- Selenium is a web browser automation tool.
- It allows us to open browser of our choice and perform tasks as a human being would like:
- Clicking buttons.
- Entering information in forms.
- Searching for information on web pages
Uses a web-driver package that can take control of the browser and mimic user-oriented actions.
BeautifulSoup
- A module that can be used for pulling data out of HTML and XML documents.
- Depends heavily on other libraries like requests or urllib for sending web requests.
- Dose not have a document parser; we need to choose one like,
html.parser
, 'HTML5lib'. - It is difficult to scrap websites which return Java script code.
- It does not automate the web browser. It saves a copy of web page source and then does further processing
Selenium
- A tool developed for web application automated testing.
- Can send web requests on its own
- It comes with a parser.
- Loads JavaScript and can help access data behind JavaScript files as well.
- Faster that BeautifulSoup, while interacting with webpages.
- Comes handy when handling Java script featured websites.
Setup
- Selenium - we need to install selenium package.
pip install selenium
-
Selenium Drivers - These web drivers enable python to control the browser for interactions.
- The browser that you will use, chrome or Firefox, should be pre installed.
- Check the version of your browser.
- visit -
chromedriver.chromium.org/downloads
- Download the version of chromedriver that matches the version of your browser.
- Keep a check of the path where the chromedriver is downloaded.
Selenium offers a wide variety of functions to locate an element on the web page:
- find_element_by_id: Uses id to find an element.
- find_element_by_name: Uses name to find an element.
- find_element_by_xpath: Uses xpath to find and element.
- find_element_by_tag_name: Uses tag name to find an element.
- find_element_by_class_name: Uses value of class attribute to find an element.
- CSS selector
- Link Text
- Partial Link Text
There are other functions as well which help us locate elements on the web page.
-
XPath known as the XML path is a language that helps to query the XML documents.
It consits of expression for a path along with certain conditions to locate a particular element. -
The basic format of Xpath is mentioned below:
-
There are two types of XPath:
- Absolute XPath
- Relative XPath
-
Absolute XPath: Begins with the single forward slash (/), means we select the element from the root node and go all the way down to the element needed.
/html/body/div[2]/div[1]/div/h4[1]/b/html[1]/body[1]/div[2]/h4[1]/b[1]
- Relative XPath: starts from the middle of the HTML structure. Starts with double forward slash (//) an can search elements anywhere on the webpage without writting the long absolute xpath.
//div[@class='featured-box columnsize1']//h4[1]//b[1]
1) Basic Xpath: XPath expression select nodes or list of nodes on the basis of attributes like ID, Name, Classname, etc. from the XML document like shown below:
XPath= //input[@name='uid']
XPath= //a[@href='http://google.com/']
2) Contains(): used when the value of any attribute changes dynamically, example, login information. It can find the element with partial text.
for finding 'submit' button where Type= 'submit':
XPath= //*[contains(@type, 'sub')]
XPath= //img[contains(@src, 'content')]
3) Using OR & AND: Here, two conditions are used, whether 1st condition OR 2nd condition should be true. Means any one condition should be true to find the element.
XPath= //*[@type='submit' or @name='btnReset']
in AND expression, two conditions are used, both condiitons should be true to find the element. if fails to find element if any one condition is flase.
XPath= //input[@type='submit' and @name='btnlogin']
4) starts-with(): Used to find a web element whose value of an attribute changes on the refresh or on any other dynamic operaiton on the web page. we mathch the starting text of the attribute to locate an element whose attribute has changed dynamically.
XPath= //img[starts-with(@src,'https')]
5) text(): Used with the text function to locate an element with exact text.
Here, it go anywhere inside the document, irrespective of the tag, but, it must contain a text whose value is Seach Google or type a URL. The asterisk(*) implies any tag with the same value.
XPath= //*[text()='Search Google or type a URL']
-
It's a technique which is used to extract large amounts of data from websites.
-
The data extracted can be stored in structured fromats like CSV.
-
Then we can use the extracted data according to our need.
-
For example we can collect data from e-commerce portals, social media platforms to understand the customer behaviours, sentiments, buying patterns which are critical insights for any business.
- web scraping is an automated technique used to extract large amounts of data from websites.
- The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.
- There are different ways to scrape websites such as online Services, APIs or writting your own code.
- We write code that sends a requests to the server that's hosting the page we specified.
- Our code downloads that page's source code
- It filters through the page looking for HTML elements we've specified, and extracting whatever content we've instructed it to extract in the code.
For example if we want to get all of the titles inside H2 tags from a website, we could write some code to do that and the code will work as shown in the following steps:
- Our code would request the site's content from it's server and downloading it.
- Then it would go through the page's HTML code and looks for the H2 tags.
- Whenever it finds and H2 tag, it would copy whatever text is inside the tag, and save it in whatever format we have specified.
A web page is made by generally 4 types of files:
1. HTML file: It contains the main content of the web page
2. CSS file: This file is for the styling of the web page.
3. JS file: JavaScript file brings interactivity to the web page.
4. Images file: JPG/PNG file formats for showing images in the web page
As we are interested in extracting data from the web page , we will be using the html file for extracting data as the html file contains the main content of the web page.
- This is a built-in Python keyword used for debugging.
- It checks whether a given condition is
True
. - If the condition is
False
, Python raises anAssertionError
.
- This checks whether the string
"Selenium"
is present in the variabletitle
. - If
"Selenium"
exists insidetitle
, the assertion passes, and the program continues running normally. - If
"Selenium"
is not intitle
, anAssertionError
is raised.
Synchronization issues in Selenium with Python primarily refer to challenges related to the timing misalignment between the automation script and the web applicaiton being tested. These issues arise due to the asynchronous nature of web applications where elements may load or change dynamically, and the automation script may not synchronize with these changes.
-
Element Visibility: The script tries to interact with an element before it is visible on the page, leading to NoSuchElementException or ElementVisibleException.
-
Element Interactability: The script attempts to interact with an element before it becomes interactable, such as Clicking a button or typing in a text field.
-
Page Load: The script proceeds with actions before the page has finished loading completly, resulting in StaleElementReferenceException or ElementNotInteractableException.
-
asynchronous operaitons: Web applicaitons often use asynchronous operaitons such as AJAX requests or JavaScript timers to update content dynamically. The script needs to wait for these opeations to complete before proceeding.
-
Dynamic Content: Element on the page may load or change dynamically after a certain period or due to user interacts . The script needs to wait for these changes to reflect before performing actions.
-
Selenium has a built-in way to automatically wait for elements called an implicit wait.
-
An implicit wait value can be set either with the timeouts capability in the browser options, or with a driver method
-
This is a global setting that applies to every element location call for the entire session.
-
The dafault value is 0, which means that if the element is not found, it will immediately return an error.
-
If an implicit wait is set, the driver will wait for the duration of the provided value before retuuning the error
The syntax for explicit waits in Selenium Python involves two main parts:
from selenium.webdriver.support.ui import WebDriverWait
wait = WebDriverWait(driver, timeout)
from selenium.webdriver.support import expected_conditions as EC
Element = wait.until(EC.presence_of_element_located((By.ID, "my_element_id")))
Feature | Implicit Wait | Explicit Wait |
---|---|---|
Scope | Global (applies to all element find operations) | Specific to a particular element |
Waits For | Element presence in the DOM | Specific conditions (presence, visibility, etc.) |
Syntax | Simpler (set a global timeout value) | More complex (use ExpectedConditions functions) |
Control | Less control over waiting behavior | More control over waiting logic |
Readability | Less readable (unclear what you're waiting for) | More readable (explicitly states wait conditions) |