FACULTY OF INFORMATION SYSTEMS
Course:
Web Data Analysis
(3 credits)
Lecturer: Nguyen Thon Da Ph.D.
LECTURER’S INFORMATION
Chapter 10
Working with Web-Based APIs,
Beautiful Soup and Selenium
(Part 1)
Web Data Analysis :: Thon-Da Nguyen Ph.D.
MAIN CONTENTS
Introduction to web APIs
Accessing web API and data formats
Web scraping using APIs
Introduction to Selenium
Using Selenium for web scraping
Hypertext Markup Language: HTML
Using Your Browser as a Development Tool
Cascading Style Sheets: CSS
The Beautiful Soup Library
Scraping JavaScript
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to web APIs
A web API is an interface for websites to return information in
response to requests.
It enables websites to share information with users and third-
party web applications.
Web APIs are language-agnostic and typically return
information in JSON, XML, or CSV formats.
APIs are used to develop applications and often include
documentation, methods, and libraries.
Web APIs operate based on the HTTP protocol and often
require authentication, such as an API key, for requests.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to web APIs
REST (Representational State Transfer)
REST is an architectural protocol based on simple principles and resource-
oriented. It uses HTTP methods like GET, POST, PUT, DELETE to perform
operations on resources.
REST uses URLs (Uniform Resource Locators) to identify resources and typically
returns data in JSON or XML format.
REST is easy to understand, flexible, and suitable for lightweight and high-
performance web applications.
SOAP (Simple Object Access Protocol)
SOAP is a document-oriented data transmission protocol and uses XML to package
messages.
It defines specific operations and messages through a special file called WSDL
(Web Services Description Language).
SOAP is commonly used in complex and reliable systems, such as Enterprise Web
Services. Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to web APIs
Benefits of web APIs
An API's returned data is completely specific to the requests being
performed, along with the filters or parameters that have been
applied to it.
Tasks such as parsing HTML or XML using Python libraries, such as
BeautifulSoup, pyquery, and lxml, isn't always required.
The format of the data is structured and easy to handle.
Data cleaning and processing for final listings will be more easy or
might not be required.
There will be significant reductions in processing time (compared to
coding, analyzing the web, and applying XPath and CSS selectors to
retrieve data).
They are easy to process.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Making requests to the web API using a web browser
Case 1 – accessing a simple API (request and response)
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Making requests to the web API using a web browser
Case 1 – accessing a simple API (request and response)
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Making requests to the web API using a web browser
Case 2 – demonstrating RESTful API cache functionality
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Accessing web API and data formats
Making requests to the web API using a web browser
Case 2 – demonstrating RESTful API cache functionality
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Web scraping using APIs
Example 1 – Searching and collecting university names and URL
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Web scraping using APIs
Example 1 – searching and collecting university names and URL
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Web scraping using APIs
Example 2 – scraping information from GitHub events
Demo SourceCode: Chapter10_Ex3.ipynb
Its output:
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Four key points about Selenium:
Web Application Framework: Selenium is described as a web application
framework: It provides a structured set of tools and resources for working with
web applications.
Web Scraping: Selenium can be utilized for web scraping activities. Web scraping
involves extracting data from websites.
Browser Automation: Selenium can function as a browser automation tool: It can
perform various tasks within a web browser without human intervention. This
includes actions like clicking links, saving screenshots, downloading images, and
filling out HTML <form> templates.
Dynamic and Secure Web Services: Selenium is suitable for working with dynamic
or secure web services that use technologies like JavaScript, cookies, and
scripts. It can load, test, crawl, and scrape data from these types of websites.…
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Selenium projects: Selenium WebDriver
Selenium WebDriver for Browser Automation: A crucial part of Selenium
used to automate web browsers.
Multiple Language Bindings and Third-Party Drivers: It supports multiple
programming languages and interfaces with browsers like Chrome, Firefox,
and Opera through third-party drivers.
No External Dependencies: Selenium WebDriver works independently
without relying on external software or servers.
Enhanced Features and Overcoming Limitations: It offers an object-
oriented API with improved features to address limitations seen in previous
Selenium versions and Selenium Remote Control (RC).
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Selenium projects : (cont.)
Selenium RC is a server that is programmed in Java. It uses HTTP
to accept commands for the browser and is used to test complex
AJAX-based web applications.
Selenium Grid is a server enabling parallel test execution on
multiple machines, across different browsers and operating
systems, reducing performance issues and time consumption.
Selenium IDE is an open-source integrated development
environment for building test cases with Selenium.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Introduction to Selenium
Selenium projects : (cont.)
Selenium RC is a server that is programmed in Java. It uses HTTP
to accept commands for the browser and is used to test complex
AJAX-based web applications.
Selenium Grid is a server enabling parallel test execution on
multiple machines, across different browsers and operating
systems, reducing performance issues and time consumption.
Selenium IDE is an open-source integrated development
environment for building test cases with Selenium.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Open Chrome
Use webdriver_manager for creating a driver object for the Chrome.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Open Chrome
To open a given URL in the browser window using Selenium for Python,
call get() method on the driver object and pass the URL as argument to get()
method. Code: driver.get(‘www.anywebsite.com’)
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by Link Text
To find a Link Element (hyperlinks) by the value (link text) inside the link, using Selenium in
Python, call find_element() method and pass By.LINK_TEXT as the first argument, and the link
text as the second argument. Code: driver.find_elements(By.LINK_TEXT, 'Contact MySQL')
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by Link Text
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Get Title of a Website
To get the title of the webpage using Selenium for Python, read the title property
of WebDriver object. Code: driver.title
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Get Link after clicking a link (current link)
To get the current URL in the browser window using Selenium for Python, read the
current_url property of web driver object. Code: driver.current_url
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by ID
To find an HTML Element by id attribute using Selenium in Python, call find_element() method
and pass By.ID as the first argument, and the id attribute’s value ((of the HTML Element we need
to find)) as the second argument. Code: find_element(By.ID, "id_value")
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by ID
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by Class Name
To find an HTML Element by class name attribute using Selenium in Python, call find_element()
method and pass By.CLASS_NAME as the first argument, and the class name (of the HTML
Element we need to find) as the second argument.
Code: find_element(By.CLASS_NAME, "class_name_value")
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by Class Name
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by CSS Selector
To find an HTML Element by CSS Selector, call find_element() method and pass
By.CSS_SELECTOR as the first argument, and the CSS selector string (of the HTML Element we
need to find) as the second argument.
Code: find_element(By.CLASS_NAME, "css_selector_value")
If there are multiple HTML Elements with the same given class name, then find_element()
returns the first HTML Element of those. Regading CSS selector string, if we would like to get the
first paragraph element whose class name is 'that_class_name' and the paragraph is inside a div,
then the CSS selector string is 'div p.that_class_name'.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Element by CSS Selector
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Elements by Partial Link Text
To find link elements (hyperlinks) by partial link text, call find_elements() method, pass
By.PARTIAL_LINK_TEXT as the first argument, and the partial link text value as the second
argument. Code: find_elements(By.PARTIAL_LINK_TEXT, "partial_link_text_value")
find_elements() method returns all the HTML Elements, that match the given partial link text,
as a list. If there are no elements by given partial link text, find_elements() function returns
an empty list.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Elements by Partial Link Text
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Elements by Tag Name
To find all HTML Elements that has a given tag name in a document, using Selenium in
Python, call find_elements() method, pass By.TAG_NAME as the first argument, and the
tag name (of the HTML Elements we need to find) as the second argument. Code:
find_elements(By.TAG_NAME, "tag_name_value")
find_elements() method returns all the HTML Elements, that match the given tag name,
as a list. If there are no elements by given tag name, find_elements() function returns an
empty list.
Web Data Analysis :: Thon-Da Nguyen Ph.D.
Python Selenium – Find Elements by Tag Name
Web Data Analysis :: Thon-Da Nguyen Ph.D.