0% found this document useful (0 votes)

82 views11 pages

Web Scraping

The document provides instructions for running Python code in a Jupyter Notebook. It explains that to run a line of code, the user can either press SHIFT + ENTER after entering the line or click the RUN button. The document then provides an example of using Beautiful Soup to parse HTML tags from a simple HTML file by running various lines of code in a Jupyter Notebook cell.

Uploaded by

Alya Rusmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views11 pages

Web Scraping

Uploaded by

Alya Rusmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

To run the python command in Jupyter:

● SHIFT + ENTER after entering each line to run the command. ENTER will add a new
line below to continue entering

● click the RUN button

TEST WITH SIMPLE HTML TAGS USING BEAUTIFULSOUP

1. Download simple.txt from i-learn

2. Copy the text and past it using notepad++, save it as html file on your desktop.

3. Open Anaconda Navigator

4. Launch Jupyter Notebook

5. New 🡪 python 3

6. Enter the following codes:

a. from bs4 import BeautifulSoup as bs SHIFT + ENTER

b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on

location of the files]

c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER

d. print (soup) SHIFT + ENTER

e. print (soup.prettify()) SHIFT + ENTER

f. soup.title SHIFT + ENTER

g. soup.body SHIFT + ENTER

h. soup.body.contents[1] SHIFT + ENTER

i. soup.get_text()

j. print (soup.get_text())
k. print (soup.get_text(strip=True))

l. print (soup.get_text(‘ ’, strip=True))

m. soup.findAll(‘p’)

n. soup.findAll('p',{'id':'First content'})
How to read and write data from/to a file using python

open (filename, file mode)

1. How to write data into a file. (if the file exists, then the content will be overwritten)

Example : writing a text into file named ‘lineText.txt’

Specify the file name lineText.txt
filename = "lineText.txt"
f = open(filename, 'w') Open the file and write to the file

for i in range(10):
repeat 10 times, just to print the text
f.write("This is line %d\r\n" % (i+1)) This is line …….

f.close() %d = to print the integer number

Which comes from %(i + 1)
\r = to insert carriage return (ENTER key)
\n = new line

Close the file name lineText.txt

2. How to append to the existing file. (if the file exists,

the new content will be appended, the existing content
still intact)

Example : append the text to file named ‘lineText.txt’

filename = "lineText.txt"
f = open(filename, 'a+')

for i in range(5):
f.write("Appended line %d\r\n" % (i+1))

f.close()

****Note : you can search the file in folder anaconda3/script, since it is a text file,
you can view it using notepad. You can view the file in Jupyter Notebook
explorer
Select View to view the content

Select the file to view

3. How to read all contents in the file.

Example : read all the contents in a file named

‘lineText.txt’

filename = "lineText.txt"
f = open (filename, 'r')

f1 = f.read()
print (f1)

f.close()

4. How to read content in a file line by line.

Example : read the content in a file named ‘lineText.txt’ line by line

filename = "lineText.txt"
f = open(filename, 'r')

f1 = f.readlines()

for x in f1:
print (x)

f.close()

How to start web scraping in Jupyter Notebook

1. Import BeautifulSoup from package bs4

2. Import the URL package, to read the URL address in the website

3. Copy the URL address selected from the website

4. Request to open the connection, read the webpage and download to our machine

5. Read the HTML tags from the webpage (scraped contents)

6. Close the connection to the webpage

7. To parse the contents (synthesize the webpage contents)

How to write the scraping data into a file (csv file – excel format)
Example :

Scrap data from webpage –

https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD

save the data into a file named test.csv (excel format delimited), it will be saved in the
anaconda3/script folder

from bs4 import BeautifulSoup as soup

from urllib.request import urlopen as uReq
my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')

………..

filename = 'test.csv'
f = open(filename,'w')

f.close()

Example :
To scrap
https://www.newegg.com/Laptops-Notebooks/Category/ID-
223?Tid=17489

from bs4 import BeautifulSoup as soup

from urllib.request import urlopen as uReq

my_url = "https://www.newegg.com/Laptops-Notebooks/Category/ID-223?Tid=17489"

uClient = uReq(my_url) #to request the connection to URL specified

my_page = uClient.read() #read the webpage connected

page_soup = soup(my_page, "html.parser") #to parse the webpage content

#to select all tags <div class = item-container>

my_content = page_soup.findAll("div", {"class":"item-container"})

print (my_content) #to display what is in my_content

for x in my_content: #looping through all the contents in the item-container

model = x.div.div.a.img['title'] #scrap the title and pun in array – tree navigation
print (model)

for x in my_content:
model = x.div.div.a.img['title'] #different title name for each image
item_desc = x.findAll('a',{'class':'item-title'}) #find all the <a href ......class= item-title
print(len(item_desc)) #how many contents are there?
print(item_desc[0]) # Array index always starts with 0

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print(len(item_desc))
print(item_desc[0].text) # display only text

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print ('Model : ' + brand)
print ('Product Name : ' + item_desc[0].text + '\n')

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'}) #shipping information
print(shipping[0].text.strip())

for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'})
print('Model : ' + model)
print('Product Description : ' + item_desc[0].text)
print('Shipping : ' + shipping[0].text.strip() + '\n')

To print those data into file csv (excel format delimited)

Lista Emails Dublin
No ratings yet
Lista Emails Dublin
33 pages
Manual For GPR Viewer 2017
No ratings yet
Manual For GPR Viewer 2017
29 pages
Stata Basics for Beginners
No ratings yet
Stata Basics for Beginners
63 pages
Facebook HTML
No ratings yet
Facebook HTML
15 pages
Ethical Hacking Online Course Brochure PDF
No ratings yet
Ethical Hacking Online Course Brochure PDF
19 pages
SEO Spider Configuration - Screaming Frog
No ratings yet
SEO Spider Configuration - Screaming Frog
55 pages
Symfony Security Best Practices
No ratings yet
Symfony Security Best Practices
2 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
PDMS Clash Detection Training Manual
No ratings yet
PDMS Clash Detection Training Manual
40 pages
Atoll LTE Planning
80% (5)
Atoll LTE Planning
88 pages
ProtaStructure 2018: Key Updates
No ratings yet
ProtaStructure 2018: Key Updates
57 pages
Whatsapp Terms and Conditions
No ratings yet
Whatsapp Terms and Conditions
4 pages
Burp Suite Cheat Sheet by Codelivly
No ratings yet
Burp Suite Cheat Sheet by Codelivly
5 pages
Adwords Advanced Search Exam Q
100% (1)
Adwords Advanced Search Exam Q
23 pages
SQL Injection Technique Upload
No ratings yet
SQL Injection Technique Upload
19 pages
PDF Maker Free For Vtigercrm
No ratings yet
PDF Maker Free For Vtigercrm
13 pages
SQL Injection Walkthrough
No ratings yet
SQL Injection Walkthrough
20 pages
Json Sample
No ratings yet
Json Sample
241 pages
Linux Command Cheat Sheet
No ratings yet
Linux Command Cheat Sheet
1 page
TrevorC2 for Red Team Ops
No ratings yet
TrevorC2 for Red Team Ops
8 pages
Social Media Set-Up Checklist
No ratings yet
Social Media Set-Up Checklist
8 pages
10 Realtime Python Automation Scripts
100% (2)
10 Realtime Python Automation Scripts
12 pages
ACUNETIX
100% (1)
ACUNETIX
9 pages
Google Analytics 4 Certification Answers
No ratings yet
Google Analytics 4 Certification Answers
151 pages
Astalavista
100% (1)
Astalavista
19 pages
Manual SQL Injection
No ratings yet
Manual SQL Injection
14 pages
Splunk-5 0 3-Tutorial
No ratings yet
Splunk-5 0 3-Tutorial
88 pages
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
100% (1)
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
20 pages
Digital Forensica Cyber Crime
No ratings yet
Digital Forensica Cyber Crime
392 pages
Top 50 Linux Commands You Must Know As A Regular User
No ratings yet
Top 50 Linux Commands You Must Know As A Regular User
25 pages
Linux Essentials Exam Prep
No ratings yet
Linux Essentials Exam Prep
7 pages
Course File of Ecommerce
100% (2)
Course File of Ecommerce
30 pages
Open Commerce API (OCAPI) Guide
No ratings yet
Open Commerce API (OCAPI) Guide
30 pages
Ethical Hacking Question Bank
No ratings yet
Ethical Hacking Question Bank
5 pages
OWASP SAMM v2.0
No ratings yet
OWASP SAMM v2.0
66 pages
Python for Aspiring Data Scientists
No ratings yet
Python for Aspiring Data Scientists
2 pages
WebRTC GitHub Repo Developer's Guide PDF
No ratings yet
WebRTC GitHub Repo Developer's Guide PDF
6 pages
Pro Stores Store Admin Full
No ratings yet
Pro Stores Store Admin Full
363 pages
Apostila - Adobe Illustrator CS5
No ratings yet
Apostila - Adobe Illustrator CS5
528 pages
Report 2023 Threat Landscape PDF
No ratings yet
Report 2023 Threat Landscape PDF
39 pages
Cyber-Attacks and Threats For Healthcare - A Multi-Layer Thread Analysis
No ratings yet
Cyber-Attacks and Threats For Healthcare - A Multi-Layer Thread Analysis
4 pages
Nuxeo Admin 36052169
No ratings yet
Nuxeo Admin 36052169
214 pages
Wordlist
No ratings yet
Wordlist
23 pages
60 Website Vulnerability Scanning System Using Python PY060
No ratings yet
60 Website Vulnerability Scanning System Using Python PY060
7 pages
Astra Theme Update Changelog
No ratings yet
Astra Theme Update Changelog
8 pages
Setting Up An SMS Gateway With Ubuntu 8.04, Kannel and Huawei E220 GSM Modem
No ratings yet
Setting Up An SMS Gateway With Ubuntu 8.04, Kannel and Huawei E220 GSM Modem
13 pages
TIBCO Hawk Rulebase Standard Guidelines
No ratings yet
TIBCO Hawk Rulebase Standard Guidelines
9 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Unit I
No ratings yet
Unit I
12 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Beautiful Soup & Selenium Web Scraping Guide
No ratings yet
Beautiful Soup & Selenium Web Scraping Guide
5 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
AIML Manual Lab-For Students
No ratings yet
AIML Manual Lab-For Students
45 pages
055-En
No ratings yet
055-En
2 pages
GateControl Manual
No ratings yet
GateControl Manual
16 pages
Faq. - Epfo
No ratings yet
Faq. - Epfo
17 pages
CanSat Mission Guide 2019 PDF
No ratings yet
CanSat Mission Guide 2019 PDF
33 pages
Mac Launcher & AI App Comparison
No ratings yet
Mac Launcher & AI App Comparison
117 pages
Low Cost Data Logger Design & Comparison
No ratings yet
Low Cost Data Logger Design & Comparison
7 pages
Summer Internship Report
No ratings yet
Summer Internship Report
25 pages
Webinar
No ratings yet
Webinar
25 pages
ComputerScience Board Paper 2024 - MS
No ratings yet
ComputerScience Board Paper 2024 - MS
26 pages
S1agile en RN G PDF
No ratings yet
S1agile en RN G PDF
10 pages
Literacy Rate Analysis Project File
50% (2)
Literacy Rate Analysis Project File
41 pages
Data Merge InDesign
No ratings yet
Data Merge InDesign
14 pages
Python Pandas Cheatsheety
No ratings yet
Python Pandas Cheatsheety
7 pages
12 Ip SSM 2022-23 Ahmd
No ratings yet
12 Ip SSM 2022-23 Ahmd
130 pages
Introduction To Computing and Programming in Python, Global Edition Mark J. Guzdial Download
100% (2)
Introduction To Computing and Programming in Python, Global Edition Mark J. Guzdial Download
59 pages
52-Word Wrap Functionality in ALV
No ratings yet
52-Word Wrap Functionality in ALV
8 pages
Data Staging
No ratings yet
Data Staging
3 pages
Notes
No ratings yet
Notes
73 pages
Student Management System Project Inpython
No ratings yet
Student Management System Project Inpython
4 pages
Raysafe Xi View: User Manual
100% (1)
Raysafe Xi View: User Manual
15 pages
Bash Cheat Sheet by Tomi Mester
No ratings yet
Bash Cheat Sheet by Tomi Mester
19 pages
SF S4 EC Data Migration en-US
No ratings yet
SF S4 EC Data Migration en-US
68 pages
Breaking Barriers: Micro-Mortgage Analytics
No ratings yet
Breaking Barriers: Micro-Mortgage Analytics
121 pages
WpDataTables 1.6 Presentation
No ratings yet
WpDataTables 1.6 Presentation
22 pages
File Access in VBA
No ratings yet
File Access in VBA
4 pages
Pandas Dataframes III Quiz
No ratings yet
Pandas Dataframes III Quiz
5 pages

Web Scraping

Uploaded by

Web Scraping

Uploaded by

To run the python command in Jupyter:

● click the RUN button

1. Download simple.txt from i-learn

3. Open Anaconda Navigator

4. Launch Jupyter Notebook

6. Enter the following codes:

a. from bs4 import BeautifulSoup as bs SHIFT + ENTER

b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on

c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER

d. print (soup) SHIFT + ENTER

e. print (soup.prettify()) SHIFT + ENTER

g. soup.body SHIFT + ENTER

h. soup.body.contents[1] SHIFT + ENTER

l. print (soup.get_text(‘ ’, strip=True))

open (filename, file mode)

Example : writing a text into file named ‘lineText.txt’

f.close() %d = to print the integer number

Close the file name lineText.txt

2. How to append to the existing file. (if the file exists,

Example : append the text to file named ‘lineText.txt’

Select the file to view

3. How to read all contents in the file.

Example : read all the contents in a file named

4. How to read content in a file line by line.

Example : read the content in a file named ‘lineText.txt’ line by line

How to start web scraping in Jupyter Notebook

1. Import BeautifulSoup from package bs4

3. Copy the URL address selected from the website

5. Read the HTML tags from the webpage (scraped contents)

6. Close the connection to the webpage

7. To parse the contents (synthesize the webpage contents)

Scrap data from webpage –

from bs4 import BeautifulSoup as soup

from bs4 import BeautifulSoup as soup

uClient = uReq(my_url) #to request the connection to URL specified

my_page = uClient.read() #read the webpage connected

page_soup = soup(my_page, "html.parser") #to parse the webpage content

#to select all tags <div class = item-container>

print (my_content) #to display what is in my_content

for x in my_content: #looping through all the contents in the item-container

To print those data into file csv (excel format delimited)

You might also like