To run the python command in Jupyter:
● SHIFT + ENTER after entering each line to run the command. ENTER will add a new
line below to continue entering
OR
● click the RUN button
TEST WITH SIMPLE HTML TAGS USING BEAUTIFULSOUP
1. Download simple.txt from i-learn
2. Copy the text and past it using notepad++, save it as html file on your desktop.
3. Open Anaconda Navigator
4. Launch Jupyter Notebook
5. New 🡪 python 3
6. Enter the following codes:
a. from bs4 import BeautifulSoup as bs SHIFT + ENTER
b. test_url = "C:\\Users\\User\\Desktop\\simple.html" SHIFT + ENTER [depends on
location of the files]
c. soup = bs(open(test_url), 'html.parser') SHIFT + ENTER
d. print (soup) SHIFT + ENTER
e. print (soup.prettify()) SHIFT + ENTER
f. soup.title SHIFT + ENTER
g. soup.body SHIFT + ENTER
h. soup.body.contents[1] SHIFT + ENTER
i. soup.get_text()
j. print (soup.get_text())
k. print (soup.get_text(strip=True))
l. print (soup.get_text(‘ ’, strip=True))
m. soup.findAll(‘p’)
n. soup.findAll('p',{'id':'First content'})
How to read and write data from/to a file using python
open (filename, file mode)
1. How to write data into a file. (if the file exists, then the content will be overwritten)
Example : writing a text into file named ‘lineText.txt’
Specify the file name lineText.txt
filename = "lineText.txt"
f = open(filename, 'w') Open the file and write to the file
for i in range(10):
repeat 10 times, just to print the text
f.write("This is line %d\r\n" % (i+1)) This is line …….
f.close() %d = to print the integer number
Which comes from %(i + 1)
\r = to insert carriage return (ENTER key)
\n = new line
Close the file name lineText.txt
2. How to append to the existing file. (if the file exists,
the new content will be appended, the existing content
still intact)
Example : append the text to file named ‘lineText.txt’
filename = "lineText.txt"
f = open(filename, 'a+')
for i in range(5):
f.write("Appended line %d\r\n" % (i+1))
f.close()
****Note : you can search the file in folder anaconda3/script, since it is a text file,
you can view it using notepad. You can view the file in Jupyter Notebook
explorer
Select View to view the content
Select the file to view
3. How to read all contents in the file.
Example : read all the contents in a file named
‘lineText.txt’
filename = "lineText.txt"
f = open (filename, 'r')
f1 = f.read()
print (f1)
f.close()
4. How to read content in a file line by line.
Example : read the content in a file named ‘lineText.txt’ line by line
filename = "lineText.txt"
f = open(filename, 'r')
f1 = f.readlines()
for x in f1:
print (x)
f.close()
How to start web scraping in Jupyter Notebook
1. Import BeautifulSoup from package bs4
2. Import the URL package, to read the URL address in the website
3. Copy the URL address selected from the website
4. Request to open the connection, read the webpage and download to our machine
5. Read the HTML tags from the webpage (scraped contents)
6. Close the connection to the webpage
7. To parse the contents (synthesize the webpage contents)
How to write the scraping data into a file (csv file – excel format)
Example :
Scrap data from webpage –
https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD
save the data into a file named test.csv (excel format delimited), it will be saved in the
anaconda3/script folder
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20CARD'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
………..
filename = 'test.csv'
f = open(filename,'w')
f.close()
Example :
To scrap
https://www.newegg.com/Laptops-Notebooks/Category/ID-
223?Tid=17489
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = "https://www.newegg.com/Laptops-Notebooks/Category/ID-223?Tid=17489"
uClient = uReq(my_url) #to request the connection to URL specified
my_page = uClient.read() #read the webpage connected
page_soup = soup(my_page, "html.parser") #to parse the webpage content
#to select all tags <div class = item-container>
my_content = page_soup.findAll("div", {"class":"item-container"})
print (my_content) #to display what is in my_content
for x in my_content: #looping through all the contents in the item-container
model = x.div.div.a.img['title'] #scrap the title and pun in array – tree navigation
print (model)
for x in my_content:
model = x.div.div.a.img['title'] #different title name for each image
item_desc = x.findAll('a',{'class':'item-title'}) #find all the <a href ......class= item-title
print(len(item_desc)) #how many contents are there?
print(item_desc[0]) # Array index always starts with 0
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print(len(item_desc))
print(item_desc[0].text) # display only text
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
print ('Model : ' + brand)
print ('Product Name : ' + item_desc[0].text + '\n')
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'}) #shipping information
print(shipping[0].text.strip())
for x in my_content:
model = x.div.div.a.img['title']
item_desc = x.findAll('a',{'class':'item-title'})
shipping = x.findAll('li',{'class':'price-ship'})
print('Model : ' + model)
print('Product Description : ' + item_desc[0].text)
print('Shipping : ' + shipping[0].text.strip() + '\n')
To print those data into file csv (excel format delimited)