Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views159 pages

Data Preprocessing

Uploaded by

skull music
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views159 pages

Data Preprocessing

Uploaded by

skull music
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Data Preprocessing

Prof. Chia-Yu Lin


National Central University

2025 Fall
1
Have Interest in Data Analytics?
Mission:Finish a pork carrot soup

Data Data
Storage Analysis
Stage 1 Stage 3 Stage 5

Stage 2 Data Stage 4


Data Preprocessing Data
Collection (Cleaning, Visualization
Labeling…)

2
Outline
• Introduction to Data Analytics Libraries
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing
東京大學資料科學家養成全書:
使用Python動手學習資料分析

• Reference book:

3
Data Analysis Libraries
• There are four commonly used libraries
• Numpy:
– A library for basic array and numerical operations.
– Besides, advanced and complex computation, the computation
speed of numpy is faster than normal operations of Python.
• Scipy:
– A library that further enhance the functionality of Numpy.
– Can perform statistical and signal operations.
• Pandas
– A library for processing various data in the form of DataFrames.
• Matplotlib
– A library for data visualization
4
Libraries Related to Data Analysis
DIPY
Biopython Nipy

PyTables NetworkX Sumpy


Astropy
StatsModels Scikit-image
Scikit-learn Matplotlib Pandas PyMC
Numpy

Scipy Sympy
IPython Python Jupyter

5
Import Library
• Using “import”
– Import the Numpy library with the name "np"
– “np.function name” to use the function of Numpy

• Using “from”
• Originally, we need to use function by
“module.function.function…” to use function
• We can use “from” to omit
– np.random.function = random.function

6
Magic Command
• Some modules in Juypyter Environment set “Magic
Command”
• Magic command starts with %, which is a command to
perform various environmental operations in the Jupyter
environment.
• %precision
– Numpy extension command. When displaying data, you can specify
the number of digits after the decimal point to be displayed
• %matplotlib
– The extension command of Matplotlib can specify the display
method of the chart.
– If you use "inline”, you can directly display the chart at this position.
If you do not specify %matplotlib, it will be displayed in another 7
window.
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

8
Numpy
• NumPy is a Python library used for working with arrays. It
also has functions for working in domain of linear algebra,
fourier transform, and matrices.
• Numpy is written by C instead of Python. Calculation is fast.

9
Install Numpy
• pip install numpy

10
Array Operation
• Array

11
Data Type (1/3)
• When using Numpy to process data, there is “data type” to
enable high-speed calculation and to ensure the precision
of the calculation.
• Integer, floating point are data types.

int(Signed Integer) uint(Unsigned Integer)


Data Description Data Description
Type Type
int8 8 bits signed integer uint8 8 bits unsigned integer
int16 16 bits signed integer uint16 16 bits unsigned integer
int32 32 bits signed integer uint32 32 bits unsigned integer
int64 64 bits signed integer uint64 64 bits unsigned integer

12
Data Type (2/3)

float(Floating point) bool(Boolean)


Data Description Data Description
Type Type
float16 16 bits floating point bool Show True or False
float32 32 bits floating point
float64 64 bits floating point

13
Data Type (3/3)
• If you want to query the data type, you can specify ".dtype"
after the variable.

14
Dimension and Number of
Components
• Dimension: ndim
• Number of elements: size

15
Calculate All Elements
• In Python, you have to use ”for loop” to calculate all
elements.
• Numpy can calculate all elements without for loop.

16
Sort
• Use sort. The default is from small to large.
• If you want to show from large to small, you can use “data[::-1].sort()”

• <Review>
• Q1:What is [n:m:s] in Python?
– From the nth to the (m-1)(1)
th , take every s elements.

• If n and m are omitted, what does it mean?


– Take "all" in s intervals.
(2)

• What does s=-1 represent?


– Take out one by one starting
(3) from the end.

17
Min、Max、Sum、Cumulative Sum
• Array of Numpy can call:
• min: minimum
• max: maximum
• sum: summation
• cumsum: cumulative sum
– For sequence a, b, c
– First value: a
– Second value: a+b
– Third value: a+b+c

18
Random (1/2)
• When we analyze data, we will use “random function” to sperate data or
make data in different distribution.
• There is random function in Python, but we usually use random function
in Numpy during analyzing data.
• Random function generates random value based on mathematics
formula.
• The initial value of random function is called “seed.”
• Although it is not necessary to specify a seed during the operation,
specifying the same seed can ensure obtain the same sequence of
random numbers in each operation.

19
Random (2/2)
function Definition
rand Uniform distribution,0.0<x<1.0
random_sample Uniform distribution,0.0<x<1.0
(The generating method is different from
“rand”)
randint Uniform distribution。Integers without range.
randn Normal distribution. Random numbers
whose mean is 0 and std is 1.
normal Normal distribution. Random numbers with
arbitrary mean and std.
bionomial Random numbers from binomial distribution.
beta Random numbers from betta distribution.
gemma Random numbers from gamma distribution.
chisquare Random numbers from chi-squared
distribution. 20
Retrieve Data Randomly
• random.choice
• There are 2 arguments and 1 parameter.
– The first argument:array you want to retrieve data.
– The second argument:Number of data you want to retrieve.
– Parameter:replace
• If replace is True or without specifying, it means data can be retrieved
repeatedly.
• If replace is False, it means data can not be retrieved repeatedly.

21
Matrix
• We can use Numpy to do matrix operation.
• “arrange” function can generate specified consecutive
integers.
– arange(9) can generate integer 0~8.
• Retrieve row or column from matrix, we can use “[range of
row, range of column].”

22
Matrix Multiplication
• “dot function” is matrix multiplication.
• If use “*”, it is element-wise product. (The respective
elements are multiplied.)

23
Matrix Whose Elements Are 0 Or 1
• np.zeros: Matrix whose elements are 0
• np.ones: Matrix whose elements are 1
• dtype: Specify data type

24
HW 1-1
• 1. Using “np.array” to generate
an array contains 1-50 and show
the sum of 1-50.
• 2. Create an array with 10
random numbers generating
based on normal distribution
with random seed 0. Show the
minimum, maximum, and sum.
• 3. Create a 5*5 matrix with all
elements of 3 and calculate
matrix multiplication .

26
HW 1-2
• Generate 4*4 matrix with
normal distribution and
random seed 1.
• Generate 4*4 matrix with
normal distribution and
random seed 2.
• Calculate element-wise
product.
• Calculate matrix
multiplication.

28
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

29
Pandas
• In Python, Pandas is a library to preprocess data before
building model.
• Pandas can flexibly process all kinds of data, and
perform operations such as table calculation, data
extraction, and search.

• EX:
– Find rows from data that match certain criteria
– Set a benchmark to calculate the average of each
– Merge data

30
Install Pandas
• pip install pandas

31
Import Pandas
• Import Pandas

• Import Series module, which can process one dimension


array.
• Import DataFrame module, which can process two dimension
array.

32
Series (1/3)
• A pandas Series is a one-dimensional labelled data structure
which can hold data such as strings, integers and even other
Python objects.
• The basis of Series of Pandas is the array of Numpy.

Element Elements
Index

Data Type 33
Series (2/3)
• Get data: Series.values
• Get index: Series.index

34
Series (3/3)
• Specify index in Series

35
DataFrame (1/2)
• A Pandas DataFrame is a 2 dimensional data structure, like
a 2 dimensional array, or a table with rows and columns.
• We can set difference dtype (data type) for each column.

Element
Index

36
DataFrame (2/2)
• DataFrame, like Series, can change the index value and set
the text as the index value

Element
Index

37
In Jupyter
• Previously, we use “print” to show Series or DataFrame.
• In Jupyter, the Series or DataFrame can be automatically
recognized.
• Therefore, we can directly input Series or DataFrame.

38
Exchange Row and Column of DataFrame

• Exchange row and column, like the transpose in matrix.

39
Extracted Specific Column (1/2)
• Extracted the specified one column: Specify the column
name directly.

40
Extracted Specific Column(2/2)
• Extracted the specified many columns: Use “list” of Python
to specify.

Extract column.
Without “.”

41
Extract Data (1/3)
• For DataFrame objects, you can only get data that meets
certain conditions, or combine multiple data => Like a filter.

42
Extract Data (2/3)
• Extract “City Column” and compare “City Colunn” to
“Taipei.” The result shows True or False.

43
Extract Data (3/3)
• If you want to specify multiple conditions, you can use
“isin(list).”

44
Practice
• Filter the data whose Birth_year is before 1990, excluding
1990.

45
Drop Column or Row in DataFrame
(1/4)
• To remove specific column or row, we can use “drop.”
• Use “axis parameter” to specify column or row.
– “axis=0” is row
– “axis=1” is column

46
Drop Column or Row in DataFrame
(2/4)
• Drop Column:

47
Drop Column or Row in DataFrame
(3/4)
• When we execute“attri_data_frame1.drop([‘Birth_year’],
axis = 1), ” the column is not deleted “in data."
• If you want to delete the column “in data,” you have to use
“assign.”

48
Drop Column or Row in DataFrame
(4/4)

49
Merge Data in DataFrame (1/3)
• The objects of DataFrame can be merged.
• Data are from difference source. We need to merge data to
analyze.
• We can use “merge.”

50
Merge Data in DataFrame (2/3)
• Merge dataframe “attri_data_frame1” and dataframe “
attri_data_frame2”
• The common column of these two dataframes is “ID.”
• Choose the data with same ID to merge.

51
Merge Data in DataFrame (3/3)

52
Statistics (1/2)
• We can do statistics in DataFrame.
• Use “groupby” to do statistics based on specific
condition/behavior.

53
Statistics (2/2)
• Use the Gender row as a benchmark to calculate the
average score for math grades.

54
Practice
• Using the Gender row as a benchmark to calculate
maximum and minimum English grades.

55
Sort (1/3)
• For objects of Series and DataFrame, we can do sorting.
• Sort by index
• Sort by value

56
Sort (2/3)
• Sort by index

57
Sort (3/3)
• Sort by value
• Use “value of Birth_year column” to sort.

58
Check nan (null) (1/4)
• Sometime, there is missing information in data.
• If we directly calculate average, we cannot get correct result
• Missing information should be deleted.

59
Check nan (null) (2/4)
• Compare eligible data
– Check whether the data contains “Taipei”

60
Check nan (null) (3/4)
• Use “isnull” to check whether the value is “nan.”

61
Check nan (null) (4/4)
• Calculate the number of nan

There are 5 True in “Name” column.

62
HW 1-3 (1/5)
• Use the following data as input.

from pandas import Series,DataFrame


import pandas as pd

attri_data1 = {'ID':['1','2','3','4','5'],
'Sex':['F','F','M','M','F'],
'Money':[1000,2000,500,300,700],
'Name':['Alice','Bob','Candy','David','Ella']}

attri_data_frame1 = DataFrame(attri_data1 )(attri_data1)

64
HW 1-3 (2/5)
• Extract and show the data whose money is more than 500
(including 500).

66
HW 1-3 (3/5)
• Calculate the average money of male and female.

68
HW 1-3 (4/5)
• Input dataframe “attri_data2.”

attri_data2 = {'ID':['3','4','7'],
'Math':[60,30,40],
'English':[80,20,30]}
attri_data_frame2 = DataFrame(attri_data2)

69
HW 1-3 (5/5)
• Merge dataframe “attri_data1” and “attri_data2.”

• Show the average of “Money, Math, and English.”


• (The average of ID column can be ignored.)

72
HW 1-4
• Generate 100 data with ID, Gender, Money
• Randomly generate 100 gender data.
• np.random.seed(2)
• array2 = np.random.randint(2, size=100)
• 0:Female
• 1:Male

• Generate money randomly.


• np.random.seed(3)
• array3=np.random.normal(1000, 10, size=100)

74
HW 1-4: Generated Matrix

75
HW 1-4 (1)
• Extract the data who has least money.

77
HW 1-4 (2)
• Extract data whose money more than 1010.

79
HW 1-4 (3)
• Sort the result of question (2) by money.

81
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

82
The Importance of Data Preprocessing
• If there is some soil and dirt in your carrot pork soup, how
do you feel?
• Cleaning is absolutely indispensable step in processing food
materials.
• Same as data. Data needs preprocessing.
• Wrong data cannot produce correct result even though you
have powerful analytical method.

83
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

84
Data Cleaning (1/3)
• If some data of a company contains boy, girl, male, female,
Male, Female, M, F, …etc., many columns are duplicate.

Pie charts are divided into various blocks!


Many blocks are duplicate.

85
https://ithelp.ithome.com.tw/articles/10199944
Data Cleaning (2/3)
• First, decide the format of data.
– “Boy, Girl”
– “Male, Female”
– “M, F”
• If the format is set as “M, F”, we needs to start convergence
process. Change all different format of male and female to
“M, F”.
• Finally, if there are some value does not represent male
and female, such as cell phone number, address. We
should change the value to “null.”

86
Data Cleaning (3/3)

87
Data Cleaning :Outlier
• Outlier:
• There is no uniform definition. The outliers are judged by
data analysts or decision maker.
• Methods to judge outlier:
– Draw boxplots. Treat values ​above a certain percentage as outliers.
– Using normal distribution.
– Map data into a specific space and observe the distance among
data.

88
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

89
Impute missing value
• Missing value and outlier unavoidable situations when dealing
with data.
• There are various reasons for missing value. For example,
forgetting to fill the data, system issue.
• For missing value, should it be ignored, or the closest value fill
in? ??
– Different methods will produce great deviations, which may lead to
wrong decisions and cause heavy losses. Thus, missing value should
be handled with caution.

90
Example Data
• Assume “NaN(NA)” is missing value.

#Data Preparation
import numpy as np
from numpy import nan as NA
import pandas as pd

random.seed(0)
df = pd.DataFrame(np.random.rand(10, 4))

# 設定為NA
df.iloc[1,0] = NA
df.iloc[2:3,2] = NA
df.iloc[5:,3] = NA
print(df)

92
Listwise Deletion (成批刪除)
• Delete the row which has NaN.
• Use “dropna, ” which is called list-wise deletion (成批刪除).

93
Pairwise deletion (逐對刪除)
• List-wise Deletion will make data become too little, make
the data unusable.
• In pairwise deletion, we can ignore column which has
missing value.
• We extract the columns we want and use dropna.

94
Filling Missing Value: fillna
• Use “fillna(value)” for NaN value.
• EX: If we want to fill 0 for NaN, we can use fillna(0).

95
Filling Missing Value: ffill
• We can use “ffill” to fill in the value of “previous row.”

96
Filling Missing Value: mean
• We can use “mean” to fill the average value of the column.
• Notice: In time series data, if you use this method, you may
use future value to compute mean. You should avoid this.

The average value for each


column.

97
More fillna
• Besides the methods we introduce, there are more fillna
methods.
• Use “?df.fillna” to search other methods.
• About missing value, above methods do not work in all
cases . When dealing with missing value, you should
consider the background and situation of data and
choose appropriate fiilna method.

98
HW 1-5 (1/4)
• Assume “NaN(NA)” is missing value.

import numpy as np
from numpy import nan as NA
import pandas as pd
import numpy.random as random

random.seed(0)
df2 = pd.DataFrame(np.random.rand(15, 6))

df2.iloc[2,0] = NA
df2.iloc[5:8,2] = NA
df2.iloc[7:9,3] = NA
df2.iloc[10,5] = NA

df2

100
HW 1-5 (2/4)
• 1. Delete the row with NaN.

101
HW 1-5 (3/4)
• 2. Fill 0 for NaN.

102
HW 1-5 (4/4)
• 3. Fill mean value for NaN.

103
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

112
Data Labeling
• Label the features in data.

• Label the correct answer in data.

113
Example of Data Labeling
• Label the correct answer in data.

The color is not good (著色不佳) Milk Residue (乳汁吸附)

117
Tool of Data Labeling
• Labelbox
• LabelImg
– Support Yolo

https://1applehealth.com/info/33701239 118
Test
• What are the 3 parts of data preprocessing?
• Data Cleaning (資料清洗
(1) )

• Impute missing (2)


value (資料補值)

• Data Labeling (資料標註)


(3)

148
Have Interest in Data Analytics?
Mission:Finish a pork carrot soup

Data Data
Storage Analysis
Stage 1 Stage 3 Stage 5

Stage 2 Data Stage 4


Data Preprocessing Data
Collection (Cleaning, Visualization
Labeling…)

149
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

150
Data Visualization
• Before analyzing data…….
– Sometimes you can't find a message by just observing the numbers.
– We can obtain some implicit information by data visualization.
– Enhance data understanding through charts and infographics.

• After analyzing data……


– Analysis results are easily interpreted with graphs.
– By converting the information into charts, it is easier for people
to understand.

151
Example of Data Visualization
before Analyzing Data
• Visualize data to see whether it conforms your
understanding.
Before anomaly Start to have anomaly After anomaly

152
Example of Data Visualization
After Analyzing Data
• The distribution of generated data and real world data.

153
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

154
Library for Data Visualization
• Matplotlib
– In Matplotlib, most of data visualization function is provided
by “pyplot.function name”
– Thus, after importing matplot library by “import
matplotlib.pypliot as plt,” we can use data visualization
function by “plt. function name”
• Seaborn
– A library to make Matplotlib's charts more beautiful.

• In the following slides, we will use Matplotlib.


155
Import Matplot Library

156
Scatter Plot (1/2)
• A scatter plot (aka scatter chart, scatter graph) uses dots
to represent values for two different numeric variables.
• The position of each dot on the horizontal and vertical
axis indicates values for an individual data point.
• Scatter plots are used to observe relationships between
variables.

157
Scatter Plot (1/2)
• By “plt.plot(x,y,’o’)” can
generate a scatter plot.
– ‘o’ is used to specify the
type of dots.

158
Scatter Plot (2/2)

159
Continuous Scatter Plot (1/2)
• If the data is continuous, the chart will be like curve instead
of dot.

160
Continuous Scatter Plot(2/2)

161
Subplot (1/2)
• We can divide the plots
into multiple by
“subplot.”
• plt.subplot(2,1,1)
– Represents the plot of
second row and the first
column.
• linspace(-10, 10,100)
– Divide 100 numbers from
-10~10.

162
Subplot(2/2)

163
Function Graph (1/2)
• Draw the graph of f(x)=x2+2x+1.

164
Function Graph (2/2)

165
Histogram (1/3)
• A histogram is an approximate representation of
the distribution of numerical data.
• When you want to observe the all picture of data, you can
use histogram.
• By histogram, we can understand the which value is more,
which value is less, and whether there is any deviation.
• We can use “hist” to generate histogram.

166
Histogram (2/3)

167
Histogram (3/3)

168
Practice (1/2)
• Generate two datasets. Each dataset contains 1000
uniform 0~1 random numbers.
• Draw histogram for two datasets.

169
Practice (2/2)
• <Hint>
• Use “np.random.uniform” to generate random numbers.
• => np.random.uniform(0.0,1.0,10)
• Use “plt.subplot” to draw two plots.
• Use “plt.tight_layout” to automatically adjust the size of
plots.

170
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

171
Bar Chart (1/3)
• A bar chart provides a way of showing data values
represented as vertical bars. It is sometimes used to show
trend data, and the comparison of multiple data sets side
by side.
• We can use bar function in pyplot module.
• Show the tags in bar chart, we can use “xtick” function.
• Put the graph in the center: align=‘center’

• Q: What is the difference between histogram and bar


chart? 172
Bar Chart(2/3)

173
Bar Chart (3/3)

174
Horizontal Bar Chart (1/2)
• We can use “barh” to show horizontal bar chart.
• Exchange x axis and y axis and set label again.

175
Horizontal Bar Chart (2/2)

176
Show Many Bar Charts (1/2)
• Visualize first and final math grades by class for comparison.

177
Show Many Bar Charts (2/2)

178
Stacked Bar Chart (1/3)
• Same with using “bar” function.
• Do different setting for “bottom parameters.”
• Specify the graphics to be stacked on top in the bar
parameter.
– “bottom= the graphics to be stacked on top”

179
Stacked Bar Chart (2/3)

180
Stacked Bar Chart (3/3)

181
Pie Chart (1/3)
• A pie chart (or a circle chart) is a circular statistical graphic, which is
divided into slices to illustrate numerical proportion.
• In a pie chart, the arc length of each slice (and consequently
its central angle and area) is proportional to the quantity it
represents.

182
Pie Chart (2/3)

讓第二塊0.1的距離

183
Pie Chart (3/3)

184
Bubbles Diagram (1/2)

185
Bubbles Diagram (2/2)

186
About Data Visualization
• In recent years, data analysis and data visualization have
attracted much attention.
• There are many data visualization tools such as Tableau、
Excel, PowerBI.
• In companies, they usually use these tools instead of
Python.
• However, using Python library is more flexible to adjust the
graph. As an engineer, we still need to learn the library for
data visualization.

187
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

188
HW: Visualize the Price Data of
Taiwan Stock Market
• Stocks are an important investment for many people.
There are more than one million stock investors in
Taiwan. How to correctly obtain stock information is a
major issue.
• Taiwan Stock Exchange Corporation
(https://www.twse.com.tw/zh/) provides various
historical and real-time stock information of Taiwan stock
market, which is a very important website for Taiwan
stock investors.

193
Get Monthly Trading Information of
Individual stocks (English)
• Market Info -> Historical Trading Data -> Trading Value->
Monthly

194
Get Monthly Trading Information of
Individual stocks (Chinese)
• 交易資訊 -> 盤後資訊 -> 個股月交資訊

195
Input the Data Time (Chinese)
• 在資料日期選擇113年
• 股票代號: 2330 (台積電)
• 按查詢

196
Input the Data Time (English)
• Year: 2024
• Stock code: 2330 (TSMC)
• Query

197
Open Data Page (Chinese)
• 按 列印/HTML 扭可從表格方式顯示下載的資訊,也可
以透過網址直接下載資料。

198
Open Data Page (English)
• Use Print/HTML button to show the table data in another
page. 。

199
Observe Webpage
• In Chrome, turn on “development tool.”

201
Analyze the Architecture of
Webpage
• The first object of
“Table.”
• The monthly
transaction
information is in the
“tr” row of the first
tbody table, and each
data is stored in the
“td” column.

202
Draw Line Chart
• Analyze the trading value of Taiwan Stock Exchange
Corporation.
• Extract the highest trading value of each month, the lowest
trading value of each month and draw a line chart.

203
Appendix:Three Steps for Crawler
1) Request the specified URL to get the response
2) Parse the response content and analyze the required
information from it
3) Save information from the previous analysis to a database
or file

205
Requests module: read website files
• To collect information systematically and automatically on the
Internet, you must extract the page content or files on the
website for processing.
• Python provides a "requests" module.
– Users can easily request the website and get the response content.

• Anaconda has built in


• Query installed packages with this command : pip list
• If you use other environments. Use this command : pip install -U requests

207
send “GET” request
• When the browser is opened, and the URL is sent, the
designated web server responds after receiving the
request, and the web page can be seen in the browser. The
method of this request is called GET.
• The requests module can complete GET requests without
going through a browser. The syntax is:

209
Request object attribute
• The Request object can use the following attributes to get different response
content.
• text: Get the source data of web page
• The default reading code of Requests is Latin-1. If the code of the read page
is different, it will often cause garbled characters. You can set the encoding of
the Response object to UTF-8 or Big5
• Response object.encoding='UTF-8’
• content: Get the binary file data of website
• status_code: Get HTTP status code
– Informational responses, 100–199,
– Successful responses, 200–299,
– Redirects, 300–399,
– Client errors, 400–499,
– Server errors, 500–599.

211
Appendix:Three Steps for Crawler
1) Request the specified URL to get the response
2) Parse the response content and analyze the required
information from it
3) Save information from the previous analysis to a database
or file

213
BeautifulSoup Module: Web Parsing
• BeautifulSoup: can quickly and accurately analyze and
extract specific objects in the page

• Anaconda has built in


• If you use other environments. Use this command : pip install -U
beautifulsoup4

215
The structure of web pages (1/3)
• The content of web pages is plain text, usually saved as
.htm or .html files.
• A web page uses HTML (Hypertext Markup Language)
syntax to construct content with tags so the browser can
show the web page according to its description.

217
The structure of web pages (2/3)
• HTML provides a structured representation of documents:
DOM (Document Object Model, document object model)
• All tags are enclosed by <…..>, most have start and end tags
– <h1>Title</h1>
– <div>Block Content</div>
– <p>Paragraph</p>
– <img>Image</img>
– <a>Hyperlink</a>

219
The structure of web pages (3/3)

head

html
body

• The function of the BeautifulSoup module parses the web page source code
into structured objects, allowing the program to obtain the content quickly.

221
How to use BeautifulSoup
• After importing BeautifulSoup, use the “requests” module
to obtain the source code of the webpage, and then use
"lxml" to parse the source code.

• Creating BeautifulSoup requires two parameters:


• First parameter:The source code to be parsed
• Second parameter:The parser
– ”html.parser” is Python's built-in parser
– “lxml” is a C-based parser that performs faster

223
Attributes of BeautifulSoup

Attribute Description
label name Returns the specified content ex: sp.title returns
the tag content of <title>
text Returns the text content of the web page after
removing all HTML tags

225
BeautifulSoup: find(), find_all()
Function Description
find() Find the first matching tag, return it as a string,
Ex:sp.find("a")
find_all() Find the all matching tag, return it as a list,
Ex:sp.find("a")

• Add tag attributes as search criteria

227
BeautifulSoup: select()

Function Description
select() Find the content of the specified CSS selector
such as id or class, and return it in a list
Ex:
Read by id : sp.select(“#id”)
Read by class : sp.select(“.classname”)

229
Test
• When do we need data visualization?
– Before analyzing data
(1)
– After analyzing data

• What is the library for data visualization?


– Matplotlib
(2)
– Seaborn

236

You might also like