0% found this document useful (0 votes)

8 views159 pages

Data Preprocessing

Uploaded by

skull music

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views159 pages

Data Preprocessing

Uploaded by

skull music

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 159

Data Preprocessing

Prof. Chia-Yu Lin

National Central University

2025 Fall
1
Have Interest in Data Analytics?
Mission：Finish a pork carrot soup

Data Data
Storage Analysis
Stage 1 Stage 3 Stage 5

Stage 2 Data Stage 4

Data Preprocessing Data
Collection (Cleaning, Visualization
Labeling…)

2
Outline
• Introduction to Data Analytics Libraries
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing
東京大學資料科學家養成全書：
使用Python動手學習資料分析

• Reference book：

3
Data Analysis Libraries
• There are four commonly used libraries
• Numpy:
– A library for basic array and numerical operations.
– Besides, advanced and complex computation, the computation
speed of numpy is faster than normal operations of Python.
• Scipy:
– A library that further enhance the functionality of Numpy.
– Can perform statistical and signal operations.
• Pandas
– A library for processing various data in the form of DataFrames.
• Matplotlib
– A library for data visualization
4
Libraries Related to Data Analysis
DIPY
Biopython Nipy

PyTables NetworkX Sumpy

Astropy
StatsModels Scikit-image
Scikit-learn Matplotlib Pandas PyMC
Numpy

Scipy Sympy
IPython Python Jupyter

5
Import Library
• Using “import”
– Import the Numpy library with the name "np"
– “np.function name” to use the function of Numpy

• Using “from”
• Originally, we need to use function by
“module.function.function…” to use function
• We can use “from” to omit
– np.random.function = random.function

6
Magic Command
• Some modules in Juypyter Environment set “Magic
Command”
• Magic command starts with %, which is a command to
perform various environmental operations in the Jupyter
environment.
• %precision
– Numpy extension command. When displaying data, you can specify
the number of digits after the decimal point to be displayed
• %matplotlib
– The extension command of Matplotlib can specify the display
method of the chart.
– If you use "inline”, you can directly display the chart at this position.
If you do not specify %matplotlib, it will be displayed in another 7
window.
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

8
Numpy
• NumPy is a Python library used for working with arrays. It
also has functions for working in domain of linear algebra,
fourier transform, and matrices.
• Numpy is written by C instead of Python. Calculation is fast.

9
Install Numpy
• pip install numpy

10
Array Operation
• Array

11
Data Type (1/3)
• When using Numpy to process data, there is “data type” to
enable high-speed calculation and to ensure the precision
of the calculation.
• Integer, floating point are data types.

int(Signed Integer) uint(Unsigned Integer)

Data Description Data Description
Type Type
int8 8 bits signed integer uint8 8 bits unsigned integer
int16 16 bits signed integer uint16 16 bits unsigned integer
int32 32 bits signed integer uint32 32 bits unsigned integer
int64 64 bits signed integer uint64 64 bits unsigned integer

12
Data Type (2/3)

float(Floating point) bool(Boolean)

Data Description Data Description
Type Type
float16 16 bits floating point bool Show True or False
float32 32 bits floating point
float64 64 bits floating point

13
Data Type (3/3)
• If you want to query the data type, you can specify ".dtype"
after the variable.

14
Dimension and Number of
Components
• Dimension: ndim
• Number of elements: size

15
Calculate All Elements
• In Python, you have to use ”for loop” to calculate all
elements.
• Numpy can calculate all elements without for loop.

16
Sort
• Use sort. The default is from small to large.
• If you want to show from large to small, you can use “data[::-1].sort()”

• <Review>
• Q1:What is [n:m:s] in Python？
– From the nth to the (m-1)(1)
th , take every s elements.

• If n and m are omitted, what does it mean?

– Take "all" in s intervals.
(2)

• What does s=-1 represent？

– Take out one by one starting
(3) from the end.

17
Min、Max、Sum、Cumulative Sum
• Array of Numpy can call:
• min: minimum
• max: maximum
• sum: summation
• cumsum: cumulative sum
– For sequence a, b, c
– First value: a
– Second value: a+b
– Third value: a+b+c

18
Random (1/2)
• When we analyze data, we will use “random function” to sperate data or
make data in different distribution.
• There is random function in Python, but we usually use random function
in Numpy during analyzing data.
• Random function generates random value based on mathematics
formula.
• The initial value of random function is called “seed.”
• Although it is not necessary to specify a seed during the operation,
specifying the same seed can ensure obtain the same sequence of
random numbers in each operation.

19
Random (2/2)
function Definition
rand Uniform distribution，0.0<x<1.0
random_sample Uniform distribution，0.0<x<1.0
(The generating method is different from
“rand”)
randint Uniform distribution。Integers without range.
randn Normal distribution. Random numbers
whose mean is 0 and std is 1.
normal Normal distribution. Random numbers with
arbitrary mean and std.
bionomial Random numbers from binomial distribution.
beta Random numbers from betta distribution.
gemma Random numbers from gamma distribution.
chisquare Random numbers from chi-squared
distribution. 20
Retrieve Data Randomly
• random.choice
• There are 2 arguments and 1 parameter.
– The first argument：array you want to retrieve data.
– The second argument：Number of data you want to retrieve.
– Parameter：replace
• If replace is True or without specifying, it means data can be retrieved
repeatedly.
• If replace is False, it means data can not be retrieved repeatedly.

21
Matrix
• We can use Numpy to do matrix operation.
• “arrange” function can generate specified consecutive
integers.
– arange(9) can generate integer 0~8.
• Retrieve row or column from matrix, we can use “[range of
row, range of column].”

22
Matrix Multiplication
• “dot function” is matrix multiplication.
• If use “*”， it is element-wise product. (The respective
elements are multiplied.)

23
Matrix Whose Elements Are 0 Or 1
• np.zeros: Matrix whose elements are 0
• np.ones: Matrix whose elements are 1
• dtype: Specify data type

24
HW 1-1
• 1. Using “np.array” to generate
an array contains 1-50 and show
the sum of 1-50.
• 2. Create an array with 10
random numbers generating
based on normal distribution
with random seed 0. Show the
minimum, maximum, and sum.
• 3. Create a 5*5 matrix with all
elements of 3 and calculate
matrix multiplication .

26
HW 1-2
• Generate 4*4 matrix with
normal distribution and
random seed 1.
• Generate 4*4 matrix with
normal distribution and
random seed 2.
• Calculate element-wise
product.
• Calculate matrix
multiplication.

28
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

29
Pandas
• In Python, Pandas is a library to preprocess data before
building model.
• Pandas can flexibly process all kinds of data, and
perform operations such as table calculation, data
extraction, and search.

• EX:
– Find rows from data that match certain criteria
– Set a benchmark to calculate the average of each
– Merge data

30
Install Pandas
• pip install pandas

31
Import Pandas
• Import Pandas

• Import Series module, which can process one dimension

array.
• Import DataFrame module, which can process two dimension
array.

32
Series (1/3)
• A pandas Series is a one-dimensional labelled data structure
which can hold data such as strings, integers and even other
Python objects.
• The basis of Series of Pandas is the array of Numpy.

Element Elements
Index

Data Type 33
Series (2/3)
• Get data： Series.values
• Get index： Series.index

34
Series (3/3)
• Specify index in Series

35
DataFrame (1/2)
• A Pandas DataFrame is a 2 dimensional data structure, like
a 2 dimensional array, or a table with rows and columns.
• We can set difference dtype (data type) for each column.

Element
Index

36
DataFrame (2/2)
• DataFrame, like Series, can change the index value and set
the text as the index value

Element
Index

37
In Jupyter
• Previously, we use “print” to show Series or DataFrame.
• In Jupyter, the Series or DataFrame can be automatically
recognized.
• Therefore, we can directly input Series or DataFrame.

38
Exchange Row and Column of DataFrame

• Exchange row and column, like the transpose in matrix.

39
Extracted Specific Column (1/2)
• Extracted the specified one column: Specify the column
name directly.

40
Extracted Specific Column(2/2)
• Extracted the specified many columns: Use “list” of Python
to specify.

Extract column.
Without “.”

41
Extract Data (1/3)
• For DataFrame objects, you can only get data that meets
certain conditions, or combine multiple data => Like a filter.

42
Extract Data (2/3)
• Extract “City Column” and compare “City Colunn” to
“Taipei.” The result shows True or False.

43
Extract Data (3/3)
• If you want to specify multiple conditions, you can use
“isin(list).”

44
Practice
• Filter the data whose Birth_year is before 1990, excluding
1990.

45
Drop Column or Row in DataFrame
(1/4)
• To remove specific column or row, we can use “drop.”
• Use “axis parameter” to specify column or row.
– “axis=0” is row
– “axis=1” is column

46
Drop Column or Row in DataFrame
(2/4)
• Drop Column：

47
Drop Column or Row in DataFrame
(3/4)
• When we execute“attri_data_frame1.drop([‘Birth_year’],
axis = 1), ” the column is not deleted “in data."
• If you want to delete the column “in data,” you have to use
“assign.”

48
Drop Column or Row in DataFrame
(4/4)

49
Merge Data in DataFrame (1/3)
• The objects of DataFrame can be merged.
• Data are from difference source. We need to merge data to
analyze.
• We can use “merge.”

50
Merge Data in DataFrame (2/3)
• Merge dataframe “attri_data_frame1” and dataframe “
attri_data_frame2”
• The common column of these two dataframes is “ID.”
• Choose the data with same ID to merge.

51
Merge Data in DataFrame (3/3)

52
Statistics (1/2)
• We can do statistics in DataFrame.
• Use “groupby” to do statistics based on specific
condition/behavior.

53
Statistics (2/2)
• Use the Gender row as a benchmark to calculate the
average score for math grades.

54
Practice
• Using the Gender row as a benchmark to calculate
maximum and minimum English grades.

55
Sort (1/3)
• For objects of Series and DataFrame, we can do sorting.
• Sort by index
• Sort by value

56
Sort (2/3)
• Sort by index

57
Sort (3/3)
• Sort by value
• Use “value of Birth_year column” to sort.

58
Check nan (null) (1/4)
• Sometime, there is missing information in data.
• If we directly calculate average, we cannot get correct result
• Missing information should be deleted.

59
Check nan (null) (2/4)
• Compare eligible data
– Check whether the data contains “Taipei”

60
Check nan (null) (3/4)
• Use “isnull” to check whether the value is “nan.”

61
Check nan (null) (4/4)
• Calculate the number of nan

There are 5 True in “Name” column.

62
HW 1-3 (1/5)
• Use the following data as input.

from pandas import Series,DataFrame

import pandas as pd

attri_data1 = {'ID':['1','2','3','4','5'],
'Sex':['F','F','M','M','F'],
'Money':[1000,2000,500,300,700],
'Name':['Alice','Bob','Candy','David','Ella']}

attri_data_frame1 = DataFrame(attri_data1 )(attri_data1)

64
HW 1-3 (2/5)
• Extract and show the data whose money is more than 500
(including 500).

66
HW 1-3 (3/5)
• Calculate the average money of male and female.

68
HW 1-3 (4/5)
• Input dataframe “attri_data2.”

attri_data2 = {'ID':['3','4','7'],
'Math':[60,30,40],
'English':[80,20,30]}
attri_data_frame2 = DataFrame(attri_data2)

69
HW 1-3 (5/5)
• Merge dataframe “attri_data1” and “attri_data2.”

• Show the average of “Money, Math, and English.”

• (The average of ID column can be ignored.)

72
HW 1-4
• Generate 100 data with ID, Gender, Money
• Randomly generate 100 gender data.
• np.random.seed(2)
• array2 = np.random.randint(2, size=100)
• 0:Female
• 1:Male

• Generate money randomly.

• np.random.seed(3)
• array3=np.random.normal(1000, 10, size=100)

74
HW 1-4: Generated Matrix

75
HW 1-4 (1)
• Extract the data who has least money.

77
HW 1-4 (2)
• Extract data whose money more than 1010.

79
HW 1-4 (3)
• Sort the result of question (2) by money.

81
Outline
• Introduction to Data Analytics Modules
• Basic Concepts of Numpy
• Basic Concepts of Pandas
• Data preprocessing

82
The Importance of Data Preprocessing
• If there is some soil and dirt in your carrot pork soup, how
do you feel?
• Cleaning is absolutely indispensable step in processing food
materials.
• Same as data. Data needs preprocessing.
• Wrong data cannot produce correct result even though you
have powerful analytical method.

83
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

84
Data Cleaning (1/3)
• If some data of a company contains boy, girl, male, female,
Male, Female, M, F, …etc., many columns are duplicate.

Pie charts are divided into various blocks!

Many blocks are duplicate.

85
https://ithelp.ithome.com.tw/articles/10199944
Data Cleaning (2/3)
• First, decide the format of data.
– “Boy, Girl”
– “Male, Female”
– “M, F”
• If the format is set as “M, F”, we needs to start convergence
process. Change all different format of male and female to
“M, F”.
• Finally, if there are some value does not represent male
and female, such as cell phone number, address. We
should change the value to “null.”

86
Data Cleaning (3/3)

87
Data Cleaning ：Outlier
• Outlier：
• There is no uniform definition. The outliers are judged by
data analysts or decision maker.
• Methods to judge outlier:
– Draw boxplots. Treat values above a certain percentage as outliers.
– Using normal distribution.
– Map data into a specific space and observe the distance among
data.

88
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

89
Impute missing value
• Missing value and outlier unavoidable situations when dealing
with data.
• There are various reasons for missing value. For example,
forgetting to fill the data, system issue.
• For missing value, should it be ignored, or the closest value fill
in? ??
– Different methods will produce great deviations, which may lead to
wrong decisions and cause heavy losses. Thus, missing value should
be handled with caution.

90
Example Data
• Assume “NaN(NA)” is missing value.

#Data Preparation
import numpy as np
from numpy import nan as NA
import pandas as pd

random.seed(0)
df = pd.DataFrame(np.random.rand(10, 4))

# 設定為NA
df.iloc[1,0] = NA
df.iloc[2:3,2] = NA
df.iloc[5:,3] = NA
print(df)

92
Listwise Deletion (成批刪除)
• Delete the row which has NaN.
• Use “dropna, ” which is called list-wise deletion (成批刪除).

93
Pairwise deletion (逐對刪除)
• List-wise Deletion will make data become too little, make
the data unusable.
• In pairwise deletion, we can ignore column which has
missing value.
• We extract the columns we want and use dropna.

94
Filling Missing Value: fillna
• Use “fillna(value)” for NaN value.
• EX: If we want to fill 0 for NaN, we can use fillna(0).

95
Filling Missing Value: ffill
• We can use “ffill” to fill in the value of “previous row.”

96
Filling Missing Value: mean
• We can use “mean” to fill the average value of the column.
• Notice: In time series data, if you use this method, you may
use future value to compute mean. You should avoid this.

The average value for each

column.

97
More fillna
• Besides the methods we introduce, there are more fillna
methods.
• Use “?df.fillna” to search other methods.
• About missing value, above methods do not work in all
cases . When dealing with missing value, you should
consider the background and situation of data and
choose appropriate fiilna method.

98
HW 1-5 (1/4)
• Assume “NaN(NA)” is missing value.

import numpy as np
from numpy import nan as NA
import pandas as pd
import numpy.random as random

random.seed(0)
df2 = pd.DataFrame(np.random.rand(15, 6))

df2.iloc[2,0] = NA
df2.iloc[5:8,2] = NA
df2.iloc[7:9,3] = NA
df2.iloc[10,5] = NA

df2

100
HW 1-5 (2/4)
• 1. Delete the row with NaN.

101
HW 1-5 (3/4)
• 2. Fill 0 for NaN.

102
HW 1-5 (4/4)
• 3. Fill mean value for NaN.

103
Data Preprocessing
1. Data Cleaning (資料清洗 )
2. Impute missing value (資料補值)
3. Data Labeling (資料標註)

112
Data Labeling
• Label the features in data.

• Label the correct answer in data.

113
Example of Data Labeling
• Label the correct answer in data.

The color is not good (著色不佳) Milk Residue (乳汁吸附)

117
Tool of Data Labeling
• Labelbox
• LabelImg
– Support Yolo

https://1applehealth.com/info/33701239 118
Test
• What are the 3 parts of data preprocessing?
• Data Cleaning (資料清洗
(1) )

• Impute missing (2)

value (資料補值)

• Data Labeling (資料標註)

(3)

148
Have Interest in Data Analytics?
Mission：Finish a pork carrot soup

Data Data
Storage Analysis
Stage 1 Stage 3 Stage 5

Stage 2 Data Stage 4

Data Preprocessing Data
Collection (Cleaning, Visualization
Labeling…)

149
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

150
Data Visualization
• Before analyzing data…….
– Sometimes you can't find a message by just observing the numbers.
– We can obtain some implicit information by data visualization.
– Enhance data understanding through charts and infographics.

• After analyzing data……

– Analysis results are easily interpreted with graphs.
– By converting the information into charts, it is easier for people
to understand.

151
Example of Data Visualization
before Analyzing Data
• Visualize data to see whether it conforms your
understanding.
Before anomaly Start to have anomaly After anomaly

152
Example of Data Visualization
After Analyzing Data
• The distribution of generated data and real world data.

153
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

154
Library for Data Visualization
• Matplotlib
– In Matplotlib, most of data visualization function is provided
by “pyplot.function name”
– Thus, after importing matplot library by “import
matplotlib.pypliot as plt,” we can use data visualization
function by “plt. function name”
• Seaborn
– A library to make Matplotlib's charts more beautiful.

• In the following slides, we will use Matplotlib.

155
Import Matplot Library

156
Scatter Plot (1/2)
• A scatter plot (aka scatter chart, scatter graph) uses dots
to represent values for two different numeric variables.
• The position of each dot on the horizontal and vertical
axis indicates values for an individual data point.
• Scatter plots are used to observe relationships between
variables.

157
Scatter Plot (1/2)
• By “plt.plot(x,y,’o’)” can
generate a scatter plot.
– ‘o’ is used to specify the
type of dots.

158
Scatter Plot (2/2)

159
Continuous Scatter Plot (1/2)
• If the data is continuous, the chart will be like curve instead
of dot.

160
Continuous Scatter Plot(2/2)

161
Subplot (1/2)
• We can divide the plots
into multiple by
“subplot.”
• plt.subplot(2,1,1)
– Represents the plot of
second row and the first
column.
• linspace(-10, 10,100)
– Divide 100 numbers from
-10~10.

162
Subplot(2/2)

163
Function Graph (1/2)
• Draw the graph of f(x)=x2+2x+1.

164
Function Graph (2/2)

165
Histogram (1/3)
• A histogram is an approximate representation of
the distribution of numerical data.
• When you want to observe the all picture of data, you can
use histogram.
• By histogram, we can understand the which value is more,
which value is less, and whether there is any deviation.
• We can use “hist” to generate histogram.

166
Histogram (2/3)

167
Histogram (3/3)

168
Practice (1/2)
• Generate two datasets. Each dataset contains 1000
uniform 0~1 random numbers.
• Draw histogram for two datasets.

169
Practice (2/2)
• <Hint>
• Use “np.random.uniform” to generate random numbers.
• => np.random.uniform(0.0,1.0,10)
• Use “plt.subplot” to draw two plots.
• Use “plt.tight_layout” to automatically adjust the size of
plots.

170
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

171
Bar Chart (1/3)
• A bar chart provides a way of showing data values
represented as vertical bars. It is sometimes used to show
trend data, and the comparison of multiple data sets side
by side.
• We can use bar function in pyplot module.
• Show the tags in bar chart, we can use “xtick” function.
• Put the graph in the center: align=‘center’

• Q: What is the difference between histogram and bar

chart? 172
Bar Chart(2/3)

173
Bar Chart (3/3)

174
Horizontal Bar Chart (1/2)
• We can use “barh” to show horizontal bar chart.
• Exchange x axis and y axis and set label again.

175
Horizontal Bar Chart (2/2)

176
Show Many Bar Charts (1/2)
• Visualize first and final math grades by class for comparison.

177
Show Many Bar Charts (2/2)

178
Stacked Bar Chart (1/3)
• Same with using “bar” function.
• Do different setting for “bottom parameters.”
• Specify the graphics to be stacked on top in the bar
parameter.
– “bottom= the graphics to be stacked on top”

179
Stacked Bar Chart (2/3)

180
Stacked Bar Chart (3/3)

181
Pie Chart (1/3)
• A pie chart (or a circle chart) is a circular statistical graphic, which is
divided into slices to illustrate numerical proportion.
• In a pie chart, the arc length of each slice (and consequently
its central angle and area) is proportional to the quantity it
represents.

182
Pie Chart (2/3)

讓第二塊0.1的距離

183
Pie Chart (3/3)

184
Bubbles Diagram (1/2)

185
Bubbles Diagram (2/2)

186
About Data Visualization
• In recent years, data analysis and data visualization have
attracted much attention.
• There are many data visualization tools such as Tableau、
Excel, PowerBI.
• In companies, they usually use these tools instead of
Python.
• However, using Python library is more flexible to adjust the
graph. As an engineer, we still need to learn the library for
data visualization.

187
Outline of Data Visualization
• What is data visualization?
• Basic library for data visualization.
• Advanced library data visualization.
• Stock market data visualization.

188
HW： Visualize the Price Data of
Taiwan Stock Market
• Stocks are an important investment for many people.
There are more than one million stock investors in
Taiwan. How to correctly obtain stock information is a
major issue.
• Taiwan Stock Exchange Corporation
(https://www.twse.com.tw/zh/) provides various
historical and real-time stock information of Taiwan stock
market, which is a very important website for Taiwan
stock investors.

193
Get Monthly Trading Information of
Individual stocks (English)
• Market Info -> Historical Trading Data -> Trading Value->
Monthly

194
Get Monthly Trading Information of
Individual stocks (Chinese)
• 交易資訊 -> 盤後資訊 -> 個股月交資訊

195
Input the Data Time (Chinese)
• 在資料日期選擇113年
• 股票代號： 2330 (台積電)
• 按查詢

196
Input the Data Time (English)
• Year: 2024
• Stock code： 2330 (TSMC)
• Query

197
Open Data Page (Chinese)
• 按列印/HTML 扭可從表格方式顯示下載的資訊，也可
以透過網址直接下載資料。

198
Open Data Page (English)
• Use Print/HTML button to show the table data in another
page. 。

199
Observe Webpage
• In Chrome, turn on “development tool.”

201
Analyze the Architecture of
Webpage
• The first object of
“Table.”
• The monthly
transaction
information is in the
“tr” row of the first
tbody table, and each
data is stored in the
“td” column.

202
Draw Line Chart
• Analyze the trading value of Taiwan Stock Exchange
Corporation.
• Extract the highest trading value of each month, the lowest
trading value of each month and draw a line chart.

203
Appendix：Three Steps for Crawler
1) Request the specified URL to get the response
2) Parse the response content and analyze the required
information from it
3) Save information from the previous analysis to a database
or file

205
Requests module: read website files
• To collect information systematically and automatically on the
Internet, you must extract the page content or files on the
website for processing.
• Python provides a "requests" module.
– Users can easily request the website and get the response content.

• Anaconda has built in

• Query installed packages with this command : pip list
• If you use other environments. Use this command : pip install -U requests

207
send “GET” request
• When the browser is opened, and the URL is sent, the
designated web server responds after receiving the
request, and the web page can be seen in the browser. The
method of this request is called GET.
• The requests module can complete GET requests without
going through a browser. The syntax is:

209
Request object attribute
• The Request object can use the following attributes to get different response
content.
• text: Get the source data of web page
• The default reading code of Requests is Latin-1. If the code of the read page
is different, it will often cause garbled characters. You can set the encoding of
the Response object to UTF-8 or Big5
• Response object.encoding='UTF-8’
• content: Get the binary file data of website
• status_code: Get HTTP status code
– Informational responses, 100–199,
– Successful responses, 200–299,
– Redirects, 300–399,
– Client errors, 400–499,
– Server errors, 500–599.

211
Appendix：Three Steps for Crawler
1) Request the specified URL to get the response
2) Parse the response content and analyze the required
information from it
3) Save information from the previous analysis to a database
or file

213
BeautifulSoup Module: Web Parsing
• BeautifulSoup: can quickly and accurately analyze and
extract specific objects in the page

• Anaconda has built in

• If you use other environments. Use this command : pip install -U
beautifulsoup4

215
The structure of web pages (1/3)
• The content of web pages is plain text, usually saved as
.htm or .html files.
• A web page uses HTML (Hypertext Markup Language)
syntax to construct content with tags so the browser can
show the web page according to its description.

217
The structure of web pages (2/3)
• HTML provides a structured representation of documents:
DOM (Document Object Model, document object model)
• All tags are enclosed by <…..>, most have start and end tags
– <h1>Title</h1>
– <div>Block Content</div>
– <p>Paragraph</p>
– <img>Image</img>
– <a>Hyperlink</a>

219
The structure of web pages (3/3)

head

html
body

• The function of the BeautifulSoup module parses the web page source code
into structured objects, allowing the program to obtain the content quickly.

221
How to use BeautifulSoup
• After importing BeautifulSoup, use the “requests” module
to obtain the source code of the webpage, and then use
"lxml" to parse the source code.

• Creating BeautifulSoup requires two parameters:

• First parameter：The source code to be parsed
• Second parameter：The parser
– ”html.parser” is Python's built-in parser
– “lxml” is a C-based parser that performs faster

223
Attributes of BeautifulSoup

Attribute Description
label name Returns the specified content ex: sp.title returns
the tag content of <title>
text Returns the text content of the web page after
removing all HTML tags

225
BeautifulSoup: find(), find_all()
Function Description
find() Find the first matching tag, return it as a string,
Ex:sp.find("a")
find_all() Find the all matching tag, return it as a list,
Ex:sp.find("a")

• Add tag attributes as search criteria

227
BeautifulSoup: select()

Function Description
select() Find the content of the specified CSS selector
such as id or class, and return it in a list
Ex:
Read by id : sp.select(“#id”)
Read by class : sp.select(“.classname”)

229
Test
• When do we need data visualization?
– Before analyzing data
(1)
– After analyzing data

• What is the library for data visualization?

– Matplotlib
(2)
– Seaborn

236

Data Analytics Preparation & Visualization
No ratings yet
Data Analytics Preparation & Visualization
54 pages
DSA LAB Manual - Good Content
No ratings yet
DSA LAB Manual - Good Content
70 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
63 pages
02 Python Basics
No ratings yet
02 Python Basics
52 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
What Is Numpy?: Aim: Study Python Libraries: Numpy, Pandas, Matplotlib, Scikitlearn With Student Dataset
No ratings yet
What Is Numpy?: Aim: Study Python Libraries: Numpy, Pandas, Matplotlib, Scikitlearn With Student Dataset
18 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Attachment 3 Python For Data Analysis Lyst9850
No ratings yet
Attachment 3 Python For Data Analysis Lyst9850
31 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
DS Final
No ratings yet
DS Final
46 pages
Int254 Unit 2
No ratings yet
Int254 Unit 2
33 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
3 - Pandas
No ratings yet
3 - Pandas
87 pages
NumPy and Pandas: Essential Python Libraries
No ratings yet
NumPy and Pandas: Essential Python Libraries
72 pages
21CSE354T - Full Stack Web Development Question Bank
100% (1)
21CSE354T - Full Stack Web Development Question Bank
9 pages
Lab #2 - Data Analysis With NumPy and Pandas
No ratings yet
Lab #2 - Data Analysis With NumPy and Pandas
7 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
RAW Data
No ratings yet
RAW Data
22 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
Implementation of Python Basic Libraries Such As Numpy
No ratings yet
Implementation of Python Basic Libraries Such As Numpy
6 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
61 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
ML Sample Programs
No ratings yet
ML Sample Programs
7 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
Python & Excel for Data Science
No ratings yet
Python & Excel for Data Science
19 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Introduction To Python
No ratings yet
Introduction To Python
62 pages
NUMPY
No ratings yet
NUMPY
33 pages
FINAL FDS MANUAL Print
No ratings yet
FINAL FDS MANUAL Print
55 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
EXPENSE TRACKER REPORT (GRP 32)
100% (1)
EXPENSE TRACKER REPORT (GRP 32)
46 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Unit 2
No ratings yet
Unit 2
38 pages
Unit 5
No ratings yet
Unit 5
28 pages
Report
No ratings yet
Report
18 pages
1 Syllabus
No ratings yet
1 Syllabus
36 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Unit 5 PythonPackages (Matplotlib)
No ratings yet
Unit 5 PythonPackages (Matplotlib)
24 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
DV Lab Manual Modified
No ratings yet
DV Lab Manual Modified
31 pages
Practicals Web Technology
No ratings yet
Practicals Web Technology
16 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
43 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
DV Lab2 Updated
No ratings yet
DV Lab2 Updated
12 pages
What Is CSS? CSS Stands For Cascading Style Sheet CSS Is An Extension To Basic HTML That Allows Us To Style Our Web Pages. Styles Define How To Display HTML Elements. Styles Were Added To HTML 4.0
No ratings yet
What Is CSS? CSS Stands For Cascading Style Sheet CSS Is An Extension To Basic HTML That Allows Us To Style Our Web Pages. Styles Define How To Display HTML Elements. Styles Were Added To HTML 4.0
101 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
22mbada303 Module 4
No ratings yet
22mbada303 Module 4
32 pages
Building ECommerce Applications
100% (2)
Building ECommerce Applications
244 pages
Py PPT 06
No ratings yet
Py PPT 06
33 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
Python CA2
No ratings yet
Python CA2
11 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
Yash Pratap HTML
No ratings yet
Yash Pratap HTML
54 pages
UI Developer Expertise Overview
No ratings yet
UI Developer Expertise Overview
6 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
Project Dissertation
No ratings yet
Project Dissertation
151 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
Unit-1 HTML Common Tags
No ratings yet
Unit-1 HTML Common Tags
45 pages
16th ICCCNT 2025 Paper 938
No ratings yet
16th ICCCNT 2025 Paper 938
7 pages
114-1華語聽說一 Syllabus
No ratings yet
114-1華語聽說一 Syllabus
3 pages
Important Questions of Web Design and Development
No ratings yet
Important Questions of Web Design and Development
3 pages
Aisect University Hazaribag: A Presentation On Hostel Management System
No ratings yet
Aisect University Hazaribag: A Presentation On Hostel Management System
26 pages
31 - Orgafe Organic Farm
No ratings yet
31 - Orgafe Organic Farm
91 pages
OADJS Findings
No ratings yet
OADJS Findings
203 pages
Users Guide
No ratings yet
Users Guide
8 pages
Best Tutorial of C#
No ratings yet
Best Tutorial of C#
97 pages
HTML:: Cascading Style Sheets (CSS) Is A Style Sheet Language Used For Describing The
No ratings yet
HTML:: Cascading Style Sheets (CSS) Is A Style Sheet Language Used For Describing The
12 pages
Mobile App Development Methods
No ratings yet
Mobile App Development Methods
24 pages
Body (Background-Color:#d0e4fe ) h1 (Color:orange Text-Align:center ) P (Font-Family:"times New Roman" Font-Size:20px )
No ratings yet
Body (Background-Color:#d0e4fe ) h1 (Color:orange Text-Align:center ) P (Font-Family:"times New Roman" Font-Size:20px )
2 pages
Cdisc Odm 2016
No ratings yet
Cdisc Odm 2016
211 pages
A2Z Online Store SRS Guide
No ratings yet
A2Z Online Store SRS Guide
19 pages
Basic Web Design Lab Course
No ratings yet
Basic Web Design Lab Course
5 pages
Web Performance Insights
No ratings yet
Web Performance Insights
35 pages
HTML Basics for Beginners
No ratings yet
HTML Basics for Beginners
31 pages
Web301 - Prelim Lesson
No ratings yet
Web301 - Prelim Lesson
85 pages
f3 Elementary
No ratings yet
f3 Elementary
33 pages
s3 Us East 1 Amazonaws Com Documents Scribd Com Docs 8wi2zmhqv448r16e PDF Response Content Disposition Attachment 3B 20filename 3D 22251927665 Literaturos Kurso Kartojimas 11 12 PDF 22 3B 20filename 2
No ratings yet
s3 Us East 1 Amazonaws Com Documents Scribd Com Docs 8wi2zmhqv448r16e PDF Response Content Disposition Attachment 3B 20filename 3D 22251927665 Literaturos Kurso Kartojimas 11 12 PDF 22 3B 20filename 2
10 pages
HTML - Devdocs
No ratings yet
HTML - Devdocs
4 pages
Basic Concept Web Desiging
No ratings yet
Basic Concept Web Desiging
12 pages
Web Sraping
No ratings yet
Web Sraping
11 pages
Cssfinal
No ratings yet
Cssfinal
7 pages
HTML Tag, Elements and Attribute
No ratings yet
HTML Tag, Elements and Attribute
7 pages