AI VIET NAM — COURSE 2024
Data Analysis - Exercise
Nefy 17 thang 8 nam 2024
Phan I: Ly thuyét
Pandas li mot thet vien trong Python vai ttn diém li nhanb, monk, link dong, a8 sit dng, ma nguém
md, cong eu dug dé phan tieh va thao tée dit lien, Pandas duge xay dumg tren tht vien NumPy va
u functions hé tro cleaning, analyzing, vA manipulating data, e6 thé ghip ta extract valuable
insights cia ede tap att igu, Pandas rat hieu qu khi sit dung tren dif lign bang, nhit SQL table hose
6 nh
Excel spreadsheets,
ul pandas
nb 1: Logo tht vign Pandas
Mot s6 die diém efia Pandas:
‘# Thao the wi ede ngubn dit lieu ti file sv, excel file, SQL, JSON file
+ Chung efip cic loni cd trie dit ligu khie nhw nbvt Series, DataFrame vi Panel
1 Co thé dap ting uhidn dang dataset Khe nhau nbut time series, heterogeneous data, tabular vi
matrix data
+ Co thé lam vige vii missing data bing efich xéa ehiing hose gin cho chiing gi tri zeros hofe gid
‘ri phit hop voi trang thai test.
‘© C6 thé dling cho vige parsing vi conversion data.
+ Ching edp cdo Ky thmat loo dit lieu.
'* Cung cp time series functionality ~ date range generation, frequency conversion, moving window
statistics, data shifting va lagging,AL VIETNAM aivietnam.edu.vn
+ Tich hop t6t vii ede thut vign khée eiia Python nhit Seikit-learn, statmodels va SciPy.
© C6 hieu nang ea.
Cau trie dit ligu trong Pandas: Pandas duge xiy ding tren NumPy array, bao gdm Series,
DataFrame vi Panel:
# Series: C6 cfu tric la ming ID voi dtt ligu dng nit, Joni dit ligu 66 thé integer, string,
float,» true dinh nhin dite goi I chi mue (index). Kich thude cia series Ia Khong thé thay
446i (immutable) va gid ti dit lign e6 thé thay déi (mutable). Dé khdi tao Series e6 thé ding
pandas.Series(data, index, dtype, eopy), trong dé:
~ data: Nhan céc gid tri e6 dang ndatray, list, dictionary, constant,
— index: Gia tri index phai la duy nit (unique), e6 thé hash va ¢6 kich thude bing data,
me dinh index e6 gi tri 0, 1, 2
— dtype: Logi dif ligu ciia gid tri ben trong Series.
Chi mye Dir liu
oof
Hinh 2: Vidu vé mot Sevies trong Pandas
* DataFrame: La ein trie dit ligu 2D, e6 dang bang bao gém eae est va hing, ese cot e6 thé dink
ghia logi dit lien khie nhan, Cae edt 66 ee kiém dit lieu khae nhan nhit float64, int, bool,.. Mot
Ot ciia DataFrame I mét cu triic Series. Cac chién DataFrame ditge dnh nhan theo eae
hing vi cot. Tit d6, ta ¢6 thé thao tae tren ed hing va cot, Dé khdi tao DataFrame, 6 thé thite
hign bdi pandas.DataFrame(data, index, columns, dtype, copy).
= data: Nhan cfc gid tri nhuf ndarray, series, map, lists, diet, constants va DataFrame khae:
~ Che tham s6 khée tnléng tif uhtt Series, pandas DataFrame e6 thé duige tao ding eée input
nt Lists, Diet, Series, Numpy ndarrays, DataFrame khée.AL VIETNAM aivietnam.edu.vn
orm oc oo
° A o M ° A M
+ —
1 8 a F 1 5 F
—
2 c 2 F 2 c F
3 . 3 M 3 > M
Series Series DataFrame
Hinh 3: Vi dy vé DataFrame trong Pandas, C6 thé coi DataFrame la mgt danh sich ehita cde Series,
* Panel: La mot 3D container, trong d6:
~ items: axis 0, méi item titimg ting DataFrame ehtta ben trong,
— major _axis: axis 1, n6 1a cae hang (rows) ciia méi DataFrame.
— minor_axis: axis 2, n6 1a cae e6t (columns) efia mdi DataFrame
tems EES ES TE]
0 A uM
(iets eas]
5 E Mog
Cae a SS
9 1 F F
‘Major 10 J M F
Axis Ww K F
2 L F
Hinh 4: Vidy vé Panel trong Pandas, C6 thé coi Panel la mot danh sich ebtta eée DataFrame,
Mot sé function trén Pandas thuding dimg dé xit ly dit lieu:
* Handle missing values: isna(), notna() ~ tim kiém eae gi tri NA, isnulll)
# Indexing and slicing in Pandas: loc (label based). .iloe (integer based), .ix (label and integer
based).
+ Cac query nlut trong excel hay SQL: where(), query()AL VIETNAM aivietnam.edu.vn
‘* Sort: sort_indlex(), sort_values()
‘¢ Series basic funetionality: axes, dtype, empty, ndim, size, values, head(), tail(),
ze, values, hend(), tail()
‘* Dataframe basic functionality: T, axes, dtypes, empty, ndim, shape,
‘+ Céc function lign quan théng ké: count(), sum(), mean(), median(), model(), std(), min()
max(), abs(), prod(), eumsum(), cumprod(), describe(), pte_change(), cov(), corr(), rank), var()
skew(), apply(),
‘* Cae function filter data: groupby(), get_group(), merge(), coneat(), append(), melt), pivot)
pivot_table()
'* M@t 56 function Khae: get_option(), set_option(), reset_option(), deseribe_option(), op-
tion_context()AL VIETNAM aivietnam.edu.vn
Phan II: Bai tap
‘rong phin nay, cing ta s& sit dung pandas dé the hign mot s6 kg thuat phan tich tren bai bo dit
ligu vé text va time-series. Cac cau bai tap dutgc chia thinh eée butée thy hién trong bai toan.
A. Data Analysis with IMDB Movie data
IMDB Movie dataset Ih mot b9 dit ligu dnb gid phim, ding dé phan tich mite 46 quan tam:
cia phim theo mot s6 tiéu ehf mht: go di vien, tén phim... nbiim difa ra ning ge nhin,
dit doan trong tufing lai, Ce ban tai bo dit liu IMDB-Movie-Data.csv tai day.
Cae bude cin thu hign trong bai toan:
1. Read data
. View the data
Understand some basic information about the data
4. Data Selection ~ Indexing and Slicing data
5. Data Selection ~ Based on Conditional filtering,
6. Groupby operations
Sorting operation
View missing values
Deal with missing values - Deleting
10, Deal with missing values - Filling
11, Apply() fimetions
‘Ta batt din thue bien va nhiin dink & mdi bude, code diye thue hign tren Google Colab:
1. Import libraries va load dataset: Dé doc mot file esv trong pandas, ta c6 thé diimg ham
read csv nbitf san:
: iapert auspy as ap
2 import pandas ae pa
import natplotlib.pyplot as plt
5 dataset path = "IMDB-Hovie-Data.csv’
7 # Read data fron .cev file
* date = pd.readcav (dataset path)
Neoai ra, ta ¢6 thé doc ding thai chi dinh cot Im ehi myc cho bang dit lieu (mae dink, pandas sé
ty tao mot et chi myc rieng). O day, ta e6 thé chon et Title Iam e9t chi mye nba sau (e6t eh
rue khong die ebvta gia tri tring lap):
1 # Read data with specified explicit inex
if We vill tee thie later in our anelyese
) data_indexed ~ pd.read_cev(dataset_path, index.col="Title")AL VIETNAM
aivietnam.edu.vn
2. View the dat:
1 Preview
2 date.head
op 5 rove using head)
Georebaerption
crate are
saarunsaacry sci “eight
Aniveton Contr Feily sss
Hin 5: Mot s6 miu dt lieu dau tien cita bo dit Lieu
Desctor setors
dares tn Tie rey
Sosate
data
3. Understand some basic information about the data:
| flet’s firet understand the basic information about this data
2 data.intod
“clots ‘pandas. core. frane, otaFrane’s
Rongelndex: 1000 entries, & to 999
Date colunns Ceotel 12 columns:
Non-all Count Deype
# Colunn
fons
nite
Gence
Description
Director
etors|
Year
ating
votes
°
z
3
5
6
7
a
9
1
Hinh 6: Thong tin ev bin v8 bing dit ligu
ear ROS tating
1800 non-null inte
1800 non-null object
1006 non-null object
2800 non-null object
1006 non-null object
1800 non-null object
1806 non-null inte
Funtine CHinutes) 1000 non-null inte
1000 non-null Floats
1800 non-null inte
Revenue (litlions) 872 non-null floctes
936 non-null Flot
eypes: Float 643), int64Ce), objects)
imerory usoge: 93.94 XB
oi qua 5 hing dan tién cita bing ait liu bling cach sf dung hend()
ss ques)AL VIETNAM aivietnam.edu.vn
1 data deserived
Rank Yoor auntie (nutes) Rating Votes Rovenus Qlittions) Metascore
count
[email protected]¢060 1000.099000 —-1960.090008 1600.000000 1 .eeea0DeES 1572.009000 936 000000
eon $00,500000 2012. 783000, 113.172000 6.723200 1enaNe3e.05 52.956376 5.985043
std 288.619636 3.205962 18.10908 0.945429. 1,8876260105 we w.a7s7
25k 250.750000 2010.090000 109.000000 6.200000. 3,630900e,04 133.278000 7.000000
Sik 500.509000 2014. 090000 111.008000 6,809000. 1, 107298605 7.985000 59500000
7% 750.250000 2016. 000000 123.e00000 7.409008 2. 3900086,05 13. 71s000 72000000
ox 1000002000 2016.000000 391.000000 9.000000. 1,7219166+06 936.638000 180000000
nh 7: Ting quan thing ke
1 tit dataset
day ta 06 thé thi
© Gi tri min va max cita Year, tite dataset ehtfa ese ho phim tit 2006 t4i 2016.
© Rating trung bin eho ese bo phim Ih 6.7, thip nhAt 1 1,9, eao nhit 9.0.
# Doanh thu eao nhit dat dive 1 936.6 tr
on dollar,
4. Data Selection — Indexing and Slicing data: Tit bing dif ligu, ta c6 thé tach bit Ia cot nio
trong bang dit lieu dé tré thinh mot Series hoic mot DataFrame, tity vio phitimg thife tach ta sit
dung. O day, ta sé tach mot $6 c6t trong data sit dung ky thoat Indexing. Dé téeh opt thinh
Series, ta thtte hien:
# Extract data as series
genre = datal*Genre’l
genre
° Action, Adventure, Sei-Fi
1 Adventure,Mystery, Sci-Fi
2 Horror, Thriller
3 ‘Animation Comedy, Family
4 Action, Adventure, Fantasy
995 Crine,Drana,Mystery
996 Horror
997 Drama, Music Romance
998 ‘Adventure, Conedy
999 Comedy, Fomily, Fantasy
None: Genre, Length: 1000, dtyze: object
Hinh &: Tich cot Gerne thanh mot Series
Dé tach cot thanh DataFrame, ta thute hien:
1 # Extract data as davarrane
aatalt Genre)AL VIETNAM
aivietnam.edu.vn
Genre
Action, Adventure, Sci-Fi
Adventure Mystery, Sci-Fi
Horror, Thriller
‘Animation, Comedy, Fanily
Action, Adventure, Fantasy
@
1
2
3
4
Crime, Drano, Mystery
Horror
Drana, Masi, Romance
998, Adventure,Comedy
999 Comedy, Family, Fantasy
1000 rons x 1 columns
inh 9: Tich cot Gerne thinh mot DataFrame
‘Ta 06 thé chon va téch cimg mt lite nbiéu eO¢ vi nha, tao think mot DataFrame mi
some.cols » data[{*Title’, Genre’, Actors’, "Director’, ‘Rating 1]
DBéi wi vige tach hing, ta 06 thé téch ra mot s6 long hing abt diuh, tit chi mue X dén chi me
Y trong bing dit lieu, goi
data-iloc{10:15)C(/Title’, ‘Rating’ , "Revenue (Millions)?
Két hop voi vige chon edt, ta e6 mot bang dif lieu gém
‘Title, Rating, Revenue (Millions)
‘Title Rating Revenue (Millions)
1 Fantostic Beosts ond Mhere to Find Then 7.5 234.02
n Wieden Figures 7.8 169.27
2 Rogue One 7.9 sa2.a7
2B Noona 7.7 288.75
16 colossal 6.4 Dar
Hinh 10: Tich mot 6 cot tao thanh mot DataFrame mdi
Is Slicing. Vi du, dé tich ede hing tht 10 dén tht 45, ta Tam mbit sau:
5 min dif lieu vai eae tritimg thong tin
Data Selection — Based on Conditional filtering: Ta ed lay cao hang trong bang dif lien dita
tren mot s6 didu ki
cia tutu theo. Vi du, ta mong muéa léy eée bo phim tit 2010 toi 2015, vi
rating nhé hon 6.0 nhitog lai e6 doanh thu thuge top 5% tren ton bo dataset. Theo d6, ta o6 thé
trién khai eode mht sam:
datel(Cdatal’Year'] >= 2010) & (dat
& (datal Rating’) < 6.0)
Year’) < 2018))
& (datal’Revenue (Millions)’} > datal’Revenue (Mil2ions)°I. quantile (O..95))1AL VIETNAM aivietnam.edu.vn
fork Title Genre Description Director actors Year qfitXi2® aating votes
fs 0 string Kristen
tre * stort,
wwitight fantasy "sterious david Rabert ao a
se 942 MUSA sdventure,Orane Fantasy "™Gteriows bad TSHR zone zk 4.9 92740
teliose rips Taster
Hin 11; Phim wi doanh thu cao trong giai doan nam 2010-2015
6. Groupby Operations: Groupby Ii mot phép gom uhém dit lieu dum txén mot bose nbidu bién
(G day 1a edt dit ligu trong bing). Vi du, ta 66 thé tim sé rating trung biah ma ese dao dign dat
Aitge bing each gom nhém eée chi s6 Rating efi eée bo phim theo Direetor
+ data. groupby (/Director’) [{/Rating’]] mean () mead 0
Rating
Director
amie Khan 8.5
Abdellatif Kechiche 7.8
‘Aden Leon 6.5
Adan Mckay 7.8
‘Adam Shankran 6.3
Hinh 12: Sit dimg groupby dé tim s6 rating trung binh dat dhige ciia ese dgo dién trong bo ait lien
7. Sorting Operations: Sorting cho phép ta sip xép cic hing trong bing dit lieu theo thit tr
ting/giim din dua theo gid tri cita eot nto dé trong bing di lieu. Vi du, dita tren két qua
groupby phi trite, ta 06 thé tim top 5 dyo dign dat s6 rating trang binh cao nhat nhut sate
+ data.groupby (*Director’) [[/Rating’}].mean().sort_values ({'Rating’], ascending=
Falee) head)
Rating
Director
Nitesh Tinari 8.80
Christopher Nolan 8.68
Olivier Nokache 8.60
Makoto Shinkai 8.60
Aamir Khan 8.50
Hinh 13: 5 dgo dién 06 duge s6 Rating trang binh cao nit,
lig
8. View missing values: Ce bo d
value) trong mot vai tring thong.
vin d@ nay. Vi viy, vige dau tien ta
thing sé xudt hign tinh trang bi gi tri réng (missing
n cia mot s6 min di igu, Khi xi Iy dit lien, ta clin khde phe
sn kiém tra xem vj tri bj mt-st dit Hig theo esc sane
1 # To check null values vow-wiee
2 date.ienul1Q.sun()AL VIETNAM aivietnam.edu.vn
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Winutes)
Rating
Votes
Revenue (Millions) 12
Metascore
dtype: intes
g
Tinh Ld: Bing tng sip $6 long efe gia tri null 66 trong titng oft eita bang dit liga
G day ta thy Revenue (Millions) vA Metascore la 2 cot 06 chia. dit gu nnll, Dé sity vin d® mat
rat dif ligu, e6 hai phuong én chin: hoae thé ese vimg trong bing mot gid txi nto d6 hove loai
ching.
9. Deal with missing values - Deleting Déi vai phwong én loai ba, ta 06 thé loai ba tofin bo edt
clnia uhigu gid tri ull (néu 66 thé) hoge chi logi bd ese hang ela gia tr] uu. Déi vdi x6a cdt, ta
‘thyte hien:
1 # Use drop function to drop columns
2 data.drop(*Metascore’, axis=1).nead
x6a hang, ta dim
aropna
10. Dealing with missing values - Filling: Déi voi phuong én thé gié tri moi vio ede 6 tréng, ta
6 thé st dung ese gia taj moan, median... eda e6t dit lieu tung ving dé thay thé (vige chon gi
tri dé thay thé edn thy thude vio tinh chit elia bo dit lieu, bai tosn dang ii quyét...). VI du, 66
amgt vii bing ¢6 Revenue mang gia tx] null, ta ¢6 thé gn cho né gia te] trung ink nb sav:
| revenue_nean = data_indexed{’Revenue (#il1ions)"J.mean()
2 print ("The mean revenue is: ", revenue_mean)
| # We can £411 the null values with this sean revenue
data_indexed (Revenue (Millions) ’].fillna(revenue_aean, inplace=True)
11, apply() functions: Apply functions diige ding ki ta: muén thye thi mot him nao 6 len ee
hing trong bing cif ligu, Som khi thufe thi, két qua tra vé tit him chinh la. gid tri mdi cia hing
txtang ting Vi du, ta: mudn phan loai phim theo ba mite do [Good’, *Average’, "Bad’] dua tren
Rating, ta c6 thé dinh nghia mot him dé lam déu nay va apply n6 len DataFrame:
1 # Classity movies based on ratings
2 de rating_group Crating)
> af rating >= 7.5
‘ return 'Good’
+ elit rating >= 6.0:
* return ‘Average?
> else
10AL VIETNAM aivietnam.edu.vn
7 return "Bad?
# Lets apply this function on our movies data
1 # creating a nev variable in the dataset to hold the rating category
2 data[’Rating_category’] = data[’Rating']. apply (rating. group)
4s datal{'Title?,'Director’, Rating’, "Rating. category’]).nead(5)
Title Director Rating Rating category
@ Guardians of the Galexy Jones Gunn 8.3. Good
1 Prometheus Ridley Scott o Average
2 Split M. Night Shyamalan 7.3 Average
3 Sing Christophe Lourdelet 7.2 Average
4 Suicide Squed Dovid ayer 6.2 Average
Hinh 15: DataFrame sau khi due apply ham rating_group().
nay sé duige dita vio mot cot mdi mang ten Rating category
qui tri vé san khi thue thi hang
uAL VIETNAM aivietnam.edu.vn
B. Data Analysis with Time Series data
‘Time series data I mot dang di ligu vai gia tri dutge do ling tai nhitng diém khie nhan theo thd gian.
Mot 56 dit ligu time series divge phan bé theo tin suit nit dink, vt dy nut tha tiét trong 1 eid, hong
truy ep website trong nghy, téng doanh thu trong thing... Dif ligu time series efing o% thé phan bé vi
khodng each khong déu, vi du nhit s6 lugng eude goi khin ep trong ngay hose nhat ky’ he théng,
VARIABLE
TIME
mw erengie
Finh 16: Minh dang dB thi tia dit Bi
a time-series
Trong phin nay, chring ta sé khai thac khia canh sip xép va trite quan héa di lg cho time series
Cu thé voi dit Liou time series cho nang liemg, ta se Fim quen vi ap dung eta cée ky thnat time-based
indexing, resampling, vi rolling. Vige nay s® gitip ta phan tich die ese khia eanh thong tin én quan
trong trong dit liga, Vi du, Rolling windows ¢6 thé gitip ta khém pha cae bién thé vé nbn edu eign va
cung eép ning ltang tai tao theo thi gian, Chiing ta diing bo dit ligu daily time series efin Open Power
System Data (OPSD) 6 Dit, gdm tng hfdng tiéu tha ign, sim xnit-dign gio va sin xudt dign mat trai
tren toln quée trong giai doan 2006-2017, Cae ban tai bo dit lieu opsd_germany _daily.esv tai day.
Ching ta sé thyfe hién cée van 48 sau:
‘# Import libraries and read dataset
‘* Time-based indexing
series data
© Visualizing tin
1» Seasonality
‘¢ Frequencies
¢ Resampling
# Rolling windows
12AL VIETNAM aivietnam.edu.vn
* Trends
Ching ta s8 Khim phé mie tieu thy va sin suit diem d Dite thay déi theo thai gian nhir thé nho, vi
tri Idi ede eau bi
‘© Khi no mt tien thu din thurimg cao bit va thip nbit?
‘# Nang hong gi6 va mat trai dutge sin xudt da thay déi theo mita nhit thé nao?
‘© Xu hung di han trong tiew thy dign, ning lnting mat trai vA ming ling gis IA gt
© So sinh tf 1é sin hitmg ning img gi6 va mst tris vél me tiéw thu nang emg gié vA mat tr
vA ty Ie may da thay déi nhit thé nio theo thii gian?
1. Import libraries and read dataset: Din tien, ta vin diimg him read esv0 dé doe bing dit lieu:
1 import pandas as pa
aataset_path = *./opsd_germany_daily..csv”
# Read data from .cev tile
) oped_daily = pd. read_cav(dataset_path)
© print (opsd_daily. shape)
print (opsd_daily, dtypes)
opsd_daily. head (3)
‘Ta dutye Két quit nhuthinh ben diéi, 66 thé quan sit thy nbidu gis tri bi bé tréng d ede cot Wind,
Solar, Wind+Solar:
(4383, 5)
Dete object
Consumption Floatés
Wind Floated
Solar Floaté4
WindsSolar — Float64
dtype: object
Date Consumption Wind Solar WindsSolar
@ 2006-01-01 1069.184 NaN NaN NaN
1 2006-01-02 1380.521 NaN NaN Non
2 2006-01-03 1642.33 NaN NaN NaN
Mink 17: Mot sé dit lieu din tien etia DataFrame
Oi véi dang dit lieu Time Series, ta.c6 thé chon cot Date lam index (v3 gia tri eot may trong bo
dir lien Iuon Ih dy nhit (nnique))
\ opaddaily = oped_daily.set_index(/Date’)
2 opsd daily. mead (3)
13AL VIETNAM aivietnam.edu.vn
Consumption Wind Solar Wind+SoLar
Date
2006-01-01 1069.84 NaN NaN Nat
2006-01-02 380.521 NaN NaN Nat
2006-01-03 442.533 NaN NaN Nat
Minh 18: Bing dif lien sau khi chon et Date lim index
‘Ta 06 thé thy hign Iai bude load file va Kc nay, chi dink Ot sé lam chi mye ngay tit Ie thf
hign Idi goi him, déng thi too them cae 6 oat Year, Month, Weekday trich tit oat Date «
thugn tien cho viee xit ly mot 56 bie ve sau:
opsddaily = pd.read_ceu(’ opsd_geraany daily.csv’, andex_co1%
» parse.datos=True)
# Add columns vith year, month, and veekday name
opsd_daily[’Year’] = opsd_daily. index.year
opsd_daily[’Month'] = opsd_daily.index.month
© opsd_daily(’Wookday Wane] = opad_daily index. day_name()
# Dseplay a randon sampling of § rove
ops4daily.sampio(5, random staté
Consumption Wind Solar WindtSolar Year Month Weekday None
Date
2008-08-23 152.011 NN NON. NoN 2008 8 + Saturday
2013-08-08 | 1291.984 79.666 93.371 173.037 2013. «= «8 =~ Thursday
1281.057 NON NON NoN 2009 = 8 Thursday
1391.050 81.229 160.641 241.870 2015 10 Friday
201.522 NaN NaN NoN 2009 6 Tuesday
Hinh 19: DataFrame véi ede eot moi: Year, Month, Weekday
2. Time-based indexing: Mot trong ning tink ning néi tri efia pandas ki xt Iy dit Leu time:
series IA tinh nang time-based indexing, lien quan dén vige dimg dates va times dé ¢6 ebste va tray
cap ait lien (kha giéng vai Indexing 6 phin truide nhvtog gia tr] Iie nay sé Ii ngiy thing nam)
Vige niy cho phép ta diing loe accessor dé thite thi. Vidu, ta e6 thé tray ep dif lieu theo mot
Khofing thoi gian tit ngay 2014-01-20 dén ngay 2014-01-22:
opsd.daily. loc! 2014-01-20": ?2014-01-22"1
Consumption Wind Solar WindsSolar Year Month Weekday Nane
Date
2034-01-20 1590.687 78.647 6.371 85.018 7014. «1S ‘Monday
20i4-01-21 1624.86 15.643 5.835 21.478 7e14 «1S Tuesday
2ei4-01-22 1625.155 60.259 11.992 72.251 2814 —«1—ednesday
Hinh 20: LAy eée man di liga ti ngage 20/1/2014 eén 22/1/2014
‘Mot tin ning khae ciia pandas Ia partial-string indexing, cho phép ta
gian mot esch chung chung, khong
licing, theo mo ta thei
neu thé ngay thing nam nut 6 phin tren, Vi du:
vyAL VIETNAM aivietnam.edu.vn
1 opsd_daiy. Loc [2012-0273
Consumption Wind Solar Winé:Solar Year Month Weekday Name
181.866 199.607 43.502 243.109 2012 ednesday
2
1863.407 73.469 44.675 118.144 2012, -2-~— Thursday
163.631 36.352 46.510 82.862 2012-2 Friday
2
2
zo2-02-e& © 1372.614 20.551 45.225 65.776 2012 Saturday
2012-02-@5 279.432 55.522 54.572 110.094 2012 Sunday
Hinh 21: Partial-string indexing. Voi vige chi dat "2012-02, ta cf thé My duge ton b9 ee mn dit ligu
thuge "2012-02"
3. Visualizing time series data: Voi viee pandas 6 6 trd trte quan héa dit ligu len dd thi, phi
hop wi thit vign seaborn ta ¢6 thé dé ding tre quan héa duige dit ligu time-series len dé thi, Vi
dy, ta trve quan (plot) dit lign cot Consumption nbtt san:
| isport matplotisb.pyplot as ple
2 # Display figures inline in Jupyter actebook
| Iapert seaborn as sas
1 # Use seadorn style defaults and set the default figure size
© sng. set(re=C rigure, tigsize’":(11, 4)
+ opsd_daiiy( Consumption’) plot (Linewiath=0.8);
1600
3
‘consumption
g
8
2006 2008 210 2012 21s 2016
Date
inh 22: Dé thi dit lign vé mie tien thu dien nang hhng ngiy tai Dite
‘Ta e6 thé plot eiing tie mot 96 edt dif ligu khée thimh titng d6 thi riéng le:
1 cols_plot = (*Consumption’, *Solar’, *Wind’]
u ares = opsd_daily [cols plot]. plot (marker=’.', alphas0.S, Linestyles/Nlone’,
figsizes(11, 9), subplots=True)
| tor az in as
‘ax, set_ylabel (*Daily Totals (avn) *)
+ pat.snow0
15AL VIETNAM aivietnam.edu.vn
gue
B 200
Zuo
bai as omy
EEa
8888 oe
Daly Tals wm
nb 2
1 thj vé mvc tien thy dign, sin Ingng dig nang tie mat toi va sin Itong dien nang tit gi6
4. Seasonality: Tam dich: tinh thai vy, Chi sé v8 cae de ting lip di l§p Iai trong mot kho’ing,
thai gian eS dinh xuyén suét céc nim. Cac dang dic trumg nay thettng ditge anh huéng bai rit
nhigu yéu té khée nhan. O trong dit lign ela bai, ta o6 thé khai ph tinh thii vu eta dt lieu, ding
seaboru dé vé, va group dit lieu thinb timg ubém nt san:
fig, axes = plt.eubplote(2, 1, figeize=(11, 10), eharex=True)
for name, ax in rip(['Consvaption’, 'Solar’, 'Wind!], axes)
sus boxplot (datasopsd_daily, x=/oath’, yename
ax get _plabel (Gun)
ax aet_titie (aame)
4% Renove the aUtonatic x-axis Label fron all but the bottom subplot
at ax ss (-1)
zlapel (77)
arsax)
16AL VIETNAM aivietnam.edu.vn
Consumption
co Ttnaeeaseoe
gare.
Wed
Sbibbbebddebt
©
82 8
Hinh 24: Biéu dién phan bé eta eae cot Consumption, Wind, Solar theo Month
5, Frequencies: Trong Datetimelndex evia pandas, ta 06 thé sit dung cae gia tri thi gian sfin 06
6 tao thinh mot chudi gia tri theo tan suit. Vi dn, vai hai gia tri "1998-03-10" va "1098-03-14,
ta 66 thé tgo mot danh sich thoi gian voi tn suit theo ngay. Trfe dank sich moi eta ching ta
‘rd thinh: 7198-03-10", "1998-03-11", "1998-03-12", "1998-03-13", "1998-03-14". Vide nay dutse thite
bien bing ech eai dat thuge tinh “freq
pa. date range(?1988-03-10", "1998-03-15", freq="D*)
Datetimerndex(['1998-03-10", '1998-03-11", '1998-03-12", 1998-03-13",
*1998-03-14" '1998-03-15"],
dtype='datetineG4{ns]", freq='D')
Minh 25: Vi dy vé ly tan snit theo nghy tir 10/3/1998 dén 15/3/1998
‘Voi tinh nang nay etia pandas, ta 66 thé thite hién vige thé dit eu bj mat bing ky thuat forward
fill (ffl). Ky that nay lien quan dén vige sit dung gli tri ghi nhan difge tai thai digm trude 46
um gif tri thay thé cho tohn bo gis tri bi mt mst san dé trutde khi gap duige min dit lieu 6 gis
tri, Vi du, gia sit ta biét dutge git tri Consumption eiia mot vai nghy nhut san:
# To select an arbitrary sequence of date/tine values fron a pandas tine series,
# ve need to use a Datetinelndex, rather than simply a list of date/tine strings
times_sanple = pd.to_datetime([?2013-02-03", '2013-02-06", '2013-02-08'])
# Select the specified dates and just the Consumption columa
7AL VIETNAM aivietnam.edu.vn
+ consum_sample = opsd_daiiy.loc{tines.sample, [’ Consumption’11, copy Q
© consum sample
Consumption
2013-02-03 1109.639,
2013-02-06 1451.49
2013-02-08 143.098
Hinh 26 Lay dit ligu ciia 3 ngay trong bp dit lieu gbe lam vi dy mine
1 # Convert the data to daily frequency, without filling any nissings
consum_treq = consum_sanple.astreq('D:)
# Create a column with missings forvard filled
consum_freql’ Consuxption - Forward Fill‘] = consun_sample.asfreq(’D’, methods"
fi)
consum_treq
Consumption Consumption - Forward Fill
2013-02-03 109.639 109.639
2013-02-04 Nott 109.639
2013-02-05 NaN 109.639
2013-02-06 1451.49 1451449
2013-02-07 NaN 451.449
2013-02-08 143.098 1433,098,
Hinh 27: Thue hign fill vio cae ngay khée trong pham vi tit ngay 3/2/2013 dén 8/2/2013
Voi gid tri tiew thy dign nang etia 3
celia 3 ngay tren sit dung fill
ngiy, ta 66 thé thé gia tr] cho ee ngdy odn Iai trong phar vi
nig dé thay déi tin x6
6, Resampling: Li ky thuat ign eiia bo dit lign time series, 6 thé
gia tang hose gin di tan 1, ta 6 thé gidm tan 56 cita bo dit lieu hign tai tit mzay
sang thing. Diéu ny ding nghia vai viee bo dit iew mai eita chiing ta sé €6 st min dit lien hon
bain g6ec
Resampling thittng hvu dung véi time series cho lower hoe higher frequency. Resampling cho
lower frequeney (downsampling) thing lien quan tdi hoat dong téng hap, vi du mite doanh thu
trong thing tit di ligu ngay. Resampling cho higher grequency (upsampling) it phé bién hon,
thutmg ding trong vige noi suy. G day, ta thit ap dung downsampling cho bg dit ligu hign tai nlut
# specity the data columns ue want to include (1.¢, exelude Year, Month, Weekday
ane)
2 data_columns = (/Consuaption’, *Wind’, ’Solar’, ‘Wind+Solar’]
# Reza
le to weekly frequency, aggregating with nean
opsd_veskly_mean - opsd_daily [aata,
opsd_ eekly mean head (3)
olumns]. resample (’?).mean ©)
G doan code trén, ta downsampling tit tin s6 theo nga
1a trung binh elia 7 nghy trong tua
sang thin, Gid tri cita ete e6t Iie ny sé
1sAL VIETNAM aivietnam.edu.vn
Consumption Wind Solar Wind+Solar
Date
2006-01-01 1069.184000 NaN NaN Non
2006-02-08 1381.300143 NoN NaN Not
2006-01-15 1486.730286 NaN NaN Not
Tinh 28: Sit dung kp thuat Resampling dé déi tn s6 lay miu eiia bo dl igu tit ugiy sang tui
Di nhien, khi ta downsampling b@ dit ligu, s6 Ing min dit lieu eta bing dit ligm mdi sé it hon so
wi bing thon 1/7 Lin, Co thé kiem tra bang cach dimg thude tinh shape etia DataFrame:
print (opsa_daily shape (0])
Print (opsd_veokiy mean. shape (0])
‘Ta visualize daily va weekly time series ciin Solar trong 6 thing nhit sau
# start and ond of the date range to extract
start, end = '2017-017, '2017-06"
# Plot daily and veekly resampled tine ceries together
fig. ax = plt.subplots()
ax. plot (opsd_daily-loc(start:end, 'Solar’],
narker='.', linestyle='-", linewidth=0.5, label='Daily’)
fax. plot (opsd_week1y_nean.octstart :end, *Solar’]
+ marker='0*, markersize=8, Linestyle='-', label="Weokly Mean Resampie’)
ax. set_ylabel (Solar Production (cvh)*)
fax. legend
| pie show
zo
= bay : .
cere 1 oofha ahh
a)
5
gu
B wo
E
Zs
°
Hinh 20: Da thi Time series iia Solar theo nghy va theo tin
Lit ¥ sng bing dit lieu gée ciia chiing ta 06 tn mot s6 gia tri null. Vi vay dé dam bio toin bo
ese miu ¢6 gis tri, ta chi dit tham s6 min count vio dé sir ly vin dd nay. Vi du, ta resampling
bb dit lidu thinh theo nim, dé dim bio ede ngay trong nam déu tdn tai git non-null, ta ot
cai dst min _count=360 (ese ban 6 thé chon min_ecount bing wot gia tri Khie tay vo quan
sit 4 nan):
1 # Compute the annual sume, setting the value to NaN for any year vnich hae
2 # fever than 960 daye of data
» opsd_annual = opsd_daily(data columns] .resanple( YE’). sun(min_count~360)
19AL VIETNAM aivietnam.edu.vn
¢# Tho dofauit index of the resampled DataFrane is the last day of each year,
5 # ('2006-12-31", '2007-12-317, etc.) so to make life easier, set the index
\# to the year component
> oped_anaual = oped_annual.2et_index Coped_annual index. year)
{ opadanaual.index.aane = ‘Year?
| # Compute the ratio of #indsSolar to Consumption
to oped_annual[Wind+Solar/Concumption’] = oped_annual [’Wind+Solar’} / oped_annuall’
Consunption’]
41 opsd_anaual. tai (3)
Consumption Wind Solar WindsSolar WindsSolar/Consumption
Year
2015 505264.56300 77468.994 34907.138 112376.132 0.222410
2016 505927.35400 7008.126 34562.824 111570.950 0.220528
2017 504736,36939 102667.365 35882.643 138550.008 0.274500
Fink 30: Annual resampling voi bo dit Tigw hien tai
"Ta 06 thé ve bién dd hién thi sin ligng sin xu4t ning ling gi6 vA mat trai déng gop vo mite
4 tien thu dign nang ké tit nim 2012 nbit san:
1 # Plot fron 2012 onvards, because there is no solar production data in earlier
years
fax = opsd_annual loc [2012:, *Vind+Soler/Consunption'}, plot. bar (color=' C0")
fax. set_ylabel (Fraction?)
fax set ylin(o, 0.3)
ax set_title (Wind + Solar Share of Annual Electricity Consumption’)
: pit _xticks (rotation=0)
Wind + Solar Share of Annual Electricity Consumption
030
025
020
&
£ ors
&
0.05
0.00
2012 2013 2018 2015 216 2017
Year
Munk 31: Bidu dé o9t biéu thi Solar + Wind dong gop vio me tieu thu dign nang
7. Rolling windows: Rolling window efing 1A mot hoat dng chuyén thong tin quan trong trong dit
lien time series. Giéng downsampling, rolling windows chia dif ligu thinh ede time windows (ede
Khong thai gian nhit tuin, thang... dulge trifat tren cic miu dif ligu thoo ngiy) va dit len trong
mii window d6 duge téng hop vi him mean(), median(), sum(),... Tuy nhien, khong gidng mbit
20AL VIETNAM aivietnam.edu.vn
doveusampling, khi ma dit ligu khong overlap nan va output Indn €6 tn s6 th4p han inpat, rolling
vsindows overlap va gom thinh nbting dit len c6 cing tin 96, vi thé time series diée chuyén ¢6
cing tn 96 voi time series gc. Ta vi du voi rolling trong 7 ngay:
1 # Compute the centered 7-day rolling mean
5 opsd.Td = opsd_daily {data colunns).rolling(7, centersTrue).mean()
+ oped74. head (10)
Consumption Wind Solar Wind+Solar
Date
2006-01-01 NoN NoN NaN Non
2006-01-02 NoN NaN NaN Now
2006-01-03 NoN NaN NaN Non
2006-01-04 1361.471429 oN NaN Now
2006-01-05 1381.300143 NoN NaN Now
2006-01-06 1402.557571 NoN NaN Non
2006-01-07 1421.754429 NON NaN Non
2006-01-08 1438,891429 NoN NaN Non
Hinh 32: Rolling windows véi chu kj 7 ngay
G day, 2006-01-01 dén 2006-01-07 dutge dnh nhan 1 2006-01-04, 2006-01-02 dén 2006-01-08 duige
inh nhitn La 2006-01-05, tung tir cho ee démg khuie
8. Trends: La mét die trimg chi sm hing eta di Hien, 06 thé tang hose giim di trong mot khong
‘thdi gian dai. Véi ky thuat rolling windows, ta 06 thé dé dang true quan héa trends cia bo dit
iéu, tai cdc time scales khac nhau. Vi dy, ta tinh 365-day rolling mean:
inport matplotlib.dates as ndates
# The min_periods=360 argument accounts for a fev isolated missing days in the
# vind and solar production tine series
opsd_3654 = opsd_daily (aata_colune] .rolLing(windou=365, conte
160) -mean)
ruc, min_periods
# plot daily, 7-day rolling =
s fig, ax » plt subplots 0
ax. plot (opsd_daily{’Consumption’], marker=’.’, markersize=2, color~0.6",
Linestyle*’None', Label" Daily’)
fax. plot (opsd.74{* Consumption’), Linevidthe2, label='7-d Rolling Mean’)
2 ax. plot (opsd_365d[*Consunption*], color="0.2*, linewidth=3,
s label='Trend (365-d Rolling Mean) ’)
| # Set x-ticks to yearly interval and add legend and Labels
az.zaxis. set _aajor_locator (ndates. YearLocater ())
© ax. legend
+ ax.set_rlabel (’ Year’)
* ax-set_ylabel (Consumption (GWh)?)
ax set_titie (Trends in Electricity Coneunption’)
2 pit. show()
fm time series
wn, and 365-day rolling mAL VIETNAM aivietnam.edu.vn
‘Tends in Electricity Consumption
1600
=
S soo
a
E 1200
5 1000
ear
Hinh 33: Xu hiténg tiéu thy dign, twin vA nm, tng manh vio endi nam
1 # Plot 365-day rolling 4
o fig, ax » plt subplots 0
) for na in ['Wind’, *Solar’, *Wind+Soler’]
1 ax.plot Copsd_365d (ne), label=na)
5 # Bot x-ticks to yearly interval, adjust y-axis limits, add legend and labels
© ax-zaxis.set_major_locator (ndates. YearLocator ())
set_ylin(0, 400)
Legend ©
jet_ylabel (Production (GW) ')
Ttitle(/Trends in Electricity Production (365-4 Rolling Means)’)
series of vind and solar pover
‘Trends in Electricity Production (365-d Rolling Means)
eran (ne
gs
100 ee
zou 212 213 2s ais 2016 217
Hinh 34: Xu hiténg sin sendt ning luiong dien gi6 va mat trai c6 su hitimg tang qua bing nim, dae biet
Ja mang hieng ai
Nhut vay voi mot s6 bude thute hin tren, ta da xem qua each sip xép, phan tich va trate quan héa
ait ligu time series data trong pandas, dig céc k¥ thnat nbit time-based indexing, resampling,
rolling windows. Ap dung ky thuat nay sto bo dataset OPSD, thu ditde ese thong tin chi tiét we
thai diém, ese kj, va xu huting trong sin xnft va tien thu dien.
2AL VIETNAM aivietnam.edu.vn
Phan III: Cau héi
A, Phan trée nghi¢m
1. Data Analysis Ia gi?
(a) Qué trink thu thip dit lieu, (e) Qué tinh seit Iy it Lien
(©) Qué trinh tim kim via Khai thie a ién, () Cae phitong én tren dé dking,
2. Cin tte dit liga mio sat day khong thude pandas:
(a) Series (6) Panel
(0) DataFrame (a) Tensor
3. ¥ ngbta ciia phutong thife head() déi véi bing dit ligu trong pandas I
(a) Hién thi ede hing eudi cing (c) Hién thj ngiu nhion mot sé hang
(0) Hign thi cae hang an tien (a) Hin thj tat eee
ng
4. ¥ nghia ciia phutong thife describe() déi wai bang dit lien trong pandas li:
(a) Bing théng ké cia cae cot dtt ign string (c) Bing théng ke cia ee e@t ait ligu list
(8) Bing théng ke etia cae cot dit lien 56 (a) Bang thing ke eiia cae cot dit lien dict
5. Phuong thife nio san day duige dimg dé doc mot file esv tit bo nbé trong pandas?
(a) pd.load_esv() (e) paread_file()
(®) paread_esv() (a) paload_file()
6. ¥ ngbia ciia phutong thife groupby() déi véi bing dit lign trong pandas li:
(a) Loe eée hing theo itu kien (6) Néi cae bing ait ign
(8) Ting hyp théng ke ee eot ate lew (a) Gom nhém ait lien theo wi tei
cia mét ho&e nhieu et
7. Phuong thite nao sau day ding dé kiém tra ede gia tr] NaN 06 trong bing dit lieu?
(a) abisna() (6) Afnotmnll)
(8) dfmotna() (a) f.tail()
8. Phuong thie nio sau day e6 thé dive ding dé bé di mot hing ¢6 gid tri null trong bing dit lieu?
(a) df.drop_null() (6) dfdropna()
(b) df-drop_missing() (a) dfremove_null()
9, Phutcng thiie nao san day trong pandas dite ding dé fill ede gid tri bi mat trong bang dit lieu sit
dung ky thuat forward filling?AL VIETNAM aivietnam.edu.vn
(a) fillna{method="bfill’) (6) fillna(method="fil!)
(0) fillna(method="pad’) (a) filina(metho
forward’)
10. Phuong thrfe nao sau day trong pandas dirge ding dé resample dit leu?
(a) resample() (6) rednce()
(0) downsample() (a) shrink()
11, Phuong thite nao san day trong pandas duige ding dé tink rolling windows?
(a) rolling() (e) average()
(®) mean() (a) smooth()
12, Phnfomg thite nio san day trong pandas drige ding thy thi m@t ham bit ki vao tat ea phin tit
trong mot Series?
(a) pa Series.transform() (c) paSeries:map()
(©) pa Series applymap() (a) pdSeriesapply()
18, Phitong thite nao sau day c6 thé ditge ding dé lay toan bo mot oot sit dung tén cia n6 tit bing
att lieu?
(a) affect (6) atix{eot]
() at-loefeo!] (a) Atitoe(cot]
Xem qua bing dif liéu sau day (df) vA tra Iai eée cu hoi dudi day:
Date Open [High Tow Close [Volume [Adj Close
{6/20/2010 | 9.000000 | 25.0000 | 17.540001 | 23.880000 | 18766300 | 23.880000
6/30/2010 | 25.790001 | 30.420000 | 23.299099 | 23.830000 | 17187100 | 23830000
771/2010_| 25,000000 | 25.920000 | 20-270000 | 21.959909 | S2I8800_| 21,959099
7/2/2010_| 23.000000 | 23.100000 | 8.709990 | 19:200001 | 5139800 | 19.200001
7/6/2010 _| 20,000000 | 20.000000 | 1.830000 | 16.11000T | 6865000 | 16.110007
14, Dong lenh ndo sau diy ding dé chon ese hing ¢6 gis tri *Close” lon hon 25?
(a) affatyClose'] > 25] (6) Afiiloc{atfClose’] > 25)
(8) al’ Close’ (a) affatClose > 25]
15. Dong lenh nao san day ding dé chon cdc hing 06 gid tri "Volume nhé hen hoae bing 1000000?
(a) at.itocfafl’ Volume’
(8) alfa Votume’]
1000000] (@) affat.Volume <= 1000000]
1000000] (a) aff'Volume}] <= 1000000
16. Dong inh nio san diy ding dé chon ee hing e6 git] "High’ nbé hon hose biing "Low"?
24AL VIETNAM aivietnam.edu.vn
(a) atiloc(aHigh’] <= df[Low'] (6) affatsHigh <= df. Low]
(0) alffatfHigh’) <= atPLow}) (a) aff High] <= aff’Low’]
17, Dong lenh nao san day ding dé tim gi tri trang Dinh efia e0t ‘Close’?
(a) df.mean() (6) AfClose"| sum()
(0) af{’Close'}.nean (a) affClose’}.mean()
~ Hét -