0 ratings0% found this document useful (0 votes) 111 views16 pagesEDA Assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
‘191723, 348 PM
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
Exploratory Data Analysis
Problem Statement:
We have used Cars dataset from kagoe wit features including make, model year, engine, and other properties ofthe car used to predict its
price
Importing the necessary libraries
{ngort watplotiib.pyplot as plt Avisualts
fngort seaborn ae ste avisualisation
Suratplot ib inline
warnings. 1 terwarnings("Sgnore")
Load the dataset into dataframe
oF = pdareaderu("Veontent/Cars_eota.c3¥")
ée.neaso
cngine vanber
ake ode Year "Yael EINE Engine Transmission oriven heels of market categot
Dee 7 a ears
© BMW Sees 2011. Unleaded 3950 «80 MANUAL rearwheslarve 20 TunerLumuyhig
" Perarmane
2 BNW gure! 2011 traded 3000 «6.0 “MANUAL rearwhesleme — 20==«(Sumaitig
onset)
3 BNW gay 20M enced 250080 MANUAL 20 us Patan
(resoree)
Now we observe the each features present Inthe dataset
ake: The Make feature is the company name ofthe Car.
Yodel: The Model feature isthe model or diffrent version of Car models
ear: The year describes the model has been launched
Engine Fuel Type: It defines the Fueltype ofthe car model
Engine +: Its say the Horsepower that refers tothe power an engine produces.
Fngie Cylinders: It define the nos of eylindersin present inthe engine
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
16‘191723, 348 PM
Driven, trees
arket category: This features
Its say about the about car size,
The feature i all about the style that belongs tocar
hagimay Hea: The average a car will get while driving on an open etretch of road without stepping or starting. ypealy ata higher speed,
vehicle style:
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
Inis the type of feature that describe about the car transmission type Le Manual or automatic
The lype of wheel drive.
It defined nos of éoars present inthe ca,
about
type of car or which category the car belongs.
city apg: City MPG refers to driving with occasional stopping ane braking
Ian refered to rating ofthat car or popularity of ca.
sho The pice ofthat ca,
Popularity,
Check thi
\e datatypes
eF.antoo
anger
oF sen0ni0.
ne
ware
fine Cylinders 1188 noncrall
=) m8 1944 non-null
#leatea(s), imtea(S), obsect(@)
sunt)
Fuel Type
Sylsoaers
B8ucee
Dropping irrevalent columns
11 we consider all columns present inthe dataset then unneccessary columns wilimpact onthe mode's accuracy
Not all he columns are important tous inthe given daterame, and hence we would drop the columns that are ievalento us. k would reflect
four models accucary so we need to drop them. Otherwise it wll affect our model
‘Thelistcols_to.drop contains the names ofthe cols that are ievalent, drop all these cals from the dataframe,
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
216‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
col
trop = [Engine Fuel Type", "harkat Category", "Venicle style", “Popularity, “Munber of Doors",
Vehicle size")
These features are not neccessary to obtaln the models accucary. It does not contan any relevant information inthe dataset
cols te_dvop = ["Engine Fuel Type", “Market Category", "Vehicle Style", "Popularity", "Wunber of Doors",
SF = ef deop(cois 20 crop, ae
efsnesde)
ake pode Year FEIN Engine TONSRLSSION osssen ypeens MEMMBY
earn *5°5 20r1 5350 60 MANUAL rwarwacldive 25,
4 BNW 1Sees 2011 S000 60 MANUAL ~
2 BNW 1Sere8 2001 3000 60 MANUAL ~
Renaming the columns
19
2»
"Now; ts time for renaming the feature to useful feature name. It wll help to use them In model raining purpose.
We have aeady dropped the unneccesary columns, and now we are let with useful columns. One extra thing that we would do iste rename
the columns such thatthe name clearly represents the essence ofthe column,
The given det represents (In key value pat the previous name, and the new name forthe dataframe columns
rorename(coluans = (Rake :‘Conpany"), inplace = Teve)
tforenana(columns = CASA": Price"), inplace = Thus)
efcrenane(columns = {Engine HP°:"H?"), inplace = True)
“«
7
1 eM geal auth 9000 CMAMUAL arwhenive 28
2 eww gel aot 90 CMAWAL raruhasave 28
3 aww 1 con 2300 60 MANUAL roar wheel drive 28
11908 pera 20K Zot2 3000 O.—AUTOMATIC. aetna 23,
foto Aewa 20K 2012 300069 “AUTOMATC. atwhocidne 28
1 ose 9 pandas fu
hare
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
6
16
40120
esr
‘Vehicle size")
ane‘191723, 348 PM
# Prins she nese
erhead)
° BMW
2 aww
Dropping the duplicate rows
of the datsfrane
1 Series
1 Sores
1 Sores
zon
zon
aon
260
000
000
ojtinders
60
60
60
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
‘ype
MANUAL
MANUAL
rearhoel ve
rearubeel ve
rearboel ve
%
0
~
0
2»
There are many rows in the dataframe which are duplicate, and hence they are just repeating the information
as thoy dont add any value to the dataframe.
ht betterif we remove these rows
For given data, we would Ike to see how many rows were duplicates For his, we will count the numberof rows, remove the dublicated rows,
‘and again count the number of rows
oF = ef.drop duplicates)
etsheas0)
+ aww
(9525, 10)
Dropping the null or missing values
1 Sores
1 Series
1 sees
aon
Engine
60
Tranentssion
MANUAL
‘Missing values are usually represented inthe form of Nan or nullor None inthe dataset
Finding whether we have null values Inthe data is by using the isnull function
iehway
=
ay
40850
‘There are may values which ae missing, in pandas dataframe these vlues ar reffered to as np nan, We went to deal wth these values
bbeause we cart use nan values to train models, Ether we can remave them to apply some svvategy to replace them with ather values,
To keep things simple we willbe dropping nan values
ef-ssnull().sue()
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
ane‘191723, 348 PM
company
caty moe
[As we can see that the HP and Cylinders have rl values of 69 and 20. As these nul values will mpact on models’ accuracy. So to avoid the
Impact we wil drop the these values As these values are small camparing with datase that will ot impact any major affect on model accuracy
0 we ill drop the values.
#1 = 4F.droona(ants-0)
641.ssn100).s460)
Srananassion Tye
highway mee
ef ante)
3 iponay 906827 nancnull
types: floatea(2y, inte), cbseer’e)
eF.asnali(). suet)
ronission Type
highway 9G
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
eck nunber of nan values dn each col agein
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
56‘191723, 348 PM
‘count 10827, 000000
in
20%
Removing
‘Sometimes a dataset can contain extreme values that are outside the range of what expected and unlike the ther data. These are called
cuties and often machine learning modeling and model skin general can be improved by understanclng and even removing these outer
2mr0.89ex70
1990000000
2007.000000,
outliers
plt-boxplot (det Price’ 1)
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
HP Engine cytinders
woazrooo009 10827 a90000
254 953062 se0teo4
55:000000 0.000000
+173.900000 +,000000
ce" column An dataset
“Sratplotlab. inet Linea® at 0x7/20008ab¢39>),
aps" (a
“natplotisb.
seasons" [erat
ghway Wee ehty ape
~0827-090000 70827000000,
zeavene 19227607
‘2000000 7.000000
22,900000 16:000000,
Hines. Line2D at @x7#2@01Redsde>],
LénstplotiibrLines-Linezo at 07#208184b810>]
(eauepioel
c
observation:
Here as you see that we got some values near to 1S and 2.0. So these values are called outliers, Because there are away from the normal
values, Now we have detect the outliers of the feature of Price. Similarly we will eecking of anothers features.
pit.bosolo(s
cH)
sooiaesie> |,
*O82TODe OF
42493286008
2,900000640
21972500404
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
ene‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
“Ghiplociib.lines-inead at ecifeatsataede)y
ton [coaplotib nes Lined st Wo riOLalfo),
‘SiritSie ines tinet st xrrasnsoneon,
yen: [atplotlie ines Linea at OC FS00L41 0)
sedion'! [enatpot is ins, ina or fieeenees],
Fitera': [cneprolib Les Linea) at ex 500130700}
see's
observation:
Here boxplots show the proper dstrbution of of 25 percentile and 75 percentile ofthe feature of HP,
ou
print all the columne which are of int or Neat datatype ino.
Hint: Use loc with condition
yoes(inciuées[np- Ant, float)
Year HP engine Cylinders highay #96 clty ape Price
0 20m 350 60 26 9 48195
4 20m 3000 60 219 406850
3 20 200 60 2 18 20450
4 2011 2000 60 2 18 346500
stort 2012 2000 60 2 16 50620
stos2 2019 2000 60 216 0020
Save the colunn nanes of the above output in variable list named ‘1°
Tef'vear’, "18", engine Cylinders", highay PS", city wpa "Prtce"y
en
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue m6‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
We Engine Cylinders highaay APG city ape
2 201 a0 60 280 36260
2 am gy! 201 s000 £0 MANUAL osrwhosline 28a. 860
1900 pou 70x 012 900 60 AUTOMATIC. stvhesline 2318 48120
time pou 20x 2012 800 0 AUTOMATIC abwhoaline «288.8870
Outliers removal techniques - IQR Method
Here comes cool Fact for you!
1ORis the frst quartile subtracted fom the third quartile; these quartiles can be clearly seen on a box plot onthe data,
+ Calculate IQR and give a suitable threshold to remove the outliers and save this new dataframeinto 2,
Let us help you to decide threshold: Ollie in this case are defined ae the observations that are below (Q1 ~1.5x QR) or above (03 + 15x IQR)
18 define QL ane @3
QL, B= wpwpercentslecae{‘Price'}s[25 , 75)
1 # define 198 (interauantse range)
Ta ga
wa gas serge
lw = Gh - 2.58708
for 4 sn af 'Pr8ce")
etsclow or UP )
outliers. appena(s)
ef{‘Price'] = (op. if x in Outliers else x for x in ef "Price’1]
entert"eece'])
GL, B = wpspercentslecae('H#"16[25 » 751)
48 cefine 198 (Jntorquantsle range)
Te = ga
we = gas .sera8
lowe Qh 2.58108
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ane‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory
for sn dF 18")
‘(Selon or UP )
Ovt1iers.append(i)
SATHPD = [apsNeN A x Jn OutlSors else x for x 30 "11
tentare'We"))
sns.boxplot(de.tP, ontente"h*)
“oatplotlib.sk08._subplots.eesSubplot st 0741803976760
[nt © cetine 442 after renoving ovtliers
812 = af dropna(axisn8)
efa.intet)
colin en-Mll Count otype
types! Floatea(ay, snes), ebseee(e)
4 Fane the shape of of & a¢2
print(at. shine)
(20827, 10)
(os, '20)
003_coledf.relect_atypes( snclude=[09.0bJect])
soiieal
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
one‘191723, 348 PM
2 aww
‘1908 Acura
1 Fine unique values and there counts in each column an df using value counts fan
for 4 40 dfcoluans
print (
print(af(]-value_counts())
1 Sones
MANUAL
auTowanie
14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
roar wool sive
at wheel ve
ay
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
1016‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
Visualising Univariate Distributions
‘We willuse seaborn Ibrary to visualize eye catchy univariate plots
10 you know? you have just now already explored one unvarate pt, guess which one? Yeah te box plot
Histogram & Density Pl:
Histograms and density plots show the frequency of a numeric variable along the y-axis, and the value along the x-axis, The sns.distplot()
ss
function plots a density curve. Notice that ths is aesthetically better than vanilla
Aploting distplot for variable 1?
ple figure( Figsize-(24,8))
Ens.oistplor(af[ #8" ],colon =
ple titleCofstplot of vasa
ple-xdaver (9°)
ple-ylaver (*coune*)
ple-stont)
Detect arate
Bs
observation:
We plo the Histogram of feature HP with help of distplot in seaborn,
In this graph we can see that there is max values neat at 200, simiary we have also the 2nd highest value near 400 ane so on,
It represents the overall distribution of continuous data variables
Since seaborn uses matplotib behind he scenes, the usual matplotib functions work well with sesborn, For example, you can use subplots to
plot multiple univariate dstebutions
+ Hint use matpltio subplot function
Levene’, “HP, “Engine cylincers", “highway Ho", “city moa’, “Pesce']
H plot alt the colunns present sn List 1 together using subplot of ainention (2,3)
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ne11917, 948 Pat 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
ple. figure (#4gss200(35,10)) ‘
for {im range(@,ten(2)) H tenayy
putesubplet(2,3,442)
ns countplotef[1[3)))
Tt suptitle( "subplot of list)
pit-eieleQ(ip)
pit-wlabel s1)
pit ylabet ("count")
plt-stont)
sus. pateplet(ef(t))
subplot st
tt rs shoe
; h sl
: | hs. “ i
‘eget
amore ee
Bar Chart Plots
Plot ahistagram depicting the make in X axis and number of car in y axis,
pit. Figure(rigsize = (52,8))
na histpioe 4 Company")
ple tight Layout)
Fuse nlangest and then plot to get bar plot 2tke below output
-ntps:feolab rosearch google. com/riva/ x5SszpUp7KEpS3azD0RUKW\'SqIXEn6w_HprintModo=tue
1216‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory
observation:
In this plot we can see that we have plot the bar plot with the cars model and nos. of cars
count Plot
‘Acount plot canbe thought of asa histogram across a categorical, instead of quantitative, variable.
Plot a countslt fora variable Transmission vertically with hue as Drive mode
ple figure(Figsszee(35,5))
Sns.countplot(x = "Transmission Type’ ,catarce2)
plt-grig(color = "k’, Linestyle = "=~", Linewidth = @.5) 1 srcxcroun STYLING
plt-legend(loc = 4)
pis-stont)
awuannatpLotlib.Legend:No handles with Labels found to put in legend
ecm nee
Observation:
In this count plot, We have plot the feature of Transmission with help of hue.
We can see thatthe the nos of count and the transmission type and automated manuals plotted, Drive mode as been given with help of hue
Visualising Bivariate Distributions
Bivariate datributions are simply two univariate distributions plotted on x and y axes respectively. They help you cbserve the relationship
between the two variables
Scatter Plots
‘Scatterplots are used to find the correlation between two continuos variables,
Using seatteplot nd the correlation between HP’ and Price’ column af the data
s(#igsizes(10,6))
pit. gure Figeize = (8,8))
Ens. seatterplot(et( HP"), Af Price’)
ple legens(loe = 4)
ple-stont)
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 1316‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory
anuan:natplotlib.legend:Ne handles
observation:
Itis a type of plot or mathematical diagram using Cartesian coordinates to dlsplay values for typically two variables fr a set of data
‘We have plot the scatter plot with x axis as HP and y axis as Price.
‘The datapoints between the festures shouldbe same either wise it give errors,
Plotting Aggregated Values across Categories
Bar Plots - Mean, Median and Count Plots
ar plots are used to daplay aggregated values ofa variable rather than entire distributions. This ie especially useful when youhave alot of
data whichis cificu to visualise in a single figure.
For example, say you want to visualise and compare the Price across Cylinders. The sns.barplot() function can be used to do that,
et. grovpby("acear_oroxinity”, a5_tndexeratse){ median house value’ ).222n()
Ssns-barplot(‘oceah_proxinity’, “median touse value’, datacdf, ci-False)
ers grovpby( ocean proxinity" i necian_ house value’ J.naan()-piot-bar()
sns.barplot(de[ Price” J, "Engine Cylinders"])
PIetitLe( the Price across Cylinders")
plt-alasel(“price") fx-uaves ron ata
plt.ylaved ‘eyLinder’)
ple ston)
|ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 146‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory
len)
the rice across Cynder
Tex(@, 0:5, °0
Observation:
By default seabor plots the mean value across categories, though you can plot the count, median, sum et.
‘Also, barplot compures and shows the confidence interval of the mean as well
TT
When you want to visualise having a large number of categories, it is helpful to plot the
categories across the y-axis.
Let's now érill down i
Transmission sub categories.
‘These plots looks beutiful isntit? In Data Analyst life such chars ae there unavoidable frend.)
Multivariate Plots
Heatmaps
[Aheat map is @ two-dimensional presentation of information withthe help of colors. Heat maps ean help the user visualize simple or
complex information
Using heatmaps plot the correlation between the featutes present inthe dataset.
tind she correlation of features of the
A Set size of graphs (12,8)
observation:
[Aheatmap contains values representing various shades of the same colour foreach value tobe plotted. Usually the darker shades ofthe chart
represent higher values than the lighter shade, Fora very different value a completely diferent colour can also be used
‘The above heatmap plot shows corelation between various variables inthe colored scale of 1 to 1
Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue
1911611917, 948 Pat 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory
-ntps:feolab rosearch google. com/riva/ x5SszpUp7KEpS3azD0RUKW\'SqIXEn6w_HprintModo=tue 1616