0 ratings0% found this document useful (0 votes) 24 views27 pagesADS Assignment 1
Applied data science some question and answer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Explain dada science — die ceyole “um_cledoal
Doda Sciemee is “the _domaim. “of —atudliy “hed deals seth —
vast volumes of data using mederm tools ame
techmigues te fimcl_umseen patlerms , derive -meariimng}ul
eematinn amd make usimess decisions
Dada Seiemee_ difecyole_ aevolues arcund use_sf ML end
difperent amalyticol steatepies tc aoa imaights amd
predictions pom imfomnslion im ad? jie acyuise oa
commercial enlerpeise ebjecluv 41 15 a ilesotive set of
steps te 2zeqsiired to deliver ATE, A
difeyole af Deda Scienoe
) Sdentiging_ paste amd amdee séthing the
coach yeocal:
Sdemtipying prolslerns by one of them major steps recenaly
im the dota acience to finda clas at re _ealay
since it usill decide the final goal analysis
This hose should exarrime amd passe lyst cone stuclies ”
of similar amalyss , assess im hase AOSGUtees »
dnftasleuduse , total and technology needs - ~-
Shis phase shauld -
+ Clenely stole the peoltern thd 2equiixes sollions andub should be sesclued od once
| + Define the potential value af the _bussuners project
|+ démd aiks , imelucling ethical aspecls imvelised. im peapedd
|. Build amd communicode ao highly unlegeated , flertlel? —
prayed) plan: : =
} Data Collection oo Fa
dm this step 2eu2 doda 1» cap from televomt
| sources . dhe coda captured cam be either Jn — ates” ed!
| ot wnstectured fowm - _ . a
|| she methods of collectin, dala might come from - dogs
websites, secial io data, data fem online __
|| zepositortes amd oxen dala streamed feem.anlime :
|| scuzcer via APIs, useb_sceapping, er dota preserd am
excel. ee 7
Oke must kmau the difference beluwreen —veiriceie data
| sets availelsle anol the dala umseslmerd _stsategy
| of cm exgamisalien amd keep drack ushere each dada
comes prom amd whether 2b 2» up te date ot nal. 6
> Hada ian
Gm this step the dala is conuerled inte a umipied
a} for _ameath dola peoceriimg Hela in y20cd ed!
ustth ETL procers Céxteact , Teamsporm ame) lead) €
data science operations are carried aul -
The actions 1 be perfeemec! ad thir slage axe -
+ Selection of applicolele data
+ Data inlegeatian hy meams of mezging. dala sets
+ Data cleaning and filitation of selevard dmnforrolin+ desing the asking valuer through either elemimabing
them ox np umputisng them Z -
+ Jrenting dmateurole daja theough Canalo them
+ dest fect outliers the use. of bare plels 4 ospe uth them.
|G is the mest dime comsumirg bed mos) _essenlios
| slep as your model ustll be as accurate os your dala.
Bato exploration — a
© |" Dada Amalysrs i least aul by ve vases adatistical tack
With the suppse of dada engimeer eleusing steps are —
carried cul foe taplovalety data _aralysis : -
_tExanrime the data by —formuletling dhe various as
| adattistical fumstion / fo
|. tdentify incleponderd amd depenslent variatsles =
| + Amalyze deey feotures of dada te wok on
| + Defime aprend of dada :
Doda scientist explore clisteibution of dala imside
| clisdenetive vorialsles af a chorocler geophicolly by usage
| of bar genphs alse 2elalions belweem distinct nibecia aie
Eolas via geaphicod tepresertatlions Lider _acatler c= plat _
om uwoemth maps. e
I Datla Modeling
| Amedel should use prepared amd amalysed cata to
i provide the desited output dhe emitzennmemt needed
| dor exeauiling the data medel will be decided amc!
- ceaded belore meding the _2pecifie zequizememts.
| She tearm usceks together to develop dotascls for _
daouming amd tesling the made for pesduclien pusypeses,th
| 4 abso imvole tasks Like choosing Ih? apptopsicte
meele igpe and dearming ashether the pecs dina
| classification , pesion ae cluvderimg problem
Affe amalysimg choose the algowdhrn to implerren »
6) Medel Deployment
dhe -mecdel is finally prepared to be deployed 4m
the desired Joumal and preferred chammel- Machine
deoumimg pede howe to be ea the
Lyme precerb - tn enezal these au
“rhage amd ample “sith producls and application
dh on celses the ceenilion of 0 delivery mechamusrn
zenptited 40 ge ihe made aud im he marke) anmamg
the uxers ete amother sysem.
He ML model are abe deplayed en devices amd
gaining adaption and populaxity im field of avmpoli
waite a male an “Role of Statistics um Dala Scierce .
Statistics xa fied of maths hod help 1» usotk
| with dala , ushether itS numbers, ce deciphins -
| D's all about callecking , checking omcl maleumg sone
| of dota 20 we cam make er choices .
| Statishins 2 a ceucial topic for dada sciemer lasing
| cdalistios im dota science helps um umemver ne.
Limsighhs , make beHlet decisions amd even predic
| future daemoly_©
Role of statistics im Datla Siemer
) Data teaming: Statisties adds im iclenbipying and dealing
uth enor oe oulliess im dalasels , ensuring data
quality - -
2) Descriptive Amalysis: ff alleus ust dumamange and
andetstamd data through measures like —meéeom,
meciam amd standard cleiation:
3) Inferemtial Amalysis + Slatisics helps make predictions
amd. chau conclusions about orger populations
based am__cample data.
a) A/ B Testing amol Expetim aril : A) B destimg a
pouserlul teckmique a deda science comes imlo play.
“9 Hypothesis testing: d's vital Jor esting ideas ond
hypotheses amd deterarunimg xf observed patteems
ae statistically iLicand -
6) Machime dearming Statistical methocls umclerpim many
HL algorithms , guickong model olevelapmert and
evaluolian
7) Dada Visualization : Statirties enhances data
visualization, -metcun it ensier 1© camvey -imsighis
to non- techmiced stakeholders: _
Bemehits of Statistics im Data Science :
sa 5 make Data Scientists make sense of the
heops of dada they work usth. dt helps us ts
2) With aslatintios , we cam make imloumed olecision -
Inmagine you'ze checsimg belweem tuse ice CREO «
Stalisties cam tell you ushich ame peaple like mare
based on suneys , ao you pick the igh ene.
stall| —-Aoyp wu» _cadeh _ermnel conse) vist
2) Statistics helps us predic! ushed might happenin
the future 41's like ooking 04 past useadher —cleca :
had will —20im tomorcus ot nal
to guess 4
9) Data Scientists use alalisdics to cud eercors Stas —
faker un ous data
amolysis. — =
ee es, SESS ~~
explain the _sigmificamee —of— aja_sciemce _comsidetin 9
volume and chimensians of dala.______ -
Dua Sciomer Ba mubliclisciplinary field that imisoluves
) Volume of Data
“| tig Dodo. Handling
led douly , commenly referred! to as ‘big doda’
his imelucles dloda___ from social media ,semsess , online
Jeamsactions and more + Dala science provicle tools cmd
techmi ques to _efpiciemlty pescess.,_amalyre amd extioct
valuuclsle imights fem the messuse dodasets
= Scalability a an :
deadilional data processing methods may steugele 1°
dr today's digilal weld Large amaumls of dala ave
ihamelLe Jorge volurmen of data - Dela science Lenerages
dlisdai buted ‘compuilimg and paralled peocessing fo scale
up ih? __pzecessing pouser , enalslimg the omalyis of
enormous dolases im a sesonable ameunt of Lime:
= Patiewn Recogmrihtosm : - - -
With a lorge volume of data, dlaja scientist, cam
ademtifiy paberns , “hemds amd correlation that max
| mat be appaxent im amoller dotasels .ghis helps —
ee im making more accurate peedichions acl —
denfoumed —desisans z —
pthilules co ferlures For ezomple , im image dala ,
each pixel com be a _dimemsion . Data acience methods -
amd make sense of such complex. datasels -
- Feature enguncering:
Dota Scientist uxsrk on feotuse adleclian and enginesiung
te iderdify the mash welovord dumensions se
amablysis Shs umuelses chessing the tight vauiable
oe ottibules thal camdaibule — the mot ta the
model's preclichwe power|= Dimension aliby Reo chion : ——— .
Heading 46 the cusse of dimensionality Ded 7
Pmalyis
| emplaiys techmugues dake Primcpol
it Datubuied Slochasuie Neighbor embedding
cPCA) ot t-
| Ce sue) te tecluce dimensionality while _presenring
| the esential information: - _
pe
|
|
| thu the a ieomep of data science
| phe sheer
jo tackle the challenges posed by
volume anc) olimeniors of dala St empowers orgamigllers _
cmd imelividuals to extaaed -menrrurrgy bud yraights » make —_}-
dalo - chien decirinrs amd) unlock the potential ____-
| within vod ard complex datasels .
{|
|
a
t
|
Ae——
Je apply dota explssalion amd sual sion dechniques
txplaim Hypothetical teshing — im dedail ain a
Hypothesis fest “ya a_abalintical cmabhed, ‘hed a4 _used 46
male a _atatisbica) decision using experiomendas dala ———~
Yas bassially an assumption shad ax ne make about a |
population : “tue —mruntually exclasive
Uatajements aleut a population to i |
slodement_is__betler uiled. [supposes by
working. of Hypothesis Taing —
1 Define “pall ae ~ pitesmale “ngpathexis_—____—
“Null Hypethens _CHo) > 44 ise -generxol atodement
“measured cos
2] choose » siymificemee fovel. re -
Selecta sigmnipicamce— Level ©. 5 “typically 0 0:05 = abeenad
the threshold foe _20yectimg—the— “nu hypethesis —
The p-value isthe _prooolsilitiy of vyitling, she obscused
pelts ushem the He Jo} Bivem—_pssblem in te. tf —
Pvalue <_< then 2eject mull hypothesis hus p-value
ws the ctiterion ased_lo_ coleutade significance value3) Collect amd ighlyge deta —
roy zelevamt dota thacugh abseasabion _ot or expesimrilin
>the dada —usira appeopiaie an andhesd-
—_@btaim atest at _
a} coteutate — Test Statindias
the dota for the teal — anes in Hin — _ dep: —
she dice of 405) dapend>—a onthe yee
_test_beimg conclucted a oe
pte ae tes}
) z- test:
meas, Pet | atamdarel “dewiation. yee
e_aize is tome Z- test ds used __—
"= sample mean ___6 = SD_s 4
A test Le tet _cam am_only be. “used _ushen_ ieee the
_-meams of tuo _geaups. 24 population sD are ume
amd scumple sizes Amal them use _t- tea
Ttest_in ured Ahem mn < 39 -
t+ Hou] $2 5Dah sample
ae n= Semple sige ——
© Chi=Square Teds
: we ae test_is_used for _calegarical dade_ot fet jebing
__indeper lence _im_conihimgemoy faleles ae —
Did, = Ghstaued gran! auleell _
Eig - Expeded fueq. in cell
— Ag 7 tous 4eclummna—|@ anova ted
— | Amalysirs of Vasiamer in a atatisticas lechmique- used to
|| check if the omens 0} tus o_more Groups are
ij coleuloded a5: Row tela x Column tial
a Teta Obs exwatiorns
jain aifesent pom ecuth other. __
______F_=_ANovA coefficient _
Fegeerpyey
__ “Hse | SB ream soon. ace}
\5] comparing test Statiatin: — a
im_this use have to _clecide_ whether _
eqecd Nal Hypothests » a
Shere are 2 sways to decide this = —_— ——) Ax bse g ceitical uabsen” a
Comparing Test sdalistics & Jalsuloted ciiticas value
Test statistics > Git’ cal value —> Reject Null Hypothesis
Test Statistics < cidica) value —> Accept Nal ypoinetin
pe = haimg- P-values _
- value Reed sa ge
ake Liars —
T) tdepred the 2esully a _
|| Ad dest, we cam _canelucle out experiment wing
—methed A OB. — 2 —
suppose a _seumple_ of Pee eee Jaleo
amd theix cholestec! levels are ‘niddsured! im Cmg Idt)
205,196, 210) 190,218,205, 200, 192,198,202, 208, 200,
205, 198, 208 5 20, 192,145,196, 205,210, 192,20
Papulecion méam = 200 a
Population 9D Ce)= Soy la
Step:
Nall Hi
Step 2: Defime the aigmificomee _tove} es =
34m a tuse -taed dest and based! en a
idisttibation tale , ciidical value Jor a sigmificamce tee.
|| of 0-05 can be calaalokeat thscugh 2-table as 1-96LTCE
Step 3: Compude the test datishes —
-|| Qe use __2-test ay 5:D amd 4 axe kmausn
Step 4: _As _alsolucke value of tert statistias 2.0% is
———greatler_thom —ceiticad value of ag
————Nall_Hypatheris ia rejected
Thus there _is__Atalistically aigmipi cam} evidamee that _
average chetestral tevel is + he_pepulatian 2s cliff exe mt
= ft 200 mg JL a eeSol”:
Sm this case, we werd fo deat whether ihe average thigh mes
Of worhert produced by the machiume is _aigrifi comb,
different frorn the epee thickness = 0-025 0m
toot Step |: Define “ypethenin ee =
Null Hypesthesis (Ho) + _M_= 0025
Aleemade ‘eypatnens (Hd * M #0-025-
Step 2: Define ihe bignipicamee Level ———
4} is 0 tue toiled ted based on novel _ disdei bution tale,
|| cettica! value ai comce level of 5% (-05) Js _______
2 a
quer an —— _ —
siep 3+ Compule the test stalisties = -
Hoe use use t-test Tees es
ts G-w)/ fS/ fe) Steg eth ee
H+ 9-024em _$ = 0:002¢m a
WL = 00250 __ meio C
4+ (0-024 -0-025) = 7
° oot Mion 63 X457x1 ey
Thin 1 = satiation — follows “a tedishi bution with
muumber a} degree of fpecdom ven-l
t = O-001(3) = -1-5
o-:00oL
Itle 15-felleusimg dato
WED ef aeteaopac Wh lignes Wlpeesei nt aes codficiend from thea Speasamaars Qotrdlalion on Corjiiant ReneQ.? Suppose A gisiem —sestasisa nt secdiulen _am_aurerage < o loo
cntomers per day. doe the Poisson Distribution —
find the probability thed the restaurant teceies more than
|| 0 certaim — runner of austomers- Plot the _pouson —___
|dindei bution fer adteast omy 6 dis ceele number are
cst om ater
CAverage sole —of Zacienees paella ee
___|| dets coomricler__5 _discrebe__numlsor_of-_cerstoormneeb a4 —___—
_____|| _Hie= PO ena NR SL se NS: =
a She _restarixaml reneives ane sthom_.0 muses —
—custeomers—_ pee —dery. =. 2 MO
ee Poisson eee - -
L= PSU) :
_ aie 97° x 100"
0 il
|= 0-452 —
0-148
Probability that aestauracnt seneiie “matte thom lo
| Tuvmber_of customers per clay ts +O 0-18LTCE
|S discete number —> 11h, 112,
M3 h Wd VS at
Db p(eeun = e7'O iool! = osozst
WwW Las
pli) = jet?
PO jooNt is Oo ORB
nz! ; — —
Ly. pC xe aad 2. e722. 100"? Sd @ Ong ge __—
U3 1
|) P(x= ua) > 7% 1o9"* = 0. 0186.
nat .
Is) pCx=tsd= @-100"% = 0-0122
st 7o.8_ Sreplosim im_hbuieh +)) ensure of contealtencemey
DHeosune of spread 3) Skeuemess im.data sith exormple.
| i] Measure of Cemt2al Tenclemey a
Cemteot renclemcies im Statistics ore the — anuwmericol values
__ | for a foxge collection of numeticol cla» — ————————
|| Measure of combeol _dendency is the aepreremteitine
| thad_axe ured to _2zepeesemt mud - value ot _cemtsos value —
vole
of a dataset the _cemdeal value ot the amest—___—
“coturing valiie thed gives general idea of the —ushole—
dataset. — ES
___ seme of the mest _comanenty — useal icembealtemdency
laze - Meam_, Nediam _4.Nede
Pre the ausea jé_of tne _gisensset af voles
| t demotes the equal _disteibutian of values poe a
iwem datas). _______
tl -
| __Yhere axe three types_of mera 2
_||__Asithmetic eam, _4yeomeltie_meam.,_Hasimenic mes
a yAethnnelic Heowm “ulhen use “aay cobeuLale meam_ur
usually coleuleile axithemélie mem
|= eam for Apngrouped dada -
k= Ex = Sum of all chseradliom
a _Setol no. af observations88 83.. 87, $8:, 40
10 students ama have sneoieed %
84,82, 83, 85 timc meom
| (Ems (88183 142405190 281284182 e88 +85] = 85.2%
—Heam_* t= 85. Doel fet hig
I= Heam. for_tpouped Datla as 7
B+ Bin "Fi > fregquemoy 1 obi Es dataport —
EFi xi__=> i dodapaimt
example e
i
+ Fs IS a ae =
- fi 5s | 10 = _
| R= 2 (4xs) + (4x10) +(isxg) = 200
SHIO+S oo
| aim of 48,9 sais
| aan: Yaxexg = YE = 6603 _
of Hazmeric Mean i o
Lis - CTR, ‘~ if halh ef data samyale fear the tower half.
+) —Feemula = Medion af dngtouped Data
||) ttediam ee
|—PMecliom.,im_. stollistion is 4 the middle value..of the | given
=| Ee gb alee assamged tm ascending 2 desecmdi
-—Amedliam ix a mucmber ned ws _aqperated by the ty ‘phe
For Ode Narmlser of Ohserwations —
Median = es” teem a
| rat cuen number of obsentions —
|__Median = (72ers + "seen
2
| ntediom of Senuped Dela —
Nesom z a $e cy -
A -Aauser-dinmit of clos
-h_= magmitude —
F = Freqpemey 4
N= tolof —
Cf = eterrrae “close = ing the Pedant _
cf o- g|3> Mecle bes
i ee ee
__a_maninnuien frequency coerospondimg to it.
= Nede_o] Ungrouped Deda phat
at» ebsersalian_uitth the highest frequency =
| _Example -> 2, 3, 3,4 p2ye8 joo pepen iy
: aan
jag | (fa te fe jut. [> Asayseohh |
a olla th. by ben by ued, Mn Son
2 han oceused 2 timer. bye 2 - —t
_ Mode of th» clata iw 3LTCE
| Hecle of trayped Data —— _ — ae
= rede = 4 a (eo Ee - - =
RF ofa
2 tousee Pee oa fas peyo} class vabich proceeds
‘ 9 the meal chore —_______
Fas pp oielnns- sich asec
|é xanmple = _ = -
| Clan 10 -20_| 20-30.| 30-.40.|_40 =SO_|_SO=66_;___—
|| Freqsremeyy 5. |_& 12 16 10
| As sloss 40-50 hos highest frequemey it is medal clows
pheio, 240, fi 16, fa =12 2 =10
Hele = 40 + [4 <2. 2
2C16) - 12-10,
sat fue -
ee of spread lesctiles haus cimidae ab sveasden | the
set_o} observed values are for a _pasicular vorialle
Gdota_itemn aa
94 is used fageasity heated! seal of data
Luidhin a _datase}-|| vasiomee Co#) sus wld :
_ || + dhe _vaxiamee of data is gure by _-meanusing the distance
_||_ _of the _obsersed value _feomn he_2neamo_ of the _distsibutian
_|| these measured tell us hou muuch | the _pbseriralions “avr _vaired
or_similax 40 each other. pent ty ei ges
Some _mojor_usays te_-meosure ane spread xP
Dkamge Sa ae lide rend Se
She zamge af cada 2s given a5 the _differener —__—
beluweem the manumuum amd the —-miunicrmurm values
of the obserssotions of the datq______—-
|| + dt 4s bared onal values... evem_chamge in cone cof “hs
«is 2 aia)” = (Populeti gra) = ee f _
& 4 abt the data “values axe id entice, then O variance
BA Little vasiamer_zepessends thot the dlerla_patuats ore
desse tothe -meam _ashereas if the dala _pourts are
righty speead aut feo the _meam! amd fuer. emather—
___imdicaites high vaxiamer
+ dt i» overage of squcreal_olistamer from each péunt te
the mean
|3> Stamdamd Deviclion Co)
+44 in a -meaiure _uduch shous hous much variation
| ftom meen exists - an a
_+ St coleulates the extent to ushich the values ctiffer ftom.
value asill offer the vole of SD
S-De = Je? = Mvariamcea) Quewlile. Deviation wot
Quartile, ax the values. that disicle. a lh of mes
olata info 3 quarters @,., @2,Q3-- ——
Qi > tower quartile, .@, <> Mediam, Qs “dapper Quote
QD i» half the dlipperener-belascin.- “the upper_amnd
Aouser — quartile i dees 1
at Ore his
Qa = 4 quote
Av = fouser iret
Ax = upper timid
28 =
Sol”: La im ae .orcley
2,547,288, 10, 10, 1F 1S 412 4 18-, 24,2 25-28, $8 __| ieaaene Medan —
— = aq)/2 a _—
2
an |News tose halk “of dada — — ——EE
| 25,2, 2, 84 8410,10. Cele) — — -
Qe oe -
|| Nous upper hol} of dela. a
| 14,15 417,18, 24, 27,28 48 __C €ssen)
o |__Q, = Mecliam of ee es —e
=4 Latha sth £018 42%) - _Pe
3) Skewness im data
Sp -
———
—— — - ——