Ba Notes-Data Science
Ba Notes-Data Science
ON
BUSINESSANALYTICS
IVB.TechIIsemester(2012PE04)
Preparedby
Mr.P.Raviprakash
Assistant Professor
DEPARTMENTOFCSE-AIML
MALLAREDDYENGINEERINGCOLLEGEFORWOMEN
(AutonomousInstitution-UGC,Govt.ofIndia)
NIRFIndianRanking 2018,AcceptedbyMHRD,Govt.ofIndia
PermanentlyAffiliated toJNTUH, Approved by AICTE, ISO 9001:2015 Certified Institution AAAA+
Rated by Digital Learning Magazine, AAA+ Rated by Careers 360 Magazine, 6thRank CSR Platinum
Rated by AICTE-CII Survey, Top 100 Rank band by ARIIA, MHRD, Govt. of India
NationalRanking-Top100RankbandbyOutlook,NationalRanking-Top100RankbandbyTimesNewsMagazine
Maisammaguda,Dhulapally,Secunderabad,Kompally-500100
2023 2024
CourseObjectives:
• Toexplorethefundamentalconceptsofdata analytics.
• Tolearntheprinciplesand methodsofstatisticalanalysis
• Discoverinterestingpatterns,analyzesupervisedandunsupervisedmodelsandestimatethe
accuracy of thealgorithms.
• Tounderstandthevarioussearchmethodsandvisualizationtechniques.
2. Problem Analysis: Identify, formulate, research literature, and analyze complex engineering
problemsreachingsubstantiatedconclusionsusingfirst principlesofmathematics,naturalsciences, and
engineering sciences.
3. Design/developmentofsolutions:Designsolutionsforcomplexengineeringproblemsanddesign
systemcomponentsor processes that meet t h e specified needs with appropriate consideration for
public health and safety, and cultural, societal, and environmental considerations.
5. Modern tool usage: Create, select, and applyappropriate techniques, resources, and modern
engineeringandITtools, includingpredictionand modelingtocomplexengineeringactivities, with
an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal,health,safety, legalandculturalissuesandtheconsequent responsibilitiesrelevanttothe
professional engineering practice.
10. Communication:Communicateeffectivelyoncomplexengineeringactivitieswiththeengineering
communityand with t h e societyat large, such as, being able to comprehend and write effective
reports and design documentation, make effective presentations, and give and receive clear
instructions.
PO/CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 P11 P12
CO1 H H L L H L L
CO2 M L L L H L
CO3 M L L L L
C04 M L L L L
CO5 M L L L L
PROGRAMSPECIFICOUTCOMES-POs:
PSO1:
PSO2:
PSO3:
CO-PSOMAPPING:
CO1 M M
M M
CO2
M M
CO3
M M
CO4
L M
CO5
SYLLABUS
UNIT-I
DataManagement:DesignDataArchitectureandmanagethedataforanalysis,understand varioussources of
Data like Sensors/Signals/GPS etc. Data Management, Data Quality(noise, outliers, missing values,
duplicate data) and Data Processing & Processing.
UNIT-II
DataAnalytics:IntroductiontoAnalytics, IntroductiontoToolsandEnvironment, ApplicationofModeling
inBusiness,Databases&TypesofDataand variables,DataModelingTechniques,MissingImputationsetc. Need
Business Modeling.
UNIT-III
Regression-Concepts,Bluepropertyassumptions,Least SquareEstimation,VariableRationalization,and Model
Building etc.
LogisticRegression:ModelTheory, Modelfit Statistics,ModelConstruction,Analyticsapplicationsto various
Business Domains etc.
UNIT-IV
ObjectSegmentation: RegressionVsSegmentation-SupervisedandUnsupervisedLearning,Tree Building
-Regression,Classification,Overfitting. PruningandComplexity,MultipleDecisionTreesetc.TimeSeries
Methods: Arima, Measures ofForecast Accuracy, STL approach, Extract features fromgenerated modelas
Height, Average Energy etc and Analyze for prediction
UNIT-V
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques,Icon-BasedVisualizationTechniques, HierarchicalVisualizationTechniques,Visualizing
Complex Data and Relations.
TEXTBOOKS
1. Student'sHandbookforAssociateAnalytics-II,III.
2.DataMiningConceptsandTechniques, Han,Kamber,3rdEdition,MorganKaufinannPublishers.
REFERENCEBOOKS
1. IntroductiontoDataMining.Tan,SteinbachandKumar,AddisionWisley,2006.
2.DataMiningAnalysisandConcepts,M.Zakiand W. Meira
3.MiningofMassiveDatasets,JureLeskovecStanfordUniv. AnandRajaramanMilliwayLabsJeffreyD
Ullman Stanford Univ
INDEX
1. UNIT-1.................................................................................................................................... 1
Introduction.................................................................................................................. 1
Dataandarchitecturedesign......................................................................................... 2
UnderstandvarioussourcesoftheData........................................................................... 5
DataManagement...................................................................................................... 11
DataQuality................................................................................................................ 16
DataPre-processing.................................................................................................... 19
DataProcessing.......................................................................................................... 22
2. UNIT-2.................................................................................................................................. 24
Introductiontotools..................................................................................................... 26
Databases&TypesofDataandvariables....................................................................... 34
Variables.................................................................................................................... 38
MissingImputations.................................................................................................... 40
NeedforBusinessModelling......................................................................................... 41
DataModellingTechniques.......................................................................................... 43
3. UNIT3................................................................................................................................... 45
RegressionConcepts................................................................................................. 45
LogisticRegression..................................................................................................... 55
AnalyticsapplicationstovariousBusiness Domains....................................................... 68
4. UNIT4.................................................................................................................................... 72
Segmentation............................................................................................................. 74
RegressionVsSegmentation....................................................................................... 75
MultipleDecisionTrees................................................................................................ 84
OverfittingandUnderfitting......................................................................................... 89
TimeSeriesMethods.................................................................................................... 91
ARIMA&ARMA............................................................................................................. 92
MeasureofForecastAccuracy...................................................................................... 94
5. UNIT-5.................................................................................................................................. 98
DataVisualization....................................................................................................... 98
Pixel-OrientedVisualizationTechniques...................................................................... 98
GeometricProjectionVisualizationTechniques............................................................. 99
Icon-BasedVisualizationTechniques......................................................................... 106
HierarchicalVisualization.......................................................................................... 108
VisualizingComplexDataandRelations...................................................................... 111
BusinessAnalytics Dept.ofCSE-AIML
1. UNIT-1
Introduction:
In the beginning times of computers and Internet, the data used was not as much of
as it is today, the data then couldbe so easily storedandmanagedby all the users
andbusiness enterprises on a single computer, because the data never exceeded to the
extent of 19 exabytes but now in this era, the data has increased about 2.5 quintillion
per day.
Most of the data is generated from social media sites like Facebook, Instagram, Twitter,
etc, and the other sources can be e-business, e-commerce transactions,hospital, school,
bank data, etc. This data is impossible to manage by traditional data storing techniques.
Either the data being generated from large-scaleenterprises or thedata generated
fromanindividual,eachandeveryaspect of data needs to be analysed to benefit yourself
from it. But how do we do it? Well, thats where the term Data Analytics comes in.
WhyisDataAnalyticsimportant?
Data Analytics has a key role in improving your business as it is used to gather hidden
insights, Interesting Patterns in Data, generate reports, perform market analysis, and
improve business requirements.
WhatistheroleofDataAnalytics?
• Gather Hidden Insights Hidden insights from data are gathered and then
analyzed with respect to business requirements.
• GenerateReportsReportsaregeneratedfromthedataandarepassedontotherespe
ctive teams and individuals to deal with further actions for a high rise in business.
• Perform Market Analysis Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
• Improve BusinessRequirementAnalysis ofData allowsimprovingBusiness to
customer requirements and experience.
.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 1
BusinessAnalytics Dept.ofCSE-AIML
WhatarethetoolsusedinDataAnalytics?
With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top
tools in the data analytics market are as follows.
• Rprogramming
• Python
• TableauPublic
• QlikView
• SAS
• MicrosoftExcel
• RapidMiner
• KNIME
• OpenRefine
• ApacheSpark
Dataandarchitecturedesign:
DataarchitectureinInformationTechnologyiscomposedofmodels,policies,rulesorstandardsthat
govern which data is collected, and how it is stored, arranged, integrated, and put to use in data
systems and in organizations.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 2
BusinessAnalytics Dept.ofCSE-AIML
During the definition of the target state, the Data Architecture breaks a subject down to
the atomic level and then builds it back up to the desired form.
TheDataArchitectbreaksthesubjectdownbygoingthrough3traditionalarchitecturalprocesses:
Conceptual model: It is a businessmodel which uses Entity Relationship (ER) model for
relationbetween entities and their attributes.
Logicalmodel:It is amodel where problems arerepresented in the formoflogic such asrows
and column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds thedatabasedesign like which type of
databasetechnology will be suitable for architecture.
Thedataarchitectureisformedbydividingintothreeessentialmodelsandthenarecombined:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 3
BusinessAnalytics Dept.ofCSE-AIML
Factorsthat influenceDataArchitecture:
Variousconstraintsandinfluenceswillhaveaneffectondataarchitecturedesign.Theseinclude
enterprise requirements, technology drivers, economics, business policies and data
processing need. Enterprise requirements:
• These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed),
transaction reliability, and transparent data management.
• In addition, the conversion of raw data such as transaction records and image files
into more useful information forms through such features as data warehouses is
also a common organizational requirement, since this enables managerial decision
making and other organizational processes.
• Oneofthearchitecturetechniquesisthesplitbetweenmanagingtransactiondataand(m
aster) reference data. Another one is splitting data capture systems from data
retrieval systems (as done in a data warehouse).
Technologydrivers:
• These are usually suggested by the completed data architecture and database
architecture designs.
• In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site
resources (e.g. previously purchased software licensing).
Economics:
• These are also important factors that must be considered during the data
architecturephase. It is possible that some solutions, while optimal in principle,
may not be potential candidates due to their cost.
• External factors such as the business cycle, interest rates, market conditions, and
legal considerations could all have an effect on decisions relevant to data
architecture.
Businesspolicies:
• Businesspoliciesthatalsodrivedataarchitecturedesignincludeinternalorganizationalp
olicies, rules of regulatory bodies, professional standards, and applicable
governmental laws that can vary by applicable agency.
• These policies and rules will help describe the manner in which enterprise wishes
to process their data.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 4
BusinessAnalytics Dept.ofCSE-AIML
Dataprocessingneeds
• These include accurate and reproducible transactions performed in high volumes,
data warehousingfor the support of management informationsystems
(andpotential data mining), repetitive periodic reporting, ad hoc reporting, and
support of various organizational initiatives as required (i.e. annual budgets, new
product development)
• TheGeneralApproachisbasedondesigningtheArchitectureatthreeLevelsofSpecification.
➢ TheLogicalLevel
➢ ThePhysicalLevel
➢ TheImplementationLevel
UnderstandvarioussourcesoftheData:
• Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data.
• Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis.
• Intheprocessofbigdataanalysis,Datacollectionistheinitial
stepbeforestartingtoanalyse the patterns or useful information in data. The data
which is to be analysed must be collected from different valid sources.
• The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as knowledge. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc.
• Themaingoalofdatacollectionistocollectinformation-rich data.
• Datacollectionstartswithaskingsomequestionssuchas whattypeofdataistobecollected
andwhatisthesourceofcollection.
• Most of the data collected are of two types known as qualitative data which is a
group of non-numerical data such as words, sentences mostly focus on behaviour
and actions of the
groupandanotheroneisquantitativedatawhichisinnumericalformsandcanbecalcula
ted using different scientific tools and sampling data.
Theactualdataisthenfurtherdividedmainlyintotwotypesknownas:
1. Primarydata
2. Secondarydata
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 5
BusinessAnalytics Dept.ofCSE-AIML
1. Primarydata:
• The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
whichanalysis is performed otherwiseit would be a burden in the data processing.
Fewmethodsofcollectingprimarydata:
1. Interviewmethod:
• Thedatacollectedduringthisprocessisthroughinterviewingthetargetaudiencebya
person called interviewer and the person who answers the interview is known as
the interviewee.
• Some basic business or product related questionsare asked and noted down in
the form of notes, audio, or video and this data is stored for processing.
• These can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.
2. Surveymethod:
• The surveymethodis the processofresearch where a listofrelevantquestionsare
askedand answers are noted down in the form of text, audio, or video.
• Thesurveymethodcanbeobtainedinbothonlineandofflinemodelikethrough
websiteforms
andemail.Thenthatsurveyanswersarestoredforanalysingdata.Examplesareonlinesur
veys or surveys through social media polls.
3. Observationmethod:
• Theobservationmethodisamethodofdatacollectioninwhichtheresearcherkeenlyobse
rves the behaviour and practices of the target audience using some data collecting
tool and stores the observed data in the form of text, audio, video, or any raw
formats.
• Inthismethod,thedataiscollecteddirectlybypostingafewquestionsontheparticipants.F
or example, observing a group of customers and their behaviour towards the
products. The data obtained will be sent for processing.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 6
BusinessAnalytics Dept.ofCSE-AIML
4. Experimentalmethod:
• The experimental method is the process of collecting data through performing
experiments, research, and investigation.
• ThemostfrequentlyusedexperimentmethodsareCRD,RBD,LSD,FD.
CRD-
CompletelyRandomizeddesignisasimpleexperimentaldesignusedindataanalyticswhich
is based on randomization and replication. It is mostly used for comparing the
experiments.
RBD- Randomized Block Design is an experimental design in which the experimentis
dividedinto small units called blocks.
• Random experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
• RandomizedBlockDesign-
TheTermRandomizedBlockDesignhasoriginatedfromagricultural research. In this
design several treatments of variables are applied to different blocks of land to
ascertain their effect on the yield of the crop.
• Blocks are formed in such a manner that each block contains as many plots as a
number of
treatmentssothatoneplotfromeachisselectedatrandomforeachtreatment.Theproduc
tion of each plot is measured after the treatment is given.
• These data are then interpreted and inferences are drawn by using the analysis of
Variance technique so as to know the effect of various treatments like different
dozes of fertilizers, different types of irrigation etc.
LSD Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns.
• It is an arrangement of NxN squares with an equal number of rows and columns
which contain lettersthatoccursonlyonceinarow.Hencethedifferences can
beeasilyfoundwithfewer errors in the experiment. Sudoku puzzle is an example of a
Latin square design.
• A Latin square is one of the experimental designs which has a balanced two-way
classification scheme sayfor example - 4 X 4 arrangement. Inthis scheme each letter
from A to D occurs only once in each row and also only once in each column.
• The Latin square is probably under used in most fields of research because text book
examples tend to be restricted to agriculture, the area which spawned most original
work on ANOVA. Agricultural examples oftenreflectgeographical designs
whererowsandcolumns areliterallytwo dimensions of a grid in a field.
• Rows and columns can be any two sources of variation in an experiment. In this
sense a Latin square is a generalisation of a randomized blockdesign with two
different blocking systems
• A B C D
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 7
BusinessAnalytics Dept.ofCSE-AIML
B C D A
C D A B
D A B C
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 8
BusinessAnalytics Dept.ofCSE-AIML
• The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisonsamongtreatments,willbefreefrombothdifferencesbetween
rowsandcolumns. Thus, the magnitude of error will be smaller than any other
design.
FD- Factorial design is an experimental design where each experiment has two factors
each with possiblevalues andonperformingtrailothercombinational
factorsarederived.Thisdesignallowsthe experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyses the
impacts of each of the variables. In a true experiment, randomization is essential so that
theexperimenter can infer cause and effect without any bias.
2. Secondarydata:
Secondarydataisthedatawhichhasalready beencollectedandreused
againforsomevalidpurpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.
Internalsource:
Thesetypesofdatacaneasily befoundwithin theorganizationsuchasmarketrecord,
asalesrecord, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
▪ Accounting resources- This gives so much information which can be usedby the
marketing researcher. They give information about internal factors.
▪ Sales Force Report- It gives information about the sales of a product. The
information provided is from outside the organization.
▪ InternalExperts-Thesearepeoplewhoareheading
thevariousdepartments.Theycangive an idea of how a particular thing is working.
▪ MiscellaneousReports-
Thesearewhatinformationyouaregettingfromoperationalreports. If the data
available within the organization areunsuitable orinadequate, the marketershould
extend the search to external secondary data sources.
Externalsource:
Thedatawhichcantbefoundatinternalorganizations andcanbegainedthroughexternalthird-
party resources is external source data. The cost and time consumption are more
because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labour bureau, syndicate services, and other non-
governmental publications.
3. GovernmentPublications-
▪ Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There
are number of government agencies generating data.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 9
BusinessAnalytics Dept.ofCSE-AIML
4. These are like: Registrar General of India- It is an office which generates demographic
data. It includes details of gender, age, occupation etc.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 10
BusinessAnalytics Dept.ofCSE-AIML
5. CentralStatisticalOrganization-
▪ This organization publishes the national accounts statistics. It contains estimates
of national income for several years, growth rate, andrate of major economic
activities.Annual survey of Industries is also published by the CSO.
▪ It gives information about the total number of workers employed, production units,
material used and value added by the manufacturer.
6. DirectorGeneralofCommercialIntelligence-
▪ This office operates from Kolkata. It gives information about foreign trade i.e.
import and export. These figures are provided region-wise and country-wise.
7. MinistryofCommerceandIndustries-
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 11
BusinessAnalytics Dept.ofCSE-AIML
1. TheBombayStockExchange
▪ Itpublishesadirectorycontainingfinancialaccounts,keyprofitabilityandotherrelevan
t matter) Various Associations of Press Media.
• ExportPromotionCouncil.
• ConfederationofIndianIndustries(CII)
• SmallIndustriesDevelopmentBoardofIndia
• DifferentMillslike-Woollenmills,Textilemillsetc
▪ Theonly disadvantageof the above sources is that the data may bebiased. They
are likely to colour their negative points.
2. Syndicate Services-
▪ Theseservicesareprovidedbycertainorganizationswhichcollectandtabulatethemar
keting information on a regular basis for a number of clients who are the
subscribers to these services.
▪ Theseservicesareusefulintelevisionviewing,movementofconsumergoodsetc.
▪ Thesesyndicateservicesprovideinformationdatafrombothhouseholdaswellasinstitution
.
Incollectingdatafromhousehold,theyusethreeapproaches:
Survey-Theyconductsurveysregarding-lifestyle,sociographic,generaltopics.
Mail Diary Panel- It may be related to 2 fields - Purchase and
Media. ElectronicScannerServices-Theseareusedtogeneratedata
onvolume. They collect data for Institutions from
• Wholesellers
• Retailers,and
• IndustrialFirms
▪ Varioussyndicateservices areOperationsResearch
Group(ORG)andTheIndianMarketing Research Bureau (IMRB).
ImportanceofSyndicateServices:
• Syndicateservices
arebecomingpopularsincetheconstraintsofdecisionmakingarechanging and we
need more of specific decision-making in the light of changing environment. Also,
Syndicate services are ableto provide information to the industries at a low unit
cost.
DisadvantagesofSyndicateServices:
• The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 12
BusinessAnalytics Dept.ofCSE-AIML
InternationalOrganization-
Theseincludes
• TheInternationalLabourOrganization(ILO):
• Itpublishesdataonthetotalandactivepopulation,employment,unemployment,
wages and consumer prices.
• TheOrganizationforEconomicCo-operationanddevelopment(OECD):
• Itpublishesdataonforeigntrade,industry,food,transport,andscienceandtechnology.
• TheInternationalMonetaryFund(IMA):
• Itpublishesreportsonnationalandinternationalforeignexchangeregulations.
Othersources:
Sensorsdata:WiththeadvancementofIoTdevices,thesensorsofthesedevices
collectdatawhich can be used for sensor data analytics to track the performance and
usage of products.
Satellitesdata:Satellitescollectalotofimagesanddatainterabytesondailybasisthroughsurveil
lance cameras which can be used to collect useful information.
Webtraffic:Duetofastandcheapinternetfacilitiesmanyformatsofdatawhichisuploadedbyus
ers
ondifferentplatformscanbepredictedandcollectedwiththeirpermissionfordataanalysis.Thes
earch engines also provide their data through keywords and queries searched mostly.
ExportalltheDataontothecloudlikeAmazonwebservices S3
Weusually exportourdatatocloudforpurposeslikesafety, multipleaccess
andrealtime simultaneous analysis.
DataManagement:
Data management is the practice of collecting, keeping, andusingdata securely,
efficiently, andcost- effectively. The goal of data management is to help people,
organizations, and connected things optimize the use of data within the bounds of policy
and regulation so that they can make decisions and take actions that maximize the
benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies,
procedures, and practices. The work of data management has a wide scope, covering
factors such as how to:
• Create,access,andupdatedataacrossadiversedatatier
• Storedataacrossmultiplecloudsandonpremises
• Providehighavailabilityanddisasterrecovery
• Usedatainagrowingvarietyofapps,analytics,andalgorithms
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 13
BusinessAnalytics Dept.ofCSE-AIML
• Ensuredataprivacyandsecurity
• Archiveanddestroydatainaccordancewithretentionschedulesandcompliancerequiremen
ts
WhatisCloudComputing?
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 14
BusinessAnalytics Dept.ofCSE-AIML
Cloud computing is a term referred to storing and accessing data over the internet. It
doesnt store
anydataontheharddiskofyourpersonalcomputer.Incloudcomputing,youcanaccessdatafro
m a remote server.
ServiceModelsofCloudcomputingarethereferencemodelsonwhichtheCloudComputingisbased.
Thesecanbecategorizedinto
threebasicservicemodelsaslistedbelow:
1. INFRASTRUCTUREasaSERVICE(IaaS)
IaaSprovidesaccesstofundamentalresourcessuchasphysicalmachines,virtualmachines,virt
ual storage, etc.
2. PLATFORMasaSERVICE(PaaS)
PaaSprovidestheruntimeenvironmentforapplications,development&deploymenttools,etc.
3. SOFTWAREasaSERVICE(SAAS)
SaaSmodelallowstousesoftwareapplicationsasaservicetoendusers.
For providing the above services models AWS is one of the popular platforms. In this
Amazon Cloud (Web) Services is one of the popular service platforms for Data
Management
AmazonCloud(Web)ServicesTutorial
Whatis AWS?
ThefullformofAWSisAmazonWebServices.Itisaplatformthatoffersflexible,reliable,scalabl
e, easy-to-use and, cost-effective cloud computing solutions.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 15
BusinessAnalytics Dept.ofCSE-AIML
AWSisacomprehensive,easytousecomputingplatformofferedAmazon.Theplatformisdevelo
ped with a combination of infrastructure as a service (IaaS), platform as a service (PaaS)
and packaged software as a service (SaaS) offering.
HistoryofAWS
2002- AWS services launched
2006-
Launcheditscloudproducts
2012- Holds first customer
event
2015-
Revealsrevenuesachievedof$4.6billion
2016- Surpassed $10 billon revenue
target 2016- Release snowball and
snowmobile
2019-Offersnearly100cloudservices
2021-AWScomprisesover200productsandservices
ImportantAWSServices
AmazonWebServicesoffersawiderangeofdifferentbusinesspurposeglobalcloud-
basedproducts. The products include storage, databases, analytics, networking, mobile,
development tools, enterprise applications, with a pay-as-you-go pricing model.
AmazonWebServices-AmazonS3:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 16
BusinessAnalytics Dept.ofCSE-AIML
1.HowtoConfigureS3?
FollowingarethestepstoconfigureaS3account.
Step1−OpentheAmazonS3consoleusingthislink−https://console.aws.amazon.com/s3/home
Step2−CreateaBucketusingthefollowingsteps.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 17
BusinessAnalytics Dept.ofCSE-AIML
• Apromptwindowwillopen.ClicktheCreateBucketbuttonatthebottomofthepage.
• CreateaBucketdialogboxwillopen.FilltherequireddetailsandclicktheCreatebutton.
• SelecttheStaticWebsiteHostingoption.ClicktheradiobuttonEnablewebsitehostingandfil
lthe required details.
Step3 −AddanObjecttoabucketusingthefollowingsteps.
• OpentheAmazonS3consoleusingthefollowi
ng link.
https://console.aws.amazon.com/s3/home
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 18
BusinessAnalytics Dept.ofCSE-AIML
• ClicktheUploadbutton.
• ClicktheAddfilesoption.Selectthosefileswhicharetobeuploadedfromthesystemandth
en click the Open button.
• Clickthestartuploadbutton.Thefileswillgetuploadedintothebucket.
• Afterwards,wecancreate,edit,modify,updatetheobjectsandotherfilesinwideformats.
AmazonS3Features
• LowcostandEasytoUse −UsingAmazonS3,theuser
canstorealargeamountofdataat very low charges.
• Secure − Amazon S3 supports data transfer over SSL and the data gets
encrypted
automaticallyonceitisuploaded.Theuserhascompletecontrolovertheirdatabyconfigu
ring bucket policies using AWS IAM.
• Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
• Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes contentto theend users withlowlatencyandprovides highdatatransfer
speeds withoutany minimum usage commitments.
• IntegratedwithAWSservices−AmazonS3integratedwithAWSservicesincludeAma
zon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon
Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 19
BusinessAnalytics Dept.ofCSE-AIML
DataQuality:
WhatisDataQuality?
Therearemanydefinitionsofdataquality,ingeneral,dataqualityistheassessmentofho
w much the data is usable and fits its serving context.
WhyDataQualityis Important?
Enhancingthedataqualityisacriticalconcern
asdataisconsideredasthecoreofallactivities
withinorganizations,poordataqualityleadstoinaccuratereportingwhichwillresultinac
curate decisions and surely economic damages.
Manyfactorshelpmeasuringdataqualitysuchas:
• Data Accuracy:Dataareaccuratewhendatavaluesstoredinthedatabasecorrespond
to real-world values.
• Data Uniqueness:Ameasureofunwantedduplicationexistingwithinoracrosssystems
for a particular field, record, or data set.
• DataConsistency:Violationofsemanticrulesdefinedoverthedataset.
• DataCompleteness:Thedegreetowhichvaluesarepresentinadatacollection.
• Data Timeliness: Theextenttowhichageofthedataisappropriatedforthetaskat
hand.
OtherfactorscanbetakenintoconsiderationsuchasAvailability,EaseofManipulation,
Believability.
OUTLIERS:
• Outlierisapointoranobservationthatdeviatessignificantlyfrom
the other observations.
• Outlierisacommonlyusedterminologybyanalystsanddatascientists
as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an
overall pattern in a sample.
• Reasonsforoutliers:Duetoexperimentalerrorsorspecialcircumstances.
• There is no rigid mathematical definition of what constitutes an outlier;
determining whether or not an observation is an outlier is ultimately a subjective
exercise.
• Therearevariousmethodsofoutlierdetection.Somearegraphical suchas
normalprobability plots. Others are model-based. Box plots are a hybrid.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 20
BusinessAnalytics Dept.ofCSE-AIML
TypesofOutliers:
Outliercanbeoftwotypes:
Univariate:Theseoutlierscanbefoundwhenwelookatdistributionofasinglevariable.
Multivariate: Multi-variate outliers are outliers in an n-dimensional space.
Inordertofindthem,youhavetolookatdistributions inmulti-dimensions.
ImpactofOutliersonadataset:
Outliers can drasticallychange the resultsof the data analysisand statistical modelling. There
are numerous unfavourable impacts of outliers in the data set:
• Itincreasestheerrorvarianceandreducesthepowerofstatisticaltests
• Iftheoutliersarenon-randomlydistributed,theycandecreasenormality
• Theycanbiasorinfluenceestimatesthatmaybeofsubstantiveinterest
• They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.
DetectOutliers:
Mostcommonlyusedmethodtodetect
outliersisvisualization.Weusevariousvisualizationmethods, like Box-plot, Histogram,
Scatter Plot (above, we have used box plot and scatter plot for visualization).
Outliertreatmentsarethreetypes:
Retention:
• There is no rigid mathematical definition of what constitutes an outlier;
determining whether
ornotanobservationisanoutlierisultimatelyasubjectiveexercise.Therearevariousmet
hods of outlier detection. Some are graphical such as normal probability plots.
Others are model- based. Box plots are a hybrid.
Exclusion:
• According to a purpose of the study, it is necessary to decide, whether and which
outlier will be removed/excluded from the data, since they could highly bias the
final results of the analysis.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 21
BusinessAnalytics Dept.ofCSE-AIML
Rejection:
• Rejectionofoutliersismoreacceptable inareasofpractice
wheretheunderlyingmodelofthe process being measured and the usual
distribution of measurement error are confidently known.
• An outlier resulting from an instrument reading error may be excluded but it is
desirable that the reading is at least verified.
Othertreatmentmethods
OUTLIERpackageinR:todetectandtreatoutliersinData.
Outlier detection from graphical representation:
– ScatterplotandBoxplot
–Theobservationsoutofboxaretreatedasoutliersindata
MissingDatatreatment:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 22
BusinessAnalytics Dept.ofCSE-AIML
MissingValues
➢ Missing data in the training data set can
reduce the power / fit of a model or can lead
to a biased model because we have not
analyzed the behavior and
relationship with other variables correctly. It can lead
➢ InR,missingvaluesarerepresentedbythesymbolNA(notavailable).
➢ Impossible values (e.g., dividing by zero) are represented by the symbol NaN(not a
number) and R outputs the result for dividing by zero as Inf(Infinity).
PMMapproachtotreatmissingvalues:
• PMM->PredictiveMeanMatching(PMM)isasemi-parametricimputationapproach.
• Itissimilartotheregressionmethodexceptthatforeachmissingvalue,itfillsinavalue
randomly from among the observed donor values from an observation
• whoseregression-predicted values are closesttothe regression-predictedvalue
forthe missing value from the simulated regression model.
DataPre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and efficient format.
StepsInvolvedinDataPreprocessing:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 23
BusinessAnalytics Dept.ofCSE-AIML
1. DataCleaning:
Thedata can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
(a). MissingData:
This situation arises when some data is missing inthe data. It can be handled in various
ways. Some of them are:
1. Ignorethetuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
2. Fillthe Missingvalues:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
• (b).NoisyData:
Noisy datais a meaningless data that cant beinterpreted by machines. It canbe
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways:
1. BinningMethod:
This method works on sorted data in order to smooth it. Binning, also called
discretization, is a techniquefor
reducingthecardinality(Thetotalnumberofuniquevaluesforadimension isknownas itscardinality)
ofcontinuousanddiscretedata.Binninggroupsrelatedvaluestogetherinbinstoreduce the
number of distinct values
2. Regression:
Here data can be made smooth by fitting it to a regressionfunction. The regression
used may be linear (having one independent variable) or
multiple(havingmultipleindependentvariables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. DataTransformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
Normalization is a technique often applied as part of data preparationin Data
Analytics through machine learning. The goal of normalization is to change the
values of numeric
columnsinthedatasettoacommonscale,withoutdistortingdifferencesintherangesof
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 24
BusinessAnalytics Dept.ofCSE-AIML
values.Formachinelearning,everydatasetdoesnotrequirenormalization.Itisdoneinord
er to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. AttributeSelection:
In this strategy, new attributes are constructed from thegiven set of attributes to
help the mining process.
3. Discretization:
Discretization is theprocess through which we can transform continuous
variables, models or functions into a discrete form. We do this by creating a set
of contiguous intervals (or bins) that go across the range of our desired
variable/model/function. Continuous data is Measured, while Discrete data is
Counted
4. ConceptHierarchy Generation:
Here attributes are converted fromlower level to higher level inhierarchy. For Example-
The attribute city can be converted to country.
3. DataReduction:
Since data mining is a technique that is used to handle huge amount of data.While
working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we use data reduction technique. It aims to increase the storage efficiency
and reduce data storage and analysis costs.
Thevariousstepstodatareductionare:
1. Data Cube Aggregation:
Aggregation operationisappliedtodatafortheconstructionofthedatacube.
2. AttributeSubsetSelection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.
3. NumerosityReduction:
Thisenabletostorethe modelofdatainsteadofwhole data,forexample:RegressionModels.
4. DimensionalityReduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved,such
reductionarecalled lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are: Wavelet transforms and PCA
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 25
BusinessAnalytics Dept.ofCSE-AIML
(Principal Component Analysis).
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 26
BusinessAnalytics Dept.ofCSE-AIML
DataProcessing:
Data processing occurs when data is collected and translated into usable information.
Usually performedbyadatascientist orteamofdatascientists, itisimportant
fordataprocessingtobedone correctly as not to negatively affect the end product, or data
output.
Data processingstarts with data in itsraw form andconverts it into a more readable
format (graphs,
documents,etc.),givingittheformandcontextnecessarytobeinterpretedbycomputersandutil
ized by employees throughout an organization.
Sixstagesofdataprocessing
1. Datacollection
Collectingdataisthefirststepindataprocessing.Dataispulledfromavailablesources,including
data lakes and data warehouses. It is important that the data sources available are
trustworthy and well- built so the data collected (and later used as information) is of the
highest possible quality.
2. Datapreparation
Oncethedatais collected, it thenenters thedata preparation stage.Data
preparation,oftenreferred to as pre-processing is the stage at which raw data is
cleaned up and organized for the following
stageofdataprocessing.Duringpreparation,rawdataisdiligentlycheckedforanyerrors.Thepu
rpose of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and
begin to create high-quality data for the best business intelligence.
3. Datainput
Thecleandataisthenenteredintoitsdestination(perhapsaCRMlikeSalesforce
oradatawarehouse like Redshift), and translated into a language that it can understand.
Data input is the first stage in which raw data begins to take the form of usable
information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually
processed for interpretation. Processing is done using machine learning algorithms,
though the process itself may
varyslightlydependingonthesourceofdatabeingprocessed(datalakes,socialnetworks,conne
cted devices etc.) andits intendeduse (examiningadvertisingpatterns, medical diagnosis
from connected devices, determining customer needs, etc.).
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 27
BusinessAnalytics Dept.ofCSE-AIML
5. Dataoutput/interpretation
The output/interpretation stage is thestage atwhich data isfinallyusable to non-data
scientists. Itis translated, readable, and often in the form of graphs, videos, images, plain
text, etc.).
6. Datastorage
The final stage of data processing is storage. After all of the data is processed, it is then
stored for future use. While some information may be put to use immediately, much of it
will serve a purpose later on. When data is properly stored, it can be quickly and easily
accessed by members of the organization when needed.
***EndofUnit-1***
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 28
BusinessAnalytics Dept.ofCSE-AIML
UNIT-2
Data has been the buzzword for ages now. Either the data being generated from
large-scale enterprises or the data generated from an individual, each and every
aspect of data needs to be analyzed to benefit yourself from it.
WhyisDataAnalyticsimportant?
Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business
requirements.
WhatistheroleofDataAnalytics?
• GatherHiddenInsightsHiddeninsightsfromdataaregatheredandthenanalyzed
with respect to business requirements.
• GenerateReports Reportsaregeneratedfromthedataandarepassedontothe
respective teams and individuals to deal with further actions for a high rise in
business.
• PerformMarketAnalysisMarketAnalysiscanbeperformedtounderstandthestre
ngths and weaknesses of competitors.
• Improve Business Requirement Analysis of Data allows
improvingBusiness to customer requirements and experience.
WaystoUseDataAnalytics:
Nowthatyouhavelookedatwhatdataanalyticsis,letsunderstandhowwecanusedataanalytics.
Fig:WaystouseDataAnalytics
1. Improved Decision Making: Data Analytics eliminates guesswork and manual
tasks. Be it choosing the right content, planning marketing campaigns, or developing
products. Organizations can use the insights they gain from data analytics to make
informed decisions. Thus, leading to better outcomes and customer satisfaction.
2. Better Customer Service: Data analytics allows you to tailor customer service
according to their needs. It also provides personalization and builds stronger
relationships with customers. Analyzeddatacan reveal information aboutcustomers
interests, concerns,andmore. Ithelps you give better recommendations for products
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 29
BusinessAnalytics Dept.ofCSE-AIML
and services.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 30
BusinessAnalytics Dept.ofCSE-AIML
3. EfficientOperations:
Withthehelpofdataanalytics,youcanstreamlineyourprocesses,save money, and boost
production. With an improved understanding of what your audience wants, you spend
lesser time creating ads and content that arent in line with your audiences
interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your
campaigns are performing. This helps in fine-tuning them for optimal outcomes.
Additionally, you can also find potential customers who are most likely to interact with
a campaign and convert into leads.
StepsInvolvedinDataAnalytics:
Next step to understanding what data analytics is to learn how data is analyzed in
organizations. Thereareafewsteps thatare involvedinthedataanalytics
lifecycle.Belowarethesteps thatyou can take to solve your problems.
Fig:DataAnalyticsprocesssteps
1. Understand the problem: Understanding the business problems, defining the
organizational goals, and planning a lucrative solution is the first step in the analytics
process. E-commerce companies often encounter issues such as predicting the return
of items, giving relevant product recommendations, cancellation of orders, identifying
frauds, optimizing vehicle routing, etc.
2. Data Collection: Next, you need to collect transactional business data and
customer-related information fromthepastfew years toaddress theproblems your
business is facing.Thedata can have information about the total units that were sold
for a product, the sales, and profit that were made,andalsowhen
wastheorderplaced.Pastdataplaysacrucial roleinshapingthefutureofa business.
3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and
contain
unwantedmissingvalues.Suchdataisnotsuitableorrelevantforperformingdataanalysis.He
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 31
BusinessAnalytics Dept.ofCSE-AIML
nce, you need to clean the data to remove unwanted, redundant, and missing values
to make it ready for analysis.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 32
BusinessAnalytics Dept.ofCSE-AIML
4. Data Exploration and Analysis: After you gather the right data, the next vital
step is to execute exploratory data analysis. You can use data visualization and
business intelligence tools,
dataminingtechniques,andpredictivemodellingtoanalyze,visualize,andpredictfutureout
comes
fromthisdata.Applyingthesemethodscantellyoutheimpactandrelationshipofacertainfeat
ure as compared to other variables.
Belowaretheresultsyoucangetfromtheanalysis:
• Youcanidentifywhenacustomerpurchasesthenextproduct.
• Youcanunderstandhowlongittooktodeliver theproduct.
• Yougetabetterinsightintothekindofitemsacustomerlooksfor,productreturns,etc.
• Youwill beabletopredictthesalesandprofitforthenextquarter.
• Youcanminimizeordercancellationbydispatchingonlyrelevantproducts.
• Youllbeabletofigureouttheshortestroutetodelivertheproduct,etc.
5. Interpret the results: The final step is to interpret the results and validate if the
outcomes
meetyourexpectations.Youcanfindouthiddenpatternsandfuturetrends.Thiswillhelpyoug
ain insights that will support you with appropriate data-driven decision making.
Introductiontotools:
WhatarethetoolsusedinDataAnalytics?
WiththeincreasingdemandforDataAnalyticsinthemarket,manytoolshaveemergedwithva
rious functionalitiesforthispurpose.Eitheropen-sourceoruser-
friendly,thetoptoolsinthedataanalytics market are as follows.
• R programming This tool is the leading analytics tool used for statistics and
data modeling. R compiles andruns on various platforms such as UNIX,
Windows, and Mac OS. It also provides tools to automatically install all
packages as per user-requirement.
• Python Python is anopen-source,object-oriented programminglanguagethatis
easy to read, write, and maintain. It provides various machine learning and
visualization libraries suchasScikit-learn,
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 33
BusinessAnalytics Dept.ofCSE-AIML
TensorFlow,Matplotlib,Pandas,Keras,etc.Italsocanbeassembledon any platform
like SQL server, a MongoDB database or JSON
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 34
BusinessAnalytics Dept.ofCSE-AIML
• Tableau Public This is a free software that connectstoany data sourcesuchas
Excel, corporate Data Warehouse, etc. Itthen creates visualizations,maps,
dashboards etc with real-time updates on the web.
• QlikView This tool offers in-memory data processing with the results
delivered to the end-users quickly. It also offers data association and data
visualization with data being compressed to almost 10% of its original size.
• SAS A programming language and environmentfor data manipulation and
analytics, this tool is easily accessible and can analyze data from different
sources.
• MicrosoftExcel This tool is one of the mostwidely usedtools for data
analytics. Mostly used for clients internal data, this tool analyzes the tasks that
summarize the data with a preview of pivot tables.
• RapidMinerApowerful,integratedplatformthatcanintegratewithanydatasourcet
ypes
suchasAccess,Excel,MicrosoftSQL,Teradata,Oracle,Sybaseetc.Thistoolismostlyus
ed for predictive analytics, such as data mining, textanalytics, machine
learning.
• KNIME Konstanz Information Miner (KNIME) is an open-source data analytics
platform,
whichallowsyoutoanalyzeandmodeldata.Withthebenefitofvisualprogramming,KN
IME provides aplatformforreportingandintegrationthrough itsmodular
datapipelineconcept.
• OpenRefineAlsoknownasGoogleRefine,
thisdatacleaningsoftwarewillhelpyouclean up data for analysis. It is used for
cleaning messy data, the transformation of data and parsing data from
websites.
◼ Apache Spark One of the largest large-scale data processing engine, this
tool executes
applicationsinHadoopclusters100timesfasterinmemoryand10timesfasterondisk.
This tool is also popular for data pipelines and machine learning model
development.
DataAnalyticsApplications:
Dataanalyticsisusedinalmosteverysectorofbusiness,letsdiscussafewofthem:
1. Retail: Data analytics helps retailers understand their customer needs and buying
habits to predict trends, recommend new products, and boost their business. They
optimize the supply chain, and retail operations at every step of the customer
journey.
2. Healthcare: Healthcare industries analyse patient data to provide lifesaving
diagnoses and treatment options.Dataanalytics help in discovering new
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 35
BusinessAnalytics Dept.ofCSE-AIML
drugdevelopment methodsas well.
3. Manufacturing: Using data analytics, manufacturing sectors can discover new
cost-saving opportunities. They cansolve complex supply chain issues, labour
constraints, andequipment breakdowns.
4. Banking sector: Banking and financial institutions use analytics to find out
probable loan defaulters and customer churn out rate. It also helps in detecting
fraudulent transactions immediately.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 36
BusinessAnalytics Dept.ofCSE-AIML
5. Logistics: Logistics companies use data analytics to develop new business
models and optimize routes. This, in turn, ensures that the delivery reaches on
time in a cost-efficient manner.
Clustercomputing:
▪ Cluster computing is a collection of tightly or
loosely connected computers that work
together so that they act as a single entity.
▪ The connected computers execute operations
all together thus creating the idea of a single
system.
▪ Theclustersaregenerallyconnectedthroughfastl
ocal area networks (LANs)
WhyisClusterComputingimportant?
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 37
BusinessAnalytics Dept.ofCSE-AIML
ApacheSpark:
EvolutionofApacheSpark
SparkisoneofHadoopssubprojectdevelopedin2009inUCBerkeleysAMPLabbyMateiZah
aria. Itwas OpenSourcedin2010under a BSDlicense.It was donated to
Apachesoftwarefoundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.
FeaturesofApacheSpark:
ApacheSparkhasfollowingfeatures.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data in
memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark comes up
with 80 high-level operators for interactive querying.
AdvancedAnalytics−Sparknotonlysupports‘Map’ and‘reduce’.Italsosupports
SQLqueries, Streaming data, Machine learning (ML), and Graph algorithms.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 38
BusinessAnalytics Dept.ofCSE-AIML
SparkBuiltonHadoop
ThefollowingdiagramshowsthreewaysofhowSparkcanbebuiltwithHadoopcomponents.
TherearethreewaysofSparkdeploymentasexplainedbelow.
Standalone − Spark Standalone deployment means Spark occupies the place on top
of
HDFS(HadoopDistributedFileSystem)andspaceisallocatedforHDFS,explicitly.Here,Spar
kand MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn (Yet
Another Resource Negotiator) without any pre-installation or root access required. It
helps to integrate
SparkintoHadoopecosystemorHadoopstack.Itallowsothercomponents
torunontopofstack. Spark in MapReduce (SIMR) − Spark in MapReduce is used to
launch spark job in addition to standalone deployment. With SIMR, user can start
Spark and uses its shell without any administrative access.
ComponentsofSpark
ThefollowingillustrationdepictsthedifferentcomponentsofSpark.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 39
BusinessAnalytics Dept.ofCSE-AIML
ApacheSparkCore
SparkCoreistheunderlyinggeneralexecutionengineforsparkplatformthatallotherfunctio
nality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
SparkStreaming
Spark Streamingleverages Spark Core'sfastschedulingcapability to perform
streaminganalytics. Itingests datain mini-batches andperforms
RDD(ResilientDistributedDatasets) transformations on those mini-batches of data.
MLlib(MachineLearningLibrary)
MLlibisadistributedmachinelearningframeworkaboveSparkbecauseofthedistributedme
mory-
basedSparkarchitecture.Itis,accordingtobenchmarks,donebytheMLlibdevelopersagain
stthe AlternatingLeastSquares (ALS) implementations.Spark MLlibis nine times as
fastas the Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an
API for expressinggraphcomputationthatcanmodeltheuser-
definedgraphsbyusingPregelabstraction API. It also provides an optimized runtime for
this abstraction.
WhatisScala?
• Scala is a statically typed programming language that incorporates both
functional and object oriented, also suitable for imperative programming
approaches. to increase scalability of applications. It is a general-
purpose programming language. It is a strong static type language. In
scala, everything is an object whether it is a function or a number. It
does not have concept of primitive data.
• Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through Scala Js.
• ThislanguagewasoriginallybuiltfortheJavaVirtualMachine(JVM)andoneof
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 40
BusinessAnalytics Dept.ofCSE-AIML
ScalasstrengthsisthatitmakesitveryeasytointeractwithJavacode.
• Scala is a Scalable Language used to write Software for multiple
platforms. Hence,
itgotthenameScala.ThislanguageisintendedtosolvetheproblemsofJava
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 41
BusinessAnalytics Dept.ofCSE-AIML
while simultaneously being more concise. Initially designed by Martin
Odersky, it was released in 2003.
WhyScala?
• Scala is the core language to be used in writing the most popular
distributed big data processing framework Apache Spark. Big Data
processing is becoming inevitable from small to large enterprises.
• Extractingthevaluableinsightsfromdatarequiresstateoftheartprocessingt
ools and frameworks.
• Scala is easy to learn for object-oriented programmers, Java developers.
It is becoming one of the popular languages in recent years.
• Scalaoffersfirst-classfunctionsforusers
• Scala can be executed onJVM, thus paving the way for the
interoperabilitywithother languages.
• Itisdesignedforapplicationsthatareconcurrent(parallel),distributed,andres
ilient (robust)message-
driven.Itisoneofthemostdemandinglanguagesofthisdecade.
• It is concise, powerful language and can quickly grow according to the
demand of its users.
• Itisobject-
orientedandhasalotoffunctionalprogrammingfeaturesprovidingalot of
flexibility to the developers to code in a way they want.
• ScalaoffersmanyDuckTypes(StructuralTypes)
• Unlike Java, Scala has many features of functional programming
languages like
Scheme,StandardMLandHaskell,includingcurrying,typeinference,immuta
bility, lazy evaluation, and pattern matching.
• The name Scala is a portmanteau of "scalable" and "language",
signifying that it is designed to grow with the demands of its users.
WhereScalacanbeused?
• WebApplications
• UtilitiesandLibraries
• DataStreaming
• Parallelbatchprocessing
• Concurrencyanddistributedapplication
• DataanalyticswithSpark
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 42
BusinessAnalytics Dept.ofCSE-AIML
• AWSlambdaExpression
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 43
BusinessAnalytics Dept.ofCSE-AIML
ClouderaImpala:
• ClouderaImpalaisCloudera'sopensourcemassivelyparallelprocessing(MPP)
SQL query engine for data stored in a computer cluster running Apache
Hadoop.
• Impalaistheopensource,massivelyparallelprocessing(MPP)SQLqueryenginefor
nativeanalyticdatabaseinacomputerclusterrunningApacheHadoop.
• ItisshippedbyvendorssuchasCloudera,MapR,Oracle,andAmazon.
• ClouderaImpalaisaqueryenginethatrunsonApacheHadoop.
• Theprojectwas announcedinOctober2012 withapublicbetatest
distributionand became generally available in May 2013.
• ImpalabringsenablinguserstoissuelowlatencySQLqueriestodatastoredinHD
FS and Apache HBase without requiring data movement or
transformation.
• Impala is integrated with Hadoop to use the same file and data formats,
metadata, security and resource management frameworks used by
MapReduce, Apache Hive, Apache Pig and other Hadoop software.
• Impala is promoted for analysts and data scientists to perform analytics
on data stored in Hadoop via SQL or business intelligence tools.
• The result is that large-scale data processing (via MapReduce) and
interactive queries can be done on the same system using the same data
and metadata removing the need to migrate data sets into specialized
systems and/or proprietary formats simply to perform analysis.
Features include:
• SupportsHDFSandApacheHBasestorage,
• ReadsHadoopfile formats, includingtext,LZO, SequenceFile,Avro,RCFile,
and Parquet,
• SupportsHadoopsecurity(Kerberosauthentication),
• Fine-grained,role-basedauthorizationwithApacheSentry,
• Usesmetadata,ODBCdriver,andSQLsyntaxfromApacheHive.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 44
BusinessAnalytics Dept.ofCSE-AIML
Databases&TypesofDataandvariables
DataBase:ADatabaseisacollectionofrelateddata.
Database Management System: DBMS is a software or set of Programs used to
define, construct and manipulate the data.
Relational Database Management System: RDBMS is a software system
used to maintain relational databases. Many relational database systems have
an option of using the SQL.
NoSQL:
• NoSQLDatabaseisanon-
relationalDataManagementSystem,thatdoesnotrequire a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage
needs. NoSQL is used for Big data and real-time web apps. For example,
companies like Twitter, Facebook and Google collect terabytes of user
data every single day.
• NoSQL database stands for Not Only SQL or Not SQL. Though a
better term would be NoREL, NoSQL caught on. Carl Strozz introduced
the NoSQL concept in 1998.
• Traditional RDBMS uses SQL syntax to store and retrieve data for further
insights. Instead, a NoSQL database system encompasses a wide range
of database technologies that can store structured, semi-structured,
unstructured and polymorphic data.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 45
BusinessAnalytics Dept.ofCSE-AIML
WhyNoSQL?
• The concept of NoSQL databases became popular with Internet giants
like Google, Facebook, Amazon, etc. who dealwith hugevolumesof data.
The system response time becomes slow when you use RDBMS
formassive volumes of data.
• To resolve this problem, we could scale upour systems byupgrading
ourexisting hardware. This process is expensive. The alternative for this
issue is to distribute
databaseloadonmultiplehostswhenevertheloadincreases.Thismethodiskn
TypesofNoSQLDatabases:
• Document-oriented:JSONdocumentsMongoDBandCouchDB
• Key-value:RedisandDynamoDB
• Wide-column:CassandraandHBase
• Graph:Neo4jandAmazonNeptune
Relational Non-relational
Databases(SQ Databases(NoSQ
L) L)
Oracle MongoDB
MySQL couchDB
SQLServer BigTable
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 46
BusinessAnalytics Dept.ofCSE-AIML
SQLvsNOSQLDB:
SQL NoSQL
RELATIONALDATABASEMANAGEMENT Non-relationalordistributeddatabase
SYSTEM (RDBMS) system.
Thesedatabaseshavefixedorstaticor
predefined schema Theyhavedynamicschema
VerticallyScalable Horizontallyscalable
FollowsCAP (consistency,availability,
FollowsACIDproperty partition tolerance)
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 47
BusinessAnalytics Dept.ofCSE-AIML
DifferencesbetweenSQLandNoSQL
ThetablebelowsummarizesthemaindifferencesbetweenSQLandNoSQLdatabases.
SQLDatabases NoSQLDatabases
Document:MongoDBandCouchDB,Ke
Oracle, MySQL,
y- value: Redis and DynamoDB,
Examples MicrosoftSQLServer,
Wide- column: Cassandra and
and PostgreSQL
HBase, Graph: Neo4j and Amazon
Neptune
Multi- Mostdonotsupportmulti-recordACID
Record Supported transactions. However, somelike
ACID MongoDBdo.
Transactions
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 48
BusinessAnalytics Dept.ofCSE-AIML
SQLDatabases NoSQLDatabases
Benefitsof NoSQL
➢ The NoSQLdata model addresses severalissues thatthe relationalmodelis
not designed to address:
➢ Largevolumesofstructured,semi-structured,andunstructureddata.
➢ Object-orientedprogrammingthatiseasytouseandflexible.
➢ Efficient,scale-outarchitectureinsteadofexpensive,monolithicarchitecture.
Variables:
➢ Dataconsistofindividualsandvariablesthatgiveusinformationaboutthose
individuals. An individual can be an object or a person.
➢ Avariableisanattribute,suchasameasurementoralabel.
➢ TwotypesofData
➢ Quantitativedata(Numerical)
➢ Categoricaldata
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 49
BusinessAnalytics Dept.ofCSE-AIML
Discretevscontinuousvariables
of
variable
Discret Countsofindividualitemsorvalues • Numberofstudentsin
e . a class
variable • Numberofdifferenttre
s e species in a forest
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 50
BusinessAnalytics Dept.ofCSE-AIML
MissingImputations:
Imputationistheprocessofreplacingmissingdatawithsubstitutedvalues.
Typesofmissingdata
Missingdatacanbeclassifiedintooneofthree categories
1. MCAR
Data which is Missing Completely At Random has nothing systematic about
which observations are missing values. There is no relationship between
missingness and either observed or unobserved covariates.
2. MAR
MissingAtRandomisweakerthanMCAR.Themissingnessisstillrandom,butdueentir
ely to observed variables. For example, those from a lower socioeconomic
status may be less willing to provide salary information (but we know their SES
status). The key is that the missingness is not due to the values which are not
observed. MCAR implies MAR but not vice-versa.
3. MNAR
IfthedataareMissing NotAt
Random,thenthemissingnessdependsonthevaluesofthe missing data. Censored
data falls into this category. For example, individuals who are heavierare less
likelytoreporttheirweight. Anotherexample,thedevicemeasuringsome response
can only measure values above .5. Anything below that is missing.
Therecanbetwotypesofgapsin Data:
1. MissingDataImputation
2. ModelbasedTechnique
Imputations:(TreatmentofMissingValues)
1. Ignore the tuple: This is usually done when the class label is missing
(assuming
theminingtaskinvolvesclassification).Thismethodisnotveryeffective,unless
the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies
considerably.
2. Fill in themissing valuemanually: Ingeneral, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
3. Useaglobalconstanttofillinthemissingvalue:Replaceall missing
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 51
BusinessAnalytics Dept.ofCSE-AIML
attribute valuesbythesameconstant,suchasalabellike Unknownor-
∞.Ifmissingvalues
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 52
BusinessAnalytics Dept.ofCSE-AIML
are replaced by, say, “Unknown,” then the mining program may
mistakenly think thattheyforman
interestingconcept,sincetheyallhaveavalueincommon-thatof “Unknown.”
Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the
average value of that particular attribute and use this value to replace
the missing value in that attribute column.
5. Usetheattributemeanfor all samplesbelonging to
thesameclassasthe given tuple:
For example, if classifying customers according to credit risk, replace the
missing valuewith theaverage incomevalueforcustomersin thesamecredit
risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your
dataset,youmayconstructadecisiontreetopredictthemissingvaluesfor inco
me.
NeedforBusinessModelling:
ThemainneedofBusinessModellingfortheCompaniesthatembracebigdataanalyt
ics
andtransformtheirbusinessmodelsinparallelwillcreatenewopportunitiesforreve
nue streams, customers,products and services Having a big data strategy
and vision that identifies and capitalizes on new opportunities.
AnalyticsapplicationstovariousBusinessDomains
ApplicationofModellinginBusiness:
• ApplicationsofDataModellingcanbetermedasBusinessanalytics.
• Business analytics involves the collating, sorting, processing, and
studying of business-relateddatausing
statisticalmodelsanditerativemethodologies. Thegoal
ofBAistonarrowdownwhichdatasetsareusefuland
whichcanincreaserevenue, productivity, and efficiency.
• Businessanalytics(BA)isthecombinationofskills,technologies,andpractices
used to examine an organization's data and performance as a way to
gain insights and make data-driven decisions in the future using
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 53
BusinessAnalytics Dept.ofCSE-AIML
statistical analysis.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 54
BusinessAnalytics Dept.ofCSE-AIML
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 55
BusinessAnalytics Dept.ofCSE-AIML
DataModellingTechniquesinDataAnalytics:
WhatisDataModelling?
• Data Modelling is the process of analyzing the data objects and their
relationship to the otherobjects.Itisusedtoanalyzethedatarequirements
thatarerequiredforthebusiness processes. The data models are created for the
data to be stored in a database.
• TheData Model'smainfocusisonwhat data isneededandhowwe have toorganize
data rather than what operations we have to perform.
• Data Model is basically an architect's building plan. It is a process of
documenting complex software system design as in a diagram that can be
easily understood.
UsesofDataModelling:
• DataModellinghelpscreatearobustdesignwithadatamodelthat
canshowan organization's entire data on the same platform.
• Thedatabaseat thelogical, physical, andconceptual levelscan bedesigned
withthehelp data model.
• DataModellingToolshelpintheimprovementofdataquality.
• Redundantdataandmissingdatacanbeidentifiedwiththehelpofdatamodels.
• Thedatamodelisquiteatimeconsuming,butitmakesthemaintenancecheaperand
faster.
DataModellingTechniques:
Belowgivenare5differenttypesoftechniquesusedtoorganizethedata:
1. HierarchicalTechnique
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node andthe other childnodes are sortedin aparticular order. But, the
hierarchicalmodelis very rarely used now. This model can be used for real-world
model relationships.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 56
BusinessAnalytics Dept.ofCSE-AIML
2. Object-orientedModel
The object-oriented approach is the creation of objects that contains stored values.
The object- oriented model communicates whilesupporting data abstraction,
inheritance, and encapsulation.
3. NetworkTechnique
The network model provides us with a flexible way of representing objects and
relationships between these entities.Ithasafeature knownasa schema
representingthedata intheformofa graph.
Anobjectisrepresentedinsideanodeandtherelationbetween themasanedge, enabling
them to maintain multiple parent and child records in a generalized manner.
4. Entity-relationshipModel
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual
design provides a better
viewofthedatathathelpsuseasytounderstand.Inthismodel,theentiredatabaseisrepresent
ed in a diagram called an entity-relationship diagram, consisting of Entities,
Attributes, and Relationships.
5. RelationalTechnique
Relationalisusedtodescribethedifferentrelationshipsbetweentheentities.Andtherearedi
fferent sets ofrelationsbetweentheentities such
asonetoone,onetomany,manytoone,andmanyto many.
***EndofUnit-2***
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 57
BusinessAnalytics Dept.ofCSE-AIML
3. UNIT3
RegressionConcepts:
Introduction:
▪ The term regression is used to indicate the estimationor predictionof the
averagevalue of one variable for a specified value of another variable.
▪ Regressionanalysisisaverywidelyusedstatisticaltooltoestablisharelationship
model between two variables.
RegressionAnalysisisastatisticalprocessforestimatingtherelationshipsbetweenthe
Dependent Variables /Criterion Variables / Response Variables
&
OneorMoreIndependentvariables/Predictorvariables.
▪ Regressiondescribeshowanindependentvariableisnumericallyrelatedt
othe dependent variable.
▪ Regressioncanbeusedforprediction,estimationandhypothesistesting,andmod
eling causal relationships.
WhenRegressionischosen?
▪ Aregressionproblemiswhentheoutputvariableisarealorcontinuousvalue,suc
has salary or weight.
▪ Manydifferentmodelscanbeused,thesimplestislinearregression.Ittriestofitda
ta with the best hyperplane which goes through the points.
▪ Mathematicallyalinearrelationshiprepresentsastraightlinewhenplottedasagraph.
▪ Anon-linear relationship wheretheexponentof anyvariable
isnotequalto1creates a curve.
TypesofRegressionAnalysisTechniques:
1. LinearRegression
2. LogisticRegression
3. RidgeRegression
4. LassoRegression
5. PolynomialRegression
6. BayesianLinearRegression
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 58
BusinessAnalytics Dept.ofCSE-AIML
Advantages&Limitations:
• Fastandeasytomodelandisparticularlyusefulwhentherelationshiptobemod
eled is not extremely complex and if you dont have a lot of data.
• Veryintuitivetounderstandandinterpret.
• LinearRegressionisverysensitivetooutliers.
Linearregression:
• Linear Regression is a very simple method but has provento be very
useful for a large number of situations.
• Whenwehaveasingleinputattribute(x)andwewanttouselinearregression,th
is is called simple linear regression.
• simplelinearregressionwewanttomodelourdataasfollow
s: y = B0 + B1 * x
• weknowandB0andB1arecoefficientsthatweneedtoestimatethatmovethelin
e around.
• Simple regression is great, because rather than having to search for
values by trial and error orcalculatethem analytically usingmoreadvanced
linear algebra,wecan estimate them directly from our data.
OLSRegression:
LinearRegressionusingOrdinaryLeastSquaresApproxima
tion Based on Gauss Markov Theorem:
WecanstartoffbyestimatingthevalueforB1as:
n
(xi –mean(x))
2
i=1
B0=mean(y)–B1*mean(x)
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 59
• Ifwehadmultipleinputattributes(e.g.x1,x2,x3,etc.)Thiswouldbecalledmultip
le linear regression. The procedure for linear regression is different and
simpler than that for multiple linear regression.
LetusconsiderthefollowingExample:
foranequationy=2*x+3.
xi-
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
mean(x)*
yi-
mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum=90.4 Sum=45.2
Mean(x)=1.4andMean(y) = 5.8
n
(xi mean(x))
2
i=1
B0=mean(y)–B1*mean(x)
Wecanfindfromtheaboveformulas,
B1=2 and B0=3
ExampleforLinearRegressionusingR:
Considerthefollowingdataset:
x={1,2,4,3,5}andy={1,3,3,2,5}
WeuseRtoapplyLinearRegressionfortheabovedata.
> rm(list=ls())#removesthelistofvariablesinthecurrentsessionofR
> x<-c(1,2,4,3,5) #assignsvaluestox
> y<-c(1,3,3,2,5) #assignsvaluestoy
> x;y
[1] 12435
[1] 13325
> graphics.off()#tocleartheexistingplot/s
> plot(x,y,pch=16,col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula=y~x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")
> a<-data.frame(x=7)
>a
7
> result<-predict(relxy,a)
> print(resul
t) 6
> #Note:youcanobserve that
> 0.8*7+0.4
[1] 6 #Thesamecalculatedusingthelineequationy=0.8*x+0.4.
Simplelinearregressionisthesimplestformofregressionandthemoststudied.
CalculatingB1&B0usingCorrelationsandStandardDeviations:
B1=corr(x,y)*stdev(y)/stdev(x)
Correlation(x,y)*St.Deviation(y)
B1=
St.Deviation(x)
Wherecor(x,y)isthecorrelationbetweenx&yandstdev()isthecalculationofthestandard
deviation for a variable. The same is calculated inR as follows:
> x<-c(1,2,4,3,5)
> y<-c(1,3,3,2,5)
> x;y
[1]12435
[1]13325
> B1=cor(x,y)*sd(y)/sd(x)
> B1[1]
0.8
> B0=mean(y)-B1*mean(x)
> B0[1]
0.4
EstimatingError:(RMSE:RootMeanSquaredError)
WecancalculatetheerrorforourpredictionscalledtheRootMeanSquaredErrororRMSE.
Root Mean Square Error can be calculated by
n
(p i )
2
i y
Err= i=1
n
p is the predicted value and y is the
actual
value,iistheindexforaspecificinstance,nis
the number of predictions, because we
must calculate the error across all
predicted values. Estimating Error for ⇨ mean(x)= 3
x y=y-actual p=y-predicted p-y (p-y)^2
y=0.8*x+0.4
1 1 1.2 0.2 0.04 s=sumof (p-y)2= 2.4
2 3 2 -1 1
⇨ s/n=2.4/ 5=0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64
⇨ sqrt(s/n)=sqrt(0.48)=0.692
5 5 4.4 -0.6 0.36 ⇨ RMSE=0.692
PropertiesandAssumptionsofOLSapproximation:
1. Unbiasedness:
i. Biasedestimatorisdefinedasthedifferencebetweenitsexpectedvaluean
dthe true value. i.e., e(y)=y_actual y_predited
ii. Ifthebiasederror(bias)iszerothenestimatorbecomeunbiased.
iii. Unbiasednessisimportantonlywhenitiscombinedwithsmallvariance
2. LeastVariance:
i. Anestimatorisbestwhenithasthesmallestorleastvariance
ii. Leastvariancepropertyismoreimportantwhenitcombinedwithsmallbiased.
3. Efficientestimator:
i. Anestimatorsaidtobeefficientwhenitfulfilledbothconditions.
ii. Estimatorshouldunbiasedandhaveleastvariance
4. BestLinearUnbiasedEstimator(BLUEProperties):
i. AnestimatorissaidtobeBLUEwhenitfulfilltheaboveproperties
ii. AnestimatorisBLUEifitisUnbiased,LeastVarianceandLinear Estimator
5. MinimumMeanSquareError(MSE):
i. AnestimatorissaidtobeMSEestimatorifithassmallestmeansquareerror.
ii. LessdifferencebetweenestimatedvalueandTrueValue
6. SufficientEstimator:
i. An estimator is sufficient if it utilizes all the information of a sample
about the True parameter.
ii. Itmustusealltheobservationsofthesample.
AssumptionsofOLSRegression:
1. Therearerandomsamplingofobservations.
2. Theconditionalmeanshouldbezero
3. ThereishomoscedasticityandnoAuto-correlation.
4. Errortermsshouldbenormallydistributed(optional)
5. ThePropertiesofOLSestimatesofsimplelinearregressionequatio
nis y = B0+B1*x + µ (µ -> Error)
6. Theaboveequationisbasedonthefollowingassumptions
a. Randomnessofµ
b. Meanofµis Zero
c. Varianceofµisconstant
d. Thevarianceofµhasnormaldistribution
e. Errorµofdifferentobservationsareindependent.
HomoscedasticityvsHeteroscedasticity:
ModelBuildingLifeCycleinDataAnalytics:
Whenwecomeacross abusiness analytical
problem,withoutacknowledgingthestumblingblocks,
weproceedtowardstheexecution.Beforerealizingthemisfortunes,wetrytoimplementandp
redict the outcomes. The problem-solving steps involved in the data science model-
building life cycle.
Letsunderstandeverymodelbuildingstepin-depth,
The data science model-building life cycle includes some important steps to follow.
The following are the steps to follow to build a Data Model
1. ProblemDefinition
2. HypothesisGeneration
3. DataCollection
4. DataExploration/Transformation
5. PredictiveModelling
6. ModelDeployment
1. ProblemDefinition
• Thefirststepinconstructingamodelisto
understandtheindustrialprobleminamorecomprehensiveway.Toidentifythepurpose
of
theproblemandthepredictiontarget,wemustdefinetheprojectobjectivesappropriatel
y.
• Therefore,toproceedwithan analytical approach,wehavetorecognizetheobstacles
first.
Remember,excellentresultsalwaysdependonabetterunderstandingoftheprob
lem.
2. HypothesisGeneration
• Hypothesis generation is the guessing approach through which we derive some
essential data parameters that have a significant correlation with the
prediction target.
• Yourhypothesisresearchmustbein-
depth,lookingforeveryperceptiveofallstakeholders into account. We search for
every suitable factor that can influence the outcome.
• Hypothesisgeneration focusesonwhat youcancreaterather thanwhatis
availableinthe dataset.
3. DataCollection
• Data collection is gathering data from relevant sources regarding the analytical
problem, then we extract meaningful insights from the data for prediction.
Thedatagatheredmusthave:
• Proficiencyinanswerhypothesisquestions.
• Capacitytoelaborateoneverydataparameter.
• Effectivenesstojustifyyourresearch.
• Competencytopredictoutcomesaccurately.
4. DataExploration/Transformation
• The data you collected may be in unfamiliar shapes and sizes. It may contain
unnecessary features, null values, unanticipated small values, or immense
values. So, before applying any algorithmic model to data, we have to explore it
first.
• Byinspectingthedata,wegettounderstandtheexplicitandhiddentrendsindata.Wefi
nd the relation between data features and the target variable.
• Usually, a data scientist invests his 6070% of project time dealing with data
exploration only.
• Thereareseveralsubstepsinvolvedindataexploration:
o FeatureIdentification:
▪ You need to analyze which data features are available and which
ones are not.
▪ Identifyindependentandtargetvariables.
▪ Identifydatatypesandcategoriesofthesevariables.
o UnivariateAnalysis:
▪ We inspect each variable one by one. This kind of analysis
depends on the variable type whether it is categorical and
continuous.
• Continuous variable: We mainly look for statistical trends
like mean,
median,standarddeviation,skewness,andmanymoreinthedat
aset.
• Categorical variable: We use a frequency table to
understand the spread of data for each category. We can
measure the counts and frequency of occurrence of values.
o Multi-variateAnalysis:
▪ The bi-variate analysis helps to discover the relation between two
or more variables.
▪ We can find the correlation in case of continuous variables and the
case of categorical, we look for association and dissociation
between them.
o FillingNullValues:
▪ Usually, the datasetcontains null values which leadto lower the
potential of the model. With a continuous variable, we fill these
null values using the mean or mode of that specific column. For
the null values present in the categorical column, we replace them
with the most frequently occurred categorical value. Remember,
dont delete that rows because you may lose the information.
5. PredictiveModeling
• Predictive modeling is a mathematicalapproachto create a statisticalmodelto
forecast future behavior based on input test data.
Stepsinvolvedinpredictivemodeling:
• Algorithm Selection:
o When we have the structured dataset, and we want to estimate the
continuous or categorical outcome then we use supervised machine
learning methodologies like
regressionandclassificationtechniques.Whenwehaveunstructureddataand
want to predict the clusters of items to which a particular input test
sample belongs, we use unsupervised algorithms. An actual data
scientist applies multiple algorithms to get a more accurate model.
• TrainModel:
o After assigningthe algorithm andgetting the data handy, we train our
model using the input data applying the preferred algorithm. It is an
action to determine the correspondence between independent variables,
and the prediction targets.
• ModelPrediction:
o Wemakepredictionsbygivingtheinputtestdatatothetrainedmodel.Wemeas
ure the accuracy by usinga cross-validation strategy or ROC curve which
performs well to derive model output for test data.
6. ModelDeployment
• There is nothingbetter than deployingthe modelin a real-time environment.
Ithelps us to gain analytical insights into the decision-making procedure. You
constantly need to update the model with additional features for customer
satisfaction.
• To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
• When you go through the Amazon website and notice the product
recommendations completelybasedonyourcuriosities.
Youcanexperiencetheincreaseintheinvolvementof
thecustomersutilizingtheseservices.Thatshowadeployedmodelchangesthemind
setof the customer and convince him to purchase the product.
KeyTakeaways
SUMMARYOFDAMODELLIFE CYCLE:
• Understandthepurposeofthebusinessanalyticalproblem.
• Generatehypothesesbeforelookingatdata.
• Collectreliabledatafromwell-knownresources.
• Investmostofthetimeindataexplorationtoextractmeaningfulinsightsfromthedata.
• Choosethesignaturealgorithmtotrainthemodelandusetestdatatoevaluate.
• Deploythemodelintotheproductionenvironmentsoitwillbeavailabletousersand
strategize to make business decisions effectively.
LogisticRegression:
ModelTheory,ModelfitStatistics,ModelConstruction
Introduction:
• Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependentvariable using a given set of
independent variables.
• The outcomemust be acategoricalordiscretevalue. It can beeither Yes orNo,
0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0and1.
• In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something
such as whether or not the cells are cancerous or not, a mouse is obese or
not based on its weight, etc.
• Logisticregressionusestheconceptofpredictivemodelingasregression;
therefore, it is called logistic regression, but is used to classify samples;
therefore, it falls under the classification algorithm.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
TypesofLogisticRegressions:
Onthebasisofthecategories,LogisticRegressioncanbeclassifiedintothreetypes:
• Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables,such as 0 or 1, Pass or Fail, etc.
• Multinomial: InmultinomialLogisticregression,therecanbe3or more
possibleunordered types of thedependent variable, suchas
"cat","dogs", or "sheep"
• Ordinal:InordinalLogisticregression,therecan be3or morepossibleorderedtypes
of dependent variables,such as "low", "Medium",or "High".
Definition:Multi-collinearity:
• Multicollinearity isastatisticalphenomenon inwhich multiple
independentvariables show highcorrelationbetween each other and they
are too inter-related.
• MulticollinearityalsocalledasCollinearityanditisanundesiredsituationfor
anystatistical regressionmodelsince itdiminishes the reliabilityof the
modelitself.
• If two or more independent variables are too correlated, the data
obtained from the regression will be disturbed because the independent
variables are actually dependent betweeneach other.
AssumptionsforLogisticRegression:
• Thedependentvariablemustbecategoricalinnature.
• Theindependentvariableshouldnothavemulti-collinearity.
LogisticRegressionEquation:
• The Logistic regression equation can be obtained from the Linear
Regressionequation. The mathematical steps to get Logistic Regression
equations are given below:
• Logistic Regression uses a more complex cost function, this cost function
can be
definedastheSigmoidfunctionoralsoknownasthelogisticfunctioninst
eadofa linear function.
• Thehypothesisoflogisticregressiontendsittolimitthecostfunctionbetween0and
1.Thereforelinearfunctionsfailtorepresent itasitcanhaveavaluegreaterthan 1
orless than 0 which is not possible as per thehypothesis of logistic
regression. 0 h(x) 1
---LogisticRegressionHypothesisExpectation
LogisticFunction(SigmoidFunction):
• The sigmoid function is a mathematical function used to map the
predicted values toprobabilities.
• The sigmoid function maps any real value intoanother
valuewithinarangeof0 and 1, and so forma S-Form curve.
• The valueof thelogistic regression must be between 0and 1, which
cannot go beyondthis limit, so it forms a curve like the "S" form.
• Thebelowimageisshowingthelogisticfunction:
Fig:SigmoidFunctionGraph
TheSigmoidfunctioncanbeinterpretedasaprobabilityindicatingtoaClass-1orClass-
0.SotheRegressionmodelmakesthefollowingpredictionsas
1
z=sigmoid(y)= (y)= y
1+e
HypothesisRepresentation
• Whenusinglinearregression,weusedaformulaforthelineequationas:
y=b0+b1x1+b2x2+...+bnxn
• In the aboveequation
x1,x2,...xnarethepredictorvariables,
yisaresponse variable,
andb0,b1,b2,...,bnarethecoefficients,whicharenumericconstants.
• Forlogisticregression,weneedthemaximumlikeliho h(y)
od hypothesis
• ApplySigmoidfunctiononyas
z= (y)= (b0+b1x1+b2x2+...+bnxn)
1
z= (y)= (b0+b1x1+b2x2+...+bnxn)
1+e
ExampleforSigmoidFunctioninR:
> #ExampleforSigmoidFunction
> y<-c(-10:10);y
BusinessAnalytics Dept.ofCSE-AIML
[1]-10-9-8-7-6-5-4-3-2-1 0 1 2 3 4 5 6 7 8 910
> z<-1/(1+exp(-y));z
[1]4.539787e-051.233946e-043.353501e-049.110512e-042.472623e-036.692851e-03
1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-
01 5.000000e-01 7.310586e-
01
8.807971e-01 9.525741e-01
9.820138e-019.933071e-01
[17] 9.975274e-01 9.990889e-
01 9.996646e-01 9.998766e-
01
9.999546e-01
> plot(y,z)
> rm(list=ls())
> attach(mtcars)
#attachinga dataset
intotheR environment
> input<-mtcars[,c("mpg","disp","hp","wt")]
> head(input)
mpg disp hp wt
MazdaRX4 21.0 160 110 2.620
MazdaRX4Wag 21.0 160 110 2.875
Datsun710 22.8 108 93 2.320
Hornet4Drive 21.4 258 110 3.215
HornetSportabout 18.7 360 175 3.440
Valiant 18.12251053.460
> #model<-lm(mpg~disp+hp+wt);model1#Showthemodel
> model<-glm(mpg~disp+hp+wt);model
Coefficients:
(Intercept) hp wt
disp
37.105505 -0.000937 -0.031157 -3.800891
DegreesofFreedom:31Total(i.e.Null);28Residual
Null Deviance: 1126
ResidualDeviance:195AIC:158.6
> newx<-data.frame(disp=150,hp=150,wt=4)#newinputforprediction
> predict(model,newx)
1
17.08791
> 37.15+(-0.000937)*150+(-0.0311)*150+(-3.8008)*4#checkingwiththedatanewx
[1]17.14125
y<-
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 72
BusinessAnalytics Dept.ofCSE-AIML
input[,c("mpg")];y
z=1/(1+exp(-y));z
plot(y,z)
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 73
BusinessAnalytics Dept.ofCSE-AIML
>y<-input[,c("mpg")]
>y
[1]21.021.022.8 21.418.7 18.114.3 24.422.8 19.217.8 16.4 15.210.4
17.3
10.414.732.430.4 33.921.5 15.515.2 13.319.2 27.326.0 30.4 19.715.0
15.8
21.4
> z=1/(1+exp(-y));z
1.0000000 1.000000 1.000000 1.0000000 1.0000000 1.0000000 0.9999994 1.0000000
0 0
1.0000000 1.0000000 1.0000000
0.9999999 1.0000000 0.9999997
0.9999696 0.9999696 0.9999996
1.0000000 1.0000000 1.0000000
1.0000000 0.9999998 0.9999997
0.9999983 1.0000000 1.0000000
1.0000000 1.0000000 0.9999999
1.0000000 0.99999971.0000000
>plot(y,z)
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 74
BusinessAnalytics Dept.ofCSE-AIML
ConfusionMatrix(or)ErrorMatrix(or)ContingencyTable:
WhatisaConfusionMatrix?
A Confusion matrix is an N x N matrix used for evaluating the performance of
a classification model, where N is the number of target classes. The matrix
compares the actual target values with those predicted by the machine
learning model. This gives us a
holisticviewofhowwellourclassificationmodelisperformingandwhatkindsof
errorsitismaking.Itisaspecifictablelayoutthatallowsvisualizationofthe
performance of an algorithm, typically a supervisedlearningone(inunsupervised
learning itis usuallycalleda matching matrix).
For a binary classification problem, we would have a 2 x 2 matrix as shown
below with 4 values:
Letsdecipherthematrix:
• Thetargetvariablehastwovalues:PositiveorNegative
• Thecolumnsrepresenttheactualvaluesofthetargetvariable
• Therowsrepresentthepredictedvaluesofthetargetvariable
• TruePositive
• TrueNegative
• FalsePositive–Type1Error
• FalseNegativeType2Error
WhyweneedaConfusionmatrix?
• PrecisionvsRecall
• F1-score
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 75
BusinessAnalytics Dept.ofCSE-AIML
UnderstandingTruePositive,TrueNegative,FalsePositive
andFalseNegative in a Confusion Matrix
TruePositive(TP)
• Thepredictedvaluematchestheactualvalue
• Theactualvaluewaspositiveandthemodelpredictedapositivevalue
TrueNegative(TN)
• Thepredictedvaluematchestheactualvalue
• Theactualvaluewasnegativeandthemodelpredictedanegativevalue
FalsePositive(FP)Type1error
• The predictedvaluewasfalselypredicted
• The actualvaluewasnegativebutthemodelpredictedapositivevalue
• Also known as the
Type1
errorFalseNegative(FN)Type
2 error
• The predictedvaluewasfalselypredicted
• Theactualvaluewaspositivebutthemodelpredictedanegativevalue
• AlsoknownastheType2error
Toevaluatetheperformanceofamodel,wehavetheperformancemetricscalled,
Accuracy,Precision,Recall&F1-Scoremetrics
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of
correctlypredicted observation to the total observations.
• AccuracyisagreatmeasuretounderstandthatthemodelisBest.
• Accuracy is dependable only when you have symmetric datasets
where values offalse positive and false negatives are almost same.
TP+TN
Accuracy=
TP+FP+TN+FN
Precision:
Precision isthe ratio ofcorrectlypredicted positive observationsto
thetotalpredicted positive observations.
Ittellsushowmanyofthecorrectlypredictedcasesactuallyturnedouttobepositive.
TP
Precision=
TP+FP
• Precisionis a useful metric in cases where False Positive
isahigherconcern than False Negatives.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 76
BusinessAnalytics Dept.ofCSE-AIML
• Precision is important in music or video recommendation systems, e-
commerce websites, etc. Wrong results could lead to customer churn and
be harmful to the business.
Recall:(Sensitivity)
Recallis theratio of correctlypredictedpositive observations tothe all
observationsin actual class.
TP
Recall =
TP+FN
• RecallisausefulmetricincaseswhereFalseNegativetrumpsFalsePositive.
• Recallisimportant in medical caseswhereitdoesnt matter whether
we raise a false alarm but the actual positive cases should not go
undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. Itgivesa combined
ideaabout these two metrics. It is maximum when Precision is equal to
Recall.
Therefore,thisscoretakesbothfalsepositivesandfalsenegativesintoaccount.
2 Precision*Recall
F
1 Score= 1 1
=2*
Recall + Precesion Precision+Recall
• F1 is usually more useful than accuracy, especially if you have an uneven
class distribution.
• Accuracyworksbestiffalsepositivesandfalsenegativeshavesimilarcost.
• If the cost of false positives and false negatives are very
different,itsbetterto lookat both Precision and Recall.
• But thereisacatchhere.IftheinterpretabilityoftheF1-scoreispoor,means
that wedontknowwhatourclassifierismaximizing precisionorrecall?So,
we use it in combination with other evaluation metrics which gives us a
complete picture of the result.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 77
Example:
Suppose we had a classification dataset with 1000 data points.Wefit a
classifieron it andget the below confusion matrix:
• TrueNegative(TN)=330
-Means 330 negative class data
points were correctly classified by
the model.
• FalsePositive(FP)=60
-Means60negativeclassdatapointswereincorrectly classified asbelongingtothe
positive class by the model.
• FalseNegative(FN)=50
-Means50 positive class data points were incorrectly classifiedasbelonging to
the negative class by the model.
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
PreciselywehavetheoutcomesrepresentedinConfusionMatrixas:
TP =560,TN=330,FP=60,FN=50
Accuracy:
Theaccuracyforourmodelturnsouttobe:
TP+TN
Accuracy=
TP+FP+TN+FN
560+330 890
=>Accuracy= = =0.89
560 +60 +330 +50 1000
HenceAccuracyis89%...Notbad!
Precision:
Ittellsushowmanyofthecorrectlypredictedcasesactuallyturnedouttobepositive.
TP
Precision=
TP+FP
Thiswoulddeterminewhetherourmodelisreliableornot.
Recalltells us howmanyofthe actualpositivecaseswewere able to predict
correctly with our model.
TP
Precision= 560 =0.903
=
TP+FP 560+60
We can easily calculate Precision and Recall for
ourmodelbyplugginginthevalues into the above questions:
TP
Recall = 560 =0.918
TP+FN =
560+50
F1-Score
Precision*Recall
F Score=2*
BusinessAnalytics Dept.ofCSE-AIML
1
Precision+Recall
0.903*0.918 0.8289
=>F Score=2* = =0.4552
1
0.903+0.918 1.821
AUC(AreaUnderCurve)ROC(ReceiverOperatingCharacteristics)Curv
es: PerformancemeasurementisanessentialtaskinDataModelling
Evaluation. It is one of the most important evaluation metrics for
checking any
classificationmodelsperformance.ItisalsowrittenasAUROC
(AreaUnderthe
ReceiverOperatingCharacteristics)Sowhenitcomestoaclassification
problem, we cancount on anAUC - ROC Curve.
When we need to check or visualize the performance of the multi-
class
classificationproblem,weusetheAUC(AreaUnderTheCurve)ROC(Recei
ver Operating Characteristics)curve.
WhatistheAUC-ROCCurve?
TPR(TruePositiveRate)/Recall/Sensitivity
Specificity
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 80
BusinessAnalytics Dept.ofCSE-AIML
FPR(FalsePositiveRate)
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 81
BusinessAnalytics Dept.ofCSE-AIML
ROCcurve
AnROCcurve(receiveroperatingcharacteristiccurve) isagraphshowingtheperformanceof a
classification model at all classification thresholds. This curve plots two parameters:
• TruePositiveRate
• FalsePositiveRate
• TruePositiveRate(TPR)isasynonymforrecallandisthereforedefinedasfollows:
• TPR=TPTP+FN
FalsePositiveRate(FPR) isdefinedasfollows:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 82
BusinessAnalytics Dept.ofCSE-AIML
AnROC curve plots TPR vs. FPR at different classificationthresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 83
BusinessAnalytics Dept.ofCSE-AIML
AnalyticsapplicationstovariousBusinessDomains:
ApplicationofModellinginBusiness:
• ApplicationsofDataModellingcanbetermedasBusinessanalytics.
• Business analytics involves the collating, sorting, processing, and
studying of business-related data using statistical models
anditerativemethodologies.The goal of BA is to narrow down which
datasets are useful and which can increase revenue, productivity, and
efficiency.
• Business analytics (BA) is the combination of skills, technologies,
andpractices used to examine an organization's data and performance as
a way to gaininsights and make data-driven decisionsin the future using
statistical analysis.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 84
BusinessAnalytics Dept.ofCSE-AIML
employment, etc. By working with this information, business analysts
help HR byforecastingthebestfitsbetweenthecompanyand candidates.
5. Manufacturing
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 85
BusinessAnalytics Dept.ofCSE-AIML
Business analysts work with data to help stakeholders understand the
things that affect operations and the bottom line. Identifying things like
equipment downtime, inventory levels, and maintenance costs help
companies streamline inventory management,risks,andsupply-
chainmanagementtocreatemaximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by
measuring marketing and advertising metrics, identifying consumer
behaviour and the target audience, and analyzing market trends.
***EndofUnit-3***
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 86
BusinessAnalytics Dept.ofCSE-AIML
Add-onsforUnit-3
TOBEDISCUSSED:
ReceiverOperatingCharacteristics:
ROC&AUC
DerivationforLogisticRegression:
Thelogisticregressionmodel assumes that the log-odds of an observation ycan be
expressed as a linearfunction of the K input variables x:
Letstaketheexponentofbothsidesofthelogitequation.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 87
BusinessAnalytics Dept.ofCSE-AIML
The righthand side of the topequation is thesigmoid of z, which maps the real
line to
theinterval(0,1),andisapproximatelylinearneartheorigin.Ausefulfactabout P( z)
is thatthe derivative P'(z) = P(z) (1 –P(z)). Heres the derivation:
Later, we will want to take the gradient of P with respect to the set of
coefficients b, rather than z. Inthatcase, P'(z) = P(z) (1 –P(z))z, where
isthe gradient taken with respectto b.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 88
BusinessAnalytics Dept.ofCSE-AIML
4. UNIT4
SupervisedandUnsupervisedLearning
SupervisedLearning:
• Supervised learning is a machine learning method in
whichmodelsaretrained using labeled data. In supervised learning,
modelsneed tofindthemapping functionto map the input variable (X) with
the output variable (Y).
• Wefindarelationbetweenx&y,suchthaty=f(x)
• Supervised learning needs supervision to train the model, which is
similar to as a student learns things in the presence
ofateacher.Supervisedlearningcanbe used for two types of problems:
Classification and Regression.
• Example: Suppose we have an image of different types of fruits. The task
of our supervised learning model is
toidentifythefruitsandclassifythemaccordingly.
Sotoidentifytheimageinsupervisedlearning,wewillgivetheinputdataas well
as outputfor that, whichmeans wewilltrain themodel bythe shape, size,
color, and taste of each fruit. Once the training is completed, we will
testthemodel by giving the newset of fruit. The model will identify the
fruit and predicttheoutput using a suitable algorithm.
UnsupervisedMachineLearning:
• Unsupervised learning is another machine learning method in which
patterns inferred from
theunlabeledinputdata.Thegoalofunsupervisedlearningisto find the
structure and patternsfromtheinputdata.Unsupervisedlearningdoes
notneedany supervision. Instead, it finds patterns from the data by its
own.
• Unsupervised learning can be used for two types of problems: Clustering
and Association.
• Example: To understand the unsupervised learning, we will use the
example given above. So unlike supervised learning, here we will
notprovideanysupervisionto the model.
Wewilljustprovidetheinputdatasettothemodelandallowthe model
tofindthepatternsfromthedata.Withthehelpofasuitablealgorithm, the
modelwill train itselfanddividethefruitsintodifferentgroupsaccordingto the
most similar features between them.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 89
BusinessAnalytics Dept.ofCSE-AIML
ThemaindifferencesbetweenSupervisedandUnsupervisedlearningaregivenbelow:
SupervisedLearning UnsupervisedLearning
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 90
BusinessAnalytics Dept.ofCSE-AIML
Classification, Decision tree, Random Neural Networks, Principle
Component
Forest,DecisionTrees,BayesianLogic,etc. Analysis,IndependentComponentAnalysi
s,
Apriori algorithms,etc.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 91
BusinessAnalytics Dept.ofCSE-AIML
Segmentation
• Segmentation refers to the act of segmenting data according to your
companys needs in order
torefineyouranalysesbasedonadefinedcontext.Itisa
techniqueofsplittingcustomersintoseparategroupsdependingontheir
attributes or behavior.
• The purpose of segmentation
istobetterunderstandyourcustomers(visitors), and to obtain actionable
data in order to improve your website or mobile app. In concrete terms, a
segment enables you to filter your analyses based on certain elements
(singleor combined).
• Segmentation can be done on elements related to
a visit, as well as on elements related to multiple
visits during a studied period.
Steps:
• DefinepurposeAlreadymentionedinthestatementabove
• Identifycritical parameters –Some of the variables which come up in
mind are skill, motivation, vintage, department, education etc. Let us say
that basis past experience, we know that skill
andmotivationaremostimportantparameters. Also, for sake of simplicity
we just select 2variables.Takingadditionalvariables willincrease the
complexity, but can be done if it adds value.
• Granularity–Letus say we are able to classify both skill and
motivation into High andLowusingvarioustechniques.
Therearetwobroadsetofmethodologiesforsegmentation:
• Objective(supervised)segmentation
• Non-Objective(unsupervised)segmentation
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 92
BusinessAnalytics Dept.ofCSE-AIML
ObjectiveSegmentation
• Segmentation to identify the type of customers who would
respond to a particularoffer.
• Segmentation to identify high spenders among customers who will
use thee- commerce channel for festive shopping.
• Segmentation toidentifycustomers who will default on
theircreditobligation for aloan or credit card.
Non-ObjectiveSegmentation
https://www.yieldify.com/blog/types-of-market-segmentation/
• Segmentation of the customer base to understand the specific profiles
which exist withinthe customer base so thatmultiple marketing actions
canbepersonalized for each segment
• Segmentation of geographies onthebasisofaffluenceandlifestyleofpeople
living in each geography sothatsalesanddistributionstrategiescanbe
formulated accordingly.
• Hence, it is critical that the segments created on the basis of an
objective segmentation methodology must be different with respect to
the stated objective (e.g. response to an offer).
• However, in case of a non-objective methodology, the segments are
different with respect to the generic profile of observations belonging
to each segment, but not with regards to any specificoutcome of
interest.
• The most common techniques for building non-objective segmentation
are cluster analysis, K nearest neighbor techniques etc.
RegressionVsSegmentation
• Regression analysis focuses on finding a relationship
between a dependent variableand one or more
independent variables.
• Predicts the value of a dependent variable based on the value of at
least one independent variable.
• Explains the impact of changes in an independent variable
on thedependentvariable.
• We use linear or logistic regression technique for developing accurate
models for predicting an outcome of interest.
• Often,wecreateseparatemodelsforseparatesegments.
• Segmentation methodssuchasCHAIDorCRTisusedtojudgetheir
effectiveness.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 93
BusinessAnalytics Dept.ofCSE-AIML
DecisionTreeClassificationAlgorithm:
TherearetwomaintypesofDecisionTrees:
1. Classificationtrees(Yes/Notypes)
Whatweveseenaboveisanexampleofclassificationtree,wheretheoutcomewasa
variable likefit or unfit. Herethedecision variable is Categorical.
2. Regressiontrees(Continuousdatatypes)
HerethedecisionortheoutcomevariableisContinuous,e.g.anumberlike123.
DecisionTreeTerminologies
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 94
BusinessAnalytics Dept.ofCSE-AIML
RootNode:Rootnodeisfromwherethedecisiontreestarts.Itrepresentstheentire
dataset, which further gets divided into two or morehomogeneous sets.
LeafNode:Leafnodesarethefinal outputnode,andthetreecannotbe segregated
further after getting a leaf node.
Splitting:Splitting is theprocessof dividingthedecisionnode/rootnodeintosub-nodes
according tothe given conditions.
Branch/SubTree:Atreeformedbysplittingthetree.
Pruning:Pruningistheprocessofremovingtheunwantedbranchesfromthetree.
andeachbranchdescendingfromthatnodecorrespondstooneofthepossible
valuesforthisattribute.
• An instance is classified by starting
attherootnodeofthedecisiontree,testing the attribute specified by this
node, then moving down the tree branch corresponding to the value of
the attribute. This process isthenrepeatedatthe nodeonthisbranchand so
on until a leaf node is reached.
AppropriateProblemsforDecisionTreeLearning
Decision treelearningisgenerally bestsuitedtoproblems with thefollowing
characteristics:
• Instancesarerepresentedbyattribute-valuepairs.
o Thereis a finitelist of attributes (e.g.hair colour) and each
instance storesa value for that attribute (e.g. blonde).
o When each attributehasasmallnumberofdistinctvalues(e.g.
blonde,brown, red) itis easier for thedecision tree toreach
auseful solution.
o The algorithmcan be extended to handle real-valued
attributes(e.g. afloating point temperature)
• Thetargetfunctionhasdiscreteoutputvalues.
o Adecisiontreeclassifieseachexampleasoneoftheoutputvalues.
▪ Simplestcaseexistswhenthereareonlytwopossib
le classes(Boolean classification).
▪ However, itis easy to extend the decision treeto
producea targetfunction with more than two
possible output values.
o Although itisless common, the algorithmcan also be
extended to producea target function with real-valued
outputs.
• Disjunctivedescriptionsmayberequired.
o Decisiontreesnaturallyrepresentdisjunctiveexpressions.
• Thetrainingdatamaycontainerrors.
o Errors in the classification of examples, or in the attribute
values describingthoseexamplesarehandledwell
bydecisiontrees,making thema robust learning method.
• Thetrainingdatamaycontainmissingattributevalues.
o Decision treemethods can be used even when
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 96
BusinessAnalytics Dept.ofCSE-AIML
sometraining exampleshave unknown values(e.g.,
humidity isknown for onlya fraction of the examples).
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 97
BusinessAnalytics Dept.ofCSE-AIML
Afteradecisiontreelearnsclassificationrules,itcanalsobere-representedasaset of
if-thenrules in order to improve readability.
HowdoestheDecisionTreealgorithmWork?
The decision ofmaking strategicsplits heavily affectsa trees
accuracy.Thedecision criteria aredifferent for classificationand regression trees.
Decision trees use multiple algorithms to decide to split a node into two
ormoresub- nodes. The creation of sub-nodes increases the homogeneity of
resultant sub-nodes. In other words, we can say that the purity of the node
increases with respect to the target variable. The decision tree splits the nodes
on all available variables and then selects the split which results in most
homogeneoussub-nodes.
Inadecisiontree,forpredictingtheclassofthegivendataset,thealgorithmstarts from
the root node of the tree. This algorithm comparesthe
valuesofrootattributewith the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 98
BusinessAnalytics Dept.ofCSE-AIML
For the next node, the algorithm again compares the attribute value with the
other sub- nodes and movefurther. It continues the process until it reaches the
leaf node of the tree. The completeprocess can be better understood using the
below algorithm:
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2:Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3:Divide the S into subsets that contains possible values
for the best attributes.
• Step-4:Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets
of the dataset created in
• Step-6: Continue this process until a stage is reached where you
can not further classify the nodes and called the final node as a
leaf node.
Entropy:
Entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to
draw any conclusions from that information.
Flippingacoinisanexampleof an action that provides
information that is random.
From the graph, it is quite evident that the entropy H(X) is
zerowhentheprobabilityis either 0 or 1. The Entropy is maximum when the
probability is0.5becauseitprojects perfect randomness in the data and there is
no chance ifperfectlydeterminingthe outcome.
InformationGain
Information gain or IG is a
statisticalpropertythatmeasures how well a given
attribute separates the training examples according
to their target classification.Constructing a decision
tree is all about finding an attribute that returns the
highest information gain and the smallest entropy.
ID3followstherule Abranchwithanentropyofzeroisaleafnodeand A branchwith
entropy more than zero needs further splitting.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 99
BusinessAnalytics Dept.ofCSE-AIML
Illustrative Example:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page
100
BusinessAnalytics Dept.ofCSE-AIML
Advantages of DecisionTree:
• Simple to understand and interpret. People are able to understand
decision tree models after a brief explanation.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
101
BusinessAnalytics Dept.ofCSE-AIML
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
102
BusinessAnalytics Dept.ofCSE-AIML
MultipleDecisionTrees:
Classification&RegressionTrees:
✓ Classificationandregressiontreesisatermusedtodescribedecisiontree
algorithms that areusedfor classification and regression
learningtasks.
✓ The Classification and Regression Tree methodology, also known as the
CART were introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
Olshen, and Charles Stone.
ClassificationTrees:
A classificationtreeisanalgorithmwhere
the target variable is fixed or
categorical. The algorithm is then used
to identify the class within which a
target variable would mostlikely fall.
✓ An example of a classification-
type problem would be determining
who will or will not subscribe to a
digital platform; or who will or will
not graduate from high school.
✓ These are examples of simple binary classifications where the
categorical dependentvariablecan assumeonly one oftwo,mutually
exclusivevalues.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
103
BusinessAnalytics Dept.ofCSE-AIML
RegressionTrees
✓ A regression tree refers to an
algorithm where the target variable is
and the
algorithmisusedtopredictitsvalue
which is a continuous variable.
✓ As an example of a regression
typeproblem, you
maywanttopredictthe selling prices of
aresidentialhouse, which is
acontinuous dependent variable.
✓ This will depend on both continuous
factors like square footage as well as
categorical factors.
DifferenceBetweenClassificationandRegressionTrees
✓ Classification trees are used when the dataset needs
tobesplitintoclassesthat belongto the response variable. In many cases,
theclasses Yes or No.
✓ In other words, they are just two and mutually exclusive. In some cases,
there may be more than two classes in which case a variant of the
classification tree algorithmis used.
✓ Regression trees, on the other hand, are used when the response variable is
continuous. For instance, if the response variable is something like the price
of a property or thetemperature ofthe day, a regression tree is used.
✓ In other words, regression trees are used for prediction-type problems while
classification trees are used for classification-type problems.
1. CART:(ClassificationAndRegressionTree.)
✓ CART algorithm was introduced in Breiman et al. (1986). A CART tree is a
binary decision tree that is constructed by splitting a node into two child
nodes repeatedly, beginning with the root node that contains the whole
learning sample. The CART growing method attempts to maximize within-
node homogeneity.
✓ The extent to which a node does not represent a homogenous subset of
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
104
BusinessAnalytics Dept.ofCSE-AIML
cases is an indication
ofimpurity.Forexample,aterminalnodeinwhichallcaseshavethe
samevalueforthedependentvariableisahomogenousnodethatrequiresnofurther
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
105
BusinessAnalytics Dept.ofCSE-AIML
splitting because it is "pure." For categorical (nominal, ordinal) dependent
variables the common measure of impurity is Gini, which is based on squared
probabilities of membership for each category. Splits are found that maximize the
homogeneity of child nodes with respect to the value of the dependent variable.
Decisiontreepruning:
Pruning is a data compression technique in machine learning and search
algorithms that reduces the size ofdecision trees by removing sectionsofthetree
thatarenon-critical and redundant to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive accuracy by
the reduction of overfitting.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
106
BusinessAnalytics Dept.ofCSE-AIML
validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.
PruningTechniques:
Pruningprocessescanbedividedintotwotypes:PrePruning&PostPruning
• Pre-pruning procedures prevent a complete induction of the
trainingsetby replacing a stop () criterion in the induction algorithm (e.g.
max. Tree depth or information gain (Attr)> minGain). They considered
to be more efficient because theydo not induce an entire set,but
rathertrees remain small from the start.
• Post-Pruning (or just pruning) is the most common way of simplifying
trees. Here, nodes and subtrees are replacedwith leavesto reduce
complexity.
The procedures are differentiated on the basis of their approach in the tree:
Top-down approach & Bottom-Up approach
Bottom-uppruningapproach:
• Theseproceduresstartatthelastnodeinthetree(thelowestpoint).
• Followingrecursivelyupwards,theydeterminetherelevanceofeachindiv
idual node.
• If the relevance for the classification is not given, the node is
dropped or replacedby a leaf.
• Theadvantage isthatnorelevantsub-treescanbelostwiththismethod.
• These methods include Reduced ErrorPruning (REP),Minimum
Cost Complexity Pruning (MCCP), or Minimum Error Pruning
(MEP).
Top-downpruningapproach:
• In contrast to the bottom-up method, this method starts at the root of
the tree. Following the structure below, a relevance check is carried out
which decides whether a node is relevant for the classification of all n
items or not.
• By pruning the tree at an inner node, it can happen that an entiresub-
tree (regardless of its relevance) is dropped.
• One of these representativesispessimisticerrorpruning(PEP),whichbrings
quite good results with unseen items.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
107
BusinessAnalytics Dept.ofCSE-AIML
2. CHAI
DDECISIONTREE(Chi-squareAutomaticInteractionDetector)
• Asisevidentfromthenameofthisalgorithm,itisbasedonthechi-squarestatistic.
• Chi-square Automatic Interaction Detector (CHAID) was a technique
created by Gordon V.Kass in 1980.
• CHAIDis atoolusedtodiscovertherelationshipbetweenvariables.
• A Chi-square test yields a probability value as a result lying
anywhere between 0 and 1.
o Achi-square valuecloserto 0indicatesthatthereis
asignificant differencebetween the two classes which are
being compared.
o Similarly, avalue closerto1indicates thatthere
isnotanysignificant difference between the 2 classes.
• In CHAID analysis, nominal, ordinal, and continuous data can be used,
where continuous predictors
aresplitintocategorieswithapproximatelyequalnumber of observations.
• CHAID creates all possible cross
tabulationsforeachcategoricalpredictoruntil the best outcome is achieved
and no further splitting can be performed.
• CHAID analysis splits the target into two or more categories that
arecalledthe initial, or parent nodes, and then
thenodesaresplitusingstatisticalalgorithms into child nodes.
• Unlike in regression analysis, the CHAID technique does not require the
data to be normally distributed.
• ThenatureoftheCHAIDalgorithmistocreateWIDEtrees.
VariabletypesusedinCHAIDalgorithm:
• Dependentvariable:ContinuousORCategorical
• Independentvariables:CategoricalONLY(canbemorethan2categories)
• Thus, if there are continuous predictor variables, then we need to
transform theminto categorical variables beforetheycan be
suppliedto the CHAID algorithm.
• StatisticalTestsusedtodeterminethenextbestsplit:
• ContinuousDependentVariable:F-Test(RegressionProblems)
• CategoricaldependentVariable:Chi-square(ClassificationProblems)
HowCHAIDhandlesdifferenttypesofvariables?
NominalVariable:Automaticallygroupsthedataasperpoint 2
aboveOrdinalVariable:Automaticallygroupsthedataasperpoint 2
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
108
BusinessAnalytics Dept.ofCSE-AIML
above ContinuousVariable:Converts intosegments/decilesbefore
performing 2
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
109
BusinessAnalytics Dept.ofCSE-AIML
GINIIndexImpurityMeasure:
• GINI Index Used by the CART (classification and regression tree) algorithm,
Gini impurityisameasureofhowoftenarandomlychosenelementfromtheset
wouldbeincorrectlylabeledifitwererandomlylabeledaccordingtothe
distribution of labels in the subset. Gini impurity can be computed by
summing the probability fi of each item being chosen times the probability
1-fi of a mistake in categorizing that item.
OverfittingandUnderfitting
• Letsclearlyunderstandoverfitting,underfittingandperfectlyfitmodels.
• From the three graphs shown above, one can clearly understand that the
leftmost figure line does not cover all the data
points,sowecansaythatthemodelis under- fitted. In this
case,themodelhasfailedtogeneralizethepatterntothe new dataset, leading
to poorperformanceontesting.Theunder-fittedmodelcan
beeasilyseenasitgivesveryhigherrorsonbothtrainingandtestingdata. This
is because the dataset is not clean and contains noise, the modelhas
High Bias, and the sizeofthetraining data is not enough.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
110
BusinessAnalytics Dept.ofCSE-AIML
• The best fit model is shown by the middle graph, where both training and
testing
(validation)lossareminimum,orwecansaytrainingandtestingaccuracy
should be near each other and high in value.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
111
BusinessAnalytics Dept.ofCSE-AIML
TimeSeriesMethods:
• Time series forecasting focuses on analyzing data changes across
equally spaced time intervals.
• Time series analysis is used in a wide variety of domains, ranging from
econometrics to geology and earthquake prediction; its also used in
almost all applied sciences and engineering.
• Time Series Analysis finds hidden patterns
andhelpsobtainusefulinsightsfrom the time series data.
• Timeseriesdataisdatathatisobservedatdifferentpointsintime.
• Time Series Analysis is useful in predicting future values or
detectinganomalies from the data. Such analysis typically requires
manydatapointstobepresentin the dataset to ensure consistency
andreliability.
• The different types of modelsandanalysesthatcanbecreatedthroughtime
series analysis are:
o Classification:ToIdentifyandassigncategoriestothedata.
o Curve fitting: Plot the data along a curve and study the
relationships of variables present within the data.
o Descriptiveanalysis:HelpIdentifycertain patterns in time-series
data such as trends, cycles, or seasonal variation.
o Explanative analysis: To understand the data and its
relationships, the dependentfeatures, and cause and effect and its
tradeoff.
o Exploratoryanalysis:Describeandfocusonthemaincharacteristicso
fthe timeseries data, usually in a visual format.
o Forecasting: Predicting future data based on historical trends.
Using the historical data as a
modelforfuturedataandpredictingscenariosthat could happen along
with the future plot points.
o Interventionanalysis:TheStudyofhowaneventcanchangethedata.
o Segmentation: Splitting the data into segments to discover the
underlying propertiesfrom the source information.
ComponentsofTimeSeries:
Longtermtrend–Thesmoothlongtermdirectionoftimeserieswherethedatacan
increaseordecreaseinsomepattern.
SeasonalvariationPatternsofchangeinatimeserieswithinayearwhichtendstorepea
t every year.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
112
BusinessAnalytics Dept.ofCSE-AIML
ARIMA&ARMA:
WhatisARIMA?
• In time series analysis, ARIMA is an acronym that stands for
AutoRegressive Integrated Moving Average model is ageneralization of
an autoregressive moving average (ARMA) model. These models are
fitted to time series data either to better understand thedata or topredict
futurepoints in theseries (forecasting).
• Theyareappliedinsomecaseswheredatashowevidenceofnon-stationary,
• .A popular and very widely used statistical method for time series
forecasting and analysis is the ARIMA model.
• It is a class of models that capture a spectrum of different standard
temporal structures present in time series data. By implementing an
ARIMA model, you can forecast and analyze a time series using past
values, such aspredictingfuture prices based on historical earnings.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
113
BusinessAnalytics Dept.ofCSE-AIML
AnonseasonalARIMAmodelisclassifiedasan"ARIMA(p,d,q)"model,where:
• pisthenumberofautoregressiveterms,
• disthenumberofnonseasonaldifferencesneededforstationarity,and
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
114
BusinessAnalytics Dept.ofCSE-AIML
• qisthenumberoflaggedforecasterrorsinthepredictionequation.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
115
BusinessAnalytics Dept.ofCSE-AIML
If d=0 yt = Yt
:
If d=1 yt = Yt-Yt-1
:
If yt = (Yt-Yt- 1)-(Yt-1-Yt-2) = Yt-2Yt-1+Yt-2
NotethattheseconddifferenceofY(thed=2case)isnotthedifferencefrom2 periods
ago.Rather,itisthe first-difference-of-the-firstdifference, which isthediscrete
analogof a secondderivative, i.e., thelocalaccelerationoftheseriesratherthanits
local trend.
In termsofy,thegeneralforecastingequationis:
ŷt = μ+ϕ1yt-1+…+ϕpyt-p-θ1et-1-…-θqet-q
MeasureofForecastAccuracy:
Forecast Accuracy can be defined as the deviation of Forecastor Prediction
fromthe actual results.
Error=Actualdemand–ForecastORei=At Ft
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
116
BusinessAnalytics Dept.ofCSE-AIML
ETL Approach:
Extract,TransformandLoad(ETL)referstoaprocessindatabaseusageand
especially in data warehousing that:
• Extractsdatafromhomogeneousorheterogeneousdatasources
• Transforms the data for storing it in proper format or structure for
querying and analysis purpose
• Loads it into the final target (database, morespecifically,
operationaldatastore, datamart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes
time,so while the data is being pulled another transformation process
executes, processing the already received data and prepares the data
forloadingandassoonasthereissome data ready to be loaded into the target, the
data loading kicks off without waiting for the completion of the previous
phases.
ETL systems commonly integrate data from multiple applications (systems),
typically developed and supported by different vendors or hosted on separate
computer hardware.
Thedisparatesystemscontainingtheoriginaldataarefrequentlymanagedand
operated by different employees. For example, acostaccountingsystem
maycombine data frompayroll, sales, and purchasing.
CommerciallyavailableETLtoolsinclude:
• MicrosoftSQLServerIntegrationServices(SSIS)
• CampaignRunner
• OracleDataIntegrator(ODI)
• OracleWarehouseBuilder(OWB)
• RhinoETL
• SAPBusinessObjectsDataServices
• SASDataIntegrationStudio
• SnapLogic
TherearevariousstepsinvolvedinETL.Theyareasbelowindetail:
Extract:
The Extract step covers the data extraction from the source system
andmakesit accessible
forfurtherprocessing.Themainobjectiveoftheextractstepistoretrieve all the
required data from the source system with as little resources as possible.The
extract step should be designed in a way that it does not
negativelyaffectthesource system interms or performance,response time or
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
117
BusinessAnalytics Dept.ofCSE-AIML
any kind of locking.
Thereareseveralwaystoperformtheextract:
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
118
BusinessAnalytics Dept.ofCSE-AIML
Transform:
• The transformstepappliesasetofrulestotransformthedatafromthesource to
the target.
• This includes convertinganymeasureddatatothesamedimension(i.e.
conformeddimension) using thesame units so that they can later be
joined.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
119
BusinessAnalytics Dept.ofCSE-AIML
• The transformation step also requires joiningdatafromseveralsources,
generatingaggregates,generatingsurrogatekeys,sorting,derivingnew
calculated values, and applying advanced validation rules.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
120
BusinessAnalytics Dept.ofCSE-AIML
Load:
• During the load step, it is necessarytoensurethattheloadisperformed
correctlyandwithaslittleresourcesaspossible.ThetargetoftheLoad
process is often a database.
• Inordertomaketheloadprocessefficient,itishelpfultodisableany
constraints and indexes beforetheloadandenablethembackonlyafterthe
load completes. The referential integrity needs to be maintained by ETL
tool to ensureconsistency.
ManagingETLProcess
The ETL process seems quite straight forward. As with every application,
there is a
possibilitythattheETLprocessfails.Thiscanbecausedbymissingextractsfrom one
of the systems,missingvaluesinoneofthereferencetables, orsimplya
connection or power outage. Therefore, it is necessary to design the ETL
process keeping fail- recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently
from the others. For example, if the transformation step
fails,itshouldnotbenecessaryto restart the Extract step. We can ensure this by
implementing proper staging. Staging
meansthatthedataissimplydumpedtothelocation(calledtheStagingArea)so that
it can then be read by the next processing phase.The staging area is alsoused
during ETL process to store intermediate results of processing. This is ok for
the ETL process which uses forthis purpose. However, the staging area should
be accessed by the load ETL process only. It should never be available to
anyone else; particularlynotto end users as it is not intended for data
presentation to the end-user. May contain incomplete or in-the-middle-of-the-
processing data.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
121
BusinessAnalytics Dept.ofCSE-AIML
***EndofUnit-4***
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
122
BusinessAnalytics Dept.ofCSE-AIML
5. UNIT-5
DataVisualization:
Whydatavisualization?
Categorizationofvisualizationmethods
Pixel-oriented visualization techniques
Geometric projectionvisualizationtechniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
Pixel-OrientedVisualizationTechniques
• Tosavespaceandshowtheconnectionsamong multipledimensions,
spacefilling isoften done in a circle segment.
GeometricProjectionVisualizationTechniques
Visualization of geometric transformations and
projections ofthe data.Methods
• Directvisualization
• Scatterplotandscatterplotmatrices
• Landscapes Projection pursuit technique: Help users
find meaningfulprojectionsofmultidimensional data
• Prosectionviews
• Hyperslice
• Parallelcoordinates
LinePlot:
• Thisistheplotthatyoucanseeinthenookandcornersofanysortof
analysisbetween 2 variables.
BarPlot
• Thisisoneofthewidelyusedplots,thatwewouldhaveseenmultipletimesnotjustindata
analysis,butweusethisplotalsowhereverthereisatrendanalysisinmanyfields.
• Wecanvisualize thedatainacoolplotandcanconveythedetailsstraightforwardtoothers.
• This plot may be simple and clear but its not much frequently used in
Datascience applications.
their bars. Stacked Bar Graphs are used to show how a larger category is divided into smaller
categories and what the relationship of each part has on the total amount. There are two types of
Stacked Bar Graphs:
• Simple Stacked Bar Graphs place each value for the segment after the previous one. The total
value of the bar is all the segment valuesadded together. Ideal for comparing the total amounts across
each group/segmented bar.
• 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the
percentage of each value to the total amountin each group. This makesiteasiertoseethe relative
differences between quantities in each group.
• One major flaw of Stacked Bar Graphs is that they become hardertoreadthe more segments each
bar has. Also comparing each segment to each other is difficult, as they're not aligned on a
common baseline.
• It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
• This plot describes us as a representation, where each point in the entire dataset ispresent with
respect to any 2 to 3 features(Columns).
• Scatter plots are availableinboth 2-D aswellas in3-D. The2-Dscatterplotis the common one,
where we will primarily try to find the patterns,clusters, andseparabilityof the data.
• The colors are assigned to differentdata points based on how they werepresent inthedataset
i.e, target column representation.
• Wecancolorthedatapointsaspertheirclasslabelgiveninthedataset.
BoxandWhiskerPlot
• Thisplotcanbeusedtoobtainmorestatisticaldetailsaboutthedata.
• Thestraightlinesatthemaximumandminimumarealsocalledwhiskers.
• Points that lie outside the
whiskers will be considered
as an outlier.
• The box plot also gives us a
description of the 25th,
50th,75th quartiles.
• With the help of a box plot,
wecanalso determine
th
e Interquartile range(IQR)
where maximum details of
the data will be present
PieChart:
A pie chart shows a static number and how categories represent part of a whole the composition of
something.A pie chartrepresentsnumbersinpercentages,andthe totalsumofallsegmentsneeds to equal
100%.
• Extensively used in presentations and offices, Pie Charts help show proportions and percentages
between categories, by dividing a circle into proportional segments. Each arc length represents a
proportion of each category, while the full circle represents the total sum of all the data, equal to
100%.
DonutChart:
A Donut Chart somewhat remedies this problem by de-emphasizing the use of the area. Instead, readers
focus more on reading the length of thearcs, rather than comparing the proportionsbetweenslices.
Also,DonutChartsaremorespace-efficientthanPieChartsbecausetheblankspaceinsideaDonut Chart can be
used to display information inside it.
MarimekkoChart:
AlsoknownasaMosaicPlot.
Icon-BasedVisualizationTechniques
• Itusessmalliconstorepresentmultidimensionaldatavalues
• Visualizationofthedatavaluesasfeaturesoficons
• Typicalvisualizationmethods
• ChernoffFaces
• StickFigures
ChernoffFaces
ChernoffFaces
A wayto display variables on a two-dimensional surface, e.g., let x
beeyebrow slant, y beeyesize,zbe noselength,etc.
• The figure shows faces producedusing10 characteristicsheadeccentricity, eye size, eye spacing,eye
eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening. Each
assigned one of 10 possible values.
• Acensusdatafigureshowingage,income,gender,education
• A5-piecestickfigure(1bodyand4limbsw.differentangle/length)
• Age,incomeareindicatedbypositionofthefigure.
• Gender,educationareindicatedbyangle/length.
• Visualizationcanshowatexturepattern.
• 2 dimensions are mapped to the display axes and the remaining
dimensions aremappedto theangleand/orlength of thelimbs.
VisualizingComplexDataandRelations
WordCloud:
AlsoknownasaTagCloud.
Colour used on Word Clouds is usually meaningless and is primarily aesthetic, but it can be used to
categorise words or to display another data variable.
Typically,Word Clouds are usedonwebsitesorblogs todepictkeyword ortag usage.Word Clouds can also be
used to compare twodifferentbodiesof texttogether.
Although being simple and easy to understand, Word Clouds have some major flaws:
Long words are emphasised over short words.
Words whoseletters containmanyascenders anddescenders mayreceivemoreattention.
They're not great for analytical accuracy,so used more for aesthetic reasonsinstead.
***EndofUnit-5***