Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views144 pages

Ba Notes-Data Science

The document consists of lecture notes on Business Analytics for the IV B.Tech II semester, prepared by Mr. P. Raviprakash at Mallareddy Engineering College for Women. It outlines course objectives, outcomes, program outcomes, and program-specific outcomes, as well as a detailed syllabus covering data management, analytics, regression, segmentation, and data visualization techniques. The notes emphasize the importance of data analytics in business decision-making and the tools used for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views144 pages

Ba Notes-Data Science

The document consists of lecture notes on Business Analytics for the IV B.Tech II semester, prepared by Mr. P. Raviprakash at Mallareddy Engineering College for Women. It outlines course objectives, outcomes, program outcomes, and program-specific outcomes, as well as a detailed syllabus covering data management, analytics, regression, segmentation, and data visualization techniques. The notes emphasize the importance of data analytics in business decision-making and the tools used for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 144

LECTURENOTES

ON

BUSINESSANALYTICS

IVB.TechIIsemester(2012PE04)

Preparedby

Mr.P.Raviprakash
Assistant Professor

DEPARTMENTOFCSE-AIML
MALLAREDDYENGINEERINGCOLLEGEFORWOMEN
(AutonomousInstitution-UGC,Govt.ofIndia)

NIRFIndianRanking 2018,AcceptedbyMHRD,Govt.ofIndia
PermanentlyAffiliated toJNTUH, Approved by AICTE, ISO 9001:2015 Certified Institution AAAA+
Rated by Digital Learning Magazine, AAA+ Rated by Careers 360 Magazine, 6thRank CSR Platinum
Rated by AICTE-CII Survey, Top 100 Rank band by ARIIA, MHRD, Govt. of India
NationalRanking-Top100RankbandbyOutlook,NationalRanking-Top100RankbandbyTimesNewsMagazine
Maisammaguda,Dhulapally,Secunderabad,Kompally-500100
2023 2024
CourseObjectives:
• Toexplorethefundamentalconceptsofdata analytics.

• Tolearntheprinciplesand methodsofstatisticalanalysis

• Discoverinterestingpatterns,analyzesupervisedandunsupervisedmodelsandestimatethe
accuracy of thealgorithms.

• Tounderstandthevarioussearchmethodsandvisualizationtechniques.

Course Outcomes: After completion of this course students will be able to


CO1:Understandtheimpactofdataanalyticsforbusinessdecisionsandstrategy CO2:
Carry out data analysis/statistical analysis
CO3:Tocarryoutstandarddatavisualizationandformalinferenceprocedures CO4:
Design Data Architecture
CO5:UnderstandvariousDataSources
PROGRAMOUTCOMES:

1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals,andanengineeringspecializationforthesolutionofcomplexengineeringproblems.

2. Problem Analysis: Identify, formulate, research literature, and analyze complex engineering
problemsreachingsubstantiatedconclusionsusingfirst principlesofmathematics,naturalsciences, and
engineering sciences.

3. Design/developmentofsolutions:Designsolutionsforcomplexengineeringproblemsanddesign
systemcomponentsor processes that meet t h e specified needs with appropriate consideration for
public health and safety, and cultural, societal, and environmental considerations.

4. Conduct investigations of complex Problems: Use research-based knowledge and research


methodsincludingdesignofexperiments,analysisand interpretationofdata,andsynthesisofthe
information to provide valid conclusions.

5. Modern tool usage: Create, select, and applyappropriate techniques, resources, and modern
engineeringandITtools, includingpredictionand modelingtocomplexengineeringactivities, with
an understanding of the limitations.

6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal,health,safety, legalandculturalissuesandtheconsequent responsibilitiesrelevanttothe
professional engineering practice.

7. Environment and sustainability: Understand the impact of the professional engineering


solutionsinsocietalandenvironmentalcontexts,anddemonstratetheknowledgeof,andneed for
sustainable development.

8. Ethics:Applyethicalprinciplesandcommit to professionalethics andresponsibilitiesand norms


ofthe engineering practice.

9. Individualandteamwork: Functioneffectivelyasanindividual,andasa memberorleader in


diverse teams, and in multidisciplinary settings.

10. Communication:Communicateeffectivelyoncomplexengineeringactivitieswiththeengineering
communityand with t h e societyat large, such as, being able to comprehend and write effective
reports and design documentation, make effective presentations, and give and receive clear
instructions.

11. Project management and finance: Demonstrateknowledge and understanding of t h e


engineeringand management principlesandapplythesetoone’sownwork,asamemberand
leader in a team, to manage projects and in multidisciplinary environments.

12. Life-longlearning: Recognizetheneed for,andhavethepreparationandabilityto engage


independent and life-long learning in the broadest context of technological change.
CO-POMAPPING:

PO/CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 P11 P12

CO1 H H L L H L L

CO2 M L L L H L

CO3 M L L L L

C04 M L L L L

CO5 M L L L L
PROGRAMSPECIFICOUTCOMES-POs:

PSO1:

Understand a range of analytical, logical - Programming languages, architecture,


construction, and design underlying the field of AI and ML and its related disciplinary
areas.

PSO2:

Abilitytoacquireknowledge inanalysis, design, anddevelopmentofhumanperception,


Artificial Intelligence, Machine Learning, and data analytics in terms of real-world
problems to meet future challenges.

PSO3:

Develop computational knowledge and project development skills using innovative


tools and practices related to optimization techniques, pattern analysis, and speech
recognition for the efficient design ofcomputer-based systems of varying complexity to
solve problems in the areas related to deep learning, machine learning, and artificial
intelligence.

CO-PSOMAPPING:

CO PSO1 PSO2 PSO3

CO1 M M

M M
CO2

M M
CO3

M M
CO4

L M
CO5
SYLLABUS

UNIT-I
DataManagement:DesignDataArchitectureandmanagethedataforanalysis,understand varioussources of
Data like Sensors/Signals/GPS etc. Data Management, Data Quality(noise, outliers, missing values,
duplicate data) and Data Processing & Processing.

UNIT-II
DataAnalytics:IntroductiontoAnalytics, IntroductiontoToolsandEnvironment, ApplicationofModeling
inBusiness,Databases&TypesofDataand variables,DataModelingTechniques,MissingImputationsetc. Need
Business Modeling.

UNIT-III
Regression-Concepts,Bluepropertyassumptions,Least SquareEstimation,VariableRationalization,and Model
Building etc.
LogisticRegression:ModelTheory, Modelfit Statistics,ModelConstruction,Analyticsapplicationsto various
Business Domains etc.

UNIT-IV
ObjectSegmentation: RegressionVsSegmentation-SupervisedandUnsupervisedLearning,Tree Building
-Regression,Classification,Overfitting. PruningandComplexity,MultipleDecisionTreesetc.TimeSeries
Methods: Arima, Measures ofForecast Accuracy, STL approach, Extract features fromgenerated modelas
Height, Average Energy etc and Analyze for prediction

UNIT-V
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques,Icon-BasedVisualizationTechniques, HierarchicalVisualizationTechniques,Visualizing
Complex Data and Relations.

TEXTBOOKS
1. Student'sHandbookforAssociateAnalytics-II,III.
2.DataMiningConceptsandTechniques, Han,Kamber,3rdEdition,MorganKaufinannPublishers.

REFERENCEBOOKS
1. IntroductiontoDataMining.Tan,SteinbachandKumar,AddisionWisley,2006.
2.DataMiningAnalysisandConcepts,M.Zakiand W. Meira
3.MiningofMassiveDatasets,JureLeskovecStanfordUniv. AnandRajaramanMilliwayLabsJeffreyD
Ullman Stanford Univ
INDEX
1. UNIT-1.................................................................................................................................... 1
Introduction.................................................................................................................. 1
Dataandarchitecturedesign......................................................................................... 2
UnderstandvarioussourcesoftheData........................................................................... 5
DataManagement...................................................................................................... 11
DataQuality................................................................................................................ 16
DataPre-processing.................................................................................................... 19
DataProcessing.......................................................................................................... 22
2. UNIT-2.................................................................................................................................. 24
Introductiontotools..................................................................................................... 26
Databases&TypesofDataandvariables....................................................................... 34
Variables.................................................................................................................... 38
MissingImputations.................................................................................................... 40
NeedforBusinessModelling......................................................................................... 41
DataModellingTechniques.......................................................................................... 43
3. UNIT3................................................................................................................................... 45
Regression–Concepts................................................................................................. 45
LogisticRegression..................................................................................................... 55
AnalyticsapplicationstovariousBusiness Domains....................................................... 68
4. UNIT4.................................................................................................................................... 72
Segmentation............................................................................................................. 74
RegressionVsSegmentation....................................................................................... 75
MultipleDecisionTrees................................................................................................ 84
OverfittingandUnderfitting......................................................................................... 89
TimeSeriesMethods.................................................................................................... 91
ARIMA&ARMA............................................................................................................. 92
MeasureofForecastAccuracy...................................................................................... 94
5. UNIT-5.................................................................................................................................. 98
DataVisualization....................................................................................................... 98
Pixel-OrientedVisualizationTechniques...................................................................... 98
GeometricProjectionVisualizationTechniques............................................................. 99
Icon-BasedVisualizationTechniques......................................................................... 106
HierarchicalVisualization.......................................................................................... 108
VisualizingComplexDataandRelations...................................................................... 111
BusinessAnalytics Dept.ofCSE-AIML
1. UNIT-1
Introduction:

In the beginning times of computers and Internet, the data used was not as much of
as it is today, the data then couldbe so easily storedandmanagedby all the users
andbusiness enterprises on a single computer, because the data never exceeded to the
extent of 19 exabytes but now in this era, the data has increased about 2.5 quintillion
per day.

Most of the data is generated from social media sites like Facebook, Instagram, Twitter,
etc, and the other sources can be e-business, e-commerce transactions,hospital, school,
bank data, etc. This data is impossible to manage by traditional data storing techniques.
Either the data being generated from large-scaleenterprises or thedata generated
fromanindividual,eachandeveryaspect of data needs to be analysed to benefit yourself
from it. But how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.

WhyisDataAnalyticsimportant?
Data Analytics has a key role in improving your business as it is used to gather hidden
insights, Interesting Patterns in Data, generate reports, perform market analysis, and
improve business requirements.

WhatistheroleofDataAnalytics?

• Gather Hidden Insights– Hidden insights from data are gathered and then
analyzed with respect to business requirements.

• GenerateReports–Reportsaregeneratedfromthedataandarepassedontotherespe
ctive teams and individuals to deal with further actions for a high rise in business.
• Perform Market Analysis– Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
• Improve BusinessRequirement–Analysis ofData allowsimprovingBusiness to
customer requirements and experience.
.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 1
BusinessAnalytics Dept.ofCSE-AIML

WhatarethetoolsusedinDataAnalytics?
With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top
tools in the data analytics market are as follows.
• Rprogramming
• Python
• TableauPublic
• QlikView
• SAS
• MicrosoftExcel
• RapidMiner
• KNIME
• OpenRefine
• ApacheSpark

Dataandarchitecturedesign:

DataarchitectureinInformationTechnologyiscomposedofmodels,policies,rulesorstandardsthat
govern which data is collected, and how it is stored, arranged, integrated, and put to use in data
systems and in organizations.

• A data architecture shouldsetdata standards forall its data systems as a vision or a


model of the eventual interactions between those data systems.
• Dataarchitecturesaddressdatainstorageanddatainmotion;descriptionsofdatastores,
data groups and data items; and mappings of those data artifacts to data qualities,
applications, locations etc.
• Essentialtorealizingthetargetstate,DataArchitecturedescribeshowdataisprocessed,s
tored, and utilized in a given system. It provides criteria for data processing
operations that make it possible to design data flows and also control the flow of
data in the system.
• The Data Architect is typically responsible for defining the target state, aligning
during
developmentandthenfollowinguptoensureenhancementsaredoneinthespiritoftheori
ginal blueprint.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 2
BusinessAnalytics Dept.ofCSE-AIML

During the definition of the target state, the Data Architecture breaks a subject down to
the atomic level and then builds it back up to the desired form.

TheDataArchitectbreaksthesubjectdownbygoingthrough3traditionalarchitecturalprocesses:

Conceptual model: It is a businessmodel which uses Entity Relationship (ER) model for
relationbetween entities and their attributes.
Logicalmodel:It is amodel where problems arerepresented in the formoflogic such asrows
and column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds thedatabasedesign like which type of
databasetechnology will be suitable for architecture.

Laye View Data(What) Stakehold


r er
1 Scope/Contextual Listofthingsandarchitectural Planner
standardsimportanttothebusiness

2 Business Semanticmodel Owner


Model/Conceptual orConceptual/
EnterpriseDataModel
3 SystemModel/Logical Enterprise/LogicalDataModel Designer
4 Technology PhysicalDataModel Builder
Model/Physical

5 Detailed Actualdatabases Subcontract


or
Representations

Thedataarchitectureisformedbydividingintothreeessentialmodelsandthenarecombined:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 3
BusinessAnalytics Dept.ofCSE-AIML
Factorsthat influenceDataArchitecture:
Variousconstraintsandinfluenceswillhaveaneffectondataarchitecturedesign.Theseinclude
enterprise requirements, technology drivers, economics, business policies and data
processing need. Enterprise requirements:
• These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed),
transaction reliability, and transparent data management.
• In addition, the conversion of raw data such as transaction records and image files
into more useful information forms through such features as data warehouses is
also a common organizational requirement, since this enables managerial decision
making and other organizational processes.
• Oneofthearchitecturetechniquesisthesplitbetweenmanagingtransactiondataand(m
aster) reference data. Another one is splitting data capture systems from data
retrieval systems (as done in a data warehouse).
Technologydrivers:
• These are usually suggested by the completed data architecture and database
architecture designs.
• In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site
resources (e.g. previously purchased software licensing).
Economics:
• These are also important factors that must be considered during the data
architecturephase. It is possible that some solutions, while optimal in principle,
may not be potential candidates due to their cost.
• External factors such as the business cycle, interest rates, market conditions, and
legal considerations could all have an effect on decisions relevant to data
architecture.
Businesspolicies:
• Businesspoliciesthatalsodrivedataarchitecturedesignincludeinternalorganizationalp
olicies, rules of regulatory bodies, professional standards, and applicable
governmental laws that can vary by applicable agency.
• These policies and rules will help describe the manner in which enterprise wishes
to process their data.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 4
BusinessAnalytics Dept.ofCSE-AIML
Dataprocessingneeds
• These include accurate and reproducible transactions performed in high volumes,
data warehousingfor the support of management informationsystems
(andpotential data mining), repetitive periodic reporting, ad hoc reporting, and
support of various organizational initiatives as required (i.e. annual budgets, new
product development)
• TheGeneralApproachisbasedondesigningtheArchitectureatthreeLevelsofSpecification.
➢ TheLogicalLevel

➢ ThePhysicalLevel
➢ TheImplementationLevel
UnderstandvarioussourcesoftheData:
• Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data.
• Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis.
• Intheprocessofbigdataanalysis,“Datacollection”istheinitial
stepbeforestartingtoanalyse the patterns or useful information in data. The data
which is to be analysed must be collected from different valid sources.
• The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc.
• Themaingoalofdatacollectionistocollectinformation-rich data.
• Datacollectionstartswithaskingsomequestionssuchas whattypeofdataistobecollected
andwhatisthesourceofcollection.
• Most of the data collected are of two types known as qualitative data which is a
group of non-numerical data such as words, sentences mostly focus on behaviour
and actions of the
groupandanotheroneisquantitativedatawhichisinnumericalformsandcanbecalcula
ted using different scientific tools and sampling data.
Theactualdataisthenfurtherdividedmainlyintotwotypesknownas:
1. Primarydata
2. Secondarydata

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 5
BusinessAnalytics Dept.ofCSE-AIML

1. Primarydata:
• The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
whichanalysis is performed otherwiseit would be a burden in the data processing.
Fewmethodsofcollectingprimarydata:
1. Interviewmethod:
• Thedatacollectedduringthisprocessisthroughinterviewingthetargetaudiencebya
person called interviewer and the person who answers the interview is known as
the interviewee.
• Some basic business or product related questionsare asked and noted down in
the form of notes, audio, or video and this data is stored for processing.
• These can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.

2. Surveymethod:
• The surveymethodis the processofresearch where a listofrelevantquestionsare
askedand answers are noted down in the form of text, audio, or video.
• Thesurveymethodcanbeobtainedinbothonlineandofflinemodelikethrough
websiteforms
andemail.Thenthatsurveyanswersarestoredforanalysingdata.Examplesareonlinesur
veys or surveys through social media polls.
3. Observationmethod:
• Theobservationmethodisamethodofdatacollectioninwhichtheresearcherkeenlyobse
rves the behaviour and practices of the target audience using some data collecting
tool and stores the observed data in the form of text, audio, video, or any raw
formats.
• Inthismethod,thedataiscollecteddirectlybypostingafewquestionsontheparticipants.F
or example, observing a group of customers and their behaviour towards the
products. The data obtained will be sent for processing.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 6
BusinessAnalytics Dept.ofCSE-AIML
4. Experimentalmethod:
• The experimental method is the process of collecting data through performing
experiments, research, and investigation.
• ThemostfrequentlyusedexperimentmethodsareCRD,RBD,LSD,FD.
CRD-
CompletelyRandomizeddesignisasimpleexperimentaldesignusedindataanalyticswhich
is based on randomization and replication. It is mostly used for comparing the
experiments.
RBD- Randomized Block Design is an experimental design in which the experimentis
dividedinto small units called blocks.
• Random experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
• RandomizedBlockDesign-
TheTermRandomizedBlockDesignhasoriginatedfromagricultural research. In this
design several treatments of variables are applied to different blocks of land to
ascertain their effect on the yield of the crop.
• Blocks are formed in such a manner that each block contains as many plots as a
number of
treatmentssothatoneplotfromeachisselectedatrandomforeachtreatment.Theproduc
tion of each plot is measured after the treatment is given.
• These data are then interpreted and inferences are drawn by using the analysis of
Variance technique so as to know the effect of various treatments like different
dozes of fertilizers, different types of irrigation etc.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns.
• It is an arrangement of NxN squares with an equal number of rows and columns
which contain lettersthatoccursonlyonceinarow.Hencethedifferences can
beeasilyfoundwithfewer errors in the experiment. Sudoku puzzle is an example of a
Latin square design.
• A Latin square is one of the experimental designs which has a balanced two-way
classification scheme sayfor example - 4 X 4 arrangement. Inthis scheme each letter
from A to D occurs only once in each row and also only once in each column.
• The Latin square is probably under used in most fields of research because text book
examples tend to be restricted to agriculture, the area which spawned most original
work on ANOVA. Agricultural examples oftenreflectgeographical designs
whererowsandcolumns areliterallytwo dimensions of a grid in a field.
• Rows and columns can be any two sources of variation in an experiment. In this
sense a Latin square is a generalisation of a randomized blockdesign with two
different blocking systems
• A B C D
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 7
BusinessAnalytics Dept.ofCSE-AIML
B C D A
C D A B
D A B C

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 8
BusinessAnalytics Dept.ofCSE-AIML
• The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisonsamongtreatments,willbefreefrombothdifferencesbetween
rowsandcolumns. Thus, the magnitude of error will be smaller than any other
design.
FD- Factorial design is an experimental design where each experiment has two factors
each with possiblevalues andonperformingtrailothercombinational
factorsarederived.Thisdesignallowsthe experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyses the
impacts of each of the variables. In a true experiment, randomization is essential so that
theexperimenter can infer cause and effect without any bias.

2. Secondarydata:

Secondarydataisthedatawhichhasalready beencollectedandreused
againforsomevalidpurpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.
Internalsource:
Thesetypesofdatacaneasily befoundwithin theorganizationsuchasmarketrecord,
asalesrecord, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
▪ Accounting resources- This gives so much information which can be usedby the
marketing researcher. They give information about internal factors.
▪ Sales Force Report- It gives information about the sales of a product. The
information provided is from outside the organization.
▪ InternalExperts-Thesearepeoplewhoareheading
thevariousdepartments.Theycangive an idea of how a particular thing is working.
▪ MiscellaneousReports-
Thesearewhatinformationyouaregettingfromoperationalreports. If the data
available within the organization areunsuitable orinadequate, the marketershould
extend the search to external secondary data sources.
Externalsource:
Thedatawhichcan’tbefoundatinternalorganizations andcanbegainedthroughexternalthird-
party resources is external source data. The cost and time consumption are more
because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labour bureau, syndicate services, and other non-
governmental publications.
3. GovernmentPublications-
▪ Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There
are number of government agencies generating data.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 9
BusinessAnalytics Dept.ofCSE-AIML
4. These are like: Registrar General of India- It is an office which generates demographic
data. It includes details of gender, age, occupation etc.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 10
BusinessAnalytics Dept.ofCSE-AIML
5. CentralStatisticalOrganization-
▪ This organization publishes the national accounts statistics. It contains estimates
of national income for several years, growth rate, andrate of major economic
activities.Annual survey of Industries is also published by the CSO.
▪ It gives information about the total number of workers employed, production units,
material used and value added by the manufacturer.
6. DirectorGeneralofCommercialIntelligence-
▪ This office operates from Kolkata. It gives information about foreign trade i.e.
import and export. These figures are provided region-wise and country-wise.
7. MinistryofCommerceandIndustries-

▪ This ministry through the office of economic advisor provides information on


wholesale price index. These indices may be related to a number of sectors like
food, fuel, power, food grains etc.
▪ It also generates All India Consumer Price Index numbers for industrial workers,
urban, non- manual employees and cultural labourers.
8. PlanningCommission-
▪ ItprovidesthebasicstatisticsofIndianEconomy.
9. ReserveBankofIndia-
▪ This providesinformation on BankingSavings andinvestment.RBIalso prepares
currency and finance reports.
10. LabourBureau-
▪ Itprovidesinformationonskilled,unskilled,whitecollaredjobsetc.
11. NationalSampleSurvey-
▪ This is done by the Ministry of Planning and it provides social, economic,
demographic, industrial and agricultural statistics.
12. DepartmentofEconomicAffairs-
▪ It conducts economic survey and it also generates information on income,
consumption, expenditure, investment, savings and foreign trade.
13. StateStatisticalAbstract-
▪ This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.
14. Non-GovernmentPublications-
▪ These includes publications of various industrial and trade associations, such as
The Indian Cotton Mill Association Various chambers of commerce.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 11
BusinessAnalytics Dept.ofCSE-AIML

1. TheBombayStockExchange
▪ Itpublishesadirectorycontainingfinancialaccounts,keyprofitabilityandotherrelevan
t matter) Various Associations of Press Media.
• ExportPromotionCouncil.
• ConfederationofIndianIndustries(CII)
• SmallIndustriesDevelopmentBoardofIndia
• DifferentMillslike-Woollenmills,Textilemillsetc
▪ Theonly disadvantageof the above sources is that the data may bebiased. They
are likely to colour their negative points.
2. Syndicate Services-
▪ Theseservicesareprovidedbycertainorganizationswhichcollectandtabulatethemar
keting information on a regular basis for a number of clients who are the
subscribers to these services.
▪ Theseservicesareusefulintelevisionviewing,movementofconsumergoodsetc.

▪ Thesesyndicateservicesprovideinformationdatafrombothhouseholdaswellasinstitution
.

Incollectingdatafromhousehold,theyusethreeapproaches:
Survey-Theyconductsurveysregarding-lifestyle,sociographic,generaltopics.
Mail Diary Panel- It may be related to 2 fields - Purchase and
Media. ElectronicScannerServices-Theseareusedtogeneratedata
onvolume. They collect data for Institutions from
• Wholesellers
• Retailers,and
• IndustrialFirms
▪ Varioussyndicateservices areOperationsResearch
Group(ORG)andTheIndianMarketing Research Bureau (IMRB).
ImportanceofSyndicateServices:
• Syndicateservices
arebecomingpopularsincetheconstraintsofdecisionmakingarechanging and we
need more of specific decision-making in the light of changing environment. Also,
Syndicate services are ableto provide information to the industries at a low unit
cost.
DisadvantagesofSyndicateServices:
• The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 12
BusinessAnalytics Dept.ofCSE-AIML
InternationalOrganization-
Theseincludes
• TheInternationalLabourOrganization(ILO):
• Itpublishesdataonthetotalandactivepopulation,employment,unemployment,
wages and consumer prices.
• TheOrganizationforEconomicCo-operationanddevelopment(OECD):
• Itpublishesdataonforeigntrade,industry,food,transport,andscienceandtechnology.
• TheInternationalMonetaryFund(IMA):
• Itpublishesreportsonnationalandinternationalforeignexchangeregulations.
Othersources:
Sensor’sdata:WiththeadvancementofIoTdevices,thesensorsofthesedevices
collectdatawhich can be used for sensor data analytics to track the performance and
usage of products.
Satellitesdata:Satellitescollectalotofimagesanddatainterabytesondailybasisthroughsurveil
lance cameras which can be used to collect useful information.
Webtraffic:Duetofastandcheapinternetfacilitiesmanyformatsofdatawhichisuploadedbyus
ers
ondifferentplatformscanbepredictedandcollectedwiththeirpermissionfordataanalysis.Thes
earch engines also provide their data through keywords and queries searched mostly.
ExportalltheDataontothecloudlikeAmazonwebservices S3
Weusually exportourdatatocloudforpurposeslikesafety, multipleaccess
andrealtime simultaneous analysis.

DataManagement:
Data management is the practice of collecting, keeping, andusingdata securely,
efficiently, andcost- effectively. The goal of data management is to help people,
organizations, and connected things optimize the use of data within the bounds of policy
and regulation so that they can make decisions and take actions that maximize the
benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies,
procedures, and practices. The work of data management has a wide scope, covering
factors such as how to:
• Create,access,andupdatedataacrossadiversedatatier
• Storedataacrossmultiplecloudsandonpremises
• Providehighavailabilityanddisasterrecovery
• Usedatainagrowingvarietyofapps,analytics,andalgorithms
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 13
BusinessAnalytics Dept.ofCSE-AIML
• Ensuredataprivacyandsecurity
• Archiveanddestroydatainaccordancewithretentionschedulesandcompliancerequiremen
ts
WhatisCloudComputing?

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 14
BusinessAnalytics Dept.ofCSE-AIML
Cloud computing is a term referred to storing and accessing data over the internet. It
doesn’t store
anydataontheharddiskofyourpersonalcomputer.Incloudcomputing,youcanaccessdatafro
m a remote server.
ServiceModelsofCloudcomputingarethereferencemodelsonwhichtheCloudComputingisbased.
Thesecanbecategorizedinto
threebasicservicemodelsaslistedbelow:
1. INFRASTRUCTUREasaSERVICE(IaaS)
IaaSprovidesaccesstofundamentalresourcessuchasphysicalmachines,virtualmachines,virt
ual storage, etc.
2. PLATFORMasaSERVICE(PaaS)
PaaSprovidestheruntimeenvironmentforapplications,development&deploymenttools,etc.
3. SOFTWAREasaSERVICE(SAAS)
SaaSmodelallowstousesoftwareapplicationsasaservicetoendusers.

For providing the above services models AWS is one of the popular platforms. In this
Amazon Cloud (Web) Services is one of the popular service platforms for Data
Management
AmazonCloud(Web)ServicesTutorial
Whatis AWS?
ThefullformofAWSisAmazonWebServices.Itisaplatformthatoffersflexible,reliable,scalabl
e, easy-to-use and, cost-effective cloud computing solutions.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 15
BusinessAnalytics Dept.ofCSE-AIML

AWSisacomprehensive,easytousecomputingplatformofferedAmazon.Theplatformisdevelo
ped with a combination of infrastructure as a service (IaaS), platform as a service (PaaS)
and packaged software as a service (SaaS) offering.

HistoryofAWS
2002- AWS services launched
2006-
Launcheditscloudproducts
2012- Holds first customer
event
2015-
Revealsrevenuesachievedof$4.6billion
2016- Surpassed $10 billon revenue
target 2016- Release snowball and
snowmobile
2019-Offersnearly100cloudservices
2021-AWScomprisesover200productsandservices

ImportantAWSServices
AmazonWebServicesoffersawiderangeofdifferentbusinesspurposeglobalcloud-
basedproducts. The products include storage, databases, analytics, networking, mobile,
development tools, enterprise applications, with a pay-as-you-go pricing model.

AmazonWebServices-AmazonS3:

• Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-


based service designed for online backup and archiving of data and
application programs.
• It allows to upload, store, and download any type of files up to 5 TB in size.
This service allows the subscribers to access the same systems that Amazon
uses to run its own web sites.
• Thesubscriberhascontrolovertheaccessibilityofdata,i.e.privately/publiclyaccessible.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 16
BusinessAnalytics Dept.ofCSE-AIML
1.HowtoConfigureS3?
FollowingarethestepstoconfigureaS3account.
Step1−OpentheAmazonS3consoleusingthislink−https://console.aws.amazon.com/s3/home
Step2−CreateaBucketusingthefollowingsteps.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 17
BusinessAnalytics Dept.ofCSE-AIML

• Apromptwindowwillopen.ClicktheCreateBucketbuttonatthebottomofthepage.

• CreateaBucketdialogboxwillopen.FilltherequireddetailsandclicktheCreatebutton.

• Thebucketis created successfully in Amazon S3.Theconsoledisplays thelistof buckets


andits properties.

• SelecttheStaticWebsiteHostingoption.ClicktheradiobuttonEnablewebsitehostingandfil
lthe required details.

Step3 −AddanObjecttoabucketusingthefollowingsteps.
• OpentheAmazonS3consoleusingthefollowi
ng link.
https://console.aws.amazon.com/s3/home

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 18
BusinessAnalytics Dept.ofCSE-AIML

• ClicktheUploadbutton.

• ClicktheAddfilesoption.Selectthosefileswhicharetobeuploadedfromthesystemandth
en click the Open button.

• Clickthestartuploadbutton.Thefileswillgetuploadedintothebucket.
• Afterwards,wecancreate,edit,modify,updatetheobjectsandotherfilesinwideformats.

AmazonS3Features
• LowcostandEasytoUse −UsingAmazonS3,theuser
canstorealargeamountofdataat very low charges.
• Secure − Amazon S3 supports data transfer over SSL and the data gets
encrypted
automaticallyonceitisuploaded.Theuserhascompletecontrolovertheirdatabyconfigu
ring bucket policies using AWS IAM.
• Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
• Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes contentto theend users withlowlatencyandprovides highdatatransfer
speeds withoutany minimum usage commitments.
• IntegratedwithAWSservices−AmazonS3integratedwithAWSservicesincludeAma
zon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon
Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.

We are discussing Amazon


S3:
https://d1.awsstatic.com/whitepapers/aws-
overview.pdf

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 19
BusinessAnalytics Dept.ofCSE-AIML
DataQuality:

WhatisDataQuality?
Therearemanydefinitionsofdataquality,ingeneral,dataqualityistheassessmentofho
w much the data is usable and fits its serving context.

WhyDataQualityis Important?
Enhancingthedataqualityisacriticalconcern
asdataisconsideredasthecoreofallactivities
withinorganizations,poordataqualityleadstoinaccuratereportingwhichwillresultinac
curate decisions and surely economic damages.

Manyfactorshelpmeasuringdataqualitysuchas:
• Data Accuracy:Dataareaccuratewhendatavaluesstoredinthedatabasecorrespond
to real-world values.
• Data Uniqueness:Ameasureofunwantedduplicationexistingwithinoracrosssystems
for a particular field, record, or data set.
• DataConsistency:Violationofsemanticrulesdefinedoverthedataset.
• DataCompleteness:Thedegreetowhichvaluesarepresentinadatacollection.
• Data Timeliness: Theextenttowhichageofthedataisappropriatedforthetaskat
hand.
OtherfactorscanbetakenintoconsiderationsuchasAvailability,EaseofManipulation,
Believability.

OUTLIERS:

• Outlierisapointoranobservationthatdeviatessignificantlyfrom
the other observations.
• Outlierisacommonlyusedterminologybyanalystsanddatascientists
as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an
overall pattern in a sample.
• Reasonsforoutliers:Duetoexperimentalerrorsor“specialcircumstances”.
• There is no rigid mathematical definition of what constitutes an outlier;
determining whether or not an observation is an outlier is ultimately a subjective
exercise.
• Therearevariousmethodsofoutlierdetection.Somearegraphical suchas
normalprobability plots. Others are model-based. Box plots are a hybrid.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 20
BusinessAnalytics Dept.ofCSE-AIML
TypesofOutliers:

Outliercanbeoftwotypes:
Univariate:Theseoutlierscanbefoundwhenwelookatdistributionofasinglevariable.
Multivariate: Multi-variate outliers are outliers in an n-dimensional space.

Inordertofindthem,youhavetolookatdistributions inmulti-dimensions.

ImpactofOutliersonadataset:
Outliers can drasticallychange the resultsof the data analysisand statistical modelling. There
are numerous unfavourable impacts of outliers in the data set:
• Itincreasestheerrorvarianceandreducesthepowerofstatisticaltests
• Iftheoutliersarenon-randomlydistributed,theycandecreasenormality
• Theycanbiasorinfluenceestimatesthatmaybeofsubstantiveinterest
• They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.

DetectOutliers:
Mostcommonlyusedmethodtodetect
outliersisvisualization.Weusevariousvisualizationmethods, like Box-plot, Histogram,
Scatter Plot (above, we have used box plot and scatter plot for visualization).

Outliertreatmentsarethreetypes:
Retention:
• There is no rigid mathematical definition of what constitutes an outlier;
determining whether
ornotanobservationisanoutlierisultimatelyasubjectiveexercise.Therearevariousmet
hods of outlier detection. Some are graphical such as normal probability plots.
Others are model- based. Box plots are a hybrid.
Exclusion:
• According to a purpose of the study, it is necessary to decide, whether and which
outlier will be removed/excluded from the data, since they could highly bias the
final results of the analysis.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 21
BusinessAnalytics Dept.ofCSE-AIML
Rejection:
• Rejectionofoutliersismoreacceptable inareasofpractice
wheretheunderlyingmodelofthe process being measured and the usual
distribution of measurement error are confidently known.
• An outlier resulting from an instrument reading error may be excluded but it is
desirable that the reading is at least verified.

Othertreatmentmethods
OUTLIERpackageinR:todetectandtreatoutliersinData.
Outlier detection from graphical representation:
– ScatterplotandBoxplot

–Theobservationsoutofboxaretreatedasoutliersindata
MissingDatatreatment:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 22
BusinessAnalytics Dept.ofCSE-AIML

MissingValues
➢ Missing data in the training data set can
reduce the power / fit of a model or can lead
to a biased model because we have not
analyzed the behavior and
relationship with other variables correctly. It can lead

to wrong prediction or classification.

➢ InR,missingvaluesarerepresentedbythesymbolNA(notavailable).
➢ Impossible values (e.g., dividing by zero) are represented by the symbol NaN(not a
number) and R outputs the result for dividing by zero as ‘Inf’(Infinity).

PMMapproachtotreatmissingvalues:
• PMM->PredictiveMeanMatching(PMM)isasemi-parametricimputationapproach.
• Itissimilartotheregressionmethodexceptthatforeachmissingvalue,itfillsinavalue
randomly from among the observed donor values from an observation
• whoseregression-predicted values are closesttothe regression-predictedvalue
forthe missing value from the simulated regression model.

DataPre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and efficient format.

StepsInvolvedinDataPreprocessing:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 23
BusinessAnalytics Dept.ofCSE-AIML

1. DataCleaning:
Thedata can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
(a). MissingData:
This situation arises when some data is missing inthe data. It can be handled in various
ways. Some of them are:
1. Ignorethetuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

2. Fillthe Missingvalues:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

• (b).NoisyData:
Noisy datais a meaningless data that can’t beinterpreted by machines. It canbe
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways:
1. BinningMethod:
This method works on sorted data in order to smooth it. Binning, also called
discretization, is a techniquefor
reducingthecardinality(Thetotalnumberofuniquevaluesforadimension isknownas itscardinality)
ofcontinuousanddiscretedata.Binninggroupsrelatedvaluestogetherinbinstoreduce the
number of distinct values
2. Regression:
Here data can be made smooth by fitting it to a regressionfunction. The regression
used may be linear (having one independent variable) or
multiple(havingmultipleindependentvariables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. DataTransformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

1. Normalization:
Normalization is a technique often applied as part of data preparationin Data
Analytics through machine learning. The goal of normalization is to change the
values of numeric
columnsinthedatasettoacommonscale,withoutdistortingdifferencesintherangesof
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 24
BusinessAnalytics Dept.ofCSE-AIML

values.Formachinelearning,everydatasetdoesnotrequirenormalization.Itisdoneinord
er to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. AttributeSelection:
In this strategy, new attributes are constructed from thegiven set of attributes to
help the mining process.

3. Discretization:
Discretization is theprocess through which we can transform continuous
variables, models or functions into a discrete form. We do this by creating a set
of contiguous intervals (or bins) that go across the range of our desired
variable/model/function. Continuous data is Measured, while Discrete data is
Counted

4. ConceptHierarchy Generation:
Here attributes are converted fromlower level to higher level inhierarchy. For Example-
The attribute “city” can be converted to “country”.

3. DataReduction:
Since data mining is a technique that is used to handle huge amount of data.While
working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we use data reduction technique. It aims to increase the storage efficiency
and reduce data storage and analysis costs.
Thevariousstepstodatareductionare:
1. Data Cube Aggregation:
Aggregation operationisappliedtodatafortheconstructionofthedatacube.

2. AttributeSubsetSelection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.

3. NumerosityReduction:
Thisenabletostorethe modelofdatainsteadofwhole data,forexample:RegressionModels.

4. DimensionalityReduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved,such
reductionarecalled lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are: Wavelet transforms and PCA
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 25
BusinessAnalytics Dept.ofCSE-AIML
(Principal Component Analysis).

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 26
BusinessAnalytics Dept.ofCSE-AIML

DataProcessing:
Data processing occurs when data is collected and translated into usable information.
Usually performedbyadatascientist orteamofdatascientists, itisimportant
fordataprocessingtobedone correctly as not to negatively affect the end product, or data
output.
Data processingstarts with data in itsraw form andconverts it into a more readable
format (graphs,
documents,etc.),givingittheformandcontextnecessarytobeinterpretedbycomputersandutil
ized by employees throughout an organization.

Sixstagesofdataprocessing
1. Datacollection
Collectingdataisthefirststepindataprocessing.Dataispulledfromavailablesources,including
data lakes and data warehouses. It is important that the data sources available are
trustworthy and well- built so the data collected (and later used as information) is of the
highest possible quality.

2. Datapreparation
Oncethedatais collected, it thenenters thedata preparation stage.Data
preparation,oftenreferred to as “pre-processing” is the stage at which raw data is
cleaned up and organized for the following
stageofdataprocessing.Duringpreparation,rawdataisdiligentlycheckedforanyerrors.Thepu
rpose of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and
begin to create high-quality data for the best business intelligence.

3. Datainput
Thecleandataisthenenteredintoitsdestination(perhapsaCRMlikeSalesforce
oradatawarehouse like Redshift), and translated into a language that it can understand.
Data input is the first stage in which raw data begins to take the form of usable
information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually
processed for interpretation. Processing is done using machine learning algorithms,
though the process itself may
varyslightlydependingonthesourceofdatabeingprocessed(datalakes,socialnetworks,conne
cted devices etc.) andits intendeduse (examiningadvertisingpatterns, medical diagnosis
from connected devices, determining customer needs, etc.).

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 27
BusinessAnalytics Dept.ofCSE-AIML

5. Dataoutput/interpretation
The output/interpretation stage is thestage atwhich data isfinallyusable to non-data
scientists. Itis translated, readable, and often in the form of graphs, videos, images, plain
text, etc.).

6. Datastorage
The final stage of data processing is storage. After all of the data is processed, it is then
stored for future use. While some information may be put to use immediately, much of it
will serve a purpose later on. When data is properly stored, it can be quickly and easily
accessed by members of the organization when needed.

***EndofUnit-1***

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 28
BusinessAnalytics Dept.ofCSE-AIML
UNIT-2
Data has been the buzzword for ages now. Either the data being generated from
large-scale enterprises or the data generated from an individual, each and every
aspect of data needs to be analyzed to benefit yourself from it.

WhyisDataAnalyticsimportant?
Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business
requirements.

WhatistheroleofDataAnalytics?
• GatherHiddenInsights–Hiddeninsightsfromdataaregatheredandthenanalyzed
with respect to business requirements.
• GenerateReports –Reportsaregeneratedfromthedataandarepassedontothe
respective teams and individuals to deal with further actions for a high rise in
business.
• PerformMarketAnalysis–MarketAnalysiscanbeperformedtounderstandthestre
ngths and weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows
improvingBusiness to customer requirements and experience.

WaystoUseDataAnalytics:
Nowthatyouhavelookedatwhatdataanalyticsis,let’sunderstandhowwecanusedataanalytics.

Fig:WaystouseDataAnalytics
1. Improved Decision Making: Data Analytics eliminates guesswork and manual
tasks. Be it choosing the right content, planning marketing campaigns, or developing
products. Organizations can use the insights they gain from data analytics to make
informed decisions. Thus, leading to better outcomes and customer satisfaction.
2. Better Customer Service: Data analytics allows you to tailor customer service
according to their needs. It also provides personalization and builds stronger
relationships with customers. Analyzeddatacan reveal information aboutcustomers’
interests, concerns,andmore. Ithelps you give better recommendations for products

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 29
BusinessAnalytics Dept.ofCSE-AIML
and services.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 30
BusinessAnalytics Dept.ofCSE-AIML
3. EfficientOperations:
Withthehelpofdataanalytics,youcanstreamlineyourprocesses,save money, and boost
production. With an improved understanding of what your audience wants, you spend
lesser time creating ads and content that aren’t in line with your audience’s
interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your
campaigns are performing. This helps in fine-tuning them for optimal outcomes.
Additionally, you can also find potential customers who are most likely to interact with
a campaign and convert into leads.

StepsInvolvedinDataAnalytics:
Next step to understanding what data analytics is to learn how data is analyzed in
organizations. Thereareafewsteps thatare involvedinthedataanalytics
lifecycle.Belowarethesteps thatyou can take to solve your problems.

Fig:DataAnalyticsprocesssteps
1. Understand the problem: Understanding the business problems, defining the
organizational goals, and planning a lucrative solution is the first step in the analytics
process. E-commerce companies often encounter issues such as predicting the return
of items, giving relevant product recommendations, cancellation of orders, identifying
frauds, optimizing vehicle routing, etc.
2. Data Collection: Next, you need to collect transactional business data and
customer-related information fromthepastfew years toaddress theproblems your
business is facing.Thedata can have information about the total units that were sold
for a product, the sales, and profit that were made,andalsowhen
wastheorderplaced.Pastdataplaysacrucial roleinshapingthefutureofa business.
3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and
contain
unwantedmissingvalues.Suchdataisnotsuitableorrelevantforperformingdataanalysis.He
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 31
BusinessAnalytics Dept.ofCSE-AIML
nce, you need to clean the data to remove unwanted, redundant, and missing values
to make it ready for analysis.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 32
BusinessAnalytics Dept.ofCSE-AIML
4. Data Exploration and Analysis: After you gather the right data, the next vital
step is to execute exploratory data analysis. You can use data visualization and
business intelligence tools,
dataminingtechniques,andpredictivemodellingtoanalyze,visualize,andpredictfutureout
comes
fromthisdata.Applyingthesemethodscantellyoutheimpactandrelationshipofacertainfeat
ure as compared to other variables.
Belowaretheresultsyoucangetfromtheanalysis:
• Youcanidentifywhenacustomerpurchasesthenextproduct.
• Youcanunderstandhowlongittooktodeliver theproduct.
• Yougetabetterinsightintothekindofitemsacustomerlooksfor,productreturns,etc.
• Youwill beabletopredictthesalesandprofitforthenextquarter.
• Youcanminimizeordercancellationbydispatchingonlyrelevantproducts.
• You’llbeabletofigureouttheshortestroutetodelivertheproduct,etc.
5. Interpret the results: The final step is to interpret the results and validate if the
outcomes
meetyourexpectations.Youcanfindouthiddenpatternsandfuturetrends.Thiswillhelpyoug
ain insights that will support you with appropriate data-driven decision making.

Introductiontotools:

WhatarethetoolsusedinDataAnalytics?
WiththeincreasingdemandforDataAnalyticsinthemarket,manytoolshaveemergedwithva
rious functionalitiesforthispurpose.Eitheropen-sourceoruser-
friendly,thetoptoolsinthedataanalytics market are as follows.

• R programming – This tool is the leading analytics tool used for statistics and
data modeling. R compiles andruns on various platforms such as UNIX,
Windows, and Mac OS. It also provides tools to automatically install all
packages as per user-requirement.
• Python – Python is anopen-source,object-oriented programminglanguagethatis
easy to read, write, and maintain. It provides various machine learning and
visualization libraries suchasScikit-learn,
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 33
BusinessAnalytics Dept.ofCSE-AIML
TensorFlow,Matplotlib,Pandas,Keras,etc.Italsocanbeassembledon any platform
like SQL server, a MongoDB database or JSON

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 34
BusinessAnalytics Dept.ofCSE-AIML
• Tableau Public – This is a free software that connectstoany data sourcesuchas
Excel, corporate Data Warehouse, etc. Itthen creates visualizations,maps,
dashboards etc with real-time updates on the web.
• QlikView – This tool offers in-memory data processing with the results
delivered to the end-users quickly. It also offers data association and data
visualization with data being compressed to almost 10% of its original size.
• SAS – A programming language and environmentfor data manipulation and
analytics, this tool is easily accessible and can analyze data from different
sources.
• MicrosoftExcel – This tool is one of the mostwidely usedtools for data
analytics. Mostly used for clients’ internal data, this tool analyzes the tasks that
summarize the data with a preview of pivot tables.
• RapidMiner–Apowerful,integratedplatformthatcanintegratewithanydatasourcet
ypes
suchasAccess,Excel,MicrosoftSQL,Teradata,Oracle,Sybaseetc.Thistoolismostlyus
ed for predictive analytics, such as data mining, textanalytics, machine
learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform,
whichallowsyoutoanalyzeandmodeldata.Withthebenefitofvisualprogramming,KN
IME provides aplatformforreportingandintegrationthrough itsmodular
datapipelineconcept.
• OpenRefine–AlsoknownasGoogleRefine,
thisdatacleaningsoftwarewillhelpyouclean up data for analysis. It is used for
cleaning messy data, the transformation of data and parsing data from
websites.

◼ Apache Spark – One of the largest large-scale data processing engine, this
tool executes
applicationsinHadoopclusters100timesfasterinmemoryand10timesfasterondisk.
This tool is also popular for data pipelines and machine learning model
development.

DataAnalyticsApplications:
Dataanalyticsisusedinalmosteverysectorofbusiness,let’sdiscussafewofthem:
1. Retail: Data analytics helps retailers understand their customer needs and buying
habits to predict trends, recommend new products, and boost their business. They
optimize the supply chain, and retail operations at every step of the customer
journey.
2. Healthcare: Healthcare industries analyse patient data to provide lifesaving
diagnoses and treatment options.Dataanalytics help in discovering new
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 35
BusinessAnalytics Dept.ofCSE-AIML
drugdevelopment methodsas well.
3. Manufacturing: Using data analytics, manufacturing sectors can discover new
cost-saving opportunities. They cansolve complex supply chain issues, labour
constraints, andequipment breakdowns.
4. Banking sector: Banking and financial institutions use analytics to find out
probable loan defaulters and customer churn out rate. It also helps in detecting
fraudulent transactions immediately.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 36
BusinessAnalytics Dept.ofCSE-AIML
5. Logistics: Logistics companies use data analytics to develop new business
models and optimize routes. This, in turn, ensures that the delivery reaches on
time in a cost-efficient manner.

Clustercomputing:
▪ Cluster computing is a collection of tightly or
loosely connected computers that work
together so that they act as a single entity.
▪ The connected computers execute operations
all together thus creating the idea of a single
system.
▪ Theclustersaregenerallyconnectedthroughfastl
ocal area networks (LANs)

WhyisClusterComputingimportant?

▪ Cluster computing gives a relatively inexpensive, unconventional to the large


server or mainframe computer solutions.
▪ Itresolvesthedemandforcontentcriticalityandprocessservicesinafasterway.
▪ ManyorganizationsandITcompaniesareimplementingclustercomputingtoaugme
nttheir scalability, availability, processing speed and resource management at
economic prices.
▪ It ensures that computational power is always available. It provides a single
general strategy for the implementation and application of parallel high-
performance systems independent of certain hardware vendors and their
product decisions.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 37
BusinessAnalytics Dept.ofCSE-AIML
ApacheSpark:

• Apache Spark is a lightning-fast cluster computing technology, designed for


fast computation. It is based on Hadoop MapReduce and it extends the
MapReduce model to efficiently use it for more types of computations, which
includes interactive queries and stream processing.
• ThemainfeatureofSparkisitsin-
memoryclustercomputingthatincreasestheprocessing speed of an application.
• Spark is designedto cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming.
• Apart from supporting all these workloads in a respective system, it reduces
the management burden of maintaining separate tools.

EvolutionofApacheSpark
SparkisoneofHadoop’ssubprojectdevelopedin2009inUCBerkeley’sAMPLabbyMateiZah
aria. Itwas OpenSourcedin2010under a BSDlicense.It was donated to
Apachesoftwarefoundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.

FeaturesofApacheSpark:
ApacheSparkhasfollowingfeatures.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data in
memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark comes up
with 80 high-level operators for interactive querying.
AdvancedAnalytics−Sparknotonlysupports‘Map’ and‘reduce’.Italsosupports
SQLqueries, Streaming data, Machine learning (ML), and Graph algorithms.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 38
BusinessAnalytics Dept.ofCSE-AIML
SparkBuiltonHadoop
ThefollowingdiagramshowsthreewaysofhowSparkcanbebuiltwithHadoopcomponents.

TherearethreewaysofSparkdeploymentasexplainedbelow.
Standalone − Spark Standalone deployment means Spark occupies the place on top
of
HDFS(HadoopDistributedFileSystem)andspaceisallocatedforHDFS,explicitly.Here,Spar
kand MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn (Yet
Another Resource Negotiator) without any pre-installation or root access required. It
helps to integrate
SparkintoHadoopecosystemorHadoopstack.Itallowsothercomponents
torunontopofstack. Spark in MapReduce (SIMR) − Spark in MapReduce is used to
launch spark job in addition to standalone deployment. With SIMR, user can start
Spark and uses its shell without any administrative access.
ComponentsofSpark
ThefollowingillustrationdepictsthedifferentcomponentsofSpark.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 39
BusinessAnalytics Dept.ofCSE-AIML
ApacheSparkCore
SparkCoreistheunderlyinggeneralexecutionengineforsparkplatformthatallotherfunctio
nality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
SparkStreaming
Spark Streamingleverages Spark Core'sfastschedulingcapability to perform
streaminganalytics. Itingests datain mini-batches andperforms
RDD(ResilientDistributedDatasets) transformations on those mini-batches of data.
MLlib(MachineLearningLibrary)
MLlibisadistributedmachinelearningframeworkaboveSparkbecauseofthedistributedme
mory-
basedSparkarchitecture.Itis,accordingtobenchmarks,donebytheMLlibdevelopersagain
stthe AlternatingLeastSquares (ALS) implementations.Spark MLlibis nine times as
fastas the Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an
API for expressinggraphcomputationthatcanmodeltheuser-
definedgraphsbyusingPregelabstraction API. It also provides an optimized runtime for
this abstraction.

WhatisScala?
• Scala is a statically typed programming language that incorporates both
functional and object oriented, also suitable for imperative programming
approaches. to increase scalability of applications. It is a general-
purpose programming language. It is a strong static type language. In
scala, everything is an object whether it is a function or a number. It
does not have concept of primitive data.
• Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through Scala Js.
• ThislanguagewasoriginallybuiltfortheJavaVirtualMachine(JVM)andoneof
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 40
BusinessAnalytics Dept.ofCSE-AIML
Scala’sstrengthsisthatitmakesitveryeasytointeractwithJavacode.
• Scala is a Scalable Language used to write Software for multiple
platforms. Hence,
itgotthename“Scala”.ThislanguageisintendedtosolvetheproblemsofJava

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 41
BusinessAnalytics Dept.ofCSE-AIML
while simultaneously being more concise. Initially designed by Martin
Odersky, it was released in 2003.

WhyScala?
• Scala is the core language to be used in writing the most popular
distributed big data processing framework Apache Spark. Big Data
processing is becoming inevitable from small to large enterprises.
• Extractingthevaluableinsightsfromdatarequiresstateoftheartprocessingt
ools and frameworks.
• Scala is easy to learn for object-oriented programmers, Java developers.
It is becoming one of the popular languages in recent years.
• Scalaoffersfirst-classfunctionsforusers
• Scala can be executed onJVM, thus paving the way for the
interoperabilitywithother languages.
• Itisdesignedforapplicationsthatareconcurrent(parallel),distributed,andres
ilient (robust)message-
driven.Itisoneofthemostdemandinglanguagesofthisdecade.
• It is concise, powerful language and can quickly grow according to the
demand of its users.
• Itisobject-
orientedandhasalotoffunctionalprogrammingfeaturesprovidingalot of
flexibility to the developers to code in a way they want.
• ScalaoffersmanyDuckTypes(StructuralTypes)
• Unlike Java, Scala has many features of functional programming
languages like
Scheme,StandardMLandHaskell,includingcurrying,typeinference,immuta
bility, lazy evaluation, and pattern matching.
• The name Scala is a portmanteau of "scalable" and "language",
signifying that it is designed to grow with the demands of its users.

WhereScalacanbeused?
• WebApplications
• UtilitiesandLibraries
• DataStreaming
• Parallelbatchprocessing
• Concurrencyanddistributedapplication
• DataanalyticswithSpark
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 42
BusinessAnalytics Dept.ofCSE-AIML
• AWSlambdaExpression

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 43
BusinessAnalytics Dept.ofCSE-AIML

ClouderaImpala:

• ClouderaImpalaisCloudera'sopensourcemassivelyparallelprocessing(MPP)
SQL query engine for data stored in a computer cluster running Apache
Hadoop.
• Impalaistheopensource,massivelyparallelprocessing(MPP)SQLqueryenginefor
nativeanalyticdatabaseinacomputerclusterrunningApacheHadoop.
• ItisshippedbyvendorssuchasCloudera,MapR,Oracle,andAmazon.
• ClouderaImpalaisaqueryenginethatrunsonApacheHadoop.
• Theprojectwas announcedinOctober2012 withapublicbetatest
distributionand became generally available in May 2013.
• ImpalabringsenablinguserstoissuelowlatencySQLqueriestodatastoredinHD
FS and Apache HBase without requiring data movement or
transformation.
• Impala is integrated with Hadoop to use the same file and data formats,
metadata, security and resource management frameworks used by
MapReduce, Apache Hive, Apache Pig and other Hadoop software.
• Impala is promoted for analysts and data scientists to perform analytics
on data stored in Hadoop via SQL or business intelligence tools.
• The result is that large-scale data processing (via MapReduce) and
interactive queries can be done on the same system using the same data
and metadata – removing the need to migrate data sets into specialized
systems and/or proprietary formats simply to perform analysis.

Features include:
• SupportsHDFSandApacheHBasestorage,
• ReadsHadoopfile formats, includingtext,LZO, SequenceFile,Avro,RCFile,
and Parquet,
• SupportsHadoopsecurity(Kerberosauthentication),
• Fine-grained,role-basedauthorizationwithApacheSentry,
• Usesmetadata,ODBCdriver,andSQLsyntaxfromApacheHive.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 44
BusinessAnalytics Dept.ofCSE-AIML
Databases&TypesofDataandvariables
DataBase:ADatabaseisacollectionofrelateddata.
Database Management System: DBMS is a software or set of Programs used to
define, construct and manipulate the data.
Relational Database Management System: RDBMS is a software system
used to maintain relational databases. Many relational database systems have
an option of using the SQL.
NoSQL:
• NoSQLDatabaseisanon-
relationalDataManagementSystem,thatdoesnotrequire a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage
needs. NoSQL is used for Big data and real-time web apps. For example,
companies like Twitter, Facebook and Google collect terabytes of user
data every single day.
• NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a
better term would be “NoREL”, NoSQL caught on. Carl Strozz introduced
the NoSQL concept in 1998.
• Traditional RDBMS uses SQL syntax to store and retrieve data for further
insights. Instead, a NoSQL database system encompasses a wide range
of database technologies that can store structured, semi-structured,
unstructured and polymorphic data.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 45
BusinessAnalytics Dept.ofCSE-AIML
WhyNoSQL?
• The concept of NoSQL databases became popular with Internet giants
like Google, Facebook, Amazon, etc. who dealwith hugevolumesof data.
The system response time becomes slow when you use RDBMS
formassive volumes of data.
• To resolve this problem, we could “scale up”our systems byupgrading
ourexisting hardware. This process is expensive. The alternative for this
issue is to distribute
databaseloadonmultiplehostswhenevertheloadincreases.Thismethodiskn

own as “scaling out.”

TypesofNoSQLDatabases:

• Document-oriented:JSONdocumentsMongoDBandCouchDB
• Key-value:RedisandDynamoDB
• Wide-column:CassandraandHBase
• Graph:Neo4jandAmazonNeptune
Relational Non-relational
Databases(SQ Databases(NoSQ
L) L)
Oracle MongoDB

MySQL couchDB

SQLServer BigTable

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 46
BusinessAnalytics Dept.ofCSE-AIML

SQLvsNOSQLDB:

SQL NoSQL

RELATIONALDATABASEMANAGEMENT Non-relationalordistributeddatabase
SYSTEM (RDBMS) system.

Thesedatabaseshavefixedorstaticor
predefined schema Theyhavedynamicschema

Thesedatabasesare notsuited Thesedatabasesare


for hierarchical data storage. bestsuitedfor hierarchical data
storage.

Thesedatabasesare bestsuited for Thesedatabasesarenot sogood for


complex queries complex queries

VerticallyScalable Horizontallyscalable

FollowsCAP (consistency,availability,
FollowsACIDproperty partition tolerance)

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 47
BusinessAnalytics Dept.ofCSE-AIML

DifferencesbetweenSQLandNoSQL

ThetablebelowsummarizesthemaindifferencesbetweenSQLandNoSQLdatabases.

SQLDatabases NoSQLDatabases

Document: JSON documents, Key-


DataStorag Tables with value: key-valuepairs,Wide-
e Model fixed column:tableswith
rowsandcolumn rowsanddynamiccolumns,Graph:nod
s es and edges

Developed in the Developedinthelate2000swithafocus


Developme 1970swithafocuson on scaling and allowing for rapid
nt History reducing data application change driven by agile
duplication and DevOps practices.

Document:MongoDBandCouchDB,Ke
Oracle, MySQL,
y- value: Redis and DynamoDB,
Examples MicrosoftSQLServer,
Wide- column: Cassandra and
and PostgreSQL
HBase, Graph: Neo4j and Amazon
Neptune

Document: general purpose, Key-


value:
largeamountsofdatawithsimplelooku
PrimaryPurpose Generalpurpose
p queries, Wide-column: large
amounts of data with predictable
query patterns, Graph: analyzing
and traversing relationships
between connected data

Schemas Rigid Flexible

Vertical (scale-up Horizontal(scale-outacrosscommodity


Scaling
withalargerserver) servers)

Multi- Mostdonotsupportmulti-recordACID
Record Supported transactions. However, some—like
ACID MongoDB—do.
Transactions

Joins Typicallyrequired Typicallynotrequired

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 48
BusinessAnalytics Dept.ofCSE-AIML

SQLDatabases NoSQLDatabases

Requires ORM Many do not require ORMs. MongoDB


DatatoObjec
(object- documentsmapdirectlytodatastructur
t Mapping
relational es in most popular programming
mapping) languages.

Benefitsof NoSQL
➢ The NoSQLdata model addresses severalissues thatthe relationalmodelis
not designed to address:
➢ Largevolumesofstructured,semi-structured,andunstructureddata.
➢ Object-orientedprogrammingthatiseasytouseandflexible.
➢ Efficient,scale-outarchitectureinsteadofexpensive,monolithicarchitecture.

Variables:
➢ Dataconsistofindividualsandvariablesthatgiveusinformationaboutthose
individuals. An individual can be an object or a person.
➢ Avariableisanattribute,suchasameasurementoralabel.
➢ TwotypesofData
➢ Quantitativedata(Numerical)
➢ Categoricaldata

➢ Quantitative Variables:Quantitative data, contains numerical that can be


added, subtracted, divided, etc.
Therearetwotypesofquantitativevariables:discreteandcontinuous.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 49
BusinessAnalytics Dept.ofCSE-AIML

Discretevscontinuousvariables

Type Whatdoesthedatarepresent? Examples

of
variable
Discret Countsofindividualitemsorvalues • Numberofstudentsin
e . a class
variable • Numberofdifferenttre
s e species in a forest

Continuo Measurementsofcontinuousorno • Distance


us n- finite values. • Volume
variables • Age

Categoricalvariables: Categoricalvariables represent


groupingsofsomekind.Theyare sometimesrecorded asnumbers, butthe
numbers represent categoriesrather than actual amounts of things.
Therearethreetypesofcategoricalvariables: binary,nominal,andordinalvariables.

Typeofvariable What does the Examples


data
represent?
Binaryvariables Yes/nooutcomes. • Heads/tailsinacoinflip
• Win/loseinafootballgame

Nominal Groups with no rank or • Colors


variable order between them. • Brands
s • ZIPCODE

Ordinalvariable Groupsthatarerankedin • Finishingplaceinarace


s a specific order. • Ratingscaleresponsesina
survey*

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 50
BusinessAnalytics Dept.ofCSE-AIML
MissingImputations:
Imputationistheprocessofreplacingmissingdatawithsubstitutedvalues.

Typesofmissingdata
Missingdatacanbeclassifiedintooneofthree categories
1. MCAR
Data which is Missing Completely At Random has nothing systematic about
which observations are missing values. There is no relationship between
missingness and either observed or unobserved covariates.

2. MAR
MissingAtRandomisweakerthanMCAR.Themissingnessisstillrandom,butdueentir
ely to observed variables. For example, those from a lower socioeconomic
status may be less willing to provide salary information (but we know their SES
status). The key is that the missingness is not due to the values which are not
observed. MCAR implies MAR but not vice-versa.

3. MNAR
IfthedataareMissing NotAt
Random,thenthemissingnessdependsonthevaluesofthe missing data. Censored
data falls into this category. For example, individuals who are heavierare less
likelytoreporttheirweight. Anotherexample,thedevicemeasuringsome response
can only measure values above .5. Anything below that is missing.
Therecanbetwotypesofgapsin Data:
1. MissingDataImputation
2. ModelbasedTechnique

Imputations:(TreatmentofMissingValues)
1. Ignore the tuple: This is usually done when the class label is missing
(assuming
theminingtaskinvolvesclassification).Thismethodisnotveryeffective,unless
the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies
considerably.
2. Fill in themissing valuemanually: Ingeneral, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
3. Useaglobalconstanttofillinthemissingvalue:Replaceall missing
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 51
BusinessAnalytics Dept.ofCSE-AIML
attribute valuesbythesameconstant,suchasalabellike “Unknown”or-
∞.Ifmissingvalues

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 52
BusinessAnalytics Dept.ofCSE-AIML
are replaced by, say, “Unknown,” then the mining program may
mistakenly think thattheyforman
interestingconcept,sincetheyallhaveavalueincommon-thatof “Unknown.”
Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the
average value of that particular attribute and use this value to replace
the missing value in that attribute column.
5. Usetheattributemeanfor all samplesbelonging to
thesameclassasthe given tuple:
For example, if classifying customers according to credit risk, replace the
missing valuewith theaverage incomevalueforcustomersin thesamecredit
risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your
dataset,youmayconstructadecisiontreetopredictthemissingvaluesfor inco
me.

NeedforBusinessModelling:
ThemainneedofBusinessModellingfortheCompaniesthatembracebigdataanalyt
ics
andtransformtheirbusinessmodelsinparallelwillcreatenewopportunitiesforreve
nue streams, customers,products and services Having a big data strategy
and vision that identifies and capitalizes on new opportunities.

AnalyticsapplicationstovariousBusinessDomains
ApplicationofModellinginBusiness:
• ApplicationsofDataModellingcanbetermedasBusinessanalytics.
• Business analytics involves the collating, sorting, processing, and
studying of business-relateddatausing
statisticalmodelsanditerativemethodologies. Thegoal
ofBAistonarrowdownwhichdatasetsareusefuland
whichcanincreaserevenue, productivity, and efficiency.
• Businessanalytics(BA)isthecombinationofskills,technologies,andpractices
used to examine an organization's data and performance as a way to
gain insights and make data-driven decisions in the future using
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 53
BusinessAnalytics Dept.ofCSE-AIML
statistical analysis.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 54
BusinessAnalytics Dept.ofCSE-AIML

Although business analytics is being leveraged in most commercial sectors and


industries, the following applications are the most common.
1. CreditCardCompanies
Credit and debit cards are an everyday part of consumer spending, and they
are an ideal way of gathering information about a purchaser’s spending
habits, financial situation, behavior trends, demographics, and lifestyle
preferences.
2. CustomerRelationshipManagement(CRM)
Excellentcustomerrelationsiscriticalforanycompanythatwantstoretaincustomerl
oyalty
tostayinbusinessforthelonghaul.CRMsystemsanalyzeimportantperformanceindi
cators such as demographics, buying patterns, socio-economic information,
and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract
insights that
helporganizationsmaneuvertheirwaythroughtrickyterrain.Corporationsturntobu
siness analysts to optimize budgeting, banking, financial planning, forecasting,
and portfolio management.
4. HumanResources
Business analysts help the process by pouring through data that characterizes
high performing candidates, such as educational background, attrition rate, the
average length of employment, etc. By working with this information, business
analysts help HR by forecasting the best fits between the company and
candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things
that affect operations and the bottom line. Identifying things like equipment
downtime, inventory
levels,andmaintenancecostshelpcompaniesstreamlineinventorymanagement,ri
sks,and supply-chain management to create maximum efficiency.
6. Marketing
Businessanalystshelpanswerthesequestionsandsomanymore,bymeasuringmark
eting and advertising metrics, identifying consumer behavior and the target
audience, and analyzing market trends.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 55
BusinessAnalytics Dept.ofCSE-AIML
DataModellingTechniquesinDataAnalytics:
WhatisDataModelling?
• Data Modelling is the process of analyzing the data objects and their
relationship to the otherobjects.Itisusedtoanalyzethedatarequirements
thatarerequiredforthebusiness processes. The data models are created for the
data to be stored in a database.
• TheData Model'smainfocusisonwhat data isneededandhowwe have toorganize
data rather than what operations we have to perform.
• Data Model is basically an architect's building plan. It is a process of
documenting complex software system design as in a diagram that can be
easily understood.

UsesofDataModelling:
• DataModellinghelpscreatearobustdesignwithadatamodelthat
canshowan organization's entire data on the same platform.
• Thedatabaseat thelogical, physical, andconceptual levelscan bedesigned
withthehelp data model.
• DataModellingToolshelpintheimprovementofdataquality.
• Redundantdataandmissingdatacanbeidentifiedwiththehelpofdatamodels.
• Thedatamodelisquiteatimeconsuming,butitmakesthemaintenancecheaperand
faster.

DataModellingTechniques:

Belowgivenare5differenttypesoftechniquesusedtoorganizethedata:
1. HierarchicalTechnique
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node andthe other childnodes are sortedin aparticular order. But, the
hierarchicalmodelis very rarely used now. This model can be used for real-world
model relationships.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 56
BusinessAnalytics Dept.ofCSE-AIML
2. Object-orientedModel
The object-oriented approach is the creation of objects that contains stored values.
The object- oriented model communicates whilesupporting data abstraction,
inheritance, and encapsulation.
3. NetworkTechnique
The network model provides us with a flexible way of representing objects and
relationships between these entities.Ithasafeature knownasa schema
representingthedata intheformofa graph.
Anobjectisrepresentedinsideanodeandtherelationbetween themasanedge, enabling
them to maintain multiple parent and child records in a generalized manner.
4. Entity-relationshipModel
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual
design provides a better
viewofthedatathathelpsuseasytounderstand.Inthismodel,theentiredatabaseisrepresent
ed in a diagram called an entity-relationship diagram, consisting of Entities,
Attributes, and Relationships.
5. RelationalTechnique
Relationalisusedtodescribethedifferentrelationshipsbetweentheentities.Andtherearedi
fferent sets ofrelationsbetweentheentities such
asonetoone,onetomany,manytoone,andmanyto many.

***EndofUnit-2***

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 57
BusinessAnalytics Dept.ofCSE-AIML
3. UNIT3
Regression–Concepts:
Introduction:
▪ The term regression is used to indicate the estimationor predictionof the
averagevalue of one variable for a specified value of another variable.
▪ Regressionanalysisisaverywidelyusedstatisticaltooltoestablisharelationship
model between two variables.

“RegressionAnalysisisastatisticalprocessforestimatingtherelationshipsbetweenthe
Dependent Variables /Criterion Variables / Response Variables
&
OneorMoreIndependentvariables/Predictorvariables.
▪ Regressiondescribeshowanindependentvariableisnumericallyrelatedt
othe dependent variable.
▪ Regressioncanbeusedforprediction,estimationandhypothesistesting,andmod
eling causal relationships.

WhenRegressionischosen?
▪ Aregressionproblemiswhentheoutputvariableisarealorcontinuousvalue,suc
has “salary” or “weight”.
▪ Manydifferentmodelscanbeused,thesimplestislinearregression.Ittriestofitda
ta with the best hyperplane which goes through the points.
▪ Mathematicallyalinearrelationshiprepresentsastraightlinewhenplottedasagraph.
▪ Anon-linear relationship wheretheexponentof anyvariable
isnotequalto1creates a curve.
TypesofRegressionAnalysisTechniques:
1. LinearRegression
2. LogisticRegression
3. RidgeRegression
4. LassoRegression
5. PolynomialRegression
6. BayesianLinearRegression

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 58
BusinessAnalytics Dept.ofCSE-AIML
Advantages&Limitations:
• Fastandeasytomodelandisparticularlyusefulwhentherelationshiptobemod
eled is not extremely complex and if you don’t have a lot of data.
• Veryintuitivetounderstandandinterpret.
• LinearRegressionisverysensitivetooutliers.
Linearregression:
• Linear Regression is a very simple method but has provento be very
useful for a large number of situations.
• Whenwehaveasingleinputattribute(x)andwewanttouselinearregression,th
is is called simple linear regression.

• simplelinearregressionwewanttomodelourdataasfollow
s: y = B0 + B1 * x
• weknowandB0andB1arecoefficientsthatweneedtoestimatethatmovethelin
e around.
• Simple regression is great, because rather than having to search for
values by trial and error orcalculatethem analytically usingmoreadvanced
linear algebra,wecan estimate them directly from our data.
OLSRegression:
LinearRegressionusingOrdinaryLeastSquaresApproxima
tion Based on Gauss Markov Theorem:
WecanstartoffbyestimatingthevalueforB1as:
n

(xi mean(x))*(yi mean(y))


B1=i=1 n

(xi –mean(x))
2

i=1

B0=mean(y)–B1*mean(x)

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 59
• Ifwehadmultipleinputattributes(e.g.x1,x2,x3,etc.)Thiswouldbecalledmultip
le linear regression. The procedure for linear regression is different and
simpler than that for multiple linear regression.
LetusconsiderthefollowingExample:
foranequationy=2*x+3.
xi-
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
mean(x)*
yi-
mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum=90.4 Sum=45.2

Mean(x)=1.4andMean(y) = 5.8
n

(xi mean(x))*(yi mean(y))


B1= i=1 n

(xi mean(x))
2

i=1

B0=mean(y)–B1*mean(x)
Wecanfindfromtheaboveformulas,
B1=2 and B0=3
ExampleforLinearRegressionusingR:
Considerthefollowingdataset:
x={1,2,4,3,5}andy={1,3,3,2,5}
WeuseRtoapplyLinearRegressionfortheabovedata.
> rm(list=ls())#removesthelistofvariablesinthecurrentsessionofR
> x<-c(1,2,4,3,5) #assignsvaluestox
> y<-c(1,3,3,2,5) #assignsvaluestoy
> x;y
[1] 12435
[1] 13325
> graphics.off()#tocleartheexistingplot/s
> plot(x,y,pch=16,col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula=y~x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")
> a<-data.frame(x=7)
>a
7
> result<-predict(relxy,a)
> print(resul
t) 6
> #Note:youcanobserve that
> 0.8*7+0.4
[1] 6 #Thesamecalculatedusingthelineequationy=0.8*x+0.4.
Simplelinearregressionisthesimplestformofregressionandthemoststudied.
CalculatingB1&B0usingCorrelationsandStandardDeviations:
B1=corr(x,y)*stdev(y)/stdev(x)

Correlation(x,y)*St.Deviation(y)
B1=
St.Deviation(x)
Wherecor(x,y)isthecorrelationbetweenx&yandstdev()isthecalculationofthestandard
deviation for a variable. The same is calculated inR as follows:
> x<-c(1,2,4,3,5)
> y<-c(1,3,3,2,5)
> x;y
[1]12435
[1]13325
> B1=cor(x,y)*sd(y)/sd(x)
> B1[1]
0.8
> B0=mean(y)-B1*mean(x)
> B0[1]
0.4

EstimatingError:(RMSE:RootMeanSquaredError)
WecancalculatetheerrorforourpredictionscalledtheRootMeanSquaredErrororRMSE.
Root Mean Square Error can be calculated by

n
(p i )
2
i y
Err= i=1
n
p is the predicted value and y is the
actual
value,iistheindexforaspecificinstance,nis
the number of predictions, because we
must calculate the error across all
predicted values. Estimating Error for ⇨ mean(x)= 3
x y=y-actual p=y-predicted p-y (p-y)^2
y=0.8*x+0.4
1 1 1.2 0.2 0.04 s=sumof (p-y)2= 2.4
2 3 2 -1 1
⇨ s/n=2.4/ 5=0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64
⇨ sqrt(s/n)=sqrt(0.48)=0.692
5 5 4.4 -0.6 0.36 ⇨ RMSE=0.692
PropertiesandAssumptionsofOLSapproximation:
1. Unbiasedness:
i. Biasedestimatorisdefinedasthedifferencebetweenitsexpectedvaluean
dthe true value. i.e., e(y)=y_actual – y_predited
ii. Ifthebiasederror(bias)iszerothenestimatorbecomeunbiased.
iii. Unbiasednessisimportantonlywhenitiscombinedwithsmallvariance
2. LeastVariance:
i. Anestimatorisbestwhenithasthesmallestorleastvariance
ii. Leastvariancepropertyismoreimportantwhenitcombinedwithsmallbiased.
3. Efficientestimator:
i. Anestimatorsaidtobeefficientwhenitfulfilledbothconditions.
ii. Estimatorshouldunbiasedandhaveleastvariance
4. BestLinearUnbiasedEstimator(BLUEProperties):
i. AnestimatorissaidtobeBLUEwhenitfulfilltheaboveproperties
ii. AnestimatorisBLUEifitisUnbiased,LeastVarianceandLinear Estimator
5. MinimumMeanSquareError(MSE):
i. AnestimatorissaidtobeMSEestimatorifithassmallestmeansquareerror.
ii. LessdifferencebetweenestimatedvalueandTrueValue
6. SufficientEstimator:
i. An estimator is sufficient if it utilizes all the information of a sample
about the True parameter.
ii. Itmustusealltheobservationsofthesample.
AssumptionsofOLSRegression:
1. Therearerandomsamplingofobservations.
2. Theconditionalmeanshouldbezero
3. ThereishomoscedasticityandnoAuto-correlation.
4. Errortermsshouldbenormallydistributed(optional)
5. ThePropertiesofOLSestimatesofsimplelinearregressionequatio
nis y = B0+B1*x + µ (µ -> Error)
6. Theaboveequationisbasedonthefollowingassumptions
a. Randomnessofµ
b. Meanofµis Zero
c. Varianceofµisconstant
d. Thevarianceofµhasnormaldistribution
e. Errorµofdifferentobservationsareindependent.
HomoscedasticityvsHeteroscedasticity:

• The Assumption of homoscedasticity (meaning “same variance”) is central to


linear regression models. Homoscedasticity describes a situation in which the
error term (that is, the “noise” or random disturbance in the relationship
between the independent variables and the dependent variable) is thesame
across all values of the independent variables.
• Heteroscedasticity (the violation of homoscedasticity) is present when the size
of the error term differs across values of an independent variable.
• Theimpactofviolatingtheassumptionofhomoscedasticityisamatterofdegree,incre
asing as heteroscedasticity increases.
• Homoscedasticitymeans“havingthesamescatter.”Forittoexistinasetofdata,thepoi
nts must be about the same distance from the line, as shown in the picture
above.
• The opposite is heteroscedasticity (“different scatter”), where points are at
widely varying distances from the regression line.
VariableRationalization:
• The data set may have a large number of attributes. But some of those
attributes can be irrelevant or redundant. The goal of Variable Rationalization is
to improve the Data Processing in an optimal way through attribute subset
selection.
• This process is to find a minimum set of attributes such that dropping of those
irrelevant attributes does not much affect the utility of data and the cost of
data analysis could be reduced.
• Mining on a reduced data set also makes the discovered pattern easier to
understand. As part of Data processing, we use the below methods of Attribute
subset selection
1. StepwiseForwardSelection
2. StepwiseBackwardElimination
3. CombinationofForwardSelectionandBackwardElimination
4. DecisionTreeInduction.
Alltheabovemethodsaregreedyapproachesforattributesubsetselection.
1. StepwiseForwardSelection:Thisprocedurestartswithanemptysetofattributesas
the minimal set. The most relevant attributes are chosen (having minimum p-
value) and are added to the minimal set. In each iteration, one attribute is
added toa reduced set.
• Stepwise Backward Elimination: Here all the attributes are considered in the
initial set of attributes. In each iteration, one attribute is eliminated from the set
of attributes whose p-value is higher than significance level.
• CombinationofForwardSelectionandBackwardElimination:
Thestepwiseforward
selectionandbackwardeliminationarecombinedsoastoselecttherelevantattributes
most
efficiently.Thisisthemostcommontechniquewhichisgenerallyusedforattributesele
ction.
• Decision Tree Induction: This approach uses decision tree for attribute
selection. It constructs a flow chart like structure having nodes denoting a test
on an attribute. Each
branchcorrespondstotheoutcomeoftestandleafnodesisaclassprediction.Theattrib
ute that is not thepart of tree is considered irrelevant and hence discarded.

ModelBuildingLifeCycleinDataAnalytics:
Whenwecomeacross abusiness analytical
problem,withoutacknowledgingthestumblingblocks,
weproceedtowardstheexecution.Beforerealizingthemisfortunes,wetrytoimplementandp
redict the outcomes. The problem-solving steps involved in the data science model-
building life cycle.
Let’sunderstandeverymodelbuildingstepin-depth,
The data science model-building life cycle includes some important steps to follow.
The following are the steps to follow to build a Data Model

1. ProblemDefinition
2. HypothesisGeneration
3. DataCollection
4. DataExploration/Transformation
5. PredictiveModelling
6. ModelDeployment

1. ProblemDefinition
• Thefirststepinconstructingamodelisto
understandtheindustrialprobleminamorecomprehensiveway.Toidentifythepurpose
of
theproblemandthepredictiontarget,wemustdefinetheprojectobjectivesappropriatel
y.
• Therefore,toproceedwithan analytical approach,wehavetorecognizetheobstacles
first.
Remember,excellentresultsalwaysdependonabetterunderstandingoftheprob
lem.

2. HypothesisGeneration
• Hypothesis generation is the guessing approach through which we derive some
essential data parameters that have a significant correlation with the
prediction target.
• Yourhypothesisresearchmustbein-
depth,lookingforeveryperceptiveofallstakeholders into account. We search for
every suitable factor that can influence the outcome.
• Hypothesisgeneration focusesonwhat youcancreaterather thanwhatis
availableinthe dataset.

3. DataCollection
• Data collection is gathering data from relevant sources regarding the analytical
problem, then we extract meaningful insights from the data for prediction.

Thedatagatheredmusthave:
• Proficiencyinanswerhypothesisquestions.
• Capacitytoelaborateoneverydataparameter.
• Effectivenesstojustifyyourresearch.
• Competencytopredictoutcomesaccurately.

4. DataExploration/Transformation
• The data you collected may be in unfamiliar shapes and sizes. It may contain
unnecessary features, null values, unanticipated small values, or immense
values. So, before applying any algorithmic model to data, we have to explore it
first.
• Byinspectingthedata,wegettounderstandtheexplicitandhiddentrendsindata.Wefi
nd the relation between data features and the target variable.
• Usually, a data scientist invests his 60–70% of project time dealing with data
exploration only.
• Thereareseveralsubstepsinvolvedindataexploration:
o FeatureIdentification:
▪ You need to analyze which data features are available and which
ones are not.
▪ Identifyindependentandtargetvariables.
▪ Identifydatatypesandcategoriesofthesevariables.
o UnivariateAnalysis:
▪ We inspect each variable one by one. This kind of analysis
depends on the variable type whether it is categorical and
continuous.
• Continuous variable: We mainly look for statistical trends
like mean,
median,standarddeviation,skewness,andmanymoreinthedat
aset.
• Categorical variable: We use a frequency table to
understand the spread of data for each category. We can
measure the counts and frequency of occurrence of values.
o Multi-variateAnalysis:
▪ The bi-variate analysis helps to discover the relation between two
or more variables.
▪ We can find the correlation in case of continuous variables and the
case of categorical, we look for association and dissociation
between them.
o FillingNullValues:
▪ Usually, the datasetcontains null values which leadto lower the
potential of the model. With a continuous variable, we fill these
null values using the mean or mode of that specific column. For
the null values present in the categorical column, we replace them
with the most frequently occurred categorical value. Remember,
don’t delete that rows because you may lose the information.
5. PredictiveModeling
• Predictive modeling is a mathematicalapproachto create a statisticalmodelto
forecast future behavior based on input test data.
Stepsinvolvedinpredictivemodeling:
• Algorithm Selection:
o When we have the structured dataset, and we want to estimate the
continuous or categorical outcome then we use supervised machine
learning methodologies like
regressionandclassificationtechniques.Whenwehaveunstructureddataand
want to predict the clusters of items to which a particular input test
sample belongs, we use unsupervised algorithms. An actual data
scientist applies multiple algorithms to get a more accurate model.
• TrainModel:
o After assigningthe algorithm andgetting the data handy, we train our
model using the input data applying the preferred algorithm. It is an
action to determine the correspondence between independent variables,
and the prediction targets.
• ModelPrediction:
o Wemakepredictionsbygivingtheinputtestdatatothetrainedmodel.Wemeas
ure the accuracy by usinga cross-validation strategy or ROC curve which
performs well to derive model output for test data.

6. ModelDeployment
• There is nothingbetter than deployingthe modelin a real-time environment.
Ithelps us to gain analytical insights into the decision-making procedure. You
constantly need to update the model with additional features for customer
satisfaction.
• To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
• When you go through the Amazon website and notice the product
recommendations completelybasedonyourcuriosities.
Youcanexperiencetheincreaseintheinvolvementof
thecustomersutilizingtheseservices.That’showadeployedmodelchangesthemind
setof the customer and convince him to purchase the product.

KeyTakeaways

SUMMARYOFDAMODELLIFE CYCLE:
• Understandthepurposeofthebusinessanalyticalproblem.
• Generatehypothesesbeforelookingatdata.
• Collectreliabledatafromwell-knownresources.
• Investmostofthetimeindataexplorationtoextractmeaningfulinsightsfromthedata.
• Choosethesignaturealgorithmtotrainthemodelandusetestdatatoevaluate.
• Deploythemodelintotheproductionenvironmentsoitwillbeavailabletousersand
strategize to make business decisions effectively.
LogisticRegression:
ModelTheory,ModelfitStatistics,ModelConstruction
Introduction:
• Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependentvariable using a given set of
independent variables.
• The outcomemust be acategoricalordiscretevalue. It can beeither Yes orNo,
0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0and1.
• In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something
such as whether or not the cells are cancerous or not, a mouse is obese or
not based on its weight, etc.
• Logisticregressionusestheconceptofpredictivemodelingasregression;
therefore, it is called logistic regression, but is used to classify samples;
therefore, it falls under the classification algorithm.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
TypesofLogisticRegressions:
Onthebasisofthecategories,LogisticRegressioncanbeclassifiedintothreetypes:
• Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables,such as 0 or 1, Pass or Fail, etc.
• Multinomial: InmultinomialLogisticregression,therecanbe3or more
possibleunordered types of thedependent variable, suchas
"cat","dogs", or "sheep"
• Ordinal:InordinalLogisticregression,therecan be3or morepossibleorderedtypes
of dependent variables,such as "low", "Medium",or "High".
Definition:Multi-collinearity:
• Multicollinearity isastatisticalphenomenon inwhich multiple
independentvariables show highcorrelationbetween each other and they
are too inter-related.
• MulticollinearityalsocalledasCollinearityanditisanundesiredsituationfor
anystatistical regressionmodelsince itdiminishes the reliabilityof the
modelitself.
• If two or more independent variables are too correlated, the data
obtained from the regression will be disturbed because the independent
variables are actually dependent betweeneach other.
AssumptionsforLogisticRegression:
• Thedependentvariablemustbecategoricalinnature.
• Theindependentvariableshouldnothavemulti-collinearity.
LogisticRegressionEquation:
• The Logistic regression equation can be obtained from the Linear
Regressionequation. The mathematical steps to get Logistic Regression
equations are given below:
• Logistic Regression uses a more complex cost function, this cost function
can be
definedasthe‘Sigmoidfunction’oralsoknownasthe‘logisticfunction’inst
eadofa linear function.
• Thehypothesisoflogisticregressiontendsittolimitthecostfunctionbetween0and
1.Thereforelinearfunctionsfailtorepresent itasitcanhaveavaluegreaterthan 1
orless than 0 which is not possible as per thehypothesis of logistic
regression. 0 h(x) 1
---LogisticRegressionHypothesisExpectation

LogisticFunction(SigmoidFunction):
• The sigmoid function is a mathematical function used to map the
predicted values toprobabilities.
• The sigmoid function maps any real value intoanother
valuewithinarangeof0 and 1, and so forma S-Form curve.
• The valueof thelogistic regression must be between 0and 1, which
cannot go beyondthis limit, so it forms a curve like the "S" form.
• Thebelowimageisshowingthelogisticfunction:

Fig:SigmoidFunctionGraph
TheSigmoidfunctioncanbeinterpretedasaprobabilityindicatingtoaClass-1orClass-
0.SotheRegressionmodelmakesthefollowingpredictionsas

1
z=sigmoid(y)= (y)= y
1+e
HypothesisRepresentation
• Whenusinglinearregression,weusedaformulaforthelineequationas:

y=b0+b1x1+b2x2+...+bnxn
• In the aboveequation
x1,x2,...xnarethepredictorvariables,
yisaresponse variable,

andb0,b1,b2,...,bnarethecoefficients,whicharenumericconstants.

• Forlogisticregression,weneedthemaximumlikeliho h(y)
od hypothesis

• ApplySigmoidfunctiononyas

z= (y)= (b0+b1x1+b2x2+...+bnxn)
1
z= (y)= (b0+b1x1+b2x2+...+bnxn)
1+e
ExampleforSigmoidFunctioninR:
> #ExampleforSigmoidFunction
> y<-c(-10:10);y
BusinessAnalytics Dept.ofCSE-AIML
[1]-10-9-8-7-6-5-4-3-2-1 0 1 2 3 4 5 6 7 8 910
> z<-1/(1+exp(-y));z
[1]4.539787e-051.233946e-043.353501e-049.110512e-042.472623e-036.692851e-03
1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-
01 5.000000e-01 7.310586e-
01
8.807971e-01 9.525741e-01
9.820138e-019.933071e-01
[17] 9.975274e-01 9.990889e-
01 9.996646e-01 9.998766e-
01
9.999546e-01
> plot(y,z)

> rm(list=ls())
> attach(mtcars)
#attachinga dataset
intotheR environment
> input<-mtcars[,c("mpg","disp","hp","wt")]
> head(input)

mpg disp hp wt
MazdaRX4 21.0 160 110 2.620
MazdaRX4Wag 21.0 160 110 2.875
Datsun710 22.8 108 93 2.320
Hornet4Drive 21.4 258 110 3.215
HornetSportabout 18.7 360 175 3.440
Valiant 18.12251053.460
> #model<-lm(mpg~disp+hp+wt);model1#Showthemodel
> model<-glm(mpg~disp+hp+wt);model

Call:glm(formula=mpg ~ disp+hp+ wt)

Coefficients:
(Intercept) hp wt

disp
37.105505 -0.000937 -0.031157 -3.800891

DegreesofFreedom:31Total(i.e.Null);28Residual
Null Deviance: 1126
ResidualDeviance:195AIC:158.6
> newx<-data.frame(disp=150,hp=150,wt=4)#newinputforprediction
> predict(model,newx)
1
17.08791
> 37.15+(-0.000937)*150+(-0.0311)*150+(-3.8008)*4#checkingwiththedatanewx
[1]17.14125
y<-
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 72
BusinessAnalytics Dept.ofCSE-AIML
input[,c("mpg")];y
z=1/(1+exp(-y));z
plot(y,z)

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 73
BusinessAnalytics Dept.ofCSE-AIML
>y<-input[,c("mpg")]
>y
[1]21.021.022.8 21.418.7 18.114.3 24.422.8 19.217.8 16.4 15.210.4
17.3
10.414.732.430.4 33.921.5 15.515.2 13.319.2 27.326.0 30.4 19.715.0
15.8
21.4
> z=1/(1+exp(-y));z
1.0000000 1.000000 1.000000 1.0000000 1.0000000 1.0000000 0.9999994 1.0000000
0 0
1.0000000 1.0000000 1.0000000
0.9999999 1.0000000 0.9999997
0.9999696 0.9999696 0.9999996
1.0000000 1.0000000 1.0000000
1.0000000 0.9999998 0.9999997
0.9999983 1.0000000 1.0000000
1.0000000 1.0000000 0.9999999
1.0000000 0.99999971.0000000
>plot(y,z)

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 74
BusinessAnalytics Dept.ofCSE-AIML

ConfusionMatrix(or)ErrorMatrix(or)ContingencyTable:
WhatisaConfusionMatrix?
“A Confusion matrix is an N x N matrix used for evaluating the performance of
a classification model, where N is the number of target classes. The matrix
compares the actual target values with those predicted by the machine
learning model. This gives us a
holisticviewofhowwellourclassificationmodelisperformingandwhatkindsof
errorsitismaking.Itisaspecifictablelayoutthatallowsvisualizationofthe
performance of an algorithm, typically a supervisedlearningone(inunsupervised
learning itis usuallycalleda matching matrix).”
For a binary classification problem, we would have a 2 x 2 matrix as shown
below with 4 values:

Let’sdecipherthematrix:

• Thetargetvariablehastwovalues:PositiveorNegative
• Thecolumnsrepresenttheactualvaluesofthetargetvariable
• Therowsrepresentthepredictedvaluesofthetargetvariable

• TruePositive
• TrueNegative
• FalsePositive–Type1Error
• FalseNegative–Type2Error

WhyweneedaConfusionmatrix?
• PrecisionvsRecall
• F1-score

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 75
BusinessAnalytics Dept.ofCSE-AIML

UnderstandingTruePositive,TrueNegative,FalsePositive
andFalseNegative in a Confusion Matrix
TruePositive(TP)
• Thepredictedvaluematchestheactualvalue
• Theactualvaluewaspositiveandthemodelpredictedapositivevalue
TrueNegative(TN)
• Thepredictedvaluematchestheactualvalue
• Theactualvaluewasnegativeandthemodelpredictedanegativevalue
FalsePositive(FP)–Type1error
• The predictedvaluewasfalselypredicted
• The actualvaluewasnegativebutthemodelpredictedapositivevalue
• Also known as the
Type1
errorFalseNegative(FN)–Type
2 error
• The predictedvaluewasfalselypredicted
• Theactualvaluewaspositivebutthemodelpredictedanegativevalue
• AlsoknownastheType2error
Toevaluatetheperformanceofamodel,wehavetheperformancemetricscalled,
Accuracy,Precision,Recall&F1-Scoremetrics
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of
correctlypredicted observation to the total observations.
• AccuracyisagreatmeasuretounderstandthatthemodelisBest.
• Accuracy is dependable only when you have symmetric datasets
where values offalse positive and false negatives are almost same.
TP+TN
Accuracy=
TP+FP+TN+FN
Precision:
Precision isthe ratio ofcorrectlypredicted positive observationsto
thetotalpredicted positive observations.
Ittellsushowmanyofthecorrectlypredictedcasesactuallyturnedouttobepositive.
TP
Precision=
TP+FP
• Precisionis a useful metric in cases where False Positive
isahigherconcern than False Negatives.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 76
BusinessAnalytics Dept.ofCSE-AIML
• Precision is important in music or video recommendation systems, e-
commerce websites, etc. Wrong results could lead to customer churn and
be harmful to the business.
Recall:(Sensitivity)
Recallis theratio of correctlypredictedpositive observations tothe all
observationsin actual class.
TP
Recall =
TP+FN
• RecallisausefulmetricincaseswhereFalseNegativetrumpsFalsePositive.
• Recallisimportant in medical caseswhereitdoesn’t matter whether
we raise a false alarm but the actual positive cases should not go
undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. Itgivesa combined
ideaabout these two metrics. It is maximum when Precision is equal to
Recall.
Therefore,thisscoretakesbothfalsepositivesandfalsenegativesintoaccount.
2 Precision*Recall
F
1 Score= 1 1
=2*
Recall + Precesion Precision+Recall
• F1 is usually more useful than accuracy, especially if you have an uneven
class distribution.
• Accuracyworksbestiffalsepositivesandfalsenegativeshavesimilarcost.
• If the cost of false positives and false negatives are very
different,it’sbetterto lookat both Precision and Recall.
• But thereisacatchhere.IftheinterpretabilityoftheF1-scoreispoor,means
that wedon’tknowwhatourclassifierismaximizing –precisionorrecall?So,
we use it in combination with other evaluation metrics which gives us a
complete picture of the result.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 77
Example:
Suppose we had a classification dataset with 1000 data points.Wefit a
classifieron it andget the below confusion matrix:

The different values of the Confusion


matrix wouldbe as follows:
• TruePositive(TP)=560
-Means560positiveclassdatapoints
werecorrectlyclassifiedbythemodel.

• TrueNegative(TN)=330
-Means 330 negative class data
points were correctly classified by
the model.

• FalsePositive(FP)=60
-Means60negativeclassdatapointswereincorrectly classified asbelongingtothe
positive class by the model.

• FalseNegative(FN)=50
-Means50 positive class data points were incorrectly classifiedasbelonging to
the negative class by the model.
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
PreciselywehavetheoutcomesrepresentedinConfusionMatrixas:
TP =560,TN=330,FP=60,FN=50
Accuracy:
Theaccuracyforourmodelturnsouttobe:
TP+TN
Accuracy=
TP+FP+TN+FN
560+330 890
=>Accuracy= = =0.89
560 +60 +330 +50 1000
HenceAccuracyis89%...Notbad!
Precision:
Ittellsushowmanyofthecorrectlypredictedcasesactuallyturnedouttobepositive.
TP
Precision=
TP+FP
Thiswoulddeterminewhetherourmodelisreliableornot.
Recalltells us howmanyofthe actualpositivecaseswewere able to predict
correctly with our model.
TP
Precision= 560 =0.903
=
TP+FP 560+60
We can easily calculate Precision and Recall for
ourmodelbyplugginginthevalues into the above questions:
TP
Recall = 560 =0.918
TP+FN =
560+50

F1-Score
Precision*Recall
F Score=2*
BusinessAnalytics Dept.ofCSE-AIML
1
Precision+Recall
0.903*0.918 0.8289
=>F Score=2* = =0.4552
1
0.903+0.918 1.821

AUC(AreaUnderCurve)ROC(ReceiverOperatingCharacteristics)Curv
es: PerformancemeasurementisanessentialtaskinDataModelling
Evaluation. It is one of the most important evaluation metrics for
checking any
classificationmodel’sperformance.ItisalsowrittenasAUROC
(AreaUnderthe
ReceiverOperatingCharacteristics)Sowhenitcomestoaclassification
problem, we cancount on anAUC - ROC Curve.
When we need to check or visualize the performance of the multi-
class
classificationproblem,weusetheAUC(AreaUnderTheCurve)ROC(Recei
ver Operating Characteristics)curve.

WhatistheAUC-ROCCurve?

AUC - ROC curve is a performance measurementfortheclassificationproblemsat


variousthresholdsettings.ROCisaprobabilitycurveandAUCrepresentsthe
degreeormeasureofseparability.Ittellshowmuchthemodeliscapableof
distinguishing between classes. Higher the AUC, the better the model is at
predicting 0 classesas0and 1 classes as 1. By analogy, the Higher the AUC, the
better the model is at distinguishing between patients withthediseaseand
nodisease.

The ROC curve isplottedwithTPR against the FPR where TPRisonthey-axisand


FPR isonthex-axis.

TPR(TruePositiveRate)/Recall/Sensitivity

Specificity

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 80
BusinessAnalytics Dept.ofCSE-AIML
FPR(FalsePositiveRate)

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 81
BusinessAnalytics Dept.ofCSE-AIML

ROCcurve
AnROCcurve(receiveroperatingcharacteristiccurve) isagraphshowingtheperformanceof a
classification model at all classification thresholds. This curve plots two parameters:

• TruePositiveRate

• FalsePositiveRate

• TruePositiveRate(TPR)isasynonymforrecallandisthereforedefinedasfollows:
• TPR=TPTP+FN

FalsePositiveRate(FPR) isdefinedasfollows:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 82
BusinessAnalytics Dept.ofCSE-AIML

AnROC curve plots TPR vs. FPR at different classificationthresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 83
BusinessAnalytics Dept.ofCSE-AIML
AnalyticsapplicationstovariousBusinessDomains:
ApplicationofModellinginBusiness:
• ApplicationsofDataModellingcanbetermedasBusinessanalytics.
• Business analytics involves the collating, sorting, processing, and
studying of business-related data using statistical models
anditerativemethodologies.The goal of BA is to narrow down which
datasets are useful and which can increase revenue, productivity, and
efficiency.
• Business analytics (BA) is the combination of skills, technologies,
andpractices used to examine an organization's data and performance as
a way to gaininsights and make data-driven decisionsin the future using
statistical analysis.

Although business analytics is being leveraged in most commercial sectors


andindustries,the following applications are the most common.
1. CreditCardCompanies
Creditanddebitcardsareaneverydaypartofconsumerspending,andthey are
an ideal way of gathering information about a purchaser’s spending
habits, financial situation,behaviour trends,demographics,and lifestyle
preferences.
2. CustomerRelationshipManagement(CRM)
Excellent customer relations is critical for any company that wants to
retain customer loyalty to stay in business for the long haul. CRM
systems analyze important performance indicators such as
demographics, buying patterns, socio- economic information, and
lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helpsto
extract insights that help organizations maneuver their way through
tricky terrain. Corporations turn to business analysts to optimize
budgeting, banking, financial planning, forecasting, and portfolio
management.
4. HumanResources
Business analysts help the process
bypouringthroughdatathatcharacterizes high performing candidates,
such as educational background, attrition rate, the average length of

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 84
BusinessAnalytics Dept.ofCSE-AIML
employment, etc. By working with this information, business analysts
help HR byforecastingthebestfitsbetweenthecompanyand candidates.

5. Manufacturing

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 85
BusinessAnalytics Dept.ofCSE-AIML
Business analysts work with data to help stakeholders understand the
things that affect operations and the bottom line. Identifying things like
equipment downtime, inventory levels, and maintenance costs help
companies streamline inventory management,risks,andsupply-
chainmanagementtocreatemaximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by
measuring marketing and advertising metrics, identifying consumer
behaviour and the target audience, and analyzing market trends.

***EndofUnit-3***

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 86
BusinessAnalytics Dept.ofCSE-AIML

Add-onsforUnit-3

TOBEDISCUSSED:
ReceiverOperatingCharacteristics:
ROC&AUC

DerivationforLogisticRegression:
Thelogisticregressionmodel assumes that the log-odds of an observation ycan be
expressed as a linearfunction of the K input variables x:

Here,weaddtheconstantterm b0,bysettingx0=1.Thisgives usK+1parameters.The


lefthandsideoftheaboveequationiscalledthelogitofP(hence,thenamelogistic
regression).

Let’staketheexponentofbothsidesofthelogitequation.

(Since ln(ab)=ln(a)+ln(b) and


exp(a+b)=exp(a)exp(b).)We can alsoinvert the logit equation toget a
newexpression for P(x):

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 87
BusinessAnalytics Dept.ofCSE-AIML

The righthand side of the topequation is thesigmoid of z, which maps the real
line to
theinterval(0,1),andisapproximatelylinearneartheorigin.Ausefulfactabout P( z)
is thatthe derivative P'(z) = P(z) (1 –P(z)). Here’s the derivation:

Later, we will want to take the gradient of P with respect to the set of
coefficients b, rather than z. Inthatcase, P'(z) = P(z) (1 –P(z))z‘, where‘
isthe gradient taken with respectto b.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 88
BusinessAnalytics Dept.ofCSE-AIML
4. UNIT4
SupervisedandUnsupervisedLearning
SupervisedLearning:
• Supervised learning is a machine learning method in
whichmodelsaretrained using labeled data. In supervised learning,
modelsneed tofindthemapping functionto map the input variable (X) with
the output variable (Y).
• Wefindarelationbetweenx&y,suchthaty=f(x)
• Supervised learning needs supervision to train the model, which is
similar to as a student learns things in the presence
ofateacher.Supervisedlearningcanbe used for two types of problems:
Classification and Regression.
• Example: Suppose we have an image of different types of fruits. The task
of our supervised learning model is
toidentifythefruitsandclassifythemaccordingly.
Sotoidentifytheimageinsupervisedlearning,wewillgivetheinputdataas well
as outputfor that, whichmeans wewilltrain themodel bythe shape, size,
color, and taste of each fruit. Once the training is completed, we will
testthemodel by giving the newset of fruit. The model will identify the
fruit and predicttheoutput using a suitable algorithm.
UnsupervisedMachineLearning:
• Unsupervised learning is another machine learning method in which
patterns inferred from
theunlabeledinputdata.Thegoalofunsupervisedlearningisto find the
structure and patternsfromtheinputdata.Unsupervisedlearningdoes
notneedany supervision. Instead, it finds patterns from the data by its
own.
• Unsupervised learning can be used for two types of problems: Clustering
and Association.
• Example: To understand the unsupervised learning, we will use the
example given above. So unlike supervised learning, here we will
notprovideanysupervisionto the model.
Wewilljustprovidetheinputdatasettothemodelandallowthe model
tofindthepatternsfromthedata.Withthehelpofasuitablealgorithm, the
modelwill train itselfanddividethefruitsintodifferentgroupsaccordingto the
most similar features between them.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 89
BusinessAnalytics Dept.ofCSE-AIML

ThemaindifferencesbetweenSupervisedandUnsupervisedlearningaregivenbelow:

SupervisedLearning UnsupervisedLearning

Supervisedlearningalgorithmsaretrained Unsupervised learning are


using labeleddata. algorithms trained
using unlabeleddata.
Supervised learning model takes direct Unsupervisedlearningmodeldoesnottak
feedback to check if it is predicting e any feedback.
correct output or not.

Supervise learning model predicts the Unsupervised learning model


d output. finds thehidden
patterns in data.
In supervised learning, input data is In unsupervised learning, onlyinput
providedto themodel along with data isprovided to the model.
theoutput.
The goal ofsupervisedlearningistotrain The goal of unsupervised learning is to
the model so that it can predict the find the hidden patternsanduseful
output whenit is given new data. insights from the unknown dataset.

Supervisedlearningneeds supervision to Unsupervisedlearningdoes not needany


trainthe model. supervision totrain the model.

Supervised learningcanbecategorized UnsupervisedLearningcan be


inClassificationandRegressionproblems. classifiedinClusteringandAssociations
problems.
Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input those cases where we have only input
as well as correspondingoutputs. data and no corresponding output
data.
Supervised mode produces Unsupervisedlearning model may give
l
learning
anaccurateresult. less accurate result as compared to
supervisedlearning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
onlyit can predict the correct output. routine things by his experiences.

Itincludes various algorithms such as InUnsupervisedLearingwehaveK-means


Linear Regression, Logistic Regression, clustering. KNN (k-nearest
neighbors),
Support Vector Machine, Multi-class Hierarchal clustering, Anomaly
detection,

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 90
BusinessAnalytics Dept.ofCSE-AIML
Classification, Decision tree, Random Neural Networks, Principle
Component
Forest,DecisionTrees,BayesianLogic,etc. Analysis,IndependentComponentAnalysi
s,
Apriori algorithms,etc.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 91
BusinessAnalytics Dept.ofCSE-AIML
Segmentation
• Segmentation refers to the act of segmenting data according to your
company’s needs in order
torefineyouranalysesbasedonadefinedcontext.Itisa
techniqueofsplittingcustomersintoseparategroupsdependingontheir
attributes or behavior.
• The purpose of segmentation
istobetterunderstandyourcustomers(visitors), and to obtain actionable
data in order to improve your website or mobile app. In concrete terms, a
segment enables you to filter your analyses based on certain elements
(singleor combined).
• Segmentation can be done on elements related to
a visit, as well as on elements related to multiple
visits during a studied period.

Steps:
• Definepurpose–Alreadymentionedinthestatementabove
• Identifycritical parameters –Some of the variables which come up in
mind are skill, motivation, vintage, department, education etc. Let us say
that basis past experience, we know that skill
andmotivationaremostimportantparameters. Also, for sake of simplicity
we just select 2variables.Takingadditionalvariables willincrease the
complexity, but can be done if it adds value.
• Granularity–Letus say we are able to classify both skill and
motivation into High andLowusingvarioustechniques.
Therearetwobroadsetofmethodologiesforsegmentation:
• Objective(supervised)segmentation
• Non-Objective(unsupervised)segmentation

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 92
BusinessAnalytics Dept.ofCSE-AIML

ObjectiveSegmentation
• Segmentation to identify the type of customers who would
respond to a particularoffer.
• Segmentation to identify high spenders among customers who will
use thee- commerce channel for festive shopping.
• Segmentation toidentifycustomers who will default on
theircreditobligation for aloan or credit card.

Non-ObjectiveSegmentation

https://www.yieldify.com/blog/types-of-market-segmentation/
• Segmentation of the customer base to understand the specific profiles
which exist withinthe customer base so thatmultiple marketing actions
canbepersonalized for each segment
• Segmentation of geographies onthebasisofaffluenceandlifestyleofpeople
living in each geography sothatsalesanddistributionstrategiescanbe
formulated accordingly.
• Hence, it is critical that the segments created on the basis of an
objective segmentation methodology must be different with respect to
the stated objective (e.g. response to an offer).
• However, in case of a non-objective methodology, the segments are
different with respect to the “generic profile” of observations belonging
to each segment, but not with regards to any specificoutcome of
interest.
• The most common techniques for building non-objective segmentation
are cluster analysis, K nearest neighbor techniques etc.

RegressionVsSegmentation
• Regression analysis focuses on finding a relationship
between a dependent variableand one or more
independent variables.
• Predicts the value of a dependent variable based on the value of at
least one independent variable.
• Explains the impact of changes in an independent variable
on thedependentvariable.
• We use linear or logistic regression technique for developing accurate
models for predicting an outcome of interest.
• Often,wecreateseparatemodelsforseparatesegments.
• Segmentation methodssuchasCHAIDorCRTisusedtojudgetheir
effectiveness.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 93
BusinessAnalytics Dept.ofCSE-AIML

• Creating separate model for separate segments may be time consuming


and not worth the effort. But, creating separate model for separate
segments may provide higher predictivepower.

DecisionTreeClassificationAlgorithm:

• Decision Tree is a supervised learning technique that can be used for


both classification and Regression problems, but mostly it is preferred for
solving Classification problems.
• DecisionTreesusuallymimichumanthinkingabilitywhilemakingadecision,
so it is easy to understand.
• A decision tree simply asks a question, andbasedontheanswer(Yes/No),it
further split the tree into subtrees.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• Itisatree-
structuredclassifier,whereinternalnodesrepresentthefeaturesofa
dataset, branches represent the decision rules and each leaf node
represents the outcome.
• InaDecisiontree,therearetwonodes,whicharetheDecisionNodeandLeafNod
e. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those
decisionsanddonotcontainany further branches.
◼ BasicDecisionTreeLearningAlgorithm:
• Now that we know what a DecisionTree is, we’ll see how it works
internally. There are
manyalgorithmsouttherewhichconstructDecisionTrees,butoneofthe
bestiscalledasID3Algorithm.ID3StandsforIterativeDichotomiser3.

TherearetwomaintypesofDecisionTrees:
1. Classificationtrees(Yes/Notypes)
Whatwe’veseenaboveisanexampleofclassificationtree,wheretheoutcomewasa
variable like‘fit’ or ‘unfit’. Herethedecision variable is Categorical.
2. Regressiontrees(Continuousdatatypes)
HerethedecisionortheoutcomevariableisContinuous,e.g.anumberlike123.

DecisionTreeTerminologies

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 94
BusinessAnalytics Dept.ofCSE-AIML

RootNode:Rootnodeisfromwherethedecisiontreestarts.Itrepresentstheentire
dataset, which further gets divided into two or morehomogeneous sets.
LeafNode:Leafnodesarethefinal outputnode,andthetreecannotbe segregated
further after getting a leaf node.
Splitting:Splitting is theprocessof dividingthedecisionnode/rootnodeintosub-nodes
according tothe given conditions.
Branch/SubTree:Atreeformedbysplittingthetree.
Pruning:Pruningistheprocessofremovingtheunwantedbranchesfromthetree.

Parent/Childnode:Therootnodeofthetreeiscalled theparentnode, andother nodes


are called the child nodes.
DecisionTreeRepresentation:
• Each non-leaf nodeis connectedto atest that splitsits set
ofpossible answersinto subsets corresponding to different test
results.
• Eachbranchcarriesaparticulartestresult'ssubsettoanothernode.
• Eachnodeisconnectedtoasetofpossibleanswers.
• Belowdiagramexplainsthegeneralstructureofadecisiontree:

• A decision tree is an arrangement of tests that provides an appropriate


classification at every step in an analysis.
• "In general, decision
treesrepresentadisjunctionofconjunctionsofconstraints on the attribute-
values of instances. Each path from the tree root to a leaf corresponds to
a conjunction of attributetests,andthetreeitselftoadisjunction of these
conjunctions" (Mitchell, 1997, p.53).
• More specifically, decision trees classify instances by sorting them down
the tree from the root node to some leaf node, which provides the
classification of the instance. Eachnodeinthetreespecifiesatest
ofsomeattribute ofthe instance,
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 95
BusinessAnalytics Dept.ofCSE-AIML

andeachbranchdescendingfromthatnodecorrespondstooneofthepossible
valuesforthisattribute.
• An instance is classified by starting
attherootnodeofthedecisiontree,testing the attribute specified by this
node, then moving down the tree branch corresponding to the value of
the attribute. This process isthenrepeatedatthe nodeonthisbranchand so
on until a leaf node is reached.

AppropriateProblemsforDecisionTreeLearning
Decision treelearningisgenerally bestsuitedtoproblems with thefollowing
characteristics:
• Instancesarerepresentedbyattribute-valuepairs.
o Thereis a finitelist of attributes (e.g.hair colour) and each
instance storesa value for that attribute (e.g. blonde).
o When each attributehasasmallnumberofdistinctvalues(e.g.
blonde,brown, red) itis easier for thedecision tree toreach
auseful solution.
o The algorithmcan be extended to handle real-valued
attributes(e.g. afloating point temperature)
• Thetargetfunctionhasdiscreteoutputvalues.
o Adecisiontreeclassifieseachexampleasoneoftheoutputvalues.
▪ Simplestcaseexistswhenthereareonlytwopossib
le classes(Boolean classification).
▪ However, itis easy to extend the decision treeto
producea targetfunction with more than two
possible output values.
o Although itisless common, the algorithmcan also be
extended to producea target function with real-valued
outputs.
• Disjunctivedescriptionsmayberequired.
o Decisiontreesnaturallyrepresentdisjunctiveexpressions.
• Thetrainingdatamaycontainerrors.
o Errors in the classification of examples, or in the attribute
values describingthoseexamplesarehandledwell
bydecisiontrees,making thema robust learning method.
• Thetrainingdatamaycontainmissingattributevalues.
o Decision treemethods can be used even when
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 96
BusinessAnalytics Dept.ofCSE-AIML
sometraining exampleshave unknown values(e.g.,
humidity isknown for onlya fraction of the examples).

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 97
BusinessAnalytics Dept.ofCSE-AIML

Afteradecisiontreelearnsclassificationrules,itcanalsobere-representedasaset of
if-thenrules in order to improve readability.
HowdoestheDecisionTreealgorithmWork?
The decision ofmaking strategicsplits heavily affectsa tree’s
accuracy.Thedecision criteria aredifferent for classificationand regression trees.

Decision trees use multiple algorithms to decide to split a node into two
ormoresub- nodes. The creation of sub-nodes increases the homogeneity of
resultant sub-nodes. In other words, we can say that the purity of the node
increases with respect to the target variable. The decision tree splits the nodes
on all available variables and then selects the split which results in most
homogeneoussub-nodes.

Tree Building: Decision tree learning is the construction of a decision tree


from class- labeled training tuples. A decision tree is a flow-chart-like structure,
where each internal (non-leaf) node denotes a test on an attribute, each
branch represents the outcome ofa test, and each leaf(orterminal) nodeholds a
class label. Thetopmostnode in a tree is the root node. There are many specific
decision-tree algorithms. Notable ones include the following.
ID3→(extensionofD3)
C4.5→(successorofID3)
CART→(ClassificationAndRegressionTree)
CHAID→(Chi-squareautomaticinteractiondetectionPerformsmulti-levelsplitswhen
computing classificationtrees)
MARS→(multivariateadaptiveregressionsplines):Extendsdecisiontreestohandle
numerical data better
ConditionalInferenceTrees→Statistics-basedapproachthatusesnon-parametrictests
as splitting criteria, correctedfor multiple testing to avoid over fitting.

The ID3 algorithm builds decision treesusingatop-downgreedysearchapproach


through the space of possible branches with no backtracking. A greedy
algorithm, as the namesuggests, always makes the choice that seems to be
the best at that moment.

Inadecisiontree,forpredictingtheclassofthegivendataset,thealgorithmstarts from
the root node of the tree. This algorithm comparesthe
valuesofrootattributewith the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 98
BusinessAnalytics Dept.ofCSE-AIML

For the next node, the algorithm again compares the attribute value with the
other sub- nodes and movefurther. It continues the process until it reaches the
leaf node of the tree. The completeprocess can be better understood using the
below algorithm:
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2:Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3:Divide the S into subsets that contains possible values
for the best attributes.
• Step-4:Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets
of the dataset created in
• Step-6: Continue this process until a stage is reached where you
can not further classify the nodes and called the final node as a
leaf node.
Entropy:
Entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to
draw any conclusions from that information.
Flippingacoinisanexampleof an action that provides
information that is random.
From the graph, it is quite evident that the entropy H(X) is
zerowhentheprobabilityis either 0 or 1. The Entropy is maximum when the
probability is0.5becauseitprojects perfect randomness in the data and there is
no chance ifperfectlydeterminingthe outcome.
InformationGain
Information gain or IG is a
statisticalpropertythatmeasures how well a given
attribute separates the training examples according
to their target classification.Constructing a decision
tree is all about finding an attribute that returns the
highest information gain and the smallest entropy.
ID3followstherule —Abranchwithanentropyofzeroisaleafnodeand A branchwith
entropy more than zero needs further splitting.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page 99
BusinessAnalytics Dept.ofCSE-AIML

Hypothesis spacese archin decision treelearning:

In orderto derive theHypothesis


space,wecomputetheEntropyandInformation Gain ofClass and attributes.
For themwe use the following statistics formulae:
EntropyofClassis:

Illustrative Example:

Concept:“PlayTennis”: Data set:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia) Page
100
BusinessAnalytics Dept.ofCSE-AIML

Basic algorithm for inducing a decision tree from training tuples:


Algorithm:
Generate decisiontree.
Generate a decisiontree from thetraining tuples of data
partitionD.
Input:
Data partition,D,whichisasetof training tuples andtheir
associated class labels;
attributelist,the set of candidate attributes;
Attributeselectionmethod,aproceduretodeterminethesplitting
criterion that “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and,possibly,eitherasplitpointorsplittingsubset.
Output:Adecisiontree.
Method:
(1) Createanode N;
(2) If tuplesi Nd are all of the same class,C t hen
Return N as a leaf node labeled with the class C;
(3) If attribute list isempty then
returnNasaleafnodelabeledwiththemajorityclassin D;
//majorityvoting
(4) applyAttributeselectionmethod(D,attributelist)tofindthe
“best”
splittingcriterion;
(5) LabelnodeNwithsplittingcriterion;
(6) ifsplittingattributeisdiscrete-valuedandmultiwaysplits
allowed
then//notrestrictedtobinarytrees
(7) attributelist=attributelist-splittingattribute
(8) foreachoutcomejofsplittingcriterion
//partitionthetuplesandgrowsubtreesfor
eachpartition
(9) letDjbethesetofdatatuplesinDsatisfyingoutcomej;
//a partition
(10) ifDjisempty then
attachaleaflabeledwiththemajorityclassinDtonode N;
else
attachthenodereturnedbyGeneratedecisiontree(Dj,
attributelist)tonodeN;
(11) returnN;

Advantages of DecisionTree:
• Simple to understand and interpret. People are able to understand
decision tree models after a brief explanation.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
101
BusinessAnalytics Dept.ofCSE-AIML

• Requires little data preparation. Other techniques often require data


normalization, dummy variables need to be created and blank values to be
removed.
• Able to handle both numerical and categorical data. Other techniques are
usually specialized in analysing datasets that have only one type of
variable. (For example, relation rules can be used only with nominal
variables while neural networks can be used only with numerical variables.)
• Usesawhiteboxmodel.Ifagivensituationisobservableinamodelthe explanation
for the condition is easily explained by Boolean logic. (Anexampleof a black
box modelis an artificial neural network since the explanation for the results
is difficult to understand.)
• Possibletovalidateamodelusingstatisticaltests.Thatmakesitpossibleto
account for thereliability of the model.
• Robust: Performs well with large datasets. Large amounts of data can be
analyzed using standard computing resources in reasonable time.
ToolsusedtomakeDecisionTree:
Manydata miningsoftware packagesprovideimplementationsofone or more
decision tree algorithms. Several examples include:
▪ SASEnterpriseMiner
▪ Matlab
▪ R (an open source software environment for statistical computing which
includes severalCART implementationssuch asrpart, partyand
randomForestpackages)
▪ Weka (a free and open-source data mining suite, contains many
decision tree algorithms)
▪ Orange(afreedataminingsoftwaresuite,whichincludesthetreemoduleorngTree)
▪ KNIME
▪ MicrosoftSQLServer
▪ Scikit-learn (a free and open-source machine learning library for
the Python programming language).
▪ Salford Systems CART (which licensed the proprietary code of the
original CART authors)
▪ IBMSPSSModeler
▪ RapidMiner

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
102
BusinessAnalytics Dept.ofCSE-AIML

MultipleDecisionTrees:
Classification&RegressionTrees:

✓ Classificationandregressiontreesisatermusedtodescribedecisiontree
algorithms that areusedfor classification and regression
learningtasks.
✓ The Classification and Regression Tree methodology, also known as the
CART were introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
Olshen, and Charles Stone.

ClassificationTrees:
A classificationtreeisanalgorithmwhere
the target variable is fixed or
categorical. The algorithm is then used
to identify the “class” within which a
target variable would mostlikely fall.
✓ An example of a classification-
type problem would be determining
who will or will not subscribe to a
digital platform; or who will or will
not graduate from high school.
✓ These are examples of simple binary classifications where the
categorical dependentvariablecan assumeonly one oftwo,mutually
exclusivevalues.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
103
BusinessAnalytics Dept.ofCSE-AIML

RegressionTrees
✓ A regression tree refers to an
algorithm where the target variable is
and the
algorithmisusedtopredictitsvalue
which is a continuous variable.
✓ As an example of a regression
typeproblem, you
maywanttopredictthe selling prices of
aresidentialhouse, which is
acontinuous dependent variable.
✓ This will depend on both continuous
factors like square footage as well as
categorical factors.

DifferenceBetweenClassificationandRegressionTrees
✓ Classification trees are used when the dataset needs
tobesplitintoclassesthat belongto the response variable. In many cases,
theclasses Yes or No.
✓ In other words, they are just two and mutually exclusive. In some cases,
there may be more than two classes in which case a variant of the
classification tree algorithmis used.
✓ Regression trees, on the other hand, are used when the response variable is
continuous. For instance, if the response variable is something like the price
of a property or thetemperature ofthe day, a regression tree is used.
✓ In other words, regression trees are used for prediction-type problems while
classification trees are used for classification-type problems.

1. CART:(ClassificationAndRegressionTree.)
✓ CART algorithm was introduced in Breiman et al. (1986). A CART tree is a
binary decision tree that is constructed by splitting a node into two child
nodes repeatedly, beginning with the root node that contains the whole
learning sample. The CART growing method attempts to maximize within-
node homogeneity.
✓ The extent to which a node does not represent a homogenous subset of
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
104
BusinessAnalytics Dept.ofCSE-AIML
cases is an indication
ofimpurity.Forexample,aterminalnodeinwhichallcaseshavethe
samevalueforthedependentvariableisahomogenousnodethatrequiresnofurther

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
105
BusinessAnalytics Dept.ofCSE-AIML
splitting because it is "pure." For categorical (nominal, ordinal) dependent
variables the common measure of impurity is Gini, which is based on squared
probabilities of membership for each category. Splits are found that maximize the
homogeneity of child nodes with respect to the value of the dependent variable.

Decisiontreepruning:
Pruning is a data compression technique in machine learning and search
algorithms that reduces the size ofdecision trees by removing sectionsofthetree
thatarenon-critical and redundant to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive accuracy by
the reduction of overfitting.

One of the questions that arises inadecisiontreealgorithmistheoptimalsizeofthe


finaltree.Atreethatistoolargerisksoverfittingthetrainingdataandpoorly
generalizing to new samples. A small tree might not capture important
structural information about the sample space. However, it is hard to tell
whenatreealgorithm should stop because it is impossible Before and After
pruning to tell if the additionofa single extra node will dramatically decrease
error. This problem is known as the horizon effect. A common strategy is to
grow the tree until each node contains a small number of instances then use
pruning to remove nodes that do not provide additional information. Pruning
should reduce thesizeofalearningtreewithoutreducingpredictiveaccuracy as
measured by a cross-

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
106
BusinessAnalytics Dept.ofCSE-AIML

validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.

PruningTechniques:
Pruningprocessescanbedividedintotwotypes:PrePruning&PostPruning
• Pre-pruning procedures prevent a complete induction of the
trainingsetby replacing a stop () criterion in the induction algorithm (e.g.
max. Tree depth or information gain (Attr)> minGain). They considered
to be more efficient because theydo not induce an entire set,but
rathertrees remain small from the start.
• Post-Pruning (or just pruning) is the most common way of simplifying
trees. Here, nodes and subtrees are replacedwith leavesto reduce
complexity.
The procedures are differentiated on the basis of their approach in the tree:
Top-down approach & Bottom-Up approach

Bottom-uppruningapproach:
• Theseproceduresstartatthelastnodeinthetree(thelowestpoint).
• Followingrecursivelyupwards,theydeterminetherelevanceofeachindiv
idual node.
• If the relevance for the classification is not given, the node is
dropped or replacedby a leaf.
• Theadvantage isthatnorelevantsub-treescanbelostwiththismethod.
• These methods include Reduced ErrorPruning (REP),Minimum
Cost Complexity Pruning (MCCP), or Minimum Error Pruning
(MEP).

Top-downpruningapproach:
• In contrast to the bottom-up method, this method starts at the root of
the tree. Following the structure below, a relevance check is carried out
which decides whether a node is relevant for the classification of all n
items or not.
• By pruning the tree at an inner node, it can happen that an entiresub-
tree (regardless of its relevance) is dropped.
• One of these representativesispessimisticerrorpruning(PEP),whichbrings
quite good results with unseen items.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
107
BusinessAnalytics Dept.ofCSE-AIML

2. CHAI
DDECISIONTREE(Chi-squareAutomaticInteractionDetector)
• Asisevidentfromthenameofthisalgorithm,itisbasedonthechi-squarestatistic.
• Chi-square Automatic Interaction Detector (CHAID) was a technique
created by Gordon V.Kass in 1980.
• CHAIDis atoolusedtodiscovertherelationshipbetweenvariables.
• A Chi-square test yields a probability value as a result lying
anywhere between 0 and 1.
o Achi-square valuecloserto 0indicatesthatthereis
asignificant differencebetween the two classes which are
being compared.
o Similarly, avalue closerto1indicates thatthere
isnotanysignificant difference between the 2 classes.
• In CHAID analysis, nominal, ordinal, and continuous data can be used,
where continuous predictors
aresplitintocategorieswithapproximatelyequalnumber of observations.
• CHAID creates all possible cross
tabulationsforeachcategoricalpredictoruntil the best outcome is achieved
and no further splitting can be performed.
• CHAID analysis splits the target into two or more categories that
arecalledthe initial, or parent nodes, and then
thenodesaresplitusingstatisticalalgorithms into child nodes.
• Unlike in regression analysis, the CHAID technique does not require the
data to be normally distributed.
• ThenatureoftheCHAIDalgorithmistocreateWIDEtrees.
VariabletypesusedinCHAIDalgorithm:
• Dependentvariable:ContinuousORCategorical
• Independentvariables:CategoricalONLY(canbemorethan2categories)
• Thus, if there are continuous predictor variables, then we need to
transform theminto categorical variables beforetheycan be
suppliedto the CHAID algorithm.
• StatisticalTestsusedtodeterminethenextbestsplit:
• ContinuousDependentVariable:F-Test(RegressionProblems)
• CategoricaldependentVariable:Chi-square(ClassificationProblems)
HowCHAIDhandlesdifferenttypesofvariables?
NominalVariable:Automaticallygroupsthedataasperpoint 2
aboveOrdinalVariable:Automaticallygroupsthedataasperpoint 2

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
108
BusinessAnalytics Dept.ofCSE-AIML
above ContinuousVariable:Converts intosegments/decilesbefore
performing 2

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
109
BusinessAnalytics Dept.ofCSE-AIML

GINIIndexImpurityMeasure:
• GINI Index Used by the CART (classification and regression tree) algorithm,
Gini impurityisameasureofhowoftenarandomlychosenelementfromtheset
wouldbeincorrectlylabeledifitwererandomlylabeledaccordingtothe
distribution of labels in the subset. Gini impurity can be computed by
summing the probability fi of each item being chosen times the probability
1-fi of a mistake in categorizing that item.

OverfittingandUnderfitting

• Let’sclearlyunderstandoverfitting,underfittingandperfectlyfitmodels.
• From the three graphs shown above, one can clearly understand that the
leftmost figure line does not cover all the data
points,sowecansaythatthemodelis under- fitted. In this
case,themodelhasfailedtogeneralizethepatterntothe new dataset, leading
to poorperformanceontesting.Theunder-fittedmodelcan
beeasilyseenasitgivesveryhigherrorsonbothtrainingandtestingdata. This
is because the dataset is not clean and contains noise, the modelhas
High Bias, and the sizeofthetraining data is not enough.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
110
BusinessAnalytics Dept.ofCSE-AIML

• When it comes to the overfitting, as shown in the rightmost


graph,itshowsthe model is covering all
thedatapointscorrectly,andyoumightthinkthisisa perfect fit. But actually,
no, it is not a good fit! Because the model learns too many details from
thedataset,italsoconsidersnoise.Thus,itnegativelyaffectsthe new data set;
not every detail thatthemodelhaslearnedduringtrainingneeds
alsoapplytothenewdatapoints,whichgivesapoorperformanceon testing or
validation

dataset. Thisis because themodelhas traineditselfin a


verycomplexmanner and has high variance.

• The best fit model is shown by the middle graph, where both training and
testing
(validation)lossareminimum,orwecansaytrainingandtestingaccuracy
should be near each other and high in value.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
111
BusinessAnalytics Dept.ofCSE-AIML
TimeSeriesMethods:
• Time series forecasting focuses on analyzing data changes across
equally spaced time intervals.
• Time series analysis is used in a wide variety of domains, ranging from
econometrics to geology and earthquake prediction; it’s also used in
almost all applied sciences and engineering.
• Time Series Analysis finds hidden patterns
andhelpsobtainusefulinsightsfrom the time series data.
• Timeseriesdataisdatathatisobservedatdifferentpointsintime.
• Time Series Analysis is useful in predicting future values or
detectinganomalies from the data. Such analysis typically requires
manydatapointstobepresentin the dataset to ensure consistency
andreliability.
• The different types of modelsandanalysesthatcanbecreatedthroughtime
series analysis are:
o Classification:ToIdentifyandassigncategoriestothedata.
o Curve fitting: Plot the data along a curve and study the
relationships of variables present within the data.
o Descriptiveanalysis:HelpIdentifycertain patterns in time-series
data such as trends, cycles, or seasonal variation.
o Explanative analysis: To understand the data and its
relationships, the dependentfeatures, and cause and effect and its
tradeoff.
o Exploratoryanalysis:Describeandfocusonthemaincharacteristicso
fthe timeseries data, usually in a visual format.
o Forecasting: Predicting future data based on historical trends.
Using the historical data as a
modelforfuturedataandpredictingscenariosthat could happen along
with the future plot points.
o Interventionanalysis:TheStudyofhowaneventcanchangethedata.
o Segmentation: Splitting the data into segments to discover the
underlying propertiesfrom the source information.
ComponentsofTimeSeries:
Longtermtrend–Thesmoothlongtermdirectionoftimeserieswherethedatacan
increaseordecreaseinsomepattern.
Seasonalvariation–Patternsofchangeinatimeserieswithinayearwhichtendstorepea
t every year.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
112
BusinessAnalytics Dept.ofCSE-AIML

Cyclical variation – Its much alike seasonalvariationbuttheriseandfalloftime


series over periods are longer than one year.
Irregular variation –Any variation that is not explainable by any of the
three above mentioned components. They can be classified into–stationary
and non–stationary variation.
Stationary Variation: When the data neither increasesnordecreases,i.e.it’s
completely random it’s called stationaryvariation.OrWhenthedatahassome
explainableportionremainingandcanbeanalyzedfurtherthensuchcaseiscalled
non
– stationaryvariation.

ARIMA&ARMA:
WhatisARIMA?
• In time series analysis, ARIMA is an acronym that stands for
AutoRegressive Integrated Moving Average model is ageneralization of
an autoregressive moving average (ARMA) model. These models are
fitted to time series data either to better understand thedata or topredict
futurepoints in theseries (forecasting).
• Theyareappliedinsomecaseswheredatashowevidenceofnon-stationary,
• .A popular and very widely used statistical method for time series
forecasting and analysis is the ARIMA model.
• It is a class of models that capture a spectrum of different standard
temporal structures present in time series data. By implementing an
ARIMA model, you can forecast and analyze a time series using past
values, such aspredictingfuture prices based on historical earnings.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
113
BusinessAnalytics Dept.ofCSE-AIML

• Univariate models such as these are used to understand better a single


time- dependent variable present in the
data,suchastemperatureovertime.They predict future data points of and
from the variables.
• wherean initial differencing step (corresponding to the "integrated" part
ofthe model) can be applied to reduce the non-stationary.A standard
notation used for describing ARIMA is by parameters p,d and q.
• Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q) where
parameters p, d, and q are non-negative integers, p is the order of the
Autoregressive model, d is the degree of differencing, and q is the order
of the Moving-average model.
• The parameters are substituted withanintegervaluetoindicatethespecific
ARIMAmodelbeingusedquickly.TheparametersoftheARIMAmodelare
further describedas follows:
o p: Stands for the number of lag observations included in the
model,alsoknown as the lag order.
o d: The number of times the raw observations are
differentiated, also calledthe degree of differencing.
o q:Isthesizeofthemovingaveragewindowandalsocalledtheorder
ofmoving average.
Univariatestationaryprocesses(ARMA)
A covariance stationary processis an ARMA (p, q) process of autoregressive
order p and moving
averageorderqifitcanbewrittenas

TheacronymARIMAstandsforAuto-Regressive Integrated Moving


Average.Lagsof the stationarized series in the forecasting equation are
called "autoregressive" terms,
lagsoftheforecasterrorsarecalled"movingaverage"terms,andatimeseries
which needs to be differencedto
bemadestationaryissaidtobean"integrated"versionof
astationaryseries.Random-walk and random-trend models, autoregressive
models,and exponential smoothing models are all special cases of ARIMA
models.

AnonseasonalARIMAmodelisclassifiedasan"ARIMA(p,d,q)"model,where:

• pisthenumberofautoregressiveterms,
• disthenumberofnonseasonaldifferencesneededforstationarity,and
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
114
BusinessAnalytics Dept.ofCSE-AIML
• qisthenumberoflaggedforecasterrorsinthepredictionequation.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
115
BusinessAnalytics Dept.ofCSE-AIML

The forecasting equationisconstructed asfollows. First,let


ydenote the dthdifference of Y, which means:

If d=0 yt = Yt
:
If d=1 yt = Yt-Yt-1
:
If yt = (Yt-Yt- 1)-(Yt-1-Yt-2) = Yt-2Yt-1+Yt-2

NotethattheseconddifferenceofY(thed=2case)isnotthedifferencefrom2 periods
ago.Rather,itisthe first-difference-of-the-firstdifference, which isthediscrete
analogof a secondderivative, i.e., thelocalaccelerationoftheseriesratherthanits
local trend.

In termsofy,thegeneralforecastingequationis:

ŷt = μ+ϕ1yt-1+…+ϕpyt-p-θ1et-1-…-θqet-q

MeasureofForecastAccuracy:
Forecast Accuracy can be defined as the deviation of Forecastor Prediction
fromthe actual results.
Error=Actualdemand–ForecastORei=At Ft

We measure Forecast Accuracy by 2 methods : 1. Mean Forecast Error (MFE)


For n time periods where we have actual demand and forecast values:
n
(ei)
MFE= i=1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends
to over- forecast 2. Mean Absolute Deviation (MAD) For n time periods where
we have actual demand and forecast values:
n
(ei)
i=1
MAD=
n
While MFEis ameasure of forecast modelbias, MAD indicates the absolutesize of
theerrors
UsesofForecasterror:
• Forecastmodelbias
• Absolutesizeoftheforecasterrors
• Comparealternativeforecastingmodels
• Identifyforecastmodelsthatneedadjustment

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
116
BusinessAnalytics Dept.ofCSE-AIML
ETL Approach:
Extract,TransformandLoad(ETL)referstoaprocessindatabaseusageand
especially in data warehousing that:
• Extractsdatafromhomogeneousorheterogeneousdatasources
• Transforms the data for storing it in proper format or structure for
querying and analysis purpose
• Loads it into the final target (database, morespecifically,
operationaldatastore, datamart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes
time,so while the data is being pulled another transformation process
executes, processing the already received data and prepares the data
forloadingandassoonasthereissome data ready to be loaded into the target, the
data loading kicks off without waiting for the completion of the previous
phases.
ETL systems commonly integrate data from multiple applications (systems),
typically developed and supported by different vendors or hosted on separate
computer hardware.
Thedisparatesystemscontainingtheoriginaldataarefrequentlymanagedand
operated by different employees. For example, acostaccountingsystem
maycombine data frompayroll, sales, and purchasing.
CommerciallyavailableETLtoolsinclude:
• MicrosoftSQLServerIntegrationServices(SSIS)
• CampaignRunner
• OracleDataIntegrator(ODI)
• OracleWarehouseBuilder(OWB)
• RhinoETL
• SAPBusinessObjectsDataServices
• SASDataIntegrationStudio
• SnapLogic
TherearevariousstepsinvolvedinETL.Theyareasbelowindetail:
Extract:
The Extract step covers the data extraction from the source system
andmakesit accessible
forfurtherprocessing.Themainobjectiveoftheextractstepistoretrieve all the
required data from the source system with as little resources as possible.The
extract step should be designed in a way that it does not
negativelyaffectthesource system interms or performance,response time or
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
117
BusinessAnalytics Dept.ofCSE-AIML
any kind of locking.
Thereareseveralwaystoperformtheextract:

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
118
BusinessAnalytics Dept.ofCSE-AIML

• Update notification - if the source system is able to provide a notification


thata
recordhasbeenchangedanddescribethechange,thisistheeasiestwayto get
the data.
• Incremental extract -
somesystemsmaynotbeabletoprovidenotificationthat an update has
occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the
system needs to identify changes and propagate it down. Note, that by
using daily extract, wemay not be able to handle deleted records
properly.
• Fullextract-somesystemsarenotabletoidentifywhichdatahasbeen changed
at all, so a full extract is the only way one can getthedataoutofthe
system. The full extract requires keeping a copy of the last extract
inthesame format in order to be able to
identifychanges.Fullextracthandlesdeletionsas well.
• When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data volumes
canbeintensof gigabytes.
• Clean:Thecleaningstepisoneofthemostimportantasitensuresthe quality of
the data in the data warehouse. Cleaning should perform basic data
unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Availablearetranslatedto
standardMale/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided
valueConvert phone numbers, ZIP codes to a standardizedform
• Validateaddressfields,convertthemintopropernaming,e.g.Street/St/St./Str./Str
• Validateaddress fields against eachother (State/Country, City/State,
City/ZIP code, City/Street).

Transform:
• The transformstepappliesasetofrulestotransformthedatafromthesource to
the target.
• This includes convertinganymeasureddatatothesamedimension(i.e.
conformeddimension) using thesame units so that they can later be
joined.
MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
119
BusinessAnalytics Dept.ofCSE-AIML
• The transformation step also requires joiningdatafromseveralsources,
generatingaggregates,generatingsurrogatekeys,sorting,derivingnew
calculated values, and applying advanced validation rules.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
120
BusinessAnalytics Dept.ofCSE-AIML
Load:
• During the load step, it is necessarytoensurethattheloadisperformed
correctlyandwithaslittleresourcesaspossible.ThetargetoftheLoad
process is often a database.
• Inordertomaketheloadprocessefficient,itishelpfultodisableany
constraints and indexes beforetheloadandenablethembackonlyafterthe
load completes. The referential integrity needs to be maintained by ETL
tool to ensureconsistency.

ManagingETLProcess
The ETL process seems quite straight forward. As with every application,
there is a
possibilitythattheETLprocessfails.Thiscanbecausedbymissingextractsfrom one
of the systems,missingvaluesinoneofthereferencetables, orsimplya
connection or power outage. Therefore, it is necessary to design the ETL
process keeping fail- recovery in mind.

Staging:
It should be possible to restart, at least, some of the phases independently
from the others. For example, if the transformation step
fails,itshouldnotbenecessaryto restart the Extract step. We can ensure this by
implementing proper staging. Staging
meansthatthedataissimplydumpedtothelocation(calledtheStagingArea)so that
it can then be read by the next processing phase.The staging area is alsoused
during ETL process to store intermediate results of processing. This is ok for
the ETL process which uses forthis purpose. However, the staging area should
be accessed by the load ETL process only. It should never be available to
anyone else; particularlynotto end users as it is not intended for data
presentation to the end-user. May contain incomplete or in-the-middle-of-the-
processing data.

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
121
BusinessAnalytics Dept.ofCSE-AIML

***EndofUnit-4***

MallaReddyEngineeringCollegeForWomen(AutonomousInstitution-UGC,Govt.ofIndia)
Page
122
BusinessAnalytics Dept.ofCSE-AIML
5. UNIT-5
DataVisualization:

Data visualization is the art and practice of gathering, analyzing,


andgraphicallyrepresenting empirical information.
They are sometimes called information graphics, or even just charts and
graphs.
Thegoalofvisualizingdataistotellthestoryinthedata.
Tellingthe story is predicated on understandingthe data ata very deep level,
andgatheringinsightfrom comparisons of datapointsin the numbers

Whydatavisualization?

Gain insight into an information space by mapping data onto graphical


primitivesProvide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, and relationships among
data.
Help find interesting regions and suitable parameters for further quantitative
analysis.
Provideavisualproofofcomputerrepresentationsderived.

Categorizationofvisualizationmethods
Pixel-oriented visualization techniques
Geometric projectionvisualizationtechniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations

Pixel-OrientedVisualizationTechniques

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page


123
BusinessAnalytics Dept.ofCSE-AIML

• For a data set of m dimensions,create m windowson thescreen, one


foreachdimension.
• The mdimensionvaluesofarecordaremapped tompixelsatthe
correspondingpositions in the windows.
• Thecolorsofthepixelsreflectthecorrespondingvalues.

• Tosavespaceandshowtheconnectionsamong multipledimensions,
spacefilling isoften done in a circle segment.
GeometricProjectionVisualizationTechniques
Visualization of geometric transformations and
projections ofthe data.Methods
• Directvisualization
• Scatterplotandscatterplotmatrices
• Landscapes Projection pursuit technique: Help users
find meaningfulprojectionsofmultidimensional data
• Prosectionviews
• Hyperslice
• Parallelcoordinates

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 124


BusinessAnalytics Dept.ofCSE-AIML

LinePlot:
• Thisistheplotthatyoucanseeinthenookandcornersofanysortof
analysisbetween 2 variables.

• The line plotsare nothing butthevaluesonaseriesofdatapointswillbeconnectedwith straightlines.


• Theplotmayseemverysimplebutithasmoreapplicationsnotonlyinmachinelearningbut in many other
areas.
• UsedtoanalyzetheperformanceofamodelusingtheROC-AUCcurve.

BarPlot
• Thisisoneofthewidelyusedplots,thatwewouldhaveseenmultipletimesnotjustindata
analysis,butweusethisplotalsowhereverthereisatrendanalysisinmanyfields.
• Wecanvisualize thedatainacoolplotandcanconveythedetailsstraightforwardtoothers.

• This plot may be simple and clear but it’s not much frequently used in
Datascience applications.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 125


BusinessAnalytics Dept.ofCSE-AIML
StackedBarGraph:

• UnlikeaMulti-setBar Graphwhichdisplaystheirbarsside-by-side, StackedBarGraphssegment

their bars. Stacked Bar Graphs are used to show how a larger category is divided into smaller
categories and what the relationship of each part has on the total amount. There are two types of
Stacked Bar Graphs:
• Simple Stacked Bar Graphs place each value for the segment after the previous one. The total
value of the bar is all the segment valuesadded together. Ideal for comparing the total amounts across
each group/segmented bar.

• 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the
percentage of each value to the total amountin each group. This makesiteasiertoseethe relative
differences between quantities in each group.
• One major flaw of Stacked Bar Graphs is that they become hardertoreadthe more segments each
bar has. Also comparing each segment to each other is difficult, as they're not aligned on a
common baseline.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 126


BusinessAnalytics Dept.ofCSE-AIML
ScatterPlot

• It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
• This plot describes us as a representation, where each point in the entire dataset ispresent with
respect to any 2 to 3 features(Columns).
• Scatter plots are availableinboth 2-D aswellas in3-D. The2-Dscatterplotis the common one,
where we will primarily try to find the patterns,clusters, andseparabilityof the data.
• The colors are assigned to differentdata points based on how they werepresent inthedataset
i.e, target column representation.
• Wecancolorthedatapointsaspertheirclasslabelgiveninthedataset.

BoxandWhiskerPlot
• Thisplotcanbeusedtoobtainmorestatisticaldetailsaboutthedata.
• Thestraightlinesatthemaximumandminimumarealsocalledwhiskers.
• Points that lie outside the
whiskers will be considered
as an outlier.
• The box plot also gives us a
description of the 25th,
50th,75th quartiles.
• With the help of a box plot,
wecanalso determine
th
e Interquartile range(IQR)
where maximum details of
the data will be present

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 127


BusinessAnalytics Dept.ofCSE-AIML
• These box plots come under
univariate analysis, which
means that we are exploring
data only with one variable.

PieChart:
A pie chart shows a static number and how categories represent part of a whole the composition of
something.A pie chartrepresentsnumbersinpercentages,andthe totalsumofallsegmentsneeds to equal
100%.
• Extensively used in presentations and offices, Pie Charts help show proportions and percentages
between categories, by dividing a circle into proportional segments. Each arc length represents a
proportion of each category, while the full circle represents the total sum of all the data, equal to
100%.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 128


BusinessAnalytics Dept.ofCSE-AIML

DonutChart:

A donut chart is essentially a Pie Chart with an area of


the centre cut out. Pie Charts are sometimes criticised
for focusing readers on the proportional areas of the
slices to one another and to the chart as a whole. This
makes it tricky to see the differences between slices,
especially when you try to compare multiplePie Charts
together.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 129


BusinessAnalytics Dept.ofCSE-AIML

A Donut Chart somewhat remedies this problem by de-emphasizing the use of the area. Instead, readers
focus more on reading the length of thearcs, rather than comparing the proportionsbetweenslices.
Also,DonutChartsaremorespace-efficientthanPieChartsbecausetheblankspaceinsideaDonut Chart can be
used to display information inside it.

MarimekkoChart:

AlsoknownasaMosaicPlot.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 130


BusinessAnalytics Dept.ofCSE-AIML
Marimekko Charts are used to visualise categorical data over a pair of variables. In a Marimekko Chart,
both axes are variable with a percentage scale, that determines both the width and height of each segment.
So Marimekko Charts work asa kindoftwo- way 100% Stacked BarGraph. Thismakes it possible to detect
relationships between categoriesand their subcategories via the two axes.
The main flaws of Marimekko Charts are that they can be hard to read, especially when there are many
segments. Also, it’s hard to accurately make comparisons between each segment, as they are not all
arranged next to each other along a common baseline. Therefore, Marimekko Charts are better suited for
giving a more general overview of the data.

Icon-BasedVisualizationTechniques
• Itusessmalliconstorepresentmultidimensionaldatavalues
• Visualizationofthedatavaluesasfeaturesoficons
• Typicalvisualizationmethods
• ChernoffFaces
• StickFigures
ChernoffFaces

ChernoffFaces
A wayto display variables on a two-dimensional surface, e.g., let x
beeyebrow slant, y beeyesize,zbe noselength,etc.

• The figure shows faces producedusing10 characteristics–headeccentricity, eye size, eye spacing,eye
eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening. Each
assigned one of 10 possible values.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 131


BusinessAnalytics Dept.ofCSE-AIML
StickFigure

• Acensusdatafigureshowingage,income,gender,education
• A5-piecestickfigure(1bodyand4limbsw.differentangle/length)
• Age,incomeareindicatedbypositionofthefigure.
• Gender,educationareindicatedbyangle/length.
• Visualizationcanshowatexturepattern.
• 2 dimensions are mapped to the display axes and the remaining
dimensions aremappedto theangleand/orlength of thelimbs.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 132


BusinessAnalytics Dept.ofCSE-AIML
HierarchicalVisualization
CirclePacking

• Circle Packing is a variation of a Treemap that uses circles instead of


rectangles. Containment within each circle represents a level in the
hierarchy: each branch of the tree is represented as a circle and its sub-
branches are represented as circles inside of it. The area of each circle
can also be used to represent an additional arbitrary value, such as
quantityor filesize. Colourmay also beused toassigncategories orto
represent another variable via different shades.
As beautiful as Circle Packing appears, it's notas space-efficient as a Treemap, as there's a lot of empty
space within the circles. Despite this, Circle Packingactually reveals hierarchal structure better than a
Treemap.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 133


BusinessAnalytics Dept.ofCSE-AIML
SunburstDiagram

• As known as a Sunburst Chart, Ring Chart, Multi-level Pie Chart, Belt


Chart, Radial Treemap.
• This type of visualisationshows hierarchythrougha series ofrings, that
are sliced for each category node. Each ring corresponds to a level inthe
hierarchy, with the central circle representing the root node and the
hierarchy moving outwards from it.
• Rings are sliced up and divided based on their hierarchical relationship
to the parent slice. The angle of each slice is either dividedequally
underitsparent node orcanbemadeproportional toa value.
• Colour can be used to highlight hierarchal groupings orspecificcategories.

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 134


BusinessAnalytics Dept.ofCSE-AIML
Treemap:

• Treemaps arean alternativeway of visualisingthehierarchicalstructureof a Tree Diagram


while also displaying quantities for each category via area size. Each category is assigned a
rectangle area with their subcategoryrectangles nestedinsideof it.
• When a quantity is assigned to a category, its area size is displayed in proportion to that
quantity and to the other quantities within the same parent category in a part-to-whole
relationship. Also, the area size of the parent category is the total of its subcategories. If no
quantity is assigned to a subcategory, then it's area is divided equally amongst the other
subcategories within its parent category.
• The way rectangles are divided and ordered into sub-rectangles is dependent on the tiling
algorithm used. Many tiling algorithms have been developed,butthe"squarifiedalgorithm"
which keeps each rectangle as square as possible is theone commonly used.
• Ben Shneiderman originally developed Treemaps as a way of visualising a vastfile directory
on a computer, without taking up too much space on the screen. This makes Treemaps a
more compact and space-efficient optionfor displaying hierarchies, that gives a
MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 135
BusinessAnalytics Dept.ofCSE-AIML
quickoverviewofthestructure.Treemaps are alsogreatat comparing theproportions between
categories via their area size.
• The downside to a Treemap is that it doesn't showthehierarchallevelsas clearly as other
chartsthatvisualisehierarchaldata (suchas a TreeDiagram orSunburstDiagram).

VisualizingComplexDataandRelations

• Foralargedatasetofhighdimensionality, it wouldbedifficult tovisualizealldimensions at the


same time
• Hierarchicalvisualizationtechniquespartitionalldimensionsintosubsets(i.e.,subspaces).
• Thesubspacesarevisualizedinahierarchicalmanner.
• “Worlds-within-Worlds,”also known as n-Vision, is a
representative hierarchical visualizationmethod.
• Tovisualizea6-Ddataset,wherethedimensionsareF,X1,X2,X3,X4,X5.
• We want to observe how F changes w.r.t. other dimensions. We can fix
X3,X4,X5dimensions toselected valuesandvisualizechangesto F w.r.t.X1,X2
• Mostvisualizationtechniquesweremainlyfornumericdata.
• Recently,moreandmorenon-numericdata, such as text and socialnetworks,
havebecome available.
• ManypeopleontheWebtagvariousobjectssuchaspictures, blogentries, and
productreviews.
• Atagcloudisavisualizationofstatisticsofuser-generatedtags.

• Often, in a tag cloud, tags are listed alphabetically or in a user-preferred


order.
• Theimportanceofatagisindicatedbyfontsizeorcolor.
MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 136
BusinessAnalytics Dept.ofCSE-AIML

WordCloud:

AlsoknownasaTagCloud.

Avisualisationmethodthat displays howfrequentlywords appear ina givenbodyof text, bymakingthesize of


each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words.
Alternatively, the words canalsobearranged inany format:horizontallines,columnsorwithin ashape.
WordCloudscanalsobeusedtodisplaywordsthathavemeta-dataassignedtothem.Forexample,in a Word Cloud
with all the World's country's names, the population could be assigned to each name to determine its size.

Colour used on Word Clouds is usually meaningless and is primarily aesthetic, but it can be used to
categorise words or to display another data variable.
Typically,Word Clouds are usedonwebsitesorblogs todepictkeyword ortag usage.Word Clouds can also be
used to compare twodifferentbodiesof texttogether.
Although being simple and easy to understand, Word Clouds have some major flaws:
Long words are emphasised over short words.
Words whoseletters containmanyascenders anddescenders mayreceivemoreattention.
They're not great for analytical accuracy,so used more for aesthetic reasonsinstead.
***EndofUnit-5***

MallaReddyEngineeringCollegeFor Women(AutonomousInstitution-UGC,Govt.ofIndia) Page 137

You might also like