STATISTICS FOR DATA SCIENCE
UE20CS203
Unit 1:Introduction
Mamatha.H.R
K.M Mitravinda
Department of Computer Science and
Engineering
STATISTICS FOR DATA SCIENCE
Unit 1:Introduction
Mamatha H R
Department of Computer Science and Engineering
STATISTICS FOR DATA SCIENCE
What is Data Science?
● Have you ever wondered how YouTube recommends videos of
your liking?
● How Google’s autocomplete works?
● How Gmail filters your emails into spam and non-spam
categories?
These are some of the simplest applications of Data Science. Such
tasks would be impossible without the availability of data. Thus in
simple words, Data Science is all about using data to solve problems.
Source: https://coralogix.com/blog/elasticsearch-
autocomplete-with-search-as-you-type/
STATISTICS FOR DATA SCIENCE
What is Data Science?
Data Science is an interdisciplinary field.
● It is focused on extracting knowledge and insights from data.
● Those insights are then applied to solve problems across a wide
range of domains.
● It incorporates skills from Statistics, Computer Science,
Mathematics, Business etc.
Source: theblog.adobe..com
STATISTICS FOR DATA SCIENCE
Applications of Data Science
Source: edureka.com
STATISTICS FOR DATA SCIENCE
Applications of Data Science
Source: edureka.com
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Airlines Industry
Data Science is used for various purposes like: route planning,
revenue management, prediction on in-flight sales and food
supplies etc.
Sources: Simplilearn, datasciencecentral.com
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Airlines Industry
Sources: Simplilearn, datasciencecentral.com
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Logistics Industries
Source: Simplilearn
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Logistics Industries
Logistics is a sector where data scientists can make a significant
impact in several areas such as:
● waste reduction
● optimizing delivery routes (which can translate into lower
delivery costs)
● selecting carriers that deploy best practices in mitigating the
effects of CO2 emissions
● ensuring that hazardous materials are handled with the
utmost care
● forecasting the supply and demand cycles
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Recommender systems
Source : https://www.martechadvisor.com/articles/customer-experience-2/recommendation-engines-how-amazon-and-netflix-are-
winning-the-personalization-battle/
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Recommender systems
Amazon has a huge bank of data on online consumer purchasing
behaviour.
The data includes
● purchased shopping cart
● items added to carts but abandoned
● wish lists
● dwell time
● referral sites
● customers’ demographic information
● number of times viewed an item before final purchase
● click paths in session, pricing experiments online etc.
Using this data it can easily find the hidden factors and patterns
to generate the “Recommended for You” section which helps to
create a personalized shopping experience for every customer.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Recommender systems
Source: https://medium.com/swlh/recommendations-in-time-context-93b32f73d98d
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Recommender systems
Netflix has set up 1300 recommendation clusters based on users
viewing preferences.
Netflix’s personalized recommendation algorithms produce $1
billion a year in value from customer retention and accounts for
80% of its total views. Some of the user information that Netflix
captures to help in recommendation include:
● Viewer interactions with Netflix services like viewer ratings,
viewing history, etc.
● Movie’s information about the categories, year of release,
title, genres etc.
● Other viewers with similar watching preferences.
● Time duration of a viewer watching a show.
● The device on which a viewer is watching.
● The time of the day a viewer watches.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Weather Forecasting
Source: phys.org/news
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Weather Forecasting
Weather forecasts are made by collecting quantitative data
about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.
So in general, weather forecasting is driven by the data about the
atmosphere.
There are a wide variety of devices and technologies gathering
information about the weather like:
thermometers, barometers, anemometers, weather balloons,
radar systems, satellites etc.
Various weather models analyse and try to make sense of all
the incoming information to accurately predict the weather.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Sports
Source: https://arstechnica.com/information-technology/2015/10/big-data-an-it-buzzword-that-is-actually-producing-results/
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Sports
Players, team managers, coaches and fans rely on sports analytics
before making decisions or developing strategies to win games.
Sports data analysts spend their time collecting on-field and off-
field data from a variety of sources and then analyzing and
interpreting that data looking for meaningful insights.
The main objective of sports analysis is to improve team
performance and enhance the chances of winning the game.
Major teams and their analytics partner:
(i)Real Madrid and Microsoft
(ii) Manchester United and Aon
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Sports
● Moneyball, an American biographical film accounts for the attempts
of baseball team’s general manager to assemble a competitive team
using sports analytics.
● He utilized sabermetrics to evaluate his potential roster by
performing data mining on hundreds of individual baseball players,
identifying statistics that were highly predictive of how many runs a
player would score.
Source:
https://en.wikipedia.org/wiki/Moneyball_(film)
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Sports
Source: https://fivethirtyeight.com/features/billion-dollar-billy-beane/
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Politics
Political parties and their strategists have realized the importance of
mining real-time demographic and polling data.
The various data points may include voter sentiment, mass emotions,
citizen concerns in different constituencies, popular outlooks in
various states, etc. Political parties can use these insights to,
● pull voter donations
● convert undecided voters
● enroll young volunteers
● organize resources
● social media campaigns
● improve effectiveness of electioneering activities etc.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Politics
Source: https://fivethirtyeight.com/features/todays-polls-and-final-election/
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Politics
Political strategists and digital analysts can deploy modern software
analytics to create detailed maps of voting patterns.
Data analytics can help these campaigners to paint a vivid picture of
political winds, party supporters, and trenchant opponents in every
demographic region.
This demographic data and other information can be used in
campaign-spending management. It can help determine whether a
voter would be most receptive to a phone call, a flyer or mailer, an in-
person visit, or some other form of campaigning.
By using data in this way, campaigns can avoid wasting money on
ineffective or unnecessary advertising, and have a better chance of
reaching someone who is receptive.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Politics
Source:
https://www.datasciencecentral.com/profiles/blogs
/top-politics-focused-data-science-analytics-
organizations
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Healthcare & Medicine
Source:
http://www.primeclasses.in/blog/2019/08/26/the-
need-for-data-science-in-healthcare-industry/
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Healthcare & Medicine
There are several fields in healthcare like medical imaging, drug
discovery, genetics, predictive diagnosis etc that make use of
data science.
● Hospitals analyse medical data and patient records to predict
those patients that are likely to seek readmission within a
few months of discharge.
● Omada Health is a digital medical company that uses smart
devices to create customized behavioral plans and online
training to help prevent chronic health conditions, such as
diabetes, high blood pressure, and high cholesterol.
● On the mental health side, Canada’s new start-up, Awake
Labs, is tracking data on children with autism in dress,
informing parents before the meltdown.
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in Healthcare & Medicine
Source: https://allofus.nih.gov
STATISTICS FOR DATA SCIENCE
Applications of Data Science : Data Science in predicting people’s opinions
Source: Simplilearn
STATISTICS FOR DATA SCIENCE
What is Data?
Technically, data refers to individual facts, statistics, or items of
information, often numeric, that are collected through
observation.
Source: https://www.twinkl.de/teaching-wiki/data
STATISTICS FOR DATA SCIENCE
Data vs Information
➔ Data
● Raw facts, usually formatted in a special way.
● Based on records, observations etc.
● Unorganized.
➔ Information
● A collection of facts organized in such a way that they have
additional value beyond the value of the facts themselves.
● Based on analysis of data.
● Organized and always depends on data.
Ex : Data – thermometer readings of temperature
taken every hour: (16.0, 17.0, 16.0, 18.5, 17.0,15.5….)
[on
transformation]
Information – today’s high: 18.5, today’s low: 15.5
STATISTICS FOR DATA SCIENCE
Data vs Information
Source: https://effectualsystems.com/data-need-information/
STATISTICS FOR DATA SCIENCE
Types of Data
Data Represented by
Alphanumeric data Numbers, letters, and other characters
Image data Graphic images or pictures
Audio data Sound, noise, tones
Video data Moving images or pictures
STATISTICS FOR DATA SCIENCE
Structured, Unstructured & Semi-structured Data
Source: https://towardsdatascience.com/data-extraction-from-a-pdf-table-
with-semi-structured-layout-ef694f3f8ff1
Source: slidegeeks.com
STATISTICS FOR DATA SCIENCE
Structured, Unstructured & Semi-structured Data
Structured Data:
Structured data is the data whose elements are
addressable for effective analysis. The data is
organized into a formatted repository that is typically a
database. Ex: Relational data.
Semi-Structured Data:
It is the data that doesn’t reside in relational database
but has some organizational properties that make it
easier to analyse. Ex: XML data.
Unstructured Data:
It is the data which is not organized in a predefined
manner or doesn’t have a predefined data model, thus
not a good fit for a mainstream relational database.
Ex: Word, pdf, text etc.
STATISTICS FOR DATA SCIENCE
Structured, Unstructured & Semi-structured Data
Source:
https://www.slidegeeks.com/pics/dgm/l/f/Forms_Type_Of_Big_Data_Ppt_PowerPoint_Presentation_Infographic_Template_Slide_1-.jpg
STATISTICS FOR DATA SCIENCE
Data Information
Source: guru99.com
STATISTICS FOR DATA SCIENCE
Information Concepts
Source: https://learningforsustainability.net
STATISTICS FOR DATA SCIENCE
Science
■ Science-latin word Scientia
■ Meaning Knowledge
■ Science is a systematic enterprise that builds and organizes
knowledge in the form of testable explanations and
predictions about the universe.
STATISTICS FOR DATA SCIENCE
Why do we need Data Science?
Source: https://static.seekingalpha.com/uploads/2020/1/14/50485001-15789998083991578_origin.png
STATISTICS FOR DATA SCIENCE
Why do we need Data Science?
The main reason why we need data science is the ability to process
and interpret data. This enables users and industries to make
informed decisions as well as helps in their growth, optimization,
and performance.
We know that, unstructured data is generated everywhere, every
second. Unstructured data isn't well organized or easy to access.
But its growth is enormous and importance of analyzing and
drawing inferences from this type of data is crucial.
Data Science provides a number of methods and techniques to
deal with such data.
This certainly helps many businesses and industries significantly to
improve their productivity.
STATISTICS FOR DATA SCIENCE
How is Data generated?
There is tons of data getting generated each day.
Some of the major sources from which data is generated are:
web, databases, media, IoT, cloud etc.
Insight into data generation in a day over the internet:
● 500 million tweets are sent
● 294 billion emails are sent
● 4 petabytes of data are created on Facebook
● 4 terabytes of data are created from each connected car
● 65 billion messages are sent on WhatsApp
● 5 billion searches are made
Slide courtesy:Dr.Uma
STATISTICS FOR DATA SCIENCE
How is Data generated?
By 2025, it’s estimated that 463 exabytes of data will be created
each day globally
– that’s the equivalent of 212,765,957 DVDs per day!
Source: theblog.adobe..com Slide courtesy:Dr.Uma
STATISTICS FOR DATA SCIENCE
Data generation
● In 2014, Oscars-host Ellen DeGeneres’ “celeb selfie” tweet
that was viewed 26 million times across the Web during a 12-
hour period.
● More than one billion hours of TV shows and movies are
streamed from Netflix per month.
● Walmart, handles more than 1 million customer transactions
every hour, feeding databases estimated at more than 2.5
petabytes. (the equivalent of 167 times the books in
America's Library of Congress)
● Facebook, is home to 40 billion photos.
STATISTICS FOR DATA SCIENCE
Data generation
Source: https://twitter.com/theellenshow
STATISTICS FOR DATA SCIENCE
Data generation
STATISTICS FOR DATA SCIENCE
Data generation
Source: https://trak.in/tags/business/2014/04/15/digital-data-universe-expansion-2020/
STATISTICS FOR DATA SCIENCE
Growth in Data generation
The total amount of data created, captured, copied and consumed
globally has been exponentially increasing.
In 2020, the amount of data created & replicated was higher than
expected caused by the increased demand due to the pandemic.
Up to 2025, global data creation is projected to grow to more than
180 zettabytes.
STATISTICS FOR DATA SCIENCE
How much of data is put into use?
Source: IDC, 2014
STATISTICS FOR DATA SCIENCE
How much of data is put into use?
Though there is a huge amount of data getting generated each
day, it shall serve no purpose if it is left unused.
This can further lead to information overload where there is an
overabundance of information but it is not put into work due to
lack of time, resources, understanding of the information,
irrelevance of the information or other reasons.
Thus, it is important to understand the data and know how to
utilize it in the right manner.
STATISTICS FOR DATA SCIENCE
No one knows how to use it
Source: https://image.slidesharecdn.com/instroductiontodatascience-160420090623/95/introduction-to-data-science-38-
638.jpg?cb=1461307670
STATISTICS FOR DATA SCIENCE
But is data all we need?
The graph below shows a cause & effect relationship between
‘Age of Miss America’ and ‘Murders by steam, hot vapour and hot
objects’ which practically doesn’t seem correct.
Thus, we see that the presence of interesting patterns need not
imply their correctness.
Blindly applying various processes and techniques on data can
result in incorrect inferences.
Source: https://i2.wp.com/boingboing.net/wp-
content/uploads/2016/02/chart.jpg?fit=800%2C315&ssl=1
STATISTICS FOR DATA SCIENCE
But is data all we need?
The following work highlights the risk of amplifying and reinforcing
biases present in the data by blindly applying machine learning on it.
Source: https://arxiv.org/abs/1607.06520
STATISTICS FOR DATA SCIENCE
Learn how to use data
The above examples help us understand that we need to learn
how to utilize and handle the available data in the right manner
to be able to arrive at correct results and draw meaningful
inferences.
➔ Explore: identify patterns
➔ Predict: make informed guesses
➔ Infer: quantify what you know
STATISTICS FOR DATA SCIENCE
Learn how to use data
Source:slidesharecdn.com
STATISTICS FOR DATA SCIENCE
Data Science project life cycle
The correct process of using available data is shown in this life
cycle. It outlines the major stages in a data science project.
Source: https://static.javatpoint.com/tutorial/data-
science/images/data-science-lifecycle.png
STATISTICS FOR DATA SCIENCE
Data Science project life cycle
Source: https://res.cloudinary.com/practicaldev
STATISTICS FOR DATA SCIENCE
Data Scientist
Data Scientists in simple words are those who make sense out of all the
data that are available and figure out the things that can be done with it.
Source: proschoolonline.com
STATISTICS FOR DATA SCIENCE
Data Scientist
Source: edureka!
DATA SCIENCE
What does a Data Scientist do?
They are responsible for collecting, analyzing, modelling and
interpreting large amounts of data. Their role combines Computer
Science, Mathematics, Statistics etc.
Source: https://edvancer.in/wp-
content/uploads/2015/11/76c99311fc4be19bf4353
cfc3c2e94b2.png
DATA SCIENCE
What does a Data Scientist do?
Source: medium.com
Slide courtesy:Dr.Uma
STATISTICS FOR DATA SCIENCE
Prerequisites for a Data Scientist
Curiosity Common Communication
Sense skills
Sources: quickanddirtytips.com, Slide courtesy:Dr.Uma
dreamstime.com,linkedin.com
STATISTICS FOR DATA SCIENCE
Prerequisites for a Data Scientist
Source: data-
Slide courtesy:Dr.Uma flair.training
STATISTICS FOR DATA SCIENCE
Demand for Data Scientist
Data Science is a growing field. It is a popular and lucrative
profession. Glassdoor has ranked this profession at #2 in 2021
despite the occurrence of the pandemic.
Sources : Glassdoor, Forbes
STATISTICS FOR DATA SCIENCE
Demand for Data Scientist
Source:
https://cdn.ttgtmedia.com/rms/onlineimages/busin
ess_analytics-data_scientist_01_mobile.png
STATISTICS FOR DATA SCIENCE
How is it different from what Statisticians have been doing?
Both Statisticians and Data Scientists work closely with data.
● Statisticians use mathematical equations and statistical
models to analyze data and arrive at conclusions.
● Data Scientists however focus on delivering actionable
results and sometimes need to deploy the model to the
production system.
STATISTICS FOR DATA SCIENCE
How is it different from what Statisticians have been doing?
Source:
https://scientistcafe.com/ids/images/softskill1.png
STATISTICS FOR DATA SCIENCE
Data Science vs Data Analysis
● Data Science is primarily used to make decisions
and predictions making use of predictive causal
analytics, prescriptive analytics (predictive plus
decision science) and machine learning.
● Data Analysis includes descriptive analytics and
prediction to a certain extent.
STATISTICS FOR DATA SCIENCE
Data Science vs Data Analysis
Source:
https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-
content/uploads/2017/01/Data-Analyst-vs-Data-
Science-1-422x300.png
STATISTICS FOR DATA SCIENCE
Data Science vs Data Analysis
Source: edureka!
STATISTICS FOR DATA SCIENCE
Common tasks in Data Science
Source: Simplilearn
STATISTICS FOR DATA SCIENCE
Common tasks in Data Science
Source: https://static.javatpoint.com/tutorial/data-science/images/how-to-solve-a-problem-in-data-science.png
STATISTICS FOR DATA SCIENCE
Course Overview
Data Visualization
Simple Linear Regression
Sources : keydifferences.com, enhancedigitech.com, barnraisersllc.com, edugyan.in,
pinterest.co.uk, gopika – lashitha.blogspot.com
DATA ANALYTICS
References
Text Book:
Statistics for Engineers and Scientists, William Navidi.4th Edition ,
McGraw Hill Education, India
THANK YOU
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 712
K.M Mitravinda
[email protected]