Introduction to Data Science
Dr. Seema Gupta Bhol
Data Science
• Data Science is the science of understanding
data using processes, tools and techniques
which aid in decision making.
• Data science (DS) is a multidisciplinary field
of study with goal to address the challenges in
big data.
• It involves techniques for identifying,
collecting and exploring the data using colorful
plots and graph.
Importance of Data Science
• Increasing usage of internet which has generated more data.
• Growing usage of smart phones, tablets and digital devices
• Increasing usage of social media
• Increasing computational capability with both hardware and
software becoming powerful by the day.
• Programmers across the world are creating complex
algorithms and contributing to the open source developers’
community.
• Easy and speedy access to such data for every individual or
organization irrespective of the size of the concern.
• Storage of data becoming cheaper
Big Data
• Big Data is also data but with a huge size.
• Big Data is a term used to describe a collection of
data that is huge in volume and yet growing
exponentially with time.
• The data is so large and complex that none of the
traditional data management tools are able to
store it or process it efficiently.
• Extremely large data sets that may be analyzed
computationally to reveal patterns , trends and
association, especially relating to human behavior
and interaction are known as Big Data.
Examples Of Big Data
• The New York Stock Exchange generates
about one terabyte of new trade data per day.
• Social Media: The statistic shows that
500+terabytes of new data get ingested into the
databases of social media site Facebook, every
day. This data is mainly generated in terms of
photo and video uploads, message exchanges,
putting comments etc.
Examples Of Big Data
• Tracking consumer behavior and shopping habits to
deliver personalized retail product recommendations tailored to
individual customers
• Monitoring payment patterns and analyzing them against historical
customer activity to detect fraud in real time.
• Using AI-powered technologies like natural language processing to
analyze unstructured medical data (such as research reports, clinical
notes, and lab results) to gain new insights for improved treatment
development and enhanced patient care
• Using image data from cameras and sensors, as well as GPS data,
to detect potholes and improve road maintenance in cities
Memory Sizes
Characteristics Of Big Data
The following are known as “Big Data
Characteristics”.
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value
Volume
• Volume means “How much Data is generated”.
• Now-a-days, Organizations or Human Beings
or Systems are generating or getting very vast
amount of Data say TB(Tera Bytes) to PB(Peta
Bytes) to Exa Byte(EB) and more.
• Volume = Very Large Amount of Data
Velocity
• Velocity means “How fast produce Data”.
• Now-a-days, Organizations or Human Beings
or Systems are generating huge amounts of
Data at very fast rate.
• Velocity= Produce Data at very fast rate
Variety
• Variety means “Different forms of Data”.
• Now-a-days, Organizations or Human Beings
or Systems are generating very huge amount of
data at very fast rate in different formats.
• Variety = Produce data in different formats
Veracity
• Veracity means “The Quality or Correctness or
Accuracy of Captured Data”.
• Out of 4Vs, it is most important V for any Big
Data Solutions. Because without Correct
Information or Data, there is no use of storing
large amount of data at fast rate and different
formats.
• That data should give correct business value.
• Veracity= Correctness of Data
Value
• Whether the data is useful to an organization.
• It’s essential to determine the business value
of the data you collect.
• Big data must contain the right data and then
be effectively analyzed in order to yield
insights that can help drive decision-making.
Big Data Analytics
• Big Data analytics is the process of collecting,
organizing and analyzing large sets of data (called
Big Data) to discover patterns and other useful
information.
• Big Data analytics can help organizations to
better understand the information contained
within the data and will also help identify the data
that is most important to the business and future
business decisions.
• Analysts working with Big Data typically want
the knowledge that comes from analyzing the data
High-Performance Analytics
Required
• To analyze such a large volume of data, Big Data
analytics is typically performed using specialized
software tools and applications for predictive analytics,
data mining, text mining, forecasting and data
optimization.
• Collectively these processes are separate but highly
integrated functions of high-performance analytics.
• Using Big Data tools and software enables an
organization to process extremely large volumes of data
that a business has collected to determine which data is
relevant and can be analyzed to drive better business
decisions in the future.
FUNDAMENTAL FIELDS OF
STUDY RELATING TO DATA
SCIENCE
FUNDAMENTAL FIELDS OF STUDY
RELATING TO DATA SCIENCE
Theories and techniques from many fields and
disciplines are used to investigate and analyze a large
amount of data to help decision makers in many
industries such as science, engineering, economics,
politics, finance, and education
– Computer Science
• Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics
• Statistical and Stochastic modeling, Probability.
Data science
• Mathematics and Applied Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
DATA SCIENCE AND BIG DATA
• They are not the “same thing”
• If we compare Big data to crude oil
• Big data is about extracting “crude oil”, transporting it
in “mega tankers”, siphoning it through “pipelines”,
and storing it in “massive silos”
• Data science is about refining the “crude oil”
Aspect Big Data Data Science
Handling and processing vast amounts Extracting insights and
Definition
of data knowledge from data
Efficient storage, processing, and Analyzing data to inform
Objective
management of data decisions and predict trends
Collection, storage, and processing of Data analysis, modeling, and
Primary Tasks
data interpretation
Hadoop, Spark, NoSQL databases Python, R, TensorFlow, Scikit-
Tools/Technologies
(e.g., MongoDB) Learn
Structured, semi-structured, and Processed and cleaned data for
Data Types
unstructured data analysis
Accessible data repositories for Actionable insights, predictive
Outcome
analysis models
Data Scientists, Machine
Typical Roles Data Engineers, Big Data Analysts
Learning Engineers
Real-time data processing, large-scale Predictive analytics, data-
Applications
data storage driven decision making
Distributed computing, data Statistical modeling, machine
Key Techniques
warehousing learning algorithms
Application of Big Data
• Big Data in Healthcare
• Big Data in Education
• Big Data in E-commerce
• Big Data in Media and Entertainment
• Big Data in Finance
• Big Data in Travel Industry
• Big Data in Telecom
• Big Data in Automobile
Big Data in Retail
• Retailers need to understand their customers in a
better way to fulfill their needs in the best possible
way.
• Through advanced analysis of their customer’s data,
retailers are now able to understand them from every
angle possible. They gather this data from various
sources such as social media, loyalty programs, etc.
• This empowers them to provide customers with
more personalized services and predict their
demands in advance.
• This helps them in building a loyal customer base.
Big Data in Healthcare
• The amount of data the healthcare industry has to deal with is
very large.
• From finding a cure to cancer to detecting Ebola and much
more, Big Data is helping researchers to have some life-
saving outcomes through it.
• Big Data and analytics is helping to build more personalized
medications.
• Data analysts are harnessing this data to develop more and
more effective treatments. Identifying unusual patterns of
certain medicines to discover ways for developing more
economical solutions is a common practice these days.
• Smart wearables generate massive amounts of real-time data
in the form of alerts which helps in saving the lives of the
people.
Big Data in Education
• Big Data is the key to shaping the future of the
people and has the power to transform the
education system for better.
• Some of the top universities are using Big Data
as a tool to renovate their academic curriculum.
• Additionally, universities can even track the
dropout rates of the students and are taking the
required measures to reduce this rate as much as
possible.
Big Data in E-commerce
• Some of the biggest E-commerce companies of the
world like Amazon, Flipkart, Alibaba, etc are using Big
Data and analytics
• Big Data’s recommendation engine is one of the most
amazing applications the Big Data world has ever
witnessed. It furnishes the companies with a 360-degree
view of its customers.
• Companies then suggest customers accordingly.
Customers now experience more personalized services.
• Big Data has completely redefined people’s online
shopping experiences.
Big Data in Media and
Entertainment
• Viewers these days need content according to
their choices only. Content that is relatively new
to what they saw the previous time.
• Earlier the companies broadcasted the Ads
randomly without any kind of analysis.
• But after the advent of Big Data analytics in the
industry, companies now are aware of the kind of
Ads that attracts a customer and the most
appropriate time to broadcast it for seeking
maximum attention.
Big Data in Finance
• Data has been the second most important commodity
for them after money.
• Financial firms were among the earliest adopters of Big
Data and Analytics.
• Digital banking and payments are two of the most
trending Big data applications.
• Big Data is helping in fraud detection, risk analysis,
algorithmic trading, and customer contentment.
• This has brought much-needed fluency in their
systems. They are now empowered to focus more on
providing better services to their customers rather than
focussing on security issues.
Big Data in Travel Industry
“50% Off on Your Next Flight Booking!!”
• Through Big Data and analytics, travel
companies are now able to offer more
customized traveling experience. They are now
able to understand their customer’s
requirements in a much-enhanced way.
• From providing them with the best offers to be
able to make suggestions in real-time, Big
Data is certainly a perfect guide for any
traveler.
Big Data in Telecom
• With the ever-increasing popularity of smartphones, it has flooded
the telecom industry with massive amounts of data.
• And this data is like a goldmine, telecom companies just need to
know how to dig it properly.
• Through Big Data and analytics, companies are able to provide the
customers with smooth connectivity, thus eradicating all the network
barriers that the customers have to deal with.
• Companies now with the help of Big Data and analytics can track
the areas with the lowest as well as the highest network traffics and
thus doing the needful to ensure hassle-free network connectivity.
• Big Data alike other industries have helped the telecom industry to
understand its customers pretty well.
• Telecom industries now provide customers with offers as
customized as possible.
Big Data in Automobile
• Predictive maintenance: AI models can analyze historical data and
sensor readings to predict when a vehicle part might fail. This can
help extend the lifespan of the vehicle.
• Manufacturing processes: AI models can analyze data from
manufacturing processes to identify areas for improvement and
inefficiencies.
• Parking areas: Big data can help determine where to build parking
areas.
• Traffic lights and signs: Big data can help identify areas with many
accidents and determine where to install traffic lights or signs.
• Navigation systems: Big data can help set up more accurate
navigation systems.
• Self-driving cars: Self-driving cars use sensors to collect data about
their surroundings, which is then processed and analyzed to create
an environment map.
Datafication
• Datafication is the process of transforming
various aspects of our lives into data that can
be quantified and analyzed.
• It simply means a process of turning many
physical aspects of life into computerized data.
• Think of it as the digital translation of the real
world.
• E.g. FitBit datafies our physical activities to
derive useful information.
BFSI (Banking, Financial Services, and Insurance)
CAGR (Compund annual Growth Rate )
Datafication
Datafication
Following 6 steps constitute the data science process :
1. Data Collection:
It all starts with data collection and retrieval.
This could be from anything we do—clicking on a website, using an
app, or even just walking around with your smartphone. Devices
and sensors collect this data, often without we even noticing.
2. Data Storage:
Once collected, this data needs a place to go. The data is saved in
databases or cloud storage, where it can be accessed and used later.
3. Data Processing:
The raw data collected isn’t very useful on its own.
Data processing involves data cleaning, organizing, and
transforming this data into a more usable format.
4. Data Analysis:
At this stage data becomes valuable information. Using various tools
and techniques, analysts can look at patterns and trends in the data.
For example, vendors might discover that people tend to buy more
ice cream on hot days—a useful insight for a business.
5. Data Visualization:
To make the data easy to understand, it’s often presented visually,
like in charts or graphs. This step helps people quickly grasp the
insights hidden in the data.
6. Data Application:
Finally, the insights gained from data analysis are put to use. This
could mean anything from tweaking a marketing strategy to
designing a new product. For example, if data shows that customers
prefer shopping online at certain times of the day, a business might
run targeted ads during those hours.
Digitization and Datafication
• Digitization refers to the process of converting information
from a physical format into a digital one. This could involve
transforming handwritten notes into typed text, scanning a
photograph to create a digital image, or converting analog
audio recordings into digital files. The primary aim of
digitization is to preserve information and make it easier to
store, access, and share using digital technologies.
• Datafication is the process of turning all aspects of life into
quantifiable data through the capture and analysis of data from
various activities and interactions. Datafication involves
extracting data from processes and behaviors that weren’t
previously quantified—like tracking people’s movements via
their smartphones, logging interactions on social media, or
recording shopping habits online. The focus is on turning these
activities into data that can be analyzed to gain insights,
improve services, and predict future behaviors.
Why is Datafication Important?
• Informed Decision-Making
• Personalization
• Efficiency and Innovation
• Enhanced User Experience
• Predictive Capabilities
Real-World Examples of Datafication
Social Media
• On social media platform like Facebook, Instagram, or Twitter,
shows content that is perfectly aligned with our interests.
• These platforms collect data on our interactions—likes, shares,
comments, and the time we spend looking at certain posts. By
analyzing this data, social media companies can tailor our
content feed .
• This isn’t just about keeping us engaged, it’s also about
delivering targeted ads that are relevant to us . The datafication
is analyzing our behavior and preferences.
Real-World Examples of Datafication
Smart Homes
• House adjusts the lighting, temperature, and even plays the
favorite music as a person walks in.
• Devices like smart thermostats, lights, and security systems
collect data on the daily routines and preferences.
• They learn when we typically get home, our preferred
temperature settings, and even the times when we are usually
away.
• This data helps automate tasks, making our life more convenient
and energy-efficient.
Fitness Trackers
•Fitness tracker like a Fitbit or an Apple Watch collect data on
our steps, heart rate, sleep patterns, and more.
•This data isn’t just for show, it helps to understand the health
and fitness levels. For example, by tracking the steps and calories
burned, one can set and achieve fitness goals.
•If the heart rate spikes unexpectedly, the device can alert the
potential health issues.
•Moreover, many fitness apps allows to share the data with
healthcare providers, giving them valuable insights into the
health that can lead to better, more personalized care.
Retail and Online Shopping
•Retailers like Amazon, Flipkart etc. track the browsing history,
past purchases, and even the items looked at but didn’t bought.
By analyzing this data, they can recommend products that are
tailored to our tastes and needs.
•This personalized shopping experience not only makes it easier
for you to find what you’re looking for but also introduces you to
new products you might not have considered otherwise.
Navigation and Ride-Sharing Apps
•Google Maps or a ride-sharing app like Uber or Ola are
excellent examples of datafication in action.
•They collect data from millions of users to provide real-time
traffic updates, optimal routes, and estimated arrival times.
•For ride-sharing apps, this data helps match you with drivers and
calculate fare estimates based on distance, traffic, and time of
day. This not only makes your commute more efficient but also
enhances safety .
Data Scientists
• Data scientists are the key to realizing the
opportunities presented by big data. They
bring structure to it, find important patterns in
it, and advise executives on the implications
for products, processes, and decisions.
Data scientist
• A data scientist is an expert who uses data to help
organizations solve complex problems and achieve their
goals. They use their skills in statistics, computer
science, business, and communication to:
• Collect, analyze, and interpret large amounts of data
• Uncover trends and challenge assumptions
• Create visual representations of data findings
• Collaborate with other departments to understand their
data needs
• Data scientists use a variety of tools, including: machine
learning, artificial intelligence, and statistical analysis.
Roles and responsibilities of data
scientists
• Data mining or extracting usable data from valuable data sources
• Using machine learning tools to select features, create and optimize
classifiers
• Carrying out the preprocessing of structured and unstructured data
• Enhancing data collection procedures to include all relevant information
for developing analytic systems
• Processing, cleansing, and validating the integrity of data to be used for
analysis
• Analyzing large amounts of information to find patterns and solutions
• Developing prediction systems and machine learning algorithms
• Presenting results in a clear manner
• Propose solutions and strategies to tackle business challenges
• Collaborate with Business and IT teams
• In addition, some data scientists develop AI technologies for use internally
or by customers -- for example, conversational AI systems, AI-driven
robots and other autonomous machines, including key components in self-
driving cars.