Module 1
Module 1
Science
1/
Introduction to Data
Science
2/
MODULE
1
3/
What is
Data?
4/
Exampl
e
Example:
Raw Data: 100, 85, 93 — meaningless unless we
know what it represents.
If we say: ”These are marks of students in
Mathematics”, now the same data becomes useful
information.
Knowledge
When we analyze and understand patterns from
information, it becomes knowledge.
Example: ”On average, students perform better in
Math if they attend all classes regularly.”
In simple words: Knowledge = Insights +
Understanding from Information
Therefore:
5/
Exampl
e
Data → Processing → Information → Knowledge
6/
Importance of Classifying
Data
2 Unstructured Data
3 Semi-structured Data
7/
Structured
Data
Definition:
Structured data refers to data that is organized and
formatted in a fixed schema, such as rows and
columns in a relational database (SQL). Each column
represents a field or attribute (e.g., Name, Age,
Salary). Each row represents a record (e.g., data for
one person).
Characteristics:
Stored in tables (databases,
spreadsheets). Has a fixed format
and predefined schema. Easy to
search and query (using SQL).
Well suited for statistical analysis.
Example:
9/
Structured Data: Real-life
Scenarios
Examples:
Banking system: customer records
College management system: student details
Inventory management: product stock data
Real-life Scenarios:
Airline ticket booking
databases Hospital patient
records
Railway reservation system
10 /
Unstructured
Data
Definition:
Unstructured data refers to data that does not follow
any specific format or structure. It is not organized in
rows/columns. Needs special tools (NLP, AI, ML) for
analysis.
Characteristics:
No predefined schema.
Cannot be easily stored in databases.
Usually text-heavy, image-heavy or
multimedia data. Requires advanced
techniques to process.
Examples:
Data Type Example
Text Emails, PDF documents,
What-
sApp messages
Images Instagram photos, X-ray 11 /
Unstructured
Data images
Audio Voice messages, customer
service
calls
Video YouTube videos, CCTV
footage
12 /
Unstructured Data: Real-life
Scenarios
Customer reviews on
Amazon Tweets on
Twitter
CCTV camera feeds
Lecture videos on YouTube
More than 80% of real-world data is unstructured.
13 /
Semi-structured
Data
Definition:
Semi-structured data refers to data that is partially organized
— does not fit perfectly into tables, but still contains
tags, markers, or key-value pairs to separate data
elements.
Characteristics:
No rigid table format.
Uses tags/keys to give
structure. Flexible and
scalable.
Suitable for handling web data, APIs.
14 /
Semi-structured Data: Examples and
Scenarios
Examples:
JSON: { ¨n a m e ¨: ¨John , ¨a g e ¨: 28, ¨s a l a r y ¨:
¨
50000 }
XML:
<employee>
<name>John</name>
<age>28</age>
<salary>50000</salary>
</employee>
Real-life Scenarios:
E-commerce websites: product details stored
in JSON Social media APIs: posts/comments
data
15 /
Semi-structured Data: Examples and
Scenarios
Mobile app logs: user activity in key-value format
16 /
Comparison
Table
17 /
Importance of Understanding Data
Types
Storage system: Understanding data type helps select
the right storage solution, such as using SQL
databases for structured data, NoSQL for flexible
semi-structured data, or file systems for unstructured
data like images and videos.
Processing technique: Knowing the data type guides
the choice of processing method — using ETL
pipelines for structured data, machine learning
algorithms for pattern recognition, NLP for text
analysis, or computer vision techniques for image
data.
Analysis method: Identifying data type allows us to
choose proper analysis tools — BI tools are
effective for structured data reporting, while AI
tools are needed for analyzing unstructured or
complex data types.
Modern context: In today’s AI and Data Science
world, where around 80% of data is unstructured, it
18 /
Importance of Understanding Data
Types
is essential to build systems and workflows that can
handle diverse data types efficiently.
19 /
Data Sets and
Patterns
What is a Data Set?
A Data Set is a collection of related data items,
usually presented in a tabular or organized form.
It contains multiple records (rows) and attributes
(columns). Each row = 1 instance/record
Each column = 1 attribute/feature
20 /
Definition and Sources of
Data Sets
Definition:
In Data Science: “A Data Set is a structured
collection of data that is typically used for analysis,
training machine learning models, or drawing
conclusions.”
22 /
What are Patterns in
Data?
Examples of Patterns:
E-commerce: “Customers who buy baby diapers
often buy baby wipes.”
Banking: “High-value customers usually visit bank
during weekends.”
Healthcare: “High cholesterol patients over 50
years often develop hypertension.”
23 /
Types of
Patterns
24 /
Types of Patterns in Data
Association and Correlation
Association
Finding items or events that occur together
frequently in the data.
Goal: Discover ”if a customer buys A, they are likely
to buy B.” Example: In a supermarket, 60% of
customers who buy diapers also buy baby wipes.
Use case: Market Basket Analysis — used in
retail for cross-selling.
Correlation
Finding how two variables are related — when one
changes, does the other also change?
Goal: Measure strength and direction of relationship
between variables.
Example:
More study hours → higher exam
marks More exercise → lower
body weight
Technical term: Correlation Coefficient (r) value
between -1 and +1.
Use case: Predictive modeling, feature selection in ML. 25 /
Types of Patterns in Data
Sequence and Trends
Trends
Discovering long-term movements or changes in data
over time. Goal: Identify overall direction (upward,
downward, seasonal). Example:
Increasing sales of woolen clothes in winter
season. Rising housing prices over 5 years.
26 /
Use case: Business forecasting, sales planning, investment decisions.
27 /
Types of Patterns in Data
Outliers and Clusters
Outliers
Meaning: Detecting data points that are very different from
others. Goal: Identify unusual events or errors.
Example:
Fraudulent bank transaction: suddenly very large amount
withdrawn. Sensor error: suddenly reading temperature as
500 degrees.
Use case: Fraud detection, quality control, anomaly detection.
Clusters
Meaning: Grouping similar data points into clusters based on
features. Goal: Find natural groupings in data.
Example:
Grouping customers with similar buying
behavior. Grouping patients with similar
symptoms.
28 /
Use case: Customer segmentation, targeted marketing,
personalized medicine.
29 /
How are Patterns
Found?
30 /
How are Patterns Found?
(Part 1)
Statistical Analysis
Use descriptive statistics (mean, median, standard
deviation).
Find relationships between variables (correlation,
regression).
Example: Find if higher age correlates with higher
cholesterol.
Data Visualization
Using charts/plots to visually detect patterns.
Tools: Bar chart, line graph, scatter plot,
histogram, heatmap.
Example: Visualizing study hours vs marks – linear
pattern seen.
31 /
How are Patterns Found?
(Part 2)
3. Machine Learning Models
ML algorithms learn patterns automatically from data.
Example: Decision trees learn which features
are important to predict customer churn.
Use case: Predictive models, classification, regression.
33 /
Brief History of Data
Science
34 /
1960s -
1970s
1960s: Data Processing
Use of mainframe computers to process
structured data. Focus on automating
repetitive tasks (e.g., payroll,
inventory).
Example: Banks maintained customer accounts
on mainframes.
36 /
1980s -
1990s
1980s: Business Intelligence (BI)
Companies began analyzing data for
insights. Rise of BI tools (reports,
dashboards, graphs). Managers made
data-driven decisions.
Example: Sales reports guided marketing campaigns.
Present Trends
AutoML — automate model
selection/tuning. Explainable AI (XAI).
Edge AI — run models on
devices. Cloud + Big Data
integration.
Focus on data privacy and ethical AI.
39 /
Summary
Timeline
Period Milestone
1960s Data Processing (mainframes)
1970s Relational Databases, SQL
1980s Business Intelligence (BI)
1990s Data Mining, KDD
2000s Big Data (Hadoop, MapReduce)
2010s Rise of Data Science, ML/DL
Present AI-driven Data Science, AutoML, XAI
40 /
Introduction to Data
Science
What is Data Science?
Data Science is an interdisciplinary field
that uses: Scientific methods
Mathematical & statistical
techniques Programming &
computing tools
Domain knowledge
to extract knowledge and insights from both
structured and unstructured data.
Definition:
“Data Science is the process of collecting, cleaning,
analyzing, and interpreting large amounts of data to
uncover useful patterns, make predictions, and
support decision making.”
41 /
Data Science — Interdisciplinary
Field
Enables:
Sm
44 /
Why is Data Science Important?
(Part 2)
Solving social and environmental problems
Governments and NGOs use Data Science to:
Track disease
outbreaks Manage
traffic
Monitor pollution
Plan disaster response
Example: COVID-19 infection tracking.
46 /
Differences between AI, ML, DL, Data
Science, and Data Analytics
47 /
Artificial Intelligence
(AI)
Definition:
AI is the science of making computers or machines
”think” and ”act” like humans.
Examples:
Google Assistant, Siri, Alexa — understand voice
and give responses
Self-driving cars — detect objects and
drive safely AI in games — computer 48 /
Artificial Intelligence
(AI)
Definition:
players in chess
49 /
Machine Learning
(ML)
Definition:
ML is a subset of AI. It is a technique where
machines learn from data and improve their
performance without being explicitly
programmed.
In simple terms: ML = Learning from data
Examples of ML:
Email spam filters — learn from emails which are
spam and which are not.
Netflix recommendations — learns user preferences.
Credit card fraud detection — learns patterns
of fraudulent transactions.
Goal of ML:
To make machines learn from past data and predict
50 /
Machine Learning
(ML)
Definition:
or make decisions on new data.
51 /
Deep Learning
(DL)
Definition:
DL is a subset of ML that uses neural networks
with many layers (deep networks) to learn
complex patterns.
DL can handle:
Images
Videos
Audio
Natural Language
In simple terms: DL = ML using neural networks
Examples of DL:
Face recognition on Facebook
Speech recognition in Google Assistant
Medical image analysis for detecting tumors
Goal of DL:
Deep Learning
(DL)
Definition:
To learn from complex, large data and perform tasks like vision,
speech, language. 39 / 79
Data
Science
Definition:
Data Science is a field that
combines: Statistics
Programming
Machine
Learning
Domain
knowledge
to analyze and interpret data, and extract useful insights.
41 /
Data
Analytics
Definition:
Data Analytics is the process of examining
existing data to: Discover patterns
Generate reports
Support decision making
It is a part of Data Science — but focuses more on
analyzing current or past data.
43 /
AI, ML, DL, and Data Science —
Relationships
45 /
Real-world Applications of Data Science —
Intro- duction
46 /
Healthcare &
Banking
Healthcare faster diagnosis.
Predict disease risk
Personalized
treatment Analyze
medical images
Monitor patients
Predict epidemics
Examples
Cancer detection (MRI)
Hospital readmission
prediction
Wearables track heart
rate
Impact
Better outcomes,
47 /
Healthcare &
Banking
Banking & Finance Risk analysis
Fr Personalized
au marketing
d Examples
de Fraud
te detection in
cti real-time
on Loan default
Cr prediction
ed Algorithmic trading
it
sc Impact
Reduced fraud,
ori
improved services.
ng
Customer
segment
ation
48 /
Retail &
Transportation
Retail & E-commerce Impact
Personalized Higher customer satisfaction,
recommendations optimized inventory.
Dynamic pricing
Inventory
management
Sentiment analysis
Targeted ads
Examples
Amazon
recommendations
Flipkart dynamic
pricing
Zara sales data
analysis
49 /
Retail &
Transportation
Transportation & Logistics pricing
Predict traffic Impact
Optimize Reduced costs, better efficiency.
delivery
routes
Demand
forecasting
Self-driving tech
Examples
Uber surge pricing
DHL
route
optimizatio
n Airlines
adjust
ticket
50 /
Entertainment & Social
Media
Entertainment & Media YouTube video Soci
Personalized suggestions al
recommendations Impact Medi
Audience More engagement, a
analysis personalized experience.
Pe
Content
rs
creation Ad on
optimization ali
Examples ze
d
Netflix movie fe
suggestions Spotify ed
song
recommendations T
a
Entertainment & Social
Media
rgeted advertising
Sentiment analysis
Fake news
detection
User behavior prediction
Examples
Facebook/Instagram
feed Twitter trending
topics
Fake account detection
Impact
Better moderation,
improved engagement.
48 / 79
Govt/Public Sector & Scientific
Research
Government & Public COVID-19 spread tracking
Sector Impact
Crime prediction Safer cities, better public
Traffic management services.
Public health
monitoring Disaster
response
Environmental
monitoring
Examples
Predictive
policing Smart
traffic lights
49 /
Govt/Public Sector & Scientific
Research
Scientific Research Impact
Analyze Faster progress, scientific
breakthroughs.
experimental
data
Discover new
patterns
Accelerate discoveries
Examples
New
planet
discovery
Climate
modeling
Genomics research
50 /
Steps in Data Science
Process
Main Steps:
1 Business Objective
2 Data Requirement
3 Data Collection
4 Exploratory Data
Analysis , Monitoring
5 Model Building
6 Evaluation
7 Deployment
8 Monitoring (loop
after deployment)
51 /
Step 1: Business
Objective
54 /
Step 3 — Exploratory Data Analysis (EDA)
EDA = ”Getting to know your data”
What we do in EDA:
Understand data types (numerical,
categorical) Check for missing values
Find outliers
Visualize: histograms, scatter plots, box
plots Study variable correlations
Identify patterns and trends
Why important?
Detect errors/outliers
55 /
Step 3 — Exploratory Data Analysis (EDA)
Guide feature selection for ML models
56 /
Monitoring (loop from EDA)
58 /
Step 4- Model
Building
Purpose: To create predictive models that learn from
data and make accurate predictions or
classifications.
Main Steps:
Algorithm Selection: Choose suitable ML
algorithms (e.g., Decision Tree, SVM, Logistic
Regression, Neural Networks)
Data Splitting: Split dataset into training and
testing sets (e.g., 80%-20%)
Model Training: Feed training data into the
algorithm to learn patterns
Hyperparameter Tuning: Adjust model settings
(e.g., learning rate, max depth) for best results
Cross-Validation: Test model performance on
multiple subsets to avoid overfitting
Output: A trained model ready for evaluation and deployment.
59 /
Step 5 — Evaluation
Common Metrics:
Problem Type Evaluation Metrics
Classification Accuracy, Precision, Recall, F1-
(Yes/No) score
Regression RMSE, MAE, R² Score
(Numeric)
Examples:
Spam classifier → 95
60 /
Step 5 — Evaluation
House price predictor → R² = 0.89
If poor performance → tune or improve model
61 /
Step 6 — Deployment
Examples:
Fraud detection model → Live in bank system
Netflix recommender → Runs on streaming
platform Disease risk tool → Used in
hospital EMR system
62 /
Step 7 — Monitoring (Post Deployment)
Examples:
E-commerce → Retrain model monthly as
preferences change
Fraud detection → Update model with new fraud types
63 /
Summary of Data Science Process
Step What Happens
Business Define the problem to solve
Objective Decide what data is needed
Data Gather data from sources
Requirement Understand data, find patterns
Data Collection
Exploratory Loop back if data is insufficient
65 /
Ethical and Privacy Implications of Data
Science (Part 1)
Introduction
Data Science is powerful — but with great power
comes great responsibility.
Data Scientists handle:
Personal data
Sensitive data
Financial data
Health data
Social media
data
If misused → leads to privacy violations,
discrimination, and harm.
Understanding ethics and privacy is critical.
66 /
Ethical and Privacy Implications of Data
Science (Part 2)
What is Ethics in Data Science?
Ethics = Principles of right and wrong in
data use. Questions to consider:
Is data used fairly?
With consent?
Does it create bias or
discrimination? Is data use
transparent?
Does it respect privacy?
What is Privacy in Data Science?
Privacy = Right of individuals to control:
What data is collected about
them How their data is used
Who has access to their data
Example: A person should know how their medical data
67 /
Ethical and Privacy Implications of Data
Science (Part 2)
is used and consent to it.
68 /
Ethical & Privacy Challenges in Data
Science
1. Data Collection Without Consent
Companies sometimes collect personal data without informing users.
3. Lack of Transparency
AI/ML models often work as black boxes.
4. . Data Misuse
Data collected for one purpose used for another
without consent.
Example: Fitness app health data sold to
insurance companies.
Violation: Privacy breach — user never agreed.
5. Data Breaches
Poor security leads to hacking or leaks.
Example: Millions of user records leaked
from banks/social media.
Result: Huge privacy loss for users.
70 /
Principles of Ethical Data
Science
1. Transparency
Users should know:
What data is
collected How it is
used
How decisions are made
2. Fairness
Algorithms must
not:
Discriminate
Reinforce bias
3. . Accountability
Companies must take
71 /
Principles of Ethical Data
Science
responsibility for: Model
performance
Data usage
72 /
Principles of Ethical Data Science
(Part 3)
4. . Privacy Protection
Data should be:
Collected with consent
Used only for intended
purpose Stored securely
Anonymized when possible
5. Human-centered AI
AI systems should:
Benefit humans
Respect human values and rights
73 /
Tools and Skills Needed in Data
Science
1. Introduction
Data Science is interdisciplinary —
blends: Computer Science
Statistics
Machine
Learning
Domain
Knowledge Data
Visualization
A Data Scientist is like a toolbox — knowing which
tool for which task.
74 /
Types of Tools in Data
Science
2. Types of Tools in Data Science
Typical project stages:
1 Data Collection
2 Data Cleaning / Preprocessing
3 Data Analysis
4 Visualization
5 Model Building
6 Deployment
7 Monitoring
Students should
learn:
Platforms
Languages
Libraries
75 /
Types of Tools in Data
Science
Databases
Frameworks
76 /
Platfor
ms
3. . Platforms
Environments
for:
Writing code
Running
experiments
Storing notebooks
Sharing results
Examples:
Platform Purpose
Jupyter Python code +
visualiza-
tion
Google Colab Free cloud notebooks
Kaggle Kernels Community sharing
Anaconda Bundled Python +
77 /
Platfor
ms pack-
ages
AWS Sagemaker Cloud ML platform
78 /
Programming
Languages
4. Programming Languages
Core skill of Data Scientist
Popular Languages:
Python — easy, huge
libraries R — statistics,
visualization
SQL — querying data
Scala, Java — Big Data tools
Python is #1 → easy syntax, big community, rich libraries
79 /
Librari
es
Libraries- Ready-made packages → do tasks faster
Examples:
Pandas — data manipulation
NumPy — numerical
computation Matplotlib, Seaborn
— visualization Scikit-learn —
ML algorithms TensorFlow,
PyTorch — Deep Learning NLTK,
spaCy — NLP
OpenCV — Computer Vision
6 Frameworks- Higher-level tools simplify complex tasks
Popular Frameworks:
TensorFlow, Keras — Deep Learning
PyTorch — research +
production DL Apache Spark —
Big Data
80 /
Librari
es
HuggingFace — NLP models
81 /
Database
s
7. Databases
Data stored in:
Relational DB — MySQL,
PostgreSQL NoSQL DB —
MongoDB, Cassandra Cloud —
AWS S3, BigQuery
Need SQL skills to extract data
8. Data Visualization Tools
Helps explore data, communicate insights
Examples:
Matplotlib, Seaborn — Python
plots Tableau — dashboards
Power BI — Microsoft
tool Plotly —
82 /
Database
s
interactive plots
83 /
Cloud Platforms and Big Data
Tools
Cloud Platforms
Store large data, run large models
Examples:
AWS (Sagemaker, EC2,
S3) GCP (BigQuery)
Azure ML
Big Data Tools
Hadoop — distributed
storage Spark — fast
processing
Hive — SQL querying on big data
84 /
Soft Skills and
Summary
1. 1. Soft Skills
Communication — explain results
Storytelling — use data to tell a
story Curiosity — explore and
question
Teamwork — work with others
Summary
Mix of technical and soft skills
needed Continuous learning —
tools evolve
85 /
Current Trends and Major Research
Challenges in Data Science — Introduction
1. Introduction
Data Science evolves rapidly — new tools, techniques,
and applications constantly emerge.
Data Scientists must:
Keep up with current trends
Be aware of major research challenges
Understanding latest trends and challenges is critical
to stay competitive.
86 /
Current
Trends
1. Automated Machine Learning (AutoML)
Automates feature selection, algorithm choice,
parameter tuning.
Popular tools: Google Cloud AutoML, H2O.ai AutoML,
Azure AutoML
Impact: Easier for non-experts, faster model development.
2. Explainable AI (XAI)
Makes AI transparent,
interpretable Tools: LIME,
SHAP
Impact: Builds trust, essential in sensitive domains
(healthcare, finance)
3. Edge AI
AI on edge devices (phones, sensors)
Benefits: real-time processing, privacy, less cloud
87 /
Current
Trends
dependence Examples: Face unlock, smart
cameras
88 /
Current
Trends
4. DataOps & MLOps
DevOps for Data Science:
Version control for
data Automated ML
testing
Continuous deployment (CI/CD)
Impact: Scalable, maintainable ML pipelines
5. Ethical and Responsible AI
Focus on reducing bias, ensuring fairness, protecting
privacy Critical concern for companies and
governments
6. NLP Advancements (LLMs)
Transformer-based models (BERT, GPT) revolutionize NLP
89 /
Current
Trends context, generate text, translate
LLMs: understand
languages Examples: ChatGPT, Google Bard
90 /
Current
Trends
7. Data Democratization
Makes data and tools accessible to non-
technical users Tools:
Self-service BI tools
No-code ML
platforms Open
datasets
Impact:
Business users can explore data and build
models Not just limited to data scientists
91 /
Major Research
Challenges
1. Handling Unstructured Data
Most data = text, images, video,
audio Challenge: Efficient
processing, analysis
2. Scalability and Big Data
Data growing exponentially (petabytes, real-time
streams) Challenge: Process large data in real-
time
3. Data Privacy and Security
More data → higher privacy
risks Challenges:
Secure storage
Compliance with laws (GDPR,
CCPA) Balance utility vs privacy
4. Bias and Fairness in AI
92 /
Major Research
Challenges
Bias in data → unfair models
Challenge: detect, remove bias; ensure fairness
93 /
Major Research
Challenges
5. Interpretability of Complex Models
Deep models often black-box
Challenge: make models interpretable to users/regulators
6. Data Quality and Governance
”Garbage in, garbage out”
Challenge: ensure quality, manage governance across
organizations
7. Bridging Research to Production
Models often fail in real-world deployment
Challenge: build robust, scalable, production-ready models
8. Cost and Energy Efficiency of AI
Training large models consumes huge energy
Challenge: make AI energy-efficient, cost-effective
94 /