Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views112 pages

Module 1

Uploaded by

Anu Augustin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views112 pages

Module 1

Uploaded by

Anu Augustin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 112

Introduction to Data

Science

1/
Introduction to Data
Science

Data Science is an interdisciplinary field that


combines statis- tics, computer science, domain
knowledge, and data analysis techniques to extract
insights from data. It involves collecting, processing,
analyzing, and visualizing data to aid decision mak-
ing.

2/
MODULE
1

Introduction to data: Structured, Unstructured,


Semi-structured, Data sets & Patterns; Brief history
of Data Science; Introduction to Data Science;
Importance of Data Science; Differences between AI,
ML, DL, Data Science & Data Analytics; Real world
applications of data science; Steps in data science
process.
Ethical and privacy implications of Data Science;
Tools and Skills Needed – Introduction of platforms,
Tools, Frameworks, Languages, Databases and
Libraries, Current trends & Major research challenges
in data science.

3/
What is
Data?

Data refers to raw facts and figures, which can be


numbers, text, images, audio, video, symbols, etc.
By itself, data may not convey meaning. But when
processed correctly, data provides information
that helps in decision making.

4/
Exampl
e
Example:
Raw Data: 100, 85, 93 — meaningless unless we
know what it represents.
If we say: ”These are marks of students in
Mathematics”, now the same data becomes useful
information.

Knowledge
When we analyze and understand patterns from
information, it becomes knowledge.
Example: ”On average, students perform better in
Math if they attend all classes regularly.”
In simple words: Knowledge = Insights +
Understanding from Information

Therefore:
5/
Exampl
e
Data → Processing → Information → Knowledge

6/
Importance of Classifying
Data

In today’s digital world, we generate data from various sources


— social media, sensors, IoT devices, websites, etc. The
nature of data is different depending on the source.
To effectively store, manage, process, and analyze,
we need to classify data into different types.

This classification is mainly done into:


1 Structured Data

2 Unstructured Data

3 Semi-structured Data

7/
Structured
Data
Definition:
Structured data refers to data that is organized and
formatted in a fixed schema, such as rows and
columns in a relational database (SQL). Each column
represents a field or attribute (e.g., Name, Age,
Salary). Each row represents a record (e.g., data for
one person).

Characteristics:
Stored in tables (databases,
spreadsheets). Has a fixed format
and predefined schema. Easy to
search and query (using SQL).
Well suited for statistical analysis.

Example:

Name Age Salary


John 28 50,000
Mary 25 45,000
8/
Structured
Data
Akash 30 55,000

9/
Structured Data: Real-life
Scenarios

Examples:
Banking system: customer records
College management system: student details
Inventory management: product stock data

Real-life Scenarios:
Airline ticket booking
databases Hospital patient
records
Railway reservation system

10 /
Unstructured
Data
Definition:
Unstructured data refers to data that does not follow
any specific format or structure. It is not organized in
rows/columns. Needs special tools (NLP, AI, ML) for
analysis.
Characteristics:
No predefined schema.
Cannot be easily stored in databases.
Usually text-heavy, image-heavy or
multimedia data. Requires advanced
techniques to process.
Examples:
Data Type Example
Text Emails, PDF documents,
What-
sApp messages
Images Instagram photos, X-ray 11 /
Unstructured
Data images
Audio Voice messages, customer
service
calls
Video YouTube videos, CCTV
footage

12 /
Unstructured Data: Real-life
Scenarios

Customer reviews on
Amazon Tweets on
Twitter
CCTV camera feeds
Lecture videos on YouTube
More than 80% of real-world data is unstructured.

13 /
Semi-structured
Data

Definition:
Semi-structured data refers to data that is partially organized
— does not fit perfectly into tables, but still contains
tags, markers, or key-value pairs to separate data
elements.

Characteristics:
No rigid table format.
Uses tags/keys to give
structure. Flexible and
scalable.
Suitable for handling web data, APIs.

14 /
Semi-structured Data: Examples and
Scenarios
Examples:
JSON: { ¨n a m e ¨: ¨John , ¨a g e ¨: 28, ¨s a l a r y ¨:
¨
50000 }
XML:
<employee>
<name>John</name>
<age>28</age>
<salary>50000</salary>
</employee>
Real-life Scenarios:
E-commerce websites: product details stored
in JSON Social media APIs: posts/comments
data
15 /
Semi-structured Data: Examples and
Scenarios
Mobile app logs: user activity in key-value format

16 /
Comparison
Table

Feature Structured Unstructured Semi-


Data Data structured
Data
Format Tables No format Tags, key-
(Rows/Cols) values
Schema Fixed No Partial
Storage Relational DB File NoSQL,
Docu-
systems, ment DB
Cloud
Example Student Images, JSON, XML,
records Videos HTML
Analysis SQL, Stats NLP, CV, DL Data parsing
Flexibility Low High Medium

17 /
Importance of Understanding Data
Types
Storage system: Understanding data type helps select
the right storage solution, such as using SQL
databases for structured data, NoSQL for flexible
semi-structured data, or file systems for unstructured
data like images and videos.
Processing technique: Knowing the data type guides
the choice of processing method — using ETL
pipelines for structured data, machine learning
algorithms for pattern recognition, NLP for text
analysis, or computer vision techniques for image
data.
Analysis method: Identifying data type allows us to
choose proper analysis tools — BI tools are
effective for structured data reporting, while AI
tools are needed for analyzing unstructured or
complex data types.
Modern context: In today’s AI and Data Science
world, where around 80% of data is unstructured, it
18 /
Importance of Understanding Data
Types
is essential to build systems and workflows that can
handle diverse data types efficiently.

19 /
Data Sets and
Patterns
What is a Data Set?
A Data Set is a collection of related data items,
usually presented in a tabular or organized form.
It contains multiple records (rows) and attributes
(columns). Each row = 1 instance/record
Each column = 1 attribute/feature

Example — Student Marks Data Set


Student Age Subject Marks
Name
John 18 Math 90
Mary 19 Science 85
David 18 Math 75
This entire table is called a Data Set.

20 /
Definition and Sources of
Data Sets
Definition:
In Data Science: “A Data Set is a structured
collection of data that is typically used for analysis,
training machine learning models, or drawing
conclusions.”

Sources of Data Sets:


Government portals (data.gov)
Kaggle (online data science
platform) Sensors (IoT devices)
Business transaction
logs Social media
feeds
Scientific experiments
21 /
Types of Data Sets and
Importance
Types of Data Sets:
Type Example
Numerical Data Set Heights, Weights, Prices
Categorical Data Set Gender, Product categories
Time-Series Data Set Stock prices over time
Image Data Set Collection of labeled images
Text Data Set Collection of Tweets

Importance of Data Sets in Data Science


Data Sets are the foundation of data analysis
and AI. Without data sets → No analysis
possible → No model building.
“Garbage In, Garbage Out” → If data set quality is
bad, the analysis will be poor.
Good Data Set = Useful Patterns = Better Decisions

22 /
What are Patterns in
Data?

A pattern in data means a regularity, trend, or


relationship found among the data items.
Pattern = Hidden structure inside the
data Finding patterns = Core goal of
data science

Examples of Patterns:
E-commerce: “Customers who buy baby diapers
often buy baby wipes.”
Banking: “High-value customers usually visit bank
during weekends.”
Healthcare: “High cholesterol patients over 50
years often develop hypertension.”

23 /
Types of
Patterns

Pattern Type Example


Association People who buy A also buy B
Correlation More study hours → Higher marks
Sequence Customer usually buys milk first, then
bread
Trends Increasing sales during festive seasons
Outliers Fraudulent transactions with large
amounts
Clusters Grouping similar customers for
marketing

24 /
Types of Patterns in Data
Association and Correlation

Association
Finding items or events that occur together
frequently in the data.
Goal: Discover ”if a customer buys A, they are likely
to buy B.” Example: In a supermarket, 60% of
customers who buy diapers also buy baby wipes.
Use case: Market Basket Analysis — used in
retail for cross-selling.

Correlation
Finding how two variables are related — when one
changes, does the other also change?
Goal: Measure strength and direction of relationship
between variables.
Example:
More study hours → higher exam
marks More exercise → lower
body weight
Technical term: Correlation Coefficient (r) value
between -1 and +1.
Use case: Predictive modeling, feature selection in ML. 25 /
Types of Patterns in Data
Sequence and Trends

Sequence (Sequential Patterns)


Finding order in which events occur — what happens
after what. Goal: Understand temporal patterns (time-
based).
Example:
In online shopping: Customers first search for ”shoes” →
then buy ”socks”.
In stock markets: Price of stock increases after company
announces good news.
Use case: Recommendation engines, stock prediction.

Trends
Discovering long-term movements or changes in data
over time. Goal: Identify overall direction (upward,
downward, seasonal). Example:
Increasing sales of woolen clothes in winter
season. Rising housing prices over 5 years.
26 /
Use case: Business forecasting, sales planning, investment decisions.

27 /
Types of Patterns in Data
Outliers and Clusters

Outliers
Meaning: Detecting data points that are very different from
others. Goal: Identify unusual events or errors.
Example:
Fraudulent bank transaction: suddenly very large amount
withdrawn. Sensor error: suddenly reading temperature as
500 degrees.
Use case: Fraud detection, quality control, anomaly detection.

Clusters
Meaning: Grouping similar data points into clusters based on
features. Goal: Find natural groupings in data.
Example:
Grouping customers with similar buying
behavior. Grouping patients with similar
symptoms.
28 /
Use case: Customer segmentation, targeted marketing,
personalized medicine.

29 /
How are Patterns
Found?

Patterns in data are found


using: Statistical
analysis
Data visualization (plots,
graphs) Machine Learning
models
Association rule mining (e.g., Apriori
algorithm) Clustering algorithms (e.g.,
K-Means)

30 /
How are Patterns Found?
(Part 1)
Statistical Analysis
Use descriptive statistics (mean, median, standard
deviation).
Find relationships between variables (correlation,
regression).
Example: Find if higher age correlates with higher
cholesterol.

Data Visualization
Using charts/plots to visually detect patterns.
Tools: Bar chart, line graph, scatter plot,
histogram, heatmap.
Example: Visualizing study hours vs marks – linear
pattern seen.

31 /
How are Patterns Found?
(Part 2)
3. Machine Learning Models
ML algorithms learn patterns automatically from data.
Example: Decision trees learn which features
are important to predict customer churn.
Use case: Predictive models, classification, regression.

4. Association Rule Mining (e.g., Apriori Algorithm)


Special algorithms used to find item associations.
Example: {Milk, Bread} {Butter} (people buying
milk and bread often also buy butter).
Use case: Market basket analysis, retail promotions.

5. Clustering Algorithms (e.g., K-Means)


Used to group similar data points.
Example: K-Means finds 3 clusters in customer
spending patterns: High spenders, Medium
spenders, Low spenders.
32 /
How are Patterns Found?
(Part 2)
Use case: Targeted marketing, image segmentation.

33 /
Brief History of Data
Science

What is Data Science? (Quick Recap)


Data Science is an interdisciplinary field
combining: Mathematics & Statistics
Computer Science
(Programming) Domain
Knowledge
Goal: Extract useful knowledge and insights from data
to support decision making.

34 /
1960s -
1970s
1960s: Data Processing
Use of mainframe computers to process
structured data. Focus on automating
repetitive tasks (e.g., payroll,
inventory).
Example: Banks maintained customer accounts
on mainframes.

1970s: Relational Databases and SQL


Edgar F. Codd introduced the Relational
Database Model. Data stored in tables
(rows/columns).
SQL developed for querying data.
Example: Airline companies used relational
35 /
1960s -
1970s
databases for reservations.

36 /
1980s -
1990s
1980s: Business Intelligence (BI)
Companies began analyzing data for
insights. Rise of BI tools (reports,
dashboards, graphs). Managers made
data-driven decisions.
Example: Sales reports guided marketing campaigns.

1990s: Data Mining and Knowledge Discovery


Goal: Find hidden patterns, trends, relationships.
Popular techniques: association rules, clustering,
decision trees.
Process formalized as Knowledge Discovery in
Databases (KDD).
Example: Market Basket Analysis in supermarkets
(diapers & baby wipes).
37 /
2000s -
2010s
2000s: Big Data Era
Explosion of data from internet, social
media, IoT. Challenges: Volume, Velocity,
Variety (3Vs).
Tools developed: Hadoop, MapReduce.
Example: Google improved search with Big Data.

2010s: Rise of Data Science


Traditional BI not sufficient.
Needed new skills: Python, R, Big Data, ML, Data
Visualization, Ethics.
Role of Data Scientist emerged.
Example: Netflix used Data Science for movie
recommendations.
38 /
2010s onward and Present
Trends
2010s onward: AI & Deep Learning Integration
Deep Learning (DL) outperforms traditional models.
Neural Networks became popular with GPUs,
TensorFlow, PyTorch.
Example: Siri, Alexa use Data Science and NLP.

Present Trends
AutoML — automate model
selection/tuning. Explainable AI (XAI).
Edge AI — run models on
devices. Cloud + Big Data
integration.
Focus on data privacy and ethical AI.

39 /
Summary
Timeline

Period Milestone
1960s Data Processing (mainframes)
1970s Relational Databases, SQL
1980s Business Intelligence (BI)
1990s Data Mining, KDD
2000s Big Data (Hadoop, MapReduce)
2010s Rise of Data Science, ML/DL
Present AI-driven Data Science, AutoML, XAI

40 /
Introduction to Data
Science
What is Data Science?
Data Science is an interdisciplinary field
that uses: Scientific methods
Mathematical & statistical
techniques Programming &
computing tools
Domain knowledge
to extract knowledge and insights from both
structured and unstructured data.

Definition:
“Data Science is the process of collecting, cleaning,
analyzing, and interpreting large amounts of data to
uncover useful patterns, make predictions, and
support decision making.”
41 /
Data Science — Interdisciplinary
Field

Data Science combines mathematics, statistics,


computer science, and domain expertise to extract
insights from data. The Venn diagram shows how
these fields intersect to form the foundation of Data
Science.
42 /
Why is Data Science Important?
(Part 1)
1. Turning data into actionable trends.
insights Example: Amazon boosts sales with
Raw data → Information → personalized recommendations.
Knowledge → Insights.
Example: Netflix
recommends movies based
on viewing data.

2. Supporting data-driven decision


making
Moves from gut feeling
to data-driven
decisions.
Example: Banks analyze
transactions; hospitals plan
treatments.

3. Gaining competitive advantage


Understand customers,
optimize operations, predict
43 /
Why is Data Science Important?
4. Generating
(Part 1) art assistants
revenue and Self-driving cars
reducing costs
Language
Identify translation Facial
revenue recognition
opportunitie
s, reduce 6. Improving healthcare and saving
inefficiencie lives
s. Predict disease risk, personalize
Example: treatments.
Predictive Example: AI tools predict
maintenance heart disease risk.
prevents
equipment
failures.

5. Driving innovation with AI and


ML

Enables:

Sm
44 /
Why is Data Science Important?
(Part 2)
Solving social and environmental problems
Governments and NGOs use Data Science to:
Track disease
outbreaks Manage
traffic
Monitor pollution
Plan disaster response
Example: COVID-19 infection tracking.

8. Enhancing human understanding


Helps scientists analyze complex data, leading to
discoveries in:
Medicine
Climate
change
Space
research
45 /
Why is Data Science Important?
(Part 2)
Psychology
Example: Climate models predict global warming.

46 /
Differences between AI, ML, DL, Data
Science, and Data Analytics

47 /
Artificial Intelligence
(AI)
Definition:
AI is the science of making computers or machines
”think” and ”act” like humans.

AI systems can perform tasks such as:


Recognizing speech
Understanding
language Seeing
(computer vision)
Making decisions
Playing games
without being explicitly programmed for each action.

Examples:
Google Assistant, Siri, Alexa — understand voice
and give responses
Self-driving cars — detect objects and
drive safely AI in games — computer 48 /
Artificial Intelligence
(AI)
Definition:
players in chess

49 /
Machine Learning
(ML)
Definition:
ML is a subset of AI. It is a technique where
machines learn from data and improve their
performance without being explicitly
programmed.
In simple terms: ML = Learning from data

Examples of ML:
Email spam filters — learn from emails which are
spam and which are not.
Netflix recommendations — learns user preferences.
Credit card fraud detection — learns patterns
of fraudulent transactions.

Goal of ML:
To make machines learn from past data and predict
50 /
Machine Learning
(ML)
Definition:
or make decisions on new data.

51 /
Deep Learning
(DL)
Definition:
DL is a subset of ML that uses neural networks
with many layers (deep networks) to learn
complex patterns.

DL can handle:
Images

Videos
Audio
Natural Language
In simple terms: DL = ML using neural networks
Examples of DL:
Face recognition on Facebook
Speech recognition in Google Assistant
Medical image analysis for detecting tumors
Goal of DL:
Deep Learning
(DL)
Definition:
To learn from complex, large data and perform tasks like vision,
speech, language. 39 / 79
Data
Science
Definition:
Data Science is a field that
combines: Statistics
Programming
Machine
Learning
Domain
knowledge
to analyze and interpret data, and extract useful insights.

Examples of Data Science:


Predicting customer churn in
telecom Forecasting sales for
next quarter
Analyzing social media sentiment
Recommending personalized products
40 /
Data
Science
Definition:
Goal: Turn raw data into meaningful insights and help
businesses or society.

41 /
Data
Analytics
Definition:
Data Analytics is the process of examining
existing data to: Discover patterns
Generate reports
Support decision making
It is a part of Data Science — but focuses more on
analyzing current or past data.

Examples of Data Analytics:


Creating dashboards of sales performance
Analyzing website traffic using Google
Analytics Summarizing past customer
purchase behavior

Goal: Understand what happened in the past and why.


42 /
Comparison
Table

Term What it is Example


AI Making machines ”think” Self-driving car
ML Machines learning from Netflix
data recommendations
DL ML using deep neural Face recognition
networks
Data Science Extracting insights from Predicting customer
data churn
Data Analyzing data for insights Sales dashboards
Analytics

43 /
AI, ML, DL, and Data Science —
Relationships

ML is a key part of AI, DL is a subset of ML, and


Data Science overlaps with these areas while
focusing on extracting insights from data.
44 /
Applications of Data
Science

45 /
Real-world Applications of Data Science —
Intro- duction

Data Science is now used in almost every field —


business, healthcare, entertainment,
transportation, and more.
Data is everywhere: social media, e-commerce,
healthcare, sensors & IoT, scientific research,
banking.
Organizations use this data to:
Understand customers
Improve
products/services
Predict trends
Make smarter decisions
Solve real-world problems

46 /
Healthcare &
Banking
Healthcare faster diagnosis.
Predict disease risk
Personalized
treatment Analyze
medical images
Monitor patients
Predict epidemics
Examples
Cancer detection (MRI)
Hospital readmission
prediction
Wearables track heart
rate
Impact
Better outcomes,
47 /
Healthcare &
Banking
Banking & Finance Risk analysis
Fr Personalized
au marketing
d Examples
de Fraud
te detection in
cti real-time
on Loan default
Cr prediction
ed Algorithmic trading
it
sc Impact
Reduced fraud,
ori
improved services.
ng
Customer
segment
ation
48 /
Retail &
Transportation
Retail & E-commerce Impact
Personalized Higher customer satisfaction,
recommendations optimized inventory.
Dynamic pricing
Inventory
management
Sentiment analysis
Targeted ads
Examples
Amazon
recommendations
Flipkart dynamic
pricing
Zara sales data
analysis
49 /
Retail &
Transportation
Transportation & Logistics pricing
Predict traffic Impact
Optimize Reduced costs, better efficiency.
delivery
routes
Demand
forecasting
Self-driving tech
Examples
Uber surge pricing
DHL
route
optimizatio
n Airlines
adjust
ticket
50 /
Entertainment & Social
Media
Entertainment & Media YouTube video Soci
Personalized suggestions al
recommendations Impact Medi
Audience More engagement, a
analysis personalized experience.
Pe
Content
rs
creation Ad on
optimization ali
Examples ze
d
Netflix movie fe
suggestions Spotify ed
song
recommendations T
a
Entertainment & Social
Media
rgeted advertising
Sentiment analysis
Fake news
detection
User behavior prediction
Examples
Facebook/Instagram
feed Twitter trending
topics
Fake account detection
Impact
Better moderation,
improved engagement.
48 / 79
Govt/Public Sector & Scientific
Research
Government & Public COVID-19 spread tracking
Sector Impact
Crime prediction Safer cities, better public
Traffic management services.
Public health
monitoring Disaster
response
Environmental
monitoring
Examples
Predictive
policing Smart
traffic lights
49 /
Govt/Public Sector & Scientific
Research
Scientific Research Impact
Analyze Faster progress, scientific
breakthroughs.
experimental
data
Discover new
patterns
Accelerate discoveries
Examples
New
planet
discovery
Climate
modeling
Genomics research
50 /
Steps in Data Science
Process
Main Steps:
1 Business Objective

2 Data Requirement

3 Data Collection

4 Exploratory Data
Analysis , Monitoring
5 Model Building

6 Evaluation

7 Deployment

8 Monitoring (loop
after deployment)

51 /
Step 1: Business
Objective

Understand the problem or goal


from a business/organizational
point of view.
Without a clear objective, no data analysis will be useful.
Examples of Business Objectives:
Domain Business Objective
E- Predict customer churn
commerce
Healthcare Predict disease risk
Banking Detect fraudulent
transactions
Transport Optimize delivery routes

Data Scientist must work with domain experts to define


objectives clearly.
52 /
Step 2 — Data Collection
Next step: Gather data from multiple sources
Internal databases (SQL,
NoSQL) APIs (public or
private)
Sensors (IoT devices)
Social media
Web scraping
External purchased
datasets Open data portals
(Kaggle, UCI)

Examples of Data Sources:


Bank → Transaction records
Healthcare → EMR (Electronic Medical
Records) E-commerce → User
clickstream data
Government → Census data
53 /
Step 2 — Data Collection
Note: Ensure good data quality — avoid missing or
corrupted records.

54 /
Step 3 — Exploratory Data Analysis (EDA)
EDA = ”Getting to know your data”
What we do in EDA:
Understand data types (numerical,
categorical) Check for missing values
Find outliers
Visualize: histograms, scatter plots, box
plots Study variable correlations
Identify patterns and trends

Example (Student dataset):


Plot marks vs study
hours Identify outlier
students
Correlation between attendance and marks

Why important?
Detect errors/outliers
55 /
Step 3 — Exploratory Data Analysis (EDA)
Guide feature selection for ML models

56 /
Monitoring (loop from EDA)

After EDA, we may find:


Some features
missing Poor data
quality
Need for more data

Loop back to:


Data
Requirement
Data Collection

Continuous Monitoring ensures:


High data
quality
57 /
Monitoring (loop from EDA)
Completeness
Alignment with business objectives

58 /
Step 4- Model
Building
Purpose: To create predictive models that learn from
data and make accurate predictions or
classifications.
Main Steps:
Algorithm Selection: Choose suitable ML
algorithms (e.g., Decision Tree, SVM, Logistic
Regression, Neural Networks)
Data Splitting: Split dataset into training and
testing sets (e.g., 80%-20%)
Model Training: Feed training data into the
algorithm to learn patterns
Hyperparameter Tuning: Adjust model settings
(e.g., learning rate, max depth) for best results
Cross-Validation: Test model performance on
multiple subsets to avoid overfitting
Output: A trained model ready for evaluation and deployment.
59 /
Step 5 — Evaluation

Once model is built → Evaluate performance:


Accuracy on test data
Generalization to new
data Business
relevance

Common Metrics:
Problem Type Evaluation Metrics
Classification Accuracy, Precision, Recall, F1-
(Yes/No) score
Regression RMSE, MAE, R² Score
(Numeric)
Examples:
Spam classifier → 95
60 /
Step 5 — Evaluation
House price predictor → R² = 0.89
If poor performance → tune or improve model

61 /
Step 6 — Deployment

After evaluation, deploy model into real-world use.


How deployment happens:
API endpoint
Web app integration
Business system integration
Dashboard with live predictions

Examples:
Fraud detection model → Live in bank system
Netflix recommender → Runs on streaming
platform Disease risk tool → Used in
hospital EMR system

62 /
Step 7 — Monitoring (Post Deployment)

Even after deployment, monitoring is essential:


Check if performance remains high
Detect changes in data (concept
drift) Adapt to new patterns or
behaviors

Examples:
E-commerce → Retrain model monthly as
preferences change
Fraud detection → Update model with new fraud types

Goal: Keep model relevant and effective over time

63 /
Summary of Data Science Process
Step What Happens
Business Define the problem to solve
Objective Decide what data is needed
Data Gather data from sources
Requirement Understand data, find patterns
Data Collection
Exploratory Loop back if data is insufficient

Data Analysis Train machine learning models


Monitoring using training data
(from EDA) Test and measure model
Model Building performance Put model into real-
world use
Evaluation Ensure continued performance and
Deployment rele- vance
Monitoring
(post- deploy)
64 /
Summary of Data Science Process

65 /
Ethical and Privacy Implications of Data
Science (Part 1)

Introduction
Data Science is powerful — but with great power
comes great responsibility.
Data Scientists handle:
Personal data
Sensitive data
Financial data
Health data
Social media
data
If misused → leads to privacy violations,
discrimination, and harm.
Understanding ethics and privacy is critical.

66 /
Ethical and Privacy Implications of Data
Science (Part 2)
What is Ethics in Data Science?
Ethics = Principles of right and wrong in
data use. Questions to consider:
Is data used fairly?
With consent?
Does it create bias or
discrimination? Is data use
transparent?
Does it respect privacy?
What is Privacy in Data Science?
Privacy = Right of individuals to control:
What data is collected about
them How their data is used
Who has access to their data
Example: A person should know how their medical data
67 /
Ethical and Privacy Implications of Data
Science (Part 2)
is used and consent to it.

68 /
Ethical & Privacy Challenges in Data
Science
1. Data Collection Without Consent
Companies sometimes collect personal data without informing users.

Example: Mobile apps accessing location, contacts unnecessarily.

Violation: No informed consent → ethical breach.

2. Bias in Data and Algorithms


ML models reflect bias in training data.

Example: Hiring algorithms discriminate; facial recognition


poor on dark-skinned faces.
Result: Unfair treatment → violates fairness.

3. Lack of Transparency
AI/ML models often work as black boxes.

Example: Bank rejects loan — no explanation given.

Result: Lack of trust — violates transparency.


69 /
Ethical & Privacy Challenges
(Contd.)

4. . Data Misuse
Data collected for one purpose used for another
without consent.
Example: Fitness app health data sold to
insurance companies.
Violation: Privacy breach — user never agreed.

5. Data Breaches
Poor security leads to hacking or leaks.
Example: Millions of user records leaked
from banks/social media.
Result: Huge privacy loss for users.

70 /
Principles of Ethical Data
Science
1. Transparency
Users should know:
What data is
collected How it is
used
How decisions are made

2. Fairness
Algorithms must
not:
Discriminate
Reinforce bias

3. . Accountability
Companies must take

71 /
Principles of Ethical Data
Science
responsibility for: Model
performance
Data usage

72 /
Principles of Ethical Data Science
(Part 3)

4. . Privacy Protection
Data should be:
Collected with consent
Used only for intended
purpose Stored securely
Anonymized when possible
5. Human-centered AI
AI systems should:
Benefit humans
Respect human values and rights

73 /
Tools and Skills Needed in Data
Science

1. Introduction
Data Science is interdisciplinary —
blends: Computer Science
Statistics
Machine
Learning
Domain
Knowledge Data
Visualization
A Data Scientist is like a toolbox — knowing which
tool for which task.

74 /
Types of Tools in Data
Science
2. Types of Tools in Data Science
Typical project stages:
1 Data Collection
2 Data Cleaning / Preprocessing
3 Data Analysis
4 Visualization
5 Model Building
6 Deployment
7 Monitoring
Students should
learn:
Platforms

Languages
Libraries
75 /
Types of Tools in Data
Science
Databases

Frameworks

76 /
Platfor
ms
3. . Platforms
Environments
for:
Writing code
Running
experiments
Storing notebooks
Sharing results
Examples:
Platform Purpose
Jupyter Python code +
visualiza-
tion
Google Colab Free cloud notebooks
Kaggle Kernels Community sharing
Anaconda Bundled Python +
77 /
Platfor
ms pack-
ages
AWS Sagemaker Cloud ML platform

78 /
Programming
Languages

4. Programming Languages
Core skill of Data Scientist
Popular Languages:
Python — easy, huge
libraries R — statistics,
visualization
SQL — querying data
Scala, Java — Big Data tools
Python is #1 → easy syntax, big community, rich libraries

79 /
Librari
es
Libraries- Ready-made packages → do tasks faster
Examples:
Pandas — data manipulation
NumPy — numerical
computation Matplotlib, Seaborn
— visualization Scikit-learn —
ML algorithms TensorFlow,
PyTorch — Deep Learning NLTK,
spaCy — NLP
OpenCV — Computer Vision
6 Frameworks- Higher-level tools simplify complex tasks
Popular Frameworks:
TensorFlow, Keras — Deep Learning
PyTorch — research +
production DL Apache Spark —
Big Data
80 /
Librari
es
HuggingFace — NLP models

81 /
Database
s
7. Databases
Data stored in:
Relational DB — MySQL,
PostgreSQL NoSQL DB —
MongoDB, Cassandra Cloud —
AWS S3, BigQuery
Need SQL skills to extract data
8. Data Visualization Tools
Helps explore data, communicate insights
Examples:
Matplotlib, Seaborn — Python
plots Tableau — dashboards
Power BI — Microsoft
tool Plotly —
82 /
Database
s
interactive plots

83 /
Cloud Platforms and Big Data
Tools

Cloud Platforms
Store large data, run large models
Examples:
AWS (Sagemaker, EC2,
S3) GCP (BigQuery)
Azure ML
Big Data Tools
Hadoop — distributed
storage Spark — fast
processing
Hive — SQL querying on big data

84 /
Soft Skills and
Summary

1. 1. Soft Skills
Communication — explain results
Storytelling — use data to tell a
story Curiosity — explore and
question
Teamwork — work with others

Summary
Mix of technical and soft skills
needed Continuous learning —
tools evolve

85 /
Current Trends and Major Research
Challenges in Data Science — Introduction

1. Introduction
Data Science evolves rapidly — new tools, techniques,
and applications constantly emerge.
Data Scientists must:
Keep up with current trends
Be aware of major research challenges
Understanding latest trends and challenges is critical
to stay competitive.

86 /
Current
Trends
1. Automated Machine Learning (AutoML)
Automates feature selection, algorithm choice,
parameter tuning.
Popular tools: Google Cloud AutoML, H2O.ai AutoML,
Azure AutoML
Impact: Easier for non-experts, faster model development.
2. Explainable AI (XAI)
Makes AI transparent,
interpretable Tools: LIME,
SHAP
Impact: Builds trust, essential in sensitive domains
(healthcare, finance)
3. Edge AI
AI on edge devices (phones, sensors)
Benefits: real-time processing, privacy, less cloud
87 /
Current
Trends
dependence Examples: Face unlock, smart
cameras

88 /
Current
Trends
4. DataOps & MLOps
DevOps for Data Science:
Version control for
data Automated ML
testing
Continuous deployment (CI/CD)
Impact: Scalable, maintainable ML pipelines
5. Ethical and Responsible AI
Focus on reducing bias, ensuring fairness, protecting
privacy Critical concern for companies and
governments
6. NLP Advancements (LLMs)
Transformer-based models (BERT, GPT) revolutionize NLP

89 /
Current
Trends context, generate text, translate
LLMs: understand
languages Examples: ChatGPT, Google Bard

90 /
Current
Trends

7. Data Democratization
Makes data and tools accessible to non-
technical users Tools:
Self-service BI tools
No-code ML
platforms Open
datasets
Impact:
Business users can explore data and build
models Not just limited to data scientists

91 /
Major Research
Challenges
1. Handling Unstructured Data
Most data = text, images, video,
audio Challenge: Efficient
processing, analysis
2. Scalability and Big Data
Data growing exponentially (petabytes, real-time
streams) Challenge: Process large data in real-
time
3. Data Privacy and Security
More data → higher privacy
risks Challenges:
Secure storage
Compliance with laws (GDPR,
CCPA) Balance utility vs privacy
4. Bias and Fairness in AI
92 /
Major Research
Challenges
Bias in data → unfair models
Challenge: detect, remove bias; ensure fairness

93 /
Major Research
Challenges
5. Interpretability of Complex Models
Deep models often black-box
Challenge: make models interpretable to users/regulators
6. Data Quality and Governance
”Garbage in, garbage out”
Challenge: ensure quality, manage governance across
organizations
7. Bridging Research to Production
Models often fail in real-world deployment
Challenge: build robust, scalable, production-ready models
8. Cost and Energy Efficiency of AI
Training large models consumes huge energy
Challenge: make AI energy-efficient, cost-effective
94 /

You might also like