Introduction to Data Science
What is Data Science?
Data Science is an interdisciplinary field that extracts insights from
data using techniques from:
• Statistics
• Computer Science
• Mathematics
• Domain Knowledge
It turns raw data into actionable knowledge through processes like
data cleaning, analysis, visualization, and machine learning.
Big Data and Data Science Hype
• Big Data refers to datasets that are too large, fast, or complex
for traditional tools to handle.
• Data Science hype came from the explosion of:
o IoT devices
o Social media platforms
o Online transactions
o Cloud storage
o Mobile apps
Why the hype?
Because companies discovered that data = money when used to:
• Predict behavior
• Optimize operations
• Personalize user experience
• Improve decision-making
Why Now? (The Timing)
• Storage became cheaper
• Computing power increased
• Machine learning libraries matured (like Scikit-learn,
TensorFlow)
• Demand for automation and smarter decisions rose globally
We reached a point where data is being produced faster than ever,
and tools to analyze it are finally accessible.
Datafication
• Datafication is the process of turning actions, processes, or
objects into data.
o Example: Your step count, your Google searches, and
Netflix viewing habits — all are datafied.
Everything we do is now trackable — and analyzable.
Current Landscape of Data Science
• Used in almost every industry: Healthcare, finance, marketing,
e-commerce, sports, and even agriculture.
• Job roles have evolved:
o Data Analyst
o Data Engineer
o Machine Learning Engineer
o Business Intelligence Analyst
o Research Scientist
Data Science is not just tech; it's everywhere decision-making
matters.
Skillsets Needed in Data Science
Category Skills
Programming Python, R, SQL
Math/Stats Probability, Statistics, Linear Algebra
Data Handling Pandas, NumPy, Excel, Databases
Supervised/Unsupervised Learning, Scikit-learn,
Machine Learning
TensorFlow
Visualization Matplotlib, Seaborn, Power BI, Tableau
Domain
Understanding the industry you're working in
Knowledge
Communication, Problem-solving, Storytelling
Soft Skills
with data
Introduction to Statistics (for Data Science)
What is Statistical Inference?
Statistical inference is the process of making educated guesses
about a population using data from a sample.
Two major types:
• Estimation – Predicting unknown values (e.g., average height of
all students from a class sample).
• Hypothesis Testing – Testing assumptions (e.g., "Do ads
increase sales?").
Populations and Samples
Term Description
Population The full group you want to study (e.g., all Twitter users).
A subset of the population used to draw conclusions
Sample
(e.g., 500 Twitter users randomly picked).
Sampling saves time and money — but must be done well to
avoid bias.
Statistical Modeling
A statistical model is a mathematical representation of observed
data.
It usually has:
• Assumptions (e.g., data follows a normal distribution)
• Parameters (e.g., mean, variance)
• A goal: describe or predict real-world outcomes
Example:
Linear Regression →
Sales = α + β × AdSpend + error
Probability Distributions
Distributions tell you how likely different outcomes are.
Type Examples
Discrete Binomial, Poisson
Continuous Normal, Exponential
• Normal Distribution (bell curve): Most common, used in many
real-world cases.
• Binomial Distribution: Used for yes/no outcomes (e.g., coin
toss).
• Poisson: Counts of events in a time period (e.g., calls per hour
in a call center).
Fitting a Model
Model fitting = Finding the best mathematical function to describe
your data.
• Use training data to fit the model
• Check performance with test/validation data
• Metrics: MSE (Mean Squared Error), Accuracy, R², etc.
A good model generalizes well to unseen data — not just
memorizes the training set.
Intro to R (for Stats & DS)
R is a powerful language built for statistical analysis and visualization.
• Popular in academia and among statisticians
• Known for packages like:
o ggplot2 – for visualization
o dplyr – for data manipulation
o caret – for machine learning
• Basic example:
r
CopyEdit
# Load a dataset
data(mtcars)
# Summary statistics
summary(mtcars)
# Linear regression
model <- lm(mpg ~ wt + hp, data=mtcars)
summary(model)
Data Analysis in Data Science
Exploratory Data Analysis (EDA)
EDA is the first step in data analysis where you:
• Understand your data
• Spot patterns, trends, or outliers
• Form hypotheses for further analysis
Think of EDA as “data detective work” before jumping to
machine learning or predictions.
Basic Tools of EDA
Tool Description
Summary Mean, median, mode, std deviation, min, max,
Statistics quartiles
Visualizations
• Histograms – distribution of a single variable
• Box plots – detect outliers
• Bar charts – for categorical data
• Scatter plots – check relationships between variables
• Heatmaps – correlation matrices |
Also includes: Missing value checks, frequency tables, group-wise
aggregation
Philosophy of EDA (by John Tukey)
“EDA is not about confirming a hypothesis, but suggesting new
ones.”
• It's non-formal and interactive
• Emphasizes visualizations over complex equations
• Focuses on understanding data before modeling
The idea: Let the data speak for itself, and only then decide the
next step.
The Data Science Process
Here’s the typical pipeline:
1. Define the problem (business or research objective)
2. Collect data (from files, APIs, web scraping, databases)
3. Clean the data (handle missing values, remove duplicates)
4. EDA (explore patterns, visualize relationships)
5. Model the data (ML or statistical models)
6. Evaluate the model (accuracy, precision, etc.)
7. Communicate results (dashboard, report, presentation)
8. Deploy & monitor (in real-world apps)
Case Study: RealDirect (Online Real Estate Firm)
• Goal: Help homeowners decide how to sell their homes — with
or without a real estate agent.
• Data used:
o Listing prices
o Home features (location, size, no. of beds/baths)
o Time on market
o Selling strategy (self-listed vs agent-listed)
Key EDA insights:
• Self-listed homes tended to start with higher prices.
• Homes with better photos sold faster.
• Pricing too high led to homes sitting longer on the market.
Outcome: Data-driven advice engine to help sellers choose the
best listing strategy.
Machine Learning: Core Algorithms
1. Linear Regression (Supervised Learning)
Goal: Predict a continuous numeric value
Example: Predict house price based on size.
How it works:
• Fits a straight line:
Y = mX + b
Where Y = predicted value, X = input, m = slope, b = intercept
Used when:
• You want to estimate trends or predict quantities
• Your target output is numerical
Example use cases:
• Predicting salary from years of experience
• Estimating sales based on ad spend
Key idea: It minimizes the error between predicted and actual
values using least squares.
2. k-Nearest Neighbors (k-NN) (Supervised Learning)
Goal: Predict the label of a new data point based on its neighbors
How it works:
• Choose a value of k (like 3 or 5)
• Find the k closest data points (neighbors) to the new point
using distance (usually Euclidean)
• Assign the most common label among them (for classification)
Or take the average value (for regression)
Used when:
• You want to classify objects (like emails as spam/ham) or
predict values based on similarity
Example use cases:
• Image recognition
• Recommendation systems
• Customer behavior prediction
Sensitive to outliers and irrelevant features, and slow with large
datasets.
3. k-Means Clustering (Unsupervised Learning)
Goal: Group data into k clusters without predefined labels
How it works:
1. Choose k (number of clusters)
2. Randomly place k centroids
3. Assign each point to the nearest centroid
4. Recalculate centroids as the mean of the cluster
5. Repeat steps 3–4 until stable
Used when:
• You want to discover natural groupings in data
Example use cases:
• Customer segmentation
• Market basket analysis
• Document/topic clustering
Doesn’t work well with non-spherical clusters or uneven
densities
Summary Table:
Algorithm Type Task Output
Linear Numeric
Supervised Prediction
Regression Value
k-NN Supervised Classification/Regression Class or Value
Groups
k-Means Unsupervised Clustering
(Clusters)
Applications of Machine Learning
Real-world Usage of ML
ML is behind many everyday tech:
• Spam filtering
• Facial recognition
• Recommendation engines (Amazon, Netflix)
• Self-driving cars
• Disease prediction
Let’s zoom into one specific use case:
Spam Filtering
Why Linear Regression & k-NN struggle here:
Algorithm Why It Fails for Spam Filtering
Outputs continuous values, not probabilities or
Linear
categories. Can’t handle word frequencies or text
Regression
data well.
Too slow for large datasets (like emails), needs to
k-NN compare with every training sample. Also, high
dimensionality (lots of words) = poor accuracy.
Why Naive Bayes Works Well
Naive Bayes = Probabilistic classifier based on Bayes’ Theorem
• Assumes features (words) are independent (naive assumption)
• Calculates:
P(Spam∣words)∝P(words∣Spam)⋅P(Spam)P(\text{Spam} \mid
\text{words}) \propto P(\text{words} \mid \text{Spam}) \cdot
P(\text{Spam})P(Spam∣words)∝P(words∣Spam)⋅P(Spam)
Why it works:
• Extremely fast, even with large text data
• Works well with word counts / frequencies
• Good with high-dimensional text (like emails)
It can classify emails as spam or not spam based on the
presence/absence of specific keywords, punctuation, headers, etc.
Data Wrangling – Prepping Data for ML
Before training ML models, you need clean data. That’s where data
wrangling comes in.
Key Tools & Techniques:
Tool /
Purpose
Concept
Access structured data from web services (e.g., Twitter
APIs
API, Reddit API)
Web Extract data from websites using tools like
Scraping BeautifulSoup, Scrapy, or Selenium
Cleaning Remove duplicates, fill missing data, normalize formats
Convert raw data (HTML, XML, JSON) into structured
Parsing
tables
pandas (Python) for manipulation, re for regex parsing,
Libraries
requests for HTTP calls
Mini Workflow:
1. Collect email data via IMAP API or CSV
2. Clean + tokenize email content
3. Label spam/ham
4. Vectorize text (Bag of Words / TF-IDF)
5. Train a Naive Bayes model
6. Predict new emails
Introduction to Feature Engineering & Selection
What’s a Feature?
A feature is an individual measurable property or characteristic of
the data.
Example: For customer retention, features could be:
• Number of purchases
• Days since last login
• Total amount spent
Feature Generation
(aka Creating Useful Inputs for Your ML Model)
Goal: Turn raw data into smart, meaningful variables
Process involves:
1. Brainstorming: Think of all potentially useful info (domain
knowledge helps!)
2. Domain Expertise: Talk to someone who knows the field (e.g., a
marketing analyst for churn prediction)
3. Imagination: Think outside the box—combine features or
create ratios
Example: Customer Retention
Raw data:
• Login timestamps
• Purchase amounts
• Support tickets
Generated features might include:
• Avg. time between logins
• Purchase frequency
• Complaints per month
Feature Selection
(Remove useless/noisy features to improve model performance)
Why it matters:
• Speeds up training
• Reduces overfitting
• Improves accuracy
Feature Selection Techniques
Method Description Pros Cons
Select features using
Ignores model
Filters statistical tests (e.g., Fast
performance
correlation, mutual info)
Method Description Pros Cons
Use a model (e.g.,
Computationally
Wrappers logistic regression) to Accurate
expensive
evaluate feature subsets
Built into model training
Embedded Depends on model
(e.g., Decision Trees, Efficient
(e.g., Trees) choice
Lasso)
Decision Trees & Random Forests for Feature Selection
• Decision Tree: Splits based on the most important features first
The first few splits = high-importance features
• Random Forest: Builds multiple trees, averages feature
importance
Can rank features by how much they reduce error
Summary Flow (Customer Retention Use Case):
1. Raw data: User logs, purchases, complaints
2. Feature Generation: Turn logs into counts, ratios, gaps
3. Feature Selection:
o Use filter methods for quick elimination
o Use Random Forests to rank what's most predictive
4. Train ML model (e.g., Logistic Regression, XGBoost)
Recommendation Systems
Used to predict user preferences and suggest relevant items
(products, movies, music, etc.).
Examples:
• 🛍 Amazon → “People also bought”
• Netflix → “Top picks for you”
• Spotify → “Recommended for you”
Building a User-Facing Data Product
A Recommendation System = Data + Algorithms + UX
To make it useful:
• Easy to access (search or suggestions)
• Personalized and timely
• Fast and scalable
Think of it as a smart assistant embedded in a product.
Core Algorithmic Ingredients
1. Collaborative Filtering
(Users who liked X also liked Y)
• User-based: Recommend based on similar users
• Item-based: Recommend items similar to ones the user liked
2. Content-Based Filtering
(Recommend similar items based on features)
• E.g., movie genre, product category, price, etc.
3. Hybrid Models
Combine both approaches above for better accuracy.
Dimensionality Reduction Techniques
Used to simplify user-item matrices (which are often huge and
sparse).
Why?
• Reduce computation
• Improve generalization
• Remove noise
Key Techniques
1. SVD (Singular Value Decomposition)
Breaks down a user-item matrix into:
R=U⋅Σ⋅VTR = U \cdot \Sigma \cdot V^TR=U⋅Σ⋅VT
Where:
• RRR is the original matrix (users vs items)
• UUU and VVV capture user/item relationships
• Σ\SigmaΣ are the strengths (singular values)
Used in Latent Factor Models (like matrix factorization in Netflix's
engine).
2. PCA (Principal Component Analysis)
• Finds important directions (principal components) in data
• Less used directly in recommender systems, but helpful for
feature reduction before building a model
Mini Exercise: Build Your Own Recommender System (Basic
Version)
Goal: Recommend books/movies based on user ratings
Steps:
1. Collect data (e.g., MovieLens dataset)
2. Create a user-item matrix
3. Choose an algorithm:
o User-based CF → use cosine similarity
o SVD → use matrix factorization with scikit-surprise or
numpy
4. Generate predictions
5. Recommend top-N items for a user
Example Code (Skeleton – Python):
python
CopyEdit
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
# Sample user-item rating matrix
ratings = pd.DataFrame({
'User1': [5, 3, 0, 0],
'User2': [4, 0, 0, 2],
'User3': [1, 1, 0, 5],
}, index=['ItemA', 'ItemB', 'ItemC', 'ItemD'])
# Transpose to user-by-items
user_sim = cosine_similarity(ratings.T)
# Show similarity matrix
print(pd.DataFrame(user_sim, index=ratings.columns,
columns=ratings.columns))
Social Network Graphs — Introduction
What’s a Social Network Graph?
• A way to represent relationships using nodes (users) and edges
(connections).
• Example: In Facebook,
o Nodes = users
o Edges = friendships (undirected) or followers (directed)
Social networks are basically big graphs — with millions of people
(nodes) and billions of interactions (edges).
Mining Social-Network Graphs
This involves extracting useful insights from large-scale network data.
Why?
• To find influential users
• Detect communities or cliques
• Track information spread (like viral posts or fake news)
• Recommend connections
Key Concepts:
1. Social Networks as Graphs
• Undirected Graphs → Friendship (mutual)
• Directed Graphs → Followers/following, email (1-way)
Real-world networks tend to be:
• Sparse: Not every node is connected to every other
• Small-world: Most nodes can be reached in a few steps (e.g., 6
degrees of separation)
2. Clustering of Graphs
• Grouping similar nodes together based on structure
• Real-world example: Friends from the same school, workplace,
or interest group
Tools:
• Clustering coefficient: Measures how tightly-knit a node's
neighbors are
• k-means on graph embeddings: When you turn graph nodes
into vectors
3. Direct Discovery of Communities
• Communities = Densely connected clusters of nodes
• Algorithms:
o Louvain (for large graphs)
o Girvan-Newman (uses edge betweenness)
• Application: Find topic-based groups on Twitter or LinkedIn
circles
4. Partitioning of Graphs
• Dividing a graph into non-overlapping subgraphs (like
departments in a company)
• Goal: Maximize internal edges within partitions, minimize
external edges
Used in:
• Load balancing
• Organizing network traffic
• Parallel computing
5. Neighborhood Properties
• Refers to the properties of nodes that are 1 hop or k hops
away
Examples:
• Degree: Number of direct neighbors
• Ego Network: The node and all of its neighbors + their
connections
• Clustering coefficient: How connected the neighbors are with
each other
Useful for:
• Predicting friendships
• Detecting anomalies (like bots)
Summary
Concept Purpose
Social Graphs Represent connections
Clustering Group similar nodes
Communities Detect densely connected subgroups
Partitioning Split network for analysis or load
Neighborhoods Analyze local node environment
Data Visualization
The art and science of converting complex data into visual form so
patterns, trends, and insights are obvious at a glance.
Basic Principles of Data Visualization
1. Clarity over clutter
→ Show the data, not the decoration. Don’t confuse with too many
colors or 3D effects.
2. Choose the right chart
Data Type Best Chart
Trends over time Line chart
Category comparison Bar chart
Proportions Pie chart (use sparingly)
Relationships Scatter plot
Distribution Histogram, box plot
Network/Flow Graphs, Sankey diagrams
3. Use color meaningfully
Color should communicate, not just decorate (e.g., red for danger,
blue for cool).
4. Minimize chartjunk
Avoid unnecessary grid lines, bold fonts, shadows. Keep it clean.
Common Tools for Data Visualization
Programming-based:
• Matplotlib / Seaborn (Python) – for static plots
• Plotly / Bokeh / Altair – for interactive visualizations
• ggplot2 (R) – great grammar of graphics
GUI-based (No-code / Low-code):
• Tableau – industry favorite
• Power BI – great for business dashboards
• Google Data Studio – web-based, integrates with Google
services
• Excel – surprisingly powerful when used well
Inspiring Real-World Visualization Projects
1. Spotify Wrapped
→ Personalized data storytelling using visualizations of your
listening habits.
2. The New York Times Graphics Department
→ Stunning visual journalism combining interactivity +
storytelling.
3. Gapminder by Hans Rosling
→ Animated bubble charts showing global development over
time.
Exercise: Create Your Own Visualization
Dataset Idea:
Use a public dataset like:
• COVID-19 daily cases
• IMDb movie ratings
• India rainfall dataset
• Airbnb listings in a city
• Kaggle datasets
Steps:
1. Import & clean the data (Pandas / Excel)
2. Explore using plots:
o Time series → line plot
o Category-wise → bar chart
o Correlation → heatmap
3. Visualize using:
o Matplotlib / Seaborn (Python)
o or Tableau if you prefer drag-n-drop
Example (Python):
python
CopyEdit
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Load sample dataset
df = sns.load_dataset('tips')
# Plot total bill vs tip
sns.scatterplot(data=df, x='total_bill', y='tip', hue='sex')
plt.title("Tips vs Total Bill")
plt.show()
Data Science & Ethical Issues
As data becomes the new gold, ethical questions are the new
minefields.
Privacy
• What’s the concern?
Companies and governments collect enormous data: location,
searches, purchases, health, etc.
Users often don’t realize how much is being tracked.
• Ethical issues:
o Is consent truly informed?
o Can data be anonymized enough?
o Who owns your data — you or the company?
Real case:
Facebook–Cambridge Analytica scandal → user data used to
influence elections.
Security
• Security is about protecting data from unauthorized access.
• Data breaches can leak millions of users’ personal and financial
details.
Must-dos for ethical data science:
• Encrypt sensitive data
• Follow security protocols (HTTPS, access control, 2FA)
• Limit data access to need-to-know roles
Ethics in Data Science
The big 3 questions every data scientist should ask:
1. Should I build this model?
2. Who benefits, who gets hurt?
3. Is the algorithm biased?
Examples of ethical failure:
• Racist AI in hiring
• Gender-biased facial recognition
• Predictive policing reinforcing injustice
A Look Back at Data Science
• Earlier: Data science was about crunching numbers and finding
patterns.
• Now: It’s about responsibility, fairness, transparency, and
human impact.
• Tools have evolved, but now so have the questions we ask of
data.
Next-Generation Data Scientists
They must be:
• Not just coders, but critical thinkers
• Fluent in ethics, privacy laws, and social consequences
• Skilled in explaining how and why a model works (model
interpretability)
• Comfortable saying no to shady projects — not just doing
what’s possible, but what’s right