Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views29 pages

Introduction To Data Science

The document provides an extensive overview of Data Science, covering its definition, the significance of big data, and the current landscape of the field. It discusses essential skills, statistical inference, data analysis processes, machine learning algorithms, and applications in real-world scenarios. Additionally, it highlights the importance of feature engineering and selection in building effective machine learning models.

Uploaded by

bca.friends69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

Introduction To Data Science

The document provides an extensive overview of Data Science, covering its definition, the significance of big data, and the current landscape of the field. It discusses essential skills, statistical inference, data analysis processes, machine learning algorithms, and applications in real-world scenarios. Additionally, it highlights the importance of feature engineering and selection in building effective machine learning models.

Uploaded by

bca.friends69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Introduction to Data Science

What is Data Science?


Data Science is an interdisciplinary field that extracts insights from
data using techniques from:
• Statistics
• Computer Science
• Mathematics
• Domain Knowledge
It turns raw data into actionable knowledge through processes like
data cleaning, analysis, visualization, and machine learning.

Big Data and Data Science Hype


• Big Data refers to datasets that are too large, fast, or complex
for traditional tools to handle.
• Data Science hype came from the explosion of:
o IoT devices
o Social media platforms
o Online transactions
o Cloud storage
o Mobile apps
Why the hype?
Because companies discovered that data = money when used to:
• Predict behavior
• Optimize operations
• Personalize user experience
• Improve decision-making

Why Now? (The Timing)


• Storage became cheaper
• Computing power increased
• Machine learning libraries matured (like Scikit-learn,
TensorFlow)
• Demand for automation and smarter decisions rose globally
We reached a point where data is being produced faster than ever,
and tools to analyze it are finally accessible.

Datafication
• Datafication is the process of turning actions, processes, or
objects into data.
o Example: Your step count, your Google searches, and
Netflix viewing habits — all are datafied.
Everything we do is now trackable — and analyzable.

Current Landscape of Data Science


• Used in almost every industry: Healthcare, finance, marketing,
e-commerce, sports, and even agriculture.
• Job roles have evolved:
o Data Analyst
o Data Engineer
o Machine Learning Engineer
o Business Intelligence Analyst
o Research Scientist
Data Science is not just tech; it's everywhere decision-making
matters.

Skillsets Needed in Data Science


Category Skills

Programming Python, R, SQL

Math/Stats Probability, Statistics, Linear Algebra

Data Handling Pandas, NumPy, Excel, Databases

Supervised/Unsupervised Learning, Scikit-learn,


Machine Learning
TensorFlow

Visualization Matplotlib, Seaborn, Power BI, Tableau

Domain
Understanding the industry you're working in
Knowledge

Communication, Problem-solving, Storytelling


Soft Skills
with data

Introduction to Statistics (for Data Science)

What is Statistical Inference?


Statistical inference is the process of making educated guesses
about a population using data from a sample.
Two major types:
• Estimation – Predicting unknown values (e.g., average height of
all students from a class sample).
• Hypothesis Testing – Testing assumptions (e.g., "Do ads
increase sales?").

Populations and Samples


Term Description

Population The full group you want to study (e.g., all Twitter users).

A subset of the population used to draw conclusions


Sample
(e.g., 500 Twitter users randomly picked).

Sampling saves time and money — but must be done well to


avoid bias.

Statistical Modeling
A statistical model is a mathematical representation of observed
data.
It usually has:
• Assumptions (e.g., data follows a normal distribution)
• Parameters (e.g., mean, variance)
• A goal: describe or predict real-world outcomes
Example:
Linear Regression →
Sales = α + β × AdSpend + error

Probability Distributions
Distributions tell you how likely different outcomes are.
Type Examples

Discrete Binomial, Poisson

Continuous Normal, Exponential


• Normal Distribution (bell curve): Most common, used in many
real-world cases.
• Binomial Distribution: Used for yes/no outcomes (e.g., coin
toss).
• Poisson: Counts of events in a time period (e.g., calls per hour
in a call center).

Fitting a Model
Model fitting = Finding the best mathematical function to describe
your data.
• Use training data to fit the model
• Check performance with test/validation data
• Metrics: MSE (Mean Squared Error), Accuracy, R², etc.
A good model generalizes well to unseen data — not just
memorizes the training set.
Intro to R (for Stats & DS)
R is a powerful language built for statistical analysis and visualization.
• Popular in academia and among statisticians
• Known for packages like:
o ggplot2 – for visualization
o dplyr – for data manipulation
o caret – for machine learning
• Basic example:
r
CopyEdit
# Load a dataset
data(mtcars)

# Summary statistics
summary(mtcars)

# Linear regression
model <- lm(mpg ~ wt + hp, data=mtcars)
summary(model)

Data Analysis in Data Science

Exploratory Data Analysis (EDA)


EDA is the first step in data analysis where you:
• Understand your data
• Spot patterns, trends, or outliers
• Form hypotheses for further analysis
Think of EDA as “data detective work” before jumping to
machine learning or predictions.

Basic Tools of EDA


Tool Description

Summary Mean, median, mode, std deviation, min, max,


Statistics quartiles

Visualizations
• Histograms – distribution of a single variable
• Box plots – detect outliers
• Bar charts – for categorical data
• Scatter plots – check relationships between variables
• Heatmaps – correlation matrices |
Also includes: Missing value checks, frequency tables, group-wise
aggregation

Philosophy of EDA (by John Tukey)


“EDA is not about confirming a hypothesis, but suggesting new
ones.”
• It's non-formal and interactive
• Emphasizes visualizations over complex equations
• Focuses on understanding data before modeling
The idea: Let the data speak for itself, and only then decide the
next step.

The Data Science Process


Here’s the typical pipeline:
1. Define the problem (business or research objective)
2. Collect data (from files, APIs, web scraping, databases)
3. Clean the data (handle missing values, remove duplicates)
4. EDA (explore patterns, visualize relationships)
5. Model the data (ML or statistical models)
6. Evaluate the model (accuracy, precision, etc.)
7. Communicate results (dashboard, report, presentation)
8. Deploy & monitor (in real-world apps)

Case Study: RealDirect (Online Real Estate Firm)


• Goal: Help homeowners decide how to sell their homes — with
or without a real estate agent.
• Data used:
o Listing prices
o Home features (location, size, no. of beds/baths)
o Time on market
o Selling strategy (self-listed vs agent-listed)
Key EDA insights:
• Self-listed homes tended to start with higher prices.
• Homes with better photos sold faster.
• Pricing too high led to homes sitting longer on the market.
Outcome: Data-driven advice engine to help sellers choose the
best listing strategy.

Machine Learning: Core Algorithms

1. Linear Regression (Supervised Learning)


Goal: Predict a continuous numeric value
Example: Predict house price based on size.
How it works:
• Fits a straight line:
Y = mX + b
Where Y = predicted value, X = input, m = slope, b = intercept
Used when:
• You want to estimate trends or predict quantities
• Your target output is numerical
Example use cases:
• Predicting salary from years of experience
• Estimating sales based on ad spend
Key idea: It minimizes the error between predicted and actual
values using least squares.
2. k-Nearest Neighbors (k-NN) (Supervised Learning)
Goal: Predict the label of a new data point based on its neighbors
How it works:
• Choose a value of k (like 3 or 5)
• Find the k closest data points (neighbors) to the new point
using distance (usually Euclidean)
• Assign the most common label among them (for classification)
Or take the average value (for regression)
Used when:
• You want to classify objects (like emails as spam/ham) or
predict values based on similarity
Example use cases:
• Image recognition
• Recommendation systems
• Customer behavior prediction
Sensitive to outliers and irrelevant features, and slow with large
datasets.

3. k-Means Clustering (Unsupervised Learning)


Goal: Group data into k clusters without predefined labels
How it works:
1. Choose k (number of clusters)
2. Randomly place k centroids
3. Assign each point to the nearest centroid
4. Recalculate centroids as the mean of the cluster
5. Repeat steps 3–4 until stable
Used when:
• You want to discover natural groupings in data
Example use cases:
• Customer segmentation
• Market basket analysis
• Document/topic clustering
Doesn’t work well with non-spherical clusters or uneven
densities

Summary Table:
Algorithm Type Task Output

Linear Numeric
Supervised Prediction
Regression Value

k-NN Supervised Classification/Regression Class or Value

Groups
k-Means Unsupervised Clustering
(Clusters)

Applications of Machine Learning

Real-world Usage of ML
ML is behind many everyday tech:
• Spam filtering
• Facial recognition
• Recommendation engines (Amazon, Netflix)
• Self-driving cars
• Disease prediction
Let’s zoom into one specific use case:

Spam Filtering
Why Linear Regression & k-NN struggle here:
Algorithm Why It Fails for Spam Filtering

Outputs continuous values, not probabilities or


Linear
categories. Can’t handle word frequencies or text
Regression
data well.

Too slow for large datasets (like emails), needs to


k-NN compare with every training sample. Also, high
dimensionality (lots of words) = poor accuracy.

Why Naive Bayes Works Well


Naive Bayes = Probabilistic classifier based on Bayes’ Theorem
• Assumes features (words) are independent (naive assumption)
• Calculates:
P(Spam∣words)∝P(words∣Spam)⋅P(Spam)P(\text{Spam} \mid
\text{words}) \propto P(\text{words} \mid \text{Spam}) \cdot
P(\text{Spam})P(Spam∣words)∝P(words∣Spam)⋅P(Spam)
Why it works:
• Extremely fast, even with large text data
• Works well with word counts / frequencies
• Good with high-dimensional text (like emails)
It can classify emails as spam or not spam based on the
presence/absence of specific keywords, punctuation, headers, etc.

Data Wrangling – Prepping Data for ML


Before training ML models, you need clean data. That’s where data
wrangling comes in.
Key Tools & Techniques:
Tool /
Purpose
Concept

Access structured data from web services (e.g., Twitter


APIs
API, Reddit API)

Web Extract data from websites using tools like


Scraping BeautifulSoup, Scrapy, or Selenium

Cleaning Remove duplicates, fill missing data, normalize formats

Convert raw data (HTML, XML, JSON) into structured


Parsing
tables

pandas (Python) for manipulation, re for regex parsing,


Libraries
requests for HTTP calls

Mini Workflow:
1. Collect email data via IMAP API or CSV
2. Clean + tokenize email content
3. Label spam/ham
4. Vectorize text (Bag of Words / TF-IDF)
5. Train a Naive Bayes model
6. Predict new emails

Introduction to Feature Engineering & Selection

What’s a Feature?
A feature is an individual measurable property or characteristic of
the data.
Example: For customer retention, features could be:
• Number of purchases
• Days since last login
• Total amount spent

Feature Generation
(aka Creating Useful Inputs for Your ML Model)
Goal: Turn raw data into smart, meaningful variables
Process involves:
1. Brainstorming: Think of all potentially useful info (domain
knowledge helps!)
2. Domain Expertise: Talk to someone who knows the field (e.g., a
marketing analyst for churn prediction)
3. Imagination: Think outside the box—combine features or
create ratios
Example: Customer Retention
Raw data:
• Login timestamps
• Purchase amounts
• Support tickets
Generated features might include:
• Avg. time between logins
• Purchase frequency
• Complaints per month

Feature Selection
(Remove useless/noisy features to improve model performance)
Why it matters:
• Speeds up training
• Reduces overfitting
• Improves accuracy

Feature Selection Techniques


Method Description Pros Cons

Select features using


Ignores model
Filters statistical tests (e.g., Fast
performance
correlation, mutual info)
Method Description Pros Cons

Use a model (e.g.,


Computationally
Wrappers logistic regression) to Accurate
expensive
evaluate feature subsets

Built into model training


Embedded Depends on model
(e.g., Decision Trees, Efficient
(e.g., Trees) choice
Lasso)

Decision Trees & Random Forests for Feature Selection


• Decision Tree: Splits based on the most important features first
The first few splits = high-importance features
• Random Forest: Builds multiple trees, averages feature
importance
Can rank features by how much they reduce error

Summary Flow (Customer Retention Use Case):


1. Raw data: User logs, purchases, complaints
2. Feature Generation: Turn logs into counts, ratios, gaps
3. Feature Selection:
o Use filter methods for quick elimination
o Use Random Forests to rank what's most predictive
4. Train ML model (e.g., Logistic Regression, XGBoost)
Recommendation Systems
Used to predict user preferences and suggest relevant items
(products, movies, music, etc.).
Examples:
• 🛍 Amazon → “People also bought”
• Netflix → “Top picks for you”
• Spotify → “Recommended for you”

Building a User-Facing Data Product


A Recommendation System = Data + Algorithms + UX
To make it useful:
• Easy to access (search or suggestions)
• Personalized and timely
• Fast and scalable
Think of it as a smart assistant embedded in a product.

Core Algorithmic Ingredients


1. Collaborative Filtering
(Users who liked X also liked Y)
• User-based: Recommend based on similar users
• Item-based: Recommend items similar to ones the user liked
2. Content-Based Filtering
(Recommend similar items based on features)
• E.g., movie genre, product category, price, etc.
3. Hybrid Models
Combine both approaches above for better accuracy.

Dimensionality Reduction Techniques


Used to simplify user-item matrices (which are often huge and
sparse).
Why?
• Reduce computation
• Improve generalization
• Remove noise

Key Techniques
1. SVD (Singular Value Decomposition)
Breaks down a user-item matrix into:
R=U⋅Σ⋅VTR = U \cdot \Sigma \cdot V^TR=U⋅Σ⋅VT
Where:
• RRR is the original matrix (users vs items)
• UUU and VVV capture user/item relationships
• Σ\SigmaΣ are the strengths (singular values)
Used in Latent Factor Models (like matrix factorization in Netflix's
engine).

2. PCA (Principal Component Analysis)


• Finds important directions (principal components) in data
• Less used directly in recommender systems, but helpful for
feature reduction before building a model

Mini Exercise: Build Your Own Recommender System (Basic


Version)
Goal: Recommend books/movies based on user ratings
Steps:
1. Collect data (e.g., MovieLens dataset)
2. Create a user-item matrix
3. Choose an algorithm:
o User-based CF → use cosine similarity
o SVD → use matrix factorization with scikit-surprise or
numpy
4. Generate predictions
5. Recommend top-N items for a user

Example Code (Skeleton – Python):


python
CopyEdit
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Sample user-item rating matrix


ratings = pd.DataFrame({
'User1': [5, 3, 0, 0],
'User2': [4, 0, 0, 2],
'User3': [1, 1, 0, 5],
}, index=['ItemA', 'ItemB', 'ItemC', 'ItemD'])

# Transpose to user-by-items
user_sim = cosine_similarity(ratings.T)

# Show similarity matrix


print(pd.DataFrame(user_sim, index=ratings.columns,
columns=ratings.columns))

Social Network Graphs — Introduction


What’s a Social Network Graph?
• A way to represent relationships using nodes (users) and edges
(connections).
• Example: In Facebook,
o Nodes = users
o Edges = friendships (undirected) or followers (directed)
Social networks are basically big graphs — with millions of people
(nodes) and billions of interactions (edges).

Mining Social-Network Graphs


This involves extracting useful insights from large-scale network data.
Why?
• To find influential users
• Detect communities or cliques
• Track information spread (like viral posts or fake news)
• Recommend connections

Key Concepts:

1. Social Networks as Graphs


• Undirected Graphs → Friendship (mutual)
• Directed Graphs → Followers/following, email (1-way)
Real-world networks tend to be:
• Sparse: Not every node is connected to every other
• Small-world: Most nodes can be reached in a few steps (e.g., 6
degrees of separation)

2. Clustering of Graphs
• Grouping similar nodes together based on structure
• Real-world example: Friends from the same school, workplace,
or interest group
Tools:
• Clustering coefficient: Measures how tightly-knit a node's
neighbors are
• k-means on graph embeddings: When you turn graph nodes
into vectors

3. Direct Discovery of Communities


• Communities = Densely connected clusters of nodes
• Algorithms:
o Louvain (for large graphs)
o Girvan-Newman (uses edge betweenness)
• Application: Find topic-based groups on Twitter or LinkedIn
circles

4. Partitioning of Graphs
• Dividing a graph into non-overlapping subgraphs (like
departments in a company)
• Goal: Maximize internal edges within partitions, minimize
external edges
Used in:
• Load balancing
• Organizing network traffic
• Parallel computing

5. Neighborhood Properties
• Refers to the properties of nodes that are 1 hop or k hops
away
Examples:
• Degree: Number of direct neighbors
• Ego Network: The node and all of its neighbors + their
connections
• Clustering coefficient: How connected the neighbors are with
each other
Useful for:
• Predicting friendships
• Detecting anomalies (like bots)

Summary
Concept Purpose

Social Graphs Represent connections

Clustering Group similar nodes

Communities Detect densely connected subgroups

Partitioning Split network for analysis or load

Neighborhoods Analyze local node environment

Data Visualization
The art and science of converting complex data into visual form so
patterns, trends, and insights are obvious at a glance.
Basic Principles of Data Visualization
1. Clarity over clutter
→ Show the data, not the decoration. Don’t confuse with too many
colors or 3D effects.
2. Choose the right chart
Data Type Best Chart

Trends over time Line chart

Category comparison Bar chart

Proportions Pie chart (use sparingly)

Relationships Scatter plot

Distribution Histogram, box plot

Network/Flow Graphs, Sankey diagrams


3. Use color meaningfully
Color should communicate, not just decorate (e.g., red for danger,
blue for cool).
4. Minimize chartjunk
Avoid unnecessary grid lines, bold fonts, shadows. Keep it clean.

Common Tools for Data Visualization


Programming-based:
• Matplotlib / Seaborn (Python) – for static plots
• Plotly / Bokeh / Altair – for interactive visualizations
• ggplot2 (R) – great grammar of graphics
GUI-based (No-code / Low-code):
• Tableau – industry favorite
• Power BI – great for business dashboards
• Google Data Studio – web-based, integrates with Google
services
• Excel – surprisingly powerful when used well

Inspiring Real-World Visualization Projects


1. Spotify Wrapped
→ Personalized data storytelling using visualizations of your
listening habits.
2. The New York Times Graphics Department
→ Stunning visual journalism combining interactivity +
storytelling.
3. Gapminder by Hans Rosling
→ Animated bubble charts showing global development over
time.

Exercise: Create Your Own Visualization


Dataset Idea:
Use a public dataset like:
• COVID-19 daily cases
• IMDb movie ratings
• India rainfall dataset
• Airbnb listings in a city
• Kaggle datasets
Steps:
1. Import & clean the data (Pandas / Excel)
2. Explore using plots:
o Time series → line plot
o Category-wise → bar chart
o Correlation → heatmap
3. Visualize using:
o Matplotlib / Seaborn (Python)
o or Tableau if you prefer drag-n-drop
Example (Python):
python
CopyEdit
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load sample dataset


df = sns.load_dataset('tips')

# Plot total bill vs tip


sns.scatterplot(data=df, x='total_bill', y='tip', hue='sex')
plt.title("Tips vs Total Bill")
plt.show()
Data Science & Ethical Issues
As data becomes the new gold, ethical questions are the new
minefields.

Privacy
• What’s the concern?
Companies and governments collect enormous data: location,
searches, purchases, health, etc.
Users often don’t realize how much is being tracked.
• Ethical issues:
o Is consent truly informed?
o Can data be anonymized enough?
o Who owns your data — you or the company?
Real case:
Facebook–Cambridge Analytica scandal → user data used to
influence elections.

Security
• Security is about protecting data from unauthorized access.
• Data breaches can leak millions of users’ personal and financial
details.
Must-dos for ethical data science:
• Encrypt sensitive data
• Follow security protocols (HTTPS, access control, 2FA)
• Limit data access to need-to-know roles
Ethics in Data Science
The big 3 questions every data scientist should ask:
1. Should I build this model?
2. Who benefits, who gets hurt?
3. Is the algorithm biased?
Examples of ethical failure:
• Racist AI in hiring
• Gender-biased facial recognition
• Predictive policing reinforcing injustice

A Look Back at Data Science


• Earlier: Data science was about crunching numbers and finding
patterns.
• Now: It’s about responsibility, fairness, transparency, and
human impact.
• Tools have evolved, but now so have the questions we ask of
data.

Next-Generation Data Scientists


They must be:
• Not just coders, but critical thinkers
• Fluent in ethics, privacy laws, and social consequences
• Skilled in explaining how and why a model works (model
interpretability)
• Comfortable saying no to shady projects — not just doing
what’s possible, but what’s right

You might also like