0 ratings 0% found this document useful (0 votes) 38 views 24 pages Data Science Dse
Data Science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from data, encompassing the entire data lifecycle from collection to decision-making. It differs from traditional data analysis by incorporating advanced techniques like machine learning and statistical inference, which are essential for making predictions and understanding data patterns. Key components include data cleaning, exploratory data analysis, model building, and evaluation, with a focus on improving data quality and model performance.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save data science dse For Later What is Data Science? How is it different from
traditional data analysis?
Data Science is an interdisciplinary field that uses scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and
unstructured data. It combines concepts from:
Statistics
Computer science
Mathematics
Domain knowledge
Data Science invoives the entire data lifecycle: data collection, cleaning,
exploration, modeling, visualization, and decision-making using techniques
like machine learning, data mining, and predictive analytics.
Key Components of Data Science:
1
Data Collection & Storage: Using databases, APIs, sensors, etc.
Data Cleaning & Preprocessing: Removing noise and inconsistencies.
Exploratory Data Analysis (EDA): Summarizing main characteristics
using visualization and statistics
Model Building: Applying machine learning algorithms to make predictions
or classifications.
Model Evaluation & Tuning: Measuring performance and optimizing.
Deployment & Decision Making: Implementing the model in real-world
systems.Define statistical inference. How is it used in Data
Science?
Statistical inference is the process of using data from a sample to draw
conclusions or make estimates about a larger population. It involves applying
probability theory to estimate population parameters, test hypotheses, and
quantify uncertainty.
Key Elements of Statistical Inference:
1
2.
Population: The entire group you're interested in studying,
Sample: A subset of the population used to make inferences.
Parameter: A measurable characteristic of a population (e
variance).
mean,
Statistic: A measurable characteristic of a sample, used to estimate a
parameter.
Inference: Drawing conclusions about the parameter based on the
statistic.
Common Techniques in Statistical Inference:
Estimation (Point & Interval): Estimating population parameters.
Hypothesis Testing: Assessing whether a claim about a population is
likely true.
Confidence Intervals: Giving a range of plausible values for a parameter.
P-values: Measuring the strength of evidence against a null hypothesis.How is Statistical Inference Used in Data Science?
Statistical inference is fundamental to Data Science, particularly in:
Use Case
AIB Testing
Model Evaluation
Sampling
Feature Selection
Uncertainty
Quantification
Bias Detection
Role of Statistical Inference
Determines if a new feature or product leads to
significant improvement.
Helps assess whether model performance is
statistically significant.
Allows working with large datasets by analyzing
representative samples.
Identifies which variables are significantly
associated with the target.
Provides confidence intervals for predictions.
Tests whether models or data contain significant
bias
What is the difference between a population and a
sample? Why do we sample?
Aspect Population Sample
Definition The complete set of individuals, A subset of the
items, or data you're interested in _ population selected for
studying. analysis.
Size Usually large or infinite. Smaller, manageable
portion.
Purpose Provides true characteristics Used to make inferences
(parameters) of the entire group. about the population.Data Contains all possible observations. _ Contains only part of the
Type data
Measure Results are called parameters (e.g., Results are called
Type population mean, 11) statistics (e.g., sample
mean, x).
What is a probability distribution? Name and explain
any two types.
A probability distribution is a function or rule that assigns probabilities to the
possible outcomes of a random variable. It describes how likely different values
of the variable are and provides a mathematical framework to model uncertainty.
Types of Probability Distributions
Probability distributions are generally classified into:
¢ Discrete distributions — for variables that take countable values.
* Continuous distributions — for variables that take infinite values within a
range.
1. Binomial Distribution (Discrete)
Used when:
© There are a fixed number of trials (n).
* Each trial has only two possible outcomes: success or failure.
The probability of success (p) remains constant.Formula:
P(X =k) = (7)p'a —py*
Where;
« X: Number of successes
* (f): Number of ways to choose k successes in m trials
Example:
Flipping a coin 10 times and counting how many times it lands heads.
2. Normal Distribution (Continuous)
Used when:
© The data is symmetrically distributed around a mean
Common in natural and social phenomena (e.g., height, |).
Characteristics:
© Bell-shaped curve
@ Mean = Median = Mode
© Described by two parameters: mean (y) and standard deviation (a)
Formula (PDF):
Example:
Heights of adult males in a country often follow a normal distribution.Write basic R code to calculate mean, median, and
standard deviation of a dataset.
# Sample dataset (numeric vector)
data <- c(12, 15, 22, 9, 18, 30, 24, 17, 21, 14)
# Calculate mean
mean_value <- mean(data)
print(paste("Mean:", mean_value))
# Calculate median
median_value <- median(data)
print(paste("Median:”, median_value))
# Calculate standard deviation
sd_value <- sd(data)
print(paste("Standard Deviation:", sd_value))
Explanation:
mean(data) — Returns the average
median(data) — Returns the middle value.
« sd(data) — Returns the standard deviation (measure of spread).Explain the process of fitting a model to data.
Fitting a model to data means finding a mathematical relationship between input
variables (features) and output variables (targets) so that the model can make
predictions or understand patterns.
Steps in Model Fitting:
1. Data Collection
© Gather raw data from experiments, sensors, databases, or APIs.
2. Data Preprocessing
© Cleaning: Handle missing values, outliers, and duplicates.
Encoding: Convert categorical variables to numerical (e.g., one-hot
encoding).
Normalization/Scaling: Standardize numeric values for better model
performance.
© Splitting: Divide data into training and testing sets (commonly 70/30 or
80/20).
3. Choosing a Model
© Select a suitable model based on the task:
© Linear Regression for continuous outcomes.
© Logistic Regression or Decision Trees for classification.
© Clustering algorithms for grouping, etc:
4. Model Training (Fitting)
« Use the training data to teach the model the relationship between inputs
and outputs.The model "learns" by minimizing a loss or cost function (e.g., Mean
Squared Error).
5. Model Evaluation
© Test the model on the testing/validation data
Use performance metrics:
Regression: RMSE, R®
© Classification: Accuracy, Precision, Recall, F1-score
6, Model Tuning (Hyperparameter Optimization)
© Improve performance by adjusting hyperparameters (e.g.., learning rate,
depth).
e Use techniques like grid search or cross-validation.
7. Model Deployment
« Integrate the model into real-world applications or systems for prediction
8. Monitoring and Maintenance
® Continuously monitor model performance.
e Retrain if performance drops due to new data or concept drift.
Example in R (Linear Model):
# Load data
data <- mtcars
# Fit linear model: mpg (target) ~ wt (feature)
model <- im(mpg ~ wt, data = data)
# Summary of the model
summary(model)What is Exploratory Data Analysis (EDA)? Why is it
important?
Exploratory Data Analysis (EDA) is the process of analyzing and
summarizing datasets to uncover patterns, detect anomalies, test hypotheses,
and check assumptions before applying modeling techniques. It involves both
visual and statistical methods to better understand the structure of the data.
Key Objectives of EDA:
1
Understand Data Distribution — See how values are spread across
variables.
Detect Outliers and Missing Values — Identify data quality issues.
Identify Relationships — Find correlations or associations between
variables.
Summarize Data — Use descriptive statistics like mean, median, range.
Guide Model Selection — Choose appropriate modeling techniques based
on data behavior.
Common EDA Techniques:
Numerical Summaries:
Mean, median, mode, variance, standard deviation
Min, max, percentiles
Visualizations:
Histograms — For distribution of numerical data
Boxplots — For spotting outliers
Scatter plots — For relationships between two variables« Bar charts — For categorical variables
© Correlation matrix/heatmaps
Missing Value Analysis
Check how much and where data is missing.
Compare supervised and unsupervised learning with
examples.
Aspect
Definition
Goal
Data
Requirement
Output
Evaluation
Metrics
Examples
Common
Algorithms
Supervised Learning
Learns from labeled data
(input-output pairs).
Predict or classify the
output based on input
features.
Requires a dataset with
known outcomes (labels).
Predictive model (e.
class, value)
Accuracy, precision, recall,
RMSE, ete.
Spam email classification,
house price prediction.
- Linear/Logistic
Regression
- Decision Trees
- SVM
- Neural Networks
Unsupervised Learning
Learns from unlabeled data
(no target variable)
Discover hidden patterns,
structures, or groupings in
data.
Works with raw input data
only.
Groupings, associations, or
dimensionality reduction
Silhouette score, inertia,
variance explained, etc
Customer segmentation, topic
modeling, anomaly detection.
- K-Means
- Hierarchical Clustering
-PCA
- DBSCANExplain Linear Regression with an example.
Linear regression is a supervised learning algorithm used to model the
relationship between a dependent variable (target) and one or more
independent variables (features) by fitting a straight line.
Types of Linear Regression:
1, Simple Linear Regression: One independent variable
y=Ootfirte
2. Multiple Linear Regression: More than one independent variable
y — Bot Pray + Bowe +--+ + Baan +e
Where
© y: Dependent variable (target)
* x: Independent variable(s)
* Bp: Intercept
© By:Slope (effect of x on y)
©: Etrorterm
Objective:
Find the line (model) that minimizes the difference (error) between the
predicted values and the actual values — typically using least squares method.Example: Predicting House Price
Size (sq ft) Price (in $1000s)
1000 200
1500 250
2000 300
2500 350
You want to build a model to predict house price based on size.
In R: Simple Linear Regression
# Data
size <- c(1000, 1500, 2000, 2500)
price <- c(200, 250, 300, 350)
# Fit linear model
model <- Im(price ~ size)
# Model summary
summary(model)
# Predict price for 1800 sq ft
predict(model, data.frame(size = 1800))
# Plot
plot(size, price, main = "House Price vs Size", col = "blue", pch = 19)
abline(model, col = “red") # regression lineInterpretation:
If the mode! outputs:
Price = 100 + 0.1 x Size
It means:
© Intercept (100): Base price when size = 0 (theoretically),
+ Slope (0.1): Each additional square foot increases the price by $100.
Describe how the k-Nearest Neighbors (k-NN)
algorithm works.
The k-Nearest Neighbors (k-NN) algorithm is a simple, non-parametric,
supervised learning method used for both classification and regression. It
predicts the outcome for a new data point based on how similar it is to nearby
points in the training set.
How k-NN Works (Step-by-Step):
1. Choose k
Decide how many neighbors (k) to consider (commonly odd numbers like
3, 5, 7),
2. Calculate Distance
Compute the distance between the new point and all points in the training
data.
© Common distance metrics:
° Euclidean distance:
d= J — P+ (uw)!
* Manhattan, Minkowski distances are also used.Find Nearest Neighbors
Identify the k closest data points (neighbors).
Make Prediction
© Classification: Take a majority vote among the k neighbors' classes.
* Regression: Take the average (mean) of the neighbors’ values.
Return Result
Assign the most common class (or average value) to the new data point.
Example (Classification):
Suppose you want to classify whether a fruit is an apple or orange based on
weight and color.
You setk = 3.
* You calculate the distance between the new fruit and all labeled fruits.
* You pick the 3 closest ones.
© If 2 of them are apples and 1 is orange — classify as apple.
Advantages of k-NN:
Simple to understand and implement.
No training phase — it's a lazy learner.
Works well with small datasets.
Disadvantages:
© Slow with large datasets (computational cost).
© Sensitive to irrelevant features and outliers.
© Choosing the right k is critical.In R: Simple k-NN Example
library(class)
# Features
train_X <- data.frame(height = c(6.5, 6.0, 5.0, 5.8),
weight = c(150, 180, 120, 165))
# Labels
train_Y < factor(c("Male", "Male", "Female", "Male"))
# New observation
test_X <- data.frame(height = 5.4, weight = 130)
#k-NN Classification
prediction <- knn(train = train_X, test = test_X, cl = train_Y, k = 3)
print(prediction)
What is Data Wrangling? Why is it essential?
Data wrangling (also called data cleaning or data preprocessing) is the
process of transforming and cleaning raw data into a usable format for analysis. It
involves tasks such as handling missing values, correcting errors, standardizing
data, and converting data into the appropriate structure and format for further
processing.
Key Steps in Data Wrangling:
1. Data Collection: Gathering data from various sources (databases, CSV
files, APIs, etc.).
2. Data Cleaning:
o Handling missing data (e.g., filling, dropping, or imputing missing
values).
o Removing duplicates to avoid redundancy.© Correcting inconsistencies (e.g., correcting spelling errors in
categorical data),
3. Data Transformation:
© Converting data types (e.g., changing a column from a string to
numeric).
© Normalizing or scaling data for consistency.
© Aggregating or grouping data to summarize or derive new insights.
4. Data Integration: Merging data from different sources or formats into a
single unified dataset.
5, Data Formatting: Structuring data in a consistent format (e.g., date
formats, consistent units).
Why is Data Wrangling Essential?
1. Improves Data Quality:
o Ensures that the dataset is accurate, complete, and consistent, which
is crucial for building reliable models.
2. Prepares Data for Analysis:
© Raw dala is often unstructured and messy. Wrangling transforms it
into a clean, structured format that can be easily analyzed.
3. Enables Better Insights:
o Aclean dataset allows for clearer trends and patterns to emerge,
leading to more reliable and actionable insights.
4, Reduces Errors in Analysis:© Cleaning the data minimizes the risk of errors, such as biases or
inconsistencies that could lead to incorrect conclusions.
5. Optimizes Model Performance:
© Models trained on well-preprocessed data are more likely to perform
better, as the features are consistent and relevant.
6. Saves Time and Resources:
© Without proper wrangling, a data analysis process may involve
unnecessary troubleshooting or even lead to invalid results
What is the difference between feature generation and
feature selection?
Feature Generation:
Definition:
Feature generation refers to the process of creating new features from the
existing ones. This process aims to improve the model's performance by
introducing more relevant or meaningful features.
Goal:
© Toenrich the dataset with additional features that may improve the
predictive power of the model.
To create new representations of the data that the model can learn from.
How It Works:
¢ Mathematical transformations: Applying mathematical functions (e.g.,
logarithms, square roots) to existing features.
© Interaction terms: Creating new features by combining two or more
existing features (e.g., multiplying, adding, or creating ratios).© Domain-specific features: Creating new features based on domain
knowledge (e.g., combining year and month to create a "seasonality"
feature).
« Time-based features: Extracting features like "day of the week," “hour of
the day," or "month" from timestamps
Example:
Ifyou have height and weight, you could create a new feature called bai (body mass index) using the
formula
weight (ka)
BMI
t (m)?
Feature Selection:
Definition:
Feature selection refers to the process of selecting a subset of the most
relevant features from the existing set. This helps reduce the complexity of the
model and improves model performance by removing irrelevant or redundant
features.
Goal:
© To improve model efficiency by reducing dimensionality.
¢ To eliminate noise and irrelevant data, making the model simpler and
faster.
How It Works:
e Filter Methods: Select features based on statistical measures (e.g.,
correlation, chi-square test, ANOVA) without using a model.
* Wrapper Methods: Evaluate subsets of features based on model
performance (e.g., forward selection, backward elimination, recursive
feature elimination).
« Embedded Methods: Perform feature selection during model training
(e.g., LASSO, decision trees with feature importance).Example:
You may have a dataset with 100 features, but after feature selection, you may
find that only 10 of them are significantly contributing to the model's prediction.
You then select only these 10 features to build your model.
Key Differences:
Aspect Feature Generation Feature Selection 6
Purpose Create new, potentially more useful Reduce the number ef features to eliminate
features, rie.
Impacton Dataset Increases the numberof features in the Reduces the number of feturesin the dataset,
dataset
Approach ‘Adding new variables or transforming Removing relevant or redundant testes
exiting ones
Focus Expanding the feature space Narrowing down tothe mest important
femurs.
Techniques Mathematical tranefermations, domain Statistical tests, model-based methods, et
knowledge et.
What are the components of a user-facing
recommendation engine?
A recommendation engine is a system that suggests products, services,
content, or information to users based on their preferences, behavior, or
characteristics. User-facing recommendation engines are typically part of
platforms like e-commerce sites (e.g., Amazon), streaming services (e.g., Netflix),
social media (e.g., Facebook), and other online platforms.
The main components of a user-facing recommendation engine are:
1. User Interaction Data (Input Data)
« Description: Data generated from the user’s interactions with the system
(e.g., clicks, purchases, ratings, time spent on content).© Types of Data:
°
Explicit Feedback: Ratings, likes, or direct user input (e.g., "I like
this product’).
© Implicit Feedback: Behavior-based data such as clicks, browsing
history, or time spent on content.
© Demographic Data: Information about users (e.g., age, gender,
location) that can influence recommendations.
Example: On a movie streaming platform, user actions such as watching a
movie, rating it, or adding it to a watchlist can serve as interaction data.
2. Item Database (Product/Content Catalog)
¢ Description: The collection of all the items that the recommendation
engine can suggest to the user. This includes products, movies, books, or
other content.
© Types of Data:
© Item Features: Descriptive data about each item (e.g., genre, price,
artist, director).
© Metadata: Additional information like release date, description, or
keywords that help categorize and filter items.
Example: In an e-commerce platform, this would be the database of all products,
including details like product descriptions, price, and category.
3. Recommendation Algorithm(s)
Description: The core engine that analyzes user data and item information
to generate recommendations. The algorithm is based on one or more
methods, such as;1. Collaborative Filtering:
au User-based Collaborative Filtering: Recommends items that
similar users have liked.
= Item-based Collaborative Filtering: Recommends items
similar to what the user has liked in the past.
2. Content-Based Filtering: Recommends items that are similar to
ones the user has liked, based on item features (e.g., recommending
action movies if the user has watched many action movies).
3. Hybrid Models: Combines collaborative filtering and content-based
filtering to take advantage of both methods.
4, Matrix Factorization: Techniques like Singular Value Decomposition
(SVD) decompose the user-item interaction matrix to find latent
factors that explain user preferences.
5. Deep Learning: Uses neural networks to learn complex patterns
from user-item interactions.
6. Popularity-based Recommendations: Suggests items based on
their overall popularity (e.g., top-rated items)
Example: Netflix uses collaborative filtering (e.g., "users who watched this movie
also watched...") combined with content-based filtering (e.g., recommending
movies from the same genre).
4, Ranking and Personalization
Description: Once recommendations are generated, they need to be
ranked and personalized to suit each user.
o Ranking: Sorting the recommended items based on relevance,
considering factors such as:
= User preferences= Popularity
= Recency
m Expected value to the user
© Personalization: Tailoring the recommendations to each individual
user based on their specific interests, past behavior, and context.
Example: On Spotify, not only are song recommendations personalized based
on the user's listening history, but they're also ranked based on what is most
relevant (e.g., the latest releases, user's most listened genres).
5. Filtering and Diversity
Description: To avoid recommending the same items repeatedly and to
provide variety, filters are applied. Diversity ensures the user sees a mix of
content.
© Diversity Filtering: Prevents the engine from recommending items
too similar to each other or ones the user has already seen.
© Contextual Filters: Filters based on context like location, time of
day, or season.
Example: In an e-commerce website, the engine might filter out products that the
user has already purchased or viewed in the last 30 days.What are the principles of effective data visualization?
Effective data visualization helps communicate information clearly and concisely
by using visual elements like charts, graphs, and maps. To create impactful
visualizations, it's essential to follow certain principles that ensure clarity,
accuracy, and engagement
1. Know Your Audience
© Principle: Understand the target audience's expertise, background, and
goals. Tailor the complexity and style of the visualization to suit their needs.
© Why It’s Important: Different audiences (e.g., executives vs. data
scientists) will have varying levels of technical knowledge and specific
interests. A simple, intuitive design might be needed for non-experts, while
advanced users may require more detailed and interactive visuals.
Example: A pie chart for an executive might work well to show market share
percentages, while a data scientist might prefer a more complex bar chart or line
graph to identify trends over time.
2. Clarity and Simplicity
* Principle: Strive for simplicity and clarity in design. Avoid unnecessary
elements (e.g., extraneous text, excessive colors) that might confuse the
message.
© Why It’s Important: A cluttered or overly complex visualization can
overwhelm the viewer, leading to confusion and misunderstanding. Keep
the design minimal and focused on the data story.
Example: Instead of using 5 different colors in a bar chart, limit the chart to 2 or 3
colors that are easy to distinguish and convey the main points.
3, Use the Right Type of Chart
Principle: Choose the most appropriate chart or graph to represent the
data. Different types of charts work better for different data types.© Why It’s Important: The chart type should match the message you want to
convey and the nature of the data (e.g., trends, comparisons, distributions).
Example:
© Line charts for trends over time
¢ Bar charts for comparing quantities across categories.
© Pie charts for showing parts of a whole.
© Scatter plots for relationships between two variables,
4. Highlight Key Insights
© Principle: Emphasize the most important parts of the data so the audience
can quickly identify key insights.
* Why It’s Important: Viewers should be able to extract actionable insights
at a glance without having to interpret the whole chart,
Example: Use color to highlight important trends or outliers, or add annotations
to key data points that need attention.
5. Maintain Proportions and Accuracy
© Principle: Ensure that the visual representation is proportionate and
accurately reflects the data.
© Why It’s Important: Distorted or misleading visuals (e.g., using an
inappropriate scale or changing axis ranges) can mislead the audience,
causing wrong interpretations.
Example: In bar charts, ensure the y-axis starts at zero, unless there's a
compelling reason not to, to avoid exaggerating the differences between bars.