Unit 1: Introduction to Data Science
1 Mark Questions:
1. Define Data Science.
Ans: Data Science is an interdisciplinary field that uses scientific methods,
algorithms, and systems to extract insights from structured and unstructured
data.
2. Mention one application of Data Science.
Ans: Fraud detection in banking.
2 Mark Questions:
1. List any two skills required for a data scientist.
Ans: Programming (e.g., Python), and knowledge of statistics.
2. What is the difference between data and information?
Ans: Data are raw facts and figures, while information is processed data that is
meaningful.
3 Mark Questions:
1. Explain the lifecycle of data science.
Ans: It includes:
○ Data collection
○ Data cleaning
○ Data exploration
○ Modeling
○ Evaluation
○ Deployment
4 Mark Questions (Long Answer):
1. Explain the role and responsibilities of a data scientist.
Ans:
A data scientist analyzes large sets of data to find actionable insights. Their role
includes data cleaning, statistical analysis, model building, and interpreting
results to support decision-making. They work closely with business teams to
identify problems and provide data-driven solutions using tools like Python, R,
SQL, and machine learning algorithms.
Unit 2: Data Preprocessing and Data Wrangling
1 Mark Questions:
1. What is data cleaning?
Ans: It is the process of detecting and correcting errors in data.
2. Name any one technique used for handling missing data.
Ans: Mean imputation.
2 Mark Questions:
1. What is normalization?
Ans: Normalization is scaling data to fall within a small, specified range like [0,1].
2. Define data wrangling.
Ans: Data wrangling is the process of transforming and mapping raw data into a
more usable format.
3 Mark Questions:
1. Explain outlier detection.
Ans: Outliers are extreme values that differ from other data. They can be detected
using:
○ Z-score method
○ Box plot analysis
○ IQR method
4 Mark Questions (Long Answer):
1. Discuss various techniques for handling missing data.
Ans:
○ Deletion Methods: Remove rows or columns with missing values.
○ Imputation Methods: Fill missing values using:
■ Mean/median/mode imputation
■ Regression imputation
■ KNN imputation
○ Advanced Techniques: Use ML models to predict missing values.
Unit 3: Exploratory Data Analysis (EDA)
1 Mark Questions:
1. What is EDA?
Ans: It is the process of analyzing data sets to summarize their main
characteristics.
2. Name any one graphical tool used in EDA.
Ans: Histogram.
2 Mark Questions:
1. What is the purpose of a box plot?
Ans: To visualize the distribution and detect outliers in the data.
2. Mention two summary statistics used in EDA.
Ans: Mean and standard deviation.
3 Mark Questions:
1. Explain the importance of correlation analysis.
Ans: It helps in identifying the strength and direction of relationships between
variables, which is crucial for model selection and feature engineering.
4 Mark Questions (Long Answer):
1. Explain the different types of visualizations used in EDA.
Ans:
○ Histogram: Shows the frequency distribution.
○ Box Plot: Displays median, quartiles, and outliers.
○ Scatter Plot: Shows relationships between two numeric variables.
○ Bar Chart: Used for categorical data.
○ Heatmap: Used to show correlation matrices.
Unit 4: Statistical Foundations
1 Mark Questions:
1. Define population.
Ans: Population is the entire set of individuals or items that we are interested in
studying.
2. What is a hypothesis?
Ans: A hypothesis is an assumption made for the purpose of testing.
2 Mark Questions:
1. Differentiate between population and sample.
Ans: A population includes all members of a group, while a sample is a subset of
the population.
2. What is p-value?
Ans: It is the probability of obtaining test results at least as extreme as the
observed results, assuming the null hypothesis is true.
3 Mark Questions:
1. Explain Type I and Type II errors.
Ans:
○ Type I Error (False Positive): Rejecting a true null hypothesis.
○ Type II Error (False Negative): Failing to reject a false null hypothesis.
4 Mark Questions (Long Answer):
1. Describe the steps involved in hypothesis testing.
Ans:
○ Formulate null and alternative hypotheses.
○ Select significance level (alpha).
○ Choose the appropriate test.
○ Compute the test statistic.
○ Determine the p-value.
○ Compare p-value with alpha.
○ Make a decision: reject or fail to reject the null hypothesis.
Unit 5: Introduction to Machine Learning
1 Mark Questions:
1. Define machine learning.
Ans: Machine learning is a method of data analysis that automates analytical
model building.
2. Name one supervised learning algorithm.
Ans: Linear Regression.
2 Mark Questions:
1. Differentiate between supervised and unsupervised learning.
Ans: Supervised learning uses labeled data; unsupervised learning uses
unlabeled data.
2. What is overfitting?
Ans: Overfitting occurs when a model performs well on training data but poorly
on new, unseen data.
3 Mark Questions:
1. Explain the K-Nearest Neighbors algorithm.
Ans: KNN is a classification algorithm where the output is determined by the
majority label among the k nearest data points.
4 Mark Questions (Long Answer):
1. Compare and contrast supervised, unsupervised, and reinforcement learning.
Ans:
○ Supervised Learning: Input-output pairs provided; used for classification
and regression.
○ Unsupervised Learning: Only input data; used for clustering and
association.
○ Reinforcement Learning: Learning through trial and error using feedback
from actions (rewards or penalties).
Each technique is suited to different types of problems and data
availability.