DELHI PUBLIC SCHOOL, KANPUR
CLASS 12
SUBJECT - ARTIFICIAL INTELLIGENCE (843)
Chapter - Data Science Methodology : An Analytic Approach to Capstone Project
Questions / Answers
1. How many steps are there in Data Science Methodology? Name them in order.
Ans. Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists
follow to approach a problem and find a solution. It enables the capacity to handle and comprehend the data.
Data Science Methodology consists of 10 steps
1. Business Understanding 6. Data Preparation
2. Analytic Approach 7. Modelling
3. Data Requirements 8. Evaluation
4. Data Collection 9. Deployment
5. Data Understanding 10. Feedback
2. What are the different types of data analytics used in Analytic Approach?
Ans. There are four main types of data analytics.
Descriptive Analytics:
This summarizes past data to understand what has happened. It is the first step undertaken in data analytics
to describe the trends and patterns using tools like graphs, charts etc. and statistical measures like mean,
median, mode to understand the central tendency. This method also examines the spread of data using range,
variance and standard deviation.
Diagnostic Analytics:
It helps to understand the reason behind why some things have happened. This is normally done by
analyzing past data using techniques like root cause analysis, hypothesis testing, correlation analysis etc.
The main purpose is to identify the causes or factors that led to a certain outcome.
Predictive Analytics:
This uses the past data to make predictions about future events or trends, using techniques like regression,
classification, clustering etc. Its main purpose is to foresee future outcomes and make informed decisions.
Prescriptive Analytics:
This recommends the action to be taken to achieve the desired outcome, using techniques such as
optimization, simulation, decision analysis etc. Its purpose is to guide decisions by suggesting the best
course of action based on data analysis.
3. Data is collected from different sources. Explain the different types of sources with example.
Ans. Data collection is a systematic process of gathering observations or measurements. In this phase, the
data requirements are revised and decisions are made as to whether the collection requires more or less data.
Today’s high-performance database analytics enable data scientists to utilize large datasets. There are
mainly two sources of data collection:
→ Primary data Source -
A primary data source refers to the original source of data, where the data is collected firsthand
through direct observation, experimentation, surveys, interviews, or other methods.
This data is raw, unprocessed, and unbiased, providing the most accurate and reliable information for
research, analysis, or decision-making purposes.
Examples include marketing campaigns, feedback forms, IoT sensor data etc.
→ Secondary data Source -
A secondary data source refers to the data which is already stored and ready for use.
Data given in books, journals, websites, internal transactional databases, etc. can be reused for data
analysis.
Some methods of collecting secondary data are social media data tracking, web scraping, and
satellite data tracking.
Some sources of online data are data.gov, World bank open data, UNICEF, open data network,
Kaggle, World Health Organization, Google etc.
Smart forms are an easy way to procure data online.
4. Write a short note on the steps done during Data Preparation.
Data Preparation stage covers all the activities to build the set of data that will be used in the modelling step.
Data is transformed into a state where it is easier to work with. Data preparation includes -
● cleaning of data (dealing with invalid or missing values, removal of duplicate values and assigning a
suitable format)
● combine data from multiple sources (archives, tables and platforms)
● transform data into meaningful input variables
5. What do you mean by Feature Engineering?
Ans. Feature Engineering is a part of Data Preparation. The preparation of data is the most time consuming
step among the Data Science stages. Feature engineering is the process of selecting, modifying, or creating
new features (variables) from raw data to improve the performance of machine learning models.
6. Which step of Data Science Methodology is related to constructing the data set? Explain.
Data Understanding encompasses all activities related to constructing the dataset. In this stage, we check
whether the data collected represents the problem to be solved or not. The relevance, comprehensiveness,
and suitability of the data for addressing the specific problem or question at hand are evaluated. Techniques
such as descriptive statistics and visualization can be applied to the dataset, to assess the content, quality,
and initial insights about the data.
7. What do you mean by AI Modelling.
Ans. The modelling stage uses the initial version of the dataset prepared and focuses on developing models
according to the analytical approach previously defined. The modelling process is usually iterative, leading
to the adjustments in the preparation of data. For a determined technique, Data scientists can test multiple
algorithms to identify the most suitable model for the Capstone Project.
8. Differentiate between descriptive modelling and predictive modelling.
Data Modelling focuses on developing models that are either descriptive or predictive
1. Descriptive Modeling:
It is a concept in data science and statistics that focuses on summarizing and understanding the
characteristics of a dataset without making predictions or decisions.
The goal of descriptive modeling is to describe the data rather than predict or make decisions based
on it.
This includes summarizing the main characteristics, patterns, and trends that are present in the data.
Descriptive modeling is useful when you want to understand what is happening within your data and
how it behaves, but not necessarily why it happens.
Common Descriptive Techniques:
o Summary Statistics: This includes measures like:
Mean (average), Median, Mode
Standard deviation, Variance
Range (difference between the highest and lowest values)
Percentiles (e.g., quartiles)
o Visualizations: Graphs and charts to represent the data, such as:
Bar charts
Histograms
Pie charts
Box plots
Scatter plots
2. Predictive Modeling:
It involves using data and statistical algorithms to identify patterns and trends in order to predict
future outcomes or values.
It relies on historical data and uses it to create a model that can predict future behavior or trends or
forecast what might happen next.
It involves techniques like regression, classification, and time-series forecasting, and can be applied
in a variety of fields, from predicting exam scores to forecasting weather or stock prices.
While it is a powerful tool, students must also understand its limitations and the importance of good
data.
The data scientist will use a training set for predictive modeling. A training set is a set of historical
data in which the outcomes are already known. The training set acts like a gauge to determine if the
model needs to be calibrated.
In this stage, the data scientist will play around with different algorithms to ensure that the variables
selected are actually required.
9. What do you understand by Evaluation. Write its phases also.
Ans. Evaluation in an AI project cycle is the process of assessing how well a model performs after training.
It involves using test data to measure metrics like accuracy, precision, recall, or F1 score. This helps
determine if the model is reliable and effective before deploying it in real-world situations.
Model evaluation can have two main phases.
First phase – Diagnostic measures
It is used to ensure the model is working as intended. If the model is a predictive model, a decision tree can
be used to evaluate the output of the model, check whether it is aligned to the initial design or requires any
adjustments. If the model is a descriptive model, one in which relationships are being assessed, then a testing
set with known outcomes can be applied, and the model can be refined as needed.
Second phase – Statistical significance test
This type of evaluation can be applied to the model to verify that it accurately processes and interprets the
data. This is designed to avoid unnecessary second guessing when the answer is revealed.
10. Is Feedback a necessary step in Data Science Methodology? Justify your answer.
Ans. Yes, Feedback is a necessary step in Data Science Methodology. This includes results collected from
the deployment of the model, feedback on the model’s performance from the users and clients, and
observations from how the model works in the deployed environment. Feedback from the users will help to
refine the model and assess it for performance and impact.
11. Why is model validation important?
Model Validation offers a systematic approach to measure its accuracy and reliability, providing insights
into how well it generalizes to new, unseen data. Model validation is the step conducted post Model
Training, wherein the effectiveness of the trained model is assessed using a testing dataset. Validating the
machine learning model during the training and development stages is crucial for ensuring accurate
predictions. The benefits of Model Validation include -
• Enhancing the model quality.
• Reduced risk of errors
• Prevents the model from overfitting and underfitting.
12. Write a comparative study on train-test split and cross validation.
Train-Test Split Cross Validation
Normally applied on large datasets Normally applied on small datasets
Divides the data into training data set and testing Divides a dataset into subsets (folds), trains the model on
dataset. some folds, and evaluates its performance on the
remaining data.
Clear demarcation on training data and testing data. Every data point at some stage could be in either testing
or training data set.
13. Explain the different metrics used for evaluating Classification models.
Ans. Evaluation metrics help assess the performance of a trained model on a test dataset, providing insights
into its strengths and weaknesses. These metrics enable comparison of different models, including variations
of the same model, to select the best-performing one for a specific task.
14. Explain Confusion Matrix.
Ans. A Confusion Matrix is a table used to evaluate the performance of a classification model. It
summarizes the predictions against the actual outcomes. It creates an N X N matrix, where N is the number
of classes or categories that are to be predicted. Suppose there is a problem, which is a binary classification,
then N=2 (Yes/No). It will create a 2x2 matrix.
True Positives: It is the case where the model predicted Yes and the real output was also yes.
True Negatives: It is the case where the model predicted No and the real output was also No.
False Positives: It is the case where the model predicted Yes but it was actually No.
False Negatives: It is the case where the model predicted No but it was actually Yes.
15. How to calculate Accuracy, Precision, Recall and F1-Score.
Ans. Accuracy
Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP+TN) / (TP+FP+FN+TN)
Precision and Recall
Precision measures “What proportion of predicted Positives is truly Positive?”
Precision = (TP) / (TP+FP).
Precision should be as high as possible.
Recall measures “What proportion of actual Positives is correctly classified?”
Recall = (TP) / (TP+FN)
F1 Score
A good F1 score means that you have low false positives and low false negatives, so you’re correctly
identifying real threats, and you are not disturbed by false alarms. An F1 score is considered perfect when it
is 1, while the model is a total failure when it is 0.
F1 = 2* (Precision * Recall) / (Precision + Recall)
15. Explain MAE, MSE and RMSE
Ans. 1. MAE - Mean Absolute Error is a sum of the absolute differences between predictions and actual
values. A value of 0 indicates no error or perfect predictions
2. MSE - Mean Square Error (MSE) is the most commonly used metric to evaluate the performance of a
regression model. MSE is the mean(average) of squared distances between our target variable and predicted
values.
3. RMSE Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
RMSE is often preferred over MSE because it is easier to interpret since it is in the same units as the target
variable.