0% found this document useful (0 votes)

9 views29 pages

Course 4

This lesson covers the processes of data preparation, model building, and cross-validation using a telecom case study. Key techniques discussed include data normalization, imputation of missing values, data transformation, and dimensionality reduction, all of which are essential for preparing quality data for modeling. The lesson also emphasizes the importance of model evaluation through performance metrics and cross-validation to select the best model before deployment.

Uploaded by

Rakesh Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views29 pages

Course 4

Uploaded by

Rakesh Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Welcome to the lesson on Data Preparation, Model Building, and cross-validation : Introduction

In this lesson first we will discuss, how to prepare data, than once we finish data preparation part, we
will go in to Model building part , and finally we will show the steps to be taken before deploying the
model

In this regard we have considered one case study with telecom data,

where we first show how to normalize the data, Perform data normalizing using EDA, than data
visualization,
after that we will show Data Preparation, Model Building, Model Evaluation.

In case of Model Evaluation , we will show various kinds of performance metrics.

For example for Regression problem , what kind of performance metrics are there , while
implementing the regression model we will show Multicollinearity issue.

one thing we should keep in mind while you deploy machine learning model, we have various kinds
of model
One particular data, it is possible that we are creation 5 , 6 or many other model.

So how to select the best model, we have developed so far.

After that we will talk about cross validation, this is one of the most crucial concept of model building
and Model Evaluation.

in this lesson we will discuss what are the different types of cross validation that exists and we will
give cross validation example and we will briefly talk about classification problem.
ds

and to show you , if there is classification problem, how to evaluate the model.

what are the different types of performance metrics for classification problem.
The performance metrics in the regression model differs from those in classification model problem.

Finally there in one important concept called ROC – AOC curve that will actually help you understand
how one particular classification problem is performed with this we will conclude the session and
move on to another session.
Now let us see the model building

In model building there are various steps , Data Preparation , Model Building , Model Deployment.
Within these steps there are various steps , like within Data Preparation there are various steps , we
have involved only few steps. Like Normalization and Standardization a technique to scale the data.

Like in many cases we observe the data like, salary is a dataset,

now in salary their will be some employees who will be earning in millions and some employees who
will be earning in thousands.

So now if you want to run any model on that any regression model or any classification model on
that because of the range of data , which specifically lies from thousands to millions, the regression
model may not give you proper output, may not give you proper result , because of that what we do
is we normalize or standardize the data
So in normalization what ever the data is, it will lie between the 0 to 1

and in standardization it will be very similar to Z score calculations i.e X-µ/σ , depending on the mean
and standard deviation the data will be standardized.
So Data Normalization and Standardisation is used in various algorithms for eg we will speak of
normalization , Normalization is used in clustering problem, where you calculate some distance
metrics to find out the distance between a particular data point

So Normally if you are using the clustering technique, we use normalization, but is not only restricted
to clustering , we are just giving you an example , it can be used in regression also , if you want to
minimize the impact of large numbers on your data set, we will also normalize the data.
One popular example of standardization is in artificial neural networks or various deployment
models, we required the data to be standardized, and the result suggest if you standardize the data

For Deep learning models for some of the deep learning models not all the deep learning models,
there is no hard and fast rule that you have standardize the data normally.
But some research suggests that if you normalize the data, Normally it becomes deep learning model
in some cases it performs well than comparted to a non-standardized Data, in those cases you can
normally apply normalization and standardization

In this lesson we will learn about Data incubation and Data Transformation,
Now let us talk about the second method or second technique used for Data Preparation that is Data
is Imputation or Missing values,

so in many cases we have seen that if we have data set there will be many missing values for e.g. we
have data set on location Coordinate, so may be we have 10 cities but coordinate of few of the cities
are missing, or if we say you have age of one particular individual missing.
So may be you have, let say there are 100 observations and out of 100 observation their are 10 or 20
observations their age is missing, one way to handle that is all though it is very trivial, we should not
do that, so we can drop those observations, but that has detrimental effect on the model, so you
should not randomly drop the observations, because we may loose out on some other important
information, in many cases what we do is, we replace the data with mean of the data .

We would take the average age for the particular data or particular column, and we will replace the
missing values by the average of mean
However in many cases we come up with some model, some machine learning model to predict the
missing values, there are many various other kind of techniques, we are just giving you an overall
idea on the things or techniques we normally consider while preparing the data.

So the next technique is data transformation, I have already mentioned that in many cases, we
require the data to follow some king of distribution, if it doesn’t follow, we than do some
mathematical transformation for eg logarithm transformation,
So after plotting the density function we find out that particular data is following the normal
distribution, in many cases we observe that if you take the logarithm of the that particular data or
variable if follows the normal distribution so that is called logarithmic transformation, so that is one
e.g. of transformation, Logarithmic transformation is one e.g you can also follow many various other
mathematical transformations as well , so Data transformations is another technique we apply for
Data preparation

In this video we will learn about dimension reduction and

finally one more application is kept that dimensionality reduction and many other xxxxxx are there,
in dimensionality reduction,

so in dimensionality reduction what happens is, so for e.g in our dataset and further data set, there
are 200 or 300 variables or columns in many cases we will show in future session if possible that if
you have too many variables in particular model that model may not do well.
The model will not perform well in that case what we do is we try to remove some of the variables
that is you are trying to remove some of the dimensions from the data, so principle component
analysis or many other analysis or many other techniques are there can be used to reduce the
dimension, which can used to find out which particular data, which particular column or which
particular variable should be considered for the particular modelling, so this all about the data
preparation, we have just listed down some of the important points or some of the important
techniques, you will find out some other techniques also in the future so that’s all about data
preparation so keep in mind for xxxx data science project, data preparation takes significant amount
of time, so model is good as the data
so a data scientist or as researcher a manager, your objective is prepare quality data and for that if
you have to apply many different techniques to prepare quality data you should do that so once you
have very good data set than we get into model building

In this video let us understand the model building process.

In Model Building we apply different mode in different cases, for e.g if it is regression problem may
be we apply the decision tree model may be we apply random forest regression model various other
model we apply and for that we find out which model is doing reasonably well and we fix that model
We are selecting one model out many model and while selecting that, there are few things that we
normally consider, for example we do cross validation, in cross validation we test the model on the
training dataset and finally we test on the test dataset

And now talking about selecting the model based on some measure for each model we will have
some performance metrics depending on the type of machine level model, for e.g., if its regression
model, it will be having various kinds separate performance metrics for e.g., Root meand square
error or many other techniques will be there, if you are using some classification problem, their your
performance metrics will be different, so depending on the type of machine learning project you are
using, your performance metrics will change.
And after that we basically test that model on various dataset, and the session on course will not
focus other parameters , in machine learning project we use hyperparameter tuning techniques
where we plug and play various parameters of that model

for e.g., the tree length the tree depth we plug and play with all those things and we find out for
which combination of the parameters that machine learning models are performing best, so that is
all about model building
Now once we prepare the data and develop the model, we get expected accuracy in the testing data
we deploy the model

In many cases if you see that if its not accurately doing well, than we go back to the basic learning ,
we prepare the data again, build the model, test the model and than redeploy to testing after doing
couple of iterations and once it keeping performing well in production that is after deploying and the
user have started using the application than its all periodic mandate that is once in a week or once in
month so this is all about the basic steps involved in model building.
Now once we do that, we find out that after learning machine learning model , Deep learning
model , we find out that we are following very similar pattern

There will few other techniques depending on the domain you are using but on an average this the
product model that you have developed.
Now let us look at one example case study where we will be using a stimulated data for telecom case

so let us look at the data so the dataset is used as mobile telecom service and for understanding
purpose we have created variables from the real-life instances,
but the data that we have used is basically simulated based on normal distribution , most of the
assumption are in regression, I will be giving you a basic demo on xxxxxx in data science

In data science if you want to develop a model, if you are to prepare data for the model
c

Amazon Vs Walmart Fighting It Out Online On Price
No ratings yet
Amazon Vs Walmart Fighting It Out Online On Price
5 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
39 pages
Data Science Statistics Guide
100% (2)
Data Science Statistics Guide
38 pages
Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Data Science
No ratings yet
Data Science
64 pages
Data Mining - Classification & Prediction
No ratings yet
Data Mining - Classification & Prediction
5 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Lecture-18 - Evaluation Metrics For Different Model
No ratings yet
Lecture-18 - Evaluation Metrics For Different Model
27 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
Data Warriors: Master Machine Building
No ratings yet
Data Warriors: Master Machine Building
11 pages
ML 1
No ratings yet
ML 1
13 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
MLE
No ratings yet
MLE
15 pages
CH 3
No ratings yet
CH 3
33 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Extracting Knowledge From Data
No ratings yet
Extracting Knowledge From Data
16 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Cleaning & Predictive Modeling Guide
No ratings yet
Data Cleaning & Predictive Modeling Guide
26 pages
Capstone Project
No ratings yet
Capstone Project
6 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
DWM Unit 3
No ratings yet
DWM Unit 3
18 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
ML Mdu 2024 10939237
No ratings yet
ML Mdu 2024 10939237
20 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
AI & ML Interview Preparation
No ratings yet
AI & ML Interview Preparation
15 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
XII - Unit 2 - Data Science Methodology - An Analytic Approach To Capstone Project
No ratings yet
XII - Unit 2 - Data Science Methodology - An Analytic Approach To Capstone Project
3 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Lec 2
No ratings yet
Lec 2
19 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Confusion Matrix
No ratings yet
Confusion Matrix
26 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Blood God PPL
No ratings yet
Blood God PPL
13 pages
C How To Manage Conflict at Work
No ratings yet
C How To Manage Conflict at Work
17 pages
Meal Plan Rotation Template
No ratings yet
Meal Plan Rotation Template
8 pages
Security Policy 14 Safeguarding Member Information
No ratings yet
Security Policy 14 Safeguarding Member Information
11 pages
Assesment EC GM - Foundations of Management - Online Self Learning
100% (1)
Assesment EC GM - Foundations of Management - Online Self Learning
3 pages
1 s2.0 S1877705812011332 Main
No ratings yet
1 s2.0 S1877705812011332 Main
10 pages
Compression: DMET501 - Introduction To Media Engineering
No ratings yet
Compression: DMET501 - Introduction To Media Engineering
26 pages
Cineplex Loyalty Program Strategy
No ratings yet
Cineplex Loyalty Program Strategy
10 pages
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
No ratings yet
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
21 pages
LTE End To End Call Flow: With Logs Using Common Troubleshooting Tools
100% (1)
LTE End To End Call Flow: With Logs Using Common Troubleshooting Tools
132 pages
Main Turbine Lube Oil SOP
No ratings yet
Main Turbine Lube Oil SOP
20 pages
Pila ES 100416 BE LCP Assessment 2021 2022
No ratings yet
Pila ES 100416 BE LCP Assessment 2021 2022
4 pages
SIDF Corporate Profile 2022
No ratings yet
SIDF Corporate Profile 2022
63 pages
RJS Pre Exam 2024 Answer Key
No ratings yet
RJS Pre Exam 2024 Answer Key
3 pages
Fog Computing: Survey of Trends, Architectures, Requirements, and Research Directions
No ratings yet
Fog Computing: Survey of Trends, Architectures, Requirements, and Research Directions
31 pages
Cotton Association of India
No ratings yet
Cotton Association of India
5 pages
Transfer Pricing Aspects of Intra-Group Services What Are The Open Issues and What Can Be Improved
No ratings yet
Transfer Pricing Aspects of Intra-Group Services What Are The Open Issues and What Can Be Improved
9 pages
Notation
No ratings yet
Notation
9 pages
Ifrs 8 Aggregation of Operating Segments
No ratings yet
Ifrs 8 Aggregation of Operating Segments
8 pages
LED LIGHTING Research Report Abstract
0% (1)
LED LIGHTING Research Report Abstract
14 pages
Uniswap Formulas
100% (2)
Uniswap Formulas
14 pages
2 Template 11& 14, Annex 3A
No ratings yet
2 Template 11& 14, Annex 3A
7 pages
Xstream Fiber Executive Offer
No ratings yet
Xstream Fiber Executive Offer
18 pages
Risk Assessment For Computer System Validation
90% (10)
Risk Assessment For Computer System Validation
40 pages
RPMS Nornynel
No ratings yet
RPMS Nornynel
116 pages
Dividend Policy Veddanta
No ratings yet
Dividend Policy Veddanta
14 pages
Bungalow Melody - Lyrics - Chords
No ratings yet
Bungalow Melody - Lyrics - Chords
1 page
Apex Freebitcoin High Odds Long Runner Intelligent Bot
No ratings yet
Apex Freebitcoin High Odds Long Runner Intelligent Bot
16 pages
Astm D7234-12 (Adhesion Strength of Coatings On Concrete)
No ratings yet
Astm D7234-12 (Adhesion Strength of Coatings On Concrete)
9 pages
Agust 21
No ratings yet
Agust 21
8 pages
APPROVED Vendor Pending List
No ratings yet
APPROVED Vendor Pending List
177 pages
Object Oriented Development in PL/SQL
No ratings yet
Object Oriented Development in PL/SQL
27 pages
Lin's Concordance Correlation Coefficient
No ratings yet
Lin's Concordance Correlation Coefficient
7 pages
4 Startup Roles To Hire
No ratings yet
4 Startup Roles To Hire
8 pages

Course 4

Uploaded by

Course 4

Uploaded by

Welcome to the lesson on Data Preparation, Model Building, and cross-validation : Introduction

In case of Model Evaluation , we will show various kinds of performance metrics.

So how to select the best model, we have developed so far.

Like in many cases we observe the data like, salary is a dataset,

In this video we will learn about dimension reduction and

In this video let us understand the model building process.

You might also like