Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views29 pages

Course 4

This lesson covers the processes of data preparation, model building, and cross-validation using a telecom case study. Key techniques discussed include data normalization, imputation of missing values, data transformation, and dimensionality reduction, all of which are essential for preparing quality data for modeling. The lesson also emphasizes the importance of model evaluation through performance metrics and cross-validation to select the best model before deployment.

Uploaded by

Rakesh Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Course 4

This lesson covers the processes of data preparation, model building, and cross-validation using a telecom case study. Key techniques discussed include data normalization, imputation of missing values, data transformation, and dimensionality reduction, all of which are essential for preparing quality data for modeling. The lesson also emphasizes the importance of model evaluation through performance metrics and cross-validation to select the best model before deployment.

Uploaded by

Rakesh Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Welcome to the lesson on Data Preparation, Model Building, and cross-validation : Introduction

In this lesson first we will discuss, how to prepare data, than once we finish data preparation part, we
will go in to Model building part , and finally we will show the steps to be taken before deploying the
model

In this regard we have considered one case study with telecom data,

where we first show how to normalize the data, Perform data normalizing using EDA, than data
visualization,
after that we will show Data Preparation, Model Building, Model Evaluation.

In case of Model Evaluation , we will show various kinds of performance metrics.


For example for Regression problem , what kind of performance metrics are there , while
implementing the regression model we will show Multicollinearity issue.

one thing we should keep in mind while you deploy machine learning model, we have various kinds
of model
One particular data, it is possible that we are creation 5 , 6 or many other model.

So how to select the best model, we have developed so far.


d

After that we will talk about cross validation, this is one of the most crucial concept of model building
and Model Evaluation.

in this lesson we will discuss what are the different types of cross validation that exists and we will
give cross validation example and we will briefly talk about classification problem.
ds

and to show you , if there is classification problem, how to evaluate the model.

what are the different types of performance metrics for classification problem.
The performance metrics in the regression model differs from those in classification model problem.

Finally there in one important concept called ROC – AOC curve that will actually help you understand
how one particular classification problem is performed with this we will conclude the session and
move on to another session.
Now let us see the model building

In model building there are various steps , Data Preparation , Model Building , Model Deployment.
Within these steps there are various steps , like within Data Preparation there are various steps , we
have involved only few steps. Like Normalization and Standardization a technique to scale the data.

Like in many cases we observe the data like, salary is a dataset,


now in salary their will be some employees who will be earning in millions and some employees who
will be earning in thousands.

So now if you want to run any model on that any regression model or any classification model on
that because of the range of data , which specifically lies from thousands to millions, the regression
model may not give you proper output, may not give you proper result , because of that what we do
is we normalize or standardize the data
So in normalization what ever the data is, it will lie between the 0 to 1

and in standardization it will be very similar to Z score calculations i.e X-µ/σ , depending on the mean
and standard deviation the data will be standardized.
So Data Normalization and Standardisation is used in various algorithms for eg we will speak of
normalization , Normalization is used in clustering problem, where you calculate some distance
metrics to find out the distance between a particular data point

So Normally if you are using the clustering technique, we use normalization, but is not only restricted
to clustering , we are just giving you an example , it can be used in regression also , if you want to
minimize the impact of large numbers on your data set, we will also normalize the data.
One popular example of standardization is in artificial neural networks or various deployment
models, we required the data to be standardized, and the result suggest if you standardize the data

For Deep learning models for some of the deep learning models not all the deep learning models,
there is no hard and fast rule that you have standardize the data normally.
But some research suggests that if you normalize the data, Normally it becomes deep learning model
in some cases it performs well than comparted to a non-standardized Data, in those cases you can
normally apply normalization and standardization

In this lesson we will learn about Data incubation and Data Transformation,
Now let us talk about the second method or second technique used for Data Preparation that is Data
is Imputation or Missing values,

so in many cases we have seen that if we have data set there will be many missing values for e.g. we
have data set on location Coordinate, so may be we have 10 cities but coordinate of few of the cities
are missing, or if we say you have age of one particular individual missing.
So may be you have, let say there are 100 observations and out of 100 observation their are 10 or 20
observations their age is missing, one way to handle that is all though it is very trivial, we should not
do that, so we can drop those observations, but that has detrimental effect on the model, so you
should not randomly drop the observations, because we may loose out on some other important
information, in many cases what we do is, we replace the data with mean of the data .

We would take the average age for the particular data or particular column, and we will replace the
missing values by the average of mean
However in many cases we come up with some model, some machine learning model to predict the
missing values, there are many various other kind of techniques, we are just giving you an overall
idea on the things or techniques we normally consider while preparing the data.

So the next technique is data transformation, I have already mentioned that in many cases, we
require the data to follow some king of distribution, if it doesn’t follow, we than do some
mathematical transformation for eg logarithm transformation,
So after plotting the density function we find out that particular data is following the normal
distribution, in many cases we observe that if you take the logarithm of the that particular data or
variable if follows the normal distribution so that is called logarithmic transformation, so that is one
e.g. of transformation, Logarithmic transformation is one e.g you can also follow many various other
mathematical transformations as well , so Data transformations is another technique we apply for
Data preparation

In this video we will learn about dimension reduction and


finally one more application is kept that dimensionality reduction and many other xxxxxx are there,
in dimensionality reduction,

so in dimensionality reduction what happens is, so for e.g in our dataset and further data set, there
are 200 or 300 variables or columns in many cases we will show in future session if possible that if
you have too many variables in particular model that model may not do well.
The model will not perform well in that case what we do is we try to remove some of the variables
that is you are trying to remove some of the dimensions from the data, so principle component
analysis or many other analysis or many other techniques are there can be used to reduce the
dimension, which can used to find out which particular data, which particular column or which
particular variable should be considered for the particular modelling, so this all about the data
preparation, we have just listed down some of the important points or some of the important
techniques, you will find out some other techniques also in the future so that’s all about data
preparation so keep in mind for xxxx data science project, data preparation takes significant amount
of time, so model is good as the data
so a data scientist or as researcher a manager, your objective is prepare quality data and for that if
you have to apply many different techniques to prepare quality data you should do that so once you
have very good data set than we get into model building

In this video let us understand the model building process.

In Model Building we apply different mode in different cases, for e.g if it is regression problem may
be we apply the decision tree model may be we apply random forest regression model various other
model we apply and for that we find out which model is doing reasonably well and we fix that model
We are selecting one model out many model and while selecting that, there are few things that we
normally consider, for example we do cross validation, in cross validation we test the model on the
training dataset and finally we test on the test dataset

And now talking about selecting the model based on some measure for each model we will have
some performance metrics depending on the type of machine level model, for e.g., if its regression
model, it will be having various kinds separate performance metrics for e.g., Root meand square
error or many other techniques will be there, if you are using some classification problem, their your
performance metrics will be different, so depending on the type of machine learning project you are
using, your performance metrics will change.
And after that we basically test that model on various dataset, and the session on course will not
focus other parameters , in machine learning project we use hyperparameter tuning techniques
where we plug and play various parameters of that model

for e.g., the tree length the tree depth we plug and play with all those things and we find out for
which combination of the parameters that machine learning models are performing best, so that is
all about model building
Now once we prepare the data and develop the model, we get expected accuracy in the testing data
we deploy the model

In many cases if you see that if its not accurately doing well, than we go back to the basic learning ,
we prepare the data again, build the model, test the model and than redeploy to testing after doing
couple of iterations and once it keeping performing well in production that is after deploying and the
user have started using the application than its all periodic mandate that is once in a week or once in
month so this is all about the basic steps involved in model building.
Now once we do that, we find out that after learning machine learning model , Deep learning
model , we find out that we are following very similar pattern

There will few other techniques depending on the domain you are using but on an average this the
product model that you have developed.
Now let us look at one example case study where we will be using a stimulated data for telecom case

so let us look at the data so the dataset is used as mobile telecom service and for understanding
purpose we have created variables from the real-life instances,
but the data that we have used is basically simulated based on normal distribution , most of the
assumption are in regression, I will be giving you a basic demo on xxxxxx in data science

In data science if you want to develop a model, if you are to prepare data for the model
c

You might also like