Mohamed Imran
-Data Scientist
Ganit Inc.
Data Preprocessing
Real World Data S.No Credit_rati Age Income Credit_car
ng ds
Any Problem? 1 0.00 21 10000 y
2 1.0 2500 n
3 2.0 62 -500 y
4 100.012 42 n
5 yes 200 1y
6 30 0 Seventy No
thousand
Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation
Data Cleaning
1. Missing Data
● Central Imputation
● KNN Imputation
● 2. Noisy Data
● Smoothing
● Clustering
1. Outlier Removal
● Using Boxplot
Imputation
S.No Qualification Age Income
● Replace with mean or a median
1 B.Tech 25 30k
● When to use mean?
● Replace with nearest neighbour 2 M.Tech 30 50k
● How much nearest to see?
3 B.Tech 26 32k
4 B.Tech 25 ?
5 M.Tech 29 60k
6 B.Tech ? 30k
Outlier
● BoxPlot
Data Transformation
● Normalization
Min-max normalization
1. Min Max Normalization
2. Z - Score Normalization
3. Decimal scaling
Decimal scaling
v= v/10^j
Data Integration
● Check for correlation
● Remove uncorrelated data
Data Reduction
● Data Cube Aggregation
Relationship
x Y
2 8
6 20
Y = ????????? 4 14
3 11
7 23
4 14
2 8
5 17
Relationship
x Y
2 8
6 20
Y = 2 + 3(X) 4 14
3 11
7 23
4 14
2 8
5 17
What is 2 here?
x Y
2 8
6 20
Y = 2 + 3(X) 4 14
3 11
7 23
4 14
2 8
5 17
Find the Y in ? x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
4 14
2 8
5 17
10 ?
1 ?
Value for Y with given X x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
4 14
2 8
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
3 = Slope
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
3 = Slope
10 32
X = input
1 5
Formula for a line
Linear
Regression
Welcome to the world of data science
What is linear?
What is linear?
A Straight line
What is Regression?
What is Regression?
Relationship between two points
What is Linear Regression?
What is Linear Regression?
A Straight line that attempts to predict
the relationship between two points
Help me in finding the relationship?
x y
1 1
2 3
4 3
3 2
5 5
y = B0 + B1 * x
B1 = sum((xi-mean(x)) * (yi-mean(y))) / sum((xi – mean(x))^2)
B0 = mean(y) – B1 * mean(x)
x mean(x) x - mean(x) y mean(y) y - mean(y)
1 3 -2 1 2.8 -1.8
B1 = 8 / 10
2 3 -1 3 2.8 0.2
4 3 1 3 2.8 0.2
B1 = 0.8
3 3 0 2 2.8 -0.8
5 2.8 2.2
5 3 2 B0 = mean(y) – B1 * mean(x)
or
B0 = 2.8 – 0.8 * 3
x - mean(x) y - mean(y) Multiplicati or
on x - mean(x) squared
-2 -1.8 3.6 B0 = 0.4 -2 4
-1 0.2 -0.2 -1 1
y = B0 + B1 * x
1 0.2 0.2 1 1
0 -0.8 0 0 0
or
2 2.2 4.4 2 4
8 y = 0.4 + 0.8 * x 10
x y predicted y
1 1 1.2
2 3 2
4 3 3.6
3 2 2.8
5 5 4.4
RMSE = 0.692
Gradient Descent
Finding the optimum relationship
where the error is minimal.
Finding the intercept and coefficients
value.
Find the solution?
Any Suggestions?
Line of best fit
Ordinary least square line
Cost Function
Gradient Descent
Learning Rate
Momentum
Partial Derivative
Finding the direction of coefficient and
slope moves in.
Error Metrics
for Regression
Iteration Error
1 9.556915033600001
2 9.514033718864932
3 9.471355093177891
4 9.42887819847207
5 9.302648387373978
10 9.302648387373978
20 9.260968926175824
30 8.775918820666949
40 8.392252947074406
50 8.02634104901006
60 7.677361561773854
100 6.160260505649477
200 4.018554474422596
300 2.685046327855845
400 1.854748522005687
800 0.6906129091698867
1000 0.5644839798882763
1600 0.4891352315933852
Step 1
Step 2
Step 3
Step 4
Step 5
Advantage of Linear Regression
● Linear regression implements a
statistical model that, when
relationships between the independent
variables and the dependent variable
are almost linear, shows optimal
results.
● Best place to understand the data
analysis
● Easily Explicable
Disadvantages
● Linear regression is often
inappropriately used to model non-
linear relationships.
● Linear regression is limited to
predicting numeric output.
● A lack of explanation about what has
been learned can be a problem.
● Prone to bias variance problem
How to evaluate our model?
Overfitting vs Underfitting
Training Data(Less Error) Testing Data (More Error)
Overfitting vs Underfitting
Training Data (More Error) Testing (Still More Error)
Variance and Bias Trade off
Ideal Model should have Low varinance and Low Bias