0% found this document useful (0 votes)

14 views78 pages

Learning Concepts Hackers Realm

The document discusses data normalization techniques applied to a wine quality dataset, including max absolute scaling, min-max scaling, log transformation, and standardization. It provides code snippets for each method using Python libraries such as pandas and seaborn, along with visualizations of the distributions of various features. Additionally, it addresses the detection and removal of outliers using the Z-score method.

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views78 pages

Learning Concepts Hackers Realm

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

learning-concepts-hackers-realm

August 13, 2024

1 Data Normalization
[1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np
warnings.filterwarnings('ignore')
%matplotlib inline

[2]: df = pd.read_csv('data/winequality.csv')
df.head()

[2]: type fixed acidity volatile acidity citric acid residual sugar \
0 white 7.0 0.27 0.36 20.7
1 white 6.3 0.30 0.34 1.6
2 white 8.1 0.28 0.40 6.9
3 white 7.2 0.23 0.32 8.5
4 white 7.2 0.23 0.32 8.5

chlorides free sulfur dioxide total sulfur dioxide density pH \

0 0.045 45.0 170.0 1.0010 3.00
1 0.049 14.0 132.0 0.9940 3.30
2 0.050 30.0 97.0 0.9951 3.26
3 0.058 47.0 186.0 0.9956 3.19
4 0.058 47.0 186.0 0.9956 3.19

sulphates alcohol quality

0 0.45 8.8 6
1 0.49 9.5 6
2 0.44 10.1 6
3 0.40 9.9 6
4 0.40 9.9 6

[3]: df.describe()

1
[3]: fixed acidity volatile acidity citric acid residual sugar \
count 6487.000000 6489.000000 6494.000000 6495.000000
mean 7.216579 0.339691 0.318722 5.444326
std 1.296750 0.164649 0.145265 4.758125
min 3.800000 0.080000 0.000000 0.600000
25% 6.400000 0.230000 0.250000 1.800000
50% 7.000000 0.290000 0.310000 3.000000
75% 7.700000 0.400000 0.390000 8.100000
max 15.900000 1.580000 1.660000 65.800000

chlorides free sulfur dioxide total sulfur dioxide density \

count 6495.000000 6497.000000 6497.000000 6497.000000
mean 0.056042 30.525319 115.744574 0.994697
std 0.035036 17.749400 56.521855 0.002999
min 0.009000 1.000000 6.000000 0.987110
25% 0.038000 17.000000 77.000000 0.992340
50% 0.047000 29.000000 118.000000 0.994890
75% 0.065000 41.000000 156.000000 0.996990
max 0.611000 289.000000 440.000000 1.038980

pH sulphates alcohol quality

count 6488.000000 6493.000000 6497.000000 6497.000000
mean 3.218395 0.531215 10.491801 5.818378
std 0.160748 0.148814 1.192712 0.873255
min 2.720000 0.220000 8.000000 3.000000
25% 3.110000 0.430000 9.500000 5.000000
50% 3.210000 0.510000 10.300000 6.000000
75% 3.320000 0.600000 11.300000 6.000000
max 4.010000 2.000000 14.900000 9.000000

[19]: sns.distplot(df['free sulfur dioxide'])

[19]: <AxesSubplot:xlabel='free sulfur dioxide', ylabel='Density'>

2
[20]: sns.distplot(df['alcohol'])

[20]: <AxesSubplot:xlabel='alcohol', ylabel='Density'>

3
1.1 Max absolute scaling
[10]: ## value / max_value

[11]: df_temp = df.copy()

[12]: df_temp['free sulfur dioxide'] = df_temp['free sulfur dioxide'] / df_temp['free␣

↪sulfur dioxide'].abs().max()

[13]: sns.distplot(df_temp['free sulfur dioxide'])

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-level
function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)

[13]: <AxesSubplot:xlabel='free sulfur dioxide', ylabel='Density'>

[21]: df_temp['alcohol'] = df_temp['alcohol'] / df_temp['alcohol'].abs().max()

[22]: sns.distplot(df_temp['alcohol'])

4
[22]: <AxesSubplot:xlabel='alcohol', ylabel='Density'>

[ ]: # original_value = scaled_value * max

1.2 Min-Max Scaling

[ ]: # (value - min) / (max - min)

[23]: df_temp = df.copy()

[24]: df_temp['alcohol'] = (df_temp['alcohol'] - df_temp['alcohol'].min()) /␣

↪(df_temp['alcohol'].max() - df_temp['alcohol'].min())

[25]: sns.distplot(df_temp['alcohol'])

[25]: <AxesSubplot:xlabel='alcohol', ylabel='Density'>

5
[ ]: # original_value = scaled_value * (max-min) + min

1.2.1 Log Transformation

[27]: sns.distplot(df['total sulfur dioxide'])

[27]: <AxesSubplot:xlabel='total sulfur dioxide', ylabel='Density'>

6
[28]: df_temp = df.copy()

[30]: df_temp['total sulfur dioxide'] = np.log(df_temp['total sulfur dioxide']+1)

[32]: sns.distplot(df_temp['total sulfur dioxide'])

[32]: <AxesSubplot:xlabel='total sulfur dioxide', ylabel='Density'>

7
[ ]:

[ ]:

1.3 Standardization of Data

[ ]: ## z-score method
# scaled_value = value - mean / std

[ ]: # original_value = scaled_value * std + mean

[40]: sns.distplot(df['fixed acidity'])

[40]: <AxesSubplot:xlabel='fixed acidity', ylabel='Density'>

8
[39]: sns.distplot(df['pH'])

[39]: <AxesSubplot:xlabel='pH', ylabel='Density'>

9
[41]: scaled_data = df.copy()

[42]: ## apply the formula

for col in ['fixed acidity', 'pH']:
scaled_data[col] = (scaled_data[col] - scaled_data[col].mean()) /␣
↪scaled_data[col].std()

[43]: sns.distplot(scaled_data['fixed acidity'])

[43]: <AxesSubplot:xlabel='fixed acidity', ylabel='Density'>

[44]: sns.distplot(scaled_data['pH'])

[44]: <AxesSubplot:xlabel='pH', ylabel='Density'>

10
[45]: from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

[47]: sc.fit(df[['pH']])

[47]: StandardScaler()

[48]: sc_data = sc.transform(df[['pH']])

[54]: sc_data = sc_data.reshape(-1)

[56]: sns.distplot(df['pH'])

[56]: <AxesSubplot:xlabel='pH', ylabel='Density'>

11
[55]: sns.distplot(sc_data)

[55]: <AxesSubplot:ylabel='Density'>

12
[ ]:

[ ]:

2 Detect and Remove Outliers

[4]: df.describe()

[4]: fixed acidity volatile acidity citric acid residual sugar \

count 6487.000000 6489.000000 6494.000000 6495.000000
mean 7.216579 0.339691 0.318722 5.444326
std 1.296750 0.164649 0.145265 4.758125
min 3.800000 0.080000 0.000000 0.600000
25% 6.400000 0.230000 0.250000 1.800000
50% 7.000000 0.290000 0.310000 3.000000
75% 7.700000 0.400000 0.390000 8.100000
max 15.900000 1.580000 1.660000 65.800000

chlorides free sulfur dioxide total sulfur dioxide density \

pH sulphates alcohol quality

[5]: sns.distplot(df['residual sugar'])

[5]: <AxesSubplot:xlabel='residual sugar', ylabel='Density'>

13
[6]: # to see outliers clearly
sns.boxplot(df['residual sugar'])

[6]: <AxesSubplot:xlabel='residual sugar'>

14
2.1 Z-score method
[41]: # find the limits
upper_limit = df['residual sugar'].mean() + 3*df['residual sugar'].std()
lower_limit = df['residual sugar'].mean() - 3*df['residual sugar'].std()
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)

upper limit: 19.71870063294501

lower limit: -8.830047823091236

[42]: # find the outliers

df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] <␣
↪lower_limit)]

[42]: type fixed acidity volatile acidity citric acid residual sugar \
0 white 7.0 0.270 0.36 20.70
7 white 7.0 0.270 0.36 20.70
182 white 6.8 0.280 0.40 22.00
191 white 6.8 0.280 0.40 22.00
292 white 7.4 0.280 0.42 19.80
444 white 6.9 0.240 0.36 20.80
1454 white 8.3 0.210 0.49 19.80
1608 white 6.9 0.270 0.49 23.50
1653 white 7.9 0.330 0.28 31.60
1663 white 7.9 0.330 0.28 31.60
2489 white 6.1 0.280 0.24 19.95
2492 white 6.1 0.280 0.24 19.95
2620 white 6.5 0.280 0.28 20.40
2781 white 7.8 0.965 0.60 65.80
2785 white 6.4 0.240 0.25 20.20
2787 white 6.4 0.240 0.25 20.20
3014 white 7.0 0.450 0.34 19.80
3023 white 7.0 0.450 0.34 19.80
3420 white 7.6 0.280 0.49 20.15
3497 white 7.7 0.430 1.00 19.95
3547 white 7.3 0.200 0.29 19.90
3619 white 6.8 0.450 0.28 26.05
3623 white 6.8 0.450 0.28 26.05
3730 white 6.2 0.220 0.20 20.80
4107 white 6.8 0.300 0.26 20.30
4480 white 5.9 0.220 0.45 22.60

chlorides free sulfur dioxide total sulfur dioxide density pH \

0 0.045 45.0 170.0 1.00100 3.00

15
7 0.045 45.0 170.0 1.00100 3.00
182 0.048 48.0 167.0 1.00100 2.93
191 0.048 48.0 167.0 1.00100 2.93
292 0.066 53.0 195.0 1.00000 2.96
444 0.031 40.0 139.0 0.99750 3.20
1454 0.054 50.0 231.0 1.00120 2.99
1608 0.057 59.0 235.0 1.00240 2.98
1653 0.053 35.0 176.0 1.01030 3.15
1663 0.053 35.0 176.0 1.01030 3.15
2489 0.074 32.0 174.0 0.99922 3.19
2492 0.074 32.0 174.0 0.99922 3.19
2620 0.041 40.0 144.0 1.00020 3.14
2781 0.074 8.0 160.0 1.03898 3.39
2785 0.083 35.0 157.0 0.99976 3.17
2787 0.083 35.0 157.0 0.99976 3.17
3014 0.040 12.0 67.0 0.99760 3.07
3023 0.040 12.0 67.0 0.99760 3.07
3420 0.060 30.0 145.0 1.00196 3.01
3497 0.032 42.0 164.0 0.99742 3.29
3547 0.039 69.0 237.0 1.00037 3.10
3619 0.031 27.0 122.0 1.00295 3.06
3623 0.031 27.0 122.0 1.00295 3.06
3730 0.035 58.0 184.0 1.00022 3.11
4107 0.037 45.0 150.0 0.99727 3.04
4480 0.120 55.0 122.0 0.99636 3.10

sulphates alcohol quality

0 0.45 8.8 6
7 0.45 8.8 6
182 0.50 8.7 5
191 0.50 8.7 5
292 0.44 9.1 5
444 0.33 11.0 6
1454 0.54 9.2 5
1608 0.47 8.6 5
1653 0.38 8.8 6
1663 0.38 8.8 6
2489 0.44 9.3 6
2492 0.44 9.3 6
2620 0.38 8.7 5
2781 0.69 11.7 6
2785 0.50 9.1 5
2787 0.50 9.1 5
3014 0.38 11.0 6
3023 0.38 11.0 6
3420 0.44 8.5 5
3497 0.50 12.0 6

16
3547 0.48 9.2 6
3619 0.42 10.6 6
3623 0.42 10.6 6
3730 0.53 9.0 6
4107 0.38 12.3 6
4480 0.35 12.8 5

[43]: # trimming - delete the outlier data

new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar']␣
↪>= lower_limit)]

print('before removing outliers:', len(df))

print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))

before removing outliers: 6497

after removing outliers: 6469
outliers: 28

[44]: sns.boxplot(new_df['residual sugar'])

[44]: <AxesSubplot:xlabel='residual sugar'>

[45]: # capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()

17
new_df.loc[(new_df['residual sugar']>=upper_limit), 'residual sugar'] =␣
↪upper_limit

new_df.loc[(new_df['residual sugar']<=lower_limit), 'residual sugar'] =␣

↪lower_limit

[46]: sns.boxplot(new_df['residual sugar'])

[46]: <AxesSubplot:xlabel='residual sugar'>

[47]: len(new_df)

[47]: 6497

2.2 IQR method

[48]: q1 = df['residual sugar'].quantile(0.25)
q3 = df['residual sugar'].quantile(0.75)
iqr = q3-q1

[49]: q1, q3, iqr

[49]: (1.8, 8.1, 6.3)

18
[50]: upper_limit = q3 + (1.5 * iqr)
lower_limit = q1 - (1.5 * iqr)
lower_limit, upper_limit

[50]: (-7.6499999999999995, 17.549999999999997)

[51]: sns.boxplot(df['residual sugar'])

[51]: <AxesSubplot:xlabel='residual sugar'>

[52]: # find the outliers

df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] <␣
↪lower_limit)]

[52]: type fixed acidity volatile acidity citric acid residual sugar \
0 white 7.0 0.270 0.36 20.70
7 white 7.0 0.270 0.36 20.70
14 white 8.3 0.420 0.62 19.25
38 white 7.3 0.240 0.39 17.95
39 white 7.3 0.240 0.39 17.95
… … … … … …
4691 white 6.9 0.190 0.31 19.25
4694 white 6.9 0.190 0.31 19.25
4748 white 6.1 0.340 0.24 18.35
4749 white 6.2 0.350 0.25 18.40

19
4778 white 5.8 0.315 0.19 19.40

chlorides free sulfur dioxide total sulfur dioxide density pH \

0 0.045 45.0 170.0 1.00100 3.00
7 0.045 45.0 170.0 1.00100 3.00
14 0.040 41.0 172.0 1.00020 2.98
38 0.057 45.0 149.0 0.99990 3.21
39 0.057 45.0 149.0 0.99990 3.21
… … … … … …
4691 0.043 38.0 167.0 0.99954 2.93
4694 0.043 38.0 167.0 0.99954 2.93
4748 0.050 33.0 184.0 0.99943 3.12
4749 0.051 28.0 182.0 0.99946 3.13
4778 0.031 28.0 106.0 0.99704 2.97

sulphates alcohol quality

0 0.45 8.80 6
7 0.45 8.80 6
14 0.67 9.70 5
38 0.36 8.60 5
39 0.36 8.60 5
… … … …
4691 0.52 9.10 7
4694 0.52 9.10 7
4748 0.61 9.30 5
4749 0.62 9.30 6
4778 0.40 10.55 6

[118 rows x 13 columns]

[53]: # trimming - delete the outlier data

new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar']␣
↪>= lower_limit)]

print('before removing outliers:', len(df))

print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))

before removing outliers: 6497

after removing outliers: 6377
outliers: 120

[54]: sns.boxplot(new_df['residual sugar'])

[54]: <AxesSubplot:xlabel='residual sugar'>

20
[55]: # capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] =␣
↪upper_limit

new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] =␣

↪lower_limit

[56]: sns.boxplot(new_df['residual sugar'])

[56]: <AxesSubplot:xlabel='residual sugar'>

21
2.3 Percentile method
[57]: upper_limit = df['residual sugar'].quantile(0.99)
lower_limit = df['residual sugar'].quantile(0.01)
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)

upper limit: 18.2

lower limit: 0.9

[58]: sns.boxplot(df['residual sugar'])

[58]: <AxesSubplot:xlabel='residual sugar'>

22
[59]: # find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] <␣
↪lower_limit)]

[59]: type fixed acidity volatile acidity citric acid residual sugar \
0 white 7.0 0.270 0.36 20.70
7 white 7.0 0.270 0.36 20.70
14 white 8.3 0.420 0.62 19.25
103 white 7.5 0.305 0.40 18.90
111 white 7.2 0.270 0.46 18.75
… … … … … …
4749 white 6.2 0.350 0.25 18.40
4778 white 5.8 0.315 0.19 19.40
4779 white 6.0 0.590 0.00 0.80
4877 white 5.9 0.540 0.00 0.80
4897 white 6.0 0.210 0.38 0.80

chlorides free sulfur dioxide total sulfur dioxide density pH \

0 0.045 45.0 170.0 1.00100 3.00
7 0.045 45.0 170.0 1.00100 3.00
14 0.040 41.0 172.0 1.00020 2.98
103 0.059 44.0 170.0 1.00000 2.99
111 0.052 45.0 255.0 1.00000 3.04
… … … … … …
4749 0.051 28.0 182.0 0.99946 3.13

23
4778 0.031 28.0 106.0 0.99704 2.97
4779 0.037 30.0 95.0 0.99032 3.10
4877 0.032 12.0 82.0 0.99286 3.25
4897 0.020 22.0 98.0 0.98941 3.26

sulphates alcohol quality

0 0.45 8.80 6
7 0.45 8.80 6
14 0.67 9.70 5
103 0.46 9.00 5
111 0.52 8.90 5
… … … …
4749 0.62 9.30 6
4778 0.40 10.55 6
4779 0.40 10.90 4
4877 0.36 8.80 5
4897 0.32 11.80 6

[97 rows x 13 columns]

[60]: # trimming - delete the outlier data

new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar']␣
↪>= lower_limit)]

print('before removing outliers:', len(df))

print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))

before removing outliers: 6497

after removing outliers: 6398
outliers: 99

[61]: sns.boxplot(new_df['residual sugar'])

[61]: <AxesSubplot:xlabel='residual sugar'>

24
[62]: # capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] =␣
↪upper_limit

new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] =␣

↪lower_limit

[63]: sns.boxplot(new_df['residual sugar'])

[63]: <AxesSubplot:xlabel='residual sugar'>

25
[65]: sns.distplot(df['residual sugar'])

[65]: <AxesSubplot:xlabel='residual sugar', ylabel='Density'>

26
[64]: sns.distplot(new_df['residual sugar'])

[64]: <AxesSubplot:xlabel='residual sugar', ylabel='Density'>

[ ]:

[6]: # data preparation

df = pd.DataFrame()
df['season'] = ['summer', 'autumn', 'spring', 'winter', 'autumn', 'winter']

2.4 Label Encoding

[7]: df.head()

[7]: season
0 summer
1 autumn
2 spring
3 winter
4 autumn

[8]: from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

27
df['season_label'] = le.fit_transform(df['season'])
df.head()

[8]: season season_label

0 summer 2
1 autumn 0
2 spring 1
3 winter 3
4 autumn 0

[5]: # map the labels

mapping = {'spring':0, 'summer':1, 'autumn':2, 'winter':3}
df['season_custom_label'] = df['season'].map(mapping)
df.head()

[5]: season season_label season_custom_label

0 summer 2 1
1 autumn 0 2
2 spring 1 0
3 winter 3 3
4 autumn 0 2

[ ]:

2.5 One-Hot Encoding

[9]: df.head()

[9]: season season_label

0 summer 2
1 autumn 0
2 spring 1
3 winter 3
4 autumn 0

[10]: from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

[14]: ohe.fit_transform(df[['season']]).toarray()

[14]: array([[0., 0., 1., 0.],

[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]])

28
[15]: ohe_values = ohe.fit_transform(df[['season']]).toarray()
ohe_df = pd.DataFrame(ohe_values)
enc_df = pd.concat([df, ohe_df], axis=1)
enc_df.head()

[15]: season season_label 0 1 2 3

0 summer 2 0.0 0.0 1.0 0.0
1 autumn 0 1.0 0.0 0.0 0.0
2 spring 1 0.0 1.0 0.0 0.0
3 winter 3 0.0 0.0 0.0 1.0
4 autumn 0 1.0 0.0 0.0 0.0

[17]: ## second ohe method using pandas

enc_df = pd.get_dummies(df, prefix=['season'], columns=['season'],␣
↪drop_first=True)

enc_df.head()

[17]: season_label season_spring season_summer season_winter

0 2 0 1 0
1 0 0 0 0
2 1 1 0 0
3 3 0 0 1
4 0 0 0 0

[ ]:

2.6 Mean/Target Encoding

[18]: df = pd.read_csv('data/Loan Prediction Dataset.csv')
df.head()

[18]: Loan_ID Gender Married Dependents Education Self_Employed \

0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0

Credit_History Property_Area Loan_Status

0 1.0 Urban Y

29
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y

[ ]: #!pip install category_encoders

[20]: df['Loan_Status'] = df['Loan_Status'].map({'Y':1, 'N':0})

df.head()

[20]: Loan_ID Gender Married Dependents Education Self_Employed \

0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0

Credit_History Property_Area Loan_Status

0 1.0 Urban 1
1 1.0 Rural 0
2 1.0 Urban 1
3 1.0 Urban 1
4 1.0 Urban 1

[21]: from category_encoders import TargetEncoder

cols = ['Gender', 'Dependents']
target = 'Loan_Status'
for col in cols:
te = TargetEncoder()
# fit the data
te.fit(X=df[col], y=df[target])
# transform
values = te.transform(df[col])
df = pd.concat([df, values], axis=1)

df.head()

[21]: Loan_ID Gender Married Dependents Education Self_Employed \

0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No

30
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0

Credit_History Property_Area Loan_Status Gender Dependents

0 1.0 Urban 1 0.693252 0.689855
1 1.0 Rural 0 0.693252 0.647059
2 1.0 Urban 1 0.693252 0.689855
3 1.0 Urban 1 0.693252 0.689855
4 1.0 Urban 1 0.693252 0.689855

[25]: df.sample(frac=1).head(10)

[25]: Loan_ID Gender Married Dependents Education Self_Employed \

382 LP002231 Female No 0 Graduate No
477 LP002530 NaN Yes 2 Graduate No
71 LP001245 Male Yes 2 Not Graduate Yes
474 LP002524 Male No 2 Graduate No
266 LP001877 Male Yes 2 Graduate No
541 LP002743 Female No 0 Graduate No
354 LP002143 Female Yes 0 Graduate No
116 LP001404 Female Yes 0 Graduate No
16 LP001034 Male No 1 Not Graduate No
598 LP002945 Male Yes 0 Graduate Yes

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

382 6000 0.0 156.0 360.0
477 2873 1872.0 132.0 360.0
71 1875 1875.0 97.0 360.0
474 5532 4648.0 162.0 360.0
266 4708 1387.0 150.0 360.0
541 2138 0.0 99.0 360.0
354 2423 505.0 130.0 360.0
116 3167 2283.0 154.0 360.0
16 3596 0.0 100.0 240.0
598 9963 0.0 180.0 360.0

Credit_History Property_Area Loan_Status Gender Dependents

382 1.0 Urban 1 0.669643 0.689855
477 0.0 Semiurban 0 0.615385 0.752475

31
71 1.0 Semiurban 1 0.693252 0.752475
474 1.0 Rural 1 0.693252 0.752475
266 1.0 Semiurban 1 0.693252 0.752475
541 0.0 Semiurban 0 0.669643 0.689855
354 1.0 Semiurban 1 0.669643 0.689855
116 1.0 Semiurban 1 0.669643 0.689855
16 NaN Urban 1 0.693252 0.647059
598 1.0 Rural 1 0.693252 0.689855

[29]: df = df[['Education', 'Self_Employed', 'Dependents']]

df = df.iloc[:,:3]
df.head()

[29]: Education Self_Employed Dependents

0 Graduate No 0
1 Graduate No 1
2 Graduate Yes 0
3 Not Graduate No 0
4 Graduate No 0

2.7 Frequency Encoding

[30]: df.head()

[30]: Education Self_Employed Dependents

0 Graduate No 0
1 Graduate No 1
2 Graduate Yes 0
3 Not Graduate No 0
4 Graduate No 0

[31]: df.groupby('Education').size()

[31]: Education
Graduate 480
Not Graduate 134
dtype: int64

[32]: col = 'Education'

# group by frequency
freq = df.groupby(col).size()/len(df)
# map the values
df.loc[:, "{}_freq".format(col)] = df[col].map(freq)
df.head()

32
[32]: Education Self_Employed Dependents Education_freq
0 Graduate No 0 0.781759
1 Graduate No 1 0.781759
2 Graduate Yes 0 0.781759
3 Not Graduate No 0 0.218241
4 Graduate No 0 0.781759

2.8 Binary Encoding

[ ]: # 0 0 - 0
# 0 1 - 1
# 1 0 - 2
# 1 1 - 3

[38]: # fill null values

df['Self_Employed'] = df['Self_Employed'].fillna('No')

[39]: from category_encoders import BinaryEncoder

be = BinaryEncoder()
be_enc = be.fit_transform(df['Self_Employed'])

[40]: enc_df = pd.concat([df, be_enc], axis=1)

enc_df.sample(frac=1).head(10)

[40]: Education Self_Employed Dependents Education_freq Self_Employed_0 \

291 Graduate No 2 0.781759 0
594 Graduate Yes 0 0.781759 1
179 Not Graduate No 0 0.218241 0
401 Not Graduate No 0 0.218241 0
443 Graduate No 1 0.781759 0
26 Graduate No 0 0.781759 0
219 Graduate No 2 0.781759 0
483 Graduate No 0 0.781759 0
340 Not Graduate No 3+ 0.218241 0
273 Graduate No 0 0.781759 0

Self_Employed_1
291 1
594 0
179 1
401 1
443 1
26 1
219 1
483 1
340 1
273 1

33
[ ]:

3 Extract Features from Datetime Attributes

[2]: df = pd.read_csv('data/Traffic data.csv', nrows=100)
df.head()

[2]: ID Datetime Count

0 0 25-08-2012 00:00 8
1 1 25-08-2012 01:00 2
2 2 25-08-2012 02:00 6
3 3 25-08-2012 03:00 2
4 4 25-08-2012 04:00 2

[4]: # drop columns

df = df.drop(columns=['ID','Count'])

[5]: df['Datetime'] = pd.to_datetime(df['Datetime'])

df.head()

[5]: Datetime
0 2012-08-25 00:00:00
1 2012-08-25 01:00:00
2 2012-08-25 02:00:00
3 2012-08-25 03:00:00
4 2012-08-25 04:00:00

3.0.1 Extract date features

[12]: df['year'] = df['Datetime'].dt.year
df['month'] = df['Datetime'].dt.month
df['day'] = df['Datetime'].dt.day
df['quarter'] = df['Datetime'].dt.quarter
df['day_of_week'] = df['Datetime'].dt.dayofweek
df['week'] = df['Datetime'].dt.week
df['is_weekend'] = np.where(df['day_of_week'].isin([5,6]), 1, 0)
df.sample(frac=1).head(10)

[12]: Datetime year month day quarter day_of_week week \

4 2012-08-25 04:00:00 2012 8 25 3 5 34
37 2012-08-26 13:00:00 2012 8 26 3 6 34
49 2012-08-27 01:00:00 2012 8 27 3 0 35
28 2012-08-26 04:00:00 2012 8 26 3 6 34
88 2012-08-28 16:00:00 2012 8 28 3 1 35
32 2012-08-26 08:00:00 2012 8 26 3 6 34
58 2012-08-27 10:00:00 2012 8 27 3 0 35

34
63 2012-08-27 15:00:00 2012 8 27 3 0 35
59 2012-08-27 11:00:00 2012 8 27 3 0 35
54 2012-08-27 06:00:00 2012 8 27 3 0 35

is_weekend
4 1
37 1
49 0
28 1
88 0
32 1
58 0
63 0
59 0
54 0

3.0.2 Extract time features

[13]: df['hour'] = df['Datetime'].dt.hour
df['minute'] = df['Datetime'].dt.minute
df['second'] = df['Datetime'].dt.second
df.sample(frac=1).head()

[13]: Datetime year month day quarter day_of_week week \

11 2012-08-25 11:00:00 2012 8 25 3 5 34
64 2012-08-27 16:00:00 2012 8 27 3 0 35
50 2012-08-27 02:00:00 2012 8 27 3 0 35
3 2012-08-25 03:00:00 2012 8 25 3 5 34
46 2012-08-26 22:00:00 2012 8 26 3 6 34

is_weekend hour minute second

11 1 11 0 0
64 0 16 0 0
50 0 2 0 0
3 1 3 0 0
46 1 22 0 0

[14]: df['date'] = df['Datetime'].dt.date

df['time'] = df['Datetime'].dt.time
df.head()

[14]: Datetime year month day quarter day_of_week week \

0 2012-08-25 00:00:00 2012 8 25 3 5 34
1 2012-08-25 01:00:00 2012 8 25 3 5 34
2 2012-08-25 02:00:00 2012 8 25 3 5 34
3 2012-08-25 03:00:00 2012 8 25 3 5 34
4 2012-08-25 04:00:00 2012 8 25 3 5 34

35
is_weekend hour minute second date time
0 1 0 0 0 2012-08-25 00:00:00
1 1 1 0 0 2012-08-25 01:00:00
2 1 2 0 0 2012-08-25 02:00:00
3 1 3 0 0 2012-08-25 03:00:00
4 1 4 0 0 2012-08-25 04:00:00

[15]: import datetime

# find difference from current day
df['difference'] = datetime.datetime.today() - df['Datetime']
df.head()

[15]: Datetime year month day quarter day_of_week week \

0 2012-08-25 00:00:00 2012 8 25 3 5 34
1 2012-08-25 01:00:00 2012 8 25 3 5 34
2 2012-08-25 02:00:00 2012 8 25 3 5 34
3 2012-08-25 03:00:00 2012 8 25 3 5 34
4 2012-08-25 04:00:00 2012 8 25 3 5 34

is_weekend hour minute second date time \

0 1 0 0 0 2012-08-25 00:00:00
1 1 1 0 0 2012-08-25 01:00:00
2 1 2 0 0 2012-08-25 02:00:00
3 1 3 0 0 2012-08-25 03:00:00
4 1 4 0 0 2012-08-25 04:00:00

difference
0 3522 days 13:24:49.747950
1 3522 days 12:24:49.747950
2 3522 days 11:24:49.747950
3 3522 days 10:24:49.747950
4 3522 days 09:24:49.747950

[ ]:

4 Fill Missing Values in Dataset

[24]: df = pd.read_csv('data/Loan Prediction Dataset.csv')
df.head()

[24]: Loan_ID Gender Married Dependents Education Self_Employed \

0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No

36
4 LP001008 Male No 0 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0

Credit_History Property_Area Loan_Status

0 1.0 Urban Y
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y

[25]: # check null values

df.isnull().sum()

[25]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

4.0.1 Fill with negative values

[37]: new_df = df.copy()

[38]: new_df = df.fillna(-999)

new_df.isnull().sum()

[38]: Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0

37
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64

4.0.2 Consider NULL Value as new category

[39]: new_df = df.copy()

[40]: df['Gender'].value_counts()

[40]: Male 489

Female 112
Name: Gender, dtype: int64

[41]: # consider nan as category

new_df['Gender'] = df['Gender'].fillna('nan')

[42]: new_df['Gender'].value_counts()

[42]: Male 489

Female 112
nan 13
Name: Gender, dtype: int64

4.0.3 Drop rows which have NULL values

[43]: new_df = df.copy()

[44]: len(df)

[44]: 614

[45]: new_df = df.dropna(axis=0)

len(new_df)

[45]: 480

[46]: new_df.isnull().sum()

[46]: Loan_ID 0
Gender 0

38
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64

4.0.4 Fill missing value with mean, median and mode

[47]: new_df = df.copy()

[48]: df['LoanAmount'].mean()

[48]: 146.41216216216216

[49]: sns.distplot(df['LoanAmount'])

[49]: <AxesSubplot:xlabel='LoanAmount', ylabel='Density'>

39
[50]: # fill missing value for numerical
new_df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
new_df.isnull().sum()

[50]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

[52]: new_df['Loan_Amount_Term'] = df['Loan_Amount_Term'].

↪fillna(df['Loan_Amount_Term'].median())

new_df.isnull().sum()

[52]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

[53]: sns.countplot(df['Self_Employed'])

[53]: <AxesSubplot:xlabel='Self_Employed', ylabel='count'>

40
[55]: df['Self_Employed'].mode()[0]

[55]: 'No'

[56]: # fill missing value for categorical

new_df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].
↪mode()[0])

new_df.isnull().sum()

[56]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

41
4.0.5 Fill missing value based on grouping category

[58]: new_df = df.copy()

[62]: df.groupby('Loan_Status').mean()['LoanAmount']

[62]: Loan_Status
N 151.220994
Y 144.294404
Name: LoanAmount, dtype: float64

[61]: mean_df = df.groupby('Loan_Status').mean()['LoanAmount']

[63]: mean_df['N']

[63]: 151.22099447513813

[69]: # fill missing value for numerical column

new_df.loc[(new_df['Loan_Status']=='N'), 'LoanAmount'] = new_df.
↪loc[(new_df['Loan_Status']=='N'), 'LoanAmount'].fillna(mean_df['N'])

new_df.loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'] = new_df.

↪loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'].fillna(mean_df['Y'])

[70]: new_df.isnull().sum()

[70]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

[72]: for val in mean_df.keys():

print(val)

N
Y

42
[73]: mean_df = df.groupby('Loan_Status').mean()['Loan_Amount_Term']
mean_df

[73]: Loan_Status
N 344.064516
Y 341.072464
Name: Loan_Amount_Term, dtype: float64

[80]: df.groupby('Loan_Status')['Loan_Amount_Term'].agg(pd.Series.mean)

[80]: Loan_Status
N 344.064516
Y 341.072464
Name: Loan_Amount_Term, dtype: float64

[76]: for val in mean_df.keys():

new_df.loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'] = new_df.
↪loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'].fillna(mean_df[val])

[77]: new_df.isnull().sum()

[77]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

[79]: # fill missing value for categorical

mode_df = df.groupby('Loan_Status')['Self_Employed'].agg(pd.Series.mode)
mode_df

[79]: Loan_Status
N No
Y No
Name: Self_Employed, dtype: object

[81]: for val in mode_df.keys():

43
new_df.loc[(new_df['Loan_Status']==val), 'Self_Employed'] = new_df.
loc[(new_df['Loan_Status']==val), 'Self_Employed'].fillna(mode_df[val])
↪

[82]: new_df.isnull().sum()

[82]: Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

4.0.6 Fill missing value using ML Model

[83]: new_df = df.copy()

[92]: new_df = new_df[['LoanAmount', 'Loan_Amount_Term', 'ApplicantIncome',␣

↪'CoapplicantIncome']]

new_df.head()

[92]: LoanAmount Loan_Amount_Term ApplicantIncome CoapplicantIncome

0 NaN 360.0 5849 0.0
1 128.0 360.0 4583 1508.0
2 66.0 360.0 3000 0.0
3 120.0 360.0 2583 2358.0
4 141.0 360.0 6000 0.0

[93]: len(new_df)

[93]: 614

[94]: col = "LoanAmount"

[95]: # fill numerical values

new_df_temp = new_df.dropna(subset=[col], axis=0)
print(col, len(new_df_temp))

LoanAmount 592

44
[96]: # input and output split
X = new_df_temp.drop(columns=[col], axis=1)
y = new_df_temp[col]

[97]: from lightgbm import LGBMRegressor

model = LGBMRegressor(use_missing=False)
model.fit(X, y)

[97]: LGBMRegressor(use_missing=False)

[98]: d = {}
temp = new_df.drop(columns=[col], axis=1)
d[col] = list(model.predict(temp))

[99]: i = 0
for val, d_val in zip(new_df[col], d[col]):
if pd.isna(val):
new_df.at[i, col] = d_val
i += 1

[100]: new_df.isnull().sum()

[100]: LoanAmount 0
Loan_Amount_Term 14
ApplicantIncome 0
CoapplicantIncome 0
dtype: int64

[101]: new_df.head()

[101]: LoanAmount Loan_Amount_Term ApplicantIncome CoapplicantIncome

0 152.335142 360.0 5849 0.0
1 128.000000 360.0 4583 1508.0
2 66.000000 360.0 3000 0.0
3 120.000000 360.0 2583 2358.0
4 141.000000 360.0 6000 0.0

[ ]: # fill missing values for categorical - LGBMClassifer

[ ]:

45
5 Feature Selection Techniques
5.1 Correlation Matrix (Numerical Attributes)
[2]: df = pd.read_csv('data/bike sharing dataset.csv')
df.head()

[2]: instant dteday season yr mnth hr holiday weekday workingday \

0 1 2011-01-01 1 0 1 0 0 6 0
1 2 2011-01-01 1 0 1 1 0 6 0
2 3 2011-01-01 1 0 1 2 0 6 0
3 4 2011-01-01 1 0 1 3 0 6 0
4 5 2011-01-01 1 0 1 4 0 6 0

weathersit temp atemp hum windspeed casual registered cnt

0 1 0.24 0.2879 0.81 0.0 3 13 16
1 1 0.22 0.2727 0.80 0.0 8 32 40
2 1 0.22 0.2727 0.80 0.0 5 27 32
3 1 0.24 0.2879 0.75 0.0 3 10 13
4 1 0.24 0.2879 0.75 0.0 0 1 1

[3]: corr = df.corr()

corr

[3]: instant season yr mnth hr holiday \

instant 1.000000 0.404046 0.866014 0.489164 -0.004775 0.014723
season 0.404046 1.000000 -0.010742 0.830386 -0.006117 -0.009585
yr 0.866014 -0.010742 1.000000 -0.010473 -0.003867 0.006692
mnth 0.489164 0.830386 -0.010473 1.000000 -0.005772 0.018430
hr -0.004775 -0.006117 -0.003867 -0.005772 1.000000 0.000479
holiday 0.014723 -0.009585 0.006692 0.018430 0.000479 1.000000
weekday 0.001357 -0.002335 -0.004485 0.010400 -0.003498 -0.102088
workingday -0.003416 0.013743 -0.002196 -0.003477 0.002285 -0.252471
weathersit -0.014198 -0.014524 -0.019157 0.005400 -0.020203 -0.017036
temp 0.136178 0.312025 0.040913 0.201691 0.137603 -0.027340
atemp 0.137615 0.319380 0.039222 0.208096 0.133750 -0.030973
hum 0.009577 0.150625 -0.083546 0.164411 -0.276498 -0.010588
windspeed -0.074505 -0.149773 -0.008740 -0.135386 0.137252 0.003988
casual 0.158295 0.120206 0.142779 0.068457 0.301202 0.031564
registered 0.282046 0.174226 0.253684 0.122273 0.374141 -0.047345
cnt 0.278379 0.178056 0.250495 0.120638 0.394071 -0.030927

weekday workingday weathersit temp atemp hum \

instant 0.001357 -0.003416 -0.014198 0.136178 0.137615 0.009577
season -0.002335 0.013743 -0.014524 0.312025 0.319380 0.150625
yr -0.004485 -0.002196 -0.019157 0.040913 0.039222 -0.083546
mnth 0.010400 -0.003477 0.005400 0.201691 0.208096 0.164411

46
hr -0.003498 0.002285 -0.020203 0.137603 0.133750 -0.276498
holiday -0.102088 -0.252471 -0.017036 -0.027340 -0.030973 -0.010588
weekday 1.000000 0.035955 0.003311 -0.001795 -0.008821 -0.037158
workingday 0.035955 1.000000 0.044672 0.055390 0.054667 0.015688
weathersit 0.003311 0.044672 1.000000 -0.102640 -0.105563 0.418130
temp -0.001795 0.055390 -0.102640 1.000000 0.987672 -0.069881
atemp -0.008821 0.054667 -0.105563 0.987672 1.000000 -0.051918
hum -0.037158 0.015688 0.418130 -0.069881 -0.051918 1.000000
windspeed 0.011502 -0.011830 0.026226 -0.023125 -0.062336 -0.290105
casual 0.032721 -0.300942 -0.152628 0.459616 0.454080 -0.347028
registered 0.021578 0.134326 -0.120966 0.335361 0.332559 -0.273933
cnt 0.026900 0.030284 -0.142426 0.404772 0.400929 -0.322911

windspeed casual registered cnt

instant -0.074505 0.158295 0.282046 0.278379
season -0.149773 0.120206 0.174226 0.178056
yr -0.008740 0.142779 0.253684 0.250495
mnth -0.135386 0.068457 0.122273 0.120638
hr 0.137252 0.301202 0.374141 0.394071
holiday 0.003988 0.031564 -0.047345 -0.030927
weekday 0.011502 0.032721 0.021578 0.026900
workingday -0.011830 -0.300942 0.134326 0.030284
weathersit 0.026226 -0.152628 -0.120966 -0.142426
temp -0.023125 0.459616 0.335361 0.404772
atemp -0.062336 0.454080 0.332559 0.400929
hum -0.290105 -0.347028 -0.273933 -0.322911
windspeed 1.000000 0.090287 0.082321 0.093234
casual 0.090287 1.000000 0.506618 0.694564
registered 0.082321 0.506618 1.000000 0.972151
cnt 0.093234 0.694564 0.972151 1.000000

[8]: # display correlation matrix in heatmap

corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')

[8]: <AxesSubplot:>

47
5.2 Chi-Square (Categorical Attributes)
[10]: df = pd.read_csv('data/Loan Prediction Dataset.csv')
df = df[['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',␣
↪'Credit_History', 'Property_Area', 'Loan_Status']]

# fill null values

for col in df.columns:
df[col] = df[col].fillna(df[col].mode()[0])
df.head()

[10]: Gender Married Dependents Education Self_Employed Credit_History \

0 Male No 0 Graduate No 1.0
1 Male Yes 1 Graduate No 1.0
2 Male Yes 0 Graduate Yes 1.0
3 Male Yes 0 Not Graduate No 1.0
4 Male No 0 Graduate No 1.0

Property_Area Loan_Status
0 Urban Y
1 Rural N
2 Urban Y

48
3 Urban Y
4 Urban Y

[11]: # label encoding

from sklearn.preprocessing import LabelEncoder
for col in df.columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
df.head()

[11]: Gender Married Dependents Education Self_Employed Credit_History \

0 1 0 0 0 0 1
1 1 1 1 0 0 1
2 1 1 0 0 1 1
3 1 1 0 1 0 1
4 1 0 0 0 0 1

Property_Area Loan_Status
0 2 1
1 0 0
2 2 1
3 2 1
4 2 1

[12]: from sklearn.feature_selection import chi2

X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']

[14]: chi_scores = chi2(X, y)

[15]: chi_scores

[15]: (array([3.62343084e-02, 1.78242499e+00, 8.59527587e-02, 3.54050246e+00,

7.28480330e-03, 2.60058772e+01, 3.77837464e-01]),
array([8.49032435e-01, 1.81851834e-01, 7.69386856e-01, 5.98873168e-02,
9.31982300e-01, 3.40379591e-07, 5.38762867e-01]))

[16]: # higher the chi value, higher the importance

chi_values = pd.Series(chi_scores[0], index=X.columns)
chi_values.sort_values(ascending=False, inplace=True)
chi_values.plot.bar()

[16]: <AxesSubplot:>

49
[17]: # if p-value > 0.05, lower the importance
p_values = pd.Series(chi_scores[1], index=X.columns)
p_values.sort_values(ascending=False, inplace=True)
p_values.plot.bar()

[17]: <AxesSubplot:>

50
5.3 Recursive Feature Elimination (RFE)
[18]: df.head()

[18]: Gender Married Dependents Education Self_Employed Credit_History \

0 1 0 0 0 0 1
1 1 1 1 0 0 1
2 1 1 0 0 1 1
3 1 1 0 1 0 1
4 1 0 0 0 0 1

Property_Area Loan_Status
0 2 1
1 0 0
2 2 1
3 2 1
4 2 1

[19]: from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier

51
[20]: # input split
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']

[21]: rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3)

rfe.fit(X, y)

[21]: RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3)

[24]: for i, col in zip(range(X.shape[1]), X.columns):

print(f"{col} selected={rfe.support_[i]} rank={rfe.ranking_[i]}")

Gender selected=False rank=4

Married selected=False rank=5
Dependents selected=False rank=3
Education selected=True rank=1
Self_Employed selected=False rank=2
Credit_History selected=True rank=1
Property_Area selected=True rank=1

[ ]:

6 Cross Validation Techniques

6.1 KFold Cross Validation
[3]: df = pd.read_csv('data/bike sharing dataset.csv')
df = df.drop(columns=['instant', 'dteday', 'casual', 'registered'], axis=1)
df.head()

[3]: season yr mnth hr holiday weekday workingday weathersit temp \

0 1 0 1 0 0 6 0 1 0.24
1 1 0 1 1 0 6 0 1 0.22
2 1 0 1 2 0 6 0 1 0.22
3 1 0 1 3 0 6 0 1 0.24
4 1 0 1 4 0 6 0 1 0.24

atemp hum windspeed cnt

0 0.2879 0.81 0.0 16
1 0.2727 0.80 0.0 40
2 0.2727 0.80 0.0 32
3 0.2879 0.75 0.0 13
4 0.2879 0.75 0.0 1

[4]: X = df.drop(columns=['cnt'], axis=1)

y = df['cnt']

52
[6]: from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score

cv = KFold(n_splits=5, random_state=42, shuffle=True)

model = RandomForestRegressor()
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}")

Error Mean: 0.9451816375903522 Error Std: 0.0034610555321333914

6.2 Repeated Stratified KFold Cross Validation

[8]: from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

model = RandomForestRegressor()
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}")

Error Mean: 0.9450597389924781 Error Std: 0.0036935612975137313

[ ]:

7 Handling Imbalanced Classes

[13]: from collections import Counter

[9]: df = pd.read_csv('data/creditcard.csv')
df.head()

[9]: Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941

V8 V9 … V21 V22 V23 V24 V25 \

0 0.098698 0.363787 … -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 … -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 … 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 … -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 … -0.009431 0.798278 -0.137458 0.141267 -0.206010

V26 V27 V28 Amount Class

53
0 -0.189115 0.133558 -0.021053 149.62 0
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0

[5 rows x 31 columns]

[10]: X = df.drop(columns=['Class'], axis=1)

y = df['Class']

[12]: sns.countplot(y)

[12]: <AxesSubplot:xlabel='Class', ylabel='count'>

[14]: Counter(y)

[14]: Counter({0: 284315, 1: 492})

54
7.1 Over Sampling Techniques
7.1.1 RandomOverSampler

[17]: ## Repeats the same samples in the dataset in random manner

[ ]: !pip install imblearn

[15]: from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversampler.fit_resample(X, y)
Counter(y_over)

[15]: Counter({0: 284315, 1: 284315})

[16]: oversampler = RandomOverSampler(sampling_strategy=0.3)

X_over, y_over = oversampler.fit_resample(X, y)
Counter(y_over)

[16]: Counter({0: 284315, 1: 85294})

7.1.2 SMOTE (Synthetic Minority Over-sampling Techinique)

[ ]: # it will create new samples bases on nearest neighbors

[21]: from imblearn.over_sampling import SMOTE

oversampler = SMOTE(sampling_strategy=0.4)
X_over, y_over = oversampler.fit_resample(X, y)
Counter(y_over)

[21]: Counter({0: 284315, 1: 113726})

[22]: sns.countplot(y_over)

[22]: <AxesSubplot:xlabel='Class', ylabel='count'>

55
7.2 Under Sampling Technique
7.2.1 RandomUnderSampler

[23]: from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersampler.fit_resample(X, y)
Counter(y_under)

[23]: Counter({0: 492, 1: 492})

[30]: undersampler = RandomUnderSampler(sampling_strategy=0.2)

X_under, y_under = undersampler.fit_resample(X, y)
Counter(y_under)

[30]: Counter({0: 2460, 1: 492})

[31]: sns.countplot(y_under)

[31]: <AxesSubplot:xlabel='Class', ylabel='count'>

56
7.3 Combine Oversampling and Undersampling
[37]: from imblearn.pipeline import Pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
pipeline = Pipeline([('o', over), ('u', under)])
X_resample, y_resample = pipeline.fit_resample(X, y)

[38]: sns.countplot(y)

[38]: <AxesSubplot:xlabel='Class', ylabel='count'>

57
[39]: sns.countplot(y_resample)

[39]: <AxesSubplot:xlabel='Class', ylabel='count'>

58
[ ]:

8 Ensembling Techniques
[46]: df = pd.read_csv('data/winequality.csv')
df = df.drop(columns=['type'], axis=1)
df = df.fillna(-2)
df.head()

[46]: fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058

free sulfur dioxide total sulfur dioxide density pH sulphates \

0 45.0 170.0 1.0010 3.00 0.45
1 14.0 132.0 0.9940 3.30 0.49
2 30.0 97.0 0.9951 3.26 0.44
3 47.0 186.0 0.9956 3.19 0.40
4 47.0 186.0 0.9956 3.19 0.40

alcohol quality
0 8.8 6
1 9.5 6
2 10.1 6
3 9.9 6
4 9.9 6

[47]: X = df.drop(columns=['quality'], axis=1)

y = df['quality']

[48]: from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25,␣
↪random_state=42, stratify=y)

[49]: from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

[49]: 0.4664615384615385

59
8.1 Voting Classifier
[52]: from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()

model = VotingClassifier(estimators=[('lr', model1), ('kn', model2), ('rf',␣

↪model3)], voting='soft') # soft-probability score, hard-take the majority␣

↪class

model.fit(x_train, y_train)
model.score(x_test, y_test)

[52]: 0.6344615384615384

8.2 Averaging
[53]: model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()

model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)

pred1 = model1.predict_proba(x_test)
pred2 = model2.predict_proba(x_test)
pred3 = model3.predict_proba(x_test)

final_pred = (pred1+pred2+pred3)/3

[56]: sns.countplot(y)

[56]: <AxesSubplot:xlabel='quality', ylabel='count'>

60
[58]: final_pred

[58]: array([[2.00197494e-03, 8.00622692e-03, 4.62873630e-01, …,

3.14175685e-02, 1.19202672e-02, 1.93443421e-05],
[4.84634692e-03, 1.76344906e-02, 5.34755720e-01, …,
2.74161344e-02, 8.46195147e-03, 2.05848950e-05],
[5.11534773e-03, 1.96147189e-02, 2.74824437e-01, …,
1.86035677e-01, 3.98274064e-02, 4.01583045e-03],
…,
[1.83534781e-03, 2.05818375e-02, 6.62530834e-01, …,
1.02678348e-01, 6.85639154e-03, 6.13865903e-04],
[1.88135495e-03, 1.91471538e-02, 2.21197692e-01, …,
2.26948363e-01, 2.04132847e-02, 3.42074213e-03],
[1.47880961e-03, 1.04958076e-02, 3.94366651e-01, …,
5.45452751e-02, 9.34159743e-03, 9.95620183e-05]])

[79]: pred = []
for res in final_pred:
pred.append(np.argmax(res)+3)

[80]: from sklearn.metrics import accuracy_score

accuracy_score(y_test, pred)

[80]: 0.6350769230769231

61
8.3 Weighted Average
[85]: model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = RandomForestClassifier()

model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)

pred1 = model1.predict_proba(x_test)
pred2 = model2.predict_proba(x_test)
pred3 = model3.predict_proba(x_test)

final_pred = (pred1*0.25+pred2*0.25+pred3*0.5)/3

[86]: pred = []
for res in final_pred:
pred.append(np.argmax(res)+3)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

[86]: 0.6652307692307692

[ ]: # advanced ensembling - stacking, blending, bagging, boosting

# ensemble algorithms
# bagging - random forest, bagging
# boosting - gbm, xgboost, lightgbm, catboost

[ ]:

9 Dimensionality Reduction Techniques

[2]: from keras.datasets import mnist

[3]: (X, y), (_,_) = mnist.load_data()

print(X.shape, y.shape)

(60000, 28, 28) (60000,)

[4]: X = X.reshape(len(X), -1)

X.shape

[4]: (60000, 784)

62
[5]: from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from umap import UMAP

9.1 PCA
[13]: x_pca = PCA(n_components=2).fit_transform(X)

[14]: x_pca.shape

[14]: (60000, 2)

[21]: plt.figure(figsize=(10,10))
sc = plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()

63
9.2 LDA
[22]: x_lda = LDA(n_components=2).fit_transform(X, y)

[23]: plt.figure(figsize=(10,10))
sc = plt.scatter(x_lda[:, 0], x_lda[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()

64
9.3 t-SNE
[12]: # taking only 10k samples for quick results
x_tsne = TSNE(n_jobs=-1).fit_transform(X[:10000])

[13]: plt.figure(figsize=(10,10))
sc = plt.scatter(x_tsne[:, 0], x_tsne[:, 1], c=y[:10000])
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()

65
9.4 UMAP
[6]: x_umap = UMAP(n_neighbors=10, min_dist=0.1, metric='correlation').
↪fit_transform(X)

[7]: plt.figure(figsize=(10,10))
sc = plt.scatter(x_umap[:, 0], x_umap[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()

[ ]:

66
10 Handle Large Data (CSV)
[2]: df = pd.read_csv('data/1000000 Sales Records.csv')
df.head()

[2]: Region Country Item Type Sales Channel \

0 Sub-Saharan Africa South Africa Fruits Offline
1 Middle East and North Africa Morocco Clothes Online
2 Australia and Oceania Papua New Guinea Meat Offline
3 Sub-Saharan Africa Djibouti Clothes Offline
4 Europe Slovakia Beverages Offline

Order Priority Order Date Order ID Ship Date Units Sold Unit Price \
0 M 7/27/2012 443368995 7/28/2012 1593 9.33
1 M 9/14/2013 667593514 10/19/2013 4611 109.28
2 M 5/15/2015 940995585 6/4/2015 360 421.89
3 H 5/17/2017 880811536 7/2/2017 562 109.28
4 L 10/26/2016 174590194 12/4/2016 3973 47.45

Unit Cost Total Revenue Total Cost Total Profit

0 6.92 14862.69 11023.56 3839.13
1 35.84 503890.08 165258.24 338631.84
2 364.69 151880.40 131288.40 20592.00
3 35.84 61415.36 20142.08 41273.28
4 31.79 188518.85 126301.67 62217.18

[3]: df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 14 entries, Region to Total Profit
dtypes: float64(5), int64(2), object(7)
memory usage: 489.9 MB

10.1 nrows
[4]: df = pd.read_csv('data/1000000 Sales Records.csv', nrows=1000)
df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 14 entries, Region to Total Profit
dtypes: float64(5), int64(2), object(7)
memory usage: 502.1 KB

[6]: cols = df.columns.values

cols

67
[6]: array(['Region', 'Country', 'Item Type', 'Sales Channel',
'Order Priority', 'Order Date', 'Order ID', 'Ship Date',
'Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue',
'Total Cost', 'Total Profit'], dtype=object)

[7]: req_cols = ['Region', 'Country', 'Item Type', 'Sales Channel',

'Order Priority',
'Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue',
'Total Cost', 'Total Profit']

[8]: df = pd.read_csv('data/1000000 Sales Records.csv', usecols=req_cols)

df.head()

[8]: Region Country Item Type Sales Channel \

Order Priority Units Sold Unit Price Unit Cost Total Revenue \
0 M 1593 9.33 6.92 14862.69
1 M 4611 109.28 35.84 503890.08
2 M 360 421.89 364.69 151880.40
3 H 562 109.28 35.84 61415.36
4 L 3973 47.45 31.79 188518.85

Total Cost Total Profit

0 11023.56 3839.13
1 165258.24 338631.84
2 131288.40 20592.00
3 20142.08 41273.28
4 126301.67 62217.18

[9]: df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 11 entries, Region to Total Profit
dtypes: float64(5), int64(1), object(5)
memory usage: 356.5 MB

10.2 Convert Datatype of the Columns

[10]: df.describe()

68
[10]: Units Sold Unit Price Unit Cost Total Revenue \
count 1000000.000000 1000000.000000 1000000.000000 1.000000e+06
mean 4998.867302 266.025488 187.522978 1.329563e+06
std 2885.334142 216.987966 175.650798 1.468527e+06
min 1.000000 9.330000 6.920000 9.330000e+00
25% 2502.000000 81.730000 35.840000 2.778672e+05
50% 4998.000000 154.060000 97.440000 7.844445e+05
75% 7496.000000 421.890000 263.330000 1.822444e+06
max 10000.000000 668.270000 524.960000 6.682700e+06

Total Cost Total Profit

count 1.000000e+06 1.000000e+06
mean 9.372671e+05 3.922956e+05
std 1.148954e+06 3.788199e+05
min 6.920000e+00 2.410000e+00
25% 1.617289e+05 9.510480e+04
50% 4.667818e+05 2.810549e+05
75% 1.196327e+06 5.653076e+05
max 5.249600e+06 1.738700e+06

[11]: for col in df.columns:

if df[col].dtype == 'float64':
df[col] = df[col].astype('float16')
if df[col].dtype == 'int64':
df[col] = df[col].astype('int16')
if df[col].dtype == 'object':
df[col] = df[col].astype('category')

[12]: df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 11 entries, Region to Total Profit
dtypes: category(5), float16(5), int16(1)
memory usage: 17.2 MB

[13]: df = pd.read_csv('data/1000000 Sales Records.csv', usecols=req_cols,␣

↪dtype={'Region': 'category', 'Units Sold': 'int16'})

df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 11 entries, Region to Total Profit
dtypes: category(1), float64(5), int16(1), object(4)
memory usage: 282.3 MB

69
10.3 Load Dataset Faster using chunks
[17]: %%time
df = pd.read_csv('data/1000000 Sales Records.csv')
len(df)

Wall time: 2.1 s

[17]: 1000000

[19]: %%time
chunks = pd.read_csv('data/1000000 Sales Records.csv', iterator=True,␣
↪chunksize=1000)

# df = pd.concat(chunks, ignore_index=True)
# df.head()

Wall time: 5.01 ms

[20]: length = 0
for chunk in chunks:
length += len(chunk)
length

[20]: 1000000

[ ]: # multiprocessing, dask module, numpy

[ ]:

11 Sampling Techniques
[1]: import pandas as pd
df = pd.read_csv('data/winequality.csv')
df.head()

[1]: type fixed acidity volatile acidity … sulphates alcohol quality

0 white 7.0 0.27 … 0.45 8.8 6
1 white 6.3 0.30 … 0.49 9.5 6
2 white 8.1 0.28 … 0.44 10.1 6
3 white 7.2 0.23 … 0.40 9.9 6
4 white 7.2 0.23 … 0.40 9.9 6

[5 rows x 13 columns]

[2]: len(df)

[2]: 6497

70
11.0.1 Random Sample

[8]: # without duplicates

sample_df = df.sample(n=500, replace=False).reset_index(drop=True)
sample_df.head()

[8]: type fixed acidity volatile acidity … sulphates alcohol quality

0 white 7.6 0.285 … 0.45 9.2 5
1 white 6.3 0.320 … 0.46 12.3 6
2 white 7.3 0.220 … 0.41 9.1 7
3 red 6.5 0.530 … 0.83 10.3 6
4 white 7.0 0.480 … 0.35 9.0 5

[5 rows x 13 columns]

[4]: len(sample_df)

[4]: 500

[9]: # with duplicates, creates more samples

sample_df = df.sample(n=10000, replace=True).reset_index(drop=True)
sample_df.head()

[9]: type fixed acidity volatile acidity … sulphates alcohol quality

0 white 6.0 0.200 … 0.47 9.8 5
1 red 8.5 0.210 … 0.67 10.4 5
2 white 6.1 0.380 … 0.69 10.4 6
3 red 6.8 0.775 … 0.56 10.7 5
4 white 5.6 0.320 … 0.49 11.1 6

[5 rows x 13 columns]

[10]: len(sample_df)

[10]: 10000

11.0.2 Stratified Sampling

[11]: # useful to get uniform train and test data

[13]: import pandas as pd

import seaborn as sns
df = pd.read_csv('data/winequality.csv')
df.head()

[13]: type fixed acidity volatile acidity … sulphates alcohol quality

0 white 7.0 0.27 … 0.45 8.8 6

71
1 white 6.3 0.30 … 0.49 9.5 6
2 white 8.1 0.28 … 0.44 10.1 6
3 white 7.2 0.23 … 0.40 9.9 6
4 white 7.2 0.23 … 0.40 9.9 6

[5 rows x 13 columns]

[14]: sns.countplot(x='quality', data=df)

[14]: <Axes: xlabel='quality', ylabel='count'>

[15]: from sklearn.model_selection import train_test_split

X = df.drop(columns=['quality'])
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣

↪random_state=42)

[16]: sns.countplot(x=y_test)

72
[16]: <Axes: xlabel='quality', ylabel='count'>

[17]: # stratified samples

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42, stratify=y)

[18]: sns.countplot(x=y_test)

[18]: <Axes: xlabel='quality', ylabel='count'>

73
[ ]:

12 L1 & L2 Regularization (Reduce Overfitting & Perform Fea-

ture Selection)
[27]: import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/winequality.csv')
df.head()

[27]: type fixed acidity volatile acidity … sulphates alcohol quality

0 white 7.0 0.27 … 0.45 8.8 6
1 white 6.3 0.30 … 0.49 9.5 6
2 white 8.1 0.28 … 0.44 10.1 6
3 white 7.2 0.23 … 0.40 9.9 6
4 white 7.2 0.23 … 0.40 9.9 6

[5 rows x 13 columns]

74
[20]: df = df.drop(columns=['type'])
df = df.fillna(-2)
df.head(2)

[20]: fixed acidity volatile acidity citric acid … sulphates alcohol

quality
0 7.0 0.27 0.36 … 0.45 8.8
6
1 6.3 0.30 0.34 … 0.49 9.5
6

[2 rows x 12 columns]

[21]: X = df.drop(columns=['quality'])
y = df['quality']

[22]: from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error

[23]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣

↪random_state=42)

12.0.1 Lasso (L1)

[40]: lasso_model = Lasso(alpha=0.5) # alpha = regularization strength

# train the model
lasso_model.fit(X_train, y_train)
# predict from model
y_pred = lasso_model.predict(X_test)
# calculate mse
lasso_mse = mean_squared_error(y_test, y_pred)
print("Lasso MSE:", lasso_mse)

Lasso MSE: 0.7094490695927799

12.0.2 Ridge (L2)

[41]: ridge_model = Ridge(alpha=0.5)

# train the model
ridge_model.fit(X_train, y_train)
# predict from model
y_pred = ridge_model.predict(X_test)
# calculate mse
ridge_mse = mean_squared_error(y_test, y_pred)
print("Ridge MSE:", ridge_mse)

75
Ridge MSE: 0.48188801180027196

[42]: lasso_model.coef_

[42]: array([-0. , -0. , 0. , -0. , -0. ,

0.00527723, -0.0017967 , -0. , 0. , 0. ,
0. ])

[44]: plt.figure(figsize=(10, 6))

plt.bar(X.columns, lasso_model.coef_, color='blue')
plt.xlabel('Features')
plt.ylabel('Coeffients')
plt.title('Coefficients of Features')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

[45]: ridge_model.coef_

[45]: array([ 0.00305946, -1.07868134, 0.01248917, 0.02339235, -0.11589844,

0.00677758, -0.00228324, -0.40177486, 0.02487401, 0.52749414,
0.33607207])

[46]: plt.figure(figsize=(10, 6))

plt.bar(X.columns, ridge_model.coef_, color='orange')
plt.xlabel('Features')

76
plt.ylabel('Coeffients')
plt.title('Coefficients of Features')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

[ ]:

13 Pipeline Module
[1]: from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

[2]: # load the data

X, y = load_iris(return_X_y=True)

[5]: # split for training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,␣
↪random_state=42)

77
[6]: # build the pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scalar', StandardScaler()),
('model', LogisticRegression())
])

[8]: # train the model

pipeline.fit(X_train, y_train)

[8]: Pipeline(steps=[('imputer', SimpleImputer()), ('scalar', StandardScaler()),

('model', LogisticRegression())])

[10]: # evaluate the model

accuracy = pipeline.score(X_test, y_test)
print('Accuracy:', accuracy)

Accuracy: 1.0

[11]: # get test predictions

y_pred = pipeline.predict(X_test)
print('Predictions:', y_pred)

Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1
0 0 2 1
0]

[ ]:

Fds Practical Slips Solutions
No ratings yet
Fds Practical Slips Solutions
32 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
4766 Books Doubtnut Question Bank
No ratings yet
4766 Books Doubtnut Question Bank
388 pages
Devesh
No ratings yet
Devesh
11 pages
Datamining Exp5 Datanormalisation
No ratings yet
Datamining Exp5 Datanormalisation
14 pages
Skewness in Data
No ratings yet
Skewness in Data
33 pages
How To Use Excel in Analytical Chemistry and in General Scientific Data Analysis - Robert de Levie
No ratings yet
How To Use Excel in Analytical Chemistry and in General Scientific Data Analysis - Robert de Levie
501 pages
FDS Solved Slips
100% (1)
FDS Solved Slips
63 pages
Basic Python Analysis
No ratings yet
Basic Python Analysis
33 pages
Quality Prediction Checkpoint
No ratings yet
Quality Prediction Checkpoint
14 pages
Eda Red Wine
No ratings yet
Eda Red Wine
16 pages
Empirical Crop Suitability Model 1694688954
No ratings yet
Empirical Crop Suitability Model 1694688954
24 pages
Python Project 2 Colab
No ratings yet
Python Project 2 Colab
6 pages
ML Assgn Logistic Wine Quality - Ipynb - Colab
No ratings yet
ML Assgn Logistic Wine Quality - Ipynb - Colab
5 pages
Wine
No ratings yet
Wine
22 pages
14.hydroxyl Compounds Lecture Notes
100% (4)
14.hydroxyl Compounds Lecture Notes
22 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Chemistry Lab Questions
No ratings yet
Chemistry Lab Questions
4 pages
CC08 Group 07 Probability and Statistics Assignment Report PDF
No ratings yet
CC08 Group 07 Probability and Statistics Assignment Report PDF
36 pages
CODER
No ratings yet
CODER
18 pages
AM19 EDA Assignment5
No ratings yet
AM19 EDA Assignment5
19 pages
Swd325 Practical Solution
No ratings yet
Swd325 Practical Solution
9 pages
The Art of Effective Visualization of Multi-Dimensional Data
No ratings yet
The Art of Effective Visualization of Multi-Dimensional Data
51 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Quality Prediction
No ratings yet
Quality Prediction
20 pages
09 Histograms and Stem-And-leaf Plots
No ratings yet
09 Histograms and Stem-And-leaf Plots
6 pages
How To Use Excel in Analytical Chemistry
100% (2)
How To Use Excel in Analytical Chemistry
501 pages
Normalization of Data - Jupyter Notebook
No ratings yet
Normalization of Data - Jupyter Notebook
7 pages
Keeratsi HW8
No ratings yet
Keeratsi HW8
17 pages
Assignment4 VidulGarg
No ratings yet
Assignment4 VidulGarg
14 pages
MuleSoft Integration Architect I Updated Practice Questions
No ratings yet
MuleSoft Integration Architect I Updated Practice Questions
21 pages
PHA6112Lab Exp1 PH and Buffers Edited
No ratings yet
PHA6112Lab Exp1 PH and Buffers Edited
32 pages
10th STD Science Mcqs Bank Eng Version 2020-21 by Bangalore Rural
No ratings yet
10th STD Science Mcqs Bank Eng Version 2020-21 by Bangalore Rural
45 pages
Wine DS
No ratings yet
Wine DS
14 pages
Engineering Careers360
No ratings yet
Engineering Careers360
1 page
Wine
No ratings yet
Wine
15 pages
Presentation Final Thesis Surobhi Deb
No ratings yet
Presentation Final Thesis Surobhi Deb
18 pages
14-May - Jupyter Notebook
No ratings yet
14-May - Jupyter Notebook
15 pages
JEE Physics Problem Set
No ratings yet
JEE Physics Problem Set
73 pages
phần code r tới câu f của phần 4
No ratings yet
phần code r tới câu f của phần 4
9 pages
IMPLEMENTATION
No ratings yet
IMPLEMENTATION
6 pages
Project Data Mining (AMAN YADAV)
No ratings yet
Project Data Mining (AMAN YADAV)
12 pages
Water - Qualit (2) - JupyterLab
No ratings yet
Water - Qualit (2) - JupyterLab
10 pages
Workbook - Discrete Random Variables
No ratings yet
Workbook - Discrete Random Variables
19 pages
01 Mean, Variance, and Standard Deviation
No ratings yet
01 Mean, Variance, and Standard Deviation
10 pages
Coding An
No ratings yet
Coding An
19 pages
Code R
No ratings yet
Code R
3 pages
Brochure - Global Wi-Fi Market - Global Forecast To 2020
No ratings yet
Brochure - Global Wi-Fi Market - Global Forecast To 2020
24 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
Workbook Regression
No ratings yet
Workbook Regression
18 pages
Workbook - Hypothesis Testing
No ratings yet
Workbook - Hypothesis Testing
26 pages
Probability & Statistics - Workbook.solutions
No ratings yet
Probability & Statistics - Workbook.solutions
471 pages
SUBQUERIES
No ratings yet
SUBQUERIES
8 pages
Compte Rendu TP 2 Pandas
No ratings yet
Compte Rendu TP 2 Pandas
2 pages
Python Seaborn Tutorial For Beginners v2
No ratings yet
Python Seaborn Tutorial For Beginners v2
40 pages
ADS Practical Exam Questions
No ratings yet
ADS Practical Exam Questions
14 pages
Analysis Report
No ratings yet
Analysis Report
14 pages
Advanced Science & Math Topics
No ratings yet
Advanced Science & Math Topics
17 pages
09 Lineplot
No ratings yet
09 Lineplot
21 pages
Exercise#9 Instructions 2021
No ratings yet
Exercise#9 Instructions 2021
5 pages
Probability & Statistics - Final Exam - Solutions
No ratings yet
Probability & Statistics - Final Exam - Solutions
16 pages
Wine Quality Prediction Using Machine Learning
No ratings yet
Wine Quality Prediction Using Machine Learning
10 pages
Water Quality Data Analysis
No ratings yet
Water Quality Data Analysis
4 pages
Organic Chemistry: Homologous Series & Formulas
No ratings yet
Organic Chemistry: Homologous Series & Formulas
12 pages
Workbook - Hypothesis Testing - Solutions
No ratings yet
Workbook - Hypothesis Testing - Solutions
91 pages
Lecture Notes Sas 1 Sas 5
No ratings yet
Lecture Notes Sas 1 Sas 5
25 pages
Probability & Statistics - Workbook
No ratings yet
Probability & Statistics - Workbook
163 pages
Practical04.ipynb - Colab
No ratings yet
Practical04.ipynb - Colab
2 pages
Water Potablity Detection
No ratings yet
Water Potablity Detection
29 pages
02 Significance Level and Type I and II Errors
No ratings yet
02 Significance Level and Type I and II Errors
8 pages
Probability & Statistics - Final Exam - Practice 1
No ratings yet
Probability & Statistics - Final Exam - Practice 1
9 pages
Probability & Statistics - Final Exam
No ratings yet
Probability & Statistics - Final Exam
9 pages
A Level Chemistry Paper 1 Set 17
No ratings yet
A Level Chemistry Paper 1 Set 17
16 pages
Catalytic Livescript
No ratings yet
Catalytic Livescript
4 pages
Car Insurance Insights Summary Presentation
No ratings yet
Car Insurance Insights Summary Presentation
10 pages
08 Joint Distributions
No ratings yet
08 Joint Distributions
6 pages
03 Symmetric and Skewed Distributions and Outliers
No ratings yet
03 Symmetric and Skewed Distributions and Outliers
6 pages
07 Relative Frequency Tables
No ratings yet
07 Relative Frequency Tables
6 pages
Wine Quality Prediction
No ratings yet
Wine Quality Prediction
6 pages
10 Hypothesis Testing For The Difference of Proportions
No ratings yet
10 Hypothesis Testing For The Difference of Proportions
9 pages
Class 7 Science Bit Bank
No ratings yet
Class 7 Science Bit Bank
112 pages
J Americ Oil Chem Soc - 2018 - Zlatar - Purification of The Ester Phase by Water Extraction in Biodiesel Production
No ratings yet
J Americ Oil Chem Soc - 2018 - Zlatar - Purification of The Ester Phase by Water Extraction in Biodiesel Production
13 pages
AP '84 Multiple Choice
No ratings yet
AP '84 Multiple Choice
12 pages
02 Measures of Spread
No ratings yet
02 Measures of Spread
6 pages
04 Box and Whisker Plots
No ratings yet
04 Box and Whisker Plots
6 pages
21brs1715 Lab3
No ratings yet
21brs1715 Lab3
4 pages
Ammonia Tri Guidance Revised April 2018
No ratings yet
Ammonia Tri Guidance Revised April 2018
16 pages
Pandas Usefull Code
No ratings yet
Pandas Usefull Code
2 pages
10 Building Histograms From Data Sets
No ratings yet
10 Building Histograms From Data Sets
7 pages
Iron and Steel Lab Testing Manual 2024
No ratings yet
Iron and Steel Lab Testing Manual 2024
8 pages
03 Coefficient of Determination and RMSE
No ratings yet
03 Coefficient of Determination and RMSE
7 pages
Acids and Bases Worksheet
No ratings yet
Acids and Bases Worksheet
12 pages
02 Frequency Histograms and Polygons, and Density Curves
No ratings yet
02 Frequency Histograms and Polygons, and Density Curves
6 pages
Non Explosive Agent
No ratings yet
Non Explosive Agent
7 pages
Poly (Acrylic Acid/acrylamide/sodium Humate) Superabsorbent Hydrogels For Metal Ion/Dye Adsorption: Effect of Sodium Humate Concentration
No ratings yet
Poly (Acrylic Acid/acrylamide/sodium Humate) Superabsorbent Hydrogels For Metal Ion/Dye Adsorption: Effect of Sodium Humate Concentration
17 pages
01 Measures of Central Tendency
No ratings yet
01 Measures of Central Tendency
6 pages
3 Outliers Iqr
No ratings yet
3 Outliers Iqr
3 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Diimide Reduction of Porphyrins: J - Chem. SOC., A M - Chem. SOC., 84, 832
No ratings yet
Diimide Reduction of Porphyrins: J - Chem. SOC., A M - Chem. SOC., 84, 832
5 pages
Schiff Bases. Part 1. Thermal Decarboxylation of A-Amino-Acids in The Presence of Ketones
No ratings yet
Schiff Bases. Part 1. Thermal Decarboxylation of A-Amino-Acids in The Presence of Ketones
5 pages
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Biology 2
No ratings yet
Biology 2
3 pages
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Laboratory Worksheet 2 Buffers and Buffer Capacity
No ratings yet
Laboratory Worksheet 2 Buffers and Buffer Capacity
3 pages
IMAT Complete Guide and Strategy
No ratings yet
IMAT Complete Guide and Strategy
6 pages
Chemistry Class 10th Viva Notes
No ratings yet
Chemistry Class 10th Viva Notes
3 pages
Homeworks MR
No ratings yet
Homeworks MR
3 pages
Msds NH4OH
No ratings yet
Msds NH4OH
6 pages
Conductometric Titration
No ratings yet
Conductometric Titration
3 pages
Instruction - Expt. 2-Acid Base Titration
No ratings yet
Instruction - Expt. 2-Acid Base Titration
3 pages