Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views6 pages

Practical 6

Practical

Uploaded by

anuja.jadhav.ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

Practical 6

Practical

Uploaded by

anuja.jadhav.ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Practical 6 - Scaling the data set using Standardization.

Standardization:
Standardization, also known as Z-score normalization, is a data preprocessing
technique that transforms the distribution of a feature (or set of features) such
that the mean of the observed values is 0 and the standard deviation is 1. This
technique is essential when machine learning algorithms are sensitive to the
scale of input features. For instance, algorithms like Support Vector Machines
(SVM), k-Nearest Neighbors (k-NN), and Neural Networks are distance-based
and require all features to be on the same scale.

Formula for Standardization:

After standardization:
- The mean becomes 0.
- The standard deviation becomes 1.

Why Standardization is Important:


1. Equal Feature Contribution: In many machine learning models, features with
larger ranges may dominate the model, making it more challenging for the
algorithm to learn from other features. Standardization helps ensure that each
feature contributes equally.
2. Faster Convergence: In optimization algorithms like gradient descent,
standardizing data can speed up convergence since all features will be on a
similar scale.
3. Better Performance: Many algorithms work more effectively when features
are normally distributed (mean = 0, std = 1).

When to Use Standardization:


- Distance-Based Algorithms: Algorithms like SVM, k-NN, and k-means
clustering are distance-based and work better when all features are on the same
scale.
- Gradient-Based Algorithms: Neural networks and linear/logistic regression use
gradient descent. Standardization helps these algorithms converge faster.
- Features with Different Units: If your features are measured in different units
(e.g., height in centimeters and weight in kilograms), standardization helps
bring them to a common scale.

Example
Let’s consider a dataset where we have two features: age and salary. Age
typically ranges from 20 to 60, while salary can range from $30,000 to
$120,000. Due to the large difference in scales, salary will dominate any
distance calculations. Standardizing both features ensures that age and salary
contribute equally to the model.

Practical Example in Python:


Below is a step-by-step guide to standardizing a dataset using Python and the
`StandardScaler` class from the `scikit-learn` library.

Step 1: Import Required Libraries

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
Step 2: Create a Dataset
Let’s create a small dataset with two features, `age` and `salary`, to demonstrate
the effect of standardization.
# Create a sample dataset
data = {'Age': [22, 25, 47, 52, 46],
'Salary': [20000, 32000, 45000, 58000, 79000]}
df = pd.DataFrame(data)

Step 3: Apply Standardization


Use `StandardScaler` to standardize the dataset. This will transform each feature
so that the mean is 0 and the standard deviation is 1.
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it


standardized_data = scaler.fit_transform(df)

# Convert the standardized data back to a DataFrame


standardized_df = pd.DataFrame(standardized_data, columns=['Age', 'Salary'])
print("\nStandardized Dataset:\n", standardized_df)

Step 4: View Results


In this step, you'll observe that the transformed data has a mean of 0 and a
standard deviation of 1.
# Check mean and std deviation after standardization
mean = standardized_df.mean()
std_dev = standardized_df.std()

print("\nMean after Standardization:\n", mean)


print("\nStandard Deviation after Standardization:\n", std_dev)
Full Code Example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Create a sample dataset


data = {'Age': [22, 25, 47, 52, 46],
'Salary': [20000, 32000, 45000, 58000, 79000]}

df = pd.DataFrame(data)
print("Original Dataset:\n", df)

# Step 2: Initialize the StandardScaler


scaler = StandardScaler()

# Step 3: Fit the scaler to the data and transform it


standardized_data = scaler.fit_transform(df)

# Step 4: Convert the standardized data back to a DataFrame


standardized_df = pd.DataFrame(standardized_data, columns=['Age', 'Salary'])
print("\nStandardized Dataset:\n", standardized_df)

# Step 5: Check mean and std deviation after standardization


mean = standardized_df.mean()
std_dev = standardized_df.std()

print("\nMean after Standardization:\n", mean)


print("\nStandard Deviation after Standardization:\n", std_dev)
Output:

1. Original Dataset:
Age Salary
0 22 20000
1 25 32000
2 47 45000
3 52 58000
4 46 79000

2. Standardized Dataset:
Age Salary
0 -1.566699 -1.569105
1 -1.350157 -1.036600
2 0.261458 -0.318282
3 0.696526 0.400036
4 1.958872 2.524951

3. Mean and Standard Deviation after Standardization:


Mean after Standardization:
Age 3.330669e-16
Salary 0.000000e+00
dtype: float64

Standard Deviation after Standardization:


Age 1.118034
Salary 1.118034
dtype: float64
Explanation of Results:
- The transformed dataset now has values with a mean of 0 and standard
deviation of 1.
- Both `Age` and `Salary` are now standardized, ensuring that both features
contribute equally to any machine learning algorithm applied on this data.

Conclusion:
Standardization is a crucial preprocessing step, especially when dealing with
features of different scales. By transforming features to have a mean of 0 and a
standard deviation of 1, it ensures that algorithms sensitive to the magnitude of
input values perform better and converge faster.

You might also like