Practical 6 - Scaling the data set using Standardization.
Standardization:
Standardization, also known as Z-score normalization, is a data preprocessing
technique that transforms the distribution of a feature (or set of features) such
that the mean of the observed values is 0 and the standard deviation is 1. This
technique is essential when machine learning algorithms are sensitive to the
scale of input features. For instance, algorithms like Support Vector Machines
(SVM), k-Nearest Neighbors (k-NN), and Neural Networks are distance-based
and require all features to be on the same scale.
Formula for Standardization:
After standardization:
- The mean becomes 0.
- The standard deviation becomes 1.
Why Standardization is Important:
1. Equal Feature Contribution: In many machine learning models, features with
larger ranges may dominate the model, making it more challenging for the
algorithm to learn from other features. Standardization helps ensure that each
feature contributes equally.
2. Faster Convergence: In optimization algorithms like gradient descent,
standardizing data can speed up convergence since all features will be on a
similar scale.
3. Better Performance: Many algorithms work more effectively when features
are normally distributed (mean = 0, std = 1).
When to Use Standardization:
- Distance-Based Algorithms: Algorithms like SVM, k-NN, and k-means
clustering are distance-based and work better when all features are on the same
scale.
- Gradient-Based Algorithms: Neural networks and linear/logistic regression use
gradient descent. Standardization helps these algorithms converge faster.
- Features with Different Units: If your features are measured in different units
(e.g., height in centimeters and weight in kilograms), standardization helps
bring them to a common scale.
Example
Let’s consider a dataset where we have two features: age and salary. Age
typically ranges from 20 to 60, while salary can range from $30,000 to
$120,000. Due to the large difference in scales, salary will dominate any
distance calculations. Standardizing both features ensures that age and salary
contribute equally to the model.
Practical Example in Python:
Below is a step-by-step guide to standardizing a dataset using Python and the
`StandardScaler` class from the `scikit-learn` library.
Step 1: Import Required Libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
Step 2: Create a Dataset
Let’s create a small dataset with two features, `age` and `salary`, to demonstrate
the effect of standardization.
# Create a sample dataset
data = {'Age': [22, 25, 47, 52, 46],
'Salary': [20000, 32000, 45000, 58000, 79000]}
df = pd.DataFrame(data)
Step 3: Apply Standardization
Use `StandardScaler` to standardize the dataset. This will transform each feature
so that the mean is 0 and the standard deviation is 1.
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler to the data and transform it
standardized_data = scaler.fit_transform(df)
# Convert the standardized data back to a DataFrame
standardized_df = pd.DataFrame(standardized_data, columns=['Age', 'Salary'])
print("\nStandardized Dataset:\n", standardized_df)
Step 4: View Results
In this step, you'll observe that the transformed data has a mean of 0 and a
standard deviation of 1.
# Check mean and std deviation after standardization
mean = standardized_df.mean()
std_dev = standardized_df.std()
print("\nMean after Standardization:\n", mean)
print("\nStandard Deviation after Standardization:\n", std_dev)
Full Code Example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Step 1: Create a sample dataset
data = {'Age': [22, 25, 47, 52, 46],
'Salary': [20000, 32000, 45000, 58000, 79000]}
df = pd.DataFrame(data)
print("Original Dataset:\n", df)
# Step 2: Initialize the StandardScaler
scaler = StandardScaler()
# Step 3: Fit the scaler to the data and transform it
standardized_data = scaler.fit_transform(df)
# Step 4: Convert the standardized data back to a DataFrame
standardized_df = pd.DataFrame(standardized_data, columns=['Age', 'Salary'])
print("\nStandardized Dataset:\n", standardized_df)
# Step 5: Check mean and std deviation after standardization
mean = standardized_df.mean()
std_dev = standardized_df.std()
print("\nMean after Standardization:\n", mean)
print("\nStandard Deviation after Standardization:\n", std_dev)
Output:
1. Original Dataset:
Age Salary
0 22 20000
1 25 32000
2 47 45000
3 52 58000
4 46 79000
2. Standardized Dataset:
Age Salary
0 -1.566699 -1.569105
1 -1.350157 -1.036600
2 0.261458 -0.318282
3 0.696526 0.400036
4 1.958872 2.524951
3. Mean and Standard Deviation after Standardization:
Mean after Standardization:
Age 3.330669e-16
Salary 0.000000e+00
dtype: float64
Standard Deviation after Standardization:
Age 1.118034
Salary 1.118034
dtype: float64
Explanation of Results:
- The transformed dataset now has values with a mean of 0 and standard
deviation of 1.
- Both `Age` and `Salary` are now standardized, ensuring that both features
contribute equally to any machine learning algorithm applied on this data.
Conclusion:
Standardization is a crucial preprocessing step, especially when dealing with
features of different scales. By transforming features to have a mean of 0 and a
standard deviation of 1, it ensures that algorithms sensitive to the magnitude of
input values perform better and converge faster.