Skewness in Data: Impact and Mitigation Strategies
Skewness is a statistical measure that quantifies the asymmetry or lack of symmetry in a probability distribution. It indicates
the deviation from a symmetrical bell curve (normal distribution).
Two main types of skewness:
Right-skewed (positive skewness): The tail on the right side of the distribution is longer. This means there are a few
extreme values on the higher end.
Left-skewed (negative skewness): The tail on the left side of the distribution is longer. This indicates a few extreme
values on the lower end.
Why Remove Skewness?
Removing skewness is often necessary for several reasons:
Assumption Violation: Many statistical methods, especially parametric tests like t-tests and ANOVA, assume normality in
the data. Skewness violates this assumption, leading to inaccurate results.
Model Performance: Skewed data can impact the performance of machine learning models, especially those sensitive to
outliers or extreme values.
Interpretation: Skewness can make it difficult to interpret the data and draw meaningful conclusions. A skewed distribution
might mask important patterns or relationships.
Normalization: Some normalization techniques, like standardization, assume a normal distribution. Skewness can affect
the effectiveness of these techniques.
Methods for Removing Skewness
There are several methods to address skewness:
Transformation: Applying mathematical functions like logarithmic or square root transformations can often reduce
skewness.
Winsorization: Capping extreme values to a certain percentile.
Trimming: Removing outliers directly.
Non-parametric Tests: Using statistical methods that are less sensitive to skewness.ues.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.precision', 2)
sns.set(rc={"axes.facecolor":"Beige" , "axes.grid" : False})
import warnings
warnings.filterwarnings("ignore")
# Load the wine dataset (you may need to adjust the path)
wine = pd.read_csv('/kaggle/input/titanic/WineQT.csv')
wine = wine.drop('Id',axis=1)
wine.head()
fixed volatile citric residual free sulfur total sulfur
chlorides density pH sulphates alcohol quality
acidity acidity acid sugar dioxide dioxide
0 7.4 0.70 0.00 1.9 0.08 11.0 34.0 1.0 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.10 25.0 67.0 1.0 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.09 15.0 54.0 1.0 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.07 17.0 60.0 1.0 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.08 11.0 34.0 1.0 3.51 0.56 9.4 5
from scipy.stats import skew
skewness_values = wine.skew()
# Print the skewness values
print("Skewness values for each feature:")
skewness_values.sort_values(ascending=False)
Skewness values for each feature:
chlorides 6.03
residual sugar 4.36
sulphates 2.50
total sulfur dioxide 1.67
free sulfur dioxide 1.23
fixed acidity 1.04
alcohol 0.86
volatile acidity 0.68
citric acid 0.37
quality 0.29
pH 0.22
density 0.10
dtype: float64
df_num = wine.select_dtypes(include='number')
df_cat = wine.select_dtypes(include=['object','category'])
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20, 12), sharex = False, sharey = False)
axes = axes.ravel()
cols = df_num.columns[:]
for col, ax in zip(cols, axes):
data = df_num
sns.kdeplot(data=data, x=col, shade=True, ax=ax)
ax.set(title=f'Distribution of Variable: {col}', xlabel=None)
fig.delaxes(axes[8])
fig.tight_layout()
plt.show()
Transformations
Log Transformation: Useful for reducing positive skewness.
Square Root Transformation: Can help in reducing skewness, especially for data with moderate skew.
Box-Cox Transformation: A more flexible transformation that can handle both positive and negative skewness.
Yeo-Johnson Transformation: Similar to Box-Cox but can be applied to data with zero or negative values.
Skewness Removal
%%time
from scipy import stats
# Function to plot histograms before and after skewness removal
def plot_before_after(data, column, transformed_data, method):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
sns.histplot(data[column], kde=True, ax=ax1)
ax1.set_title(f'Before: {column}')
sns.histplot(transformed_data, kde=True, ax=ax2)
ax2.set_title(f'After: {column} ({method})')
plt.tight_layout()
plt.show()
# Function to remove skewness using different methods
def remove_skewness(data, column):
# Original data
original = data[column]
# 1. Log transformation
log_transformed = np.log1p(original)
plot_before_after(data, column, log_transformed, 'Log')
# 2. Square root transformation
sqrt_transformed = np.sqrt(original)
plot_before_after(data, column, sqrt_transformed, 'Square Root')
# 3. Box-Cox transformation
boxcox_transformed, _ = stats.boxcox(original + 1) # Adding 1 to handle zero values
plot_before_after(data, column, boxcox_transformed, 'Box-Cox')
# 4. Yeo-Johnson transformation
yeojohnson_transformed, _ = stats.yeojohnson(original)
plot_before_after(data, column, yeojohnson_transformed, 'Yeo-Johnson')
print(f"\033[033m\033[1m")
print(f"Skewness before transformation: {original.skew():.2f}")
print(f"Skewness after Log transformation: {log_transformed.skew():.2f}")
print(f"Skewness after Square Root transformation: {sqrt_transformed.skew():.2f}")
print(f"Skewness after Box-Cox transformation: {pd.Series(boxcox_transformed).skew():.2f}")
print(f"Skewness after Yeo-Johnson transformation: {pd.Series(yeojohnson_transformed).skew():.2f}")
# Apply skewness removal to all numeric columns
numeric_columns = wine.select_dtypes(include=[np.number]).columns
for column in numeric_columns:
print(f"\nProcessing column: {column}")
remove_skewness(wine, column)
Processing column: fixed acidity
Skewness before transformation: 1.04
Skewness after Log transformation: 0.48
Skewness after Square Root transformation: 0.73
Skewness after Box-Cox transformation: -0.00
Skewness after Yeo-Johnson transformation: -0.00
Processing column: volatile acidity
Skewness before transformation: 0.68
Skewness after Log transformation: 0.27
Skewness after Square Root transformation: 0.11
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: citric acid
Skewness before transformation: 0.37
Skewness after Log transformation: 0.13
Skewness after Square Root transformation: -0.50
Skewness after Box-Cox transformation: 0.03
Skewness after Yeo-Johnson transformation: 0.03
Processing column: residual sugar
Skewness before transformation: 4.36
Skewness after Log transformation: 2.16
Skewness after Square Root transformation: 2.81
Skewness after Box-Cox transformation: 0.01
Skewness after Yeo-Johnson transformation: 0.01
Processing column: chlorides
Skewness before transformation: 6.03
Skewness after Log transformation: 5.35
Skewness after Square Root transformation: 3.85
Skewness after Box-Cox transformation: -0.21
Skewness after Yeo-Johnson transformation: -0.21
Processing column: free sulfur dioxide
Skewness before transformation: 1.23
Skewness after Log transformation: -0.09
Skewness after Square Root transformation: 0.48
Skewness after Box-Cox transformation: -0.01
Skewness after Yeo-Johnson transformation: -0.01
Processing column: total sulfur dioxide
Skewness before transformation: 1.67
Skewness after Log transformation: 0.00
Skewness after Square Root transformation: 0.69
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: density
Skewness before transformation: 0.10
Skewness after Log transformation: 0.10
Skewness after Square Root transformation: 0.10
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: pH
Skewness before transformation: 0.22
Skewness after Log transformation: 0.07
Skewness after Square Root transformation: 0.12
Skewness after Box-Cox transformation: -0.00
Skewness after Yeo-Johnson transformation: -0.00
Processing column: sulphates
Skewness before transformation: 2.50
Skewness after Log transformation: 1.68
Skewness after Square Root transformation: 1.62
Skewness after Box-Cox transformation: 0.01
Skewness after Yeo-Johnson transformation: 0.01
Processing column: alcohol
Skewness before transformation: 0.86
Skewness after Log transformation: 0.68
Skewness after Square Root transformation: 0.76
Skewness after Box-Cox transformation: 0.11
Skewness after Yeo-Johnson transformation: 0.11
Processing column: quality
Skewness before transformation: 0.29
Skewness after Log transformation: -0.17
Skewness after Square Root transformation: 0.03
Skewness after Box-Cox transformation: 0.01
Skewness after Yeo-Johnson transformation: 0.01
CPU times: user 1min 40s, sys: 11.5 s, total: 1min 51s
Wall time: 1min 3s
Customized Skewness Removal
%%time
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
wine = pd.read_csv('/kaggle/input/titanic/WineQT.csv')
wine = wine.drop('Id',axis=1)
display(wine.head(2))
# Function to plot histograms before and after skewness removal
def plot_before_after(data, column, transformed_data, method):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
sns.histplot(data[column], kde=True, ax=ax1)
ax1.set_title(f'Before: {column}')
sns.histplot(transformed_data, kde=True, ax=ax2)
ax2.set_title(f'After: {column} ({method})')
plt.tight_layout()
plt.show()
# Function to remove skewness using different methods
def remove_skewness(data, column):
# Original data
original = data[column]
# 1. Log transformation
log_transformed = np.log1p(original)
plot_before_after(data, column, log_transformed, 'Log')
# 2. Square root transformation
sqrt_transformed = np.sqrt(original)
plot_before_after(data, column, sqrt_transformed, 'Square Root')
# 3. Box-Cox transformation
boxcox_transformed, _ = stats.boxcox(original + 1) # Adding 1 to handle zero values
plot_before_after(data, column, boxcox_transformed, 'Box-Cox')
# 4. Yeo-Johnson transformation
yeojohnson_transformed, _ = stats.yeojohnson(original)
plot_before_after(data, column, yeojohnson_transformed, 'Yeo-Johnson')
print(f"\033[031m\033[1m")
print(f"Skewness before transformation: {original.skew():.2f}")
print(f"Skewness after Log transformation: {log_transformed.skew():.2f}")
print(f"Skewness after Square Root transformation: {sqrt_transformed.skew():.2f}")
print(f"Skewness after Box-Cox transformation: {pd.Series(boxcox_transformed).skew():.2f}")
print(f"Skewness after Yeo-Johnson transformation: {pd.Series(yeojohnson_transformed).skew():.2f}")
print(f"\033[0m") # Reset text color
# Apply skewness removal to all numeric columns except 'quality'
numeric_columns = wine.select_dtypes(include=[np.number]).columns
for column in numeric_columns:
if column != 'quality':
print(f"\nProcessing column: {column}")
remove_skewness(wine, column)
else:
print(f"\nSkipping 'quality' column as requested.")
# Display information about the 'quality' column
print(f"\n\033[034m\033[1mInformation about 'quality' column:\033[0m")
print(f"Skewness of 'quality': {wine['quality'].skew():.2f}")
print("\nDistribution of 'quality':")
plt.figure(figsize=(10, 5))
sns.histplot(wine['quality'], kde=True)
plt.title("Distribution of 'quality'")
plt.show()
fixed volatile citric residual free sulfur total sulfur
chlorides density pH sulphates alcohol quality
acidity acidity acid sugar dioxide dioxide
0 7.4 0.70 0.0 1.9 0.08 11.0 34.0 1.0 3.51 0.56 9.4 5
1 7.8 0.88 0.0 2.6 0.10 25.0 67.0 1.0 3.20 0.68 9.8 5
Processing column: fixed acidity
Skewness before transformation: 1.04
Skewness after Log transformation: 0.48
Skewness after Square Root transformation: 0.73
Skewness after Box-Cox transformation: -0.00
Skewness after Yeo-Johnson transformation: -0.00
Processing column: volatile acidity
Skewness before transformation: 0.68
Skewness after Log transformation: 0.27
Skewness after Square Root transformation: 0.11
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: citric acid
Skewness before transformation: 0.37
Skewness after Log transformation: 0.13
Skewness after Square Root transformation: -0.50
Skewness after Box-Cox transformation: 0.03
Skewness after Yeo-Johnson transformation: 0.03
Processing column: residual sugar
Skewness before transformation: 4.36
Skewness after Log transformation: 2.16
Skewness after Square Root transformation: 2.81
Skewness after Box-Cox transformation: 0.01
Skewness after Yeo-Johnson transformation: 0.01
Processing column: chlorides
Skewness before transformation: 6.03
Skewness after Log transformation: 5.35
Skewness after Square Root transformation: 3.85
Skewness after Box-Cox transformation: -0.21
Skewness after Yeo-Johnson transformation: -0.21
Processing column: free sulfur dioxide
Skewness before transformation: 1.23
Skewness after Log transformation: -0.09
Skewness after Square Root transformation: 0.48
Skewness after Box-Cox transformation: -0.01
Skewness after Yeo-Johnson transformation: -0.01
Processing column: total sulfur dioxide
Skewness before transformation: 1.67
Skewness after Log transformation: 0.00
Skewness after Square Root transformation: 0.69
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: density
Skewness before transformation: 0.10
Skewness after Log transformation: 0.10
Skewness after Square Root transformation: 0.10
Skewness after Box-Cox transformation: 0.00
Skewness after Yeo-Johnson transformation: 0.00
Processing column: pH
Skewness before transformation: 0.22
Skewness after Log transformation: 0.07
Skewness after Square Root transformation: 0.12
Skewness after Box-Cox transformation: -0.00
Skewness after Yeo-Johnson transformation: -0.00
Processing column: sulphates
Skewness before transformation: 2.50
Skewness after Log transformation: 1.68
Skewness after Square Root transformation: 1.62
Skewness after Box-Cox transformation: 0.01
Skewness after Yeo-Johnson transformation: 0.01
Processing column: alcohol
Skewness before transformation: 0.86
Skewness after Log transformation: 0.68
Skewness after Square Root transformation: 0.76
Skewness after Box-Cox transformation: 0.11
Skewness after Yeo-Johnson transformation: 0.11
Skipping 'quality' column as requested.
Information about 'quality' column:
Skewness of 'quality': 0.29
Distribution of 'quality':
CPU times: user 1min 33s, sys: 10.8 s, total: 1min 44s
Wall time: 59 s
Transformations for Skewed Data
Skewness, a measure of asymmetry in a probability distribution, can significantly impact statistical analyses. To address this,
various transformations can be applied to the data. The appropriate transformation depends on the direction of the skewness
(right or left).
Right-Skewed Data (Positive Skewness)
When the tail on the right side of the distribution is longer, indicating a few extreme values on the higher end, the data is said to
be right-skewed or positively skewed. Here are common transformations for right-skewed data:
Logarithmic Transformation: This is often the first choice for right-skewed data, especially when values are positive. The logarithm
of a value reduces the impact of large values, bringing the distribution closer to normal. Example: log(x)
Square Root Transformation: A milder approach compared to logarithmic, suitable for moderately skewed data. It can be useful
when the data includes zeros. Example: sqrt(x)
Cube Root Transformation: Similar to the square root but less aggressive. Example: x^(1/3)
Box-Cox Transformation: A family of transformations that automatically identifies the optimal power to reduce skewness. It's a
more flexible approach that can handle a wider range of data. Example: (x^lambda - 1) / lambda (where lambda is a parameter
determined by the data)
Left-Skewed Data (Negative Skewness)
When the tail on the left side of the distribution is longer, indicating a few extreme values on the lower end, the data is said to
be left-skewed or negatively skewed. In such cases, the transformations can be applied to the reciprocal of the data:
Reciprocal Logarithmic Transformation: log(1/x)
Reciprocal Square Root Transformation: sqrt(1/x)
Reciprocal Cube Root Transformation: (1/x)^(1/3)
Choosing the Right Transformation:
Visual Inspection: Always visualize your data (e.g., using histograms) to understand the nature of skewness and assess the
effectiveness of different transformations.
Domain Knowledge: Consider the meaning of the variable and whether transformations are appropriate in context.
Desired Outcome: Do you need a specific distribution (e.g., normal)?
Impact on Other Variables: Be mindful of how transformations might affect relationships with other variables.
Note: While these transformations are common, there might be other suitable transformations depending on the specific
characteristics of your data. It's often a good practice to try different transformations and evaluate their impact on the
skewness and overall distribution.
Skewness Removal with Yeo-Johnson transformation
%%time
from scipy import stats
wine = pd.read_csv('/kaggle/input/titanic/WineQT.csv')
wine = wine.drop('Id', axis=1)
display(wine.head(2))
def plot_before_after(data, column, transformed_data, method):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
sns.histplot(data[column], kde=True, ax=ax1)
ax1.set_title(f'Before: {column}')
sns.histplot(transformed_data, kde=True, ax=ax2)
ax2.set_title(f'After: {column} ({method})')
plt.tight_layout()
plt.show()
# Function to remove skewness using Yeo-Johnson transformation
def remove_skewness(data, column):
original = data[column]
transformed, _ = stats.yeojohnson(original)
plot_before_after(data, column, transformed, 'Yeo-Johnson')
print(f"\033[035m\033[1m")
print(f"Skewness before transformation: {original.skew():.2f}")
print(f"Skewness after Yeo-Johnson transformation: {pd.Series(transformed).skew():.2f}")
print(f"\033[0m") # Reset text color
return transformed
# Create a new DataFrame to store the transformed data
wine_transformed = wine.copy()
# Apply skewness removal to all numeric columns except 'quality'
numeric_columns = wine.select_dtypes(include=[np.number]).columns
for column in numeric_columns:
if column != 'quality':
print(f"\nProcessing column: {column}")
wine_transformed[column] = remove_skewness(wine, column)
else:
print(f"\nSkipping 'quality' column as requested.")
# Display information about the 'quality' column
print(f"\n\033[032m\033[1mInformation about 'quality' column:\033[0m")
print(f"Skewness of 'quality': {wine['quality'].skew():.2f}")
print("\nDistribution of 'quality':")
plt.figure(figsize=(10, 5))
sns.histplot(wine['quality'], kde=True)
plt.title("Distribution of 'quality'")
plt.show()
# Display the head of the transformed DataFrame
print("\nHead of the transformed DataFrame:")
display(wine_transformed.head())
# Compare skewness before and after transformation
print("\nSkewness comparison:")
skewness_before = wine.select_dtypes(include=[np.number]).skew()
skewness_after = wine_transformed.select_dtypes(include=[np.number]).skew()
skewness_comparison = pd.DataFrame({'Before': skewness_before, 'After': skewness_after})
display(skewness_comparison)
fixed volatile citric residual free sulfur total sulfur
chlorides density pH sulphates alcohol quality
acidity acidity acid sugar dioxide dioxide
0 7.4 0.70 0.0 1.9 0.08 11.0 34.0 1.0 3.51 0.56 9.4 5
1 7.8 0.88 0.0 2.6 0.10 25.0 67.0 1.0 3.20 0.68 9.8 5
Processing column: fixed acidity
Skewness before transformation: 1.04
Skewness after Yeo-Johnson transformation: -0.00
Processing column: volatile acidity
Skewness before transformation: 0.68
Skewness after Yeo-Johnson transformation: 0.00
Processing column: citric acid
Skewness before transformation: 0.37
Skewness after Yeo-Johnson transformation: 0.03
Processing column: residual sugar
Skewness before transformation: 4.36
Skewness after Yeo-Johnson transformation: 0.01
Processing column: chlorides
Skewness before transformation: 6.03
Skewness after Yeo-Johnson transformation: -0.21
Processing column: free sulfur dioxide
Skewness before transformation: 1.23
Skewness after Yeo-Johnson transformation: -0.01
Processing column: total sulfur dioxide
Skewness before transformation: 1.67
Skewness after Yeo-Johnson transformation: 0.00
Processing column: density
Skewness before transformation: 0.10
Skewness after Yeo-Johnson transformation: 0.00
Processing column: pH
Skewness before transformation: 0.22
Skewness after Yeo-Johnson transformation: -0.00
Processing column: sulphates
Skewness before transformation: 2.50
Skewness after Yeo-Johnson transformation: 0.01
Processing column: alcohol
Skewness before transformation: 0.86
Skewness after Yeo-Johnson transformation: 0.11
Skipping 'quality' column as requested.
Information about 'quality' column:
Skewness of 'quality': 0.29
Distribution of 'quality':
Head of the transformed DataFrame:
fixed volatile citric residual free sulfur total sulfur
chlorides density pH sulphates alcohol quality
acidity acidity acid sugar dioxide dioxide
0 0.96 0.44 0.00 0.43 0.04 2.70 3.54 0.04 1.07 0.20 0.27 5
1 0.97 0.50 0.00 0.45 0.05 3.64 4.20 0.04 1.03 0.21 0.27 5
2 0.97 0.46 0.04 0.44 0.04 3.05 3.99 0.04 1.04 0.21 0.27 5
3 1.01 0.23 0.40 0.43 0.04 3.19 4.10 0.04 1.03 0.20 0.27 6
4 0.96 0.44 0.00 0.43 0.04 2.70 3.54 0.04 1.07 0.20 0.27 5
Skewness comparison:
Before After
fixed acidity 1.04 -4.78e-03
volatile acidity 0.68 2.16e-03
citric acid 0.37 2.63e-02
residual sugar 4.36 7.58e-03
chlorides 6.03 -2.08e-01
free sulfur dioxide 1.23 -9.34e-03
total sulfur dioxide 1.67 2.77e-04
density 0.10 0.00e+00
pH 0.22 -4.37e-03
sulphates 2.50 5.71e-03
alcohol 0.86 1.10e-01
quality 0.29 2.87e-01
CPU times: user 23.9 s, sys: 2.86 s, total: 26.7 s
Wall time: 14.9 s
https://www.kaggle.com/code/pythonafroz/skewness-in-data-impact-and-mitigation-strategies