Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views15 pages

Feature Engineering

This document serves as a comprehensive guide to feature engineering, covering essential topics such as feature transformation, scaling, encoding, outlier detection, and mathematical transformations. It outlines various techniques for handling missing data, scaling features, encoding categorical values, detecting outliers, and transforming features to improve model performance. The guide emphasizes best practices and practical tips for each technique to enhance data analysis and machine learning outcomes.

Uploaded by

Dipayan Mohanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Feature Engineering

This document serves as a comprehensive guide to feature engineering, covering essential topics such as feature transformation, scaling, encoding, outlier detection, and mathematical transformations. It outlines various techniques for handling missing data, scaling features, encoding categorical values, detecting outliers, and transforming features to improve model performance. The guide emphasizes best practices and practical tips for each technique to enhance data analysis and machine learning outcomes.

Uploaded by

Dipayan Mohanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A GUIDE TO

FEATURE ENGINEERING

CONTENTS
FEATURE TRANSFORMATION
FEATURE CONSTRUCTION
FEATURE SELECTION
FEATURE EXTRACTION
FEATURE TRANSFORMATION
1. Missing Data Imputation
What is Missing Data Imputation?

Missing data imputation is the process of replacing missing values in a dataset with substituted values to
enable analysis and modelling without bias or loss of information.

Why is it Important?

• Many machine learning algorithms cannot handle missing values directly.


• Preserves dataset size and statistical properties.
• Prevents biased or incorrect model results.

Common Techniques for Missing Data Imputation

Method Description Suitable For Advantages Disadvantages


Mean Imputation Replace missing values Numerical, Simple, fast Affects variance,
with the mean of the symmetric biased for skewed
feature data data
Median Imputation Replace missing values Numerical, Robust to May reduce data
with the median skewed data outliers variability
Mode Imputation Replace with the most Categorical Preserves Not suitable for
frequent value data category numerical data
information
Constant Replace with a fixed Numerical or Easy to Can mislead models if
Imputation constant (e.g., -9999, categorical implement and encoding not handled
"Missing") track missingness
Forward Fill Replace missing value with Time-series Maintains Can propagate wrong
previous known value continuity values
(time-series data)
Backward Fill Replace missing value with Time-series Uses future May cause data
next known value (time- known data leakage
series data)
Interpolation Estimate missing values by Time-series, Smooth, follows Assumes continuity;
interpolating between ordered data data trends inaccurate for abrupt
known data points (linear, changes
polynomial, spline)
K-Nearest Uses nearest neighbors’ Numerical Preserves local Computationally
Neighbors (KNN) values to estimate missing datasets data structure expensive
Imputation points
Multiple Iterative model-based Multivariate Captures Complex and time-
Imputation by imputation using other data complex consuming
Chained Equations features relationships
(MICE)
Random Randomly picks observed Any feature Maintains Adds randomness,
Imputation values to fill missing spots feature less reproducible
distribution

Key Considerations
• Identify missingness type: MCAR (Missing Completely at Random), MAR (Missing at Random), MNAR
(Missing Not at Random).
• Avoid imputing before splitting dataset to prevent data leakage.
• For time-series, validate stationarity before using forward/backward fill or interpolation.
• Use domain knowledge to choose the best imputation method.

2. Feature Scaling
What is Feature Scaling?

Feature scaling adjusts the range and distribution of numeric features to a standard scale, improving model
performance and convergence.

Why Scale Features?

• Ensures features contribute equally in distance-based models.


• Speeds up gradient-based optimization.
• Prevents dominance of features with large magnitude.

Common Feature Scaling Methods

Technique Formula/Description Output Range Suitable For Advantages Disadvantages


Min-Max 𝑋 − 𝑋min [0, 1] Neural nets, Preserves Sensitive to
𝑋scaled =
Scaling 𝑋max − 𝑋min image data original data outliers
distribution
shape
Standardization 𝑋−𝜇 Unbounded Most ML Centers data Skewed data
𝑍=
(Z-Score) 𝜎 models to zero mean may affect
and unit performance
variance
Robust Scaling 𝑋scaled Unbounded Data with Resistant to Less
𝑋 − median(𝑋) outliers outliers interpretable
=
IQR(𝑋) scaling
MaxAbs Scaling 𝑋 [-1, 1] Sparse data Preserves Sensitive to large
𝑋scaled =
max(|𝑋|) sparsity maximum values
Unit Vector 𝑋 Unit norm Text data, Useful for Alters data
𝑋scaled =
Scaling |𝑋|1 clustering direction- distribution
𝑋 based shape
= 𝑛
∑𝑖=1|𝑥𝑖 | similarity
L1 𝑋 Unit sum NLP, sparse Useful for Sensitive to large
𝑋scaled =
Normalization |𝑋|1 vectors sparse values
𝑋 datasets
=
∑𝑛𝑖=1|𝑥𝑖 |
Mean 𝑋−𝜇 Approximately PCA, Centers and Still sensitive to
𝑋scaled =
Normalization 𝑋max − 𝑋min [-1, 1] regression scales data outliers

Best Practices for Scaling

• Always apply scaling after train-test split.


• Apply scaling only to numerical features, never to categorical.
• Use Standardization for most models unless data contains extreme outliers.
• Use Robust scaling if dataset has many outliers.
• Use Min-Max scaling for image pixel data or neural networks inputs.

3. Encoding Categorical Values


Definition

Encoding categorical features is the process of converting categorical data (non-numeric labels) into a numeric
format that can be used by machine learning algorithms.

Why Encoding is Needed

• Most ML algorithms require numerical input.


• Categorical data must be transformed to retain meaningful information.
• Proper encoding preserves relationships and avoids introducing bias.

Types of Categorical Variables

1. Nominal: Categories with no intrinsic order (e.g., color: red, blue, green).
2. Ordinal: Categories with a defined order (e.g., size: small, medium, large).

Encoding Techniques

Technique Description Suitable For Pros Cons


Label Assigns each category a unique Ordinal, Simple, preserves Can mislead model if
Encoding integer value nominal order (ordinal) order is not meaningful
(nominal)
One-Hot Creates binary columns for each Nominal No ordinal Can create high
Encoding category (0 or 1) assumptions, dimensionality
widely used
Ordinal Maps categories to integers based Ordinal Preserves order Requires known category
Encoding on a known order order
Binary Converts categories to binary code, High More compact than Slightly more complex,
Encoding reduces dimensionality compared cardinality one-hot less interpretable
to one-hot nominal
Frequency Replaces categories with their Nominal Simple, captures May mislead if frequency
Encoding frequency/count in data category unrelated to target
prevalence
Target Replaces categories with the mean Nominal Can improve model Risk of target leakage,
Encoding of the target variable for each performance requires careful validation
category
Hashing Applies a hash function to map Very high Scalable, fixed Possible collisions, loss of
Encoding categories to fixed number of cardinality dimension interpretability
columns (hash buckets)

Practical Tips

• For nominal features with few categories, use One-Hot Encoding.


• For ordinal features, use Ordinal Encoding with proper order.
• For high cardinality features (many unique categories), prefer Binary Encoding, Target Encoding, or
Hashing.
• Avoid target leakage with target encoding by applying it only on training data or using cross-validation
schemes.
• After encoding, check for multicollinearity and dimensionality issues.
4.Outlier Detection
Definition

Outliers are data points that significantly deviate from the majority of the dataset. Outlier detection identifies
these anomalous points which may indicate errors, variability, or novel insights.

Why Detect Outliers?

• Outliers can distort statistical analyses and machine learning models.


• They may indicate data quality issues, fraud, or rare events.
• Removing or treating outliers can improve model accuracy and robustness.

Types of Outliers

1. Point Outliers: Single data points far from others.


2. Contextual Outliers: Data points anomalous in a specific context (e.g., time-series).
3. Collective Outliers: A group of data points anomalous together.

Outlier Detection Techniques

Technique Description Use Case Pros Cons


Z-score Method Measures how many Normally distributed Simple, effective for Sensitive to non-
standard deviations a point data Gaussian data normal data and
is from the mean extreme values
Interquartile Detects outliers outside Skewed data Robust to non- May miss outliers in
Range (IQR) 1.5*IQR below Q1 or normal data multimodal data
above Q3
Winsorization Limits extreme values by When you want to Preserves data size, Alters original values,
capping them at specific reduce impact without reduces effect of may hide true outliers
percentiles (e.g., 5th and removal extremes
95th percentiles)
Visualization Boxplots, scatter plots, Exploratory analysis Intuitive and quick Subjective and
histograms to visually spot manual
outliers
Distance-based Identify points far from Multidimensional data Captures complex Computationally
Methods neighbours (e.g., k-NN relationships expensive for large
distance, Mahalanobis data
distance)
Density-based Detect points in low- High-dimensional data Detects local Parameter sensitive
Methods density regions (e.g., Local anomalies
Outlier Factor (LOF))
Clustering-based Points not belonging well Unlabeled data Unsupervised, Depends on clustering
Methods to any cluster considered captures complex quality
outliers (e.g., DBSCAN) patterns
Model-based Use predictive models to Large and complex Scalable, effective Requires tuning and
Methods detect anomalies (e.g., datasets for various data expertise
Isolation Forest, One-Class types
SVM)

Best Practices

• Understand data distribution before choosing method.


• Visualize data to get insights.
• Use robust methods for skewed or high-dimensional data.
• Consider domain knowledge to decide on outlier treatment.
• Decide whether to remove, transform, or keep outliers depending on context.

Handling Outliers

• Removal: Delete outliers if they are errors or irrelevant.


• Transformation: Apply log, square root, or winsorization to reduce impact.
• Imputation: Replace outliers with mean, median, or other estimates.
• Separate Modelling: Treat outliers as a separate class or anomaly.

5.Mathematical Feature Transformation


Definition

Mathematical Feature Transformation refers to applying mathematical functions or operations to raw


features to improve model performance by making data more suitable for learning. Transformations help in
handling skewed data, stabilizing variance, and revealing hidden relationships.

Why Use Mathematical Transformations?

• To normalize data distributions


• To reduce skewness and make data more symmetric
• To stabilize variance across features
• To improve linearity between features and target variables
• To enhance model accuracy and convergence speed

Common Mathematical Transformations

Transformation Formula / Operation Use Case Effect on Data


Log Transformation 𝑋 ′ = 𝑙𝑜𝑔(𝑋 + 𝑐), 𝑐 ≥ 0 Right-skewed data, Compresses large values,
positive values only reduces skewness
Square Root 𝑋 ′ = √𝑋 Moderately skewed data, Reduces skewness, less
positive values aggressive than log
3
Cube Root 𝑋 ′ = √𝑋 Skewed data with Reduces skewness, handles
negative and positive negative values
values
Reciprocal 1 Highly skewed, positive Inverts scale, compresses
𝑋′ =
𝑋+𝑐 values large values
Power Transformation 𝑋′ = 𝑋𝑝 , 𝑝 ∈ 𝑅 Data needing variance Changes distribution shape
stabilization
Exponential 𝑋′ = 𝑒 𝑋 To reverse a log transform Expands range
or model exponential
relationships
Box-Cox 𝑋 ′ = (𝑋 𝜆 − 1)/𝜆, 𝑖𝑓 𝜆 ≠ 0 Data normalization, Transforms data toward
Transformation 𝑋 ′ = log(𝑋) , 𝑖𝑓 𝜆 = 0 variance stabilization normality

Yeo-Johnson 𝑋 ′ = ((𝑋 + 1)𝜆 − 1)/𝜆 Data normalization Flexible normalization


Transformation 𝑖𝑓 𝑋 ≥ 0, 𝜆 ≠ 0 including negative/zero
𝑋 ′ = log(𝑋 + 1) , 𝑖𝑓𝑋 ≥ 0, 𝜆 = 0 values
𝑋 ′ = −((−𝑋 + 1)2−𝜆 − 1)
/(2 − 𝜆), 𝑖𝑓𝑋
< 0, 𝜆 ≠ 2
𝑋 ′ = − log(−𝑋 + 1) , 𝑖𝑓𝑋 < 0, 𝜆
=2
Important Notes

• For log and reciprocal transforms, data must be strictly positive; otherwise, add a small constant ccc
to shift values.
• Box-Cox requires strictly positive data.
• Yeo-Johnson is more flexible and can handle zero or negative values.
• Always analyze the data distribution before and after transformation.
• Transformation choice depends on data characteristics and domain knowledge.

When to Use

• When data shows non-normality or skewness


• When model assumptions (e.g., linear regression) require normal or homoscedastic errors
• When variance stabilizing improves model training and prediction

Practical Steps

1. Visualize the feature distribution (histogram, Q-Q plot).


2. Check skewness/kurtosis statistics.
3. Apply appropriate transformation based on data properties.
4. Re-evaluate distribution and model performance.
5. Reverse transform predictions if needed (e.g., exponentiate log-transformed predictions).

Python Tools for Transformation

• NumPy / Pandas: basic transformations like np.log(), np.sqrt()


• Scikit-learn
o PowerTransformer (method='box-cox') for Box-Cox
o PowerTransformer (method='yeo-johnson') for Yeo-Johnson

6.Encoding Numerical Features to Categorical Features


Definition

Encoding numerical values to categorical values is the process of converting continuous or discrete numerical
data into discrete categories or bins. This is useful when the model or analysis requires categorical input or
when grouping numerical data improves interpretability or model performance.

Why Convert Numerical to Categorical?

• To capture non-linear relationships by grouping values into meaningful categories.


• When categorical models or methods are preferred.
• To reduce noise and variability in numerical data.
• To simplify data interpretation and feature analysis.

Practical Tips

• Choose number of bins (k) based on data size and domain knowledge.
• Visualize distribution before and after binning.
• Consider encoding binned categories further if needed (e.g., one-hot).
• For ordered bins, ordinal encoding is appropriate.
• Beware of information loss with coarse binning.
• Use libraries like pandas for easy binning.
Common Techniques for Encoding Numerical to Categorical

Technique Description Use Case Pros Cons


Binning (Discretization) Divides numerical range When ranges Simple, May lose
into intervals (bins) and have semantic interpretable, information,
assigns bin labels meaning reduces noise sensitive to bin
edges
Equal-width Binning Divides range into equal- When uniform Easy to implement May lead to
sized intervals bin size is desired uneven data
distribution per
bin
Equal-frequency Binning Divides data so each bin When balanced Handles skewed Bin widths vary,
(Quantile Binning) has equal number of bins are desired data better may merge
samples dissimilar values
Custom Binning Define bins based on When specific High Requires domain
domain knowledge (e.g., cut-offs matter interpretability knowledge
age groups)
Clustering-based Binning Use clustering (e.g., k- When natural Captures data More complex,
means) to group similar clusters exist patterns better requires tuning
values
Thresholding/Binarization Convert based on Simple Easy to understand May oversimplify
thresholds or conditions binary/multi-class
(e.g., >100 = High) split
Feature Construction
Definition

Feature Construction is the process of creating new features from the existing raw data to improve the
performance of machine learning models. It involves transforming, combining, or extracting information to
create features that better capture the underlying patterns relevant for prediction.

Why Feature Construction?

• Raw data often lacks explicit information needed by ML models.


• Constructed features can reveal hidden relationships.
• Can improve model accuracy and generalization.
• Helps reduce dimensionality when done thoughtfully.

Key Concepts

Types of Feature Construction

1. Mathematical Transformations
o Applying functions like log, square root, polynomial terms, etc.
o Example: Creating log(age) to handle skewness.
2. Aggregation Features
o Summaries over groups or time windows (sum, mean, count).
o Example: Average purchase amount per customer.
3. Interaction Features
o Combining two or more features to capture relationships.
o Example: Multiplying age and income to capture interaction effects.
4. Decomposition Features
o Decomposing complex data into components (e.g., PCA components).
5. Date-Time Features
o Extracting parts like day, month, hour, weekday, or time since an event.
6. Text Features
o Extracting word counts, TF-IDF scores, sentiment scores from text.
7. Domain-specific Features
o Features based on domain knowledge or business logic.
o Example: Risk score in finance or BMI from height and weight in healthcare.

Process of Feature Construction

1. Understand the Data


o Perform exploratory data analysis (EDA).
o Identify raw features and target variable relationships.
2. Identify Feature Gaps
o Look for missing or weak signals.
o Consider domain knowledge to brainstorm new features.
3. Select Construction Methods
o Choose transformations or aggregations based on data types and goals.
4. Create Features
o Use formulas, group-by operations, or feature extraction techniques.
5. Validate New Features
o Check feature importance using models or correlation analysis.
o Use visualization to understand feature distribution and relevance.
6. Iterate and Refine
o Remove redundant or noisy features.
o Optimize for simplicity and model performance.

Best Practices

• Start simple: Basic aggregations or interactions can add value.


• Avoid leakage: Use only training data statistics when constructing features.
• Use automation tools: Featuretools, TSFresh for automated feature engineering.
• Track feature lineage: Keep clear documentation of how features were created.
• Beware of high dimensionality: Excessive features may cause overfitting.
• Scale/normalize features if required by the model.
Feature Selection
Definition

Feature selection is the process of identifying and selecting a subset of relevant features (variables,
predictors) for use in model construction. It helps improve model performance, reduce overfitting, and
enhance interpretability.

Why Feature Selection?

• Improves model accuracy by removing noisy or irrelevant features


• Reduces overfitting by limiting complexity
• Speeds up training and reduces computational cost
• Improves model interpretability by focusing on important features

Types of Feature Selection Methods

Type Description Pros Cons


Filter Evaluate features based on statistical measures Fast, scalable, Ignores feature interaction,
Methods independently from any model. Examples: independent of less accurate
Correlation, Chi-square, Mutual Information model
Wrapper Use predictive models to evaluate feature Consider feature Computationally expensive,
Methods subsets by training and validating repeatedly. interactions, risk of overfitting
Examples: Recursive Feature Elimination (RFE), generally more
Forward/Backward Selection accurate
Embedded Perform feature selection as part of model Efficient, balances Dependent on model choice,
Methods training. Examples: Lasso (L1 regularization), accuracy and speed may miss features important
Tree-based feature importance to other models

Common Feature Selection Techniques

1. Filter Methods

• Correlation Coefficient: Remove features highly correlated with others (multicollinearity) or low
correlation with target.
• Chi-Square Test: For categorical features, test independence with target variable.
• Mutual Information: Measures dependency between variables (non-linear relationships).
• Variance Threshold: Remove features with very low variance (almost constant).

2. Wrapper Methods

• Recursive Feature Elimination (RFE): Repeatedly build model and remove least important features
based on model coefficients or importance scores.
• Sequential Feature Selection: Add or remove features sequentially based on model performance
(forward or backward).

3. Embedded Methods

• Lasso Regression (L1 Regularization): Penalizes absolute size of coefficients, shrinking some to zero,
effectively performing feature selection.
• Tree-based Models: Random Forest, Gradient Boosting provide feature importance scores naturally.
Features with low importance can be discarded.
How to Choose a Method?

• For large datasets with many features, start with filter methods for speed.
• For smaller datasets where accuracy is critical, wrapper or embedded methods are preferred.
• When using complex models like trees or regularized regression, embedded methods simplify the
pipeline.
• Use domain knowledge to guide or validate feature choices.

Feature Selection Workflow (Project Reference)

1. Understand Data & Domain: Identify potentially relevant features based on prior knowledge.
2. Preprocessing: Clean data, handle missing values, and encode categorical variables.
3. Initial Filter: Use statistical tests or variance threshold to remove irrelevant or constant features.
4. Apply Wrapper or Embedded Methods: Use RFE or model-based selection for refined feature subset.
5. Validate: Evaluate model performance with selected features using cross-validation.
6. Iterate: Refine feature set based on performance and domain feedback.

Practical Tips

• Avoid selecting features based solely on training data statistics before train-test split to prevent data
leakage.
• Use cross-validation to estimate true performance impact of feature selection.
• Keep track of removed features to maintain reproducibility and interpretability.
• Combine feature selection with feature engineering for best results.
Feature Extraction
Feature extraction transforms raw data into a reduced set of informative features. This improves model
performance and reduces computational cost.

1. Principal Component Analysis (PCA)

• Purpose: Reduce dimensionality while retaining maximum variance.


• How It Works:
o Standardize data.
o Compute covariance matrix.
o Calculate eigenvalues and eigenvectors.
o Select top-k eigenvectors (principal components).
o Project data onto new feature space.
• Use Case: Linear datasets, image compression, noise reduction.
• Pros: Fast, interpretable variance, unsupervised.
• Cons: Linear method, sensitive to scale.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

• Purpose: Visualize high-dimensional data in 2D or 3D.


• How It Works:
o Measures pairwise similarity in high-dimensional space.
o Maps to low-dimensional space, preserving local structure.
• Use Case: Visualizing clusters in image/text embeddings.
• Pros: Excellent for cluster visualization.
• Cons: Computationally expensive, non-deterministic, not suitable for downstream ML.

3. Linear Discriminant Analysis (LDA)

• Purpose: Supervised dimensionality reduction that maximizes class separability.


• How It Works:
o Computes between-class and within-class scatter.
o Maximizes ratio of between-class to within-class variance.
• Use Case: Classification tasks with labeled data.
• Pros: Works well with labeled data, better than PCA for classification.
• Cons: Assumes normal distribution and equal covariance.

4. Autoencoders

• Purpose: Learn compressed representation of data using neural networks.


• How It Works:
o Encoder compresses input to a latent space.
o Decoder reconstructs input from latent space.
• Use Case: Image compression, anomaly detection, pretraining.
• Pros: Non-linear feature extraction.
• Cons: Requires large datasets and careful tuning.

5. Independent Component Analysis (ICA)

• Purpose: Decompose multivariate signals into independent non-Gaussian signals.


• How It Works:
o Maximizes statistical independence between components.
• Use Case: Signal separation, EEG/ECG data.
• Pros: Captures hidden independent factors.
• Cons: Sensitive to noise, assumes non-Gaussianity.

6. Uniform Manifold Approximation and Projection (UMAP)

• Purpose: Non-linear dimensionality reduction, faster alternative to t-SNE.


• How It Works:
o Preserves both local and global structure using graph theory.
• Use Case: Visualization, preprocessing for clustering.
• Pros: Faster than t-SNE, preserves structure.
• Cons: Harder to interpret than PCA.

7. Feature Hashing (Hashing Trick)

• Purpose: Transform categorical variables into numerical space using a hash function.
• How It Works:
o Applies hash function to categories and maps to fixed-size feature vector.
• Use Case: Text data, large sparse categorical features.
• Pros: Memory-efficient.
• Cons: Hash collisions can lead to information loss.

8. Bag of Words (BoW)

• Purpose: Text to numerical feature representation.


• How It Works:
o Counts frequency of each word in the document.
• Use Case: Text classification, spam detection.
• Pros: Simple and interpretable.
• Cons: Ignores word order and context.

9. TF-IDF (Term Frequency–Inverse Document Frequency)

• Purpose: Weighs words by frequency and uniqueness across documents.


• How It Works:
o TF: Frequency of word in a document.
o IDF: Log inverse of frequency across all documents.
• Use Case: Text mining, information retrieval.
• Pros: Reduces impact of common words.
• Cons: Still ignores word semantics.

10. Word Embeddings (Word2Vec, GloVe, FastText)

• Purpose: Capture semantic meaning of words in dense vectors.


• How It Works:
o Learns relationships between words based on context in large corpus.
• Use Case: NLP tasks (sentiment analysis, translation).
• Pros: Rich semantic representation.
• Cons: Requires large pre-trained models or training data.
11. Fourier Transform (FFT)

• Purpose: Transform time-domain signals into frequency domain.


• How It Works:
o Decomposes signals into sinusoids with different frequencies.
• Use Case: Signal processing, fault detection, time series analysis.
• Pros: Effective for periodic pattern extraction.
• Cons: Assumes stationarity of signals.

12. Wavelet Transform

• Purpose: Time-frequency analysis of non-stationary signals.


• How It Works:
o Decomposes signal using wavelets at different scales.
• Use Case: ECG, EEG analysis, compression.
• Pros: Good time and frequency localization.
• Cons: More complex than FFT.

Best Practices

• Always normalize or standardize data before applying extraction techniques.


• Use dimensionality reduction only after understanding model requirements.
• Apply domain knowledge to combine extracted features with constructed ones.

You might also like