0% found this document useful (0 votes)

16 views15 pages

Feature Engineering

This document serves as a comprehensive guide to feature engineering, covering essential topics such as feature transformation, scaling, encoding, outlier detection, and mathematical transformations. It outlines various techniques for handling missing data, scaling features, encoding categorical values, detecting outliers, and transforming features to improve model performance. The guide emphasizes best practices and practical tips for each technique to enhance data analysis and machine learning outcomes.

Uploaded by

Dipayan Mohanta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views15 pages

Feature Engineering

Uploaded by

Dipayan Mohanta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A GUIDE TO

FEATURE ENGINEERING

CONTENTS
FEATURE TRANSFORMATION
FEATURE CONSTRUCTION
FEATURE SELECTION
FEATURE EXTRACTION
FEATURE TRANSFORMATION
1. Missing Data Imputation
What is Missing Data Imputation?

Missing data imputation is the process of replacing missing values in a dataset with substituted values to
enable analysis and modelling without bias or loss of information.

Why is it Important?

• Many machine learning algorithms cannot handle missing values directly.

• Preserves dataset size and statistical properties.
• Prevents biased or incorrect model results.

Common Techniques for Missing Data Imputation

Method Description Suitable For Advantages Disadvantages

Mean Imputation Replace missing values Numerical, Simple, fast Affects variance,
with the mean of the symmetric biased for skewed
feature data data
Median Imputation Replace missing values Numerical, Robust to May reduce data
with the median skewed data outliers variability
Mode Imputation Replace with the most Categorical Preserves Not suitable for
frequent value data category numerical data
information
Constant Replace with a fixed Numerical or Easy to Can mislead models if
Imputation constant (e.g., -9999, categorical implement and encoding not handled
"Missing") track missingness
Forward Fill Replace missing value with Time-series Maintains Can propagate wrong
previous known value continuity values
(time-series data)
Backward Fill Replace missing value with Time-series Uses future May cause data
next known value (time- known data leakage
series data)
Interpolation Estimate missing values by Time-series, Smooth, follows Assumes continuity;
interpolating between ordered data data trends inaccurate for abrupt
known data points (linear, changes
polynomial, spline)
K-Nearest Uses nearest neighbors’ Numerical Preserves local Computationally
Neighbors (KNN) values to estimate missing datasets data structure expensive
Imputation points
Multiple Iterative model-based Multivariate Captures Complex and time-
Imputation by imputation using other data complex consuming
Chained Equations features relationships
(MICE)
Random Randomly picks observed Any feature Maintains Adds randomness,
Imputation values to fill missing spots feature less reproducible
distribution

Key Considerations
• Identify missingness type: MCAR (Missing Completely at Random), MAR (Missing at Random), MNAR
(Missing Not at Random).
• Avoid imputing before splitting dataset to prevent data leakage.
• For time-series, validate stationarity before using forward/backward fill or interpolation.
• Use domain knowledge to choose the best imputation method.

2. Feature Scaling
What is Feature Scaling?

Feature scaling adjusts the range and distribution of numeric features to a standard scale, improving model
performance and convergence.

Why Scale Features?

• Ensures features contribute equally in distance-based models.

• Speeds up gradient-based optimization.
• Prevents dominance of features with large magnitude.

Common Feature Scaling Methods

Technique Formula/Description Output Range Suitable For Advantages Disadvantages

Min-Max 𝑋 − 𝑋min [0, 1] Neural nets, Preserves Sensitive to
𝑋scaled =
Scaling 𝑋max − 𝑋min image data original data outliers
distribution
shape
Standardization 𝑋−𝜇 Unbounded Most ML Centers data Skewed data
𝑍=
(Z-Score) 𝜎 models to zero mean may affect
and unit performance
variance
Robust Scaling 𝑋scaled Unbounded Data with Resistant to Less
𝑋 − median(𝑋) outliers outliers interpretable
=
IQR(𝑋) scaling
MaxAbs Scaling 𝑋 [-1, 1] Sparse data Preserves Sensitive to large
𝑋scaled =
max(|𝑋|) sparsity maximum values
Unit Vector 𝑋 Unit norm Text data, Useful for Alters data
𝑋scaled =
Scaling |𝑋|1 clustering direction- distribution
𝑋 based shape
= 𝑛
∑𝑖=1|𝑥𝑖 | similarity
L1 𝑋 Unit sum NLP, sparse Useful for Sensitive to large
𝑋scaled =
Normalization |𝑋|1 vectors sparse values
𝑋 datasets
=
∑𝑛𝑖=1|𝑥𝑖 |
Mean 𝑋−𝜇 Approximately PCA, Centers and Still sensitive to
𝑋scaled =
Normalization 𝑋max − 𝑋min [-1, 1] regression scales data outliers

Best Practices for Scaling

• Always apply scaling after train-test split.

• Apply scaling only to numerical features, never to categorical.
• Use Standardization for most models unless data contains extreme outliers.
• Use Robust scaling if dataset has many outliers.
• Use Min-Max scaling for image pixel data or neural networks inputs.

3. Encoding Categorical Values

Definition

Encoding categorical features is the process of converting categorical data (non-numeric labels) into a numeric
format that can be used by machine learning algorithms.

Why Encoding is Needed

• Most ML algorithms require numerical input.

• Categorical data must be transformed to retain meaningful information.
• Proper encoding preserves relationships and avoids introducing bias.

Types of Categorical Variables

1. Nominal: Categories with no intrinsic order (e.g., color: red, blue, green).
2. Ordinal: Categories with a defined order (e.g., size: small, medium, large).

Encoding Techniques

Technique Description Suitable For Pros Cons

Label Assigns each category a unique Ordinal, Simple, preserves Can mislead model if
Encoding integer value nominal order (ordinal) order is not meaningful
(nominal)
One-Hot Creates binary columns for each Nominal No ordinal Can create high
Encoding category (0 or 1) assumptions, dimensionality
widely used
Ordinal Maps categories to integers based Ordinal Preserves order Requires known category
Encoding on a known order order
Binary Converts categories to binary code, High More compact than Slightly more complex,
Encoding reduces dimensionality compared cardinality one-hot less interpretable
to one-hot nominal
Frequency Replaces categories with their Nominal Simple, captures May mislead if frequency
Encoding frequency/count in data category unrelated to target
prevalence
Target Replaces categories with the mean Nominal Can improve model Risk of target leakage,
Encoding of the target variable for each performance requires careful validation
category
Hashing Applies a hash function to map Very high Scalable, fixed Possible collisions, loss of
Encoding categories to fixed number of cardinality dimension interpretability
columns (hash buckets)

Practical Tips

• For nominal features with few categories, use One-Hot Encoding.

• For ordinal features, use Ordinal Encoding with proper order.
• For high cardinality features (many unique categories), prefer Binary Encoding, Target Encoding, or
Hashing.
• Avoid target leakage with target encoding by applying it only on training data or using cross-validation
schemes.
• After encoding, check for multicollinearity and dimensionality issues.
4.Outlier Detection
Definition

Outliers are data points that significantly deviate from the majority of the dataset. Outlier detection identifies
these anomalous points which may indicate errors, variability, or novel insights.

Why Detect Outliers?

• Outliers can distort statistical analyses and machine learning models.

• They may indicate data quality issues, fraud, or rare events.
• Removing or treating outliers can improve model accuracy and robustness.

Types of Outliers

1. Point Outliers: Single data points far from others.

2. Contextual Outliers: Data points anomalous in a specific context (e.g., time-series).
3. Collective Outliers: A group of data points anomalous together.

Outlier Detection Techniques

Technique Description Use Case Pros Cons

Z-score Method Measures how many Normally distributed Simple, effective for Sensitive to non-
standard deviations a point data Gaussian data normal data and
is from the mean extreme values
Interquartile Detects outliers outside Skewed data Robust to non- May miss outliers in
Range (IQR) 1.5*IQR below Q1 or normal data multimodal data
above Q3
Winsorization Limits extreme values by When you want to Preserves data size, Alters original values,
capping them at specific reduce impact without reduces effect of may hide true outliers
percentiles (e.g., 5th and removal extremes
95th percentiles)
Visualization Boxplots, scatter plots, Exploratory analysis Intuitive and quick Subjective and
histograms to visually spot manual
outliers
Distance-based Identify points far from Multidimensional data Captures complex Computationally
Methods neighbours (e.g., k-NN relationships expensive for large
distance, Mahalanobis data
distance)
Density-based Detect points in low- High-dimensional data Detects local Parameter sensitive
Methods density regions (e.g., Local anomalies
Outlier Factor (LOF))
Clustering-based Points not belonging well Unlabeled data Unsupervised, Depends on clustering
Methods to any cluster considered captures complex quality
outliers (e.g., DBSCAN) patterns
Model-based Use predictive models to Large and complex Scalable, effective Requires tuning and
Methods detect anomalies (e.g., datasets for various data expertise
Isolation Forest, One-Class types
SVM)

Best Practices

• Understand data distribution before choosing method.

• Visualize data to get insights.
• Use robust methods for skewed or high-dimensional data.
• Consider domain knowledge to decide on outlier treatment.
• Decide whether to remove, transform, or keep outliers depending on context.

Handling Outliers

• Removal: Delete outliers if they are errors or irrelevant.

• Transformation: Apply log, square root, or winsorization to reduce impact.
• Imputation: Replace outliers with mean, median, or other estimates.
• Separate Modelling: Treat outliers as a separate class or anomaly.

5.Mathematical Feature Transformation

Definition

Mathematical Feature Transformation refers to applying mathematical functions or operations to raw

features to improve model performance by making data more suitable for learning. Transformations help in
handling skewed data, stabilizing variance, and revealing hidden relationships.

Why Use Mathematical Transformations?

• To normalize data distributions

• To reduce skewness and make data more symmetric
• To stabilize variance across features
• To improve linearity between features and target variables
• To enhance model accuracy and convergence speed

Common Mathematical Transformations

Transformation Formula / Operation Use Case Effect on Data

Log Transformation 𝑋 ′ = 𝑙𝑜𝑔(𝑋 + 𝑐), 𝑐 ≥ 0 Right-skewed data, Compresses large values,
positive values only reduces skewness
Square Root 𝑋 ′ = √𝑋 Moderately skewed data, Reduces skewness, less
positive values aggressive than log
3
Cube Root 𝑋 ′ = √𝑋 Skewed data with Reduces skewness, handles
negative and positive negative values
values
Reciprocal 1 Highly skewed, positive Inverts scale, compresses
𝑋′ =
𝑋+𝑐 values large values
Power Transformation 𝑋′ = 𝑋𝑝 , 𝑝 ∈ 𝑅 Data needing variance Changes distribution shape
stabilization
Exponential 𝑋′ = 𝑒 𝑋 To reverse a log transform Expands range
or model exponential
relationships
Box-Cox 𝑋 ′ = (𝑋 𝜆 − 1)/𝜆, 𝑖𝑓 𝜆 ≠ 0 Data normalization, Transforms data toward
Transformation 𝑋 ′ = log(𝑋) , 𝑖𝑓 𝜆 = 0 variance stabilization normality

Yeo-Johnson 𝑋 ′ = ((𝑋 + 1)𝜆 − 1)/𝜆 Data normalization Flexible normalization

Transformation 𝑖𝑓 𝑋 ≥ 0, 𝜆 ≠ 0 including negative/zero
𝑋 ′ = log(𝑋 + 1) , 𝑖𝑓𝑋 ≥ 0, 𝜆 = 0 values
𝑋 ′ = −((−𝑋 + 1)2−𝜆 − 1)
/(2 − 𝜆), 𝑖𝑓𝑋
< 0, 𝜆 ≠ 2
𝑋 ′ = − log(−𝑋 + 1) , 𝑖𝑓𝑋 < 0, 𝜆
=2
Important Notes

• For log and reciprocal transforms, data must be strictly positive; otherwise, add a small constant ccc
to shift values.
• Box-Cox requires strictly positive data.
• Yeo-Johnson is more flexible and can handle zero or negative values.
• Always analyze the data distribution before and after transformation.
• Transformation choice depends on data characteristics and domain knowledge.

When to Use

• When data shows non-normality or skewness

• When model assumptions (e.g., linear regression) require normal or homoscedastic errors
• When variance stabilizing improves model training and prediction

Practical Steps

1. Visualize the feature distribution (histogram, Q-Q plot).

2. Check skewness/kurtosis statistics.
3. Apply appropriate transformation based on data properties.
4. Re-evaluate distribution and model performance.
5. Reverse transform predictions if needed (e.g., exponentiate log-transformed predictions).

Python Tools for Transformation

• NumPy / Pandas: basic transformations like np.log(), np.sqrt()

• Scikit-learn
o PowerTransformer (method='box-cox') for Box-Cox
o PowerTransformer (method='yeo-johnson') for Yeo-Johnson

6.Encoding Numerical Features to Categorical Features

Definition

Encoding numerical values to categorical values is the process of converting continuous or discrete numerical
data into discrete categories or bins. This is useful when the model or analysis requires categorical input or
when grouping numerical data improves interpretability or model performance.

Why Convert Numerical to Categorical?

• To capture non-linear relationships by grouping values into meaningful categories.

• When categorical models or methods are preferred.
• To reduce noise and variability in numerical data.
• To simplify data interpretation and feature analysis.

Practical Tips

• Choose number of bins (k) based on data size and domain knowledge.
• Visualize distribution before and after binning.
• Consider encoding binned categories further if needed (e.g., one-hot).
• For ordered bins, ordinal encoding is appropriate.
• Beware of information loss with coarse binning.
• Use libraries like pandas for easy binning.
Common Techniques for Encoding Numerical to Categorical

Technique Description Use Case Pros Cons

Binning (Discretization) Divides numerical range When ranges Simple, May lose
into intervals (bins) and have semantic interpretable, information,
assigns bin labels meaning reduces noise sensitive to bin
edges
Equal-width Binning Divides range into equal- When uniform Easy to implement May lead to
sized intervals bin size is desired uneven data
distribution per
bin
Equal-frequency Binning Divides data so each bin When balanced Handles skewed Bin widths vary,
(Quantile Binning) has equal number of bins are desired data better may merge
samples dissimilar values
Custom Binning Define bins based on When specific High Requires domain
domain knowledge (e.g., cut-offs matter interpretability knowledge
age groups)
Clustering-based Binning Use clustering (e.g., k- When natural Captures data More complex,
means) to group similar clusters exist patterns better requires tuning
values
Thresholding/Binarization Convert based on Simple Easy to understand May oversimplify
thresholds or conditions binary/multi-class
(e.g., >100 = High) split
Feature Construction
Definition

Feature Construction is the process of creating new features from the existing raw data to improve the
performance of machine learning models. It involves transforming, combining, or extracting information to
create features that better capture the underlying patterns relevant for prediction.

Why Feature Construction?

• Raw data often lacks explicit information needed by ML models.

• Constructed features can reveal hidden relationships.
• Can improve model accuracy and generalization.
• Helps reduce dimensionality when done thoughtfully.

Key Concepts

Types of Feature Construction

1. Mathematical Transformations
o Applying functions like log, square root, polynomial terms, etc.
o Example: Creating log(age) to handle skewness.
2. Aggregation Features
o Summaries over groups or time windows (sum, mean, count).
o Example: Average purchase amount per customer.
3. Interaction Features
o Combining two or more features to capture relationships.
o Example: Multiplying age and income to capture interaction effects.
4. Decomposition Features
o Decomposing complex data into components (e.g., PCA components).
5. Date-Time Features
o Extracting parts like day, month, hour, weekday, or time since an event.
6. Text Features
o Extracting word counts, TF-IDF scores, sentiment scores from text.
7. Domain-specific Features
o Features based on domain knowledge or business logic.
o Example: Risk score in finance or BMI from height and weight in healthcare.

Process of Feature Construction

1. Understand the Data

o Perform exploratory data analysis (EDA).
o Identify raw features and target variable relationships.
2. Identify Feature Gaps
o Look for missing or weak signals.
o Consider domain knowledge to brainstorm new features.
3. Select Construction Methods
o Choose transformations or aggregations based on data types and goals.
4. Create Features
o Use formulas, group-by operations, or feature extraction techniques.
5. Validate New Features
o Check feature importance using models or correlation analysis.
o Use visualization to understand feature distribution and relevance.
6. Iterate and Refine
o Remove redundant or noisy features.
o Optimize for simplicity and model performance.

Best Practices

• Start simple: Basic aggregations or interactions can add value.

• Avoid leakage: Use only training data statistics when constructing features.
• Use automation tools: Featuretools, TSFresh for automated feature engineering.
• Track feature lineage: Keep clear documentation of how features were created.
• Beware of high dimensionality: Excessive features may cause overfitting.
• Scale/normalize features if required by the model.
Feature Selection
Definition

Feature selection is the process of identifying and selecting a subset of relevant features (variables,
predictors) for use in model construction. It helps improve model performance, reduce overfitting, and
enhance interpretability.

Why Feature Selection?

• Improves model accuracy by removing noisy or irrelevant features

• Reduces overfitting by limiting complexity
• Speeds up training and reduces computational cost
• Improves model interpretability by focusing on important features

Types of Feature Selection Methods

Type Description Pros Cons

Filter Evaluate features based on statistical measures Fast, scalable, Ignores feature interaction,
Methods independently from any model. Examples: independent of less accurate
Correlation, Chi-square, Mutual Information model
Wrapper Use predictive models to evaluate feature Consider feature Computationally expensive,
Methods subsets by training and validating repeatedly. interactions, risk of overfitting
Examples: Recursive Feature Elimination (RFE), generally more
Forward/Backward Selection accurate
Embedded Perform feature selection as part of model Efficient, balances Dependent on model choice,
Methods training. Examples: Lasso (L1 regularization), accuracy and speed may miss features important
Tree-based feature importance to other models

Common Feature Selection Techniques

1. Filter Methods

• Correlation Coefficient: Remove features highly correlated with others (multicollinearity) or low
correlation with target.
• Chi-Square Test: For categorical features, test independence with target variable.
• Mutual Information: Measures dependency between variables (non-linear relationships).
• Variance Threshold: Remove features with very low variance (almost constant).

2. Wrapper Methods

• Recursive Feature Elimination (RFE): Repeatedly build model and remove least important features
based on model coefficients or importance scores.
• Sequential Feature Selection: Add or remove features sequentially based on model performance
(forward or backward).

3. Embedded Methods

• Lasso Regression (L1 Regularization): Penalizes absolute size of coefficients, shrinking some to zero,
effectively performing feature selection.
• Tree-based Models: Random Forest, Gradient Boosting provide feature importance scores naturally.
Features with low importance can be discarded.
How to Choose a Method?

• For large datasets with many features, start with filter methods for speed.
• For smaller datasets where accuracy is critical, wrapper or embedded methods are preferred.
• When using complex models like trees or regularized regression, embedded methods simplify the
pipeline.
• Use domain knowledge to guide or validate feature choices.

Feature Selection Workflow (Project Reference)

1. Understand Data & Domain: Identify potentially relevant features based on prior knowledge.
2. Preprocessing: Clean data, handle missing values, and encode categorical variables.
3. Initial Filter: Use statistical tests or variance threshold to remove irrelevant or constant features.
4. Apply Wrapper or Embedded Methods: Use RFE or model-based selection for refined feature subset.
5. Validate: Evaluate model performance with selected features using cross-validation.
6. Iterate: Refine feature set based on performance and domain feedback.

Practical Tips

• Avoid selecting features based solely on training data statistics before train-test split to prevent data
leakage.
• Use cross-validation to estimate true performance impact of feature selection.
• Keep track of removed features to maintain reproducibility and interpretability.
• Combine feature selection with feature engineering for best results.
Feature Extraction
Feature extraction transforms raw data into a reduced set of informative features. This improves model
performance and reduces computational cost.

1. Principal Component Analysis (PCA)

• Purpose: Reduce dimensionality while retaining maximum variance.

• How It Works:
o Standardize data.
o Compute covariance matrix.
o Calculate eigenvalues and eigenvectors.
o Select top-k eigenvectors (principal components).
o Project data onto new feature space.
• Use Case: Linear datasets, image compression, noise reduction.
• Pros: Fast, interpretable variance, unsupervised.
• Cons: Linear method, sensitive to scale.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

• Purpose: Visualize high-dimensional data in 2D or 3D.

• How It Works:
o Measures pairwise similarity in high-dimensional space.
o Maps to low-dimensional space, preserving local structure.
• Use Case: Visualizing clusters in image/text embeddings.
• Pros: Excellent for cluster visualization.
• Cons: Computationally expensive, non-deterministic, not suitable for downstream ML.

3. Linear Discriminant Analysis (LDA)

• Purpose: Supervised dimensionality reduction that maximizes class separability.

• How It Works:
o Computes between-class and within-class scatter.
o Maximizes ratio of between-class to within-class variance.
• Use Case: Classification tasks with labeled data.
• Pros: Works well with labeled data, better than PCA for classification.
• Cons: Assumes normal distribution and equal covariance.

4. Autoencoders

• Purpose: Learn compressed representation of data using neural networks.

• How It Works:
o Encoder compresses input to a latent space.
o Decoder reconstructs input from latent space.
• Use Case: Image compression, anomaly detection, pretraining.
• Pros: Non-linear feature extraction.
• Cons: Requires large datasets and careful tuning.

5. Independent Component Analysis (ICA)

• Purpose: Decompose multivariate signals into independent non-Gaussian signals.

• How It Works:
o Maximizes statistical independence between components.
• Use Case: Signal separation, EEG/ECG data.
• Pros: Captures hidden independent factors.
• Cons: Sensitive to noise, assumes non-Gaussianity.

6. Uniform Manifold Approximation and Projection (UMAP)

• Purpose: Non-linear dimensionality reduction, faster alternative to t-SNE.

• How It Works:
o Preserves both local and global structure using graph theory.
• Use Case: Visualization, preprocessing for clustering.
• Pros: Faster than t-SNE, preserves structure.
• Cons: Harder to interpret than PCA.

7. Feature Hashing (Hashing Trick)

• Purpose: Transform categorical variables into numerical space using a hash function.
• How It Works:
o Applies hash function to categories and maps to fixed-size feature vector.
• Use Case: Text data, large sparse categorical features.
• Pros: Memory-efficient.
• Cons: Hash collisions can lead to information loss.

8. Bag of Words (BoW)

• Purpose: Text to numerical feature representation.

• How It Works:
o Counts frequency of each word in the document.
• Use Case: Text classification, spam detection.
• Pros: Simple and interpretable.
• Cons: Ignores word order and context.

9. TF-IDF (Term Frequency–Inverse Document Frequency)

• Purpose: Weighs words by frequency and uniqueness across documents.

• How It Works:
o TF: Frequency of word in a document.
o IDF: Log inverse of frequency across all documents.
• Use Case: Text mining, information retrieval.
• Pros: Reduces impact of common words.
• Cons: Still ignores word semantics.

10. Word Embeddings (Word2Vec, GloVe, FastText)

• Purpose: Capture semantic meaning of words in dense vectors.

• How It Works:
o Learns relationships between words based on context in large corpus.
• Use Case: NLP tasks (sentiment analysis, translation).
• Pros: Rich semantic representation.
• Cons: Requires large pre-trained models or training data.
11. Fourier Transform (FFT)

• Purpose: Transform time-domain signals into frequency domain.

• How It Works:
o Decomposes signals into sinusoids with different frequencies.
• Use Case: Signal processing, fault detection, time series analysis.
• Pros: Effective for periodic pattern extraction.
• Cons: Assumes stationarity of signals.

12. Wavelet Transform

• Purpose: Time-frequency analysis of non-stationary signals.

• How It Works:
o Decomposes signal using wavelets at different scales.
• Use Case: ECG, EEG analysis, compression.
• Pros: Good time and frequency localization.
• Cons: More complex than FFT.

Best Practices

• Always normalize or standardize data before applying extraction techniques.

• Use dimensionality reduction only after understanding model requirements.
• Apply domain knowledge to combine extracted features with constructed ones.

Application of Linear Algebra in Computer Science and Engineering
80% (5)
Application of Linear Algebra in Computer Science and Engineering
5 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Data Mining
No ratings yet
Data Mining
33 pages
ML Notes
No ratings yet
ML Notes
44 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit II 10 Data Preprocessing Techniques
No ratings yet
Unit II 10 Data Preprocessing Techniques
13 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Slides Concepts
No ratings yet
Slides Concepts
55 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Data Processing
No ratings yet
Data Processing
19 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Recap of Machine Learning
No ratings yet
Recap of Machine Learning
29 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
ML Lec 4
No ratings yet
ML Lec 4
9 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Indian Institute of Technology Kanpur CS 676: Computer Vision and Image Processing
No ratings yet
Indian Institute of Technology Kanpur CS 676: Computer Vision and Image Processing
7 pages
Environmental Quality MGMT - 2024 - Das - An Integrated Study of Water Quality in The Ganol River Basin India Application
No ratings yet
Environmental Quality MGMT - 2024 - Das - An Integrated Study of Water Quality in The Ganol River Basin India Application
12 pages
PCA Home Assignment - IIT Kharagpur
No ratings yet
PCA Home Assignment - IIT Kharagpur
3 pages
M.Sc. Statistics Curriculum 2018
No ratings yet
M.Sc. Statistics Curriculum 2018
31 pages
Generative AI Lab Manual
No ratings yet
Generative AI Lab Manual
62 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Chang 2008
No ratings yet
Chang 2008
9 pages
A Research Study On Unsupervised Machine Learning Algorithms For Early Fault Detection in Predictive Maintenance
No ratings yet
A Research Study On Unsupervised Machine Learning Algorithms For Early Fault Detection in Predictive Maintenance
7 pages
Data Driven Decision Making
100% (1)
Data Driven Decision Making
27 pages
Advanced Machine Learning Challenge5
No ratings yet
Advanced Machine Learning Challenge5
22 pages
INF30036 Lecture3
No ratings yet
INF30036 Lecture3
36 pages
Wpiea2023216 Print PDF
No ratings yet
Wpiea2023216 Print PDF
58 pages
Deep Learning - IIT Ropar
No ratings yet
Deep Learning - IIT Ropar
2 pages
A Survey of Machine Learning-Based Solutions To Protect Privacy in The Internet of Things
No ratings yet
A Survey of Machine Learning-Based Solutions To Protect Privacy in The Internet of Things
9 pages
Face Detection and Recognition Using Raspberry Pi Paper1
No ratings yet
Face Detection and Recognition Using Raspberry Pi Paper1
4 pages
Revisiting Hotel Brands in West Bengal: An Assessment For Confidence Building of Leisure Tourists Amidst COVID 19 Pandemic
No ratings yet
Revisiting Hotel Brands in West Bengal: An Assessment For Confidence Building of Leisure Tourists Amidst COVID 19 Pandemic
12 pages
Factoextra-Extract and Visualize The Results of Multivariate Data Analyses - Factoextra
No ratings yet
Factoextra-Extract and Visualize The Results of Multivariate Data Analyses - Factoextra
23 pages
Acceptance of Driverless Vehicles Results From A Large Cross-National Questionnaire Study
No ratings yet
Acceptance of Driverless Vehicles Results From A Large Cross-National Questionnaire Study
23 pages
Regression Vs Kalman Filter
No ratings yet
Regression Vs Kalman Filter
68 pages
09 Test Construction and Interpretation
No ratings yet
09 Test Construction and Interpretation
23 pages
Multicollinearity
No ratings yet
Multicollinearity
15 pages
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis, 1st Edition Dropbox Download
100% (16)
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis, 1st Edition Dropbox Download
16 pages
Managing Diversification SSRN-id1358533
No ratings yet
Managing Diversification SSRN-id1358533
23 pages
SCCS Poster 2024
No ratings yet
SCCS Poster 2024
1 page
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
No ratings yet
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
7 pages
An Overview On Indications and Chemical Composition of Aromatic Waters (Hydrosols)
No ratings yet
An Overview On Indications and Chemical Composition of Aromatic Waters (Hydrosols)
18 pages
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
No ratings yet
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
6 pages
Semester 1 Data Science Course Overview
No ratings yet
Semester 1 Data Science Course Overview
21 pages
An Investigation Into Infant Cry and Apgar Score Using Principle Component Analysis
No ratings yet
An Investigation Into Infant Cry and Apgar Score Using Principle Component Analysis
6 pages

Feature Engineering

Uploaded by

Feature Engineering

Uploaded by

A GUIDE TO

• Many machine learning algorithms cannot handle missing values directly.

Common Techniques for Missing Data Imputation

Method Description Suitable For Advantages Disadvantages

Why Scale Features?

• Ensures features contribute equally in distance-based models.

Common Feature Scaling Methods

Technique Formula/Description Output Range Suitable For Advantages Disadvantages

Best Practices for Scaling

• Always apply scaling after train-test split.

3. Encoding Categorical Values

Why Encoding is Needed

• Most ML algorithms require numerical input.

Types of Categorical Variables

Technique Description Suitable For Pros Cons

• For nominal features with few categories, use One-Hot Encoding.

Why Detect Outliers?

• Outliers can distort statistical analyses and machine learning models.

1. Point Outliers: Single data points far from others.

Outlier Detection Techniques

Technique Description Use Case Pros Cons

• Understand data distribution before choosing method.

• Removal: Delete outliers if they are errors or irrelevant.

5.Mathematical Feature Transformation

Mathematical Feature Transformation refers to applying mathematical functions or operations to raw

Why Use Mathematical Transformations?

• To normalize data distributions

Common Mathematical Transformations

Transformation Formula / Operation Use Case Effect on Data

Yeo-Johnson 𝑋 ′ = ((𝑋 + 1)𝜆 − 1)/𝜆 Data normalization Flexible normalization

• When data shows non-normality or skewness

1. Visualize the feature distribution (histogram, Q-Q plot).

Python Tools for Transformation

• NumPy / Pandas: basic transformations like np.log(), np.sqrt()

6.Encoding Numerical Features to Categorical Features

Why Convert Numerical to Categorical?

• To capture non-linear relationships by grouping values into meaningful categories.

Technique Description Use Case Pros Cons

Why Feature Construction?

• Raw data often lacks explicit information needed by ML models.

Types of Feature Construction

Process of Feature Construction

1. Understand the Data

• Start simple: Basic aggregations or interactions can add value.

Why Feature Selection?

• Improves model accuracy by removing noisy or irrelevant features

Types of Feature Selection Methods

Type Description Pros Cons

Common Feature Selection Techniques

Feature Selection Workflow (Project Reference)

1. Principal Component Analysis (PCA)

• Purpose: Reduce dimensionality while retaining maximum variance.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

• Purpose: Visualize high-dimensional data in 2D or 3D.

3. Linear Discriminant Analysis (LDA)

• Purpose: Supervised dimensionality reduction that maximizes class separability.

• Purpose: Learn compressed representation of data using neural networks.

5. Independent Component Analysis (ICA)

• Purpose: Decompose multivariate signals into independent non-Gaussian signals.

6. Uniform Manifold Approximation and Projection (UMAP)

• Purpose: Non-linear dimensionality reduction, faster alternative to t-SNE.

7. Feature Hashing (Hashing Trick)

8. Bag of Words (BoW)

• Purpose: Text to numerical feature representation.

9. TF-IDF (Term Frequency–Inverse Document Frequency)

• Purpose: Weighs words by frequency and uniqueness across documents.

10. Word Embeddings (Word2Vec, GloVe, FastText)

• Purpose: Capture semantic meaning of words in dense vectors.

• Purpose: Transform time-domain signals into frequency domain.

12. Wavelet Transform

• Purpose: Time-frequency analysis of non-stationary signals.

• Always normalize or standardize data before applying extraction techniques.

You might also like