Sampling
Sampling
1
1/22/2025
1. Random Sampling: Randomly select data points from the " Up-sampling: Atype of over-sampling where additional instances
dataset without any bias. Up of the minority class are added to balance the dataset.
Sampling
" Example: Splitting a dataset into training and test sets. Sampling " Achieved either by duplicating existing instances or generating
synthetic data.
2. Under-sampling (or Sub-Sampling): Reduce the size of the
Technique majority class to balance class distribution in imbalanced
datasets.
Example: SMOTE (Synthetic Minority Over-sampling Technique)
generates new data points for the minority class using
sin " Example: In fraud detection, randomly reducing non interpolation.
Machine fraudulent transactions to match the number of fraudulent " Use Case:
transactions. " Handling imbalanced datasets in classification problem
Learning 3. Over-sampling: Increases the size of the minority class by " Sub-sampling: Reduces the dataset's size by randomly removing
duplicating its data points or generating synthetic data. Sub data points, often from the majority class, to balance the class
" Example: Using techniques like SMOTE (Synthetic Minority distribution or reduce the dataset size.
Over-sampling Technique). Sampling " Example: Randomly removing non-fraudulent transactions in a
fraud detection dataset.
4. Resampling: Drawing samples from a dataset in different
ways to improve model performance, evaluate models, or " Use Case:
address data-related challenges like imbalance or scarcity. " Reducing computational cost or memory usage.
"Example: Bootstrap, Cross-Validation " Balancing datasets when the majority class dominates
2
1/22/2025
" Definition: Re-sampling is a general term that encompasses both "Imbalanced Dataset: A dataset where the distribution of classes is
adding (up-sampling) or removing (sub-sampling) data points to skewed.
achieve specific goals, such as balancing classes, creating balanced " Majority classes dominate while minority classes are
training/testing splits, or bootstrapping. underrepresented
" Re-sampling is an overarching concept that applies to many of " Real world problems: fraud detection, medical diagnosis, rare
the specific techniques already listed, such as: event prediction
Re "Up-sampling and Sub-sampling: Balancing class " Challenges:
Sampling distributions.
"Bootstrapping: Re-sampling with replacement.
Imbalanced " Bias towards the majority class;
" Cross-validation: Creating new training/testing splits by re Datasets " Poor generalization for the minority class
sampling data subsets. "Accuracy metrics may not reflect true performance
Use Case: " How can this be resolved?
" Model evaluation: Re-sampling strategies like k "Under-sampling: Reduce the size of the majority class
fold crosS-validation. "Over-sampling: Increase the size of the minority class
" Improving training data representativeness. 1. Duplicate minority class; no new information is added
2. Generate new data from the existing ones (SMOTE)
3
1/22/2025
" Borderline-SMOTE: Focuses on generating synthetic data near the #Generate and plota synthettc Imbalanced classifletion dataset
decision boundary. from collections import Counter
" ADASYN (Adaptive Synthetic Sampling):Adjusts the number of from sklearn.datasets import make_classification
synthetic samples for each instance based on its difficulty to from matplotlib import pyplot
classify. Python from numpy import where
#define dataset
Variants "SMOTE-ENN and SMOTE-Tomek: Combines SMOTE With under
code X, y=make_classificationln_samples=10000, n features=2, n_redundant-0,
sampling techniques like Edited Nearest Neighbor (ENN)or Tomek
of Links to remove noisy instances. for n_clusters per_class=1,weights=[0.98], lip_y-0, random_ state=1)
# summarlze class distributlon
SMOTE SMOTE counter =Counter(y)
print(counter)
# scatter plot of examples by class label
colors {0:'maroon', 1:'darkgreen'}
for label, _in counter.items():
row_0x =wherely ==label)[O]
Pyplot.scatter (X(row ix, 0], X(row_ix, 1], label=str(label), color-colors(label)
Pyplot.legend)
Pyplot.show()
# Oversample and plot imbalanced dataset with SMOTE # summarize the new class distribution
from collections import Counter
counter =Counterly)
from sklearn.datasets import make_classification
print(counter)
from imblearn.over_sampling import SMOTE # scatter plot of examples by class
Python from matplotlib import pyplot
Python labelcolors =(0:maroon', 1'darkgreen')
from numpy import where
code code for label, in counter. items(): row_ix = wherely == label) [0]
# define datasetx, y = make classification(n_samples=10000, n_features=2, pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str{label), alpha=0. 5,
for n_redundant=0, n_clusters_per_class=1, weights=[0.98], flip_y=0, for color=colors[label]) #
random_state=1) Pyplot.legend()
SMOTE # summarize class distribution SMOTE Pyplot.show()
counter =Counterly)print(counter)
# transform the dataset
oversample =SMOTE()
X, y= oversample.fit_resample(X, y)
4
1/22/2025
-2 -3 1
2. k-fold Cross-Validation
Aspect Bootstrap k-fold Cross-Validation
Process:
" Partition the dataset into groups (folds). |Sampling Sampling is done Partitioning is done
" Each fold takes turns being the test set, with the remaining Technique with replacement without replacement
Common folds serving as the training set. General-purpose
Re " Use Cases: Comparison: Use Cases parameter Model evaluation
" Primarily used for evaluating predictive models. Bootstrap vs. estimation
Sampling " Repeatedly trains on one subset and evaluates on another Simpler and more Specifically suited for
Methods subset. k-fold Cross- Flexibility general |predictive modelling
Validation Potentially higher, Moderate, Depends on
the value of k
Computational depending on the
Cost number of (number of folds)
resamples
|Independence Sub-samples may Folds are mutually
of Samples Overlap exclusive
5
1/22/2025
" Definition: Are-sampling technique that
creates multiple datasets "Need for Bootstrapping:
(called bootstrap samples) bysampling with replacement from the
original dataset. " Uncertainty Estimation: It provides an empirical way to
" Helps estimate the sampling distribution of a estimate the uncertainty of a statistic (e.g., mean, median)
statistic and without making strong assumptions about the data
assess the variability of a model.
distributíon.
Boot " Key Characteristics
" Sampling with Boot " Small Sample Sizes: Useful when the dataset is too small to
Replacement: Each sample is created by split into separate training and test sets.
strapping randomly selecting data points from the original dataset,
where a data point can appear more than once in a sample. strapping " Model Validation: Provides an alternative to cross-validation
for assessing modelperformance.
" Sample Size: The bootstrap sample typically has the same size Robustness: Enhances model reliability by reducing the impact
as the original dataset. of outliers and variance in small datasets.
" Multiple Samples: Bootstrapping generates multiple bootstrap
datasets to improve the robustness of the results.
" Advantages
import numpy as np
1. Non-parametric: Does not rely on assumptions about data
distribution. from sklearn.utils import resample
2. Flexibility: Applicable to various problems, including regression and
classification.
Bootstrapping #Original dataset
3. Improved Accuracy: Generates more robust estimations of model in data = [1, 2, 3, 4, 5]
performance and uncertainty. # Number of bootstrap samples
Bootstrap Limitations
Python B= 1000
" ComputationalCost: Repeating the sampling process BBB times can be bootstrap_means = 0
computationally expensive.
" Overfitting Risk: In small datasets, repetitive use of the same samples
may lead to overfitting in model training. # Generate B bootstrap samples
" Dependence on Original Data:The quality of bootstrap estimates for_in range(B):
depends heavily on the representativeness of the original dataset. replace -True,
Sa mp le = resample(data,
n_samples=len(data))
bootstrap_means.append(np.mean(sample))
7
1/22/2025
8
1/22/2025