Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views9 pages

Sampling

The document discusses the concepts of population and sampling in the context of statistics and machine learning, defining population as the complete set of observations and sampling as a subset representing that population. It outlines various sampling techniques, including random sampling, over-sampling, and under-sampling, highlighting their applications in handling imbalanced datasets. Additionally, it covers bootstrapping and cross-validation methods for model evaluation and uncertainty estimation.

Uploaded by

nabinkoirala53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views9 pages

Sampling

The document discusses the concepts of population and sampling in the context of statistics and machine learning, defining population as the complete set of observations and sampling as a subset representing that population. It outlines various sampling techniques, including random sampling, over-sampling, and under-sampling, highlighting their applications in handling imbalanced datasets. Additionally, it covers bootstrapping and cross-validation methods for model evaluation and uncertainty estimation.

Uploaded by

nabinkoirala53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1/22/2025

"Definition: The complete set of all possible observations,


individuals, or items of interest in a study or experiment.
" Includes every possible data point reievant to the analysis.
"Example:
1. Analyzing the buying behavior of customers in a
county.
" Population: All customers that country.
Resampling Population 2. Analyzing stock prices in machine learning.
" Population: The complete historical data ofstock
prices.

" Definition: A subset of the population, chosen to represent the


"Inthe context of machine learning or statistics, the population
is often too large to be analyzed directly.
population in a study.
"Example: " Statistical Sampling: The process of selecting subsets of
1. From the population of all customers in a country, a
observations from a broader population with the goal of
sample could be 1,000 randomly selected customers. estimating properties of that population or use for machine
learning.
2. In machine learning, a sample might be a subset of
labeled data used for training and testing models. " Need for sampling:
Statistical "High cost or difficulty in making additional observations.
Population
" Challenges in gathering all existing obsevations.
Sample Sample
Sampling " More observations are expected to be made in the future.
" Processing the entire dataset is computationally expensive
or impractical.

Figure: Sampling (mage Source)

1
1/22/2025

Goals of Sampling: Sampling helps us estimate population "Points of Consideration in Sampling:


properties efficiently. By working with samples, we achieve: 1. Sample Goal. The population property that you wish to
"Reduced costs and time compared to using complete estimate using the sample.
datasets. 2. Population. The scope or domain from which
The ability to generalize findings to the population, albeit observations could theoretically be made.
with some error. 3. Selection Criteria.The methodology that will be used to
"Steps for Effective Sampling: accept or reject observations in your sample.
1. Define the Population:Clearly specify the domain or 4. Sample Size. The number of observations that will
constitute the sample.
Sampling scope from which observations could theoretically be
made.
Sampling
2. Set aSampling Goal: ldentify the populationproperty to
be estimated using the sample.
3. Establish Selection Criteria: Decide the methodology to
accept or reject observations for inclusion in the sample.
4. Determine Sample Size: Choose the number of
observations that will constitute the sample.

1. Random Sampling: Randomly select data points from the " Up-sampling: Atype of over-sampling where additional instances
dataset without any bias. Up of the minority class are added to balance the dataset.
Sampling
" Example: Splitting a dataset into training and test sets. Sampling " Achieved either by duplicating existing instances or generating
synthetic data.
2. Under-sampling (or Sub-Sampling): Reduce the size of the
Technique majority class to balance class distribution in imbalanced
datasets.
Example: SMOTE (Synthetic Minority Over-sampling Technique)
generates new data points for the minority class using
sin " Example: In fraud detection, randomly reducing non interpolation.
Machine fraudulent transactions to match the number of fraudulent " Use Case:
transactions. " Handling imbalanced datasets in classification problem
Learning 3. Over-sampling: Increases the size of the minority class by " Sub-sampling: Reduces the dataset's size by randomly removing
duplicating its data points or generating synthetic data. Sub data points, often from the majority class, to balance the class
" Example: Using techniques like SMOTE (Synthetic Minority distribution or reduce the dataset size.
Over-sampling Technique). Sampling " Example: Randomly removing non-fraudulent transactions in a
fraud detection dataset.
4. Resampling: Drawing samples from a dataset in different
ways to improve model performance, evaluate models, or " Use Case:
address data-related challenges like imbalance or scarcity. " Reducing computational cost or memory usage.
"Example: Bootstrap, Cross-Validation " Balancing datasets when the majority class dominates

2
1/22/2025

" Definition: Re-sampling is a general term that encompasses both "Imbalanced Dataset: A dataset where the distribution of classes is
adding (up-sampling) or removing (sub-sampling) data points to skewed.
achieve specific goals, such as balancing classes, creating balanced " Majority classes dominate while minority classes are
training/testing splits, or bootstrapping. underrepresented
" Re-sampling is an overarching concept that applies to many of " Real world problems: fraud detection, medical diagnosis, rare
the specific techniques already listed, such as: event prediction
Re "Up-sampling and Sub-sampling: Balancing class " Challenges:
Sampling distributions.
"Bootstrapping: Re-sampling with replacement.
Imbalanced " Bias towards the majority class;
" Cross-validation: Creating new training/testing splits by re Datasets " Poor generalization for the minority class
sampling data subsets. "Accuracy metrics may not reflect true performance
Use Case: " How can this be resolved?
" Model evaluation: Re-sampling strategies like k "Under-sampling: Reduce the size of the majority class
fold crosS-validation. "Over-sampling: Increase the size of the minority class
" Improving training data representativeness. 1. Duplicate minority class; no new information is added
2. Generate new data from the existing ones (SMOTE)

SMOTE (Synthetic Minority Oversampling Technique) : Interpolates " Advantages:


between instances of the minority class to create synthetics examples
1. Mitigates Overfitting: Unlike simple duplication, SMOTE
"Proposed by Chawla et al. (2002) generates new instances, reducing the risk of overfitting.
" Mechanism: It generates new instances along the line segments 2. Improves Model Generalization: Provides more diverse
connecting neighboring instances of the minority class. training data for the minority class.
" Algorithm: 3. Works with Multiple Algorith ms: Can be used with most
1. ldentify Minority Class Instance: ldentify one or more minority machine learning classifiers
SMOTE classes that are significantly underrepresented compared to others. " Limitations:
2. ldentify the k-Nearest Neighbours: For each minority class SMOTE 1. Risk of Overlap: May generate synthetic data in regions of
instance, find its k-nearest neighbors the feature space. feature space where classes overlap, leading to noise.
3. Randomly Select Neighbours: Randomly choose one or more of 2. No Guarantee of Realistic Instances: Synthetic instances are
these neighbours. interpolations and may not always represent real-world
4. Generate Synthetic Data: For each selected neighbor, create a scenariOs.
synthetic insta nce using: Synthetic Instance = OriginalInstance+ 3. Fixed k-Neighbors: Afixed value of k might not be optimal for
Ax (Neighbour Instance - Original Instance) where E 0, 1] all datasets.
5. Repeat: Continue this process until the desired level of balance is
achieved.

3
1/22/2025
" Borderline-SMOTE: Focuses on generating synthetic data near the #Generate and plota synthettc Imbalanced classifletion dataset
decision boundary. from collections import Counter
" ADASYN (Adaptive Synthetic Sampling):Adjusts the number of from sklearn.datasets import make_classification
synthetic samples for each instance based on its difficulty to from matplotlib import pyplot
classify. Python from numpy import where
#define dataset
Variants "SMOTE-ENN and SMOTE-Tomek: Combines SMOTE With under
code X, y=make_classificationln_samples=10000, n features=2, n_redundant-0,
sampling techniques like Edited Nearest Neighbor (ENN)or Tomek
of Links to remove noisy instances. for n_clusters per_class=1,weights=[0.98], lip_y-0, random_ state=1)
# summarlze class distributlon
SMOTE SMOTE counter =Counter(y)
print(counter)
# scatter plot of examples by class label
colors {0:'maroon', 1:'darkgreen'}
for label, _in counter.items():
row_0x =wherely ==label)[O]
Pyplot.scatter (X(row ix, 0], X(row_ix, 1], label=str(label), color-colors(label)
Pyplot.legend)
Pyplot.show()

# Oversample and plot imbalanced dataset with SMOTE # summarize the new class distribution
from collections import Counter
counter =Counterly)
from sklearn.datasets import make_classification
print(counter)
from imblearn.over_sampling import SMOTE # scatter plot of examples by class
Python from matplotlib import pyplot
Python labelcolors =(0:maroon', 1'darkgreen')
from numpy import where
code code for label, in counter. items(): row_ix = wherely == label) [0]
# define datasetx, y = make classification(n_samples=10000, n_features=2, pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str{label), alpha=0. 5,
for n_redundant=0, n_clusters_per_class=1, weights=[0.98], flip_y=0, for color=colors[label]) #
random_state=1) Pyplot.legend()
SMOTE # summarize class distribution SMOTE Pyplot.show()
counter =Counterly)print(counter)
# transform the dataset

oversample =SMOTE()
X, y= oversample.fit_resample(X, y)

4
1/22/2025

SMOTE 1. Bootstrap Method


" Process:
" Draw samples from the dataset with replacement.
e 1 " Some instances may appear multiple times in a subsample,
Common while others may not appear at all.
Re " Instances not included in the subsample can be used as a
test set.
Sampling " Use Cases:
Methods " General-purpose estimation of population parameters.
" Can be adapted for predictive model evaluation.

-2 -3 1

Imbalanced Dataset Data distribution after Balancing Minority class

2. k-fold Cross-Validation
Aspect Bootstrap k-fold Cross-Validation
Process:
" Partition the dataset into groups (folds). |Sampling Sampling is done Partitioning is done
" Each fold takes turns being the test set, with the remaining Technique with replacement without replacement
Common folds serving as the training set. General-purpose
Re " Use Cases: Comparison: Use Cases parameter Model evaluation
" Primarily used for evaluating predictive models. Bootstrap vs. estimation
Sampling " Repeatedly trains on one subset and evaluates on another Simpler and more Specifically suited for
Methods subset. k-fold Cross- Flexibility general |predictive modelling
Validation Potentially higher, Moderate, Depends on
the value of k
Computational depending on the
Cost number of (number of folds)
resamples
|Independence Sub-samples may Folds are mutually
of Samples Overlap exclusive

5
1/22/2025
" Definition: Are-sampling technique that
creates multiple datasets "Need for Bootstrapping:
(called bootstrap samples) bysampling with replacement from the
original dataset. " Uncertainty Estimation: It provides an empirical way to
" Helps estimate the sampling distribution of a estimate the uncertainty of a statistic (e.g., mean, median)
statistic and without making strong assumptions about the data
assess the variability of a model.
distributíon.
Boot " Key Characteristics
" Sampling with Boot " Small Sample Sizes: Useful when the dataset is too small to
Replacement: Each sample is created by split into separate training and test sets.
strapping randomly selecting data points from the original dataset,
where a data point can appear more than once in a sample. strapping " Model Validation: Provides an alternative to cross-validation
for assessing modelperformance.
" Sample Size: The bootstrap sample typically has the same size Robustness: Enhances model reliability by reducing the impact
as the original dataset. of outliers and variance in small datasets.
" Multiple Samples: Bootstrapping generates multiple bootstrap
datasets to improve the robustness of the results.

1. Original Dataset: Start with a dataset containing n samples.


Method 2. Resample: Randomly draw n samples with replacement from
1. Model Evaluation
" Estimate the bias and variance of a machine learning model.
of the original dataset to create a bootstrap sample.
" Compare model performance metrics like accuracy,
3. Compute Statistic: Calculate the statistic (e.g., mean,
Creating accuracy) of interest on the bootstrap sample.
variance,
Applications precision, and recall across bootstrap samples.
2. Feature Importance
Bootstrap 4. Repeat: Repeat the process BBB times to
generate BBB of "Assess the stability of feature importance rankings
across
samples bootstrap samples and compute the statisticfor each.
bootstrap samples.
5. Aggregate Results: Analyze the distribution of the computed
statistics to estimate confidence intervals or variability,
Bootstrap 3. Ensemble Methods
"The Bagging technique (e.g., Random Forests) uses
bootstrapping to train multiple models on different subsets
of data.
4. Confidence Interval Estimation
" Provides confidence intervals for model predictions or
parameter estimates.
1/22/2025

" Advantages
import numpy as np
1. Non-parametric: Does not rely on assumptions about data
distribution. from sklearn.utils import resample
2. Flexibility: Applicable to various problems, including regression and
classification.
Bootstrapping #Original dataset
3. Improved Accuracy: Generates more robust estimations of model in data = [1, 2, 3, 4, 5]
performance and uncertainty. # Number of bootstrap samples
Bootstrap Limitations
Python B= 1000
" ComputationalCost: Repeating the sampling process BBB times can be bootstrap_means = 0
computationally expensive.
" Overfitting Risk: In small datasets, repetitive use of the same samples
may lead to overfitting in model training. # Generate B bootstrap samples
" Dependence on Original Data:The quality of bootstrap estimates for_in range(B):
depends heavily on the representativeness of the original dataset. replace -True,
Sa mp le = resample(data,
n_samples=len(data))
bootstrap_means.append(np.mean(sample))

In Random Forests, bootstrapping is used to:


#Calculate confidence interval
lower_bound = np.percentile(bootstrap_means, 2.5)
Bootstrapping Generate multiple training datasets.
for "Train each decision tree on a unique bootstrap sample.
Bootstrapping upper_bound =np.percentile(bootstrap_means, 97.5) "Aggregate predictions from all trees for the final output
in Bagging (e.g., majority vote for classification, average for
print(f"Bootstrap Confidence Interval: ({lower_bound:.2f}, in regression).
Python {upper_bound:.2f}]")
Random
Forest

7
1/22/2025

" Definition: Statistical technique used to evaluate the 2. K-Fold Cross-Validation


performance of a machine learning model. " The dataset is divided into k subsets (folds).
" Thedataset is divided into subsets to test the model's tested on the
" The model is trained on k-1 folds and
ability to generalize to unseen data. remaining fold.
" By syste maticallytraining and testing the model on " This process is repeated k times,
with each fold used as a
different partitions of the data, cross-validation helps in Types test set once.
estimating the accuracy of the predictive model. of " The finalperformance isthe average
of the k iterations.
Cross "Types of Cross-Validation Cross - " Common choices for k: k =5 or k = 10.
1. Holdout Method
Validation Validation 3. Stratified K-Fold Cross-Validation
The dataset is randomlysplit into two parts: Similar to K-Fold but ensures that each fold has a
" Training set: Used to train the model. representative distribution of the target variable.
Useful for imbalanced datasets.
" Testing set: Used to evaluate the model.
" Typicaly, a 70-30 or 80-20 split is used.
" Limitations:
" Performance may depend on the specific split.
" Results can vary due to randomness in the split.

4. Leave-One-Out Cross-Validation (LOOCV) 1, Model Selection: Compare different models or hyper


" A
special case of K-Fold where k equals the number of parameter settings.
data points. 2. Performance Estimation: Provide a realistic estimate of the
" Each data point is used as a test set exactly once. model's performance on unseen data.
Types " Advantages: Maximízes training data usage. Purpose 3. Prevent Overfitting: Ensure that the model does not learn
of "Disadvantages: Computationaly expensive. of the noise or specifics of a particular dataset split.
5. Leave-P-Out Cross-Validation " Steps in K-fold Cross Validation
Cross - " Similar to LOOCV but leaves out p samples asthe test set.
Cross - 1. Shuffle the dataset (if not time-series data).
Validation " Less common due to high computational cost. Validation 2. Split the dataset into kkk folds.
6. Time Series Cross-Validation 3. For each fold:
" Designed for time-dependent data. " Train the model onk- 1 folds.
"Ensures that training data precedes testing data to respect " Test the model on the remaining fold.
temporal order. 4. Calculate the performance metric for each fold (e.g.,
" Common methods include rolling window and expanding accuracy, RMSE, F1-score).
window validation. 5. Average the metrics across all folds.

8
1/22/2025

Performance Metrics Used in Cross-Validation: " Advantages:


Classification Tasks: Accuracy, Precision, Recall, F1-Score, " Reduces variability in performance estimates.
ROC-AUC.
" Ensures that all data points contribute to training and
Regression Tasks: Mean Squared Error (MSE), Root Mean testing.
Squared Error (RMSE), Mean Absolute Error (MAE), R " Provides insight into the model's ability to generalize.
Cross - squared (R').
" Key Considerations Cross - " Disadvantages
Can be computationally expensive, especialiy for large
Validation 1. Data Shuffling: Improves the randomness and ensures Validation datasets or complex models.
no bias in splits. " May require additional considerations for time-series or
2. Choice of kkk: Larger kkk values provide a better imbalanced datasets.
estimate but increase computational cost. " Over-reliance on crosS-validation metrics can sometimes
3. Imbalanced Datasets: Use stratified variants to overlook domain-specific insights.
maintain the target variable distribution.
4. Time-Series Data: Avoid data shuffling; ensure
temporal order is preserved.

You might also like