0% found this document useful (0 votes)

27 views6 pages

Sampling & Resampling Basics

Uploaded by

Harsh Raj Nema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views6 pages

Sampling & Resampling Basics

Uploaded by

Harsh Raj Nema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Introduction to Statistical Sampling and Resampling

Data is the currency of applied machine learning. Therefore, it is important that

it is both collected and used effectively.

Data sampling refers to statistical methods for selecting observations from the
domain with the objective of estimating a population parameter. Whereas data
resampling refers to methods for economically using a collected dataset to
improve the estimate of the population parameter and help to quantify the
uncertainty of the estimate.

Both data sampling and data resampling are methods that are required in
a predictive modeling problem.
In this tutorial, you will discover statistical sampling and statistical resampling
methods for gathering and making best use of data.

After completing this tutorial, you will know:

 Sampling is an active process of gathering observations with the intent of

estimating a population variable.
 Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population
parameter.
 Resampling methods, in fact, make use of a nested resampling method.

1. Statistical Sampling
2. Statistical Resampling

Statistical Sampling
Each row of data represents an observation about something in the world.

When working with data, we often do not have access to all possible
observations. This could be for many reasons; for example:

 It may difficult or expensive to make more observations.

 It may be challenging to gather all observations together.
 More observations are expected to be made in the future.
Observations made in a domain represent samples of some broader idealized
and unknown population of all possible observations that could be made in the
domain. This is a useful conceptualization as we can see the separation and
relationship between observations and the idealized population.
We can also see that, even if we intend to use big data infrastructure on all
available data, that the data still represents a sample of observations from an
idealized population.

Nevertheless, we may wish to estimate properties of the population. We do this

by using samples of observations.

Sampling consists of selecting some part of the population to observe so that

one may estimate something about the whole population.

How to Sample
Statistical sampling is the process of selecting subsets of examples from a
population with the objective of estimating properties of the population.

Sampling is an active process. There is a goal of estimating population

properties and control over how the sampling is to occur. This control falls short
of influencing the process that generates each observation, such as performing
an experiment. As such, sampling as a field sits neatly between pure
uncontrolled observation and controlled experimentation.

Sampling is usually distinguished from the closely related field of experimental

design, in that in experiments one deliberately perturbs some part of the
population in order to see what the effect of that action is. […] Sampling is also
usually distinguished from observational studies, in which one has little or no
control over how the observations on the population were obtained.

There are many benefits to sampling compared to working with fuller or

complete datasets, including reduced cost and greater speed.

In order to perform sampling, it requires that you carefully define your

population and the method by which you will select (and possibly reject)
observations to be a part of your data sample. This may very well be defined by
the population parameters that you wish to estimate using the sample.

Some aspects to consider prior to collecting a data sample include:

 Sample Goal. The population property that you wish to estimate using
the sample.
 Population. The scope or domain from which observations could
theoretically be made.
 Selection Criteria. The methodology that will be used to accept or reject
observations in your sample.
 Sample Size. The number of observations that will constitute the sample.
Some obvious questions […] are how best to obtain the sample and make the
observations and, once the sample data are in hand, how best to use them to
estimate the characteristics of the whole population. Obtaining the observations
involves questions of sample size, how to select the sample, what observational
methods to use, and what measurements to record.

Statistical sampling is a large field of study, but in applied machine learning,

there may be three types of sampling that you are likely to use: simple random
sampling, systematic sampling, and stratified sampling.

 Simple Random Sampling: Samples are drawn with a uniform

probability from the domain.
 Systematic Sampling: Samples are drawn using a pre-specified pattern,
such as at intervals.
 Stratified Sampling: Samples are drawn within pre-specified categories
(i.e. strata).
Although these are the more common types of sampling that you may
encounter, there are other techniques.

Sampling Errors
Sampling requires that we make a statistical inference about the population
from a small set of observations.

We can generalize properties from the sample to the population. This process of
estimation and generalization is much faster than working with all possible
observations, but will contain errors. In many cases, we can quantify the
uncertainty of our estimates and add errors bars, such as confidence intervals.

There are many ways to introduce errors into your data sample.

Two main types of errors include selection bias and sampling error.

 Selection Bias. Caused when the method of drawing observations skews

the sample in some way.
 Sampling Error. Caused due to the random nature of drawing
observations skewing the sample in some way.
Other types of errors may be present, such as systematic errors in the way
observations or measurements are made.
In these cases and more, the statistical properties of the sample may be different
from what would be expected in the idealized population, which in turn may
impact the properties of the population that are being estimated.

Simple methods, such as reviewing raw observations, summary statistics, and

visualizations can help expose simple errors, such as measurement corruption
and the over- or underrepresentation of a class of observations.

Nevertheless, care must be taken both when sampling and when drawing
conclusions about the population while sampling.

Statistical Resampling
Once we have a data sample, it can be used to estimate the population
parameter.

The problem is that we only have a single estimate of the population parameter,
with little idea of the variability or uncertainty in the estimate.

One way to address this is by estimating the population parameter multiple

times from our data sample. This is called resampling.

Statistical resampling methods are procedures that describe how to

economically use available data to estimate a population parameter. The result
can be both a more accurate estimate of the parameter (such as taking the mean
of the estimates) and a quantification of the uncertainty of the estimate (such as
adding a confidence interval).

Resampling methods are very easy to use, requiring little mathematical

knowledge. They are methods that are easy to understand and implement
compared to specialized statistical methods that may require deep technical skill
in order to select and interpret.

The resampling methods […] are easy to learn and easy to apply. They require
no mathematics beyond introductory high-school algebra, et are applicable in an
exceptionally broad range of subject areas.

A downside of the methods is that they can be computationally very expensive,

requiring tens, hundreds, or even thousands of resamples in order to develop a
robust estimate of the population parameter.
The key idea is to resample form the original data — either directly or via a
fitted model — to create replicate datasets, from which the variability of the
quantiles of interest can be assessed without long-winded and error-prone
analytical calculation. Because this approach involves repeating the original
data analysis procedure with many replicate sets of data, these are sometimes
called computer-intensive methods.

Each new subsample from the original data sample is used to estimate the
population parameter. The sample of estimated population parameters can then
be considered with statistical tools in order to quantify the expected value and
variance, providing measures of the uncertainty of the estimate.

Statistical sampling methods can be used in the selection of a subsample from

the original sample.

A key difference is that process must be repeated multiple times. The problem
with this is that there will be some relationship between the samples as
observations that will be shared across multiple subsamples. This means that the
subsamples and the estimated population parameters are not strictly identical
and independently distributed. This has implications for statistical tests
performed on the sample of estimated population parameters downstream, i.e.
paired statistical tests may be required.

Two commonly used resampling methods that you may encounter are k-fold
cross-validation and the bootstrap.

 Bootstrap. Samples are drawn from the dataset with replacement

(allowing the same sample to appear more than once in the sample),
where those instances not drawn into the data sample may be used for the
test set.
 k-fold Cross-Validation. A dataset is partitioned into k groups, where
each group is given the opportunity of being used as a held out test set
leaving the remaining groups as the training set.
The k-fold cross-validation method specifically lends itself to use in the
evaluation of predictive models that are repeatedly trained on one subset of the
data and evaluated on a second held-out subset of the data.

Generally, resampling techniques for estimating model performance operate

similarly: a subset of samples are used to fit a model and the remaining samples
are used to estimate the efficacy of the model. This process is repeated multiple
times and the results are aggregated and summarized. The differences in
techniques usually centre around the method in which subsamples are chosen.

The bootstrap method can be used for the same purpose, but is a more general
and simpler method intended for estimating a population parameter.

Bayesian Statistical Methods
100% (11)
Bayesian Statistical Methods
288 pages
Sampling & Estimation for IT Students
No ratings yet
Sampling & Estimation for IT Students
50 pages
Sampling Methods: by Prof. Manidatta Ray
No ratings yet
Sampling Methods: by Prof. Manidatta Ray
11 pages
Digital Communication and Probability of Error.
No ratings yet
Digital Communication and Probability of Error.
58 pages
Probability Vs Non Probability Sampling
100% (1)
Probability Vs Non Probability Sampling
39 pages
Data Science Interview Q - A
No ratings yet
Data Science Interview Q - A
165 pages
Types of Sampling Methods Explained
No ratings yet
Types of Sampling Methods Explained
27 pages
Market Research Sampling Guide
No ratings yet
Market Research Sampling Guide
5 pages
MAT 121 Course Guide
No ratings yet
MAT 121 Course Guide
114 pages
Sampling, Sampling Distributions and Estimation
No ratings yet
Sampling, Sampling Distributions and Estimation
8 pages
Sampling
No ratings yet
Sampling
11 pages
Point and Interval Estimate
No ratings yet
Point and Interval Estimate
135 pages
Sampling
No ratings yet
Sampling
15 pages
Probability and Statistics - BE03000251 - Assignment - EC - CE - CIVIL
No ratings yet
Probability and Statistics - BE03000251 - Assignment - EC - CE - CIVIL
25 pages
Lecture 03 Probability and Statistics Review Part2
No ratings yet
Lecture 03 Probability and Statistics Review Part2
74 pages
DR. Waqar Al - Kubaisy
No ratings yet
DR. Waqar Al - Kubaisy
44 pages
Data Science Q&A
No ratings yet
Data Science Q&A
4 pages
Lect - 02 1 Distribution Functions (1) (Introduction)
No ratings yet
Lect - 02 1 Distribution Functions (1) (Introduction)
44 pages
Type of Sampling and Data
No ratings yet
Type of Sampling and Data
40 pages
Measures & Probability Concepts
100% (1)
Measures & Probability Concepts
4 pages
Lesson 08 Sampling and Sampling Techniques
No ratings yet
Lesson 08 Sampling and Sampling Techniques
92 pages
Module 6 Sampling Estimation
No ratings yet
Module 6 Sampling Estimation
17 pages
Stat For MGT II New (1) - 1
No ratings yet
Stat For MGT II New (1) - 1
67 pages
3sampling and Simulation
No ratings yet
3sampling and Simulation
52 pages
Sampling Thoery
No ratings yet
Sampling Thoery
30 pages
Sampling and Simulation Modi
No ratings yet
Sampling and Simulation Modi
48 pages
Group 4 - In-Class Presentation
No ratings yet
Group 4 - In-Class Presentation
24 pages
Introduction To Sampling
No ratings yet
Introduction To Sampling
17 pages
Data Sampling
No ratings yet
Data Sampling
18 pages
Sampling Techniques Autosaved
No ratings yet
Sampling Techniques Autosaved
80 pages
Data Science Q&A - Latest Ed (2020) - 2 - 2
No ratings yet
Data Science Q&A - Latest Ed (2020) - 2 - 2
2 pages
W2 2+Sampling+and+Generalizability
No ratings yet
W2 2+Sampling+and+Generalizability
45 pages
Cse-613 - Mod 4
No ratings yet
Cse-613 - Mod 4
97 pages
Sds Module 1
No ratings yet
Sds Module 1
86 pages
Day6 STATwithMehedi
No ratings yet
Day6 STATwithMehedi
12 pages
Sampling
No ratings yet
Sampling
5 pages
Sampling Techniques for Researchers
No ratings yet
Sampling Techniques for Researchers
62 pages
Lecture 8
No ratings yet
Lecture 8
39 pages
Tổng Hợp BT Thống Kê (2) -Đã Gộp
No ratings yet
Tổng Hợp BT Thống Kê (2) -Đã Gộp
20 pages
9 Sampling
No ratings yet
9 Sampling
5 pages
Sampling Methods
No ratings yet
Sampling Methods
24 pages
14 Probability Sampling Technique by DR - Faryal Shaikh (25th Jan 2023) )
No ratings yet
14 Probability Sampling Technique by DR - Faryal Shaikh (25th Jan 2023) )
30 pages
Sampling Methods for Researchers
No ratings yet
Sampling Methods for Researchers
26 pages
Eth Od S
No ratings yet
Eth Od S
17 pages
Chapter IIIa
No ratings yet
Chapter IIIa
51 pages
Sampling and Sampling Distributions: Mrs. Kiranmayi Patel
No ratings yet
Sampling and Sampling Distributions: Mrs. Kiranmayi Patel
35 pages
CHAPTER-7 & 8 (Alternative Minimized)
No ratings yet
CHAPTER-7 & 8 (Alternative Minimized)
22 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
18 pages
AEB801 20222023-Lecture 02-1
No ratings yet
AEB801 20222023-Lecture 02-1
19 pages
Unit Ii
No ratings yet
Unit Ii
21 pages
Portion 3
No ratings yet
Portion 3
32 pages
ST2187 - Block 8 Sampling and Sampling Distributions
No ratings yet
ST2187 - Block 8 Sampling and Sampling Distributions
13 pages
Introduction: Demystifying The Art of Sampling
No ratings yet
Introduction: Demystifying The Art of Sampling
9 pages
1466677135da Mod6 Q1 e Text
No ratings yet
1466677135da Mod6 Q1 e Text
11 pages
Chapter 1 - Introduction To Statistics
No ratings yet
Chapter 1 - Introduction To Statistics
30 pages
Lesson: Sampling and Sampling Distributions
No ratings yet
Lesson: Sampling and Sampling Distributions
2 pages
Sample Space Notes
No ratings yet
Sample Space Notes
7 pages
Sampling
No ratings yet
Sampling
8 pages
MR Chapter Eight
No ratings yet
MR Chapter Eight
4 pages
Sampling in Daily Life
No ratings yet
Sampling in Daily Life
45 pages
Chapter2 PDF
80% (5)
Chapter2 PDF
58 pages
5 Continuous-Time Markov Chains: Angela Peace
No ratings yet
5 Continuous-Time Markov Chains: Angela Peace
82 pages
Chapter 3
100% (1)
Chapter 3
79 pages
Quantum Measurement and Control
100% (1)
Quantum Measurement and Control
478 pages
A Monte Carlo Experiment
No ratings yet
A Monte Carlo Experiment
43 pages
What Is Data Sampling
No ratings yet
What Is Data Sampling
2 pages
Biostatistics and Research Notes BPT
No ratings yet
Biostatistics and Research Notes BPT
3 pages
Probability: Prof. Neha Taneja
No ratings yet
Probability: Prof. Neha Taneja
35 pages
PGDSMA Brochure 2021
No ratings yet
PGDSMA Brochure 2021
22 pages
BSR PPT - Compiled
No ratings yet
BSR PPT - Compiled
24 pages
Sampling Techniques: of The Population Has A Chance of Being Included
No ratings yet
Sampling Techniques: of The Population Has A Chance of Being Included
10 pages
STAT110PART7
No ratings yet
STAT110PART7
22 pages
Sampling and Sampling Distributions
100% (22)
Sampling and Sampling Distributions
78 pages
Information Theory Differential Entropy
No ratings yet
Information Theory Differential Entropy
29 pages
Gate Scholorship Work - October: Sampling Fundamentals
No ratings yet
Gate Scholorship Work - October: Sampling Fundamentals
13 pages
Math Investigatory Project
No ratings yet
Math Investigatory Project
15 pages
Convergence Black-Scholes To Binomial
100% (1)
Convergence Black-Scholes To Binomial
9 pages
Recitation 3: Network Examples: Hung-Bin (Bing) Chang and Yu-Yu Lin
No ratings yet
Recitation 3: Network Examples: Hung-Bin (Bing) Chang and Yu-Yu Lin
16 pages
Engineering Data Analysis Course
No ratings yet
Engineering Data Analysis Course
3 pages
Measurement Error Reduction Guide
No ratings yet
Measurement Error Reduction Guide
13 pages
Monte Carlo Method & Simulation Guide
No ratings yet
Monte Carlo Method & Simulation Guide
12 pages
Actuarial Society of India: Examinations
No ratings yet
Actuarial Society of India: Examinations
12 pages
Introduction to Random Variables
No ratings yet
Introduction to Random Variables
3 pages
Stat Quarter Exam
No ratings yet
Stat Quarter Exam
4 pages
Set 1 IBM-322
No ratings yet
Set 1 IBM-322
3 pages
Musical Interest Rating / Categories SD D NLND L SL Total: Correlations
No ratings yet
Musical Interest Rating / Categories SD D NLND L SL Total: Correlations
4 pages
FMS1323 Statistics Test 1
No ratings yet
FMS1323 Statistics Test 1
3 pages
Multivariate Analysis for CSE/MCA
No ratings yet
Multivariate Analysis for CSE/MCA
1 page
Tut 5
No ratings yet
Tut 5
2 pages

Sampling & Resampling Basics

Uploaded by

Sampling & Resampling Basics

Uploaded by

Introduction to Statistical Sampling and Resampling

Data is the currency of applied machine learning. Therefore, it is important that

After completing this tutorial, you will know:

 Sampling is an active process of gathering observations with the intent of

 It may difficult or expensive to make more observations.

Nevertheless, we may wish to estimate properties of the population. We do this

Sampling consists of selecting some part of the population to observe so that

Sampling is an active process. There is a goal of estimating population

Sampling is usually distinguished from the closely related field of experimental

There are many benefits to sampling compared to working with fuller or

In order to perform sampling, it requires that you carefully define your

Some aspects to consider prior to collecting a data sample include:

Statistical sampling is a large field of study, but in applied machine learning,

 Simple Random Sampling: Samples are drawn with a uniform

 Selection Bias. Caused when the method of drawing observations skews

Simple methods, such as reviewing raw observations, summary statistics, and

One way to address this is by estimating the population parameter multiple

Statistical resampling methods are procedures that describe how to

Resampling methods are very easy to use, requiring little mathematical

A downside of the methods is that they can be computationally very expensive,

Statistical sampling methods can be used in the selection of a subsample from

 Bootstrap. Samples are drawn from the dataset with replacement

Generally, resampling techniques for estimating model performance operate

You might also like