Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 639127d

Browse files
committed
made changes based on review
1 parent c8ae4b3 commit 639127d

File tree

2 files changed

+50
-70
lines changed

2 files changed

+50
-70
lines changed

_posts/python/statistics/normality-test/2015-06-30-python-Normality-Test.html

Lines changed: 40 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@
1515
---
1616
{% raw %}
1717
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
18-
</div>
19-
<div class="inner_cell">
18+
</div><div class="inner_cell">
2019
<div class="text_cell_render border-box-sizing rendered_html">
2120
<h4 id="New-to-Plotly?">New to Plotly?<a class="anchor-link" href="#New-to-Plotly?">&#182;</a></h4><p>Plotly's Python library is free and open source! <a href="https://plot.ly/python/getting-started/">Get started</a> by dowloading the client and <a href="https://plot.ly/python/getting-started/">reading the primer</a>.
2221
<br>You can set up Plotly to work in <a href="https://plot.ly/python/getting-started/#initialization-for-online-plotting">online</a> or <a href="https://plot.ly/python/getting-started/#initialization-for-offline-plotting">offline</a> mode, or in <a href="https://plot.ly/python/getting-started/#start-plotting-online">jupyter notebooks</a>.
@@ -26,16 +25,14 @@ <h4 id="New-to-Plotly?">New to Plotly?<a class="anchor-link" href="#New-to-Plotl
2625
</div>
2726
</div>
2827
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
29-
</div>
30-
<div class="inner_cell">
28+
</div><div class="inner_cell">
3129
<div class="text_cell_render border-box-sizing rendered_html">
3230
<h3 id="Normality-Tests">Normality Tests<a class="anchor-link" href="#Normality-Tests">&#182;</a></h3>
3331
</div>
3432
</div>
3533
</div>
3634
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
37-
</div>
38-
<div class="inner_cell">
35+
</div><div class="inner_cell">
3936
<div class="text_cell_render border-box-sizing rendered_html">
4037
<p>In statistics, normality tests are used to determine whether a data set is modeled for Normal (Gaussian) Distribution. Many statistical functions require that a distribution be normal or nearly normal.</p>
4138
<p>There are several methods of assessing whether data are normally distributed or not. They fall into two broad categories: <em>graphical</em> and <em>statistical</em>.
@@ -57,16 +54,14 @@ <h3 id="Normality-Tests">Normality Tests<a class="anchor-link" href="#Normality-
5754
</div>
5855
</div>
5956
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
60-
</div>
61-
<div class="inner_cell">
57+
</div><div class="inner_cell">
6258
<div class="text_cell_render border-box-sizing rendered_html">
6359
<h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#182;</a></h3>
6460
</div>
6561
</div>
6662
</div>
6763
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
68-
</div>
69-
<div class="inner_cell">
64+
</div><div class="inner_cell">
7065
<div class="text_cell_render border-box-sizing rendered_html">
7166
<p>Let's first develop a test dataset that we can use throughout this tutorial.</p>
7267
<p><em>The tutorial below imports <a href="http://www.numpy.org/">NumPy</a>, <a href="https://plot.ly/pandas/intro-to-pandas-tutorial/">Pandas</a>, and <a href="https://www.scipy.org/">SciPy</a>.</em></p>
@@ -89,14 +84,13 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
8984
<span class="kn">import</span> <span class="nn">scipy</span>
9085
</pre></div>
9186

92-
</div>
87+
</div>
9388
</div>
9489
</div>
9590

9691
</div>
9792
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
98-
</div>
99-
<div class="inner_cell">
93+
</div><div class="inner_cell">
10094
<div class="text_cell_render border-box-sizing rendered_html">
10195
<p><em>Generate Gaussian Data</em></p>
10296

@@ -116,7 +110,7 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
116110
<span class="nb">print</span><span class="p">(</span><span class="s1">&#39;mean=</span><span class="si">%.3f</span><span class="s1"> stdv=</span><span class="si">%.3f</span><span class="s1">&#39;</span> <span class="o">%</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">)))</span>
117111
</pre></div>
118112

119-
</div>
113+
</div>
120114
</div>
121115
</div>
122116

@@ -126,7 +120,7 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
126120

127121
<div class="output_area">
128122

129-
<div class="prompt"></div>
123+
<div class="prompt"></div>
130124

131125

132126
<div class="output_subarea output_stream output_stdout output_text">
@@ -140,17 +134,15 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
140134

141135
</div>
142136
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
143-
</div>
144-
<div class="inner_cell">
137+
</div><div class="inner_cell">
145138
<div class="text_cell_render border-box-sizing rendered_html">
146139
<p>We can see that the mean and standard deviation are reasonable but rough estimations of the true underlying population mean and standard deviation, given the small-ish sample size.</p>
147140

148141
</div>
149142
</div>
150143
</div>
151144
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
152-
</div>
153-
<div class="inner_cell">
145+
</div><div class="inner_cell">
154146
<div class="text_cell_render border-box-sizing rendered_html">
155147
<h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Plot">&#182;</a></h3><p>A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.</p>
156148
<p>In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained.</p>
@@ -172,7 +164,7 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
172164
<span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">([</span><span class="n">trace</span><span class="p">],</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;normality-histogram&#39;</span><span class="p">)</span>
173165
</pre></div>
174166

175-
</div>
167+
</div>
176168
</div>
177169
</div>
178170

@@ -182,7 +174,7 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
182174

183175
<div class="output_area">
184176

185-
<div class="prompt output_prompt">Out[3]:</div>
177+
<div class="prompt output_prompt">Out[3]:</div>
186178

187179

188180

@@ -197,25 +189,22 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
197189

198190
</div>
199191
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
200-
</div>
201-
<div class="inner_cell">
192+
</div><div class="inner_cell">
202193
<div class="text_cell_render border-box-sizing rendered_html">
203194
<p>We can see a Gaussian-like shape to the data, that although is not strongly the familiar bell-shape, is a rough approximation.</p>
204195

205196
</div>
206197
</div>
207198
</div>
208199
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
209-
</div>
210-
<div class="inner_cell">
200+
</div><div class="inner_cell">
211201
<div class="text_cell_render border-box-sizing rendered_html">
212202
<h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" href="#Quantile-Quantile-Plot">&#182;</a></h3>
213203
</div>
214204
</div>
215205
</div>
216206
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
217-
</div>
218-
<div class="inner_cell">
207+
</div><div class="inner_cell">
219208
<div class="text_cell_render border-box-sizing rendered_html">
220209
<p>Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.</p>
221210
<p>This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized samples are divided into groups (e.g. 5), called quantiles. Each data point in the sample is paired with a similar member from the idealized distribution at the same cumulative distribution.</p>
@@ -237,7 +226,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
237226
<span class="n">qqplot_data</span> <span class="o">=</span> <span class="n">qqplot</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">,</span> <span class="n">line</span><span class="o">=</span><span class="s1">&#39;s&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">lines</span>
238227
</pre></div>
239228

240-
</div>
229+
</div>
241230
</div>
242231
</div>
243232

@@ -289,7 +278,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
289278
<span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">fig</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;normality-QQ&#39;</span><span class="p">)</span>
290279
</pre></div>
291280

292-
</div>
281+
</div>
293282
</div>
294283
</div>
295284

@@ -299,7 +288,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
299288

300289
<div class="output_area">
301290

302-
<div class="prompt output_prompt">Out[5]:</div>
291+
<div class="prompt output_prompt">Out[5]:</div>
303292

304293

305294

@@ -314,8 +303,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
314303

315304
</div>
316305
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
317-
</div>
318-
<div class="inner_cell">
306+
</div><div class="inner_cell">
319307
<div class="text_cell_render border-box-sizing rendered_html">
320308
<p>Running the example creates the QQ plot showing the scatter plot of points in a diagonal line, closely fitting the expected diagonal pattern for a sample from a Gaussian distribution.</p>
321309
<p>There are a few small deviations, especially at the bottom of the plot, which is to be expected given the small data sample.</p>
@@ -324,16 +312,14 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
324312
</div>
325313
</div>
326314
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
327-
</div>
328-
<div class="inner_cell">
315+
</div><div class="inner_cell">
329316
<div class="text_cell_render border-box-sizing rendered_html">
330317
<h3 id="Statistical-Normality-Tests">Statistical Normality Tests<a class="anchor-link" href="#Statistical-Normality-Tests">&#182;</a></h3>
331318
</div>
332319
</div>
333320
</div>
334321
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
335-
</div>
336-
<div class="inner_cell">
322+
</div><div class="inner_cell">
337323
<div class="text_cell_render border-box-sizing rendered_html">
338324
<p>There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.</p>
339325
<p>Each test makes different assumptions and considers different aspects of the data.</p>
@@ -358,16 +344,14 @@ <h4 id="Interpretation-of-a-Test">Interpretation of a Test<a class="anchor-link"
358344
</div>
359345
</div>
360346
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
361-
</div>
362-
<div class="inner_cell">
347+
</div><div class="inner_cell">
363348
<div class="text_cell_render border-box-sizing rendered_html">
364349
<h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapiro-Wilk-Test">&#182;</a></h3>
365350
</div>
366351
</div>
367352
</div>
368353
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
369-
</div>
370-
<div class="inner_cell">
354+
</div><div class="inner_cell">
371355
<div class="text_cell_render border-box-sizing rendered_html">
372356
<p>The <a href="https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test">Shapiro-Wilk test</a> evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.</p>
373357
<p>In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.</p>
@@ -406,7 +390,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
406390
<span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">swt_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;shapiro-wilk-table&#39;</span><span class="p">)</span>
407391
</pre></div>
408392

409-
</div>
393+
</div>
410394
</div>
411395
</div>
412396

@@ -416,7 +400,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
416400

417401
<div class="output_area">
418402

419-
<div class="prompt output_prompt">Out[6]:</div>
403+
<div class="prompt output_prompt">Out[6]:</div>
420404

421405

422406

@@ -431,8 +415,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
431415

432416
</div>
433417
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
434-
</div>
435-
<div class="inner_cell">
418+
</div><div class="inner_cell">
436419
<div class="text_cell_render border-box-sizing rendered_html">
437420
<p>Running the above example calculates the statistic and p-value.</p>
438421
<p>The p-value is interested and finds that the data is likely drawn from a Gaussian distribution.</p>
@@ -441,16 +424,14 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
441424
</div>
442425
</div>
443426
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
444-
</div>
445-
<div class="inner_cell">
427+
</div><div class="inner_cell">
446428
<div class="text_cell_render border-box-sizing rendered_html">
447429
<h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href="#Anderson-Darling-Test">&#182;</a></h3>
448430
</div>
449431
</div>
450432
</div>
451433
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
452-
</div>
453-
<div class="inner_cell">
434+
</div><div class="inner_cell">
454435
<div class="text_cell_render border-box-sizing rendered_html">
455436
<p><a href="https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test">Anderson-Darling Test</a> is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples, named for Theodore Anderson and Donald Darling.</p>
456437
<p>It can be used to check whether a data sample is normal. The test is a modified version of a more sophisticated nonparametric goodness-of-fit statistical test called the <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov-Smirnov test</a>.</p>
@@ -511,7 +492,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
511492
<span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">andar_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;anderson-darling-table&#39;</span><span class="p">)</span>
512493
</pre></div>
513494

514-
</div>
495+
</div>
515496
</div>
516497
</div>
517498

@@ -521,7 +502,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
521502

522503
<div class="output_area">
523504

524-
<div class="prompt output_prompt">Out[7]:</div>
505+
<div class="prompt output_prompt">Out[7]:</div>
525506

526507

527508

@@ -536,8 +517,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
536517

537518
</div>
538519
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
539-
</div>
540-
<div class="inner_cell">
520+
</div><div class="inner_cell">
541521
<div class="text_cell_render border-box-sizing rendered_html">
542522
<p>Running the example calculates the statistic on the test data set and the critical values are tabulated.</p>
543523
<p>Critical values in a statistical test are a range of pre-defined significance boundaries at which the H0 can be failed to be rejected if the calculated statistic is less than the critical value. Rather than just a single p-value, the test returns a critical value for a range of different commonly used significance levels.</p>
@@ -548,16 +528,14 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
548528
</div>
549529
</div>
550530
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
551-
</div>
552-
<div class="inner_cell">
531+
</div><div class="inner_cell">
553532
<div class="text_cell_render border-box-sizing rendered_html">
554533
<h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link" href="#D'Agostino's-$K^{2}$Test">&#182;</a></h3>
555534
</div>
556535
</div>
557536
</div>
558537
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
559-
</div>
560-
<div class="inner_cell">
538+
</div><div class="inner_cell">
561539
<div class="text_cell_render border-box-sizing rendered_html">
562540
<p>The <a href="https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test">D'Agostino's $K^{2}$ test</a> calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.</p>
563541
<ul>
@@ -600,7 +578,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
600578
<span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">normt_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s2">&quot;D&#39;Agostino-test-table&quot;</span><span class="p">)</span>
601579
</pre></div>
602580

603-
</div>
581+
</div>
604582
</div>
605583
</div>
606584

@@ -610,7 +588,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
610588

611589
<div class="output_area">
612590

613-
<div class="prompt output_prompt">Out[8]:</div>
591+
<div class="prompt output_prompt">Out[8]:</div>
614592

615593

616594

@@ -625,8 +603,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
625603

626604
</div>
627605
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
628-
</div>
629-
<div class="inner_cell">
606+
</div><div class="inner_cell">
630607
<div class="text_cell_render border-box-sizing rendered_html">
631608
<p>Running the above example calculates the statistic and p-value.
632609
The p-value is interpreted against an alpha of 5% and finds that the test dataset does not significantly deviate from normal.</p>
@@ -635,14 +612,13 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
635612
</div>
636613
</div>
637614
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
638-
</div>
639-
<div class="inner_cell">
615+
</div><div class="inner_cell">
640616
<div class="text_cell_render border-box-sizing rendered_html">
641617
<h4 id="Conclusion">Conclusion<a class="anchor-link" href="#Conclusion">&#182;</a></h4><p>We have covered a few normality tests, but this is not all of the tests that exist. It is recommended to use all possible tests on your data, where appropriate.</p>
642618
<p><strong><em>How to interpret the results?</em></strong></p>
643619
<ul>
644-
<li>Your data may not be normal for lots of different reasons. Each test looks at the question of whether a sample was drawn from a Gaussian distribution from a slightly different perspective.</li>
645-
<li>Investigate why your data is not normal and perhaps use data preparation techniques to make the data more normal.</li>
620+
<li>Your data may not be normal for many different reasons. Each test looks at the question of whether a sample was drawn from a Gaussian distribution from a slightly different perspective.</li>
621+
<li>Investigate why your data is not normal and perhaps use data preparation techniques to normalize the data.</li>
646622
<li>Start looking into the use of nonparametric statistical methods instead of the parametric methods.</li>
647623
<li>If some of the methods suggest that the sample is Gaussian and some not, then perhaps take this as an indication that your data is Gaussian-like.</li>
648624
<li>In many situations, you can treat your data as though it is Gaussian and proceed with your chosen parametric statistical methods.</li>

0 commit comments

Comments
 (0)