olecom
diff --git a/‎_posts/python/statistics/normality-test/2015-06-30-python-Normality-Test.html
Lines changed: 40 additions & 64 deletions b/‎_posts/python/statistics/normality-test/2015-06-30-python-Normality-Test.html
Lines changed: 40 additions & 64 deletions
@@ -15,8 +15,7 @@
 ---
 {% raw %}
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h4 id="New-to-Plotly?">New to Plotly?<a class="anchor-link" href="#New-to-Plotly?">&#182;</a></h4><p>Plotly's Python library is free and open source! <a href="https://plot.ly/python/getting-started/">Get started</a> by dowloading the client and <a href="https://plot.ly/python/getting-started/">reading the primer</a>.
 <br>You can set up Plotly to work in <a href="https://plot.ly/python/getting-started/#initialization-for-online-plotting">online</a> or <a href="https://plot.ly/python/getting-started/#initialization-for-offline-plotting">offline</a> mode, or in <a href="https://plot.ly/python/getting-started/#start-plotting-online">jupyter notebooks</a>.
@@ -26,16 +25,14 @@ <h4 id="New-to-Plotly?">New to Plotly?<a class="anchor-link" href="#New-to-Plotl
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Normality-Tests">Normality Tests<a class="anchor-link" href="#Normality-Tests">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>In statistics, normality tests are used to determine whether a data set is modeled for Normal (Gaussian) Distribution. Many statistical functions require that a distribution be normal or nearly normal.</p>
 <p>There are several methods of assessing whether data are normally distributed or not. They fall into two broad categories: <em>graphical</em> and <em>statistical</em>. 
@@ -57,16 +54,14 @@ <h3 id="Normality-Tests">Normality Tests<a class="anchor-link" href="#Normality-
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Let's first develop a test dataset that we can use throughout this tutorial.</p>
 <p><em>The tutorial below imports <a href="http://www.numpy.org/">NumPy</a>, <a href="https://plot.ly/pandas/intro-to-pandas-tutorial/">Pandas</a>, and <a href="https://www.scipy.org/">SciPy</a>.</em></p>
@@ -89,14 +84,13 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
 <span class="kn">import</span> <span class="nn">scipy</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p><em>Generate Gaussian Data</em></p>
 
@@ -116,7 +110,7 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
 <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;mean=</span><span class="si">%.3f</span><span class="s1"> stdv=</span><span class="si">%.3f</span><span class="s1">&#39;</span> <span class="o">%</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">)))</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -126,7 +120,7 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
 
 <div class="output_area">
 
-<div class="prompt"></div>
+    <div class="prompt"></div>
 
 
 <div class="output_subarea output_stream output_stdout output_text">
@@ -140,17 +134,15 @@ <h3 id="Test-Dataset">Test Dataset<a class="anchor-link" href="#Test-Dataset">&#
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>We can see that the mean and standard deviation are reasonable but rough estimations of the true underlying population mean and standard deviation, given the small-ish sample size.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Plot">&#182;</a></h3><p>A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.</p>
 <p>In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained.</p>
@@ -172,7 +164,7 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
 <span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">([</span><span class="n">trace</span><span class="p">],</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;normality-histogram&#39;</span><span class="p">)</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -182,7 +174,7 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
 
 <div class="output_area">
 
-<div class="prompt output_prompt">Out[3]:</div>
+    <div class="prompt output_prompt">Out[3]:</div>
 
 
 
@@ -197,25 +189,22 @@ <h3 id="Histogram-Plot">Histogram Plot<a class="anchor-link" href="#Histogram-Pl
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>We can see a Gaussian-like shape to the data, that although is not strongly the familiar bell-shape, is a rough approximation.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" href="#Quantile-Quantile-Plot">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.</p>
 <p>This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized samples are divided into groups (e.g. 5), called quantiles. Each data point in the sample is paired with a similar member from the idealized distribution at the same cumulative distribution.</p>
@@ -237,7 +226,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
 <span class="n">qqplot_data</span> <span class="o">=</span> <span class="n">qqplot</span><span class="p">(</span><span class="n">gauss_data</span><span class="p">,</span> <span class="n">line</span><span class="o">=</span><span class="s1">&#39;s&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">lines</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -289,7 +278,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
 <span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">fig</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;normality-QQ&#39;</span><span class="p">)</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -299,7 +288,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
 
 <div class="output_area">
 
-<div class="prompt output_prompt">Out[5]:</div>
+    <div class="prompt output_prompt">Out[5]:</div>
 
 
 
@@ -314,8 +303,7 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Running the example creates the QQ plot showing the scatter plot of points in a diagonal line, closely fitting the expected diagonal pattern for a sample from a Gaussian distribution.</p>
 <p>There are a few small deviations, especially at the bottom of the plot, which is to be expected given the small data sample.</p>
@@ -324,16 +312,14 @@ <h3 id="Quantile-Quantile-Plot">Quantile-Quantile Plot<a class="anchor-link" hre
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Statistical-Normality-Tests">Statistical Normality Tests<a class="anchor-link" href="#Statistical-Normality-Tests">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.</p>
 <p>Each test makes different assumptions and considers different aspects of the data.</p>
@@ -358,16 +344,14 @@ <h4 id="Interpretation-of-a-Test">Interpretation of a Test<a class="anchor-link"
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapiro-Wilk-Test">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>The <a href="https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test">Shapiro-Wilk test</a> evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.</p>
 <p>In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.</p>
@@ -406,7 +390,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
 <span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">swt_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;shapiro-wilk-table&#39;</span><span class="p">)</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -416,7 +400,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
 
 <div class="output_area">
 
-<div class="prompt output_prompt">Out[6]:</div>
+    <div class="prompt output_prompt">Out[6]:</div>
 
 
 
@@ -431,8 +415,7 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Running the above example calculates the statistic and p-value.</p>
 <p>The p-value is interested and finds that the data is likely drawn from a Gaussian distribution.</p>
@@ -441,16 +424,14 @@ <h3 id="Shapiro-Wilk-Test">Shapiro-Wilk Test<a class="anchor-link" href="#Shapir
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href="#Anderson-Darling-Test">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p><a href="https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test">Anderson-Darling Test</a> is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples, named for Theodore Anderson and Donald Darling.</p>
 <p>It can be used to check whether a data sample is normal. The test is a modified version of a more sophisticated nonparametric goodness-of-fit statistical test called the <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov-Smirnov test</a>.</p>
@@ -511,7 +492,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
 <span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">andar_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">&#39;anderson-darling-table&#39;</span><span class="p">)</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -521,7 +502,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
 
 <div class="output_area">
 
-<div class="prompt output_prompt">Out[7]:</div>
+    <div class="prompt output_prompt">Out[7]:</div>
 
 
 
@@ -536,8 +517,7 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Running the example calculates the statistic on the test data set and the critical values are tabulated.</p>
 <p>Critical values in a statistical test are a range of pre-defined significance boundaries at which the H0 can be failed to be rejected if the calculated statistic is less than the critical value. Rather than just a single p-value, the test returns a critical value for a range of different commonly used significance levels.</p>
@@ -548,16 +528,14 @@ <h3 id="Anderson-Darling-Test">Anderson-Darling Test<a class="anchor-link" href=
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link" href="#D'Agostino's-$K^{2}$Test">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>The <a href="https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test">D'Agostino's $K^{2}$ test</a> calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph Dâ€™Agostino.</p>
 <ul>
@@ -600,7 +578,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
 <span class="n">py</span><span class="o">.</span><span class="n">iplot</span><span class="p">(</span><span class="n">normt_table</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s2">&quot;D&#39;Agostino-test-table&quot;</span><span class="p">)</span>
 </pre></div>
 
-</div>
+    </div>
 </div>
 </div>
 
@@ -610,7 +588,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
 
 <div class="output_area">
 
-<div class="prompt output_prompt">Out[8]:</div>
+    <div class="prompt output_prompt">Out[8]:</div>
 
 
 
@@ -625,8 +603,7 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
 
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <p>Running the above example calculates the statistic and p-value. 
 The p-value is interpreted against an alpha of 5% and finds that the test dataset does not significantly deviate from normal.</p>
@@ -635,14 +612,13 @@ <h3 id="D'Agostino's-$K^{2}$Test">D'Agostino's $K^{2}$Test<a class="anchor-link"
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div>
-<div class="inner_cell">
+</div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h4 id="Conclusion">Conclusion<a class="anchor-link" href="#Conclusion">&#182;</a></h4><p>We have covered a few normality tests, but this is not all of the tests that exist. It is recommended to use all possible tests on your data, where appropriate.</p>
 <p><strong><em>How to interpret the results?</em></strong></p>
 <ul>
-<li>Your data may not be normal for lots of different reasons. Each test looks at the question of whether a sample was drawn from a Gaussian distribution from a slightly different perspective.</li>
-<li>Investigate why your data is not normal and perhaps use data preparation techniques to make the data more normal.</li>
+<li>Your data may not be normal for many different reasons. Each test looks at the question of whether a sample was drawn from a Gaussian distribution from a slightly different perspective.</li>
+<li>Investigate why your data is not normal and perhaps use data preparation techniques to normalize the data.</li>
 <li>Start looking into the use of nonparametric statistical methods instead of the parametric methods.</li>
 <li>If some of the methods suggest that the sample is Gaussian and some not, then perhaps take this as an indication that your data is Gaussian-like.</li>
 <li>In many situations, you can treat your data as though it is Gaussian and proceed with your chosen parametric statistical methods.</li>