Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit c48fdb3

Browse files
committed
Pushing the docs to dev/ for branch: master, commit b4db36d337a4ff83f1bcb37c5a8c615d3134d372
1 parent 69acbb1 commit c48fdb3

File tree

1,238 files changed

+4184
-3859
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,238 files changed

+4184
-3859
lines changed
Binary file not shown.

dev/_downloads/a2486a67d0a96c8526fd62fbb80c78ba/plot_mahalanobis_distances.py

Lines changed: 137 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -3,67 +3,86 @@
33
Robust covariance estimation and Mahalanobis distances relevance
44
================================================================
55
6-
An example to show covariance estimation with the Mahalanobis
6+
This example shows covariance estimation with Mahalanobis
77
distances on Gaussian distributed data.
88
99
For Gaussian distributed data, the distance of an observation
1010
:math:`x_i` to the mode of the distribution can be computed using its
11-
Mahalanobis distance: :math:`d_{(\mu,\Sigma)}(x_i)^2 = (x_i -
12-
\mu)'\Sigma^{-1}(x_i - \mu)` where :math:`\mu` and :math:`\Sigma` are
13-
the location and the covariance of the underlying Gaussian
14-
distribution.
11+
Mahalanobis distance:
12+
13+
.. math::
14+
15+
d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu)
16+
17+
where :math:`\mu` and :math:`\Sigma` are the location and the covariance of
18+
the underlying Gaussian distributions.
1519
1620
In practice, :math:`\mu` and :math:`\Sigma` are replaced by some
17-
estimates. The usual covariance maximum likelihood estimate is very
18-
sensitive to the presence of outliers in the data set and therefor,
19-
the corresponding Mahalanobis distances are. One would better have to
21+
estimates. The standard covariance maximum likelihood estimate (MLE) is very
22+
sensitive to the presence of outliers in the data set and therefore,
23+
the downstream Mahalanobis distances also are. It would be better to
2024
use a robust estimator of covariance to guarantee that the estimation is
21-
resistant to "erroneous" observations in the data set and that the
22-
associated Mahalanobis distances accurately reflect the true
23-
organisation of the observations.
25+
resistant to "erroneous" observations in the dataset and that the
26+
calculated Mahalanobis distances accurately reflect the true
27+
organization of the observations.
2428
25-
The Minimum Covariance Determinant estimator is a robust,
29+
The Minimum Covariance Determinant estimator (MCD) is a robust,
2630
high-breakdown point (i.e. it can be used to estimate the covariance
2731
matrix of highly contaminated datasets, up to
2832
:math:`\frac{n_\text{samples}-n_\text{features}-1}{2}` outliers)
29-
estimator of covariance. The idea is to find
33+
estimator of covariance. The idea behind the MCD is to find
3034
:math:`\frac{n_\text{samples}+n_\text{features}+1}{2}`
3135
observations whose empirical covariance has the smallest determinant,
3236
yielding a "pure" subset of observations from which to compute
33-
standards estimates of location and covariance.
34-
35-
The Minimum Covariance Determinant estimator (MCD) has been introduced
36-
by P.J.Rousseuw in [1].
37+
standards estimates of location and covariance. The MCD was introduced by
38+
P.J.Rousseuw in [1]_.
3739
3840
This example illustrates how the Mahalanobis distances are affected by
39-
outlying data: observations drawn from a contaminating distribution
41+
outlying data. Observations drawn from a contaminating distribution
4042
are not distinguishable from the observations coming from the real,
41-
Gaussian distribution that one may want to work with. Using MCD-based
43+
Gaussian distribution when using standard covariance MLE based Mahalanobis
44+
distances. Using MCD-based
4245
Mahalanobis distances, the two populations become
43-
distinguishable. Associated applications are outliers detection,
44-
observations ranking, clustering, ...
45-
For visualization purpose, the cubic root of the Mahalanobis distances
46-
are represented in the boxplot, as Wilson and Hilferty suggest [2]
46+
distinguishable. Associated applications include outlier detection,
47+
observation ranking and clustering.
48+
49+
.. note::
50+
51+
See also :ref:`sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py`
4752
48-
[1] P. J. Rousseeuw. Least median of squares regression. J. Am
49-
Stat Ass, 79:871, 1984.
50-
[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square.
51-
Proceedings of the National Academy of Sciences of the United States
52-
of America, 17, 684-688.
53+
.. topic:: References:
5354
54-
"""
55-
print(__doc__)
55+
.. [1] P. J. Rousseeuw. `Least median of squares regression
56+
<http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf>`_. J. Am
57+
Stat Ass, 79:871, 1984.
58+
.. [2] Wilson, E. B., & Hilferty, M. M. (1931). `The distribution of chi-square.
59+
<https://water.usgs.gov/osw/bulletin17b/Wilson_Hilferty_1931.pdf>`_
60+
Proceedings of the National Academy of Sciences of the United States
61+
of America, 17, 684-688.
62+
63+
""" # noqa: E501
64+
65+
# %%
66+
# Generate data
67+
# --------------
68+
#
69+
# First, we generate a dataset of 125 samples and 2 features. Both features
70+
# are Gaussian distributed with mean of 0 but feature 1 has a standard
71+
# deviation equal to 2 and feature 2 has a standard deviation equal to 1. Next,
72+
# 25 samples are replaced with Gaussian outlier samples where feature 1 has
73+
# a standard devation equal to 1 and feature 2 has a standard deviation equal
74+
# to 7.
5675

5776
import numpy as np
58-
import matplotlib.pyplot as plt
5977

60-
from sklearn.covariance import EmpiricalCovariance, MinCovDet
78+
# for consistent results
79+
np.random.seed(7)
6180

6281
n_samples = 125
6382
n_outliers = 25
6483
n_features = 2
6584

66-
# generate data
85+
# generate Gaussian data of shape (125, 2)
6786
gen_cov = np.eye(n_features)
6887
gen_cov[0, 0] = 2.
6988
X = np.dot(np.random.randn(n_samples, n_features), gen_cov)
@@ -72,73 +91,103 @@
7291
outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.
7392
X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)
7493

75-
# fit a Minimum Covariance Determinant (MCD) robust estimator to data
76-
robust_cov = MinCovDet().fit(X)
94+
# %%
95+
# Comparison of results
96+
# ---------------------
97+
#
98+
# Below, we fit MCD and MLE based covariance estimators to our data and print
99+
# the estimated covariance matrices. Note that the estimated variance of
100+
# feature 2 is much higher with the MLE based estimator (7.5) than
101+
# that of the MCD robust estimator (1.2). This shows that the MCD based
102+
# robust estimator is much more resistant to the outlier samples, which were
103+
# designed to have a much larger variance in feature 2.
77104

78-
# compare estimators learnt from the full data set with true parameters
79-
emp_cov = EmpiricalCovariance().fit(X)
105+
import matplotlib.pyplot as plt
106+
from sklearn.covariance import EmpiricalCovariance, MinCovDet
80107

81-
# #############################################################################
82-
# Display results
83-
fig = plt.figure()
84-
plt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)
85-
86-
# Show data set
87-
subfig1 = plt.subplot(3, 1, 1)
88-
inlier_plot = subfig1.scatter(X[:, 0], X[:, 1],
89-
color='black', label='inliers')
90-
outlier_plot = subfig1.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:],
91-
color='red', label='outliers')
92-
subfig1.set_xlim(subfig1.get_xlim()[0], 11.)
93-
subfig1.set_title("Mahalanobis distances of a contaminated data set:")
94-
95-
# Show contours of the distance functions
108+
# fit a MCD robust estimator to data
109+
robust_cov = MinCovDet().fit(X)
110+
# fit a MLE estimator to data
111+
emp_cov = EmpiricalCovariance().fit(X)
112+
print('Estimated covariance matrix:\n'
113+
'MCD (Robust):\n{}\n'
114+
'MLE:\n{}'.format(robust_cov.covariance_, emp_cov.covariance_))
115+
116+
# %%
117+
# To better visualize the difference, we plot contours of the
118+
# Mahalanobis distances calculated by both methods. Notice that the robust
119+
# MCD based Mahalanobis distances fit the inlier black points much better,
120+
# whereas the MLE based distances are more influenced by the outlier
121+
# red points.
122+
123+
fig, ax = plt.subplots(figsize=(10, 5))
124+
# Plot data set
125+
inlier_plot = ax.scatter(X[:, 0], X[:, 1],
126+
color='black', label='inliers')
127+
outlier_plot = ax.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:],
128+
color='red', label='outliers')
129+
ax.set_xlim(ax.get_xlim()[0], 10.)
130+
ax.set_title("Mahalanobis distances of a contaminated data set")
131+
132+
# Create meshgrid of feature 1 and feature 2 values
96133
xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
97134
np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
98135
zz = np.c_[xx.ravel(), yy.ravel()]
99-
136+
# Calculate the MLE based Mahalanobis distances of the meshgrid
100137
mahal_emp_cov = emp_cov.mahalanobis(zz)
101138
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
102-
emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),
103-
cmap=plt.cm.PuBu_r,
104-
linestyles='dashed')
105-
139+
emp_cov_contour = plt.contour(xx, yy, np.sqrt(mahal_emp_cov),
140+
cmap=plt.cm.PuBu_r, linestyles='dashed')
141+
# Calculate the MCD based Mahalanobis distances
106142
mahal_robust_cov = robust_cov.mahalanobis(zz)
107143
mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
108-
robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),
109-
cmap=plt.cm.YlOrBr_r, linestyles='dotted')
144+
robust_contour = ax.contour(xx, yy, np.sqrt(mahal_robust_cov),
145+
cmap=plt.cm.YlOrBr_r, linestyles='dotted')
110146

111-
subfig1.legend([emp_cov_contour.collections[1], robust_contour.collections[1],
112-
inlier_plot, outlier_plot],
113-
['MLE dist', 'robust dist', 'inliers', 'outliers'],
114-
loc="upper right", borderaxespad=0)
115-
plt.xticks(())
116-
plt.yticks(())
147+
# Add legend
148+
ax.legend([emp_cov_contour.collections[1], robust_contour.collections[1],
149+
inlier_plot, outlier_plot],
150+
['MLE dist', 'MCD dist', 'inliers', 'outliers'],
151+
loc="upper right", borderaxespad=0)
117152

118-
# Plot the scores for each point
119-
emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33)
120-
subfig2 = plt.subplot(2, 2, 3)
121-
subfig2.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25)
122-
subfig2.plot(np.full(n_samples - n_outliers, 1.26),
123-
emp_mahal[:-n_outliers], '+k', markeredgewidth=1)
124-
subfig2.plot(np.full(n_outliers, 2.26),
125-
emp_mahal[-n_outliers:], '+k', markeredgewidth=1)
126-
subfig2.axes.set_xticklabels(('inliers', 'outliers'), size=15)
127-
subfig2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
128-
subfig2.set_title("1. from non-robust estimates\n(Maximum Likelihood)")
129-
plt.yticks(())
153+
plt.show()
154+
155+
# %%
156+
# Finally, we highlight the ability of MCD based Mahalanobis distances to
157+
# distinguish outliers. We take the cubic root of the Mahalanobis distances,
158+
# yielding approximately normal distributions (as suggested by Wilson and
159+
# Hilferty [2]_), then plot the values of inlier and outlier samples with
160+
# boxplots. The distribution of outlier samples is more separated from the
161+
# distribution of inlier samples for robust MCD based Mahalanobis distances.
130162

163+
fig, (ax1, ax2) = plt.subplots(1, 2)
164+
plt.subplots_adjust(wspace=.6)
165+
166+
# Calculate cubic root of MLE Mahalanobis distances for samples
167+
emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33)
168+
# Plot boxplots
169+
ax1.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25)
170+
# Plot individual samples
171+
ax1.plot(np.full(n_samples - n_outliers, 1.26), emp_mahal[:-n_outliers],
172+
'+k', markeredgewidth=1)
173+
ax1.plot(np.full(n_outliers, 2.26), emp_mahal[-n_outliers:],
174+
'+k', markeredgewidth=1)
175+
ax1.axes.set_xticklabels(('inliers', 'outliers'), size=15)
176+
ax1.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
177+
ax1.set_title("Using non-robust estimates\n(Maximum Likelihood)")
178+
179+
# Calculate cubic root of MCD Mahalanobis distances for samples
131180
robust_mahal = robust_cov.mahalanobis(X - robust_cov.location_) ** (0.33)
132-
subfig3 = plt.subplot(2, 2, 4)
133-
subfig3.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]],
134-
widths=.25)
135-
subfig3.plot(np.full(n_samples - n_outliers, 1.26),
136-
robust_mahal[:-n_outliers], '+k', markeredgewidth=1)
137-
subfig3.plot(np.full(n_outliers, 2.26),
138-
robust_mahal[-n_outliers:], '+k', markeredgewidth=1)
139-
subfig3.axes.set_xticklabels(('inliers', 'outliers'), size=15)
140-
subfig3.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
141-
subfig3.set_title("2. from robust estimates\n(Minimum Covariance Determinant)")
142-
plt.yticks(())
181+
# Plot boxplots
182+
ax2.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]],
183+
widths=.25)
184+
# Plot individual samples
185+
ax2.plot(np.full(n_samples - n_outliers, 1.26), robust_mahal[:-n_outliers],
186+
'+k', markeredgewidth=1)
187+
ax2.plot(np.full(n_outliers, 2.26), robust_mahal[-n_outliers:],
188+
'+k', markeredgewidth=1)
189+
ax2.axes.set_xticklabels(('inliers', 'outliers'), size=15)
190+
ax2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
191+
ax2.set_title("Using robust estimates\n(Minimum Covariance Determinant)")
143192

144193
plt.show()

0 commit comments

Comments
 (0)