|
3 | 3 | Robust covariance estimation and Mahalanobis distances relevance
|
4 | 4 | ================================================================
|
5 | 5 |
|
6 |
| -An example to show covariance estimation with the Mahalanobis |
| 6 | +This example shows covariance estimation with Mahalanobis |
7 | 7 | distances on Gaussian distributed data.
|
8 | 8 |
|
9 | 9 | For Gaussian distributed data, the distance of an observation
|
10 | 10 | :math:`x_i` to the mode of the distribution can be computed using its
|
11 |
| -Mahalanobis distance: :math:`d_{(\mu,\Sigma)}(x_i)^2 = (x_i - |
12 |
| -\mu)'\Sigma^{-1}(x_i - \mu)` where :math:`\mu` and :math:`\Sigma` are |
13 |
| -the location and the covariance of the underlying Gaussian |
14 |
| -distribution. |
| 11 | +Mahalanobis distance: |
| 12 | +
|
| 13 | +.. math:: |
| 14 | +
|
| 15 | + d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu) |
| 16 | +
|
| 17 | +where :math:`\mu` and :math:`\Sigma` are the location and the covariance of |
| 18 | +the underlying Gaussian distributions. |
15 | 19 |
|
16 | 20 | In practice, :math:`\mu` and :math:`\Sigma` are replaced by some
|
17 |
| -estimates. The usual covariance maximum likelihood estimate is very |
18 |
| -sensitive to the presence of outliers in the data set and therefor, |
19 |
| -the corresponding Mahalanobis distances are. One would better have to |
| 21 | +estimates. The standard covariance maximum likelihood estimate (MLE) is very |
| 22 | +sensitive to the presence of outliers in the data set and therefore, |
| 23 | +the downstream Mahalanobis distances also are. It would be better to |
20 | 24 | use a robust estimator of covariance to guarantee that the estimation is
|
21 |
| -resistant to "erroneous" observations in the data set and that the |
22 |
| -associated Mahalanobis distances accurately reflect the true |
23 |
| -organisation of the observations. |
| 25 | +resistant to "erroneous" observations in the dataset and that the |
| 26 | +calculated Mahalanobis distances accurately reflect the true |
| 27 | +organization of the observations. |
24 | 28 |
|
25 |
| -The Minimum Covariance Determinant estimator is a robust, |
| 29 | +The Minimum Covariance Determinant estimator (MCD) is a robust, |
26 | 30 | high-breakdown point (i.e. it can be used to estimate the covariance
|
27 | 31 | matrix of highly contaminated datasets, up to
|
28 | 32 | :math:`\frac{n_\text{samples}-n_\text{features}-1}{2}` outliers)
|
29 |
| -estimator of covariance. The idea is to find |
| 33 | +estimator of covariance. The idea behind the MCD is to find |
30 | 34 | :math:`\frac{n_\text{samples}+n_\text{features}+1}{2}`
|
31 | 35 | observations whose empirical covariance has the smallest determinant,
|
32 | 36 | yielding a "pure" subset of observations from which to compute
|
33 |
| -standards estimates of location and covariance. |
34 |
| -
|
35 |
| -The Minimum Covariance Determinant estimator (MCD) has been introduced |
36 |
| -by P.J.Rousseuw in [1]. |
| 37 | +standards estimates of location and covariance. The MCD was introduced by |
| 38 | +P.J.Rousseuw in [1]_. |
37 | 39 |
|
38 | 40 | This example illustrates how the Mahalanobis distances are affected by
|
39 |
| -outlying data: observations drawn from a contaminating distribution |
| 41 | +outlying data. Observations drawn from a contaminating distribution |
40 | 42 | are not distinguishable from the observations coming from the real,
|
41 |
| -Gaussian distribution that one may want to work with. Using MCD-based |
| 43 | +Gaussian distribution when using standard covariance MLE based Mahalanobis |
| 44 | +distances. Using MCD-based |
42 | 45 | Mahalanobis distances, the two populations become
|
43 |
| -distinguishable. Associated applications are outliers detection, |
44 |
| -observations ranking, clustering, ... |
45 |
| -For visualization purpose, the cubic root of the Mahalanobis distances |
46 |
| -are represented in the boxplot, as Wilson and Hilferty suggest [2] |
| 46 | +distinguishable. Associated applications include outlier detection, |
| 47 | +observation ranking and clustering. |
| 48 | +
|
| 49 | +.. note:: |
| 50 | +
|
| 51 | + See also :ref:`sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py` |
47 | 52 |
|
48 |
| -[1] P. J. Rousseeuw. Least median of squares regression. J. Am |
49 |
| - Stat Ass, 79:871, 1984. |
50 |
| -[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square. |
51 |
| - Proceedings of the National Academy of Sciences of the United States |
52 |
| - of America, 17, 684-688. |
| 53 | +.. topic:: References: |
53 | 54 |
|
54 |
| -""" |
55 |
| -print(__doc__) |
| 55 | + .. [1] P. J. Rousseeuw. `Least median of squares regression |
| 56 | + <http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/LeastMedianOfSquares.pdf>`_. J. Am |
| 57 | + Stat Ass, 79:871, 1984. |
| 58 | + .. [2] Wilson, E. B., & Hilferty, M. M. (1931). `The distribution of chi-square. |
| 59 | + <https://water.usgs.gov/osw/bulletin17b/Wilson_Hilferty_1931.pdf>`_ |
| 60 | + Proceedings of the National Academy of Sciences of the United States |
| 61 | + of America, 17, 684-688. |
| 62 | +
|
| 63 | +""" # noqa: E501 |
| 64 | + |
| 65 | +# %% |
| 66 | +# Generate data |
| 67 | +# -------------- |
| 68 | +# |
| 69 | +# First, we generate a dataset of 125 samples and 2 features. Both features |
| 70 | +# are Gaussian distributed with mean of 0 but feature 1 has a standard |
| 71 | +# deviation equal to 2 and feature 2 has a standard deviation equal to 1. Next, |
| 72 | +# 25 samples are replaced with Gaussian outlier samples where feature 1 has |
| 73 | +# a standard devation equal to 1 and feature 2 has a standard deviation equal |
| 74 | +# to 7. |
56 | 75 |
|
57 | 76 | import numpy as np
|
58 |
| -import matplotlib.pyplot as plt |
59 | 77 |
|
60 |
| -from sklearn.covariance import EmpiricalCovariance, MinCovDet |
| 78 | +# for consistent results |
| 79 | +np.random.seed(7) |
61 | 80 |
|
62 | 81 | n_samples = 125
|
63 | 82 | n_outliers = 25
|
64 | 83 | n_features = 2
|
65 | 84 |
|
66 |
| -# generate data |
| 85 | +# generate Gaussian data of shape (125, 2) |
67 | 86 | gen_cov = np.eye(n_features)
|
68 | 87 | gen_cov[0, 0] = 2.
|
69 | 88 | X = np.dot(np.random.randn(n_samples, n_features), gen_cov)
|
|
72 | 91 | outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.
|
73 | 92 | X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)
|
74 | 93 |
|
75 |
| -# fit a Minimum Covariance Determinant (MCD) robust estimator to data |
76 |
| -robust_cov = MinCovDet().fit(X) |
| 94 | +# %% |
| 95 | +# Comparison of results |
| 96 | +# --------------------- |
| 97 | +# |
| 98 | +# Below, we fit MCD and MLE based covariance estimators to our data and print |
| 99 | +# the estimated covariance matrices. Note that the estimated variance of |
| 100 | +# feature 2 is much higher with the MLE based estimator (7.5) than |
| 101 | +# that of the MCD robust estimator (1.2). This shows that the MCD based |
| 102 | +# robust estimator is much more resistant to the outlier samples, which were |
| 103 | +# designed to have a much larger variance in feature 2. |
77 | 104 |
|
78 |
| -# compare estimators learnt from the full data set with true parameters |
79 |
| -emp_cov = EmpiricalCovariance().fit(X) |
| 105 | +import matplotlib.pyplot as plt |
| 106 | +from sklearn.covariance import EmpiricalCovariance, MinCovDet |
80 | 107 |
|
81 |
| -# ############################################################################# |
82 |
| -# Display results |
83 |
| -fig = plt.figure() |
84 |
| -plt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05) |
85 |
| - |
86 |
| -# Show data set |
87 |
| -subfig1 = plt.subplot(3, 1, 1) |
88 |
| -inlier_plot = subfig1.scatter(X[:, 0], X[:, 1], |
89 |
| - color='black', label='inliers') |
90 |
| -outlier_plot = subfig1.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:], |
91 |
| - color='red', label='outliers') |
92 |
| -subfig1.set_xlim(subfig1.get_xlim()[0], 11.) |
93 |
| -subfig1.set_title("Mahalanobis distances of a contaminated data set:") |
94 |
| - |
95 |
| -# Show contours of the distance functions |
| 108 | +# fit a MCD robust estimator to data |
| 109 | +robust_cov = MinCovDet().fit(X) |
| 110 | +# fit a MLE estimator to data |
| 111 | +emp_cov = EmpiricalCovariance().fit(X) |
| 112 | +print('Estimated covariance matrix:\n' |
| 113 | + 'MCD (Robust):\n{}\n' |
| 114 | + 'MLE:\n{}'.format(robust_cov.covariance_, emp_cov.covariance_)) |
| 115 | + |
| 116 | +# %% |
| 117 | +# To better visualize the difference, we plot contours of the |
| 118 | +# Mahalanobis distances calculated by both methods. Notice that the robust |
| 119 | +# MCD based Mahalanobis distances fit the inlier black points much better, |
| 120 | +# whereas the MLE based distances are more influenced by the outlier |
| 121 | +# red points. |
| 122 | + |
| 123 | +fig, ax = plt.subplots(figsize=(10, 5)) |
| 124 | +# Plot data set |
| 125 | +inlier_plot = ax.scatter(X[:, 0], X[:, 1], |
| 126 | + color='black', label='inliers') |
| 127 | +outlier_plot = ax.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:], |
| 128 | + color='red', label='outliers') |
| 129 | +ax.set_xlim(ax.get_xlim()[0], 10.) |
| 130 | +ax.set_title("Mahalanobis distances of a contaminated data set") |
| 131 | + |
| 132 | +# Create meshgrid of feature 1 and feature 2 values |
96 | 133 | xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
|
97 | 134 | np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
|
98 | 135 | zz = np.c_[xx.ravel(), yy.ravel()]
|
99 |
| - |
| 136 | +# Calculate the MLE based Mahalanobis distances of the meshgrid |
100 | 137 | mahal_emp_cov = emp_cov.mahalanobis(zz)
|
101 | 138 | mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
|
102 |
| -emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov), |
103 |
| - cmap=plt.cm.PuBu_r, |
104 |
| - linestyles='dashed') |
105 |
| - |
| 139 | +emp_cov_contour = plt.contour(xx, yy, np.sqrt(mahal_emp_cov), |
| 140 | + cmap=plt.cm.PuBu_r, linestyles='dashed') |
| 141 | +# Calculate the MCD based Mahalanobis distances |
106 | 142 | mahal_robust_cov = robust_cov.mahalanobis(zz)
|
107 | 143 | mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
|
108 |
| -robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov), |
109 |
| - cmap=plt.cm.YlOrBr_r, linestyles='dotted') |
| 144 | +robust_contour = ax.contour(xx, yy, np.sqrt(mahal_robust_cov), |
| 145 | + cmap=plt.cm.YlOrBr_r, linestyles='dotted') |
110 | 146 |
|
111 |
| -subfig1.legend([emp_cov_contour.collections[1], robust_contour.collections[1], |
112 |
| - inlier_plot, outlier_plot], |
113 |
| - ['MLE dist', 'robust dist', 'inliers', 'outliers'], |
114 |
| - loc="upper right", borderaxespad=0) |
115 |
| -plt.xticks(()) |
116 |
| -plt.yticks(()) |
| 147 | +# Add legend |
| 148 | +ax.legend([emp_cov_contour.collections[1], robust_contour.collections[1], |
| 149 | + inlier_plot, outlier_plot], |
| 150 | + ['MLE dist', 'MCD dist', 'inliers', 'outliers'], |
| 151 | + loc="upper right", borderaxespad=0) |
117 | 152 |
|
118 |
| -# Plot the scores for each point |
119 |
| -emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33) |
120 |
| -subfig2 = plt.subplot(2, 2, 3) |
121 |
| -subfig2.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25) |
122 |
| -subfig2.plot(np.full(n_samples - n_outliers, 1.26), |
123 |
| - emp_mahal[:-n_outliers], '+k', markeredgewidth=1) |
124 |
| -subfig2.plot(np.full(n_outliers, 2.26), |
125 |
| - emp_mahal[-n_outliers:], '+k', markeredgewidth=1) |
126 |
| -subfig2.axes.set_xticklabels(('inliers', 'outliers'), size=15) |
127 |
| -subfig2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) |
128 |
| -subfig2.set_title("1. from non-robust estimates\n(Maximum Likelihood)") |
129 |
| -plt.yticks(()) |
| 153 | +plt.show() |
| 154 | + |
| 155 | +# %% |
| 156 | +# Finally, we highlight the ability of MCD based Mahalanobis distances to |
| 157 | +# distinguish outliers. We take the cubic root of the Mahalanobis distances, |
| 158 | +# yielding approximately normal distributions (as suggested by Wilson and |
| 159 | +# Hilferty [2]_), then plot the values of inlier and outlier samples with |
| 160 | +# boxplots. The distribution of outlier samples is more separated from the |
| 161 | +# distribution of inlier samples for robust MCD based Mahalanobis distances. |
130 | 162 |
|
| 163 | +fig, (ax1, ax2) = plt.subplots(1, 2) |
| 164 | +plt.subplots_adjust(wspace=.6) |
| 165 | + |
| 166 | +# Calculate cubic root of MLE Mahalanobis distances for samples |
| 167 | +emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33) |
| 168 | +# Plot boxplots |
| 169 | +ax1.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25) |
| 170 | +# Plot individual samples |
| 171 | +ax1.plot(np.full(n_samples - n_outliers, 1.26), emp_mahal[:-n_outliers], |
| 172 | + '+k', markeredgewidth=1) |
| 173 | +ax1.plot(np.full(n_outliers, 2.26), emp_mahal[-n_outliers:], |
| 174 | + '+k', markeredgewidth=1) |
| 175 | +ax1.axes.set_xticklabels(('inliers', 'outliers'), size=15) |
| 176 | +ax1.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) |
| 177 | +ax1.set_title("Using non-robust estimates\n(Maximum Likelihood)") |
| 178 | + |
| 179 | +# Calculate cubic root of MCD Mahalanobis distances for samples |
131 | 180 | robust_mahal = robust_cov.mahalanobis(X - robust_cov.location_) ** (0.33)
|
132 |
| -subfig3 = plt.subplot(2, 2, 4) |
133 |
| -subfig3.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]], |
134 |
| - widths=.25) |
135 |
| -subfig3.plot(np.full(n_samples - n_outliers, 1.26), |
136 |
| - robust_mahal[:-n_outliers], '+k', markeredgewidth=1) |
137 |
| -subfig3.plot(np.full(n_outliers, 2.26), |
138 |
| - robust_mahal[-n_outliers:], '+k', markeredgewidth=1) |
139 |
| -subfig3.axes.set_xticklabels(('inliers', 'outliers'), size=15) |
140 |
| -subfig3.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) |
141 |
| -subfig3.set_title("2. from robust estimates\n(Minimum Covariance Determinant)") |
142 |
| -plt.yticks(()) |
| 181 | +# Plot boxplots |
| 182 | +ax2.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]], |
| 183 | + widths=.25) |
| 184 | +# Plot individual samples |
| 185 | +ax2.plot(np.full(n_samples - n_outliers, 1.26), robust_mahal[:-n_outliers], |
| 186 | + '+k', markeredgewidth=1) |
| 187 | +ax2.plot(np.full(n_outliers, 2.26), robust_mahal[-n_outliers:], |
| 188 | + '+k', markeredgewidth=1) |
| 189 | +ax2.axes.set_xticklabels(('inliers', 'outliers'), size=15) |
| 190 | +ax2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) |
| 191 | +ax2.set_title("Using robust estimates\n(Minimum Covariance Determinant)") |
143 | 192 |
|
144 | 193 | plt.show()
|
0 commit comments