Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented Sep 9, 2016

The following change significantly speeds up the kmeans++ initialization used
in MiniBatchKmeans.

The Euclidean distance computation is the bottleneck in kmeans++. However, on
every call to Euclidean distance there is also a call to check_pairwise_arrays.
In in kmeans++, the same Y arrays are being checked every time. One of the checks
in here turns out to cause a speed issue. Specifically the finite check. This patch adds a flag to disable this check.

I'm not sure if this is the desired way to go about this change, but I do think something needs to be
done about this functions efficiency.

Here is some data that shows the speed increase:
For these tests I'm clustering with n_clusters=1000. The feature dimension is 128 and
the number of data points is 10*n_clusters. I then profiled different versions of the code.
First here is the slow version where I force it to perform the finite check every time.

Total time: 5.67384 s
File: /home/joncrall/code/scikit-learn/sklearn/metrics/pairwise.py
Function: euclidean_distances at line 173

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   173                                           @ut.profile
   174                                           def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
   175                                                                   X_norm_squared=None, force_all_finite=True):
   237      1000      1402395   1402.4     24.7      X, Y = check_pairwise_arrays(X, Y, force_all_finite=True)
   239                                           
   240      1000         1000      1.0      0.0      if X_norm_squared is not None:
   241                                                   XX = check_array(X_norm_squared)
   242                                                   if XX.shape == (1, X.shape[0]):
   243                                                       XX = XX.T
   244                                                   elif XX.shape != (X.shape[0], 1):
   245                                                       raise ValueError(
   246                                                           "Incompatible dimensions for X and X_norm_squared")
   247                                               else:
   248      1000        36429     36.4      0.6          XX = row_norms(X, squared=True)[:, np.newaxis]
   249                                           
   250      1000         1251      1.3      0.0      if X is Y:  # shortcut in the common case euclidean_distances(X, X)
   251                                                   YY = XX.T
   252      1000          976      1.0      0.0      elif Y_norm_squared is not None:
   253      1000        15173     15.2      0.3          YY = np.atleast_2d(Y_norm_squared)
   254                                           
   255      1000         2450      2.5      0.0          if YY.shape != (1, Y.shape[0]):
   256                                                       raise ValueError(
   257                                                           "Incompatible dimensions for Y and Y_norm_squared")
   258                                               else:
   259                                                   YY = row_norms(Y, squared=True)[np.newaxis, :]
   260                                           
   261      1000      3846672   3846.7     67.8      distances = safe_sparse_dot(X, Y.T, dense_output=True)
   262      1000       108011    108.0      1.9      distances *= -2
   263      1000        64575     64.6      1.1      distances += XX
   264      1000        64196     64.2      1.1      distances += YY
   265      1000       128121    128.1      2.3      np.maximum(distances, 0, out=distances)
   266                                           
   267      1000         1648      1.6      0.0      if X is Y:
   268                                                   # Ensure that distances between vectors and themselves are set to 0.0.
   269                                                   # This may not be the case due to floating point rounding errors.
   270                                                   distances.flat[::distances.shape[0] + 1] = 0.0
   271                                           
   272      1000          941      0.9      0.0      return distances if squared else np.sqrt(distances, out=distances)

Digging a little deeper shows the timings int his function in this function.
As you can see most of this functions time is spent in check_array.

Total time: 1.40017 s
File: /home/joncrall/code/scikit-learn/sklearn/metrics/pairwise.py
Function: check_pairwise_arrays at line 58

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    58                                           @ut.profile
    59                                           def check_pairwise_arrays(X, Y, precomputed=False, dtype=None,
    60                                                                     force_all_finite=True):
   104      1000         8652      8.7      0.6      X, Y, dtype_float = _return_float_dtype(X, Y)
   105                                           
   106      1000          659      0.7      0.0      warn_on_dtype = dtype is not None
   107      1000          460      0.5      0.0      estimator = 'check_pairwise_arrays'
   108      1000          507      0.5      0.0      if dtype is None:
   109      1000          524      0.5      0.0          dtype = dtype_float
   110                                           
   111      1000          581      0.6      0.0      if Y is X or Y is None:
   112                                                   X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
   113                                                                       warn_on_dtype=warn_on_dtype, estimator=estimator,
   114                                                                       force_all_finite=force_all_finite)
   115                                               else:
   116      1000          761      0.8      0.1          X = check_array(X, accept_sparse='csr', dtype=dtype,
   117      1000          481      0.5      0.0                          warn_on_dtype=warn_on_dtype, estimator=estimator,
   118      1000       153559    153.6     11.0                          force_all_finite=force_all_finite)
   119      1000          633      0.6      0.0          Y = check_array(Y, accept_sparse='csr', dtype=dtype,
   120      1000          514      0.5      0.0                          warn_on_dtype=warn_on_dtype, estimator=estimator,
   121      1000      1230357   1230.4     87.9                          force_all_finite=force_all_finite)
   122                                           
   123      1000          559      0.6      0.0      if precomputed:
   124                                                   if X.shape[1] != Y.shape[0]:
   125                                                       raise ValueError("Precomputed metric requires shape "
   126                                                                        "(n_queries, n_indexed). Got (%d, %d) "
   127                                                                        "for %d indexed." %
   128                                                                        (X.shape[0], X.shape[1], Y.shape[0]))
   129      1000         1287      1.3      0.1      elif X.shape[1] != Y.shape[1]:
   130                                                   raise ValueError("Incompatible dimension for X and Y matrices: "
   131                                                                    "X.shape[1] == %d while Y.shape[1] == %d" % (
   132                                                                        X.shape[1], Y.shape[1]))
   133                                           
   134      1000          638      0.6      0.0      return X, Y

Looking at check_array we see the offending function call to _assert_all_finite

Total time: 1.28438 s
File: /home/joncrall/code/scikit-learn/sklearn/utils/validation.py
Function: check_array at line 271

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   271                                           @ut.profile
   272                                           def check_array(array, accept_sparse=None, dtype="numeric", order=None,
   273                                                           copy=False, force_all_finite=True, ensure_2d=True,
   274                                                           allow_nd=False, ensure_min_samples=1, ensure_min_features=1,
   275                                                           warn_on_dtype=False, estimator=None):

   342      2001         2498      1.2      0.2      if isinstance(accept_sparse, str):
   343      2000         2531      1.3      0.2          accept_sparse = [accept_sparse]
   344                                           
   345                                               # store whether originally we wanted numeric dtype
   346      2001         4113      2.1      0.3      dtype_numeric = dtype == "numeric"
   347                                           
   348      2001         3419      1.7      0.3      dtype_orig = getattr(array, "dtype", None)
   349      2001         4128      2.1      0.3      if not hasattr(dtype_orig, 'kind'):
   350                                                   # not a data type (e.g. a column named dtype in a pandas DataFrame)
   351                                                   dtype_orig = None
   352                                           
   353      2001         1811      0.9      0.1      if dtype_numeric:
   354         1            1      1.0      0.0          if dtype_orig is not None and dtype_orig.kind == "O":
   355                                                       # if input is object, convert to float.
   356                                                       dtype = np.float64
   357                                                   else:
   358         1            1      1.0      0.0              dtype = None
   359                                           
   360      2001         4210      2.1      0.3      if isinstance(dtype, (list, tuple)):
   361                                                   if dtype_orig is not None and dtype_orig in dtype:
   362                                                       # no dtype conversion required
   363                                                       dtype = None
   364                                                   else:
   365                                                       # dtype conversion required. Let's select the first element of the
   366                                                       # list of accepted types.
   367                                                       dtype = dtype[0]
   368                                           
   369      2001         1944      1.0      0.2      if estimator is not None:
   370      2000         3129      1.6      0.2          if isinstance(estimator, six.string_types):
   371      2000         1908      1.0      0.1              estimator_name = estimator
   372                                                   else:
   373                                                       estimator_name = estimator.__class__.__name__
   374                                               else:
   375         1            1      1.0      0.0          estimator_name = "Estimator"
   376      2001         3926      2.0      0.3      context = " by %s" % estimator_name if estimator is not None else ""
   377                                           
   378      2001         3753      1.9      0.3      if sp.issparse(array):
   379                                                   array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
   380                                                                                 force_all_finite)
   381                                               else:
   382      2001         8476      4.2      0.7          array = np.array(array, dtype=dtype, order=order, copy=copy)
   383                                           
   384      2001         2034      1.0      0.2          if ensure_2d:
   385      2001         2534      1.3      0.2              if array.ndim == 1:
   386                                                           if ensure_min_samples >= 2:
   387                                                               raise ValueError("%s expects at least 2 samples provided "
   388                                                                                "in a 2 dimensional array-like input"
   389                                                                                % estimator_name)
   390                                                           warnings.warn(
   391                                                               "Passing 1d arrays as data is deprecated in 0.17 and will "
   392                                                               "raise ValueError in 0.19. Reshape your data either using "
   393                                                               "X.reshape(-1, 1) if your data has a single feature or "
   394                                                               "X.reshape(1, -1) if it contains a single sample.",
   395                                                               DeprecationWarning)
   396      2001        16453      8.2      1.3              array = np.atleast_2d(array)
   397                                                       # To ensure that array flags are maintained
   398      2001         3891      1.9      0.3              array = np.array(array, dtype=dtype, order=order, copy=copy)
   399                                           
   400                                                   # make sure we actually converted to numeric:
   401      2001         2012      1.0      0.2          if dtype_numeric and array.dtype.kind == "O":
   402                                                       array = array.astype(np.float64)
   403      2001         2282      1.1      0.2          if not allow_nd and array.ndim >= 3:
   404                                                       raise ValueError("Found array with dim %d. %s expected <= 2."
   405                                                                        % (array.ndim, estimator_name))
   406      2001         1898      0.9      0.1          if force_all_finite:
   407      2001      1114783    557.1     86.8              _assert_all_finite(array)
   408                                           
   409      2001        61351     30.7      4.8      shape_repr = _shape_repr(array.shape)
   410      2001         2295      1.1      0.2      if ensure_min_samples > 0:
   411      2001        16251      8.1      1.3          n_samples = _num_samples(array)
   412      2001         2250      1.1      0.2          if n_samples < ensure_min_samples:
   413                                                       raise ValueError("Found array with %d sample(s) (shape=%s) while a"
   414                                                                        " minimum of %d is required%s."
   415                                                                        % (n_samples, shape_repr, ensure_min_samples,
   416                                                                           context))
   417                                           
   418      2001         2557      1.3      0.2      if ensure_min_features > 0 and array.ndim == 2:
   419      2001         2229      1.1      0.2          n_features = array.shape[1]
   420      2001         1982      1.0      0.2          if n_features < ensure_min_features:
   421                                                       raise ValueError("Found array with %d feature(s) (shape=%s) while"
   422                                                                        " a minimum of %d is required%s."
   423                                                                        % (n_features, shape_repr, ensure_min_features,
   424                                                                           context))
   425                                           
   426      2001         1958      1.0      0.2      if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:
   427                                                   msg = ("Data with input dtype %s was converted to %s%s."
   428                                                          % (dtype_orig, array.dtype, context))
   429                                                   warnings.warn(msg, _DataConversionWarning)
   430      2001         1776      0.9      0.1      return array

Disabling this check after it runs the first time gives a better profile.

We could probably even scrape a bit more performance by checking everything at
the start of kmeans++ and then disabling all subsequent checks. This should be
ok because the new centroids are always just previously existing ones.

Total time: 3.82659 s
File: /home/joncrall/code/scikit-learn/sklearn/metrics/pairwise.py
Function: euclidean_distances at line 173

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   173                                           @ut.profile
   174                                           def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
   175                                                                   X_norm_squared=None, force_all_finite=True):
   238      1000       288126    288.1      7.5      X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
   239                                           
   240      1000          718      0.7      0.0      if X_norm_squared is not None:
   241                                                   XX = check_array(X_norm_squared)
   242                                                   if XX.shape == (1, X.shape[0]):
   243                                                       XX = XX.T
   244                                                   elif XX.shape != (X.shape[0], 1):
   245                                                       raise ValueError(
   246                                                           "Incompatible dimensions for X and X_norm_squared")
   247                                               else:
   248      1000        21165     21.2      0.6          XX = row_norms(X, squared=True)[:, np.newaxis]
   249                                           
   250      1000          927      0.9      0.0      if X is Y:  # shortcut in the common case euclidean_distances(X, X)
   251                                                   YY = XX.T
   252      1000          699      0.7      0.0      elif Y_norm_squared is not None:
   253      1000         8768      8.8      0.2          YY = np.atleast_2d(Y_norm_squared)
   254                                           
   255      1000         1956      2.0      0.1          if YY.shape != (1, Y.shape[0]):
   256                                                       raise ValueError(
   257                                                           "Incompatible dimensions for Y and Y_norm_squared")
   258                                               else:
   259                                                   YY = row_norms(Y, squared=True)[np.newaxis, :]
   260                                           
   261      1000      3174245   3174.2     83.0      distances = safe_sparse_dot(X, Y.T, dense_output=True)
   262      1000        93573     93.6      2.4      distances *= -2
   263      1000        69584     69.6      1.8      distances += XX
   264      1000        72021     72.0      1.9      distances += YY
   265      1000        92431     92.4      2.4      np.maximum(distances, 0, out=distances)
   266                                           
   267      1000         1575      1.6      0.0      if X is Y:
   268                                                   # Ensure that distances between vectors and themselves are set to 0.0.
   269                                                   # This may not be the case due to floating point rounding errors.
   270                                                   distances.flat[::distances.shape[0] + 1] = 0.0
   271                                           
   272      1000          804      0.8      0.0      return distances if squared else np.sqrt(distances, out=distances)

Disabling the profiler and using a coarser function timer we get the following timings:

Without Checks: 3.9655s
With Checks: 4.9669s

This is a 20% decrease in the amount of time taken (1 second total).

To ensure that this speedup was not just for parameters resembling my problem I
did a gridsearch on various parameter values and looked at the percent change.
For larget datasets the change is consistently positive. There are a few
negative changes for small datasets, but this is likely because of random
fluctuations. For datasets with at least a .1 second speed increase, there is a
15% average improvement with the improvement increasing for larger datasets.

    n_clusters  n_features  per_cluster  new_speed  old_speed  percent_change  absolute_change
0         2000         512          200  45.520926  56.996406        0.201337        11.475480
1         2000         512          100  46.009816  57.047427        0.193481        11.037611
2         2000         512           10  46.023527  56.965505        0.192081        10.941978
3         2000         512            1  46.036122  57.028242        0.192749        10.992120
4         2000         128          200  13.359815  16.861238        0.207661         3.501423
5         2000         128          100  13.594423  16.169258        0.159243         2.574835
6         2000         128           10  13.041987  16.357325        0.202682         3.315338
7         2000         128            1  13.517229  15.883567        0.148980         2.366338
8         2000          32          200   4.691806   5.384530        0.128651         0.692724
9         2000          32          100   4.785984   5.434597        0.119349         0.648613
10        2000          32           10   4.816490   5.309634        0.092877         0.493144
11        2000          32            1   4.678081   5.419021        0.136729         0.740940
12        2000           4          200   2.194186   2.311234        0.050643         0.117048
13        2000           4          100   2.195776   2.310168        0.049517         0.114392
14        2000           4           10   2.197284   2.313143        0.050087         0.115859
15        2000           4            1   2.245128   2.298406        0.023181         0.053278
16        1000         512          200  10.325005  13.263920        0.221572         2.938915
17        1000         512          100  10.372535  13.245843        0.216922         2.873308
18        1000         512           10  10.340922  13.195515        0.216331         2.854593
19        1000         512            1  10.377185  13.245806        0.216568         2.868621
20        1000         128          200   2.862865   3.559318        0.195670         0.696453
21        1000         128          100   2.814686   3.682527        0.235664         0.867841
22        1000         128           10   2.866434   3.692132        0.223637         0.825698
23        1000         128            1   3.000796   3.592812        0.164778         0.592016
24        1000          32          200   1.063762   1.172979        0.093111         0.109217
25        1000          32          100   1.053794   1.172330        0.101111         0.118536
26        1000          32           10   1.042551   1.164388        0.104636         0.121837
27        1000          32            1   1.045254   1.165895        0.103475         0.120641
28        1000           4          200   0.576731   0.612182        0.057909         0.035451
29        1000           4          100   0.576767   0.609157        0.053172         0.032390
30        1000           4           10   0.598036   0.607962        0.016327         0.009926
31        1000           4            1   0.578046   0.610441        0.053068         0.032395
32         100         512          200   0.124780   0.144742        0.137913         0.019962
33         100         512          100   0.122643   0.144813        0.153094         0.022170
34         100         512           10   0.123119   0.142057        0.133312         0.018938
35         100         512            1   0.122787   0.144781        0.151913         0.021994
36         100         128          200   0.049460   0.058768        0.158387         0.009308
37         100         128          100   0.060528   0.072691        0.167326         0.012163
38         100         128           10   0.049410   0.061536        0.197052         0.012126
39         100         128            1   0.052959   0.061433        0.137941         0.008474
40         100          32          200   0.029961   0.035051        0.145223         0.005090
41         100          32          100   0.028987   0.033165        0.125970         0.004178
42         100          32           10   0.030262   0.039366        0.231266         0.009104
43         100          32            1   0.028978   0.033200        0.127173         0.004222
44         100           4          200   0.022328   0.027136        0.177188         0.004808
45         100           4          100   0.015451   0.017483        0.116230         0.002032
46         100           4           10   0.014638   0.016714        0.124215         0.002076
47         100           4            1   0.014893   0.018615        0.199944         0.003722
48          10         512          200   0.003733   0.002868       -0.301604        -0.000865
49          10         512          100   0.002347   0.002843        0.174507         0.000496
50          10         512           10   0.002344   0.002907        0.193718         0.000563
51          10         512            1   0.002358   0.006877        0.657086         0.004519
52          10         128          200   0.001717   0.002016        0.148196         0.000299
53          10         128          100   0.007036   0.002073       -2.394020        -0.004963
54          10         128           10   0.001862   0.002053        0.093021         0.000191
55          10         128            1   0.001747   0.002061        0.152244         0.000314
56          10          32          200   0.001596   0.001831        0.128385         0.000235
57          10          32          100   0.001543   0.001787        0.136606         0.000244
58          10          32           10   0.001550   0.001786        0.132159         0.000236
59          10          32            1   0.001551   0.001857        0.164719         0.000306
60          10           4          200   0.001525   0.001715        0.110663         0.000190
61          10           4          100   0.007363   0.001704       -3.320974        -0.005659
62          10           4           10   0.001505   0.001696        0.112470         0.000191
63          10           4            1   0.001494   0.009352        0.840255         0.007858

The script I used to generate these numbers is:

"""
python speedup_kmeans.py --profile
python speedup_kmeans.py

For 1000 clusters and 10000 datapoints:
    5 seconds with checks
    3 seconds without checks
    3 seconds without checks and with out safe_dot
"""
from __future__ import absolute_import, division, print_function, unicode_literals
import utool as ut
import sklearn  # NOQA
from sklearn.datasets.samples_generator import make_blobs
from sklearn.utils.extmath import row_norms, squared_norm  # NOQA
import sklearn.cluster
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances  # NOQA
(print, rrr, profile) = ut.inject2(__name__, '[tester]')


#@profile
def test_kmeans_plus_plus_speed(n_clusters=1000, n_features=128, per_cluster=10, fix=True):
    # Make random cluster centers on a ball
    rng = np.random.RandomState(42)
    centers = rng.rand(n_clusters, n_features)
    centers /= np.linalg.norm(centers, axis=0)[None, :]
    (centers * 512).astype(np.int) / 512
    centers /= np.linalg.norm(centers, axis=0)[None, :]

    n_samples = n_clusters * 10
    n_clusters, n_features = centers.shape
    X, true_labels = make_blobs(n_samples=n_samples, centers=centers,
                                cluster_std=1., random_state=42)

    x_squared_norms = row_norms(X, squared=True)

    _k_init = sklearn.cluster.k_means_._k_init
    random_state = rng
    n_local_trials = None  # NOQA

    #print('Testing kmeans init')
    with ut.Timer('testing kmeans init') as t:
        centers = _k_init(X, n_clusters, random_state=random_state, x_squared_norms=x_squared_norms, fix=fix)
    #print('Done testing kmeans init')
    return t.ellapsed


def main():
    basis = {
        'n_clusters': [10, 100, 1000, 2000][::-1],
        'n_features': [4, 32, 128, 512][::-1],
        'per_cluster': [1, 10, 100, 200][::-1],
    }
    vals = []
    for kw in ut.ProgIter(ut.all_dict_combinations(basis), lbl='gridsearch',
                          bs=False, adjust=False, freq=1):
        print(kw)
        new_speed = test_kmeans_plus_plus_speed(fix=True, **kw)
        old_speed = test_kmeans_plus_plus_speed(fix=False, **kw)
        kw['new_speed'] = new_speed
        kw['old_speed'] = old_speed
        vals.append(kw)

    import pandas as pd
    pd.options.display.max_rows = 64
    pd.options.display.width = 100
    pd.options.display.max_ = 64
    df = pd.DataFrame.from_dict(vals)
    df['percent_change'] = (df['old_speed'] - df['new_speed']) / df['old_speed']
    df = df.reindex_axis(['n_clusters', 'n_features', 'per_cluster', 'new_speed', 'old_speed', 'percent_change'], axis=1)
    df['absolute_change'] = (df['old_speed'] - df['new_speed'])
    print(df)

    print(df['percent_change'][df['absolute_change'] > .1].mean())
    #print(df.loc[df['percent_change'].argsort()[::-1]])

    #try:
    #    profile.dump_stats('out.lprof')
    #    profile.print_stats(stripzeros=True)
    #except Exception:
    #    pass

#if __name__ == '__main__':
#    main()

This change is Reviewable

# Initialize list of closest distances and calculate current potential
closest_dist_sq = euclidean_distances(
centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
squared=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly question: didn't we already check finiteness of everything in the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check_arrays in the fit method of MiniBatchKmeans, but there doesn't seem to be one that happens before this one in regular KMeans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@Erotemic Erotemic Sep 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best choice would be to add a check_input flag to _k_init. Then if this flag is true the input is checked at the start of the _k_init function. For minibatch this flag can be set to false and for regular kmeans this flag can be set to true (or we can just put the check in kmeans as well). In all cases check_input would be set to false in the euclidean_distances calls inside _k_init.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going with that approach?

@amueller
Copy link
Member

amueller commented Sep 9, 2016

please address the flake8 errors.
I'm not sure if we should add a flag for finiteness specifically. We could also just entirely skip check_array and add a check_input flag. That seems conceptually more simple.

@GaelVaroquaux
Copy link
Member

Great work!

I am in favor of @amueller 's suggestion: skipping completely the check_array in the inner loop, rather than adding a flag to check_array.

@Erotemic
Copy link
Contributor Author

Sounds good. I'll make those changes and push.

@Erotemic
Copy link
Contributor Author

I finished these changes. Let me know if anything looks like it needs more work.

on the number of seeds (2+log(k)); this is the default.
check_inputs : boolean (default=True)
Whether to check if inputs are are finite and floats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are are

@amueller amueller changed the title Added flag to disable l2-dist finite check [MRG + 1] Added flag to disable l2-dist finite check Sep 13, 2016
@amueller
Copy link
Member

LGTM. Can you please re-run your benchmarks with the current version and check that the improvement over master is still there? (it's easy to mess up things when refactoring ;)

@Erotemic
Copy link
Contributor Author

Erotemic commented Sep 14, 2016

I ran the benchmarks on a smaller grid because the older grid took a long time. This one still shows the same speed improvements. I also added a parameter to test giving integer inputs.

    n_features  asint  per_cluster  n_clusters  new_speed  old_speed  percent_change  absolute_change
0          128   True           20         500   1.728701   3.175227       45.556624         1.446526
1          128   True           10         500   0.964327   1.461100       33.999946         0.496773
18         128  False           20         500   1.820726   2.117415       14.011850         0.296689
19         128  False           10         500   0.737921   1.026572       28.117949         0.288651
3           32   True           20         500   0.681702   0.790321       13.743655         0.108619
22          32  False           10         500   0.361195   0.459011       21.310159         0.097816
5           32   True            1         500   0.117711   0.205754       42.790332         0.088043
4           32   True           10         500   0.428249   0.515482       16.922577         0.087233
20         128  False            1         500   0.173600   0.255813       32.137977         0.082213
2          128   True            1         500   0.209229   0.289853       27.815446         0.080624
25         128  False           10         100   0.069253   0.129692       46.601908         0.060439
23          32  False            1         500   0.105972   0.155374       31.795527         0.049402
21          32  False           20         500   0.687639   0.727779        5.515399         0.040140
7          128   True           10         100   0.047142   0.085005       44.541926         0.037863
27          32  False           20         100   0.064948   0.098792       34.257885         0.033844
24         128  False           20         100   0.076329   0.098159       22.239494         0.021830
10          32   True           10         100   0.056360   0.077750       27.511216         0.021390
9           32   True           20         100   0.075062   0.096348       22.092830         0.021286
6          128   True           20         100   0.122499   0.142712       14.163424         0.020213
26         128  False            1         100   0.023037   0.041862       44.969302         0.018825
8          128   True            1         100   0.038797   0.053073       26.898888         0.014276
11          32   True            1         100   0.008956   0.020436       56.175699         0.011480
29          32  False            1         100   0.013403   0.021140       36.597607         0.007737
28          32  False           10         100   0.028573   0.034658       17.557063         0.006085
15          32   True           20          10   0.001318   0.002301       42.720962         0.000983
13         128   True           10          10   0.001378   0.002235       38.346667         0.000857
35          32  False            1          10   0.001117   0.001886       40.771176         0.000769
16          32   True           10          10   0.001199   0.001932       37.936567         0.000733
33          32  False           20          10   0.001339   0.002054       34.811376         0.000715
14         128   True            1          10   0.001133   0.001839       38.397718         0.000706
34          32  False           10          10   0.001206   0.001889       36.147924         0.000683
17          32   True            1          10   0.001096   0.001775       38.253862         0.000679
32         128  False            1          10   0.001167   0.001811       35.558189         0.000644
12         128   True           20          10   0.002095   0.002675       21.677511         0.000580
30         128  False           20          10   0.002064   0.002503       17.536674         0.000439
31         128  False           10          10   0.007326   0.002133     -243.471943        -0.005193

The modified benchmark script is in this gist https://gist.github.com/Erotemic/5230d93ccc9fa5329b0a02a351b02939

I also just squashed everything into a single commit to make the history nicer.

@amueller
Copy link
Member

great, thanks :)

@Erotemic
Copy link
Contributor Author

I made a few more changes to this branch. Here is a summary of the differences.

FIX: logic error in euclidean_distances

I noticed a small issue in euclidean_distances when check_inputs is False. I had incorrect indentation leading to a NameError because XX was undefined. I fixed this and added a new test to cover this case.

FIX: documentation error in euclidean_distances

When fixing this I also noticed that the documentation states that the input shape of X_norm_squared should be (1, n_samples_1), but the underlying function actually wants it to be (n_samples_1, 1). If check_inputs is True the underlying code will correct itself, but with the new check_inputs=False behavior the documentation needs an update. I corrected this in the documentation.

ADD: check_flags to other pairwise metrics functions

I also went ahead and added the check_inputs flag to the other metrics in pairwise.py in order to maintain a consistent API. The only function I skipped was paired_distances because I couldn't see how to nicely work with its callable version without using kwds in complicated ways. I added a total of 4 new tests to cover all of this new behavior.

FIX: Converted one test in test_pairwise.py to non-yield version

While I was in that file I changed the behavior of a yield test which was producing a lot of warning on my system with py.test version 3.0.1. It seems like this behavior will be deprecated in a new version, so
it may be worth the effort to go through and change those tests to their non-yield versions soon.


I'm still waiting on the results of AppVeyor, but it seems as if all tests are now passing in this branch.

@amueller amueller changed the title [MRG + 1] Added flag to disable l2-dist finite check [MRG] Added flag to disable l2-dist finite check Oct 10, 2016
@Erotemic
Copy link
Contributor Author

I dropped the commit that changes the yield test behavior and made a new PR #7654 which looks at the issue independently.


def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
init_size=None):
init_size=None, check_inputs=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this check_inputs used?

Copy link
Contributor Author

@Erotemic Erotemic Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its passed to _k_init (the kmeans++) implementation. On reviewing this again it may not be worth the extra API complexity to have the check_inputs parameter in _init_centroids because it is not called very many times. Do you think I should remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably if it doesn't provide speed gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change.


def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None):
def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None,
check_inputs=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more: is it called with both "True" and "False" in the current code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is only called in _init_centroids, and in the KMeans code it is called with check_inputs=True by default and in MinibatchKMeans it is called with check_inputs=False because the inputs have already been checked at that point.

@jnothman
Copy link
Member

jnothman commented Nov 1, 2016

@Erotemic, will #7548 (perhaps with a context manager) fix this in your opinion?

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 1, 2016

@jnothman I think that a context manager could work, and I would be ok with redoing this patch to use such a context manager if the sklearn team decided they liked that direction. However, in my opinion I don't particularly like it as a solution because of the way it would interact with global variables. This could cause issues with solutions that involve any sort of threading. Perhaps this isn't a huge issue because the GIL is so omnipresent and I don't think a race condition would exist in a multiprocessing solution, but from a stylistic design perspective I think it is a bit too obfuscated. I would prefer to have a flag explicitly passed around so it is very clear when inputs are not being validated.

That being said, as long as it is constant, I do think having SKLEARN_SUPPRESS_VALIDATION is great to have as a global option. I think its a great way to eek out a bunch of performance when the sklearn framework can't guarantee the input is ok, but a user running experiments in scripts can.

To summarize, here's a list of pros and cons of a context manager from my perspective. I'll try to weight them by importance.

Pros:

* Easy, requires very little modification of code to achieve the desired effect. (weight=.9)
* Having a `check_inputs` flag on every function that needs it can be a bit cumbersome. (weight=.8)

Cons:

* Modifying global variables is not the best practice (weight=.4)
* Global variable context managers can cause issues with threading, (but who really uses threading in a GIL environment?) (weight=.2)
* Not always clear if inputs may be validated or not. It is difficult for a developer to ensure input validation if it really needs to happen. (weight=.6)
* Context managers increase indentation and reduce the available horizontal space by 4 spaces. This can be quite annoying.  (weight=.6)
* A flag allows for conditional checking. To disable the flag all that is necessary is to overwrite its value. To disable a context manager it would need an enabled flag, which might be ugly.  (weight=.4)

That's my opinion based mostly on the grounds of explicit > implicit and an avoidance of manipulating global variables. There are more cons than pros, but the pros have slightly higher (subjectively chosen) weights, but I do think the cons outweigh the pros. However, If other developers are in support of a context manager, I'm willing to jump on board.

@amueller
Copy link
Member

amueller commented Nov 1, 2016

I feel that we can be explicit on repeated internal calls, such as here, and leave global behavior to the user.
@jnothman so you would use a context manager to set the flag for the subsequent calls?

@jnothman
Copy link
Member

jnothman commented Nov 1, 2016

TBH, I wasn't well acquainted with this issue when I made that comment. I'll need to give it a more thorough look.

@jnothman
Copy link
Member

jnothman commented Nov 1, 2016

I'd forgotten, @Erotemic, that this was about reducing finiteness checks in a nested context. Here, I agree that a global context switch is not great. I was asking if as a user, disabling finiteness checks for your call would be an okay solution. But perhaps it'd only be a temporary fix.

For this sort of thing, it'd be nice to be able to attach to the data a tag to say that it's to be presumed finite. Is there a way to do that which does not destroy everything we believe in?

I agree that pairwise_distances of all things is used frequently as a helper and should be targeted by the present kind of fix.

Threading could be a concern; really the context manager should use a lock.

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 2, 2016

@jnothman I think some of the confusion also stems from me being undisciplined about keeping a single feature to a single branch. This caused this PR was a bit haphazardly put together. It started off as me noticing that I could get a performance gain with a change to a single function, so I made that change and it got +1ed. Then I ended up changing all of the pairwise distance functions in order to achieve a consistent API. While making other changes that I believe I've dropped from the PR.

Perhaps it might be better to close this PR and then I can reorganize them a bit. First I'll create one that simply adds the check_input flags to the pairwise functions. Then I can build on top of that and use that new feature to get the speed gain this PR was originally intended for. Thoughts?

@amueller
Copy link
Member

amueller commented Nov 2, 2016

I think the PR is reasonably scoped. If you want you can break it up but it's not necessary.

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 3, 2016

Its less work to not break it up, so I'll default to that as long as there are no objections. I just thought I'd bring it up.

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 3, 2016

I just noticed that elsewhere in the code a similar flag is referred to as check_input, not check_inputs (also check_input was originally suggested as a name, I must have just miss-typed it). I renamed the flag to check_input for consistency.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we already have tests in there to ensure finiteness is checked when check_input=True...

Y : {array-like, sparse matrix}, shape (n_samples_2, n_features)
Y_norm_squared : array-like, shape (n_samples_2, ), optional
Y_norm_squared : array-like, shape (1, n_samples_2), optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is tested to work with a 1-d array. Can we leave this shape alone?

Return squared Euclidean distances.
X_norm_squared : array-like, shape = [n_samples_1], optional
X_norm_squared : array-like, shape = (n_samples_1, 1), optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

raise ValueError(
"Incompatible dimensions for X and X_norm_squared")
else:
XX = X_norm_squared
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still be putting this into the right shape if 1d.

raise ValueError("additive_chi2 does not support sparse matrices.")
X, Y = check_pairwise_arrays(X, Y)
if check_input:
if issparse(X) or issparse(Y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's negligible harm to having this outside the if statement.

@cmarmo
Copy link
Contributor

cmarmo commented Aug 20, 2020

@jeremiedbb , @ogrisel , is this PR still relevant? kmeans has been largely modified and kmeans++ is under revision... Thanks!

@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Dec 15, 2020
@jeremiedbb jeremiedbb removed the Superseded PR has been replace by a newer PR label Dec 16, 2020
@jeremiedbb
Copy link
Member

@cmarmo I removed the superseded label because actually this PR does more than I do in #19002. It would still be interesting to implement the rest.

@cmarmo
Copy link
Contributor

cmarmo commented Dec 16, 2020

@cmarmo I removed the superseded label because actually this PR does more than I do in #19002. It would still be interesting to implement the rest.

Thanks @jeremiedbb for clarifying. This is above my understanding :) .
As @Erotemic has been quite responsive in another old PR he's working on, perhaps it could be useful to be more specific about what is still needed and this PR could be finalized?

@Erotemic
Copy link
Contributor Author

This one is a bit more complex than the previous one. I'll probably have to look at it this weekend to refresh myself on what I was doing here.

Base automatically changed from master to main January 22, 2021 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants