-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Path for pluggable low-level computational routines #22438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here are is a proposal on the developer API for the multi-dispatching: scikit-learn dispatchingThis registers from sklearn.utils.dispatch import multidispatch
# Dispatch on "X" and "metrics"
@multidispatch(domain="sklearn.metrics", keys=["X", "metric"])
def pairwise_distance(...):
... Backend ProviderUsing Python typesThe following registers from sklearn.utils.dispatch import Backend
import numpy as np
be = Backend(domain="sklearn.metrics")
@be.register(name="pairwise_distance")
def faster_pairwise(X: np.ndarray, .., metric: Literal["euclidean", "manhattan"]):
... Specifying a CuPy array:Specifying import cupy as cp
@be.register(name="pairwise_distance")
def faster_pairwise(X: cp.ndarray, ...):
... Backend only supports float32There can be a callable that is passed in the dispatch keys and returns True if the function supports the input. # Uses a **kwargs just in case we want to dispatch more keys in the future
@be.register(
name="pairwise_distance",
supports=lambda **kwargs: kwargs["X"].dtype == np.float32
)
def faster_pairwise(X: np.ndarray, ...):
... Note that this callable can be used instead of Python typing. |
It might be more complex that this because we will need to:
|
If dispatching is decided on the estimator level with # Set computational backend for training
clf = KNeighborsClassifier(n_neighbors=5).set_computational_engine("sklearn_dppy")
clf.fit(X_train, y_train)
# changing computational backend for prediction
clf.set_computational_engine("sklearn_numba")
y_pred = clf.predict(X_test) This has the benefit of not needing a global state when using the context manager. What do you think? |
Indeed, but it might become tedious to set the engine for each step in a pipeline once we start supporting many estimators. |
Do we want to support using different dispatchers for each estimator in a pipeline? pipe = make_pipeline(
StandScalar().set_computational_engine("sklearn_faster_scalar"),
PCA().set_computational_engine("sklearn_numba"),
LogisticRegression() # use default scikit-learn
) On another note, what would the API look like for device transfers. For example: pipe = make_pipeline(
PCA().set_computational_engine("sklearn_cuml"),
LogisticRegression() # use default scikit-learn
)
# X and y are NumPy arrays
pipe.fit(X, y) If we want the above to "just work", the backend |
Indeed, maybe we could allow for both APIs (context manager + estimator level engine registration). We will make all the API private in the short term (in the experimental phases) so that we can get a better feeling of both option. I am working a on a draft branch, will try to publish it somewhere by the end of the week. |
IIUC, this would mean that we define a plugin API. Is this correct? |
More code/documentation reuse and public API consistency. |
From the developers of cuML, we think this is a great proposal to improve user’s experience, and extend Scikit-learn without impacting it’s ease of use, and we’d love to collaborate and contribute towards making it happen. The main advantages of this approach as we see it mirror what @ogrisel says, when compared to just having libraries following the Scikit-learn APIs but being separate, are: ensuring a more consistent user experience, reduce barrier of entry (still using scikit-learn proper with an option as opposed to a new library), and discoverability/documentation. There are quite a few elements where we would like to give our feedback based on the past few years of developing a scikit-learn-like library for GPUs. First, I think the API that probably would have the least need for maintenance from Scikit-learn itself is indeed:
using
I think that second point is particularly important to make the effort to easily be adoptable by future libraries that might use different types of hardware. Today, for cuML for example, that means it’s on us to accept NumPy/CPU objects and we do the transferences to device and back, which is something we’ve learnt we already had to support due to user’s expectations anyways. That said, the mechanism could be even more powerful if the pipeline machinery in Scikit-learn could relax some validations so that memory transferences could be minimized in pipelines like:
Perhaps a mechanism that register’s what’s the “preferred” device/format of a computational engine, so that if there are multiple consecutive algorithms in the same device, the data doesn’t need to be transferred back and forth. One problem which we've had to address in cuML is how to minimize data transfers and conversions (for example, row major to column major). Generally, computational engines may have preferred memory formats for particular algorithms (e.g. row-major vs column-major), and so one thing we might want to think about is mechanisms to allow an engine to maintain data in its preferred location and format through several chained calls to that engine. Being able to register this “preference” allows backends to take advantage of this if desired, or just default to using NumPy/array arrays, so it is opt-in, which means it wouldn’t complicate engine development unless the engine needs it. It would also keep maintenance on the scikit-learn codebase side low, by keeping the bulk of that responsibility (and flexibility) on the engine side. |
Indeed. Having the ability to register the memory layout for the output datastructure of a transformer is also linked to the pandas-in / pandas-out discussion: |
I also want to see if relying on uarray (a pluggable type-based dispatching mechanism that is progressively be leveraged in SciPy for similar purpose) would be useful or if we should rather handle the backend dispatching manually. I am thinking that non-trivial dispatching decision might happen during the In particular, before entering an iterative loop in a fit method, some temporary datastructures would probably need to be allocated consistently with that dispatching decision and then passed as argument to the backend specific implementation called repeatedly inside the fit loop while the fit loop itself would stay in the scikit-learn code-base, for instance to call per-iteration callbacks (#22000) or to mutualize the stopping criterion logic across all backends. |
cc @scikit-learn/core-devs |
I like this way of thinking about this. In particular because you want to replace more than just one method ( Something like https://pluggy.readthedocs.io/en/stable/ looks like a nice way to define plugins, might be a good place to learn from/get inspiration (I'm not sure using pluggy directly is the right thing to do). Some question that come up if you use a plugin system:
I think there must be prior art for this kind of stuff out there that we could learn from. And maybe it is very unlikely that someone has more than one accelerator in the same computer/would want to disable acceleration for some estimators but not others? |
Thanks for participating in this conversation, @betatim.
This probably is not (yet) explicit, but I think we somewhat plan to have a grosser granularity, i.e. have a plugin per accelerator/vendor (
I think also it is appropriate to consider some UX regarding plugins (beforehand). Do you know projects that are using pluggy, so as to get learn from their experience? I will try to search for such projects. |
pytest (pluggy was "invented" for/by pytest) and datasette are two projects I know of that use pluggy. |
I started writing some code to get a feeling for what a "plugin system" based on pluggy would look like (code over architecture docs). I already learnt a lot and made more sense of how pluggy works. My plan is to get the basic infrastructure working and to see if its possible to do things like having multiple plugins that implement the same function, disabling plugins dynamically, etc. I'll post a link to my branch here instead of opening a PR. WDYT? |
Thanks for having explored this. I am interested by your branch, and I don't think a PR is required if it's experimental. cc @fcharras who might also be interested. |
I picked the kmeans example/function linked in the top comment. The only reason for that is that it was easy, I didn't think about whether it is the best example or not. This means if you have a preference/opinion, lets switch to that. My goal was to write some code based on pluggy to be able to see it in action. I wanted to have a "base plugin" with the default sklearn implementation and then one "other plugin" (the fake cupy plugin in this case). I wanted two plugins in order to see what would happen if one of them wanted to pass on the computation or if it only implements a subset of the hooks. I haven't tried to find an optimal place for all the code to live, I think it feels a bit clunky right now. I haven't thought about performance either. Diff of the first commit (diff of the branch, might change as I work on it). I used the bisecting kmeans example to run the code. I haven't really thought about how kmeans works, but it feels like the hook is in the wrong place now. Maybe it should be "higher level", maybe somewhere in Related to "where should the hook be?": I think we should have the hooks "high enough" so that the overhead of picking a hook and calling it doesn't matter. This means we shouldn't hook a function that only takes 50ns to execute but is called 1000 times. Instead the hook should be "above" the loop that is making the 1000 calls. Ideally you call a hook with enough work so that it would take a few seconds or more to execute. I like that this example allows the hook implementations to decide if they want to "take the call or not". For example a GPU plugin could decide that it will take action if the input data is already in the GPU memory or if the input data is not on the GPU memory yet but it is "big enough" so that transferring it to GPU memory is worth it. Basically, I think the decision of which hook implementation to use is more complicated than just the datatype of the input data. Unlike in my hack above, all the "go faster" plugins should be their own Python package that people install. I like that a plugin is "activated" by This is a super long comment already, so I'll stop now and wait to hear what you think 😃 |
It's cool that you pick KMeans @betatim we are in the process of implementing a GPU KMeans plugin here , the plugin interface is not done yet but it was planned to work with this branch of scikit-learn that have been drafted by @ogrisel , we can also work on an interface with your branch. |
Sorry, indeed I had not seen the notifications for this discussion. I will try to resume my work on this branch later today to get it in a minimally functional state and hope we can get a constructive discussion on the design from there. |
Thank you! How do you think we should try and decide things? Should we start with figuring out the features we need/care about/don't care about to see if we can already filter out some approaches? From reading all the comments above it isn't clear to me what people think is important, was important but not anymore, etc. For me it would be good to figure out "what is the actual problem we want to solve, and what are problems we don't want to solve". Maybe updating (together) https://hackmd.io/@oliviergrisel/S1RDJ3HCF? |
For information: we had a quick chat this afternoon with @betatim and @fcharras where we discussed the current design proposed in #24497 and the choice to make engine activation explicit manual (at least) for now and not dependent on the type of the input container.
This is not 100% clear yet to me either. Any current API choice is subject to change as we start implementing engines and get some practical experience on their usability when we try to use them for "real-life"-ish datascience tasks. We plan to organize public online meeting dedicated to the topic on the scikit-learn discord server in the coming weeks for those interested. We will announce that on the mailing list and maybe twitter. |
I'll put some thoughts "for the future"/to keep at the back of the mind here for now. I don't have answers or expect answers to them, but I need to write them down somewhere otherwise I'll forget:
|
If I may, I'd like to raise some stupid questions:
|
Good questions to be asking, some thoughts on possible answers:
Related to (3), but a new thought: you could imagine having a engine/plugin that provides a different implementation of |
I can provide some insights for 3. from the experimental plugin we're developing:
|
Does anyone have thoughts on what API the different engines (for estimators) should have? Right now it looks like we are leaning towards "bespoke engine API per estimator", based on the naming of the methods on the KMeans proof-of-concept. Is this a conscious choice? Something in flux? The more I've thought about having engines for many estimators, the more I am thinking it would make sense to have one (or a few) engine APIs. For example "prepare to fit", "fit", "post fit", "prepare to predict", "predict", "post predict", etc. WDYT? |
Given the discussion in #26010, a plugin mechanism seems a desirable feature. As it is a considerable amount of API change/addition, I would appreciate to have a SLEP for it. We already have public APIs with
Summary: Case by case then? |
I agree there is still a tension between estimator-level and lower-level APIs and also between should the dispatch automatically move data from host to device without extra verbose configuration when its likely to enable GPU related speed-ups. Note that scipy is concurrently making progress by vendoring
I am still not sure what approach would work best, both from a code-complexity point of view (for maintainers, contributors and users who want to read the code to understand what's going on) and for users trying to understand which library is actually used when fitting a non-trivial machine learning pipeline on a host equipped with GPUs). |
The goal of this issue is to discuss the design and prototype a way to register alternative implementations for core low level routines in scikit-learn, in particular to benefit from hardware optimized implementations (e.g. using GPUs efficiently).
Motivation
scikit-learn aims to provide reasonably easy to maintain and portable implementations of standard machine learning algorithms. Those implementations are typically written in Python (with the help of NumPy and SciPy) or in Cython when the overhead of the Python interpreter prevents us to efficiently implement algorithms with (nested) tight loops. This allows us reasonably fast implementations as binary packages (installable with pip/PyPI, conda/conda-forge, conda/anaconda or Linux distros) for a variety of platforms (Linux / macOS / Windows) x (x86_64, i686, arm64, ppc64le) from a single code-base with no external runtime dependencies beyond Python, NumPy and SciPy.
Recently, GPU hardware have proven very competitive for many machine learning related workloads, either from a pure latency standpoint, or from the standpoint of a better computation/energy trade-off (irrespective of the raw speed considerations). However, hardware optimized implementation are typically not portable and mandates additional dependencies.
We therefore propose to design a way for our users to register alternative implementations of low-level computation routines in scikit-learn provided they has installed the required extension package(s) that matches their specific hardware.
Relationship to adopting the Array API spec
This proposal is related and complementary to another effort, namely:
The Array API spec is useful to make it possible to have some scikit-learn estimators written using a pure numpy syntax to delegate the computation to alternative Array API compatible libraries such as CuPy.
However, some algorithms in scikit-learn cannot be efficiently written using NumPy operation only, for instance the main K-Means loop is written in Cython to process chunks of samples in parallel (using
prange
and OpenMP), compute distance with centroids and reduce those distances to find assign each sample to its closest centroid on-the-fly while preventing unnecessary memory transfer between CPU cache and RAM.If we want to run this algorithm efficiently on GPU hardware, one would need to dispatch the computation of this low level function to an alternative implementation that can work on GPU, either written in C/C++ with GPU-specific supporting runtime libraries and compilers (e.g. OpenCL, NVIDIA Cuda, Intel oneAPI DPC++, AMD ROCm...) or using a Python syntax with the help of GPU support provided in numba for instance.
List of candidate routines
Explicit registration API design ideas
I started to draft some API ideas in:
Feel free to comment here, or there.
This design is subject to evolve, in particular to make it possible to register both Array API extensions and non-Array API extensions with the same registration API.
Next steps
The text was updated successfully, but these errors were encountered: