-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Callback API continued #22000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Callback API continued #22000
Conversation
Thanks! Another use case I see is structured logging: instead of generating lines in a text file, generate an event log in json file, records in a database (e.g. MongoDB or PostgreSQL, possibly via a JSON column type), a Kafka stream or an ML specific with ML tracking platforms, for instance MLFlow tracking features or weights and biases' wandb.log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
@@ -515,6 +518,22 @@ def sag{{name_suffix}}(SequentialDataset{{name_suffix}} dataset, | |||
fabs(weights[idx] - | |||
previous_weights[idx])) | |||
previous_weights[idx] = weights[idx] | |||
|
|||
with gil: | |||
if _eval_callbacks_on_fit_iter_end( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the overhead of taking the GIL compare to early stopping directly using the stopping_criteron
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has an impact on performance for sure. But If we want to enable callbacks at this step of the fit there's no way around.
What we can do however is to check before entering the nogil part if the estimator has callbacks and execute this part only if it's the case. Let me try something like that. We might encounter the same issue as in #13389
sklearn/callback/_progressbar.py
Outdated
else: | ||
# node is a leaf, look for tasks of its sub computation tree before | ||
# going to the next node | ||
child_dir = this_dir / str(node.tree_status_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should abstract away the filesystem that backs the computation trees because:
- I do not think we want third party developers writing Callbacks to worry about the filesystem.
- It will be easier to switch to another inter-process communication method in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably better yes. I'll try to come up with a more friendly solution
@@ -0,0 +1,268 @@ | |||
# License: BSD 3 clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali Discussing with @jeremiedbb IRL and while I was explaining to the sample-props PR, he was under the impression that the MetaDataRequest
class would be similar to the ComputationTree
in some regards. Maybe you could have a look for some inspiration :)
else: | ||
sub_estimator._callbacks.extend(propagated_callbacks) | ||
|
||
def _eval_callbacks_on_fit_begin(self, *, levels, X=None, y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too magical to have only something call _eval_callbacks_begin
and inspect internal the stack of calls to infer which method called this function.
Of course it would make sense only if the same methods are expected to be called for fit/predict/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course it would make sense only if the same methods are expected to be called for fit/predict/etc.
Well that's not obvious at all and I've not really thought about that. This first iteration is all about fit. I think it will be easier to not try to be too magical for now
@jeremiedbb Would this PR cover early stopping for |
@chritter For now EarlyStopping based on a time budget in SearchCV estimators doesn't seem possible due to joblib (it might be possible at some point if the possibility to return a generator is merged joblib/joblib#588) |
@jeremiedbb What is the current status of this feature? Is it abandoned? :( |
No it's not :) I haven't been working on it for some time but I started working on it again a few weeks ago. There's still a lot work to do though |
Maybe you could keep this WIP branch up to date ;) |
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
Remark: #27663 implements a smaller portion of this. |
I think I'm -1 on using callbacks for early stopping since I don't see a way of making it work within pipelines. |
Fixes #78 #7574 #10973
Continuation of the work started in #16925 by @rth.
Goal
The goal of this PR is to propose a callback API that can handle the most important / asked use cases.
Monitor some quantities / metrics at each iteration.
This can also be very useful for maintenance / debugging / implementation of new features.
Allow to stop the iterations based on some external metric evaluated on a validation set.
Take regular snapshots of an estimator during the fit to be able to recover a working estimator if the fit is somehow interrupted for instance.
(not implemented yet)
Challenges
Supporting all these features and make each of these callbacks available is not easy and will require some refactoring in probably many estimators.
The proposed API makes it possible to enable the callbacks 1 estimator at a time: Setting callbacks on non yet supported estimators has no effect. Thus we can then incrementally do it in subsequent dedicated PRs. Here I only did NMF, LogisticRegression and Pipeline to show what are the necessary changes in the code base.
The proposed API also makes it possible to only enable a subset of the features for an estimator, and add the remaining ones later. For LogisticRegression I only passed the minimum for instance.
Callbacks should not impact the performance of the estimators. Some quantities passed to the callbacks might be costly to compute. We don't want to spend time computing them if the only callback is a progress bar for instance.
The solution I found is to do a lazy evaluation using lambdas and only actually compute them if there's at least 1 callback requesting it. For now callbacks can request these by defining specific class attributes but maybe there's a better way. mixins ?
The callbacks described above are not meant to be evaluated a the same fitting step of an estimator.
When an estimator has several nested loops (LogisticRegressionCV(multiclass="ovr") for instance has a loop over Cs, a loop over the classes and then the final loop for the iterations on the dataset), the snapshot callback can only be evaluated at the end of an outermost loop while the EarlyStopping would be evaluated at the end of an innermost loop, and the ProgressBar could be evaluated at each level of nesting.
In this PR I propose that each estimator holds a
computation tree
as a private attribute representing these nested loops, the root being the beginning of fit and each node being one step of a loop. This structure is defined in_computation_tree.py
. It allows to have a simple way to know exactly at which step of the fit we are at each evaluation of the callbacks and is kind of the best solution I found to solve the challenges described below. This imposes the main changes to the code base, i.e. passing the parent node around.Dealing with parallelism and especially multiprocessing is the main challenge to me.
Typically with a callback you might want to accumulate a bunch of info during fit and recover them at the end. The issue is that the callback is not shared between sub-processes and modifying its state in a sub-process (e.g. modifying an attribute) will not be visible from the main process. The joblib API doesn't allow inter-process communication that would be needed to overcome this.
The solution we found is that the callbacks write the information they want to keep in files (in files in this first implementation but we might consider sockets or another solution ?). It's relatively easy to avoid race conditions with this design.
As an example this is necessary to be able to report progress in real time. In an estimator running in parallel, there's no like current computation node. We are at different nodes at the same time. But having the status of each node in a file updated at each call to the callbacks allows to know the current overall progress from the main process. (there are other difficulties described later).
The last main challenge is meta-estimators. We'd like some callbacks to be set on the meta-estimator, like progress bars, but some others to be set on the underlying estimator(s), like early-stopping. Moreover we encounter the parallelism issue again if the meta-estimator is fitting clones of the underlying estimator in parallel, like GS.
For that, I propose to have a mixin to tell a callback that it should be propagated to sub estimators. This way the meta-estimator will only propagate the appropriate callbacks to its sub-estimators, and these sub-estimators can also have normal callbacks.
The API
This PR adds a new module
sklearn.callback
which exposesBaseCallback
, the abstract base class for the callbacks. All callbacks must inherit from BaseCallback. It also exposesAutoPropagatedMixin
. Callbacks that should be propagated to sub-estimators by meta-estimators must inherit from this.BaseCallback
has 3 abstract methods:on_fit_end
. Called at the beginning of fit, after all validations. We pass a reference to the estimator, X_train and y_train.on_fit_iter_end
. Called at the end of each node of the computation tree, i.e. each step of each nested loop. We pass a reference to the estimator (which at this point might be different from the one passed at in_fit_begin for propagated callbacks), and the computation node where it was called. We also pass some of these:stopping_criterion <= tol
.predict
,transform
, ...on_fit_end
. Called at the end of fit. Takes no argument. It allows the callback to do some clean-up.Examples
Progress bars.
expand
Here's an example of progress monitoring using rich. I used custom estimators to simulate a complex setting with a meta-estimator (like a GridSearchCV) running in parallel with a sub-estimator also running in parallel.
simplescreenrecorder-2021-12-16_19.17.35.mp4
Convergence Monitoring
expand
Snapshot
expand
EarlyStopping
expand
If the
on_fit_iter_end
method of the callbacks returns True, the iteration loop breaks.Verbose
expand
TODO
This PR is still WIP.