Description
Describe the workflow you want to enable
Motivation
The dummy
module is fantastic for testing pipelines all the way up through enterprise scales. The strategies offered in the DummyClassifier
are excellent for testing corner cases. However, the strategies offered fall short when testing pipelines that include downstream tasks that depend on moments of the predicted probabilities (e.g. gains charts).
This is because the existing strategies do not include sampling random probabilities.
Proposed API:
Consider adding a new strategy with a name like uniform-proba
or score-random
or something similar that results in this behavior for binary classification:
print(DummyClassifier(strategy="uniform-proba").fit(X, y).predict_proba(X))
"""
[[0.5651713 0.4348287 ]
[0.36557341 0.63442659]
[0.42386353 0.57613647]
...
[0.30348692 0.69651308]
[0.59589879 0.40410121]
[0.32664176 0.67335824]]
"""
Describe your proposed solution
Proposed implementation
I had something like this in mind:
class DummyClassifier(MultiOutputMixin, ClassifierMixin, BaseEstimator):
...
def predict_proba(self, X):
...
for k in range(self.n_outputs_):
if self._strategy == "uniform-proba":
out = rs.dirichlet([1] * n_classes_[k], size=n_samples)
out = out.astype(np.float64)
...
Similar to the "stratified"
strategy, this simple implementation relies on numpy.random
, in this case the dirichlet
distribution. By setting all the alpha
s to 1, we are specifying that the probabilities of each class are equally distributed -- in contrast, the "stratified"
strategy effectively samples from a dirichlet distribution with one alpha equal to 1 and the rest equal to 0.
Describe alternatives you've considered, if relevant
No response
Additional context
I am happy to make the PR. The biggest question is what the strategy string should be.
Thank you for reading 🙏.