Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Feat: DummyClassifier strategy that produces randomized probabilities #31462

Open
@tmcclintock

Description

@tmcclintock

Describe the workflow you want to enable

Motivation

The dummy module is fantastic for testing pipelines all the way up through enterprise scales. The strategies offered in the DummyClassifier are excellent for testing corner cases. However, the strategies offered fall short when testing pipelines that include downstream tasks that depend on moments of the predicted probabilities (e.g. gains charts).

This is because the existing strategies do not include sampling random probabilities.

Proposed API:

Consider adding a new strategy with a name like uniform-proba or score-random or something similar that results in this behavior for binary classification:

print(DummyClassifier(strategy="uniform-proba").fit(X, y).predict_proba(X))
"""
[[0.5651713  0.4348287 ]
 [0.36557341 0.63442659]
 [0.42386353 0.57613647]
 ...
 [0.30348692 0.69651308]
 [0.59589879 0.40410121]
 [0.32664176 0.67335824]]
"""

Describe your proposed solution

Proposed implementation

I had something like this in mind:

class DummyClassifier(MultiOutputMixin, ClassifierMixin, BaseEstimator):
    ...

    def predict_proba(self, X):
        ...
        for k in range(self.n_outputs_):
            if self._strategy == "uniform-proba":
                out = rs.dirichlet([1] * n_classes_[k], size=n_samples)
                out = out.astype(np.float64)
            ...

Similar to the "stratified" strategy, this simple implementation relies on numpy.random, in this case the dirichlet distribution. By setting all the alphas to 1, we are specifying that the probabilities of each class are equally distributed -- in contrast, the "stratified" strategy effectively samples from a dirichlet distribution with one alpha equal to 1 and the rest equal to 0.

Describe alternatives you've considered, if relevant

No response

Additional context

I am happy to make the PR. The biggest question is what the strategy string should be.

Thank you for reading 🙏.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions