-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Lack of consistency for decision_function methods in outlier detection #8693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
if there is substantial inconsistency in decision_function, should we be
providing a different method?
…On 4 Apr 2017 2:36 am, "Albert Thomas" ***@***.***> wrote:
Description
I think we could improve the consistency of the decision_function of the
outlier detection algorithms implemented in scikit-learn.
- decision_function for OCSVM is such that if the value is positive
then the sample is an inlier and if negative then it is an outlier. It
takes into account the parameter nu which can be seen as a
contamination parameter. The decision_function of IsolationForest does
not take into account the contamination parameter, it just returns the
score of the samples. For LOF, it is private _decision_function and
does not take into account the contamination parameter. For
EllipticEnveloppe, decision_function takes into account the
contamination parameter and it is said in the documentation that it is
meant to "ensure a compatibility with other outlier detection tools such as
the One-Class SVM".
decision_function should maybe stick with the OCSVM convention and we
could add a score_samples method, as for kernel density estimation, which
would return the scores of the algorithms as defined in their original
papers. This would be useful when performing benchmarks with ROC curves for
instance. When I did a benchmark with sklearn anomaly detection algorithms
I defined a subclass for each algorithm, each with a score method.
If you think this should be adressed I can submit a PR.
See also #8677 <#8677>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8693>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz642BScz5kOIscbRTqFwFRKhEjqRsks5rsSATgaJpZM4Mxyh2>
.
|
This is what I suggest but feedbacks are welcome:
|
+1 for more consistency Also it would be great to remove IMHO it would be clearer to have a standardized |
Hi all, I've been using Isolation Forests, I have some questions regarding this issue. I red that you have the contamination parameter in Isolation Forest for consistency. Correct me if I'm wrong, but I think that with the approach from the original papers you can discover the proportion of anomalies in the dataset, but with the implementation of scikit learn you have to define it. Isn't that like deleting a good property from the algorithm for exploratory analysis? Thanks. |
@rcamino I don't think the original paper gives a method to find the proportion of anomalies. Are you referring to this paragraph of the original paper?
The question then is how do you define 'very close to 1' and 'much smaller than 0.5'? I think this would need more work than what the original paper says. BTW you can still access the values of s with the `decision_function. |
@albertcthomas Yes, I'm referring to that paragraph. It is true that 0.5 is not exactly defined as the threshold for anomalies, the expressions "very close to 1", "much smaller than 0.5" and "all the instances return s ≈ 0.5" need some interpretation and analysis in your dataset. I want to analyse this with the decision_function, but I don't know exactly the range for the output scores. Sorry for asking here, I don't know what's the right place (had no luck in stackexchange). Thanks again. |
Also it would be great to remove OutlierDetectionMixin which is not
used by any outlier detection algorithm excepting by EllipticEnvelope
IMHO it would be clearer to have a standardized decision_function
method for all the algo, with an optional contamination parameter.
(Except for OCSVM and Elliptic_Envelope, decision functions
structurally cannot depend on the dataset contamination, which is then
just used to define a threshold for prediction)
+1 to both suggestions.
|
@rcamino the |
@ngoix Great, thank you! Do you think that small formula should be added in the decision_function documentation? |
The |
To summarize :
|
Description
I think we could improve the consistency of the
decision_function
of the outlier detection algorithms implemented in scikit-learn.decision_function
for OCSVM is such that if the value is positive then the sample is an inlier and if negative then it is an outlier. It takes into account the parameternu
which can be seen as a contamination parameter. Thedecision_function
of IsolationForest does not take into account thecontamination
parameter, it just returns the score of the samples. For LOF, it is private_decision_function
and does not take into account the contamination parameter. For EllipticEnveloppe,decision_function
takes into account the contamination parameter and it is said in the documentation that it is meant to "ensure a compatibility with other outlier detection tools such as the One-Class SVM".decision_function
should maybe stick with the OCSVM convention and we could add ascore_samples
method, as for kernel density estimation, which would return the scores of the algorithms as defined in their original papers. This would be useful when performing benchmarks with ROC curves for instance. When I did a benchmark with sklearn anomaly detection algorithms I defined a subclass for each algorithm, each with ascore
method.If you think this should be adressed I can submit a PR.
See also #8677.
The text was updated successfully, but these errors were encountered: