Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DOC: clarify the documentation for the loss functions used in GBRT, and Absolute Error in particular. #30339

Closed
@AhmedThahir

Description

@AhmedThahir

Describe the bug

From my understanding, currently there is no way to minimize the MAE (Mean Absolute Error). Quantile regression with quantile=0.5 will optimize for the Median Absolute Error. This would be different from optimizing the MAE when the conditional distribution of the response variable is not symmetrically-distributed.

if sample_weight is None:
return np.median(y_true, axis=0)
else:
return _weighted_percentile(y_true, sample_weight, 50)

What I expect

  • Using HistGradientBoostingRegressor(loss="absolute_error") should optimize for the mean of absolute errors.
  • Using HistGradientBoostingRegressor(loss="quantile", quantile=0.5) should optimize for the median of absolute errors.
        if sample_weight is None:
            return np.mean(y_true, axis=0)
        else:
            return _weighted_mean(y_true, sample_weight)

What happens
Both give the same results

  • Using HistGradientBoostingRegressor(loss="absolute_error") optimizes for the median of absolute errors
  • Using HistGradientBoostingRegressor(loss="quantile", quantile=0.5) optimizes for the median of absolute errors

Suggested Actions

If this is intended behavior:

  • Feel free to close this issue marked as resolved.
  • Kindly add a note in the documentation that "Absolute Error optimizes for Median Absolute Error, not Mean Absolute Error" as "absolute_error" is not very clear.
  • I would appreciate if there was more explanation regarding on using custom loss functions Custom Loss function? #21614. This way, we could optimize for Mean Absolute Error, Median Absolute Error, Log Cosh, etc. as per the requirement.

Note
I have tried my best to go through the documentation prior to creating this issue. I am a fresh graduate in Computer Science, and if you believe this issue is not well-framed due to a misunderstanding of my concepts, kindly advise me and I'll work on it.

Steps/Code to Reproduce

# Imports
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np

# Dataset Generation
x = np.linspace(start=0, stop=10, num=100)

n_repeat = 100 # no of x for each x
X = np.repeat(x, n_repeat)[:, np.newaxis]
y_true_mean = 1 * np.repeat(x, n_repeat)
noise = np.random.RandomState(0).lognormal(mean=0, sigma=1, size=y_true_mean.shape[0])
y_noisy = y_true_mean + noise

# Model Creation
mae = HistGradientBoostingRegressor(loss="absolute_error") # should be mean of absolute errors
quantile = HistGradientBoostingRegressor(loss="quantile", quantile=0.5) # should be median of absolute errors

# Fit & Prediction
y_pred_mae = mae.fit(X, y_noisy).predict(X)
y_pred_quantile = quantile.fit(X, y_noisy).predict(X)

# Prediction Comparison
print((y_pred_mae - y_pred_quantile).sum()) # both give same results

Expected Results

Median and mean of absolute errors should give different results for a log-normally distributed response. Hence, the predictions should be different from each other, and the difference of their predictions, should total as a non-zero value.

Actual Results

Predictions by both models are the same, which can be seen in the difference of their predictions, totaling as 0.

0.

Versions

System:
    python: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.1.85+-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.5.2
          pip: 24.1.2
   setuptools: 75.1.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: 3.0.11
       pandas: 2.2.2
   matplotlib: 3.8.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 2
         prefix: libopenblas
       filepath: /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 2
         prefix: libopenblas
       filepath: /usr/local/lib/python3.10/dist-packages/scipy.libs/libopenblasp-r0-01191904.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 2
         prefix: libgomp
       filepath: /usr/local/lib/python3.10/dist-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions