Description
Describe the bug
From my understanding, currently there is no way to minimize the MAE (Mean Absolute Error). Quantile regression with quantile=0.5 will optimize for the Median Absolute Error. This would be different from optimizing the MAE when the conditional distribution of the response variable is not symmetrically-distributed.
scikit-learn/sklearn/_loss/loss.py
Lines 574 to 577 in 46a7c9a
What I expect
- Using
HistGradientBoostingRegressor(loss="absolute_error")
should optimize for the mean of absolute errors. - Using
HistGradientBoostingRegressor(loss="quantile", quantile=0.5)
should optimize for the median of absolute errors.
if sample_weight is None:
return np.mean(y_true, axis=0)
else:
return _weighted_mean(y_true, sample_weight)
What happens
Both give the same results
- Using
HistGradientBoostingRegressor(loss="absolute_error")
optimizes for the median of absolute errors - Using
HistGradientBoostingRegressor(loss="quantile", quantile=0.5)
optimizes for the median of absolute errors
Suggested Actions
If this is intended behavior:
- Feel free to close this issue marked as resolved.
- Kindly add a note in the documentation that "Absolute Error optimizes for Median Absolute Error, not Mean Absolute Error" as "absolute_error" is not very clear.
- I would appreciate if there was more explanation regarding on using custom loss functions Custom Loss function? #21614. This way, we could optimize for Mean Absolute Error, Median Absolute Error, Log Cosh, etc. as per the requirement.
Note
I have tried my best to go through the documentation prior to creating this issue. I am a fresh graduate in Computer Science, and if you believe this issue is not well-framed due to a misunderstanding of my concepts, kindly advise me and I'll work on it.
Steps/Code to Reproduce
# Imports
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np
# Dataset Generation
x = np.linspace(start=0, stop=10, num=100)
n_repeat = 100 # no of x for each x
X = np.repeat(x, n_repeat)[:, np.newaxis]
y_true_mean = 1 * np.repeat(x, n_repeat)
noise = np.random.RandomState(0).lognormal(mean=0, sigma=1, size=y_true_mean.shape[0])
y_noisy = y_true_mean + noise
# Model Creation
mae = HistGradientBoostingRegressor(loss="absolute_error") # should be mean of absolute errors
quantile = HistGradientBoostingRegressor(loss="quantile", quantile=0.5) # should be median of absolute errors
# Fit & Prediction
y_pred_mae = mae.fit(X, y_noisy).predict(X)
y_pred_quantile = quantile.fit(X, y_noisy).predict(X)
# Prediction Comparison
print((y_pred_mae - y_pred_quantile).sum()) # both give same results
Expected Results
Median and mean of absolute errors should give different results for a log-normally distributed response. Hence, the predictions should be different from each other, and the difference of their predictions, should total as a non-zero value.
Actual Results
Predictions by both models are the same, which can be seen in the difference of their predictions, totaling as 0.
0.
Versions
System:
python: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
executable: /usr/bin/python3
machine: Linux-6.1.85+-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.5.2
pip: 24.1.2
setuptools: 75.1.0
numpy: 1.26.4
scipy: 1.13.1
Cython: 3.0.11
pandas: 2.2.2
matplotlib: 3.8.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 2
prefix: libopenblas
filepath: /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Haswell
user_api: blas
internal_api: openblas
num_threads: 2
prefix: libopenblas
filepath: /usr/local/lib/python3.10/dist-packages/scipy.libs/libopenblasp-r0-01191904.3.27.so
version: 0.3.27
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 2
prefix: libgomp
filepath: /usr/local/lib/python3.10/dist-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None