-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Two different versions for weighted lorenz curve calculation in the examples #28534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Indeed this looks like a bug. Someone would need to do a quick literature review to find the correct way to compute this curve when there are per-individual weights / exposure. EDIT: I have the feeling that the cumsum variant is the correct one but I am not 100% sure. |
I think you're right, the horizontal axis is the cumulative sum of probability Pr(Y=y). Here it corresponds to the individual weight/exposure. |
One version is unweighted, the other takes into account sample weights. So neither is wrong. See also #21320 (comment) and note that wikipedia is not a reliable source on those quantities. |
I'm not expert but something I find strange in the unweighted version is that it does use the weights (exposure) but differently: |
Indeed I agree it's weird to only use the weights on y axis and not on the x axis. I would intuitively expect the following weighing semantics to hold:
We can empirically check when those properties hold: the curves for the weighted vs trimmed/duplicated but unweighted data should match exactly. |
I find this topic interesting and I would like to contribute. and checked that the approach with the cumulative sums in the scikit-learn examples matches the theory: import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.integrate import quad
alpha = np.linspace(0, 1, 100)
rv = stats.gamma(scale=10, a=0.5)
def f2int(x):
return rv.ppf(x)
res = [1 / rv.mean() * quad(f2int, a=0, b=el)[0] for el in alpha]
sample = rv.rvs(size=1000, random_state=518522)
alpha_sample = np.linspace(0, 1, len(sample))
empirical_percentiles = np.percentile(sample, q=alpha_sample)
ranking = np.argsort(sample)
ranked_claims = sample[ranking]
cumulated_claims = np.cumsum(ranked_claims)
cumulated_claims /= cumulated_claims[-1]
plt.plot(
alpha_sample, cumulated_claims, marker=".", alpha=0.5, linewidth=4, color="orange"
)
plt.plot(alpha, res, alpha=1.0) To check the behaviour when weighting I simulated unweighted data from a Poisson distribution with parameter 0.1 and then aggregated it. I aggregated the exposure and the claim count and computed again the frequency for the weighted data (claim count divided by exposure). Update: When building the Lorenz curve using the weighted data I then used this approach
and obtained points lying on the unweighted curve. |
@m-maggi sorry I had not seen your reply. Could you try to do the simple experiment: import numpy as np
rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n) And then compute the curves for:
y_true_repeated = y_true.repeat(sample_weight)
y_pred_repeated = y_pred.repeat(sample_weight)
sample_weight = None # or equivalently: np.ones(n) vs
y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure |
But feel free to open a PR to fix the wrong example in any case. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the issue linked to the documentation
There are 2 definitions of (weighted)
lorenz_curve()
functions here and hereThe difference is in the X coordinates that these functions returns. Both return X coordinates between 0 and 1, but the first example returns equally spaced X coordinates:
and the second example return un-equally spaced X coordinates (spaced using the samples weights):
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: