Two different versions for weighted lorenz curve calculation in the examples #28534

yonil7 · 2024-02-26T06:29:54Z

Describe the issue linked to the documentation

There are 2 definitions of (weighted) lorenz_curve() functions here and here

The difference is in the X coordinates that these functions returns. Both return X coordinates between 0 and 1, but the first example returns equally spaced X coordinates:

cumulated_samples = np.linspace(0, 1, len(cumulated_claim_amount))

and the second example return un-equally spaced X coordinates (spaced using the samples weights):

cumulated_exposure = np.cumsum(ranked_exposure)
cumulated_exposure /= cumulated_exposure[-1]

Suggest a potential alternative/fix

No response

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-02-27T14:11:13Z

Indeed this looks like a bug. Someone would need to do a quick literature review to find the correct way to compute this curve when there are per-individual weights / exposure.

EDIT: I have the feeling that the cumsum variant is the correct one but I am not 100% sure.

lamdang2k · 2024-02-29T11:16:44Z

Indeed this looks like a bug. Someone would need to do a quick literature review to find the correct way to compute this curve when there are per-individual weights / exposure.

EDIT: I have the feeling that the cumsum variant is the correct one but I am not 100% sure.

I think you're right, the horizontal axis is the cumulative sum of probability Pr(Y=y). Here it corresponds to the individual weight/exposure.
We can use linspace when all claims are equally probable with probabilities 1/n.
Source: https://en.wikipedia.org/wiki/Lorenz_curve#:~:text=For%20a%20discrete,to%20n%3A

lorentzenchr · 2024-03-08T07:01:06Z

One version is unweighted, the other takes into account sample weights. So neither is wrong.

See also #21320 (comment) and note that wikipedia is not a reliable source on those quantities.

yonil7 · 2024-03-08T08:34:39Z

I'm not expert but something I find strange in the unweighted version is that it does use the weights (exposure) but differently:
cumulated_claim_amount = np.cumsum(ranked_pure_premium * ranked_exposure)
In a more intuitive terms - it use the weights to make the bar higher but not wider (each bar has the same width but the height of each bar is affected by the weight where in the weighted version the weight affects both the height and the width of each bar)

ogrisel · 2024-03-28T09:04:45Z

Indeed I agree it's weird to only use the weights on y axis and not on the x axis.

I would intuitively expect the following weighing semantics to hold:

setting weights to 0 to a fraction of the observations should be equivalent to trimming those data points;
setting weights to 2 to a fraction of the observations (while keeping the other to unit weights) should be equivalent to duplicating those data points.

We can empirically check when those properties hold: the curves for the weighted vs trimmed/duplicated but unweighted data should match exactly.

m-maggi · 2024-06-15T13:47:02Z

Indeed I agree it's weird to only use the weights on y axis and not on the x axis.

I would intuitively expect the following weighing semantics to hold:

setting weights to 0 to a fraction of the observations should be equivalent to trimming those data points;

setting weights to 2 to a fraction of the observations (while keeping the other to unit weights) should be equivalent to duplicating those data points.

We can empirically check when those properties hold: the curves for the weighted vs trimmed/duplicated but unweighted data should match exactly.

I find this topic interesting and I would like to contribute.
I started from the definition of Lorenz curve from this paper:

and checked that the approach with the cumulative sums in the scikit-learn examples matches the theory:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.integrate import quad

alpha = np.linspace(0, 1, 100)
rv = stats.gamma(scale=10, a=0.5)


def f2int(x):
    return rv.ppf(x)


res = [1 / rv.mean() * quad(f2int, a=0, b=el)[0] for el in alpha]
sample = rv.rvs(size=1000, random_state=518522)
alpha_sample = np.linspace(0, 1, len(sample))
empirical_percentiles = np.percentile(sample, q=alpha_sample)
ranking = np.argsort(sample)
ranked_claims = sample[ranking]
cumulated_claims = np.cumsum(ranked_claims)
cumulated_claims /= cumulated_claims[-1]
plt.plot(
    alpha_sample, cumulated_claims, marker=".", alpha=0.5, linewidth=4, color="orange"
)
plt.plot(alpha, res, alpha=1.0)

To check the behaviour when weighting I simulated unweighted data from a Poisson distribution with parameter 0.1 and then aggregated it. I aggregated the exposure and the claim count and computed again the frequency for the weighted data (claim count divided by exposure).
You can find this example here
My reasoning: When simulating iid samples from a poisson with parameter 0.1, we have many 0s and few values >= 1. If we aggregate the data the claim frequencies will converge toward 0.1 by increasing group sizes, hence there will be no inequality at all and the Lorenz curve will be the diagonal line.

Update:
I did another example where I followed the procedure for creating weighted data in #29248 where one first simulates the weighted data and then repeats the observations to get the unweighted data.

When building the Lorenz curve using the weighted data I then used this approach

cumulated_exposure = np.cumsum(ranked_exposure)
cumulated_exposure /= cumulated_exposure[-1]

and obtained points lying on the unweighted curve.

ogrisel · 2024-10-28T17:07:31Z

@m-maggi sorry I had not seen your reply.

Could you try to do the simple experiment:

import numpy as np
rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

And then compute the curves for:

the exposure-repeated data:

y_true_repeated = y_true.repeat(sample_weight)
y_pred_repeated = y_pred.repeat(sample_weight)
sample_weight = None  # or equivalently: np.ones(n)

vs

exposure-weighted data:

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure

ogrisel · 2024-10-28T17:09:21Z

But feel free to open a PR to fix the wrong example in any case.

yonil7 added Documentation Needs Triage Issue requires triage labels Feb 26, 2024

yonil7 mentioned this issue Feb 26, 2024

Add metrics.gini_index_score() #28535

Open

ogrisel added Bug and removed Needs Triage Issue requires triage labels Feb 27, 2024

m-maggi mentioned this issue Nov 2, 2024

DOC attempt to fix lorenz_curve in plot tweedie regression example #30198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Two different versions for weighted lorenz curve calculation in the examples #28534

Two different versions for weighted lorenz curve calculation in the examples #28534

yonil7 commented Feb 26, 2024 •

edited

Loading

ogrisel commented Feb 27, 2024 •

edited

Loading

Uh oh!

lamdang2k commented Feb 29, 2024

Uh oh!

lorentzenchr commented Mar 8, 2024 •

edited

Loading

Uh oh!

yonil7 commented Mar 8, 2024 •

edited

Loading

Uh oh!

ogrisel commented Mar 28, 2024

Uh oh!

m-maggi commented Jun 15, 2024 •

edited

Loading

Uh oh!

ogrisel commented Oct 28, 2024

Uh oh!

ogrisel commented Oct 28, 2024

Uh oh!

Uh oh!

Two different versions for weighted lorenz curve calculation in the examples #28534

Two different versions for weighted lorenz curve calculation in the examples #28534

Comments

yonil7 commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue linked to the documentation

Suggest a potential alternative/fix

ogrisel commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lamdang2k commented Feb 29, 2024

Uh oh!

lorentzenchr commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonil7 commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Mar 28, 2024

Uh oh!

m-maggi commented Jun 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 28, 2024

Uh oh!

ogrisel commented Oct 28, 2024

Uh oh!

yonil7 commented Feb 26, 2024 •

edited

Loading

ogrisel commented Feb 27, 2024 •

edited

Loading

lorentzenchr commented Mar 8, 2024 •

edited

Loading

yonil7 commented Mar 8, 2024 •

edited

Loading

m-maggi commented Jun 15, 2024 •

edited

Loading