Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Two different versions for weighted lorenz curve calculation in the examples #28534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yonil7 opened this issue Feb 26, 2024 · 8 comments
Open

Comments

@yonil7
Copy link

yonil7 commented Feb 26, 2024

Describe the issue linked to the documentation

There are 2 definitions of (weighted) lorenz_curve() functions here and here

The difference is in the X coordinates that these functions returns. Both return X coordinates between 0 and 1, but the first example returns equally spaced X coordinates:

cumulated_samples = np.linspace(0, 1, len(cumulated_claim_amount))

and the second example return un-equally spaced X coordinates (spaced using the samples weights):

cumulated_exposure = np.cumsum(ranked_exposure)
cumulated_exposure /= cumulated_exposure[-1]

Suggest a potential alternative/fix

No response

@yonil7 yonil7 added Documentation Needs Triage Issue requires triage labels Feb 26, 2024
@ogrisel ogrisel added Bug and removed Needs Triage Issue requires triage labels Feb 27, 2024
@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2024

Indeed this looks like a bug. Someone would need to do a quick literature review to find the correct way to compute this curve when there are per-individual weights / exposure.

EDIT: I have the feeling that the cumsum variant is the correct one but I am not 100% sure.

@lamdang2k
Copy link
Contributor

Indeed this looks like a bug. Someone would need to do a quick literature review to find the correct way to compute this curve when there are per-individual weights / exposure.

EDIT: I have the feeling that the cumsum variant is the correct one but I am not 100% sure.

I think you're right, the horizontal axis is the cumulative sum of probability Pr(Y=y). Here it corresponds to the individual weight/exposure.
We can use linspace when all claims are equally probable with probabilities 1/n.
Source: https://en.wikipedia.org/wiki/Lorenz_curve#:~:text=For%20a%20discrete,to%20n%3A

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 8, 2024

One version is unweighted, the other takes into account sample weights. So neither is wrong.

See also #21320 (comment) and note that wikipedia is not a reliable source on those quantities.

@yonil7
Copy link
Author

yonil7 commented Mar 8, 2024

I'm not expert but something I find strange in the unweighted version is that it does use the weights (exposure) but differently:
cumulated_claim_amount = np.cumsum(ranked_pure_premium * ranked_exposure)
In a more intuitive terms - it use the weights to make the bar higher but not wider (each bar has the same width but the height of each bar is affected by the weight where in the weighted version the weight affects both the height and the width of each bar)

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2024

Indeed I agree it's weird to only use the weights on y axis and not on the x axis.

I would intuitively expect the following weighing semantics to hold:

  • setting weights to 0 to a fraction of the observations should be equivalent to trimming those data points;
  • setting weights to 2 to a fraction of the observations (while keeping the other to unit weights) should be equivalent to duplicating those data points.

We can empirically check when those properties hold: the curves for the weighted vs trimmed/duplicated but unweighted data should match exactly.

@m-maggi
Copy link
Contributor

m-maggi commented Jun 15, 2024

Indeed I agree it's weird to only use the weights on y axis and not on the x axis.

I would intuitively expect the following weighing semantics to hold:

  • setting weights to 0 to a fraction of the observations should be equivalent to trimming those data points;
  • setting weights to 2 to a fraction of the observations (while keeping the other to unit weights) should be equivalent to duplicating those data points.

We can empirically check when those properties hold: the curves for the weighted vs trimmed/duplicated but unweighted data should match exactly.

I find this topic interesting and I would like to contribute.
I started from the definition of Lorenz curve from this paper:
Screenshot 2024-06-15 at 13 20 13

and checked that the approach with the cumulative sums in the scikit-learn examples matches the theory:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.integrate import quad

alpha = np.linspace(0, 1, 100)
rv = stats.gamma(scale=10, a=0.5)


def f2int(x):
    return rv.ppf(x)


res = [1 / rv.mean() * quad(f2int, a=0, b=el)[0] for el in alpha]
sample = rv.rvs(size=1000, random_state=518522)
alpha_sample = np.linspace(0, 1, len(sample))
empirical_percentiles = np.percentile(sample, q=alpha_sample)
ranking = np.argsort(sample)
ranked_claims = sample[ranking]
cumulated_claims = np.cumsum(ranked_claims)
cumulated_claims /= cumulated_claims[-1]
plt.plot(
    alpha_sample, cumulated_claims, marker=".", alpha=0.5, linewidth=4, color="orange"
)
plt.plot(alpha, res, alpha=1.0)

image

To check the behaviour when weighting I simulated unweighted data from a Poisson distribution with parameter 0.1 and then aggregated it. I aggregated the exposure and the claim count and computed again the frequency for the weighted data (claim count divided by exposure).
You can find this example here
My reasoning: When simulating iid samples from a poisson with parameter 0.1, we have many 0s and few values >= 1. If we aggregate the data the claim frequencies will converge toward 0.1 by increasing group sizes, hence there will be no inequality at all and the Lorenz curve will be the diagonal line.

image

Update:
I did another example where I followed the procedure for creating weighted data in #29248 where one first simulates the weighted data and then repeats the observations to get the unweighted data.

When building the Lorenz curve using the weighted data I then used this approach

cumulated_exposure = np.cumsum(ranked_exposure)
cumulated_exposure /= cumulated_exposure[-1]

and obtained points lying on the unweighted curve.

image

@ogrisel
Copy link
Member

ogrisel commented Oct 28, 2024

@m-maggi sorry I had not seen your reply.

Could you try to do the simple experiment:

import numpy as np
rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

And then compute the curves for:

  • the exposure-repeated data:
y_true_repeated = y_true.repeat(sample_weight)
y_pred_repeated = y_pred.repeat(sample_weight)
sample_weight = None  # or equivalently: np.ones(n)

vs

  • exposure-weighted data:
y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure

@ogrisel
Copy link
Member

ogrisel commented Oct 28, 2024

But feel free to open a PR to fix the wrong example in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants