-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Hi, I'm trying to understand why the following code won't find a good fit - the convergence rate is high at the end indicating that a good solution has not been found, and it ends because the change in convergence_rate is below the rate_tolerance.
The dummy use-case is that I have 3 tables spanning [education, age], [education, gender], and [education, children], and want a joint distribution of [education, age, gender, children] that fits all marginals of the 3 tables. My real use-case prevents expanding the distributions sequentially, because the data has the tables [age, municipality], [municipality, children], [age, children], which connects in a triangle.
I've tested it purely on dummy data, but that luckily makes it easier to paste and reproduce.
In the dummy data, I have the full distribution for [education, age, gender, children]. I get the marginals by grouping on some of the axes. I then pass the full distribution as the initial value, and try to fit it to the marginals. This should converge after 1 step, because there is a perfect fit initially. But it doesn't and gives a very bad fit, with convergence rate = 2.218433 at the final iteration step, where it stops because the change in convergence rate is below the rate_tolerance.
Am I using the function wrong, or is there a convergence issue?
`# library imports
from ipfn import ipfn
import numpy as np
import pandas as pd
Generate the full joint distribution
weight = np.array([1., 2., 1., 3., 5., 5., 6., 2., 2., 1., 7., 6.,
5., 4., 2., 5., 5., 5., 3., 8., 7., 2., 7., 6.,
1., 2., 1., 3., 5., 5., 6., 2., 2., 1., 7., 6.,
5., 4., 2., 5., 5., 5., 3., 8., 7., 2., 7., 6.,],)
weight = weight * 0.5 # to still sum to 100.
gender_l = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,]
education_l = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,
1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,
1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,
1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,]
age_l = ['20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',
'20-25','30-35','40-45',]
children_l = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,]
df = pd.DataFrame()
df['gender'] = gender_l
df['education'] = education_l
df['age'] = age_l
df['children'] = children_l
df['weight'] = weight
Define the 2d target marginal distributions dataframes
gender_education_conditional = df.groupby(['gender', 'education'])['weight'].sum()
education_age_conditional = df.groupby(['education', 'age'])['weight'].sum()
education_children_conditional = df.groupby(['education','children'])['weight'].sum()
Perform the ipfn
aggregates = [gender_education_conditional, education_age_conditional, education_children_conditional]
dimensions = [['gender', 'education'], ['education', 'age'], ['education','children']]
IPF = ipfn.ipfn(df, aggregates, dimensions, weight_col="weight", verbose = 2)
ipf_out = IPF.iteration()
df = ipf_out[0]
flag = ipf_out[1]
convergence_rate = ipf_out[2]
And print the results for evaluation
print(flag)
print(convergence_rate)
print(df.groupby(["education", "age"])["weight"].sum(), education_age_conditional)
print(df.groupby(["education", "children"])["weight"].sum(), education_children_conditional)
print(df.groupby(["gender", "education"])["weight"].sum(), gender_education_conditional)`