Replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen"
This repository contains the replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen," accepted at the ICLR 2025 conference as a spotlight paper.
Check the requirements.txt file for the required packages.
The replication package is organized as follows:
.
├── readme.md # This file
├── result/ # The directory where the raw results are stored. The results are then processed to generate the figures and tables in the paper using the scripts in the `script/` directory.
├── figures # Figures in the paper
├── script/ # Scripts for generating the results in the paper
│ ├── minimal-bias/ # Experiments for the minimal bias experiment (Section 4.1)
│ │ ├── min-bias-table.ipynb # Python Jupyter notebook for generating the table for the minimum bias (Table 2)
│ │ └── min-bias-figures.ipynb # R notebook for generating the figures for the minimum bias (Figure 2)
│ └── evolutionary/ # Experiments for the evolutionary algorithm (Section 4.2)
│ ├── ga-analysis.ipynb # Python Jupyter notebook for analyzing the results of the genetic algorithm (RQ 1, 2, 4)
│ ├── larger-sample.ipynb # Python Jupyter notebook for analyzing the results of adopting the evolved estimator to a larger sample size (RQ 3)
│ └── phi-variance-calc.ipynb # Python Jupyter notebook for calculating the variance of Phis, the frequency of frequencies (RQ 3)
├── realdata/ # Real data used in the paper (RQ 5)
├── ga.py # Genetic algorithm
├── ga-runner.py # Runner for the suite of experiments for the genetic algorithm
├── ga-planner.txt # Suite of experiments for the genetic algorithm
├── larger-sample-assessment.py # Script for assessing the performance of the evolved estimator on a larger sample size
├── ga-variance-crosscheck-gen.py # Variance generation script for RQ4 of the genetic algorithm
├── help1var.py # Helper functions for the genetic algorithm
└── requirements.txt # Required packages
Due to the size of the raw results, the result/ directory is not included in the repository. The raw results can be generated by running the ga.py script. The raw result will be publically shared upon request or after the acceptance of the paper.
The minimal bias experiment in the paper can be replicated using the min-bias-table.ipynb and min-bias-figures.ipynb notebooks in the script/minimal-bias/ directory.
The genetic algorithm can be run using the ga.py script. The command line arguments are as follows:
$ python ga.py -h
usage: ga.py [-h] [--p_est_method {none,emp,nGT}] [--max_term MAX_TERM]
[--max_iter MAX_ITER] [--seed SEED] [--rep_smp REP_SMP]
[--rep_evo REP_EVO]
S {uniform,zipf,half,zipfhalf,diri,dirihalf} n_total m_target
k_target {knowing,sampling,onlycovmat}
positional arguments:
S number of species
{uniform,zipf,half,zipfhalf,diri,dirihalf}
distribution type
n_total number of samples
m_target target m
k_target target k
{knowing,sampling,onlycovmat}
setting
options:
-h, --help show this help message and exit
--p_est_method {none,emp,nGT}
estimation method for p for evolution
--max_term MAX_TERM maximum number of terms in the formula
--max_iter MAX_ITER maximum number of iterations for the genetic algorithm
--seed SEED random seed
--rep_smp REP_SMP sampling repetitions
--rep_evo REP_EVO evoluation repititions
-
Sis the number of species in the underlying distribution. -
distribution typeis the type of distribution used for the underlying distribution. The options are-
uniform: Uniform distribution -
half: Half&Half distribution -
zipf: Zipf distribution with$\alpha = 1$ -
zipfhalf: Zipf distribution with$\alpha = 0.5$ -
diri: The distribution is drawn from a Dirichlet distribution with prior$r = 1$ -
dirihalf: The distribution is drawn from a Dirichlet distribution with prior$r = 0.5$
-
-
n_totalis the number of samples used to estimate the probability mass. -
m_targetis the number of samples that the estimator is used to estimate the probability mass. Normally, this is the same asn_total. Ifm_targetis larger thann_total, the evolved estimator is adjusted to the larger sample size and the mean squared error (MSE) is calculated (used for RQ3). -
k_targetis the target frequency of the probability mass. Ifk_targetis 0, the estimator is estimating the missing mass. -
settingis the setting used for the genetic algorithm.- One can use
samplingfor running the genetic algorithm to replicate the experiment in the paper; it will usen_totalsamples to estimate the probability mass. - The
knowingsetting uses the true probability distribution, instead of the estimated probability distribution from the samples, when computing the bias, variance, and MSE of the estimator during the genetic algorithm. - The
onlycovmatsetting only computes the covariance matrix of the frequencies of frequencies used in the genetic algorithm. It does not run the genetic algorithm. It is used for the purpose of parallelizing the experiments.
- One can use
-
p_est_methodis the method selects which approximated probability distribution to use for approximating the MSE. In the paper, we usednGT, the natural probability estimate with Good-Turing estimator. Ifsettingisknowing, this parameter should benone. -
max_termlimits the number of terms in the estimator. If there is no limit, set this parameter to-1. -
max_iteris the maximum number of iterations for the genetic algorithm. -
seedis the random seed for the sampling. Ifsettingisknowing, this parameter should be-1. -
rep_smpis the number of repetitions for the sampling. -
rep_evois the number of repetitions for the genetic algorithm per each sample.
Note that there are several hard-coded parameters in the script. For example,
- the result directory (default:
{root_dir}/result/) - the frequency of the probability mass (default:
0, which means the missing mass) - the increment of the generation limit (default:
100) - the maximum number of generations (default:
2000) - the selection pool size (default:
100)
For example, following command runs the genetic algorithm to search for the missing mass estimator, whose maximum number of terms in the estimator is 20, for the Zipf distribution with
$ python ga.py 200 zipf 200 200 0 sampling --p_est_method nGT --seed 0 --rep_smp 1 --rep_evo 1 --max_term 20
The ga-runner.py script is a runner for the suite of experiments for the genetic algorithm. It runs the suite of experiments defined in the ga-planner.txt file. The ga-planner.txt file contains the suite of experiments for the genetic algorithm. The format of the file is a space-separated list of the parameters for the genetic algorithm. The parameters are S, dist, n, m, k, setting, p_est_method, start_seed, rep_smp, rep_evo, max_term, max_iter. For instance, the above example can be written as follows:
100 uniform 100 100 0 sampling nGT 0 30 1 20 2000
Any line starting with # is ignored.
The larger-sample-assessment.py script assesses the performance of the evolved estimator on a larger sample size. It generates the assess-incn.csv file, which contains the MSE of the evolved estimator on a larger sample size. The generated files are used in the larger-sample.ipynb notebook.
The ga-variance-crosscheck-gen.py script generates the variance of the estimators that is evolved from one probability distribution to another probability distribution. This script is used for RQ3 of the genetic algorithm. Its result is used in the ga-analysis.ipynb notebook.
The ga-analysis.ipynb notebook analyzes the results of the genetic algorithm. It generates the figures in the paper for RQ1, RQ2, and RQ4. The larger-sample.ipynb notebook analyzes the results of adopting the evolved estimator to a larger sample size. The phi-variance-calc.ipynb notebook calculates the variance of Phis, the frequency of frequencies, for RQ3.