Replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen"

This repository contains the replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen," accepted at the ICLR 2025 conference as a spotlight paper.

Requirements

Check the requirements.txt file for the required packages.

Structure

The replication package is organized as follows:

.
├── readme.md                       # This file
├── result/                         # The directory where the raw results are stored. The results are then processed to generate the figures and tables in the paper using the scripts in the `script/` directory.
├── figures                         # Figures in the paper
├── script/                         # Scripts for generating the results in the paper
│   ├── minimal-bias/                   # Experiments for the minimal bias experiment (Section 4.1)
│   │   ├── min-bias-table.ipynb            # Python Jupyter notebook for generating the table for the minimum bias (Table 2)
│   │   └── min-bias-figures.ipynb          # R notebook for generating the figures for the minimum bias (Figure 2)
│   └── evolutionary/                   # Experiments for the evolutionary algorithm (Section 4.2)
│       ├── ga-analysis.ipynb               # Python Jupyter notebook for analyzing the results of the genetic algorithm (RQ 1, 2, 4)
│       ├── larger-sample.ipynb             # Python Jupyter notebook for analyzing the results of adopting the evolved estimator to a larger sample size (RQ 3)
│       └── phi-variance-calc.ipynb         # Python Jupyter notebook for calculating the variance of Phis, the frequency of frequencies (RQ 3)
├── realdata/                       # Real data used in the paper (RQ 5)
├── ga.py                           # Genetic algorithm
├── ga-runner.py                    # Runner for the suite of experiments for the genetic algorithm
├── ga-planner.txt                  # Suite of experiments for the genetic algorithm
├── larger-sample-assessment.py     # Script for assessing the performance of the evolved estimator on a larger sample size
├── ga-variance-crosscheck-gen.py   # Variance generation script for RQ4 of the genetic algorithm
├── help1var.py                     # Helper functions for the genetic algorithm
└── requirements.txt                # Required packages

Due to the size of the raw results, the result/ directory is not included in the repository. The raw results can be generated by running the ga.py script. The raw result will be publically shared upon request or after the acceptance of the paper.

Minimal Bias Experiment

The minimal bias experiment in the paper can be replicated using the min-bias-table.ipynb and min-bias-figures.ipynb notebooks in the script/minimal-bias/ directory.

The Genetic Algorithm for the Minimal MSE Experiment

`ga.py`

The genetic algorithm can be run using the ga.py script. The command line arguments are as follows:

$ python ga.py -h
usage: ga.py [-h] [--p_est_method {none,emp,nGT}] [--max_term MAX_TERM]
             [--max_iter MAX_ITER] [--seed SEED] [--rep_smp REP_SMP]
             [--rep_evo REP_EVO]
             S {uniform,zipf,half,zipfhalf,diri,dirihalf} n_total m_target
             k_target {knowing,sampling,onlycovmat}

positional arguments:
  S                     number of species
  {uniform,zipf,half,zipfhalf,diri,dirihalf}
                        distribution type
  n_total               number of samples
  m_target              target m
  k_target              target k
  {knowing,sampling,onlycovmat}
                        setting

options:
  -h, --help            show this help message and exit
  --p_est_method {none,emp,nGT}
                        estimation method for p for evolution
  --max_term MAX_TERM   maximum number of terms in the formula
  --max_iter MAX_ITER   maximum number of iterations for the genetic algorithm
  --seed SEED           random seed
  --rep_smp REP_SMP     sampling repetitions
  --rep_evo REP_EVO     evoluation repititions

S is the number of species in the underlying distribution.
distribution type is the type of distribution used for the underlying distribution. The options are
- uniform: Uniform distribution
- half: Half&Half distribution
- zipf: Zipf distribution with $\alpha = 1$
- zipfhalf: Zipf distribution with $\alpha = 0.5$
- diri: The distribution is drawn from a Dirichlet distribution with prior $r = 1$
- dirihalf: The distribution is drawn from a Dirichlet distribution with prior $r = 0.5$
n_total is the number of samples used to estimate the probability mass.
m_target is the number of samples that the estimator is used to estimate the probability mass. Normally, this is the same as n_total. If m_target is larger than n_total, the evolved estimator is adjusted to the larger sample size and the mean squared error (MSE) is calculated (used for RQ3).
k_target is the target frequency of the probability mass. If k_target is 0, the estimator is estimating the missing mass.
setting is the setting used for the genetic algorithm.
- One can use sampling for running the genetic algorithm to replicate the experiment in the paper; it will use n_total samples to estimate the probability mass.
- The knowing setting uses the true probability distribution, instead of the estimated probability distribution from the samples, when computing the bias, variance, and MSE of the estimator during the genetic algorithm.
- The onlycovmat setting only computes the covariance matrix of the frequencies of frequencies used in the genetic algorithm. It does not run the genetic algorithm. It is used for the purpose of parallelizing the experiments.
p_est_method is the method selects which approximated probability distribution to use for approximating the MSE. In the paper, we used nGT, the natural probability estimate with Good-Turing estimator. If setting is knowing, this parameter should be none.
max_term limits the number of terms in the estimator. If there is no limit, set this parameter to -1.
max_iter is the maximum number of iterations for the genetic algorithm.
seed is the random seed for the sampling. If setting is knowing, this parameter should be -1.
rep_smp is the number of repetitions for the sampling.
rep_evo is the number of repetitions for the genetic algorithm per each sample.

Note that there are several hard-coded parameters in the script. For example,

the result directory (default: {root_dir}/result/)
the frequency of the probability mass (default: 0, which means the missing mass)
the increment of the generation limit (default: 100)
the maximum number of generations (default: 2000)
the selection pool size (default: 100)

For example, following command runs the genetic algorithm to search for the missing mass estimator, whose maximum number of terms in the estimator is 20, for the Zipf distribution with $\alpha = 1$ with support size 200 using 200 samples:

$ python ga.py 200 zipf 200 200 0 sampling --p_est_method nGT --seed 0 --rep_smp 1 --rep_evo 1 --max_term 20

`ga-runner.py` and `ga-planner.txt`

The ga-runner.py script is a runner for the suite of experiments for the genetic algorithm. It runs the suite of experiments defined in the ga-planner.txt file. The ga-planner.txt file contains the suite of experiments for the genetic algorithm. The format of the file is a space-separated list of the parameters for the genetic algorithm. The parameters are S, dist, n, m, k, setting, p_est_method, start_seed, rep_smp, rep_evo, max_term, max_iter. For instance, the above example can be written as follows:

100 uniform 100 100 0 sampling nGT 0 30 1 20 2000

Any line starting with # is ignored.

`larger-sample-assessment.py`

The larger-sample-assessment.py script assesses the performance of the evolved estimator on a larger sample size. It generates the assess-incn.csv file, which contains the MSE of the evolved estimator on a larger sample size. The generated files are used in the larger-sample.ipynb notebook.

`ga-variance-crosscheck-gen.py`

The ga-variance-crosscheck-gen.py script generates the variance of the estimators that is evolved from one probability distribution to another probability distribution. This script is used for RQ3 of the genetic algorithm. Its result is used in the ga-analysis.ipynb notebook.

`ga-analysis.ipynb`, `larger-sample.ipynb`, and `phi-variance-calc.ipynb`

The ga-analysis.ipynb notebook analyzes the results of the genetic algorithm. It generates the figures in the paper for RQ1, RQ2, and RQ4. The larger-sample.ipynb notebook analyzes the results of adopting the evolved estimator to a larger sample size. The phi-variance-calc.ipynb notebook calculates the variance of Phis, the frequency of frequencies, for RQ3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen"

Requirements

Structure

Minimal Bias Experiment

The Genetic Algorithm for the Minimal MSE Experiment

`ga.py`

`ga-runner.py` and `ga-planner.txt`

`larger-sample-assessment.py`

`ga-variance-crosscheck-gen.py`

`ga-analysis.ipynb`, `larger-sample.ipynb`, and `phi-variance-calc.ipynb`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
realdata		realdata
script		script
LICENSE		LICENSE
ga-planner.txt		ga-planner.txt
ga-runner.py		ga-runner.py
ga-variance-crosscheck-gen.py		ga-variance-crosscheck-gen.py
ga.py		ga.py
helpf1var.py		helpf1var.py
larger-sample-assessment.py		larger-sample-assessment.py
readme.md		readme.md
requirements.txt		requirements.txt

License

niMgnoeSeeL/UnseenGA

Folders and files

Latest commit

History

Repository files navigation

Replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen"

Requirements

Structure

Minimal Bias Experiment

The Genetic Algorithm for the Minimal MSE Experiment

ga.py

ga-runner.py and ga-planner.txt

larger-sample-assessment.py

ga-variance-crosscheck-gen.py

ga-analysis.ipynb, larger-sample.ipynb, and phi-variance-calc.ipynb

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`ga.py`

`ga-runner.py` and `ga-planner.txt`

`larger-sample-assessment.py`

`ga-variance-crosscheck-gen.py`

`ga-analysis.ipynb`, `larger-sample.ipynb`, and `phi-variance-calc.ipynb`

Packages