Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

This repository contains the replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen," accepted at the ICLR 2025 conference as a spotlight paper.

License

Notifications You must be signed in to change notification settings

niMgnoeSeeL/UnseenGA

Repository files navigation

Replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen"

This repository contains the replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen," accepted at the ICLR 2025 conference as a spotlight paper.

Requirements

Check the requirements.txt file for the required packages.

Structure

The replication package is organized as follows:

.
├── readme.md                       # This file
├── result/                         # The directory where the raw results are stored. The results are then processed to generate the figures and tables in the paper using the scripts in the `script/` directory.
├── figures                         # Figures in the paper
├── script/                         # Scripts for generating the results in the paper
│   ├── minimal-bias/                   # Experiments for the minimal bias experiment (Section 4.1)
│   │   ├── min-bias-table.ipynb            # Python Jupyter notebook for generating the table for the minimum bias (Table 2)
│   │   └── min-bias-figures.ipynb          # R notebook for generating the figures for the minimum bias (Figure 2)
│   └── evolutionary/                   # Experiments for the evolutionary algorithm (Section 4.2)
│       ├── ga-analysis.ipynb               # Python Jupyter notebook for analyzing the results of the genetic algorithm (RQ 1, 2, 4)
│       ├── larger-sample.ipynb             # Python Jupyter notebook for analyzing the results of adopting the evolved estimator to a larger sample size (RQ 3)
│       └── phi-variance-calc.ipynb         # Python Jupyter notebook for calculating the variance of Phis, the frequency of frequencies (RQ 3)
├── realdata/                       # Real data used in the paper (RQ 5)
├── ga.py                           # Genetic algorithm
├── ga-runner.py                    # Runner for the suite of experiments for the genetic algorithm
├── ga-planner.txt                  # Suite of experiments for the genetic algorithm
├── larger-sample-assessment.py     # Script for assessing the performance of the evolved estimator on a larger sample size
├── ga-variance-crosscheck-gen.py   # Variance generation script for RQ4 of the genetic algorithm
├── help1var.py                     # Helper functions for the genetic algorithm
└── requirements.txt                # Required packages

Due to the size of the raw results, the result/ directory is not included in the repository. The raw results can be generated by running the ga.py script. The raw result will be publically shared upon request or after the acceptance of the paper.

Minimal Bias Experiment

The minimal bias experiment in the paper can be replicated using the min-bias-table.ipynb and min-bias-figures.ipynb notebooks in the script/minimal-bias/ directory.

The Genetic Algorithm for the Minimal MSE Experiment

ga.py

The genetic algorithm can be run using the ga.py script. The command line arguments are as follows:

$ python ga.py -h
usage: ga.py [-h] [--p_est_method {none,emp,nGT}] [--max_term MAX_TERM]
             [--max_iter MAX_ITER] [--seed SEED] [--rep_smp REP_SMP]
             [--rep_evo REP_EVO]
             S {uniform,zipf,half,zipfhalf,diri,dirihalf} n_total m_target
             k_target {knowing,sampling,onlycovmat}

positional arguments:
  S                     number of species
  {uniform,zipf,half,zipfhalf,diri,dirihalf}
                        distribution type
  n_total               number of samples
  m_target              target m
  k_target              target k
  {knowing,sampling,onlycovmat}
                        setting

options:
  -h, --help            show this help message and exit
  --p_est_method {none,emp,nGT}
                        estimation method for p for evolution
  --max_term MAX_TERM   maximum number of terms in the formula
  --max_iter MAX_ITER   maximum number of iterations for the genetic algorithm
  --seed SEED           random seed
  --rep_smp REP_SMP     sampling repetitions
  --rep_evo REP_EVO     evoluation repititions
  • S is the number of species in the underlying distribution.
  • distribution type is the type of distribution used for the underlying distribution. The options are
    • uniform: Uniform distribution
    • half: Half&Half distribution
    • zipf: Zipf distribution with $\alpha = 1$
    • zipfhalf: Zipf distribution with $\alpha = 0.5$
    • diri: The distribution is drawn from a Dirichlet distribution with prior $r = 1$
    • dirihalf: The distribution is drawn from a Dirichlet distribution with prior $r = 0.5$
  • n_total is the number of samples used to estimate the probability mass.
  • m_target is the number of samples that the estimator is used to estimate the probability mass. Normally, this is the same as n_total. If m_target is larger than n_total, the evolved estimator is adjusted to the larger sample size and the mean squared error (MSE) is calculated (used for RQ3).
  • k_target is the target frequency of the probability mass. If k_target is 0, the estimator is estimating the missing mass.
  • setting is the setting used for the genetic algorithm.
    • One can use sampling for running the genetic algorithm to replicate the experiment in the paper; it will use n_total samples to estimate the probability mass.
    • The knowing setting uses the true probability distribution, instead of the estimated probability distribution from the samples, when computing the bias, variance, and MSE of the estimator during the genetic algorithm.
    • The onlycovmat setting only computes the covariance matrix of the frequencies of frequencies used in the genetic algorithm. It does not run the genetic algorithm. It is used for the purpose of parallelizing the experiments.
  • p_est_method is the method selects which approximated probability distribution to use for approximating the MSE. In the paper, we used nGT, the natural probability estimate with Good-Turing estimator. If setting is knowing, this parameter should be none.
  • max_term limits the number of terms in the estimator. If there is no limit, set this parameter to -1.
  • max_iter is the maximum number of iterations for the genetic algorithm.
  • seed is the random seed for the sampling. If setting is knowing, this parameter should be -1.
  • rep_smp is the number of repetitions for the sampling.
  • rep_evo is the number of repetitions for the genetic algorithm per each sample.

Note that there are several hard-coded parameters in the script. For example,

  • the result directory (default: {root_dir}/result/)
  • the frequency of the probability mass (default: 0, which means the missing mass)
  • the increment of the generation limit (default: 100)
  • the maximum number of generations (default: 2000)
  • the selection pool size (default: 100)

For example, following command runs the genetic algorithm to search for the missing mass estimator, whose maximum number of terms in the estimator is 20, for the Zipf distribution with $\alpha = 1$ with support size 200 using 200 samples:

$ python ga.py 200 zipf 200 200 0 sampling --p_est_method nGT --seed 0 --rep_smp 1 --rep_evo 1 --max_term 20

ga-runner.py and ga-planner.txt

The ga-runner.py script is a runner for the suite of experiments for the genetic algorithm. It runs the suite of experiments defined in the ga-planner.txt file. The ga-planner.txt file contains the suite of experiments for the genetic algorithm. The format of the file is a space-separated list of the parameters for the genetic algorithm. The parameters are S, dist, n, m, k, setting, p_est_method, start_seed, rep_smp, rep_evo, max_term, max_iter. For instance, the above example can be written as follows:

100 uniform 100 100 0 sampling nGT 0 30 1 20 2000

Any line starting with # is ignored.

larger-sample-assessment.py

The larger-sample-assessment.py script assesses the performance of the evolved estimator on a larger sample size. It generates the assess-incn.csv file, which contains the MSE of the evolved estimator on a larger sample size. The generated files are used in the larger-sample.ipynb notebook.

ga-variance-crosscheck-gen.py

The ga-variance-crosscheck-gen.py script generates the variance of the estimators that is evolved from one probability distribution to another probability distribution. This script is used for RQ3 of the genetic algorithm. Its result is used in the ga-analysis.ipynb notebook.

ga-analysis.ipynb, larger-sample.ipynb, and phi-variance-calc.ipynb

The ga-analysis.ipynb notebook analyzes the results of the genetic algorithm. It generates the figures in the paper for RQ1, RQ2, and RQ4. The larger-sample.ipynb notebook analyzes the results of adopting the evolved estimator to a larger sample size. The phi-variance-calc.ipynb notebook calculates the variance of Phis, the frequency of frequencies, for RQ3.

About

This repository contains the replication package for the paper "How Much is Unseen Depends Chiefly on Information About the Seen," accepted at the ICLR 2025 conference as a spotlight paper.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published