-
Notifications
You must be signed in to change notification settings - Fork 16
Additional Data for API examples
Example data comes from The Great Repertoire Project (Briney et al., 2019) as mirrored on GitHub. We will be using synth01.fasta from the “Ten batches of 100M synthetic sequences, generated with IGoR's default V(D)J recombination model” data set.
Extract the first file of this example:
tar -xvf igor_synthetic_100M_default-model_fastas.tar.gz igor_synthetic_100M_default-model_fastas/synth01.fasta
To run on a smaller subset of data (recommended):
head -n 100000000 synth01.fasta > synth01-50M.fasta
head -n 20000000 synth01.fasta > synth01-10M.fasta
Each FASTA file from the Great Repertoire Project is 100 million sequences or about 37GB. Plan to have at least 1TB of hard disk space available if you will also test this data using the MiAIRR TSV or JSON output options.
Memory usage was benchmarked for subsets of this data set totaling 10M, 50M, and 100M sequences: When saving to file, the analysis uses about 40-50MB per thread, depending on chunk size. In our 224 thread tests for 50 and 100M sequences, 10-12GB of RAM usage was common. However, for use with a dictionary output for the API, the entire data structure will be stored in memory. The following ranges should be considered:
- 10 million: 25-50GB
- 50 million: 75-150GB
- 100 million: 175-300GB
- https://github.com/briney/grp_paper
- Briney, B., Inderbitzin, A., Joyce, C. et al. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397 (2019). https://doi.org/10.1038/s41586-019-0879-y