Additional Data for API examples

Creating a CDR3 Length Histogram Using PyIR’s API and Matplotlib

Example data comes from The Great Repertoire Project (Briney et al., 2019) as mirrored on GitHub. We will be using synth01.fasta from the “Ten batches of 100M synthetic sequences, generated with IGoR's default V(D)J recombination model” data set.

Extract the first file of this example: tar -xvf igor_synthetic_100M_default-model_fastas.tar.gz igor_synthetic_100M_default-model_fastas/synth01.fasta

To run on a smaller subset of data (recommended): head -n 100000000 synth01.fasta > synth01-50M.fasta

head -n 20000000 synth01.fasta > synth01-10M.fasta

System Requirements

Each FASTA file from the Great Repertoire Project is 100 million sequences or about 37GB. Plan to have at least 1TB of hard disk space available if you will also test this data using the MiAIRR TSV or JSON output options.

Memory usage was benchmarked for subsets of this data set totaling 10M, 50M, and 100M sequences: When saving to file, the analysis uses about 40-50MB per thread, depending on chunk size. In our 224 thread tests for 50 and 100M sequences, 10-12GB of RAM usage was common. However, for use with a dictionary output for the API, the entire data structure will be stored in memory. The following ranges should be considered:

Memory usage

10 million: 25-50GB
50 million: 75-150GB
100 million: 175-300GB

Data Availability:

https://github.com/briney/grp_paper
Briney, B., Inderbitzin, A., Joyce, C. et al. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397 (2019). https://doi.org/10.1038/s41586-019-0879-y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Additional Data for API examples

Creating a CDR3 Length Histogram Using PyIR’s API and Matplotlib

System Requirements

Memory usage

Data Availability:

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally