Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Additional Data for API examples

Luke Myers edited this page May 27, 2020 · 2 revisions

Creating a CDR3 Length Histogram Using PyIR’s API and Matplotlib

Example data comes from The Great Repertoire Project (Briney et al., 2019) as mirrored on GitHub. We will be using synth01.fasta from the “Ten batches of 100M synthetic sequences, generated with IGoR's default V(D)J recombination model” data set.

Extract the first file of this example: tar -xvf igor_synthetic_100M_default-model_fastas.tar.gz igor_synthetic_100M_default-model_fastas/synth01.fasta

To run on a smaller subset of data (recommended): head -n 100000000 synth01.fasta > synth01-50M.fasta

head -n 20000000 synth01.fasta > synth01-10M.fasta

System Requirements

Each FASTA file from the Great Repertoire Project is 100 million sequences or about 37GB. Plan to have at least 1TB of hard disk space available if you will also test this data using the MiAIRR TSV or JSON output options.

Memory usage was benchmarked for subsets of this data set totaling 10M, 50M, and 100M sequences: When saving to file, the analysis uses about 40-50MB per thread, depending on chunk size. In our 224 thread tests for 50 and 100M sequences, 10-12GB of RAM usage was common. However, for use with a dictionary output for the API, the entire data structure will be stored in memory. The following ranges should be considered:

Memory usage

  • 10 million: 25-50GB
  • 50 million: 75-150GB
  • 100 million: 175-300GB

Data Availability:

Clone this wiki locally