- Corrected an error in SNVBOX feature files that led to decreased performance
- Please see: https://zenodo.org/record/6016720
3Cnet was trained using the following versions of software:
- Follow the steps in this section if you prefer a ready-to-go environment.
- If you prefer to set up the environment on your own, skip directly to "Clone the 3Cnet repository."
- We recommend you have at least 40GB of free storage.
- See NVIDIA NGS pytorch container docs for other execution examples.
Docker Engine (we use Docker 20.10.5)
https://docs.docker.com/engine/install/ NVIDIA/nvidia-docker2
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2- The Docker image for 3Cnet is based on one of NVIDIA NGC's offerings.
$ sudo docker pull 3billion/3cnet:0.0.1$ sudo docker run --gpus all -it -v </absolute/path/to/mount>:/workspace 3billion/3cnet:0.0.1
$ cd workspace$ git clone https://github.com/KyoungYeulLee/3Cnet.gitSTEP 3: Run download_data.py to retrieve necessary files from Zenodo
$ cd 3Cnet
$ python download_data.py
$ tar -xvf data.tar.gz- To train 3Cnet
$ cd ./src/model
$ python train_model.py- To evaluate 3Cnet performance
$ python test_model.py- To re-create the training/evaluation datasets
$ cd ../featurize
$ python build_dataset.pyUnderlined files are the top-level files or scripts intended to be directly modified or executed by the user.
- download_data.py: Retrieves
data/directory from Zenodo. - 3cnet.yaml: Anaconda-compatible environment yaml. (deprecated, also contains dependencies not directly used by 3cnet)
- build_dataset.py: Run this to parse and process raw data into pytorch-compatible inputs.
- config.yaml: A file specifying paths used by scripts in
src/featurize - featurize_data.py: A dependency used by
build_dataset.pythat converts HGVSp nomenclature to amino acid sequences. - get_variants.py: Used to map sequences in
/data/variant_datato back HGVSp identifiers. - merge_data.py: Script that bundles the two different sequence data with SNVbox features. (generates
*_dataset.bin,*_mut.npy, and*_snvbox.npyfiles)
- config.yaml: A file specifying paths and hyperparameters used for model training or evaluation. Alter this to modify file paths and/or settings.
- deep_utilities.py: A generic collection of utility functions and classes used broadly by other files in src/model.
- LSTM_datasets.py: Dataset class definition for 3Cnet.
- LSTM_models.py: A wrapper class for 3Cnet defining low-level training routines.
- LSTM_networks.py: The 3Cnet architecture is defined here (nn.Module).
- evaluate_metrics.py: Definition of metrics used for model evaluation.
- train_model.py: Top-level script for 3Cnet training. Outcomes are saved in
data/model/(MODEL_TYPE)/(MODEL_NAME). - test_model.py: Evaluate model using values under the
VALIDkey inconfig.yaml. TheMODEL_NUMkey inconfig.yamlrepresents the epoch # to load and must be defined for this script to run as intended.
-
msa_data/: NP_*.npy files representing each residues of conservative proportion of 21-amino acids
-
variant_data/
- clinvar_data.txt: pathogenic-or-benign-labeled variants from ClinVar
- common_variants.txt: benign-labeled variants from gnomAD
- conservation_data.txt: pathogenic-like and benign-like variants inferred from conservation data
- truncated_variants.txt: ClinVar(ver. 2020.04) variants of the following consequences -
start lost, stop gained, deletion, frameshift - nonsynonymous_variants.txt: Entries from
truncated_variants.txtplus missense variants (also ver. 2020.04) - external_missense_variants.txt: Missense variants found in ClinVar 2020.08 but not in ClinVar 2020.04
- external_truncated_variants.txt: Start lost, stop gained, deletion, and frameshift variants found in ClinVar 2020.08 but not in ClinVar 2020.04
- external_nonsyn_variants.txt: File that combines
external_missense_variants.txtandexternal_truncated_variants.txt - patient_variants.txt: Collection of disease-causing and non-causal variants from 111 patients (variant duplicates removed)
-
sequence_data/: Contains output generated by
build_dataset.py. (required to train 3Cnet)- *_dataset.bin files: files containing amino acid sequences
- *_mut.npy files: files containing HGVSp identifiers
-
SNVBOX_features/: SNVBOX-generated feature vectors. Files in this folder correspond to those in
sequence_data/.- *_snvbox.npy: Tabular features of variants generated by SNVBox
-
validation_result/: Contains data pertaining to the external clinvar test set and patient data test results.
- external_clinvar_missense.tsv: Variants from
external_missense_variants.txtand their scores generated by various algorithms (3Cnet, REVEL, VEST4, SIFT, PolyPhen2, PrimateAI, CADD, FATHMM, DANN) - external_clinvar_nonsynonymous.tsv: Variants from
external_nonsyn_variants.txtand their scores generated by various algorithms - external_clinvar_truncated.tsv: Variants from
external_truncated_variants.txtand their scores generated by various algorithms - patient_all_scores.tsv: Variants from
patient_variants.txtand their scores generated by various algorithms - patient_3scores.tsv: inhouse patients variants data for validation in which 3 insilico predictive scores are (3Cnet, REVEL, PrimateAI)
- external_clinvar_missense.tsv: Variants from