Thanks to visit codestin.com
Credit goes to github.com

Skip to content

sillitoe/3Cnet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Update Log

Feb 7, 2022

Installation

3Cnet was trained using the following versions of software:


STEP 1: Preprocess: Docker

  • Follow the steps in this section if you prefer a ready-to-go environment.
  • If you prefer to set up the environment on your own, skip directly to "Clone the 3Cnet repository."
  • We recommend you have at least 40GB of free storage.
  • See NVIDIA NGS pytorch container docs for other execution examples.

Install Docker and nvidia-docker2

Docker Engine (we use Docker 20.10.5)

https://docs.docker.com/engine/install/ NVIDIA/nvidia-docker2

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2

Pull the 3Cnet Docker image from Docker Hub

  • The Docker image for 3Cnet is based on one of NVIDIA NGC's offerings.
$ sudo docker pull 3billion/3cnet:0.0.1

Run docker image interactively

$ sudo docker run --gpus all -it -v </absolute/path/to/mount>:/workspace 3billion/3cnet:0.0.1
$ cd workspace

STEP 2: Clone the 3Cnet repository

$ git clone https://github.com/KyoungYeulLee/3Cnet.git

STEP 3: Run download_data.py to retrieve necessary files from Zenodo

$ cd 3Cnet
$ python download_data.py
$ tar -xvf data.tar.gz

Code execution (continuing from data download)

  1. To train 3Cnet
$ cd ./src/model
$ python train_model.py
  1. To evaluate 3Cnet performance
$ python test_model.py
  1. To re-create the training/evaluation datasets
$ cd ../featurize
$ python build_dataset.py

Data and file structures

Underlined files are the top-level files or scripts intended to be directly modified or executed by the user.

  • download_data.py: Retrieves data/ directory from Zenodo.
  • 3cnet.yaml: Anaconda-compatible environment yaml. (deprecated, also contains dependencies not directly used by 3cnet)

src/featurize

  • build_dataset.py: Run this to parse and process raw data into pytorch-compatible inputs.
  • config.yaml: A file specifying paths used by scripts in src/featurize
  • featurize_data.py: A dependency used by build_dataset.py that converts HGVSp nomenclature to amino acid sequences.
  • get_variants.py: Used to map sequences in /data/variant_data to back HGVSp identifiers.
  • merge_data.py: Script that bundles the two different sequence data with SNVbox features. (generates *_dataset.bin, *_mut.npy, and *_snvbox.npy files)

src/model

  • config.yaml: A file specifying paths and hyperparameters used for model training or evaluation. Alter this to modify file paths and/or settings.
  • deep_utilities.py: A generic collection of utility functions and classes used broadly by other files in src/model.
  • LSTM_datasets.py: Dataset class definition for 3Cnet.
  • LSTM_models.py: A wrapper class for 3Cnet defining low-level training routines.
  • LSTM_networks.py: The 3Cnet architecture is defined here (nn.Module).
  • evaluate_metrics.py: Definition of metrics used for model evaluation.
  • train_model.py: Top-level script for 3Cnet training. Outcomes are saved in data/model/(MODEL_TYPE)/(MODEL_NAME).
  • test_model.py: Evaluate model using values under the VALID key in config.yaml. The MODEL_NUM key in config.yaml represents the epoch # to load and must be defined for this script to run as intended.

data/

  • msa_data/: NP_*.npy files representing each residues of conservative proportion of 21-amino acids

  • variant_data/

    • clinvar_data.txt: pathogenic-or-benign-labeled variants from ClinVar
    • common_variants.txt: benign-labeled variants from gnomAD
    • conservation_data.txt: pathogenic-like and benign-like variants inferred from conservation data
    • truncated_variants.txt: ClinVar(ver. 2020.04) variants of the following consequences - start lost, stop gained, deletion, frameshift
    • nonsynonymous_variants.txt: Entries from truncated_variants.txt plus missense variants (also ver. 2020.04)
    • external_missense_variants.txt: Missense variants found in ClinVar 2020.08 but not in ClinVar 2020.04
    • external_truncated_variants.txt: Start lost, stop gained, deletion, and frameshift variants found in ClinVar 2020.08 but not in ClinVar 2020.04
    • external_nonsyn_variants.txt: File that combines external_missense_variants.txt and external_truncated_variants.txt
    • patient_variants.txt: Collection of disease-causing and non-causal variants from 111 patients (variant duplicates removed)
  • sequence_data/: Contains output generated by build_dataset.py. (required to train 3Cnet)

    • *_dataset.bin files: files containing amino acid sequences
    • *_mut.npy files: files containing HGVSp identifiers
  • SNVBOX_features/: SNVBOX-generated feature vectors. Files in this folder correspond to those in sequence_data/.

    • *_snvbox.npy: Tabular features of variants generated by SNVBox
  • validation_result/: Contains data pertaining to the external clinvar test set and patient data test results.

    • external_clinvar_missense.tsv: Variants from external_missense_variants.txt and their scores generated by various algorithms (3Cnet, REVEL, VEST4, SIFT, PolyPhen2, PrimateAI, CADD, FATHMM, DANN)
    • external_clinvar_nonsynonymous.tsv: Variants from external_nonsyn_variants.txt and their scores generated by various algorithms
    • external_clinvar_truncated.tsv: Variants from external_truncated_variants.txt and their scores generated by various algorithms
    • patient_all_scores.tsv: Variants from patient_variants.txt and their scores generated by various algorithms
    • patient_3scores.tsv: inhouse patients variants data for validation in which 3 insilico predictive scores are (3Cnet, REVEL, PrimateAI)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%