Maintainer: Elvisa Mehinovic
Laboratory of Dr. Tychele N. Turner, Ph.D.
Washington University in St. Louis
This is an empty folder generated for user to store all their input query files that will be run through the workflow. This is an optional folder however, if user decides to call file outside of this folder, they must include fill path to that file in config.json: 'query'. Query file may not be a repeat sequence nor a file larger than 1 MB. These files will not generate accurate information.
Prior to running the pipeline ensure that you have the reference genome data as described on our wiki
Steps:
Reference Outline Provided Image to Better Understand Pipeline File Locations.
- Start by cloning ACES GitHub:
All required script files will be available on GitHub to be pulled on a desktop by using:
wget https://github.com/TNTurnerLab/ACES.git
Or can be pulled using git clone as follows:
git clone https://github.com/TNTurnerLab/ACES.git
-
Pull down ready to run docker image with the code provided below:
-
This docker image is pre-built and needs no modifications to it. If the user wishes to build their own image manually, follow steps in Dockerfile with the provided Dockerfile in this pipeline.
docker pull tnturnerlab/vgp_ens_pipeline:latest
-
3. Put all VGP species ‘* -unmasked.fa’ files, and '* .dna.toplevel.fa' species files from Ensembl pub/release-103 in the provided Genomes directory and unzip them.
- See file DOWNLOADING VGP AND ENSEMBL SPECIES FILES for command line codes that will help achieve this.
4. Use or generate empty files corresponding to files named in SUB-FILES GUIDE and put your query input files inside the pre-generated folder USER query Files. This folder is found in the folder VGP SnakeFile Pipeline.
- Files can be modified or changed based on user's requirements
5. Configure all file pathways in file config.json. This file can be in VGP SnakeFile Pipeline.
-
Reference FILES GUIDE: config.json
-
genomesdb:
- currently defaulted to VGP_AND_ENSEMBL_TOGETHER.txt, unless user wants to change it, this file will run all VGP and Ensembl genomes against users query sequence.
-
query:
- Pathway to this file does not have to change if user puts their input file inside the pre-generated folder, USER query Files. When editing this portion of the config, please only input the filename after the last '/'. User does not need to edit path unless they did not place their input file in provided folder.
- Input file may not be a repeat sequence nor a file larger than 1MB. These files will not generate accurate information.
-
dbs:
- Do not edit this path, this path is the pathway to /Genomes folder.
-
threshold:
- E-value threshold requirement. Default is set to 0.0001. User may change if desired. It is recommended value is in decimal format.
-
queryLengthPer
- Minimum % of query length requirement. Default is set to 0.3, or 30%. User may keep default or replace with decimal value
-
- Open file config.json, and fill in value for "threshold"
- Within this file, enter a single value with decimal point; can be in scientific notation but not required
- Value should correspond to a threshold requirement species blast outputs must meet before they can generate a parse file.
- The threshold is a value of the expected number of chance matches in a random model. For more information about threshold values visit this link: http://pathblast.org/docs/e_value.html
-
Open file config.json, and fill in value for "queryLengthPer"
- Fill in decimal value for percent of query length sequences needed in order to be included into the results.
- This requirement helps eliminate small sequences that may have been generated as hits by BLASTn.
- The percent of query length will be applied to all subject sequence lengths, and those sequences that met the minimum requirement or better will be allowed to move further into the pipeline.
-
Open file corresponding to that of "genomesdb" in config.json, This file is in the file genomes input document.
- Default file is set to run all VGP and Ensembl genomes.
- Create new file, or modify and close this file when content.
- Users must upload or have handy their {query} file for Blast.
- Open _config.json _ to set which file is the users query file: "query"
- Your query file should be put in file USERS_query_Files, if not please modify complete pathway to input file in config.json file.
- Query file cannot be full genomes nor repeat elements.
- Locate
Local_ACES_Version.smkandLSF_ACES_Version.smkin ACES Pipeline folder, decide whether user will be using file LSF_ACES_Version.smk for running on a LSF server, or Local_ACES_Version.smk for running on a local machine.
-
Run Dockerfile command - CHECK:
docker run tnturnerlab/vgp_ens_pipeline:latest (CHECKS IF PULL IS SUCCESSFUL AND FILE IS READY TO RUN) -
Run the following script:
docker run -v "/##FULLPATH TO GITHUB CLONE##/ACES/ACES_Pipeline:/ACES/ACES_Pipeline" tnturnerlab/vgp_ens_pipeline:latest /opt/conda/bin/snakemake -s /ACES/ACES_Pipeline/Local_ACES_Version.smk -k -w 120 --rerun-incomplete --keep-going
-
Tell Docker where data and code are:
-
Execute on LSF code:
export LSF_DOCKER_VOLUMES="/##PATH_TO##/##_DIRECTORY_##/ACES/ACES_Pipeline/:/ACES/ACES_Pipeline/"
Example:
export LSF_DOCKER_VOLUMES="/path/to/data:/path/name /home/directory:/home-
Run Docker interactively to see if successful:
bsub -Is -R 'rusage[mem=50GB]' -a 'docker(tnturnerlab/vgp_ens_pipeline:latest)' /bin/bash
-
-
Create a group job:
bgadd -L 2000 /username/###ANY NAME YOU WOULD LIKE TO CALL JOB### -
Run following script:
-
MUST MODIFY SCRIPT TO RUN:
bsub -q general -g /username/VGP -oo Done.log.out -R 'span[hosts=1] rusage[mem=30GB]' -G compute-NAME -a 'docker(tnturnerlab/vgp_ens_pipeline:latest)' /opt/conda/bin/snakemake --cluster " bsub -q general -g /username/VGP -oo %J.log.out -R 'span[hosts=1] rusage[mem=300GB]' -M 300GB -a 'docker(tnturnerlab/vgp_ens_pipeline:latest)' -n 4 " -j 100 -s LSF_ACES_Version.smk -k -w 120 --rerun-incomplete --keep-going -F -
Example:
bsub -q general -g /elvisa/VGP -oo Done.log.out -R 'span[hosts=1] rusage[mem=30GB]' -G compute-tychele -a 'docker(tnturnerlab/vgp_ens_pipeline:latest)' /opt/conda/bin/snakemake --cluster " bsub -q general -g /elvisa/VGP -oo %J.log.out -R 'span[hosts=1] rusage[mem=300GB]' -M 300GB -a 'docker(tnturnerlab/vgp_ens_pipeline:latest)' -n 4 " -j 100 -s LSF_ACES_Version.smk -k -w 120 --rerun-incomplete --keep-going -F
-
-
Output files will be generated in the Output folder provided in this pipeline.
- View Output Files Generated: Output to see which files are generated and more information on each. Output files will be generated inside the ACES_Pipeline folder. Two folders will be created within the folder. One folder will hold all BLAST outputs from the pipeline execution, and the other holding output files. The file with the name BLAST_Outputfiles_ARCHIVE_For_Genomes_ * can be deleted or kept. Outputfiles_For_Genomes_ * will hold the name of the folder holding all outputs. The names for these folders will vary based on name of genomes input document used, user query file name, and threshold value used.
-
Once satisfied, user can move or delete all log files with basic mv or rm commands.