DupyliCate

DupyliCate is a python tool for mining, classification and analysis of gene duplications. It can help find and classify gene duplicates in a number of organisms concurrently and is designed to have high throughput. It can be used in a reference-free manner for intra-species gene duplicates identification, and classification or with a reference organism for comparative genomic analysis. The gene duplicates will however be identified and classified using intra-species local alignment in both the cases. The only difference is that, in the presence of a reference, the orthologs for the sample organism genes in the reference organism genome will be assigned, and further analysis pertaining synteny and gene copy number variation will be carried out. The tool moreover can be used with a reference organism to obtain a comparative gene table of specific genes in the reference, whose copy number variation that the user wants to know in other sample organisms, thus facilitating in-depth comparative genomic analyses.

There are two main modes of DupyliCate -

'Overlap': Gene duplicates repeat across groups and this mode also produces an output file called 'Duplicates_relationships' that connects the different genes repeating across the duplicate classes - tandems, proximals, dispersed

'Strict' : Gene duplicates do not repeat across groups and this mode produces a fourth class of duplicates type called 'Mixed_duplicates' where the gene duplicates from different duplicate classes - tandems, proximals, dispersed are merged together to retain their individual classification an also their relationship across the groups

DupyliCate also facilitates optional gene expression analysis of gene duplicates using count table data. Further, it provides the user, option to calculate Ka/Ks values of the duplicates.

Workflow

(1) DupyliCate needs the structural annotation file(s) (GFF3) along with one of assembly or coding sequence or peptide sequence FASTA file(s). In case, structural annotation is not available for a sample organism in which duplicates need to be analysed, there is an option to provide the annotation of a related organism's annotation as reference. This will then be used to produce the required annotation and carry out further analysis. (Step 1)

(2) The input files are first checked and validated before the actual run starts. If, the check fails, the script exits and the errors will be recorded and displayed organism-wise in the path Tmp/Errors_warnings. In case of GFF errors, there is a helper script provided along with the main script that uses AGAT to process and correct the GFF files. The corrected and validated files then enter the main analysis and processed to give PEP files without alternate transcripts. (Steps 2,3)

(3) Since the output depends on the quality of the input files and is also influenced by the ploidy of the organisms being analysed, a BUSCO-based QC step is included. This provides detailed QC reports containing the BUSCO completeness, duplication and BUSCO-derived pseudo-ploidy number. (Step 4)

(4) Before moving on to the duplication analysis, it is important to segregate singletons and duplicates correctly. For this, two cut-offs - one based on normalized bit score and self-similarity are offered. A more detailed information on the cut-offs and parameters can be obtained here. By default, a BUSCO-based auto threshold method is chosen, where BUSCO single copy genes are used to identify the normalized bit score threshold to segregate singletons and duplicates. If BUSCO is not available or if the organism has a very low number of BUSCO single copy genes, then the default fallback is to go for a default self-similarity threshold instead of normalized bit score threshold. There is an option to manually set the normalized bit score threshold and self-similarity threshold as well. (Step 5)

(5) Self alignment of sample organisms is performed and if a reference organism is provided, forward alignment of each sample is performed against this reference organism. This is followed by a comprehensive ortholog detection step, where a synergy of local, global alignments and phylogeny is utilized to obtain orthologs of genes in the samples against the reference genes. (Steps 5,6)

(6) Next, the duplicates clustering and grouping step is performed sample organism-wise. In the presence of a reference organism, synteny analysis is also carried out for small scale gene duplicate groups of tandem and proximal duplicates, to add a confidence layer of genomic positional context. Also, if a reference is involved, detailed gene duplicate group nature details like gene group expansion, de novo duplication etc., are provided for small scale gene duplicate groups of tandem and proximal duplicates in the output. (Steps 7,8)

(7) Finally, as a clustering approach is used to obtain gene duplicate groups/ arrays, for all the duplicate groups in the output, there is an internal scoring scheme used to classify the group as low, moderate and high confidence group that can help in interpreting the results accordingly. Along with the different duplicate group files, the singletons in a sample organism are also provided organism-wise. It is important to note that, in the reference-free mode, ortholog detection, synteny analysis, gene group nature analysis steps are absent. (Steps 7,8)

(9) Ka/Ks computation can also be done for all gene duplicates on an individual gene basis using either Nei Gojobori or Modified Ynag Nielsen methods. Please note that assembly FASTA file(s) or coding sequence file(s) are required along with the annotation file for Ka/Ks analysis. (Step 9)

(10) If expression data is available for the sample organism(s), expression analysis of gene duplicates can also be peformed with the script. This step gives out comprehensive information about the correlation among genes in duplicate groups, generates pairwise gene expression plots in a matrix figure and also helps determine divergent expression among duplicates (Step 10)

Gene duplicates types output by DupyliCate

Installation and dependencies

(1) Manual installation

Clone this repository

git clone https://github.com/ShakNat/DupyliCate

Mandatory dependencies:

Tools: BLAST, DIAMOND, MMSeqs2 (latest versions preferred)
Python libraries: pandas (v2.3.1 or greater), numpy (v2.3.2 or greater), seaborn (v0.13.2 or greater), matplotlib (v3.10.5 or greater), scipy (v1.16.1 or greater), jinja2 (3.1.6), openpyxl (3.1.2 or greater)

Optional dependencies:

Tools: GeMoMa, BUSCO, AGAT, MAFFT, FastTree (latest versions preferred)
Python libraries: dendropy (v5.0.8), tqdm (v4.67.1)

(2) Docker installation

docker pull shakunthalan/dupylicate:latest

It is recommended to run the docker image as a user and not root:

 docker run --rm -u $(id -u) -v /path/to/data:/data shakunthalan/dupylicate:latest

Note: If you are using a docker image of the tool, you need not specify the dependencies' full paths while running the tool; The dependencies are built-in in the docker image and that makes it simple to use the tool across systems without the need for local installations.

(3) Installation in a conda environment

This method of installation installs all the dependencies in a conda environment using the environment.yml file in this repository

git clone https://github.com/ShakNat/DupyliCate

cd DupyliCate

conda env create -f environment.yml

conda activate dupylicate_v1.0

Running the script

Usage:

  python3 dupylicate.py --gff <GFF_FILE_OR_DIR>
                        [--fasta <ASSEMBLY_FASTA_FILE_OR_DIR> | --cds <CDS_FASTA_FILE_OR_DIR> | --pep <PEP_FASTA_FILE_OR_DIR>]
                        --out <DIR>


MANDATORY:

	--gff 									STR			Full path to folder containing GFF files

	Choose one of FASTA/PEP/CDS

	[--fasta								STR			Full path to folder containing WGS FASTA files

    --cds									STR			Full path to folder containing CDS FASTA files

    --pep									STR			Full path to folder containing PEP files]

    Output directory

	--out									STR			full path to output folder

OPTIONAL:

	--ref									STR			Name of the organism to be taken as reference [Default is NA and the script runs in reference-free mode]

	--prokaryote							STR			Use the flag if the organisms for analysis are prokaryotes

	--pseudos_gff							STR			yes | no Optional inclusion or exclusion of pseudogenes with coding features in the GFF file [DEFAULT is no]

	--gff_config							STR			Full path to TXT file containing the gff parameters

	--mode									STR			overlap | strict [DEFAULT is overlap] 

	--to_annotate							STR			Full path to TXT file 

	--seq_aligner							STR			blast | diamond | mmseqs2 [DEFAULT is diamond]

	--blast									STR			Full path to BLAST if not already in your PATH environment variable

	--diamond								STR			Full path to diamond if not already in your PATH environment variable

	--mmseqs								STR			Full path to mmseqs2 if not already in your PATH environment variable

	--mafft									STR			Full path to MAFFT if not already in your PATH environment variable

	--evalue								FLOAT		evalue for alignment [DEFAULT is 1e-5]

	--gemoma								STR			Full path to GeMoMa if not already in your PATH environment variable

	--qc									STR			yes | no Quality check with BUSCO [DEFAULT is no]

	--busco									STR			Full path to BUSCO | busco_docker [DEFAULT is busco]

	--busco_lineage							STR			Full path to BUSCO config file [DEFAULT is auto]

	--busco_version							STR			BUSCO version in the format vx.x.x> [DEFAULT is v5.8.2]  

	--container_version						STR			Docker container version of BUSCO [DEFAULT is cv1]

	--docker_host_path						STR			Full host folder path 

	--docker_container_path					STR			Full mount path in the docker container

	--score									STR | FLOAT	auto | float  number between 0 and 1 [DEFAULT is auto]

	--self_simcut							FLOAT		Similarity percentage to remove self alignment hits with low similarity percentage [DEFAULT is 50.0]

    --hits									INT			Number of top hits to be considered for finding suitable orthologs in the reference organism [DEFAULT is 10]

	--ortho_candidates						INT			User specified integer value for listing the potential ortholog candidates [DEFAULT is 3]

	--occupancy								FLOAT		Occupancy cutoff for MAFFT aligned file trimming [DEFAULT is 0.1]

    --scoreratio							FLOAT		Ratio of forward alignment bit score and self alignment bit score of query to assess if the forward hit is valid [DEFAULT is 0.3]

	--fwd_simcut							FLOAT		Similarity percentage to remove forward alignment hits with low similarity percentage [DEFAULT is 40.0]

    --cores									INT			Number of cores needed to run Dupylicate analysis [DEFAULT is 4]

	--proximity								INT			Value for the number of intervening genes to detect proximal duplications [DEFAULT is 10]

    --synteny_score							FLOAT		Value which is used as a cut-off or threshold for synteny analysis [DEFAULT is 0.5]

	--flank									INT			Value specifying the number of flanking genes to be considered to determine the synteny window size in synteny analysis [DEFAULT is 5]

	--side									INT			Value for synteny support from either side of a flanking region of a synteny window [DEFAULT is 1]

	--ka_ks									STR			yes | no  To calculate ka, ks values [DEFAULT is no]

	--ka_ks_method							STR			MYN | NG  Methods for Ka/Ks ratio caclulation [Default is NG]

	--duplicates_analysis					STR			yes | no  For further statistical analysis of identified gene duplicates [DEFAULT is no]

	--specific_duplicates_analysis			STR			yes | no  For further statistical analysis of specified ref genes' gene duplicates> [DEFAULT is no]

	--ref_free_specific_duplicates_analysis	STR			Full path to TXT file containing the names of genes whose expression analysis is to be done in the absence of a reference 

	--dpi									STR			low | moderate | high | very high  Resolution level desired for plots [DEFAULT is moderate]

	--analyse_disperse_duplicates			STR			yes | no  For statistical analysis of dispersed gene duplicates [DEFAULT is no]

	--exp									STR			Full path to the folder with expression counts files

	--avg_exp_cut							FLOAT		Value cutoff for average expression value across samples [Default is 1]

	--genes									INT			Value referring to the number of genes to be taken for random pairing [DEFAULT is 10000] 

	--specific_genes						STR			Full path to TXT file 

	--clean_up								STR			yes | no  Cleans up intermediate files and folders [DEFAULT is yes]

ALLOWED FILE EXTENSIONS:

	<'.gff', '.gff.gz', '.gff3', '.gff3.gz', '.fa', '.fa.gz', '.fna', '.fna.gz','.fasta', '.fasta.gz',
	'.genome.fasta', '.genome.fa', '.genome.fasta.gz', '.genome.fa.gz', '.genome.fna',
	'.genome.fna.gz','.cds.fa', '.cds.fasta','.cds.fa.gz', '.cds.fasta.gz', '.cds.fna',
	'.cds.fna.gz','.pep.fa','.pep.fa.gz','.pep.fasta','.pep.fasta.gz', '.pep.fna',
	'.pep.fna.gz','.tsv','.txt','.txt.gz','.tpms.txt','.tpms.txt.gz'>

More details on some parameters

--mode In the overlap mode genes repeat among the different duplicates classification. In the strict mode there is no gene repetition and you have a new classification group called mixed duplicates containing the related connected components from the other three duplicate classes

--to_annotate This needs the path to a TXT file. The TXT file must contain names of queries and the corresponding reference organism for GeMoMa annotation separated by comma - Query,Reference -> one pair per line

--gemoma The full path to GeMoMa must also include the name of the GeMoMa jar file

--busco This flag can take the full path to busco. If you have installed busco via conda, activate the environment anf give the full path. If you have a docker-based installation of busco, just give busco_docker as the parameter to this flag

--busco_lineage This flag can take the full path to BUSCO config file that specifies the BUSCO lineage to be used for the different sample organisms in an analysis

--busco_version --container_version --docker_host_path --docker_container_path These flags are needed only if you want to do a BUSCO-based QC or auto scoring and if you have a docker-based busco installation

--score This flag can take in string parameter auto if you want the thresholding for singleton duplicates segregation to be done based on BUSCO; If you want to tune this parameter, instead of auto, you can specify any float number between 0 and 1

--hits These are the number of reference genes that will be picked for tree building in the ortholog finding step from among the list of valid forward hits in the reference organism for each sample organism gene

--ortho_candidates The low confidence orthologs have another output column containing other potential orthologs, since the identified ortholog is of low confidence level. This parameter helps tune the number of such potential ortholgos to be listed

--analyse_disperse_duplicates The number of dispersed duplicates are generally higher, and a routine expression analysis of all idnetifeid dispersed duplicate groups could be time consuming. Hence, this has been made optional. The expression analysis activated with the --exp flag will do the analysis for small scale duplicate groups of tndem and proximal genes. If the dispersed duplicates also need to be subject to expression analysis, then this flag's parameter can be tuned to yes

--genes This flag's parameter gives the number of genes to be considered for random gene pairing to identify a correlation threshold that can be used to assess functional divergence of gene duplicates based on gene expression values

--specific_genes The TXT file pointed by this flag must contain a user given list of specific genes from the reference with one transcript name per line, for analysing gene duplication outputs

--ref_free_specific_duplicates_analysis This flag can be used for analysing expression of gene duplicates in the sample when a reference is not given

IMPORTANT: The difference between the --specific_duplicates_analysis and the --ref_free_specific_duplicates_analysis flags is that the former needs specification of yes or no and the genes must be specified in a separate TXT file whose path must be given with the --specific_genes flag. The genes specified must be genes in the reference organism, and this option will help perform expression analysis of the orthologs of the specified reference genes if they belong to a gene duplicates group in the sample. These two flags can be used only in the presence of a reference organism. On the other hand the --ref_free_specific_duplicates_analysis is a single flag needing the full path to the TXT file with the genes in the sample organism whose expression analysis needs to be done and can be used with or without reference as it is only sample organism dependent

GFF fields config file preparation instructions for Dupylicate.py:

This is a simple .TXT file containing the gff parameters in different columns
The columns can be separated by tabs or spaces
There are four columns mandatorily needed in the config file in the following order:

(i) base file name - same as the base name you use for the gff file | all in case all the gff have the same gff pattern

(ii) child_attribute: attribute field of the mRNA or transcript feature in the file like ID

(iii) child_parent_linker: attribute field of the mRNA or transcript, CDS, exon features that link them with their respective parent feature like Parent - Note: base assumption by the tool is that all child levels have the same child-parent linker attribute fields. For eg., if Parent is the child-parent linker in the mRNA feature line, then Parent will be the child-parent linker for all other child-level feature lines in the GFF

(iv) parent_attribute: attribute field of the gene feature like ID

(v) By default in the script, child_attribute is ID, child_parent_linker is Parent, and parent_attribute is ID

Sample config file and GFF file example:

If the config looks like this -

all	Name	Parent	ID

And the corresponding GFF file looks as below - 

##gff-version 3

##annot-version Araport11

##species Arabidopsis thaliana columbia

Chr1    	phytozomev12    	gene    	3631    	5899    	.       	+       	.       	ID=AT1G01010.Araport11.447;Name=AT1G01010

Chr1    	phytozomev12    	mRNA    	3631    	5899    	.       	+       	.       	ID=AT1G01010.1.Araport11.447;Name=AT1G01010.1;pacid=37401853;longest=1;geneName=NAC001;Parent=AT1G01010.Araport11.447

Understanding the GFF config file:

The first column of the file says all. This means all the files in the analysis will have the same GFF file format/ pattern
The second column that is the child attribute column is Name. So the text following the Name field in the last column of the mRNA feature will be picked which is AT1G01010.1 above
The third column that is the child-parent linker column is Parent. It is the field Parent in mRNA that links it to its parent gene, and in the above example its corresponding text picked will be AT1G01010.Araport11.447
The fourth column is the parent attribute that is mentioned as ID. In the above example it is the ID field in the gene feature line which is AT1G01010.Araport11.447

IMPORTANT: It is important to note that the child-parent linker and the parent must be chosen in such a way that they point to the same text. For example, both the child-parent linker and the parent attribute in the above example point to AT1G01010.Araport11.447; This is important to ensure that the transcripts correctly map to the parent gene especially in the alternate transcript removal step

Additionally when including a gene expression counts file for expression analysis, the gene names in the first column of the counts file must match with text pointed by the child attribute or the parent attribute. If not, cleaned expression file will not be written and expression analysis will not be possible.

Preparing list of reference genes for specific analysis

If you have a known list of genes in the reference organism whose copy number variation you want to analyze in the sample, the --specific_genes flag can be used.
This needs the full path to a simple .TXT file
This .TXT file should have one transcript name per line

Preparing the TXT file for GeMoMa annotation

In case, some of your input files lack structural annotation, the --to_annotate flag in the script can be used
This needs the full path to a simple .TXT file
The name of the organism to be annotated followed by comma and the name of the reference organism to be used for annotation must be mentioned in a line
If there are multiple organisms for annotation, specify each of them along with their respective reference in a single line in the format as specified above

eg.

Vamurensis,Vvinifera

Vrotundifolia,Vvinifera

Preparing the BUSCO config file

DupyliCate by default uses the --auto-lineage-euk, --auto-lineage-prok to detect the BUSCO lineages of sample organisms. However in case you want to specify a lineage to choose, you can use the --busco_lineage flag and specify the path to this BUSCO config file.

This is a simple two column TXT file separated by tab
The first column is the sample organism name and the second column is the busco lineage you want to specify
Please ensure that this organism name and the input annotation, assembly file names of the sample organism without the extensions are matching.

Helper scripts

Usage:

	python3 AGAT_wrapper.py

MANDATORY:

	--gff_in <full path to GFF3 file or folder containing GFF3 annotation files>

	--gff_out <full path to output folder or output file>

OPTIONAL:

	--agat <full path to the agat_convert_sp_gxf2gxf.pl script including the script name>

Usage:

	python3 Fasta_fix.py

MANDATORY:

	--in <full path to a folder containing FASTA files or a single FASTA file>

	--out <full path to output directory>

	--config <full path to config file including the config file name>

OPTIONAL:

	--cores <number of cores for running the script>

Config file preparation instructions for Fasta_fix.py:

The config file is a simple .TXT file
Specify the organism name i.e. the base name without the extension of your file in the first column
If the same set of string manipulation operations are to be performed for all the files in a folder, just specify the word all in the first column and follow it up in the next columns with the desired operations - This single line is enough if the same set of operations are to be performed on all the files
Specify the various string manipulation operations to be performed on the FASTA header of this particular organism's file in the subsequent columns separated by white space or tab
IMPORTANT: The operations will be performed in the same order as you specify them in the config wise i.e. column wise order of the different operations you specify will be followed by the script; Hence operation ORDER is IMPORTANT
Possible operations and the manner in which they are to be specified are as follows

i. extract: example- extract:ID=:;

ii. split: example- split:_

iii. take: example- take:1,3

iv. join: example- join:- <This will join a list of strings into a single string separated by - >

v. remove: example- remove:v2

vi. add_prefix: add_suffix: example: add_suffix:ath

vii. replace: <to replace a specfic character with another character or pattern; Specify the pattern to be replaced first and the pattern to be added in its place - separate these two with a comma> example- replace:phytozome,phyt

viii. uppercase

lowercase

Description of output files

md5sum.tsv: Contains the md5sums of every input file used in a particular run of the tool
run_parameters.json: json file that keeps a record of all the parameters used for a particular run of the tool
Duplication landscape plots: Histogram plots of normalized bit scores of a gene's second best hit and self hit in self alignment of the sample; Depicts the genome level gene duplication status
Duplication frequency plots: Bar plots depicting the actual and relative percentage of genes in the different classification categories
BUSCO QC Summary file: TSV file with information about BUSCO completeness, duplication, Feff (pseudo-ploidy number - approximate indication of ploidy)
Tandem duplicates, Proximal duplicates, Dispersed duplicates: Folders containing organism-wise tandem, proximal and dispersed gene duplicates output TSV files
Mixed duplicates: Mixed duplicates folder containing organism-wise mixed duplicates output TSV files (output in strict mode)
Duplicates_relationships: Duplicates relationships folder containing organism-wise duplicates relationships across duplicate gruops (output in overlap mode)
The different gene duplicates classification output files have a number of columns. In all the duplicates results files, both in the presence or absence of a reference organism, the following columns are present:

(i) Pairwise gene distance (bp) - Genetic distance in bp between two neighbouring genes in an identified duplicates group

(ii) Actual intervening gene number: The tool considers only protein coding genes for identifying and classifying gene duplicates. But it also important to know the actual number of protein coding and non-coding genes in-between gene duplicates, and the actual intervening gene number gives this information

(iii) Apparent intervening gene number: The number of protein coding genes between neighbouring genes in an identified duplicates group

(iv) Group confidence: The gene duplicates are clustered into groups. This column gives a confidence score for the reliability of the identified groups -

confidence_score <= 0.3 -> Low confidence gene duplicates group

confidence_score <= 0.5 -> Moderate confidence gene duplicates group

confidence_score > 0.5 -> High confidence gene duplicates group

(v) Nature of duplicate group: This column is present only in the small scale duplicate groups - tandems and proximals output files in the presence of a reference organism;

Small scale duplicates are mainly responsible for the evolutionary innovations seen in stress response mechanisms and biosynthetic pathways;

Understanding their nature like whether they lead to gene expansion, conservation or de novo duplication can lead to crucial biological insights;

The nature of duplicate group provides this information as detailed in the image below

Singletons: Singletons folder containing organism-wise singleton gene output TSV files
In the presence of a reference organism, the small scale gene duplicates (tandems, proximals) files and the singletons files have a column called Synteny information

(vi) Synteny information: This column evaluates the synteny between the identified gene duplicates and their corresponding orthoolog genes in the reference organism

Orthologs: Folder of ortholog files sample organism-wise. This contains file outputs only when a reference organism is specified. Each ortholog file within the folder is a tsv file with clear headers. Some parameter columns are explained here for a better understanding of the file ouput. The third column denotes the confidence score of the assigned ortholog. The individual ortholog scoring is defined as follows:
- 0.0625 - Low confidence ortholog
  - (Ortholog is determined via phylogeny of the forward hits. But it is not among the top forward hits, and all the forward hits are very similar, lowering the confidence greatly)
- 0.125 - Low confidence ortholog
  - (Ortholog is determined via phylogeny of the forward hits. All the forward hits are very similar, making phylogeny analysis results less confident)
- 0.25 - Low confidence ortholog
  - (Ortholog is determined via phylogeny of the forward hits. Top forward hit and the next best hit are not significantly different, lowering the confidence)
- 0.5 - Moderate confidence ortholog
  - (Ortholog is the top forward hit determined via synteny analysis. But the top hit is not significantly different from the next best hit, making it a moderately confident assignment)
- 1 - High confidence ortholog
  - (Ortholog is the top forward hit, and it is significantly different from the next best hit, making it a confident assignment)

The ortholog confidence is denoted in the last column. In case of low confidence orthologs, the fourth column has a list of other potential orthologs for the particular sample gene in the first column

Copy_number_table.tsv: This is found only in the presence of a reference organism and if the user has given specific reference genes or analysis using the --specific_genes flag;

Comparative genomics table depicting the specific user-given reference genes and their copy numbers in sample organisms;

These specific genes are orthologs of user specified genes in the reference organism whose copy number variation the user wants to know;

Every cell in each organism-wise column has the orthologs corresponding to specific reference organism genes in that sample organism;

Each such identified orthologous group has a score called Ortholog group confidence score (OGCS) appended to it to show the reliability of the identified orthologous group.

The OGCS scoring system is as follows:

OGCS <= 0.5 -> Low confidence orthologous group

0.5 < OGCS <= 0.8 -> Moderate confidence orthologous group

OGCS > 0.8 -> High confidence orthologous group

The ortholog confidence scoring is more stringent than the gene duplicates confidence scoring system because of the species specific variations and distance considerations between the reference and sample organisms

Copy_number_table.xlsx: This output file is also found only in the presence of a reference organism and if the user has given specific reference genes or analysis using the --specific_genes flag;

It is the same file as Copy_number_table.tsv, except that, it is red colour coded in the cells that do not have orthologs for specific reference genes in the sample organism(s);

Being an xlsx file, this colour difference can be used as a quick presence-absence variation assessment of the user-specified genes for comparative genomics between the reference and the sample organism(s)

Summary.tsv: Summary file consolidating the number of different gene duplicates in each analysed sample organism; It also lists the number of gene duplicate groups classified as low/ moderate/ high confidence groups
Ka_Ks_analysis: Folder containing organism-wise Ka/Ks computation TSV files of gene duplicates on an individual gene basis along with statistical significance and nature of selection pressure
- Each file in this folder is a tsv file. Every individual gene in a gene duplicate group is assigned a Ka/Ks value in this file. This gene-level Ka/Ks value is obtained by a series of statistical analyses, and normalization approaches in the script
Ka/Ks plot: Distribution plot of statistically significant Ka/Ks values (adjusted pval<=0.05) of gene duplicates to assess the trend of selection pressures on gene duplicates; This is a filtered Ka/Ks plot taht removes extreme Ka/Ks values
Duplicates_analysis: Folder containing organism wise gene expression output folders; Each organism folder shows Plots folder, Stats folder, the kernel density estimation plot of correlation coefficients used for determining the divergent expression threshold, and TXT file of perceived pseudogenes (genes that have low expression across the different RNASeq samples used for generating the counts table file) in that organism;

It is important to note that some gene duplicate groups can have statistics folder present, but might not have corresponding plots folder due to the possibility of all or most of them being perceived pseudogenes, disabling their correlation analysis and gene expression plotting; The stats.tsv file for each duplicate group lists the iterative gene pairs analyzed in a duplicate group, the Bonferroni corrected p-value for every pair, the Spearman correlation value and finally the strength of correlation or expression divergence.
Specific_duplicates_analysis: Folder similar to the Duplicates_analysis folder, except that contains the respective output folders, and files for specific gene duplicates
Specific_ref_free_duplicates_analysis: Folder similar to the Duplicates_analysis folder, except that contains the respective output folders, and files for specified genes if presen as duplicates in the sample organism

Common error prone steps and some recommendations

The errors and warnings output by DupyliCate are stored in the Tmp folder created within the output folder you specify /output_folder/Tmp/Errors_warnings
By defualt if the run completes successfully, the Tmp folder is removed automatically; If you want to retain the intermediate files specify no in the --clean_up flag
Please remember to add / at the end of the folder path when giving a folder of input files to DupyliCate
Please ensure that the files have the allowed extensions as listed above in the main script usage
The base names of the files excluding the extensions must be the same across all input files, the config files, and the annotation TXT file for GeMoMa annotation
Gene duplications classification needs the GFF file(s) as the inputs. Since GFF files have wide varying formats, the analysis can be interrupted at the validation step due to GFF file formatting issues
Some GFF files have mRNA features denoted as mRNA, while some have them denoted as transcript. In cases of merged annotation files, both the terms can denote an mRNA feature. But the script looks for the keyword mRNA first, and if not found, then searches for the word transcript. In case you have a merged annotation, please make the mRNA feature indicator uniform throughout the GFF file - either mRNA throughout or transcript throughout
Next, the FASTA headers need to match specific GFF attribute fields. Otherwise, the analysis would stop after the validation and PEP file generation step
Please look into the GFF config file preparation instructions to avoid such errors during analysis

The script is designed in such a way that it can start from the point, an analysis was stopped or interrupted. But if the output files of the previous steps are truncated or empty, this will not be captured and may cause errors downstream. It is safe to remove the output files of the step that was interrupted while retaining the files generated in the earlier steps to facilitate a seamless run despite interruptions; If you want to rerun the script with expression analysis involved, it is recommended to delete the result folders of the previous run's expression analysis, since the expression analysis involves a unique naming methodology that could reproduce the same result folders from the previous run with a different group number appended to the folder

Third party tool references and dataset references

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. PubMed
Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x
Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).
Price, M.N., Dehal, P.S., and Arkin, A.P. (2010) FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3):e9490. doi:10.1371/journal.pone.0009490
Kazutaka Katoh, Daron M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability, Molecular Biology and Evolution, Volume 30, Issue 4, April 2013, Pages 772–780, https://doi.org/10.1093/molbev/mst010
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 2018. doi: 10.1186/s12859-018-2203-5
Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format.
(Version v1.5.1). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva, OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D516–D522, https://doi.org/10.1093/nar/gkae987
Jeet Sukumaran, Mark T. Holder, DendroPy: a Python library for phylogenetic computing, Bioinformatics, Volume 26, Issue 12, June 2010, Pages 1569–1571, https://doi.org/10.1093/bioinformatics/btq228
Cheng, Chia-Yi, Vivek Krishnakumar, Agnes P. Chan, Françoise Thibaud-Nissen, Seth Schobel, and Christopher D. Town. 2017. “Araport11: A Complete Reannotation of the Arabidopsis Thaliana Reference Genome.” The Plant Journal 89(4):789–804. doi:10.1111/tpj.13415.
Pucker B, Holtgräwe D, Stadermann KB, Frey K, Huettel B, Reinhardt R, et al. (2019) A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS ONE 14(5): e0216233. https://doi.org/10.1371/journal.pone.0216233

When using DupyliCate in your research please cite

Natarajan, S. & Pucker, B. DupyliCate: mining, classifying, and characterizing gene duplications. 2025.10.10.681656 Preprint at https://doi.org/10.1101/2025.10.10.681656 (2025).

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
Benchmarking_documentation		Benchmarking_documentation
Docker_repo		Docker_repo
test_data		test_data
AGAT_wrapper.py		AGAT_wrapper.py
Fasta_fix.py		Fasta_fix.py
LICENSE		LICENSE
README.md		README.md
dupylicate.py		dupylicate.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DupyliCate

Workflow

Gene duplicates types output by DupyliCate

Installation and dependencies

(1) Manual installation

(2) Docker installation

(3) Installation in a conda environment

Running the script

More details on some parameters

GFF fields config file preparation instructions for Dupylicate.py:

Preparing list of reference genes for specific analysis

Preparing the TXT file for GeMoMa annotation

Preparing the BUSCO config file

Helper scripts

Config file preparation instructions for Fasta_fix.py:

Description of output files

Common error prone steps and some recommendations

Third party tool references and dataset references

When using DupyliCate in your research please cite

Logo credits: The logo for DupyliCate was designed by Marie Hagedorn.

About

Uh oh!

Releases

Packages

Languages

License

ShakNat/DupyliCate

Folders and files

Latest commit

History

Repository files navigation

DupyliCate

Workflow

Gene duplicates types output by DupyliCate

Installation and dependencies

(1) Manual installation

(2) Docker installation

(3) Installation in a conda environment

Running the script

More details on some parameters

GFF fields config file preparation instructions for Dupylicate.py:

Preparing list of reference genes for specific analysis

Preparing the TXT file for GeMoMa annotation

Preparing the BUSCO config file

Helper scripts

Config file preparation instructions for Fasta_fix.py:

Description of output files

Common error prone steps and some recommendations

Third party tool references and dataset references

When using DupyliCate in your research please cite

Logo credits: The logo for DupyliCate was designed by Marie Hagedorn.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages