QUAST: add ANI column#3091
Conversation
|
Thanks for the suggestion! Having this metric sounds reasonable, but should it rather be a part of QUAST? Their team is very responsive, if you want to take a stab in a pull request there. For MultiQC, we can attempt to parse it from the QUAST report, and for older versions we could keep your code here to calculate it if it's missing. |
|
I coordinated with the QUAST author to add this metric - but it might take some time, so feel free to file a PR there! And thanks for this PR here - happy to merge it now. |
|
Thanks @vladsavelyev for your feedback and even accepting this proposal given the limitations! You are absolutely right that it would be better to upstream and report this from QUAST. I opened a PR there (ablab/quast#279) and happy if you want to revert this if we can get that accepted. |
When comparing a genome against a reference with QUAST, it is helpful to know how similar that genome is. QUAST outputs this in a "# mismatches per 100 kbp" and "# indels per 100 kbp" field. Here we combine these two fields to calculate an average nucleotide identity for a new column, which is simpler and more interpretable.
I acknowledge that sequence identity is complicated and has different meanings (https://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity). However, the choice is made easier given the outputs available from QUAST. Here I implement the gap-compressed identity, which the author of the above blog finds most compelling. Although it slightly differs from the definition of BLAST ANI used in the literature for prokaryotic species definition (non-gap-compressed), I think it is sufficient for most applications. The description also explicitly states it is gap-uncompressed.
Thanks for your consideration!