-
Notifications
You must be signed in to change notification settings - Fork 5
Rename contigs #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename contigs #104
Conversation
|
erikrikarddaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things.
Didn't we decide to only rename the genomes that contain duplicates? I don't see that. I commented like this on the param, but maybe that makes sense, i.e. only run renaming when the param is set and only for genomes containing duplicates, or? In that case, I think we could set the default for the param to true.
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
erikrikarddaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not there yet, I think. I have plenty of comments in the code, but to summarise:
- I think the easiest output from the check module is a channel that is a list of fasta files, not a file containing the fasta file names. If you think it's valuable to have a file for the user, then create the file,
catit and output both from the module. - I see no reason to have the outputs from this module optional.
- I see no reason not to use the standard
${prefix}.suffixpattern in the module.
It feels this is only halfway however, since you actually send all contig files to renaming.
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
|
So I changed the logic as we discussed. Now CHECK_DUPLICATES is always active and if it finds some duplicates contigs, it will rename only those files and then remerge with the rest of the samples. this logic works only on local genomes, not the one downloaded with NCBI (it shouldn't be an issue there). |
erikrikarddaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to look good! I assume you tested this properly. It's a little difficult for me to check that it works only by reading the code.
modules/local/check_duplicates.nf
Outdated
| zgrep -H '>' *.fna.gz | sed 's/^[^:]*://' | sort | uniq -d > temp_dupes.txt | ||
| zgrep -l -F -f temp_dupes.txt *.fna.gz | sort -u > duplicate_contig_names.txt || touch duplicate_contig_names.txt | ||
| rm temp_dupes.txt | ||
| cat duplicate_contig_names.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things I don't understand here:
- Why do you bother to write a file? You're not outputting that above. Maybe it's good to do though.
- You're outputing the names of genomes with duplicates rather than the duplicate contigs. I think this is actually what you do with
-lto grep, so it's the file name that's misleading. - Why don't you use the normal pattern of using the
$prefix? It doesn't hurt even if we don't think this will be called multiple times.
In summary: I'd write to "${prefix}.genomes_with_duplicates.txt" and make sure that that's declared in the output section.
| "${task.process}": | ||
| seqkit: \$(seqkit version | sed 's/seqkit v//' | sed 's/ Build.*//') | ||
| END_VERSIONS | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a stub.
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
Co-authored-by: Daniel Lundin <[email protected]>
I created a file to test it. maybe I should add it to test-dataset folder? |
We should have this in the automatic pipeline tests, with proper checks for content. If that means we need data in the test-data repo (I assume that's what you mean), we should. I don't know what your test data looks like, but to me it would be easy to just have an alternative genome sheet with the same genome but having different ids (and filenames since you're matching on that, right?). |
erikrikarddaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Daniel Lundin <[email protected]>
…into rename-contigs
PR checklist
nf-core lint).nextflow run . -profile test,docker).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).I made a PR to solve the issue #97