-
Notifications
You must be signed in to change notification settings - Fork 210
Description
1. What were you trying to do?
I have 3 genome sequences, each with 12 chromosomes. An msa was constructed for each chromosome sequence triplet, which was fed to vg construct. The 12 resulting .vg files were fed to vg combine and vg converted to a single gfa file.
I wanted to create giraffe indexes which included the haplotypes from the gfa paths.
vg gbwt --path-regex "(.*)\.(chr.*)" --path-fields "SC" -G graph.gfa -g graph.gg -o graph.gbwt
My gfa paths are formatted "sampleX.chrXX", so the above command seemed appropriate for parsing the sample and contig metadata from the path names.
Inspecting the .gbwt file reveals...
vg gbwt -M graph.gbwt
36 paths with names, 36 samples with names, 36 haplotypes, 3 contigs with names
vg gbwt -SL graph.gbwt
sample1.chr01
sample1.chr02
...
sample3.chr12
vg gbwt -CL graph.gbwt
sample1
sample2
sample3
vg gbwt -HL graph.gbwt
36
2. What did you want to happen?
I may be misinterpreting the results, I couldn't find much documentation for these options. vg gbwt -M reporting 36 paths seems reasonable, I was expecting 3 samples and 12 contigs though. According to ---path-regex "(.*)\.(chr.*)" --path-fields "SC" I'd expect the samples (S) to be sample1-sample3 and the contigs (C) to be chr01-chr12 from the sampleX.chrXX path names.
Does the output seem reasonable and I'm misunderstanding the contigs and samples labels? Do I need to include a haplotype id in path names or will this default to a valid value if omitted?
3. What actually happened?
see above
4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:
n/a
5. What data and command can the vg dev team use to make the problem happen?
input data is too large to post
6. What does running vg version say?
vg version v1.33.0 "Moscona"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon