vg gbwt metadata parsing

**1. What were you trying to do?**
I have 3 genome sequences, each with 12 chromosomes. An msa was constructed for each chromosome sequence triplet, which was fed to `vg construct`. The 12 resulting .vg files were fed to `vg combine` and `vg convert`ed to a single gfa file.

I wanted to create giraffe indexes which included the haplotypes from the gfa paths.

`vg gbwt --path-regex "(.*)\.(chr.*)" --path-fields "SC" -G graph.gfa -g graph.gg -o graph.gbwt`

My gfa paths are formatted "sampleX.chrXX", so the above command seemed appropriate for parsing the sample and contig metadata from the path names.

Inspecting the .gbwt file reveals...

```
vg gbwt -M graph.gbwt 
36 paths with names, 36 samples with names, 36 haplotypes, 3 contigs with names
```
```
vg gbwt -SL graph.gbwt
sample1.chr01
sample1.chr02
...
sample3.chr12
```
```
vg gbwt -CL graph.gbwt
sample1
sample2
sample3
```
```
vg gbwt -HL graph.gbwt
36
```

**2. What did you want to happen?**
I may be misinterpreting the results, I couldn't find much documentation for these options. `vg gbwt -M` reporting 36 paths seems reasonable, I was expecting 3 samples and 12 contigs though. According to `---path-regex "(.*)\.(chr.*)" --path-fields "SC"` I'd expect the samples (S) to be sample1-sample3 and the contigs (C) to be chr01-chr12 from the sampleX.chrXX path names.

Does the output seem reasonable and I'm misunderstanding the contigs and samples labels? Do I need to include a haplotype id in path names or will this default to a valid value if omitted?

**3. What actually happened?**
see above

**4. If you got a line like `Stack trace path: /somewhere/on/your/computer/stacktrace.txt`, please copy-paste the contents of that file here:**

```
n/a
```

**5. What data and command can the vg dev team use to make the problem happen?**
input data is too large to post

**6. What does running `vg version` say?**

```
vg version v1.33.0 "Moscona"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vg gbwt metadata parsing #3361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vg gbwt metadata parsing #3361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions