--chromid a (as-is) behavior is more like "match query" than "as-is" and causes problems under certain conditions

The `--chromid` option currently allows choices `a`, `s`, and `l`. The latter two are fairly straightforward, though they are limited to the assumption that the only naming options are a number (or some other chromosome ID, like X or Y), possibly prefixed by "chr". The `a` is named "as-is" and implies that the input (target) sequence IDs will be left alone when being written to the output, but in practice it is based on the format of the query ID (where format refers to "prefixed by 'chr'" (long) or not (short)). The implication is that this will be a whole-file decision, but it is actually a record-by-record decision, and, thus, it is based on any given pairing of query and target in the chain file. This produces unexpected behavior when the query and/or target sequence IDs use mixed formats and/or formats other than the expected long/short. This matters because modern assemblies often have varying sequence IDs depending on the assembler and subsequent steps. Even fairly benign names, can cause a problem.

Consider the target genome has all names like `chr1`, `chr2`, etc. and the query genome has some names like `chr1`, `chr2`, etc. and other names like `unassigned-0000006` or `utig4-1234567`. Pretend that the query `chr1` maps to target `chr1` in the chain file and query `unassigned-0000006` maps to `chr9` in the target. The output would have target `chr1` for the query `chr1` inputs and target `9` for the query `unassigned-0000006` inputs. This mixes the short and long types of sequence IDs in the output, which is not ideal. A person could use `--chromid l` or `--chromid s` to force one or the other, but this doesn't work well when the target genome has non-short/non-long sequence IDs.

Consider that these names were swapped. Pretend the query has all names like `chr1`, `chr2`, etc. Pretend the target has some names like `chr1`, `chr2`, etc. and other names like `unassigned-0000006`. We wouldn't want the output sequence ID name to become `chrunassigned-0000006`, whether from using the default "as-is" or by using the "long" option. Similarly, one may not wish to drop the "chr" from `chr1`, `chr2`, etc. when using the "short" option.

Accordingly, I think adding a new option would be helpful. This new option would make no changes to target sequence IDs when writing output. Ideally, "as-is" would follow this behavior and the new option, possibly `m` for "match query", would take the current behavior of "as-is". The truly "as-is" behavior would be an ideal default. It's understandable if you don't want to change the default behavior though, and could guess that the current behavior may be good for you and many others, so another option is to simply add something like `n` for "no change". This behavior would effectively ignore the use of the `update_chromID` function from `src/cmmodule/utils.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

--chromid a (as-is) behavior is more like "match query" than "as-is" and causes problems under certain conditions #74

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

--chromid a (as-is) behavior is more like "match query" than "as-is" and causes problems under certain conditions #74

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions