-
Notifications
You must be signed in to change notification settings - Fork 26
Description
The --chromid option currently allows choices a, s, and l. The latter two are fairly straightforward, though they are limited to the assumption that the only naming options are a number (or some other chromosome ID, like X or Y), possibly prefixed by "chr". The a is named "as-is" and implies that the input (target) sequence IDs will be left alone when being written to the output, but in practice it is based on the format of the query ID (where format refers to "prefixed by 'chr'" (long) or not (short)). The implication is that this will be a whole-file decision, but it is actually a record-by-record decision, and, thus, it is based on any given pairing of query and target in the chain file. This produces unexpected behavior when the query and/or target sequence IDs use mixed formats and/or formats other than the expected long/short. This matters because modern assemblies often have varying sequence IDs depending on the assembler and subsequent steps. Even fairly benign names, can cause a problem.
Consider the target genome has all names like chr1, chr2, etc. and the query genome has some names like chr1, chr2, etc. and other names like unassigned-0000006 or utig4-1234567. Pretend that the query chr1 maps to target chr1 in the chain file and query unassigned-0000006 maps to chr9 in the target. The output would have target chr1 for the query chr1 inputs and target 9 for the query unassigned-0000006 inputs. This mixes the short and long types of sequence IDs in the output, which is not ideal. A person could use --chromid l or --chromid s to force one or the other, but this doesn't work well when the target genome has non-short/non-long sequence IDs.
Consider that these names were swapped. Pretend the query has all names like chr1, chr2, etc. Pretend the target has some names like chr1, chr2, etc. and other names like unassigned-0000006. We wouldn't want the output sequence ID name to become chrunassigned-0000006, whether from using the default "as-is" or by using the "long" option. Similarly, one may not wish to drop the "chr" from chr1, chr2, etc. when using the "short" option.
Accordingly, I think adding a new option would be helpful. This new option would make no changes to target sequence IDs when writing output. Ideally, "as-is" would follow this behavior and the new option, possibly m for "match query", would take the current behavior of "as-is". The truly "as-is" behavior would be an ideal default. It's understandable if you don't want to change the default behavior though, and could guess that the current behavior may be good for you and many others, so another option is to simply add something like n for "no change". This behavior would effectively ignore the use of the update_chromID function from src/cmmodule/utils.py.