Special characters in taxonomy strings cause false duplicate detection in `lefserClades`?

Hi,

we're (@CTMARGREITER) applying `lefserClades` and have some troubles with path strings. Although all row names (taxonomy path strings) in the assay are unique, it 
appears that duplicates are detected.

It might boil down to some "problematic" tip names in the taxonomy strings. Specifically, we have genera at the lowest taxonomy level, that contain
characters like hyphens, points and brackets, which seem to be detected as duplicates.

## Example

Here's an example:

### Exemplary `SummarizedExperiment`

```r
pathStrings <- c(
  "k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Burkholderiales|f__Burkholderiaceae|g__Burkholderia-Caballeronia-Paraburkholderia",  # problematic
  "k__Bacteria|p__Verrucomicrobiota|c__Verrucomicrobiia|o__Pedosphaerales|f__Pedosphaeraceae|g__ADurb.Bin063_1",  # problematic
  "k__Viruses|p__Viruses_noname|c__Viruses_noname|o__Caudovirales|f__Siphoviridae|g__Siphoviridae_noname" 
)

assay_data <- matrix(
  c(
    542, 1, 3250, 1,
    40, 0, 425, 2,
    150, 2, 2350, 1
  ),
  nrow = 3,
  byrow = TRUE,
  dimnames = list(
    pathStrings,
    1:4
  )
)

se <- SummarizedExperiment::SummarizedExperiment(
    assays = list(counts = assay_data),
    colData = data.frame(
        Group = c("Group1", "Group2", "Group1", "Group2")
    )
)
```

### Applying `lefserClades`

```r
relative_abundance <- lefser::relativeAb(se)
relative_abundance <- lefser::rowNames2RowData(relative_abundance)

rescl <- lefser::lefserClades(
  relative_abundance, assay=1L, classCol = "Group",
  kruskal.threshold = 1,
  wilcox.threshold = 1,
  lda.threshold = 0
)
```

Upon running `lefserClades`, features are being dropped due to duplicated tip names:

```
Dropped 2 features with duplicated tip names.
lefser will be run at the kingdom, phylum, class, order, family, genus level.

>>>> Running lefser at the kingdom level. <<<<
The outcome variable is specified as 'Group' and the reference category is 'Group1'.
 See `?factor` or `?relevel` to change the reference category.

...
```

## Cause?

It might be due to the regex expression used in `.dropFeatures`.

https://github.com/waldronlab/lefser/blob/0ac110dc526796d7b609532722b0e14452c0b838/R/lefserClades.R#L204

Using the `pathStrings` from above results in two empty strings for the problematic tips:

```r
print(stringr::str_extract(pathStrings, "(|\\w__\\w+)?$"))

> [1] ""                       ""                       "g__Siphoviridae_noname"
```

---

Since we're often dealing with such taxonomy strings derived from mothur, we would like to check if that is intended behavior? Or maybe we simply missed something?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Special characters in taxonomy strings cause false duplicate detection in `lefserClades`? #87

Example

Exemplary `SummarizedExperiment`

Applying `lefserClades`

Cause?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Special characters in taxonomy strings cause false duplicate detection in lefserClades? #87

Description

Example

Exemplary SummarizedExperiment

Applying lefserClades

Cause?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Special characters in taxonomy strings cause false duplicate detection in `lefserClades`? #87

Exemplary `SummarizedExperiment`

Applying `lefserClades`