Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Special characters in taxonomy strings cause false duplicate detection in lefserClades? #87

@JakobKlotz

Description

@JakobKlotz

Hi,

we're (@CTMARGREITER) applying lefserClades and have some troubles with path strings. Although all row names (taxonomy path strings) in the assay are unique, it
appears that duplicates are detected.

It might boil down to some "problematic" tip names in the taxonomy strings. Specifically, we have genera at the lowest taxonomy level, that contain
characters like hyphens, points and brackets, which seem to be detected as duplicates.

Example

Here's an example:

Exemplary SummarizedExperiment

pathStrings <- c(
  "k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Burkholderiales|f__Burkholderiaceae|g__Burkholderia-Caballeronia-Paraburkholderia",  # problematic
  "k__Bacteria|p__Verrucomicrobiota|c__Verrucomicrobiia|o__Pedosphaerales|f__Pedosphaeraceae|g__ADurb.Bin063_1",  # problematic
  "k__Viruses|p__Viruses_noname|c__Viruses_noname|o__Caudovirales|f__Siphoviridae|g__Siphoviridae_noname" 
)

assay_data <- matrix(
  c(
    542, 1, 3250, 1,
    40, 0, 425, 2,
    150, 2, 2350, 1
  ),
  nrow = 3,
  byrow = TRUE,
  dimnames = list(
    pathStrings,
    1:4
  )
)

se <- SummarizedExperiment::SummarizedExperiment(
    assays = list(counts = assay_data),
    colData = data.frame(
        Group = c("Group1", "Group2", "Group1", "Group2")
    )
)

Applying lefserClades

relative_abundance <- lefser::relativeAb(se)
relative_abundance <- lefser::rowNames2RowData(relative_abundance)

rescl <- lefser::lefserClades(
  relative_abundance, assay=1L, classCol = "Group",
  kruskal.threshold = 1,
  wilcox.threshold = 1,
  lda.threshold = 0
)

Upon running lefserClades, features are being dropped due to duplicated tip names:

Dropped 2 features with duplicated tip names.
lefser will be run at the kingdom, phylum, class, order, family, genus level.

>>>> Running lefser at the kingdom level. <<<<
The outcome variable is specified as 'Group' and the reference category is 'Group1'.
 See `?factor` or `?relevel` to change the reference category.

...

Cause?

It might be due to the regex expression used in .dropFeatures.

tips <- stringr::str_extract(pathStrings, "(|\\w__\\w+)?$")

Using the pathStrings from above results in two empty strings for the problematic tips:

print(stringr::str_extract(pathStrings, "(|\\w__\\w+)?$"))

> [1] ""                       ""                       "g__Siphoviridae_noname"

Since we're often dealing with such taxonomy strings derived from mothur, we would like to check if that is intended behavior? Or maybe we simply missed something?
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions