-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi,
we're (@CTMARGREITER) applying lefserClades
and have some troubles with path strings. Although all row names (taxonomy path strings) in the assay are unique, it
appears that duplicates are detected.
It might boil down to some "problematic" tip names in the taxonomy strings. Specifically, we have genera at the lowest taxonomy level, that contain
characters like hyphens, points and brackets, which seem to be detected as duplicates.
Example
Here's an example:
Exemplary SummarizedExperiment
pathStrings <- c(
"k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Burkholderiales|f__Burkholderiaceae|g__Burkholderia-Caballeronia-Paraburkholderia", # problematic
"k__Bacteria|p__Verrucomicrobiota|c__Verrucomicrobiia|o__Pedosphaerales|f__Pedosphaeraceae|g__ADurb.Bin063_1", # problematic
"k__Viruses|p__Viruses_noname|c__Viruses_noname|o__Caudovirales|f__Siphoviridae|g__Siphoviridae_noname"
)
assay_data <- matrix(
c(
542, 1, 3250, 1,
40, 0, 425, 2,
150, 2, 2350, 1
),
nrow = 3,
byrow = TRUE,
dimnames = list(
pathStrings,
1:4
)
)
se <- SummarizedExperiment::SummarizedExperiment(
assays = list(counts = assay_data),
colData = data.frame(
Group = c("Group1", "Group2", "Group1", "Group2")
)
)
Applying lefserClades
relative_abundance <- lefser::relativeAb(se)
relative_abundance <- lefser::rowNames2RowData(relative_abundance)
rescl <- lefser::lefserClades(
relative_abundance, assay=1L, classCol = "Group",
kruskal.threshold = 1,
wilcox.threshold = 1,
lda.threshold = 0
)
Upon running lefserClades
, features are being dropped due to duplicated tip names:
Dropped 2 features with duplicated tip names.
lefser will be run at the kingdom, phylum, class, order, family, genus level.
>>>> Running lefser at the kingdom level. <<<<
The outcome variable is specified as 'Group' and the reference category is 'Group1'.
See `?factor` or `?relevel` to change the reference category.
...
Cause?
It might be due to the regex expression used in .dropFeatures
.
Line 204 in 0ac110d
tips <- stringr::str_extract(pathStrings, "(|\\w__\\w+)?$") |
Using the pathStrings
from above results in two empty strings for the problematic tips:
print(stringr::str_extract(pathStrings, "(|\\w__\\w+)?$"))
> [1] "" "" "g__Siphoviridae_noname"
Since we're often dealing with such taxonomy strings derived from mothur, we would like to check if that is intended behavior? Or maybe we simply missed something?
Thank you!