-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Hi Alex,
A (reproducible) issue ("GSEA hangs") was posted on the clusterProfiler GitHub.
See: YuLab-SMU/clusterProfiler#659 (comment), and posts below that one.
Since clusterProfiler uses under the hood fgsea for gene set enrichment analysis, I checked whether the reported issue originates from the way input/output data is being processed by clusterProfiler, or from fgsea. It turns that I could reproduce the issue when directly using fgsea, hence this post.
Please note that the OP reported this issue when using R-4.2.2, but I could reproduce it also with the current versions of R (R-4.3.0 resp. R-4.3.3) and fgsea on both my Windows resp. Linux machines.
Also note that the issue occurs when minSize is set to 10; when minSize=11 is ued fgsea runs as expected...
For your convenience I have attached the 2 input files to this post as RData file (which I compressed into an ZIP archive in order to be able to upload it). See below how these objects were generated, also in case you would like to generate them yourselves.
I would appreciate if you could have a look at this to see whether this can be fixed.
G
> ## load libraries
> library(clusterProfiler)
> library(fgsea)
> library(org.Hs.eg.db)
>
> ## import input genes (human ENSEMBL) and GO-BP gene sets
> load("fgsea.input.Rdata")
>
> ######
> ## if preferred, code to generate input
>
> ## copy/paste list of input genes ('hgene_list') from:
> ## https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
>
>
> ## create GO-based gene sets; limit to BP
> ## 'ont' should either be "BP", "CC", "MF" or all
> library(GO.db)
> ont <- "BP"
>
> goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
> if (ont != "ALL") {goterms <- goterms[goterms == ont]}
>
> term2gene.go <- AnnotationDbi::mapIds(org.Hs.eg.db,
+ keys=names(goterms),
+ column="ENTREZID",
+ keytype="GOALL",
+ multiVals='list')
'select()' returned 1:many mapping between keys and columns
>
> ## end code to generate input.
> ######
>
> ## manually convert ENSEMBL into ENTREZID using function bitr from clusterProfiler.
> ## when using the function gseGO from clusterProfiler, this is being done on the fly;
> ## see for gseGO function call: https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
>
> ensembl.2.eg <- bitr( names(hgene_list),
+ fromType="ENSEMBL",
+ toType="ENTREZID",
+ OrgDb="org.Hs.eg.db",
+ drop = TRUE)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(names(hgene_list), fromType = "ENSEMBL", toType = "ENTREZID", :
0.05% of input gene IDs are fail to map...
>
>
> input.genes <- hgene_list[ensembl.2.eg$ENSEMBL]
> names(input.genes) <- ensembl.2.eg$ENTREZID
> ## perform GSEA
> ## with minSize = 11; works fine!
>
> system.time({
+
+ res <- fgseaMultilevel(
+ pathways = term2gene.go,
+ stats = input.genes,
+ minSize = 11,
+ maxSize = 500,
+ eps = 0,
+ scoreType = c("std") )
+
+ })
user system elapsed
3.47 0.87 20.19
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes, :
There were 8 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
3: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes, :
For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.
>
> ## perform GSEA
> ## now with minSize = 10; run was aborted after 5 mins since it wasn't finished by then...
>
> system.time({
+
+ res <- fgseaMultilevel(
+ pathways = term2gene.go,
+ stats = input.genes,
+ minSize = 10,
+ maxSize = 500,
+ eps = 0,
+ scoreType = c("std") )
+
+ })
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes, :
There were 4 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
Timing stopped at: 3.07 0.91 592.6
>
>
sessionInfo() Windows machine:
> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
[4] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
[7] fgsea_1.28.0 clusterProfiler_4.10.1
loaded via a namespace (and not attached):
[1] DBI_1.2.2 bitops_1.0-7 shadowtext_0.1.3
[4] gson_0.1.0 gridExtra_2.3 rlang_1.1.3
[7] magrittr_2.0.3 DOSE_3.28.2 compiler_4.3.0
[10] RSQLite_2.3.6 png_0.1-8 vctrs_0.6.5
[13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
[16] crayon_1.5.2 fastmap_1.1.1 XVector_0.42.0
[19] ggraph_2.2.1 utf8_1.2.4 HDO.db_0.99.1
[22] enrichplot_1.23.1.992 purrr_1.0.2 bit_4.0.5
[25] zlibbioc_1.48.2 cachem_1.0.8 aplot_0.2.2
[28] GenomeInfoDb_1.38.8 jsonlite_1.8.8 blob_1.2.4
[31] BiocParallel_1.36.0 tweenr_2.0.3 parallel_4.3.0
[34] R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3
[37] GOSemSim_2.29.1.001 Rcpp_1.0.12 snow_0.4-4
[40] Matrix_1.6-5 splines_4.3.0 igraph_2.0.3
[43] tidyselect_1.2.1 qvalue_2.34.0 viridis_0.6.5
[46] codetools_0.2-20 lattice_0.22-6 tibble_3.2.1
[49] plyr_1.8.9 treeio_1.26.0 withr_3.0.0
[52] KEGGREST_1.42.0 gridGraphics_0.5-1 scatterpie_0.2.1
[55] polyclip_1.10-6 Biostrings_2.70.3 pillar_1.9.0
[58] ggtree_3.10.1 ggfun_0.1.4 generics_0.1.3
[61] RCurl_1.98-1.14 ggplot2_3.5.0 munsell_0.5.1
[64] scales_1.3.0 tidytree_0.4.6 glue_1.7.0
[67] lazyeval_0.2.2 tools_4.3.0 data.table_1.15.4
[70] fs_1.6.3 graphlayouts_1.1.1 fastmatch_1.1-4
[73] tidygraph_1.3.1 cowplot_1.1.3 grid_4.3.0
[76] tidyr_1.3.1 ape_5.7-1 colorspace_2.1-0
[79] nlme_3.1-164 GenomeInfoDbData_1.2.11 patchwork_1.2.0
[82] ggforce_0.4.2 cli_3.6.2 fansi_1.0.6
[85] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.4
[88] yulab.utils_0.1.4 digest_0.6.35 ggrepel_0.9.5
[91] ggplotify_0.1.2 farver_2.1.1 memoise_2.0.1
[94] lifecycle_1.0.4 httr_1.4.7 GO.db_3.18.0
[97] bit64_4.0.5 MASS_7.3-60.0.1
>
sessionInfo() Linux machine:
> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 39 (Thirty Nine)
Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
[4] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
[7] fgsea_1.28.0 clusterProfiler_4.10.1
loaded via a namespace (and not attached):
[1] DBI_1.2.2 bitops_1.0-7 shadowtext_0.1.3
[4] gson_0.1.0 gridExtra_2.3 rlang_1.1.3
[7] magrittr_2.0.3 DOSE_3.28.2 compiler_4.3.3
[10] RSQLite_2.3.6 png_0.1-8 vctrs_0.6.5
[13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
[16] crayon_1.5.2 fastmap_1.1.1 XVector_0.42.0
[19] ggraph_2.2.1 utf8_1.2.4 HDO.db_0.99.1
[22] enrichplot_1.22.0 purrr_1.0.2 bit_4.0.5
[25] zlibbioc_1.48.2 cachem_1.0.8 aplot_0.2.2
[28] GenomeInfoDb_1.38.8 jsonlite_1.8.8 blob_1.2.4
[31] BiocParallel_1.36.0 tweenr_2.0.3 parallel_4.3.3
[34] R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3
[37] GOSemSim_2.28.1 Rcpp_1.0.12 Matrix_1.6-5
[40] splines_4.3.3 igraph_2.0.3 tidyselect_1.2.1
[43] qvalue_2.34.0 viridis_0.6.5 codetools_0.2-20
[46] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
[49] treeio_1.26.0 withr_3.0.0 KEGGREST_1.42.0
[52] gridGraphics_0.5-1 scatterpie_0.2.2 polyclip_1.10-6
[55] Biostrings_2.70.3 pillar_1.9.0 ggtree_3.10.1
[58] ggfun_0.1.4 generics_0.1.3 RCurl_1.98-1.14
[61] ggplot2_3.5.0 munsell_0.5.1 scales_1.3.0
[64] tidytree_0.4.6 glue_1.7.0 lazyeval_0.2.2
[67] tools_4.3.3 data.table_1.15.4 fs_1.6.3
[70] graphlayouts_1.1.1 fastmatch_1.1-4 tidygraph_1.3.1
[73] cowplot_1.1.3 grid_4.3.3 tidyr_1.3.1
[76] ape_5.7-1 colorspace_2.1-0 nlme_3.1-164
[79] GenomeInfoDbData_1.2.11 patchwork_1.2.0 ggforce_0.4.2
[82] cli_3.6.2 fansi_1.0.6 viridisLite_0.4.2
[85] dplyr_1.1.4 gtable_0.3.4 yulab.utils_0.1.4
[88] digest_0.6.35 ggrepel_0.9.5 ggplotify_0.1.2
[91] farver_2.1.1 memoise_2.0.1 lifecycle_1.0.4
[94] httr_1.4.7 GO.db_3.18.0 bit64_4.0.5
[97] MASS_7.3-60.0.1
>