- 
                Notifications
    
You must be signed in to change notification settings  - Fork 72
 
Description
Hi Alex,
A (reproducible) issue ("GSEA hangs") was posted on the clusterProfiler GitHub.
See: YuLab-SMU/clusterProfiler#659 (comment), and posts below that one.
Since clusterProfiler uses under the hood fgsea for gene set enrichment analysis, I checked whether the reported issue originates from the way input/output data is being processed by clusterProfiler, or from fgsea. It turns that I could reproduce the issue when directly using fgsea, hence this post.
Please note that the OP reported this issue when using R-4.2.2, but I could reproduce it also with the current versions of R (R-4.3.0 resp. R-4.3.3) and fgsea on both my Windows resp. Linux machines.
Also note that the issue occurs when minSize is set to 10; when minSize=11 is ued fgsea runs as expected...
For your convenience I have attached the 2 input files to this post as RData file (which I compressed into an ZIP archive in order to be able to upload it). See below how these objects were generated, also in case you would like to generate them yourselves.
I would appreciate if you could have a look at this to see whether this can be fixed.
G
> ## load libraries
> library(clusterProfiler)
> library(fgsea)
> library(org.Hs.eg.db)
> 
> ## import input genes (human ENSEMBL) and GO-BP gene sets
> load("fgsea.input.Rdata")
> 
> ######
> ## if preferred, code to generate input
> 
> ## copy/paste list of input genes ('hgene_list') from:
> ## https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> 
> ## create GO-based gene sets; limit to BP
> ## 'ont' should either be "BP", "CC", "MF" or all
> library(GO.db)
> ont <- "BP" 
> 
> goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
> if (ont != "ALL") {goterms <- goterms[goterms == ont]}
> 
> term2gene.go <- AnnotationDbi::mapIds(org.Hs.eg.db,
+                                       keys=names(goterms),
+                                       column="ENTREZID",
+                                       keytype="GOALL",
+                                       multiVals='list')
'select()' returned 1:many mapping between keys and columns
> 
> ## end code to generate input.
> ######
> 
> ## manually convert ENSEMBL into ENTREZID using function bitr from clusterProfiler.
> ## when using the function gseGO from clusterProfiler, this is being done on the fly;
> ## see for gseGO function call: https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> ensembl.2.eg <- bitr( names(hgene_list),
+                       fromType="ENSEMBL",
+                       toType="ENTREZID",
+                       OrgDb="org.Hs.eg.db",
+                       drop = TRUE)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(names(hgene_list), fromType = "ENSEMBL", toType = "ENTREZID",  :
  0.05% of input gene IDs are fail to map...
> 
> 
> input.genes <- hgene_list[ensembl.2.eg$ENSEMBL]
> names(input.genes) <- ensembl.2.eg$ENTREZID
> ## perform GSEA
> ## with minSize = 11; works fine!
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 11,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })
   user  system elapsed 
   3.47    0.87   20.19 
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 8 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
3: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.
> 
> ## perform GSEA
> ## now with minSize = 10; run was aborted after 5 mins since it wasn't finished by then...
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 10,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 4 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
Timing stopped at: 3.07 0.91 592.6
> 
>
sessionInfo() Windows machine:
> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     
other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1
loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.0         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.23.1.992   purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.0         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.29.1.001     Rcpp_1.0.12             snow_0.4-4             
[40] Matrix_1.6-5            splines_4.3.0           igraph_2.0.3           
[43] tidyselect_1.2.1        qvalue_2.34.0           viridis_0.6.5          
[46] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
[49] plyr_1.8.9              treeio_1.26.0           withr_3.0.0            
[52] KEGGREST_1.42.0         gridGraphics_0.5-1      scatterpie_0.2.1       
[55] polyclip_1.10-6         Biostrings_2.70.3       pillar_1.9.0           
[58] ggtree_3.10.1           ggfun_0.1.4             generics_0.1.3         
[61] RCurl_1.98-1.14         ggplot2_3.5.0           munsell_0.5.1          
[64] scales_1.3.0            tidytree_0.4.6          glue_1.7.0             
[67] lazyeval_0.2.2          tools_4.3.0             data.table_1.15.4      
[70] fs_1.6.3                graphlayouts_1.1.1      fastmatch_1.1-4        
[73] tidygraph_1.3.1         cowplot_1.1.3           grid_4.3.0             
[76] tidyr_1.3.1             ape_5.7-1               colorspace_2.1-0       
[79] nlme_3.1-164            GenomeInfoDbData_1.2.11 patchwork_1.2.0        
[82] ggforce_0.4.2           cli_3.6.2               fansi_1.0.6            
[85] viridisLite_0.4.2       dplyr_1.1.4             gtable_0.3.4           
[88] yulab.utils_0.1.4       digest_0.6.35           ggrepel_0.9.5          
[91] ggplotify_0.1.2         farver_2.1.1            memoise_2.0.1          
[94] lifecycle_1.0.4         httr_1.4.7              GO.db_3.18.0           
[97] bit64_4.0.5             MASS_7.3-60.0.1        
> 
sessionInfo() Linux machine:
> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 39 (Thirty Nine)
Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.11.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     
other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1
loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.3         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.22.0       purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.3         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.28.1         Rcpp_1.0.12             Matrix_1.6-5           
[40] splines_4.3.3           igraph_2.0.3            tidyselect_1.2.1       
[43] qvalue_2.34.0           viridis_0.6.5           codetools_0.2-20       
[46] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
[49] treeio_1.26.0           withr_3.0.0             KEGGREST_1.42.0        
[52] gridGraphics_0.5-1      scatterpie_0.2.2        polyclip_1.10-6        
[55] Biostrings_2.70.3       pillar_1.9.0            ggtree_3.10.1          
[58] ggfun_0.1.4             generics_0.1.3          RCurl_1.98-1.14        
[61] ggplot2_3.5.0           munsell_0.5.1           scales_1.3.0           
[64] tidytree_0.4.6          glue_1.7.0              lazyeval_0.2.2         
[67] tools_4.3.3             data.table_1.15.4       fs_1.6.3               
[70] graphlayouts_1.1.1      fastmatch_1.1-4         tidygraph_1.3.1        
[73] cowplot_1.1.3           grid_4.3.3              tidyr_1.3.1            
[76] ape_5.7-1               colorspace_2.1-0        nlme_3.1-164           
[79] GenomeInfoDbData_1.2.11 patchwork_1.2.0         ggforce_0.4.2          
[82] cli_3.6.2               fansi_1.0.6             viridisLite_0.4.2      
[85] dplyr_1.1.4             gtable_0.3.4            yulab.utils_0.1.4      
[88] digest_0.6.35           ggrepel_0.9.5           ggplotify_0.1.2        
[91] farver_2.1.1            memoise_2.0.1           lifecycle_1.0.4        
[94] httr_1.4.7              GO.db_3.18.0            bit64_4.0.5            
[97] MASS_7.3-60.0.1        
>