Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fgsea hangs forever for highly enriched pathways in the presence of repeated high scored genes #151

@guidohooiveld

Description

@guidohooiveld

Hi Alex,

A (reproducible) issue ("GSEA hangs") was posted on the clusterProfiler GitHub.
See: YuLab-SMU/clusterProfiler#659 (comment), and posts below that one.

Since clusterProfiler uses under the hood fgsea for gene set enrichment analysis, I checked whether the reported issue originates from the way input/output data is being processed by clusterProfiler, or from fgsea. It turns that I could reproduce the issue when directly using fgsea, hence this post.

Please note that the OP reported this issue when using R-4.2.2, but I could reproduce it also with the current versions of R (R-4.3.0 resp. R-4.3.3) and fgsea on both my Windows resp. Linux machines.

Also note that the issue occurs when minSize is set to 10; when minSize=11 is ued fgsea runs as expected...

For your convenience I have attached the 2 input files to this post as RData file (which I compressed into an ZIP archive in order to be able to upload it). See below how these objects were generated, also in case you would like to generate them yourselves.

I would appreciate if you could have a look at this to see whether this can be fixed.
G

> ## load libraries
> library(clusterProfiler)
> library(fgsea)
> library(org.Hs.eg.db)
> 
> ## import input genes (human ENSEMBL) and GO-BP gene sets
> load("fgsea.input.Rdata")
> 
> ######
> ## if preferred, code to generate input
> 
> ## copy/paste list of input genes ('hgene_list') from:
> ## https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> 
> ## create GO-based gene sets; limit to BP
> ## 'ont' should either be "BP", "CC", "MF" or all
> library(GO.db)
> ont <- "BP" 
> 
> goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
> if (ont != "ALL") {goterms <- goterms[goterms == ont]}
> 
> term2gene.go <- AnnotationDbi::mapIds(org.Hs.eg.db,
+                                       keys=names(goterms),
+                                       column="ENTREZID",
+                                       keytype="GOALL",
+                                       multiVals='list')
'select()' returned 1:many mapping between keys and columns
> 
> ## end code to generate input.
> ######
> 
> ## manually convert ENSEMBL into ENTREZID using function bitr from clusterProfiler.
> ## when using the function gseGO from clusterProfiler, this is being done on the fly;
> ## see for gseGO function call: https://github.com/YuLab-SMU/clusterProfiler/issues/659#issuecomment-2027820878
> 
> ensembl.2.eg <- bitr( names(hgene_list),
+                       fromType="ENSEMBL",
+                       toType="ENTREZID",
+                       OrgDb="org.Hs.eg.db",
+                       drop = TRUE)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(names(hgene_list), fromType = "ENSEMBL", toType = "ENTREZID",  :
  0.05% of input gene IDs are fail to map...
> 
> 
> input.genes <- hgene_list[ensembl.2.eg$ENSEMBL]
> names(input.genes) <- ensembl.2.eg$ENTREZID
> ## perform GSEA
> ## with minSize = 11; works fine!
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 11,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })
   user  system elapsed 
   3.47    0.87   20.19 
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 8 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)
3: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.
> 

> ## perform GSEA
> ## now with minSize = 10; run was aborted after 5 mins since it wasn't finished by then...
> 
> system.time({
+ 
+ res <- fgseaMultilevel(
+   pathways = term2gene.go,
+   stats = input.genes,
+   minSize = 10,
+   maxSize = 500,
+   eps = 0,
+   scoreType = c("std") )
+ 
+   })

Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (2.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In fgseaMultilevel(pathways = term2gene.go, stats = input.genes,  :
  There were 4 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)

Timing stopped at: 3.07 0.91 592.6
> 
>

sessionInfo() Windows machine:

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.0         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.23.1.992   purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.0         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.29.1.001     Rcpp_1.0.12             snow_0.4-4             
[40] Matrix_1.6-5            splines_4.3.0           igraph_2.0.3           
[43] tidyselect_1.2.1        qvalue_2.34.0           viridis_0.6.5          
[46] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
[49] plyr_1.8.9              treeio_1.26.0           withr_3.0.0            
[52] KEGGREST_1.42.0         gridGraphics_0.5-1      scatterpie_0.2.1       
[55] polyclip_1.10-6         Biostrings_2.70.3       pillar_1.9.0           
[58] ggtree_3.10.1           ggfun_0.1.4             generics_0.1.3         
[61] RCurl_1.98-1.14         ggplot2_3.5.0           munsell_0.5.1          
[64] scales_1.3.0            tidytree_0.4.6          glue_1.7.0             
[67] lazyeval_0.2.2          tools_4.3.0             data.table_1.15.4      
[70] fs_1.6.3                graphlayouts_1.1.1      fastmatch_1.1-4        
[73] tidygraph_1.3.1         cowplot_1.1.3           grid_4.3.0             
[76] tidyr_1.3.1             ape_5.7-1               colorspace_2.1-0       
[79] nlme_3.1-164            GenomeInfoDbData_1.2.11 patchwork_1.2.0        
[82] ggforce_0.4.2           cli_3.6.2               fansi_1.0.6            
[85] viridisLite_0.4.2       dplyr_1.1.4             gtable_0.3.4           
[88] yulab.utils_0.1.4       digest_0.6.35           ggrepel_0.9.5          
[91] ggplotify_0.1.2         farver_2.1.1            memoise_2.0.1          
[94] lifecycle_1.0.4         httr_1.4.7              GO.db_3.18.0           
[97] bit64_4.0.5             MASS_7.3-60.0.1        
> 

sessionInfo() Linux machine:

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 39 (Thirty Nine)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.18.0    AnnotationDbi_1.64.1   IRanges_2.36.0        
[4] S4Vectors_0.40.2       Biobase_2.62.0         BiocGenerics_0.48.1   
[7] fgsea_1.28.0           clusterProfiler_4.10.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.2               bitops_1.0-7            shadowtext_0.1.3       
 [4] gson_0.1.0              gridExtra_2.3           rlang_1.1.3            
 [7] magrittr_2.0.3          DOSE_3.28.2             compiler_4.3.3         
[10] RSQLite_2.3.6           png_0.1-8               vctrs_0.6.5            
[13] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
[16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0         
[19] ggraph_2.2.1            utf8_1.2.4              HDO.db_0.99.1          
[22] enrichplot_1.22.0       purrr_1.0.2             bit_4.0.5              
[25] zlibbioc_1.48.2         cachem_1.0.8            aplot_0.2.2            
[28] GenomeInfoDb_1.38.8     jsonlite_1.8.8          blob_1.2.4             
[31] BiocParallel_1.36.0     tweenr_2.0.3            parallel_4.3.3         
[34] R6_2.5.1                stringi_1.8.3           RColorBrewer_1.1-3     
[37] GOSemSim_2.28.1         Rcpp_1.0.12             Matrix_1.6-5           
[40] splines_4.3.3           igraph_2.0.3            tidyselect_1.2.1       
[43] qvalue_2.34.0           viridis_0.6.5           codetools_0.2-20       
[46] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
[49] treeio_1.26.0           withr_3.0.0             KEGGREST_1.42.0        
[52] gridGraphics_0.5-1      scatterpie_0.2.2        polyclip_1.10-6        
[55] Biostrings_2.70.3       pillar_1.9.0            ggtree_3.10.1          
[58] ggfun_0.1.4             generics_0.1.3          RCurl_1.98-1.14        
[61] ggplot2_3.5.0           munsell_0.5.1           scales_1.3.0           
[64] tidytree_0.4.6          glue_1.7.0              lazyeval_0.2.2         
[67] tools_4.3.3             data.table_1.15.4       fs_1.6.3               
[70] graphlayouts_1.1.1      fastmatch_1.1-4         tidygraph_1.3.1        
[73] cowplot_1.1.3           grid_4.3.3              tidyr_1.3.1            
[76] ape_5.7-1               colorspace_2.1-0        nlme_3.1-164           
[79] GenomeInfoDbData_1.2.11 patchwork_1.2.0         ggforce_0.4.2          
[82] cli_3.6.2               fansi_1.0.6             viridisLite_0.4.2      
[85] dplyr_1.1.4             gtable_0.3.4            yulab.utils_0.1.4      
[88] digest_0.6.35           ggrepel_0.9.5           ggplotify_0.1.2        
[91] farver_2.1.1            memoise_2.0.1           lifecycle_1.0.4        
[94] httr_1.4.7              GO.db_3.18.0            bit64_4.0.5            
[97] MASS_7.3-60.0.1        
> 

fgsea.input.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions