Thanks to visit codestin.com
Credit goes to github.com

Skip to content

tungtokyo1108/SVVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stochastic variational variable selection - SVVS

DOI

Publication

Dang, T., Kumaishi, K., Usui, E. et al. Stochastic variational variable selection for high-dimensional microbiome data. Microbiome 10, 236 (2022). https://doi.org/10.1186/s40168-022-01439-0 https://doi.org/10.1101/2022.12.25.521906

Getting Started

Get the SVVS Source

git clone https://github.com/tungtokyo1108/SVVS.git 

How To Use

  • Import packages
import warnings
from abc import ABCMeta, abstractmethod
from time import time 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.utils import check_array
from sklearn.metrics.cluster import adjusted_rand_score
from DMM_Dirichlet_SVVS import DMM_SVVS
  • Import example dataset in data/
dataset_A_count = pd.read_csv("datasetA_count.csv", index_col=0)
dataset_A_meta = pd.read_csv("datasetA_meta.csv", index_col=0)
  • Set up parameters for SVVS function
    • n_components: the maximum number of clusters. Depending on the data, SVVS can decide the number of effective cluster.
    • max_iter: the maximum number of iterations to perform.
    • init_params: the method used to initialize the weights. There are two options: "kmeans": responsibilities are initialized using kmeans; "random": responsibilities are initialized randomly. Default = "random".
    • weight_concentration_prior: the dirichlet concentration of each component on weight distribution. Default = 0.1.
    • select_prior: the prior on the selection distribution. Default = 1.
X = check_array(dataset_A_count, dtype=[np.float64, np.float32])
dmm = DMM_SVVS(n_components = 10, max_iter = 100, init_params = "random")
  • Run SVVS
log_resp_, clus_update, prob_selected, sel_update = dmm.fit_predict(X)
  • Evaluate the number of cluster
resp_ = clus_update[100]
log_resp_max_ = resp_.argmax(axis=1)
df_cluster = {'Diseases': dataset_A_meta['Label'], 'Predicted_cluster': log_resp_max_}
clus_labeled = pd.DataFrame(data=df_cluster)
clus_labeled["True_cluster"] = clus_labeled["Diseases"].apply(lambda x: 2 
                                          if x == "D" else 4)
ARI_score = adjusted_rand_score(clus_labeled["Predicted_cluster"], clus_labeled["True_cluster"])
  • Evaluate the selected microbiome species
selected_features = prob_selected.sum(axis=0)/prob_selected.shape[0]
df = {'Microbiome_species': dataset_A_count.columns, 'Selected_probility': selected_features}
clus_selected = pd.DataFrame(data=df).sort_values(by = 'Selected_probility',ascending=False).reset_index(drop=True)

Get the human gut microbiome datasets

We used human gut microbiome datasets in Duvallet et al. (2017): http://dx.doi.org/10.1038/s41467-017-01973-8. Additional information about the datasets are in the MicrobiomeHD github repo. MicrobiomeHD contains all 28 datasets of human gut microbiome studies in health and disease. In our paper, we used only 3 datasets: cdi_schubert (dataset B), ibd_gevers (dataset C) and ob_goodrich (dataset D).

Directory structure

Data

  • The main files for dataset A of our paper in the data/ folder include:
    • datasetA_count.csv: a OTU count table for dataset A
    • datasetA_meta.csv: metadata that includes a true label of groups for dataset A
    • datasetA_tree.nwk: phylogenetic tree for dataset A
  • Please follow instructions in the MicrobiomeHD github repo to get OTU tables, metadata and phylogenetic tree for datasets B, C and D
  • Please follow instructions in the HMP16SData to get OTU tables, metadata and phylogenetic tree for dataset E

Source code

All of the code is in the src/ folder, you can use to re-make the analyses, figures and tables in the paper:

  • DMM_Dirichlet_SVVS.py: file contains Python codes for SVVS algorithm
  • SVVS_application.py: file that is used to run SVVS algorithm for a sepecific dataset. Outputs of this file are ARI score for each dataset in Table 3 and a number of selected species in Table S1 and S2.
  • Figure_phylogenetic_analysis.R: file that is used to make phylogenetic analysis for each dataset. Output of this file is Figures 2.
  • Figure_NMDS_analysis.R: file that is used to make NMDS figures. Outputs of this file are Figures 1 and 3.

If you have any problem, please contact me via email: [email protected]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published