SIFA (tumor Subclone Identification by Feature Allocation) is a Baysian method to identify tumor subclones using WGS data. This page will guide you through the basic steps of using SIFA.
Currently SIFA requires sample size to be at least two, since a unique tree cannot be identified with only one sample.
Zeng, L., Warren, J.L. and Zhao, H. (2017) Phylogeny-based tumor subclone identification using Bayesian feature allocation model [pdf]
SIFA is written in R and C++. Please install the following packages in R prior to implementing our software:
- data manipulation:
tidyr,reshape,dplyr - segment calling:
copynumber - Bayesian analysis:
coda - integrating
c++functionality:Rcpp,RcppArmadillo - visualization:
ggplot2,igraph - others:
gtools
SIFA takes an .Rdata file as input.
The .Rdata file should contain a list obs_data with required fields:
obs_data$Dfor total reads matrixobs_data$Xfor mutant reads matrixobs_data$locfor mutation location matrixobs_data$segmentsfor loci segmentation matrix.
When obs_data$segments is not provided, we will use the information provided in obs_data$loc and obs_data$D to call the genome segments.
Input format:
-
obs_data$Dandobs_data$Xshould take the following format:loci sample 1 sample 2 sample 3 ... locus 1 12 11 33 ... locus 2 5 8 7 ... ... ... ... ... ... locus J 22 10 17 ... -
obs_data$locshould take the following format:chromosome position gene 1 7660469 CAMTA1 3 88482840 13 102703724 FGF14 ... ... ... 23 153383479
For loci in non-coding regions, the gene column can be left blank.
-
obs_data$segmentsshould take the following format:segments start end segment 1 1 5 segment 2 6 25 segment 3 26 40 ... ... ... segment S 155 J
Each row of the matrix represents one segment, with the two entries marking the starting and ending locus of the segment.
To use SIFA, please set R working directory to SIFA_package after cloning this repository. Make sure you have all the dependencies correctly installed, load your input .Rdata file, and then open source code SIFA_app.R execute the commands line by line following the instructions below.
-
In the MODEL INPUT section of the code, load the
.Rdatawhere your inputs are saved, specify random seedmyseed, and specify the folderfoldernameto store output files (a new folder will be created if it does not exist). For example:############################################# ########## MODEL INPUT ###################### ############################################# load("example.Rdata") myseed = 1 # set random seed foldername = "temp_out" # set output foldername dir.create(foldername) # folder where outputs are saved
-
Next, you need to specify Bayesian sampling parameters in
specify_pars.R. For most of the parameters, default values work just fine. Some of the parameters you can change are:#### maximum number of copy Params$max_CN=4 #### maximum number of mutant copies Params$max_mut=2 #### MCMC sampling parameters MCMC_par$burnin=4000 # burnin sample size MCMC_par$Nsamp=4000 # number of samples for inference MCMC_par$Ntune=2000 # number of samples used for adaptive parameter tuning Nclone=c(3:7) # candidate subclone numbers K
-
run the remaining sections one by one:
sampler.Rto perform samplingModel_select.Rto perform model selection. Plot of model selection will be saved inselection.pdfFit_visual(foldername,X,D)for results visualization:- Visualization results will list top 3 frequent trees (when >= 3 tree structures exist) in posterior samples, and display corresponding parameter estimations.
get_point_estimate():- get parameter point estimates from a given posterior sample
.Rdatafile - will identify up to top 3 trees from posterior samples, and calculate point estimates for each tree
- get parameter point estimates from a given posterior sample
During the sampling process, samples for each individual K will be stored in one .Rdata file.
- Estimated phylogenetic tree:
- Estimated subclone mutated copy numbers:
- Estimated subclone total copy numbers:
- Estimated subclone fractions across samples:
Please feel free to contact [email protected] if you have any question.