Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

caroljlsun/sysbiol_pt3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Investigation of DNA structural motifs, CpG methylation and their role in transcriptional regulation

Welcome to my github repo for my part III project.

The code for this systems biology project is mostly within the folder "Code". Models that were trained, mostly the CNNs, are within the folder "Models"

This project mainly investigates how we can predict the presence of G-quadruplexes within promoters, in the mouse genome. 3 methods of machine learning (ML) were assessed for this purpose: Support Vector Machines (SVMs), Random Forests (RFs), Convolutional Neural Network (CNNs). The best predictive model is a convolutional neural network, named ML_5_model , with an AUC of 0.88. It's model architecture is:

image

The features used in training the models are shown: image

The second part of the project investigates if there are cell type specfic features associated with G4s. This was investigated by listing the top 100 predicted G4s from SVMs, RFs, & CNNs. Then the gene names were identified from these G4 locations, and investigated using enRichr as well as "goProfiles" in R. Generally, all ML methods were found to score promoters highly in white blood cell related genes. This is quite interesting, considering how many features used in training have B & T cell information. For example, the cell type associated with the genes the SVMs found:

image

Finally, there is a positive correlation between high G4 score and gene expression:

image

Guide to code uploaded

Here is a quick description of the code uploaded in this github repo:

Collapse_arrays changes the feature vector into a form that is useable for SVMs and RFs

Correlations and Correlations_negative_G4s explores the correlations of the features used in ML training

DeSeq2 and Expression files explore the expression data in the feature vector (FV)

FV_ * G4_ * and G_quadruplex _ * _ * files process raw G4 counts into a form useable in the FV

FV _* minimal _ 1 files make differently sized FVs, used to assess ML code before using the full size FV

Feature_masking trains a CNN using the one-hot encoding of the promoters only

Feature_masking 2 trains a CNN using all feature except the one-hot encoding

G4_scores_CNN_1 finds the top 100 G4 predictions using the best CNN model

GC_content calculates the GC content of a promoter sequence using a rolling window of 25 bps

GO_profiles compares the gene lists predicted by SVMs, RFs and CNNS

Histones * _ * _ processes raw histone data

Kmer_PCA_UMAP_copy and write_2_fasta_copy conducts k-mer analysis of the promoters with and without G4s

Methylated_positions processes raw methylation data to get the positions of methylated bases in promoters

Methylkit_2 uses the methylation data to get DNAshaper features for the FV

R_loops obtains R loop positions

Random_forests_1 trains multiple RFs

SVM_1 trains multiple SVMs

Saliency_maps_*_ finds the saliency maps of the multiple CNNs trained in this project

Venn diagram of G quadruplexes assesses briefly how many G4s are shared between different experimental conditions.

compile_sequences lists all the sequences of the promoters used

Guide to models uploaded

Here is a quick description of the models uploaded in this github repo:

ML_1 a CNN which uses 2D convolution, AUC = 0.7772

ML_4_model a CNN which uses the full FV and 1D convolution, AUC = 0.84

ML_5_model a CNN which is feature masked, only looks at one-hot encoding, AUC = 0.88

ML_6_model a CNN which is feature masked, looks at all features except one-hot encoding, AUC = 0.81

About

Code for part III systems biology main project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages