AI Talent Hub Hackathon 2023, Data Driven Life Science track
This tool aims at providing researchers with a more effective and simpler (compared to conventional tools like deseq2) workflow for detecting biomarkers - genes that may indicate a specific biological condition (e.g. disease).
Team:
- Fedor Logvin (ITMO University, Saint Petersburg, Russia)
- Anton Changailidi (ITMO University, Saint Petersburg, Russia)
- Danil Trotsenko (ITMO University, Saint Petersburg, Russia)
- Timur Sheydaev (ITMO University, Saint Petersburg, Russia)
- Xenia Sukhanova (ITMO University, Saint Petersburg, Russia)
The analysis consists of following steps:
- A maximum-based low counts filter (Rau et al.) to eliminate genes with low counts.
- Bayesian search to obtain the best hyperparameters values for XGboost and Elastic net logistic regression models.
- Model gets fitted on a specified number of random subsamples of 80% from row count, using the best hyperparametes discovered earlier.
- For both models, (n_obs) important genes are retained from each iteration; at the end all genes which occur in specified number of iterations are kept.
- To identify, whether expression of selected genes significantly differs within defined groups, Mann-Whitney U test is performed. FDR is controlled at level a=0.05.
- Custom Rau filter written in C# requires a .NET framework, which can be found here.
- Python 3.9+ is recommended, older versions were not tested.
- Required python packages can be found in requirements.txt. Keep in mind that scikit-optimize requires older NumPy versions(<=1.23.5).
- Specify Telegram bot token and a logging directory in config.toml
- On Linux, install systemctl services for Dash app and Telegram bot (copy service config files to /lib/systemd/system/)
- Run systemctl services
- Run redis server for RQ job scheduler