DNA Sequence Encoding and Classification

This project explores different encoding methods to classify DNA sequences, aiming to find the most effective approach for predictive modeling. By transforming DNA sequences through di-nucleotide, tri-nucleotide, and Fourier transform encodings, this project builds and evaluates classification models using a Random Forest classifier.

Objective

To identify the most effective encoding method for classifying DNA sequences based on di-nucleotide, tri-nucleotide, and Fourier transform encodings.

Dataset

The dataset contains DNA sequences and their associated class labels, which represent different biological categories. It is loaded from a TSV file.

Project Structure

dna_sequence_classification.py: Main Python file containing data processing, feature engineering, model training, and evaluation code.
README.md: Project overview and instructions (this file).
data: Folder for storing the input dataset file (extraA3_Data.tsv).

Feature Engineering and Encodings

Di-nucleotide Encoding: Calculates the frequencies of each di-nucleotide (two-nucleotide sequence) within each DNA sequence.
Tri-nucleotide Encoding: Calculates the frequencies of each tri-nucleotide (three-nucleotide sequence) within each DNA sequence.
Fourier Transform Encoding: Maps each DNA sequence to numerical values and applies the Fourier Transform, capturing frequency information.

Model and Evaluation

A Random Forest classifier is trained and evaluated using 10-fold stratified cross-validation on each encoding method. Key evaluation metrics include F1-score and Mean Average Precision (AP).

Results Summary

Encoding Method	Mean F1 Score	Std. Dev. F1 Score	Mean Average Precision	Std. Dev. Average Precision
Di-nucleotide	0.6148	0.0418	0.5475	0.0615
Tri-nucleotide	0.5324	0.0751	0.5372	0.0971
Fourier Transform	0.4278	0.0015	0.2834	0.0458

Precision-Recall Analysis

The following precision-recall curves display the performance of each encoding method across all folds, highlighting the effectiveness of di-nucleotide and tri-nucleotide encodings.

Di-Nucleotide Encoding Precision-Recall Curve
Tri-Nucleotide Encoding Precision-Recall Curve
Fourier Transform Encoding Precision-Recall Curve

How to Run

Clone the repository:

git clone https://github.com/your_username/dna-sequence-classification.git
cd dna-sequence-classification

Install required packages:

pip install pandas numpy scikit-learn matplotlib scipy

Run the code:
```
python ea3.py
```
Results:
- Outputs F1-scores and Average Precision for each encoding method.
- Generates precision-recall curves for visual comparison.

Insights

Di-nucleotide Encoding: Achieved the highest mean F1 score and precision, demonstrating strong performance in sequence classification.
Tri-nucleotide Encoding: Performed comparably well, showing potential for further exploration.
Fourier Transform Encoding: Lower scores indicate it may not capture DNA sequence patterns as effectively for this task.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Di-Nucleotide Encoding Precision-Recall Curve.png		Di-Nucleotide Encoding Precision-Recall Curve.png
Fourier Transform Encoding Precision-Recall Curve.png		Fourier Transform Encoding Precision-Recall Curve.png
LICENSE		LICENSE
README.md		README.md
Tri-Nucleotide Encoding Precision-Recall Curve.png		Tri-Nucleotide Encoding Precision-Recall Curve.png
dna_sequence_classification.py		dna_sequence_classification.py
extraA3_Data.tsv		extraA3_Data.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DNA Sequence Encoding and Classification

Objective

Dataset

Project Structure

Feature Engineering and Encodings

Model and Evaluation

Results Summary

Precision-Recall Analysis

How to Run

Insights

License

About

Uh oh!

Releases

Packages

Languages

License

mrw-soumik/DNA-Sequence-Classification-with-Data-Encoding

Folders and files

Latest commit

History

Repository files navigation

DNA Sequence Encoding and Classification

Objective

Dataset

Project Structure

Feature Engineering and Encodings

Model and Evaluation

Results Summary

Precision-Recall Analysis

How to Run

Insights

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages