This project contains Python scripts that implement a small-scale genome assembly workflow using a greedy overlap algorithm, followed by coverage analysis and ORF detection using motif-based scoring. The project demonstrates how raw sequencing reads can be assembled into contigs and how biological features such as ORFs can be extracted from assembled sequences.
The assembly script:
- Reads short DNA reads from a text file
- Computes pairwise overlaps using suffix–prefix matching
- Merges reads into longer contigs based on a minimum overlap
k - Removes redundant contigs
- Writes final contigs to FASTA and CSV
Includes:
- Finding contigs that fully or partially contain each other
- Merging nested or overlapping contigs
- Producing a minimal set of long, non-redundant contigs
- Calculates how many reads map to each contig
- Generates a horizontal bar plot of read coverage
The ORF detection script:
- Reads contigs from a FASTA file
- Uses a 13-base motif scoring matrix
- Searches for high-scoring start sites (ATG/GTG)
- Extends ORFs to stop codons (TAA, TAG, TGA)
- Ensures ORF ≥ minimum length
- Writes annotated ORFs to a FASTA output
- How genome assembly works using a greedy overlap approach
- How contigs can be refined and merged
- How can biological features, such as ORFs, be extracted using motifs and codon logic?