TEXT TO SPEECH
MOHAMMED REHAN SAADI | SAZZAD ISLAM RAFEW | FARDEEN ABDULLAH TASEEN
PROJECT BRIEF
• Converts written text into natural-sounding speech using AI.
• Helps visually impaired users, audiobook creators, and AI-powered assistants.
• Uses Deep Learning models like Tacotron2 & WaveGlow to generate high-
quality speech.
• Provides a realistic voice output with adjustable pitch and speed.
EXPECTED OUTCOME
• A system where users input text and receive clear, human-like speech
output.
• Supports multiple languages and voice customizations.
• Helps in accessibility, AI assistants, and content creation.
PROBLEM STATEMENT
• Existing TTS solutions are expensive, robotic-sounding, or language-limited.
• Visually impaired individuals struggle to access written content.
• Content creators need high-quality AI voices for audiobooks and podcasts.
• Solution: Our TTS system generates natural, expressive speech at low cost, making it
accessible and customizable.
METHODOLOGY OVERVIEW
The Text-to-Speech (TTS) system follows a structured process to convert text into natural-sounding speech
using deep learning techniques. Below are the key steps involved:
• Step 1:Text Input
User provides text input via a UI.
Text can be loaded from a file or typed directly.
• Step 2:Text Preprocessing
Normalize text (convert numbers, abbreviations, and symbols into readable words).
Remove unnecessary punctuation.
Convert text into phonemes for accurate pronunciation.
METHODOLOGY OVERVIEW
• Step 3:Feature Extraction
Tokenize text and convert it into a phonetic representation.
Extract linguistic and prosodic features.
• Step 4:Speech Synthesis Model
Use Tacotron2 or FastSpeech for sequence-to-sequence text-to-speech conversion.
Generate Mel spectrograms as an intermediate representation.
• Step 5:Audio Waveform Generation
Use WaveGlow or HiFi-GAN to convert Mel spectrograms into audio waveforms.
Apply post-processing for noise reduction and clarity.
METHODOLOGY OVERVIEW
• Step 6:Output & Playback
Play the generated speech audio.
Allow customization of voice parameters (pitch, speed, tone).
• Tools & Technologies Used:
gTTS, pyttsx3 (Basic TTS APIs)
Tacotron2, FastSpeech (Deep Learning Models)
WaveGlow, HiFi-GAN (Audio Waveform Generation)
Flask/Streamlit (User Interface)
FEATURE LIST
The core features of out project mainly consist of the following:
• Text to Speech Conversion
Convert text to speech using deep learning models.
Ensures pronunciation, natural rhythm and intonation.
Uses open-source text to speech models like Tacotron2, WaveGlow and FastSpeech.
• Multi Language Support
Supports languages other than just English.
Uses open-source datasets like CommonVoice and LJSpeech for different speech synthesis.
Users can select preferred language for converting text to speech.
• Adjustable Voice Speech and Speech
Has a range of voices like man, female and robot.
Allows speech speed control such as slow, fast or normal.
Generate high-quality audio files like MP3.
Uses built-in audio player for hearing generated speech.
DATASET DETAILS
• Dataset Name: LJSpeech Dataset
• Source: Open-source dataset with 13,100 English audio clips
• Size: 24 hours of recorded speech
• Features:
Text – The sentence to be converted into speech
Audio File – Corresponding recorded human speech
Speaker ID – Identifies the speaker (if multi-speaker)
Duration – Length of the audio clip
Use Case: AI learns speech patterns and converts text into natural-sounding audio.
TECHNOLOGY STACK
• Programming Language: Python
• Frameworks & Libraries:
Tacotron2, WaveGlow – Deep Learning models for speech synthesis
pyttsx3, gTTS – Simple text-to-speech conversion
Librosa – Audio processing
TensorFlow / PyTorch – Model training and optimization
Web Framework (Optional): Flask / Streamlit (For UI)
Database (Optional): SQLite / Firebase (For storing user text inputs)
Deployment: Google Cloud / AWS
TARGET MARKET
The target market for the Text to speech system includes:
• Visually Impaired Individuals-Provides accessible reading options.
• Audiobook & Podcast Creators-Converts text into natural speech.
• Educational Institutions-Converts textbooks into audio for students.
• Elderly & Disabled Users-Assists with communication and reading.
THANK YOU!