Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views57 pages

SDS 2

Uploaded by

thehandytalk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

SDS 2

Uploaded by

thehandytalk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Voice and Hand Sign Language Recognition

Software Design Specification

By
Junaid Sajid 2021-BS-AI-018
Zalwisha Khan 2021-BS-AI-019
Muhammad Abdullah 2021-BS-AI-020

Supervised by

Dr Uzair Saeed

Bachelor of Science in Artificial Intelligence

DEPARTMENT OF COMPUTER SCIENCE

The University of Faisalabad


Revision History
Name Date Reason for changes Version

Dr Uzair Saeed 29-10-24 Abstract too long 2.0

Dr Uzair Saeed 31-10-24 Wrong heading 2.1

Dr Uzair Saeed 7-11-24 Report summary 2.2

Dr Uzair Saeed 6-11-24 Missing tables 2.3

Dr Uzair Saeed 27-11-24 Diagram changes 2.4

ii
Application Evaluation History
Comments (by committee) Action Taken
*Include the ones given at scope time
both in documentation and
presentation
Make it more clear Using 2 languages to make it more clear
and easy to understand.

Supervised by

DR Uzair Saeed

Signature: __________________

iii
Abstract
There are many sign languages that are used around the world, by more than million
deaf people or by those who have hearing difficulty. The most commonly used sign
language standard is ASL (American Sign Language). However, this communication
gap has existed for years and is now being narrowed by introducing various techniques
to automate the detection of sign gestures. In this project, the user can be able to capture
hand images through the web camera and then the system must be able to tell what
gesture or hand sign is being performed. The gesture then can be translated into text or
voice language. For deaf people understanding voice language transformed into sign
language. The captured images of hand gestures and voice are processed using image
processing tools and with the help of Machine Learning algorithm. Additionally, the
system incorporates voice recognition capabilities to enhance user interaction and a
place where user can also learn sign languages. The final solution integrates these
components into a user-friendly interface.

iv
Table of Contents
CHAPTER 1: INTRODUCTION .................................................................................. 1

1.1 Introduction .................................................................................................... 1

1.2 Aim & Objectives .......................................................................................... 1

1.3 Problem Statement ......................................................................................... 1

1.4 Proposed System ............................................................................................ 2

1.5 Project Scope ................................................................................................. 2

1.6 Assumptions & Constraints ........................................................................... 2

1.7 Social Benefits ............................................................................................... 3

1.8 Business Plan ................................................................................................. 3

1.8.1 Business Model Canvas ................................................................................ 3

1.8.2 Problem ......................................................................................................... 4

1.8.3 Solution ......................................................................................................... 4

1.8.4 Customers ..................................................................................................... 4

1.8.5 Competitors ................................................................................................... 5

1.8.6 Marketing Plan .............................................................................................. 6

1.8.7 Revenue......................................................................................................... 6

1.8.8 SWOT (Strength Weakness Opportunities Threats) Analysis....................... 6

1.8.9 FAB (Features, Attributes, Benefits) Analysis .............................................. 7

1.9 Report Layout ................................................................................................ 8

CHAPTER 2: LITERATURE REVIEW/BACKGROUND AND EXISTING WORK 9

2.1 Background .................................................................................................... 9

2.1.1 Speech Recognition .................................................................................... 10

2.1.2 Hand sign recognition ................................................................................. 10

2.1.3 Training the System .................................................................................... 10

2.1.4 Feature Extraction ....................................................................................... 11

v
2.1.5 Real-time Processing .................................................................................. 11

2.1.6 Output Translation ...................................................................................... 11

2.2 Languages we are using: .............................................................................. 11

2.2.1 American Sign Language (ASL):................................................................ 11

2.2.2 British Sign Language (BSL)...................................................................... 12

2.3 Currently Running Apps in the Market:....................................................... 13

2.4 Literature Summary ..................................................................................... 15

CHAPTER 3: REQUIREMENTS ANALYSIS ........................................................... 16

3.1 Stakeholders List (Actors) ........................................................................... 16

3.2 Requirements Elicitation .............................................................................. 17

3.2.1 Functional Requirements ............................................................................ 18

3.2.2 Non-Functional Requirements .................................................................... 20

3.2.3 Requirements Traceability Matric............................................................... 22

3.3 Use Case Design/Description ...................................................................... 23

3.4 Software Development Life Cycle Model ................................................... 26

3.5 Specific Requirements (Hardware and Software Requirements) ................ 27

3.5.1 Hardware Requirement ............................................................................... 27

3.5.2 Software Requirement ................................................................................ 28

CHAPTER 4: SOFTWARE DESIGN SPECIFICATION ........................................... 29

4.1 Design Models ............................................................................................. 29

4.2 Work Breakdown Structure.......................................................................... 29

4.3 System Architecture ..................................................................................... 30

4.3.1 Block Diagram ............................................................................................ 33

4.3.2 Component Diagram ................................................................................... 35

4.3.3 Software Architecture Diagram .................................................................. 36

1. Presentation Layer (User Interface) ............................................................. 37

vi
4.4 Data Representation ..................................................................................... 41

4.4.1 Entity-Relationship Diagram (ERD)........................................................... 41

4.4.2 UML Class Diagram ................................................................................... 42

4.4.3 Data Flow Diagram (DFD) ......................................................................... 43

4.4.4 Hierarchical Diagram .................................................................................. 43

4.5 Process Flow/Representation ....................................................................... 44

4.5.1 Flowchart .................................................................................................... 45

4.5.2 Sequence Diagram ...................................................................................... 46

References ................................................................................................................ 48

vii
List of Tables
Table 3.1 Software requirements ................................................................................. 28

viii
CHAPTER 1: INTRODUCTION
1.1 Introduction

Gestures play a significant role in communication. For deaf and mute people sign
language plays an important role in their life. Sign language has different vocabulary,
syntax, and meaning which are not understood by a normal person if he/she does not
have basic knowledge about sign language then it causes many communication
problems. Using deep learning algorithms and image processing we can able to classify
sign language gestures to their corresponding words or letters. CV (computer vision)
enables machines to interpret and process body movements and hand gestures visually.
It tracks, detects, and analyzes body and hand movements in real-time to translate them
into meaningful output. Algorithms like CNN identify hand movement by focusing on
the shape and position of fingers, and then the system extracts key features from
gestures to differentiate between signs. Once the sign is recognized it is mapped into
meaningful textual data and then the text is passed from text to speech engine which
converts textual data into voice and vice versa. [1][2]. A user can also learn sign
languages. Different sign languages courses will be available.

1.2 Aim & Objectives

The purpose of this project is to facilitate real-time communication between sign


language user and non-sign language user. By using machine learning algorithm and
computer vision our aim is to develop a system that accurately recognize gesture and
voice and transform it into voice output and on screen text and voice input to sign
language. This system can be used in health, education and in industrial sectors to
minimize the communication gap.

1.3 Problem Statement

Sign language consists of many gestures with include various hand and arm
movements. The existing Systems lack real-time responsiveness and may not always
be readily available, leading to communication challenges for the deaf and hard-of-
hearing community. They provide facilities to learn sign language but do not interpret
sign language to voice in real time efficiently. Some of existing system uses hardware
to convert sign language into voice. These systems are only limited to one or two sign
languages. To use this system a person should have knowledge about sign language.

1
1.4 Proposed System

For gesture-to-voice conversion, movement will be performed in front of a web camera


that will recognize and convert gesture into desired output, and for voice-to-sign
language microphone is used to give input that does conversion using ML algorithm
and CV. The proposed system uses computer vision techniques, including deep learning
models, to capture and analyze sign language gestures. Using CNN for gesture
recognition and then classifying the hand signs into different categories. Then we will
train CNN model labeled images of hand signs by using augmentation techniques. The
validation and accuracy of the model will be tested. After successful detection and
translation, the system gives output in two ways voice or on-screen text. The speech-
to-text conversion method is used to convert voice into text then textual data will be
converted into sign language. The voice output enables real-time interpretation for users
who may not be familiar with sign language, while the on-screen text serves as a visual
reference. This dual output mechanism ensures accessibility and inclusivity for a wider
audience. [3], [4], [5][6], [7].

1.5 Project Scope

The scope of voice and sign language recognition systems is to facilitate the
communication between spoken language users and sign language user. To build a
multi-model that enables real-time processing of both voice and gesture inputs. Handle
simultaneous input from both voice and hand gesture then combine both inputs into
actionable response. Support multiple languages if required (start with one single
language and extend later). Build user user-friendly interface for users to interact with
the system without greater knowledge about the system.

1.6 Assumptions & Constraints

• The system will be able to remove extra background noise in voice


recognition.
• The system can be accessible on almost every device.
• Data privacy and user authentication will be ensured.
• The system can be able to give efficient feedback.
• Detection of complex gestures with accuracy.

2
Constraints:

• Real-time fast processing demands high-performance hardware that may not be


accessible by all users.
• Internet connectivity is required.
• Time constraints might limit the scope of features implemented in the first phase of
development.

1.7 Social Benefits

• Overcome the gap between sign language users and spoken users.
• Improved access to public services.
• Facilitate accessibilities in different sectors.
• Employment opportunities.
• Reduce need of interpreters.
• Real-time communication.

1.8 Business Plan

1.8.1 Business Model Canvas

3
1.8.2 Problem

• Limited adoption in the market.


Due to a lack of awareness, some users may not be able to use the app.

• Subscription prices.
Introducing the right balance between affordability and profitability is difficult.
Incorrect pricing can result in either underuse or financial loss.

• High market competition.


Due to more emerging technology with similar features may reduce market share.

• Customer retention.
If a user finds app confusing to use or is not satisfied with its performance, they may
cancel their subscription.

1.8.3 Solution

• Conduct seminars, online webinars, and marketing campaigns to highlight


system benefits.
• Collaborate with organizations to reach large audiences.
• Offer free trials to users to build trust.
• Regularly update the system for new features.
• Introduce features like dual voice and hand sign recognition.

1.8.4 Customers

• Individuals with hearing and speaking problems.


They need real-time communication tools that can be accessible for day-to-day
interactions.

• Non sign language speakers.


Individuals who have no knowledge of sign language but have interaction with sign
language users without formal training to facilitate two way communication.

• Educational institutions.
Educational centres with students requiring special assistance in communication
they need real time translation tools that support sign language user environment
by enabling communication between students with and without disabilities.

• Healthcare sectors.
Hospitals, clinics, and healthcare professionals that need quick and efficient
communication with patients.

4
• Government agencies.
Organizations offering services to people including disabled people need tools that
allow seamless interaction between them.

1.8.5 Competitors

The hand talk app.

Limitations
• This app only supports limited languages.
• Focuses only on conversion of spoken or text input into sign language but do
not support voice recognition or sign language to text or voice output.
• Designed only for students.
• Available only on mobile app

Saksham app

Limitations
• Limited vocabulary: The app mainly covers basic words and phrases, which
makes it less useful for complex conversations.
• Inconsistent translation: It sometimes struggles to accurately translate certain
phrases or words due to the nuances of sign language.

Google’s live transcribe

Limitations:

• Accuracy of transcription can be compromised in noisy environments or with strong


accents.

• It doesn't support sign language, so it's not a complete solution for all users.

Advantages of our project

• Our project supports a dual recognition system for both voice input and hand sign
language recognition.
• Uses real-time AI-based hand voice recognition.
• Designed for individuals, institutions, businesses.
• Offers for both mobile and enterprise level.

5
1.8.6 Marketing Plan

• Digital marketing
Publish blog content, run targeted ads on Google and social media platforms,
and collaborate with influencers and organizations.
• Partnerships and collaborations.
Partner with non-profit organizations, and companies interested in improving
workplace accessibility.
• Workshops.
Organize workshops for schools, healthcare and businesses to demonstrate the
system benefits. Show case systems relevant exhibitions to attract potential
partners and customers.
• Promotional trails and offers.
Offer limited trials to allow organizations and individuals to experience the
system.

1.8.7 Revenue

• Subscription model.
Charge user fee for continuous access to the software. Offer a basic free version with
essential feature and charge for advance features.

• Courses.
Offering multiple sign languages courses to learn.

1.8.8 SWOT (Strength Weakness Opportunities Threats) Analysis

Strength
• Dual recognition system.
Combination of voice and hand sign language recognition enhances
communication.
• Inclusivity and accessibility focus.
Support for deaf, mute user and hearing user making system highly inclusive.
• Multi-platform deployment.
Availability on different devices offers flexibility.
• Real time performance
Fast performance reduce communication delay.

6
Weaknesses
• Potential technical challenges.
Gesture detection errors or voice misrecognition issues like these in noisy
environments.
• Limited initial language support.
At initial stage support few languages but need to expand languages coverage.
Opportunities
• Technology partnerships
Collaboration with hardware and software manufactures to enhance
functionality.
• Global market expansion
Opportunity to deploy system in global market so that international users can
also use app.
Threats
• Market saturation risk
Entry of new competitors or alternative solutions could impact market share.
• Competition from established platforms
Competitors like Google, Microsoft, and Hand Talk pose a threat.
• Technology evolution
Rapid advancements in AI may require continuous updates to remain competitive.

1.8.9 FAB (Features, Attributes, Benefits) Analysis

• Feature: Dual recognition system.


Attribute: Both voice and sign language recognize by system.
Benefit: Enable efficient communication between users.
• Feature: Multi-language support.
Attribute: Support multiple spoken and sign languages.
Benefit: Increase accessibility.
• Feature: Real-time interaction.
Attribute: Minimum delay in response.
Benefit: Reduce communication delay improving user experience.
• Feature: Multi-platform availability.
Attribute: Available on smart devices.
Benefit: Ensure flexibility in deployment based on user needs.

7
• Feature: User-friendly interface.
Attribute: Features with visual aids for better usability.
Benefit: No high-level knowledge required to operate app.
• Feature: Secure data handling.
Attribute: Handle user data securely.
Benefit: Build user trust by protecting personal information.

1.9 Report Layout

The project aims to bridge communication gaps between sign language users and non-
sign language speakers using advanced technology. Recognizing the critical role of
sign language for the deaf and mute community, the system utilizes deep learning and
computer vision to translate gestures into voice output and text. Objectives include
facilitating real-time communication across various sectors, such as healthcare and
education. Current systems struggle with real-time responsiveness and often require
users to know sign language. This project proposes a dual-function system that
converts gestures to voice and vice versa. The scope encompasses multi-language
support and a user-friendly interface, ensuring accessibility for all. Assumptions
include efficient noise cancellation and data security, while constraints highlight the
need for high-performance hardware and internet connectivity.

8
CHAPTER 2: LITERATURE REVIEW/BACKGROUND
AND EXISTING WORK
2.1 Background

Sign language is structured form of hand motion which are used in non-verbal
communication, because it is not possible to talk to everybody with one language or
simple hand gestures. Sign language is useful in non-verbal communication when the
person you’re speaking with cannot speak or hear. In this case, our project is useful to
facilitate two-way communication. Different authors use different techniques for sign
language recognition, for classification which are 3D CNN, Long Short-Term Memory
combined with CNN and object detection algorithm YOLO v5 for detecting hand
movements. These models give impressive performance in object detection model for
identifying dynamic hand gestures, but they used less amount of dataset so their model
give accuracy of 82% and their model is only for one way communication. Our
proposed system uses CNN model for detecting hand gesture while maintaining good
accuracy. [8], [9]

Most common sign languages are (ASL) American sign language, (BSL) British sign
language, (FSL) French Sign Language. Countries have their own versions of sign
language Russian Sign Language (RSL), and Japanese Sign Language (JSL), each with
their own vocabulary structure and unique features. Hand gesture recognition has
complex processes such as motion analysis, pattern recognition, motion modeling.
Different viewpoints cause gestures to appear differently on 2D space, some researchers
use wrist bands or colored gloves to help in hand segmentation process it reduces the
complexity of segmentation process. Different evaluation criteria are used to measure
the performance of system and overall accuracy.[10] Processing sign languages through
system include several techniques like computer vision, gesture recognition, voice
recognition and machine learning algorithm.[11]

The article uses deep learning model for detecting and recognizing words from Indian
sign language ISL. The authors use a model Long Short-Term Memory (LSTM) and
Gated Recurrent Units (GRU) to identify signs from ISL video frames. The model was
trained on a custom dataset, IISL2020, which includes 11 static signs and 630 samples.

9
The best-performing model, consisting of a single layer of LSTM followed by GRU,
achieved around 97% accuracy over 11 different signs.[1]

In these article the author designs a desktop application to American Sign Language
(ASL) into text in real-time. The system uses Convolutional Neural Networks (CNN)
for classification, achieving an accuracy of 96.3% for the 26 letters of the alphabet. The
application processes hand images through a filter and classifier to predict the class of
hand gestures, facilitating communication for individuals with hearing or speech
impairments. The researchers used the ASL Hand Sign Dataset from Kaggle, which
contains 24 classes with Gaussian blur applied. The model uses a Convolutional Neural
Network (CNN) with two layers. The first layer processes the hand image to predict
and detect frames, while the second layer checks and predicts symbols or letters that
look similar. The model was trained using 30,526 images for training data and 8,958
images for testing data. The training involved 20 epochs to optimize accuracy and
reduce loss. The application uses a live camera to capture hand gestures, which are then
processed and converted to text in real-time. The system includes an autocorrect feature
using the Hunspell library to improve accuracy. Performance was measured using a
confusion matrix, achieving a final accuracy rate of 96.3%.[12][13][14]

2.1.1 Speech Recognition


For speech recognition a microphone captures the audio input. Background
noise and echoes are removed to make it clearer and to make accurate
predictions. NLP model predicts the sequence of words based on recognized
phonemes.

2.1.2 Hand sign recognition


The system captures real time image of hand gestures. The system will isolate
the hand from background using key points detection. The image is resized and
normalized to ensure accuracy for model input. CNN is used to processed image
and extract features and patterns of movements

2.1.3 Training the System


The system will be trained on large datasets of hand gestures paired with
corresponding commands (e.g., "start", "stop", or identifying different fruit
types). Voice datasets for converting voice to sign language.

10
2.1.4 Feature Extraction
To process sign language, key points on the hands, such as finger tips and joints,
are detected and analyzed. These key points help the computer understand the
shape and movement of the hands, which are essential for recognizing signs.
Key features are extracted from speech signal to represent its acoustic
properties. The signals are then converted into spectrogram then it will pass
from CNN model to extract features

2.1.5 Real-time Processing


Real-time voice and hand sign language recognition involves using sensors or
cameras and microphones to continuously track gestures and voice. As the user
makes signs, the system processes the visual data, matches it to predefined
signs, and responds accordingly, and for voice system recognize voice then it
will check what word are spoken then match it to the sign language gesture.
Advanced algorithms used to ensure that the interaction feels smooth and
instantaneous, allowing for seamless communication.

2.1.6 Output Translation


Once a sign is recognized, the system provides output in the form of text,
speech, or action and vice versa. For example, a sign could be translated into
text displayed on a screen or a verbal response could be given through a voice
synthesizer.

2.2 Languages we are using:

2.2.1 American Sign Language (ASL):

American Sign Language (ASL) is expressive visual language that is used by deaf
and mute people. ASL has its own grammar, syntax, and vocabulary, making it different
from English despite sharing the same geographical region.

Key Features of ASL:


• Gestures
ASL relies on hand shapes, facial expressions, and body movements. The
position, motion, and orientation of the hands play important role in
communication.

11
• Facial expression
In ASL, facial expressions and head movements play an essential role in
grammar. These non manual signals are important as hand gesture in creating
meaning.
• Grammar and Structure
ASL has its own set of grammatical rules that are separate from English.In ASL,
the typical sentence structure is Subject-Object-Verb (SOV), which differs
from the English Subject-Verb-Object (SVO) structure.
• Finger Spelling
ASL incorporates a system for spelling out words that don’t have any specific
signs. Each letter of the English alphabet have a unique hand gestures, and words
are spelled out by forming these letter shapes sequentially.

2.2.2 British Sign Language (BSL)

British Sign Language (BSL) is the primary sign language used by Deaf
communities in the United Kingdom. Like ASL, BSL is a visual and spatial
language, but it entirely different in its grammar, syntax, and vocabulary. BSL is
not a variant of English, but rather a fully developed language with its own
grammar and linguistic features.

Key Features of BSL:

• Distinct Vocabulary
BSL has its own unique signs for each word, unlike ASL, which have same
signs as English. ASL and BSL are entirely different from each other.
• Facial Expressions and Body Posture
Similar to ASL, BSL has facial expressions, eye contact, and body posture are
critical for effective communication. Non-manual features help clarify meaning
and convey emotion.
• Word Order and Syntax
Similar to ASL, BSL often follows a Topic-Comment structure, where the
topic is introduced first, followed by a comment.

12
• Fingerspelling
Like ASL, BSL uses fingerspelling for proper names, places, and words that do
not have established signs. BSL fingerspelling uses the British Alphabet, which
is different from the ASL alphabet.

2.3 Currently Running Apps in the Market:

1. Saksham
Overview: Saksham is an app developed by the Saksham foundation in Pakistan to
bridge the communication gap between the deaf and the hearing world. The app
translates text or voice into Pakistan Sign Language (PSL). It serves as a dictionary and
translator to make communication easier for people who are deaf or hard of hearing.
2. Google's Live Transcribe
Overview: Google's Live Transcribe app is available in many languages, including
Urdu. It transcribes speech into real-time text, allowing people with hearing
impairments to follow conversations. While it does not directly involve sign language,
it helps bridge the communication gap by providing written translations of spoken
words.

3. The hand talk app


Overview: This app offers real time translation from text to spoken language and then
into sign language using animated 3d characters. The use of 3d characters makes this
app user-friendly.

Our Model:

Our model is designed for facilitate communication between persons with hearing or
speaking difficulty through multiple sign languages like ASL, PSL, BSL. The idea is
to develop a multilingual platform capable of reaching a wide number of users
from different linguistic backgrounds and making things easier for them to
communicate and be understood. Our model will recognize hand gesture and
convert it into sign language for sign language user to understand what sign
language user is trying to say. To facilitate two-way communication voice is also
converted into hand sign language to better understanding of what normal person is

13
trying to say to sign language user. Including these languages will help us reach
out to the global deaf community and help them communicate effectively in their
native sign languages. The existing system only do one way communication and
mostly they provide courses to lean sign language not real time communication. The
core of our model lies in its simplicity. The app also features real-time translation of
common phrases, which enhances the learning experience and makes it practical for
daily use

App Features Languages Target Communication Type


Audience

Saksham Text/voice Pakistan Sign Deaf or hard-of- One-way (text/voice to


translation into Language (PSL) hearing people PSL)
Pakistan Sign
Language
(PSL); serves as
a dictionary and
translator.

Google's Live Real-time Multiple spoken Hearing-impaired One-way (speech to text)


Transcribe speech-to-text languages individuals
transcription.

Hand Talk Real-time text- Multiple spoken Deaf and hearing One-way (text/voice to
App to-spoken and sign individuals sign language)
language and languages
sign language
translation using
animated 3D
characters.

Our Multilingual ASL, PSL, BSL. Global deaf and Two-way (gesture ↔
Model platform hearing voice/sign)
supporting real- communities
time two-way
communication.

14
2.3 Literature Summary

The proposed model aims to enhance two-way communication by recognizing and


translating hand gestures and voice into multiple sign languages, including ASL, BSL,
and PSL. The system’s core functionality includes real-time recognition of hand
gestures through CNN models and voice-to-sign translation, ensuring accessibility and
ease of communication for diverse users. Additionally, the platform offers features like
phrase translation to facilitate learning and practical use in daily scenarios.

Existing system involve techniques such as 3D Convolutional Neural Networks


(CNNs), Long Short-Term Memory (LSTM) networks combined with CNNs, and
object detection models like YOLO v5 have been employed.

Hand gesture recognition involves complex stages such as motion analysis, pattern
recognition, and feature extraction. Key points on the hand, such as fingertips and
joints, are identified to capture the gesture's shape and motion. CNN models are
commonly used to analyze these features, ensuring accurate recognition. For real-time
applications, advanced algorithms are employed to process gestures continuously,
enabling seamless communication. Some systems incorporate additional tools like
wristbands or gloves to simplify the segmentation process.

Speech recognition systems use microphones to capture audio, which is then filtered to
remove background noise and enhance clarity. Natural Language Processing (NLP)
models predict word sequences based on recognized phonemes. When integrated with
sign language systems, these models enable two-way communication by converting
spoken words into sign gestures and vice versa..

15
CHAPTER 3: REQUIREMENTS ANALYSIS
3.1 Stakeholders List (Actors)

• Project Manager
Manage project planning, coordination, and monitoring. Guarantee that the
project is completed on schedule, affordable, and to quality standards. Stays in
touch with all parties and sorts out any issues that emerge throughout the
project.
• Software Development Team
Generates and applies key functions including voice recognition and sign
language recognition. Develops and upgrades algorithms for gesture detection,
machine learning, and natural language processing. Provides recurring reports
on development progress and manages technical issues.
• Machine Learning Engineers
Creates, instructs and upgrades machine learning models for speech and sign
language recognition. Assemble and prerender data to improve the validity of
recognition algorithms. Works jointly with the software team to apply AI
models in the system.
• UI and UX Designers
Create an instinctive and reachable user interface that considers the needs of
those who have hearing or speech disablement. Organizes usability testing to
make sure that the platform is easily operated. Focus on visual elements such as
sign language symbols, controls, and text-to-speech or voice-to-text changing
interfaces.
• Quality Assurance (QA) Team
Test software usefulness, performance, and accuracy. Supervises that the sign and
speech recognition systems fulfill the required criteria. Cooperates with end users to
collect response, fix errors, and enhances the application's solidness.
• Sponsors or Investors
Economic aid for the project. Supervises project milestones and gives advice to
ensure that the project cooperates with company goals.

16
3.2 Requirements Elicitation

For the project “Voice and Hand Sign Language Recognition,” elicitation makes
certain that the system meets the needs of end-users, namely the individuals with
hearing or speech disablement. The goal is to collect non- functional, functional, and
domain-specific demands to lead the system’s development, counting the identification
of errorless voice command interpretation and various sign languages.

This process includes engaging shareholders, surveying functioning systems, and


examining user interactions to acknowledge their challenges and expectations.

Methods for Requirements Elicitation

Various methods were employed to cluster together the requirements for the project.
The following techniques were selected to make certain the extensive and precise data
collection:

1. Interviews

• Purpose: Conducted with shareholders, including potential users (e.g., hearing-disable


individuals), domain professionals, and developers.
• Objective: Acknowledge the types of voice commands and sign languages to be
supported, system accuracy demands and user interface preferences.

2. Observation

• Purpose: Notices how individuals come in contact using voice and sign language in
real-life scenarios.
• Objective: Recognizes gaps in current systems and recognizes user behaviour.

3. Questionnaires and Surveys

• Purpose: Distributed constructed questionnaires to a wide range of audience to


assemble input on expected system characteristics and usability.
• Objective: Gather feedback on priorities, such as supported languages, accuracy and
recognition speed.

4. Prototyping

• Purpose: Developed an introductory system prototype displaying basic functionalities


for hand sign and voice recognition.

17
• Objective: Gather user feedback in order to refine conditions and upgrade system
design repeatedly.

5. Document Analysis

• Purpose: Inspect existing datasets, research papers and technical documentation set
referred to voice and sign language recognition.
• Objective: Influence initial work and excellence to define system demands.

6. Focus Groups

• Purpose: Organized group discussions with users, language professionals and


developers.
• Objective: Inspires probable challenges and surveys futuristic solutions.

7. Use Case Modelling

• Purpose: States precisely the real-life scenarios where the system comes into function.
• Objective: Defines functional demands, such as interpret hand signs into text or
distinguishing voice commands in rowdy environments.

3.2.1 Functional Requirements

The functional requirements are defined as the central functionalities of the Voice and
Hand Sign Language Recognition System. These requirements state the expected
behaviour of the system to confirm it meets user needs fruitfully.

1. Voice Command Recognition

Description: The system shall faultlessly recognize voice commands from


individuals and translate them into actions or text.

Expected Behaviour:

• Recognize speech from multiple individuals with multiple accents.


• Filter background sounds and controls noise interference.
• Support various languages for voice recognition.

2. Hand Sign Recognition

Description: The system shall recognize hand signs from individuals and
translate them into speech or text in real-time.

18
Expected Behaviour:

• Determine and translate gestures from multiple sign languages (e.g., ASL, BSL).
• Faultlessly identify hand movements and fixed gestures by using a camera.
• Work expertly under different brightness and angles.

3. Multimodal Input Processing

Description: The system shall operate both hand sign inputs and voice
commands concurrently.

Expected Behaviour:

• Categorize inputs based on user preference (e.g., prioritize voice when both inputs are
active).
• Consistently changes between voice and sign recognition as required.

4. Translation Output

Description: The system shall translate recognized voice and hand signs into
speech or text output.

Expected Behaviour:

• Present recognized text in a adaptable format.


• Provide audio output for converted hand signs.

5. User Authentication

Description: The system shall provide a safe ser log in for customized settings.

Expected Behaviour:

• Permit users to create and have access to their accounts.


• Save user preferences for language and input system protectedly.

6. Language Support

Description: Multiple languages shall be supported by the system for both sign
and voice recognition.

Expected Behaviour:

• Permit users to select their preferred language from a list of languages.


• Gives errorless translations for the selected language.

19
7. Customizable Interface

Description: A user-friendly and customizable interface should be provided by


the system.

Expected Behaviour:

• Permit users to manage system settings, like font size, output preferences and language.
• Provide users with both auditory and visual feedback for interactions.

8. Error Handling and Feedback

Description: The system shall detect any errors during recognition and notify
the users about them.

Expected Behaviour:

• Provide suggestions on how to correct gestures or commands that are not recognized.
• Present clear error messages for system problems and invalid issues.

9. System Performance

Description: The system shall operate expertly and give real-time responses.

Expected Behaviour:

• Hand sign and voice inputs are processed within minimum time.
• Sustain recognition correctness of at least 90%.

3.2.2 Non-Functional Requirements

The non-functional requirements are defined as the quality characteristics,


performance, and restrictions of the Voice and Hand Sign Language Recognition
System. These demands ensure the system's accuracy, manageability, and user
satisfaction.

1. Usability Requirements

• The user interface must be intuitive, easy to use, and accessible to individuals with
different levels of technical proficiency.
• The system must provide visual and auditory response for inputs and outputs to
increase user interaction.
• Users shall be able to change settings, like language preferences, input modes and font
size to match their needs.

20
2. Scalability Requirements

• The system must be competent of scaling to support an increasing number of users


without demeaning in performance.
• It shall favour additional sign languages and voice recognition models if needed.

3. Reliability Requirements

• The system shall work continuously unaccompanied by the failure for at least 99.5%
uptime.
• The system shall handle unexpected inputs gracefully and provide meaningful error
messages.

4. Portability Requirements

• The system shall be deployable on multiple platforms, including Windows, macOS,


and Linux.
• A mobile version shall be compatible with Android and iOS devices.
• The system shall function seamlessly across devices with different hardware
configurations.

5. Maintainability Requirements

• The system’s codebase shall be modular and well-documented to facilitate updates and
debugging.
• Regular updates shall be provided to improve performance, add features, and fix bugs.
• A version control system shall be used to manage changes in the codebase.

6. Environmental Requirements

• The system shall perform optimally in varying environmental conditions, including


low-light and noisy environments.
• It shall function efficiently on devices with limited computational resources (e.g., low-
end smartphones).

7. Ethical and Legal Requirements

• The system shall ensure unbiased recognition of all voices and hand signs, regardless
of gender, ethnicity, or cultural background.

21
• All collected data shall be used solely for improving system performance and shall not
be shared without user consent.

3.2.3 Requirements Traceability Matric

Requirement Requirement Elicitation Stakeholders Project Goal


ID Description Techniques

FR-01 Recognize voice Interviews, Use End Users, Ensures seamless


commands and Case Modelling Project Manager interaction for users
translate into who rely on voice
Sign actions commands.

FR-02 Recognize hand Focus Groups, End Users, Supports individuals


signs and Prototyping UI/UX with speech
translate into Designers impairments by
text or speech providing a real-time
recognition system.

FR-03 User Document Project Ensures personalized


authentication Analysis, Manager, QA and secure interaction
for secure Interviews Team with the system.
access

FR-04 Support multiple Interviews, End Users, Expands user reach


languages for Questionnaires Sponsors by accommodating
input and output diverse linguistic
preferences.

FR-05 Customizable Observation, UI/UX Improves accessibility


user interface Prototyping Designers, End and user satisfaction
Users by catering to
individual needs.

FR-06 Error handling Document QA Team, End Enhances user


and feedback Analysis, Focus Users experience by
Groups

22
providing guidance on
correcting errors.

FR-07 Real-time input Observation, Software Team, Ensures the system


processing Prototyping ML Engineers meets performance
standards for real-
time interactions.

FR-08 High Focus Groups, End Users, Ensures system


recognition Prototyping Sponsors reliability and
accuracy trustworthiness.

FR-09 Intuitive and Prototyping, UI/UX Ensures users with


accessible Interviews Designers, End varying technical
interface Users skills can operate the
system effectively.

FR-10 Reliable Observation, Sponsors, QA Guarantees consistent


operation with Document Team system availability for
minimal Analysis users.
downtime

FR-11 Portability Document Project Expands system


across multiple Analysis Manager, usability across
platforms Sponsors various devices and
operating systems.

FR-12 Adherence to Document Project Ensures compliance


ethical and legal Analysis Manager, with regulations and
standards Sponsors, QA promotes fair use of
Team the system.

3.3 Use Case Design/Description

This section outlines the use cases for the Voice and Hand Sign Language
Recognition System. The use cases describe key interactions between the users and
the system, illustrating how the system supports its functionality.

23
Use Case 1: Recognize Voice Commands

Actors: End User

Goal: Translate spoken commands into text or actions.

Preconditions: The system is active, and the microphone is functional.

Main Flow:

1. The user speaks a command.

2. The system captures and processes the voice input.

3. The system translates the voice into text or performs the associated
action.

4. The system provides feedback on the output.

Alternative Flow: If the system cannot recognize the command, it provides an


error message and suggests rephrasing.

Use Case 2: Recognize Hand Signs

Actors: End User

Goal: Detect and translate hand signs into text or speech.

Preconditions: The system is active, and the camera is operational.

Main Flow:

1. The user performs a hand sign.

2. The system captures the video input via the camera.

3. The system processes the input to identify the hand sign.

4. The system translates the sign into text or speech output.

5. The system provides feedback on the output.

Alternative Flow: If the sign is not recognized, the system displays an error
message and suggests corrective actions.

24
Use Case 3: Translate Inputs into Outputs

Actors: End User

Goal: Convert recognized inputs (voice or hand signs) into outputs (text or
speech).

Preconditions: The input (voice or sign) has been recognized successfully.

Main Flow:

1. The system processes the recognized input.

2. It generates an appropriate output (text or speech).

3. The output is displayed on the screen or provided as audio feedback.

Alternative Flow: If translation fails, the system requests the user to repeat
the input.

Use Case 4: Login into User Profile

Actors: End User

Goal: Allow users to create, modify, or delete profiles for personalized


settings.

Preconditions: The user has access to the system.

Main Flow:

1. The user logs into the system.

2. The user selects profile management options.

3. The user updates language preferences, subscription plan, or other


settings.

4. The system saves the changes.

Alternative Flow: If changes cannot be saved, the system provides an error


message.

25
Use Case 5: Courses

Actors: End User

Goal: User can learn sign languages.

Main Flow:

1. Detailed courses of multiple languages.

2. User can login to profile and select any sign language to learn.

3.4 Software Development Life Cycle Model

Chosen SDLC Model:

Agile Development Model

The agile model was chosen for the development of speech and sign language
recognition systems. Due to its cyclical and adaptive nature, it is very suitable for the
dynamic needs of the project and the need for frequent feedback.

The Agile model explained

The Agile model is a dynamic, incremental approach to software development. It


concentrates on delivering small, functional modules of the system in repetitions
(sprints), allowing continuous advancement through regular feedback from
stakeholders. Agile encourages collaboration, adaptability, and transparency.

Key Characteristics of the Agile Model:

• Incremental Development: The system is developed in compact, compliant parts.


• User Involvement: Shareholders and end-users provide systemic input.
• Repetitive Procedure: Each sprint conveys a potentially transferable product
increment.
• Constant Trials and Development: Trials are performed in each iteration to recognize
and correct issues timely.
• Adaptability: Allow for changes in demands even in later stages.

Reasons for selecting Agile Model

• Dynamic Demands

The project includes latest features like audio and sign language recognition,
which may require continuous clarifying based on user response or

26
technological progress. Agile’s flexibility secures the system matures in
reaction to stakeholder requires.

• Stakeholder Cooperation

The system's success depends on considering the requirements of users, such as


people with hearing or speech disabilities. Agile eases regular cooperation with
end-users, language specialists, and developers to arrange the system with real-
world needs.

• Repetitive Conveyance

Features like audio identification, hand sign identification, and multimodal


input can be evolved, carried out on trials, and cleared in increments. This
method permits timely detection of challenges, minimizing development risks.

• Concentrate on Quality

Continuous trials in every repetition ensures that the system is authentic and
meets performance standards. Feedback loops increase precision, usability, and
approachability with every sprint.

• Versatility and Future Advancement

Agile supports the addition of new features, such as assistance in more


languages or modern identification algorithms, without disturbing the
development procedure. The repetitive nature secures the system can adapt to
evolving user requirements and modern technologies.

3.5 Specific Requirements (Hardware and Software Requirements)

3.5.1 Hardware Requirement

Processor: 2.6 GHz or faster.


RAM: 4 GB or higher.
HDD:64 GB for basic storage application.
SSD: 128GB is recommended
Camera: 1080p resolution
Microphone: High quality

27
3.5.2 Software Requirement

Operating system
window 10.
Programming
Python
Voice Recognition
• PyAudio
Hand Sign Language Recognition
• OpenCV
Data Processing & Analysis
• Matplotlib
• Pytorch
Web Frameworks
• Django
Development Tools

IDE: Jupyter Notebook

Table 3.1 Software requirements

Requirements Versions

Python 3.11.4

PyAudio 0.2.13

OpenCV 4.8.0

Matplotlib 3.8.0

Pytorch 3.9

Django 5.1.4

IDE Jupyter notebook

28
CHAPTER 4: SOFTWARE DESIGN SPECIFICATION
4.1 Design Models

Object-Oriented Design (OOD) :

Object-Oriented Design (OOD) uses objects to model real-world entities, encapsulating


data and behavior. For this project, objects represent distinct components like input
processing, recognition models, and output generation.

• Voice Recognition:

Objects like Audio Input , Signal Processor, and Voice Recognition Model
encapsulate specific tasks. Enables modular representation of processes such as
feature extraction, model inference, and output formatting.

• Hand Sign Recognition:

Objects like Image Capture, Preprocessing Module, and Gesture Recognition


Model handle key functions such as image acquisition, noise reduction, and
classification. Inheritance can be applied to handle variations in gesture data,
such as different sign languages.

4.2 Work Breakdown Structure

1. Data Preparation

• Voice Data Collection


▪ Collect diverse audio datasets with varied accents and noise levels.
▪ Annotate datasets with corresponding transcriptions.
• Gesture Data Collection
▪ Gather datasets of hand sign gestures, including static and dynamic gestures.
▪ Include variations in lighting, hand shapes, and skin tones.
• Data Preprocessing
For voice:
▪ Normalize audio signals and reduce noise.
▪ Extract features like MFCCs or spectrograms.
For gestures:
Preprocess images by resizing, normalization, and data augmentation.

29
2. Model Development
• Voice Recognition Model
▪ Develop a deep learning model for voice-to-text
▪ Fine-tune using a pre-trained speech recognition model
• Hand Gesture Recognition Model
▪ Train a CNN-based model for static gestures.
▪ Fine-tune CNN models for dynamic gestures
• Model Optimization
Implement techniques like pruning to improve model performance on edge
devices
3. System Integration

• Combining Recognition Pipelines


▪ Develop middleware to integrate voice and hand gesture models into a single
pipeline.
▪ Ensure synchronization of outputs from both models.
• Real-Time Processing
▪ Optimize the system for real-time recognition of both inputs simultaneously.

4.3 System Architecture

1. Application Layer
Purpose: Handles user interaction and delivers the system’s functionalities.
Responsibilities:
• Provide an intuitive and accessible User Interface (UI) for
communication.
• Allow users to input voice or gestures and receive corresponding
outputs.
• Manage user settings, preferences, and feedback collection.
Components:
• Graphical User Interface (GUI)
• Input/Output Control ( microphone, camera, speaker, screen)

2. Presentation Layer

Purpose: Ensures proper formatting and representation of data for the


application layer.

30
Responsibilities:

• Convert raw recognition results into user-friendly outputs.

• Ensure outputs are accessible to the end-user.

• Maintain cross-platform compatibility for displaying results.

Components:

• Data Formatting Module

• Output Generators ( audio, text, or gestures)

3. Processing Layer

Purpose: The core computational layer that performs recognition, translation,


and logic operations.

Responsibilities:

• Voice Recognition: Convert spoken words into text using speech-to-


text models.

• Hand Sign Recognition: Detect and interpret gestures from video


inputs.

• Translation Engine: Map recognized inputs (voice or gestures) to the


desired output format.

• Coordinate communication between the recognition modules and


presentation layer.

Components:

• Voice Recognition Engine.

• Gesture Recognition Engine CNN-based models for hand detection

• Translation Engine

4. Data Management Layer

Purpose: Handle all data storage, retrieval, and management tasks.

Responsibilities:

• Store gesture libraries, speech patterns, and translation mappings.

31
• Log user interactions and model usage for performance improvements.

• Enable training and updates for recognition models using stored data.

Components:

• APIs

• Model Update Manager

5. Communication Layer

Purpose: Facilitate seamless communication between different system


components.

Responsibilities:

• Manage data exchange between layers.

• Ensure synchronization of input data (voice/video) with processing


modules.

• Optimize communication for real-time performance.

Components:

• Communication Protocol Manager

• Middleware for Data Transfer

6. Input/Output Layer

Purpose: Interface with external hardware for data collection and output
delivery.

Responsibilities:

• Capture raw input data (voice via microphone and gestures via camera).

• Deliver processed output as audio, text, or visual gestures.

Components:

• Input Devices: Microphone, camera

• Output Devices: Screen, speaker, or sign

32
4.3.1 Block Diagram

1. Input Layer:

Captures user input from hardware devices:

▪ Microphone: Records spoken language for voice recognition.

▪ Camera: Captures hand gestures for sign recognition.

2. Preprocessing Layer:

Enhances and prepares raw input data for recognition:

▪ Removes noise from audio and segments meaningful parts of the


speech.

▪ Normalizes and extracts gesture features from video frames.

3. Recognition Modules:

Core modules for interpreting the inputs:

▪ Voice Recognition: Uses speech-to-text models to convert audio to


textual representation.

33
▪ Hand Sign Recognition: Employs computer vision algorithms to
detect and classify hand gestures.

4. Translation Engine:

Converts recognized data into the appropriate output format:

▪ Maps voice or sign inputs into text, speech, or gestures.

▪ Supports language translation if necessary.

5. Output Layer:

Delivers the system’s response to the user:

▪ Audio Output: Converts text into spoken language through text-to-


speech synthesis.

▪ Visual Output: Displays text or generates animated gestures for sign


representation.

Interactions Between Components

• Input and Preprocessing:


Raw data from the microphone and camera is processed to reduce noise and
extract relevant features.

• Recognition and Translation:


The processed data is passed to the appropriate recognition module. Outputs
from these modules are sent to the translation engine to generate the desired
response.

• Output Delivery:
The translated content is routed to the appropriate output device for user
communication.

34
4.3.2 Component Diagram

1. Input System:

• Microphone: Captures the user’s voice input.

• Camera: Captures video input for gesture recognition.

• Voice Input Module: Manages audio signal reception and preparation.

• Gesture Input Module: Manages video signal reception and


preprocessing.

2. Processing System:

• Voice Preprocessing: Handles noise reduction and segmentation of


audio input.

• Gesture Preprocessing: Extracts and segments relevant frames from


the video input.

• Voice Recognition Module: Converts preprocessed audio into text


using a speech recognition engine.

• Gesture Recognition Module: Identifies and interprets gestures using


a vision-based model.

3. Translation System:

• Text-to-Speech Engine: Converts recognized text into speech for


audio output.

35
• Sign-to-Text Engine: Maps recognized gestures into text or
commands for output.

4. Data Management System:

• Repositories: Stores voice data, gesture data, and translation mappings


for training and updates.

• Dependencies: Recognition and translation modules rely on


repositories for accurate outputs.

5. Output System:

• Audio Output Module: Manages the conversion of text to audio


signals.

• Visual Output Module: Displays recognized text or animates


gestures.

• Speakers/Screen: Deliver the final output to the user.

• Key Interfaces and Dependencies

▪ Input System ↔ Processing System: Input data flows from the


microphone or camera to preprocessing modules.

▪ Processing System ↔ Translation System: Recognized data is passed


to translation modules for conversion.

▪ Translation System ↔ Data Management: Translation modules query


repositories to fetch mappings or update logs.

▪ Translation System ↔ Output System: Translated results are sent to


output modules for final delivery.

4.3.3 Software Architecture Diagram

The system is structured into six core layers, each containing subsystems and modules
designed for specific functions. The layers ensure clarity in operations and scalability
for future enhancements.

36
1. Presentation Layer (User Interface)

This layer serves as the system's interface for user interaction. It includes:

• Input Interfaces:

▪ Microphone: Captures voice input for recognition.

▪ Camera: Captures video frames of hand gestures for sign recognition.

▪ Keyboard/Touch Interface: Allows text input or system configuration.

• Output Interfaces:

▪ Speaker:

Outputs synthesized speech for recognized sign language or translated


gestures.

▪ Screen/Display:

Displays recognized voice input as text. Presents sign language


translations as animations or static visuals. Shows error messages or
prompts for corrections.

2. Core Processing Layer


This layer contains the business logic and manages workflows for recognition,
translation, and communication.

• Workflow Manager:

Controls the flow of data between recognition and translation modules.Ensures


smooth operation of the system in real-time, prioritizing latency-sensitive tasks.

37
• Communication Manager:

Handles bi-directional communication between users (e.g., a deaf user and a


hearing user).Ensures timely feedback and synchronization of audio and visual
outputs.

• Error Handling Subsystem:

Detects issues like ambiguous gestures or unclear voice inputs.Provides fallback


options, such as prompting the user for clarification or defaulting to text display.

3. Recognition Layer
This layer focuses on converting raw user inputs into machine-readable formats.

• Voice Recognition Module:

Uses Automatic Speech Recognition (ASR) techniques to transcribe spoken


language into text .Integrates noise cancellation and speaker identification for
higher accuracy in real-world environments.

• Gesture Recognition Module:

Captures and processes video frames using:

▪ Hand Detection Models: Identifies hands within the video feed.

▪ Pose Estimation Models: Recognizes the position and orientation of


hand gestures.

▪ Gesture Classification Models: Matches detected gestures to a pre-


trained set of sign language expressions.

Utilizes deep learning frameworks like PyTorch for real-time processing.

• Multi-Language Support:

Both modules are designed to support multiple spoken and sign languages (e.g.,
ASL, BSL, or ISL for sign language; English, Spanish, or Hindi for voice).

4. Translation Layer

This layer handles the conversion of recognized inputs into meaningful outputs.

• Voice-to-Sign Translation Module:

38
Converts transcribed speech into sign language. Uses animated avatars or
graphical displays to depict sign language gestures. Incorporates contextual
understanding to improve the accuracy of sign generation.

• Sign-to-Voice Translation Module:

Converts recognized gestures into spoken language. Ensures correct intonation


and language fluency in the synthesized speech.

• Text Display Submodule:

Acts as a fallback for showing translated outputs when audio or visual


translation is insufficient.

5. Data Management Layer


This layer ensures efficient storage, retrieval, and processing of system data.

• Language Model Repository:

Stores pre-trained models for ASR, gesture recognition, and translation. Allows
periodic updates for enhanced accuracy and additional language support.

• User Preferences Database:

Saves user-specific settings, such as preferred language, gesture recognition


sensitivity, and output format.

• System Logs and Analytics:

Maintains logs of interactions for debugging, improvement, and user insights.


Tracks performance metrics like recognition accuracy and processing time.

6. Integration Layer
This layer connects hardware and external services with the software.

• Device Drivers:

Provides low-level control over input (microphone, camera) and output


(display, speaker) devices.

• API Gateway:

Offers a platform for integrating third-party services such as:

▪ Cloud-based ASR or translation APIs for enhanced accuracy.

39
▪ Gesture recognition libraries for specialized sign languages.

Enables integration with IoT devices or assistive tools for extended


functionality.

Data Flow and Collaboration

1. Input Capture:

• A microphone captures spoken words, converting them into audio


signals.

• A camera captures video frames, detecting and tracking hand


movements.

2. Recognition and Preprocessing:

• Audio signals are processed to remove noise, then transcribed into text.

• Video frames are preprocessed to detect hands, isolate gestures, and


classify them.

3. Translation:

• Recognized text from speech is converted into sign language visuals.

• Recognized gestures are translated into text or spoken words using


synthesis tools.

4. Output Generation:

• Audio outputs deliver spoken translations.

• Visual outputs display sign language animations or recognized text.

5. Error Management:

• If inputs are unclear, the system requests clarification from the user or
switches to a fallback mode (e.g., text-based interaction).

6. Data Storage:

• Translations, user preferences, and logs are stored for future reference
and system improvements.

40
4.4 Data Representation

• Voice Input Module: Processes user-provided speech into actionable text using STT.
• Hand Sign Input Module: Captures gestures, extracts features, and translates them to
text.
• Translation Module: Processes text for interpretation or conversion into the desired
output.
• Output Modules: Converts processed data into either spoken words via TTS or visual
sign language using a virtual avatar.

4.4.1 Entity-Relationship Diagram (ERD)

• User: Stores user information (UserID, Name, Email, Password).

41
• Gesture: Captures hand gesture details with a unique GestureID and links to
languages.
• Voice Input: Represents the spoken input, associated with languages and a
unique VoiceID.
• Translation: Converts the source language (text, gesture, or voice) into a target
type (text, sign, or video).
• Output Modules: Includes "Text/Video" and "Sign" output, with relevant IDs
and languages.

4.4.2 UML Class Diagram

• User Class: Contains attributes like userID, name, and output format
preferences, along with methods to set and get preferences.

• Gesture Class: Captures gesture patterns, meaning, and recognition logic.

• VoiceInput Class: Manages audio input, processing, and transcription logic.

• Translation Class: Handles translation from input types (gesture or voice) to


the desired output type.

• Output Class: Manages the generation of output content (text, video, or sign)
in the desired format.

Significance:

• Defines the modular design of the system.

42
• Highlights the reusability and scalability of the components for different types
of input and output.

• Helps in implementing object-oriented programming principles, such as


encapsulation and inheritance.

4.4.3 Data Flow Diagram (DFD)

4.4.4 Hierarchical Diagram

• Root System:

The entire system is represented as the root, encompassing all major modules.

• Modules:

Input Module: Handles user inputs (voice and gestures), preprocessing, and
feature extraction.

43
Translation Module: Maps inputs to appropriate outputs using a language
model and data mapping processes.

Output Module: Produces outputs in the desired formats (text, speech, or


sign).

User Management: Manages user profiles, authentication, and


personalization preferences.

• Leaf Nodes:

Detailed subcomponents such as Audio Preprocessing, Virtual Avatar, and


Text-to-Speech (TTS) represent specific tasks performed by their parent
modules.

4.5 Process Flow/Representation

• Input Stage:

The system receives input from the user in two forms Voice Input: The user
speaks into a microphone. Gesture Input: The user performs gestures in front
of a camera.

• Preprocessing Stage:

Voice Input: Processes the audio signal to remove noise and extracts
transcribable data. Gesture Input: Analyzes captured images or videos to
recognize gestures.

44
• Recognition Stage:

Voice is converted to text using speech-to-text (STT) algorithms. Gestures are


matched to predefined patterns using a gesture recognition algorithm.

• Translation Stage:

Converts the recognized input (text, gesture, or voice) into the desired output
type: text, speech, or sign.Uses language models, translation databases, and
gesture mapping.

• Output Generation Stage:

The translated content is formatted into the desired form:

▪ Sign Language: Displayed using a virtual avatar or animation.


▪ Text Output: Shown as plain text.
▪ Speech Output: Produced using text-to-speech (TTS) synthesis.

4.5.1 Flowchart

• Input Stage:

The system starts by receiving input from the user.

A decision is made to identify the input type: voice or gesture.

45
• Preprocessing Stage:

For voice input, the system preprocesses the audio to remove noise and
enhance clarity.

For gesture input, the system processes image or video data to recognize
hand movements.

• Recognition Stage:

Voice input is converted to text using a speech-to-text (STT) system.

Gesture input is matched with predefined patterns using a gesture recognition


algorithm.

• Translation Stage:

Translates the recognized input (text or gesture) into the desired output format
(text, sign, or speech).

• Output Stage:

Based on the user’s preferences:

o Sign Language: Displays the recognized sign.

o Text: Outputs plain text.

o Speech: Produces spoken output using text-to-speech (TTS) synthesis.

4.5.2 Sequence Diagram

46
• Input Stage:

▪ The User provides an input to the system (either voice or gesture).

▪ The InputModule forwards the raw input data to the Preprocessor for
cleaning and feature extraction.

▪ The Preprocessor returns the preprocessed data to the InputModule.

• Recognition and Translation:

▪ The InputModule sends the preprocessed input to the Translator, which


maps the input to the desired output format (text, sign, or speech).

▪ The Translator returns the translated data to the InputModule.

• Output Generation:

▪ The InputModule sends the translated data to the OutputGenerator,


specifying the desired output format ( text, sign animation, or speech
synthesis).

▪ The OutputGenerator delivers the final output back to the User.

47
References
[1] D. Kothadiya, C. Bhatt, K. Sapariya, K. Patel, A. B. Gil-González, and J. M.
Corchado, “Deepsign: Sign Language Detection and Recognition Using Deep
Learning,” Electronics (Switzerland), vol. 11, no. 11, Jun. 2022, doi:
10.3390/electronics11111780.

[2] P. V.V and P. R. Kumar, “Segment, Track, Extract, Recognize and Convert Sign
Language Videos to Voice/Text,” International Journal of Advanced Computer
Science and Applications, vol. 3, no. 6, 2012, doi:
10.14569/ijacsa.2012.030608.

[3] M. E. Walizad and M. Hurroo, “Sign Language Recognition System using


Convolutional Neural Network and Computer Vision; Sign Language
Recognition System using Convolutional Neural Network and Computer
Vision.” [Online]. Available: www.ijert.org

[4] Mr. K. Singh, Mr. M. Shahi, and Mr. N. Gupta, “Conversion of Sign Language
to Text,” Int J Res Appl Sci Eng Technol, vol. 12, no. 5, 2024, doi:
10.22214/ijraset.2024.60947.

[5] W. Li, H. Pu, and R. Wang, “Sign Language Recognition Based on Computer
Vision,” in 2021 IEEE International Conference on Artificial Intelligence and
Computer Applications, ICAICA 2021, 2021. doi:
10.1109/ICAICA52286.2021.9498024.

[6] Prof. S. R. Pandit, Mr. Jayesh Pawar, Mr. Rohit Pawar, Ms. Priyanka Pote, and
Ms. Priyanka Pote, “Sign Language Recognition using Deep Learning,”
International Journal of Advanced Research in Science, Communication and
Technology, 2023, doi: 10.48175/ijarsct-10797.

[7] D. Bendarkar, P. Somase, P. Rebari, R. Paturkar, and A. Khan, “Web Based


Recognition and Translation of American Sign Language with CNN and
RNN,” International journal of online and biomedical engineering, vol. 17, no.
1, 2021, doi: 10.3991/ijoe.v17i01.18585.

[8] S. K. Suman, H. Shekhar, C. B. Mahto, D. Gururaj, L. Bhagyalakshmi, and P.


S. K. Patra, “Sign Language Interpreter,” in Cognitive Science and Technology,
2023. doi: 10.1007/978-981-19-8086-2_96.

48
[9] H. Adhikari, M. S. Bin Jahangir, I. Jahan, M. S. Mia, and M. R. Hassan, “A
Sign Language Recognition System for Helping Disabled People,” in 2023 5th
International Conference on Sustainable Technologies for Industry 5.0, STI
2023, Institute of Electrical and Electronics Engineers Inc., 2023. doi:
10.1109/STI59863.2023.10465011.

[10] M. J. Cheok, Z. Omar, and M. H. Jaward, “A review of hand gesture and sign
language recognition techniques,” International Journal of Machine Learning
and Cybernetics, vol. 10, no. 1, pp. 131–153, Jan. 2019, doi: 10.1007/s13042-
017-0705-5.

[11] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, and H. Ney, “Speech


Recognition Techniques for a Sign Language Recognition System.” [Online].
Available: http://www-i6.informatik.rwth-aachen.de/

[12] Y. Obi, K. S. Claudio, V. M. Budiman, S. Achmad, and A. Kurniawan, “Sign


language recognition system for communicating to people with disabilities,” in
Procedia Computer Science, Elsevier B.V., 2022, pp. 13–20. doi:
10.1016/j.procs.2022.12.106.

[13] K. S. Kale and M. B. Waghmare, “Hand Gesture Alphabet Recognition for


American Sign Language using Deep Learning,” Int J Sci Res Sci Eng Technol,
2021, doi: 10.32628/ijsrset218521.

[14] D. Shah, “Sign Language Recognition for Deaf and Dumb,” Int J Res Appl Sci
Eng Technol, vol. 9, no. 5, 2021, doi: 10.22214/ijraset.2021.34770.

49

You might also like