Ling 4807: Applications of
Computer in Linguistics
Farig Sadeque
Assistant Professor
Computer Science and Engineering
BRAC University
Development of Bangla language
technology: scope and necessity
Talking points
Summary of the current status
Components
Spell and grammar checker
Translation
OCR
Sentiment analysis
Speech to text and text to speech
Plagiarism checker
Question answering
Digital assistant
Sign language to text converter
Summary
Summary
Existing Bangla NLP market analysis
Market opportunity
Existing tech
Entry barrier and challenges
Spell and grammar checker
Market opportunity
Existing tech
One spell checker from EBLICT: https://spell.bangla.gov.bd/
Some other online spell checkers
No grammar check/correction tool
Some SOTA research came out of the ভাষাভ্রম competition this year
Translation
English to Bangla and Bangla to English
Potential Market
75% of consumers are more likely to buy products from websites in their native language
65% of non-native English speakers prefer content in their native tongue
Was valued at USD 650 million in 2020 and is expected to reach USD 3 billion by 2027
Interested communities:
Technology & manufacturing: Translate manuals of different machineries and different products
Global business people: Translate to understand cultural statements better
Finance and legal: translate documents without any contextual mistakes
Marketing (copy & content writers): Translate from Bangla to English or English to Bangla to
advertise products
E-commerce: Translate to communicate product information
Healthcare: Translate important healthcare information
Freelance writers
Existing tech
Multiple government initiative
Amar Vasha was supposed to use artificial intelligence to translate Supreme Court orders and decisions
from English to Bangla
BUET CSE published a 2.75 million sentence-pair translation corpus
Google's proprietary machine translation technology, dubbed Google Neural Machine
Translation (GNMT), employs recurrent neural networks
Over 4,000 volunteers from 81 locations throughout the nation entered at least
400,000 words into the translation software on a single day to celebrate Independence
Day
Entry barriers
Bangla language structure
Collected corpus was never deployed to build a proper software
Why?
OCR
Potential market: globally valued at 10.65 billion USD
Existing tech
Bangla OCR has been studied since the 1980s
BOCRA and Apona Pathak were introduced
these weren’t open source and weren’t maintained
CRBLP OCR, 2007
Tesseract project
Opensource, maintained by Google
Google Lens works moderately well for OCR as well
Puthi was developed by TeamEngine, with 95% claimed accuracy
But the project failed due to technical reasons, was never released for public use
Apurba developed one which was funded by EBLICT
Let’s see how well it works, shall we?
https://ocr.bangla.gov.bd/
Entry barriers
Developing a completely new dataset for Bangla is difficult. Why?
Alpha-syllabary language family utilizes a cursive writing style and diacritics often, segmenting
graphical components according to characters becomes incredibly challenging.
Broad pixels from the upper or lower portion of a character in a complicated script like Bangla cannot be
removed while eliminating noise because they would erase not just noise but also the difference between
two characters.
The lettering of Bangla words might also make segmentation difficult.
Complex typeface, issues with preservation etc.
No pipeline was developed
Sentiment Analysis
Potential market: The Asia-Pacific market is expected to reach US$523.6 million by
2027, led by nations such as Australia, India, and South Korea.
Existing tech
One publicly available app:
https://sentiment.bangla.gov.bd/sentiment-emotion-analysis
Lots of researchers and students work on sentiment analysis, but still no corpus
publicly available
Entry barriers
Lack of quality data, no standard corpus
A lot of researchers are willing to work on the problem because it’s trendy, not
because they actually want to develop software that can analyze emotions
Social media data has issues
Speech-to-text
Potential market: was valued at USD 1 Billion in 2019 and is expected to grow to
USD 3 Billion by 2027
Existing tech
Some major datasets exist, but no usable model
Not enough data
Lacks variety
Needs three major components:
Acoustic model
Pronunciation model
Language model
Entry barriers
Data acquisition
Need 10k+ hours of speech data
No datasets previously mentioned had more than 500 hours
Text-to-speech
Potential market: Worldwide Text-to-Speech market is expected to reach USD 5790.1
million by 2028, up over USD 2543.1 million in 2021, at a 12.3 percent CAGR
between 2022 and 2028
Existing tech
Kotha, based on Festival, was released in 2007 by CRBLP
Other systems includes Subachan and Anuprash
Entry barrier
Lack of publicly accessible gold standard data
Difficult to compare models
Long term sustainability is an issue
Kotha is still available online, but no one has maintained it in last 10 years, it still needs windows 7 to
run
Speech synthesis by its nature is a difficult task
Plagiarism checker
The global market for anti-plagiarism softwares in the education sector is expected to
increase at a CAGR of 13.8 percent between 2020 and 2027, from USD 819.5 million
in 2020 to USD 2,029.4 million in 2027
Due to the lack of a national plagiarism policy, institutions are sometimes unable to
take action against plagiarized research. No university in Bangladesh even has a
plagiarism policy
Existing tech
No foolproof distinct tech exists at this moment
A couple of old efforts are there: one tried to detect plagiarism from NCTB books
Entry barriers
Lack of plagiarism policy
Extensive data is required
Document similarity techniques are not new, but who are we going to compare it with?