Thanks to visit codestin.com
Credit goes to github.com

Skip to content

adiparashar/LTRC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interlingua (USR) based machine translation for Indian languages

This repository contains the code and related files, from the work I did at the Language Technologies Research Centre(LTRC), IIIT Hyderabad as a research intern under the guidance of Dr. Sukhada(IIT BHU) and Dr. Soma Paul (IIIT-H).

Indian languages are syntactically and morphologically complex, in addition to them being low resource. The objective of the project is to create effective methods for neural machine translation with limited resources. This can be achieved by combining developments in deep learning combined with heuristics from traditional linguistic and grammatical knowledge.

The main focus is to design an interlingua(USR, Universal Semantic Representation), based on the Pāṇinian Sanskrit grammatical framework, which will serve as a comprehensible intermediary representation for all the languages included in the translation process. We focused on Hindi, Sanskrit and English(for proof of concept) as part of our experiments.

My undertakings can be broadly categorized into the following:

Dataset creation and processing

Concept dictionary creation: Hindi and Sanskrit bilingual dictionaries from various sources(1,2,3) were scraped to build a concept dictionary repository. These concept dictionaries map Hindi & Sanskrit words to their concepts(single or compund word meanings in a common language, say English) and the generated Minimum Recursion Semantics(MRS) features.

USR generation: Universal Semantic Representation (USR) captures the meaning expressed by a sentence in the discourse. It has rows corresponding to properties of the sentence and its concept words. These properties are the concepts(and TAM (tense-aspect-modality) specification on the verb), semantic category of nouns, GNP (Gender, Number, Person) information, dependency relations, anaphora,speaker’s view-points, sentence type etc. The USR acts as the interlingua in our translation process. These are generated by following the heuristics from the Pāṇinian Sanskrit grammatical framework.

Sentence generation

To generate the sentence back from a given USR two kinds of approaches were followed, namely the hybrid and the direct approach depending on the proportion of deep learning and linguistics they invloved.1

  • The hybrid approach used a linguistic rule-based approach to first generate the sentences. Since the USRs did not have postposition related details, so these generated sentences were often devoid of/contained incorrect postpositions. To generate sentences with the postpositions, LLMs were finetuned on the mask prediction task for Hindi sentences, where the masks were the unknown postpositions.

  • The direct approach involved directly generating the sentences from a given USR. The USRs were converted into graphs, with hybrid rules from Abstract Meaning Representation(AMR), Universal Networking Language(UNL) frameworks. These USR graphs were then linearized using a depth first search(DFS) based approach, to get sequences. We finetuned different seq2seq LLMs like BART and mT5 to generate the sentences back from these USR linearizations. BART was chosen because of it's denoising training objective, which could help in getting the sentences from the linearizations.

NOTE: This is an ongoing research with constant developments and revisions across all the moving parts, such as the concept dictionaries, and the USR heuristics. My work during the nascent stage of this project involved laying the groundwork for the datasets and experimenting with neural monolingual text generation from a hypothesised and dynamic interlingua(USR).

Footnotes

  1. As a proof of concept experiment, the sentences generated were in the same language as the source language of the USR(Hindi). If these approaches achieve satisfying results, they can be extended to crosslingual sentence generation i.e. translation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published