Interlingua (USR) based machine translation for Indian languages

This repository contains the code and related files, from the work I did at the Language Technologies Research Centre(LTRC), IIIT Hyderabad as a research intern under the guidance of Dr. Sukhada(IIT BHU) and Dr. Soma Paul (IIIT-H).

Indian languages are syntactically and morphologically complex, in addition to them being low resource. The objective of the project is to create effective methods for neural machine translation with limited resources. This can be achieved by combining developments in deep learning combined with heuristics from traditional linguistic and grammatical knowledge.

The main focus is to design an interlingua(USR, Universal Semantic Representation), based on the Pāṇinian Sanskrit grammatical framework, which will serve as a comprehensible intermediary representation for all the languages included in the translation process. We focused on Hindi, Sanskrit and English(for proof of concept) as part of our experiments.

My undertakings can be broadly categorized into the following:

Dataset creation and processing

Concept dictionary creation: Hindi and Sanskrit bilingual dictionaries from various sources(1,2,3) were scraped to build a concept dictionary repository. These concept dictionaries map Hindi & Sanskrit words to their concepts(single or compund word meanings in a common language, say English) and the generated Minimum Recursion Semantics(MRS) features.

USR generation: Universal Semantic Representation (USR) captures the meaning expressed by a sentence in the discourse. It has rows corresponding to properties of the sentence and its concept words. These properties are the concepts(and TAM (tense-aspect-modality) specification on the verb), semantic category of nouns, GNP (Gender, Number, Person) information, dependency relations, anaphora,speaker’s view-points, sentence type etc. The USR acts as the interlingua in our translation process. These are generated by following the heuristics from the Pāṇinian Sanskrit grammatical framework.

Sentence generation

To generate the sentence back from a given USR two kinds of approaches were followed, namely the hybrid and the direct approach depending on the proportion of deep learning and linguistics they invloved.¹

The hybrid approach used a linguistic rule-based approach to first generate the sentences. Since the USRs did not have postposition related details, so these generated sentences were often devoid of/contained incorrect postpositions. To generate sentences with the postpositions, LLMs were finetuned on the mask prediction task for Hindi sentences, where the masks were the unknown postpositions.
The direct approach involved directly generating the sentences from a given USR. The USRs were converted into graphs, with hybrid rules from Abstract Meaning Representation(AMR), Universal Networking Language(UNL) frameworks. These USR graphs were then linearized using a depth first search(DFS) based approach, to get sequences. We finetuned different seq2seq LLMs like BART and mT5 to generate the sentences back from these USR linearizations. BART was chosen because of it's denoising training objective, which could help in getting the sentences from the linearizations.

NOTE: This is an ongoing research with constant developments and revisions across all the moving parts, such as the concept dictionaries, and the USR heuristics. My work during the nascent stage of this project involved laying the groundwork for the datasets and experimenting with neural monolingual text generation from a hypothesised and dynamic interlingua(USR).

As a proof of concept experiment, the sentences generated were in the same language as the source language of the USR(Hindi). If these approaches achieve satisfying results, they can be extended to crosslingual sentence generation i.e. translation. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
USR to sentence		USR to sentence
concept dictionaries		concept dictionaries
generated data		generated data
postposition prediction		postposition prediction
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interlingua (USR) based machine translation for Indian languages

Dataset creation and processing

Sentence generation

About

Uh oh!

Releases

Packages

Languages

adiparashar/LTRC

Folders and files

Latest commit

History

Repository files navigation

Interlingua (USR) based machine translation for Indian languages

Dataset creation and processing

Sentence generation

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages