The Bhojpuri LT Resources (BHLTR) project was initially initiated by me (Atul) at Jawaharlal Nehru University (JNU), New Delhi during the doctoral research work. BHLTR data contains monolingual, parallel (English-Bhojpuri), and POS annotated monolingual corpora. In this data, POS is annotated according to Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset.
BHLTR/
├─ mono-bho-corpus/
│ ├─ monolingual.bho
│ ├─ README.md
│ ├─ pos-annotated/
│ │ └─ pos-tagged.bho
│ ├─ treebank/
│ │ └─ README.md
│
└─ parallel-corpora/
├─ eng--bho.training.eng
├─ eng--bho.training.bho
├─ eng--bho.development.eng
├─ eng--bho.development.bho
├─ eng--bho.test.eng2bho.eng
├─ eng--bho.test.bho2eng.bho
├─ additional-resources.md
├─ license.md
├─ README.md
├─ README.txt
I would like to thank my Doctoral supervisor Prof. Girish Nath Jha and Sanskrit Computational Lab, JNU, New Delhi.
If you use this data, please cite:
@article{ojha2019english,
title={English-Bhojpuri SMT System: Insights from the Karaka Model},
author={Ojha, Atul Kr},
journal={arXiv preprint arXiv:1905.02239},
year={2019}
}
other papers/references about the BHLTR are:
@inproceedings{karakanta2019proceedings,
title={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages},
author={Karakanta, Alina and Ojha, Atul Kr and Liu, Chao-Hong and Washington, Jonathan and Oco, Nathaniel and Lakew, Surafel Melaku and Malykh, Valentin and Zhao, Xiaobing},
booktitle={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages},
year={2019}
}
@article{kumar2018automatic,
title={Automatic identification of closely-related Indian languages: Resources and experiments},
author={Kumar, Ritesh and Lahiri, Bornini and Alok, Deepak and Ojha, Atul Kr and Jain, Mayank and Basit, Abdul and Dawer, Yogesh},
journal={arXiv preprint arXiv:1803.09405},
year={2018}
}
@inproceedings{ojha2015training,
title={Training \& evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri},
author={Ojha, Atul Kr. and Behera, Pitambar and Singh, Srishti and Jha, Girish N},
booktitle={the proceedings of 7th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics},
pages={524--529},
year={2015}
}
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: BHLTR v1.0 License: CC BY-NC-SA 4.0 Includes text: yes Contributors: Ojha, Atul Kr. Copyright (©) holder: Ojha, Atul Kr. Contact: [email protected] ===============================================================================