This project implements a sequence-to-sequence (Seq2Seq) LSTM model for lemmatizing Sinhala words, mapping inflected forms to their base forms (lemmas). The model is built using PyTorch and trained on a dataset of Sinhala word-lemma pairs. It supports the Sinhala script and handles morphological variations common in the language.
- Seq2Seq Architecture: Uses an encoder-decoder LSTM with embeddings for character-level lemmatization.
- Sinhala Support: Handles Sinhala Unicode characters, including vowels, consonants, and diacritics.
- Training and Inference: Includes scripts for training (
train.py) and testing (test.py) the model. - Custom Vocabulary: Maps Sinhala characters to indices (
mappings.json) for robust encoding.
The model is trained on input.json, which contains pairs of inflected Sinhala words and their lemmas. Example:
{
"අකුරු": "<අකුර>",
"අගයන්": "<අගය>",
"ඈත": "<ඈත>",
...
}Lemmas are wrapped in angle brackets (<, >) to distinguish them from input words.
- Python 3.8+
- PyTorch (
pip install torch) - JSON for data handling (standard library)
- Clone the repository:
git clone https://github.com/your-username/sinhala-lemmatizer.git cd sinhala-lemmatizer - Install dependencies:
pip install torch
- Ensure
input.jsonandmappings.jsonare in the project directory.
To train the model:
python train.py- Loads
input.jsonandmappings.json. - Trains a Seq2Seq LSTM model with a two-layer encoder-decoder architecture.
- Saves the best model weights to
sinhalemming.pthbased on validation loss.
To test the model on sample words:
python test.py- Loads the trained model (
sinhalemming.pth) andmappings.json. - Predicts lemmas for test words, e.g.:
Starting Predict අක්මාවේ → <අක්මා> අගයනවා → <අගය> ඈත → <ඈත> ...
- Add Words: Update
input.jsonwith new word-lemma pairs. - Extend Vocabulary: Modify
mappings.jsonor regenerate it using thecreate_mappingsfunction intrain.pyto include additional Sinhala characters. - Hyperparameters: Adjust
train.py(e.g., batch size, learning rate, LSTM layers) for better performance.
train.py: Script for training the lemmatizer model.test.py: Script for testing the model on sample words.mappings.json: Character-to-index mappings for Sinhala characters and special tokens.input.json: Training dataset with word-lemma pairs.sinhalemming.pth: Trained model weights (generated after training).
This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to:
- Share: Copy and redistribute the material in any medium or format.
- Adapt: Remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
- Attribution: You must give appropriate credit to the original author, provide a link to the license, and indicate if changes were made.
- No Additional Restrictions: You may not apply legal terms or technological measures that restrict others from doing anything the license permits.
See the full license for details.
- Built with PyTorch.
- Inspired by Seq2Seq models for natural language processing.