This is the group project for the course Natural Language Processing.
This repository has two main folders: datasets and experiments.
- The
datasetsfolder contains code for creating German, Dutch and English corpus. - The
experimentsfolder contains the experiment code, parameter.yaml file and replicable output.
You can prepare all datasets by running sh ./createALLdatasets.sh from within the folder datasets. For English, German and Dutch a list of all available lemma's with their wordforms is saved to [english|dutch|german]_bylemma_[orth|phon].txt
After that, 6100 wordforms are chosen and saved to [src|tgt]_[train|test|valid].txt, and you can use these data files directly to run the experiments.
The experiments folder contains the experiments that replicate the two experiments in the paper Kirov & Cotterell (2018), and also include experiments using new generated English, German and Dutch data. You can run the experiments by starting from experiment_phon.ipynb for experiments related to phoneme features with the data in the folder [english|dutch|german]_phon, and experiment_orth.ipynb for experiments related to orthographic (grapheme) features with the data in the folder [english|dutch|german]_orth from the Experiments folder.