The following repository includes the code and pre-trained cross-lingual word embeddings from the paper Improving cross-lingual word embeddings by meeting in the middle (EMNLP 2018).
We release the 300-dimension word embeddings used in our experiments (English, Spanish, Italian, German and Finnish) as binary bin files:
- Monolingual FastText embeddings: Available here
- Baseline cross-lingual embeddings: Available here
- Cross-lingual embeddings post-processed with Meemi: Available here
Note 1: All vocabulary words are lowercased.
Note 2: If you would like to convert the binary files to txt, you can use convertvec.
Requirements:
- Python 3
- NumPy
- Gensim
- If you use VecMap or MUSE, please also check their corresponding GitHub pages. Note that we use a previous version of these tools, of which there is a copy in this repository (WIP).
get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE [-vecmap | -muse TRAIN_DICT VALID_DICT]get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILEExample:
get_crossembs.sh EN-ES.english.vecmap.txt EN-ES.spanish.vecmap.txt en-es.dict.txtget_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE -vecmapget_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE -muse TRAIN_SIZE VALID_SIZEIn order to test your embeddings on bilingual dictionary induction type the following:
python test.py SOURCE_EMBEDDINGS TARGET_EMBEDDINGS < DICTIONARY_FILEIn order to test your embeddings on monolingual word similarity type the following:
python test_similarity_monolingual.py EMBEDDINGS DATASETYou can also test various datasets at the same time:
python test_similarity_monolingual.py EMBEDDINGS DATASET1 [DATASET2] ... [DATASETN]Likewise, to test your cross-lingual embeddings on cross-lingual word similarity type the following:
python test_similarity_crosslingual.py SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DATASETAs with monolingual similarity, you can also test various datasets at the same time. Below is an example of how to test your English-Spanish cross-lingual embeddings on all the monolingual and cross-lingual word similarity datasets:
python test_similarity_monolingual.py EN-ES.english.vecmap.meemi.bin data/SimLex/simlex-999_english.txt data/SemEval2018-subtask1-monolingual/english.txt data/rg65-monolingual/rg65_english.txt data/WS353-monolingual/WS353-english-sim.txt
python test_similarity_monolingual.py EN-ES.english.vecmap.meemi.bin data/SemEval2018-subtask1-monolingual/spanish.txt data/rg65-monolingual/rg65_spanish.txt
python test_similarity_crosslingual.py EN-ES.english.vecmap.meemi.bin EN-ES.spanish.vecmap.meemi.bin data/SemEval2018-subtask2-crosslingual/en-es.txt data/rg65-crosslingual/rg65_EN-ES.txtNote: This code assumes that lowercased word embeddings are provided as input. If you would like to mantain the casing, simply remove the .lower() commands in the evaluation scripts.
Hypernym Discovery is the task to retrieve, for a given term, a ranked list of valid hypernyms. In this experiment, a hypernym discovery system is trained in English data (and possibly in a weakly supervised setting with some target language data), and makes predictions in the target language.
To run the hypernym discovery experiments, launch the following command:
python3 experiments/hypernym_discovery/taxoembed.py -wvtrain SOURCE_EMBEDDINGS -wvtest TARGET_EMBEDDINGS -vtest TARGET_VOCABULARY -hypotrain SOURCE_HYPONYMS -hypertrain SOURCE_HYPERNYMS -test TARGET_HYPONYMS -newtrain TARGET_LANG_TRAINING_INSTANCES -npairs NUMB_TRAINING_INSTANCES -o OUTPUT_FOLDER The predictions of the model are saved in OUTPUT_FOLDER with the name [TARGET_EMBEDDINGS]_[NUMB_TRAINING_INSTANCES]_[TARGET_LANG_TRAINING_INSTANCES]_W.txt.
For example, evaluating a hypernym discovery model for Spanish trained on VecMap English vectors and 500 additional instances in Spanish:
python3 experiments/hypernym_discovery/taxoembed.py -wvtrain EN-ES.english.vecmap.bin -wvtest EN-ES.spanish.vecmap.bin -vtest experiments/hypernym_discovery/SemEval2018-Task9/vocabulary/1C.spanish.vocabulary.txt -hypotrain experiments/hypernym_discovery/SemEval2018-Task9/training/data/1A.english.training.data.txt -hypertrain experiments/hypernym_discovery/SemEval2018-Task9/training/gold/1A.english.training.gold.txt -test experiments/hypernym_discovery/SemEval2018-Task9/test/data/1C.spanish.test.data.txt -o experiments/hypernym_discovery/ -newtrain experiments/hypernym_discovery/SemEval2018-Task9/utils/spanish_train.tsv -npairs 500Then, call the official SemEval task scorer passing as arguments the gold file and the predictions file generated in the previous step.
python experiments/hypernym_discovery/SemEval2018-Task9/task9-scorer.py GOLD_FILE PREDICTIONS_FILEFor the previous example, the exact command would be:
python experiments/hypernym_discovery/SemEval2018-Task9/task9-scorer.py experiments/hypernym_discovery/SemEval2018-Task9/test/gold/1C.spanish.test.gold.txt experiments/hypernym_discovery/EN-ES.spanish.vecmap.bin_500_1C.spanish.output_W.txt If you use any of these resources, please cite the following paper:
@InProceedings{doval:meemiemnlp2018,
author = "Doval, Yerai and Camacho-Collados, Jose and Espinosa-Anke, Luis and Schockaert, Steven",
title = "Improving cross-lingual word embeddings by meeting in the middle",
booktitle = "Proceedings of EMNLP",
year = "2018",
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium"
}
If you use VecMap or MUSE, please also cite their corresponding papers.