BCWS: Bilingual Contextual Word Similarity Dataset

Description

Every 4 lines in the file form a testing instance. The first line in the pos tag. The second and the third lines are the bilingual sentences. The target word in each sentence is indicated by <target_word>. The fourth line is the annotated scores from 11 annotators and the average value. For more details, please refer to [2].

Please note that we use Chinese (Traditional) in the dataset. If your training corpus or vocabulary is in Chinese (Simplified), a conversion tool for preprocessing is suggested. For details, please refer to this repository, we use this toolkit with the s2t conversion mode.

References

Please cite [1] and [2] if you find the resources in this repository useful.

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

[1] Ta-Chung Chi and Yun-Nung Chen, CLUSE: Cross-Lingual Unsupervised Sense Embeddings

@inproceedings{chi-chen:2018:EMNLP2018,
  author    = {Chi, Ta-Chung  and  Chen, Yun-Nung},
  title     = {Cluse: Cross-lingual underspervised sense embeddings},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP)},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
}