Ad-hoc implementation of the CS/CG model proposed by Wei et al.
- Each dataset must be defined as a sub-class of
torch.utils.data.Dataset, with methods for- preprocessing and vocab builder (text -> vocab look-up indices)
__getitem__which must return a training example__len__- generating train/test/valid splits
- computing language model probabilites (i.e.
P(x), wherex: anno/code tensor)
- Get train/test/valid splits for a dataset.
- Construct a configuration for the LM.
- For each kind (anno/code), train a LM and dump the model as
lm-{dataset_name}-{kind}.pt(e.g.lm-django-anno.pt). - Finally, using these models, compute
P(x)for eachx(anno/code tensor).
@article{wei2019code,
title={Code Generation as a Dual Task of Code Summarization},
author={Wei, Bolin and Li, Ge and Xia, Xin and Fu, Zhiyi and Jin, Zhi},
journal={arXiv preprint arXiv:1910.05923},
year={2019}
}