Evaluation of PoET in distributed training mode

Reported by @3bsamad in #10 

Currently when training PoET in distributed training mode, it seems that the evaluation is only based on the data used by GPU 1, i.e. 1/n of the dataset. Possible solution might be using Hugging Face Accelerate.