Hi,
Thank you for the great work!
While training the model following the exact setup in the README (4×A100), I observed a sudden spike in loss around epoch 0.4 — it jumped from ~0.45 to over 3.0 in one step, and the grad_norm also increased drastically at that point. See the training logs below for reference:

This looks like a sharp outlier compared to the otherwise smooth training curve before and after.
Questions:
- Is this behavior expected or known during early training?
- Could this be related to learning rate scheduling, gradient accumulation, or optimizer dynamics?
- Any recommendations for mitigation or whether it should be a concern?
Thanks in advance!