Sudden loss spike at ~0.4 epoch during training on 4×A100 — expected behavior?

Hi, 

Thank you for the great work! 

While training the model following the exact setup in the README (4×A100), I observed a sudden spike in loss around epoch 0.4 — it jumped from ~0.45 to over 3.0 in one step, and the grad_norm also increased drastically at that point. See the training logs below for reference:

![Image](https://github.com/user-attachments/assets/d2bb4ea0-3679-4d95-ad2c-b1f89e092d26)

This looks like a sharp outlier compared to the otherwise smooth training curve before and after.

Questions:
- Is this behavior expected or known during early training?
- Could this be related to learning rate scheduling, gradient accumulation, or optimizer dynamics?
- Any recommendations for mitigation or whether it should be a concern?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sudden loss spike at ~0.4 epoch during training on 4×A100 — expected behavior? #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Sudden loss spike at ~0.4 epoch during training on 4×A100 — expected behavior? #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions