Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Sudden loss spike at ~0.4 epoch during training on 4×A100 — expected behavior? #19

@xXuHaiyang

Description

@xXuHaiyang

Hi,

Thank you for the great work!

While training the model following the exact setup in the README (4×A100), I observed a sudden spike in loss around epoch 0.4 — it jumped from ~0.45 to over 3.0 in one step, and the grad_norm also increased drastically at that point. See the training logs below for reference:

Image

This looks like a sharp outlier compared to the otherwise smooth training curve before and after.

Questions:

  • Is this behavior expected or known during early training?
  • Could this be related to learning rate scheduling, gradient accumulation, or optimizer dynamics?
  • Any recommendations for mitigation or whether it should be a concern?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions