The reward on the validation set of the CALX example has been staying around 0.5 and is unable to improve or increase.

Why does the reward on the validation set of my CALX example stay around 0.5 and fail to improve? I am using a 0.5B-parameter model—could the model be too small, or is the reward too sparse? How should I improve it?

<img width="2792" height="1340" alt="Image" src="https://github.com/user-attachments/assets/d1fa0fcc-ef5d-427e-8a6a-f74564455a50" />