Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@wizeng23
Copy link
Contributor

@wizeng23 wizeng23 commented Jun 4, 2025

Description

  • Switch from GCP to Lambda cluster for training jobs, as GCP's CUDA version of 12.2 results in errors with certain CUDA ops needed for SSMs.
  • Update some package installs to new recommendations from Falcon team. I had to install a specific dill version to prevent a version dependency issue.
  • Switch 0.5B training job to not use oumi distributed to avoid a bug

I've tested the 0.5B training and evaluation jobs to confirm they work.

Related issues

Fixes OPE-1314

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Copy link
Collaborator

@taenin taenin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider updating the eval jobs to also use Lambda for consistency as well. That way a user can run both training and eval without setting up a new cloud provider.

@wizeng23 wizeng23 merged commit 78bf10e into main Jun 5, 2025
4 checks passed
@wizeng23 wizeng23 deleted the wizeng/falcon-install-fix branch June 5, 2025 00:00
penfever pushed a commit that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants