Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@al-rigazzi
Copy link
Collaborator

This PR adds a Dragon-based launcher to SmartSim.

al-rigazzi and others added 15 commits April 4, 2024 15:21
This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon.

[ committed by @al-rigazzi @ankona @MattToast ]
[ reviewed by @MattToast @ankona @al-rigazzi ]

---------

Co-authored-by: Matt Drozt <[email protected]>
Co-authored-by: Christopher McBride <[email protected]>
1. ZMQ authenticators appear to have clashing inproc addresses when
using the `zmq.Context.instance()` factory method. Replaced as needed.
2. Updated underlying `Dragon` library version, which included a
breaking changing causing the swap from `TemplateProcess` to
`ProcessTemplate`
3. Fixed incomplete permission set on curve key files

[ committed by @ankona]
[ reviewed by @MattToast @al-rigazzi ]
## Fix a defect in retrieving status updates for the dragon launcher. 

Pre-dragon launchers used the task/step name to retrieve updates while
the dragon launcher uses the `task_id`. This fix ensures that the name
for dragon tasks is mapped appropriately.

[ committed by @ankona ]
[ reviewed by @al-rigazzi ]
Reorder experiment startup to ensure telemetry monitor registers event
listeners prior to launching entities.

[ committed by @ankona ]
[ approved by @MattToast ]
Update the dragon entrypoint to ensure that the log file is removed when
the environment is shutdown.

Additional updates:
- minor refactor to enable testing entrypoint features
- add tests for entrypoint functions
- update incorrect license clause

[ committed by @ankona ]
[ reviewed by @al-rigazzi ]
Add build option to `smart` CLI for installation of Dragon runtime.

### Additional Changes
- minor extract-method refactor to avoid `too-many-statements` linter
issue

### Expected Output

```bash
(ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon
[SmartSim] INFO Running SmartSim build process...
[SmartSim] INFO Checking requested versions...
[SmartSim] INFO Checking for build tools...
[SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149")
[SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
[SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl
[SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
[SmartSim] INFO Dragon installation complete
[SmartSim] INFO Redis build complete!

ML Backends Requested
╒════════════╤════════╤═══════╕
│ PyTorch    │ 2.0.1  │ True  │
│ TensorFlow │ 2.13.1 │ True  │
│ ONNX       │ 1.14.1 │ False │
╘════════════╧════════╧═══════╛

Building for GPU support: False

[SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/
[SmartSim] INFO ML Backends and RedisAI build complete!
[SmartSim] INFO Tensorflow, Torch backend(s) built
[SmartSim] INFO SmartSim build complete!
```

---------

Co-authored-by: Alyssa Cote <[email protected]>
Co-authored-by: amandarichardsonn <[email protected]>
Co-authored-by: Matt Drozt <[email protected]>

[ reviewed by @al-rigazzi @MattToast ]
[ committed by @ankona ]
This PR actually adds several things:
- stdout and stderr redirect of Dragon-launched processes
- `DragonBatchStep` with logic to keep track of batch jobs run through
SLURM and PBS
- some more env variables were added to `CONFIG` to help with launching
dragon with options
- some mitigation of Authenticator's locking behavior is put in place
- a cooldown period was added to the `DragonBackend` to make sure
telemetry monitor can get updates before it shuts down
- the `DragonBackend` status is now a string representation of two
tables, one for hosts (indicating Free/Busy status) and one for
ProcessGroups (similar to standard WLM output)
- documentation was added for Dragon.

---------

Co-authored-by: Matt Drozt <[email protected]>
Co-authored-by: Amanda Richardson <[email protected]>
@al-rigazzi al-rigazzi requested review from ankona and mellis13 May 11, 2024 09:41
@al-rigazzi al-rigazzi requested a review from ashao May 12, 2024 07:53
Copy link
Contributor

@mellis13 mellis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after the tests finish passing). Lots of incredible work in this PR!

@al-rigazzi al-rigazzi added type: feature Issues that include feature request or feature idea area: launcher Issues related to any of the launchers within SmartSim area: api Issues related to API changes area: Dragon labels May 13, 2024
@al-rigazzi al-rigazzi merged commit 8606e8e into develop May 13, 2024
@al-rigazzi al-rigazzi deleted the dragon_launcher branch September 11, 2024 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: api Issues related to API changes area: Dragon area: launcher Issues related to any of the launchers within SmartSim type: feature Issues that include feature request or feature idea

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants