Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@wvaske
Copy link
Contributor

@wvaske wvaske commented Feb 28, 2025

Benchmark script is getting migrated from bash to python for better integration with results checking scripts.

  • Update to latest version of DLIO
  • Started updating rules document
  • Separate config locations for training / checkpoint / vectordb

@wvaske wvaske requested a review from a team as a code owner February 28, 2025 15:46
@github-actions
Copy link

github-actions bot commented Feb 28, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@glimchb
Copy link

glimchb commented Feb 28, 2025

@wvaske

since dlio now publishes to https://pypi.org/project/dlio-benchmark/2.0.0/
we can completely remove the submodule from here. it is not needed anymore...

and instead of

pip3 install -r dlio_benchmark/requirements.txt

just do

pip3 install dlio-benchmark==2.0.0

wdyt ? much cleaner and more pythonic...

FileSystemGuy
FileSystemGuy previously approved these changes Feb 28, 2025
@wvaske
Copy link
Contributor Author

wvaske commented Mar 4, 2025

@zhenghh04 , is the version of DLIO on pypi up-to-date with your changes? If not, can you rev the version to 2.1 or 2.0.1 and push a new version?

Copy link
Contributor

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered encoding some of the CLI options the user gives for "datagen" into a config file such that those option values are then inherited (and not changeable) when they start doing runs? I'm a bit concerned that users will run datagen with some options and then run the benchmark with others (by accident). Certainly the number of accelerators needs to be variable, but total DRAM size?

Are we going to (collect and/or) validate the DRAM size given by asking the benchmark clients at runtime their DRAM size?

In any case, I'm approving the change...

Copy link
Contributor

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg: both "datagen" and "run" have "results_dir" as an option, IMHO those should be much more tied together as a way to help the user from fumbling it.

Did you consider supporting a single invocation of the script that did all the stages in one pass? Or at least run and report_gen. Its much more flexible to leave them all separate, but also more error-prone.

The argument is called "--results_dir" but we have the SUT for data being created (datagen), read back (run), and checkpoints written (run), and then we have where to store the logs and the output reports files, which should not be on the SUT. IMHO consider changing the argument names to "--SUT_path" and "results_dir" to make it more obvious?

I've still approved the change....

@glimchb
Copy link

glimchb commented Mar 25, 2025

@FileSystemGuy @wvaske who can merge my #84 ? i can't even comment on that.... sorry for polluting your pr...

Copy link
Contributor

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you separated the SUT_dir from the results_dir (with better names) and are also on the path to record the operating config from one stage to the next. That's what I get for reviewing them in time order rather than all at once.

Nit: the help text(s) includes "and data rates for each accelerator", which the user might confuse with the output number, which is a "data rate per accelerator".

Why does the datagen step care about the accelerator type? If it generates a different dataset based upon the accelerator type, then the user should not be allowed to "run" with a different accelerator type than the dataset was generated for. I didn't think datagen cared about accelerator type, so this may be moot. Eg: is it kosher to to do a A100 "run" against against the same datagen'ed set of files as an H100 "run"?

Copy link
Contributor

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, again I should have looked at everything before commenting...

Copy link
Contributor

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes "per_host_mem_kB" and "total_mem_kB" are in kilo-bytes but the CLI args are in GB and the raw memory capacity pulled from the nodes info is in B. Would standardizing all memory capacity variables on GB (or MB) be less risky for confusion?

cmd += " ++workload.workflow.generate_data=False ++workload.workflow.train=True"

# Training doesn't do checkpoints
cmd += " ++workload.workflow.checkpoint=False"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to return cmd variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes you find bugs and you wonder how it ever worked... I think this got caught in my refactor. Fixed in the latest commit

{name = "MLCommons Storage Working Group"}
]

dependencies = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

un comment dlio ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dlio hasn't done a release using the requirements here from pypi actually breaks things. I updated the pyproject.toml to pull from the mlperf_storage_v2.0 branch on the argonne git repo so the install should work correctly now.

wvaske and others added 3 commits May 2, 2025 18:09
Updated README with latest commadnd structure
Added pyarrow dependancy
@wvaske
Copy link
Contributor Author

wvaske commented May 3, 2025

Checkpointing works.
Check out the history function "mlpstorage history show"
Added a report generator. "mlpstorage reports reportgen"

Please test and provide feedback. The report generated has a lot of extra information that helps to know how the test was run W.R.T. input files, args, params, and how they combine.

I recommend pulling the csv into Excel via power query so it's a data connection to a table. Then create a pivot table from the data table for doing analysis.

I currently capture cpu and memory information with passwordless ssh. If that doesn't work for you, please let me know so I can update with a different methodology.

wvaske added 2 commits May 2, 2025 18:28
BUGFIX: Workflow was not properly getting added to the command
BUGFIX: Subdirectories were not getting generated by datagen process. Added here. Might be DLIO bug
- A **sample** is the unit of data on which training is run, e.g., an image, or a sentence.
- A **step** is defined to be the first batch of data loaded into the (emulated) accelerator.
- **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better.
- **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2 Checkpointing and 2.3 Vector Database above are empty. maybe at least add a TODO to TBD
maybe mention will be part of 3.0 release...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We'll remove vectorDB and Huihuo needs to add his proposed checkpoint rules document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add the checkpointing section soon.

wvaske added 2 commits May 5, 2025 15:14
Enhancement: Improved parsing of results
Updated new output locations to be more consistent between checkpointing and training. The reportgen is backwards compatible with existing runs.

[tool.setuptools]
packages = {find = {}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the changes locally and tried running the benchmark, but encountered the following error related to missing YAML configuration files:

FileNotFoundError: Configuration file not found: /root/checkpoint/mlperf-storage/.venv/lib/python3.12/site-packages/configs/dlio/workload/unet3d_h100.yaml

It looks like the YAML files under configs/dlio/workload/ are not being included in the installed package. As a result, they're not accessible at runtime.

To fix this, you might need to update pyproject.toml under [tool.setuptools.package-data] like so:

[tool.setuptools.package-data]
"mlpstorage" = ["../configs/dlio/workload/*.yaml"]

After making this change and reinstalling with pip install ., the benchmark worked as expected on my end

wvaske added 12 commits May 7, 2025 08:42
Added structure for validating a submission
Early preview of global config for params that don't change (host, memory, etc)
Uses a SubmissionChecker class where we can add additional methods that start with "check_" and they will run automatically.
TODO: Add checks
TODO: Add print of performance tables
Added python 3.10 or greater requirement to pyproject.toml
Added try/excepts to main to try and write metadata file if error occurs to help with debugging
Updated the --params option to support --params P1=V1 P2=v2 or --params p1=v1 --params p2=v2
Set name of .hydra directory to "dlio_config"
Added messaging when num_files_train is greater than 10000
elif self.benchmark.args.closed:
self.logger.error(f'Number of processes ({num_procs}) should be exactly {LLM_SUBSET_PROCS} or {ClosedGPUs} in closed submission.')
validations.add(PARAM_VALIDATION.INVALID)
elif not benchmark.args.closed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typo in your code is this line. Please add self. in front of benchmark.args.closed

It should be:
elif not self.benchmark.args.closed:

@johnugeorge johnugeorge merged commit cb7fa48 into mlcommons:main May 15, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators May 15, 2025
@wvaske wvaske deleted the wvaske/python-migration branch May 22, 2025 17:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants