Benchmark.py script for v2.0 #85

wvaske · 2025-02-28T15:46:13Z

Benchmark script is getting migrated from bash to python for better integration with results checking scripts.

Update to latest version of DLIO
Started updating rules document
Separate config locations for training / checkpoint / vectordb

github-actions · 2025-02-28T15:46:25Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

glimchb · 2025-02-28T15:50:46Z

since dlio now publishes to https://pypi.org/project/dlio-benchmark/2.0.0/
we can completely remove the submodule from here. it is not needed anymore...

and instead of

pip3 install -r dlio_benchmark/requirements.txt

just do

pip3 install dlio-benchmark==2.0.0

wdyt ? much cleaner and more pythonic...

…d vectordb-bench

wvaske · 2025-03-04T23:20:08Z

@zhenghh04 , is the version of DLIO on pypi up-to-date with your changes? If not, can you rev the version to 2.1 or 2.0.1 and push a new version?

…yaml

FileSystemGuy

Have you considered encoding some of the CLI options the user gives for "datagen" into a config file such that those option values are then inherited (and not changeable) when they start doing runs? I'm a bit concerned that users will run datagen with some options and then run the benchmark with others (by accident). Certainly the number of accelerators needs to be variable, but total DRAM size?

Are we going to (collect and/or) validate the DRAM size given by asking the benchmark clients at runtime their DRAM size?

In any case, I'm approving the change...

FileSystemGuy

Eg: both "datagen" and "run" have "results_dir" as an option, IMHO those should be much more tied together as a way to help the user from fumbling it.

Did you consider supporting a single invocation of the script that did all the stages in one pass? Or at least run and report_gen. Its much more flexible to leave them all separate, but also more error-prone.

The argument is called "--results_dir" but we have the SUT for data being created (datagen), read back (run), and checkpoints written (run), and then we have where to store the logs and the output reports files, which should not be on the SUT. IMHO consider changing the argument names to "--SUT_path" and "results_dir" to make it more obvious?

I've still approved the change....

glimchb · 2025-03-25T20:33:19Z

@FileSystemGuy @wvaske who can merge my #84 ? i can't even comment on that.... sorry for polluting your pr...

FileSystemGuy

Ah, I see you separated the SUT_dir from the results_dir (with better names) and are also on the path to record the operating config from one stage to the next. That's what I get for reviewing them in time order rather than all at once.

Nit: the help text(s) includes "and data rates for each accelerator", which the user might confuse with the output number, which is a "data rate per accelerator".

Why does the datagen step care about the accelerator type? If it generates a different dataset based upon the accelerator type, then the user should not be allowed to "run" with a different accelerator type than the dataset was generated for. I didn't think datagen cared about accelerator type, so this may be moot. Eg: is it kosher to to do a A100 "run" against against the same datagen'ed set of files as an H100 "run"?

FileSystemGuy

Ah, again I should have looked at everything before commenting...

FileSystemGuy

The attributes "per_host_mem_kB" and "total_mem_kB" are in kilo-bytes but the CLI args are in GB and the raw memory capacity pulled from the nodes info is in B. Would standardizing all memory capacity variables on GB (or MB) be less risky for confusion?

Wvaske/bugfixes and reporting

ekaynar · 2025-05-02T21:58:03Z

mlpstorage/benchmarks/dlio.py

+            cmd += " ++workload.workflow.generate_data=False ++workload.workflow.train=True"
+
+        # Training doesn't do checkpoints
+        cmd += " ++workload.workflow.checkpoint=False"


needs to return cmd variable

Sometimes you find bugs and you wonder how it ever worked... I think this got caught in my refactor. Fixed in the latest commit

glimchb · 2025-05-02T23:28:34Z

pyproject.toml

+    {name = "MLCommons Storage Working Group"}
+]
+
+dependencies = [


un comment dlio ?

Since dlio hasn't done a release using the requirements here from pypi actually breaks things. I updated the pyproject.toml to pull from the mlperf_storage_v2.0 branch on the argonne git repo so the install should work correctly now.

Updated README with latest commadnd structure Added pyarrow dependancy

…mlperf_storage_v2.0 git branch

BUGIFX: loops was overwriting results

wvaske · 2025-05-03T00:22:52Z

Checkpointing works.
Check out the history function "mlpstorage history show"
Added a report generator. "mlpstorage reports reportgen"

Please test and provide feedback. The report generated has a lot of extra information that helps to know how the test was run W.R.T. input files, args, params, and how they combine.

I recommend pulling the csv into Excel via power query so it's a data connection to a table. Then create a pivot table from the data table for doing analysis.

I currently capture cpu and memory information with passwordless ssh. If that doesn't work for you, please let me know so I can update with a different methodology.

BUGFIX: Workflow was not properly getting added to the command BUGFIX: Subdirectories were not getting generated by datagen process. Added here. Might be DLIO bug

glimchb · 2025-05-05T20:55:03Z

Submission_guidelines.md

 - A **sample** is the unit of data on which training is run, e.g., an image, or a sentence.
 - A **step** is defined to be the first batch of data loaded into the (emulated) accelerator.
 - **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better.
+- **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational.


2.2 Checkpointing and 2.3 Vector Database above are empty. maybe at least add a TODO to TBD
maybe mention will be part of 3.0 release...

Thanks. We'll remove vectorDB and Huihuo needs to add his proposed checkpoint rules document

Yes, will add the checkpointing section soon.

Enhancement: Improved parsing of results Updated new output locations to be more consistent between checkpointing and training. The reportgen is backwards compatible with existing runs.

garimajain05 · 2025-05-06T21:30:23Z

pyproject.toml

+
+[tool.setuptools]
+packages = {find = {}}
+


I pulled the changes locally and tried running the benchmark, but encountered the following error related to missing YAML configuration files:

FileNotFoundError: Configuration file not found: /root/checkpoint/mlperf-storage/.venv/lib/python3.12/site-packages/configs/dlio/workload/unet3d_h100.yaml

It looks like the YAML files under configs/dlio/workload/ are not being included in the installed package. As a result, they're not accessible at runtime.

To fix this, you might need to update pyproject.toml under [tool.setuptools.package-data] like so:

[tool.setuptools.package-data] "mlpstorage" = ["../configs/dlio/workload/*.yaml"]

After making this change and reinstalling with pip install ., the benchmark worked as expected on my end

…files.

Removed "run_#" folders

Added structure for validating a submission Early preview of global config for params that don't change (host, memory, etc)

Uses a SubmissionChecker class where we can add additional methods that start with "check_" and they will run automatically. TODO: Add checks TODO: Add print of performance tables

Added python 3.10 or greater requirement to pyproject.toml Added try/excepts to main to try and write metadata file if error occurs to help with debugging Updated the --params option to support --params P1=V1 P2=v2 or --params p1=v1 --params p2=v2

Set name of .hydra directory to "dlio_config" Added messaging when num_files_train is greater than 10000

garimajain05 · 2025-05-15T13:56:29Z

mlpstorage/rules.py

+        elif self.benchmark.args.closed:
+            self.logger.error(f'Number of processes ({num_procs}) should be exactly {LLM_SUBSET_PROCS} or {ClosedGPUs} in closed submission.')
+            validations.add(PARAM_VALIDATION.INVALID)
+        elif not benchmark.args.closed:


The typo in your code is this line. Please add self. in front of benchmark.args.closed

It should be:
elif not self.benchmark.args.closed:

wvaske added 5 commits October 3, 2024 09:18

Start of argparse code for benchmark.py

35ee576

Start of argparse code for benchmark.py

aad14c1

argument parsing matches original benchmark.sh

9789204

Pulled recent dlio changes. Can run training workloads mostly.

64e730f

removing pycharm config files

aa551d6

wvaske requested a review from a team as a code owner February 28, 2025 15:46

FileSystemGuy previously approved these changes Feb 28, 2025

View reviewed changes

Added some parameter validation

2633b26

wvaske dismissed FileSystemGuy’s stale review via 2633b26 March 4, 2025 23:13

Removed submodules and switched to requirements.txt files for dlio an…

367a2fb

…d vectordb-bench

wvaske added 5 commits March 5, 2025 08:36

Added memory and cpu core count collection

2147cb7

Added better logging for development and checking inputs against the …

20be565

…yaml

Removing readme changes from this PR

584e187

Skeleton for vectordb bench

8b059f7

Refactoring to multiple src files

fdf7ba0

FileSystemGuy approved these changes Mar 25, 2025

View reviewed changes

Merge pull request #10 from wvaske/wvaske/bugfixes_and_reporting

f64e15d

Wvaske/bugfixes and reporting

ekaynar reviewed May 2, 2025

View reviewed changes

glimchb reviewed May 2, 2025

View reviewed changes

wvaske and others added 3 commits May 2, 2025 18:09

BUGIFX: loops was overwriting results

6c09ffd

Updated README with latest commadnd structure Added pyarrow dependancy

Updated dlio benchmark dependency in pyproject.toml to pull from the …

fdc414a

…mlperf_storage_v2.0 git branch

Merge pull request #12 from wvaske/wvaske/bugfixes_and_reporting

9ba0019

BUGIFX: loops was overwriting results

wvaske added 2 commits May 2, 2025 18:28

llama3_1t had workflow to generate data and do training.

8ac1fdc

BUGFIX: datasize wasn't generating a proper command with --data-dir

39842be

BUGFIX: Workflow was not properly getting added to the command BUGFIX: Subdirectories were not getting generated by datagen process. Added here. Might be DLIO bug

glimchb reviewed May 5, 2025

View reviewed changes

wvaske added 2 commits May 5, 2025 15:14

BUGFIX: Checkpoint wasn't returning command on add_workflow_to_cmd

64c362e

BUGFIX: Log levels need to be uppercase

e24f80e

Enhancement: Improved parsing of results Updated new output locations to be more consistent between checkpointing and training. The reportgen is backwards compatible with existing runs.

garimajain05 reviewed May 6, 2025

View reviewed changes

wvaske added 12 commits May 7, 2025 08:42

BUGFIX: pyproject.tom didn't have package-data to copy workload yaml …

721e0aa

…files.

Cleaned up closed validation for training. Added more log messages

1ce551d

Removed "run_#" folders

BUGFIX: Datagen doesn't run the verifier

620adc7

Updated commands in the readme

e6108da

REFACTOR: renamed logging because it clashed

3df3d26

Added structure for validating a submission Early preview of global config for params that don't change (host, memory, etc)

Added submission_checker script

8d56ab4

Uses a SubmissionChecker class where we can add additional methods that start with "check_" and they will run automatically. TODO: Add checks TODO: Add print of performance tables

Added example of what a result object looks like in docstring

74f7666

BUGFIX: mpirun vs mpiexec wasn't implemented.

3673d5c

Updated readme to link to mlcommons/storage:main

1199c82

Added logging of metadata file path.

7460dd6

Set name of .hydra directory to "dlio_config" Added messaging when num_files_train is greater than 10000

BUGFIX: Needed int() for num_files_train comparison

3d2d1b2

garimajain05 reviewed May 15, 2025

View reviewed changes

johnugeorge approved these changes May 15, 2025

View reviewed changes

johnugeorge merged commit cb7fa48 into mlcommons:main May 15, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators May 15, 2025

wvaske deleted the wvaske/python-migration branch May 22, 2025 17:06


		[tool.setuptools]
		packages = {find = {}}

Benchmark.py script for v2.0 #85

Benchmark.py script for v2.0 #85

Uh oh!

Conversation

wvaske commented Feb 28, 2025

Uh oh!

github-actions bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glimchb commented Feb 28, 2025

Uh oh!

wvaske commented Mar 4, 2025

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

glimchb commented Mar 25, 2025

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wvaske commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

github-actions bot commented Feb 28, 2025 •

edited

Loading

wvaske commented May 3, 2025 •

edited

Loading