-
Notifications
You must be signed in to change notification settings - Fork 51
Benchmark.py script for v2.0 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
since dlio now publishes to https://pypi.org/project/dlio-benchmark/2.0.0/ and instead of just do wdyt ? much cleaner and more pythonic... |
|
@zhenghh04 , is the version of DLIO on pypi up-to-date with your changes? If not, can you rev the version to 2.1 or 2.0.1 and push a new version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered encoding some of the CLI options the user gives for "datagen" into a config file such that those option values are then inherited (and not changeable) when they start doing runs? I'm a bit concerned that users will run datagen with some options and then run the benchmark with others (by accident). Certainly the number of accelerators needs to be variable, but total DRAM size?
Are we going to (collect and/or) validate the DRAM size given by asking the benchmark clients at runtime their DRAM size?
In any case, I'm approving the change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eg: both "datagen" and "run" have "results_dir" as an option, IMHO those should be much more tied together as a way to help the user from fumbling it.
Did you consider supporting a single invocation of the script that did all the stages in one pass? Or at least run and report_gen. Its much more flexible to leave them all separate, but also more error-prone.
The argument is called "--results_dir" but we have the SUT for data being created (datagen), read back (run), and checkpoints written (run), and then we have where to store the logs and the output reports files, which should not be on the SUT. IMHO consider changing the argument names to "--SUT_path" and "results_dir" to make it more obvious?
I've still approved the change....
|
@FileSystemGuy @wvaske who can merge my #84 ? i can't even comment on that.... sorry for polluting your pr... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see you separated the SUT_dir from the results_dir (with better names) and are also on the path to record the operating config from one stage to the next. That's what I get for reviewing them in time order rather than all at once.
Nit: the help text(s) includes "and data rates for each accelerator", which the user might confuse with the output number, which is a "data rate per accelerator".
Why does the datagen step care about the accelerator type? If it generates a different dataset based upon the accelerator type, then the user should not be allowed to "run" with a different accelerator type than the dataset was generated for. I didn't think datagen cared about accelerator type, so this may be moot. Eg: is it kosher to to do a A100 "run" against against the same datagen'ed set of files as an H100 "run"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, again I should have looked at everything before commenting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attributes "per_host_mem_kB" and "total_mem_kB" are in kilo-bytes but the CLI args are in GB and the raw memory capacity pulled from the nodes info is in B. Would standardizing all memory capacity variables on GB (or MB) be less risky for confusion?
Wvaske/bugfixes and reporting
| cmd += " ++workload.workflow.generate_data=False ++workload.workflow.train=True" | ||
|
|
||
| # Training doesn't do checkpoints | ||
| cmd += " ++workload.workflow.checkpoint=False" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to return cmd variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes you find bugs and you wonder how it ever worked... I think this got caught in my refactor. Fixed in the latest commit
| {name = "MLCommons Storage Working Group"} | ||
| ] | ||
|
|
||
| dependencies = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
un comment dlio ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since dlio hasn't done a release using the requirements here from pypi actually breaks things. I updated the pyproject.toml to pull from the mlperf_storage_v2.0 branch on the argonne git repo so the install should work correctly now.
Updated README with latest commadnd structure Added pyarrow dependancy
…mlperf_storage_v2.0 git branch
BUGIFX: loops was overwriting results
|
Checkpointing works. Please test and provide feedback. The report generated has a lot of extra information that helps to know how the test was run W.R.T. input files, args, params, and how they combine. I recommend pulling the csv into Excel via power query so it's a data connection to a table. Then create a pivot table from the data table for doing analysis. I currently capture cpu and memory information with passwordless ssh. If that doesn't work for you, please let me know so I can update with a different methodology. |
BUGFIX: Workflow was not properly getting added to the command BUGFIX: Subdirectories were not getting generated by datagen process. Added here. Might be DLIO bug
| - A **sample** is the unit of data on which training is run, e.g., an image, or a sentence. | ||
| - A **step** is defined to be the first batch of data loaded into the (emulated) accelerator. | ||
| - **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better. | ||
| - **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.2 Checkpointing and 2.3 Vector Database above are empty. maybe at least add a TODO to TBD
maybe mention will be part of 3.0 release...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We'll remove vectorDB and Huihuo needs to add his proposed checkpoint rules document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add the checkpointing section soon.
Enhancement: Improved parsing of results Updated new output locations to be more consistent between checkpointing and training. The reportgen is backwards compatible with existing runs.
|
|
||
| [tool.setuptools] | ||
| packages = {find = {}} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled the changes locally and tried running the benchmark, but encountered the following error related to missing YAML configuration files:
FileNotFoundError: Configuration file not found: /root/checkpoint/mlperf-storage/.venv/lib/python3.12/site-packages/configs/dlio/workload/unet3d_h100.yaml
It looks like the YAML files under configs/dlio/workload/ are not being included in the installed package. As a result, they're not accessible at runtime.
To fix this, you might need to update pyproject.toml under [tool.setuptools.package-data] like so:
[tool.setuptools.package-data]
"mlpstorage" = ["../configs/dlio/workload/*.yaml"]
After making this change and reinstalling with pip install ., the benchmark worked as expected on my end
Removed "run_#" folders
Added structure for validating a submission Early preview of global config for params that don't change (host, memory, etc)
Uses a SubmissionChecker class where we can add additional methods that start with "check_" and they will run automatically. TODO: Add checks TODO: Add print of performance tables
Added python 3.10 or greater requirement to pyproject.toml Added try/excepts to main to try and write metadata file if error occurs to help with debugging Updated the --params option to support --params P1=V1 P2=v2 or --params p1=v1 --params p2=v2
Set name of .hydra directory to "dlio_config" Added messaging when num_files_train is greater than 10000
| elif self.benchmark.args.closed: | ||
| self.logger.error(f'Number of processes ({num_procs}) should be exactly {LLM_SUBSET_PROCS} or {ClosedGPUs} in closed submission.') | ||
| validations.add(PARAM_VALIDATION.INVALID) | ||
| elif not benchmark.args.closed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typo in your code is this line. Please add self. in front of benchmark.args.closed
It should be:
elif not self.benchmark.args.closed:
Benchmark script is getting migrated from bash to python for better integration with results checking scripts.