-
Notifications
You must be signed in to change notification settings - Fork 51
Benchmark.py script for v2.0 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
119 commits
Select commit
Hold shift + click to select a range
35ee576
Start of argparse code for benchmark.py
wvaske aad14c1
Start of argparse code for benchmark.py
wvaske 9789204
argument parsing matches original benchmark.sh
wvaske 64e730f
Pulled recent dlio changes. Can run training workloads mostly.
wvaske aa551d6
removing pycharm config files
wvaske 2633b26
Added some parameter validation
wvaske 367a2fb
Removed submodules and switched to requirements.txt files for dlio an…
wvaske 2147cb7
Added memory and cpu core count collection
wvaske 20be565
Added better logging for development and checking inputs against the …
wvaske 584e187
Removing readme changes from this PR
wvaske 8b059f7
Skeleton for vectordb bench
wvaske fdf7ba0
Refactoring to multiple src files
wvaske 95c3c26
Refactoring to multiple src files
wvaske a6c1169
Refactoring to multiple src files
wvaske 95595bd
added --debug option for Wes's sanity
wvaske 5c14cfe
Added separate configs for datagen as datagen is not accelerator spec…
wvaske 1da3a7c
Updated num-processes to pull from num-accelerators vs num-processes …
wvaske 903730e
Added option for mpi-bin and exec-type (only mpi supported now)
wvaske f3c5660
Localhost uses psutil instead of SSHing to the local IP to get cluste…
wvaske e58a804
Added more log levels and color handling
wvaske 635b0cc
Added accumulated cpu count and meminfo values
wvaske 0c1b13b
bugfix.
wvaske fc14191
Added ssh username
wvaske a0f9142
Added datasize
wvaske efe0d42
Reported memory is consistently bytes internally. Inputs from users a…
wvaske 9f6cc54
BUGFIX: executable was linking to local dlio directory instead of ins…
wvaske ad448e8
Merge branch 'mlcommons:main' into wvaske/python-migration
wvaske f10def2
Merge pull request #1 from wvaske/wvaske/python-migration
wvaske 8283415
Added requirement for psutil
wvaske d058662
Updated README.md with commands and help for benchmark.py
wvaske 2cf9dd0
Merge remote-tracking branch 'wvaske-origin/wvaske/python-migration' …
wvaske 693ce59
Cleaned cli to always have results_dir and remove client_host_mem_in_…
wvaske 945c1ff
Moved datasize calculation to rules.py
wvaske 7f31e76
Added default results_dir in default tempdir location
wvaske 02eaf60
Added generation of benchmark.py datagen command in the datasize func…
wvaske 50132df
Added pyproject.toml for installation
wvaske 05a8479
Added pyproject.toml for installation
wvaske cf8724f
Added pyproject.toml for installation
wvaske 7b9851d
Renamed package to mlpstorage
wvaske 8f106a9
Added What-If mode to see what would be executed
wvaske 7118741
Automatically find configs root directory
wvaske 70e7671
Added vdb configuration files. Default is 1 million vectors. 10m is 1…
wvaske a988619
Moved modules
wvaske 0de78c9
Read configs from the correct place
wvaske 2f0a26b
Merging VectorDB Benchmark Support
wvaske b1f7b0f
Updated pyproject.toml to pull vdbbench from github
wvaske 8ebfdc1
Merge branch 'wvaske/vdbbench' into wvaske/datagen_and_checkpointing
wvaske d83e545
Added command executor to capture output of running commands and prin…
wvaske d2fc08f
Added function to logging to apply debug and verbose options. Verbose…
wvaske e62b6d7
Updated vdb execution option to be "run_search" to distinguish betwee…
wvaske 436d3ec
Use the new CommandExecutor
wvaske 3018716
Moved directory naming to rules.py
wvaske 073a8b8
Moved param verification to rules.py
wvaske 19a1ece
Improved the code around verification to be clearer by returning whet…
wvaske 68ed311
Refactored benchmark classes for better isolation
wvaske 839ae0d
Update install
wvaske 7320ef8
Merge pull request #3 from wvaske/wvaske/datagen_and_checkpointing
wvaske fa53c98
Added general timer to Benchmark base class
wvaske 1fbddbb
Added cli option to force db creation and to adjust output freqyency.
wvaske 582518d
Updated vectordb configs to have reasonable chunk_size
wvaske 58c93c4
Add power definitions
dslik f790e13
First draft of power requirements for system description PDF
dslik 6b27992
Fixed definition indentation
dslik f9dabfd
Clarified table requirements
dslik 4557c58
Fix section formatting
dslik 084f88b
Consistent capitalization, added hyphen
dslik a4075ac
Set vectordb as preview workload.
wvaske 2704ce3
Added exit codes and serialization of exect type. Addes support for h…
wvaske 31bc28d
Added debug module that will drop to the debugger in the Benchmark cl…
wvaske d6a7664
Added history of mlpstorage commands with ability to rerun a command …
wvaske 00e9cf9
Added history of mlpstorage commands with ability to rerun a command …
wvaske 9ff5018
Moved dlio configs to align with dlio requirements
wvaske e270054
Added json encoder to capture metadata from Benchmark class
wvaske 0841f38
Added metadata collection
wvaske 559d8e6
Updated paths to "workload" for dlio
wvaske e670ddd
vdb is open only
wvaske cdb3c1c
Standardized collection of command outputs and generation of output c…
wvaske ac0424f
Merge pull request #5 from wvaske/wvaske/datagen_and_history
wvaske 26841d5
Merge pull request #4 from dslik/power-rules-updates
wvaske 2319cad
Merge pull request #6 from wvaske/main
wvaske 94e896f
Development tracker
wvaske 719d9ea
Merge pull request #7 from wvaske/wvaske/datagen_and_history
wvaske 92c8d63
Merge pull request #8 from wvaske/main
wvaske 6ce3e1f
Updated checkpoint (llama3) config files to latest from DLIO
wvaske 991bc2e
Updated checkpoint (llama3) config files to latest from DLIO
wvaske aecc7e8
Updated debug
wvaske c6d875d
Added LLM Checkpoint config options
wvaske 316ad67
Updated debug hook in main
wvaske d67afe6
Added checkpoint rules
wvaske 0a1b5d7
Checkpoint CLI
wvaske 4b3d69f
Split Training to Training & DLIO benchmark.
wvaske b16f229
Merge pull request #9 from wvaske/wvaske/checkpointing
wvaske fd75157
line endings?
wvaske adcb609
Merge remote-tracking branch 'wvaske-origin/wvaske/python-migration' …
wvaske 9d99579
handle logging options if no args
wvaske 0a9b7aa
Remove benchmark.sh
wvaske 328a049
Util funcs to flatten dictionaries and remove NaN values
wvaske d3998f1
Added report generator
wvaske 727bcea
Tracking command executed
wvaske f64e15d
Merge pull request #10 from wvaske/wvaske/bugfixes_and_reporting
wvaske 6c09ffd
BUGIFX: loops was overwriting results
wvaske fdc414a
Updated dlio benchmark dependency in pyproject.toml to pull from the …
wvaske 9ba0019
Merge pull request #12 from wvaske/wvaske/bugfixes_and_reporting
wvaske 8ac1fdc
llama3_1t had workflow to generate data and do training.
wvaske 39842be
BUGFIX: datasize wasn't generating a proper command with --data-dir
wvaske 64c362e
BUGFIX: Checkpoint wasn't returning command on add_workflow_to_cmd
wvaske e24f80e
BUGFIX: Log levels need to be uppercase
wvaske 721e0aa
BUGFIX: pyproject.tom didn't have package-data to copy workload yaml …
wvaske 1ce551d
Cleaned up closed validation for training. Added more log messages
wvaske 620adc7
BUGFIX: Datagen doesn't run the verifier
wvaske e6108da
Updated commands in the readme
wvaske 3df3d26
REFACTOR: renamed logging because it clashed
wvaske 8d56ab4
Added submission_checker script
wvaske 74f7666
Added example of what a result object looks like in docstring
wvaske 3673d5c
BUGFIX: mpirun vs mpiexec wasn't implemented.
wvaske 80fe51c
Added info to readme about building MPI and upgrading pip
wvaske 1199c82
Updated readme to link to mlcommons/storage:main
wvaske 7460dd6
Added logging of metadata file path.
wvaske 3d2d1b2
BUGFIX: Needed int() for num_files_train comparison
wvaske File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # Rules Updates | ||
|
|
||
| - [ ] Define filesystem caching rules in detail | ||
| - [ ] Define system json schema and creation process | ||
| - [ ] Define allowed time between runs | ||
| - [ ] Define rules that use local SSD for caching data | ||
| - [ ] Define rules for hyperconverged and local cache | ||
|
|
||
| # Code Updates | ||
| - [ ] Configure datasize to collect the memory information from the hosts instead of getting a number of hosts for the calculation | ||
|
|
||
| - [ ] Determine method to use cgroups for memory limitation in the benchmark script. | ||
|
|
||
| - [x] Add a log block at the start of datagen & run that output all the parms being used to be clear on what a run is. | ||
|
|
||
| - [x] Remove accelerator type from datagen | ||
| - [x] datasize should output the datagen command to copy and paste | ||
|
|
||
| - [ ] Add autosize parameter for run_benchmark and datasize | ||
| - [ ] for run it's just size of dataset based on memory capacity | ||
| - [ ] For datasize it needs an input of GB/s for the cluster and list of hosts | ||
| - | ||
| - [x] Keep a log of mlperfstorage commands executed in a mlperf.history file in results_dir | ||
|
|
||
| - [ ] Add support for datagen to use subdirectories | ||
| - [x] Capture cluster information and write to a json document in outputdir. | ||
| - [ ] Figure out how to get all clients for milvus | ||
|
|
||
| ## benchmark[.py | .sh] script | ||
| - [x] Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number | ||
| - [x] Better installer that manages dependencies | ||
| - [ ] Containerization | ||
| - - [ ] Ease of Deployment of Benchmark (just get it working) | ||
| - - [ ] Cgroups and resource limits (better cache management) | ||
| - [ ] Flush Cache before a run | ||
| - [ ] Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small) | ||
| - [ ] Reportgen should run validation against outputs | ||
| - [ ] Add better system.json creation to automate the system description for consistency | ||
| - - [ ] Add json schema checker for system documents that submitters create | ||
| - [ ] Automate execution of multiple runs | ||
| - [ ] ~~Add support for code changes in closed to supported categories [ data loader, s3 connector, etc]~~ | ||
| - - [ ] ~~Add patches directory that gets applied before execution~~ | ||
| - [ ] Add runtime estimation | ||
| - [x] and --what-if or --dry-run flag | ||
| - [ ] Automate selection of minimum required dataset | ||
| - [ ] ~~Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets~~ | ||
| - [ ] Split system.json into automatically capturable (clients) and manual (storage) | ||
| - [ ] Define system.json schema and add schema checker to the tool for reportgen | ||
| - [ ] Add report-dir csv of results from tests as they are run | ||
| - [ ] Collect versions of all prerequisite packages for storage and dlio | ||
|
|
||
| ## DLIO Improvements | ||
| - [ ] Reduce verbosity of logging | ||
| - [ ] Add callback handler for custom monitoring | ||
| - - [ ] SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times | ||
| - - [ ] Checkpoint_bench uses RPC to call execution which can be wrapped externally | ||
| - [ ] Add support for DIRECTIO | ||
| - [ ] Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc) | ||
| - [ ] Determine if global barrier for each batch matches industry behavior | ||
|
|
||
| ## Results Presentation | ||
| - [ ] Better linking and presentation of system diagrams (add working links to system diagrams to supplementals) | ||
| - [ ] Define presentation and rules for hyperconverged or systems with local cache |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.2
Checkpointingand 2.3Vector Databaseabove are empty. maybe at least add a TODO to TBDmaybe mention will be part of 3.0 release...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We'll remove vectorDB and Huihuo needs to add his proposed checkpoint rules document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add the checkpointing section soon.