Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
35ee576
Start of argparse code for benchmark.py
wvaske Oct 3, 2024
aad14c1
Start of argparse code for benchmark.py
wvaske Oct 3, 2024
9789204
argument parsing matches original benchmark.sh
wvaske Nov 21, 2024
64e730f
Pulled recent dlio changes. Can run training workloads mostly.
wvaske Feb 28, 2025
aa551d6
removing pycharm config files
wvaske Feb 28, 2025
2633b26
Added some parameter validation
wvaske Mar 4, 2025
367a2fb
Removed submodules and switched to requirements.txt files for dlio an…
wvaske Mar 4, 2025
2147cb7
Added memory and cpu core count collection
wvaske Mar 5, 2025
20be565
Added better logging for development and checking inputs against the …
wvaske Mar 7, 2025
584e187
Removing readme changes from this PR
wvaske Mar 7, 2025
8b059f7
Skeleton for vectordb bench
wvaske Mar 15, 2025
fdf7ba0
Refactoring to multiple src files
wvaske Mar 25, 2025
95c3c26
Refactoring to multiple src files
wvaske Mar 26, 2025
a6c1169
Refactoring to multiple src files
wvaske Mar 27, 2025
95595bd
added --debug option for Wes's sanity
wvaske Mar 27, 2025
5c14cfe
Added separate configs for datagen as datagen is not accelerator spec…
wvaske Mar 28, 2025
1da3a7c
Updated num-processes to pull from num-accelerators vs num-processes …
wvaske Mar 28, 2025
903730e
Added option for mpi-bin and exec-type (only mpi supported now)
wvaske Mar 28, 2025
f3c5660
Localhost uses psutil instead of SSHing to the local IP to get cluste…
wvaske Mar 28, 2025
e58a804
Added more log levels and color handling
wvaske Mar 28, 2025
635b0cc
Added accumulated cpu count and meminfo values
wvaske Mar 28, 2025
0c1b13b
bugfix.
wvaske Mar 28, 2025
fc14191
Added ssh username
wvaske Mar 28, 2025
a0f9142
Added datasize
wvaske Mar 28, 2025
efe0d42
Reported memory is consistently bytes internally. Inputs from users a…
wvaske Mar 28, 2025
9f6cc54
BUGFIX: executable was linking to local dlio directory instead of ins…
wvaske Mar 31, 2025
ad448e8
Merge branch 'mlcommons:main' into wvaske/python-migration
wvaske Apr 4, 2025
f10def2
Merge pull request #1 from wvaske/wvaske/python-migration
wvaske Apr 4, 2025
8283415
Added requirement for psutil
wvaske Apr 4, 2025
d058662
Updated README.md with commands and help for benchmark.py
wvaske Apr 4, 2025
2cf9dd0
Merge remote-tracking branch 'wvaske-origin/wvaske/python-migration' …
wvaske Apr 4, 2025
693ce59
Cleaned cli to always have results_dir and remove client_host_mem_in_…
wvaske Apr 4, 2025
945c1ff
Moved datasize calculation to rules.py
wvaske Apr 4, 2025
7f31e76
Added default results_dir in default tempdir location
wvaske Apr 4, 2025
02eaf60
Added generation of benchmark.py datagen command in the datasize func…
wvaske Apr 4, 2025
50132df
Added pyproject.toml for installation
wvaske Apr 16, 2025
05a8479
Added pyproject.toml for installation
wvaske Apr 16, 2025
cf8724f
Added pyproject.toml for installation
wvaske Apr 16, 2025
7b9851d
Renamed package to mlpstorage
wvaske Apr 16, 2025
8f106a9
Added What-If mode to see what would be executed
wvaske Apr 16, 2025
7118741
Automatically find configs root directory
wvaske Apr 16, 2025
70e7671
Added vdb configuration files. Default is 1 million vectors. 10m is 1…
wvaske Apr 16, 2025
a988619
Moved modules
wvaske Apr 16, 2025
0de78c9
Read configs from the correct place
wvaske Apr 16, 2025
2f0a26b
Merging VectorDB Benchmark Support
wvaske Apr 16, 2025
b1f7b0f
Updated pyproject.toml to pull vdbbench from github
wvaske Apr 16, 2025
8ebfdc1
Merge branch 'wvaske/vdbbench' into wvaske/datagen_and_checkpointing
wvaske Apr 16, 2025
d83e545
Added command executor to capture output of running commands and prin…
wvaske Apr 17, 2025
d2fc08f
Added function to logging to apply debug and verbose options. Verbose…
wvaske Apr 17, 2025
e62b6d7
Updated vdb execution option to be "run_search" to distinguish betwee…
wvaske Apr 17, 2025
436d3ec
Use the new CommandExecutor
wvaske Apr 17, 2025
3018716
Moved directory naming to rules.py
wvaske Apr 17, 2025
073a8b8
Moved param verification to rules.py
wvaske Apr 17, 2025
19a1ece
Improved the code around verification to be clearer by returning whet…
wvaske Apr 17, 2025
68ed311
Refactored benchmark classes for better isolation
wvaske Apr 18, 2025
839ae0d
Update install
wvaske Apr 18, 2025
7320ef8
Merge pull request #3 from wvaske/wvaske/datagen_and_checkpointing
wvaske Apr 18, 2025
fa53c98
Added general timer to Benchmark base class
wvaske Apr 22, 2025
1fbddbb
Added cli option to force db creation and to adjust output freqyency.
wvaske Apr 22, 2025
582518d
Updated vectordb configs to have reasonable chunk_size
wvaske Apr 22, 2025
58c93c4
Add power definitions
dslik Apr 22, 2025
f790e13
First draft of power requirements for system description PDF
dslik Apr 22, 2025
6b27992
Fixed definition indentation
dslik Apr 22, 2025
f9dabfd
Clarified table requirements
dslik Apr 22, 2025
4557c58
Fix section formatting
dslik Apr 22, 2025
084f88b
Consistent capitalization, added hyphen
dslik Apr 22, 2025
a4075ac
Set vectordb as preview workload.
wvaske Apr 24, 2025
2704ce3
Added exit codes and serialization of exect type. Addes support for h…
wvaske Apr 24, 2025
31bc28d
Added debug module that will drop to the debugger in the Benchmark cl…
wvaske Apr 24, 2025
d6a7664
Added history of mlpstorage commands with ability to rerun a command …
wvaske Apr 24, 2025
00e9cf9
Added history of mlpstorage commands with ability to rerun a command …
wvaske Apr 24, 2025
9ff5018
Moved dlio configs to align with dlio requirements
wvaske Apr 24, 2025
e270054
Added json encoder to capture metadata from Benchmark class
wvaske Apr 24, 2025
0841f38
Added metadata collection
wvaske Apr 24, 2025
559d8e6
Updated paths to "workload" for dlio
wvaske Apr 24, 2025
e670ddd
vdb is open only
wvaske Apr 24, 2025
cdb3c1c
Standardized collection of command outputs and generation of output c…
wvaske Apr 24, 2025
ac0424f
Merge pull request #5 from wvaske/wvaske/datagen_and_history
wvaske Apr 24, 2025
26841d5
Merge pull request #4 from dslik/power-rules-updates
wvaske Apr 24, 2025
2319cad
Merge pull request #6 from wvaske/main
wvaske Apr 24, 2025
94e896f
Development tracker
wvaske Apr 24, 2025
719d9ea
Merge pull request #7 from wvaske/wvaske/datagen_and_history
wvaske Apr 24, 2025
92c8d63
Merge pull request #8 from wvaske/main
wvaske Apr 24, 2025
6ce3e1f
Updated checkpoint (llama3) config files to latest from DLIO
wvaske May 2, 2025
991bc2e
Updated checkpoint (llama3) config files to latest from DLIO
wvaske May 2, 2025
aecc7e8
Updated debug
wvaske May 2, 2025
c6d875d
Added LLM Checkpoint config options
wvaske May 2, 2025
316ad67
Updated debug hook in main
wvaske May 2, 2025
d67afe6
Added checkpoint rules
wvaske May 2, 2025
0a1b5d7
Checkpoint CLI
wvaske May 2, 2025
4b3d69f
Split Training to Training & DLIO benchmark.
wvaske May 2, 2025
b16f229
Merge pull request #9 from wvaske/wvaske/checkpointing
wvaske May 2, 2025
fd75157
line endings?
wvaske May 2, 2025
adcb609
Merge remote-tracking branch 'wvaske-origin/wvaske/python-migration' …
wvaske May 2, 2025
9d99579
handle logging options if no args
wvaske May 2, 2025
0a9b7aa
Remove benchmark.sh
wvaske May 2, 2025
328a049
Util funcs to flatten dictionaries and remove NaN values
wvaske May 2, 2025
d3998f1
Added report generator
wvaske May 2, 2025
727bcea
Tracking command executed
wvaske May 2, 2025
f64e15d
Merge pull request #10 from wvaske/wvaske/bugfixes_and_reporting
wvaske May 2, 2025
6c09ffd
BUGIFX: loops was overwriting results
wvaske May 3, 2025
fdc414a
Updated dlio benchmark dependency in pyproject.toml to pull from the …
wvaske May 3, 2025
9ba0019
Merge pull request #12 from wvaske/wvaske/bugfixes_and_reporting
wvaske May 3, 2025
8ac1fdc
llama3_1t had workflow to generate data and do training.
wvaske May 3, 2025
39842be
BUGFIX: datasize wasn't generating a proper command with --data-dir
wvaske May 5, 2025
64c362e
BUGFIX: Checkpoint wasn't returning command on add_workflow_to_cmd
wvaske May 5, 2025
e24f80e
BUGFIX: Log levels need to be uppercase
wvaske May 6, 2025
721e0aa
BUGFIX: pyproject.tom didn't have package-data to copy workload yaml …
wvaske May 7, 2025
1ce551d
Cleaned up closed validation for training. Added more log messages
wvaske May 7, 2025
620adc7
BUGFIX: Datagen doesn't run the verifier
wvaske May 7, 2025
e6108da
Updated commands in the readme
wvaske May 7, 2025
3df3d26
REFACTOR: renamed logging because it clashed
wvaske May 7, 2025
8d56ab4
Added submission_checker script
wvaske May 7, 2025
74f7666
Added example of what a result object looks like in docstring
wvaske May 7, 2025
3673d5c
BUGFIX: mpirun vs mpiexec wasn't implemented.
wvaske May 8, 2025
80fe51c
Added info to readme about building MPI and upgrading pip
wvaske May 8, 2025
1199c82
Updated readme to link to mlcommons/storage:main
wvaske May 9, 2025
7460dd6
Added logging of metadata file path.
wvaske May 9, 2025
3d2d1b2
BUGFIX: Needed int() for num_files_train comparison
wvaske May 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .gitmodules

This file was deleted.

63 changes: 63 additions & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Rules Updates

- [ ] Define filesystem caching rules in detail
- [ ] Define system json schema and creation process
- [ ] Define allowed time between runs
- [ ] Define rules that use local SSD for caching data
- [ ] Define rules for hyperconverged and local cache

# Code Updates
- [ ] Configure datasize to collect the memory information from the hosts instead of getting a number of hosts for the calculation

- [ ] Determine method to use cgroups for memory limitation in the benchmark script.

- [x] Add a log block at the start of datagen & run that output all the parms being used to be clear on what a run is.

- [x] Remove accelerator type from datagen
- [x] datasize should output the datagen command to copy and paste

- [ ] Add autosize parameter for run_benchmark and datasize
- [ ] for run it's just size of dataset based on memory capacity
- [ ] For datasize it needs an input of GB/s for the cluster and list of hosts
-
- [x] Keep a log of mlperfstorage commands executed in a mlperf.history file in results_dir

- [ ] Add support for datagen to use subdirectories
- [x] Capture cluster information and write to a json document in outputdir.
- [ ] Figure out how to get all clients for milvus

## benchmark[.py | .sh] script
- [x] Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number
- [x] Better installer that manages dependencies
- [ ] Containerization
- - [ ] Ease of Deployment of Benchmark (just get it working)
- - [ ] Cgroups and resource limits (better cache management)
- [ ] Flush Cache before a run
- [ ] Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small)
- [ ] Reportgen should run validation against outputs
- [ ] Add better system.json creation to automate the system description for consistency
- - [ ] Add json schema checker for system documents that submitters create
- [ ] Automate execution of multiple runs
- [ ] ~~Add support for code changes in closed to supported categories [ data loader, s3 connector, etc]~~
- - [ ] ~~Add patches directory that gets applied before execution~~
- [ ] Add runtime estimation
- [x] and --what-if or --dry-run flag
- [ ] Automate selection of minimum required dataset
- [ ] ~~Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets~~
- [ ] Split system.json into automatically capturable (clients) and manual (storage)
- [ ] Define system.json schema and add schema checker to the tool for reportgen
- [ ] Add report-dir csv of results from tests as they are run
- [ ] Collect versions of all prerequisite packages for storage and dlio

## DLIO Improvements
- [ ] Reduce verbosity of logging
- [ ] Add callback handler for custom monitoring
- - [ ] SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times
- - [ ] Checkpoint_bench uses RPC to call execution which can be wrapped externally
- [ ] Add support for DIRECTIO
- [ ] Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc)
- [ ] Determine if global barrier for each batch matches industry behavior

## Results Presentation
- [ ] Better linking and presentation of system diagrams (add working links to system diagrams to supplementals)
- [ ] Define presentation and rules for hyperconverged or systems with local cache
517 changes: 392 additions & 125 deletions README.md

Large diffs are not rendered by default.

66 changes: 62 additions & 4 deletions Submission_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ The following definitions are used throughout this document:
- A **sample** is the unit of data on which training is run, e.g., an image, or a sentence.
- A **step** is defined to be the first batch of data loaded into the (emulated) accelerator.
- **Accelerator Utilization (AU)** is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better.
- **Design power** is defined to be the minimum measurement of electrical power that must be capable of being supplied to a single or collection of power supply units (PSUs) in order to avoid violating regulatory and safety requirements. For individual PSUs, the design power equals the nameplate rated power. For groups of redundant PSUs, the design power is equal to the sum of the nameplate rated power of the minimum number of PSUs required to be simultaneously operational.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2 Checkpointing and 2.3 Vector Database above are empty. maybe at least add a TODO to TBD
maybe mention will be part of 3.0 release...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We'll remove vectorDB and Huihuo needs to add his proposed checkpoint rules document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add the checkpointing section soon.

- A **division** is a set of rules for implementing benchmarks from a suite to produce a class of comparable results. MLPerf Storage allows CLOSED and OPEN divisions, detailed in Section 6.
- **DLIO ([code link](https://github.com/argonne-lcf/dlio_benchmark), [paper link](https://ieeexplore.ieee.org/document/9499416))** is a benchmarking tool for deep learning applications. DLIO is the core of the MLPerf Storage benchmark and with specified configurations will emulate the I/O pattern for the workloads listed in Table 1. MLPerf Storage provides wrapper scripts to launch DLIO. There is no need to know the internals of DLIO to do a CLOSED submission, as the wrapper scripts provided by MLPerf Storage will suffice. However, for OPEN submissions changes to the DLIO code might be required (e.g., to add custom data loaders).
- **Dataset content** refers to the data and the total capacity of the data, not the format of how the data is stored. Specific information on dataset content can be found [here](https://github.com/mlcommons/storage/tree/main/storage-conf/workload).
Expand All @@ -118,6 +119,13 @@ The following definitions are used throughout this document:
- A **benchmark implementation** is an implementation of a benchmark in a particular framework by a user under the rules of a specific division.
- A **run** is a complete execution of a benchmark implementation on a system.
- A **benchmark result** is the mean of 5 run results, executed consecutively. The dataset is generated only once for the 5 runs, prior to those runs. The 5 runs must be done on the same machine(s).
- **Nameplate rated power** is defined as the maximum power capacity that can be provided by a power supply unit (PSU), as declared to a certification authority. The nameplate rated power can typically be obtained from the PSU datasheet.
- A **Power Supply Unit (PSU)** is a component which converts an AC or DC voltage input to one or more DC voltage outputs for the purpose of powering a system or subsystem. Power supply units may be redundant and hot swappable.
- **SPEC PTDaemon® Interface (PTDaemon®)** is a software component created by the Standard Performance Evaluation Corporation (SPEC) designed to simplify the measurement of power consumption by abstracting the interface between benchmarking software and supported power analyzers.
- A **Supported power analyzer** is a test device supported by the PTDaemon® software that measures the instantaneous voltage and multiplies it by the instantaneous current, then accumulates these values over a specific time period to provide a cumulative measurement of consumed electrical power. For a listing of supported power analyzers, see https://www.spec.org/power/docs/SPECpower-Device_List.html
- A **System Under Test (SUT)** is the storage system being benchmarked.


- The storage system under test must be described via one of the following **storage system access types**. The overall solution might support more than one of the below types, but any given benchmark submission must be described by the access type that was actually used during that submission. Specifically, this is reflected in the `system-name.json` file, in the `storage_system→solution_type`, the `storage_system→software_defined` and `storage_system→hyperconverged` fields, and the `networks→protocols` fields. An optional vendor-specified qualifier may be specified. This will be displayed in the results table after the storage system access type, for example, “NAS - RDMA”.
- **Direct-attached media** – any solution using local media on the ``host node``(s); eg: NVMe-attached storage with a local filesystem layered over it. This will be abbreviated “**Local**” in the results table.
- **Remotely-attached block device** – any solution using remote block storage; eg: a SAN using FibreChannel, iSCSI, NVMeoF, NVMeoF over RDMA, etc, with a local filesystem implementation layered over it. This will be abbreviated “**Remote Block**” in the results table.
Expand Down Expand Up @@ -431,17 +439,67 @@ The ``<system-name>.json`` file must be pass a validation check with the JSON sc

The goal of the pdf is to complement the JSON file, providing additional detail on the system to enable full reproduction by a third party. We encourage submitters to add details that are more easily captured by diagrams and text description, rather than a JSON.

This file is supposed to include everything that a third party would need in order to recreate the results in the submission, including product model numbers or hardware config details, unit counts of drives and/or components, system and network topologies, software used with version numbers, and any non-default configuration options used by any of the above.

The following *recommended* structure of systems.pdf provides a starting point and is optional. Submitters are free to adjust this structure as they see fit.
This file is should include everything that a third party would need in order to recreate the results in the submission, including product model numbers or hardware config details, unit counts of drives and/or components, system and network topologies, software used with version numbers, and any non-default configuration options used by any of the above.

A great example of a system description pdf can be found [here](https://github.com/mlcommons/storage_results_v0.5/tree/main/closed/DDN/systems).


**Cover page**

The following information is required to be included in the system description PDF:

- System name of the submission
- Submitter name
- Submission date
- Version of the benchmark
- Solution type of the submission
- Submission division (OPEN or CLOSED)

**Mandatory Power requirements**

Systems that require customer provisioning of power (for example, systems intended to be deployed in on-premises data centers or in co-located data centers) shall include a “Power Requirements Table”. Systems designed to only run in a cloud or hyper-converged environment do not have to include this table.

The power requirements table shall list all hardware devices required to operate the storage system. Shared network equipment also used for client network communication and optional storage management systems do not need to be included. The power requirements table shall include:

1. Every component in the system that requires electrical power.
2. For each component, every PSU for each system component.
3. For each PSU, the PSU nameplate rated power.
4. For each PSU (or redundant groups of PSUs0, the design power.

Two examples of a power requirements tables are shown below:

**Power Requirements Table** (Large system example)

| System component | Power supply unit | Nameplate rated power | Design power |
| -------------------- | ----------------- | --------------------- | -------------- |
| Storage controller 1 | Power supply 1 | 1200 watts | 3600 watts |
| | Power supply 2 | 1200 watts | |
| | Power supply 3 | 1200 watts | |
| | Power supply 4 | 1200 watts | |
| Storage shelf 1 | Power supply 1 | 1000 watts | 1000 watts |
| | Power supply 2 | 1000 watts | |
| Network switch 1 | Power supply 1 | 1200 watts | 1200 watts |
| | Power supply 2 | 1200 watts | |
| **Totals** | | **9200 watts** | **5800 watts** |

**Power Requirements Table** (Direct-attached media system example)

| System component | Power supply unit | Nameplate rated power | Design power |
| -------------------- | ----------------- | --------------------- | -------------- |
| NVMe SSD 1 | 12VDC supply | 10 watts | 10 watts |
| | 3.3VDC supply | 2 watts | 2 watts |
| **Totals** | | **12 watts** | **12 watts** |

System component and power supply unit names in the above tables are examples. Consistent names should be used in bill-of-material documentation, system diagrams and descriptive text.

**Optional information**

The following *recommended* structure of systems.pdf provides a starting point for additional optional information. Submitters are free to adjust this structure as they see fit.

If the submission is for a commercial system, a pdf of the product spec document can add significant value. If it is a system that does not have a spec document (e.g., a research system, HPC etc), or the product spec pdf doesn’t include all the required detail, the document can contain (all these are optional):

- Recommended: High-level system diagram e.g., showing the ``host node``(s), storage system main components, and network topology used when connecting everything (e.g., spine-and-leaf, butterfly, etc.), and any non-default configuration options that were set during the benchmark run.
- Optional: Additional text description of the system, if the information is not captured in the JSON, e.g., the storage system’s components (make and model, optional features, capabilities, etc) and all configuration settings that are relevant to ML/AI benchmarks. If the make/model doesn’t specify all the components of the hardware platform it is running on, eg: it’s an Software-Defined-Storage product, then those should be included here (just like the client component list).
- Optional: power requirements – If the system requires the physical deployment of hardware, consider including the “not to exceed” power requirements for the system to run the MLCommons storage benchmark workload. Additional information can include the total nameplate power rating and the peak power consumption during the benchmark.
- Optional: physical requirements – If the system requires the physical deployment of hardware, consider including the number of rack units, required supporting equipment, and any physical constraints on how the equipment must be installed into an industry-standard rack, such as required spacing, weight constraints, etc. We recommended the following three categories for the text description:
1. Software,
2. Hardware, and
Expand Down
Loading