- 
        Couldn't load subscription status. 
- Fork 51
Description
Hi
I am running mlpstorage training run test, but I saw some PMIX ERR NOMEM error during the test, but it seems the test is not stopped and it is still running, and at the end I can see a success/fail test result.
Is this PMIX ERR NOMEM error a big concern and the test result is good to be accepted ? pls help to comment, thanks.
//log
(myenv) root@minio1:~# mlpstorage training run --hosts 10.17.73.100 10.17.73.103 10.17.73.106 10.17.73.107 10.17.73.108 10.17.73.109 10.17.73.111 10.17.73.112 --client-host-memory-in-gb 64 --num-accelerators 132 --accelerator-type a100 --model resnet50 --data-dir /mnt/newdisk/fcresnet50/ --results-dir /home/test/resnet50_results/ --param checkpoint.checkpoint_folder=/home/test/ckpoint --param dataset.num_files_train=21103 --param reader.computation_threads=48 --param reader.prefetch_size=0 --param reader.transfer_size=768 --mpi-params "--mca btl_tcp_if_include ens36" --allow-run-as-root --closed
Setting attr from num_accelerators to 132
Hosts is: ['10.17.73.100', '10.17.73.103', '10.17.73.106', '10.17.73.107', '10.17.73.108', '10.17.73.109', '10.17.73.111', '10.17.73.112']
Hosts is: ['10.17.73.100', '10.17.73.103', '10.17.73.106', '10.17.73.107', '10.17.73.108', '10.17.73.109', '10.17.73.111', '10.17.73.112']
2025-08-24 15:39:17|STATUS: Benchmark results directory: /home/test/resnet50_results/training/resnet50/run/20250824_153917
2025-08-24 15:39:17|INFO: Found benchmark run: training_run_resnet50_20250824_153917
2025-08-24 15:39:17|STATUS: Verifying benchmark run for training_run_resnet50_20250824_153917
2025-08-24 15:39:17|RESULT: Minimum file count dictated by 500 step requirement of given accelerator count and batch size.
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: checkpoint.checkpoint_folder = /home/test/ckpoint (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 21103 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.computation_threads = 48 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.prefetch_size = 0 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.transfer_size = 768 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='resnet50', run_datetime='20250824_153917')])
2025-08-24 15:39:17|STATUS: Running benchmark command:: mpirun -n 132 -host 10.17.73.100:17,10.17.73.103:17,10.17.73.106:17,10.17.73.107:17,10.17.73.108:16,10.17.73.109:16,10.17.73.111:16,10.17.73.112:16 --allow-run-as-root --mca btl_tcp_if_include ens36 /root/.venvs/myenv/bin/dlio_benchmark workload=resnet50_a100 ++hydra.run.dir=/home/test/resnet50_results/training/resnet50/run/20250824_153917 ++hydra.output_subdir=dlio_config ++workload.checkpoint.checkpoint_folder=/home/test/ckpoint ++workload.dataset.num_files_train=21103 ++workload.reader.computation_threads=48 ++workload.reader.prefetch_size=0 ++workload.reader.transfer_size=768 ++workload.dataset.data_folder=/mnt/newdisk/fcresnet50/resnet50 --config-dir=/root/storage/configs/dlio
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:226537] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
Thanks,
Mike Chen