Thanks to visit codestin.com
Credit goes to github.com

Skip to content

PMIX ERR NOMEM error, but the test can still run #197

@mikechenccp

Description

@mikechenccp

Hi

I am running mlpstorage training run test, but I saw some PMIX ERR NOMEM error during the test, but it seems the test is not stopped and it is still running, and at the end I can see a success/fail test result.

Is this PMIX ERR NOMEM error a big concern and the test result is good to be accepted ? pls help to comment, thanks.

//log
(myenv) root@minio1:~# mlpstorage training run --hosts 10.17.73.100 10.17.73.103 10.17.73.106 10.17.73.107 10.17.73.108 10.17.73.109 10.17.73.111 10.17.73.112 --client-host-memory-in-gb 64 --num-accelerators 132 --accelerator-type a100 --model resnet50 --data-dir /mnt/newdisk/fcresnet50/ --results-dir /home/test/resnet50_results/ --param checkpoint.checkpoint_folder=/home/test/ckpoint --param dataset.num_files_train=21103 --param reader.computation_threads=48 --param reader.prefetch_size=0 --param reader.transfer_size=768 --mpi-params "--mca btl_tcp_if_include ens36" --allow-run-as-root --closed
Setting attr from num_accelerators to 132
Hosts is: ['10.17.73.100', '10.17.73.103', '10.17.73.106', '10.17.73.107', '10.17.73.108', '10.17.73.109', '10.17.73.111', '10.17.73.112']
Hosts is: ['10.17.73.100', '10.17.73.103', '10.17.73.106', '10.17.73.107', '10.17.73.108', '10.17.73.109', '10.17.73.111', '10.17.73.112']
2025-08-24 15:39:17|STATUS: Benchmark results directory: /home/test/resnet50_results/training/resnet50/run/20250824_153917
2025-08-24 15:39:17|INFO: Found benchmark run: training_run_resnet50_20250824_153917
2025-08-24 15:39:17|STATUS: Verifying benchmark run for training_run_resnet50_20250824_153917
2025-08-24 15:39:17|RESULT: Minimum file count dictated by 500 step requirement of given accelerator count and batch size.
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: checkpoint.checkpoint_folder = /home/test/ckpoint (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 21103 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.computation_threads = 48 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.prefetch_size = 0 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.transfer_size = 768 (Parameter: Overrode Parameters)
2025-08-24 15:39:17|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='resnet50', run_datetime='20250824_153917')])
2025-08-24 15:39:17|STATUS: Running benchmark command:: mpirun -n 132 -host 10.17.73.100:17,10.17.73.103:17,10.17.73.106:17,10.17.73.107:17,10.17.73.108:16,10.17.73.109:16,10.17.73.111:16,10.17.73.112:16 --allow-run-as-root --mca btl_tcp_if_include ens36 /root/.venvs/myenv/bin/dlio_benchmark workload=resnet50_a100 ++hydra.run.dir=/home/test/resnet50_results/training/resnet50/run/20250824_153917 ++hydra.output_subdir=dlio_config ++workload.checkpoint.checkpoint_folder=/home/test/ckpoint ++workload.dataset.num_files_train=21103 ++workload.reader.computation_threads=48 ++workload.reader.prefetch_size=0 ++workload.reader.transfer_size=768 ++workload.dataset.data_folder=/mnt/newdisk/fcresnet50/resnet50 --config-dir=/root/storage/configs/dlio
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:768540] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:220609] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:224733] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:226590] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:228063] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:230458] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[minio1:228468] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[minio1:226537] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958

Thanks,
Mike Chen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions