-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
Hi folks! I rebuilt a container with a newer maestrowf, and I'm having trouble reproducing a previously working run. I think it might be related to the PYTHONPATH and FLUX_URI - it appears that we first find a both, but then when maestro runs it doesn't seem to be able to import flux (suggesting the PYTHONPATH was altered). Here are the full logs, and I'll try to annotate them a bit.
# This is starting flux with 'flux start' which is what we call a launcher mode, expecting the workflow tool to run flux jobs, etc.
🌀 Launcher Mode: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout=256 -Srundir=/mnt/flux/view/run/flux -Sstatedir=/mnt/flux/view/var/lib/flux -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s -Slog-stderr-level=6 -Slog-stderr-mode=local
broker.info[0]: start: none->join 0.335857ms
broker.info[0]: parent-none: join->init 0.016684ms
cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
job-manager.info[0]: restart: 0 jobs
job-manager.info[0]: restart: 0 running jobs
job-manager.info[0]: restart: checkpoint.job-manager not found
broker.info[0]: rc1.0: running /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1.d/02-cron
broker.info[0]: rc1.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1 Exited (rc=0) 1.1s
broker.info[0]: rc1-success: init->quorum 1.12243s
broker.info[0]: online: flux-sample-0 (ranks 0)
broker.info[0]: online: flux-sample-[0-3] (ranks 0-3)
# This indicates the quorum is full
broker.info[0]: quorum-full: quorum->run 0.634584s
# Here is where I think maestro starts?
[2023-11-14 17:31:24: INFO] INFO Logging Level -- Enabled
[2023-11-14 17:31:24: WARNING] WARNING Logging Level -- Enabled
[2023-11-14 17:31:24: CRITICAL] CRITICAL Logging Level -- Enabled
[2023-11-14 17:31:24: INFO] Loading specification -- path = ./lulesh-flux.yaml
[2023-11-14 17:31:24: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/logs
[2023-11-14 17:31:24: INFO] Adding step 'make-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] Adding step 'run-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] run-lulesh is dependent on make-lulesh. Creating edge (make-lulesh, run-lulesh)...
[2023-11-14 17:31:24: INFO]
------------------------------------------
Submission attempts = 1
Submission restart limit = 1
Submission throttle limit = 0
Use temporary directory = False
Hash workspaces = False
Dry run enabled = False
Output path = /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124
------------------------------------------
[2023-11-14 17:31:24: INFO] Acquiring -- LULESH
[2023-11-14 17:31:24: INFO] Checking for connectivity to 'https://github.com/LLNL/LULESH.git'
[2023-11-14 17:31:24: INFO] Connectivity achieved!
[2023-11-14 17:31:24: INFO] Cloning 'LULESH' from 'https://github.com/LLNL/LULESH.git'...
[2023-11-14 17:31:25: INFO] Running Maestro Conductor in the foreground.
[2023-11-14 17:31:25: INFO]
------------------------------------------
Submission attempts = 1
Submission throttle limit = 0
Use temporary directory = False
Tmp Dir =
------------------------------------------
[2023-11-14 17:31:25: INFO]
==================================================
name: lulesh_sample1
description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.
==================================================
[2023-11-14 17:31:25: INFO]
==================================================
Constructing parameter study 'lulesh_sample1'
==================================================
[2023-11-14 17:31:25: INFO]
==================================================
Processing step '_source'
==================================================
[2023-11-14 17:31:25: INFO] Encountered '_source'. Adding and continuing.
[2023-11-14 17:31:25: INFO]
==================================================
Processing step 'make-lulesh'
==================================================
[2023-11-14 17:31:25: INFO]
-------------------------------------------------
Adding step 'make-lulesh' (No parameters used)
-------------------------------------------------
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = cd /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH
mkdir build
cd build
cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
make
[2023-11-14 17:31:25: INFO]
==================================================
Processing step 'run-lulesh'
==================================================
[2023-11-14 17:31:25: INFO]
==================================================
Expanding step 'run-lulesh'
==================================================
-------- Used Parameters --------
{'SIZE', 'ITERATIONS'}
---------------------------------
[2023-11-14 17:31:25: INFO]
**********************************
Combo [SIZE.100.ITER.10]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log
[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log
[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.10.SIZE.100)...
[2023-11-14 17:31:25: INFO]
**********************************
Combo [SIZE.100.ITER.20]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log
[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log
[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.20.SIZE.100)...
[2023-11-14 17:31:25: INFO]
**********************************
Combo [SIZE.100.ITER.30]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log
[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log
[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.30.SIZE.100)...
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta/study
[2023-11-14 17:31:25: INFO] Checking DAG status at 2023-11-14 17:31:25.588449
# Here is where things start to get weird - at first we have a flux_uri, but then not?
[2023-11-14 17:31:25: INFO] Found FLUX_URI in environment, scheduling jobs to broker uri local:///mnt/flux/view/run/flux/local
[2023-11-14 17:31:25: INFO] No FLUX_URI; scheduling standalone batch job to root instance
Traceback (most recent call last):
File "/usr/local/bin/maestro", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 507, in main
rc = args.func(args)
File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 341, in run_study
completion_status = conductor.monitor_study()
File "/usr/local/lib/python3.10/dist-packages/maestrowf/conductor.py", line 352, in monitor_study
completion_status = dag.execute_ready_steps()
File "/usr/local/lib/python3.10/dist-packages/maestrowf/datastructures/core/executiongraph.py", line 734, in execute_ready_steps
adapter = adapter(**self._adapter)
File "/usr/local/lib/python3.10/dist-packages/maestrowf/interfaces/script/fluxscriptadapter.py", line 116, in __init__
self._broker_version = self._interface.get_flux_version()
File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 75, in get_flux_version
cls.connect_to_flux()
File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 22, in connect_to_flux
cls.flux_handle = flux.Flux()
NameError: name 'flux' is not defined
broker.err[0]: rc2.0: maestro run -fg ./lulesh-flux.yaml -y Exited (rc=1) 1.8s
broker.info[0]: rc2-fail: run->cleanup 1.75864s
broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.1s
broker.info[0]: cleanup-success: cleanup->shutdown 0.278262s
broker.info[0]: children-complete: shutdown->finalize 95.1042ms
broker.info[0]: rc3.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc3 Exited (rc=0) 0.2s
broker.info[0]: rc3-success: finalize->goodbye 0.17189s
broker.info[0]: goodbye: goodbye->exit 0.046488ms
Can we talk about what the steps / flow of logic is between that first FLUX_URI being found and the second? If the second isn't finding flux because the PYTHONPATH isn't being passed forward, that might be the bug?
For context, here is the workflow I'm running:
description:
name: lulesh_sample1
description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.
env:
variables:
OUTPUT_PATH: ./studies/lulesh
labels:
outfile: $(SIZE.label).$(ITERATIONS.label).log
dependencies:
git:
- name: LULESH
path: $(OUTPUT_PATH)
url: https://github.com/LLNL/LULESH.git
batch:
type : flux
study:
- name: make-lulesh
description: Build the MPI enabled version of LULESH.
run:
cmd: |
cd $(LULESH)
mkdir build
cd build
cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
make
depends: []
- name: run-lulesh
description: Run LULESH.
run:
cmd: |
$(LAUNCHER) $(LULESH)/build/lulesh2.0 -s $(SIZE) -i $(ITERATIONS) -p > $(outfile)
depends: [make-lulesh]
nodes: 1
procs: 1
cores per task: 1
nested: True
priority: high
walltime: "00:60:00"
# Note that I reduced these sizes for a single container run
global.parameters:
SIZE:
values : [100, 100, 100]
# values : [100, 100, 100, 200, 200, 200, 300, 300, 300]
label : SIZE.%%
ITERATIONS:
# values : [10, 20, 30, 10, 20, 30, 10, 20, 30]
values : [10, 20, 30]
label : ITER.%%Metadata
Metadata
Assignees
Labels
No labels