Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Debugging previously working run with Flux! #429

@vsoch

Description

@vsoch

Hi folks! I rebuilt a container with a newer maestrowf, and I'm having trouble reproducing a previously working run. I think it might be related to the PYTHONPATH and FLUX_URI - it appears that we first find a both, but then when maestro runs it doesn't seem to be able to import flux (suggesting the PYTHONPATH was altered). Here are the full logs, and I'll try to annotate them a bit.

# This is starting flux with 'flux start' which is what we call a launcher mode, expecting the workflow tool to run flux jobs, etc.
🌀 Launcher Mode: flux start -o --config /mnt/flux/view/etc/flux/config -Scron.directory=/etc/flux/system/cron.d   -Stbon.fanout=256   -Srundir=/mnt/flux/view/run/flux    -Sstatedir=/mnt/flux/view/var/lib/flux   -Slocal-uri=local:///mnt/flux/view/run/flux/local -Stbon.connect_timeout=5s    -Slog-stderr-level=6    -Slog-stderr-mode=local  
broker.info[0]: start: none->join 0.335857ms
broker.info[0]: parent-none: join->init 0.016684ms
cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
job-manager.info[0]: restart: 0 jobs
job-manager.info[0]: restart: 0 running jobs
job-manager.info[0]: restart: checkpoint.job-manager not found
broker.info[0]: rc1.0: running /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1.d/02-cron
broker.info[0]: rc1.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc1 Exited (rc=0) 1.1s
broker.info[0]: rc1-success: init->quorum 1.12243s
broker.info[0]: online: flux-sample-0 (ranks 0)
broker.info[0]: online: flux-sample-[0-3] (ranks 0-3)

# This indicates the quorum is full
broker.info[0]: quorum-full: quorum->run 0.634584s

# Here is where I think maestro starts?
[2023-11-14 17:31:24: INFO] INFO Logging Level -- Enabled
[2023-11-14 17:31:24: WARNING] WARNING Logging Level -- Enabled
[2023-11-14 17:31:24: CRITICAL] CRITICAL Logging Level -- Enabled
[2023-11-14 17:31:24: INFO] Loading specification -- path = ./lulesh-flux.yaml
[2023-11-14 17:31:24: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/logs
[2023-11-14 17:31:24: INFO] Adding step 'make-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] Adding step 'run-lulesh' to study 'lulesh_sample1'...
[2023-11-14 17:31:24: INFO] run-lulesh is dependent on make-lulesh. Creating edge (make-lulesh, run-lulesh)...
[2023-11-14 17:31:24: INFO] 
------------------------------------------
Submission attempts =       1
Submission restart limit =  1
Submission throttle limit = 0
Use temporary directory =   False
Hash workspaces =           False
Dry run enabled =           False
Output path =               /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124
------------------------------------------
[2023-11-14 17:31:24: INFO] Acquiring -- LULESH
[2023-11-14 17:31:24: INFO] Checking for connectivity to 'https://github.com/LLNL/LULESH.git'
[2023-11-14 17:31:24: INFO] Connectivity achieved!
[2023-11-14 17:31:24: INFO] Cloning 'LULESH' from 'https://github.com/LLNL/LULESH.git'...
[2023-11-14 17:31:25: INFO] Running Maestro Conductor in the foreground.
[2023-11-14 17:31:25: INFO] 
------------------------------------------
Submission attempts =       1
Submission throttle limit = 0
Use temporary directory =   False
Tmp Dir = 
------------------------------------------
[2023-11-14 17:31:25: INFO] 
==================================================
name: lulesh_sample1
description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Constructing parameter study 'lulesh_sample1'
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Processing step '_source'
==================================================

[2023-11-14 17:31:25: INFO] Encountered '_source'. Adding and continuing.
[2023-11-14 17:31:25: INFO] 
==================================================
Processing step 'make-lulesh'
==================================================

[2023-11-14 17:31:25: INFO] 
-------------------------------------------------
Adding step 'make-lulesh' (No parameters used)
-------------------------------------------------

[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = cd /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH
mkdir build
cd build
cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
make

[2023-11-14 17:31:25: INFO] 
==================================================
Processing step 'run-lulesh'
==================================================

[2023-11-14 17:31:25: INFO] 
==================================================
Expanding step 'run-lulesh'
==================================================
-------- Used Parameters --------
{'SIZE', 'ITERATIONS'}
---------------------------------
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.10]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 10 -p > SIZE.100.ITER.10.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.10.SIZE.100)...
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.20]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 20 -p > SIZE.100.ITER.20.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.20.SIZE.100)...
[2023-11-14 17:31:25: INFO] 
**********************************
Combo [SIZE.100.ITER.30]
**********************************
[2023-11-14 17:31:25: INFO] Searching for workspaces...
cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log

[2023-11-14 17:31:25: INFO] New cmd = $(LAUNCHER) /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/LULESH/build/lulesh2.0 -s 100 -i 30 -p > SIZE.100.ITER.30.log

[2023-11-14 17:31:25: INFO] Processing regular dependencies.
[2023-11-14 17:31:25: INFO] Adding edge (make-lulesh, run-lulesh_ITER.30.SIZE.100)...
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta
[2023-11-14 17:31:25: INFO] Directory does not exist. Creating directories to /workflow/study/studies/lulesh/lulesh_sample1_20231114-173124/meta/study
[2023-11-14 17:31:25: INFO] Checking DAG status at 2023-11-14 17:31:25.588449

# Here is where things start to get weird - at first we have a flux_uri, but then not?
[2023-11-14 17:31:25: INFO] Found FLUX_URI in environment, scheduling jobs to broker uri local:///mnt/flux/view/run/flux/local
[2023-11-14 17:31:25: INFO] No FLUX_URI; scheduling standalone batch job to root instance
Traceback (most recent call last):
  File "/usr/local/bin/maestro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 507, in main
    rc = args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/maestro.py", line 341, in run_study
    completion_status = conductor.monitor_study()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/conductor.py", line 352, in monitor_study
    completion_status = dag.execute_ready_steps()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/datastructures/core/executiongraph.py", line 734, in execute_ready_steps
    adapter = adapter(**self._adapter)
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/interfaces/script/fluxscriptadapter.py", line 116, in __init__
    self._broker_version = self._interface.get_flux_version()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 75, in get_flux_version
    cls.connect_to_flux()
  File "/usr/local/lib/python3.10/dist-packages/maestrowf/abstracts/interfaces/flux.py", line 22, in connect_to_flux
    cls.flux_handle = flux.Flux()
NameError: name 'flux' is not defined
broker.err[0]: rc2.0: maestro run -fg ./lulesh-flux.yaml -y Exited (rc=1) 1.8s
broker.info[0]: rc2-fail: run->cleanup 1.75864s
broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
broker.info[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
broker.info[0]: cleanup.2: flux queue idle --quiet Exited (rc=0) 0.1s
broker.info[0]: cleanup-success: cleanup->shutdown 0.278262s
broker.info[0]: children-complete: shutdown->finalize 95.1042ms
broker.info[0]: rc3.0: /opt/software/linux-ubuntu20.04-x86_64/gcc-9.4.0/flux-core-0.54.0-dxw2ljadlubcgovsrnvijkrkywlgk2ex/etc/flux/rc3 Exited (rc=0) 0.2s
broker.info[0]: rc3-success: finalize->goodbye 0.17189s
broker.info[0]: goodbye: goodbye->exit 0.046488ms

Can we talk about what the steps / flow of logic is between that first FLUX_URI being found and the second? If the second isn't finding flux because the PYTHONPATH isn't being passed forward, that might be the bug?

For context, here is the workflow I'm running:

description:
    name: lulesh_sample1
    description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on FLUX.

env:
    variables:
        OUTPUT_PATH: ./studies/lulesh

    labels:
        outfile: $(SIZE.label).$(ITERATIONS.label).log

    dependencies:
      git:
        - name: LULESH
          path: $(OUTPUT_PATH)
          url: https://github.com/LLNL/LULESH.git

batch:
    type        : flux

study:
    - name: make-lulesh
      description: Build the MPI enabled version of LULESH.
      run:
          cmd: |
            cd $(LULESH)
            mkdir build
            cd build
            cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
            make
          depends: []

    - name: run-lulesh
      description: Run LULESH.
      run:
          cmd: |
            $(LAUNCHER) $(LULESH)/build/lulesh2.0 -s $(SIZE) -i $(ITERATIONS) -p > $(outfile)
          depends: [make-lulesh]
          nodes: 1
          procs: 1
          cores per task: 1
          nested: True
          priority: high
          walltime: "00:60:00"

# Note that I reduced these sizes for a single container run
global.parameters:
    SIZE:
        values  : [100, 100, 100]
        # values  : [100, 100, 100, 200, 200, 200, 300, 300, 300]
        label   : SIZE.%%
    ITERATIONS:
        # values  : [10, 20, 30, 10, 20, 30, 10, 20, 30]
        values  : [10, 20, 30]
        label   : ITER.%%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions