Thanks to visit codestin.com
Credit goes to github.com

Skip to content

jobs stuck in PENDING state but they've actually completed or are currently running #441

@BenWibking

Description

@BenWibking

I've encountered the confusing situation of having the conductor still running as a background process, my jobs have all completed successfully, but running maestro status lists the jobs that were run via SLURM as still "PENDING".

Is there any way to figure out what has gone wrong, or otherwise reset the conductor process?

Running ps aus | grep $USER, I see:

login4.stampede3(626)$ ps aux | grep $USER
bwibking 1549018  0.0  0.0  20820 11936 ?        Ss   Apr30   0:09 /usr/lib/systemd/systemd --user
bwibking 1549019  0.0  0.0 202568  6848 ?        S    Apr30   0:00 (sd-pam)
bwibking 1725176  0.0  0.0   8436  3692 ?        S    Apr30   0:00 /bin/sh -c nohup conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408 > /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408/medres_compressive.txt 2>&1
bwibking 1725177  0.0  0.0 325552 67876 ?        S    Apr30   0:32 /scratch/projects/compilers/intel24.0/oneapi/intelpython/python3.9/bin/python3.9 /home1/02661/bwibking/.local/bin/conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408
root     3152422  0.0  0.0  39932 11976 ?        Ss   13:31   0:00 sshd: bwibking [priv]
bwibking 3152677  0.0  0.0  40240  7472 ?        S    13:31   0:00 sshd: bwibking@pts/122
bwibking 3152678  0.0  0.0  17920  6116 pts/122  Ss   13:31   0:00 -bash
root     3163591  0.0  0.0  39932 12044 ?        Ss   13:43   0:00 sshd: bwibking [priv]
bwibking 3163629  0.0  0.0  40076  7236 ?        S    13:43   0:00 sshd: bwibking@notty
root     3168472  0.0  0.0  39932 12020 ?        Ss   13:48   0:00 sshd: bwibking [priv]
bwibking 3168505  0.0  0.0  40080  7208 ?        S    13:48   0:00 sshd: bwibking@notty
bwibking 3174162  0.0  0.0  20408  4220 pts/122  R+   13:54   0:00 ps aux
bwibking 3174163  0.0  0.0   7592  2836 pts/122  S+   13:54   0:00 grep --color=auto bwibking

Partial output from maestro status:

run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324                   3  run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:52  --                                   0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822                   3  run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:54  --                                   0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655                  3  run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:55  --                                   0
run-sim_amp.0.0002777777777777778.f_sol.0.0.tc_tff.7.498942093324558                   3  run-sim/amp.0.0002777777777777778.f_sol.0.0.tc_tff.7.498942093324558            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:57  --                                   0
run-sim_amp.0.00044444444444444447.f_sol.0.0.tc_tff.1.1547819846894583                 3  run-sim/amp.0.00044444444444444447.f_sol.0.0.tc_tff.1.1547819846894583          PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:58  --                                   0
run-sim_amp.1.8518518518518515e-05.f_sol.0.0.tc_tff.3.651741272548377                  3  run-sim/amp.1.8518518518518515e-05.f_sol.0.0.tc_tff.3.651741272548377           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:00  --                                   0
run-sim_amp.0.00018518518518518518.f_sol.0.0.tc_tff.2.0535250264571463                 3  run-sim/amp.0.00018518518518518518.f_sol.0.0.tc_tff.2.0535250264571463          PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:02  --                                   0
run-sim_amp.0.0003518518518518519.f_sol.0.0.tc_tff.6.493816315762113                   3  run-sim/amp.0.0003518518518518519.f_sol.0.0.tc_tff.6.493816315762113            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:04  --                                   0
run-sim_amp.7.407407407407406e-05.f_sol.0.0.tc_tff.1.539926526059492                   3  run-sim/amp.7.407407407407406e-05.f_sol.0.0.tc_tff.1.539926526059492            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:06  --                                   0
run-sim_amp.0.00024074074074074072.f_sol.0.0.tc_tff.4.869675251658631                  3  run-sim/amp.0.00024074074074074072.f_sol.0.0.tc_tff.4.869675251658631           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:08  --                                   0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions