Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MACHINE_LEARNING_DEVICE_ID is silently overwritten by gunicorn round-robin default, making dGPU selection impossible without undocumented MACHINE_LEARNING_DEVICE_IDS #29457

Description

@Soulplayer

I have searched the existing issues, both open and closed, to make sure this is not a duplicate report.

  • Yes

The bug

The bug

immich_ml/config.py reads the target device from MACHINE_LEARNING_DEVICE_ID:

@property
def device_id(self) -> str:
    return os.environ.get("MACHINE_LEARNING_DEVICE_ID", "0")

But immich_ml/gunicorn_conf.py unconditionally overwrites that variable for every worker before fork, using MACHINE_LEARNING_DEVICE_IDS which defaults to "0":

device_ids = os.environ.get("MACHINE_LEARNING_DEVICE_IDS", "0").replace(" ", "").split(",")

def pre_fork(arbiter: Arbiter, _: Worker) -> None:
    env["MACHINE_LEARNING_DEVICE_ID"] = device_ids[len(arbiter.WORKERS) % len(device_ids)]

Result: setting MACHINE_LEARNING_DEVICE_ID=1 in the container environment has no effect — the worker always runs on device 0. On multi-GPU systems (e.g. iGPU + Arc dGPU with OpenVINO), the workload silently lands on GPU.0 (typically the iGPU) no matter what the user configures. There is no warning and the debug log (OpenVINO: Using GPU device GPU.0) is the only clue.

Impact

  • Users who follow the documented single-device variable get silent misconfiguration.
  • Bug reports about specific GPUs (e.g. Battlemage issues in ML OCR memory leaks? #23462) may be unreliable, because "tested on my dGPU" setups can actually have been running on the iGPU.

Expected

Either:

  1. pre_fork should respect an explicitly set MACHINE_LEARNING_DEVICE_ID when MACHINE_LEARNING_DEVICE_IDS is not set, or
  2. the docs should clearly state that MACHINE_LEARNING_DEVICE_IDS is the only variable that works, and MACHINE_LEARNING_DEVICE_ID should be removed from user-facing documentation.

Verified on v3.0.0 (-openvino image): with MACHINE_LEARNING_DEVICE_ID=1 set and confirmed present via docker exec ... env, the logs still show OpenVINO: Using GPU device GPU.0; switching to MACHINE_LEARNING_DEVICE_IDS=1 immediately gives GPU.1.

The OS that Immich Server is running on

Unraid (kernel 6.18.33)

Version of Immich Server

v3.0.0

Version of Immich Mobile App

N/A (not relevant to this issue)

Platform with the issue

  • Server
  • Web
  • Mobile

Device make and model

N/A

Your docker-compose.yml content

immich-machine-learning:
  container_name: immich_machine_learning
  image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino
  devices:
    - /dev/dri:/dev/dri
  volumes:
    - /mnt/user/appdata/immich/model-cache:/cache
  environment:
    MACHINE_LEARNING_DEVICE_ID: "1"   # <-- silently ignored, this is the bug
    IMMICH_LOG_LEVEL: debug
    MACHINE_LEARNING_MODEL_TTL: 300
    MACHINE_LEARNING_PRELOAD__CLIP__TEXTUAL: immich-app/ViT-L-14-quickgelu__dfn2b
    MACHINE_LEARNING_PRELOAD__CLIP__VISUAL: immich-app/ViT-L-14-quickgelu__dfn2b
    MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION__DETECTION: buffalo_l
    MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION__RECOGNITION: buffalo_l
  env_file:
    - .env
  restart: always

Your .env content

UPLOAD_LOCATION=/mnt/user/afbeeldingen-immich
DB_DATA_LOCATION=/mnt/user/appdata/immich/database
TZ=Europe/Brussels
IMMICH_VERSION=release
DB_PASSWORD=<redacted>
DB_USERNAME=<redacted>
DB_DATABASE_NAME=immich

Reproduction steps

  1. Use a system with two GPUs visible to OpenVINO (e.g. Intel iGPU + Arc dGPU), with the -openvino ML image and /dev/dri passed through.
  2. Set MACHINE_LEARNING_DEVICE_ID=1 in the ML container environment and recreate the container.
  3. Confirm the variable is present: docker exec immich_machine_learning env | grep MACHINE_LEARNING_DEVICE_ID shows =1.
  4. Set IMMICH_LOG_LEVEL=debug and trigger any ML job (Smart Search / Face Detection).
  5. Logs show OpenVINO: Using GPU device GPU.0 — the setting is ignored and the workload runs on GPU.0 (the iGPU).
  6. Replace it with MACHINE_LEARNING_DEVICE_IDS=1 (plural) and recreate: logs now correctly show GPU.1.

Relevant log output

# With MACHINE_LEARNING_DEVICE_ID=1 confirmed set in the container:
[07/02/26 19:52:09] DEBUG    OpenVINO: Using GPU device GPU.0
[07/02/26 19:52:10] DEBUG    OpenVINO: Using GPU device GPU.0

# After switching to MACHINE_LEARNING_DEVICE_IDS=1:
[07/02/26 19:54:41] DEBUG    OpenVINO: Using GPU device GPU.1
[07/02/26 19:54:42] DEBUG    OpenVINO: Using GPU device GPU.1

Additional information

Root cause: immich_ml/config.py reads the device from MACHINE_LEARNING_DEVICE_ID:

@property
def device_id(self) -> str:
    return os.environ.get("MACHINE_LEARNING_DEVICE_ID", "0")

but immich_ml/gunicorn_conf.py unconditionally overwrites that variable for every worker in pre_fork, using MACHINE_LEARNING_DEVICE_IDS, which defaults to "0":

device_ids = os.environ.get("MACHINE_LEARNING_DEVICE_IDS", "0").replace(" ", "").split(",")

def pre_fork(arbiter: Arbiter, _: Worker) -> None:
    env["MACHINE_LEARNING_DEVICE_ID"] = device_ids[len(arbiter.WORKERS) % len(device_ids)]

So an explicitly set MACHINE_LEARNING_DEVICE_ID never reaches the worker. On multi-GPU systems the workload silently lands on GPU.0 (typically the iGPU) with no warning.

Impact beyond misconfiguration: hardware-specific bug reports become unreliable — users who believe they tested on their dGPU may actually have been running on their iGPU the whole time (this affected my own testing in #23462).

Suggested fix: have pre_fork respect an explicitly set MACHINE_LEARNING_DEVICE_ID when MACHINE_LEARNING_DEVICE_IDS is unset — or document that only MACHINE_LEARNING_DEVICE_IDS works and remove the singular variant from the docs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    To triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions