Copy /etc/localtime symlink and target instead of mounting #184

lukeyeager · 2024-03-27T20:59:15Z

With ubuntu 24.04, attempting to install the tzdata package inside an enroot container leads to the following error:

mv: cannot move '/etc/localtime.dpkg-new' to '/etc/localtime': Device or resource busy

The package's postinst script only skips setting the timezone if /etc/localtime already exists as a symlink. So, in order to set the timezone inside the container to match the host os, we must copy both the symlink and the symlink's target into the container rootfs.

See https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/tzdata/tree/debian/tzdata.postinst?h=debian/2024a-1ubuntu1

lukeyeager · 2024-04-01T15:07:50Z

Tested on ubuntu 20.04, 22.04 and 24.04, centos 7 and 8, and alpine 3.19.

Ubuntu has the annoying habit of canonicalizing the symlink when installing tzdata, so when the container starts up it'll be whatever was on your host (e.g. /usr/share/zoneinfo/US/Central), but then tzdata's postinst script (linked above) will rewrite that to /usr/share/zoneinfo/America/Chicago). And then if you restart the container this hook will set it back to US/Central. That feels a little janky but it seems to work.

With ubuntu 24.04, attempting to install the tzdata package inside an enroot container leads to the following error: ``` mv: cannot move '/etc/localtime.dpkg-new' to '/etc/localtime': Device or resource busy ``` The package's postinst script only skips setting the timezone if /etc/localtime already exists as a symlink. So, in order to set the timezone inside the container to match the host os, we must copy both the symlink and the symlink's target into the container rootfs.

krono · 2024-11-21T09:43:35Z

HI, sorry to re-touch this issue, but it seems that this hook might introduce a race condition, when used with pyxis.

So a user is re-using the same container on different nodes simultaneously, and previosusly, the mount was somewhat fine, cause it happens locally to the machine.

However, copying into the container has the nasty side effect that, given that the container is on a shared file system, another node might just have done the same…

j-hellenberg · 2024-11-21T10:10:24Z

To add some more context here:

I'm running a sbatch job using a common pre-setup container like

#!/bin/bash
#SBATCH --container-name ubuntu-2310
#SBATCH --array=0-100%10
(more #SBATCH directives)

python myscript.py

so Slurm will schedule executions on multiple nodes in parallel.

The specific error I'm experiencing is

slurmstepd-cXXX: error: pyxis: container start failed with error code: 1
slurmstepd-cXXX: error: pyxis: printing enroot log file:
slurmstepd-cXXX: error: pyxis:     cp: cannot create symbolic link '/<network_share>/.local/share/enroot/pyxis_2310/etc/localtime': File exists
slurmstepd-cXXX: error: pyxis:     [ERROR] /etc/enroot/hooks.d/10-localtime.sh exited with return code 1
slurmstepd-cXXX: error: pyxis: couldn't start container
slurmstepd-cXXX: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd-cXXX: error: Failed to invoke spank plugin stack

However, this error only randomly appears in about 1/3 of executions. Our guess would be that this happens because the lines of 10-localtime.sh are executed multiple times in parallel and interleaved, and depending on which order this happens in, some of them succeed, and some don't.

lukeyeager · 2024-11-21T15:59:25Z

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

lukeyeager · 2024-11-21T16:00:38Z

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

j-hellenberg · 2024-11-22T10:47:00Z

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

We just tried your suggestion on one of our machines, and that machine was then the only one no longer experiencing errors, so we think that should do the trick 👍

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

I agree that this approach looks a bit suspicious, and I would not do it this way in any kind of serious production usage. In this case, we are talking about research work, though, and I like the convenience of having a manual change I make in a container immediately affect all future job executions.

In any case, I believe adding the --force flag should increase the resilience of the system without negative side effects here, so doing so probably makes sense even though it is only of relevance in this (maybe not fully supported) edge case? I can open a small PR for that if you want.

3XX0 · 2024-11-23T09:24:39Z

Yeah we don't usually recommend this since we don't test it.
Having said that we do flock the rootfs:

enroot/src/runtime.sh

Line 243 in 2bd5143

    
           flock -w 30 "${_lock}" > /dev/null 2>&1 || common::err "Could not acquire rootfs lock"

So maybe your shared filesystem doesn't support it or is not properly configured

krono · 2024-11-23T09:35:44Z

It's GPFS, and it does support locking.
I'll have a test and will report back

krono · 2024-11-25T08:28:34Z

I fail to understand what's happening here.
I'm making a new issue

lukeyeager force-pushed the localtime-hook branch from 558814c to eeac008 Compare April 1, 2024 15:05

lukeyeager force-pushed the localtime-hook branch from eeac008 to 70febb1 Compare April 1, 2024 15:10

lukeyeager requested a review from 3XX0 April 1, 2024 18:08

3XX0 merged commit f0552fb into NVIDIA:master Apr 22, 2024

lukeyeager deleted the localtime-hook branch November 24, 2024 14:26

krono mentioned this pull request Nov 25, 2024

10-localtime hook sometimes fails to copy the file #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Copy /etc/localtime symlink and target instead of mounting #184

Copy /etc/localtime symlink and target instead of mounting #184

Uh oh!

lukeyeager commented Mar 27, 2024

Uh oh!

lukeyeager commented Apr 1, 2024

Uh oh!

krono commented Nov 21, 2024

Uh oh!

j-hellenberg commented Nov 21, 2024

Uh oh!

lukeyeager commented Nov 21, 2024

Uh oh!

lukeyeager commented Nov 21, 2024

Uh oh!

j-hellenberg commented Nov 22, 2024

Uh oh!

3XX0 commented Nov 23, 2024

Uh oh!

krono commented Nov 23, 2024

Uh oh!

krono commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copy /etc/localtime symlink and target instead of mounting #184

Copy /etc/localtime symlink and target instead of mounting #184

Uh oh!

Conversation

lukeyeager commented Mar 27, 2024

Uh oh!

lukeyeager commented Apr 1, 2024

Uh oh!

krono commented Nov 21, 2024

Uh oh!

j-hellenberg commented Nov 21, 2024

Uh oh!

lukeyeager commented Nov 21, 2024

Uh oh!

lukeyeager commented Nov 21, 2024

Uh oh!

j-hellenberg commented Nov 22, 2024

Uh oh!

3XX0 commented Nov 23, 2024

Uh oh!

krono commented Nov 23, 2024

Uh oh!

krono commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants