Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lukeyeager
Copy link
Member

With ubuntu 24.04, attempting to install the tzdata package inside an enroot container leads to the following error:

mv: cannot move '/etc/localtime.dpkg-new' to '/etc/localtime': Device or resource busy

The package's postinst script only skips setting the timezone if /etc/localtime already exists as a symlink. So, in order to set the timezone inside the container to match the host os, we must copy both the symlink and the symlink's target into the container rootfs.

See https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/tzdata/tree/debian/tzdata.postinst?h=debian/2024a-1ubuntu1

@lukeyeager
Copy link
Member Author

Tested on ubuntu 20.04, 22.04 and 24.04, centos 7 and 8, and alpine 3.19.

Ubuntu has the annoying habit of canonicalizing the symlink when installing tzdata, so when the container starts up it'll be whatever was on your host (e.g. /usr/share/zoneinfo/US/Central), but then tzdata's postinst script (linked above) will rewrite that to /usr/share/zoneinfo/America/Chicago). And then if you restart the container this hook will set it back to US/Central. That feels a little janky but it seems to work.

With ubuntu 24.04, attempting to install the tzdata package inside an
enroot container leads to the following error:
```
mv: cannot move '/etc/localtime.dpkg-new' to '/etc/localtime': Device or resource busy
```
The package's postinst script only skips setting the timezone if
/etc/localtime already exists as a symlink. So, in order to set the
timezone inside the container to match the host os, we must copy both
the symlink and the symlink's target into the container rootfs.
@lukeyeager lukeyeager requested a review from 3XX0 April 1, 2024 18:08
@3XX0 3XX0 merged commit f0552fb into NVIDIA:master Apr 22, 2024
@krono
Copy link
Contributor

krono commented Nov 21, 2024

HI, sorry to re-touch this issue, but it seems that this hook might introduce a race condition, when used with pyxis.

So a user is re-using the same container on different nodes simultaneously, and previosusly, the mount was somewhat fine, cause it happens locally to the machine.

However, copying into the container has the nasty side effect that, given that the container is on a shared file system, another node might just have done the same…

@j-hellenberg
Copy link

To add some more context here:

I'm running a sbatch job using a common pre-setup container like

#!/bin/bash
#SBATCH --container-name ubuntu-2310
#SBATCH --array=0-100%10
(more #SBATCH directives)

python myscript.py

so Slurm will schedule executions on multiple nodes in parallel.

The specific error I'm experiencing is

slurmstepd-cXXX: error: pyxis: container start failed with error code: 1
slurmstepd-cXXX: error: pyxis: printing enroot log file:
slurmstepd-cXXX: error: pyxis:     cp: cannot create symbolic link '/<network_share>/.local/share/enroot/pyxis_2310/etc/localtime': File exists
slurmstepd-cXXX: error: pyxis:     [ERROR] /etc/enroot/hooks.d/10-localtime.sh exited with return code 1
slurmstepd-cXXX: error: pyxis: couldn't start container
slurmstepd-cXXX: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd-cXXX: error: Failed to invoke spank plugin stack

However, this error only randomly appears in about 1/3 of executions. Our guess would be that this happens because the lines of 10-localtime.sh are executed multiple times in parallel and interleaved, and depending on which order this happens in, some of them succeed, and some don't.

@lukeyeager
Copy link
Member Author

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

@lukeyeager
Copy link
Member Author

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

@j-hellenberg
Copy link

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

We just tried your suggestion on one of our machines, and that machine was then the only one no longer experiencing errors, so we think that should do the trick 👍

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

I agree that this approach looks a bit suspicious, and I would not do it this way in any kind of serious production usage. In this case, we are talking about research work, though, and I like the convenience of having a manual change I make in a container immediately affect all future job executions.

In any case, I believe adding the --force flag should increase the resilience of the system without negative side effects here, so doing so probably makes sense even though it is only of relevance in this (maybe not fully supported) edge case? I can open a small PR for that if you want.

@3XX0
Copy link
Member

3XX0 commented Nov 23, 2024

Yeah we don't usually recommend this since we don't test it.
Having said that we do flock the rootfs:

flock -w 30 "${_lock}" > /dev/null 2>&1 || common::err "Could not acquire rootfs lock"

So maybe your shared filesystem doesn't support it or is not properly configured

@krono
Copy link
Contributor

krono commented Nov 23, 2024

It's GPFS, and it does support locking.
I'll have a test and will report back

@lukeyeager lukeyeager deleted the localtime-hook branch November 24, 2024 14:26
@krono
Copy link
Contributor

krono commented Nov 25, 2024

I fail to understand what's happening here.
I'm making a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants