Remove dependency on the runner's volume #244

nikola-jokic · 2025-08-15T16:20:38Z

As part of the effort to remove the volume as a dependency, this PR intends to fully replace the volumes by using node exec cp.
This change eliminates the need for workflow pods and container steps to execute on the same node where the runner is (in case read write once is used). If read write many volumes are used, then affinities or use scheduler must have been used, which made it more complicated and not ideal.

Due to a need for the workflow pods to land on different nodes for most environments, we used the exec API with retries. This will increase the duration of the workflow, but will eliminate the whole set of issues.

To reduce the amount of data being copied, only the temp dir is copied for each run step, while the whole work repository is copied during the setup.

austinpray-mixpanel

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

zarko-a · 2025-08-18T15:21:08Z

Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?

I'll take a stab at responding as I'd really like to get this feature out as soon as possible :)

Cloning a volume via your cloud provider's API, then mounting it inside K8S is FAR more complicated than doing a simple copy via exec API. My understanding is that runner copies only the job "spec" (for the lack of better word) and maybe nodejs binary to what used to be a shared volume. Although maybe node is copied from the init container actually, I don't have the full picture of Nikola's implementation yet. In any case the size of this is relatively small and I don't see why it shouldn't be reliable. Doing a whole PV clone for <100MB of files seems like a huge overkill. Potentially heavy operations like repo cloning actually happen in workflow pod and wouldn't be copied using kube exec API.

Most importantly runner container hooks are written to be pretty generic and not prefer one cloud provider over the other.
Be careful what you wish for, even if they decided to implement something like you are suggesting. GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

austinpray-mixpanel · 2025-08-18T16:16:42Z

Hey @zarko-a! Yeah thanks for braining this out with me

my main concern was
"I have significant doubts that this will be a stable approach. At scale we observe even trivial use cases for the exec api (like exec into a pod and check for existence of a file on a cron) to fail for all sorts of reasons."

To expand on that:

Anecdotally the exec API is super flakey. We experience lots of random connection issues and timeouts when we issued execs 100s of times per day as a part of our deploy workflows. This is anecdotal on Kube 1.25-1.27 though. We removed exec api stuff from the hot path around the time 1.27 was released.
- I'm happy to burn some $$$ if we want to stress test this. Like Spin up hundreds of pod pairs and use this code to copy files between the pods.
Logically the exec API is dependent on control plane uptime, which is not 100%
- For instance in GKE land the control plane has a 99.5% and 99.95% monthly uptime SLA for zonal and regional clusters respectively. Intentional control plane upgrades and other things like that could also cause api downtime which would fail worker setup.

👉 So at minimum I would expect this implementation to expect these execs to fail or be interrupted. Heavily integrate backoff retries or something like that.

GCP/GKE would likely be the last one to get support for this. Both AWS and and Azure are bigger and I'm sure GH has more customers on those two clouds than GCP.

Well yeah if there was an ADR out for cloud specific providers my team would for sure contribute a GCP one in short order

anlesk · 2025-09-03T15:57:58Z

Adding my 5 cents here. As we have altered hooks code on our end and implemented copy via k8s api, similar to how it was originally proposed in #160, we had to implement retry mechanism and the uptime is still not 100%, as sometimes copy fails even after a number of retries with or without sleep/wait incorporated between the attempts.

We do see the retry to kick in in roughly 20% of the executions.

The copy of the nodejs distro still takes a decent amount of time and the overall lag for workflow container kickstart varies between a 40s to 3min on our set up.

nikola-jokic · 2025-09-04T08:46:43Z

Hey @anlesk,

Thank you for your feedback! That is one of the reasons I wanted to use an init container to copy runner assets, so we can only copy the temporary directory. The _temp should be much smaller in size, resulting in a lower number of errors, but we will still add retries to ensure it works.
Initially, we would copy the workdir, which has actions downloaded. This one would be slightly larger, but node assets should be much larger than that.

nikola-jokic · 2025-09-04T12:45:53Z

Quick update:

Retries are added. We are trying to minimize the number of files being copied by leveraging the init container as much as possible. It is challenging with the container action, where the workspace should be mounted, and we don't know in advance what the workspace will look like at the time the container step is invoked.

But for most use-cases, a single workspace copy at the time the prepare-job is called should be okay, and each run-step basically copies the temp dir only to the container, and back to the runner. This approach may lead to fewer issues.

ukhl · 2025-09-05T20:07:57Z

I was discussing this issue with someone and they mentioned it sounds similar to this problem AWS is looking to solve for their mountpoint-s3-csi-driver.
https://github.com/awslabs/mountpoint-s3-csi-driver/blob/0753c0635a38b68a4433683bef53b769ad2c7b40/docs/HEADROOM_FOR_MPPOD.md

This solution wouldn't get rid of the runner and workflow pods be on the same node requirement, but it does provide a potential solution for setting requests and limits on workflow pods and ensuring there is enough space on a node for these pods. I think everyone would prefer the flexibility of workflow pods going to whatever node has room for them, but if the issues people are bringing up with exec being too flaky ends up holding true, or the spin up time of these pods is heavily impacted by the copy solution, maybe this "headroom" pod solution is a decent alternative.

nikola-jokic · 2025-09-05T22:24:02Z

Hey @ukhl,

There is no way we can afford to build on top of a cloud-specific solution. However, the intention of this repo is to provide a solution that should work in most cases and would allow you to customize/modify the implementation that suits your needs.

The copy is needed because there are many instances where read-write-many volumes don't exist in user environments. If we had the luxury of relying on read-write-many volumes, it would avoid all the headaches of scheduling the workflow pod on the same node as the runner. We would simply rely on the shared filesystem maintained by a driver, allowing workflow pods to be scheduled on nodes with enough capacity to handle it.

Since the copy using exec is supported by Kubernetes itself, and many people have faced issues using the volume, we are trying our best to remove this dependency. I would personally love to avoid it, since we now have to make sure the file permissions are properly set, that there are binaries available on arbitrary job containers in order to execute the action, user volume mounts are applied to the correct places, etc... But we can't. That is why using the exec API with retries, transferring the least amount of files possible, does seem to be the best approach.

We truly appreciate the feedback, though! It is amazing to see this many people coming up with such helpful and thoughtful feedback and suggestions!

ukhl · 2025-09-06T00:28:42Z

Hi @nikola-jokic I think you might have jumped to conclusions based on the repository I linked to. The doc I linked to isn't cloud proprietary. It's a method of getting multiple pods to schedule together via pod affinity.

The idea here would be to schedule the runner pod and a dummy workflow pod at the same time. This pod would then be replaced with a real workflow pod if/when required.

This approach would be less efficient on resources, but only if you don't use workflow/step pods that often. It would solve the problem of trying to schedule a workflow/step pod on a node and there not being enough resources for it.

Copilot

Pull Request Overview

This PR removes the dependency on Kubernetes persistent volumes by implementing a local filesystem approach for sharing data between the runner and pods. The changes simplify the architecture by using emptyDir volumes and file copying operations instead of persistent volumes.

Replaces persistent volume mounts with emptyDir volumes and file copying operations
Adds new functions for copying files to/from pods using tar streams
Refactors volume mounting logic to use container-specific volumes instead of shared work volumes

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/k8s/tests/test-setup.ts	Removes volume creation logic and adds local directory setup for testing
packages/k8s/tests/run-container-step-test.ts	Updates tests to use prepareJob before running container steps
packages/k8s/tests/prepare-job-test.ts	Removes volume mount validation tests and updates user volume mount tests
packages/k8s/tests/k8s-utils-test.ts	Removes containerVolumes tests and renames writeEntryPointScript to writeRunScript
packages/k8s/src/k8s/utils.ts	Replaces volume mount logic with script generation for file operations
packages/k8s/src/k8s/index.ts	Adds file copying functions and removes persistent volume creation
packages/k8s/src/index.ts	Updates runScriptStep call signature
packages/k8s/src/hooks/run-script-step.ts	Adds file copying operations before and after script execution
packages/k8s/src/hooks/run-container-step.ts	Refactors to use pod-based execution with file copying
packages/k8s/src/hooks/prepare-job.ts	Updates job preparation to use file copying instead of volume mounts
packages/k8s/package.json	Adds tar-fs dependency and updates dev dependencies
examples/extension.yaml	Removes security context from extension example

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-24T13:39:42Z

packages/k8s/tests/test-setup.ts


 kc.loadFromDefault()

 const k8sApi = kc.makeApiClient(k8s.CoreV1Api)


The k8sStorageApi variable declaration was removed but may still be referenced elsewhere in the codebase. Ensure all references to k8sStorageApi are also removed or updated.

Copilot · 2025-09-24T13:39:42Z

packages/k8s/tests/prepare-job-test.ts

-        ),
-        targetVolumePath: '/volume_mount',
+        sourceVolumePath: userVolumeMount,
+        targetVolumePath: '/__w/myvolume',


The targetVolumePath changed from '/volume_mount' to '/__w/myvolume' which appears to be a significant change in the volume mounting strategy. Ensure this path change is intentional and consistent with the new architecture.

Suggested change

targetVolumePath: '/__w/myvolume',

targetVolumePath: '/volume_mount',

Copilot · 2025-09-24T13:39:43Z

packages/k8s/src/k8s/utils.ts

+}
+
+export function listDirAllCommand(dir: string): string {
+  return `cd ${dir} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'`


The listDirAllCommand function uses string interpolation without input validation. The dir parameter should be validated or escaped to prevent command injection vulnerabilities.

Suggested change

return `cd ${dir} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'`

return `cd ${shlex.quote(dir)} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'`

Copilot · 2025-09-24T13:39:43Z

packages/k8s/src/k8s/index.ts

+    const child = spawn(commands[0], commands.slice(1), {
+      stdio: ['ignore', 'pipe', 'ignore']
+    })


The localCalculateOutputHash function spawns a child process using the first element of the commands array without validation. This could lead to command injection if the commands parameter is not properly validated by the caller.

packages/k8s/src/hooks/run-script-step.ts

Co-authored-by: Copilot <[email protected]>

nikola-jokic mentioned this pull request Aug 15, 2025

K8s mode without PV #160

Closed

austinpray-mixpanel reviewed Aug 18, 2025

View reviewed changes

anlesk mentioned this pull request Sep 4, 2025

Add support for resources requests and limits #100

Closed

nikola-jokic mentioned this pull request Sep 16, 2025

Introduce new kubernetes-novolume mode actions/actions-runner-controller#4250

Merged

nikola-jokic marked this pull request as ready for review September 24, 2025 13:38

Copilot AI review requested due to automatic review settings September 24, 2025 13:38

Copilot AI reviewed Sep 24, 2025

View reviewed changes

nikola-jokic force-pushed the nikola-jokic/no-volume branch from d52f18b to c6748da Compare September 25, 2025 13:05

nikola-jokic requested a review from a team as a code owner September 25, 2025 13:05

nikola-jokic force-pushed the nikola-jokic/no-volume branch from 4602264 to df697c4 Compare September 26, 2025 07:28

nikola-jokic and others added 5 commits September 26, 2025 11:12

bump actions

6cc78f2

experiment using init container to prepare working environment

12366b1

rm script before continuing

7c74ea6

fix

b22bf8d

Update packages/k8s/src/hooks/run-script-step.ts

1ac92ff

Co-authored-by: Copilot <[email protected]>

nikola-jokic force-pushed the nikola-jokic/no-volume branch from 74a4507 to 1ac92ff Compare September 26, 2025 10:28

nikola-jokic added 3 commits September 26, 2025 15:24

leverage exec stat instead of printf

ba092f7

npm update

1ab21e3

document the new constraint

7114669

rentziass approved these changes Oct 2, 2025

View reviewed changes

densto88 approved these changes Oct 2, 2025

View reviewed changes

nikola-jokic merged commit 96c35e7 into main Oct 2, 2025
5 checks passed

nikola-jokic deleted the nikola-jokic/no-volume branch October 2, 2025 14:23

zarko-a mentioned this pull request Oct 6, 2025

Runner to workflow pods take 3 minutes to start on RWX & containerMode: Kubernetes #207

Open

4 tasks

Wielewout mentioned this pull request Oct 6, 2025

Add option to disable pod affinity #235

Closed

LeonoreMangold mentioned this pull request Oct 6, 2025

Workflow pod forced on the same node in Kubernetes mode, even with RWX volume #227

Open

4 tasks

This was referenced Oct 8, 2025

Adding local mount in workflow pod to speed up operation? #252

Open

cpToPod failed after 30 attempts (related to permission issue) #257

Closed

This was referenced Oct 17, 2025

[BUG] fs-init container failed，executing the mv command reports an error #262

Closed

Change command to remove sudo to fix fs-init inital container #263

Merged

mcammisa78 mentioned this pull request Nov 13, 2025

Severe slowdown in k8s-novolume hooks due to _temp scan and cross-pod sync with actions/setup-go #274

Closed

4 tasks


		kc.loadFromDefault()

		const k8sApi = kc.makeApiClient(k8s.CoreV1Api)

	targetVolumePath: '/__w/myvolume',
	targetVolumePath: '/volume_mount',

	return `cd ${dir} && find . -not -path '/_runner_hook_responses' -printf '%b %p\n'`
	return `cd ${shlex.quote(dir)} && find . -not -path '/_runner_hook_responses' -printf '%b %p\n'`

Remove dependency on the runner's volume #244

Remove dependency on the runner's volume #244

Uh oh!

Conversation

nikola-jokic commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

austinpray-mixpanel left a comment

Choose a reason for hiding this comment

Uh oh!

zarko-a commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

austinpray-mixpanel commented Aug 18, 2025

Uh oh!

anlesk commented Sep 3, 2025

Uh oh!

nikola-jokic commented Sep 4, 2025

Uh oh!

nikola-jokic commented Sep 4, 2025

Uh oh!

ukhl commented Sep 5, 2025

Uh oh!

nikola-jokic commented Sep 5, 2025

Uh oh!

ukhl commented Sep 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nikola-jokic commented Aug 15, 2025 •

edited

Loading

zarko-a commented Aug 18, 2025 •

edited

Loading