-
Notifications
You must be signed in to change notification settings - Fork 85
Remove dependency on the runner's volume #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
austinpray-mixpanel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this gets too far: can you address the concerns with this approach I raised over in #160 (comment) ?
I'll take a stab at responding as I'd really like to get this feature out as soon as possible :) Cloning a volume via your cloud provider's API, then mounting it inside K8S is FAR more complicated than doing a simple copy via exec API. My understanding is that runner copies only the job "spec" (for the lack of better word) and maybe Most importantly runner container hooks are written to be pretty generic and not prefer one cloud provider over the other. |
|
Hey @zarko-a! Yeah thanks for braining this out with me my main concern was To expand on that:
👉 So at minimum I would expect this implementation to expect these execs to fail or be interrupted. Heavily integrate backoff retries or something like that.
Well yeah if there was an ADR out for cloud specific providers my team would for sure contribute a GCP one in short order |
|
Adding my 5 cents here. As we have altered hooks code on our end and implemented copy via k8s api, similar to how it was originally proposed in #160, we had to implement retry mechanism and the uptime is still not 100%, as sometimes copy fails even after a number of retries with or without sleep/wait incorporated between the attempts. We do see the retry to kick in in roughly 20% of the executions. The copy of the |
|
Hey @anlesk, Thank you for your feedback! That is one of the reasons I wanted to use an init container to copy runner assets, so we can only copy the temporary directory. The |
|
Quick update: Retries are added. We are trying to minimize the number of files being copied by leveraging the init container as much as possible. It is challenging with the container action, where the workspace should be mounted, and we don't know in advance what the workspace will look like at the time the container step is invoked. But for most use-cases, a single workspace copy at the time the prepare-job is called should be okay, and each |
|
I was discussing this issue with someone and they mentioned it sounds similar to this problem AWS is looking to solve for their mountpoint-s3-csi-driver. This solution wouldn't get rid of the runner and workflow pods be on the same node requirement, but it does provide a potential solution for setting requests and limits on workflow pods and ensuring there is enough space on a node for these pods. I think everyone would prefer the flexibility of workflow pods going to whatever node has room for them, but if the issues people are bringing up with exec being too flaky ends up holding true, or the spin up time of these pods is heavily impacted by the copy solution, maybe this "headroom" pod solution is a decent alternative. |
|
Hey @ukhl, There is no way we can afford to build on top of a cloud-specific solution. However, the intention of this repo is to provide a solution that should work in most cases and would allow you to customize/modify the implementation that suits your needs. The copy is needed because there are many instances where read-write-many volumes don't exist in user environments. If we had the luxury of relying on read-write-many volumes, it would avoid all the headaches of scheduling the workflow pod on the same node as the runner. We would simply rely on the shared filesystem maintained by a driver, allowing workflow pods to be scheduled on nodes with enough capacity to handle it. Since the copy using exec is supported by Kubernetes itself, and many people have faced issues using the volume, we are trying our best to remove this dependency. I would personally love to avoid it, since we now have to make sure the file permissions are properly set, that there are binaries available on arbitrary job containers in order to execute the action, user volume mounts are applied to the correct places, etc... But we can't. That is why using the exec API with retries, transferring the least amount of files possible, does seem to be the best approach. We truly appreciate the feedback, though! It is amazing to see this many people coming up with such helpful and thoughtful feedback and suggestions! |
|
Hi @nikola-jokic I think you might have jumped to conclusions based on the repository I linked to. The doc I linked to isn't cloud proprietary. It's a method of getting multiple pods to schedule together via pod affinity.
The idea here would be to schedule the runner pod and a dummy workflow pod at the same time. This pod would then be replaced with a real workflow pod if/when required. This approach would be less efficient on resources, but only if you don't use workflow/step pods that often. It would solve the problem of trying to schedule a workflow/step pod on a node and there not being enough resources for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the dependency on Kubernetes persistent volumes by implementing a local filesystem approach for sharing data between the runner and pods. The changes simplify the architecture by using emptyDir volumes and file copying operations instead of persistent volumes.
- Replaces persistent volume mounts with emptyDir volumes and file copying operations
- Adds new functions for copying files to/from pods using tar streams
- Refactors volume mounting logic to use container-specific volumes instead of shared work volumes
Reviewed Changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/k8s/tests/test-setup.ts | Removes volume creation logic and adds local directory setup for testing |
| packages/k8s/tests/run-container-step-test.ts | Updates tests to use prepareJob before running container steps |
| packages/k8s/tests/prepare-job-test.ts | Removes volume mount validation tests and updates user volume mount tests |
| packages/k8s/tests/k8s-utils-test.ts | Removes containerVolumes tests and renames writeEntryPointScript to writeRunScript |
| packages/k8s/src/k8s/utils.ts | Replaces volume mount logic with script generation for file operations |
| packages/k8s/src/k8s/index.ts | Adds file copying functions and removes persistent volume creation |
| packages/k8s/src/index.ts | Updates runScriptStep call signature |
| packages/k8s/src/hooks/run-script-step.ts | Adds file copying operations before and after script execution |
| packages/k8s/src/hooks/run-container-step.ts | Refactors to use pod-based execution with file copying |
| packages/k8s/src/hooks/prepare-job.ts | Updates job preparation to use file copying instead of volume mounts |
| packages/k8s/package.json | Adds tar-fs dependency and updates dev dependencies |
| examples/extension.yaml | Removes security context from extension example |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| kc.loadFromDefault() | ||
|
|
||
| const k8sApi = kc.makeApiClient(k8s.CoreV1Api) |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The k8sStorageApi variable declaration was removed but may still be referenced elsewhere in the codebase. Ensure all references to k8sStorageApi are also removed or updated.
| ), | ||
| targetVolumePath: '/volume_mount', | ||
| sourceVolumePath: userVolumeMount, | ||
| targetVolumePath: '/__w/myvolume', |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The targetVolumePath changed from '/volume_mount' to '/__w/myvolume' which appears to be a significant change in the volume mounting strategy. Ensure this path change is intentional and consistent with the new architecture.
| targetVolumePath: '/__w/myvolume', | |
| targetVolumePath: '/volume_mount', |
packages/k8s/src/k8s/utils.ts
Outdated
| } | ||
|
|
||
| export function listDirAllCommand(dir: string): string { | ||
| return `cd ${dir} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'` |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The listDirAllCommand function uses string interpolation without input validation. The dir parameter should be validated or escaped to prevent command injection vulnerabilities.
| return `cd ${dir} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'` | |
| return `cd ${shlex.quote(dir)} && find . -not -path '*/_runner_hook_responses*' -printf '%b %p\n'` |
| const child = spawn(commands[0], commands.slice(1), { | ||
| stdio: ['ignore', 'pipe', 'ignore'] | ||
| }) |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The localCalculateOutputHash function spawns a child process using the first element of the commands array without validation. This could lead to command injection if the commands parameter is not properly validated by the caller.
d52f18b to
c6748da
Compare
4602264 to
df697c4
Compare
74a4507 to
1ac92ff
Compare
As part of the effort to remove the volume as a dependency, this PR intends to fully replace the volumes by using node exec cp.
This change eliminates the need for workflow pods and container steps to execute on the same node where the runner is (in case read write once is used). If read write many volumes are used, then affinities or use scheduler must have been used, which made it more complicated and not ideal.
Due to a need for the workflow pods to land on different nodes for most environments, we used the exec API with retries. This will increase the duration of the workflow, but will eliminate the whole set of issues.
To reduce the amount of data being copied, only the temp dir is copied for each run step, while the whole work repository is copied during the setup.