The purpose of this repository is to make it easy to run a development box on GKE. Reasons for moving development into a container
-
Needing more resources (CPU/RAM/GPU) than your local machine
-
Needing a different operating system/architecture in order to compile code
- TensorFlow federated doesn't work on M1 tensorflow/federated#1254
The solution consists of the following pieces
-
A statefulset for running the container
- We use a statefulset because this gives the pod a stable name
-
A PVC for storing the home directory and other files
- This ensures data isn't lost between pod restarts
- This also means we can teardown the statefulset in order to save compute costs and then restart
-
Run tailscale in a sidecar to add the pod to your mesh to make it connectable from outside the cluster
- this makes it easy to connect to the pod including Jupyter
-
An ssh server running inside the main container
- This can be used with VSCode over ssh to run VSCode on your local machine but edit/run code inside the container
An SSH key is needed for two purposes
- Connect to GitHub from the container to push/pull code
- Private key needs to be stored in the server
- Connect to the container via ssh from your local machine (e.g. for VSCode)
- Private key needs to be stored on your local machine
We can use the same key for both. We use a K8s secret to make the key available to the pod. We also use a configmap to set the SSH authorized keys on the pod.
ssh-keygen -t ed25519 -C "[email protected]"
Save the key to ${HOME}/.ssh/devbox
Don't set a passphrase
Add the public key to your GitHub ssh keys.
Create a k8s secret
kubectl create secret generic ${USER}-ssh --from-file=id_ed25519=${HOME}/.ssh/devbox --from-file=id_ed25519.pub=${HOME}/.ssh/devbox.pub
We mount the ssh keys into /secrets rather than ${HOME}/.ssh. We do this because due to kubernetes/kubernetes#81089 its not clear if
- the directory
${HOME}/.sshwill end up being owned by the user the container is running - will be writable (e.g. for known hosts)
If instead we let ${HOME}/.ssh be stored on the persistent volume then we can easily do manual setup and have it persist across reboots
The startup script startup.sh starts ssh-agent ands the key in /secrets.
Create a secret containing the SSH keys authorized to ssh into the container. This will be the public key(s) of the SSH keys on the machines you will be ssh'ing from
kubectl create secret generic ${USER}-auth-keys --from-file=authorized_keys=${HOME}/.ssh/id_ed25519.pub
We start an ssh server for use with vscode. The ssh server runs on port 2222 because it runs
as user jupyter and therefore can't bind port 22 which is privileged.
For more info see Run SSHD as non root user
Follow the instructions for Remote development using ssh
You will need to edit your host settings in `~/.ssh/config to set username and port like this
Host 100.92.148.119
HostName 100.92.148.119
User jupyter
Port 2222
The hostname should be the ip addressed assigned by tailscale.
To store the home directory and other files on durable storage we do the following
- Mount a PVC at /storage
- Set HOME to /storage/jupyter
We hit a couple issues that led to this approach as opposed to
- Mounting the PVC at /home/jupyter
- Mounting the PVC at /home
When we mounted the PVC at /home/jupyter the user/group permissions of the drive caused SSH to complain when using the SSH keys in /home/jupyter/.ssh to allow ssh'ing into the pod. SSH likes the home directory to only be readable by the user. However, it looks like the owner of the directory at which the PVC is mounted is root.
Mounting the PVC one level higher at /home fixed this. However, I observed that some other
ephmeral volume was being mounted at /home/jupyter and therefore the home directory wasn't
actually on PVC. This was evident from running mount
/dev/sdb on /home type ext4 (rw,relatime)
/dev/sda1 on /home/jupyter type ext4 (rw,nosuid,nodev,relatime,commit=30)
Inspecting the docker image using crane indicates that the Dockerfile adds the Volume /home/jupyter. I think this causes
a volume to be mounted there by the kubelet if no volume is explicitly mounted.
I originally tried Kaniko but ran into issue GoogleContainerTools/skaffold#7701 with not being able to increase ephmeralStorage. So I switched to GCB.
With GCB I had to use a 32 CPU machine; it was timing out trying to push the image with 8 CPU.
Try confirm if this is an issue with tailscale or ssh.
-
Check the ssh logs
/tmp/sshd.log -
Try ssh'ing to the pod from within the pod itself
-
i.e. use
kubectl execto start a shell in the pod and then runsshin the pod -
If this succeeds then it is most likely a networking issue with tailscale
-
-
Try starting an HTTP server on the pod and seeing if you can connect to it
python -m http.server 8000 -
If it appears to be an issue with tailscale then login to tailscale and try removing the device in tailscale
-
Check the logs of tailscale; there will most likely be a link to authenticate to tailscale
- Use the link to reauthenticate
You can start an http server on the dev box by doing
python3 -m http.server 8000
This is useful if your trying to debug ssh issues and need to figure out whether its an ssh issue or a network/tailscale issue.
When using skaffold with GCB skaffold build exits with error
error copying logs to stdout: invalid write result
Running skaffold build with verbose logging e.g. skaffold build -v apears to fix this.
ssh'ing into the node hanges
- Make sure you include jupyter e.g.
ssh jupyter@HOSTNAME -p 2222