-
Notifications
You must be signed in to change notification settings - Fork 679
Description
Summary
I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.
In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.
The other thing I'm seeing is jobs that just never run and yet the workflow fails.
Steps to reproduce
I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:
mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo ' "credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper
I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.