Mechanic

Working under the hood to stop disruptions to your AKS nodes

Description

mechanic is a tool for AKS clusters that helps mitigate the impact from platform maintenance events. Its primary focus is preventing application impacts from maintenance events that require node reboots or live migrations without moving pods unnecessarily or causing application downtime.

It does this by monitoring node conditions and, when a maintenance event is indicated, querying the Instance Metadata Service for maintenance event details. If the event is deemed impactful to the node, it will cordon and drain the node to ensure pods are rescheduled to other nodes before the maintenance event occurs.

What's the best way to use this?

The best combination of functionality would be using this alongside Cluster Autoscaler. The built-in node problem detector implementation used by AKS will manage the VMEventScheduled node condition which triggers this drain functionality.

As the pods are drained from the node, without Cluster Autoscaler the cluster could exhaust available compute resources; using CAS or Node Autoprovisioning would ensure that the cluster can scale to meet the demands of the pods being rescheduled.

Installing mechanic in a cluster

The recommended way to run mechanic is through a DaemonSet - this ensures that each node in the cluster has a monitor that can coordinate cordon and drain operations. There are some limitations at this time - namely:

No ARM nodes are supported. The container images for mechanic are built for amd64 architectures.
No Windows node support. The container images target a Linux environment.

To install the DaemonSet in the cluster, you can run the following command:

kubectl apply -f https://raw.githubusercontent.com/amargherio/mechanic/main/deploy/mechanic.ds.yaml

There are some caveats and items worth noting:

The DaemonSet is deployed in a custom mechanic namespace. This is to ensure that the DaemonSet can be managed independently of other resources in the cluster.
The DaemonSet pulls the image present in the GitHub Container Registry for this repo. If you have pull restrictions, you need to make sure you've got the image pulled into a registry you're permitted to pull from.
All images use a base container image of Azure Linux.

How does it work?

mechanic runs as a DaemonSet in your cluster. Each daemon pod monitors node updates and, for each update, checks the node conditions. If a VMEventScheduled condition is present, it queries the Instance Metadata Service for maintenance information.

If the maintenance event is deemed impactful, it will cordon the node and begin draining pods to other nodes in the cluster. During the drain flow, a label is added to the node (mechanic.cordoned) indicating that it was cordoned by mechanic. If the daemon pod is restarted, it will check for this label and use it as an input on whether to uncordon the node if the VMEventScheduled condition is no longer present.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
build		build
cmd/mechanic		cmd/mechanic
deploy/base		deploy/base
hack		hack
internal		internal
pkg		pkg
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Justfile		Justfile
LICENSE		LICENSE
OWNERS		OWNERS
README.md		README.md
cliff.toml		cliff.toml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mechanic

Description

What's the best way to use this?

Installing mechanic in a cluster

How does it work?

About

Uh oh!

Releases

Packages

Languages

License

vermacodes/mechanic

Folders and files

Latest commit

History

Repository files navigation

Mechanic

Description

What's the best way to use this?

Installing mechanic in a cluster

How does it work?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages