Load-balance the apiserver endpoint

**Is this a bug report or a feature request?**:

Feature request.

MetalLB cannot reliably provide a load-balancer for the Kubernetes apiserver, because of circular dependencies.

In a working HA cluster, the setup is: you have N machines with apiserver, and a load-balancer providing a single IP for all of them. Then, you configure all your kubelets to talk to the LB IP, and voila! Miracle.

But how do you configure the LB? The Kubernetes documentation basically says "use a magic load-balancer in the sky, outside your cluster, and it will be fine." That doesn't work for bare metal clusters, we don't have magic load-balancers in the sky.

What about just configuring a LoadBalancer Service in k8s? MetalLB would create and advertise the LB IP, and everything works, right? No, because now you have a circular dependency:

1. Kubelet cannot talk to the control plane until MetalLB has started and configured the LB IP
2. ... But MetalLB cannot start until kubelet can talk to the control plane and discover that it should be running the pod!

So, MetalLB cannot be used to control the LB IP for the apiserver.

---

How can we solve this? Basically we need some way of breaking the circular dependency, so that kubelets can join the cluster *and* MetalLB can run, at the "same time."

There are a couple of options for this. Both require new code/config in MetalLB, but first we should try to agree on a general strategy for solving the problem. The options I can think of are:

- Reconfigure kubelets on apiserver nodes to talk only to 127.0.0.1. This way, MetalLB *can* schedule on the apiserver nodes, and it can set up the cluster LB. From there, all other kubelets can connect to the LB IP, and everything works.
  - One implication of this is that kubelets on the apiserver machines will be "less reliable", because they will drop out of the cluster if their *local* apiserver is unavailable, even if the apiserver LB IP is still working. This is probably OK, because if the apiserver is unhealthy, the machine is probably pretty broken anyway... But it's still forcing users to change the availability semantics of their cluster.
- Run a special "apiserver LB only" version of the speaker as a static addon pod (via a manifest in /etc/kubernetes/addons/...) on apiservers. This pod only connects to the apiserver at localhost, and only configures the apiserver LB IP, nothing else. For sanity of management, the addon manifest would be managed by the MetalLB controller, via some intermediate "addon manager" pod that drops updated manifests on the machines.
  - This adds a lot of complexity to MetalLB (need to write an "addon manager", give it similar semantics to DaemonSets...), but requires ~no changes to the cluster configuration. You just install metallb using the "HA apiserver" manifests, and it takes care of plumbing everything together to make apiserver LB work.
  - The exact separation of which pod is responsible for what is unclear to me. The proposed setup I described is probably not quite right, but I'm trying to convey the general idea of what we want to do, not the exact implementation.
  - A fair question would be: how is this better than just using keepalived-vip? One answer might be that it allows exposing the apiserver IP via BGP, not just ARP. But, for this one special case, it might be fine to just document "here's how you make this work safely using keepalived, the apiserver LB IP is special so you need a special solution".
- Look into whether Kubernetes has any plans for "autonomous pods", i.e. pods that automatically restart even if kubelet cannot talk to the apiserver. i.e., when kubelet starts up, it looks at its local checkpoint and goes "oh, this pod is autonomous, I'll start it now". Then, later, when it successfully connects to apiserver, apiserver might tell it "oh, actually, don't run that pod right now", and kubelet stops it.
  - This would let us simply run MetalLB as a set of "autonomous pods", with a tiny bit of config checkpointing so that they can bring up the apiserver LB IP with zero dependencies, which allows the rest of the system to startup and converge.
  - I haven't heard of any plans for a feature like this, so this is probably just a random brainstorming idea.

In general, cluster bringup without circular dependencies is a can of worms, so it probably won't be trivial to fix.. But MetalLB should offer a comprehensive solution for "how do I do LB in my cluster", and apiserver LB is part of that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load-balance the apiserver endpoint #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Load-balance the apiserver endpoint #168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions