-
Notifications
You must be signed in to change notification settings - Fork 41.5k
Description
Observed: hitting a service nodePort fails intermittently, but the service cluster IP works 100%
Debug: tcpdump shows SYN sent, but no SYNACK returned. I noticed that in the error case, srcip was 127.0.0.1 - clearly wrong. We proved that the KUBE-MARK-MASQ chain was being flushed and so we were not getting SNAT'ed. We proved it was kubelet that was flushing, and kube-proxy that eventually restored it (yay for rectification loops!).
We found the hostport code in kubenet erroneously flushes those chains when starting a pod. After that it can take up to several minutes for kube-proxy to hit its own sync loop and fix the problem.
The fix is easy - don't flush those chains. @freehan is working on the fix right now.
@fabioy @timstclair for 1.3.x
@pwittrock for 1.4.x
@spxtr for reporting it concretely enough to repro