Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fatal error in cilium-agent on 1.16.0-pre.3 with bandwidthManager + BBR #32909

@jackmaninov

Description

@jackmaninov

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Today I tried upgrading from 1.16.0-pre.2 (which I run for an unrelated fix that has probably been backported by now, but why mess with success?) to 1.16.0-pre.3, and I had cilium pods going into CrashLoopBackoff status. Looking at the logs, cilium-agent is dying with either of the following errors:

time=2024-06-05T10:26:51Z level=error msg="Invoke failed" error="reconciler.Config[github.com/cilium/cilium/pkg/maps/bwmap.Edt].Table cannot be nil" function="bwmap.registerReconciler (.../maps/bwmap/cell.go:41)"

or

time="2024-06-05T10:27:16Z" level=fatal msg="failed to start: reconciler.Config[github.com/cilium/cilium/pkg/maps/bwmap.Edt].Table cannot be nil" subsys=daemon

No other relevant errors or warnings.

Looking at the codebase I see it's related to bandwidthManager configuration. I have bandwidthManager.enabled: true and .bbr: true in my helm values, but otherwise no bandwidth related policies configured.. This was working fine under 1.16.0-pre.2, and I don't notice any new helm chart values related to bandwidthManager, so I'm at a loss what the source of this error could be.

I'm wondering if this could be a k8s 1.30 / Talos linux related issue. It can also be me using bandwidthManager improperly without fully understanding the implications of doing so (e.g. combined with wireguard encryption), but it has been working well for me in the past and providing latency improvements.

Cilium Version

cilium-cli: v0.16.8 compiled with go1.22.3 on linux/amd64
cilium image (default): v1.15.5
cilium image (stable): v1.15.5
cilium image (running): unknown. Unable to obtain cilium version. Reason: release: not found

Ran into the issue on upgrade of 1.16.0-pre.2 to 1.16.0-pre.3 using argo-cd.

Kernel Version

Linux version 6.6.32-talos (@buildkitsandbox) (gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP Tue May 28 12:51:33 UTC 2024

Kubernetes Version

Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

Regression

No response

Sysdump

No response

Relevant log output

cilium-dbg status output on previously working 1.16.0-pre.2 install:

VStore:                 Ok   Disabled
Kubernetes:              Ok   1.30 (v1.30.0) [linux/arm64]
Kubernetes APIs:         ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::
CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service
", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    True   [enx0************     ***************************** (Direct Routing)]
Host firewall:           Disabled
SRv6:                    Disabled
CNI Chaining:            none
CNI Config file:         successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist
Cilium:                  Ok   1.16.0-pre.2 (v1.16.0-pre.2-1bc9e514)
NodeMonitor:             Listening for events on 128 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok   
IPAM:                    IPv4: 4/254 allocated from 10.244.3.0/24, 
ClusterMesh:             0/0 clusters ready, 0 global-services
IPv4 BIG TCP:            Disabled
IPv6 BIG TCP:            Disabled
BandwidthManager:        EDT with BPF [BBR] [enx0*************]
Routing:                 Network: Tunnel [vxlan]   Host: BPF
Attach Mode:             TCX
Masquerading:            BPF   [enx00cbea9e64a2]   10.244.3.0/24 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       25/25 healthy
Proxy Status:            OK, ip 10.244.3.174, 0 redirects active on ports 10000-20000, Envoy: external
Global Identity Range:   min 65536, max 131071
Hubble:                  Ok              Current/Max Flows: 4095/4095 (100.00%), Flows/s: 8.14   Metrics: Ok
Encryption:              Wireguard       [NodeEncryption: Disabled, cilium_wg0 (Pubkey: ******************************, Port: 51871, Peers: 4)]
Cluster health:          5/5 reachable   (2024-06-05T12:09:59Z)
Modules Health:          Stopped(0) Degraded(0) OK(42)

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.area/modularizationRelates to code modularization and maintenance.feature/bandwidth-managerImpacts BPF bandwidth manager.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions