Cilium BGP
Last updated
Last updated
https://sue.eu/blogs/expose-loadbalanced-kubernetes-services-with-bgp-cilium/
This blog shows how your Kubernetes Service can be exposed to the outside world, using Cilium and BGP.
Cilium is an open source project to provide networking, security and observability for cloud native environments such as Kubernetes clusters and other container orchestration platforms.
Need any help with Cilium or Kubernetes?
Be sure to receive our white paper or just ask one of experts anything related to Cilium, networking, Kubernetes, clustors or setups. We’re more than ready to help you out!
Be sure to check out our latest events about Cilium TIPJoin us at SUE HQ or signup for our live-streams. Learn everything there is to know and insights about Cilium and it’s possibilities. Learn more ›
BGP
Border Gateway Protocol (BGP) is a standardized exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet. BGP is classified as a path-vector routing protocol. It makes routing decisions based on paths, network policies or rule sets configured by a network administrator.
Cilium and BGP
In release 1.10, Cilium integrated BGP support using MetalLB, which enables it to announce Kubernetes Service ip addresses of the type LoadBalancer using BGP. The result is that services are reachable from outside the Kubernetes network without extra components, such as an Ingress Router. Especially the ‘without extra components’ part is fantastic news, since every component adds latency – so without those less latency.
Lab environment
First let me explain how the lab is set up and what the final result will be.
The lab consists of a client network (192.168.10.0/24) and a Kubernetes network (192.168.1.1/24). When a Service gets a LoadBalancer ip address, that address will be served from the pool 172.16.10.0/24. In our lab, the following nodes are present:
Name | IP addresses | Description |
---|---|---|
bgp-router1 | 192.168.1.1/24 (k8s network), 192.168.10.1/24 (client network) | The BGP router |
k8s-control | 192.168.1.5/24 (k8s network), 192.168.10.233/24 (client network) | Management node |
k8s-master1 | 192.168.1.10/24 (k8s network) | k8s master |
k8s-worker1 | 192.168.1.21/24 (k8s network) | k8s worker |
k8s-worker2 | 192.168.1.22/24 (k8s network) | k8s worker |
k8s-worker3 | 192.168.1.23/24 (k8s network) | k8s worker |
k8s-worker4 | 192.168.1.24/24 (k8s network) | k8s worker |
k8s-worker5 | 192.168.1.25/24 (k8s network) | k8s worker |
After all the parts are configured, it will be possible to reach a Service in the Kubernetes network, from the client network, using the announced LoadBalancer IP address. See image below.
The router is a Red Hat 8 system with three network interfaces (external-, kubernetes- and client network) with FRRouting (FRR) responsible for handling the BGP traffic. FRR is a free and open source Internet routing protocol suite for Linux and Unix platforms. It implements many routing protocols like BGP, OSPF and RIP. In our lab only BGP will be enabled.
After installing FRR, the BGP daemon is configured to start by changing bgpd=no to bgpd=yes in the configuration file /etc/frr/daemons, using the following BGP configuration in /etc/frr/bgpd.conf
In the config file above the AS number 64512 is used, which is reserved for private use. The Kubernetes master node and worker nodes are configured as neighbor. The ip address of the router’s interface in the Kubernetes network (192.168.1.1) is used as router id.
After the configuration above is applied and the FRR daemon is started using the systemctl start frr command, the command vtysh -c ‘show bgp summary’ shows the following output.
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
192.168.1.10 4 64512 0 0 0 0 0 never Active 0
192.168.1.21 4 64512 0 0 0 0 0 never Active 0
192.168.1.22 4 64512 0 0 0 0 0 never Active 0
192.168.1.23 4 64512 0 0 0 0 0 never Active 0
192.168.1.24 4 64512 0 0 0 0 0 never Active 0
192.168.1.25 4 64512 0 0 0 0 0 never Active 0
Total number of neighbors 6
It goes beyond the scope of this blog to explain how to install the Kubernetes nodes and the Kubernetes cluster. For your information: in this lab Red Hat 8 (minimal installation) is used as the operating system for all the nodes and Kubeadm was subsequently used to set up the cluster.
The Cilium Helm chart version v1.10.5 is used to install and configure Cilium on the cluster, using these values:
To get Cilium up and running with BGP, only the bgp key and subkeys are needed from the settings above. The other settings are used to get a fully working Cilium environment with,for example, the Hubble user interface.
When Kubernetes is running and Cilium is configured, it is time to create a deployment and expose it to the network using BGP. The following YAML file creates a Deployment web1, which is just a simple NGINX web server serving the default web page. The file also creates a Service web1-lb with a Service type LoadBalancer. This results in an external ip address that is announced to our router using BGP.
—
apiVersion: apps/v1
kind: Deployment
metadata:
name: web1
spec:
selector:
matchLabels:
svc: web1-lb
template:
metadata:
labels:
svc: web1-lb
spec:
containers:
– name: web1
image: nginx
imagePullPolicy: IfNotPresent
ports:
– containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
After applying the YAML file above, the command kubectl get svc shows that Service web1-lb has an external ip address:
The address 172.16.10.0 seems strange, but it is fine. Often the .0 address is skipped and the .1 address is used as the first address. One of the reasons is that in the early days the .0 address was used for broadcast, which was later changed to .255. Since .0 is still a valid address MetalLB, which is responsible for the address pool, hands it out as the first address. The command vtysh -c ‘show bgp summary’ on router bgp-router1 shows that it has received one prefix:
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
192.168.1.10 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.21 4 64512 446 435 0 0 0 03:36:54 1 0
192.168.1.22 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.23 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.24 4 64512 446 435 0 0 0 03:36:56 1 0
192.168.1.25 4 64512 445 435 0 0 0 03:36:56 1 0
Total number of neighbors 6
The following snippet of the routing table (ip route) tells us that for that specific ip address 172.16.10.0, 6 possible routes/destinations are present. In other words, all Kubernetes nodes announced that they can handle traffic for that address. Cool!!
Indeed, the web page is now visible from our router.
And a client in our client network can also reach that same page, since it uses bgp-router1 as default route.
Now it is all working, most engineers want to see more details, so I will not let you down.
One of the first things you will notice is that the LoadBalanced ip address is not reachable via ping. Diving a bit deeper reveals why, but before that, let’s create the Cilium aliases to make it easier running cilium, which is present in each Cilium agent pod.
First see the output of this snippet of cilium bpf lb list, that shows the configured load balancing configuration inside Cilium for our Service *web1-lb`:
Here you can see that a mapping is created between source port 80 and destination port 80. This mapping is executed using eBPF logic at the interface and is present on all nodes. This mapping shows that only(!) traffic for port 80 is balanced.All other traffic, including the ping, is not picked up. That is why you can see the icmp packet reaching the node, but a response is never sent.
Hubble is the networking and security observability platform which is built on top of eBPF and Cilium. Via the command line and via a graphical web GUI, it is possible to see current and historical traffic. In this lab, Hubble is placed on the k8s-control node, which has direct access to the API of Hubble Relay. Hubble Relay is the component that obtains the needed information from the Cilium nodes. Be aware that the hubble command is also present in each Cilium agent pod, but that one will only show information for that specific agent!
The following outputs show the observer information which is a result of the curl http://172.16.10.0/ command on the router.
Oct 31 15:43:41.382: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: SYN)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Before, I warned about not using the hubble command inside the Cilium agent pod, but it can also be very informative seeing the specific node traffic. In this case a hubble observe –namespace default –follow is executed within each Cilium agent pod and the curl from the router is once executed. On the node where the pod is ‘living’ (k8s-worker2), we see the same output as the one above. However, on another pod (k8s-worker1) we see the following output:
What we see here is that our router is sending the traffic for ip address 172.16.10.0 to k8s-worker1, but that worker does not host our web1 container, so it forwards the traffic to k8s-worker2 which handles the traffic. All the forwarding logic is handled using eBPF – a small BPF program attached to the interface will send the traffic and routes to another worker if needed. That is also the reason that running tcpdump on k8s-worker1, where the packages initially are received, does not show any traffic. It is already redirected to k8s-worker2 before it could land in the ip stack of k8s-worker1.
Cilium.io has a lot of information about eBPF and the internals. If you have not heard about eBPF and you are into Linux and/or networking, please do yourself a favor and learn at least the basics. In my humble opinion eBPF will change networking in Linux drastically in the near future and especially for cloud native environments!
With a working BGP set-up, it is quite simple to make the Hubble Web GUI available to the outside world as well.
Due to the integrated MetalLB, it is very easy to set up Cilium with BGP. Plus, you don’t need expensive network hardware. Cilium/BGP, combined with the disabling of kube-proxy, lowers the latency to your cloud based Services and gives a clear view of what is exposed to the outside world by only announcing the LoadBalancers ip addresses. Although an Ingress Controller is not required with this set-up, I still would recommend one for most HTTP Services. They have great value at the protocol level for rewriting URLs or rate limiting requests. Examples are NGINX or Traefik (exposed by BGP of course).
All in all, it is very exciting to see that cloud native networking, but also networking within Linux is still improving!