Kubernetes Networking - The Flannel network explained


Note: If you have missed my previous articles on Docker and Kubernetes, you can find them here.
Application deployment models evolution.
Getting started with Docker.
Docker file and images.
Publishing images to Docker Hub and re-using them
Docker- Find out what's going on
Docker Networking- Part 1
Docker Networking- Part 2
Docker Swarm-Multi-Host container Cluster
Docker Networking- Part 3 (Overlay Driver)
Introduction to Kubernetes
Kubernetes- Diving in (Part 1)
Kubernetes-Diving in (Part2)- Services
Kubernetes- Infrastructure As Code with Yaml (part 1)
Kubernetes- Infrastructure As Code Part 2- Creating PODs with YAML
Kubernetes Infrastructure-as-Code part 3- Replicasets with YAML
Kubernetes Infrastructure-as-Code part 4 - Deployments and Services with YAML
Deploying a microservices APP with Kubernetes
Kubernetes- Time based scaling of deployments with python client

Kubernetes does not ship with a built-in networking component- it expects users to deploy networking on their own. Fortunately, many prebuilt solutions exist for Kubernetes with varied feature sets that satisfy the following requirements:

pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Pods in the host network of a node can communicate with all pods on all nodes without NAT"

Source: https://kubernetes.io/docs/concepts/cluster-administration/networking/

When I built my multinode Kubernetes cluster, I chose to deploy flannel as the networking solution of choice (one of the reasons being it supports my favorite encapsulation protocol- VxLAN). In this article, I am going to show you how flannel works under the hood.

The below diagram depicts various interfaces on my Kubernetes cluster nodes

Docker0 and docker_Gw bridge: These interfaces are created by docker. For details of the docker network, refer to my previous articles on the topic.

Let's look at requirements and how flannel works to satisfy these.

Requirement: "pods on a node can communicate with all pods on all nodes without NAT"

AND

"agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node"

Veth interface: For each Kubernetes PoD, Flannel will create a pair of Veth devices. The first interface is eth0 inside the PoD/container and the second one attaches itself to CNI or container network interface. The CNI and VETH devices are connected through a Linux bridge.

Let's examine the VTH interfaces and bridge

root@sathish-vm2:/home/sathish/pods# bridge vlan show
port    vlan ids
docker0  1 PVID Egress Untagged

docker_gwbridge  1 PVID Egress Untagged

cni0     1 PVID Egress Untagged

veth4be13a77     1 PVID Egress Untagged

vethce7fddef     1 PVID Egress Untagged

The CNI0 interface is the bridge master/STP root and VETH interfaces are slaves.


# Interface CNI0

root@sathish-vm2:/home/sathish/pods# ip -d link show cni0
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether de:7e:05:3e:cc:3c brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c ...............Output deleted.............................
    
# The  VETH  interfaces

8: vethce7fddef@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether d6:a5:b2:e3:8d:87 brd ff:ff:ff:ff:ff:ff link-netnsid 1 promiscuity 1 minmtu 68 maxmtu 65535
    veth
    bridge_slave state forwarding priority 32 cost 2 hairpin on guard off root_block off fastleave off learning on flood on port_id 0x8002 port_no 0x2 designated_port 32770 designated_cost 0 designated_bridge 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c 

...................................................

root@sathish-vm2:/home/sathish/pods# ip -d link show veth4be13a77
7: veth4be13a77@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 7e:7b:07:35:d5:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 1 minmtu 68 maxmtu 65535
    veth
    bridge_slave state forwarding priority 32 cost 2 hairpin on guard off root_block off fastleave off learning on flood on port_id 0x8001 port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c 
...................................................

This means, all the PoDs on a node should be able to reach out to the host and each other through the bridge- after all, they are in the same L2 network/Subnet.

Let's deploy a pair of busybox PoDs and check out things.

Busybox1.YAML
===================
apiVersion: v1
kind: Pod
metadata:
  name: busybox1
  labels:
    tier: linux
spec:
  containers:
    - name: busybox1
      image: busybox
      command: ['sh', '-c',' sleep 3600']
      
      
 Busybox2.YAML
===================

apiVersion: v1
kind: Pod
metadata:
  name: busybox2
  labels:
    tier: linux
spec:
  containers:
    - name: busybox2
      image: busybox
      command: ['sh', '-c',' sleep 3600']


root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox1.yaml
pod/busybox1 created
root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox2.yaml
pod/busybox2 created

root@sathish-vm2:/home/sathish/pods# kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
busybox1   1/1     Running   0          79s
busybox2   1/1     Running   0          74s

root@sathish-vm2:/home/sathish/pods# kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
busybox1   1/1     Running   1          87m   10.244.1.36   sathish-vm1   <none>           <none>
busybox2   1/1     Running   1          87m   10.244.1.37   sathish-vm1   <none>           <none>

These are the expectations

The subnet of eth0 should match the CNI subnet.
We should be able to ping between PoDs.
From the host, we should be able to reach both PoDs


root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ifconfig
eth0      Link encap:Ethernet  HWaddr E6:2F:5B:32:06:D5
          inet addr:10.244.1.36  Bcast:10.244.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:9 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:750 (750.0 B)  TX bytes:42 (42.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox2 -- ifconfig
eth0      Link encap:Ethernet  HWaddr 22:ED:44:82:06:E3
          inet addr:10.244.1.37  Bcast:10.244.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:488 (488.0 B)  TX bytes:42 (42.0 B)


lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Checking ping

root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ping 10.244.1.37
PING 10.244.1.37 (10.244.1.37): 56 data bytes
64 bytes from 10.244.1.37: seq=0 ttl=64 time=0.085 ms
64 bytes from 10.244.1.37: seq=1 ttl=64 time=0.059 ms
64 bytes from 10.244.1.37: seq=2 ttl=64 time=0.099 ms
64 bytes from 10.244.1.37: seq=3 ttl=64 time=0.119 ms

We can also ping the CNI0 interface IP which is the default route for the busybox containers.


#From host
root@sathish-vm2:/home/sathish/pods# ip  address
....................output deleted..............................................
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether de:7e:05:3e:cc:3c brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.1/24 brd 10.244.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::dc7e:5ff:fe3e:cc3c/64 scope link
       valid_lft forever preferred_lft forever

root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ping 10.244.0.1
PING 10.244.0.1 (10.244.0.1): 56 data bytes
64 bytes from 10.244.0.1: seq=0 ttl=63 time=1.612 ms
64 bytes from 10.244.0.1: seq=1 ttl=63 time=2.436 ms
64 bytes from 10.244.0.1: seq=2 ttl=63 time=5.298 ms
64 bytes from 10.244.0.1: seq=3 ttl=63 time=3.119 ms

Now, lets talk about the third requirement and how flannel accomplishes this.

Requirement: "PoDs in the host network of a node can communicate with all pods on all nodes without NAT"

Let's create a third PoD and force it to be placed on a particular node (sathish-vm2) in my case. This can be done by specifying the "nodeName" parameter under the spec section of the YAML file. Here is a complete listing of busybox3.yaml

apiVersion: v1
kind: Pod
metadata:
  name: busybox3
  labels:
    tier: linux
spec:
  nodeName: sathish-vm2
  containers:
    - name: busybox3
      image: busybox
      command: ['sh', '-c',' sleep 3600']

root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox3.yaml
pod/busybox3 created

root@sathish-vm2:/home/sathish/pods# kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
busybox1   1/1     Running   1          95m     10.244.1.36   sathish-vm1   <none>           <none>
busybox2   1/1     Running   1          95m     10.244.1.37   sathish-vm1   <none>           <none>
busybox3   1/1     Running   0          3m48s   10.244.0.4    sathish-vm2   <none>           <none>

Now starting a ping from this new PoD busybox 3 (in VM2) to busybox1 (in VM1).

root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox3 -- ping 10.244.1.36
PING 10.244.1.36 (10.244.1.36): 56 data bytes
64 bytes from 10.244.1.36: seq=0 ttl=62 time=0.722 ms
64 bytes from 10.244.1.36: seq=1 ttl=62 time=0.678 ms

This works as expected. This is possible due to VxLAN implementation in flannel.

VxLAN is a UDP encapsulation protocol that enables hosts belonging to the same network to communicate across an L3 routed infrastructure. Here is the VxLAN header

Image Source: http://www.plexxi.com/wp-content/uploads/2014/01/VXLAN-Packet-Format2.jpg

In a VxLAN network's hosts in the same VNI (Virtual Network Identifier-24 bits wide) can talk to each other as if they are connected to the same ethernet segment.

Note: Flannel uses a VNI ID  of "1" for most deployments.

Let's look at a packet capture and then I will delve into some of the finer points of VxLAN implementation in flannel.

Let's start with the outer headers:

The ICMP echo request originated from the VM2 (busybox3) container. The outer Source MAC and source IP belong to the VM2 Eth0 interface

root@sathish-vm2:/home/sathish# ip  address

 eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:15:5d:db:46:5b brd ff:ff:ff:ff:ff:ff
    inet 172.28.147.38/28 brd 172.28.147.47 scope global eth0

The outer destination MAC/IP is VM1's MAC/IP. It is important to note that in real-world nodes in a cluster could be connected across different subnets in which case the destination MAC will be the next-hop router MAC.

root@sathish-vm1:/home/sathish# ip address
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:15:5d:db:46:5a brd ff:ff:ff:ff:ff:ff
    inet 172.28.147.44/28 brd 172.28.147.47 scope global eth0

The destination UDP port is 8472 for VxLAN- as per standards the destination port of VxLAN should be 4789. However, this shouldn't matter as only flannel nodes need to understand this packet is VxLAN and decapsulate it.

The VNI ID as I mentioned before is 1.

The inner source MAC, IP is that of busybox3, and inner destination MAC, IP is that of busybox2.

With this, let's look at how ping actually worked (ignoring ICMP related details)

Busybox3 originated the ICMP echo request- the source IP and source MAC is the eth0 interface of busybox3. This packet is tunneled to the VETH interface.
As mentioned before VETH is connected to CNI0 through a bridge and hence the packet is switched to CNI0.
CNI0 routes the packet to flannel interface. Following routing rules comes into play (These were setup by flannel)

root@sathish-vm2:/home/sathish# iptables-save - A FORWARD -s 10.244.0.0/16 -j ACCEPT -A FORWARD -d 10.244.0.0/16 -j ACCEPT

4. Flannel performs VxLAN encapsulation and sends the packet to destination node - in my case its sathish-vm1.

But how does flannel know the destination is located in VM1? After-all there could be multiple nodes in a Kubernetes cluster and the destination could be in any of those nodes. Let's try to answer this question

With VxLAN, MACs could be learnt over the tunnel with dynamic learning when packets are received (inner MAC address and inner IP). However flannel sets nolearning on the interface, hence it is not possible to know the node from which the packet originated. A separate control plane is required.

root@sathish-vm2:/home/sathish# ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether 06:1c:30:c6:7c:cb brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    vxlan id 1 local 172.28.147.38 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 62780 gso_max_segs 65535

Flannel discovers all VxLAN neighbors when its setup and populates entry in a neighbor cache of flannel interface

root@sathish-vm2:/home/sathish# ip neigh show dev flannel.1
10.244.1.0 lladdr 16:ab:0a:e0:e7:60 PERMANENT
The above is flannel interface IP/MAC of VM1.

Then container/PoD MACs associated with that host are programmed as permanent entries.

root@sathish-vm2:/home/sathish# bridge fdb show flannel.1
..................
16:ab:0a:e0:e7:60 dev flannel.1 dst 172.28.147.44 self permanent

With the above entry, flannel knows this PoD /Container is present on 172.28.147.44- this is the outer destination IP.

5. The packet is then sent to eth0 from where it is forwarded to the destination node (switched or routed as the case maybe).

One piece of puzzle I did not elaborate on before, relates to the following:

root@sathish-vm2:/home/sathish# bridge fdb show flannel.1 .................. 16:ab:0a:e0:e7:60 dev flannel.1 dst 172.28.147.44 self permanent

AND

root@sathish-vm2:/home/sathish# ip neigh show dev flannel.1 10.244.1.0 lladdr 16:ab:0a:e0:e7:60 PERMANENT .

How did flannel create a static fdb entry on bridge i.e how did it know PoD with MAC 16:ab:0a:e0:e7:60 is running on 172.28.147.44 (VM1) ?

Also, how does flannel know 172.28.147.44 is part of VxLAN VNI 1?

The answer to both the questions is etcd. etcd is a distributed key-value datastore that is an important component of Kubernetes. When Kubernetes node is configured with flannel, flannel adds the following information to etcd:

IP address entry for VxLAN node
MAC (container/PoDMAC) to IP mapping

It used this information to program the ARP cache on the flannel1.1 interface and also to create the corresponding FDB entries.

Note: For those familiar with eVPN, the first entry is similar to  Inclusive  Multicast route that  tells you which VNI segment is present in which VTEP  and second one is similar to  type 2 route i.e which VTEP has which hosts per VNI.

Hope this was useful, and once again thanks for all your feedback/comments, keep them coming. Have a wonderful week ahead.

Quick intro to Docker, Kubernetes and latest network tech

Kubernetes Networking - The Flannel network explained

Recent Posts

Never Miss a Post. Subscribe Now!