Note: If you have missed my previous articles on Docker and Kubernetes, you can find them here.
Application deployment models evolution.
Getting started with Docker.
Docker file and images.
Publishing images to Docker Hub and re-using them
Docker- Find out what's going on
Docker Networking- Part 1
Docker Networking- Part 2
Docker Swarm-Multi-Host container Cluster
Docker Networking- Part 3 (Overlay Driver)
Introduction to Kubernetes
Kubernetes- Diving in (Part 1)
Kubernetes-Diving in (Part2)- Services
Kubernetes- Infrastructure As Code with Yaml (part 1)
Kubernetes- Infrastructure As Code Part 2- Creating PODs with YAML
Kubernetes Infrastructure-as-Code part 3- Replicasets with YAML
Kubernetes Infrastructure-as-Code part 4 - Deployments and Services with YAML
Deploying a microservices APP with Kubernetes
Kubernetes- Time based scaling of deployments with python client
Kubernetes does not ship with a built-in networking component- it expects users to deploy networking on their own. Fortunately, many prebuilt solutions exist for Kubernetes with varied feature sets that satisfy the following requirements:
"
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Pods in the host network of a node can communicate with all pods on all nodes without NAT"
When I built my multinode Kubernetes cluster, I chose to deploy flannel as the networking solution of choice (one of the reasons being it supports my favorite encapsulation protocol- VxLAN). In this article, I am going to show you how flannel works under the hood.
The below diagram depicts various interfaces on my Kubernetes cluster nodes
Docker0 and docker_Gw bridge: These interfaces are created by docker. For details of the docker network, refer to my previous articles on the topic.
Let's look at requirements and how flannel works to satisfy these.
Requirement: "pods on a node can communicate with all pods on all nodes without NAT"
AND
"agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node"
Veth interface: For each Kubernetes PoD, Flannel will create a pair of Veth devices. The first interface is eth0 inside the PoD/container and the second one attaches itself to CNI or container network interface. The CNI and VETH devices are connected through a Linux bridge.
Let's examine the VTH interfaces and bridge
root@sathish-vm2:/home/sathish/pods# bridge vlan show
port vlan ids
docker0 1 PVID Egress Untagged
docker_gwbridge 1 PVID Egress Untagged
cni0 1 PVID Egress Untagged
veth4be13a77 1 PVID Egress Untagged
vethce7fddef 1 PVID Egress Untagged
The CNI0 interface is the bridge master/STP root and VETH interfaces are slaves.
# Interface CNI0
root@sathish-vm2:/home/sathish/pods# ip -d link show cni0
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether de:7e:05:3e:cc:3c brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c ...............Output deleted.............................
# The VETH interfaces
8: vethce7fddef@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
link/ether d6:a5:b2:e3:8d:87 brd ff:ff:ff:ff:ff:ff link-netnsid 1 promiscuity 1 minmtu 68 maxmtu 65535
veth
bridge_slave state forwarding priority 32 cost 2 hairpin on guard off root_block off fastleave off learning on flood on port_id 0x8002 port_no 0x2 designated_port 32770 designated_cost 0 designated_bridge 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c
...................................................
root@sathish-vm2:/home/sathish/pods# ip -d link show veth4be13a77
7: veth4be13a77@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
link/ether 7e:7b:07:35:d5:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 1 minmtu 68 maxmtu 65535
veth
bridge_slave state forwarding priority 32 cost 2 hairpin on guard off root_block off fastleave off learning on flood on port_id 0x8001 port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge 8000.de:7e:5:3e:cc:3c designated_root 8000.de:7e:5:3e:cc:3c
...................................................
This means, all the PoDs on a node should be able to reach out to the host and each other through the bridge- after all, they are in the same L2 network/Subnet.
Let's deploy a pair of busybox PoDs and check out things.
Busybox1.YAML
===================
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
tier: linux
spec:
containers:
- name: busybox1
image: busybox
command: ['sh', '-c',' sleep 3600']
Busybox2.YAML
===================
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
tier: linux
spec:
containers:
- name: busybox2
image: busybox
command: ['sh', '-c',' sleep 3600']
root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox1.yaml
pod/busybox1 created
root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox2.yaml
pod/busybox2 created
root@sathish-vm2:/home/sathish/pods# kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox1 1/1 Running 0 79s
busybox2 1/1 Running 0 74s
root@sathish-vm2:/home/sathish/pods# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 1/1 Running 1 87m 10.244.1.36 sathish-vm1 <none> <none>
busybox2 1/1 Running 1 87m 10.244.1.37 sathish-vm1 <none> <none>
These are the expectations
The subnet of eth0 should match the CNI subnet.
We should be able to ping between PoDs.
From the host, we should be able to reach both PoDs
root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ifconfig
eth0 Link encap:Ethernet HWaddr E6:2F:5B:32:06:D5
inet addr:10.244.1.36 Bcast:10.244.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:9 errors:0 dropped:0 overruns:0 frame:0
TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:750 (750.0 B) TX bytes:42 (42.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox2 -- ifconfig
eth0 Link encap:Ethernet HWaddr 22:ED:44:82:06:E3
inet addr:10.244.1.37 Bcast:10.244.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:488 (488.0 B) TX bytes:42 (42.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Checking ping
root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ping 10.244.1.37
PING 10.244.1.37 (10.244.1.37): 56 data bytes
64 bytes from 10.244.1.37: seq=0 ttl=64 time=0.085 ms
64 bytes from 10.244.1.37: seq=1 ttl=64 time=0.059 ms
64 bytes from 10.244.1.37: seq=2 ttl=64 time=0.099 ms
64 bytes from 10.244.1.37: seq=3 ttl=64 time=0.119 ms
We can also ping the CNI0 interface IP which is the default route for the busybox containers.
#From host
root@sathish-vm2:/home/sathish/pods# ip address
....................output deleted..............................................
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether de:7e:05:3e:cc:3c brd ff:ff:ff:ff:ff:ff
inet 10.244.0.1/24 brd 10.244.0.255 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::dc7e:5ff:fe3e:cc3c/64 scope link
valid_lft forever preferred_lft forever
root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox1 -- ping 10.244.0.1
PING 10.244.0.1 (10.244.0.1): 56 data bytes
64 bytes from 10.244.0.1: seq=0 ttl=63 time=1.612 ms
64 bytes from 10.244.0.1: seq=1 ttl=63 time=2.436 ms
64 bytes from 10.244.0.1: seq=2 ttl=63 time=5.298 ms
64 bytes from 10.244.0.1: seq=3 ttl=63 time=3.119 ms
Now, lets talk about the third requirement and how flannel accomplishes this.
Requirement: "PoDs in the host network of a node can communicate with all pods on all nodes without NAT"
Let's create a third PoD and force it to be placed on a particular node (sathish-vm2) in my case. This can be done by specifying the "nodeName" parameter under the spec section of the YAML file. Here is a complete listing of busybox3.yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox3
labels:
tier: linux
spec:
nodeName: sathish-vm2
containers:
- name: busybox3
image: busybox
command: ['sh', '-c',' sleep 3600']
root@sathish-vm2:/home/sathish/pods# kubectl create -f busybox3.yaml
pod/busybox3 created
root@sathish-vm2:/home/sathish/pods# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 1/1 Running 1 95m 10.244.1.36 sathish-vm1 <none> <none>
busybox2 1/1 Running 1 95m 10.244.1.37 sathish-vm1 <none> <none>
busybox3 1/1 Running 0 3m48s 10.244.0.4 sathish-vm2 <none> <none>
Now starting a ping from this new PoD busybox 3 (in VM2) to busybox1 (in VM1).
root@sathish-vm2:/home/sathish/pods# kubectl exec --stdin --tty busybox3 -- ping 10.244.1.36
PING 10.244.1.36 (10.244.1.36): 56 data bytes
64 bytes from 10.244.1.36: seq=0 ttl=62 time=0.722 ms
64 bytes from 10.244.1.36: seq=1 ttl=62 time=0.678 ms
This works as expected. This is possible due to VxLAN implementation in flannel.
VxLAN is a UDP encapsulation protocol that enables hosts belonging to the same network to communicate across an L3 routed infrastructure. Here is the VxLAN header
In a VxLAN network's hosts in the same VNI (Virtual Network Identifier-24 bits wide) can talk to each other as if they are connected to the same ethernet segment.
Note: Flannel uses a VNI ID of "1" for most deployments.
Let's look at a packet capture and then I will delve into some of the finer points of VxLAN implementation in flannel.
Let's start with the outer headers:
The ICMP echo request originated from the VM2 (busybox3) container. The outer Source MAC and source IP belong to the VM2 Eth0 interface
root@sathish-vm2:/home/sathish# ip address
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:15:5d:db:46:5b brd ff:ff:ff:ff:ff:ff
inet 172.28.147.38/28 brd 172.28.147.47 scope global eth0
The outer destination MAC/IP is VM1's MAC/IP. It is important to note that in real-world nodes in a cluster could be connected across different subnets in which case the destination MAC will be the next-hop router MAC.
root@sathish-vm1:/home/sathish# ip address
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:15:5d:db:46:5a brd ff:ff:ff:ff:ff:ff
inet 172.28.147.44/28 brd 172.28.147.47 scope global eth0
The destination UDP port is 8472 for VxLAN- as per standards the destination port of VxLAN should be 4789. However, this shouldn't matter as only flannel nodes need to understand this packet is VxLAN and decapsulate it.
The VNI ID as I mentioned before is 1.
The inner source MAC, IP is that of busybox3, and inner destination MAC, IP is that of busybox2.
With this, let's look at how ping actually worked (ignoring ICMP related details)
Busybox3 originated the ICMP echo request- the source IP and source MAC is the eth0 interface of busybox3. This packet is tunneled to the VETH interface.
As mentioned before VETH is connected to CNI0 through a bridge and hence the packet is switched to CNI0.
CNI0 routes the packet to flannel interface. Following routing rules comes into play (These were setup by flannel)
root@sathish-vm2:/home/sathish# iptables-save - A FORWARD -s 10.244.0.0/16 -j ACCEPT -A FORWARD -d 10.244.0.0/16 -j ACCEPT
4. Flannel performs VxLAN encapsulation and sends the packet to destination node - in my case its sathish-vm1.
But how does flannel know the destination is located in VM1? After-all there could be multiple nodes in a Kubernetes cluster and the destination could be in any of those nodes. Let's try to answer this question
With VxLAN, MACs could be learnt over the tunnel with dynamic learning when packets are received (inner MAC address and inner IP). However flannel sets nolearning on the interface, hence it is not possible to know the node from which the packet originated. A separate control plane is required.
root@sathish-vm2:/home/sathish# ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 06:1c:30:c6:7c:cb brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 1 local 172.28.147.38 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 62780 gso_max_segs 65535
Flannel discovers all VxLAN neighbors when its setup and populates entry in a neighbor cache of flannel interface
root@sathish-vm2:/home/sathish# ip neigh show dev flannel.1
10.244.1.0 lladdr 16:ab:0a:e0:e7:60 PERMANENT
The above is flannel interface IP/MAC of VM1.
Then container/PoD MACs associated with that host are programmed as permanent entries.
root@sathish-vm2:/home/sathish# bridge fdb show flannel.1
..................
16:ab:0a:e0:e7:60 dev flannel.1 dst 172.28.147.44 self permanent
With the above entry, flannel knows this PoD /Container is present on 172.28.147.44- this is the outer destination IP.
5. The packet is then sent to eth0 from where it is forwarded to the destination node (switched or routed as the case maybe).
One piece of puzzle I did not elaborate on before, relates to the following:
root@sathish-vm2:/home/sathish# bridge fdb show flannel.1 .................. 16:ab:0a:e0:e7:60 dev flannel.1 dst 172.28.147.44 self permanent
AND
root@sathish-vm2:/home/sathish# ip neigh show dev flannel.1 10.244.1.0 lladdr 16:ab:0a:e0:e7:60 PERMANENT .
How did flannel create a static fdb entry on bridge i.e how did it know PoD with MAC 16:ab:0a:e0:e7:60 is running on 172.28.147.44 (VM1) ?
Also, how does flannel know 172.28.147.44 is part of VxLAN VNI 1?
The answer to both the questions is etcd. etcd is a distributed key-value datastore that is an important component of Kubernetes. When Kubernetes node is configured with flannel, flannel adds the following information to etcd:
IP address entry for VxLAN node
MAC (container/PoDMAC) to IP mapping
It used this information to program the ARP cache on the flannel1.1 interface and also to create the corresponding FDB entries.
Note: For those familiar with eVPN, the first entry is similar to Inclusive Multicast route that tells you which VNI segment is present in which VTEP and second one is similar to type 2 route i.e which VTEP has which hosts per VNI.
Hope this was useful, and once again thanks for all your feedback/comments, keep them coming. Have a wonderful week ahead.