Note: If you have missed my previous articles on Docker and Kubernetes, you can find them here.
Application deployment models evolution.
Getting started with Docker.
Docker file and images.
Publishing images to Docker Hub and re-using them
Docker- Find out what's going on
Docker Networking- Part 1
Docker Networking- Part 2
Docker Swarm-Multi-Host container Cluster
Docker Networking- Part 3 (Overlay Driver)
Introduction to Kubernetes
Kubernetes- Diving in (Part 1)-Installing Kubernetes multi-node cluster
Kubernetes-Diving in (Part2)- Services
Kubernetes- Infrastructure As Code with Yaml (part 1)
Kubernetes- Infrastructure As Code Part 2- Creating PODs with YAML
Kubernetes Infrastructure-as-Code part 3- Replicasets with YAML
Kubernetes Infrastructure-as-Code part 4 - Deployments and Services with YAML
Deploying a microservices APP with Kubernetes
Kubernetes- Time based scaling of deployments with python client
Kubernetes Networking - The Flannel network explained
Kubernetes- Installing and using kubectl top for monitoring nodes and PoDs
Application deployment has evolved over the last 2 decades. Current application designs favor microservices-based architecture. Kubernetes as an orchestration system for these applications enables easy deployment and maintenance of these applications.
A Kubernetes cluster consists of a bunch of master and worker nodes- PoDs can be scheduled on any of these nodes. The scheduler component running on Kubernetes master decides which PoD to put on which node (Kubelet on each node executes this action). However, not all nodes and applications are created equal. There could be certain nodes that are more powerful than others and could run resource-hungry PoDs. Similarly, there could be PoDs/deployments that demand more computing power (CPU), memory, or disk. How do we control the placement of PoDs/deployments- in other words how to override the d default scheduling behavior? In this article, I am going to talk about methods to influence scheduling selection and ensure the right PoD gets the right node.
Note
I have a simple 2 node cluster setup: 1 Master and 1 worker node.
root@sathish-vm2:/home/sathish# kubectl get nodes
NAME STATUS ROLES AGE VERSION
sathish-vm1 Ready <none> 52d v1.19.0
sathish-vm2 Ready master 52d v1.19.0
By Default, PoDs will not be scheduled on the master node as it has a default taint set
root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm2 | grep Taint
Taints: node-role.kubernetes.io/master:NoSchedule
I will talk about taints and tolerations in a bit, but if you have a setup similar to mine it is advisable to remove this taint before following along. To unset the taint
root@sathish-vm2:/home/sathish# kubectl taint node sathish-vm2 node-role.kubernetes.io/master:NoSchedule-
# Note the - symbol at end of taint
node/sathish-vm2 untainted
root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm2 | grep Taint
Taints: <none>
Let's try out some scheduling options
Manual Scheduling (nodeName selector)
It is possible to manually specify which node a PoD should be placed with the nodeName option. Before trying this out, let's create a PoD with kubectl and check out what the scheduler does.
root@sathish-vm2:/home/sathish# kubectl run busybox1 --image busybox --command sleep 3600
pod/busybox1 created
root@sathish-vm2:/home/sathish# kubectl get pods busybox1 -o YAML | grep node
nodeName: sathish-vm1
The scheduler automatically chooses "sathish-vm1" node in my case.
Now, let's create one more pod busybox2, and place it in "sathish-vm2" using nodeName selector.
# Create a YAML template file
root@sathish-vm2:/home/sathish# kubectl run busybox2 --image busybox --command sleep 3600 --dry-run=client -o yaml > busybox2.yaml
root@sathish-vm2:/home/sathish# cat busybox2.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: busybox2
name: busybox2
spec:
containers:
- command:
- sleep
- "3600"
image: busybox
name: busybox2
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
#Edit the YAML file to add nodeName selector. Refer Kubernetes Docs
root@sathish-vm2:/home/sathish# cat busybox2.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: busybox2
name: busybox2
spec:
containers:
- command:
- sleep
- "3600"
image: busybox
name: busybox2
resources: {}
nodeName: sathish-vm2
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@sathish-vm2:/home/sathish# kubectl create -f busybox2.yaml
pod/busybox2 created
root@sathish-vm2:/home/sathish# kubectl get pods -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
busybox1 1/1 Running 0 4m56s 10.244.1.20 sathish-vm1 <none> <none> run=busybox1
busybox2 1/1 Running 0 70s 10.244.0.7 sathish-vm2 <none> <none> run=busybox2
So now we have 1 PoD on each node and we were able to influence worker node selection with nodeName (the master node is also a worker node when taint is removed).
Taint and Tolerations
Manual scheduling can be useful for a limited set of PoDs. However, if we have hundreds of PoDs manual scheduling is cumbersome. This is where taints/tolerations become useful and allow administrators to place objects with similar requirements on a particular set of nodes.
The way to think of taint and tolerations is like this- You taint a node with a label and when creating objects like PoDs you give the object the "ability" to tolerate the taint with tolerations. This way, if we have a bunch of PoDs with high compute requirements we can place them all on a node with "highCPU" taint.
Taints and tolerations are described here
Nodes can be tainted with the following command
kubectl taint nodes node1 key1=value1:action
There are 3 possible actions
1. NoSchedule: PoDs that do not have "taint" will not be scheduled on this node (by the scheduler), but running PoDs will not be evicted.
2. PreferNoSchedule: Similar to NoSchedule- however, the scheduler will still schedule PoDs on the tainted node if it's not able to find nodes that satisfies the PoD requirement (like CPU).
3. NoExecute: No new PoDs without matching toleration will be scheduled and running PoDs without toleration will be drained from the node.
With this- Let's start by tainting the "sathish-vm1" node
root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoSchedule
node/sathish-vm1 tainted
root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm1 | grep Taint
Taints: cputype=highcpu:NoSchedule
As we choose the NoSchedule option, busybox1 is not evicted from the node
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 1/1 Running 0 53m 10.244.1.20 sathish-vm1 <none> <none>
busybox2 1/1 Running 0 49m 10.244.0.7 sathish-vm2 <none> <none>
Now I am going to change the action to "NoExecute" so that "busybox1" is evicted from "sathish-vm1".
#Remove the NoSchedule Taint by adding "-" at end
root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoSchedule-
node/sathish-vm1 untainted
#Setting NoExecute Taint
root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoExecute
node/sathish-vm1 tainted
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 1/1 Terminating 0 57m 10.244.1.20 sathish-vm1 <none> <none>
busybox2 1/1 Running 0 53m 10.244.0.7 sathish-vm2 <none> <none>
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox2 1/1 Running 0 54m 10.244.0.7 sathish-vm2 <none> <none>
Now let's re-create busybox1 PoD by giving it toleration to "highCPU" taint.
root@sathish-vm2:/home/sathish# cat busybox1.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
name: busybox1
spec:
containers:
- command:
- sleep
- "3600"
image: busybox
name: busybox1
tolerations:
- key: "cputype"
operator: "Equal"
value: "highcpu"
effect: "NoExecute"
root@sathish-vm2:/home/sathish# kubectl create -f busybox1.yaml
pod/busybox1 created
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 0/1 ContainerCreating 0 5s <none> sathish-vm1 <none> <none>
busybox2 1/1 Running 1 64m 10.244.0.7 sathish-vm2 <none> <none>
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox1 1/1 Running 0 16s 10.244.1.21 sathish-vm1 <none> <none>
busybox2 1/1 Running 1 64m 10.244.0.7 sathish-vm2 <none> <none>
One point to note with Tolerations is this: Having toleration for a specific taint does not mean that the PoD will always be scheduled on the node with the taint. It just means that the PoD can be scheduled anywhere on the cluster AND will also tolerate nodes with the taint.
This brings us to the next requirement: Let's say we want a set of PoDs to be scheduled on the node with certain hardware -for example httpd, redis PoDs should be scheduled on nodes with SSD always as they require quick disk reads. How about scheduling closely coupled PoDs on the same node?
These questions takes us to the next method of scheduling- NodeAffinity
NodeAffinity
For the sake of discussion, let's assume we want both httpd and redis PoDs to be scheduled on any node with SSD disk drives. Let's check out how this is possible.
The first step is to create labels on nodes with an SSD disk. In my case, I am going to label "sathish-vm1"
root@sathish-vm2:/home/sathish# kubectl label nodes sathish-vm1 disktype=ssd
node/sathish-vm1 labeled
root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
sathish-vm1 Ready <none> 52d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2 Ready master 52d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=
Now, when creating PoDs we can ask the scheduler to ensure that the object is created on nodes with labels "disktype=ssd". Let's deploy an httpd PoD.
root@sathish-vm2:/home/sathish# cat httpd.yaml
apiVersion: v1
kind: Pod
metadata:
name: httpd
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: httpd
image: httpd
root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd 1/1 Running 0 19s 10.244.1.18 sathish-vm1 <none> <none>
"requiredDuringSchedulingIgnoredDuringExecution:" - this essentially means the selector is considered only during scheduling and ignored on running pods. This implies, if we delete the SSD label the HTTP PoD should still run on "sathish-vm1" and not evicted.
#Delete labels with "-" at end of the label name
root@sathish-vm2:/home/sathish# kubectl label nodes sathish-vm1 disktype-
node/sathish-vm1 labeled
root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
sathish-vm1 Ready <none> 52d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2 Ready master 52d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=
root@sathish-vm2:/home/sathish# kubectl get pods
NAME READY STATUS RESTARTS AGE
httpd 1/1 Running 0 7m59s
Let's look at the following section of the YAML file I used
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
The operator I have used is "in". Based on the requirement you can choose to use other operators defined here. In, NotIn, Exists, DoesNotExist, Gt, Lt are some of the defined operators
Let's say I want to schedule PoDs in all nodes except master nodes. I can accomplish this using the "DoesNotExist" operator like so:
root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
sathish-vm1 Ready <none> 53d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2 Ready master 53d v1.19.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
Just like "nodeAffinity" selector is used to group PoDs on the same nodes, podAntiAffinity selector can be used to group PoDs requiring close coupling on the same node. This is accomplished by defining a topologyKey. Refer to Kubernetes docs here.
nodeNames, taints/tolerations, nodeAffinity allows us to group PoDs based on certain criteria that are predefined. How about scheduling based on resources required by PoDs ? Is it possible to request resources for PoDs- For instance, say 1Gb of Memory. This is possible with "resourceRequests" definitions.
Resource Requests
Resource requests for containers are defined here
At a fundamental level, PoDs are container(s) that require resources like CPU, Memory and disk. Multiple PoDs can run on a worker node at any point in time and hence hardware resources of the node are shared. Each PoD gets a minimum CPU, Memory, and disk resource specified by limitRange for a namespace. Limit ranges are described here.
Limits set for the namespace (including default namespace), can be overridden with Resource requests.
root@sathish-vm2:/home/sathish# kubectl describe namespace default
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Let's look at the number of CPU's on my node
root@sathish-vm2:/home/sathish# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
stepping : 3
microcode : 0xffffffff
cpu MHz : 2711.998
cache size : 8192 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips : 5423.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
stepping : 3
microcode : 0xffffffff
cpu MHz : 2711.998
cache size : 8192 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips : 5423.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
On my VM I have dedicated 2 vCPU cores for each of my VM's. Let's say I want to ensure at any point containers do not consume more than 1 CPU and any container should not exceed 0.5 vCPU (or 50% cpu cycles)- "LimitRange" object makes this possible
root@sathish-vm2:/home/sathish# cat default-cpu.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 1
defaultRequest:
cpu: 0.5
type: Container
root@sathish-vm2:/home/sathish# kubectl create -f default-cpu.yaml
limitrange/cpu-limit-range created
root@sathish-vm2:/home/sathish# kubectl describe namespace default
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container cpu - - 500m 1 -
Let's create a PoD now that stays within this limit
root@sathish-vm2:/home/sathish# cat underlimit.yaml
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
containers:
- name: web
image: httpd
resources:
requests:
cpu: "250m"
limits:
cpu: "500m"
root@sathish-vm2:/home/sathish# kubectl create -f underlimit.yaml
pod/web created
root@sathish-vm2:/home/sathish# kubectl get pods
NAME READY STATUS RESTARTS AGE
web 1/1 Running 0 63s
As we can see this works fine. Now let's try to exceed the limit.
apiVersion: v1
kind: Pod
metadata:
name: datacruncher
spec:
containers:
- name: datcruncher
image: busybox
resources:
requests:
cpu: "600m"
limits:
cpu: "800m"
root@sathish-vm2:/home/sathish# kubectl create -f overlimit.yaml
pod/web1 created
root@sathish-vm2:/home/sathish# kubectl describe pods datacruncher
Name: datacruncher
Namespace: default
Priority: 0
Node: sathish-vm2/172.28.147.38
Start Time: Mon, 07 Dec 2020 07:07:34 +0000
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
datcruncher:
Container ID:
Image: busybox
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 800m
Requests:
cpu: 600m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-spkmz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-spkmz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-spkmz
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 71s 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
Warning FailedScheduling 71s 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
Normal Scheduled 18s Successfully assigned default/datacruncher to sathish-vm2
Normal Pulling 18s kubelet, sathish-vm2 Pulling image "busybox"
As we can see scheduling failed due to insufficient CPU.
Daemonsets
Let's discuss another requirement: When a node joins a cluster you want to run PoDs that sets things up for the system or runs applications like logger. This PoD should run on all nodes and should be deleted only when the node goes down. This requirement can be accomplished with something called Daemonsets.
Kubernetes uses dameonsets internally to setup proxy and networking components.
root@sathish-vm2:/home/sathish# kubectl get daemonset --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system kube-flannel-ds 2 2 2 2 2 <none> 53d
kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 53d
root@sathish-vm2:/home/sathish# kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-f9fd979d6-d8wzr 1/1 Running 0 53d 10.244.0.3 sathish-vm2 <none> <none>
coredns-f9fd979d6-xcxzc 1/1 Running 0 53d 10.244.0.2 sathish-vm2 <none> <none>
etcd-sathish-vm2 1/1 Running 0 53d 172.28.147.38 sathish-vm2 <none> <none>
kube-apiserver-sathish-vm2 1/1 Running 0 53d 172.28.147.38 sathish-vm2 <none> <none>
kube-controller-manager-sathish-vm2 1/1 Running 24 53d 172.28.147.38 sathish-vm2 <none> <none>
kube-flannel-ds-cq56k 1/1 Running 0 53d 172.28.147.38 sathish-vm2 <none> <none>
kube-flannel-ds-x7prc 1/1 Running 0 53d 172.28.147.44 sathish-vm1 <none> <none>
kube-proxy-lcf25 1/1 Running 0 53d 172.28.147.44 sathish-vm1 <none> <none>
kube-proxy-tf8z8 1/1 Running 0 53d 172.28.147.38 sathish-vm2 <none> <none>
kube-scheduler-sathish-vm2 1/1 Running 19 53d 172.28.147.38 sathish-vm2 <none> <none>
metrics-server-56c59cf9ff-9qldp 1/1 Running 0 90m 10.244.0.6 sathish-vm2 <none> <none>
Let's create a daemonset
root@sathish-vm2:/home/sathish# cat daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: elasticsearch
spec:
selector:
matchLabels:
name: elasticsearch
template:
metadata:
labels:
name: elasticsearch
spec:
containers:
- name: elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
oot@sathish-vm2:/home/sathish# kubectl create -f daemonset.yaml
daemonset.apps/elasticsearch created
root@sathish-vm2:/home/sathish# kubectl get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
elasticsearch 2 2 0 2 0 <none> 6s
rroot@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
elasticsearch-jjvbg 1/1 Running 0 2m22s 10.244.1.18 sathish-vm1 <none> <none>
elasticsearch-pztcj 1/1 Running 0 2m22s 10.244.0.12 sathish-vm2 <none> <none>
Static Pods
It is possible to create PoDs on nodes even when node has not joined a cluster. Kubelet running on nodes will look for pod definitions and create PoDs. These PoDs will be recreated after deletion as long as the manifest file exists.
Kubernetes creates static PoDs for some of its components. Here is a list of static pods on a master node
root@sathish-vm2:/etc/kubernetes/manifests# ls
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
But how does Kubelet know what is the path of static PoDs? Let's examine the kubelet process and check its configuration file.
root@sathish-vm2:/etc/kubernetes/manifests# ps -aux | grep kubelet | grep config
root 140991 2.9 1.4 1957096 105976 ? Ssl Dec06 21:45 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2
root@sathish-vm2:/etc/kubernetes/manifests# cat /var/lib/kubelet/config.yaml | grep staticPodPath
staticPodPath: /etc/kubernetes/manifests
These settings indicate that static PoDs are located in "/etc/kubernetes/manifests". You can create additional YAML files in this location- PoDs will be automatically spun up as long as YAML files exist.
Well, that's it for now. Hope this post was useful and helped you gain a basic understanding of scheduling in Kubernetes and how you can control the placement of PoDs. Thanks for your time.
Comments