Kubernetes Administration- Scheduling

Note: If you have missed my previous articles on Docker and Kubernetes, you can find them here.
Application deployment models evolution.
Getting started with Docker.
Docker file and images.
Publishing images to Docker Hub and re-using them
Docker- Find out what's going on
Docker Networking- Part 1
Docker Networking- Part 2
Docker Swarm-Multi-Host container Cluster
Docker Networking- Part 3 (Overlay Driver)
Introduction to Kubernetes
Kubernetes- Diving in (Part 1)-Installing Kubernetes multi-node cluster
Kubernetes-Diving in (Part2)- Services
Kubernetes- Infrastructure As Code with Yaml (part 1)
Kubernetes- Infrastructure As Code Part 2- Creating PODs with YAML
Kubernetes Infrastructure-as-Code part 3- Replicasets with YAML
Kubernetes Infrastructure-as-Code part 4 - Deployments and Services with YAML
Deploying a microservices APP with Kubernetes
Kubernetes- Time based scaling of deployments with python client
Kubernetes Networking - The Flannel network explained
Kubernetes- Installing and using kubectl top for monitoring nodes and PoDs

Application deployment has evolved over the last 2 decades. Current application designs favor microservices-based architecture. Kubernetes as an orchestration system for these applications enables easy deployment and maintenance of these applications.

A Kubernetes cluster consists of a bunch of master and worker nodes- PoDs can be scheduled on any of these nodes. The scheduler component running on Kubernetes master decides which PoD to put on which node (Kubelet on each node executes this action). However, not all nodes and applications are created equal. There could be certain nodes that are more powerful than others and could run resource-hungry PoDs. Similarly, there could be PoDs/deployments that demand more computing power (CPU), memory, or disk. How do we control the placement of PoDs/deployments- in other words how to override the d default scheduling behavior? In this article, I am going to talk about methods to influence scheduling selection and ensure the right PoD gets the right node.

Note
I have a simple 2 node cluster setup: 1 Master and 1 worker node. 
root@sathish-vm2:/home/sathish# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
sathish-vm1   Ready    <none>   52d   v1.19.0
sathish-vm2   Ready    master   52d   v1.19.0

By Default, PoDs will not be scheduled on the master node as it has a default taint set

root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm2 | grep  Taint
Taints:             node-role.kubernetes.io/master:NoSchedule

I  will talk about taints and tolerations in a bit, but if you have a setup similar to mine it is advisable to remove this taint before following along. To unset the taint

root@sathish-vm2:/home/sathish# kubectl taint node sathish-vm2 node-role.kubernetes.io/master:NoSchedule-

# Note the - symbol at end of taint
node/sathish-vm2 untainted
root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm2 | grep  Taint
Taints:             <none>

Let's try out some scheduling options

Manual Scheduling (nodeName selector)

It is possible to manually specify which node a PoD should be placed with the nodeName option. Before trying this out, let's create a PoD with kubectl and check out what the scheduler does.

root@sathish-vm2:/home/sathish# kubectl run busybox1 --image busybox --command sleep 3600
pod/busybox1 created
root@sathish-vm2:/home/sathish# kubectl get pods busybox1 -o YAML | grep node
  nodeName: sathish-vm1

The scheduler automatically chooses "sathish-vm1" node in my case.

Now, let's create one more pod busybox2, and place it in "sathish-vm2" using nodeName selector.

# Create a YAML template file

root@sathish-vm2:/home/sathish# kubectl run busybox2 --image busybox --command sleep 3600 --dry-run=client -o yaml > busybox2.yaml
root@sathish-vm2:/home/sathish# cat busybox2.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox2
  name: busybox2
spec:
  containers:
  - command:
    - sleep
    - "3600"
    image: busybox
    name: busybox2
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

#Edit the YAML file to add nodeName selector. Refer Kubernetes Docs 
root@sathish-vm2:/home/sathish# cat busybox2.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox2
  name: busybox2
spec:
  containers:
  - command:
    - sleep
    - "3600"
    image: busybox
    name: busybox2
    resources: {}
  nodeName: sathish-vm2
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@sathish-vm2:/home/sathish# kubectl create -f busybox2.yaml
pod/busybox2 created
root@sathish-vm2:/home/sathish# kubectl get pods -o  wide --show-labels
NAME       READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES   LABELS
busybox1   1/1     Running   0          4m56s   10.244.1.20   sathish-vm1   <none>           <none>            run=busybox1
busybox2   1/1     Running   0          70s     10.244.0.7    sathish-vm2   <none>           <none>            run=busybox2

So now we have 1 PoD on each node and we were able to influence worker node selection with nodeName (the master node is also a worker node when taint is removed).

Taint and Tolerations

Manual scheduling can be useful for a limited set of PoDs. However, if we have hundreds of PoDs manual scheduling is cumbersome. This is where taints/tolerations become useful and allow administrators to place objects with similar requirements on a particular set of nodes.

The way to think of taint and tolerations is like this- You taint a node with a label and when creating objects like PoDs you give the object the "ability" to tolerate the taint with tolerations. This way, if we have a bunch of PoDs with high compute requirements we can place them all on a node with "highCPU" taint.

Taints and tolerations are described here

Nodes can be tainted with the following command

kubectl taint nodes node1 key1=value1:action

There are 3 possible actions

1. NoSchedule: PoDs that do not have "taint" will not be scheduled on this node (by the scheduler), but running PoDs will not be evicted.

2. PreferNoSchedule: Similar to NoSchedule- however, the scheduler will still schedule PoDs on the tainted node if it's not able to find nodes that satisfies the PoD requirement (like CPU).

3. NoExecute: No new PoDs without matching toleration will be scheduled and running PoDs without toleration will be drained from the node.

With this- Let's start by tainting the "sathish-vm1" node

root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoSchedule
node/sathish-vm1 tainted
root@sathish-vm2:/home/sathish# kubectl describe nodes sathish-vm1 | grep Taint
Taints:             cputype=highcpu:NoSchedule

As we choose the NoSchedule option, busybox1 is not evicted from the node


root@sathish-vm2:/home/sathish# kubectl  get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
busybox1   1/1     Running   0          53m   10.244.1.20   sathish-vm1   <none>           <none>
busybox2   1/1     Running   0          49m   10.244.0.7    sathish-vm2   <none>           <none>

Now I am going to change the action to "NoExecute" so that "busybox1" is evicted from "sathish-vm1".

#Remove the NoSchedule Taint by adding "-" at end
root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoSchedule-
node/sathish-vm1 untainted
#Setting NoExecute Taint
root@sathish-vm2:/home/sathish# kubectl taint nodes sathish-vm1 cputype=highcpu:NoExecute
node/sathish-vm1 tainted

root@sathish-vm2:/home/sathish# kubectl  get pods -o wide
NAME       READY   STATUS        RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
busybox1   1/1     Terminating   0          57m   10.244.1.20   sathish-vm1   <none>           <none>
busybox2   1/1     Running       0          53m   10.244.0.7    sathish-vm2   <none>           <none>

root@sathish-vm2:/home/sathish# kubectl  get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
busybox2   1/1     Running   0          54m   10.244.0.7   sathish-vm2   <none>           <none>

Now let's re-create busybox1 PoD by giving it toleration to "highCPU" taint.


root@sathish-vm2:/home/sathish# cat busybox1.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
  name: busybox1
spec:
  containers:
  - command:
    - sleep
    - "3600"
    image: busybox
    name: busybox1
  tolerations:
  - key: "cputype"
    operator: "Equal"
    value: "highcpu"
    effect: "NoExecute"
    
  root@sathish-vm2:/home/sathish# kubectl create -f busybox1.yaml
pod/busybox1 created
root@sathish-vm2:/home/sathish# kubectl  get pods  -o wide
NAME       READY   STATUS              RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
busybox1   0/1     ContainerCreating   0          5s    <none>       sathish-vm1   <none>           <none>
busybox2   1/1     Running             1          64m   10.244.0.7   sathish-vm2   <none>           <none>

root@sathish-vm2:/home/sathish# kubectl  get pods  -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
busybox1   1/1     Running   0          16s   10.244.1.21   sathish-vm1   <none>           <none>
busybox2   1/1     Running   1          64m   10.244.0.7    sathish-vm2   <none>           <none>

One point to note with Tolerations is this: Having toleration for a specific taint does not mean that the PoD will always be scheduled on the node with the taint. It just means that the PoD can be scheduled anywhere on the cluster AND will also tolerate nodes with the taint.

This brings us to the next requirement: Let's say we want a set of PoDs to be scheduled on the node with certain hardware -for example httpd, redis PoDs should be scheduled on nodes with SSD always as they require quick disk reads. How about scheduling closely coupled PoDs on the same node?

These questions takes us to the next method of scheduling- NodeAffinity

NodeAffinity

For the sake of discussion, let's assume we want both httpd and redis PoDs to be scheduled on any node with SSD disk drives. Let's check out how this is possible.

The first step is to create labels on nodes with an SSD disk. In my case, I am going to label "sathish-vm1"

root@sathish-vm2:/home/sathish# kubectl label nodes sathish-vm1 disktype=ssd
node/sathish-vm1 labeled

root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME          STATUS   ROLES    AGE   VERSION   LABELS
sathish-vm1   Ready    <none>   52d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2   Ready    master   52d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=

Now, when creating PoDs we can ask the scheduler to ensure that the object is created on nodes with labels "disktype=ssd". Let's deploy an httpd PoD.

root@sathish-vm2:/home/sathish# cat httpd.yaml
apiVersion: v1
kind: Pod
metadata:
  name: httpd
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: httpd
    image: httpd


root@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE          NOMINATED NODE   READINESS GATES
httpd   1/1     Running   0          19s   10.244.1.18   sathish-vm1   <none>           <none>

"requiredDuringSchedulingIgnoredDuringExecution:" - this essentially means the selector is considered only during scheduling and ignored on running pods. This implies, if we delete the SSD label the HTTP PoD should still run on "sathish-vm1" and not evicted.

#Delete labels with "-" at end of the label name
root@sathish-vm2:/home/sathish# kubectl label nodes sathish-vm1 disktype-
node/sathish-vm1 labeled

root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME          STATUS   ROLES    AGE   VERSION   LABELS
sathish-vm1   Ready    <none>   52d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2   Ready    master   52d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=

root@sathish-vm2:/home/sathish# kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
httpd   1/1     Running   0          7m59s

Let's look at the following section of the YAML file I used

affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd

The operator I have used is "in". Based on the requirement you can choose to use other operators defined here. In, NotIn, Exists, DoesNotExist, Gt, Lt are some of the defined operators

Let's say I want to schedule PoDs in all nodes except master nodes. I can accomplish this using the "DoesNotExist" operator like so:

root@sathish-vm2:/home/sathish# kubectl get nodes --show-labels
NAME          STATUS   ROLES    AGE   VERSION   LABELS
sathish-vm1   Ready    <none>   53d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm1,kubernetes.io/os=linux
sathish-vm2   Ready    master   53d   v1.19.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sathish-vm2,kubernetes.io/os=linux,node-role.kubernetes.io/master=

            
affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: DoesNotExist

Just like "nodeAffinity" selector is used to group PoDs on the same nodes, podAntiAffinity selector can be used to group PoDs requiring close coupling on the same node. This is accomplished by defining a topologyKey. Refer to Kubernetes docs here.

nodeNames, taints/tolerations, nodeAffinity allows us to group PoDs based on certain criteria that are predefined. How about scheduling based on resources required by PoDs ? Is it possible to request resources for PoDs- For instance, say 1Gb of Memory. This is possible with "resourceRequests" definitions.

Resource Requests

Resource requests for containers are defined here

At a fundamental level, PoDs are container(s) that require resources like CPU, Memory and disk. Multiple PoDs can run on a worker node at any point in time and hence hardware resources of the node are shared. Each PoD gets a minimum CPU, Memory, and disk resource specified by limitRange for a namespace. Limit ranges are described here.

Limits set for the namespace (including default namespace), can be overridden with Resource requests.

root@sathish-vm2:/home/sathish# kubectl describe namespace default
Name:         default
Labels:       <none>
Annotations:  <none>
Status:       Active

No resource quota.

Let's look at the number of CPU's on my node

root@sathish-vm2:/home/sathish# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
stepping        : 3
microcode       : 0xffffffff
cpu MHz         : 2711.998
cache size      : 8192 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips        : 5423.99
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
stepping        : 3
microcode       : 0xffffffff
cpu MHz         : 2711.998
cache size      : 8192 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips        : 5423.99
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual

On my VM I have dedicated 2 vCPU cores for each of my VM's. Let's say I want to ensure at any point containers do not consume more than 1 CPU and any container should not exceed 0.5 vCPU (or 50% cpu cycles)- "LimitRange" object makes this possible

root@sathish-vm2:/home/sathish# cat default-cpu.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-limit-range
spec:
  limits:
  - default:
      cpu: 1
    defaultRequest:
      cpu: 0.5
    type: Container
    
root@sathish-vm2:/home/sathish# kubectl create -f default-cpu.yaml
limitrange/cpu-limit-range created

root@sathish-vm2:/home/sathish# kubectl describe namespace default
Name:         default
Labels:       <none>
Annotations:  <none>
Status:       Active

No resource quota.

Resource Limits
 Type       Resource  Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----       --------  ---  ---  ---------------  -------------  -----------------------
 Container  cpu       -    -    500m             1              -

Let's create a PoD now that stays within this limit

root@sathish-vm2:/home/sathish# cat underlimit.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
  - name: web
    image: httpd
    resources:
      requests:
         cpu: "250m"
      limits:
        cpu: "500m"

root@sathish-vm2:/home/sathish# kubectl create -f underlimit.yaml
pod/web created

root@sathish-vm2:/home/sathish# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
web    1/1     Running   0          63s

As we can see this works fine. Now let's try to exceed the limit.

apiVersion: v1
kind: Pod
metadata:
  name: datacruncher
spec:
  containers:
  - name: datcruncher
    image: busybox
    resources:
      requests:
         cpu: "600m"
      limits:
        cpu: "800m"

root@sathish-vm2:/home/sathish# kubectl create -f overlimit.yaml
pod/web1 created

root@sathish-vm2:/home/sathish# kubectl describe pods  datacruncher
Name:         datacruncher
Namespace:    default
Priority:     0
Node:         sathish-vm2/172.28.147.38
Start Time:   Mon, 07 Dec 2020 07:07:34 +0000
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  datcruncher:
    Container ID:
    Image:          busybox
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:  800m
    Requests:
      cpu:        600m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-spkmz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-spkmz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-spkmz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From                  Message
  ----     ------            ----  ----                  -------
  Warning  FailedScheduling  71s                         0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
  Warning  FailedScheduling  71s                         0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
  Normal   Scheduled         18s                         Successfully assigned default/datacruncher to sathish-vm2
  Normal   Pulling           18s   kubelet, sathish-vm2  Pulling image "busybox"

As we can see scheduling failed due to insufficient CPU.

Daemonsets

Let's discuss another requirement: When a node joins a cluster you want to run PoDs that sets things up for the system or runs applications like logger. This PoD should run on all nodes and should be deleted only when the node goes down. This requirement can be accomplished with something called Daemonsets.

Kubernetes uses dameonsets internally to setup proxy and networking components.

root@sathish-vm2:/home/sathish# kubectl get daemonset --all-namespaces
NAMESPACE     NAME              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   kube-flannel-ds   2         2         2       2            2           <none>                   53d
kube-system   kube-proxy        2         2         2       2            2           kubernetes.io/os=linux   53d

root@sathish-vm2:/home/sathish# kubectl get pods -n kube-system -o wide
NAME                                  READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
coredns-f9fd979d6-d8wzr               1/1     Running   0          53d   10.244.0.3      sathish-vm2   <none>           <none>
coredns-f9fd979d6-xcxzc               1/1     Running   0          53d   10.244.0.2      sathish-vm2   <none>           <none>
etcd-sathish-vm2                      1/1     Running   0          53d   172.28.147.38   sathish-vm2   <none>           <none>
kube-apiserver-sathish-vm2            1/1     Running   0          53d   172.28.147.38   sathish-vm2   <none>           <none>
kube-controller-manager-sathish-vm2   1/1     Running   24         53d   172.28.147.38   sathish-vm2   <none>           <none>
kube-flannel-ds-cq56k                 1/1     Running   0          53d   172.28.147.38   sathish-vm2   <none>           <none>
kube-flannel-ds-x7prc                 1/1     Running   0          53d   172.28.147.44   sathish-vm1   <none>           <none>
kube-proxy-lcf25                      1/1     Running   0          53d   172.28.147.44   sathish-vm1   <none>           <none>
kube-proxy-tf8z8                      1/1     Running   0          53d   172.28.147.38   sathish-vm2   <none>           <none>
kube-scheduler-sathish-vm2            1/1     Running   19         53d   172.28.147.38   sathish-vm2   <none>           <none>
metrics-server-56c59cf9ff-9qldp       1/1     Running   0          90m   10.244.0.6      sathish-vm2   <none>           <none>

Let's create a daemonset


root@sathish-vm2:/home/sathish# cat daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elasticsearch
spec:
  selector:
    matchLabels:
      name: elasticsearch
  template:
    metadata:
      labels:
        name:  elasticsearch
    spec:

      containers:
      - name: elasticsearch
        image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2

oot@sathish-vm2:/home/sathish# kubectl create -f daemonset.yaml
daemonset.apps/elasticsearch created
root@sathish-vm2:/home/sathish# kubectl get daemonsets
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
elasticsearch   2         2         0       2            0           <none>          6s

rroot@sathish-vm2:/home/sathish# kubectl get pods -o wide
NAME                  READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
elasticsearch-jjvbg   1/1     Running   0          2m22s   10.244.1.18   sathish-vm1   <none>           <none>
elasticsearch-pztcj   1/1     Running   0          2m22s   10.244.0.12   sathish-vm2   <none>           <none>

Static Pods

It is possible to create PoDs on nodes even when node has not joined a cluster. Kubelet running on nodes will look for pod definitions and create PoDs. These PoDs will be recreated after deletion as long as the manifest file exists.

Kubernetes creates static PoDs for some of its components. Here is a list of static pods on a master node

root@sathish-vm2:/etc/kubernetes/manifests# ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

But how does Kubelet know what is the path of static PoDs? Let's examine the kubelet process and check its configuration file.


root@sathish-vm2:/etc/kubernetes/manifests# ps -aux | grep kubelet | grep config
root      140991  2.9  1.4 1957096 105976 ?      Ssl  Dec06  21:45 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2

root@sathish-vm2:/etc/kubernetes/manifests# cat /var/lib/kubelet/config.yaml | grep staticPodPath
staticPodPath: /etc/kubernetes/manifests

These settings indicate that static PoDs are located in "/etc/kubernetes/manifests". You can create additional YAML files in this location- PoDs will be automatically spun up as long as YAML files exist.

Well, that's it for now. Hope this post was useful and helped you gain a basic understanding of scheduling in Kubernetes and how you can control the placement of PoDs. Thanks for your time.

Quick intro to Docker, Kubernetes and latest network tech

Kubernetes Administration- Scheduling

Recent Posts

Comments

Never Miss a Post. Subscribe Now!