
Kubernetes – taming PODs

In this article, we will answer questions such as how to control where PODs will be launched and how does kube-scheduler work? We will also discuss a number of tools used to determine the relationship of PODs to NODEs: labels, nodeSelector, affinity, and marking tainted NODEs.
In this article, we will answer questions such as how to control where PODs will be launched and how does kube-scheduler work. We will also discuss a number of tools used to determine the relationship of PODs to NODEs: labels, nodeSelector, affinity, and marking tainted NODEs.
The following examples were carried out on a Kubernetes lab consisting of two NODEs. Therefore, we recommend access to such a lab. We described the process of creating such a test cluster in the article Combining two EuroLinux 8 machines into a Kubernetes cluster.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
euro1 Ready control-plane 35d v1.24.0
euro2 Ready <none> 35d v1.24.0
Kubernetes scheduler
Scheduling is a process that manages the running of PODs on appropriately matched NODEs so that the kubelet
process can handle them.
The scheduler process can be presented in the following steps:
1. Waiting for a new POD to appear that does not have an assigned NODE.
2. Finding the best NODE for each detected POD.
3. Informing the API of the selection.
kube-scheduler
kube-scheduler
is the default Kubernetes scheduler that runs as part of the control-plane. It is designed to allow the use of other components (scheduling components) written independently or by third parties.
Selecting the best NODE for a new POD requires appropriate filtering of the available NODEs. Those NODEs that meet the scheduling requirements are called feasible nodes. When none of the NODEs is suitable, the POD remains “unscheduled” until the scheduler finds a suitable NODE for it. After selecting executable NODEs, the scheduler executes a set of functions scoring the selected NODEs. Then, it selects the NODE with the highest score. The final step is to inform the server’s API about the given selection in the binding process.
Assigning PODs to NODEs
Kubernetes allows assigning PODs to defined NODE classes. All recommended methods use label selector. Usually assignment is not necessary, because the scheduler automatically assigns PODs very well to NODEs that have the appropriate resources. On the other hand, one can easily imagine a case where defining an additional class of NODEs can be useful. For example, when you need access to “fast” SSD memory or want NODEs to belong to a one “fast” LAN.
NODE labels
Like many other Kubernetes objects, NODEs have labels. These can be assigned manually. In addition, Kubernetes implements a standard set of labels for all NODEs in the cluster. It’s worth getting to know them for troubleshooting purposes. However, we will not elaborate on that in this article.
Adding labels allows you to run PODs on a selected group of NODEs. This method is often used to ensure that applications run on specially isolated NODEs that meet specific security requirements. In this case, it is recommended to select such a label key that kubelet
cannot modify. This will prevent another attacked NODE from setting such a label on itself. This can be done in the following steps:
- make sure you are using the Node authorizer and that the
NodeRestriction
extension plugin is enabled:
kubectl -n kube-system describe po kube-apiserver-euro1|grep NodeRestriction
--enable-admission-plugins=NodeRestriction
- add a label with the prefix
node-restriction.kubernetes.io/
to selected NODEs:
kubectl label nodes euro2 restriction.kubernetes.io/supersecure=true
- we use these labels in the
nodeSelector
field:
cat << EOF | tee deployment-supersecure.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-supersecure
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
restriction.kubernetes.io/supersecure: "true"
EOF
Running the deployment:
kubectl apply -f deployment-supersecure.yaml
deployment.apps/deployment-supersecure created
Verification of POD distribution:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-supersecure-5d4ccb7468-gf54d 1/1 Running 0 20s 10.33.0.20 euro1 <none> <none>
deployment-supersecure-5d4ccb7468-vlncv 1/1 Running 0 20s 10.33.0.18 euro1 <none> <none>
deployment-supersecure-5d4ccb7468-znpfw 1/1 Running 0 20s 10.33.0.19 euro1 <none> <none>
Removing the sample deployment:
kubectl delete deployments.apps deployment-supersecure
deployment.apps "deployment-supersecure" deleted
nodeSelector
The nodeSelector
field can be added to the POD specification. It contains a list of labels that the NODE on which the Kubernetes scheduler can run the POD must have. The NODE must contain all the selected labels.
Example:
Giving the NODEs euro1 and euro2 the label a=a:
kubectl label nodes euro1 euro2 a=a
node/euro1 labeled
node/euro2 labeled
Giving the euro2 NODE a b=b label:
kubectl label nodes euro2 b=b
node/euro2 labeled
Rewriting the previous sample deployment to require the labels a=a
and b=b
:
cat << EOF | tee deployment-ab.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-ab
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
a: "a"
b: "b"
EOF
Running the deployment:
kubectl apply -f deployment-ab.yaml
deployment.apps/deployment-ab created
Verification of POD distribution:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-ab-74d6cc75b9-74qsq 1/1 Running 0 8s 10.33.2.19 euro2 <none> <none>
deployment-ab-74d6cc75b9-bflh9 1/1 Running 0 8s 10.33.2.18 euro2 <none> <none>
deployment-ab-74d6cc75b9-gvt7l 1/1 Running 0 8s 10.33.2.20 euro2 <none> <none>
Deleting the old deployment:
kubectl delete deployment deployment-ab
deployment.apps "deployment-ab" deleted
Affinity / anti-affinity
nodeSelector
is a simplified method of assigning PODs to NODEs. The affinity
(linkage) and `anti-affinity` fields greatly expand the possibilities of tying PODs to NODEs, as well as PODs to PODs.
nodeAffinity
nodeAffinity
(linking NODEs) works similarly to nodeSelector
. There are two types of nodeAffinity
:
requiredDuringSchedulingIgnoreDuringExecution
– Kubernetes scheduler can only run a POD if the rule is satisfied. The rule can be specified in a more complex way, compared tonodeSelector
, where the only option is to match all labelspreferredDuringSchedulingIgnoredDuringExecution
– Kubernetes will try to select a NODE that meets the rule. However, this is only a preference with an assigned weight.
IgnoredDuringExecution
should be understood as – if the NODE changes the label during execution (DuringExecution), it will not interfere with the operation of this POD.
Example of POD configuration associated with a NODE with label a=a
and preference to NODEs with labels b=b
(minimum weight equal to 1) and node-role.kubernetes.io/control-plane=
(maximum weight equal to 100):
cat <<EOF | tee node-affinity-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a
operator: In
values:
- a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: b
operator: In
values:
- b
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- image: nginx
name: node-affinity
EOF
Running a POD:
kubectl apply -f node-affinity-pod.yaml
The result of the kubectl get pods -o wide
command should look like this:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity-pod 1/1 Running 0 19s 10.33.0.23 euro1 <none> <none>
POD can be deleted with the command:
kubectl delete pod node-affinity-pod
In the following examples, we will configure a deployment based on analogous PODs.
Configuring a deployment consisting of 4 PODs analogous to the POD in the example above:
cat << EOF | tee node-affinity-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-deployment
labels:
app: nginx
spec:
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a
operator: In
values:
- a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: b
operator: In
values:
- b
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- name: node-affinity-deployment
image: nginx
EOF
Running the deployment:
kubectl apply -f node-affinity-deployment.yaml
Checking the distribution of PODs among NODEs:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity-deployment-bbc88d9-25qr2 1/1 Running 1 (2m2s ago) 14m 10.33.0.30 euro1 <none> <none>
node-affinity-deployment-bbc88d9-947x5 1/1 Running 1 (2m2s ago) 14m 10.33.0.28 euro1 <none> <none>
node-affinity-deployment-bbc88d9-pg4zz 1/1 Running 1 (2m2s ago) 14m 10.33.0.33 euro1 <none> <none>
node-affinity-deployment-bbc88d9-s9vsf 1/1 Running 1 (2m2s ago) 14m 10.33.0.32 euro1 <none> <none>
All PODs were run on control-plane NODE due to the higher preference weight of this NODE.
In the following example, we will set two equal preference weights.
cat << EOF | tee node-affinity-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-deployment
labels:
app: nginx
spec:
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a
operator: In
values:
- a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: b
operator: In
values:
- b
- weight: 1
preference:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- name: node-affinity-deployment
image: nginx
EOF
Applying configuration changes:
kubectl apply -f node-affinity-deployment.yaml
Verification of POD distribution:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity-deployment-dc8df7bbf-nlkcz 1/1 Running 0 72s 10.33.2.24 euro2 <none> <none>
node-affinity-deployment-dc8df7bbf-nq88k 1/1 Running 0 72s 10.33.0.34 euro1 <none> <none>
node-affinity-deployment-dc8df7bbf-nxt85 1/1 Running 0 70s 10.33.2.25 euro2 <none> <none>
node-affinity-deployment-dc8df7bbf-rpt5k 1/1 Running 0 69s 10.33.0.35 euro1 <none> <none>
The PODs are evenly distributed among the NODEs. euro1
has an assigned label node-role.kubernetes.io/control-plane
with a weight of 1. euro2
also has an assigned label b=b
with a weight of 1.
What happens if we remove the label a=a
from the NODE of euro1
and the label b=b
from the NODE of euro2
?
kubectl label nodes euro1 a- ; kubectl label nodes euro2 b-
node/euro1 unlabeled
node/euro2 unlabeled
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity-deployment-dc8df7bbf-nlkcz 1/1 Running 0 32m 10.33.2.24 euro2 <none> <none>
node-affinity-deployment-dc8df7bbf-nq88k 1/1 Running 0 32m 10.33.0.34 euro1 <none> <none>
node-affinity-deployment-dc8df7bbf-nxt85 1/1 Running 0 32m 10.33.2.25 euro2 <none> <none>
node-affinity-deployment-dc8df7bbf-rpt5k 1/1 Running 0 32m 10.33.0.35 euro1 <none> <none>
Nothing has changed. Removing the labels did not affect the distribution of PODs already running. On the other hand, after a restart, all PODs will go to euro2
, since only this POD has the a=a
label assigned.
kubectl rollout restart deployment node-affinity-deployment
deployment.apps/node-affinity-deployment restarted
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity-deployment-979965644-45kch 1/1 Running 0 114s 10.33.2.27 euro2 <none> <none>
node-affinity-deployment-979965644-88gng 1/1 Running 0 110s 10.33.2.28 euro2 <none> <none>
node-affinity-deployment-979965644-gjlwv 1/1 Running 0 114s 10.33.2.26 euro2 <none> <none>
node-affinity-deployment-979965644-mnl6s 1/1 Running 0 109s 10.33.2.29 euro2 <none> <none>
You can delete the sample deployment with the command:
kubectl delete deployments node-affinity-deployment
deployment.apps "node-affinity-deployment" deleted
Inter-pod affinity / anti-affinity
The principle of inter-pod affinity or isolation (anti-affinity) takes this form:
This POD should (or should not) work on X provided that PODs satisfying rule Y are already running on X. X constitutes a topology labeled with a topologyKey
. Y constitutes a label selector rule with an optional list of namespaces.
There are two types (similar to nodeAffinity
):
requiredDuringSchedulingIgnoreDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
For the purposes of the demonstration, we will configure two PODs.
cat << EOF | tee examplePODs.yaml
apiVersion: v1
kind: Pod
metadata:
name: euro1-pod
labels:
nr: "1"
spec:
nodeSelector:
kubernetes.io/hostname: euro1
containers:
- image: nginx
name: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: euro2-pod
labels:
nr: "2"
spec:
nodeSelector:
kubernetes.io/hostname: euro2
containers:
- image: nginx
name: nginx
EOF
Running the PODs:
kubectl apply -f examplePODs.yaml
pod/euro1-pod created
pod/euro2-pod created
kubectl get pods -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
euro1-pod 1/1 Running 0 18m 10.33.0.36 euro1 <none> <none> nr=1
euro2-pod 1/1 Running 0 18m 10.33.2.30 euro2 <none> <none> nr=2
In the next step of the demonstration, we will run 2 PODs associated with the euro1-pod
and euro2-pod
PODs, respectively, using the nr
label.
cat << EOF | tee podAffinity-pods.yaml
apiVersion: v1
kind: Pod
metadata:
name: nr1-pod
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nr
operator: In
values:
- "1"
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: nr2-pod
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nr
operator: In
values:
- "2"
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx
EOF
Launching and verifying the distribution of PODs:
kubectl apply -f podAffinity-pods.yaml
pod/nr1-pod created
pod/nr2-pod created
kubectl get pods -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
euro1-pod 1/1 Running 0 70m 10.33.0.36 euro1 <none> <none> nr=1
euro2-pod 1/1 Running 0 70m 10.33.2.30 euro2 <none> <none> nr=2
nr1-pod 1/1 Running 0 2m20s 10.33.0.37 euro1 <none> <none> <none>
nr2-pod 1/1 Running 0 2m20s 10.33.2.31 euro2 <none> <none> <none>
Analogous to nodeAffinity
, you can define podAffinity
as a preference with weight assignment (preferredDuringSchedulingIgnoredDuringExecution
). In addition, you can define podAntiAffinity
, that is, you can choose which PODs the configured PODs should not or cannot run with in the same topology.
It is important that the selected topologyKey
be consistently defined for each NODE. The kubernetes.io/hostname
, given in the example, is defined automatically. You can also choose a different topology key, such as topologyKey: city
. In this case, ensure that each NODE is assigned this label (e.g. using the kubectl label nodes NODE city=Krakow
command).
You can remove the PODs used for the demonstration:
kubectl delete pods euro1-pod euro2-pod nr1-pod nr2-pod
Taints and Tolerations
Taints are the opposite of nodeAffinity
. A NODE marked with a taint, cannot be selected by the scheduler for a POD that does not have a defined taint tolerance.
Adding Taint using kubectl taint
:
kubectl taint nodes euro1 node-role.kubernetes.io/control-plane:NoSchedule
node/euro1 tainted
Most often control-plane NODE has this taint added by default.
Skazy mogą być też definiowane jako pary KEY=VALUE
. Przykład:
kubectl taint nodes euro2 skaza=tragiczna:NoSchedule
Removing taints (operator -
):
kubectl taint nodes euro2 skaza=tragiczna:NoSchedule-
After following the above steps, an attempt to run a standard nginx
deployment should run as follows:
kubectl create deployment nginx --image nginx --replicas 3
deployment.apps/nginx created
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-8f458dc5b-jd56c 1/1 Running 0 68s 10.33.2.36 euro2 <none> <none>
nginx-8f458dc5b-p72s6 1/1 Running 0 68s 10.33.2.37 euro2 <none> <none>
nginx-8f458dc5b-tr5jn 1/1 Running 0 68s 10.33.2.38 euro2 <none> <none>
All PODs have “landed” on euro2
. The NODE euro1
is tainted.
Taints can create the following effects:
NoSchedule
– Kubernetes scheduler will not start new PODs without specified tolerancePreferNoSchedule
– preference not to use NODEs. If other NODEs are not feasible, the scheduler will run the POD on the marked NODENoExecute
– all PODs having no tolerance for this flaw, will be disabled.
The next example deals with the differences in the use of taint NoSchedule
and NoExecute
:
kubectl taint nodes euro2 skaza:NoSchedule
node/euro2 tainted
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-8f458dc5b-jd56c 1/1 Running 0 7m28s 10.33.2.36 euro2 <none> <none>
nginx-8f458dc5b-p72s6 1/1 Running 0 7m28s 10.33.2.37 euro2 <none> <none>
nginx-8f458dc5b-tr5jn 1/1 Running 0 7m28s 10.33.2.38 euro2 <none> <none>
PODs continue to run on the tainted NODE, but new PODs cannot be started by the scheduler. All NODEs are “tainted”.
kubectl create deployment nginx2 --image nginx
deployment.apps/nginx2 created
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-8f458dc5b-jd56c 1/1 Running 0 11m 10.33.2.36 euro2 <none> <none>
nginx-8f458dc5b-p72s6 1/1 Running 0 11m 10.33.2.37 euro2 <none> <none>
nginx-8f458dc5b-tr5jn 1/1 Running 0 11m 10.33.2.38 euro2 <none> <none>
nginx2-7cc8cd4598-tpp2s 0/1 Pending 0 3s <none> <none> <none> <none>
Then we add the NoExecute
taint:
kubectl taint nodes euro2 handsUP:NoExecute
node/euro2 tainted
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-8f458dc5b-29h9v 0/1 Pending 0 48s <none> <none> <none> <none>
nginx-8f458dc5b-2bj96 0/1 Pending 0 48s <none> <none> <none> <none>
nginx-8f458dc5b-twf8z 0/1 Pending 0 48s <none> <none> <none> <none>
nginx2-7cc8cd4598-tpp2s 0/1 Pending 0 8m20s <none> <none> <none> <none>
PODs running on euro2
have been stopped.
Toleration is a property of a POD that allows it to run even though a NODE is “tainted”. Toleration is defined by PodSpec. For example:
cat << EOF | tee toleration.yaml
apiVersion: v1
kind: Pod
metadata:
name: tolerancyjny-pod
spec:
containers:
- name: nginx
image: nginx
tolerations:
- key: "skaza"
operator: "Exists"
effect: "NoSchedule"
- key: "handsUP"
operator: "Exists"
effect: "NoExecute"
EOF
Running a POD:
kubectl apply -f toleration.yaml
pod/tolerancyjny-pod created
Verifying the state of the POD:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-8f458dc5b-29h9v 0/1 Pending 0 5m3s <none> <none> <none> <none>
nginx-8f458dc5b-2bj96 0/1 Pending 0 5m3s <none> <none> <none> <none>
nginx-8f458dc5b-twf8z 0/1 Pending 0 5m3s <none> <none> <none> <none>
nginx2-7cc8cd4598-98h5r 0/1 Pending 0 5m3s <none> <none> <none> <none>
tolerancyjny-pod 1/1 Running 0 68s 10.33.2.46 euro2 <none> <none>
The POD was started on a tainted euro2
host.
Removing “taints” from euro2
will enable the restart of suspended PODs.
kubectl taint nodes euro2 skaza:NoSchedule-
node/euro2 untainted
kubectl taint nodes euro2 handsUP:NoExecute-
node/euro2 untainted
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7588f7b96-45f9m 1/1 Running 0 11m 10.33.2.50 euro2 <none> <none>
nginx-7588f7b96-f6745 1/1 Running 0 11m 10.33.2.47 euro2 <none> <none>
nginx-7588f7b96-tws56 1/1 Running 0 11m 10.33.2.48 euro2 <none> <none>
nginx2-7cc8cd4598-98h5r 1/1 Running 0 11m 10.33.2.49 euro2 <none> <none>
tolerancyjny-pod 1/1 Running 0 7m46s 10.33.2.46 euro2 <none> <none>
Summary
In this article, we introduced the most important Kubernetes mechanisms for managing PODs in a cluster. We used concepts such as nodeSelector
, affinity
and taints
in uncomplicated examples on a cluster consisting of only 2 NODEs. We also explained what a key component of Kubernetes – the scheduler – does.