k8s☞17-2调度之亲和性和反亲和性

阅读量: zyh 2021-06-17 16:25:58

Categories： > Tags：

nodeSelector

最简单的调度方式nodeSelector 方式，仅需给Pod提供，就可以让Pod调度到对应的节点上。例如：

查看node标签

➜  ~ kubectl get node k8s01 --show-labels
NAME    STATUS   ROLES                  AGE    VERSION   LABELS
k8s01   Ready    control-plane,master   564d   v1.22.4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s01,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=

Pod通过spec.nodeSelector选择标签

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd

🩱缺点：万一所有节点都不符合条件，则会Pod会卡住无法调度。

亲和性和反亲和性策略

➜  ~ kubectl explain pod.spec.affinity
KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity  <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity      <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

节点只有亲和性 nodeAffinity：其意思是如果有符合条件的node，就将pod调度到这个node上

Pod分为亲和性 podAffinity、反亲和性 podAntiAffinity，以及拓扑网格topologyKey：

对于 podAffinity 和 podAntiAffinity，k8s将其限定在一个称为``topologyKey的概念。通过topologyKey`，可以将节点划分为若干拓扑网格。
eg：例如 topologyKey: “kubernetes.io/hostname”，则表示按主机名划分拓扑网格。此时：
- podAffinity ：表示在划分的每一个拓扑网格内，若已经有符合条件的pod，则将pod调度到topologyKey内
- podAntiAffinity：表示在划分的每一个拓扑网格内，若已经有符合条件的pod，则不可调度到topologyKey内

调度策略类型

支持nodeAffinity、podAffinity、podAntiAffinity

➜  ~ kubectl explain pod.spec.affinity.podAffinity
KIND:     Pod
VERSION:  v1

RESOURCE: podAffinity <Object>

DESCRIPTION:
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

     Pod affinity is a group of inter pod affinity scheduling rules.

FIELDS:
   preferredDuringSchedulingIgnoredDuringExecution      <[]Object>
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node has pods
     which matches the corresponding podAffinityTerm; the node(s) with the
     highest sum are the most preferred.

   requiredDuringSchedulingIgnoredDuringExecution       <[]Object>
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to a pod label update), the system may or
     may not try to eventually evict the pod from its node. When there are
     multiple elements, the lists of nodes corresponding to each podAffinityTerm
     are intersected, i.e. all terms must be satisfied.

硬限制：requiredDuringSchedulingIgnoredDuringExecution 调度必须满足条件+忽略执行期间条件变化

表示【必须】满足设定的条件才可以调度，如果没有满足条件的，就不停重试。其中IgnoreDuringExecution表示pod部署后运行期间，如果不再满足设定的条件，pod也会继续运行。
软限制：preferredDuringSchedulingIgnoredDuringExecution 调度尽可能满足条件+忽略执行期间条件变化

表示【尽可能】满足设定的条件才可以调度，如果没有满足条件的，就忽略这些条件，按照正常逻辑部署。其中IgnoreDuringExecution表示pod部署之后运行的时候，如果不再满足设定的条件，pod也会继续运行。
- 软限制有权重概念，也就是说软限制可以同时设置多个条件，并根据权重来优先考虑条件。

💖策略可以组合使用

未来可能支持的调度策略

requiredDuringSchedulingRequiredDuringExecution 调度必须满足条件+不可忽略执行期间条件变化
表示【必须】满足设定的条件才可以调度，如果没有满足条件的，就不停重试。其中RequiredDuringExecution表示pod部署后运行期间，如果不再满足设定的条件，则被驱逐重新调度。
preferredDuringSchedulingRequiredDuringExecution 调度尽可能满足条件+不可忽略执行期间条件变化
表示【尽可能】满足设定的条件才可以调度，如果没有满足条件的，就忽略这些条件，按照正常逻辑部署。其中RequiredDuringExecution表示pod部署之后运行的时候，如果不再满足设定的条件，则被驱逐重新调度。

案例1

实现目标：每一个节点（topologyKey）上，均只有一个web-server和一个redis

redis的清单

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity: # 反亲和性：以节点名划分拓扑域，若拓扑域内已有app=store的Pod，则不可再将本Pod调度进此拓扑域
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

web-server的清单

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity: # 反亲和性：以节点名划分拓扑域，若拓扑域内已有app=web-store的Pod，则不可再将本Pod调度进此拓扑域
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity: # 亲和性：以节点名划分拓扑域，若拓扑域内已有app=store的pod,则可将本Pod调度进此拓扑域
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.12-alpine

💛如果要将超出节点数的Pod尽可能的均衡负载，则Pod反亲和应该使用preferredDuringSchedulingIgnoredDuringExecution，这可以确保在每一个节点都部署Pod后，依然可以将Pod部署进去。

    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - web-store
              topologyKey: "kubernetes.io/hostname"

案例2

线上服务器组专用节点

# 添加污点，非prod服务不可调度到此节点
kubectl taint nodes k8s001 dedicated=prod:NoSchedule
# 添加标签
kubectl label nodes k8s001 dedicated=prod

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    dedicated: prod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dedicated
            operator: In
            values: 
            - prod
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "prod"
    effect: "NoSchedule"
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent

affinity.nodeAffinity 节点亲和确保了必须调度到拥有 dedicated=prod 标签的节点，而 k8s001 拥有此标签。

🩱注意：当节点的dedicated != prod的时候，Pod将【不会】重新调度到其它满足条件的节点上。

tolerations 确保了 nginx pod 可以调度到拥有 dedicated=prod:NoSchedule 污点的节点，而 k8s001 拥有此污点。

prometheus-08k8s-node

helm☞02chart