扩展 Kubernetes 之 Scheduler

王磊-AI基础 • 2023-01-02 • 云技术社区 • 541 阅读

简介

kubernetes scheduler 的任务是将 pod bind 到最合适的 node 上，供 kubelet 进行下一步操作。

Extend Kubernetes 系列: Extend Kubernetes - Kubectl Plugin; Extend Kubernetes - FlexVolume And CSI; Extend Kubernetes - CRI; Extend Kubernetes - CNI

Scheduler 处于什么位置

image

Scheduler 运行流程

注意：以下流程为基于谓词和优先级的调度器（Predicates and Priorities） · v1.0.0 ~ v1.14.0

image

Phase 1: Predicates (a.k.a Filtering)
1. Find qualified nodes which pass all Predicates.
2. If none is qualified, see if preempting low-priority Pods helps.
Phase 2: Priorities (a.k.a. Scoring)
1. For each “filtered” node, score it according based on Priorities.
2. The node with highest score will be chosen as the running node.
整个调度过程分两步：Predicates-Filtering 和 Priorities-Scoring
默认配置的调度策略为 DefaultProvider，具体包含的很多预置策略，默认会选择其中的一些策略参考
可以通过kube-scheduler的启动参数--policy-config-file指定一个自定义的Json内容的文件，按照格式组装自己Predicates and Priorities policies。

影响调度的其他概念

podspec: nodename
podspec: nodeselector
Pod Priority and Preemption: Kubernetes 1.8 及其以后的版本中可以指定 Pod 的优先级。优先级表明了一个 Pod 相对于其它 Pod 的重要性。当 Pod 无法被调度时，scheduler 会尝试抢占（驱逐）低优先级的 Pod，使得这些挂起的 pod 可以被调度
Pod Affinity and Anti-affinity: pod 之间互相影响调度
Node Affinity/anti-affinity node label 影响调度
Node Taints and Tolerations: node 限制调度，pod 可以选择（match) 或者忽略(tolerate)这种限制, 参考
Pdb: limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.

Scheduler 值得关注的进展

Scheduling Framework

由于当前的主流扩展方式 Webhook（Scheduler Extender）方式有一些限制:

能力：扩展点数量\\阶段的限制：仅支持 Predicate、Priority、Bind、preemption 等扩展点。而且是在 default scheduler 对应流程完成之后调用。调度器无法通知 Extender Pod 已经取消调度
效率：调度器通过 JSON 的数据格式与扩展通信， Extender 是单独的进程，无法使用默认调度器的缓存，需要自建一个和默认调度器一样的缓存

为了解决上面的问题提出， Scheduler Framework 为默认调度器定义了新的扩展点和 API，并通过插件的方式提供

image

参考1

参考2

Descheduler

An incubator project.
定期触发 evicting pods，以便 pod 能够被调度到更合适的 node 上面去.
场景:
- node under- or over-utilized.
- 调度之后 node taint/labels 发生变化，导致现在 pod 不再适合 node
- node 失败
- node 新增

扩展 scheduler

扩展方式

一般来说，我们有4种扩展 Kubernetes 调度器的方法。

扩展方式	优缺点
clone 官方的 kube-scheduler 修改	不易维护
独立 kube-scheduler，配合 pod.spec.schedulerName	可能会产生调度冲突问题，比如一个 scheduler bind的时候实际资源已经被另一个 scheduler 分配了
extend scheduler	policy 文件可配置 Webhook，包含支持 Predicate、Priority、Bind、preemption扩展点，实现简单
scheduling Framework	Kubernetes v1.15 引入，可插拔, 未来主流方式，废弃 extend scheduler

扩展例子

example

k8s-scheduler-extender-example

Gang Scheduling

Kube-batch, gang scheduler 是某些领域，比如大数据、批量计算场景常用的的调度方式，即讲一组资源当成一个 group，如果有 group 够用的资源就整个调度，或者整个不调度 (而传统的 kubernetes 的调度粒度为 pod). kubebatch 试图解决此类问题，并且想把这种通用的需求变成标准，解决所有类似的问题.

gpushare-scheduler-extender

为 gpu share divice 扩展的 scheduler，支持多个 pod 共享 gpu显存和 card. 目前的 device 机制能注册资源总量，但是对于调度来讲，信息不太够，因此 gpushare-scheduler-extender 提供了一层 filter 帮助判断 node 上是否有足够的 gpu 资源.

实践

受限于目前主流使用的 kubernetes 版本限制，我们还是采用 extender sheduler 的方式进行实践.

想象这样一种场景：我们将所有的 kubernetes 中的节点分为两组：一组为 group a, 固定节点，包月购买; 另一组为 group b, 按量付费，满足一些弹性需求。

针对这种场景，我们对调度器的需求是

优先调度到 group a, group a 中尽量分配均匀，即默认策略：LeastRequestedPriority (空闲资源比例越高的 Node 得分越高)
group a 不够用了调度到 group b，但是 group b 调度到尽量少的 Node，即：MostRequestedPriority（空闲资源比例越低的 Node 得分越高）, 以便 group b 在弹性需求完成之后缩容.

具体实现代码在 u2takey/k8s-scheduler-extender-example

核心实现为(省略部分次要代码)

GroupPriority = Prioritize{
	Name: "group_score",
	Func: func(_ v1.Pod, nodes []v1.Node) (*schedulerapi.HostPriorityList, error) {
		var priorityList schedulerapi.HostPriorityList
		priorityList = make([]schedulerapi.HostPriority, len(nodes))
		for i, node := range nodes {
			priorityList[i] = schedulerapi.HostPriority{
				Host:  node.Name,
				Score: 1000,
			}

			if group, ok := node.Labels["group"]; ok && group == "Scale" {
				// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
				pods, err := indexer.ByIndex("node", node.Name)
				cpu, mem:= &resource.Quantity{}, &resource.Quantity{}
				for _, obj := range pods{
					if pod, ok := obj.(*v1.Pod); ok{
						for _, container := range pod.Spec.Containers{
							cpu.Add(*container.Resources.Requests.Cpu())
							mem.Add(*container.Resources.Requests.Memory())
						}
					}
				}
				nodeCpu, nodeMem := node.Status.Capacity.Cpu(), node.Status.Capacity.Memory()
				score := (toFloat(cpu)/toFloat(nodeCpu) + toFloat(mem)/toFloat(nodeMem))* 100.0
				priorityList[i].Score = int64(score)
			}
			log.Printf("score for %s %d\\n", node.Name, priorityList[i].Score)
		}
		return &priorityList, nil
	},
}

使用 terraform 新建 k8s 集群，进行测试配置为 (省略了变量配置)，新建的 worker 数量为 4，配置为 2u4G

provider "tencentcloud" {
  secret_id  = var.secret_id
  secret_key = var.secret_key
  region     = var.region
}

# test cluster
resource "tencentcloud_kubernetes_cluster" "managed_cluster" {
  vpc_id                  = var.vpc
  cluster_cidr            = "10.4.0.0/16"
  cluster_max_pod_num     = 32
  cluster_desc            = "cluster created by terraform"
  cluster_max_service_num = 32
  container_runtime          = "containerd"
  cluster_version            = "1.14.3"

  worker_config {
    count                      = 4
    availability_zone          = var.availability_zone
    instance_type              = var.default_instance_type
    system_disk_size           = 50
    security_group_ids         = [var.sg]
    internet_charge_type       = "TRAFFIC_POSTPAID_BY_HOUR"
    internet_max_bandwidth_out = 100
    public_ip_assigned         = true
    subnet_id                  = var.subnet
    key_ids                   = [var.key_id]
  }

  cluster_deploy_type = "MANAGED_CLUSTER"

  provisioner "local-exec" {
    command = <<EOT
    echo "${self.certification_authority}" > /tmp/{self.user_name}.cert;
    kubectl config set-credentials ${self.id} --username=${self.user_name} --password=${self.password};
    kubectl config set-cluster ${self.id}  --server=https://${self.domain} --certificate-authority=/tmp/{self.user_name}.cert --embed-certs=true;
    kubectl config set-context ${self.id}  --cluster=${self.id}  --user=${self.id} ;
    kubectl config use-context ${self.id};
    EOT
  }

  provisioner "local-exec" {
    when    = "destroy"
    command = <<EOT
    kubectl config unset users.${self.id};
    kubectl config unset contexts.${self.id};
    kubectl config unset clusters.${self.id};
    EOT
  }
}

新建完成之后 patch 其中两个节点为 group: Scale, 即上面描述的 groupB，用于 scale 的group

kubectl patch node 10.203.0.16 10.203.0.6  -p '{"metadata":{"labels":{"group":"Scale"}}}'

创建 deploy 进行测试， request limit 为 500m/500M, 逐渐扩容，观察调度情况, 可以发现副本会优先向 group A 平均调度 (10.203.0.14, 10.203.0.11), 直到 groupA 资源不足，此时会向 group B 调度，group B中会尽量少用节点，优先选择了一个节点 (10.203.0.6), 直到这个节点资源不足.

# 6 副本, 优先在 groupA 平均调度
k8s-scheduler-extender-example on  master [!+?] via ? v1.13.7 on ? v19.03.5 at ☸️  cls-0026rllg
➜ kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
nginx-866d5f6df5-4gxcn   1/1     Running   0          5s    10.4.0.107   10.203.0.11   <none>           <none>
nginx-866d5f6df5-4wwn8   1/1     Running   0          18s   10.4.0.41    10.203.0.14   <none>           <none>
nginx-866d5f6df5-cnpld   1/1     Running   0          36s   10.4.0.40    10.203.0.14   <none>           <none>
nginx-866d5f6df5-drpsz   1/1     Running   0          18s   10.4.0.106   10.203.0.11   <none>           <none>
nginx-866d5f6df5-frb6c   1/1     Running   0          18s   10.4.0.42    10.203.0.14   <none>           <none>
nginx-866d5f6df5-xg79m   1/1     Running   0          18s   10.4.0.105   10.203.0.11   <none>           <none>
(base)


# 7 副本, 此时 groupA 资源不足，调度到 groupB
➜ kubectl get pod -o wide
NAME                     READY   STATUS              RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
nginx-866d5f6df5-4gxcn   1/1     Running             0          12s   10.4.0.107   10.203.0.11   <none>           <none>
nginx-866d5f6df5-4wwn8   1/1     Running             0          25s   10.4.0.41    10.203.0.14   <none>           <none>
nginx-866d5f6df5-89fxh   0/1     ContainerCreating   0          2s    <none>       10.203.0.6    <none>           <none>
nginx-866d5f6df5-cnpld   1/1     Running             0          43s   10.4.0.40    10.203.0.14   <none>           <none>
nginx-866d5f6df5-drpsz   1/1     Running             0          25s   10.4.0.106   10.203.0.11   <none>           <none>
nginx-866d5f6df5-frb6c   1/1     Running             0          25s   10.4.0.42    10.203.0.14   <none>           <none>
nginx-866d5f6df5-xg79m   1/1     Running             0          25s   10.4.0.105   10.203.0.11   <none>           <none>
(base)


# 9 副本, 集中将新增副本调度到 10.203.0.6 
➜ kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
nginx-866d5f6df5-4gxcn   1/1     Running   0          39s   10.4.0.107   10.203.0.11   <none>           <none>
nginx-866d5f6df5-4wwn8   1/1     Running   0          52s   10.4.0.41    10.203.0.14   <none>           <none>
nginx-866d5f6df5-89fxh   1/1     Running   0          29s   10.4.0.72    10.203.0.6    <none>           <none>
nginx-866d5f6df5-9ng2n   1/1     Running   0          3s    10.4.0.74    10.203.0.6    <none>           <none>
nginx-866d5f6df5-cnpld   1/1     Running   0          70s   10.4.0.40    10.203.0.14   <none>           <none>
nginx-866d5f6df5-drpsz   1/1     Running   0          52s   10.4.0.106   10.203.0.11   <none>           <none>
nginx-866d5f6df5-frb6c   1/1     Running   0          52s   10.4.0.42    10.203.0.14   <none>           <none>
nginx-866d5f6df5-q7rhc   1/1     Running   0          16s   10.4.0.73    10.203.0.6    <none>           <none>
nginx-866d5f6df5-xg79m   1/1     Running   0          52s   10.4.0.105   10.203.0.11   <none>           <none>
(base)


# 10 副本，此时 10.203.0.6 资源不足，向 10.203.0.16 调度
➜ kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
nginx-866d5f6df5-4gxcn   1/1     Running   0          56s   10.4.0.107   10.203.0.11   <none>           <none>
nginx-866d5f6df5-4wwn8   1/1     Running   0          69s   10.4.0.41    10.203.0.14   <none>           <none>
nginx-866d5f6df5-89fxh   1/1     Running   0          46s   10.4.0.72    10.203.0.6    <none>           <none>
nginx-866d5f6df5-9ng2n   1/1     Running   0          20s   10.4.0.74    10.203.0.6    <none>           <none>
nginx-866d5f6df5-cnpld   1/1     Running   0          87s   10.4.0.40    10.203.0.14   <none>           <none>
nginx-866d5f6df5-drpsz   1/1     Running   0          69s   10.4.0.106   10.203.0.11   <none>           <none>
nginx-866d5f6df5-frb6c   1/1     Running   0          69s   10.4.0.42    10.203.0.14   <none>           <none>
nginx-866d5f6df5-q7rhc   1/1     Running   0          33s   10.4.0.73    10.203.0.6    <none>           <none>
nginx-866d5f6df5-sc4x6   1/1     Running   0          6s    10.4.0.10    10.203.0.16   <none>           <none>
nginx-866d5f6df5-xg79m   1/1     Running   0          69s   10.4.0.105   10.203.0.11   <none>           <none>

最后别忘了 terraform destroy 销毁集群