一 前言
万万没想到,一个 metrics-server 安装会遇到很多问题,虽然有其他杂事占用了些时间,但也卡住了两天的时间,今天准备集中精力解决。
二 重新开始安装
2.1 官网安装命令
这里还是先采用Metrics-server官网的方法,使用下面命令直接安装:
代码语言:javascript复制kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
复制代码
2.2 尝试 kubectl top nodes
毫无悬念,还是报错,跟之前一样。
2.3 分析与解决过程
2.3.1 查看服务
kubectl get po -o wide -n kube-system
发现安装成功,但状态并非 Ready,继续查看 pod 日志:
2.3.2 查看日志
全部内容过长,我们只保留最后的 Events 部分:
代码语言:javascript复制Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60s default-scheduler Successfully assigned kube-system/metrics-server-7cb8646cfc-86s47 to docker-desktop
Normal BackOff 28s kubelet Back-off pulling image "k8s.gcr.io/metrics-server/metrics-server:v0.4.1"
Warning Failed 28s kubelet Error: ImagePullBackOff
Normal Pulling 18s (x2 over 44s) kubelet Pulling image "k8s.gcr.io/metrics-server/metrics-server:v0.4.1"
Warning Failed 3s (x2 over 28s) kubelet Failed to pull image "k8s.gcr.io/metrics-server/metrics-server:v0.4.1": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 3s (x2 over 28s) kubelet Error: ErrImagePull
复制代码
关键信息:
Failed to pull image "k8s.gcr.io/metrics-server/metrics-server:v0.4.1": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
说明是网络问题,k8s.gcr.io/metrics-server/metrics-server:v0.4.1 这个源的镜像无法拉取。ok,定位了我们的第一个问题。
2.3.3 解决镜像源
这个比较简单,到 dockerhub 上搜索 metrics-server,即可看到结果:
由于我选择的是 v0.4.1,所以搜索结果如下:
选择第一个 phperall/metrics-server,通过 docker pull phperall/metrics-server:v0.4.1 测试拉取成功。
2.3.4 修改 yaml 中的镜像源
把https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml下载到本地,打开编辑,136 行 image 标签,把源改为 phperall/metrics-server:v0.4.1
2.3.5 删除失败的 apply 并使用本地文件 apply
代码语言:javascript复制kubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
复制代码
然后使用本地 yaml 再次部署:
代码语言:javascript复制kubectl apply -f components.yaml
复制代码
再次查看 pod 状态,依然不对,错误信息为 CrashLoopBackOff。 这表示启动报错。
2.3.6 继续查看原因
容器 id 为:metrics-server-767bf7d9b4-qgqdd ,查看日志:
代码语言:javascript复制metrics-server flamingskys$ kubectl logs metrics-server-767bf7d9b4-qgqdd -c metrics-server -n kube-system
E0427 09:54:41.015293 1 server.go:132] unable to fully scrape metrics: unable to fully scrape metrics from node docker-desktop: unable to fetch metrics from node docker-desktop: Get "https://192.168.65.4:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 192.168.65.4 because it doesn't contain any IP SANs
I0427 09:54:41.062052 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0427 09:54:41.062123 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0427 09:54:41.062232 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0427 09:54:41.062289 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0427 09:54:41.062360 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0427 09:54:41.062397 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0427 09:54:41.069887 1 secure_serving.go:197] Serving securely on [::]:4443
I0427 09:54:41.070139 1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0427 09:54:41.070260 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0427 09:54:41.164406 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0427 09:54:41.164822 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0427 09:54:41.166841 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0427 09:55:06.806549 1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0427 09:55:06.810530 1 tlsconfig.go:255] Shutting down DynamicServingCertificateController
I0427 09:55:06.810626 1 dynamic_serving_content.go:145] Shutting down serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0427 09:55:06.807152 1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0427 09:55:06.807205 1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0427 09:55:06.819313 1 secure_serving.go:241] Stopped listening on [::]:4443
复制代码
关键信息:
E0427 09:54:41.015293 1 server.go:132] unable to fully scrape metrics: unable to fully scrape metrics from node docker-desktop: unable to fetch metrics from node docker-desktop: Get "https://192.168.65.4:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 192.168.65.4 because it doesn't contain any IP SANs
可见是权限验证(证书)出了问题,通过搜索找到了这个 issue:metrics issue#131,解决方法就是在 yaml 中配置参数:
参数含义说明:
- --kubelet-preferred-address-types:
优先使用 InternalIP 来访问 kubelet,这样可以避免节点名称没有 DNS 解析记录时,通过节点名称调用节点 kubelet API 失败的情况(未配置时默认的情况);
- --kubelet-insecure-tls:
kubelet 的 10250 端口使用的是 https 协议,连接需要验证 tls 证书。--kubelet-insecure-tls 不验证客户端证书。
2.4 使用修改后的文件再次执行
上述修改完成后,删除掉原 apply 并再次 apply 执行:
代码语言:javascript复制metrics-server flamingskys$ kubectl apply -f components.yaml
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
复制代码
查看 pod 状态:
终于,metrics-server 的 pod 状态 READY,正常了。
验证 top 命令:
代码语言:javascript复制metrics-server flamingskys$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
docker-desktop 1786m 89% 1335Mi 70%
复制代码
得到了期待许久的输出。
三 小结
至此,问题解决完毕。相关步骤上面 2.4 节已有,使用的文件:components.yaml。