集群环境:
ubuntu:20.04
kubernetes:1.25.4
containerd:1.6.9
最近在部署kubernetes集群的时候,在初始化的时候总是无缘无故的报错,而且在手动获取token的时候也会报错
# kubeadm token create –print-join-command
timed out waiting for the condition
To see the stack trace of this error execute with –v=5 or higher
根据报错开始排查:
# systemctl status kubelet
Nov 11 11:37:38 rook-master1 kubelet[50097]: E1111 11:37:38.743508 50097 kubelet.go:2448] “Error getting node” err=”node \”rook-master1\” not found”
Nov 11 11:37:38 rook-master1 kubelet[50097]: E1111 11:37:38.844150 50097 kubelet.go:2448] “Error getting node” err=”node \”rook-master1\” not found”
# journalctl -xeu kubelet
Nov 11 12:13:40 rook-master1 kubelet[50908]: E1111 12:13:40.500017 50908 kuberuntime_sandbox.go:71] “Failed to create sandbox for pod” err=”rpc error: code = Unknown desc = failed to get sandbox image \”registry.k8s.io/pause:3.6\”: f>
Nov 11 12:13:40 rook-master1 kubelet[50908]: E1111 12:13:40.500084 50908 kuberuntime_manager.go:772] “CreatePodSandbox for pod failed” err=”rpc error: code = Unknown desc = failed to get sandbox image \”registry.k8s.io/pause:3.6\”: f>
Nov 11 12:13:40 rook-master1 kubelet[50908]: E1111 12:13:40.500226 50908 pod_workers.go:965] “Error syncing pod, skipping” err=”failed to \”CreatePodSandbox\” for \”etcd-rook-master1_kube-system(24d163f695e7dac3fdc912ceb74bfc34)\” wi>
然后发现了一条很重要的错误日志:
Unknown desc = failed to get sandbox image “registry.k8s.io/pause:3.6”
再根据这个镜像地址,详查一下syslog日志
# grep ‘registry.k8s.io/pause:3.6’ /var/log/syslog
Nov 11 00:00:18 rook-master1 containerd[14961]: time=”2022-11-11T00:00:18.677976277+08:00″ level=error msg=”RunPodSandbox for &PodSandboxMetadata{Name:etcd-rook-master1,Uid:24d163f695e7dac3fdc912ceb74bfc34,Namespace:kube-system,Attempt:0,} failed, error” error=”failed to get sandbox image \”registry.k8s.io/pause:3.6\”: failed to pull image \”registry.k8s.io/pause:3.6\”: failed to pull and unpack image \”registry.k8s.io/pause:3.6\”: failed to resolve reference \”registry.k8s.io/pause:3.6\”: failed to do request: Head \”https://asia-northeast1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\”: dial tcp 74.125.204.82:443: i/o timeout”
Nov 11 00:00:18 rook-master1 kubelet[16170]: E1111 00:00:18.678447 16170 remote_runtime.go:222] “RunPodSandbox from runtime service failed” err=”rpc error: code = Unknown desc = failed to get sandbox image \”registry.k8s.io/pause:3.6\”: failed to pull image \”registry.k8s.io/pause:3.6\”: failed to pull and unpack image \”registry.k8s.io/pause:3.6\”: failed to resolve reference \”registry.k8s.io/pause:3.6\”: failed to do request: Head \”https://asia-northeast1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\”: dial tcp 74.125.204.82:443: i/o timeout”
问题的原因很明显了,找不到 registry.k8s.io/pause:3.6 镜像
然后开始详查明明都是阿里云的镜像,为什么突然出现 registry.k8s.io 镜像
经过一番长达两天的探究,最终得知最终原因:
在 containerd 1.6.9 的更新日志中:
Migrate from k8s.gcr.io to registry.k8s.io
至此,问题得以查清和解决