研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样
起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响
代码语言:javascript复制[root@k8s-m1 ~]# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
10.252.146.104 NotReady <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
启动docker试试
代码语言:javascript复制[root@k8s-m1 ~]# systemctl start docker
Job for docker.service canceled.
无法启动,查看下启动失败的服务
代码语言:javascript复制[root@k8s-m1 ~]# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● containerd.service loaded failed failed containerd container runtime
查看下containerd的日志
代码语言:javascript复制[root@k8s-m1 ~]# journalctl -xe -u containerd
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735 08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223 08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630 08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176 08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349 08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158 08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741 08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838 08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139 08:00" level=info msg="containerd successfully booted in 0.003611s"
Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference
Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259]
Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]:
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 0x39
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:222 0xcb
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 0x100
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:216 0x4df
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:359 0x165
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 0x92
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪
代码语言:javascript复制[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
[root@k8s-m1 ~]# /usr/bin/containerd --version
containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429
按照这个blob去用github的url访问是404,只有去按照tag版本查看了,根据相关代码找到了 boltdb 的文件名是meta.db
代码语言:javascript复制https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L257
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L79
https://github.com/containerd/containerd/blob/v1.2.13/services/server/server.go#L261-L268
查找下ic.Root路径是多少
代码语言:javascript复制[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config
config information on the containerd config
--config value, -c value path to the configuration file (default: "/etc/containerd/config.toml")
[root@k8s-m1 ~]# grep root /etc/containerd/config.toml
#root = "/var/lib/containerd"
[root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
找到boltdb文件,改名启动
代码语言:javascript复制[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak}
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago
Docs: https://containerd.io
Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2)
Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 9186 (code=exited, status=2)
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
[root@k8s-m1 ~]# systemctl restart containerd.service
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago
Docs: https://containerd.io
Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 15663 (containerd)
Tasks: 16
Memory: 28.6M
CGroup: /system.slice/containerd.service
└─15663 /usr/bin/containerd
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740 08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178 08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458 08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031 08:00" level=info msg="containerd successfully booted in 0.003994s"
containerd 起来后,启动下 docker
代码语言:javascript复制[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-docker.conf
Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago
Docs: https://docs.docker.com
Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
Main PID: 9187 (code=exited, status=0/SUCCESS)
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485 08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116 08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411 08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179 08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191 08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790 08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989 08:00" level=info msg="Daemon shutdown complete"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655 08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679 08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine.
[root@k8s-m1 ~]# systemctl start docker
[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-docker.conf
Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago
Docs: https://docs.docker.com
Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
Main PID: 15974 (dockerd)
Tasks: 62
Memory: 89.1M
CGroup: /system.slice/docker.service
└─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564 08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898 08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757 08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618 08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583 08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760 08:00" level=info msg="Loading containers: done."
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208 08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865 08:00" level=info msg="Daemon has completed initialization"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131 08:00" level=info msg="API listen on /var/run/docker.sock"
Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.
etcd启动也失败,journal 查看下 etcd 状态
代码语言:javascript复制[root@k8s-m1 ~]# journalctl -xe -u etcd
Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20
Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64
Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken
Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092
Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425
Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0)
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service.
[root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/
total 8560
-rw-r--r-- 1 root root 13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap
-rw-r--r-- 2 root root 128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken
-rw------- 1 root root 8617984 Jul 23 11:26 db
这套集群是使用我的ansible部署,求star的,自带了备份脚本,但是是三天前坏的
代码语言:javascript复制[root@k8s-m1 ~]# ll /opt/etcd_bak/
total 41524
-rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db
-rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db
有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是这个tasks里的26到42行步骤,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下
代码语言:javascript复制[root@k8s-m1 ~]# cd Kubernetes-ansible
[root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db'
PLAY [10.252.146.104] **********************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]
TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]
TASK [restoreETCD : 检测备份文件存在否] *************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]
TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]
TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]
TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]
TASK [restoreETCD : 停止etcd] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]
TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************
ok: [10.252.146.104] => (item=/var/lib/etcd)
TASK [restoreETCD : 分发备份文件] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]
TASK [restoreETCD : 恢复备份] ******************************************************************************************************************************************************************************************************************
changed: [10.252.146.104]
TASK [restoreETCD : 启动etcd] ****************************************************************************************************************************************************************************************************************
fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.nSee "systemctl status etcd.service" and "journalctl -xe" for details.n"}
PLAY RECAP *********************************************************************************************************************************************************************************************************************************
10.252.146.104 : ok=7 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
查看下日志
代码语言:javascript复制[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20
Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64
Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.
这个member xxxx has already been bootstrapped解决办法就是把配置文件的下面修改,后面启动完记得改回来
代码语言:javascript复制initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'
然后成功启动
代码语言:javascript复制[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20
Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64
Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463
Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703
Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001
Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member
Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal
Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000
Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491
Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47
Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47]
Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store
查看集群状态
代码语言:javascript复制[root@k8s-m1 Kubernetes-ansible]# etcd-ha
----------------------------- ------------------ --------- --------- ----------- ----------- ------------
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
----------------------------- ------------------ --------- --------- ----------- ----------- ------------
| https://10.252.146.104:2379 | ac2dcf6aed12e8f1 | 3.3.20 | 8.3 MB | false | 47 | 5953557 |
| https://10.252.146.105:2379 | 1e713be314744d53 | 3.3.20 | 8.6 MB | false | 47 | 5953557 |
| https://10.252.146.106:2379 | 8b1621b475555fd9 | 3.3.20 | 8.3 MB | true | 47 | 5953557 |
----------------------------- ------------------ --------- --------- ----------- ----------- ------------
然后给kube-apiserver三个组件和kubelet起来后
代码语言:javascript复制[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
10.252.146.104 Ready <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
pod也在慢慢自愈了
原文链接:
http://zhangguanzhang.github.io/2020/07/23/fs-error-fix-k8s-master/
本文转载自 Zhangguanzhang,点击下方阅读原文
结束福利
开源实战利用 k8s 作微服务的架构设计代码:
代码语言:javascript复制https://gitee.com/damon_one/spring-cloud-k8s
https://gitee.com/damon_one/spring-cloud-oauth2
欢迎大家 star,多多指教。
关于作者
笔名:Damon,技术爱好者,长期从事 Java 开发、Spring Cloud 的微服务架构设计,以及结合 docker、k8s 做微服务容器化,自动化部署等一站式项目部署、落地。Go 语言学习,k8s 研究,边缘计算框架 KubeEdge 等。公众号 程序猿Damon、天山六路折梅手
发起人。个人微信 MrNull008
,个人网站:damon8.cn,欢迎來撩。
往期回顾
微服务自动化部署CI/CD
ArrayList、LinkedList 你真的了解吗?
微服务架构设计之解耦合
大佬整理的mysql规范,分享给大家
如果张东升是个程序员
浅谈负载均衡
浅谈开发与研发之差异
浅谈 Java 集合 | 底层源码解析
Oauth2的授权码模式《上》
Oauth2的认证实战-HA篇
基于 Sentinel 作熔断 | 文末赠资料
基础设施服务k8s快速部署之HA篇
今天被问微服务,这几点,让面试官刮目相看
Spring cloud 之多种方式限流(实战)
Spring cloud 之熔断机制(实战)
面试被问finally 和 return,到底谁先执行?
Springcloud Oauth2 HA篇
Spring Cloud Kubernetes之实战一配置管理
Spring Cloud Kubernetes之实战二服务注册与发现
Spring Cloud Kubernetes之实战三网关Gateway