现象
今天上午准备登陆下测试环境的zabbix-server服务器查个东西,发现ssh一直连接不上报错"No space left on device"。
代码语言:javascript复制[C:~]$ ssh 172.16.131.142
Last login: Fri Nov 1 11:28:19 2019 from 10.16.75.35
/root/.pyenv/libexec/pyenv-init: line 131: cannot create temp file for here-document: No space left on device
于是使用ansible跳过去,查看磁盘空间发现根目录已经100%了。
代码语言:javascript复制[root@ansible ~]# ssh 172.16.131.142
[root@zabbix1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 50G 50G 0 100% /
/dev/mapper/datavg-home_lv
343G 178G 148G 55% /home
/dev/mapper/datavg-swap_lv
976M 490M 436M 53% /swap
因为之前遇到过类似情况,所以我猜想还是boot.log满了,去看一下果然41个G。
代码语言:javascript复制[root@zabbix1 ~]# cd /var/log/
[root@zabbix1 log]# du -sh *
26M audit
41G boot.log
4.0K dmesg
4.0K dmesg.old
4.0K dracut.log
50M httpd
824M messages
4.0K tallylog
224K wtmp
4.0K yum.log
21M zabbix
查看日志里面的内容,日志在疯狂的写入,只截取部分。
代码语言:javascript复制[root@zabbix1 log]# tail -f boot.log
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: RtrPriority 1"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: RtrDeadInterval 12"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: DRouter 0.0.0.0"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: BDRouter 0.0.0.0"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: # Neighbors 1"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: Neighbor 172.16.44.18"
Nov 1 11:33:22 172.16.32.2 date=2019-11-01 time=11:33:22 devname=BJ-YZ-CO-FW1 devid=FG5H0E5818903326 logid=0103020301 type=event subtype=router level=information vd=root logdesc="Routing log" msg="OSPF: NFSM[Vlanif105:172.16.44.18-172.16.46.1]: Full (HelloReceived)"
Nov 1 11:24:02 2019 BJ-YZ-DS-SW1&2 %DRVPLAT/4/DrvDebug: -DevIP=172.16.32.5-Slot=2; Many Parity Errors have been detected in last 10s.
Nov 1 11:24:02 2019 BJ-YZ-DS-SW1&2 %DRVPLAT/4/Log Info: -DevIP=172.16.32.5-Slot=2; Slot 2,unit 0 DLB_HGT_FLOWSET_TIMESTAMP_PAGE_X entry 693 parity error.
Nov 1 11:24:02 2019 BJ-YZ-DS-SW1&2 %DRVPLAT/4/Log Info: -DevIP=172.16.32.5-Slot=2; Slot 2,unit 0 DLB_HGT_FLOWSET_TIMESTAMP_PAGE_X entry 693 parity error.
Nov 1 11:24:02 2019 BJ-YZ-DS-SW1&2 %DRVPLAT/4/Log Info: -DevIP=172.16.32.5-Slot=2; Slot 2,unit 0 DLB_HGT_FLOWSET_TIMESTAMP_PAGE_X entry 693 parity error.
Nov 1 11:24:02 2019 BJ-YZ-DS-SW1&2 %DRVPLAT/4/Log Info: -DevIP=172.16.32.5-Slot=2; Slot 2,unit 0 DLB_HGT_FLOWSET_TIMESTAMP_PAGE_X entry 693 parity error.
很明显日志里面有172.16.32.5和172.16.32.2的信息,去zabbix上看了一下发现是网络设备,但是zabbix显示的是已经挂掉了,但是可以查看,应该是缓存。
先备份下这个日志,再清空释放空间,因为根下已经没空间了,所以压缩到/home下。
代码语言:javascript复制[root@zabbix1 log]# tar zcvf /home/2019-11-1-boot.log.tar.gz boot.log
[root@zabbix1 log]# cat /dev/null > boot.log
查看zabbix发现server的10051端口已经没有了,只有agent的10050端口
代码语言:javascript复制[root@zabbix1 ~]# netstat -tnlp | grep zabbix
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 :::10050 :::* LISTEN 15271/zabbix_agentd
重启zabbix-server
代码语言:javascript复制[root@zabbix1 ~]# service zabbix-server restart
Shutting down Zabbix server: [FAILED]
Starting Zabbix server: [ OK ]
[root@zabbix1 ~]# netstat -tnlp | grep zabbix
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 :::10050 :::* LISTEN 15271/zabbix_agentd
tcp 0 0 :::10051 :::* LISTEN 1283/zabbix_server
原因
172.16.32.5这台网络设备一直报错,导致不停的刷日志,zabbix-server上配置了网络设备的rsyslog,所以有大量的报错日志写入到boot.log。
解决办法
注释掉rsyslog中的/var/log/boot.log。
代码语言:javascript复制[root@zabbix1 rsyslog.d]# vim /etc/rsyslog.conf
#local7.* /var/log/boot.log