上周dell服务器坏了一块硬盘,故障信息通过关联其自带的openmanager报警到了icinga2。更换了磁盘后,想起另一个pve集群使用的是华为服务器,而华为没有类似的硬件管理软件。于是安装了阵列制造商的程序并自己写了个简单脚本检测告警。
安装阵列制造商的检测程序
确认阵列卡
代码语言:javascript复制# lspci | grep -i raid
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
阵列卡为"Logic MegaRAID SAS-3 3108"
下载安装MegaRAID Storage Manager (MSM)
lsi被broadcom收购了
https://www.broadcom.cn/support
下载的zip包里面只有RPM格式的安装包,而PVE是基于debian的,所以还需要使用alien把rpm转化为deb再安装
代码语言:javascript复制apt install alien
tar zxvf 17.05.02.01_MSM_linux-x86.tar.gz
cd disk
alien --scripts *.rpm
dpkg --install lib-utils2_1.00-3_all.deb
dpkg --install megaraid-storage-manager_17.05.02-2_all.deb
默认安装到目录/usr/local/MegaRAID Storage Manager/StorCLI/
测试程序
查看所有阵列信息,这个输出会很长
代码语言:javascript复制# /usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL
Adapter #0
=====================================
Versions
================
Product Name : SAS3108
Serial No :
FW Package Build: 24.16.0-0106
Mfg. Data
================
Mfg. Date : 00/00/00
......
Image Versions in Flash:
================
BIOS Version : 6.32.02.0_4.17.08.00_0x06150500
......
Pending Images in Flash
================
None
PCI Info
================
Controller Id : 0000
......
HW Configuration
================
......
ROC temperature : 47 degree Celcius
Settings
================
Current Time : 3:37:33 11/4, 2020
Predictive Fail Poll Interval : 300sec
......
Capabilities
================
RAID Level Supported : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
......
Status
================
ECC Bucket Count : 0
Limitations
================
Max Arms Per VD : 32
......
Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 3
Disks : 2
Critical Disks : 0
Failed Disks : 0
Supported Adapter Operations
================
Rebuild Rate : Yes
......
Supported VD Operations
================
Read Policy : Yes
Write Policy : Yes
......
Supported PD Operations
================
Force Online : Yes
......
T10 Power State : No
Error Counters
================
Memory Correctable Errors : 0
Memory Uncorrectable Errors : 0
High Availability Properties
================
Topology Type : None
Cluster Information
================
Cluster Permitted : No
Cluster Active : No
Default Settings
================
Phy Polarity : 0
Phy PolaritySplit : 0
Background Rate : 30
......
我们只需要关注"Device Present"部分,如果"Degraded","Offline","Critical Disks","Failed Disks",都为"0"就判断状态磁盘正常,否则就有故障。
"Device Present"后面一共8行,只要有4个0就OK。
获取状态信息
用一个简单的组合命令:
代码语言:javascript复制/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 'Device Present' | grep 0 | wc -l
放入脚本
代码语言:javascript复制#!/bin/bash
PRESENT=$(/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 "Device Present" | grep 0 | wc -l)
if [[ $PRESENT -eq 4 ]]; then
echo 'All are OK' && exit 0
else
echo 'All are OK' && exit 2
fi
测试脚本
代码语言:javascript复制# bash /mnt/pve/nfs199/pve/check_MegaRAID.sh
All are OK
现在可以结合以前说过的钉钉告警脚本,在出现故障的时候通过钉钉发送警告,
或者集成到nagios/zabbix/icinga等监控平台。