因为工作性质改变,有许久没动手处理故障了,今天的排错也是非生产环境,为验证一些测试临时搭的一套11g RAC环境,为了省时间,直接拿之前备份的vbox的环境拷贝,结果启动机器发现集群无法启动:
代码语言:javascript复制[root@jystdrac1 ~]# su - grid
[grid@jystdrac1 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[grid@jystdrac1 ~]$ crsctl stat res -t -init
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.
查看集群alert日志报错:
代码语言:javascript复制[grid@jystdrac1 jystdrac1]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1
[grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
2021-07-01 00:26:27.379:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:26:31.384:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:32.889:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:28:36.895:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:38.424:
[mdnsd(4644)]CRS-5602:mDNS service stopping by request.
2021-07-01 00:30:38.407:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:30:42.412:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:43.923:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:32:47.928:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:49.455:
[mdnsd(4822)]CRS-5602:mDNS service stopping by request.
进一步看mdns.log等最新报错信息(gpnp.log类似,为节省篇幅没有贴出):
代码语言:javascript复制[grid@jystdrac1 mdnsd]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1/mdnsd
[grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
2021-06-30 22:50:59.275: [ MDNS][1534236416] mdnsd exit
2021-06-30 22:53:03.989: [ default][1342412544]
================================================================================
2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
[ clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:53:03.991: [ clsdmt][1335961344]PID for the Process [2201], connkey 9
2021-06-30 22:53:03.991: [ clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Failed to record pid for MDNSD
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Terminating process
2021-06-30 22:53:03.992: [ MDNS][1335961344] clsdm requested mdnsd exit
2021-06-30 22:53:03.992: [ MDNS][1335961344] mdnsd exit
2021-06-30 22:57:14.236: [ default][747345664]
================================================================================
2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
[ clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:57:14.239: [ clsdmt][740894464]PID for the Process [2375], connkey 9
2021-06-30 22:57:14.239: [ clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$
MOS 也有篇文章介绍了RAC起不来的五大问题:
- Grid Infrastructure 启动的五大问题 (Doc ID 1526147.1)
其中问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行,就和目前的现象很匹配。
文档中描述了可能的原因和对应解决方案:
代码语言:javascript复制可能的原因:
1. orarootagent 缺少执行权限
2. 缺少进程相关的 <node>.pid 文件或者这个文件的所有者/权限不对
3. GRID_HOME 所有者/权限不对
解决方案:
1. 和一个好的GRID_HOME比较所有者/权限,并做相应的改正,或者以root用户执行:,
# cd <GRID_HOME>/crs/install
# ./rootcrs.pl -unlock
# ./rootcrs.pl -patch
这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
2. 如果对应的 <node>.pid 不存在, 就用touch命令创建一个具有相应所有者/权限的文件, 否则就按要求改正文件<node>.pid的所有者/权限, 然后重启集群软件.
这里是<GRID_HOME>下,所有者属于root:root 权限 644的<node>.pid 文件列表:
./ologgerd/init/<node>.pid
./osysmond/init/<node>.pid
./ctss/init/<node>.pid
./ohasd/init/<node>.pid
./crs/init/<node>.pid
所有者属于<grid>:oinstall,权限644
./mdns/init/<node>.pid
./evm/init/<node>.pid
./gipc/init/<node>.pid
./gpnp/init/<node>.pid
3. 对第3种原因,请参考解决方案1
可是依次排查下来发现均无问题,奇怪了,为啥权限都正确就是写不进去呢?
手工vi试下看看呢?
代码语言:javascript复制[grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
2201
保存时发现报错:
代码语言:javascript复制"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
Press ENTER or type command to continue
什么?文件系统空间满了???
代码语言:javascript复制[grid@jystdrac1 jystdrac1]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_linuxbase-lv_root 28G 27G 0 100% /
tmpfs 1.5G 0 1.5G 0% /dev/shm
/dev/sda1 485M 39M 421M 9% /boot
额,果然.. 好尴尬,居然是最初级的空间容量问题。 赶紧清理下空间后重启集群再试是否正常启动? It's Ok!