墨墨导读:一套19C CDB数据库,存储更换HBA卡宕,本文详述这起begin backup导致的故障恢复全过程。
半夜接到客户反馈,一套19C CDB数据库,存储更换HBA卡宕,起不来了,OPEN时提示需要介质恢复,这里截了一段ALERT LOG。
代码语言:javascript复制2020-07-28T23:40:53.328908 08:00
Errors in file /u02/app/oracle/diag/rdbms/racdb3/racdb32/trace/racdb32_ora_306493.trc:
ORA-10873: file 1 needs to be either taken out of backup mode or media recovered
ORA-01110: data file 1: ' DATA/RACDB3/DATAFILE/system.278.1037610503'
2020-07-28T23:40:53.387627 08:00
Errors in file /u02/app/oracle/diag/rdbms/racdb3/racdb32/trace/racdb32_ora_306493.trc:
ORA-10873: file 1 needs to be either taken out of backup mode or media recovered
ORA-01110: data file 1: ' DATA/RACDB3/DATAFILE/system.278.1037610503'
ORA-10873 signalled during: ALTER DATABASE OPEN /* db agent *//* {0:17:3557} */...
2020-07-28T23:40:55.357177 08:00
License high water mark = 2
2020-07-28T23:40:55.357492 08:00
USER(prelim) (ospid: 310054): terminating the instance
2020-07-28T23:40:56.369307 08:00
Instance terminated by USER(prelim), pid = 310054
ORA-10873: file 1 needs to be either taken out of backup mode or media recovered
代码语言:javascript复制
这里报ORA-10873是由于数据库或表空间BEGIN BACKUP导致,正确的处理方法只需要end backup即可。
代码语言:javascript复制alter database end backup;
alter tablespace [tablespace_name] end backup;
alter database open;
[oracle@test ~]$ oerr ora 10873 perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LANG = “en_us.utf8” are supported and installed on your system. perl: warning: Falling back to the standard locale (“C”). 10873, 00000, “file %s needs to be either taken out of backup mode or media recovered” // *Cause: An attempt was made to open a database after an instance failure or // SHUTDOWN ABORT interrupted an online backup. // *Action: If the indicated file is not a restored backup, then issue the // ALTER DATABASE END BACKUP command and open the database. If the // file is a restored online backup, then apply media recovery to // it and open the database.
当时RECOVER DATABASE 提示找不到归档(需要6-18号的归档)
由于有存储相关操作,误以为其它原因导致的问题,没有关注该报错,查询vdatafile,vdatafile,vdatafile_header发现检查点为上个月的6-18号。
代码语言:javascript复制---当前日期为2020-07-28T23:40:56
SQL> col name for a10
SQL> select a.con_id,a.name,b.file#,b.rfile#,b.checkpoint_change#,b.checkpoint_time,b.status from v$containers a,v$datafile b where a.con_id=b.con_id order by checkpoint_change#;
CON_ID NAME FILE# RFILE# CHECKPOINT_CHANGE# CHECKPOINT_TIME STATUS
-------------------- ---------- -------------------- -------------------- -------------------- ------------------- -------
2 PDB$SEED 6 4 2336937 2020-04-13 09:26:28 ONLINE
2 PDB$SEED 5 1 2336937 2020-04-13 09:26:28 SYSTEM
2 PDB$SEED 8 9 2336937 2020-04-13 09:26:28 ONLINE
1 CDB$ROOT 178 178 8041849750453 2020-06-18 19:28:19 ONLINE
1 CDB$ROOT 7 7 8041849750453 2020-06-18 19:28:19 ONLINE
1 CDB$ROOT 9 9 8041849750453 2020-06-18 19:28:19 ONLINE
1 CDB$ROOT 4 4 8041849750453 2020-06-18 19:28:19 ONLINE
1 CDB$ROOT 3 3 8041849750453 2020-06-18 19:28:19 ONLINE
1 CDB$ROOT 1 1 8041849750453 2020-06-18 19:28:19 SYSTEM
代码语言:javascript复制
接着检查日志及询问客户也没有做restore的操作,误判断为出现了异常,未找到解决办法,因为有最近的全备,做了restore CDB$ROOT的操作,结果悲剧了,还原出来的数据文件,RECOVER仍然需要从6-18的归档开始,查询文件头检查点还是6-18号,接着查询备份中信息,文件的检查点也是6-18号,没遇到过这种情况,以为是ORACLE BUG。
代码语言:javascript复制RMAN> list backup of datafile 1;
List of Backup Sets
===================
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
2508 Incr 0 23.19G SBT_TAPE 00:04:26 **2020-07-26 08:32:28**
BP Key: 2508 Status: AVAILABLE Compressed: NO Tag: HOT_DB_BK_LEVEL0
Handle: bk_3490_1_1046766482 Media: @aaaa6
List of Datafiles in backup set 2508
File LV Type Ckp SCN Ckp Time Abs Fuz SCN Sparse Name
---- -- ---- ---------- ------------------- ----------- ------ ----
1 0 Incr 8041849750453 **2020-06-18 19:28:19** 8050242429418 NO DATA/RACDB3/DATAFILE/system.278.1037610503
RMAN>
代码语言:javascript复制
这里有一个大问题,就是由的数据库较大,restore前没有对当前环境做备份。切记,任何危险的变更操作都需要备份。做到可回退!!!
咨询公司专家后,确定为某此表空间做了begin backup导致。begin backup后文件头上的checkpoint不再更新。但经了解实际没有人为发起过backup backup,alert log中也没有找到begin backup的操作记录,需要再分析。
这时由于之前做了restore cdbroot的操作,控制文件,cdbroot的文件已从备份中还原,导致不能再end backup操作,1个月前的归档已清理,也没办法从6-18开始应用归档。没有办法通过正常的途径恢复数据库,悲剧!!!!
代码语言:javascript复制通过v$backup确定文件处于begin backup热备模式。SQL> select * from v$backup;
FILE# STATUS CHANGE# TIME CON_ID
-------------------- ------------------ -------------------- ------------------- --------------------
3 NOT ACTIVE 8041849750453 2020-06-18 19:28:19 1
4 NOT ACTIVE 8041849750453 2020-06-18 19:28:19 1
5 NOT ACTIVE 0 2
6 NOT ACTIVE 0 2
7 ACTIVE 8041849750453 2020-06-18 19:28:19 1
8 NOT ACTIVE 0 2.......
19 ACTIVE 8041849750453 2020-06-18 19:28:19 4
20 ACTIVE 8041849750453 2020-06-18 19:28:19 4
21 ACTIVE 8041849750453 2020-06-18 19:28:19 4SQL> select * from v$backup;
FILE# STATUS CHANGE# TIME CON_ID
-------------------- ------------------ -------------------- ------------------- --------------------
3 NOT ACTIVE 8041849750453 2020-06-18 19:28:19 1
4 NOT ACTIVE 8041849750453 2020-06-18 19:28:19 1
5 NOT ACTIVE 0 2
6 NOT ACTIVE 0 2
7 ACTIVE 8041849750453 2020-06-18 19:28:19 1
8 NOT ACTIVE 0 2
9 NOT ACTIVE 8041849750453 2020-06-18 19:28:19 1
15 ACTIVE 8041849750453 2020-06-18 19:28:19 4
16 ACTIVE 8041849750453 2020-06-18 19:28:19 4
17 ACTIVE 8041849750453 2020-06-18 19:28:19 4
18 ACTIVE 8041849750453 2020-06-18 19:28:19 4
FILE# STATUS CHANGE# TIME CON_ID
-------------------- ------------------ -------------------- ------------------- --------------------
19 ACTIVE 8041849750453 2020-06-18 19:28:19 4
20 ACTIVE 8041849750453 2020-06-18 19:28:19 4
21 ACTIVE 8041849750453 2020-06-18 19:28:19 4
22 ACTIVE 8041849750453 2020-06-18 19:28:19 4
23 ACTIVE 8041849750453 2020-06-18 19:28:19 4
24 ACTIVE 8041849750453 2020-06-18 19:28:19 4
25 ACTIVE 8041849750453 2020-06-18 19:28:19 4
26 ACTIVE 8041849750453 2020-06-18 19:28:19 4
27 ACTIVE 8041849750453 2020-06-18 19:28:19 4
28 ACTIVE 8041849750453 2020-06-18 19:28:19 4
29 ACTIVE 8041849750453 2020-06-18 19:28:19 4
如果不是CDB还可以重建控制文件,然后用以下方法解决,但如果是CDB,需要切到pdb中执行以下操作,重建控制文件后,PDB是ORACLE内部的名字,没办法切换PDB,所以以下方法行不通。 alter database datafile 1 offline; alter database datafile 1 end backup; alter database datafile 1 online;
最终的解决办法是,bbed修改文件头上的检查点信息,再应用近几天的归档,应用到最新状态,open resetlogs,最终0数据丢失恢复。
这里由于文件比较多,不好全copy到本地文件系统,用到了ASM未公开的内部包,只
读取ASM中的数据头数据块到本地,bbed修改完,再copy回去。
代码语言:javascript复制--copy datafile head from asm
Set pagesize 300
Set linesize 300
set numw 20
alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss';
col name for a100
select '@tofs '||b.name||' '||regexp_replace(b.name,'^.*DATAFILE/','/u01/work/') from v$datafile b where b.checkpoint_change#<8050778122014 and con_id!=2 order by checkpoint_change#;
@tofs DATA/RACDB/DATAFILE/system.278.1037610503 /u01/work/system.278.1037610503
@tofs DATA/RACDB/DATAFILE/sysaux.261.1037610537 /u01/work/sysaux.261.1037610537
@tofs DATA/RACDB/DATAFILE/undotbs1.288.1037610553 /u01/work/undotbs1.288.1037610553
@tofs DATA/RACDB/DATAFILE/users.282.1037610553 /u01/work/users.282.1037610553
@tofs DATA/RACDB/DATAFILE/undotbs2.263.1037611185 /u01/work/undotbs2.263.1037611185
@tofs DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/system.302.1038386279 /u01/work/system.302.1038386279
@tofs DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/sysaux.292.1038386277 /u01/work/sysaux.292.1038386277
@tofs DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/undotbs1.299.1038386279
......
@tofs DATA/RACDB/A3EFF303A38BE5E8E053C756D80A801E/DATAFILE/sysaux.272.1038492071 /u01/work/sysaux.272.1038492071
@tofs DATA/RACDB/A3EFF303A38BE5E8E053C756D80A801E/DATAFILE/undotbs1.281.1038492071 /u01/work/undotbs1.281.1038492071
@tofs DATA/RACDB/A3EFF303A38BE5E8E053C756D80A801E/DATAFILE/undo_2.283.1038492071 /u01/work/undo_2.283.1038492071
@tofs DATA/RACDB/A3EFF303A38BE5E8E053C756D80A801E/DATAFILE/users.279.1038492071 /u01/work/users.279.1038492071
@tofs DATA/RACDB/DATAFILE/test11.425.1041690739 /u01/work/test11.425.1041690739
--生成bbed listfile
Set pagesize 300
Set linesize 300
set numw 20
alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss';
col name for a100
select file#||' '||regexp_replace(b.name,'^.*DATAFILE/','/u01/work/') from v$datafile b where b.checkpoint_change#<8050778122014 and con_id!=2 order by checkpoint_change#;
BBED> info
File# Name Size(blks)
----- ---- ----------
1 /u01/work/system.278.1037610503 0
3 /u01/work/sysaux.261.1037610537 0
4 /u01/work/undotbs1.288.1037610553 0
7 /u01/work/users.282.1037610553 0
9 /u01/work/undotbs2.263.1037611185 0
15 /u01/work/system.302.1038386279 0
16 /u01/work/sysaux.292.1038386277 0
17 /u01/work/undotbs1.299.1038386279 0
18 /u01/work/undotbs1.303.1038386279 0
19 /u01/work/undotbs2.294.1038386277 0
20 /u01/work/undotbs2.301.1038386279 0
......
178 /u01/work/test11.425.1041690739 0
---bbed 从kcvfhbcp恢复检查点
assign file 1 block 1 kcvfhckp = file 1 block 1 kcvfhbcp
assign file 3 block 1 kcvfhckp = file 3 block 1 kcvfhbcp
assign file 4 block 1 kcvfhckp = file 4 block 1 kcvfhbcp
assign file 7 block 1 kcvfhckp = file 7 block 1 kcvfhbcp
assign file 9 block 1 kcvfhckp = file 9 block 1 kcvfhbcp
assign file 15 block 1 kcvfhckp = file 15 block 1 kcvfhbcp
assign file 16 block 1 kcvfhckp = file 16 block 1 kcvfhbcp
assign file 17 block 1 kcvfhckp = file 17 block 1 kcvfhbcp
assign file 18 block 1 kcvfhckp = file 18 block 1 kcvfhbcp
assign file 19 block 1 kcvfhckp = file 19 block 1 kcvfhbcp
......
assign file 119 block 1 kcvfhckp = file 119 block 1 kcvfhbcp
assign file 178 block 1 kcvfhckp = file 178 block 1 kcvfhbcp
assign file 181 block 1 kcvfhckp = file 181 block 1 kcvfhbcp
sum apply file 1 block 1
sum apply file 3 block 1
sum apply file 4 block 1
sum apply file 7 block 1
sum apply file 9 block 1
sum apply file 15 block 1
sum apply file 16 block 1
sum apply file 17 block 1
sum apply file 18 block 1
sum apply file 19 block 1
......
sum apply file 118 block 1
sum apply file 119 block 1
sum apply file 178 block 1
sum apply file 181 block 1
kcvfhckp 检查点,恢复起始点,begin backup后再不更新
kcvfhbcp begin backup后检查点(begin backup后检查点更新在该位置,end backup以该检查点更新kcvfhckp)
---copy to asm
Set pagesize 300
Set linesize 300
set numw 20
alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss';
col name for a100
select '@toasm '||regexp_replace(b.name,'^.*DATAFILE/','/u01/work/')||' '||b.name from v$datafile b where b.checkpoint_change#<8050778122014 and con_id!=2 order by checkpoint_change#;
@toasm /u01/work/system.278.1037610503 DATA/RACDB/DATAFILE/system.278.1037610503
@toasm /u01/work/sysaux.261.1037610537 DATA/RACDB/DATAFILE/sysaux.261.1037610537
@toasm /u01/work/undotbs1.288.1037610553 DATA/RACDB/DATAFILE/undotbs1.288.1037610553
@toasm /u01/work/users.282.1037610553 DATA/RACDB/DATAFILE/users.282.1037610553
@toasm /u01/work/undotbs2.263.1037611185 DATA/RACDB/DATAFILE/undotbs2.263.1037611185
@toasm /u01/work/system.302.1038386279 DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/system.302.1038386279
@toasm /u01/work/sysaux.292.1038386277 DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/sysaux.292.1038386277
@toasm /u01/work/undotbs1.299.1038386279 DATA/RACDB/A3D75790AD24522EE053C756D80A788E/DATAFILE/undotbs1.299.1038386279
.......
@toasm /u01/work/test11.425.1041690739 DATA/RACDB/DATAFILE/test11.425.1041690739
代码语言:javascript复制
查询v$datafile_header 确认checkpoint_change#已更新。
代码语言:javascript复制recover database
run {
allocate channel ch00 device type sbt;
allocate channel ch01 device type sbt;
allocate channel ch02 device type sbt;
allocate channel ch03 device type sbt;
SEND ‘NB_ORA_SERV=bak-svr,NB_ORA_CLIENT=racdb6’;
recover database ;
release channel ch00;
release channel ch01;
release channel ch02;
release channel ch03;
}
注册最近两天的归档,继续recover database,直到最新
代码语言:javascript复制catalog archivelog ‘ ARCH/RACDB/archivelog/2020_07_28/thread_2_seq_4935.819.1046934651’;
catalog archivelog ‘ ARCH/RACDB/archivelog/2020_07_28/thread_2_seq_4936.524.1046935119’;
…
catalog archivelog ‘ ARCH/RACDB/archivelog/2020_07_28/thread_2_seq_4942.554.1046945869’;
catalog archivelog ‘ ARCH/RACDB/archivelog/2020_07_28/thread_2_seq_4943.750.1046946151’;
最后
recover database using backup controlfile;
cancel;
alter database open resetlogs;
到些数据库启动成功。
下面测试重现了该问题,及正确的处理方法。不过19C中并没有人为发起begin backup,需要继续排查什么原因导致。
代码语言:javascript复制SQL> alter tablespace ts2 begin backup;
Tablespace altered.
SQL> select * from v$backup;
FILE# STATUS CHANGE# TIME
---------- ------------------ ---------- -----------------
1 NOT ACTIVE 0
2 NOT ACTIVE 0
3 NOT ACTIVE 0
4 NOT ACTIVE 0
5 NOT ACTIVE 0
6 NOT ACTIVE 0
7 NOT ACTIVE 0
8 ACTIVE 2237473 20200730 17:30:44
9 NOT ACTIVE 0
9 rows selected.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> shutdown abort;
ORACLE instance shut down.
SQL>
SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 626327552 bytes
Fixed Size 2255832 bytes
Variable Size 419431464 bytes
Database Buffers 197132288 bytes
Redo Buffers 7507968 bytes
Database mounted.
ORA-10873: file 8 needs to be either taken out of backup mode or media recovered
ORA-01110: data file 8: '/oracle/app/oracle/oradata/TESTA/datafile/o1_mf_ts2_hcw05wo4_.dbf'
SQL> alter tablespace ts2 end backup;
Tablespace altered.
SQL> alter database open;
Database altered.