理解TimesTen内存数据库DBI文件的作用以及相关故障处理
Posted 奋斗的小鸟_oracle
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了理解TimesTen内存数据库DBI文件的作用以及相关故障处理相关的知识,希望对你有一定的参考价值。
DBI(DataBase Information)的作用是记录TimesTen实例的DSN注册信息,在Daemon进程启动时会读取该注册信息,显示DSN的组件启动策略。我们在实际运维过程中,基本没有人会关心DBI文件的存在,甚至有一些童鞋做了一两年的TimesTen运维,都不知道有这么一个文件存在。但是,我们在实际故障处理中,如果需要用到该文件进行处理时,都是比较严重的故障,而且TimesTen一般承载的业务都是比较核心的,分分钟都会直接影响客户的感知度;在我看的点中,也有不少地方出现过,每次客户打电话给我的时候都是非常的紧张,群友也有问到过,这里根据各种不同的情况总结说明。
1. 主机异常重启,Checkpoint文件目录没有挂载
所有关于这种类型的故障基本上都属于这种情况,由于考虑到I/O问题,我们经常都会在系统规划阶段,将Checkpoint文件系统和TransLog文件系统与软件安装目录分开,采用外部的存储来存放Checkpoint文件和TransLog文件;但是由于存储方面的考虑,这些文件系统一般都不会设置成自动挂载,当主机异常重启时,由于故障期间运维人员容易第一时间恢复数据库,而忽略了启动之前的一些检查;此时,如果在Checkpoint文件系统没有挂载的情况下启动daemon进程,Daemon进程读取注册信息时找不到注册信息中记录的Checkpoint文件,TimesTen会将原来的DBI文件重命名为原来文件名加'~'的文件,同时会生成一个新的DBI文件,由于此时Checkpoint目录不存在,所以根据sys.odbc.ini文件的配置信息也无法创建新的DSN;原来的DSN也无法加载,由于DBI文件已经修改,所以即使再次重启也只会读取已经被修改过的DBI文件读取最新的注册信息。
2. 主机异常重启,DBI文件丢失/损坏
这种情况是比较极端的,一般情况下改文件不会出现丢失,因为丢失或被删除后,在有会话连接到DSN时,Daemon进程会自动生成新的同名的DBI文件;但是如果该文件损坏就悲剧了,因为文件存在的情况下不会自动修复损坏的文件,所以我们在运维过程中,创建完DSN后建议备份该文件。
下面我们模拟以上场景及其解决办法
1. 主机异常重启,Checkpoint文件目录没有挂载
-首先模拟主机异常宕机或Daemon进程异常
通过ttstatus找到Daemon进程号,并直接kill
通过ttstatus找到Daemon进程号,并直接kill
$ ttstatus|grep -E "Daemon pid"
Daemon pid 15752 port 53396 instance tt1122
$ kill -9 15752
$ ttstatus|grep -E "Daemon pid"
ttStatus: Could not connect to the TimesTen daemon.
If the TimesTen daemon is not running, please start it
by running "ttDaemonAdmin -start".
-模拟主机文件系统没有挂载
$ cd $TT_HOME/info
$ cat sys.odbc.ini|grep -E "DataStore="
DataStore=/ttchk/tt1122/DataStore/TYINFO/tyinfodata #当然,这里如果有配置多个DSN的话需要自己区分一下
$ cd /ttchk/tt1122/DataStore
$ ls
TYINFO
$ mv TYINFO TYINFO1
-然后我们先看一下info目录中的文件
$ cd $TT_HOME/info
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3520 Jan 20 08:51 DBI54928054.0
-rw------- 1 timesten timesten 3520 Jan 1 01:53 DBI54928054.01~
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 6 Jan 20 08:51 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-启动daemon进程
$ ttDaemonAdmin -start
Pid file exists: /TimesTen/tt1122/info/timestend.pid.
To start anyway, use -force.
$ ttDaemonAdmin -start -force
/TimesTen/tt1122/info/timestend.pid file exists, attempt start due to -force option.
TimesTen Daemon startup OK.
$ ttstatus
TimesTen status report as of Tue Jan 20 09:14:48 2015
Daemon pid 15902 port 53396 instance tt1122
TimesTen server pid 15917 started on port 53397
------------------------------------------------------------------------
Accessible by group timesten
End of report
-这个时候悲剧发生了,ttstatus看不到实例的DSN注册信息,而且无论我们怎么重启,怎么折腾都找不到。
$ ttisql tyinfo
Copyright (c) 1996, 2014, Oracle and/or its affiliates. All rights reserved.
Type ? or "help" for help, type "exit" to quit ttIsql.
connect "DSN=tyinfo";
830: Cannot create data store file. OS-detected error: No such file or directory
The command failed.
Done.
$ ttadmin -ramload tyinfo
ttAdmin: TimesTen Error: 10002; No record of a data store located in '/ttchk/tt1122/DataStore/TYINFO/tyinfodata'
-我们这个时候模拟文件系统挂载回来
$ mv TYINFO1 TYINFO
-接着我们再怎么折腾,都找不到实例DSN的注册信息
$ ttstatus
TimesTen status report as of Tue Jan 20 09:21:29 2015
Daemon pid 15902 port 53396 instance tt1122
TimesTen server pid 15917 started on port 53397
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ttisql tyinfo
Copyright (c) 1996, 2014, Oracle and/or its affiliates. All rights reserved.
Type ? or "help" for help, type "exit" to quit ttIsql.
connect "DSN=tyinfo";
10003: Unexpected data store file exists for new data store: /ttchk/tt1122/DataStore/TYINFO/tyinfodata.ds0.
The command failed.
Done.
$ ttadmin -ramload tyinfo
ttAdmin: TimesTen Error: 10002; No record of a data store located in '/ttchk/tt1122/DataStore/TYINFO/tyinfodata'
$ ttdaemonadmin -restart
TimesTen Daemon stopped.
TimesTen Daemon startup OK.
$ ttstatus
TimesTen status report as of Tue Jan 20 09:22:03 2015
Daemon pid 16028 port 53396 instance tt1122
TimesTen server pid 16044 started on port 53397
------------------------------------------------------------------------
Accessible by group timesten
End of report
-下面再看一下info目录中的文件
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3520 Jan 20 08:51 DBI54928054.0~
-rw------- 1 timesten timesten 3520 Jan 1 01:53 DBI54928054.01~
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 6 Jan 20 09:14 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-原来的DBI54928054.0已经被重命名为DBI54928054.0~
-通过还原原来的DBI文件,重启Daemon进程,恢复故障
$ mv DBI54928054.0~ DBI54928054.0
$ ttstatus
TimesTen status report as of Tue Jan 20 09:29:21 2015
Daemon pid 16028 port 53396 instance tt1122
TimesTen server pid 16044 started on port 53397
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ttdaemonadmin -restart
TimesTen Daemon stopped.
TimesTen Daemon startup OK.
$ ttstatus
TimesTen status report as of Tue Jan 20 09:29:39 2015
Daemon pid 16112 port 53396 instance tt1122
TimesTen server pid 16132 started on port 53397
------------------------------------------------------------------------
Data store /ttchk/tt1122/DataStore/TYINFO/tyinfodata
There are no connections to the data store
RAM residence policy: Manual
Data store is manually unloaded from RAM
Replication policy : Manual
Cache Agent policy : Manual
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ttadmin -ramload tyinfo
RAM Residence Policy : manual
Manually Loaded In RAM : True
Replication Agent Policy : manual
Replication Manually Started : False
Cache Agent Policy : manual
Cache Agent Manually Started : False
2. 主机异常重启,DBI文件丢失/损坏
-首先模拟DBI文件丢失
$ ttstatus
TimesTen status report as of Wed Jan 21 08:38:44 2015
Daemon pid 3014 port 53396 instance tt1122
TimesTen server pid 3023 started on port 53397
------------------------------------------------------------------------
Data store /ttchk/tt1122/DataStore/TYINFO/tyinfodata
There are 11 connections to the data store
Shared Memory KEY 0x1f01f762 ID 32769
Type PID Context Connection Name ConnID
Subdaemon 3020 0x0000000001fb4d00 Manager 127
Subdaemon 3020 0x000000000200bba0 Rollback 126
Subdaemon 3020 0x00000000020d9fc0 Flusher 125
Subdaemon 3020 0x000000000212ee70 Monitor 124
Subdaemon 3020 0x00000000021a4510 Deadlock Detector 123
Subdaemon 3020 0x00000000021f93c0 Checkpoint 122
Subdaemon 3020 0x000000000224e270 Aging 121
Subdaemon 3020 0x00000000022a3120 Log Marker 120
Subdaemon 3020 0x00000000022f7fd0 AsyncMV 119
Subdaemon 3020 0x000000000234ce80 HistGC 118
Subdaemon 3020 0x00000000023a1d30 IndexGC 117
RAM residence policy: Manual
Data store is manually loaded into RAM
Replication policy : Manual
Cache Agent policy : Manual
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3520 Jan 21 08:38 DBI54928054.0
-rw------- 1 timesten timesten 3524 Jan 21 05:41 DBI54928054.0~
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 5 Jan 21 08:37 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-我们看到上面DSN正常运行,而且DBI文件存在,下面删除DBI文件
$ rm -f DBI*
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 5 Jan 21 08:37 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-这个时候DBI文件已经丢失,下面测试只有有会话连接到DSN,Daemon进程会自动创建DBI文件,而且文件名与原来相同
$ ttisql tyinfo
Copyright (c) 1996, 2014, Oracle and/or its affiliates. All rights reserved.
Type ? or "help" for help, type "exit" to quit ttIsql.
connect "DSN=tyinfo";
Connection successful: DSN=TYINFO;UID=timesten;DataStore=/ttchk/tt1122/DataStore/TYINFO/tyinfodata;DatabaseCharacterSet=ZHS16GBK;ConnectionCharacterSet=ZHS16GBK;LogFileSize=128;DRIVER=/TimesTen/tt1122/lib/libtten.so;SMPOptLevel=1;LogDir=/ttlog/tt1122/TYINFO;PermSize=128;TempSize=64;Connections=80;CkptFrequency=600;RecoveryThreads=4;TypeMode=0;PLSQL=0;CacheGridEnable=0;LogBufMB=64;ReceiverThreads=1;
(Default setting AutoCommit=1)
Command> exit
Disconnecting...
Done.
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3520 Jan 21 08:42 DBI54928054.0
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 5 Jan 21 08:37 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-可以看到有会话连接DSN后,Daemon进程生成了新的DBI文件。
-下面模拟DSN正常运行,但是DBI文件损坏
$ ttstatus
TimesTen status report as of Wed Jan 21 08:55:42 2015
Daemon pid 3014 port 53396 instance tt1122
TimesTen server pid 3023 started on port 53397
------------------------------------------------------------------------
Data store /ttchk/tt1122/DataStore/TYINFO/tyinfodata
There are 11 connections to the data store
Shared Memory KEY 0x1f01f762 ID 32769
Type PID Context Connection Name ConnID
Subdaemon 3020 0x0000000001fb4d00 Manager 127
Subdaemon 3020 0x000000000200bba0 Rollback 126
Subdaemon 3020 0x00000000020d9fc0 Flusher 125
Subdaemon 3020 0x000000000212ee70 Monitor 124
Subdaemon 3020 0x00000000021a4510 Deadlock Detector 123
Subdaemon 3020 0x00000000021f93c0 Checkpoint 122
Subdaemon 3020 0x000000000224e270 Aging 121
Subdaemon 3020 0x00000000022a3120 Log Marker 120
Subdaemon 3020 0x00000000022f7fd0 AsyncMV 119
Subdaemon 3020 0x000000000234ce80 HistGC 118
Subdaemon 3020 0x00000000023a1d30 IndexGC 117
RAM residence policy: Manual
Data store is manually loaded into RAM
Replication policy : Manual
Cache Agent policy : Manual
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3520 Jan 21 08:42 DBI54928054.0
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 5 Jan 21 08:37 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-这里看到实例是正常运行的,而且DSN也是装载的,可以看到DBI文件的大小为3520Bytes,下面手动编辑DBI文件,模拟文件损坏
$ vi DBI54928054.0
^K^@^B^@^B^@^G^@^D^
"DBI54928054.0" [converted] 1L, 3578C written
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3550 Jan 21 08:58 DBI54928054.0
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
-rw-r--r-- 1 timesten timesten 5 Jan 21 08:37 timestend.pid
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
-这个时候DBI文件已经被手动编辑,大小也变为3550Bytes,实例和DSN都不会受到影响,有新的session连接和退出是Daemon进程也不会更新DBI文件,但是如果这个时候卸载DSN并停止Daemon进程,悲剧就发生了。
$ ttadmin -ramunload tyinfo
RAM Residence Policy : manual
Manually Loaded In RAM : False
Replication Agent Policy : manual
Replication Manually Started : False
Cache Agent Policy : manual
Cache Agent Manually Started : False
$ ttstatus
TimesTen status report as of Wed Jan 21 09:01:53 2015
Daemon pid 3014 port 53396 instance tt1122
TimesTen server pid 3023 started on port 53397
------------------------------------------------------------------------
Data store /ttchk/tt1122/DataStore/TYINFO/tyinfodata
There are no connections to the data store
RAM residence policy: Manual
Data store is manually unloaded from RAM
Replication policy : Manual
Cache Agent policy : Manual
------------------------------------------------------------------------
Accessible by group timesten
End of report
$ ttdaemonadmin -stop
TimesTen Daemon stopped.
$ ls -l
-rw-r----- 1 timesten timesten 3251 Dec 4 09:54 cluster.oracle.ini
-rw------- 1 timesten timesten 3550 Jan 21 09:01 DBI54928054.0
-rw-r----- 1 timesten timesten 349 Dec 4 09:54 snmp.ini
-rw-rw---- 1 timesten timesten 12946 Dec 4 10:51 sys.odbc.ini
-rw-rw---- 1 timesten timesten 422 Dec 4 09:54 sys.ttconnect.ini
drwxr-x--- 2 timesten timesten 4096 Jul 8 2014 ttcacheadvisor
-rw-r----- 1 timesten timesten 316 Dec 4 10:42 ttendaemon.options
$ ttdaemonadmin -start
TimesTen Daemon startup OK.
$ ttstatus
TimesTen status report as of Wed Jan 21 09:02:19 2015
Daemon pid 3196 port 53396 instance tt1122
TimesTen server pid 3216 started on port 53397
------------------------------------------------------------------------
Accessible by group timesten
End of report
-通过ttstatus已经看不到DSN的注册信息了,而且已经回到前面的CheckPoint文件丢失的现象,这个时候已经再没有办法恢复了,只能通过备份的DBI文件来恢复DSN的注册信息。当然,如果实在没有备份的情况下,我们可以尝试从其他地方拷贝一个DBI文件来修改,当然,DBI文件是二进制文件,可能修改起来比较麻烦。
$ ttstatus
TimesTen status report as of Wed Jan 21 09:59:15 2015
Daemon pid 4203 port 53396 instance tt1122
TimesTen server pid 4218 started on port 53397
------------------------------------------------------------------------
Data store /ttchk/tt1122/DataStore/TYINFO/tyinfodata
There are 11 connections to the data store
Shared Memory KEY 0x0301f765 ID 360449
Type PID Context Connection Name ConnID
Subdaemon 4216 0x0000000018c52070 Manager 127
Subdaemon 4216 0x0000000018ca8dd0 Rollback 126
Subdaemon 4216 0x0000000018d569a0 Flusher 125
Subdaemon 4216 0x0000000018dab850 Monitor 124
Subdaemon 4216 0x0000000018e00700 Deadlock Detector 123
Subdaemon 4216 0x0000000018e555b0 Checkpoint 122
以上是关于理解TimesTen内存数据库DBI文件的作用以及相关故障处理的主要内容,如果未能解决你的问题,请参考以下文章
Oracle TimesTen 关系型内存数据库18.1新版本详解
Ignite内存计算平台与Oracle TimesTen Scaleout对比