SQLServer Always On FCI 集群节点同时占用资源及可疑状态修复

Posted 2020-09-08 薛定谔的DBA

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了SQLServer Always On FCI 集群节点同时占用资源及可疑状态修复相关的知识，希望对你有一定的参考价值。

FCI 双节点集群，由于晚上集群节点间的网络中断过，两个节点都认为另一个节点宕机，在各节点的集群管理中都看到对方已宕机。

连接到集群IP，提示 msdb 数据库有问题：

发现MSDB数据库 “可疑”

msdb 损坏了，mssql 错误日志和代理日志也无就法查询，从windows查看到信息如下：

SQL Server 断言: 文件: <xdes.cpp>，行=3785 失败的断言 = 'curr->GetXdesId () == m_xdesId'。此错误可能与时间有关。如果重新运行该语句后错误仍然存在，请使用 DBCC CHECKDB 来检查数据库的结构是否完整，或重新启动服务器以确保内存中的数据结构未破坏。

在数据库 'msdb' 中撤消日志记录下的操作时，在日志记录 ID (106502:3622:2) 处出错。通常，这一特定故障以前在 Windows 事件日志服务中会记录为错误。请利用备份还原数据库或文件，或者修复该数据库。

处理数据库 'msdb' 的日志时出错。如果可能，请从备份还原。如果没有可用备份，可能需要重新生成日志。

由于在例程 'XdesRMReadWrite::RollbackToLsn' 中发生错误 3314，数据库 msdb 已关闭。在与该数据库的所有连接都中止后，将尝试重新启动非快照数据库。

在数据库 'msdb' 中撤消日志记录下的操作时，在日志记录 ID (106502:3614:1) 处出错。通常，这一特定故障以前在 Windows 事件日志服务中会记录为错误。请利用备份还原数据库或文件，或者修复该数据库。

传递给数据库 'msdb' 中的日志扫描操作的日志扫描号 (106502:3144:155) 无效。此错误可能指示数据损坏，或者日志文件(.ldf)与数据文件(.mdf)不匹配。如果此错误是在复制期间出现的，请重新创建发布。否则，如果该问题导致启动期间出错，请从备份还原。

恢复期间出错，导致数据库“msdb”(4:0)无法重新启动。请诊断并纠正这些恢复错误，或者从已知的正确备份中还原。如果无法更正错误，或者为意外错误，请与技术支持人员联系。

可疑判断 msdb 数据库日志有损坏。

sql server 还有自动监控的一些跟踪，主要系统错误和连接错误等其他重要性错误信息，查看如下：

DECLARE @path NVARCHAR(1000)
SELECT @path =PATH FROM sys.traces WHERE  id = 1
SELECT * FROM ::fn_trace_gettable(@path, 0)

2017-04-07 01:21:36.05 spid95      错误: 3624，严重性: 20，状态: 1。
2017-04-07 01:21:36.05 spid95      A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a QFE from Technical Support. 
2017-04-07 01:21:36.07 spid95      错误: 3314，严重性: 21，状态: 3。
2017-04-07 01:21:36.07 spid95      During undoing of a logged operation in database 'msdb', an error occurred at log record ID (106502:3622:2). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
2017-04-07 01:21:37.59 spid95      错误: 9004，严重性: 23，状态: 6。
2017-04-07 01:21:37.59 spid95      An error occurred while processing the log for database 'msdb'.  If possible, restore from backup. If a backup is not available, it might be necessary to rebuild the log.
2017-04-07 01:21:39.11 spid95      错误: 9004，严重性: 23，状态: 6。
2017-04-07 01:21:39.11 spid95      An error occurred while processing the log for database 'msdb'.  If possible, restore from backup. If a backup is not available, it might be necessary to rebuild the log.
2017-04-07 01:21:40.64 spid95      错误: 9004，严重性: 23，状态: 6。
2017-04-07 01:21:40.64 spid95      An error occurred while processing the log for database 'msdb'.  If possible, restore from backup. If a backup is not available, it might be necessary to rebuild the log.
2017-04-07 01:21:40.66 spid95      错误: 3314，严重性: 21，状态: 5。
2017-04-07 01:21:40.66 spid95      During undoing of a logged operation in database 'msdb', an error occurred at log record ID (106502:3614:1). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
2017-04-07 01:21:41.43 spid22s     Error: 9003, Severity: 20, State: 6.
2017-04-07 01:21:41.43 spid22s     The log scan number (106502:3144:155) passed to log scan in database 'msdb' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.
2017-04-07 01:21:41.43 spid22s     Error: 3414, Severity: 21, State: 1.
2017-04-07 01:21:41.43 spid22s     An error occurred during recovery, preventing the database 'msdb' (4:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

主要几个状态：3624、3314、9004、3414
基本原因如下：

Troubleshooting Error 3313, 3314, 3414, or 3456 (SQL Server)

How to troubleshoot Error 9004 in SQL Server

由于msdb日志备份引起的（发现msdb数据文件有 60+GB！）

现在集群有问题了，相互占用资源，心跳没起作用。连接数据库内部查看节点能正常连接到其中一个。

select * from sys.dm_os_cluster_nodes
SELECT * FROM fn_virtualservernodes();

更重要的是：因为存储是共享的，系统数据库和用户数据库都共享！

两个节点用共享存储，按理说数据是一致的。为了使集群能恢复正常状态，打算转移集群。重启服务器比较好，不用逐个转移，使其自动所有资源转移。

重启之后！！

msdb 正常了！！

但是有 3 个用户数据库出现了 “可疑”！！

没办法，出现了就只能修复吧！！设置 “紧急” 模式修复，设置“单用户”，结果设置都产生死锁，无法执行！

设置单用户模式后很难恢复多用户模式，似乎不断有进程来执行！

干脆把集群另一个节点服务器（虚拟机）关闭了！进来还是一样的错误。

再接着不用集群连接，到节点服务器执行，还是一样！！

再以专用管理员DAC 启动服务，看谁还能连接！进来后老提示另一个用户在运行，此时设置数据库为多用户模式也不行！

好！再更改端口，重启mssql服务，看谁还连接！结果没人抢了！可以进行操作也不会死锁，也没其他人操作，此时可以进行修复了！

两个较小（60GB和1GB）的数据库没问题；DBCC checkdb 修复过程中，有一个170GB的数据库修复至tempdb空间不足！修复了部分，且停止了。

但很快，因有些数据库文件都在同一磁盘，磁盘空间不足了！！使mssql服务自动停止！

好吧！删除一些数据库dump/log文件，启动服务分离一些不重要的数据库，把它移走（可能不再用了），移出共享盘。

什么！权限不够，本地管理员都无法移动文件！再逐个将数据文件的权限添加给本地管理员！

空间够业务用了，但是还有一个数据库没有修复！！再把临时数据库移到本地磁盘，才进行DBCC checkdb修复！

修复完成后把 tempdb 改回共享存储位置，并设置小一些！

但是，数据丢失很多！！只能备份还原！！

备份文件有专门的存储，又因昨晚网络调整，无法拷贝！！

而该数据库在还原前又因损坏无法备份！！

备份为简单模式，也不能备份日志！！

等待备份恢复中…………

集群还未恢复…………

最终还原数据库！

所以平时在网上看到，谁家数据库宕机半天都恢复不过来的云云，出现在自己身上！

相对好点，数据库只是一个有问题！！内部人员使用！但确是比较重要的！

附：

ALTER DATABASE dbname SET EMERGENCY 
GO
ALTER DATABASE dbname SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
ALTER DATABASE dbname SET SINGLE_USER 
GO
DBCC CheckDB (dbname , REPAIR_ALLOW_DATA_LOSS) 
GO
--after then:
ALTER DATABASE dbname SET MULTI_USER 
GO

以上是关于SQLServer Always On FCI 集群节点同时占用资源及可疑状态修复的主要内容，如果未能解决你的问题，请参考以下文章