Service Fabric 多节点 X509 群集 - 等待安装程序服务完成时超时
Posted
技术标签:
【中文标题】Service Fabric 多节点 X509 群集 - 等待安装程序服务完成时超时【英文标题】:Service Fabric MultiNode X509 Cluster - Timed out waiting for Installer Service to complete 【发布时间】:2017-11-01 02:01:55 【问题描述】:为了创建 Azure SF 测试环境,我在开发测试实验室中创建了三个 azure VM。这些将用 X509s 保护。
我使用了Here&Here的信息
机器是:
Windows 2016 数据中心 在同一个虚拟网络上 所有防火墙都被禁用(可以从对方ping每台机器) 全部使用同一个管理员帐户我使用文档提供的 certsetup.ps1 文件创建了自签名证书。一个服务器和集群证书按照建议合并。
如果我运行 TestConfiguration.ps1,我会得到以下输出。
LocalAdminPrivilege : True
IsJsonValid : True
IsCabValid :
RequiredPortsOpen : True
RemoteRegistryAvailable : True
FirewallAvailable : True
RpcCheckPassed : True
NoConflictingInstallations : True
FabricInstallable : True
DataDrivesAvailable : True
Passed : True
显然 IsCabValid 字段是空白的,但“Passed”字段仍然表明可以安装。我继续运行下一个 powershell 命令开始安装。
.\CreateServiceFabricCluster.ps1 -ClusterConfigFilePath .\ClusterConfig.X509.MultiMachine.json
按照上面的命令,进程启动,控制台窗口填充了以下文本,表明节点间通信正常..
Creating Service Fabric Cluster...
If it's taking too long, please check in Task Manager details and see if Fabric.exe for each node is running. If not, please look at: 1. traces in DeploymentTraces directory and 2. traces in FabricLogRoot configured in ClusterConfig.json.
Trace folder already exists. Traces will be written to existing trace folder: C:\StandaloneCluster\DeploymentTraces
Running Best Practices Analyzer...
Best Practices Analyzer completed successfully.
Creating Service Fabric Cluster...
Processing and validating cluster config.
Configuring nodes.
Default installation directory chosen based on system drive of machine '10.0.0.4'.
Copying installer to all machines.
Configuring machine '10.0.0.4'.
Configuring machine '10.0.0.5'.
Configuring machine '10.0.0.6'.
Machine 10.0.0.6 configured.
Machine 10.0.0.5 configured.
Machine 10.0.0.4 configured.
Running Fabric service installation.
Successfully started FabricInstallerSvc on machine 10.0.0.4
Successfully started FabricInstallerSvc on machine 10.0.0.6
Successfully started FabricInstallerSvc on machine 10.0.0.5
会出现几分钟的长时间停顿,之后会显示超时错误,但没有真正说明原因。我已经搜索了节点上的窗口日志,但无法发现任何进一步的信息。 PS控制台显示的错误如下:
Timed out waiting for Installer Service to complete for machine 10.0.0.4. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
Timed out waiting for Installer Service to complete for machine 10.0.0.6. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
CreateCluster Error: System.AggregateException: One or more errors occurred. ---> System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeploye
r -> Fabric
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Ac
tion`1 localFinally)
at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, Action`1 body)
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.RunFabricServices(List`1 machines, FabricPackageType fabricPackageType)
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.<CreateClusterAsyncInternal>d__7.MoveNext()
---> (Inner Exception #0) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---
---> (Inner Exception #1) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.6. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---
---> (Inner Exception #2) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.4. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---
Trace folder already exists. Traces will be written to existing trace folder: C:\StandaloneCluster\DeploymentTraces
Cleaning up faulted installation.
Removing configuration from machine 10.0.0.5
Removing configuration from machine 10.0.0.4
Removing configuration from machine 10.0.0.6
有没有 Azure SF 爱好者可以解释这个问题,或者提供任何关于我哪里出错的建议?
【问题讨论】:
您是否尝试过按照此处所述卸载 SDK:***.com/questions/38106961/… @Oliver 尝试安装时机器上不存在 SDK,否则 TestConfiguration.ps1 将失败。 您的虚拟机有多大?您可能需要更快的安装程序或更改安装程序的超时时间(我相信有一个开关可以做到这一点) 使用 -NoCleanupOnFailure 标志运行部署并检查“应用程序和服务日志 > Microsoft-Service Fabric > 管理员”下的事件日志。错误/警告日志应指示读取证书是否存在问题,或者是否存在任何其他阻塞问题。检查证书是否在每台机器上都被 ACLed 到 NETWORK SERVICE,因为这是文档中列出的要求之一。 【参考方案1】:这是 FabricHost 无法启动时出现的一般故障模式,可能由于多种原因而发生。
由于您使用的是原始 Azure VM 而不是 SF VMSS 部署,因此您还必须确保在集群配置 NodeType 下设置的上游端口在每台计算机上都是打开的。要测试此设置是否正确,请先尝试在这些虚拟机上部署一个不安全的集群。
如果上述方法有效,要进行调查,使用 -NoCleanupOnFailure 标志运行部署,并在其中一台故障机器上检查“应用程序和服务日志 > Microsoft-Service Fabric > Admin”下的事件日志。
错误/警告日志应指示读取证书是否存在问题,或者是否存在任何其他阻塞问题。检查证书是否已 ACL 到每台机器上的 NETWORK SERVICE,因为这是 doc 中列出的要求之一。
当证书指纹包含无效字符时,会发生其他常见故障之一。 Windows 证书管理工具中存在一个错误,导致显示的指纹包含此类隐藏的无效字符,当直接复制到配置中时,会导致部署问题。请使用十六进制编辑器(例如HxD)验证配置指纹仅包含有效字符。
如果这没有为您提供足够的信息来解决问题,请从 Standalone package 中包含的 Tools\Microsoft.Azure.ServiceFabric.WindowsServer.SupportPackage.zip 运行日志收集器工具,然后上传收集到的记录到您选择的存储空间以与我们的团队共享。您可以将链接邮寄至 sfsa@microsoft.com,我们可以帮助您调查。
【讨论】:
【参考方案2】:对于集群/服务器/reverseProxy 证书,1)它们的私钥加载权限需要 ACL 到“网络服务”,2)它们的 CA 证书需要添加到 TrustedRoot。
【讨论】:
以上是关于Service Fabric 多节点 X509 群集 - 等待安装程序服务完成时超时的主要内容,如果未能解决你的问题,请参考以下文章
Azure上的Service Fabric Cluster创建失败,错误代码为“VMInstanceCountAllowsBetterReliabilityLevel”
Service Fabric - 无法进行配置升级以添加或删除节点