How to Restore a Failed Azure Service Fabric Cluster

Things happen, especially with certificate expiration dates :). It recently happened with me and I got the task to restore the failed cluster. This article is dedicated to the recovery of the Windows-based Azure Service Fabric(ASF) cluster with stateless workloads.

The fastest possible way of recovery is to create a new cluster and deploy stateless services there ☺, but…

How to restore a failed Azure Service Fabric cluster.

TL;DR; You will get a step-by-step action plan to restore failed stateless Windows-based Azure Service Fabric cluster with a new primary certificate. After cluster recovery you can proceed with KeyVault integraction and secondary certificate configuration.

It turns out that ASF can survive with expired management certificates for a while. And because these certificates(one primary and secondary) are used to secure cluster communications (upgrade service and management access) – the next ASF upgrade rollover will fail spectacularly.

While Microsoft docs are good, some points are clearly missing, so I decided to fill the gap. There are two official ways to solve this issue – automated and semi-manual and both didn’t work for me.

The first lesson is to create an ASF cluster with certificate common name. And don’t use the Azure CLI command that creates the cluster with an auto-generated certificate in Azure Key Vault.

The second lesson here is that you have to automate certificate expiration validation using a PowerShell script. You should also set the configuration parameter AcceptExpiredPinnedClusterCertificate to trues, so the ASF cluster will work even with an expired certificate.

Warning! Dont do this for stateful services.


Finding the root cause.

But first, let’s dig into symptoms diagnosis. The first clue there is an issue is the following Azure Portal view with a failed cluster that contains zero nodes and zero applications.

Failed Service Fabric cluster in Azure Portal.

The following steps are needed to detect the root cause.

  • Open cluster resource group, navigate to virtual machine scale set with seed nodes. Usually, it is named nt1vm_0.
  • Open nt1vm_0 and download the RDP connection via the Connect button. The sample path to a node isyour-cluster\nt1vm | Instances\nt1vm_0.
Failed azure service fabric
Click this button in the VM portal blade.
  • Connect to a machine with a cluster VM admin login and password. In case of connection issues double check ASF load balancer for opened ports.
  • If VM is extremely laggy and has high CPU usage, then open task manager and set the idle priority to the most inhibitory process.
  • Open a server event log, navigate to section Administrative events and search for ASF errors.
Failed azure service fabric
Windows server event log with Administrative events
  • Find Service Fabric status and logs, that are located in the following folders.
"C:\WindowsAzure\Logs\AggregateStatus\aggregatestatus.json""C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\[extension version]\CommandExecution.log"
  • Service fabric might be in a death loop trying to re-install a cluster on this VM. The main symptom is a high CPU consumption by Service fabric installer and Bootstrap agent system services.

The short plan.

I will start with a short plan without screenshots and continue with a detailed one in the next section. I will highlight in bold those sections that are missing from the official Microsoft GitHub documentation.

My suggestion is to fix the zero seed node first, test if it works, then proceed to the rest.

  • Fix “death loop” by opening Task Manager, sorting processes by CPU usage and changing top process priority to Idle and then stopping it immediately.
  • The step above from official documentation didn’t work for me, so I have to re-allocate seed nodes, one-at-a-time. The process will take about 15 minutes and another 20 minutes for the ASF Windows Server setup.
  • Stop the following services on the target machine FabricInstallerService, FabricHostService, ServiceFabricNodeBootstrapAgent.
  • Install the new cluster certificate to the virtual machine root.
  • Change certificate thumbprint in the file D:\SvcFab\_sys_0\Fabric\ClusterManifest.current.xml
  • Make a backup of this file to the new folder on the drive C:\ClusterFix
  • Replace certificate thumbprints in the file D:\SvcFab_sys_0\Fabric\Fabric.Data\InfrastructureManifest.xml
  • Run following Powershell command in command prompt (not in PS ISE).
New-ServiceFabricNodeConfiguration -FabricDataRoot "D:\SvcFab" -FabricLogRoot "D:\SvcFab\Log" -ClusterManifestPath "C:\ClusterFix\clusterManifest.xml" -InfrastructureManifestPath "D:\SvcFab\_sys_0\Fabric\Fabric.Data\InfrastructureManifest.xml"
  • Open the file D:\SvcFab\_sys_0\Fabric\Fabric.Package.current.xml and look for the ManifestVersion attribute, it will point to the folder with an active cluster configuration.
  • In my case it was the following folder: D:\SvcFab\_sys_0\Fabric\Fabric.Config.4.131473098266979018
  • Remove the read-only attribute from the folder.
  • Open Settings.xml inside this folder and replace all certificate thumbprint entries with the new ones.
  • Start Service Fabric Host and Agent from Windows Services.
  • Open Azure Resource explorer, find a cluster, click edit and update to reset ASF cluster status in Azure Portal.
  • Connect to the ASF management portal with the correct certificate URL.

https://your-cluster.northeurope.cloudapp.azure.com:19080/Explorer


Detailed action plan.

Now let’s deep dive into details.

  • Fix “death loop” by opening Task Manager, sorting process by CPU usage and changing top process priority to Idle priority and then stopping it completely. In order to kill process get its PID and kill with following CMD commands.
sc queryex FabricHostSvc
taskkill /pid 5364 /f
  • Re-allocate seed nodes, one-at-a-time. The process will take about 15 minutes and up to 20 minutes for the ASF setup.
Failed azure service fabric
Re-image button
Failed azure service fabric
If you don’t see a Service fabric host running and see the following trace error, leave VM alone for 20 minutes.
  • Connect to the target VM via RDP and check if the following services are running. Then stop them all.
    FabricInstallerService, FabricHostService, ServiceFabricNodeBootstrapAgent.
Failed azure service fabric
Stop installer. Then Fabric host.
Stop Node bootstrap Upgrade and then Bootstrap agent.
  • Install the new cluster certificate to the virtual machine root.
  • Change certificate thumbprint in the file D:\SvcFab\_sys_0\Fabric\ClusterManifest.current.xml
  • Make a copy of this file to the new folder C:\ClusterFix
  • Replace cert thumbprint entries with the new onesD:\SvcFab_sys_0\Fabric\Fabric.Data\InfrastructureManifest.xml
  • Run following Powershell command in PS command prompt(not in PS ISE). The command can fail because of the old Powershell version or read-only access attribute on the destination Infrastructure manifest file.
New-ServiceFabricNodeConfiguration -FabricDataRoot "D:\SvcFab" -FabricLogRoot "D:\SvcFab\Log" -ClusterManifestPath "C:\ClusterFix\clusterManifest.xml" -InfrastructureManifestPath "D:\SvcFab\_sys_0\Fabric\Fabric.Data\InfrastructureManifest.xml"
Failed azure service fabric
Navigate to the destination folder and remove the Read-only attribute.

Install latest PowerShell version with the script:

iex “& { $(irm https://aka.ms/install-powershell.ps1) } -UseMSI”
$PSVersionTable.PSVersion
  • Open the file D:\SvcFab\_sys_0\Fabric\Fabric.Package.current.xml and look for ManifestVersion attribute, it will point to the folder with an active cluster configuration. Be careful, to choose the correct folder to edit.
  • In my case it was the following folder: D:\SvcFab\_sys_0\Fabric\Fabric.Config.4.131473098266979018
  • Open Settings.xml inside this folder and replace all certificate thumbprint entries with the new ones.
  • Start Service Fabric Host and Agent from Windows Services:
net start FabricHostSvc 
net start ServiceFabricNodeBootstrapAgent
  • Test connection after 10 minutes with the following PowerShell script, make sure the certificate installed on your machine.
$ClusterName= "your-cluster.northeurope.cloudapp.azure.com:19000"$Certthumprint = "AAAAAAAAAAAAAAAAAAAAAAA"Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 `-X509Credential `-ServerCertThumbprint $Certthumprint  `-FindType FindByThumbprint `-FindValue $Certthumprint `-StoreLocation CurrentUser `-StoreName My
  • Open Azure Resource explorer, find a cluster, click edit and update to reset ASF cluster status in Azure Portal.
Resource group name
  • Connect to the ASF management portal with the correct certificate URL.

https://your-cluster.northeurope.cloudapp.azure.com:19080/Explorer


The summary.

So, the easiest possible solution is to create a cluster copy and forget about recovery, and the next optimistic scenario is to use automated recovery with PowerShell script. And if you have such an extremely unlucky case like mine, then this article has reached its goal.

I know it’s a boring tutorial, but it can save your day :).

That’s it, thanks for reading. Cheers!


About the Author:

Stas Lebedenko is a seasoned developer and Microsoft Azure MVP from Ukraine. Driving solution architecture changes and writing code for @SigmaSoftware. A speaker, and contributor at local .NET and Azure community user groups and conference.

Reference:

Lebedenko, S. (2020). How to restore a failed Azure Service Fabric cluster.Available at: https://itnext.io/azure-service-fabric-cluster-recovery-33886f1bf44f [Accessed: 17th May 2020].

Check out more great Azure content here

Share this on...

Rate this Post:

Share: