Disaster Recovery – Always Active Group Failover Problem

I'm testing a PowerShell Disaster Recovery script and trying to switch to an off-site AG DR (asynch) replica with this command.

"ALTER AVAILABILITY OF THE GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS; ".

On many clusters, it works well, but on a few others, it usually fails (but not always) with the following error:

"Failed to move a Windows Server Failover Clustering (WSFC) group to the local node (error code 5023). The WSFC service may not be running." or is not accessible in its current state, or the specified cluster group or node descriptor is not valid For more information about this error code, see "Codes." system error "in the Windows development documentation.) Unable to designate the local availability replica of the availability group MyAG as the primary replica The operation encountered the error 41018 of SQL Server and has been shut down. Check the previous error and SQL Server error log for more details on the error and corrective actions. "

We need to force the quorum by simulating a situation in which the DR asynchronous replica can not communicate with the other replicas: https://docs.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/ force -a-wsfc-cluster-to-start-without-quorum? view = SQL-2017 server

A comparison of clusters that work against those that do not work with Get-Cluster and Get-ClusterGroup reveals no major differences.

This failover has worked successfully on some clusters, but fails 99% of the time on two clusters. It worked on both groups, but rarely. One of the common problems I've found during a Google search is about permissions for NT Authority System (see https://dataginger.com/2014/10/28/sql-server-failed-to -bring-availability-group-availability-group-name-online /), which I confirmed.

I ran a trace and see the same mistakes:

Failed to move a Windows Server Failover Clustering (WSFC) group to
the local node (error code 5023). The WSFC service may not be running
or may not be accessible in its current state, or
cluster group or node is not valid For information about this
error code, see "System error codes" in Windows development
Documentation.

Failed to designate the local availability replica
group & # 39; MyAG & # 39; as the main replica. The operation encountered SQL
Server error 41018 and was terminated. Check the previous error
and the SQL Server error log for more details about the error and
corrective actions.

If nothing happens, I might have to open a Microsoft ticket to engage their support.

Thoughts?

Thank you for your help!