We have been seeing what appears to be reboots on our SQL Managed Instances (Standard, Failover Group). One node (the current primary, USEast) has been apparently rebooting for the last 24 hours, between 30 minutes and 2 hours between cycles. This has been slowing our system down, although the users only occasionally see a connection error (retry logic). But our various business processes are suffering more.
The only place I can see to detect these is in the SQL Server Logs (SSMS \ Management \ SQL Server Logs) and then looking at the various entries there. On one of the restarts, the logs appeared to clear and now I only have the two most recent entries listed.
I don't see anything in sys.dm_operation_status to indicate anything there. The @@VERSION command has changed from early September when I checked last, but currently shows:
Microsoft SQL Azure (RTM) - 12.0.2000.8 Sep 18 2021 19:01:34 Copyright (C) 2019 Microsoft Corporation
The last time this happened to us (to this extent, we are used to one or two reboots periodically) was a few years ago where it was rebooting every 5-10 minutes. In that event, we initiated a failover to our other side and that took several hours before it finally completed. At that time I didn't have the query to see FOG health (hardened LSN, etc.) but I do now and it's showing green across the board.
We are currently planning a failover in a few hours (as soon as the business can tolerate it), and have our 3rd party interface between MSFT and us (yay...) opening an urgent ticket so we can get better insight.
But what I wanted to really ask is if there are more places to look for clues about what's happening? Any insight into what's going on and why it's rebooting so frequently?
UPDATE - We failed over to our secondary and it did so without issue. However, as soon as we failed, the continuous "RECOVERY" messages started appearing. The "RECOVERY" Messages were appearing in the logs on the other side as well prior to failing over. DBCC CHECKALLOC/CHECKCATALOG on all databases showed no issues before failover.
MSFT has a ticket open on this issue and if I hear back, I will update here. Still welcome any insights from this community though. =)
UPDATE - The recovery messages still appear, although we have not seen any sign of the restarts that were hurting us earlier. The recovery messages are in this form:
RECOVERY (873c7092-f355-4b10-806b-7f2ec74855c7, 10): XactRM::PrepareLocalXact, Preparing particpant xact, XdesId = -839198875 3
It cycles through all of our databases