SQL Server 2016 SP2 Standard Edition
4 sockets with 6 cores per socket totaling 24 cores
128 GB RAM
Windows Server 2012 R2 Data Center
So, we had an incident about a month or so ago that required us to restore the VM and SQL on top of it. We use a Cloud Provider that does host some major companies across the globe, so it’s not a Mom and Pop shop.
So, the cloud provider had to perform what they called a BMR (Bare Metal Restore) of the VM from the last image taken, which was only a couple days old. This restored the OS and SQL Server Install, etc… Then, they recreated the Disk Mounts that were attached to the VM and restored the files, folders and permissions with a file level restore—not database files.
Once they were finished with this, we used our latest SQL Server backups and restored all the system databases first, then all the user databases. We confirmed all SQL Server settings were the same as before the Incident such MAXDOP, CTFP, Max Memory, Users and Server level roles/permissions, sql agent jobs, etc… and all were identical. When the VM was first brought up, I noticed they had 12 sockets with 2 cores each, so I had them change to 4 sockets with 6 cores each, as the Standard licensing got me many years ago and its the first thing I check when taking a VM.
We immediately noticed a lot of latency with the queries and SQL Agent jobs that run on this box, which was across the board and not just one query or job…all were running twice as long. WOrking with the cloud provider asking them to confirm the VM config was identical from old to new VM, they found a few things and fixed them.
We’re now at the point where everything is running maybe 60 to 70% as fast as it used to across the board and are at a stale mate it seems. Simple queries are trying to go parallel and when they do, they run exponentially longer…one example query, ran for 10 seconds and when we dropped the maxdop from 8 down to 4, less than 1 second…and MAXDOP of 7, 6 or 5 also yielded the 10 second run time and only when we got down to 4 did it go seral and fly. We are noticing out of the 24 cores, it seems that only 8 of them are every really ever being used by SQL Server. We told our cloud provider about the excessive CX waits and they just came back and said we were on a Tier 1 storage before and we are now on Tier 2 because the previous Tier was mislabeled or something. Tier 2 has an expected latency of <=20ms while Tier 1 has an expected latency of <=10ms.
Anyways, does anyone really think that the storage could be trickling down and causing the excessive CX waits? We’ve asked for the Phsyical NUMA config on the old VM several times and havent gotten an answer yet and the new VM only seems to have the default (1). SQL Server created 3 Soft NUMA nodes, so was wondering if anyone else thought Physical NUMA would be a more likely culprit?
Again, only thing different is VM and possibly tiered disk storage…SQL Server was restored with all same setting and all same queries are all running twice as long if not longer. Unfortunately, we dont have visibility into the VM config or VMWare Tools or anything, so are asking for a side by side comparison be done.