It was February 2022.
We had just finished installing and configuring our Dell PowerStore 500T SAN — the next step in modernizing our infrastructure.
I started migrating workloads from the old Dell Compellent SAN to the new PowerStore. What used to take minutes suddenly took hours. The system crawled. Something wasn’t right.
At first, I assumed it was storage or network related — maybe throughput, maybe zoning. But then our DBA mentioned something interesting:
“We’re seeing noise in the database. Failover events between the AG nodes are spiking.”
That was the clue.
The issue wasn’t in the network — it was deeper.
I reached out to Brent Ozar, a SQL Server consultant I’d followed for years. Within a day, Brent sent back a detailed analysis that completely reframed the problem:
- TempDB was undersized and constantly shrinking/growing.
- Instant File Initialization wasn’t enabled.
- Memory allocation was far below best practice.
- SQL Server was four years behind in patching.
I spent the next few days implementing his six recommendations:
✅ Enabled Instant File Initialization
✅ Increased memory to 64 GB on both nodes
✅ Grew TempDB from 250 GB to 750 GB
✅ Patched SQL Server 2016 to SP3
✅ Reconfigured max memory settings
✅ Rebalanced the AG nodes
And just like that — performance returned to normal. No additional consulting costs. No new hardware. Just analysis, execution, and a little persistence.
What started as a latency issue turned into a lesson in causality over correlation.
Sometimes the problem you’re chasing isn’t the one you need to fix — and the right data (and a few good people) can make all the difference.
Skills: Root Cause Analysis • IT Infrastructure Analytics • SQL Optimization • Resiliency Engineering • Causality over Correlation