Thursday, April 27, 2017

SQL Server 2016 checktable - sometimes takes longer with warm cache than cold?!?

First a disclaimer - I've never run these tests on any version prior to SQL Server 2016.  So I am not at all claiming this is a new phenomenon on SQL Server 2016.  Its just an interesting behavior that surfaced while I am testing SQL Server 2016.

The system is a 6 vcpu, 48 GB RAM instance of SQL Server 2016 SP1.

I've been running through a lot of checkdb/checktable scenarios lately, been trying to keep up with blogging whenever I find something interesting.

This particular case is interesting because it defies expectations.

Here's my test, including instrumentation. The test subject is that lonely checktable command nestled in there :-) I use tablock because if not tablock or in an explicit snapshot, each time checktable is run a new internal snapshot will be created and won't be able to compare warm vs cold database cache.

So... its pretty verbose, sorry about that. But stuffing @@io_busy, virtual file stats, wait stats, spinlock stats and latch stats into temp tables before and after so they can be compared. I've got the dbcc dropcleanbuffers in there because I ran this scenario many, many, many times to verify I wasn't being thrown off by something. But my results were stable and reproducible.

So I ran the test below with cold database cache. (My database was still; might have to issue a checkpoint before the dropcleanbuffers to ensure the cache really is cold if issuing checktable against a table that's recently been edited.)


/* set nocount on;
   dbcc dropcleanbuffers;
*/
DECLARE @iobusy BIGINT;
DROP TABLE IF EXISTS #wstats_1;
DROP TABLE IF EXISTS #spins_1;
DROP TABLE IF EXISTS #latch_1;
DROP TABLE IF EXISTS #wstats_2;
DROP TABLE IF EXISTS #spins_2;
DROP TABLE IF EXISTS #latch_2;
DROP TABLE IF EXISTS #virtfile_1;
DROP TABLE IF EXISTS #virtfile_2;
DROP TABLE IF EXISTS #io_busy;
SELECT * into #wstats_1
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;
SELECT * into #spins_1
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #latch_1
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_1
FROM sys.dm_io_virtual_file_stats(db_id(),null);
set @iobusy = @@timeticks * @@io_busy;
set statistics time, IO on;
dbcc checktable ('sasq_list') with tablock;
set statistics time, IO off;
SELECT @@timeticks * @@io_busy - @iobusy AS io_busy_microseconds INTO #io_busy;
SELECT [file_id], num_of_reads, num_of_bytes_read, io_stall_read_ms INTO #virtfile_2
FROM sys.dm_io_virtual_file_stats(db_id(),null);
SELECT * into #latch_2
FROM sys.dm_os_latch_stats ls where ls.waiting_requests_count > 0;
SELECT * into #spins_2
FROM sys.dm_os_spinlock_stats where spins > 0;
SELECT * into #wstats_2
FROM sys.dm_exec_session_wait_stats where session_id = @@spid;

OK.  Well - here's the output from the checktable and statistics time.


I've got all those temp tables for instrumentation - let's cash the check 😀


SELECT * FROM #io_busy;

SELECT vf2.[file_id], vf2.num_of_reads - vf1.num_of_reads AS num_of_reads, 
vf2.num_of_bytes_read - vf1.num_of_bytes_read AS num_of_bytes_read, 
vf2.io_stall_read_ms - vf1.io_stall_read_ms AS io_stall_read_ms
FROM #virtfile_1 vf1
JOIN #virtfile_2 vf2 ON vf1.[file_id] = vf2.[file_id]
WHERE vf2.num_of_reads > vf1.num_of_reads;

select ws2.wait_type, ws2.waiting_tasks_count - ws1.waiting_tasks_count as wait_count,
ws2.wait_time_ms - ws1.wait_time_ms wait_ms, ws2.signal_wait_time_ms - ws1.signal_wait_time_ms signal_ms
from #wstats_2 ws2 join #wstats_1 ws1 on ws1.wait_type = ws2.wait_type
where ws2.waiting_tasks_count - ws1.waiting_tasks_count > 160;

select ws2.name, ws2.spins - ws1.spins as spin_count,
ws2.collisions - ws1.collisions collision_count
from #spins_2 ws2 join #spins_1 ws1 on ws1.name = ws2.name
where ws2.collisions - ws1.collisions > 10;

select ws2.latch_class, ws2.waiting_requests_count - ws1.waiting_requests_count as latch_count,
ws2.wait_time_ms - ws1.wait_time_ms latch_ms
from #latch_2 ws2 join #latch_1 ws1 on ws1.latch_class = ws2.latch_class
where ws2.waiting_requests_count - ws1.waiting_requests_count > 0;

The results...


All right.  Lets test again - this time without dropping clean buffers, so we can compare behavior with warm cache.

Here's the output from checktable and statistics time.


Huh.  Almost 11.5 seconds of elapsed time with warm cache, vs just over 6.2 seconds of elapsed time with cold cache.  That's unexpected.

Let's see what the temp tables caught - maybe they'll shed some light.

Well.  No reads according to virtual file stats - that's what we expected.  No latch waits.  (That varied a little among test runs, but it was never a significant amount of latch waits.)  The spinlocks are uninteresting... I was kind of hoping to find a spinlock-related reason for the warm cache checktable taking longer.  No dice.



So what's going on?  Well - first of all, take a look at CPU ms between the cold cache and warm cache runs.  34534 for cold cache, 33907 for warm cache.  Even though the cold cache run finished substantially faster, it took 627 ms more CPU time to complete.  That's less than 2% variance.  But its stable in the context of this test.  I ran it over and over and although the times shifted around a bit, the cold cache run was always faster but required around 500 more CPU ms to complete.

On the one hand that makes sense: the cold cache run issues 8103 read IOs according to virtual file stats and @@io_busy says 125 milliseconds of CPU time just to issue those IO commands.

 But it still finished faster!

As far as I can tell, the secret is in the wait stats - cxpacket in particular.  Below are the cold cache wait stats again.



And below are the warm cache wait stats again.  The warm cache checktable - which I would have expected to finish faster - somehow incurred almost 3 times as much cxpacket wait time as the cold cache checktable - and over 3 times as much signal wait time for the cxpacket wait type.


The warm cache checktable had less work to do than the cold cache checktable.  And, indeed, it did less work from an accumulated CPU time standpoint.  But somehow, it didn't divvy up its work as well among the parallel workers.  Due to a significant increase in cxpacket wait time and cxpacket signal wait time over the cold cache operation, it finished in almost twice the time!

This is one reason its extremely important to look not just at work, and not just at work + waits... not even at work + waits + management.  Appropriate scheduling and distribution of work among resources is also important to keep everyone happy when they want extreme performance.



No comments:

Post a Comment