Friday, January 30, 2015

More fun with SQL Server & perfmon: Will trace flag 9024 help?







In the graph below, there is at least some relationship between buffer page lookups/sec (logical IO) and CPU utilization in the shaded area.  Before the shaded area begins, though CPU utilization is extremely high and logical IO rate is extremely low.  Whenever I see that, I always try to rule out spinlock conditions - and if there are known applicable spinlock conditions I try to mitigate them (this is one of the reasons I try to learn about as many trace flags as I can).



x

During the shaded timeperiod notice that the log flush bytes/sec is much higher than previously, as well as higher buffer lookups/sec.



Log flushes/sec trends with log flush bytes/sec throughout the time range.





Because query concurrency and parallel worker concurrency is much lower during the shaded productive area than the preceding high CPU/low logical IO timeperiod, I definitely think that a spinlock issue can be involved.



I've got two spinlock issues in mind:
1. the global database lock on login, relieved by startup trace flag 1236.  Although I think trace flag 1236 will be somewhat helpful, in this particular case I don't think the login rate is high enough to trigger spinlock contention to this extreme level.
    https://support.microsoft.com/kb/2926217
2. Transaction log pool global access, relieved by startup trace flag 9024.  That seems like a good suspect.  This system is a 4 vcpu vm, with SQL Server seeing a single "NUMA Node".  So, trace flag 9024 promoting log cache spinlock access from global to node level won't help by itself.  But, with trace flag 8048 also enabled, log cache spinlock access gets promoted all the way to CPU level.  *That* may help.  There's not a lot of information about trace flag 9024, the conditions where relief is expected, or the symptoms that indicate it may be helpful.  The kb article doesn't even indicate that it is a startup trace flag yet.  But I'm hopeful it'll help in this case.  I'll update this blog post with what I learn.
    http://support.microsoft.com/kb/2809338

Friday, January 16, 2015

HP-UX 11iv3/11.31 disk congestion control/throttling

HP-UX 11.31 includes some new parameters available via scsimgr.  These parameters are for disk congestion control, based on a maximum queue depth for the target port.

This document includes details on how the congestion control works.
HP-UX 11i v3 Congestion Control Management for Storage
http://www.whitepapercompany.com/pdfs/HP_Tech_WP.pdf

On my production systems, I mainly just want to ensure its disabled :-)

If this lil awk returns a bunch of nuthin' except target path instance numbers, its not enabled.

sasquatch dimaggio:/home/sasquatch> for tgtp in `ioscan -kfNCtgtpath | awk 'NR>2 {print $2}'`; do echo $tgtp":\c"; scsimgr -p get_attr -C tgtpath -I $tgtp -a tport_qdepth_enable -a tport_max_qdepth -a tport_qd_dyn ; done
0:::
1:::
2:::

Tuesday, January 6, 2015

SQLServer:Wait Statistics(Waits started per second)\Log buffer waits - bogus values?

There are 128 in-memory transaction log buffers for each SQL Server database.  If none of them are free (a situation that is only typically encountered with high write latency and high concurrency of writes to the transaction log), threads in active transactions will accrue wait time for transaction log buffers.

In perfmon, this is accounted for with "SQLServer:Wait Statistics(*)\Log buffer waits".  To gauge if transaction log write performance is a bottleneck for high data rate ETL, I typically use  "Waits started per second".  Sometimes I use "Waits started per second" in conjunction with "(Average wait time (ms)". 

Some values are obviously bogus - various flavors of 1.84E+19 such as 18446744073709500000 are among my favorites

But today among the useful values in the range from 0 to 301 on a 6 vcpu system,  I see an abundance of these values, too.
38894
40103
40108
40111
40404
40711
40714
40736
40737
40738
40739

Anyone else seeing bogus values in the SQL Server wait-related perfmon numbers?  Is there something clever to eliminate these bogus values - maybe ensuring that 64 bit perfmon is launched, or some other trick?

******
5761 values collected, in 15 second increments.  This was the distribution.  I'll look through several more days worth of data.  the 1.84E+19 value is pretty easy to discard - even moreso if it always reports 0 as an average wait time.  All of the bogus values 38894 to 40739 report 49 ms as the average wait time.  Hmmm... so I may be able to discard values in a range with a particular reported wait time.


Average Wait Time Waits Started in 15 Seconds Occurrences
0 0 3488
0 1 1
0 1.84467E+19 722
14 4 1
19 24 1
21 60 1
22 3 1
24 13 1
24 301 1
25 8 1
28 14 1
29 3 1
29 4 1
32 152 1
34 3 1
34 12 1
34 25 1
35 6 1
37 4 1
38 4 1
45 7 1
46 8 1
49 38894 718
49 40103 11
49 40108 2
49 40111 3
49 40404 1
49 40711 2
49 40714 103
49 40736 18
49 40737 126
49 40738 7
49 40739 525
53 5 1
55 3 1
56 47 1
58 4 1
62 14 1
63 2 1
67 9 1
71 1 1
106 85 1
115 17 1
119 1 1
134 1 1
166 4 1
177 2 1
184 4 1


When I asked #sqlhelp about this, Paul Randal indicated (rightly so) that there are more reliable methods of gathering this information - such as extended events and the SQL Server wait DMV.  I use the wait DMV pretty extensively, too - usually in 5 minute increments.  But I also have a need for something very lightweight, minimally invasive (particularly requiring low privileges), and easily deployable across numerous systems - so I'll still pursue this a bit.