Feb 28

My favourite wait state– ASYNC_NETWORK_IO

Sometimes it does feel that a problem is a database problem until proven otherwise.  When this does occur I have had to put a lot of time and effort into demonstrating the hows and whys.  One of the easiest ( and most common ) scenarios of this is related to ASYNC_NETWORK_IO waits.

Simply put ASYNC_NETWORK_IO waits occur when SQL Server is waiting on the client to consume the output that it has ‘thrown’ down the wire.  SQL Server cannot run any faster, it has done the work required and is now waiting on the client to say that it has done with data. 

Naturally there can be many reasons for why the client is not consuming the results fast enough , slow client application , network bandwidth saturation, to wide or to long result sets are the most common and in this blog I would like to show you how I go about diagnosing and demonstrating these issues.

Firstly lets make a query execute 10 times faster with no effort, seriously no effort , a single tick box.  This tick box is “Discard results after execution in SSMS”.

image

SSMS is surprisingly slow at painting the text data on the screen,  in fact it seems woefully slow.  By using this tick box SQL server does exactly the same work required as without,  throws the data down the wire as normal and waits for the client to responded, again, as normal.   SSMS however will read all the bytes off the wire,  however it will not show you the data.  This will enable you to run queries at the speed that SQL server can run at but be crippled by the client application speed. Yes SSMS is not SQL Server but a client application.

So lets test this theory with the simplest query possible “SELECT * FROM Sales.SalesOrderHeader”…Here is the profiler output running that query 5 times:

image

An average of ~2.4 ish seconds.

Now, lets turn on ‘Discard results after execution”,  this is the only thing i have done, and re run the query 5 times and…

image

Running at ~240ms , a 10 times speed improvement, reads and CPU are in the same ballpark, the only difference is SQL Server hasn’t been throttled by SSMS having to present the result set to the user.  If you are trying to performance tune a query , are you taking this into account ?  Maybe the problem is not where you think it is.

So why is  ASYNC_NETWORK_IO wait my favourite ?  Well at the global level we can now find out how much of an effect this is having.

If i clear the waits and use Paul Randal’s “Tell me where it hurts” query when running without discard the ASYNC_NETWORK_IO , SQL Server has ‘waited’ 10.44 seconds.

image

With discard on:

Nothing, no rows returned.  There were no waits.

Clearly this demonstrates that client application performance is a serious consideration in our scenario here.

Incidentally,  where you are executing a query from will make a hell of a difference,  I was once involved in a performance issue where the DBAs could not reconcile a duration time shown in a profile trace against the application with the same query executing in SSMS.   The big differentiator was that SSMS was being run on the server itself (RDP’d into <facepalm>), the query was fine,  if SSMS had been run off of the saturated network connection that the applications were running over then the same ‘performance problem’ would have been witnessed.  ‘Fair test’ is something we should always be looking to achieve.

SSMS has another often underused feature ( well it has loads Smile ) :  Client Statistics

This is enabled by this icon in SSMS.

image

What this will enable you to do is compare different result executions over a number of different attributes

image

As you can see my “Wait time on server replies” the inverse if you like of ASYNC_NETWORK_IO is 3(milliseconds).

Why i particularly like this is the “Bytes Received from server” metric,  self evident what that means but in performance tuning in my experience its a question rarely asked but here we have an exact answer to that.  With this simple number we can now ask  “Well how long would it take to copy a same sized file from server to client ?” When the result set size is gbs,  this  simple question will result in a a fairly rapid re-architecting of a system.

In conclusion ASYNC_NETWORK_IO is my favourite wait state as it means ( fairly mendaciously  Smile ) its not a SQL Server problem, its done its work.  Now, lets talk about doing it right.

If you are interested in more real world stories and hints and tips like this then you’ll hear them in my precon at SQLSaturday Exeter on 11th March 2016, be great to see you there .

Aug 15

Please please tell me now…

.. Is there something I should know,  sang Duran Duran many moons ago when I was young and before I knew that query optimization was even a thing.  But,  it is quite a handy little line to keep in mind when you are tuning queries,  giving SQL server a little bit more explicit information about your data can go a long way to removing the work that is done to guarantee something about the data.

The classic case for this is adding constraints, for example :

1
select * from Production.Product where DaysToManufacture < 0

Produces the plan of :
1
SQL Server doesn’t need to read any of the table data.  This is because there is a constraint on the table guaranteeing that there are no rows <0 and therefore SQL Server can safely replace that table read with a constant scan.

So how else can this be exploited?

Well, im working on a large datawarehouse project and one of the issues we have had is poor query against the datawarehouse to process the cubes.  Imagine we are using the query of :

1
2
3
4
5
6
select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
       FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
       factInternetSales.ProductKey,UnitPrice 
  from dimCustomer
  join factInternetSales
    on dimCustomer.CustomerKey = factInternetSales.CustomerKey

in a DSV.  Now SSAS has a high propensity, I don’t know if it always does this, to wrap queries in a subselect and use DISTINCT.  So our query would come something like this :

1
2
3
4
5
6
7
8
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice from dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data

The plan hasn’t changed, both look like this :

2

That is because the primary key of both tables are being returned and that guarantees the DISTINCT , if either key column is removed from the select then SQL Server will have to sort to make the DISTINCT.

1
2
3
4
5
6
7
8
9
10
Select distinct *
from (
    select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
           /* FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, */
           factInternetSales.ProductKey,UnitPrice 
      from dimCustomer
      join factInternetSales
        on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1) – Inhibit parallelism for simplicity

3

Yes, we are asking SQL Server for two different sets of data, they are not logically equivalent queries, but sometimes the data is implicitly unique and still SQL server has to do work to guarantee that uniqueness.

Let us further imagine that Unknown members are dealt with buy setting the customer key to -1 in factInternetSales BUT the member -1 itself is not in the dimCustomer table and is dealt with by a view. This is quite a bad practice but that never stops this sort of stuff occurring in the real world, so bear with me..

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union
Select -1,null,null,'Unknown'

 

First thing to note is that UNION is used,  UNION tries to guarantee a unique set of rows and therefore extra work will have to be employed by SQL Server to do that work even on a simple select * from vwCustomer.  The recommendation would be to use UNION ALL, we implicitly know that the data will be unique so we can save SQL Server the bother..

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union all
Select -1,null,null,'Unknown'

Now we plumb that into our DSV…

1
2
3
4
5
select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
       FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
       factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer
  join factInternetSales
    on dimCustomer.CustomerKey = factInternetSales.CustomerKey

Which if we run in isolation is just peachy. However let us wrap this in DISTINCT as SSAS would:

1
2
3
4
5
6
7
8
9
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1)

4
Oh, we now have a sort.  This will be REALLY heavy on a regular DW load, ordering millions of rows is not quick.

So, what is our solution ?,  lets tell SQL Server what our implied knowledge of the data is to make it explicit.

1
2
3
4
5
6
7
8
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
 where CustomerKey >0
union all
Select -1,null,null,'Unknown'

In all intents and purposes adding CustomerKey >0 does nothing, there are no rows <0 nothing will be filtered and this is the exact point.  This is implied knowledge, we have now explicitly given SQL Server that same knowledge.

So what happens when we use that view ?

5

Great no sorting 🙂

So that deals with this as a one-off ? Lets let SQL Server know that this is the same for all queries of the table not just for this view.

1
alter table dimCustomer add Constraint ChkSk check(CustomerKey>0)

We are now guaranteeing that all CustomerKeys are positive, so if we back up a bit and use the view definition of :

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union all
Select -1,null,null,'Unknown'

And then run :

1
2
3
4
5
6
7
8
9
10
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice 
     from vwCustomer dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1)

Once again, the DISTINCT is explicitly guaranteed by the schema, no extra operations are required by SQL Server to prove that J

So, next time you are query tuning, remember that adding a little extra ( and sometimes seemingly superfluous code) can go a long way.

To paraphrase a bit

Please please tell me now, is there something you can say, to make that operator so away….

Apr 12

Reasons why your plans suck : No 56,536

I have been working with SQL server for more years than I really care to mention and like to think that I am now quite comfortable with the internals, specifically the Query Optimizer. Every so often though a new problem gets thrown into the mix, just to keep me on my toes ( or so it would seem  ).

Plans go “wrong” for a multitude of reasons, but they mostly boil down to a poor estimation. Poor estimations are generally caused by inadequate statistics. This is why SQL Server puts so much effort into efficiently creating and maintaining them. So, on my dev box I had a plan that went “wrong”, a 3 sec process turned into a 3hr process, a quick bit of investigation quickly turned up the culprit. A relatively simple select statement, no worries, its probably due to an old cached plan being used. Here is the plan:

plan

In reality, Application had hundreds of thousands of rows not one.

So, I added WITH(RECOMPILE), that should sort that one. But NO, on the next run, same thing with the same statement. Curious! Running the statement in isolation caused a “good” plan to be built, running with the full process, a “Bad” plan was built. So, why a bad estimation ? Bad stats ? In fact the plan was stating “missing stats” as a warning.

missing

Right, so not Bad stats, it’s no stats. Why would there be no stats ? Not too many reasons, async stats processing could be a culprit, it’s a large table so stats could be potentially built async. But, no, that option is disabled.
Possibilities are narrowing down now, but I thought that I would take a look at the internal info that is not normally available within the execution plan. For this you will need to use the undocumented trace flag 8666: Turn this on using DBCC TRACEON(8666) before retrieving the plan.
So inside the plan, I saw this :

failed

Stats failed to build, not “not found” or “non existent” the stats actually failed to build. Running profiler with error events, StmtStarting and StmtEnding we can see this activity happening. If you are not aware, stats are built using the STATMAN function, this can be seen using a profiler trace just as any user initiated action can. So, run the process and….

profiler

There it failed to build stats, so the estimation was wrong and I got a bad plan, bingo. That’s the effect, whats the cause of the failed stats build ?

There are a few reasons why stats fail to build mostly edge-case scenarios such as deadlock, but remember that I said this was a dev box ? A small dev box indeed, in this case the stats build failed due to memory pressure. How can I be certain of that ? Well playing the hunch card I saw an extended event, “query_no_cqscan_due_to_memory_limitation”, this, in amongst a few others sounded relevant and indeed when running the process through that event does fire when statements fail to build with memory pressure.

cqscan

Sometimes I am in awe of the amount of work that has gone into SQLServers’ query processing engine and this is one of those. Ideally I would have liked the problem to be signposted more clearly, maybe the error log, but beggars can’t be choosers.

Jan 03

Reading a snapshot – Are there performance implications ?

My present role is to look at the performance of a large datawarehouse load.  This is right up my alley,  and I have be playing with all manner of performance tuning fun, indexing, execution plans, filtered statistics etc, all good stuff.

However,  during QA testing some of my suggestions actually made things worse.  Not a little worse, but a whole lotta worse,  big surprise.  Going back to my dev box I confirmed that the TSQL changes should have made things better, but there was one significant ( other than data load )  difference.  To ensure transactional integrity a database snapshot is used,  this means that the load process works on a static view of the data whilst the OLTP system can carry on doing its work.  In the QA testing we, by definition, ran the whole process,  when I was dev’ing I read from the database not the snapshot,  interesting.

When I switched to using a snapshot things did indeed go slower,  lets try this against AdventureWorks:

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
CREATE DATABASE AdventureWorks2012ss ON
( NAME = AdventureWorks2012_Data, FILENAME = 
'D:\DATAAdventureWorks2012_data.ss' )
AS SNAPSHOT OF AdventureWorks2012;
GO
set statistics io on
GO
DBCC DROPCLEANBUFFERS
GO
use AdventureWorks2012
GO
select count(SOH.Comment),count(SOD.ProductID) from sales.SalesOrderHeader SOH
inner loop join sales.SalesOrderDetail SOD
on SOH.SalesOrderID = SOD.SalesOrderID
GO
use AdventureWorks2012ss
GO
set statistics io on
go
DBCC DROPCLEANBUFFERS
GO
select count(SOH.Comment),count(SOD.ProductID) from sales.SalesOrderHeader SOH
inner loop join sales.SalesOrderDetail SOD
on SOH.SalesOrderID = SOD.SalesOrderID
GO

OK, so that is a fairly non-sensical query and forcing the loop join is there to demonstrate the issue, but the plans are exactly the same. Running those and tracing in profiler shows the performance difference :
SnapshotRead

Reading the database ~1.0-1.1 secs, reading the Snapshot ~1.5-1.6 seconds, a sizeable difference. Other noteworthy things here are that the snapshot file and the database file exist on the same physical drive and that drive is a slow USB drive. Testing against my laptops SSD, try as I might, demonstrated no measurable difference.

There is however, one difference in the output of the above scripts, physical reads.
SnapshotPhysicalRead

150 against the database, 1150 against the snapshot. This sounds to me like a likely cause of the slowdown.

When reading a snapshot the sparse file has to be read initially but then the database is read if the page has not yet been changed, a double whammy if you will. Also this happens on a page read by page read basis, this is why I forced the nested loop join in the query. Using Process Monitor from the sysinternals tool set we can see this activity in action….

Firstly reading the database:

ReadDB

Nice large reads of 65,536 bytes each, now the snapshot…

ReadSnapshot

Lots of reads of 8,192, this is the activity that I believe is slowing my queries down. I asked Microsoft for an opinion on what is actually happening here, the response was :

When we do a large read against a COW sparse file, pages that have not been updated in the primary will not be in the sparse file and we will have to read from the primary’s file. This is done on a page by page basis so can result in many reads against the primary for a single read in the snapshot DB.

If the pages have been pushed over to the snapshot’s COW file then they can be read in a large read.

Essentially, we’ll only do large reads against pages that have been pushed into the snapshot.

The broken up reads against the primary are issued in parallel, so while it is more IOs it is not necessarily a longer time to do those IOs.

Sparse file reads themselves are not likely very optimal since they are by definition fragmented, so OS will likely end up breaking up those reads anyway.

So the activity that I am seeing is natural and normal and that explanation does highlight the physical differences that are necessitated when reading a snapshot.

So back to the problem in hand, to fix the queries that were running slow in the real examples I used the FORCESCAN hint which will prevent the nested loop operation and enable SQLServer to do larger page reads against the database file for those unchanged pages in the snapshot, this did have a dramatic effect.

Have you seen similar activity before ? Is this only an issue for slow or overloaded io systems ? I’d love to know what you have witnessed.