Mar 13

A first look at SQL Server 2016–String_split

I’ve been developing T-SQL for about 20 years now and almost since day one there has been functionality that , IMO, has been missing.  That is grouped string concatenation (ie forming a delimited string ) and string split ( breaking up a delimited string).  When I say missing, I really mean some form of succinct functionality that can do it and not some form of hideous kludge that can cause a religious argument between various members of the community about why their method is better than any one else’s Smile

My hopes were not high for SQL Server 2016 fixing this either when this article https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/09/08/passing-arrays-to-t-sql-procedures-as-json/ was published.  I’ve never been convinced by complex (ie XML) datatypes within the relational database,  there are better ways if that is what you have to do, but , <sarcasm> yippee </sarcasm>, with JSON baked in here is another method of string splitting rather than doing it properly.  Honestly,  it really does make me want to cry.

But, with the release of SQL Server RC0,  snuck into the release notes was   image

Woohoo, a STRING_SPLIT function, finally at last, lets break out the bunting.

So, lets have a quick investigation into how to use it and are there any caveats around it.

I wont repeat the official documentation https://msdn.microsoft.com/en-gb/library/mt684588.aspx but to recap, using it is really simple  STRING_SPLIT(string,delimiter) where string is the delimited string to split and delimiter to split that up on.  Notably, delimiter can only be a single char, that’s not a deal breaker by any means, but could be an annoyance.

Right, I’m now assuming that you are familiar with the official docs, so is there anything else to know about this that function before we go forward and immediately change all our previous kludges with this new function.

Firstly anyone that knows me will know that I’m interested in estimations ( and the wider optimization area) , so the first thing is , how many rows are estimated to be returned from the TVF ?

Here is a script to test this :

1
2
3
4
5
6
7
8
9
10
11
Create Table Strings
(
String varchar(max)
 
)
go
insert into Strings values('one,two,three')
go
Select *
from  Strings
Cross Apply string_split(strings.String,',')

The plan is as you would expect from a CLR TVF,  in fact I suspect that this is just an CLR TVF implementation but i’m not gonna state that as such. Smile

image

and the detail for the TVF ?

image

50, the estimate of the rows outputted is 50. After a number of tests this hasn’t changed.  Inline with similar TVF functionality, this isn’t surprising news, but i was curious as to what the number was.

Secondly, as I’m suspecting that this is just CLR, then i’ve found in the past that the optimizer plays ‘safe’ and as the TVF functions are opaque to the optimizer, then Halloween protection is implemented to protect against the possibility that the same data is being read as is being updated.

Does this happen here ? Test code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Create Table Strings
(
String varchar(max)
 
)
go
Create Table Dest
(
String varchar(max)
 
)
go
insert into Strings values('one,two,three')
go
Insert into Dest(String)
Select Value
from  Strings
Cross Apply string_split(strings.String,',')

Notice that i’m inserting to a different table.

And….

image

Awesome, a table spool,  so the same does apply here, we have Halloween protection, on large resultsets that spooling is going to hurt.

If you need to insert the result somewhere, then SELECT INTO wont need this as it knows you are creating a new table so there is no chance of a double update.  But, if you SELECT INTO a table and then INSERT those to the table you intended to in the first place, then that is effectively the same thing as a spool so more than likely not worth the effort.

Finally to finish this blog off, here is a query with a parallel plan :

1
2
3
4
5
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
--cross apply string_split(style,',')
ORDER BY Style

Here is a query with a serial plan :

1
2
3
4
5
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
cross apply string_split(style,',')
ORDER BY Style

Right <sigh> , so it inhibits a parallel plan being built,  sheesh.  Also FYI, there is no NonParallelPlanReason specified in the plan.

That’ll do for now, but i will be running some performance tests soon to see if it is worth the trouble.

UPDATE : 2016-03-21 For performance tests,  Aaron Bertrand has done a comprehensive write up.

Feb 28

My favourite wait state– ASYNC_NETWORK_IO

Sometimes it does feel that a problem is a database problem until proven otherwise.  When this does occur I have had to put a lot of time and effort into demonstrating the hows and whys.  One of the easiest ( and most common ) scenarios of this is related to ASYNC_NETWORK_IO waits.

Simply put ASYNC_NETWORK_IO waits occur when SQL Server is waiting on the client to consume the output that it has ‘thrown’ down the wire.  SQL Server cannot run any faster, it has done the work required and is now waiting on the client to say that it has done with data. 

Naturally there can be many reasons for why the client is not consuming the results fast enough , slow client application , network bandwidth saturation, to wide or to long result sets are the most common and in this blog I would like to show you how I go about diagnosing and demonstrating these issues.

Firstly lets make a query execute 10 times faster with no effort, seriously no effort , a single tick box.  This tick box is “Discard results after execution in SSMS”.

image

SSMS is surprisingly slow at painting the text data on the screen,  in fact it seems woefully slow.  By using this tick box SQL server does exactly the same work required as without,  throws the data down the wire as normal and waits for the client to responded, again, as normal.   SSMS however will read all the bytes off the wire,  however it will not show you the data.  This will enable you to run queries at the speed that SQL server can run at but be crippled by the client application speed. Yes SSMS is not SQL Server but a client application.

So lets test this theory with the simplest query possible “SELECT * FROM Sales.SalesOrderHeader”…Here is the profiler output running that query 5 times:

image

An average of ~2.4 ish seconds.

Now, lets turn on ‘Discard results after execution”,  this is the only thing i have done, and re run the query 5 times and…

image

Running at ~240ms , a 10 times speed improvement, reads and CPU are in the same ballpark, the only difference is SQL Server hasn’t been throttled by SSMS having to present the result set to the user.  If you are trying to performance tune a query , are you taking this into account ?  Maybe the problem is not where you think it is.

So why is  ASYNC_NETWORK_IO wait my favourite ?  Well at the global level we can now find out how much of an effect this is having.

If i clear the waits and use Paul Randal’s “Tell me where it hurts” query when running without discard the ASYNC_NETWORK_IO , SQL Server has ‘waited’ 10.44 seconds.

image

With discard on:

Nothing, no rows returned.  There were no waits.

Clearly this demonstrates that client application performance is a serious consideration in our scenario here.

Incidentally,  where you are executing a query from will make a hell of a difference,  I was once involved in a performance issue where the DBAs could not reconcile a duration time shown in a profile trace against the application with the same query executing in SSMS.   The big differentiator was that SSMS was being run on the server itself (RDP’d into <facepalm>), the query was fine,  if SSMS had been run off of the saturated network connection that the applications were running over then the same ‘performance problem’ would have been witnessed.  ‘Fair test’ is something we should always be looking to achieve.

SSMS has another often underused feature ( well it has loads Smile ) :  Client Statistics

This is enabled by this icon in SSMS.

image

What this will enable you to do is compare different result executions over a number of different attributes

image

As you can see my “Wait time on server replies” the inverse if you like of ASYNC_NETWORK_IO is 3(milliseconds).

Why i particularly like this is the “Bytes Received from server” metric,  self evident what that means but in performance tuning in my experience its a question rarely asked but here we have an exact answer to that.  With this simple number we can now ask  “Well how long would it take to copy a same sized file from server to client ?” When the result set size is gbs,  this  simple question will result in a a fairly rapid re-architecting of a system.

In conclusion ASYNC_NETWORK_IO is my favourite wait state as it means ( fairly mendaciously  Smile ) its not a SQL Server problem, its done its work.  Now, lets talk about doing it right.

If you are interested in more real world stories and hints and tips like this then you’ll hear them in my precon at SQLSaturday Exeter on 11th March 2016, be great to see you there .

Feb 14

SQL Saturday 496 Exeter – PreCon Agenda

Hi

Unfortunately due to matters outside of my control, SQL Saturday 502 in Edinburgh has been cancelled, I would like to thank the team though for selecting my pre-con.  Let me know if you would have liked to attend and if a viable option, I would be more than happy to still give the session at some other time.

Also ,after a few requests I thought it best to publish a fuller agenda for the precon at Exeter.

Agenda AM

Plan and optimizer basics.   We start the day with an overview of the optimizer and touch base on many different aspects of it , the functionality involved, why it can sometimes do ‘crazy’ things and remediation of such things.  Overall it will look like this: 

  • Physical and Logical Operators
  • Things the optimizer can do to help you
  • Things you can do to help the optimizer
  • Why schema definition is be important
  • Implied knowledge vs explicit knowledge

Estimation and Statistics  We will start building on the knowledge previously gained to understand how sqlserver ‘costs’ a plan.  The basis of this is Statistics and we will drill down into the algorithms on Estimation and the ‘guesswork’ and assumptions that come with that.

  • The metrics used to cost plans
  • What does estimation mean ?
  • What are statistics and how are they used ?
  • What effect do constants and variables have?
  • Stored procedure effect
  • Remediation for when the assumptions made are wrong.

All being well, lunch Smile

Agenda PM

SQL2014/16 – The new cardinality estimator The cardinality estimator had a significant rebuild in SQL Server 2014,  this section will look at those changes and their effects.  These are sometimes subtle and sometimes not,  significantly though if you are running SQL server 2014 you will need to understand  why your resource usage has changed compared to previous versions.

  • Cardinality Estimator was rewritten in SQL2014
  • Many changes that have subtle and not-so subtle effects
  • Incrementing statistics
    • Statistics at the partition level
  • SQL2016 – Query Store
    • Alerting you to plan changes

Query Optimizer Internals Finally we will get further into the inner workings of the optimizer than you ever thought possible.  If you wish to know why one candidate plan has been preferred over another than this is where to look.  This is a deep dive session exposing inner workings and how to interpret some extremely verbose logging.  

  • Deep dive using documented and undocumented trace flags
  • Gaining further insight into the optimizer operations
  • Examining Stages, The Memo, Trees and Rules

If this sounds like a fun way to spend the day, I would love your company Smile

http://sqlsatexeter.azurewebsites.net/?page_id=5561 contains all the signup details.

Jan 17

SQL From South West to Scotland : SQLSaturdays #496 & #502

I’m happy, honoured and humbled to be able to say that I will be presenting pre-cons and both SQL Saturday 496 in Exeter and SQL Saturday 502 in Edinburgh.

SQL Saturday 496 Exeter is on Friday 11th March and Saturday 12th March

The event home page is here and signup to my class is :

http://sqlsatexeter.azurewebsites.net/?page_id=5561

SQL Saturday 502 Edinburgh is on Friday 10th and Saturday 11th Jun

The event home page is here and signup to my class is :

https://www.eventbrite.com/e/sql-server-2016-masterclass-query-processing-and-internals-tickets-20699726439

Both will follow the same format and general material so no need to worry that you’ll miss out on something if you can only attend one and not the other, though i would suspect that distance would be a better differentiator 🙂  We will be looking at the Query Optimizer and delving into its internals to find how it builds a plan,  what its limits and limitations are and understanding how to recognize that we have a ‘Bad Plan’.

Hope that you can make one or the other.