Aug 22

Who did what to my database and when…

One of the most popular questions on forums / SO etc is, “How can i find out who dropped a table, truncated a table, dropped a procedure….” etc.   Im sure we have all been there,  something changes ( maybe schema or data ) and we have no way of telling who did it and when.

SQL auditing can get you there, and i’m not suggesting you shouldn’t use that,  but what if that is not setup for monitoring the objects you are interested in ?

If you look back through my posts, I have been quite interested in ScriptDom and TSQL parsing,  so if you have a trace of all the actions that have taken place over a time period, you can parse the statements to find the activity.

Drawing inspiration ( cough, stealing ) from Ed Elliot’s dacpac explorer,  I have created a git repo for a simple parser that is created using T4 templates. T4 templates are an ideal use here as the full language domain can be exploded out automatically and then you as a developer can add in your code to cherry pick the parts you are interested in.

At this time the project has hooks to report on SchemaObjectName objects , so any time any object is referenced be that in a CREATE, SELECT , DROP, MERGE …  will fall into the function OnSchemaObjectName and  OnDropTableStatement that will be hit when DROP TABLE is used.

This is not intended to be a full end to end solution, not least as the SQL to be parsed could be coming from and number of sources,  but if you have the skills to run it as is , you probably have enough to tailor it for your requirements.

As ever,  let me know what you think and any comments gratefully received.

The github repo is   :   https://github.com/davebally/TSQLParse

 

Jun 17

Why is creating excel sheets from SSIS so hard ?

If there is one process that should be simpler than it is out of the box, it is creating Excel spreadsheets from SSIS.   Over the years i’ve tried doing it a number of ways, using the built in component,  Interop , OLE DB etc all suck to one degree or another.  Either unreliable or to slow or simply unusable.

A twitter conversation, A) proved I wasn’t alone and B) Pointed me in the direction of EPPlus.

Over on SSC there is already a post on using EPplus with SSIS, some of which, such as putting EPPlus into the GAC, is still relevant for this post.

However, right now, i have a big love of BIML, simply put i think that this is what SSIS should have been in the first place and I personally find all the pointing and clicking a real time sink.  Additionally,  in BIML, one you have written a package to do something , ie a simple dataflow, its a snip to repeat that over 10, 20, 50 , 100 or 1000s of tables.  But the real time saver for me is when you need to re-architect,  ie turn sequential dataflows into a parallel dataflow.  Its only really a case of changing where you iterate in your BIML code.

Anyway,  i’ve combined these two pain points to create a BIML routine that uses EPPlus to output multi-sheeted Excel spreadsheet reliably and fast.

At the moment its very basic , take SQL statements and output the data to an excel file, but in time i will be hoping to create some meta data to start ‘getting the crayons out’ and making them look a bit more pretty.

Code is on GitHub at https://github.com/davebally/BIML-SSIS-Excel-Output ,  hope this of use to someone.

 

 

Mar 24

.Net 4.6.1 and MultiSubnetFailover

Im not a big clustering/HA expert, but I obviously understand the need for such things to exist 🙂

However, the standard SQL tooling, SSMS etc, seems to go out of its way to make the MultiSubNetFailover option as inaccessible and convoluted as possible.  Quite why this option even exists and its not done as default escapes me, clients should be entirely agnostic to the server its talking to.

After going through this pain again, this time deploying SSIS packages to an AG, i stumbled over this msdn blog

https://blogs.msdn.microsoft.com/alwaysonpro/2015/12/01/improved-multisubnet-listener-behavior-with-newly-released-sql-client-provider-in-net-4-6-1/

So .NET 4.6.1 does default to MultiSubNetFailover = True, which is great.

There is a rather cryptic note on the end though

Note that this isn’t what ended up being shipped.  Instead a new default property TransparentNetworkIPResolution was introduced to SqlConnection.

Ive asked for some clarification on what that exactly means, im unable to find any other reference to that property, but In my testing the effect seems to be the same :S

So,  if you have MultiSubNetFailOver=True pain,  give .NET 4.6.1 a whirl,  it should resolve that.

 

Mar 20

A journey to a movie recommender in U-SQL

I haven’t gotten to involved in the whole “Big Data” movement to much,  I’ve kept an arms length watch and played around with a few of those technologies but never really jumped in and got down and dirty.  However, with the public preview of Azure Data Lake, I thought that it was probably time to get past the simple parrot style examples and tutorials and try something from scratch.

What is compelling for me about ADL over other technologies is  I don’t have to learn a platform.  I never have to setup commission / provision any hardware.  As a data guy, to even get to creating and setting up a data storage area in other tech there are any number of hoops to be jumped through, software to be installed and VMs to be built.  With ADL all I have to do is give my card details to Microsoft ( YMMV on if that is a good or bad thing) and I can start playing.

In a more practical sense, ADL also provides you massive scale ( in my Pay as you go subscription, I’m limited to 50 AU, which I believe is effectively 2 cores and many many more are available if required), but 0 of the hardship in managing the machines.  Not only do you have access to that power but you have control where to use it.  If you only have a single job that requires 2000 processors then you only have to pay for them while that job is running, every other job you can pay for the 20 that are needed.

The integration to Visual Studio is neat and there is now no longer any need to submit jobs to Azure to simply try then out as a local emulator is used instead.  Also the powershell functionality looks to be fairly wide-ranging and will help in automating tasks.

What follows is an overview of my experiments that i have published into a GitHib repo. The “Examples” folder are what i would term “simple learnings” and “Full Scripts” are scripts that to a lesser or greater extent do something “useful”.  Im also not suggesting that anything here is “best practice” or method A performs better than method B,  I simply do not have the required size of data to make that call. My aim was to learn the language.

TLDR: Check out the script MovieLens09-CosineSimilarityFromCSVWithMax.usql for a U-SQL movie recommender.

The Movie Recommender is somewhat of the “Hello World” of a Big Data system and is also an interesting subject that everyone can relate to, so here are my attempts in exploring Azure Data Lake and Data Store with this final goal in mind.

First matter is to find a decent data set and for this ill be using the MovieLens 1m dataset, this presents an interesting challenge in simply consuming the  data as its not in a CSV format that is used by ADL.  ADL does support the use of “Custom Extractors”, so that was the logical place to start and I had a few interesting experiments defining an extractor using a few different possibilities, Ie Using a ‘hardcoded’ string output or using the meta data of the EXTRACT statement itself.  This is done by simply interrogating the Output.schema array :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
var s = output.Schema[x];
if (s.Type.Name.Equals("String", StringComparison.InvariantCultureIgnoreCase))
{
   output.Set(x, strings[x].ToString());
}
if (s.Type.Name.Equals("Int32", StringComparison.InvariantCultureIgnoreCase))
{
   try {
      output.Set(x, Int32.Parse(strings[x].ToString()));
   }
   catch
   {
   output.Set(x, -1);
   }
}

Finally though, I considered an extractor a bit ‘cumbersome’ and opaque for my purposes here, but in a productionised system I can certainly see the uses of it especially if you are consuming more complex type data.

Instead I reverted a string split method, again useful to understand how to call the integrated .Net functionality as this required setting up a string array inline to pass in the 2 character separator for the dataset. This necessitates a three part operation, read the file as a series of long strings ( @UserLine ), split on ‘::’ into an array (@UserSplit) and reform into a properly formed, typed and named dataset (@Users).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
@UserLine =
EXTRACT UserLine string
FROM @USERSInp
USING Extractors.Tsv( rowDelimiter: "\n");
 
@UserSplit =
SELECT new SQL.ARRAY<string>(UserLine.Split(new string[]{"::"},StringSplitOptions.None)) AS UserArr
FROM @UserLine;
 
@Users =
SELECT int.Parse(UserArr[0]) AS UserID,
       UserArr[1] AS Gender,
       int.Parse(UserArr[2])  AS Age,
       int.Parse(UserArr[3]) AS Occupation,
       UserArr[4] AS Zip
FROM @UserSplit;

This is functionality that has been long overdue in TSQL and its good that we now have STRING_SPLIT to do this however, joining a string together is not supported (well).

In U-SQL its a snap,  first use ARRAY_AGG and then use .Net (string.Join) to form the string.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@UserWithMovies =
SELECT u.*,ARRAY_AGG(r.MovieID) AS MoviesArr
FROM @Users AS u
     JOIN
     @Ratings AS r
    ON u.UserID == r.UserID
GROUP BY u.UserID,
         u.Gender,
         u.Age,
         u.Occupation,
         u.Zip;
 
@outer =
SELECT UserID,MoviesArr.Count AS MoviesCount,string.Join(":",MoviesArr) AS MovieList
FROM @UserWithMovies;

For the maths behind the recommender itself I chose a cosine similarity to match those people who are most similar, as it is good enough for my purposes and simple enough to understand.  I also match people on Age and Occupation and filter out ‘exact’ similarities.

I wont repeat the full code here verbatim but ive added comments  and if you have an ADL account you should simply be able to download the movielens dataset cut n paste the script and run it.

Please take a look over the other scripts in the Repo, they may not be as well commented but the intention should be clear.

In terms of cost,  how much did It cost me to get this far ? Peanuts, £1.50 was my bill.  Next step is to take the 20m dataset and repeat this exercise, hopefully with that size data I might see some measurable differences between use flat CSV, tables etc and experiment with the best way to increase vertices.

Mar 13

A first look at SQL Server 2016–String_split

I’ve been developing T-SQL for about 20 years now and almost since day one there has been functionality that , IMO, has been missing.  That is grouped string concatenation (ie forming a delimited string ) and string split ( breaking up a delimited string).  When I say missing, I really mean some form of succinct functionality that can do it and not some form of hideous kludge that can cause a religious argument between various members of the community about why their method is better than any one else’s Smile

My hopes were not high for SQL Server 2016 fixing this either when this article https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/09/08/passing-arrays-to-t-sql-procedures-as-json/ was published.  I’ve never been convinced by complex (ie XML) datatypes within the relational database,  there are better ways if that is what you have to do, but , <sarcasm> yippee </sarcasm>, with JSON baked in here is another method of string splitting rather than doing it properly.  Honestly,  it really does make me want to cry.

But, with the release of SQL Server RC0,  snuck into the release notes was   image

Woohoo, a STRING_SPLIT function, finally at last, lets break out the bunting.

So, lets have a quick investigation into how to use it and are there any caveats around it.

I wont repeat the official documentation https://msdn.microsoft.com/en-gb/library/mt684588.aspx but to recap, using it is really simple  STRING_SPLIT(string,delimiter) where string is the delimited string to split and delimiter to split that up on.  Notably, delimiter can only be a single char, that’s not a deal breaker by any means, but could be an annoyance.

Right, I’m now assuming that you are familiar with the official docs, so is there anything else to know about this that function before we go forward and immediately change all our previous kludges with this new function.

Firstly anyone that knows me will know that I’m interested in estimations ( and the wider optimization area) , so the first thing is , how many rows are estimated to be returned from the TVF ?

Here is a script to test this :

1
2
3
4
5
6
7
8
9
10
11
Create Table Strings
(
String varchar(max)
 
)
go
insert into Strings values('one,two,three')
go
Select *
from  Strings
Cross Apply string_split(strings.String,',')

The plan is as you would expect from a CLR TVF,  in fact I suspect that this is just an CLR TVF implementation but i’m not gonna state that as such. Smile

image

and the detail for the TVF ?

image

50, the estimate of the rows outputted is 50. After a number of tests this hasn’t changed.  Inline with similar TVF functionality, this isn’t surprising news, but i was curious as to what the number was.

Secondly, as I’m suspecting that this is just CLR, then i’ve found in the past that the optimizer plays ‘safe’ and as the TVF functions are opaque to the optimizer, then Halloween protection is implemented to protect against the possibility that the same data is being read as is being updated.

Does this happen here ? Test code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Create Table Strings
(
String varchar(max)
 
)
go
Create Table Dest
(
String varchar(max)
 
)
go
insert into Strings values('one,two,three')
go
Insert into Dest(String)
Select Value
from  Strings
Cross Apply string_split(strings.String,',')

Notice that i’m inserting to a different table.

And….

image

Awesome, a table spool,  so the same does apply here, we have Halloween protection, on large resultsets that spooling is going to hurt.

If you need to insert the result somewhere, then SELECT INTO wont need this as it knows you are creating a new table so there is no chance of a double update.  But, if you SELECT INTO a table and then INSERT those to the table you intended to in the first place, then that is effectively the same thing as a spool so more than likely not worth the effort.

Finally to finish this blog off, here is a query with a parallel plan :

1
2
3
4
5
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
--cross apply string_split(style,',')
ORDER BY Style

Here is a query with a serial plan :

1
2
3
4
5
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
cross apply string_split(style,',')
ORDER BY Style

Right <sigh> , so it inhibits a parallel plan being built,  sheesh.  Also FYI, there is no NonParallelPlanReason specified in the plan.

That’ll do for now, but i will be running some performance tests soon to see if it is worth the trouble.

UPDATE : 2016-03-21 For performance tests,  Aaron Bertrand has done a comprehensive write up.

Feb 28

My favourite wait state– ASYNC_NETWORK_IO

Sometimes it does feel that a problem is a database problem until proven otherwise.  When this does occur I have had to put a lot of time and effort into demonstrating the hows and whys.  One of the easiest ( and most common ) scenarios of this is related to ASYNC_NETWORK_IO waits.

Simply put ASYNC_NETWORK_IO waits occur when SQL Server is waiting on the client to consume the output that it has ‘thrown’ down the wire.  SQL Server cannot run any faster, it has done the work required and is now waiting on the client to say that it has done with data. 

Naturally there can be many reasons for why the client is not consuming the results fast enough , slow client application , network bandwidth saturation, to wide or to long result sets are the most common and in this blog I would like to show you how I go about diagnosing and demonstrating these issues.

Firstly lets make a query execute 10 times faster with no effort, seriously no effort , a single tick box.  This tick box is “Discard results after execution in SSMS”.

image

SSMS is surprisingly slow at painting the text data on the screen,  in fact it seems woefully slow.  By using this tick box SQL server does exactly the same work required as without,  throws the data down the wire as normal and waits for the client to responded, again, as normal.   SSMS however will read all the bytes off the wire,  however it will not show you the data.  This will enable you to run queries at the speed that SQL server can run at but be crippled by the client application speed. Yes SSMS is not SQL Server but a client application.

So lets test this theory with the simplest query possible “SELECT * FROM Sales.SalesOrderHeader”…Here is the profiler output running that query 5 times:

image

An average of ~2.4 ish seconds.

Now, lets turn on ‘Discard results after execution”,  this is the only thing i have done, and re run the query 5 times and…

image

Running at ~240ms , a 10 times speed improvement, reads and CPU are in the same ballpark, the only difference is SQL Server hasn’t been throttled by SSMS having to present the result set to the user.  If you are trying to performance tune a query , are you taking this into account ?  Maybe the problem is not where you think it is.

So why is  ASYNC_NETWORK_IO wait my favourite ?  Well at the global level we can now find out how much of an effect this is having.

If i clear the waits and use Paul Randal’s “Tell me where it hurts” query when running without discard the ASYNC_NETWORK_IO , SQL Server has ‘waited’ 10.44 seconds.

image

With discard on:

Nothing, no rows returned.  There were no waits.

Clearly this demonstrates that client application performance is a serious consideration in our scenario here.

Incidentally,  where you are executing a query from will make a hell of a difference,  I was once involved in a performance issue where the DBAs could not reconcile a duration time shown in a profile trace against the application with the same query executing in SSMS.   The big differentiator was that SSMS was being run on the server itself (RDP’d into <facepalm>), the query was fine,  if SSMS had been run off of the saturated network connection that the applications were running over then the same ‘performance problem’ would have been witnessed.  ‘Fair test’ is something we should always be looking to achieve.

SSMS has another often underused feature ( well it has loads Smile ) :  Client Statistics

This is enabled by this icon in SSMS.

image

What this will enable you to do is compare different result executions over a number of different attributes

image

As you can see my “Wait time on server replies” the inverse if you like of ASYNC_NETWORK_IO is 3(milliseconds).

Why i particularly like this is the “Bytes Received from server” metric,  self evident what that means but in performance tuning in my experience its a question rarely asked but here we have an exact answer to that.  With this simple number we can now ask  “Well how long would it take to copy a same sized file from server to client ?” When the result set size is gbs,  this  simple question will result in a a fairly rapid re-architecting of a system.

In conclusion ASYNC_NETWORK_IO is my favourite wait state as it means ( fairly mendaciously  Smile ) its not a SQL Server problem, its done its work.  Now, lets talk about doing it right.

If you are interested in more real world stories and hints and tips like this then you’ll hear them in my precon at SQLSaturday Exeter on 11th March 2016, be great to see you there .

Feb 14

SQL Saturday 496 Exeter – PreCon Agenda

Hi

Unfortunately due to matters outside of my control, SQL Saturday 502 in Edinburgh has been cancelled, I would like to thank the team though for selecting my pre-con.  Let me know if you would have liked to attend and if a viable option, I would be more than happy to still give the session at some other time.

Also ,after a few requests I thought it best to publish a fuller agenda for the precon at Exeter.

Agenda AM

Plan and optimizer basics.   We start the day with an overview of the optimizer and touch base on many different aspects of it , the functionality involved, why it can sometimes do ‘crazy’ things and remediation of such things.  Overall it will look like this: 

  • Physical and Logical Operators
  • Things the optimizer can do to help you
  • Things you can do to help the optimizer
  • Why schema definition is be important
  • Implied knowledge vs explicit knowledge

Estimation and Statistics  We will start building on the knowledge previously gained to understand how sqlserver ‘costs’ a plan.  The basis of this is Statistics and we will drill down into the algorithms on Estimation and the ‘guesswork’ and assumptions that come with that.

  • The metrics used to cost plans
  • What does estimation mean ?
  • What are statistics and how are they used ?
  • What effect do constants and variables have?
  • Stored procedure effect
  • Remediation for when the assumptions made are wrong.

All being well, lunch Smile

Agenda PM

SQL2014/16 – The new cardinality estimator The cardinality estimator had a significant rebuild in SQL Server 2014,  this section will look at those changes and their effects.  These are sometimes subtle and sometimes not,  significantly though if you are running SQL server 2014 you will need to understand  why your resource usage has changed compared to previous versions.

  • Cardinality Estimator was rewritten in SQL2014
  • Many changes that have subtle and not-so subtle effects
  • Incrementing statistics
    • Statistics at the partition level
  • SQL2016 – Query Store
    • Alerting you to plan changes

Query Optimizer Internals Finally we will get further into the inner workings of the optimizer than you ever thought possible.  If you wish to know why one candidate plan has been preferred over another than this is where to look.  This is a deep dive session exposing inner workings and how to interpret some extremely verbose logging.  

  • Deep dive using documented and undocumented trace flags
  • Gaining further insight into the optimizer operations
  • Examining Stages, The Memo, Trees and Rules

If this sounds like a fun way to spend the day, I would love your company Smile

http://sqlsatexeter.azurewebsites.net/?page_id=5561 contains all the signup details.

Mar 17

Increasing ColumnStore compression ratio for archival and “Cold storage”

At my present client, part of the task I have to under take is to re-engineer the “Cold Storage”. This is ~20Terabytes of compressed CSVs going back over 15 years or so. Presently these are held on aging and creaking centera servers that are way past end of life. The solution probably made sense 10 years ago ( or whenever it was designed ) but now it doesn’t and the clients first thought was to throw it all into Orace RAC with page compression. Seems like a vaguely sensible ( if expensive) way to handle the data. But, maybe there is an even better way and right now I would say that ColumnStore compression is that better way.

As a really simple POC I took some pseudo-real sample data and threw it into a SQL Server table. Data size was approx ~300MB. After adding page compression, that size came down to ~42MB. That is roughly inline with the expected estimates ( and I would hazard a guess what could be achieved by Oracle page compression). So how does columnstore compare ? Adding a clustered columnstore index took datasize own to 2,128KB, Yes from 300MB to 2MB that is amazing. But wait, it gets better using the COLUMNSTORE_ARCHIVAL setting that applies a variant of the LZH compression to (presumably) each rowgroup the datasize falls even further to 464KB. From 300MB to less that 1MB now, that is compression for you.

The other side of the coin is query-speed ( this is arguably less important at the client ) and something I am yet to measure and compare but will do when I get my dirty hands on the real data.

However, the achievable compression can be increased even further with a little bit of thought and jigging of the data. Simplistically columnstore is all about repeating data over a column, the meta data says “In Column X I have the value Y and that happens for the next Z rows”. My data has some large almost sequential columna, transaction reference, batch transaction reference etc.. stored as large-ish chars of 16,20, and 32 bytes in length. These are fairly unique throughout the data and as such column store compression has to specify the full value on each and every row. What would be the impact if we only held that part of each reference that was changing ?, i.e. the whole 32 characters are held but only the last one would be changing as these are effectively “Strings that look like integers”.

A quick bit of guestimation, for simplicities sake splitting the column into 2 one of 31 chars and one of 1.

10 rows *32 bytes = 320

Against

1 row *31 bytes + 10 of 1 = 41 bytes

Interesting, ~15% compression. Lets push this idea further and create a million row table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest1
go
Create Table CCSTest1
(
Col1 char(23) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest1
select r1.PadR+'000000000000000'+r2.PadR
from ctePad r1
cross join ctePad r2
 
go
Create clustered index CCSTest1idx on CCSTest1(Col1)

Using sp_spaceused we can see the data size is 31,632KB.

Adding page compression

1
ALTER TABLE CCSTest1 REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE);

Brings the datasize down to 13,120KB not to shabby.

Let us now switch to a clustered columstore

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest1idx ON CCSTest1 with (drop_existing = on)

This bring the data storage even further down to 2,864KB, quite neat in itself even before we change the underlying data structure. Note that adding the COLUMNSTORE_ARCHIVAL option has no effect on the datasize, this is due to the dictionary that is used by default to add an extra level of compression to varchar data.

Lets now split Col1 into smaller ( and more importantly for columnstore, repeatable and repeating) segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest2
go
Create Table CCSTest2
(
Col1 char(19) not null,
Col2 char(4) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest2(Col1,Col2)
select r1.PadR+'000000000000000',r2.PadR
from ctePad r1
cross join ctePad r2
go
Create clustered index CCSTest2idx on CCSTest2(Col1,Col2)

Datasize here is 31,632KB, no real difference from before.

Now add the columnstore index

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest2idx ON CCSTest2 with (drop_existing = on)

The datasize is now 1,392KB less than half of the original and certainly a practice that I shall be employing to further increase the potential compression on data. Yes, query speed will be ever so slightly degraded but this is perfectly acceptable in my scenario.

If you are using columnstore compression, to achieve the ‘optimal’ compression ratio there are still things you have to do and hoops to jump through. Consider splitting columns into multiple parts to increase the ‘repeatability’ of data over sequential rows and ensure that a clustered index is built upon the ‘raw’ data prior to adding a columnstore to define what that ‘sequential’ order is. See also “Data Loading performance considerations with Clustered Columnstore indexes“, though more concerned with throughput, does contain information of compression ratio.

Mar 08

SQLBits, Incremental Statistics and Demo Fail

So another great SQLBits is done and dusted, many thanks to the team, helpers and sponsors for putting on such a great event. It’s always a pleasure and honour to be able to present here and to get to spend time with a great crowd of intelligent SQL Geeks.

I have uploaded the slides and scripts from both of my sessions, Fridays session “Statistics and Estimations” are available here http://1drv.ms/1C0BMva and Saturdays “2014 Cardinality estimation” session are available here http://1drv.ms/1FwA7er .

During the 2014 session I suffered a massive demo fail, and as ever, realised my mistake as soon as I had left the room. I was in the middle of what I think it is an interesting scenario with incremental statistics so I thought it was work adding the details of the point I was *trying* to make on my blog.

Incremental statistics are a great new addition, simply put each partition in a table now has its own set of statistics. These are aggregated together to form the table statistics for estimation purposes, I have seen posts that sort of suggest that more than 200 histogram steps are now used for estimation, I haven’t found any evidence for this, ill pick this up later in the post.

For most partitioned tables incremental statistics are great as in reality most tables that are partitioned will only have updates to the ‘last’ one. Most partitioned strategies are on something that is continually incrementing ( like a date-time or ID ) so only the last partition is touched. Using incremental statistics means that SQL Server now does not have to look at the previous partitions ( that have not been changed) to form statistics, a nice little cost saving. Its almost a no brainer, I can’t really see a downside. Ben Nevarez covers the basics in more detail here.

However for those tables that have updates to multi partitions, there are some nuances that you will need to be aware of and this is the scenario I had with the demo fail.

As ever, the best way of demonstrating the subtleties is by example, lets create a table, define the partitioning strategy and add some data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
drop database PartitionTest
go
Create Database PartitionTest
go
USE PartitionTest;
GO
-- Adds four new filegroups to the PartitionTest database
ALTER DATABASE PartitionTest
ADD FILEGROUP test1fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test2fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test3fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test4fg;
go
 
-- Adds one file for each filegroup.
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test1dat1,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t1dat1.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test1fg;
 
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test2dat2,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t2dat2.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test2fg;
GO
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test3dat3,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t3dat3.ndf',SIZE= 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test3fg;
GO
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test4dat4,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t4dat4.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test4fg;
GO
-- Creates a partition function called myRangePF1 that will partition a table into four partitions
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (100000, 200000, 300000) ;
GO
-- Creates a partition scheme called myRangePS1 that applies myRangePF1 to the four filegroups created above
CREATE PARTITION SCHEME myRangePS1
AS PARTITION myRangePF1
TO (test1fg, test2fg, test3fg, test4fg) ;
GO
-- Creates a partitioned table called PartitionTable that uses myRangePS1 to partition col1
drop TABLE PartitionTable
go
CREATE TABLE PartitionTable (col1 int identity primary KEY, col2 int)
ON myRangePS1 (col1) ;
GO
-- Can be a bit slow so....
ALTER DATABASE PartitionTest SET DELAYED_DURABILITY = FORCED;
go
/* Populate with 10* SalesOrderHeader*/
insert into PartitionTable(col2)
select CurrencyRateID
from AdventureWorks2012.sales.SalesOrderHeader
go 10

All being well you should now have multiple partitions populated with some data, confirm that with

1
select partition_number,rows from sys.partitions where OBJECT_ID=OBJECT_ID('PartitionTable')

Now to add incremental statistics to the table we need to simply:

1
2
3
4
create statistics StatPartitionTable
on PartitionTable(Col2)
with incremental = on;
go

What we are now interested in is when does a table have its stats updated and when does a partition have its stats updated. Naturally partition stats happen as part of table stats update and cant happen as an independent process.

To start lets update all the stats, but first run profiler with SP:StmtCompleted and SQL:StmtCompleted events ( obviously filtered to the SPID that you are working in J ), with this on we can monitor the STATMAN command that are internally used by SQL Server and the partitions that are being sampled.

With that running, update the stats with :

1
update statistics PartitionTable(statpartitiontable)

In the profiler session you will see 4 STATMAN calls:

Looking at the TextData for each one you will see that there is one per partition indicated by the usual partition function filter of

WHERE $PARTITION.myRangePF1([col1]) = X

X being the partition number.

Lets update some rows in a single partition:

1
2
3
update PartitionTable
set col2=1
where col1<=20500

This is 20% of the rows+500 enough to cause SQL Server to need to update table statistics, in case you are wondering TF 2371 is still relevant and will use the dynamic drop off as versions prior to 2014.

You can also monitor the number of modifications using

1
select * From sys.dm_db_stats_properties(object_id('partitiontable'),2)

If we now run a statement that requires statistics on the table such as

1
2
select count(*) from PartitionTable where col2=877
option(recompile)

In our profiler session we will see that indeed only one of the partitions has been updated and the modification count in dm_db_stats_properties has been reset to zero.

Quite neat, why bother to sample the other there have been no changes.

Lets complicate things a bit more and update 2 partitions

1
2
3
4
5
6
7
8
update PartitionTable
set col2=407
where col1<=25000
go
update PartitionTable
set col2=407
where col1>100000 and col1<=125000
go

Modification counter now equals 50,000 and after executing

1
select count(*) from PartitionTable where col2=877 option(recompile)

the counter get reset to zero and both partition STATMAN calls will be seen in profiler.

Now, here is the scenario that I wanted to show and had a demo fail in my session:

1
2
3
4
5
6
7
8
update PartitionTable
set col2=2225
where col1<=15000
go
update PartitionTable
set col2=2225
where col1>100000 and col1<=113000
go

Looking at the modification counter there are 28,000 updates, so the question is: Will the stats now be updated ?

Run the count(*) and find out J

Modification counter will equal zero but there have been no STATMAN calls, why is this ? Well, although the table has 20% + 500 changes, the partitions are also required to have a certain level of updates too. I apologise for the use of ‘certain level’ here but haven’t been able to find what the exact formula is but from my observations then its 20% of the partitions rows need to be updated to force the stats update.

So what has happened to the update counter, well it’s still at the partition level this can be seen on the internal version of stats properties DMV.

1
select * From sys.dm_db_stats_properties_internal(object_id('partitiontable'),2)

Partitions one and two have 15k and 13k updates respectively and will continue to increment as updates happen. This does muddy the waters of the functionality a bit and TBH in my mind makes it very hard to come up with a solid usage recommendation.

Overall I think that if you have a partitioned table then you will be wanting to enable incrementing statistics but just be aware that the individual partitions may not be updating in the manner that you are expecting.

Let us return to the > 200 step histogram issue too, to my mind then if it were possible for SQL Server to use the histograms at the partition level then it would show when filtering using the table using the partition function ie

$PARTITION.myRangePF1([col1])

However if we completely mangle our test data with

1
update PartitionTable set col2 = $PARTITION.myRangePF1([col1])

and then

1
2
3
Select count(*) from PartitionTable where col2=1 and $PARTITION.myRangePF1([col1])=1 option(recompile)
go
Select count(*) from PartitionTable where col2=1 and $PARTITION.myRangePF1([col1])=2 option(recompile)

We do get a different estimation, but lets dive into this a bit further with the cardinality estimator extended event.

For partition 1 it looks like this:

For partition 2 it looks like this

So the selectivity of col2 is the same across both partitions .318 but the selectivity of the partition function is different and hence the different estimation. Maybe I am over simplifying things here but as I say, if there could be a difference then I would have expected it to show here.

Nov 02

Non use of persisted data – Part Three

A long time ago I looked at some reasons as to why persisted computed columns ( and therefore indexes on those columns) are not used (Part1, Part2).
Recently I was drawn into an issue where the same thing was happening, a persisted computed column had been created with an index, but neither were being used when expected.
Using AdventureWorks lets re-create this scenario:

1
2
3
4
5
6
7
8
9
10
11
12
13
create table ccTest
(
id integer primary key,
surname varchar(50) not null,
cSum as checksum(isnull(surname,' ')) persisted not null)
go
create index idxcsum on ccTest(cSum)
go
 
 
insert into ccTest(Id,surname)
Select BusinessEntityID,LastName from Person.Person
go

Now, if we want to find all those surnames with a specific checksum (yes, i know! checksum is bad ), it would be reasonable to :

1
select * from ccTest where cSum=0

and assume that the index idxcsum is used to retrieve the required data. Looking at the plan it obvious that this does not happen.

ixscan1

The clustered index is scanned, the non clustered index is ignored. There are cases where this is a good thing to happen, for instance to avoid a key/bookmark lookup when that will be an expensive operation. Here though there is an estimate of ~16rows not enough for that to be the case.
Maybe we need to force the use of the index ? :

1
select * from ccTest with(index=idxcsum) where cSum=0

This uses the index, but not in the manner we need :

ixforced

The index is scanned, all the rows are read from the index, and the key lookup previously mentioned has entered the equation.
So what is really happening here ?
Well in the second example, there are some very graphical clues, the Compute Scalar and Filter.
What is the compute scalar computing ? Lets look ‘defined values’ in the properties :
[[AdventureWorks2012].[dbo].[ccTest].cSum] = Scalar Operator(checksum([AdventureWorks2012].[dbo].[ccTest].[surname]))
So the persisted data is being recomputed, looking at the filter the predicate is :
[AdventureWorks2012].[dbo].[ccTest].[cSum]=(0)
That computed value is then being filtered upon. But this is the persisted data, why is the index not used ?. Well, that’s because the computation isn’t logically the same. But how can this be ?
There are two things that need to be appreciated here:

  1. SQL Server puts a lot of energy into simplifying queries and operations. Redundant branches are pruned and contradictions removed.
  2. Computed columns are expanded back to the base definition before being optimized.

In our computed column definition we have a redundant code branch. ISNULL is used to guard against the Surname column being NULL, however the value in Surname can never be NULL as there is a NOT NULL constraint on it, therefore when the column definition is expanded it is then simplified and this redundancy removed. You can see this in the ‘defined values’ above ? There is no ISNULL definition is the computation. So the optimizer cannot now rematch this computation back to the persisted computed column as the definition is not the same. It would be nice to see the NOEXPAND hint used to prevent this happening as it does for indexed views.
Now we know that the resolution is relatively straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
create table ccTest2
(
id integer primary key,
surname varchar(50) not null,
cSum as checksum(surname) persisted not null)
go
create index idxcsum on ccTest2(cSum)
go
 
 
insert into ccTest2(Id,surname)
Select BusinessEntityID,LastName from Person.Person
go

And then if we :

1
select * from ccTest2 where ccTest2.cSum=0

We will see the index used as expected in the first place

ixgood

So, the takeaway is that if you are using persisted computed columns ensure that the definitions have no redundancy and then the indexes should be used.
NB : Curiously there is something else , or more accurately another level of indirection, here. If you use the query :

1
select * from ccTest where CHECKSUM(surname) = 0

You may have expected it to match to the persisted computed column, as it would against ccTest2, but it doesn’t.