Mar 23

Getting out of the ground with TSQL Smells.

 

At SQLBits I had a number of conversations with a number of people over TSQL Smells, my open source project for static code analysis of SQL Code. The general opinion was that although the concept was sound, the process of getting it running ( and developing it further ) was not documented. Im sure that if you have sufficient .Net skills you could muddle through on your own, but since it will be mainly TSQL devs who have the interest here, I thought I would take the time to draw up some ‘Arse and Elbow’ documentation to get you started.

So lets go :

  1. Grab Visual Studio 2013. Community edition is fine operationally ( Im not a lawyer, check the licensing terms).
  2. After that has installed, update SSDT. Tools -> Extensions and Updates -> Updates -> Product Updates
  3. Install both the 32Bit and 64Bit versions of the DACFX framework, at time of writing latest is downloadable from http://www.microsoft.com/en-us/download/confirmation.aspx?id=45886
  4. Ensure that the SSDT extensions directory ( C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\120\Extensions ) is read writable ( I use Everyone / Full control ).

That should be the tooling side of the equation sorted.

Now clone ( I wont go into what this means , a number of tutorials are available on github) the TSQL Smells repo on github http://github.com/davebally/TSQL-Smells

In terms of software this should now all be done.

Run up visual studio, load in the TSQLSmells solutions and “Rebuild Solution”.

If that all goes according to plan, you will now be able to run the unit tests

And if you are even more lucky they will all pass.

This now means that the rules are installed correctly and we can proceed onwards.

The unit test cases reference sql files that are held in the TSQLSmellsTest project, you can further test by loading this project and selecting the “Run Code Analysis” option.

You are now in a position to load your own projects and run the code analysis against them.

Finally, included in the solution is RunSCAAnalysis, this generates a small exe that can run the static code analysis rules against a database or dacpac.

Command line options for that are –S Server –o outputfile and one of d
database or –f filename, it would be my intention with this that it that it would form part of your automated build/deploy framework (if you are not using SSDT) and break the build on a “Smell”.

I hope that this blog entry goes someway to starting you on your way to better TSQL Code quality and for my part, I will be giving it some love in the near future.


Update 16-apr-2016

Ive had some comments that the project is not working so ive now update the GitHub project to work with VS2015 / SQL 2014.  To do this i built a fresh azure VM with VS2015 installed and i did run into some problems.

If you are having build problems :

Mostly these may be due to missing references and due to the multiple packages involved it hard to give absolute instructions.

Missing Dac references : You need to find and install the latest ( or appropriate) DACFX packages (32Bit is essential , 64 Bit is optional for a runtime env).  The SSDT team blog is the best source of reference https://blogs.msdn.microsoft.com/ssdt/ After install you may need to delete the reference in the project , add reference and browse to the DACFX folder ( C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\130\ ) and add the dlls in.

If you are not using VS2015 & SQL 2014 then you will probably have to look in different version number folders.

Missing TransactSql.ScriptDom reference : This is installed as part of sql server and not vs,  if you do not have SQL Installed, you will need to find “Transact-SQL Language Service” from the SQL Server Feature pack https://www.microsoft.com/en-gb/download/details.aspx?id=42295.  The package is tsqllanguageservice.msi.

Again delete the references and re-add by browsing to the folder, C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies.  Again the 130 version number will change over time / older versions.

It builds but the tests fail :

This will probably be due to the output being placed in the incorrect folder.  SSDT will look to the folders (C:\Program Files\Microsoft SQL Server\X\Dac\Bin\Extensions\ ) where X is the SQL Version being targeted..  Make sure that the folder exists for your specific setup.

I wish that there were a way to get over all the versioning and streamline this but as of yet there isnt.

 

 

Mar 17

Increasing ColumnStore compression ratio for archival and “Cold storage”

At my present client, part of the task I have to under take is to re-engineer the “Cold Storage”. This is ~20Terabytes of compressed CSVs going back over 15 years or so. Presently these are held on aging and creaking centera servers that are way past end of life. The solution probably made sense 10 years ago ( or whenever it was designed ) but now it doesn’t and the clients first thought was to throw it all into Orace RAC with page compression. Seems like a vaguely sensible ( if expensive) way to handle the data. But, maybe there is an even better way and right now I would say that ColumnStore compression is that better way.

As a really simple POC I took some pseudo-real sample data and threw it into a SQL Server table. Data size was approx ~300MB. After adding page compression, that size came down to ~42MB. That is roughly inline with the expected estimates ( and I would hazard a guess what could be achieved by Oracle page compression). So how does columnstore compare ? Adding a clustered columnstore index took datasize own to 2,128KB, Yes from 300MB to 2MB that is amazing. But wait, it gets better using the COLUMNSTORE_ARCHIVAL setting that applies a variant of the LZH compression to (presumably) each rowgroup the datasize falls even further to 464KB. From 300MB to less that 1MB now, that is compression for you.

The other side of the coin is query-speed ( this is arguably less important at the client ) and something I am yet to measure and compare but will do when I get my dirty hands on the real data.

However, the achievable compression can be increased even further with a little bit of thought and jigging of the data. Simplistically columnstore is all about repeating data over a column, the meta data says “In Column X I have the value Y and that happens for the next Z rows”. My data has some large almost sequential columna, transaction reference, batch transaction reference etc.. stored as large-ish chars of 16,20, and 32 bytes in length. These are fairly unique throughout the data and as such column store compression has to specify the full value on each and every row. What would be the impact if we only held that part of each reference that was changing ?, i.e. the whole 32 characters are held but only the last one would be changing as these are effectively “Strings that look like integers”.

A quick bit of guestimation, for simplicities sake splitting the column into 2 one of 31 chars and one of 1.

10 rows *32 bytes = 320

Against

1 row *31 bytes + 10 of 1 = 41 bytes

Interesting, ~15% compression. Lets push this idea further and create a million row table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest1
go
Create Table CCSTest1
(
Col1 char(23) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest1
select r1.PadR+'000000000000000'+r2.PadR
from ctePad r1
cross join ctePad r2
 
go
Create clustered index CCSTest1idx on CCSTest1(Col1)

Using sp_spaceused we can see the data size is 31,632KB.

Adding page compression

1
ALTER TABLE CCSTest1 REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE);

Brings the datasize down to 13,120KB not to shabby.

Let us now switch to a clustered columstore

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest1idx ON CCSTest1 with (drop_existing = on)

This bring the data storage even further down to 2,864KB, quite neat in itself even before we change the underlying data structure. Note that adding the COLUMNSTORE_ARCHIVAL option has no effect on the datasize, this is due to the dictionary that is used by default to add an extra level of compression to varchar data.

Lets now split Col1 into smaller ( and more importantly for columnstore, repeatable and repeating) segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest2
go
Create Table CCSTest2
(
Col1 char(19) not null,
Col2 char(4) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest2(Col1,Col2)
select r1.PadR+'000000000000000',r2.PadR
from ctePad r1
cross join ctePad r2
go
Create clustered index CCSTest2idx on CCSTest2(Col1,Col2)

Datasize here is 31,632KB, no real difference from before.

Now add the columnstore index

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest2idx ON CCSTest2 with (drop_existing = on)

The datasize is now 1,392KB less than half of the original and certainly a practice that I shall be employing to further increase the potential compression on data. Yes, query speed will be ever so slightly degraded but this is perfectly acceptable in my scenario.

If you are using columnstore compression, to achieve the ‘optimal’ compression ratio there are still things you have to do and hoops to jump through. Consider splitting columns into multiple parts to increase the ‘repeatability’ of data over sequential rows and ensure that a clustered index is built upon the ‘raw’ data prior to adding a columnstore to define what that ‘sequential’ order is. See also “Data Loading performance considerations with Clustered Columnstore indexes“, though more concerned with throughput, does contain information of compression ratio.

Mar 08

SQLBits, Incremental Statistics and Demo Fail

So another great SQLBits is done and dusted, many thanks to the team, helpers and sponsors for putting on such a great event. It’s always a pleasure and honour to be able to present here and to get to spend time with a great crowd of intelligent SQL Geeks.

I have uploaded the slides and scripts from both of my sessions, Fridays session “Statistics and Estimations” are available here http://1drv.ms/1C0BMva and Saturdays “2014 Cardinality estimation” session are available here http://1drv.ms/1FwA7er .

During the 2014 session I suffered a massive demo fail, and as ever, realised my mistake as soon as I had left the room. I was in the middle of what I think it is an interesting scenario with incremental statistics so I thought it was work adding the details of the point I was *trying* to make on my blog.

Incremental statistics are a great new addition, simply put each partition in a table now has its own set of statistics. These are aggregated together to form the table statistics for estimation purposes, I have seen posts that sort of suggest that more than 200 histogram steps are now used for estimation, I haven’t found any evidence for this, ill pick this up later in the post.

For most partitioned tables incremental statistics are great as in reality most tables that are partitioned will only have updates to the ‘last’ one. Most partitioned strategies are on something that is continually incrementing ( like a date-time or ID ) so only the last partition is touched. Using incremental statistics means that SQL Server now does not have to look at the previous partitions ( that have not been changed) to form statistics, a nice little cost saving. Its almost a no brainer, I can’t really see a downside. Ben Nevarez covers the basics in more detail here.

However for those tables that have updates to multi partitions, there are some nuances that you will need to be aware of and this is the scenario I had with the demo fail.

As ever, the best way of demonstrating the subtleties is by example, lets create a table, define the partitioning strategy and add some data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
drop database PartitionTest
go
Create Database PartitionTest
go
USE PartitionTest;
GO
-- Adds four new filegroups to the PartitionTest database
ALTER DATABASE PartitionTest
ADD FILEGROUP test1fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test2fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test3fg;
GO
ALTER DATABASE PartitionTest
ADD FILEGROUP test4fg;
go
 
-- Adds one file for each filegroup.
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test1dat1,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t1dat1.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test1fg;
 
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test2dat2,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t2dat2.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test2fg;
GO
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test3dat3,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t3dat3.ndf',SIZE= 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test3fg;
GO
ALTER DATABASE PartitionTest
ADD FILE
(
NAME = test4dat4,FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\t4dat4.ndf',SIZE = 5MB,MAXSIZE = 100MB,FILEGROWTH = 5MB
)
TO FILEGROUP test4fg;
GO
-- Creates a partition function called myRangePF1 that will partition a table into four partitions
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (100000, 200000, 300000) ;
GO
-- Creates a partition scheme called myRangePS1 that applies myRangePF1 to the four filegroups created above
CREATE PARTITION SCHEME myRangePS1
AS PARTITION myRangePF1
TO (test1fg, test2fg, test3fg, test4fg) ;
GO
-- Creates a partitioned table called PartitionTable that uses myRangePS1 to partition col1
drop TABLE PartitionTable
go
CREATE TABLE PartitionTable (col1 int identity primary KEY, col2 int)
ON myRangePS1 (col1) ;
GO
-- Can be a bit slow so....
ALTER DATABASE PartitionTest SET DELAYED_DURABILITY = FORCED;
go
/* Populate with 10* SalesOrderHeader*/
insert into PartitionTable(col2)
select CurrencyRateID
from AdventureWorks2012.sales.SalesOrderHeader
go 10

All being well you should now have multiple partitions populated with some data, confirm that with

1
select partition_number,rows from sys.partitions where OBJECT_ID=OBJECT_ID('PartitionTable')

Now to add incremental statistics to the table we need to simply:

1
2
3
4
create statistics StatPartitionTable
on PartitionTable(Col2)
with incremental = on;
go

What we are now interested in is when does a table have its stats updated and when does a partition have its stats updated. Naturally partition stats happen as part of table stats update and cant happen as an independent process.

To start lets update all the stats, but first run profiler with SP:StmtCompleted and SQL:StmtCompleted events ( obviously filtered to the SPID that you are working in J ), with this on we can monitor the STATMAN command that are internally used by SQL Server and the partitions that are being sampled.

With that running, update the stats with :

1
update statistics PartitionTable(statpartitiontable)

In the profiler session you will see 4 STATMAN calls:

Looking at the TextData for each one you will see that there is one per partition indicated by the usual partition function filter of

WHERE $PARTITION.myRangePF1([col1]) = X

X being the partition number.

Lets update some rows in a single partition:

1
2
3
update PartitionTable
set col2=1
where col1<=20500

This is 20% of the rows+500 enough to cause SQL Server to need to update table statistics, in case you are wondering TF 2371 is still relevant and will use the dynamic drop off as versions prior to 2014.

You can also monitor the number of modifications using

1
select * From sys.dm_db_stats_properties(object_id('partitiontable'),2)

If we now run a statement that requires statistics on the table such as

1
2
select count(*) from PartitionTable where col2=877
option(recompile)

In our profiler session we will see that indeed only one of the partitions has been updated and the modification count in dm_db_stats_properties has been reset to zero.

Quite neat, why bother to sample the other there have been no changes.

Lets complicate things a bit more and update 2 partitions

1
2
3
4
5
6
7
8
update PartitionTable
set col2=407
where col1<=25000
go
update PartitionTable
set col2=407
where col1>100000 and col1<=125000
go

Modification counter now equals 50,000 and after executing

1
select count(*) from PartitionTable where col2=877 option(recompile)

the counter get reset to zero and both partition STATMAN calls will be seen in profiler.

Now, here is the scenario that I wanted to show and had a demo fail in my session:

1
2
3
4
5
6
7
8
update PartitionTable
set col2=2225
where col1<=15000
go
update PartitionTable
set col2=2225
where col1>100000 and col1<=113000
go

Looking at the modification counter there are 28,000 updates, so the question is: Will the stats now be updated ?

Run the count(*) and find out J

Modification counter will equal zero but there have been no STATMAN calls, why is this ? Well, although the table has 20% + 500 changes, the partitions are also required to have a certain level of updates too. I apologise for the use of ‘certain level’ here but haven’t been able to find what the exact formula is but from my observations then its 20% of the partitions rows need to be updated to force the stats update.

So what has happened to the update counter, well it’s still at the partition level this can be seen on the internal version of stats properties DMV.

1
select * From sys.dm_db_stats_properties_internal(object_id('partitiontable'),2)

Partitions one and two have 15k and 13k updates respectively and will continue to increment as updates happen. This does muddy the waters of the functionality a bit and TBH in my mind makes it very hard to come up with a solid usage recommendation.

Overall I think that if you have a partitioned table then you will be wanting to enable incrementing statistics but just be aware that the individual partitions may not be updating in the manner that you are expecting.

Let us return to the > 200 step histogram issue too, to my mind then if it were possible for SQL Server to use the histograms at the partition level then it would show when filtering using the table using the partition function ie

$PARTITION.myRangePF1([col1])

However if we completely mangle our test data with

1
update PartitionTable set col2 = $PARTITION.myRangePF1([col1])

and then

1
2
3
Select count(*) from PartitionTable where col2=1 and $PARTITION.myRangePF1([col1])=1 option(recompile)
go
Select count(*) from PartitionTable where col2=1 and $PARTITION.myRangePF1([col1])=2 option(recompile)

We do get a different estimation, but lets dive into this a bit further with the cardinality estimator extended event.

For partition 1 it looks like this:

For partition 2 it looks like this

So the selectivity of col2 is the same across both partitions .318 but the selectivity of the partition function is different and hence the different estimation. Maybe I am over simplifying things here but as I say, if there could be a difference then I would have expected it to show here.

Nov 02

Non use of persisted data – Part Three

A long time ago I looked at some reasons as to why persisted computed columns ( and therefore indexes on those columns) are not used (Part1, Part2).
Recently I was drawn into an issue where the same thing was happening, a persisted computed column had been created with an index, but neither were being used when expected.
Using AdventureWorks lets re-create this scenario:

1
2
3
4
5
6
7
8
9
10
11
12
13
create table ccTest
(
id integer primary key,
surname varchar(50) not null,
cSum as checksum(isnull(surname,' ')) persisted not null)
go
create index idxcsum on ccTest(cSum)
go
 
 
insert into ccTest(Id,surname)
Select BusinessEntityID,LastName from Person.Person
go

Now, if we want to find all those surnames with a specific checksum (yes, i know! checksum is bad ), it would be reasonable to :

1
select * from ccTest where cSum=0

and assume that the index idxcsum is used to retrieve the required data. Looking at the plan it obvious that this does not happen.

ixscan1

The clustered index is scanned, the non clustered index is ignored. There are cases where this is a good thing to happen, for instance to avoid a key/bookmark lookup when that will be an expensive operation. Here though there is an estimate of ~16rows not enough for that to be the case.
Maybe we need to force the use of the index ? :

1
select * from ccTest with(index=idxcsum) where cSum=0

This uses the index, but not in the manner we need :

ixforced

The index is scanned, all the rows are read from the index, and the key lookup previously mentioned has entered the equation.
So what is really happening here ?
Well in the second example, there are some very graphical clues, the Compute Scalar and Filter.
What is the compute scalar computing ? Lets look ‘defined values’ in the properties :
[[AdventureWorks2012].[dbo].[ccTest].cSum] = Scalar Operator(checksum([AdventureWorks2012].[dbo].[ccTest].[surname]))
So the persisted data is being recomputed, looking at the filter the predicate is :
[AdventureWorks2012].[dbo].[ccTest].[cSum]=(0)
That computed value is then being filtered upon. But this is the persisted data, why is the index not used ?. Well, that’s because the computation isn’t logically the same. But how can this be ?
There are two things that need to be appreciated here:

  1. SQL Server puts a lot of energy into simplifying queries and operations. Redundant branches are pruned and contradictions removed.
  2. Computed columns are expanded back to the base definition before being optimized.

In our computed column definition we have a redundant code branch. ISNULL is used to guard against the Surname column being NULL, however the value in Surname can never be NULL as there is a NOT NULL constraint on it, therefore when the column definition is expanded it is then simplified and this redundancy removed. You can see this in the ‘defined values’ above ? There is no ISNULL definition is the computation. So the optimizer cannot now rematch this computation back to the persisted computed column as the definition is not the same. It would be nice to see the NOEXPAND hint used to prevent this happening as it does for indexed views.
Now we know that the resolution is relatively straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
13
create table ccTest2
(
id integer primary key,
surname varchar(50) not null,
cSum as checksum(surname) persisted not null)
go
create index idxcsum on ccTest2(cSum)
go
 
 
insert into ccTest2(Id,surname)
Select BusinessEntityID,LastName from Person.Person
go

And then if we :

1
select * from ccTest2 where ccTest2.cSum=0

We will see the index used as expected in the first place

ixgood

So, the takeaway is that if you are using persisted computed columns ensure that the definitions have no redundancy and then the indexes should be used.
NB : Curiously there is something else , or more accurately another level of indirection, here. If you use the query :

1
select * from ccTest where CHECKSUM(surname) = 0

You may have expected it to match to the persisted computed column, as it would against ccTest2, but it doesn’t.

Aug 15

Please please tell me now…

.. Is there something I should know,  sang Duran Duran many moons ago when I was young and before I knew that query optimization was even a thing.  But,  it is quite a handy little line to keep in mind when you are tuning queries,  giving SQL server a little bit more explicit information about your data can go a long way to removing the work that is done to guarantee something about the data.

The classic case for this is adding constraints, for example :

1
select * from Production.Product where DaysToManufacture < 0

Produces the plan of :
1
SQL Server doesn’t need to read any of the table data.  This is because there is a constraint on the table guaranteeing that there are no rows <0 and therefore SQL Server can safely replace that table read with a constant scan.

So how else can this be exploited?

Well, im working on a large datawarehouse project and one of the issues we have had is poor query against the datawarehouse to process the cubes.  Imagine we are using the query of :

1
2
3
4
5
6
select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
       FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
       factInternetSales.ProductKey,UnitPrice 
  from dimCustomer
  join factInternetSales
    on dimCustomer.CustomerKey = factInternetSales.CustomerKey

in a DSV.  Now SSAS has a high propensity, I don’t know if it always does this, to wrap queries in a subselect and use DISTINCT.  So our query would come something like this :

1
2
3
4
5
6
7
8
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice from dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data

The plan hasn’t changed, both look like this :

2

That is because the primary key of both tables are being returned and that guarantees the DISTINCT , if either key column is removed from the select then SQL Server will have to sort to make the DISTINCT.

1
2
3
4
5
6
7
8
9
10
Select distinct *
from (
    select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
           /* FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, */
           factInternetSales.ProductKey,UnitPrice 
      from dimCustomer
      join factInternetSales
        on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1) – Inhibit parallelism for simplicity

3

Yes, we are asking SQL Server for two different sets of data, they are not logically equivalent queries, but sometimes the data is implicitly unique and still SQL server has to do work to guarantee that uniqueness.

Let us further imagine that Unknown members are dealt with buy setting the customer key to -1 in factInternetSales BUT the member -1 itself is not in the dimCustomer table and is dealt with by a view. This is quite a bad practice but that never stops this sort of stuff occurring in the real world, so bear with me..

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union
Select -1,null,null,'Unknown'

 

First thing to note is that UNION is used,  UNION tries to guarantee a unique set of rows and therefore extra work will have to be employed by SQL Server to do that work even on a simple select * from vwCustomer.  The recommendation would be to use UNION ALL, we implicitly know that the data will be unique so we can save SQL Server the bother..

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union all
Select -1,null,null,'Unknown'

Now we plumb that into our DSV…

1
2
3
4
5
select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
       FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
       factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer
  join factInternetSales
    on dimCustomer.CustomerKey = factInternetSales.CustomerKey

Which if we run in isolation is just peachy. However let us wrap this in DISTINCT as SSAS would:

1
2
3
4
5
6
7
8
9
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1)

4
Oh, we now have a sort.  This will be REALLY heavy on a regular DW load, ordering millions of rows is not quick.

So, what is our solution ?,  lets tell SQL Server what our implied knowledge of the data is to make it explicit.

1
2
3
4
5
6
7
8
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
 where CustomerKey >0
union all
Select -1,null,null,'Unknown'

In all intents and purposes adding CustomerKey >0 does nothing, there are no rows <0 nothing will be filtered and this is the exact point.  This is implied knowledge, we have now explicitly given SQL Server that same knowledge.

So what happens when we use that view ?

5

Great no sorting 🙂

So that deals with this as a one-off ? Lets let SQL Server know that this is the same for all queries of the table not just for this view.

1
alter table dimCustomer add Constraint ChkSk check(CustomerKey>0)

We are now guaranteeing that all CustomerKeys are positive, so if we back up a bit and use the view definition of :

1
2
3
4
5
6
7
Drop view vwCustomer
go
Create View vwCustomer
as
Select CustomerKey,Title,FirstName,LastName from dimCustomer
union all
Select -1,null,null,'Unknown'

And then run :

1
2
3
4
5
6
7
8
9
10
Select distinct *
from (
   select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName,
          FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber,
          factInternetSales.ProductKey,UnitPrice 
     from vwCustomer dimCustomer
     join factInternetSales
       on dimCustomer.CustomerKey = factInternetSales.CustomerKey
) as Data
option(maxdop 1)

Once again, the DISTINCT is explicitly guaranteed by the schema, no extra operations are required by SQL Server to prove that J

So, next time you are query tuning, remember that adding a little extra ( and sometimes seemingly superfluous code) can go a long way.

To paraphrase a bit

Please please tell me now, is there something you can say, to make that operator so away….

Apr 12

Reasons why your plans suck : No 56,536

I have been working with SQL server for more years than I really care to mention and like to think that I am now quite comfortable with the internals, specifically the Query Optimizer. Every so often though a new problem gets thrown into the mix, just to keep me on my toes ( or so it would seem  ).

Plans go “wrong” for a multitude of reasons, but they mostly boil down to a poor estimation. Poor estimations are generally caused by inadequate statistics. This is why SQL Server puts so much effort into efficiently creating and maintaining them. So, on my dev box I had a plan that went “wrong”, a 3 sec process turned into a 3hr process, a quick bit of investigation quickly turned up the culprit. A relatively simple select statement, no worries, its probably due to an old cached plan being used. Here is the plan:

plan

In reality, Application had hundreds of thousands of rows not one.

So, I added WITH(RECOMPILE), that should sort that one. But NO, on the next run, same thing with the same statement. Curious! Running the statement in isolation caused a “good” plan to be built, running with the full process, a “Bad” plan was built. So, why a bad estimation ? Bad stats ? In fact the plan was stating “missing stats” as a warning.

missing

Right, so not Bad stats, it’s no stats. Why would there be no stats ? Not too many reasons, async stats processing could be a culprit, it’s a large table so stats could be potentially built async. But, no, that option is disabled.
Possibilities are narrowing down now, but I thought that I would take a look at the internal info that is not normally available within the execution plan. For this you will need to use the undocumented trace flag 8666: Turn this on using DBCC TRACEON(8666) before retrieving the plan.
So inside the plan, I saw this :

failed

Stats failed to build, not “not found” or “non existent” the stats actually failed to build. Running profiler with error events, StmtStarting and StmtEnding we can see this activity happening. If you are not aware, stats are built using the STATMAN function, this can be seen using a profiler trace just as any user initiated action can. So, run the process and….

profiler

There it failed to build stats, so the estimation was wrong and I got a bad plan, bingo. That’s the effect, whats the cause of the failed stats build ?

There are a few reasons why stats fail to build mostly edge-case scenarios such as deadlock, but remember that I said this was a dev box ? A small dev box indeed, in this case the stats build failed due to memory pressure. How can I be certain of that ? Well playing the hunch card I saw an extended event, “query_no_cqscan_due_to_memory_limitation”, this, in amongst a few others sounded relevant and indeed when running the process through that event does fire when statements fail to build with memory pressure.

cqscan

Sometimes I am in awe of the amount of work that has gone into SQLServers’ query processing engine and this is one of those. Ideally I would have liked the problem to be signposted more clearly, maybe the error log, but beggars can’t be choosers.

Mar 29

T4 Support in SSDT

Sometimes a piece of technology passes you by completely, T4 templating happens to have done that for me.  On tuesday night, at SQL Supper, Geoff Clark ( a uk MCM ) (t) done a fantastic presentation on end-to-end datawarehouse loading, a portion of this was on t4 templating.  The question arose about SSDT and T4 templating and as a coincidence literally the next day  the SSDT team put out the March 2014 release,  having my interest already piqued i thought id have a play 🙂

So… lets add a TSQL Template

sqltt

 

That will now add base TSQL template into you project, however there is a slight snag , its broken 🙁

broken

This is confirmed and will be fixed in the next release as stated in this msdn forum thread. Bugger!!!

However, until then, all is not lost,  although the DacFX portion is broken T4 templating still works so a more ‘generic’ solution is available.  The basis of this is remarkably similar to TSQL Smells,  iterate through a project adding the .sql files into a model and examining the DOM for interesting ‘stuff’.

After a bit of playing around, very much a T4 noob, this is the script I came up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
<#@ template language="C#" debug="true" hostspecific="true" #>
<#@ assembly name="Microsoft.VisualStudio.Shell.Interop.8.0" #>
<#@ assembly name="EnvDTE" #>
<#@ assembly name="EnvDTE80" #>
<#@ assembly name="VSLangProj" #>
<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\120\SDK\Assemblies\Microsoft.SqlServer.TransactSql.ScriptDom.dll" #>
<#@ assembly name="C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\120\Microsoft.SqlServer.Dac.dll" #>
<#@ assembly name="C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\120\Microsoft.SqlServer.Dac.Extensions.dll" #>
<#@ import namespace="Microsoft.VisualStudio.Shell.Interop" #>
<#@ import namespace="EnvDTE" #>
<#@ import namespace="EnvDTE80" #>
<#@ import namespace="Microsoft.VisualStudio.TextTemplating" #>
<#@ import namespace="Microsoft.SqlServer.Dac" #>
<#@ import namespace="Microsoft.SqlServer.Dac.Model" #>
<#@ import namespace="System.IO" #>
<#@ import namespace="System.Collections.Generic" #>
<#@ output extension=".sql" #>
 
-- Dynamic File generated by db
<#
    var hostServiceProvider = (IServiceProvider)this.Host;
    var dte = (DTE)hostServiceProvider.GetService(typeof(DTE));
 
    using (TSqlModel model = new TSqlModel(SqlServerVersion.Sql110, new TSqlModelOptions { }))
    {
        foreach(Project project in dte.Solution)
        {
            IterateThroughProject(project.ProjectItems,model);
        }
 
        List<TSqlObject> allTables = GetAllTables(model);
        foreach (var table in allTables)
        {
#>
--				Table <#= table.Name.Parts[0] #>.<#= table.Name.Parts[1] #>
<#
 
        }
    }
 
#>
-- File Done
<#+
 
    public List<TSqlObject> GetAllTables(TSqlModel model)
    {
        List<TSqlObject> allTables = new List<TSqlObject>();
 
        var tables = model.GetObjects(DacQueryScopes.All, ModelSchema.Table);
        if (tables != null)
        {
            allTables.AddRange(tables);
        }
        return allTables;
    }
 
 
 
    private void IterateThroughProject(ProjectItems PrjItems,TSqlModel model)
    {
        foreach(ProjectItem PrjItem in  PrjItems)
        {
 
            if(PrjItem.Name.EndsWith(".tt", StringComparison.OrdinalIgnoreCase)){ // Dont Load the files we want to build
                continue;
 
            }
            if(PrjItem.ProjectItems!=null)
            {
                IterateThroughProject(PrjItem.ProjectItems,model);
            }
            if(//PrjItem.Object.GetType().ToString() == "Microsoft.VisualStudio.Data.Tools.Package.Project.DatabaseFileNode" && 
                PrjItem.Name.EndsWith(".sql", StringComparison.OrdinalIgnoreCase))
            {
#>
--				This is a sql file and will be processed
--				<#= PrjItem.FileNames[0] #>
<#+
                if (!PrjItem.Saved)
                {
                    PrjItem.Save();
                }
                StreamReader Reader = new StreamReader(PrjItem.FileNames[0]);
 
                string Script = Reader.ReadToEnd();
                model.AddObjects(Script);
            }
        }
    }
 
#>

This should all be fairly self explanatory, we iterate through the project in the member function IterateThroughProject adding any found .sql files to the model. Then use the model.GetObjects member to find all tables in the model, iterate over that list printing schema and table name.

I have seen a few clunky TSQL generation routines written in TSQL but i think porting those into T4 templating and having them built directly into a dacpac will be a big boon.

Have fun

Mar 13

Whats This Command Bar ?

One of the challenges I faced while creating my TSQL Smells visual studio (SSDT) add in, was trying to find out exactly what the names of the Command Bar objects internal to Visual Studio were.

Most of the advice on the internet is fairly haphazard ie guess/trial and error, but there is a simpler way.

Here http://1drv.ms/1fsR3n3 is a C# project that will add the Command Bar name into every command bar.  After dropping that into you addins directory you should now see something like this:

cmdbar

The code itself is pretty simple :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
public void OnConnection(object application, ext_ConnectMode connectMode, object addInInst, ref Array custom)
{
    _applicationObject = (DTE2)application;
    _addInInstance = (AddIn)addInInst;
    if(connectMode == ext_ConnectMode.ext_cm_UISetup)
    {
        object []contextGUIDS = new object[] { };
        string toolsMenuName = "Tools";
 
        Commands2 commands = (Commands2)_applicationObject.Commands;
        Microsoft.VisualStudio.CommandBars.CommandBars Bars = (Microsoft.VisualStudio.CommandBars.CommandBars)_applicationObject.CommandBars;
 
       for(int x=1;x<Bars.Count;x++){
           Microsoft.VisualStudio.CommandBars.CommandBar Cb = Bars[x];
           try
           {
                Command command = commands.AddNamedCommand2(_addInInstance,"Addin"+x.ToString(),
                    "This CmdBar is ["+Cb.Name+"]" , "Executes the command for My Addin",
                    true, 59, ref contextGUIDS,
                    (int)vsCommandStatus.vsCommandStatusSupported +
                    (int)vsCommandStatus.vsCommandStatusEnabled,
                    (int)vsCommandStyle.vsCommandStylePictAndText,
                    vsCommandControlType.vsCommandControlTypeButton);
 
                if ((command != null))
                {
                     CommandBarControl ctrl =
                    (CommandBarControl)command.AddControl(Cb, 1);
 
                    ctrl.TooltipText = "Executes the command for MyAddin";
                }
            }
            catch
            {
            }
        }
    }
}
 
public void QueryStatus(string commandName, vsCommandStatusTextWanted neededText, ref vsCommandStatus status, ref object commandText)
{
    if(neededText == vsCommandStatusTextWanted.vsCommandStatusTextWantedNone)
    {
        if(commandName.StartsWith("WhatsThisCommandBar.Connect.Addin"))
        {
             status = (vsCommandStatus)vsCommandStatus.vsCommandStatusSupported|vsCommandStatus.vsCommandStatusEnabled;
             return;
        }
    }
}

Hope this helps someone struggling to find the correct command bar name.

Mar 04

SSDT – Cut n Paste substituting CMDVARs

Another thing that slows me down with SSDT is testing stored procedure that use CMDVARS.

Consider this proc :

Prc1

If I wish to test that in SSMS, I can cut-n-paste it in then use SQLCMD mode and add in :setvar statements or simply replace the $(CmdSwitch) and $(Otherdb) with the appropriate values.  Ok for one or two values but what if there are lots ?  What if there is a simpler way  ? Well now there is:

Introducing “CmdVar Cut”

cmdvarcut

Selecting this option will place the text with the values substituted directly into your clipboard ready for pasting to SSMS.  As with Fast Deploy, the values are defined in Tools->Options->TSQL Smells->Deploy CMDVars.

The download for this is available from here http://dataidol.com/davebally/ssdt-tsql-smells-add-in/

Have fun

 

 

 

Mar 01

SSDT – Fast deploy

Let me state first of that I do really like SSDT, integration of a source control system of your choice and all its extra features make its use a real boon.

However……..

On my past few assignments SSDT has been heavily used, and deployment is done via Build/Deploy powershell scripts.  This is all very well and good and makes for predictable,repeatable deployments  but it does slow you down rather a lot: You make a change to a SQL artefact, then build which takes a minute or two, then deploy which once again takes a minute or two.  This over the course of a day adds up to rather a lot of time,  it also means that mostly development is done in SSMS then cut-n-pasted into SSDT on completion which really negates a lot of the advantages of SSDT and makes it no more useful than a historic system using Source safe as source control.

 

This got me to thinking that there should be a faster way to deploy during development, so have now extended TSQL Smells to do exactly that.  Right Clicking  on a file ( or folder ) will now show you the “Fast Deploy” option

FastDeploy

 

 

This will deploy the selected scripts directly to your selected SQL host without the kerfuffle of a full build-deploy cycle.  User feedback is given in the Output->TSQL Smells window, if nothing appears to be happening this should tell you why.

CMDVars are supported in the Tools->options->TsqlSmells-> Deploy CMD Vars, just enter them as key-value pairs.  These are kept in the registry so will persist over SSDT sessions,  you also set the database to connect to here.

You can download the latest version from here : http://1drv.ms/MEOp6H

Install instructions etc are available here : http://dataidol.com/davebally/ssdt-tsql-smells-add-in/

Let me know how you get on : parser@clearskysql.co.uk