My present role is to look at the performance of a large datawarehouse load. This is right up my alley, and I have be playing with all manner of performance tuning fun, indexing, execution plans, filtered statistics etc, all good stuff.
However, during QA testing some of my suggestions actually made things worse. Not a little worse, but a whole lotta worse, big surprise. Going back to my dev box I confirmed that the TSQL changes should have made things better, but there was one significant ( other than data load ) difference. To ensure transactional integrity a database snapshot is used, this means that the load process works on a static view of the data whilst the OLTP system can carry on doing its work. In the QA testing we, by definition, ran the whole process, when I was dev’ing I read from the database not the snapshot, interesting.
When I switched to using a snapshot things did indeed go slower, lets try this against AdventureWorks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
CREATE DATABASE AdventureWorks2012ss ON ( NAME = AdventureWorks2012_Data, FILENAME = 'D:\DATAAdventureWorks2012_data.ss' ) AS SNAPSHOT OF AdventureWorks2012; GO set statistics io on GO DBCC DROPCLEANBUFFERS GO use AdventureWorks2012 GO select count(SOH.Comment),count(SOD.ProductID) from sales.SalesOrderHeader SOH inner loop join sales.SalesOrderDetail SOD on SOH.SalesOrderID = SOD.SalesOrderID GO use AdventureWorks2012ss GO set statistics io on go DBCC DROPCLEANBUFFERS GO select count(SOH.Comment),count(SOD.ProductID) from sales.SalesOrderHeader SOH inner loop join sales.SalesOrderDetail SOD on SOH.SalesOrderID = SOD.SalesOrderID GO
OK, so that is a fairly non-sensical query and forcing the loop join is there to demonstrate the issue, but the plans are exactly the same. Running those and tracing in profiler shows the performance difference :
Reading the database ~1.0-1.1 secs, reading the Snapshot ~1.5-1.6 seconds, a sizeable difference. Other noteworthy things here are that the snapshot file and the database file exist on the same physical drive and that drive is a slow USB drive. Testing against my laptops SSD, try as I might, demonstrated no measurable difference.
150 against the database, 1150 against the snapshot. This sounds to me like a likely cause of the slowdown.
When reading a snapshot the sparse file has to be read initially but then the database is read if the page has not yet been changed, a double whammy if you will. Also this happens on a page read by page read basis, this is why I forced the nested loop join in the query. Using Process Monitor from the sysinternals tool set we can see this activity in action….
Firstly reading the database:
Nice large reads of 65,536 bytes each, now the snapshot…
Lots of reads of 8,192, this is the activity that I believe is slowing my queries down. I asked Microsoft for an opinion on what is actually happening here, the response was :
When we do a large read against a COW sparse file, pages that have not been updated in the primary will not be in the sparse file and we will have to read from the primary’s file. This is done on a page by page basis so can result in many reads against the primary for a single read in the snapshot DB.
If the pages have been pushed over to the snapshot’s COW file then they can be read in a large read.
Essentially, we’ll only do large reads against pages that have been pushed into the snapshot.
The broken up reads against the primary are issued in parallel, so while it is more IOs it is not necessarily a longer time to do those IOs.
Sparse file reads themselves are not likely very optimal since they are by definition fragmented, so OS will likely end up breaking up those reads anyway.
So the activity that I am seeing is natural and normal and that explanation does highlight the physical differences that are necessitated when reading a snapshot.
So back to the problem in hand, to fix the queries that were running slow in the real examples I used the FORCESCAN hint which will prevent the nested loop operation and enable SQLServer to do larger page reads against the database file for those unchanged pages in the snapshot, this did have a dramatic effect.
Have you seen similar activity before ? Is this only an issue for slow or overloaded io systems ? I’d love to know what you have witnessed.