Hekaton is a really interesting technology, but is a world away from the functionality that we know and love. The SQL team have done a great job of disguising this departure from us by integrating it inside the SQL Server engine but none the less is a different beast entirely. Although the ultimate aim, I would imagine, is a seamless integration where the user ( and developer) is not really concerned with the underlying storage technology there will be many real world issues occurring if the differences are not fully understood.
The way that Hekaton stores data is in hash buckets, this is a fundamental tenet. A hash is simply a function applied to some key data and the bucket is where the relating row is stored.
For example : if our hash function was X%5 then our buckets for the values 1 through 10 would be populated thusly :
As % is the function for modulo (divide an return the remainder) 9%5 = 4. The hash function for SQL Server would be much more complicated than this and like the hash function used in a hash join will differ depending on the data being hashed.
So this is quite interesting since when we define a hekaton table we need to specify the number of buckets that we think ( perhaps even assume that we need) upfront:
1 2 3 4 5 6 7 8
CREATE TABLE BucketDemo ( col1 INT NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 1025), col2 INT NOT NULL, coll3 INT NOT NULL, INDEX [idx] NONCLUSTERED HASH ([col2]) WITH(BUCKET_COUNT = 1025), INDEX [idx2] NONCLUSTERED HASH ([col3]) WITH(BUCKET_COUNT = 1025), ) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)
Now lets populate that up with some data:
1 2 3 4 5
INSERT INTO BucketDemo select top(8000) row_number() over(order by (select null)), cast((cast (newid() as varbinary)) as integer), row_number() over(order by (select null))%10 from sys.columns a cross join sys.columns b
So what is the bucket usage for the BucketDemo table now ?
Well, there are a number of new dmvs have been added to SQL Server 2014 as you may well expect, the one relevant one that we want is sys.dm_db_xtp_hash_index_stats.
select * from sys.dm_db_xtp_hash_index_stats where object_id=object_id('BucketDemo')
First thing to note is the obvious buggette, the index name is cartesian joined to the indexes, this will be fixed from CTP2 (if such a thing will exist) onwards. For your reference here is sys.indexes
So, looking at sys.dm_db_xtp_hash_index_stats what information can we glean?
Lets start with the primary key index “PK__BucketDe__35…” , index_id 4.
Total_bucket_count is 2048, that’s interesting in itself, we asked for a bucket count of 1025. By experimentation it easy to see that the requested bucket count is rounded up to the next power of 2 and not the exact number requested. Empty_bucket_count is quite obviously this is the number of buckets that have no data contained within them. For the primary key we have 0, there are no empty buckets. Next up avg_chain_len this is the average chain length (or how many rows are contained within ) for all buckets and max_chain_len is the maximum chain length.
For index_id 3 we can see 49 unused buckets and a fairly short max_chain_len but for index_id 2 , 2038 buckets are empty ie 10 buckets are used and avg len and max len both 800. This isn’t to surprising the data contained within will always be in the range 0-9 as the formula to calculate it is x%10. So I have vastly over estimated the number of buckets required to support this index. I specified 1025 which was rounded up to 2048 but given the data I have there will be no chance of any of the others being used.
To my mind a correct bucket estimation required for your production data will make or break your system, underestimate and you will be ‘scanning’ to higher number of rows, to high and to much memory will be allocated, its only 8bytes per bucket if unused but could add up to a fair amount.