What is Hashing; using Modulus to partition data 3

Hopefully this post goes some way in helping the reader understand better what hashing, hash indexes are and the need for row chains with In-memory Tables (Hekaton) in SQL Server 2014 hash indexes.

Purpose of hashing?

Hashing can be used to index character data, instead of building an index on a varchar(50) column for example, the data values can be hashed into smaller smallint (2 byte) or integer (4 byte) values.

It’s also used in the distributed database or table partitioning worlds to get a deterministic place to hold your data, for example, given 10 database servers holding a segment of the data the hash function GetPartitionHash( @transaction_number )  can be used to return the values 1 to 10.

How to hash

Encryption algorithms such as SHA1, MD4 etc. form a non-reversible hash of the data, for example HashBytes( ‘sha1’, ‘092231232’ ) gives the 16 byte value: 0xF0D8007C5FC2E54A75AAD004747BFA1A3D6EDCF6 from that value the original data cannot be determined, however repeating the HashBytes function will always yield the same result.

Algorithms such as SHA1 and MD4 give a uniform (or even) hash, that is to say, the distribution of values across the hash array looks random and successive calls to the hash function do not result in values clumped together thus creating hash collisions.

At the end of the day, the hash is simply a number, a number that we can play with.

Buckets within the Hash Array (the Index)

A hash index is a 1 dimensional array of pointers to the actual data, in Hekaton that array is called a Mapping Table, and there is one Mapping Table per index. The position within the bounds of the array (in Hekaton it’s called a bucket) is determined by the hash i.e. hash_index_array( MyHashFunction( @data_to_hash ) ).

The array element (the bucket) contains a pointer to the actual data row, in Hekaton this is a page in memory, though conceivably it could also be a page on stable storage.

On calling the hash function, the range between the lower and upper bounds of the value may be quite large, for example if the hashing function returned a smallint there would be 65536 possible values, if each hash bucket (element in the array) was 8 bytes that would equate to around 524KiB, but what if the hashing function returned an integer, that would be around 4.2GiBytes which is a considerable amount of memory. The engineering solution needs to be such that the hash array can be determined by size and the hash value made to fit within that range, that fitting can be achieved using the modulus of diving the hash value by the number of available array positions (discussed later).

Example

Hash_index_array is an array of 10000 elements starting from 0.

print checksum( ‘sql server 2014 rocks’ )

returns: 778782938

print 778782938 % 10000

returns: 2938

The array position is therefore 2938, or to put it another way, the natural data ‘sql server 2014 rocks’ hashes to the value 778782938 which is then reduced down to 2938. That bucket in the hash array points to the actual data row.

Hash Collisions

A collisions occurs when the hash function returns the same value for two or more unique pieces of data.

Example

The two statements below yield the same hash value of -1963883182.

print checksum( ‘D0AA9CC5-8C15-488C-BB5F-B5C5CEE0D5C2’ )

print checksum( ‘EF18B8FF-93D4-4814-8E47-922DA1E2B657’ )

 

The rate of collision depends on the quality of the hash function and what range of hash values are being derived from the total rows in the data set. The smaller the hash array the more likely collisions.

When a collision occurs you get a row chain against that hash value.

Row chain example

Consider a table with 10 rows, the key column has unique values 0 thru 9 (see SQL below), stored as tinyint i.e. 1 byte; how can the data be indexed in such a way that not all 10 rows need to be scanned to find the value, but also, the index is stored in less space – 2 or 3 bits?

Data to Index

1
2
3
4
5
6
7
select rn
from (
    select ROW_NUMBER() over( order by object_id ) - 1 as rn             --     start at 0
    from sys.objects
    ) as d
where d.rn < 10
order by rn

Take the data value (or hash of the data value) and the number of available positions in the array (buckets), using the SQL Server modulus operator % the remainder can be captured thus giving the position in the array.

print 0 % 5
print 2 % 5
print 4 % 5
print 5 % 5
print 7 % 5
print 9 % 5

0
2
4
0
2
4
One consideration of using Modulus is if data is sequentially inserted i.e. x = x + 1 the dispersion of the access will be more sparse i.e. 0 – 4, 0 – 4, 0 – 4 etc. as opposed to random i.e. 1, 4, 2, 1, 4, 0.

Hashing into buckets (elements of an array)

Two bits can store the values 0 – 3 giving four unique values – our index will be contained in an array with 4 elements; the 10 unique values need to be reduced to 4.

1
2
3
4
5
6
7
8
9
10
select rn as v,
    cast( ceiling( rn % 2 ) as smallint ) as [2],
    cast( ceiling( rn % 3 ) as smallint ) as [3],
    cast( ceiling( rn % 4 ) as smallint ) as [4]
from (
    select ROW_NUMBER() over( order by object_id ) as rn          --     start at 0
    from sys.objects
    ) as d
where d.rn <= 10
order by rn
Val    [% 1]  [% 2]  [% 3]
1      1      1      1
2      0      2      2
3      1      0      3
4      0      1      0
5      1      2      1
6      0      0      2
7      1      1      3
8      0      2      0
9      1      0      1
10     0      1      2

The table above shows the number of unique hash index values that can be used to store the original data, [3] and [4] show that a fit is achievable, however, the requirement to store 2 rows per hash bucket isn’t possible.

For any given hash look up value, three rows will be returned – see example below which looks up the value 5:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select rn as v,
    cast( ceiling( rn % 2 ) as smallint ) as [2],
    cast( ceiling( rn % 3 ) as smallint ) as [3],
    cast( ceiling( rn % 4 ) as smallint ) as [4]
into #hash_index
from (
    select ROW_NUMBER() over( order by object_id ) as rn          --     start at 0
    from sys.objects
    ) as d
where d.rn <= 10
order by rn
 
declare @lookup_value int = 5
 
select *
from #hash_index
where [3] = cast( ceiling( @lookup_value % 3 ) as smallint )

Our index saves 7 rows from having to be checked – so there is a performance boost, but, it gives us 3 rows and not the actual row required. The query below shows step 1) use the hash index to locate a range of rows and step 2) pin point the value we are after.

1
2
3
4
5
6
declare @lookup_value int = 5
 
select *
from #hash_index
where [3] = cast( ceiling( @lookup_value % 3 ) as smallint )  --     Value to look up
  and [v] = @lookup_value

Hashing strings and wide binary/composite keys

The HashBytes function can be used with one of the many encryption algorithms (see example below), it is worth noting that those algorithms can be CPU intensive.

print hashbytes( ‘sha1′, N’tony rogerson’ )

Note: character data will give a different hash depending on Unicode and collation.

The example below shows the concatenation of 3 numeric columns and a SHA1 hash generated.

1
2
3
4
5
6
7
8
9
declare @col1 int = 56,
    @col2 int = 11213,
    @col3 tinyint = 15
 
    print hashbytes( 'sha1',
                     cast( cast( @col1 as binary(4) ) +
                     cast( @col2 as binary(4) ) +
                     cast( @col3 as binary(4) ) as char(12) )
                    )

Hash evenness

There are a fixed number of buckets in the array, the return from the hash function will resolve to one of those buckets in the array, enter “hash collisions”, consider you had 1000 buckets in the hash array, now consider you hashed 1000 data values, if all those data values hashed to just 10 buckets in the entire array it would be pretty useless – that is what is meant by evenness, the ideal would be those 1000 data values hashed into all 1000 array buckets, that way you have one bucket representing 1 data value, when all 1000 hash into just 10 buckets then each one of those 10 buckets is a container for 100 rows!

The same hash function can yield different evenness depending on the variety of the data values – think GUID compared to incrementing integer. The two charts below are a demonstration of that, the checksum over newid looks random across the hash array, it’s not though, it’s just that the bytes within the 16 byte GUID vary a lot more than the incrementing integer one.

HashingIndexCHECKSUM250

Inserting 500K rows with an array of 1K buckets shows that over time the distribution across the hash array is balanced, in the case of the incrementing integer it has equality i.e. 500000 / 1000 = 500 rows per bucket.

 

HashingIndexCHECKSUM500K

Hopefully this introduction to hashing gives you some ideas around the usefulness of the technique and will set you up for my next post which will cover Hekaton hash and range indexes.