Reducing SQL Server IO and Access Times using Bloom Filters – Part 1 (Concepts) 3

Given a 10 million row table with a GUID as a primary key, we have a 50,000 row table that we want to look up to see if we have any matching rows and for those matching rows aggregate the data – lets assuming that 50% of the rows have a corresponding match – so 25,000 rows are the expected join result.

The simple SQL is as follows:

1
2
3
4
5
SELECT bt.segment, COUNT(*) as cnt, COUNT(DISTINCT bt.hashmod_md4) as hshcnt
FROM big_table_lookup AS lk
	INNER JOIN big_table bt ON bt.pk = lk.pk
GROUP BY bt.segment
ORDER BY bt.segment

Without a 3000 word blog post on how the various SQL Server join types work, they basically all have two things in common – on invocation of the join they require the real data, for instance, given the guid B667B973-2D77-412C-BFA1-85204524AC82 the B+Tree intermediate nodes will only get you so far in determining if the value exists, it will usually require access to the leaf level to determine if the row does actually exist, also, getting the real data from the leaf of the index is relatively expensive, e.g. if the B+Tree depth is 4 levels (root, intermediate-1, intermediate-2 and leaf) that’s 4 x 8192 bytes pages that need to be read, true, the root and most of the intermediate levels are likely to be in main memory depending on what else is going on in the buffer pool, but, its still 32 KiB per row unless the readahead can kick in and read some contiguous storage in which case it will be a lot less, for example, root + intermediate-1 + intermediate-2 + leaf + leaf + leaf + leaf + leaf + leaf thus negating some of the intermediate level IOS (which are in memory anyway!).

Wouldn’t it be good if we could have an index that could filter out the rows we don’t want, the rows that we know for definite are not in the base table. Given the example earlier, we have to do that expensive leaf level lookup (be it nested loop, merge or hash) on 50K rows on the big_table base table – if we could filter out 50% and just look up on those 25K rows we know are definitely in there then we’d be quids in and save some resources. Well we can – using a Bloom Filter.

This is one of seven posts I’ll make on the subject in the coming weeks, there is a lot to cover and I’ll do it in-depth.

Post #1 (Concepts)
Post #2 (Basics of the method in SQL Server)
Post #3 (Inserting Data)
Post #4 (Querying Data)
Post #5 (Initial Query Analysis)
Post #6 (Bloom Filters to replace Non-Clustered Indexing)
Post #7 (Conclusion)

Bloom Filters

Donald Bloom laid out the method in the 1970 Communications of the ACM http://dl.acm.org/citation.cfm?id=362692&CFID=328110532&CFTOKEN=83675224. It’s a simple but effective method of indexing large amounts of data by a combination of hashing and a bit array; the method is used to determine if an element is definitely not or possibly in a set of data. False positives (when the Bloom Filter reports it exists but doesn’t) happen when the hashed look up gives rise to the same hashes as the original data thus on look up in the Bloom Filter those bits are set – there is an algorithm to determine the likely error rate.

The diagram below describes how the method works, first the data is hashed, the recommendation is not to use encryption hashes because they are compute intensive, however, because I can use the SQL Server HashBytes function to give me a 16 byte MD5 hash and knowing MD5 has even distribution, I can slice the hash into four x 4 byte integers, the integer denotes the position in the bit array to set to True (i.e. the value 1).

The % @range gets the modulo (the remainder) of dividing the hash by the maximum number of bits in the Bit Array, it makes sure the hash is kept within the bounds of the available Bit Array – for example a binary(8000) array would contain 8000 bytes * 8 bits = 64,000 bits – so the modulus used would be 64000 i.e. 12312323 % 64000 returns 24323.

BloomFilterConceptHashIntoBitArray

A collision occurs when the four hashes map to four cells in the bit array that have already been set to 1. This is how False Positives occur.

Collision and False Positives

One of the downsides of a Bloom Filter is that you need to fix the size of the Bit Array from the offset. The more data added to the Bloom Filter the more bits set, the two charts below show the difference between a Bit Array of 100 segments of 64,000 bits = 6,400,000 bits and 200 segments of 64,000 bits = 12,800,000 bits – note: segments are not part of a general Bloom Filter implementation, in the next post on implementation I’ll cover why in SQL Server you need to segment data into 64,000.

The figures are drawn from a test, at the end of each insert block a query is made over all rows in the target table, a new GUID is used to look up for existence, if one exists then its as a false positive (GUID’s are unique, so each time you invoke newid() you will get a new unique value.

1
2
3
4
5
6
select @i, count(*), @start_check
from big_table
	cross join ( select newid() as newguid ) as c
where dbo.fn_bloom_filter_check_varchar( cast( newguid as char(36) ), 
                                         8000 * 8, 
                                         500 ) = 'Y'

BloomFilterFalsePos100Segs

BloomFilterFalsePos200Segs

A negative look up saves traversal of the B-Tree, the base table does not need to be accessed for those rows – those rows will be excluded from the join by the pre-filtering performed using the Bloom Filter. A False Positive simply means that exclusion will not take place and the normal join process of the key look up in the base table will be performed, so for table joins there is no change in the result.

The less false positives the less joins need to take place where the look up row does not exist in the base table, therefore, there needs to be a balance between the size of the bit array and thus the performance of using the Bloom Filter against the number of False positives in terms of the effectiveness of the Bloom Filter at excluding rows.

Updating the Bloom Filter

Short answer, no you can’t; long answer is yes you can, but it’s not trivial.

Deletes can be accommodated by storing counts, currently I’ve told you that we store a 1 (single bit) when inserting from the hash, remember – the same hash value may be given for different inputs, which is why we use multiple hash algorithms. Using different hash algorithms against the same piece of data will yield different hashes (experiment using HashBytes and try out MD2, MD4, MD5 and SHA1). Instead of using a single bit we can use a group of bits as a counter – increment when a hash requires a cell to be set and decrement when a delete is required – but, again, we need to know up front the maximum number of deletes that may occur.

In this article series I’m looking at Bloom Filters purely in an Analytical/Reporting  setting.

It should also be remembered that creating the Bloom Filter in the first place requires a single scan of the original data, without that ugly sorting etc. that a B+Tree requires.

Summary

A Bloom Filter is simply an array of bits (Bit Array), we take our input data, hash it a few times to get a number of values within the bounds of the bit array, we then set those array cells to 1. On querying the Bloom Filter, the same is true, however instead of setting the bits to 1 we check that all the bits for the hashes are set to 1.

By its very nature, False Positives can occur because we can get repeated hash values. But, False Negatives can never occur which is the beauty of this method.

The wider the bit-array the less the likelihood of collisions and false positives; also the more hashes used (number of cells that are used to store a single element) the less the likelihood of collisions and false positives, however, the more hashes you have the more the bit-array will fill up.

I strongly recommend reading the original paper that describes the maths, how to calculate array size and the probabilities for false positives etc.

Edits

From @AndrewJacksonZA a really great link for a researched list of Hash algorithms: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed