Reducing SQL Server IO and Access Times using Bloom Filters – Part 2 (Basics of the method in SQL Server) 2

Part 1 addressed Bloom Filter Concepts, if you haven’t already done so its important to start there.

In this post I will show the basics of how we set and query the bit array that holds our Bloom Filter structure.

Step 1 – Hash the target Data element (key)

Multiple hash functions are used over your data element to reduce the impact of collisions, although from a compute perspective it’s not recommended I’m going to use the MD4 encryption hash from SQL Server’s HashBytes – simply because it’s easier for people to understand the technique, also as you’ve probably noticed that’s only one hash function, the reality is you’d probably use a number of simpler but distinct hash functions.

The example assumes a composite key made up of two GUID’s (yes folk – its a great technique to help overcome the randomness of GUID’s and using GUID’s as keys).

1
2
3
4
5
6
7
8
9
10
declare @comp_key binary(32);
set @comp_key = cast( newid() as binary(16) );
print '/// Composite key'
print @comp_key;
set @comp_key = substring( @comp_key, 1, 16 ) + cast( newid() as binary(16) );
print @comp_key;
 
declare @hash binary(16) = hashbytes( 'md4', @comp_key );
print '/// Hash'
print @hash;

Output:

1
2
3
4
5
/// Composite key
0x281914F70DF33F4ABBEF9099ECBCAA5700000000000000000000000000000000
0x281914F70DF33F4ABBEF9099ECBCAA57EA2AF42A7A53664183A7086EA40038D0
/// Hash
0x8284955F8CE1D14859A0B64FF69913A2

Bitwise operator Fundamentals

The binary data type is used, the length indicates the number of bytes, for example binary(16) means we want to store 16 bytes of data, in the example above 32 bytes is required because GUID’s (uniqueidentifier) are 16 bytes in length.

The SUBSTRING function can be used to extract specific byte ranges from the column/variable, just like varchar, the “+” operator is used to concatenate the two binary variables together (just like varchar), we are using binary in the example which is fixed length hence the reason SUBSTRING( @comp_key, 1, 16) is used.

The HashBytes function takes the 32 byte variable (the two concatenated guids i.e. the composite key) and generates a 16 byte hash.

Step 2 – Flip the bits in the Bloom Filter

MD4 produces an evenly distributed hash meaning we can chop the 16 bytes up and still get the randomness and uniqueness abilities we seek to help reduce collisions.

At this point SQL Server T-SQL fails us, there is no easy and efficient way to flip a bit to 1 in a binary column/variable that is above 8 bytes (bigint), the bitwise or i.e. | only work when one side is a numeric data type like bigint and the other is binary. A CLR user defined scalar function is used to flip the bit, the function is designed to take up to binary(8000), anything larger would require the use of SQLFacet (see lines 10 and 12 ).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
using System;
using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
 
public partial class UserDefinedFunctions
{
//    [return: SqlFacet(MaxSize = -1)]
    [Microsoft.SqlServer.Server.SqlFunction]
//    public static SqlBinary flip_bit([SqlFacet(MaxSize = -1)]SqlBinary target, SqlInt32 pos)
    public static SqlBinary flip_bit(SqlBinary target, SqlInt32 pos)
    {
        BitArray ba = new BitArray(target.Value);
        ba.Set(pos.Value, true ); 
 
        byte[] by_result = new byte[target.Length];
        ba.CopyTo(by_result, 0);
 
        return new SqlBinary(by_result);
    }
}

Taking the single MD4 hash, break it into 3 parts each part 4 bytes, the Bloom Filter will be held in a single 8000 byte variable thus giving 8000 * 8 = 64,000 bits to work with. Therefore, to limit the hash within that range take the modulo of the hash part divided by 64,000.

For each hash part set the required bit to 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
declare @pos1	int = cast( substring( @hash, 1, 4 ) as int ),
		@pos2	int = cast( substring( @hash, 5, 4 ) as int ),
		@pos3	int = cast( substring( @hash, 9, 4 ) as int )
 
print '/// Bloom Filter Position'
print @pos1
print @pos2
print @pos3
 
set @pos1 = abs( @pos1 )
set @pos2 = abs( @pos2 )
set @pos3 = abs( @pos3 )
 
print '/// Bloom Filter Positive Position'
print @pos1
print @pos2
print @pos3
 
set @pos1 = @pos1 % ( 8000 * 8 )
set @pos2 = @pos2 % ( 8000 * 8 )
set @pos3 = @pos3 % ( 8000 * 8 )
 
print '/// Bloom Filter BitArray Position'
print @pos1
print @pos2
print @pos3
 
declare @bin binary(8000) = cast( replicate( char(0), 8000 ) as binary(8000) );
 
set @bin = [dbo].[flip_bit](@bin, @pos1);
set @bin = [dbo].[flip_bit](@bin, @pos2);
set @bin = [dbo].[flip_bit](@bin, @pos3);

Yep – it really is that easy; you are now done, you have your data in the Bloom Filter.

Step 3 – Query the Bloom Filter

Given a piece of data to check against the Bloom Filter, follow the same procedure above to the point before flipping the bits; given we now have the array positions to query check if all cells for hash1, 2 and 3 above are set to 1, if they are not, then your data definitely does not exist in the data set, if they are all 1 then it probably exists.

Again the problem here is manipulating the 8000 bytes within SQL Server, all we have are three numbers (the hash modulo) that point to the relevant array cells we need to check.

We need to extract the byte where the hash modulo lands on, that will give us a tinyint, on that tinyint we can then do some bitwise checks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
declare @byte_pos1 int = ( @pos1 / 8 ) + 1
declare @byte1     tinyint = substring( @bin, @byte_pos1, 1 )
declare @byte_pos2 int = ( @pos2 / 8 ) + 1
declare @byte2     tinyint = substring( @bin, @byte_pos2, 1 )
declare @byte_pos3 int = ( @pos3 / 8 ) + 1
declare @byte3     tinyint = substring( @bin, @byte_pos3, 1 )
 
print '/// Bloom Filter retrieve'
print @byte_pos1
print @byte1
print @byte_pos2
print @byte2
print @byte_pos3
print @byte3
print power( 2, @pos1 - ( ( @pos1 / 8 ) * 8 ) )
print power( 2, @pos2 - ( ( @pos2 / 8 ) * 8 ) )
print power( 2, @pos3 - ( ( @pos3 / 8 ) * 8 ) )
if @byte1 & power( 2, @pos1 - ( ( @pos1 / 8 ) * 8 ) ) = @byte1
 and @byte2 & power( 2, @pos2 - ( ( @pos2 / 8 ) * 8 ) ) = @byte2
 and @byte3 & power( 2, @pos3 - ( ( @pos3 / 8 ) * 8 ) ) = @byte3
	print 'positive'

Line 15 is simply working out what position the bit is within the byte according to the start of the byte boundary and what the hash is (it’s the difference between the two), knowing that we can simply square it and then do a bitwise &.

Summary

That is pretty much all there is to the technique, the next posts will show how to implement it against real data.