Reducing SQL Server IO and Access Times using Bloom Filters – Part 3 (Inserting Data) 5

Part 2 (Basics of the method in SQL Server) explained how to get data into a Bloom Filter structure, it now needs persisting. This post explains a method on how a Bloom Filter can be stored in a SQL Server database – I assume you have read Part 1 and Part 2 and understand about the hashing, the bitwise operations I’m doing etc.

Remember from Part 1 and Part 2 a Bloom Filter is a bit array, a cell in the bit array is set to 1 where the array index is the modulo from the hash function over the target data – we do that a number of times e.g. bitarray(hash1), bitarray(hash2) and bitarray(hash3).

From bit array to bit dimension

Bench marks have been conducted using CLR with varbinary(max) storing the bloom filter in a single row/column bit array – performance is extremely poor, so the method was abandoned (the details are out of the scope of this post); it was found that good performance could be had using binary(8000) allowing 64,000 bits, however that is too small for indexing a reasonably sized data structure.

In order to increase the available bits, multiple 8000 byte segments are used; segment key is defined smallint with a range of 0 to the maximum range you require, the segment is calculated off the modulus from the original hash of the data – an encoded data item resides in the same segment for reasons of performance and possible future parallelism of the query lookup. Cells are therefore addressed as BloomFilter( segment, {hash1-bit-cell, hash2-bit-cell ….., hashx-bit-cell} ).

Further refinement of the data can be performed by adding other dimensions, this allows the Bloom Filter to be expandable, for example should the key contain data that is temporal (year, month) or continuous (incremental account number) then the Bloom Filter bit dimensions would be (year, segment, bit-cell) or (account-range, segment, bit-cell) – but more on how to do that in a later post on advanced Bloom Filtering.

It is recommend the table schema for the bloom_filter table resides on a dedicated file group allowing the data to be held in a contiguous state thus reducing the overall size in the buffer bool which in practice it will likely reside because of the frequency of access.

Multiple Bloom Filters may be used over the same table, each filter would address a specific filtering requirement for example a filter for active orders rather than the entire table.

Initialise Bloom Filter

The SQL below shows how to initialise a Bloom Filter of 750 segments giving a capacity of 48,000,000 bits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
create table bloom_filter (
	filter_name 	 varchar(20)      not null,
	segment		 smallint	  not null, 
             primary key clustered ( filter_name, segment ),
	bloom_filter     binary(8000)     not null
	) on BLOOM_FILTER;
 
--
--	Initialise Bloom Filter
--
set nocount on;
 
truncate table bloom_filter; 
 
declare @bloom_filter binary(8000) = convert( binary(8000), 
                                              replicate( char(0), 8000 ) );
 
insert bloom_filter ( filter_name, segment, bloom_filter ) 
        values( 'all', 0, @bloom_filter );
 
declare @i smallint = 0
while @i <= 750
begin
	insert bloom_filter ( filter_name, segment, bloom_filter )
		select top 1 'all', segment + 1, bloom_filter 
                from bloom_filter 
                order by segment desc;
 
	set @i = @i + 1;
 
end
 
select sum( len(bloom_filter) ) / 1024 / 1024. as MiB, count(*)
from bloom_filter;

A quick check of the index statistics to make sure there is little if any fragmentation:

1
2
3
4
5
6
7
8
9
10
11
select index_level, 
       avg_fragmentation_in_percent, 
       fragment_count, 
       page_count, 
       min_record_size_in_bytes 
from sys.dm_db_index_physical_stats( db_id(), 
                                     object_id( 'bloom_filter' ), 
                                     null, 
                                     null, 
                                     'detailed' )
order by index_level desc

Build Bloom Filter

The SQL below builds an example “big” base table complete with bloom filter – it will be used on further posts in this series.

In order to run the SQL below you will need the CLR function from Part 2 of this series.

Important note: the segment modulus must be the same as the number of segments you built in the bloom_filter otherwise the addressability of the bit array will not function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
create table big_table (
	pk uniqueidentifier not null primary key clustered, unique( pk ),
	pk_id bigint not null identity unique,
	col1 bigint not null,
	col2 varchar(1024) not null,
	entry_date datetime not null default( getdate() ),
	hash1	int not null,
	hash2	int not null,
	hash3	int not null,
	segment	smallint not null
	);
 
--
--	Populate test data
--
 
set nocount on;
 
declare @i int = 0;
declare @guid uniqueidentifier;
declare @hash binary(16);
declare @hash1 int,
	@hash2 int,
	@hash3 int,
	@segment smallint;
 
declare @bloom_filter binary(8000);
 
while @i <= 10000000
begin
	set @guid = newid();
	set @hash = hashbytes( 'md5', cast( @guid as char(36) ) );
 
	set @hash1 = abs( cast( substring( @hash, 1, 4 ) as int ) ) % ( 8000 * 8 );
	set @hash2 = abs( cast( substring( @hash, 5, 4 ) as int ) ) % ( 8000 * 8 );
	set @hash3 = abs( cast( substring( @hash, 9, 4 ) as int ) ) % ( 8000 * 8 );
 
	set @segment = abs( cast( substring( @hash, 13, 2 ) as smallint ) ) % 750;
 
	set @bloom_filter = ( select bloom_filter 
			      from bloom_filter 
			      where segment = @segment
                                and filter_name = 'all' );
 
	insert big_table ( pk, col1, col2
			 , hash1, hash2, hash3
			 , segment )
		values( @guid, cast( 10000000 * rand() as bigint ) % 10000000
                        , getdate()
			, @hash1, @hash2, @hash3
			, @segment );
 
	set @bloom_filter = [dbo].[flip_bit](@bloom_filter, @hash1);
	set @bloom_filter = [dbo].[flip_bit](@bloom_filter, @hash2);
	set @bloom_filter = [dbo].[flip_bit](@bloom_filter, @hash3);
 
	update bloom_filter set bloom_filter = @bloom_filter
	where segment = @segment
	  and filter_name = 'all';
 
	set @i = @i + 1;
 
end

To facilitate consistency and code maintainability I’d recommend the use of an INSERT trigger to keep the filter consistent – the Bloom Filter table may become a hot spot, however, one of the advantages of segmenting the filter is each segment fits on its own database page so the insert distribution will likely not cause excessive blocking (for those oldies like me you will remember the technique, it’s from before we had row level locking).

It’s worth pointing out there is an overhead on INSERT populating the bloom filter – as there is on any other index population, having stated that, my focus for this implementation of a Bloom Filter is from an Analytical/Reporting context.

To be clear, it’s not mandatory to store hash1, hash2, hash3 and segment in the base table, in the example below they are there simply to help illustrate and test the method.

I’m deliberately using GUID’s as keys because I believe databases need to be of a distributed design instead of the legacy single database scale up model; after populating the table remember to REORGANIZE the index (see below), or if you really want to be awkward REBUILD – there is a difference, REBUILD will rebuild and fix fragmentation in the intermediate and root levels of the index, REORGANIZE only sorts out the leaf level.

1
ALTER INDEX ALL ON big_table REORGANIZE

Once you have ran the script above you are now ready to play – I’ll deal with how you query using the bloom filter in the next post.

Test

The querying technique, when to use and further considerations will be covered in the next post in this article series.

In the meantime let’s check the bloom filter has built properly and the performance. A look up table will be used made up of 5,000 random rows from the big_table (ones we know exist) as well as 100,000 rows we know do not exist. Remember – the purpose of a  bloom filter is to pre-filter data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
select pk, 
	   hash1, 
	   power( 2, ( d.hash1 - ( ( ( d.hash1 / 8 ) * 8 ) ) ) ) as hash1_p2,
	   hash2, 
	   power( 2, ( d.hash2 - ( ( ( d.hash2 / 8 ) * 8 ) ) ) ) as hash2_p2,
	   hash3, 
	   power( 2, ( d.hash3 - ( ( ( d.hash3 / 8 ) * 8 ) ) ) ) as hash3_p2,
	   segment
into big_table_lookuptest_precalc
from (  
	select pk,
			hash1 = abs( cast( substring( pk_hash, 1, 4 ) as int ) ) % ( 8000 * 8 ),
			hash2 = abs( cast( substring( pk_hash, 5, 4 ) as int ) ) % ( 8000 * 8 ),
			hash3 = abs( cast( substring( pk_hash, 9, 4 ) as int ) ) % ( 8000 * 8 ),
			segment = abs( cast( substring( pk_hash, 13, 2 ) as smallint ) ) % 750
	from (  select pk, hashbytes( 'md5', cast( pk as char(36) ) ) as pk_hash
			from (	select top 5000 pk		--	Ones that exist
					from big_table
					order by newid() ) as n
			 ) as d
		) as d
union all
select pk, 
	   hash1, 
	   power( 2, ( d.hash1 - ( ( ( d.hash1 / 8 ) * 8 ) ) ) ) as hash1_p2,
	   hash2, 
	   power( 2, ( d.hash2 - ( ( ( d.hash2 / 8 ) * 8 ) ) ) ) as hash2_p2,
	   hash3, 
	   power( 2, ( d.hash3 - ( ( ( d.hash3 / 8 ) * 8 ) ) ) ) as hash3_p2,
	   segment
from (
	select pk,
		   hash1 = abs( cast( substring( pk_hash, 1, 4 ) as int ) ) % ( 8000 * 8 ),
		   hash2 = abs( cast( substring( pk_hash, 5, 4 ) as int ) ) % ( 8000 * 8 ),
		   hash3 = abs( cast( substring( pk_hash, 9, 4 ) as int ) ) % ( 8000 * 8 ),
		   segment = abs( cast( substring( pk_hash, 13, 2 ) as smallint ) ) % 750
	from (  select pk, hashbytes( 'md5', cast( pk as char(36) ) ) as pk_hash
			from (	select top 100000 newid() as pk		--	Ones that don't
					from big_table ) as n
			 ) as d
		) as d
 
create unique clustered index cldx on big_table_lookuptest_precalc ( pk )
create unique index nc on big_table_lookuptest_precalc ( segment, pk ) 
							include( hash1, hash2, hash3 )
go

First test the performance using SQL Server’s out of the box strategies (below), on my laptop (using a 5.4K rpm internal disk) the SQL Server version takes 32 seconds – run the script a few times to get an average as the first invocation will likely take longer because of the checkpoint.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
checkpoint
dbcc dropcleanbuffers
dbcc freeproccache
go
 
declare @start datetime = getdate()
 
select count(*) as cnt, 
	   avg( cast( c.hash1 as bigint ) ),
	   avg( cast( c.hash2 as bigint ) ),
	   avg( cast( c.hash3 as bigint ) )
from big_table_lookuptest_precalc c
where pk in ( select pk from big_table bt )
 
print datediff( second, @start, getdate() )
go

Now the bit you’ve been waiting for, how we query using the bloom filter; the query below is equivalent to the SQL query above – yes, on first glance it may scare the hell out of you but it’s actually pretty straight forward – I will save the detailed explanation for the post on querying. For completeness, the query takes 15 seconds – half the time of the out of the box SQL Server one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
checkpoint
dbcc dropcleanbuffers
dbcc freeproccache
go
declare @start datetime = getdate()
 
select count(*) as cnt, 
	   avg( cast( c.hash1 as bigint ) ),
	   avg( cast( c.hash2 as bigint ) ),
	   avg( cast( c.hash3 as bigint ) )
from big_table_lookuptest_precalc c
where pk in (
		select c.pk
		from bloom_filter bf
			inner HASH join big_table_lookuptest_precalc c on bf.segment = c.segment
		where (   cast( substring( bf.bloom_filter, ( c.hash1 / 8 ) + 1, 1 ) as tinyint )
						& power( 2, ( c.hash1 - ( ( ( c.hash1 / 8 ) * 8 ) ) ) ) = power( 2, ( c.hash1 - ( ( ( c.hash1 / 8 ) * 8 ) ) ) )
 
				  and cast( substring( bf.bloom_filter, ( c.hash2 / 8 ) + 1, 1 ) as tinyint ) 
						& power( 2, ( c.hash2 - ( ( ( c.hash2 / 8 ) * 8 ) ) ) ) = power( 2, ( c.hash2 - ( ( ( c.hash2 / 8 ) * 8 ) ) ) )
 
				  and cast( substring( bf.bloom_filter, ( c.hash3 / 8 ) + 1, 1 ) as tinyint )
						& power( 2, ( c.hash3 - ( ( ( c.hash3 / 8 ) * 8 ) ) ) ) = power( 2, ( c.hash3 - ( ( ( c.hash3 / 8 ) * 8 ) ) ) )
			)
		  and c.pk in ( select pk from big_table )
	 ) 
 
print datediff( second, @start, getdate() )

Insert Collisions

A hash has varying uniqueness depending on the algorithm and the size of the resulting hash, by uniqueness I mean the number of rows before you get a hash you’ve already seen. For example, using CHECKSUM you will get a hash repeat (collision) quite quickly – in tests on AdventureWorks 19.5K rows.

We are addressing the bitarray from the hash value – taking the modulus of dividing the full hash with the size of the bit array – if you don’t understand that then please go back and read parts 1 and 2.

On inserting into the bitarray we are using multiple bitarray addresses, the above example we use 3 hash values in the range 0 – 64,000, if all three cells in the bitarray are already set to 1 then we have a collision.

Collisions on INSERT will lead to false positives on SELECT. A false positive is detrimental to query performance because it causes more rows to be joined to the big base table.

I will cover this more in the post on querying.

Summary

The size of the bitarray (i.e. the number of segments) is critical to the performance gains you’ll receive – the less false positives the better the performance because you are reducing within the join pre-filter.

Because the point of the Bloom Filter is for join pre-filtering, performance improves the less rows you expect to get back, correspondingly if the Bloom Filter step fails to filter out enough rows your query will actually run slower. There is a balance between the size of the bitarray and the number of rows expected on the pre-filter.

I will cover deletes in a later post, delete’s are possible but impractical.