Increasing ColumnStore compression ratio for archival and “Cold storage”

At my present client, part of the task I have to under take is to re-engineer the “Cold Storage”. This is ~20Terabytes of compressed CSVs going back over 15 years or so. Presently these are held on aging and creaking centera servers that are way past end of life. The solution probably made sense 10 years ago ( or whenever it was designed ) but now it doesn’t and the clients first thought was to throw it all into Orace RAC with page compression. Seems like a vaguely sensible ( if expensive) way to handle the data. But, maybe there is an even better way and right now I would say that ColumnStore compression is that better way.

As a really simple POC I took some pseudo-real sample data and threw it into a SQL Server table. Data size was approx ~300MB. After adding page compression, that size came down to ~42MB. That is roughly inline with the expected estimates ( and I would hazard a guess what could be achieved by Oracle page compression). So how does columnstore compare ? Adding a clustered columnstore index took datasize own to 2,128KB, Yes from 300MB to 2MB that is amazing. But wait, it gets better using the COLUMNSTORE_ARCHIVAL setting that applies a variant of the LZH compression to (presumably) each rowgroup the datasize falls even further to 464KB. From 300MB to less that 1MB now, that is compression for you.

The other side of the coin is query-speed ( this is arguably less important at the client ) and something I am yet to measure and compare but will do when I get my dirty hands on the real data.

However, the achievable compression can be increased even further with a little bit of thought and jigging of the data. Simplistically columnstore is all about repeating data over a column, the meta data says “In Column X I have the value Y and that happens for the next Z rows”. My data has some large almost sequential columna, transaction reference, batch transaction reference etc.. stored as large-ish chars of 16,20, and 32 bytes in length. These are fairly unique throughout the data and as such column store compression has to specify the full value on each and every row. What would be the impact if we only held that part of each reference that was changing ?, i.e. the whole 32 characters are held but only the last one would be changing as these are effectively “Strings that look like integers”.

A quick bit of guestimation, for simplicities sake splitting the column into 2 one of 31 chars and one of 1.

10 rows *32 bytes = 320

Against

1 row *31 bytes + 10 of 1 = 41 bytes

Interesting, ~15% compression. Lets push this idea further and create a million row table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest1
go
Create Table CCSTest1
(
Col1 char(23) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest1
select r1.PadR+'000000000000000'+r2.PadR
from ctePad r1
cross join ctePad r2
 
go
Create clustered index CCSTest1idx on CCSTest1(Col1)

Using sp_spaceused we can see the data size is 31,632KB.

Adding page compression

1
ALTER TABLE CCSTest1 REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE);

Brings the datasize down to 13,120KB not to shabby.

Let us now switch to a clustered columstore

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest1idx ON CCSTest1 with (drop_existing = on)

This bring the data storage even further down to 2,864KB, quite neat in itself even before we change the underlying data structure. Note that adding the COLUMNSTORE_ARCHIVAL option has no effect on the datasize, this is due to the dictionary that is used by default to add an extra level of compression to varchar data.

Lets now split Col1 into smaller ( and more importantly for columnstore, repeatable and repeating) segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
drop table CCSTest2
go
Create Table CCSTest2
(
Col1 char(19) not null,
Col2 char(4) not null
)
go
with cteRown
as
(
Select top(1000) ROW_NUMBER() over( order by (select null)) as r
from sys.columns a cross apply sys.columns b
)
,
ctePad
as
(
Select REPLICATE('0',4-len(cast(r as varchar(4))))+cast(r as varchar(4)) as PadR
from cteRown
)
insert into CCSTest2(Col1,Col2)
select r1.PadR+'000000000000000',r2.PadR
from ctePad r1
cross join ctePad r2
go
Create clustered index CCSTest2idx on CCSTest2(Col1,Col2)

Datasize here is 31,632KB, no real difference from before.

Now add the columnstore index

1
CREATE CLUSTERED COLUMNSTORE INDEX CCSTest2idx ON CCSTest2 with (drop_existing = on)

The datasize is now 1,392KB less than half of the original and certainly a practice that I shall be employing to further increase the potential compression on data. Yes, query speed will be ever so slightly degraded but this is perfectly acceptable in my scenario.

If you are using columnstore compression, to achieve the ‘optimal’ compression ratio there are still things you have to do and hoops to jump through. Consider splitting columns into multiple parts to increase the ‘repeatability’ of data over sequential rows and ensure that a clustered index is built upon the ‘raw’ data prior to adding a columnstore to define what that ‘sequential’ order is. See also “Data Loading performance considerations with Clustered Columnstore indexes“, though more concerned with throughput, does contain information of compression ratio.

2 thoughts on “Increasing ColumnStore compression ratio for archival and “Cold storage”

  1. The second important consideration is performance. Obviously, decompression adds an overhead, which you would like to avoid on the large scope. For example, it is the bad idea to have a query that scans large amount of data and performs decompression on every row to evaluate the predicate against one of compressed attributes. For example, query shown below would be highly inefficient.

    • Indeed yes, performance should always be a consideration. However in this case the balance of performance against storage space is more than acceptable. This is for “Cold Storage” ie archived data that needs to be accessible but not required to be instantly available.

Leave a Reply

Your email address will not be published. Required fields are marked *