January 28, 2015
Entity Attribute Value (EAV) data model meets SQL Server Clustered Columnstore for dramatic performance improvement (Part #1) 2
<sarcasm>Want to have complete flexibility adding/removing attributes without DDL, store all your data in a single database table and still have performance then read on….</>
It’s often the complaint of developers they can’t easily adjust the data model and add more attributes to it (columns), rightly or wrongly it happens and it’s a legitimate requirement.
Ordinarily data like this would be stored in a NoSQL database such as Dynamo or Riak (Key/Value store) or a Document database such as MongoDB or CouchDB using JSON/BSON.
In the relational database space we’d model this using an Entity Attribute Value (EAV) design, however because of legacy indexing structure (B+Tree’s) indexing is far from useful, frankly it’s next to useless.
Example EAV table
create table eav_table_nosurrogate ( ObjectType sysname not null, ObjectKeyID int not null, Attribute sysname not null, AttributeValueNumber decimal(28, 2 ) null, AttributeValueString varchar(4096) null, AttributeValueDatetime datetime null, check ( AttributeValueNumber is not null or AttributeValueDatetime is not null or AttributeValueString is not null ) );
In the above table ObjectType defines the entity being modelled, Attribute is self-explanatory, you have a separate column per data type you are modelling – basically you’d not store an integer in a varchar column.
As an example consider we are modelling a single entity Person, which has 4 attributes – First Name, Surname, Date of Birth and Dependants, we have 10,000,000 People to store, ordinarily that would be 10m rows, but in EAV we hold it as one row per person per attribute so 4 x 10,000,000 = 40,000,000 rows.
The column ObjectType contains 1 distinct value in 40m rows (Person), Attribute 4 distinct values in 40m rows (First Name, Surname etc.) and ObjectKeyID has 10m distinct values in 40m rows – some really good compression if only we could store the data in column rather than row format!
Ah but, introduced in SQL Server 2014 is a feature “Columnstore Indexing” which gives us the ability to organise a complete table with column compression, on top of that the data supports inserts, deletes and updates!
Traditional Indexing Performance Issues
By default SQL Server stores data on 8KiB pages in row format – each distinct row follows the next on the page, compression exists at a column level (within the row) or at the database page level for instance when you are using Transparent Data Encryption.
B+Tree indexing in the form of Clustered / Non-Clustered index is the mainstay of the product, they have been around since the product was originally developed, the lack of uniqueness of the values from the columns you’d index on and the row explosion you get because of storing an attribute as a row causes a deeper index tree (more intermediate levels) and as such a more work needs to be done in the index tree, so to seek to a specific row might cost 4 or more page reads. Also, because there is so much duplication of data the whole table becomes extremely large and as such you end up traversing more data through the buffer pool (and therefore the memory bus and processor cores) not to mention the physical IO if the data isn’t in the buffer pool.
The diagram below gives you a very simplistic view of what happens when you use a non-clustered index and require columns from the base table that don’t exist in the non-clustered index. The reality is that not all row seeks will incur 8 page reads because you get scans etc. and it depends what you are doing, but, it’s illustrative of the problem indexing on columns with low cardinality i.e. low number of distinct values.
SQL Server 2014 Columnstore
Column orientated storage is not a new technique, like B+Tree’s it goes back decades, what is new though is its implementation within the relational database engine. The technique was introduced in SQL Server 2012 as non-clustered Columnstore index’s which was a read only structure across a normal row orientated storage table, with SQL Server 2014 we got clustered Columnstore which holds the entire table as column orientated storage with no other index structures allowed.
Looking back at the EAV table defined earlier, assuming we are storing a number then the columns below would be used:
RID ObjectType ObjectKeyID Attribute AttributeValue
1 Person 1 Age 44
2 Person 1 Surname Blogs
3 Person 2 Age 44
4 Person 2 Surname Smith
5 Person 3 Age 50
6 Person 3 Surname Blogs
In the row orientated store we’d literally hold the above 6 rows in the database, we’d likely normalise ObjectType and Attribute out into separate tables in order to reduce the data type down (surrogate key of smallint).
When storing using column orientated storage we arrange the data as columns thus allowing compression.
Person (1), Person (2), Person (3), Person (4), Person (5), Person (6) | 1 (1), 1 (2), 2 (3), 2 (4), 3 (5), 3 ( 6) | Age (1), Surname (2), Age (3), Surname (4), Age (5), Surname (6) | 44 (1), Blogs (2), 44 (3), Smith (4), 50 (5), Blogs (6)
The data can now be compressed to:
Person (1, 2, 3, 4, 5, 6) | 1 (1, 2 ), 2 (3, 4), 3 (5, 6) | Age (1, 3, 5), Surname(2, 4, 6) | 44 (1, 3), 50 (1), Blogs (2, 6), Smith (4)
For a very full explanation of how Clustered Column store works see http://www.nikoport.com/columnstore/
The above has been an introduction to the technique, in Part 2 etc. I’ll go through performance considerations, what to monitor, watch etc.