How Hekaton (XTP) achieves Durability for “Memory Optimised” Tables 2

“Memory Optimised” tables reside entirely in memory, the operations INSERT, UPDATE and DELETE as well as SELECT are all significantly quicker, but just how is it possible to achieve durability with memory optimised tables while still achieving the significantly higher throughput?

Like traditional on-storage tables durability is achieved once the data has been written and hardened off on the transaction log – there is no difference there, both memory optimised and traditional on-storage tables both use the transaction log as a means of applying durability to data changes – there is a very big difference on how and when that is achieved though.

Memory optimised tables can be defined SCHEMA_AND_DATA (durable) or SCHEMA_ONLY (non-durable); SCHEMA_ONLY tables perform no IO to physical storage.

Both durable and non-durable tables can participate in a transaction, that is to say you can do a BEGIN TRAN, do a load of changes and either COMMIT or ROLLBACK – they aren’t like table variables for instance that aren’t transactional.

Data in a Memory Optimised table is only written to physical storage on a COMMIT TRANSACTION – don’t worry I’ll show how and why shortly.

To add a bit of spice into the dish of confusion, in this article I’m assuming you are not using DELAYED_DURABILITY which I cover in another post here: http://dataidol.com/tonyrogerson/2014/01/08/throughput-improvement-through-delayed-durability-on-commit-tran-from-sql-server-2014/.

What is Durability?

The SQL Server relational database  is said to be ACID (Atomicity, Consistency, Isolation and Durability) compliant, ACID is associated with transactional processing rather than the relational model or relational databases per se; the definition means any transactions performed by the database follows a standard set of properties. We are only really interested in Durability with regard to this article.

Durability means that the database can be brought back to a consistent state whereby any committed transactions are reflected and any un-committed or rolled back changes are recovered to the original values.

What is a Transaction?

A transaction can be auto or manual commit; an auto-commit transaction is simply where a single statement is executed without being wrapped inside a transaction for example, INSERT without manually creating a transaction; conversely a manual commit is where a transaction is formed and statements executed in it until we either rollback or commit the transaction.

Example: auto-commit:

– When @@TRANCOUNT = 0

UPDATE mytable SET trans_amount = 0

Example: manual commit:

BEGIN TRAN

UPDATE mytable SET trans_amount = 0

COMMIT TRAN

Traditional “on-storage” tables and the Transaction Log

Consider the scenario, you have 8GiB of memory, and a 500GiB table , you want to update a column that touches all the rows – for arguments sake 500GiB needs to be updated, the auto-commit statement is shown below:

UPDATE mytable SET trans_amount = trans_amount / 10.

The table will not fit in memory, 500GiB does not fit into 8GiB of memory so you have to update the table in pieces, bring rows in, update, write rows out – while doing this we need to be able to recover the data to last committed values before the current transaction in case the server crashes or the statement fails.

SQL Server uses Write Ahead Logging (WAL) – for an UPDATE the old and new rows are changed in the buffer pool (memory) then written to the transaction log (LDF), as we crawl through the 500GiB table (read, update and write), the lazywriter process writes the dirty pages (the changed data) to the data file (MDF) when new pages are required to be brought from storage and into the buffer pool – you end up with the actual data pages in the MDF containing the new values before the transaction commits hence the requirement for WAL and the requirement to hold the old value in the transaction log giving us the ability to re-apply the old value on rollback / undo.

This process must be synchronous – a thread performs an IO request, that thread cannot continue until the data is hardened off onto storage.

As you may start to gather this process causes a significant amount of IO, not only do you have to log the old and new rows, you have to do it in such a way that the log rows are hardened off – synchronous writes slow things down a lot!

Basically we have a huge overhead to cater for the ability of updating data that may not entirely fit in memory.

The example code below shows what happens on the transaction log during a transaction inserting a number of rows:

First create a database that has an area for storing memory optimised tables because we are going to do a like for like between on-storage tables and memory-optimised:

CREATE DATABASE xtp_logdemo ON  
	PRIMARY 
		( NAME = N'xtp_logdemo_Data', 
		  FILENAME = N'c:\SQLDATA\xtp_logdemo_Data.mdf'
		   ), 
 	FILEGROUP [xtp_logdemo_mod] 
		CONTAINS MEMORY_OPTIMIZED_DATA
			( NAME = N'xtp_logdemo1_mod', 
			  FILENAME = N'c:\SQLDATA\inmem\xtp_logdemo1_mod' , 
			  MAXSIZE = 4GB)
	LOG ON 
		( NAME = N'xtp_logdemo_Log', 
		  FILENAME = N'c:\SQLDATA\xtp_logdemo_log.ldf',
		  SIZE = 512MB
		   );
go
 
use xtp_logdemo;
go
 
--
--	Set recovery on database so we can keep logs
--
ALTER DATABASE xtp_logdemo SET RECOVERY FULL;
 
BACKUP DATABASE xtp_logdemo TO DISK = N'C:\temp\xtp_logdemo.bak' with init;
BACKUP LOG xtp_logdemo TO DISK = N'C:\temp\xtp_logdemo.trn' with init;
go

Now we have the database set up, create a normal table; we then checkpoint and backup the log because we aren’t interested in the log records for the table definition – as an aside, for memory optimised tables when defined with the SCHEMA_ONLY option the actual schema itself – the definition is durable, it’s just the data in the table that isn’t!

A quick note about fn_dblog and fn_dblog_xtp later; these are views over the transaction log, the transaction log is a live thing so other things may be writing to it so make sure you are the only person doing anything against the database, also I have found I need to give it a few seconds before I can truly find the end of the log – allows background stuff to settle down – basically if you see something odd, redo the example.

------------
----	on-storage table
---	
create table logtest_onstorage (
	id	int not null 
		primary key nonclustered ( id ),
 
	changecol int not null,
	bigrow_expander char(1024) not null,
 
	);
 
--
--	Insert a row so we aren't logging page allocations
--	
insert logtest_onstorage( id, changecol, bigrow_expander )
	values( 0, 1, replicate( 'b', 1024 ) );
 
checkpoint;
 
--
--	Trim out the log
--
BACKUP LOG xtp_logdemo TO DISK = N'C:\temp\xtp_logdemo.trn' with init;
go
 
--
--	Find end of the log
--
select top 1 *
from fn_dblog( null, null )
order by [Current LSN] desc
 
select * 
from fn_dblog( '0x{end of the log}', null )
order by [Current LSN]
 
 
--
--	Insert some rows in a single transaction
--
SET NOCOUNT ON;
 
BEGIN TRAN;
 
	declare @i int = 1;
	while @i <= 5
	begin;
 
		insert logtest_onstorage( id, changecol, bigrow_expander )
			values( @i, 1, replicate( 'b', 1024 ) );
 
		set @i = @i + 1;
 
	end;
 
 
	PRINT @@TRANCOUNT;
 
--
--	Anything while still inside the transaction?
--
select [Current LSN], Operation, AllocUnitName, [Transaction Name]
		, case when operation in ( 'LOP_BEGIN_XACT', 'LOP_COMMIT_XACT' ) then operation else '' end as xact_marker
from fn_dblog( '0x{end of the log}', null )
order by [current lsn] 
go

The above has inserted a number of rows within a single “manual” transaction, we have not yet committed the transaction, what you will have found is that a lot of data has already been written to the transaction log – this is Write Ahead Logging (WAL) in action. There are log records for all sorts of things and not just your inserted data – see result below; remember – we have NOT committed the transaction yet!

WAL3

Progress the example further and see what happens when we commit the transaction:

COMMIT TRAN
go
 
--
--	Anything additional?
--
select [Current LSN], Operation, AllocUnitName, [Transaction Name]
		, case when operation in ( 'LOP_BEGIN_XACT', 'LOP_COMMIT_XACT' ) then operation else '' end as xact_marker
from fn_dblog( '0x{end of the log}', null )
order by [current lsn] 
go

Not a lot! A single row is written to the transaction log (LOP_COMMIT_XACT), it’s an important row though – it marks the transaction as complete so if the database required recovery the data in your transaction would be applied, if the LOP_COMMIT_XACT had not been written then your data would get undone!

But hopefully you can see (and please play) that WAL is just that – you write ahead (to the log) before writing to the data file and those writes happen as you invoke statements.

“Memory Optimised” tables and the Transaction Log

The WAL is not the same for memory optimised tables, sure there is still write ahead logging because data is written to the log before the actual checkpoint file pairs (the data files) but unlike the requirement for on-storage tables, for memory optimised everything must fit in memory – current and any old versions that are required by other active connections because of MVCC (Multi-Version Concurrency Control) or old versions that aren’t required but haven’t been garbage collected out of memory yet.

There is absolutely zero write activity to physical storage until a COMMIT happens then everything happens on that COMMIT – yes, the COMMIT does all the work and not the individual statements as in on-storage tables WAL.

Why no write activity until COMMIT? There is no point! Should the statement fail we simply go back to the previous version which is still in memory, should the database go offline, when it comes back online and goes through recovery the current i.e. last committed data will simply be read from physical storage into memory.

The example code below demonstrates there is no write activity to the transaction log on INSERT nor ROLLBACK but on COMMIT everything is written out – try the code with an on-storage table and reflect on the difference!

create table logtest_mot (
	id	int not null 
		primary key nonclustered hash( id ) with( bucket_count = 131072 ),
 
	changecol int not null,
	bigrow_expander char(1024) not null,
 
	)  WITH ( MEMORY_OPTIMIZED=ON, 
			  DURABILITY=SCHEMA_AND_DATA );
 
--
--	Insert some data so the CFP are created and only the log records
--	we are interested in are logged.
--
insert logtest_mot( id, changecol, bigrow_expander )
	values( 0, 1, replicate( 'b', 1024 ) );
 
--	Write a checkpoint record
--	(closes under construction CFP and marks Active)
CHECKPOINT;		
 
--	Trim out the log
 
BACKUP LOG xtp_logdemo TO DISK = N'C:\temp\xtp_logdemo.trn' with init;
 
--
--	Find end of the log
--
select top 1 *
from fn_dblog_xtp( null, null )
order by [Current LSN] desc
 
select * 
from fn_dblog_xtp( '0x{end of the log}', null )
order by [Current LSN]
 
 
 
--
--	Insert some rows in a single transaction
--
SET NOCOUNT ON;
 
BEGIN TRAN;
 
	declare @i int = 1;
	while @i <= 5
	begin;
 
		insert logtest_mot( id, changecol, bigrow_expander )
			values( @i, 1, replicate( 'b', 1024 ) );
 
		set @i = @i + 1;
 
	end;
 
	PRINT @@TRANCOUNT;
 
 
--
--	Anything because of INSERT?
--
select [Current LSN], Operation, AllocUnitName, [Transaction Name]
		, case when operation in ( 'LOP_BEGIN_XACT', 'LOP_COMMIT_XACT' ) 
					then operation 
			   else '' 
		  end as xact_marker
from fn_dblog_xtp( '0x{end of the log}', null )
order by [current lsn] 
 
-- Try #1
ROLLBACK
 
--
--	Anything because of ROLLBACK?
--
select [Current LSN], Operation, AllocUnitName, [Transaction Name]
		, case when operation in ( 'LOP_BEGIN_XACT', 'LOP_COMMIT_XACT' ) 
					then operation 
			   else '' 
		  end as xact_marker
from fn_dblog_xtp( '0x{end of the log}', null )
order by [current lsn] 
 
 
-- Try #2 (re-run the BEGIN TRAN and the INSERT's)
COMMIT TRAN
 
--
--	Anything because of COMMIT TRAN?
--
select [Current LSN], Operation, AllocUnitName, [Transaction Name]
		, case when operation in ( 'LOP_BEGIN_XACT', 'LOP_COMMIT_XACT' ) 
					then operation 
			   else '' 
		  end as xact_marker
from fn_dblog_xtp( '0x{end of the log}', null )
order by [current lsn] 
 
go

You will see that nothing is written to the transaction log until you actually COMMIT the transaction, all work is done on the COMMIT which is in stark contrast to how on-storage tables behaviour – for Write Ahead Logging all writes are done as statements execute, the actual COMMIT just writes a COMMIT marker into the transaction log.

Look how little is written to the log compared with the on-storage tables, oh but hang on – that’s more! I’ve highlighted what is relevant to what we have just done, however, garbage collection runs on each active connection after that connection has done it’s work for the user. The particular thing I want you to see here is that on the memory optimised tables there are no index entries logged, however on the on-storage you get a log record for each data row and for each row of the non-clustered index that is also on the table.

mot

Basically when manipulating data in memory optimised tables the COMMIT TRAN may take some time depending on what you’ve been doing – you might think that is a bit odd, but it gives the significant benefit that rows can be bulked together and written down to storage using sequential writes. As a side comment, the COMMIT TRAN also enforces the SNAPSHOT isolation requirement where if another transaction has modified rows you have modified in your transaction but your snapshot is before the other transaction modifications then your writes will fail and your transaction rolled back.

Checkpoint File Pairs (CFP)

You have COMMIT your transaction, the data has been written to the transaction log – that is the only place it will currently sit – your data is now durable and that is really where this particular post ends.

I want to mention CFP because at this point your data is only in the transaction log, it will recover fine – don’t worry, it won’t be as quick because your data is currently in the transaction log mixed with memory optimised and traditional on-storage tables, we need a CHECKPOINT to occur.

Another post to follow, but basically the offline checkpoint worker reads the data from the transaction log and does the actual write into the CFP, that will only happen on a manual CHECKPOINT or 512MiBytes of data being written to the transaction log whichever sooner.

Summary

If you have read this far then well done, ping me a tweet @tonyrogerson and I’ll personally congratulate you :) – probably ask you if you have any questions too!

For memory optimised tables there is no physical writes to storage until the COMMIT TRANSACTION happens, individual statements do not write to the transaction log as normal tables do – everything needs to fit in memory.

I’ve on my todo list for the MVCC post, but basically you can burn up memory really quickly if multiple updates are going on with long running transactions.