Partition switching needs to be smarter

A lot has been said already on how to do partition switching in SQL Server. I would suggest reading this blog post by Cathrine Wilhelmson, which is my go-to whenever I forget the syntax for partition switching.

Partition switching needs to be smarter. Or at least, it needs to not give me an error that is a false negative.

See the example below. SQL Sever version is SQL Server 2017 CU15.

DROP TABLE IF EXISTS PartitionSwitchTarget;
DROP TABLE IF EXISTS PartitionSwitchSource;
GO
IF EXISTS ( SELECT 1
              FROM sys.partition_schemes
              WHERE name = N'PartitionSwitchScheme' )
BEGIN
  DROP PARTITION SCHEME PartitionSwitchScheme;
END;
GO
IF EXISTS ( SELECT 1
              FROM sys.partition_functions
              WHERE name = N'PartitionSwitchFunction' )
BEGIN
  DROP PARTITION FUNCTION PartitionSwitchFunction;
END;
GO

CREATE PARTITION FUNCTION PartitionSwitchFunction( bigint )
  AS RANGE RIGHT FOR VALUES( 1,
                             100,
                             1000 );
GO
CREATE PARTITION SCHEME PartitionSwitchScheme
  AS PARTITION PartitionSwitchFunction
  ALL TO ( [PRIMARY] );
GO

CREATE TABLE PartitionSwitchSource
(
  Id bigint
) ON PartitionSwitchScheme (Id);
GO
CREATE TABLE PartitionSwitchTarget
(
  Id bigint
);
GO


ALTER TABLE dbo.PartitionSwitchTarget
ADD CONSTRAINT MyCheckConstraint CHECK( Id >= 1
                                   AND Id <= 99
                                   AND Id IS NOT NULL );

GO
ALTER TABLE dbo.PartitionSwitchSource 
  SWITCH PARTITION 2 
  TO dbo.PartitionSwitchTarget

This is a RANGE RIGHT bigint partition function. This means that the left endpoints are included in each range and the right endpoints are excluded. When I run this on my SQL Server instance, I get this message:
Msg 4972, Level 16, State 1, Line 47
ALTER TABLE SWITCH statement failed. Check constraints or partition function of source table 'TestDb.dbo.PartitionSwitchSource' allows values that are not allowed by check constraints or partition function on target table 'TestDb.dbo.PartitionSwitchTarget'.

Since this is a bigint partition function, the “less than or equal to 99” predicate is logically equivalent to the “strictly less than 100” predicate. In fact, if I change the check constraint above to say “< 100", the partition switch works without issue.

The SQL Server optimizer also has problems with these types of filters. This blog post by Paul White has a pretty good write up about it:

“The SQL Server 2008 (and later) optimizers do not quite get the internal logic right when an interval references, but excludes, a boundary value belonging to a different partition. The optimizer incorrectly thinks that multiple partitions will be accessed, and concludes that it cannot use the single-partition optimization for MIN and MAX aggregates.”

It goes beyond MIN/MAX aggregates indeed. Look at this query using the same partitioned table above:

SELECT COUNT( * )
  FROM PartitionSwitchSource
  WHERE Id >= 1
    AND Id < 100;

The optimizer accesses two partitions:

partitioncount2

If I change the filter to “less than or equal 99”, I get 1 partition:

SELECT COUNT( * )
  FROM PartitionSwitchSource
  WHERE Id >= 1
    AND Id <= 99;

partitioncount1

If you’re a developer like me and you are trying to write code to return the “appropriate” filter for table that has partitions, it can be tricky. On the one hand, I apparently must use “strictly less than 100” so that I can do a partition switch later. On the other hand, I have to use “less than or equal to 99” so that SQL Server won’t access the partition to the right when it isn’t necessary to do so.

I have had success writing the filter like this:

  WHERE Id >= 1
    AND ( Id < 100
          OR Id <= 100 - 1 );

This filter in the check constraint allows for partition switching as desired. It also convinces the optimizer that it is OK to only look at 1 partition. For now, this may be the workaround.

Deadlock anatomy – transactionname

I want to have a series of blog posts where I talk about different parts in the deadlock XML report and when they might matter. There is not a lot of documentation about how to deal with deadlock XML. There definitely are a few great tools out there for visualizing and diagnosing deadlocks. Even so, I don’t think there’s any one tool out there right now that handles every single case. I very frequently find myself looking at the XML when I need to diagnose and fix a deadlock.

Today I’ll talk briefly about the transactionname attribute. This is an attribute of the “process” element. Normally this doesn’t matter too much. Some typical descriptors you see here (that actually describe what’s happening):
DROPOBJ
SELECT
UPDATE
COND WITH QUERY
INSERT

Usually the query that follows uses that construct. This is why it doesn’t usually matter very much. You will understand the deadlock better by looking at the execution stack anyway. The execution stack may reveal that you are dealing with a It might be a DROP TABLE statement (transactionname=”DROPOBJ”). It may be a IF EXISTS ( SELECT […] ) construct (transactionname=”COND WITH QUERY”). The important thing is that it is something that accurately describes what the process is actually doing.

I have seen some interesting ones:
SQL Diagnostic Manager Collection Service
MdView

In both cases, these were the transactionname of another vendor’s product. Once I realized this, the rest of the deadlock report was easy to understand. Be aware of the applications that are running on your system. If you are encountering some interesting deadlocks that might not make sense or are hard to diagnose, take a look at the transactionname and see if it makes sense in the context of the deadlock report.

TABLOCK is not always the table lock you were expecting

In order to take advantage of minimal logging and parallel insert (into heaps or clustered columnstore indexes), you have probably seen that it is a prerequisite to write your insert statements like this:

INSERT INTO DestTable WITH( TABLOCK )
( [Column list ... ] )
SELECT [...]

If you are moving a lot of data in batches across multiple sessions, you might expect it is OK to run statements like this concurrently with each other because execution of the statements will serialize; they all need a table-level exclusive (X) lock to run.

That is not the case. SQL Server will attempt to get an IX lock first, then convert that lock to a BU, or “bulk update” lock. This can cause problems when multiple statements targeting the same destination table run at the same time, even though the BU lock mode is compatible with itself.

If you are playing along at home, you can try this.
Create a target table:

DROP TABLE IF EXISTS dbo.BulkUpdateDestinationHeap;
CREATE TABLE dbo.BulkUpdateDestinationHeap
( Id bigint );
GO

In one session, get a lock on the table and hold it. We do this simply so we can set up other sessions to insert into the table.

BEGIN TRANSACTION;
SELECT Id
  FROM dbo.BulkUpdateDestinationHeap WITH ( TABLOCK, HOLDLOCK );

In two separate sessions, try to load some records into the table using the code pattern mentioned above:

INSERT INTO dbo.BulkUpdateDestinationHeap WITH ( TABLOCK )
( Id )
  SELECT TOP 1000000
         ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) )
    FROM master..spt_values t1
      CROSS JOIN master..spt_values t2;

In a fourth session, you can see what the existing locks and lock requests look like by looking at the sys.dm_tran_locks DMV. Here’s a query that will just return the relevant columns:

SELECT Locks.request_mode,
       Locks.resource_type,
       Tables.name resource_name,
       Locks.request_status,
       Locks.request_session_id
  FROM sys.dm_tran_locks Locks
    INNER JOIN sys.tables Tables
      ON Locks.resource_associated_entity_id = Tables.object_id
  WHERE Locks.resource_type <> 'DATABASE' -- ignore DATABASE lock
  ORDER BY Locks.request_session_id,
           Locks.request_mode;

This is what I see:
IXlockswaiting

Everything seems OK for now. Since IX locks are not compatitlble with S locks, both IX lock requests are waiting. However, since an IX lock is compatible with itself, both lock requests are granted as soon as the transaction in the first session commits or rolls back. When that happens, we see these lock requests:
bulkupdateconvert

If you wait a few seconds, eventually one of the sessions will be killed by the deadlock manager. This happens because each session needs to convert its IX lock to a BU lock to continue. BU locks are compatible with one another, but the BU lock mode is not compatible with IX. The BU lock request from session 53 is blocked by the currently held IX lock in session 63. Similarly, the BU lock request from session 63 is blocked by the currently held IX lock in session 53.

The XML deadlock report tells the same story in the resource list:

 <resource-list>
  <objectlock lockPartition="0"
              objid="658101385"
              subresource="FULL"
              dbid="5"
              objectname="LockDemos.dbo.BulkUpdateDestinationHeap"
              id="lock26db1093b00"
              mode="IX"
              associatedObjectId="658101385">
   <owner-list>
    <owner id="process26db55648c8"
           mode="IX" />
    <owner id="process26db55648c8"
           mode="BU"
           requestType="convert" />
   </owner-list>
   <waiter-list>
    <waiter id="process26dc4057468"
            mode="BU"
            requestType="convert" />
   </waiter-list>
  </objectlock>
  <objectlock lockPartition="0"
              objid="658101385"
              subresource="FULL"
              dbid="5"
              objectname="LockDemos.dbo.BulkUpdateDestinationHeap"
              id="lock26db1093b00"
              mode="IX"
              associatedObjectId="658101385">
   <owner-list>
    <owner id="process26dc4057468"
           mode="IX" />
    <owner id="process26dc4057468"
           mode="BU"
           requestType="convert" />
   </owner-list>
   <waiter-list>
    <waiter id="process26db55648c8"
            mode="BU"
            requestType="convert" />
   </waiter-list>
  </objectlock>
 </resource-list>

As for workarounds, depending on your workload you might consider using TABLOCKX instead of TABLOCK. When TABLOCKX is used, SQL Server requests and waits for an X lock on the object instead of an IX lock that it will convert to a BU lock. We don’t encounter the deadlock and both sessions can insert data into the target table. The sessions block each other and are forced to serialize, but there is no deadlock when we use TABLOCKX. You can run through the same demo above using TABLOCKX instead of TABLOCK to confirm.

We still get parallel insert when using TABLOCKX. This can be checked by looking at the query plan.

I wanted to check how many rows were logged to the transaction log to make sure I was still getting minimal logging. The function fn_dblog is undocumented but I have seen people use it for this purpose. I saw exactly the same number of rows logged whether I was using TABLOCK or TABLOCKX. I get 6280 rows for both queries (most times, it does not seem to be deterministic) when using the simple recovery model and 9191 rows when using the full recovery model. This is evidence to me that we still get minimal logging too. Here’s the code I used to test that:

/*
Compare how many rows are written to the transaction log
when using TABLOCK v. using TABLOCKX
*/

DECLARE @MaxLsn nvarchar(25);

-- truncate the table
TRUNCATE TABLE dbo.BulkUpdateDestinationHeap;

-- get max LSN
SELECT @MaxLsn = N'0x' + MAX( [Current LSN] )
  FROM sys.fn_dblog( NULL, NULL );

-- use tablock to insert rows
INSERT INTO dbo.BulkUpdateDestinationHeap WITH ( TABLOCK )
( Id )
  SELECT TOP 1000000
         ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
    FROM master..spt_values t1
      CROSS JOIN master..spt_values t2;

-- find out how many rows that was
SELECT COUNT( * ),
       'Using TABLOCK'
  FROM fn_dblog( @MaxLsn, NULL );

-- truncate again
TRUNCATE TABLE dbo.BulkUpdateDestinationHeap;

-- get max LSN
SELECT @MaxLsn = N'0x' + MAX( [Current LSN] )
  FROM sys.fn_dblog( NULL, NULL );

-- use tablockx to insert rows
INSERT INTO dbo.BulkUpdateDestinationHeap WITH ( TABLOCKX )
( Id )
  SELECT TOP 1000000
         ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
    FROM master..spt_values t1
      CROSS JOIN master..spt_values t2;

-- find out how many rows that was
SELECT COUNT( * ),
       'Using TABLOCKX'
  FROM fn_dblog( @MaxLsn, NULL );

The lesson here isn’t to use TABLOCKX instead of TABLOCK. The lesson is that when specifying a “table lock”, you aren’t guaranteeing the lock mode, but rather what resources you will take locks on.