Understanding How Transaction are Applied in Transactional Replication in SQL Server 2005
Occasionally in the replication newsgroups questions come up regarding transactions in transactional replication. People want to know what happens if a transaction on the publisher fails - does it still get applied at the subscriber? Or what if it runs on the publisher and fails on the subscriber - does it always/ever roll back? Then there are questions regarding the transaction size. If an update statement affects say 100000 rows on the publisher, this will be logged as 100000 updates in the transaction log. In some such cases the log reader agent may not be able to cope, and even if the log reader succeeds, there may be similar problems for the distribution agent. Users are interested in the possible workarounds for these situations. There is only a little information in BOL to help clarify and solve these types of query, so I have tied my various findings into an article to hopefully help out.
Replication of Simple Transactions
We start with a simple case. Suppose we add 4 records in a
transaction at the publisher.
The constant xact_seqno identifies the transaction, and the incrementing command_id indicates the commands in the batch. So, transactions on the publisher are applied as transactions on the subscriber.
Failing Transactions on the Publisher
What happens to the replication setup when a transaction fails at the publisher - does it still get replicated? Suppose we perform the same transaction as above, but "Command4" fails. On the publisher there will be an error message, the transaction is rolled back because of the xact_abort setting and no rows are added. Using sp_browsereplcmds, none of these commands are visible and they are not to be found on the subscriber, showing that only committed transactions are replicated to the subscriber.
Failing Transactions on the Subscriber
What if the transaction completes at the publisher and subsequently fails at the subscriber? This could occur because we haven't been vigilant, and someone has been allowed to edit the subscriber data (yes - it happens :)). In this case the distribution agent fails and raises an error message. The failure itself is expected, but the inbuilt error handler has a hard coded exception command: "if @@trancount > 0 rollback tran" (which is actually independant of the xact_abort setting on the publisher) meaning that transactions failing at the subscriber are automatically rolled back. So, until the problem is fixed, the error will prevent the distribution agent from processing any separate part of the transaction - the ACID properties are maintained. Note that this is the default behaviour which may be overridden (see later).
Large Transactions which cause Log Reader Agent Timeouts
The next thing to look at is the issue of large
transactions. Remember that all commands are implicitly run as transactions so
although a TSQL statement might look innocuous at first glance - e.g. a
simple update statement - if it updates 100000 rows, this will be logged as:
"The process could not execute 'sp_replcmds' on 'xxxxxx'."
(a) increase the QueryTimeout. This is
the usual solution. It won't decrease the latency of a big transaction, but it
is often sufficient on its own to get things working again.
Large Transactions which cause Distribution Agent Timeouts or Slowdowns
Just like with the log reader agent we can use the QueryTimeout parameter,
and increasing it's value will be the first option to try. In the case of the
distribution agent though we have 2 extra parameters: CommitBatchSize which is
"the number of transactions to be issued to the Subscriber before a COMMIT
statement is issued. The default is 100." and CommitBatchThreshold
which is "the number of replication commands to be issued to the Subscriber
before a COMMIT statement is issued. The default is 1000." - both definitions
coming from BOL.
It seems to me that that the usefulness of these 2 commands relates to those cases involving blocking issues at the subscriber. Smaller batches incur reduced locking to be applied in each separate batch at the subscriber. If there was no subscriber access other than that of the distribution agent, the transaction will overall take longer to apply if we use small batches. However in the presence of other processes needing to perform reads or updates of committed data at the subscriber (i.e. I'm not referring to dirty reads) then blocking issues might mean that smaller batches will overall work more quickly. Clearly you'd need to really understand the usage profile of the subscriber to determine if these parameters are going to improve throughput and concurrency. In practice I've never used these parameters but I suppose I know they're there if required.
What if there are multiple publications?
Simon Sabin recently asked me a question regarding multiple publications. For the sake of simplicity, consider the case where there are 2 publications, each of which holds a different table. On the publisher these tables exist in the same database and likewise on the subscriber. So, if a transaction affects these 2 tables on the publisher, is this replicated as a transaction on the subscriber? Interesting! I tested this and found that the answer is "it depends" :). Basically, if you have set up the subscriptions to share a distribution agent, the transaction will be honoured. If there are independent distribution agents, the transaction is ignored - or at least the part referring to the table out of this publication is ignored.
Hopefully this article helps clarify how transactions are applied in transactional replication, and how this knowledge can be used to optimise the system
Paul Ibison, Copyright © 2013