Home  |  Articles  |  Scripts  |  Consultancy  |  FAQs  |  Links  |  SSIS

 

A Big Merge Project On SQL 2005 - a few bugs/issues discovered
article by Paul Ibison

Introduction

I haven't written anything for a while - I was deep into a project to upgrade the Global Foreign Exchange (GFX) system for an investment bank in London. This system controls the FX work for PayPal and EBay so there was a slight bit of pressure :) - it took a year and a lot of late nights but it is now all done and I now have a bit more time to do some proper, interesting work - replication! So, I've recently been doing a few days of replication consultancy here and there but then an old client wanted me to implement a filtered merge setup to replicate to their shipping fleet. We learned all sorts of things that might be useful to others so I'll tell you all the gory details below....

Topology

The requirement was to have a central office replicating inventory data bi-directionally to a fleet of ships. The inventory data includes all objects on a ship, from cranes to toilet rolls. These objects might be purchased independently or centrally and can be transferred from a ship to another or to a shipyard for repairs. There could be conflicts in the data, and objects moving from one ship to another should also re-parent all the child records e.g. the object history should be visible on the ship after you move the object. Merge replication allows for conflict resolution and partition realignment so this was selected to be used as the ideal technology. This client already had a good experience of using merge on SQL 2000 so this was an easy choice to make. We went for the model of a separate publication per vessel sitting at head office, rather than use dynamic filtering. This has an equivalent performance and allows for a more granular control of publication settings (or so we thought - see later!!!). So, the setup is shown below:

I've only show one table and 2 vessels to make the diagram clear but there were 8 articles in total, which were joined together and 20 "vessels" (including warehouses). So - all seems pretty straightforward yes?

Issues

(1) merge publications which share an article

Each table was replicated as an article in a publication to a vessel. We had a separate publication per vessel. This means that the publications can be changed independently. Or does it? In our case the filter is a shortened form of the vessel name and ensures the vessel gets its own data. Each publication is to a vessel and hence has a different filter. A particular filter was incorrectly set on one of the publications as the vessel name had changed, so the filter needed changing for that publication. Changing a filter will cause a reinitailization to the relevant subscriptions and this is documented in BOL and is totally expected. Painful, but bearable as it only affects a single vessel because of our clever setup ;). However this is simply not the case!!!! We ran into this not-fixed bug. Basically if you change a filter clause on an article in one publication, and that article also exists in other publications, all these publications and all their subscriptions will be forced to reinitialize!!. In our case this caused us to incur an extra day of downtime. Initialization to a fleet of vessels over satellite comms is not fun! If a ship is turning or is in a storm, the comms are lost entirely, and even when the comms are up, PING returns a response time of 900ms - it's slow and it's painful!

Lesson learned - if you need to change a publication filter that is linked to other publications in the above manner via an article, drop the subscriptions to the particular publication, and then drop the publication itself. Clean out the subscriber's database using sp_removedbreplication (see below for why!). Recreate the publication with the new filter clause and then add the subscription, and initialize.

(2) adding a new subscriber, which existed in a previous incarnation as a merge subscriber

At some point we needed to reinitialize our vessel's publications to reset an optimisation setting (see #3 for details). The publisher/distributor at head office was disabled, causing all publications and all subscriptions to be dropped. The final replication metadata details were removed using sp_removedbreplication on all databases involved. However this latter part was done incorrectly (not by me!) and some of the vessel databases were not included. No big matter - this is not mentioned as a pre-requisite before setting up replication anyway and is probably just Paul being overcautious - yes? Well not always! If you don't run sp_removedbreplication on a database that was originally set up as a merge subscriber you might find some metadata tables already existing - eg sysmergesubscriptions.

In our case when we re-added the new subscriptions we got the error message "The merge process could not update the list of subscriptions". This is because of a violation of a unique key constraint on the sysmergesubscriptions table. Sure enough, old entries existed there. They also now existed throughout our brand new "clean" topology in the sysmergesubscriptions table ie the merge replication metadata had itself propagated through our system. Remember that the setup was redone from scratch so these entries didn't originally exist on the publisher - they definitely were the result of the new subscriber being added. I have seen this problem once before at a bank when the setup was a SQL 2008 publisher and a mixture of SQL 2000 and SQL 2008 subscribers all using nosync initializations. When at the bank I assumed that it was related to the nosync methodology and the fact that the sysmergesubscriptions table on the subscribers will always pre-exist in this case and be there to pollute the system. However in the case above with the shipping company I am on SQL 2005 and we use using automatic initialization. So, it is not related to nosync at all, and nothing to do with a mixed environment either. But the common thread is the pre-existence of the syssubscriptions table. There are a few posts out there mentioning others having the same problem, but none of them offer a complete explanation or a supportable solution. Hacking the data out of the replication metadata tables on all subscribers and the publishers might get things going but this will obviously lead us to an unsupported state going forwards. It might also be the case that rogue lineage data will hang around and cause rogue conflicts - also seen this before.

Lesson learned - the conclusion in this case was to take it all down and set it up from scratch, ensuring sp_removedbreplication was run everywhere correctly. This procedure doesn't error if it fails so after that ensure "select * from sysmergesubscriptions" always returns an error (because the table no longer exists and can therefore not pollute the system). Another day lost in the shipping project :(.

(3) strange, "impossible" conflicts arising!

Once the system was set up and left running a day or so I set off back to the UK, and while at the airport got a call about conflicts occurring. These conflicts were between the publisher and the subscriber.

As an aside, conflicts always appear to be this way - a merge-replication-gripe of mine that the actual 2 conflicting subscribers are not reported and it always seems that the publisher is conflicting with a subscriber, even when this is impossible as the publisher is not updated by an external application!

Anyway, once back in the UK I took a look. In this case I used sp_showlineage to see who had really changed the data to conflict with the vessel. Guess what - it really was the publisher! It seems that the ordering of the merge processing code is at fault here. The update on the subscriber is conflicting with the delete that the merge agent has decided to do to repartition the data. Really the update should propagate and the delete then propagate but NOT be recorded as a change in the merge system. In other words the lineage of the row is incorrectly updated on the publisher before the update is processed. Lizet's excellent ("Brangelina") article blogs about this particular issue. As it is lineage-related, and also related to partitions, what are our options? Basically we need to recreate the publications with @use_partition_groups set to false. Using sp_changemergepublication wasn't enough to make the change operational in my test or live environment (the value was set to -1 in sysmergepublications rather than 0 in the case of a newly set up publication, which may be relevant). Another reinitialization if you didn't set things up this way at the start :(.

Lesson learned - when using filters and joins together I'd turn off the default @use_partition_groups "optimisation" - it has not been fixed by now in SQL 2005 SP3 so this might change, and if it does I'll add an extra comment but for now this default setting should definitely come with a severe health warning!

(4) Foreign keys and NOT FOR REPLICATION - again

I know - this old chestnut. Previously I'd also thought that the NFR setting on FKs was not especially important. I considered it to be relevant for circular relationships in ER diagrams and as such something that is rare, if ever seen at all. However I now see that this can cause an invisible data loss if the schemas are not totally in sync. In my recent work on the shipping databases, I had a post-snapshot script to add a FK to a filtered article on the subscriber. As it turns out this FK was not on the publisher but should have been. So, what are the ramifications of this? Well, the ordering of application of merge changes is governed by the presence of DRI on the publisher. Merge would want to insert a PK record and then a FK one, and if we were doing deletes, we would delete a FK record and then a PK one. What if there isn't a FK there on the publisher when the publication is created? In that case the processing order during a synchronization is pot luck and depends on the order you added the article to the publication. Fair enough - merge can't read our minds for this! However if there is a FK on the subscriber could there be a violation during a delete operation if the PK record is deleted first? Yes - this is exactly what happens! If data moves partition off a subscriber because of the filter clause, a delete is sent to the subscriber, and if the order is incorrect we see an error when the merge agent runs - yes? Well, actually no - the deletes are retried several times in a merge session and then the merge agent completes without error and never retries the deletes! The message you'll see in replication monitor at the end of the synchronization won't mention this problem at all. The details exist solely in the msmerge_history table and you can only identify that there was an issue there.

Lesson Learned - ensure that the DRI exists on the publisher, and set the FKs to NFR as a belt-and-braces approach.

5) Does merge process multiple updates of a single row in any order?

We all know about generations being used to group together changes, and the article processing order and how this is determined but here I am not concerned with any of that :). What I am referring to is multiple updates of a single row on a subscriber with the merge agent not synchronizing - does merge bother about any of this? The reflex response is that there is no issue here - when it does synchronize, the merge agent takes a copy of the row at this point in time and applies it to the publisher. If 1000 updates have been made to the row at the subscriber, still only the single end state of the row is what matters and this is applied. All agreed? After all merge doesn't use the transaction log so how can it have a concept of sequential changes? Well, I found 2 cases where this could be called into question!

(a) programming objects - stored procedures, triggers etc.

Consider if you make a change to a trigger and the table is replicated. One neat thing is that as we are on SQL 2005/8 and have the @replicate_ddl set to true (the default), the changes made to the trigger on the publisher will go down to the subscribers automatically. Neat. In our case we got the dreaded "cannot resolve collation conflict for equal to operation" message after a trigger change was made on the publisher and the merge agent then failed when syncing with a subscriber, causing data changes to back up. The trigger code was at fault so I updated the trigger code on the publisher to fix the issue (by specifying the COLLATE clause). However even though the trigger code on the publisher was now correct vis-a-vis the subscriber, the merge agent still failed to synchronize! Why? Well, I turned to logging to an output file to see what was happening and I found that the old trigger change was still being applied, rather than my modified trigger code. Then I poked around a bit more and found that all the DDL changes we make are queued up sequentially in the table sysmergeschemachange. I was tempted to do a direct edit on the problem rows here but somehow managed to restrain myself :). One other thing I noticed in this table is that the changes to the trigger code are wrapped up in an "IF EXISTS" bracket. So, removing the trigger on the subscriber meant that the merge agent could now run. In my case I didn't want the trigger anyway on the subscriber (long story) so that was fine. If you use this as a solution and you really do need the trigger on the subscriber then you could always add it back in as long as you follow the workaround and disable the MSmerge_tr_altertrigger database ddl trigger first and re-enable afterwards. I haven't yet checked but would assume that the same principle applies to other DDL changes.

Lesson Learned - .

Be sure that your DDL changes will work on all subscribers or you will have issues reversing the changes!

(b) updates to a single row involving: an update to the filter column and another update to a different column

As stated above, only the final state of the row is used at synchronization, so these time-based changes made at the subscriber will effectively get applied in one go at the publisher - yes? Actually this is not entirely true. The sequence of updates is not reproduced at the publisher but the updates to a single row are NOT always done in one go. In fact even if there was a single update on the subscriber this can be replicated as more than one update at the publisher - sounds bizarre I know. Remember that 2 columns are updated at the subscriber - one update on the filter column and another on any other column. After this at the publisher during synchronization the merge agent makes an update to the filter column and subsequently a second update 20ms or so later is made to update the other columns in the row. This generally doesn't make a lot of odds - 20ms is a really short time. However if you have triggers on this table which fire during replication this can be (and is to us!) a huge problem. Auditing changes made to the filter column become a huge pain in the rear, as the initial audited row from the Inserted table in the trigger is actually a version of the row that never existed on the subscriber, and only by combining the data from the 2 audit records can we get a clear picture.

Lesson Learned - .

If you are using filters, be especially careful with triggers that fire during the replication process - they may not return full details of the changed row.

Conclusions

Well the project has just about completed now. Merge is implemented and inventory data is successfully propagating around the fleet. Plenty of painful lessons learned though. I wonder why there is such a lack of official documentation around some of these things, especially the data loss issues with partition groups. I also wonder how many DBAs have things set up this way and haven't yet realised! Overall I'm still a big fan of merge replication in SQL 2005/8 - especially the repartitioning aspects with filters - but sadly my impression is that it currently seems less robust compared to SQL 2000 and there are more caveats to be aware of when setting it up. Hopefully all these things will get cleaned up as the new code matures.