A Big Merge
Project On SQL
2005 - a few bugs/issues discovered
article by Paul Ibison
Introduction
I haven't written anything for a while - I was deep into a project to upgrade
the Global Foreign Exchange (GFX) system for an investment bank in London. This
system controls the FX work for PayPal and EBay so there was a slight bit of
pressure :) - it took a year and a lot of late nights but it is now all done and
I now have a bit more time to do some proper, interesting work - replication!
So, I've recently been doing a few days of replication consultancy here and there but
then an old client wanted me to implement a filtered merge setup to replicate to
their shipping fleet. We learned all sorts of things that might be useful to
others so I'll tell you all the gory details below....
Topology
The requirement was to have a central office replicating inventory data
bi-directionally to a fleet of ships. The inventory data includes all objects on
a ship, from cranes to toilet rolls. These objects might be purchased
independently or centrally and can be transferred from a ship to another or to a
shipyard for repairs. There could be conflicts in the data, and objects moving
from one ship to another should also re-parent all the child records e.g. the
object history should be visible on the ship after you move the object. Merge
replication allows for conflict resolution and partition realignment so this was
selected to be used as the ideal technology. This client already had a good
experience of using merge on SQL 2000 so this was an easy choice to make. We
went for the model of a separate publication per vessel sitting at head office,
rather than use dynamic filtering. This has an equivalent performance and allows
for a more granular control of publication settings (or so we thought - see
later!!!). So, the setup is shown below:

I've only show one table and 2 vessels to make the diagram clear but there
were 8 articles in total, which were joined together and 20 "vessels" (including
warehouses). So - all seems pretty straightforward yes?
Issues
(1) merge publications which share an article
Each table was replicated as an article in a publication to a vessel. We had
a separate publication per vessel. This means
that the publications can be changed independently. Or does it? In our case the
filter is a shortened form of the vessel name and ensures the vessel gets its
own data. Each publication is to a vessel and hence has a different filter. A
particular filter was incorrectly set on one of the publications as the vessel
name had changed, so the filter needed changing for that publication. Changing a
filter will cause a reinitailization to the relevant subscriptions and this is
documented in BOL and is totally expected. Painful, but bearable as it only affects a single
vessel because of our clever setup ;). However this is simply not the case!!!!
We ran into
this not-fixed bug. Basically if you change a filter clause on an article in
one publication, and that article also exists in other publications,
all these publications and all their subscriptions will
be forced to reinitialize!!. In our case this caused us to incur an
extra day of downtime. Initialization to a fleet of vessels over satellite comms
is not fun! If a ship is turning or is in a storm, the comms are lost entirely,
and even when the comms are up, PING returns a response time of 900ms - it's
slow and it's painful!
Lesson learned - if you need to change a publication filter
that is linked to other publications in the above manner via an article, drop
the subscriptions to the particular publication, and then drop the publication
itself. Clean out the subscriber's database using sp_removedbreplication (see
below for why!). Recreate the publication with the new filter clause and
then add the subscription, and initialize.
(2) adding a new subscriber, which existed in a previous incarnation as a
merge subscriber
At some point we needed to reinitialize our vessel's publications to reset an
optimisation setting (see #3 for details). The publisher/distributor at head
office was disabled, causing all publications and all subscriptions to be
dropped. The final replication metadata details were removed using
sp_removedbreplication on all databases involved. However this latter part was
done incorrectly (not by me!) and some of the vessel databases were not
included. No big matter - this is not mentioned as a pre-requisite before
setting up replication anyway and is probably just Paul being overcautious -
yes? Well not always! If you don't run sp_removedbreplication on a database that
was originally set up as a merge subscriber you might find some metadata tables
already existing - eg sysmergesubscriptions.
In our case when we re-added the new subscriptions we got the error message "The
merge process could not update the list of subscriptions". This is
because of a violation of a unique key constraint on the sysmergesubscriptions
table. Sure enough, old entries existed there. They also now existed throughout our brand new "clean"
topology in the sysmergesubscriptions table ie the merge replication metadata
had itself propagated through our system. Remember that the setup was redone
from scratch so these entries didn't originally exist on the publisher - they
definitely were the result of the new subscriber being added. I have
seen this problem once before at a bank when the setup was a SQL 2008 publisher
and a mixture of SQL 2000 and SQL 2008 subscribers all using nosync
initializations. When at the bank I assumed that it was related to the nosync
methodology and the fact that the sysmergesubscriptions table on the subscribers
will always pre-exist in this case and be there to pollute the system. However
in the case above with the shipping company I am on SQL 2005 and we use using
automatic initialization. So, it is not related to nosync at all, and nothing to
do with a mixed environment either. But the common thread is the pre-existence
of the syssubscriptions table. There are a few posts out there mentioning
others having the same problem, but none of them offer a complete explanation or
a supportable solution. Hacking the data out of the replication metadata tables
on all subscribers and the publishers might get things going but this will
obviously lead us to an unsupported state going forwards. It might also be the
case that rogue lineage data will hang around and cause rogue conflicts - also
seen this before.
Lesson learned - the conclusion in this case was to take it all
down and set it up from scratch, ensuring sp_removedbreplication was run
everywhere correctly. This procedure doesn't error if it fails so after that
ensure "select * from sysmergesubscriptions" always
returns an error (because the table no longer exists and can therefore not
pollute the system). Another day lost in the shipping project :(.
(3) strange, "impossible" conflicts arising!
Once the system was set up and left running a day or so I set off back to the
UK, and while at the airport got a call about conflicts occurring. These
conflicts were between the publisher and the subscriber.
As an aside, conflicts always appear to be this way - a
merge-replication-gripe of mine that the actual 2 conflicting subscribers are
not reported and it always seems that the publisher is conflicting with a
subscriber, even when this is impossible as the publisher is not updated by an
external application!
Anyway, once back in the UK I took a look. In this case I used sp_showlineage to see who had really changed the data to conflict with
the vessel. Guess what - it really was the publisher! It seems that the ordering
of the merge processing code is at fault here. The update on the subscriber is
conflicting with the delete that the merge agent has decided to do to
repartition the data. Really the update should propagate and the delete then
propagate but NOT be recorded as a change in the merge system. In other words
the lineage of the row is incorrectly updated on the publisher before the update
is processed. Lizet's excellent ("Brangelina") article blogs about this particular issue. As it
is lineage-related, and also related to partitions, what are our options?
Basically we need to recreate the publications with @use_partition_groups set to
false. Using sp_changemergepublication wasn't enough to make the change
operational in my test or live environment (the value was set to -1 in
sysmergepublications rather than 0 in the case of a newly set up publication,
which may be relevant). Another reinitialization if you didn't set things up
this way at
the start :(.
Lesson learned - when using filters and joins together I'd turn
off the default @use_partition_groups "optimisation" - it has not been fixed by
now in SQL 2005 SP3 so this might change, and if it does I'll add an extra comment but
for now this default setting should definitely come with a severe health warning!
(4) Foreign keys and NOT FOR REPLICATION - again
I know - this old chestnut. Previously I'd also thought that the NFR setting on FKs was not especially
important. I considered it to be relevant for circular relationships in ER
diagrams and as such something that is rare, if ever seen at all. However I now
see that this can cause an invisible data loss if the schemas are not totally in sync. In my recent
work on the shipping databases, I had a post-snapshot script to add a FK to a
filtered article on the subscriber. As it turns out this FK was not on the
publisher but should have been. So, what are the ramifications of this? Well,
the ordering of application of merge changes is governed by the presence of DRI
on the publisher. Merge would want to insert a PK record and then a FK one, and if
we were doing deletes, we would delete a FK record and then a PK one. What if
there isn't a FK there on the publisher when the publication is created? In that case
the processing order during a synchronization is pot luck and depends on the order you
added the article to the publication. Fair enough - merge can't read our minds
for this! However if there is a FK on
the subscriber could there be a violation during a delete operation if the PK record is deleted first? Yes
- this is exactly what happens! If data moves partition off a subscriber because
of the filter clause, a delete is sent to the
subscriber, and if the order is incorrect we see an error when the merge agent
runs - yes? Well, actually no - the deletes are retried several times in a merge
session and then the merge agent completes without error and never retries the
deletes! The message you'll see in replication monitor at the end of the
synchronization won't mention this problem at all. The details exist solely in the
msmerge_history table and you can only identify that there was an issue there.
Lesson Learned - ensure that the DRI exists on the publisher,
and set the FKs to NFR as a belt-and-braces approach.
5) Does merge process multiple updates of a single row in any order?
We all know about
generations
being used to group together changes, and the
article processing order
and how this is determined but here I am not concerned with any of that :). What
I am referring to is multiple updates of a single row on a subscriber with the
merge agent not synchronizing - does merge bother about any of this? The reflex
response is that there is no issue here - when it does synchronize, the merge
agent takes a copy of the row at this point in time and applies it to the
publisher. If 1000 updates have been made to the row at the subscriber, still
only the single end state of the row is what matters and this is applied. All
agreed? After all merge doesn't use the transaction log so how can it have a
concept of sequential changes? Well, I found 2 cases where this could be called
into question!
(a) programming objects - stored procedures, triggers etc.
Consider if you make a change to a trigger and the table is replicated. One
neat thing is that as we are on SQL 2005/8 and have the @replicate_ddl set to
true (the default), the changes made to the trigger on the publisher will go
down to the subscribers automatically. Neat. In our case we got the dreaded "cannot
resolve collation conflict for equal to operation" message after
a trigger change was made on the publisher and the merge agent then failed when
syncing with a subscriber, causing data changes to back up. The trigger code was
at fault so I updated the trigger code on the publisher to fix the issue (by
specifying the COLLATE clause). However even though the trigger code on the
publisher was now correct vis-a-vis the subscriber, the merge agent still failed
to synchronize! Why? Well, I turned to
logging to an output file
to see what was happening and I found that the old trigger change was still
being applied, rather than my modified trigger code. Then I poked around a bit
more and found that all the DDL changes we make are queued up sequentially in
the table sysmergeschemachange. I was tempted to do a direct edit on the
problem rows here but somehow managed to restrain myself :). One other thing I
noticed in this table is that the changes to the trigger code are wrapped up in
an "IF EXISTS" bracket. So, removing the trigger on the subscriber meant that
the merge agent could now run. In my case I didn't want the trigger anyway on
the subscriber (long story) so that was fine. If you use this as a solution and
you really do need the trigger on the subscriber then you could always add it
back in as long as you follow
the workaround and
disable the
MSmerge_tr_altertrigger database ddl trigger first and
re-enable afterwards. I haven't yet checked but would assume that the same
principle applies to other DDL changes.
Lesson Learned - .
Be sure that your DDL changes will work on all subscribers or you will have
issues reversing the changes! (b) updates to a single row involving: an
update to the filter column and another update to a different column
As stated above, only the final state of the row is used at synchronization,
so these time-based changes made at the subscriber will effectively get applied
in one go at the publisher - yes? Actually this is not entirely true. The
sequence of updates is not reproduced at the publisher but the updates to a
single row are NOT always done in one go. In fact even if there was a single
update on the subscriber this can be replicated as more than one update at the
publisher - sounds bizarre I know. Remember that 2 columns are updated at the
subscriber - one update on the filter column and another on any other column.
After this at the publisher during synchronization the merge agent makes an
update to the filter column and subsequently a second update 20ms or so later is
made to update the other columns in the row. This generally doesn't make a lot
of odds - 20ms is a really short time. However if you have triggers on this
table which fire during replication this can be (and is to us!) a huge problem.
Auditing changes made to the filter column become a huge pain in the rear, as
the initial audited row from the Inserted table in the trigger is
actually a version of the row that never existed on the subscriber, and only by
combining the data from the 2 audit records can we get a clear picture.
Lesson Learned - .
If you are using filters, be especially careful with triggers that fire
during the replication process - they may not return full details of the changed
row. Conclusions
Well the project has just about completed now. Merge is implemented and
inventory data is successfully propagating around the fleet. Plenty of painful
lessons learned though. I wonder why there is such a lack of official
documentation around some of these things, especially the data loss issues with
partition groups. I also wonder how many DBAs have things set up this way and
haven't yet realised! Overall I'm still a big fan of merge replication in SQL
2005/8 - especially the repartitioning aspects with filters - but sadly my
impression is that it currently seems less robust compared to SQL 2000 and there
are more caveats to be aware of when setting it up. Hopefully all these things
will get cleaned up as the new code matures. |