Home  |  Articles  |  Scripts  |  Consultancy  |  FAQs  |  Links  |  SSIS

 

How to Optimize Database Snapshots In Parameterized Merge Publications in SQL Server 2005

Background

In a typical merge publication, each subscriber is essentially equivalent. When we initialize the subscribers they all get the same data, and afterwards if all the subscribers were fully synchronized they would again all have identical data. I said that this is the case for "typical" merge replication, because there is also the possibility of partitioning the data according to each particular subscriber. In SQL Server 2000 this was referred to as "dynamic filters" while in SQL Server 2005 this is called "parameterized filters".

So, using parameterized filters we can send different partitions of data to each subscriber, and still manage it all using one single publication. This type of scenario is often referred to on the Microsoft exams where the publication is a central, head-office system, and the subscribers are satellite shops which can control their own data and for the sake of confidentiality they have no visibility of the data of other shops. From what I've seen, it's not at all uncommon. This article explains how to set it all up and in particular how to optimize the snapshot generation and massively reduce the initialization time.

Adding the Filter

All you need to have in order to set up data partitions is a valid WHERE clause in order to split the data according to the subscribers. You create a merge publication as per usual and in the wizard select the option to add a filter. For this particular example, I used a single table with one column called ComputerName. In the table there are 4 million rows like this:

"IBM"
"Compaq"
"Acer"
"Dell"
"IBM"
"Compaq"
"Acer"
"Dell"
..... etc

The parameterized filter is shown in the wizard screen below ("Filter statement:"):

For parameterized filters, the partition must be defined using either the HOST_NAME() function or the SUSER_SNAME() function. Don't be put off by having to use these functions if they seem to be completely unrelated to your own data - a relevant business value for the function is hard-coded for the subscriber when the subscription is created ie the actual value of HOST_NAME() is not itself used. Once the partition clause is defined against one table, publication join filters can be used to propagate the filtering down to the other child tables.

Once the publication is set up we can add subscriptions as per normal, but this could be disastrous in terms of initialization time. You see, by default, parameterized filtered publications will rely on filtered INSERTs from the Publisher to apply data to the subscriber as part of the initialization, and this is very, very slow! Dynamic snapshots are the alternative method and they provide the performance advantage of using SQL bulk copy program (bcp) files to apply data to a specific Subscriber.

Now for the clever bit. Instead of creating a whole snapshot and having each subscriber doing the filtering and the consequential slow separate insert statements etc, we pre-create the snapshots - one for each subscriber - which have just the data required for that particular subscriber.

Using the Data Partitions Tab

To pre-create the partitioned snapshots you use the "Data Partitions" page of the Publication Properties dialog box, after having already completed an initial (normal) snapshot.

In the figure above you can see that I have already entered 2 values for the HOST_NAME() function - 'IBM' and 'Compaq'.

Using this screen you can select to have the snapshots created on a schedule or manually. Generally speaking, you will know what the partitions are, and when you are about to initialize them you'll want to create the snapshots manually by selecting the partition in the property grid, and then clicking the hyperlink "Generate the selected snapshots now". If this process is very time-consuming and/or you want to avoid a hit on the production system, you could schedule the snapshot creation for the night before the day of the initialization. You can even have the partitions created by the subscribers themselves using the checkbox at the bottom of the form, although in most cases you'll know what the partitions are before starting in which case this isn't required.

After the manual snapshot generation, you can look at the distribution working folder (the one with the snapshots in it). There should be the original snapshot and there should also be a separate folder for the partition's BCP files. The folder name will be the value for the HOST_NAME() hard-coded value with a partition number suffix. In the figure below we see the separate BCP directory called "Compaq_1" corresponding to the manually generated snapshot from the Data Partitions screen shown above. In this folder the data BCP files now contain just the data required for "Compaq".

Note that a lot of this was already available in SQL Server 2000. There have been some improvements though. In SQL Server 2005 we can set it all up using the GUI rather than by scripts, which is one advantage. Another improvement is that the jobs can be initiated from the GUI. Also, the subscription no longer needs to be pointed at the snapshot location manually. If you're wondering how the merge agent in SQL Server 2005 "knows" where the snapshot actually is, then the data is stored in the following tables:

SELECT * FROM dbo.MSmerge_partition_groups
SELECT * FROM dbo.MSmerge_dynamic_snapshots

Time Benefits Illustrated

So, now you know how to set it up, what are the benefits?

Remember in the simple table I had added 4 million rows - 1 million for each computer name. The times taken initialization of the "Compaq" partition are shown below:

Normal Initialization
 

snapshot 7mins 58secs
merge 9mins 58secs

Partitioned Snapshot
 

snapshot 28secs
merge 33secs

So, the overall time for initialization has reduced from 18 mins to 1 min. Hopefully this article will encourage you to investigate the use of dynamic snapshots!