SSIS: How to handle updates

Handling updates in SQL Server Integration Services isn’t straight forward. If you have developed an ETL (Extract Transform Load) process for a Data Warehouse, then you may have faced a situation where you needed to update existing records. This could be the case when you are implementing Type 1 SCDs. The situation isn’t particular for a Data Warehouse ETL; in routine SSIS tasks, there can be situations when you want to insert a record if it does not exist, or else update the existing record. The problem is that SQL Server Integration Services does not provide direct support for updating records based on some criteria. This post will have a brief look over various solutions:

Ole DB Command

The most basic solution could be to use an Ole DB Command transformation for every input row. For this, our dataflow may look something like.

The most annoying thing in this approach is that Ole DB Command does not support named parameters and we need to write our update queries using “?” like this:

Update Dim_myDimTable
 Set Property1 = ?,
 Property2 = ?
 ....
 Where BusinessKey = ?

This could get tedious to maintain if we have lot of columns in our table since we need to map the ? parameters in the exact order they appeared in the update query.

Also, this method uses a single update query for every matched record and so is not a good choice if the records to be updated are higher in number. Also, when using this approach, we need to make sure that we only update the rows that are changed (the sample dataflow above does not depict this).

Slowly Changing Dimension transformation

This is a nice built-in method that starts a wizard and finally generates a dataflow similar to the one we saw in the first approach if we choose options in the wizard that represent Type 1 SCD with changing attributes. So this option could be regarded as a more maintainable way of approach 1 since it generates all the related data flow components automatically. However, since this approach also updates the records one by one so the performance may degrade if we have a lot of rows to update. To use this option, we need to drag a Slowly Changing Dimension transformation to our data flow and follow the wizard accordingly.

Staging Update Table

This approach gives greater scalablity, maintainablity and flexibility. This option requires us to create a staging table in the database and fill that table with the rows that that already exist in the target table. After that, we can simply write an update query with a join condition to update all the records in one-go. Here’s the dataflow sketch for this option:

After the above dataflow, we will need to create an SQL Script control flow component to write our query and clears the staging update table. This can be done using an sql command like this:

--update dimension rows
 Update D
  Set
   Property1=U.Property1,
   Property2=U.Property2
   ...
  From Dim_myDimTable D
  Inner Join Temp_myDimUpdateTable U
   on D.BusinessKey = U.BusinessKey
 
  --empty update table
  Truncate table Temp_myDimUpdateTable

Of course, this solution is an overhead if the records to be updated are small in number since it requires an staging table and is a two-pass solution.

Merge Statement

SQL Server 2008 came up with a new Merge command that can automatically insert or update records in the destination table in a single statement. This option is more described in this MSDN BOL article: Using MERGE in Integration Services Packages

That’s all I learned when I wrote an ETL for a data warehouse. I hope that post helps others that are going to write SSIS packages as well. Please have your thoughts propagated using the comments section below.

SSIS: Comparing two versions of same package

SQL Server Integration Services yields a very bitter experience when comparing two versions of a package to see the differences. The problem is that the package XML is arbitrarily reordered when we try to save a package in the Visual Studio designer. Well, luckily, there’s a free and very nice solution to this problem: BIDS helper, a Visual Studio add-on that leverages BI development using SQL Server 2005/2008. SmartDiff is a part of BIDs helper and it compares two packages after pre-processing the package definition files and producing a uniform layout for comparison.

To demonstrate, let me show the result of two versions of one of my SSIS packages. I did not modify anything and just saved the same package twice. Here’s a summary for an ordinary diff that shows how the VS designer messes up the two versions:

Summary of ordinary diff

Using SmartDiff, here’s the result:

Summary after SmartDiff

Looks great, only two differences. Here’s the actual output from SmartDiff: Click to enlarge the image:

Actual Diff using SmartDiff

That’s really cool. BIDs helper is an add-on every BI developer must have. It contains many other features other than SmartDiff. Have a try here.

SSIS: How to use a query containing table variables for an OLE DB Source

This one consumed a lot of my brain energies yesterday. I was working on some Sql Server Integration Services task and needed to create a complex query containing table variables. The query was working fine in SQL Management Studio and the SSIS designer was even previewing the data correctly. But when the task was actually executed, the query did not return any rows. I posted a question on TechNet forums and got an answer from Charles Talleyrand here. Actually, I needed a SET NOCOUNT ON statement. Let me explain using an example:

Say, I want to create a table variable (not a temp table), populate it and then use the result set as an input for an OLE DB Source. Here’s a sample query for this:


declare @mytable table
(
col1 int,
col2 varchar(20)
)

insert @mytable values (1, 'one')
insert @mytable values (2, 'two')
insert @mytable values (3, 'three')

select * from @mytable

The SSIS designer will correctly identify the column names from the above query and even display the data if you click on the Preview button. But when the task is executed, no rows will be returned. I think the output of insert statements: i.e. 1 row(s) affected was the source of problem and so a SET NOCOUNT ON prevented such interferences. Hooray!! So if you are ever working on some SSIS Data Transfer Task and want to use table variables inside your query, make sure you do not forget the SET NOCOUNT ON statement.