Handling updates in SQL Server Integration Services isn’t straight forward. If you have developed an ETL (Extract Transform Load) process for a Data Warehouse, then you may have faced a situation where you needed to update existing records. This could be the case when you are implementing Type 1 SCDs. The situation isn’t particular for a Data Warehouse ETL; in routine SSIS tasks, there can be situations when you want to insert a record if it does not exist, or else update the existing record. The problem is that SQL Server Integration Services does not provide direct support for updating records based on some criteria. This post will have a brief look over various solutions:
Ole DB Command
The most basic solution could be to use an Ole DB Command transformation for every input row. For this, our dataflow may look something like.
The most annoying thing in this approach is that Ole DB Command does not support named parameters and we need to write our update queries using “?” like this:
Update Dim_myDimTable Set Property1 = ?, Property2 = ? .... Where BusinessKey = ?
This could get tedious to maintain if we have lot of columns in our table since we need to map the ? parameters in the exact order they appeared in the update query.
Also, this method uses a single update query for every matched record and so is not a good choice if the records to be updated are higher in number. Also, when using this approach, we need to make sure that we only update the rows that are changed (the sample dataflow above does not depict this).
Slowly Changing Dimension transformation
This is a nice built-in method that starts a wizard and finally generates a dataflow similar to the one we saw in the first approach if we choose options in the wizard that represent Type 1 SCD with changing attributes. So this option could be regarded as a more maintainable way of approach 1 since it generates all the related data flow components automatically. However, since this approach also updates the records one by one so the performance may degrade if we have a lot of rows to update. To use this option, we need to drag a Slowly Changing Dimension transformation to our data flow and follow the wizard accordingly.
Staging Update Table
This approach gives greater scalablity, maintainablity and flexibility. This option requires us to create a staging table in the database and fill that table with the rows that that already exist in the target table. After that, we can simply write an update query with a join condition to update all the records in one-go. Here’s the dataflow sketch for this option:
After the above dataflow, we will need to create an SQL Script control flow component to write our query and clears the staging update table. This can be done using an sql command like this:
--update dimension rows Update D Set Property1=U.Property1, Property2=U.Property2 ... From Dim_myDimTable D Inner Join Temp_myDimUpdateTable U on D.BusinessKey = U.BusinessKey --empty update table Truncate table Temp_myDimUpdateTable
Of course, this solution is an overhead if the records to be updated are small in number since it requires an staging table and is a two-pass solution.
SQL Server 2008 came up with a new Merge command that can automatically insert or update records in the destination table in a single statement. This option is more described in this MSDN BOL article: Using MERGE in Integration Services Packages
That’s all I learned when I wrote an ETL for a data warehouse. I hope that post helps others that are going to write SSIS packages as well. Please have your thoughts propagated using the comments section below.