SSIS: How to handle updates

Handling updates in SQL Server Integration Services isn’t straight forward. If you have developed an ETL (Extract Transform Load) process for a Data Warehouse, then you may have faced a situation where you needed to update existing records. This could be the case when you are implementing Type 1 SCDs. The situation isn’t particular for a Data Warehouse ETL; in routine SSIS tasks, there can be situations when you want to insert a record if it does not exist, or else update the existing record. The problem is that SQL Server Integration Services does not provide direct support for updating records based on some criteria. This post will have a brief look over various solutions:

Ole DB Command

The most basic solution could be to use an Ole DB Command transformation for every input row. For this, our dataflow may look something like.

The most annoying thing in this approach is that Ole DB Command does not support named parameters and we need to write our update queries using “?” like this:

Update Dim_myDimTable
 Set Property1 = ?,
 Property2 = ?
 ....
 Where BusinessKey = ?

This could get tedious to maintain if we have lot of columns in our table since we need to map the ? parameters in the exact order they appeared in the update query.

Also, this method uses a single update query for every matched record and so is not a good choice if the records to be updated are higher in number. Also, when using this approach, we need to make sure that we only update the rows that are changed (the sample dataflow above does not depict this).

Slowly Changing Dimension transformation

This is a nice built-in method that starts a wizard and finally generates a dataflow similar to the one we saw in the first approach if we choose options in the wizard that represent Type 1 SCD with changing attributes. So this option could be regarded as a more maintainable way of approach 1 since it generates all the related data flow components automatically. However, since this approach also updates the records one by one so the performance may degrade if we have a lot of rows to update. To use this option, we need to drag a Slowly Changing Dimension transformation to our data flow and follow the wizard accordingly.

Staging Update Table

This approach gives greater scalablity, maintainablity and flexibility. This option requires us to create a staging table in the database and fill that table with the rows that that already exist in the target table. After that, we can simply write an update query with a join condition to update all the records in one-go. Here’s the dataflow sketch for this option:

After the above dataflow, we will need to create an SQL Script control flow component to write our query and clears the staging update table. This can be done using an sql command like this:

--update dimension rows
 Update D
  Set
   Property1=U.Property1,
   Property2=U.Property2
   ...
  From Dim_myDimTable D
  Inner Join Temp_myDimUpdateTable U
   on D.BusinessKey = U.BusinessKey
 
  --empty update table
  Truncate table Temp_myDimUpdateTable

Of course, this solution is an overhead if the records to be updated are small in number since it requires an staging table and is a two-pass solution.

Merge Statement

SQL Server 2008 came up with a new Merge command that can automatically insert or update records in the destination table in a single statement. This option is more described in this MSDN BOL article: Using MERGE in Integration Services Packages

That’s all I learned when I wrote an ETL for a data warehouse. I hope that post helps others that are going to write SSIS packages as well. Please have your thoughts propagated using the comments section below.

Advertisements

T-SQL: Using common table expressions (CTE) to generate sequences

One of the best enhancements in T-SQL with SQL Server 2005 was Common Table Expressions(CTEs). CTEs are very helpful in writing more readable and manageable queries. The good things don’t end here; self-referencing CTEs are a very powerful method of implementing recursion in SQL queries. In this post, I will present a few examples of generating sequences using CTEs.
The following statements create a number sequence from 1 to 1000.

--define start and end limits
Declare @start int, @end int
Select @start=1, @end=1000

;With NumberSequence( Number ) as
(
    Select @start as Number
        union all
    Select Number + 1
        from NumberSequence
        where Number < @end
)

--select result
Select * From NumberSequence Option (MaxRecursion 1000)

And the output is:
Number sequence

Notice that when selecting records from the CTE, we are setting a value for MaxRecursion Option. This one is important since the maximum recursion level for a CTE is set to 100 by default, and so any sequence with more than 100 rows will generate an error. We can turn the recursion limit off by using MaxRecursion 0, but this is not a good idea since it opens the possibility of infinite recursion in case of a small error in CTE.

Here’s another example that creates date sequence from 2009-03-01 to 2009-04-10.

--define start and end limits
Declare @todate datetime, @fromdate datetime
Select @fromdate='2009-03-01', @todate='2009-04-10'

;With DateSequence( Date ) as
(
    Select @fromdate as Date
        union all
    Select dateadd(day, 1, Date)
    	from DateSequence
    	where Date < @todate
)

--select result
Select * from DateSequence option (MaxRecursion 1000)

This one generates an output similar to:
Date Sequence

Notice, that this example, that generates date sequences can also be used to create Time Dimension for a Data Warehouse. If the granularity of time dimension is not too fine, CTEs give a nice alternative to loops. Here’s an example:

--define limits
Declare @todate datetime, @fromdate datetime
set @fromdate = '2009-01-01'
set @todate = '2009-12-31'

;With DateSequence( [Date] ) as
(
	Select @fromdate as [Date]
		union all
	Select dateadd(day, 1, [Date])
		from DateSequence
		where Date < @todate
)

--select result
Select
	CONVERT(VARCHAR,[Date],112) as ID,
	[Date] as [Date],
	DATEPART(DAY,[Date]) as [Day],
	CASE
		 WHEN DATEPART(DAY,[Date]) = 1 THEN CAST(DATEPART(DAY,[Date]) AS VARCHAR) + 'st'
		 WHEN DATEPART(DAY,[Date]) = 2 THEN CAST(DATEPART(DAY,[Date]) AS VARCHAR) + 'nd'
		 WHEN DATEPART(DAY,[Date]) = 3 THEN CAST(DATEPART(DAY,[Date]) AS VARCHAR) + 'rd'
		 ELSE CAST(DATEPART(DAY,[Date]) AS VARCHAR) + 'th'
	END as [DaySuffix],
	DATENAME(dw, [Date]) as [DayOfWeek],
	DATEPART(DAYOFYEAR,[Date]) as [DayOfYear],
	DATEPART(WEEK,[Date]) as [WeekOfYear],
	DATEPART(WEEK,[Date]) + 1 - DATEPART(WEEK,CAST(DATEPART(MONTH,[Date]) AS VARCHAR) + '/1/' + CAST(DATEPART(YEAR,[Date]) AS VARCHAR)) as [WeekOfMonth],
	DATEPART(MONTH,[Date]) as [Month],
	DATENAME(MONTH,[Date]) as [MonthName],
	DATEPART(QUARTER,[Date]) as [Quarter],
	CASE DATEPART(QUARTER,[Date])
		WHEN 1 THEN 'First'
		WHEN 2 THEN 'Second'
		WHEN 3 THEN 'Third'
		WHEN 4 THEN 'Fourth'
	END as [QuarterName],
	DATEPART(YEAR,[Date]) as [Year]
from DateSequence option (MaxRecursion 10000)

And here’s the beautiful output of the above query to be used as a time dimension (click to enlarge the image):
Time Dimension