LINQ: How to get the latest/last record with a group by clause

Sometimes, we need to select a record that has the maximum or minimum value from a group. Some examples are:

  • We have a PersonOrders table and we want to find the last Order for each Person
  • We have a UserOperations table and we need to find theĀ first Operation for each User
  • We have a Persons table and we need to find the person with maximum salary for each Department

Such scenarios are beautifully handled by TSQL’s Ranking_functions over partition by clauses. Here’s a typical example that uses Row_Number() function to assign a rank to each row per Person.

;With PersonOrderWithRank 
as
(
	Select *, Rnk = ROW_NUMBER() over (partition by PersonID order by OrderDate desc)
	from PersonOrders
)

Select * 
from PersonOrderWithRank
where Rnk=1

In LINQ, similar result can be achieved by using the let keyword. Here’s an example:

from p in PersonOrders
//where conditions or joins with other tables to be included here
group p by p.PersonID into grp
let MaxOrderDatePerPerson = grp.Max ( g=>g.OrderDate )

from p in grp
where p.OrderDate == MaxOrderDatePerPerson
select p

Another more compact method would be to retrieve the first record in the grouping like this:

from p in PersonOrders
//where conditions or joins with other tables to be included here
group p by p.PersonID into grp
select grp.OrderByDescending(g=>g.OrderDate).First()

The above LINQ approaches can be used for both LINQ To SQL as well as Entity Framework. Although the SQL and the LINQ approaches I described above are not exactly identical (since I used Row_Number() and not Rank() or Dense_Rank() ) but the purpose of this post is to provide an starting point to write similar queries using LINQ.

Advertisements

SSIS: How to handle updates

Handling updates in SQL Server Integration Services isn’t straight forward. If you have developed an ETL (Extract Transform Load) process for a Data Warehouse, then you may have faced a situation where you needed to update existing records. This could be the case when you are implementing Type 1 SCDs. The situation isn’t particular for a Data Warehouse ETL; in routine SSIS tasks, there can be situations when you want to insert a record if it does not exist, or else update the existing record. The problem is that SQL Server Integration Services does not provide direct support for updating records based on some criteria. This post will have a brief look over various solutions:

Ole DB Command

The most basic solution could be to use an Ole DB Command transformation for every input row. For this, our dataflow may look something like.

The most annoying thing in this approach is that Ole DB Command does not support named parameters and we need to write our update queries using “?” like this:

Update Dim_myDimTable
 Set Property1 = ?,
 Property2 = ?
 ....
 Where BusinessKey = ?

This could get tedious to maintain if we have lot of columns in our table since we need to map the ? parameters in the exact order they appeared in the update query.

Also, this method uses a single update query for every matched record and so is not a good choice if the records to be updated are higher in number. Also, when using this approach, we need to make sure that we only update the rows that are changed (the sample dataflow above does not depict this).

Slowly Changing Dimension transformation

This is a nice built-in method that starts a wizard and finally generates a dataflow similar to the one we saw in the first approach if we choose options in the wizard that represent Type 1 SCD with changing attributes. So this option could be regarded as a more maintainable way of approach 1 since it generates all the related data flow components automatically. However, since this approach also updates the records one by one so the performance may degrade if we have a lot of rows to update. To use this option, we need to drag a Slowly Changing Dimension transformation to our data flow and follow the wizard accordingly.

Staging Update Table

This approach gives greater scalablity, maintainablity and flexibility. This option requires us to create a staging table in the database and fill that table with the rows that that already exist in the target table. After that, we can simply write an update query with a join condition to update all the records in one-go. Here’s the dataflow sketch for this option:

After the above dataflow, we will need to create an SQL Script control flow component to write our query and clears the staging update table. This can be doneĀ using an sql command like this:

--update dimension rows
 Update D
  Set
   Property1=U.Property1,
   Property2=U.Property2
   ...
  From Dim_myDimTable D
  Inner Join Temp_myDimUpdateTable U
   on D.BusinessKey = U.BusinessKey
 
  --empty update table
  Truncate table Temp_myDimUpdateTable

Of course, this solution is an overhead if the records to be updated are small in number since it requires an staging table and is a two-pass solution.

Merge Statement

SQL Server 2008 came up with a new Merge command that can automatically insert or update records in the destination table in a single statement. This option is more described in this MSDN BOL article: Using MERGE in Integration Services Packages

That’s all I learned when I wrote an ETL for a data warehouse. I hope that post helps others that are going to write SSIS packages as well. Please have your thoughts propagated using the comments section below.