Delta detection, source data extraction and smooth sourcing of input data from OLTP and/or Relational data warehouse is a part of the overall ETL process. Though the performance optimization boundary is outside the regime of SSIS, but these are the fuelling stations of SSIS. OLTP and relational data warehouse are the two main data stations from which SSIS fetches and/or loads data, and relational data warehouse (i.e. data marts / ODS, whatever be the case) is often used to read as well as write data, and optimizing these relational sources can contribute to performance optimization of SSIS packages. In case if you are using SQL Server 2008 platform, some of those that can be effectively used for optimizing data access especially by SSIS packages are as below:
1) Change Data Capture (CDC) and/or Change Tracking: I would not go into the details of the difference between Change Data Capture and Change Tracking. But this is one of the effective means of delta detection. Actually more than performance, it provides a controlled and automated infrastructure for delta detection. Depending upon the kind of algorithm followed in the master ETL package that handles delta detection and extraction, CDC / Change Tracking can be used. A whitepaper that explain in depth methodology of integrating CDC with SSIS can be downloaded from here.
2) MERGE statement: MERGE is one of the smart means to off load processing of data for Slowly Changing Dimensions. A nice brief article from kimballgroup explaining the same concept can be read from here.
3) Filtered Indexes: Mostly everyone is aware of two types of indexes in SQL Server - Clustered and Non-Clustered. But Filtered Index is a new advancement and is a very useful weapon to optimize OLTP as well as relational data warehouse for optimizing data access without impacting the overall system or impacting only the system for which SSIS needs a helping hand from the relational sources for faster data access. Again I would not go into the details of filtered indexes, and a thorough section that explains the same can be read on MSDN from here.
Mostly we talk of how to optimize SSIS package performance, but the bigger picture is how do we optimize ETL process where SSIS is the driving ETL tool, and these pointers can provide some direction in an effort to optimize SSIS package performance, though indirectly.
2 comments:
Hi Siddharth,
Did you test Change Data Capture/Change Tracking yourself?
After some tests I can say that it's likely based on triggers mechanism.
I.e. time taken on 1 million inserts into table with trigger is same as time on inserts into table with change tracking.
Although, storage is optimizied compared to if we create table and trigger on it which captures data to another table.
I have worked with CDC and SSIS, but not with change tracking. I have not done a load test over several million records, but over a few thousand records, the performance is quite satisfactory. And in a D/W, when you are load delta, I do not expect many million records everytime. It again depends upon design.
Triggers are generally placed on OLTP for business logic / app related req. I do not feel using trigger is a proper option from performance or design considerations. Also when data is bulk loaded, triggers might get bypassed.
I am not much confident on the use of change tracking vs triggers, but CDC definitely wins over both.
Post a Comment