In my last post I presented the Data Warehouse reference architecture that I use to explain design principles and ETL loading strategies for each layer. I will mainly focus on SSIS as ETL technology. So let’s first start with the Staging area and the Archive.
Staging Area – The Purpose
As I already pointed out the staging layer supports 1:1 copy of source system extraction. The main goal is to bring the data as fast and as easy as possible from the sources to the SQL Server also in order to minimize the source system interaction. Data in the staging area is temporary or semi temporary and can be deleted after all data is loaded into the CDW and the archive.
Staging Area – Design Principles
The following list of design rules apply for the Staging area:
- 1:1 copy of the source system
- Same naming conventions and data types as the source system
- No transformations
- No aggregations
- No quality checks
- 1 Staging area per source system
Let me explain some of the principles in more detail. The first five principles focus all on the idea of bringing the data as easy and as fast as possible into the staging layer of the Data Warehouse in order to achieve a smooth connectivity to our sources. We also want to keep the extraction workload as short as possible in order to minimize the load of out source systems. An ERP system like SAP ERP is typically mission critical in a customer enterprise environment so we want to minimize the data extraction influence on the daily business. Therefore we always prefer a delta extraction process, if the source system is able to deliver changes only.
The point of having one staging are per source system is because of workload and data separation. This design depends a little bit on the flavor of the architect and also on the different numbers and complexities of the source systems. Of course it is also possible to separate source systems in the staging area by defining different schemas per source system in large enterprise environments a dedicated database per source is more common.
Staging Area – SSIS Loading Pattern
Because we have no further logic to implement during data source extraction the SSIS packages look very simple.
We use a source system adapter (the one that fits best for the specific source system connectivity), in this case an OleDB source, extract the data, add some audit info like CreatedDate or LoadID within a Derived Column task and store the data in our staging table, done.
This looks very easy, isn’t it? And exactly this was the goal for staging area data loads.
Archive – The Purpose
The archive is our “life insurance”. Why? Because it keeps the full data history. I hear often the claim that the Data Warehouse has to track the complete history. So let’s be precisely about “what is data history” first. There are two types of history tracking in a Data Warehouse:
- Fact Data Changes: This is the easy one. Because every fact record has a link to a certain point in time our facts are history tracked automatically.
- Master Data Changes: This is the difficult one and is also know as slowly changing dimensions (SCD). In this case we track attribute changes in our dimension like phone number or address information.
- When I say I suggest not to track the full history within the Data Warehouse I mean not to track all attribute changes that occur just in advance. Based on my experience only 10% of history tracked dimension attributes are used by business users. So why should I then track the additional 90%, if nobody use them?
- Most of the time I hear the answer: “Because we don’t know what will be used in the future.”
- But do you really want to pay the price for that? And the price would be lot’s of between joins in your Data Warehouse and between joins are very cost extensive operations. I will dig into that topic in my post about the Core Data Warehouse design.
- To make it short, we have the full history (fact and master data) in our archive and are always able to create a new SCD2 table in our Core Data Warehouse when ever needed.
Archive – Design Principles
The archive follows the same design principles as the Staging area plus some additional principles:
- 1:1 copy of the staging area
- Same naming conventions and data types as the staging area
- No transformations
- No aggregations
- No quality checks
- 1 Archive per Staging area
- Use Compression and Partitioning
- Apply delta detection if necessary
If the source systems are not able to deliver data changes only and instead give you a full extract of their data every time I would add some logic before loading that data to the archive. In case of full extracts your archive will grow very rapidly so you need to add a delta detection logic between staging and archive and load only data changes into the archive. I also would apply table compression and partitioning. Partitioning in case you need to load a certain extract again from the archive to your Data Warehouse.
Archive – SSIS Loading Pattern
Because we have no further logic to implement during the SSIS packages look very simple.
We use an OleDB adapter to connect to our Staging database, extract the data, add some audit info like CreatedDate or LoadID within a Derived Column task and store the data in our Archive table, done.
Again a very easy and straightforward approach.
I hope I made my design principles clear and explained it detailed enough. If not, shoot me a comment and let me know. Within my next post I will explain the same principles and Loading pattern for the Core Data Warehouse.
I hope I made my design principles clear and explained it detailed enough. If not, shoot me a comment and let me know.
Within my next post I will explain the same principles and Loading pattern for the Core Data Warehouse.