Parallel Data Warehouse – POC lessons learned Part 3

The first part of the PDW lessons learned series gave an introduction about the goals of the POC, a customer introduction, an overview of the requirements and the test cases. Within the second part I gave some more insights on how to migrate the existing code base I used for the POC (relational structures, stored procedures, SSIS packages) in order to work with PDW. The last post will concentrate on the results and a summary of our findings.

ETL Performance

As shown in my series “SSIS Design for Performance and Scale” a typically designed SSIS package with only 1 stream in the dataflow will give us a write performance of around 25 MB/s. Because data load performance is a critical issue in the current customer environment one of the original SSIS packages looked like this:

image

What we can see here are 6 sequence container that are executed as 6 threads in parallel, every container handles a subset of the data. But there are several disadvantages with that kind of design:

  • Complex parallel loads of partitions (partition by quarter)
  • Manual parallelism within package design
  • Multiple implementation of business logic per thread

With PDW we wanted to see how much the engine itself can help us in order to parallelize data loads. So we decided to remove all parallelism from the package so that it looks like this:

image

This gives us the following advantages:

  • Parallelism of load processes is handled by PDW (let’s see)
  • Much simpler SSIS Design -> less implementation effort
  • Business logic exists only once -> simpler and faster maintainability

So far so good, but what about the performance numbers? Let’s have a quick look at the 2 scenarios we have tested. We used the Landing Zone as host for Integration Services and executed all packages on this server.

image

Scenario 1

In the original customer scenario the SSIS package loads data from on table to another table within the same database. The whole database is stored in PDW.

Scenario 2

Because we had some trouble with the read performance we decided to put a backup of the database to the landing zone and load the data from the landing zone to the PDW nodes.

Performance Results

The following matrix shows the performance numbers we got during our tests.

image

The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.

Now we can see 2 interesting behaviors. The first scenario is 2x slower, but the second one up to 4x faster. So what is the reason for these big differences in performance?

  • Read data from PDW is slow, around 8 MB/s (OleDB driver) –> Scenario 1
  • Load data into PDW is fast, around 45 MB/s –> Scenario 2
  • PDW automatically distributes loading processes

Important: The product group is aware of the slow OleDB driver and is working hard to make it faster. We have already seen a much better performance on a later Appliance Update.

Summary

I always hear people saying that SSIS is not the right tool for data loading scenarios with PDW. It’s maybe not the fastest option, but we have seen nevertheless good performance measures. Just to remember the easiest SSIS package on a normal SQL Server gives us a write performance of 25 MB/s. With PDW we get 45MB/s, so performance is nearly doubled. If this performance is good enough for you, SSIS is still a good choice and of course additionally you can apply the same parallelism techniques for further scale. And sometimes you don’t have the option to migrate all existing SSIS packages to a T-SQL based ETL.

My feeling is, that there is much more possible in case of performance, if the driver becomes faster and more efficient in the future.

Very often also the BCP tool (bulk copy utility) is recommended for data loading. This is a very fast option if you get your source data as flat file. But when you first need to create flat files before you can use BCP you better use SSIS for direct connectivity.

Relational Query Performance

Now let’s have a look at the relational query part. We got 3 reference queries from the customer that we used for our tests. In order to make the queries work with the PDW database we only had to change the schema name (because only the “dbo” schema is currently supported) and that’s it. And the performance? Awesome, as you can see here:

image

The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.

All queries are up to 8 times faster. This is the real power that PDW offers us. A highly scalable SQL engine with lot’s of query power. So no surprises here, it works as we had expect it.

Summary

So what do we learn from this POC. I’d like to address the most important things we have seen again in a short bullet list.

Usability

einfachheit

  • PDW is not (yet) a normal SQL Server. So you need to understand the architecture of a MPP system and why it works differently to a SMP system
  • It’s not a no brainer. You need to think about, why would you need to buy a MPP System and which problem should it solve. It’s not a “one size fit’s all” thing.
  • Plan for migration effort. There are features in SQL Server that are currently not supported in PDW. Work closely with the Microsoft team to understand your specific case and there are already a lot of good design pattern you can use to overcome the missing features.

Performance

PowerRanger

  • The relational query engine is fast. You have seen the results, but you can also write statements that do not well perform. Think about data distribution and other techniques and avoid data shuffling.
  • Plan your ETL. As we have see we get good parallelism support for SSIS, although the read problem has to be solved. So the system helps us with built-in parallelism support and we can keep our SSIS packages much simpler for a longer time.
  • Think about ELT.  I think in a MPP environment the SSIS engine itself becomes the bottleneck. We take out the data from a highly parallelized engine pump it through a buffer oriented SSIS engine and put it back to our highly parallelized engine. This works like a funnel which scales only to a certain point. If you can start from scratch or optimize certain packages try to use SSIS to bring the data into PDW and do the transformations within the engine.

You are free to draw your own conclusion out of the presented results or you can also contact we for further questions or feedback. Don’t hesitate, just let me know.

References

14 thoughts on “Parallel Data Warehouse – POC lessons learned Part 3

  1. Rolf says:

    Thanks, good post. Good to see a cross platform compariosn. Not sure if I missed it but for like-for-like comparison of relational query performance, can you pls advise the spec of the “Production” system and if the DB used features like page compression or (if SQL 2012) column store indexes?

  2. Hi Rolf, thanks for your comment. The production system is a SMP box with 24 Cores, 256GB RAM and a storage system with a read throughput of 2GB/s. We currently don’t use compression and it’s based on a SQL Server 2008 R2 installation, so no columnstore index.

  3. I’m still learning from you, while I’m improving myself. I definitely liked reading everything that is posted on your site.Keep the stories coming. I loved it!

  4. Raj says:

    This is helping me a lot to compare it with other Appliance. Question: Do you know if some of these missing features or limitations are resolved in SQL 2012 PDW

    • Hi Raj,

      yes there are a lot of enhancements in the SQL 2012 version of PDW, especially in case of performance. We have now a fast OleDB driver, also for reading data from PDW. I did currently a POC with the new version and will make the results public very soon.

  5. Hi,

    You mentioned “If you can start from scratch or optimize certain packages try to use SSIS to bring the data into PDW and do the transformations within the engine.”

    We are on a mission to build a new system from scratch on PDW 2012. Do you suggest we use SSIS to bring the data into a set of staging tables within the PDW and then use T-SQL (stored procedures) to transform the data into the facts & dimensions?

    • Hi,

      you can get a decent performance also with SSIS. During our tests we were able to load data with an easy SSIS package with 180 MB/s. This has been tested on a full rack, but PDW was not the bottleneck. Compared to that we did the same logic with T-SQL based ETL with around 750 MB/s. So T-SQL was 6x faster than SSIS in our Scenario. So if 180 MB/s write performance is good enough, then SSIS is still fine. But if you need more, you should go with T-SQL. So maybe a hybrid solution is the right way. Everything which is critical can be implemented in T-SQL.

      What do you think?

      • Yes that’s what we have been told too. But my question is if you use T-SQL for your transformations, then how do you go about your logging, capturing erroneous records, handling rejects in lookups, column derivations, data conversions, etc which is quite easy to do in SSIS?

      • You are right, those things are easier with SSIS. You can still use SSIS to execute your T-SQL loads and use at least logging functionality of SSIS.
        All the rest needs to be done in T-SQL. So if you have more experiences with SSIS, then start with SSIS and use T-SQL where you need more speed. But as I already said you can get around 200 – 250 MB/s write throughput with SSIS, which should be good enough for most requirements.

    • There is still a gap between T-SQL functionality of SQL Server 2012 and PDW. But the product group is working hard to close this gap. We can expect 2-3 so called Appliance Updates every year that further functionality to PDW and the T-SQL capabilities.

    • Hi,
      yes we tested ROLAP and in combination with PDW this is mostly the recommended solution. MOLAP processing was also much faster on top of PDW, but I think ROLAP is more interesting for customers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s