Design SSIS for Performance and Scale – Parallelism strategies, Part 1

After looking at our baseline numbers I will now explain the parallelism strategies we are able to use with SSIS. For this purpose I introduce my scale map.

Scale map

Scale map

The map is divided into partition strategies and insert strategies. I will start with functional partitioned tables, since a single partition behaves similar to a non-partitioned table.

Direct inserts into single partition

imageThe focus of this design pattern is to achieve the maximum performance when loading into a single partition / table. The target table is partitioned by month and I load with multiple streams directly into one partition (heap and clustered Index). The design looks like this:

image

The data is split equally into multiple streams within the dataflow and I tested 4, 8 and 16 streams in parallel.  In order to split the data I use a modulo function within the “conditional split” task:

DAY(L_SHIPDATE) % 4 == 0

If you need more then 4 destinations you have to add more dataflows to your package because the data source is limited with ~200MB/s. So a good balance between source and destination is 1:4.

Let’s have a look at the performance numbers:

image

As you can see I got a nearly linear performance scale when loading into a heap table with no fragmentation (blue line). What’s interesting is, that even for a clustered index table I have a linear scale with no fragmentation, if it is a minimal logged operation (orange line)! The difference in case of speed is the sort that I have to pay for a clustered index.

If we are not minimal logged, we see no performance increase but the fragmentation becomes higher and higher.

Choose the right distribution key

In order to create multiple data streams I use a modulo function as mentioned before. But be careful which column you choose for data distribution. In case of a clustered index table the distribution column has to match the clustered index column. You can see the performance differences below:

image

Why? Because if the data distribution is not align with the clustered index the parallel bulk inserts hit the same pages and are blocked by each other. You can see this behavior in the following picture with lots of resource locks on the same pages.

image

Sorting on a heap

During investigations of the execution plans I also saw a sort for a heap load.

image

This sort appears because the heap table is partitioned. And there is nothing we can do about, even pre-sorting the data by partition key does not help.

Conclusion

As a summary the main important things you should remember:

  • Loading with multiple streams into a heap table is the fastest option
  • Loading into a clustered index is acceptable if it is minimally logged and ranges are not overlapping (hard to achieve).
  • The SSIS data source is limited with ~200MB/s
  • A good balance of sources & destinations is 1:4. If you need more, create more dataflows.
  • Choose the right data distribution key

Partition switch – strategy to avoid sort

image

The next pattern I want to take a closer look is partition switch. The test case itself looks like this:

  • Target table is partitioned by month
  • Load with 1 stream into staging table
  • Switch staging table into target heap or clustered index table
  • 1, 8, 16 month in parallel

The corresponding SSIS package looks like this. Every data flow is responsible for one month of data. The T-SQL tasks handles the partition switch command.

image

So let’s see what performance we get out of this pattern. The two pictures below show the overall duration in seconds and the write performance in rows per second. The time I measured was the load time from the beginning till data was available in the partition table, which means that the switch operation is included.

image

The blue line shows the performance of a heap table and the red one of clustered index load. Just for comparison the blue dotted lines show the performance of a heap load with direct inserts. I compare on moth of data to a direct insert with 1 stream and 8 month of data to 8 parallel streams and so on.

Conclusion

Compared to the first design pattern, what do we learn from this one:

  • As we can see partition switch is faster then direct inserts, because we avoid the sort by partition key now
  • The clustered index load is a bit slower, because we have to also pay the price for the sort by the clustered index
  • We have a nearly linear scale

Summary

This article introduced the design pattern for non- und functional partitioned tables. We have seen how we can apply these techniques in SSIS, what the advantages and disadvantages are and how they perform.

But what if loading one month of data is not fast enough? I will take up this question in my next post where we will have a look on hash partitioning and how this technique can help us for further scale.

References

2 thoughts on “Design SSIS for Performance and Scale – Parallelism strategies, Part 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s