The Table Difference Component is an SSIS Data Flow Component designed to find differences between two tables and, in particular, to simplify the management of slowly changing dimensions. The main advantage over the Microsoft standard SCD component is the speed and ease of use.
This component was tested with 5.5 million records and 25 comparison columns on a dual core AMD machine with 4GB of RAM and a standard SATA disk. The comparison process was complete in 7 minutes. The complexity of the algorithm is linear, so you can expect 10 million records to be processed in approximately 15 minutes with the same hardware.
One question arising very often in Data Warehouse programming is "What are the differences between two tables?" You have several million customers and need to decide what changes should be made. The standard SSIS toolkit provides the SCD component for handling this process, but it is of no practical use when you have to handle more than a couple thousand records. The Table Difference Component is not limited by the number of records to be processed.
This component has two inputs: one for "old data" and one for "new data". The old and new rows are compared and the result is sent to different outputs:
- Unchanged rows (same in both old and new)
- Deleted rows (appear in old but not in new)
- New rows (appear in new but not in old)
- Updated rows (appear in both old and new, but there are changes)
The diagram below illustrates the component functionality:
Occasionally, you may have a requirement to perform different updates for different columns. The component includes an option to tag each input column with an identifier. Then, for each unique identifier, a separate update output can be created and maintained. Using this capability, you can tag the historical attributes with "10" and tag the other columns with "20" and decide what kind of operation to perform for the different updates, thereby providing maximum flexibility.
The component inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the different inputs represent the same row. This is easily accomplished in SQL with a simple “order by” and a convenient index. Usually, slowly changing dimensions maintain an index by the business key, so the sorting requirement is easily accomplished and does not represent a problem.
The component has two inputs:
- Old Data Flow - attached to source returning existing rows.
- New Data Flow - attached to source returning new rows.
Each input has a buffer where all incoming rows are directed and maintained. A separate thread starts as soon as data from both buffers is available and checks for any differences between them. The rows are sent to the appropriate output and then removed from the buffers. The component does not need to maintain all the data in memory. The data is removed as soon as it has been compared with the corresponding row. For some obscure reason, SSIS allocates more processing time to one of the input data flows, compared to the other. This results in additional memory consumption by the component. For this reason, if you need to compare more than half a million records, we recommend you use the COZYROC Flow Synchronization component in your data flow.
The component has three standard outputs for New, Deleted and Unchanged rows. For each unique user-defined UpdateID identifier, a separate Updated row output is created. The outputs related to the UpdateIDs are named "UpdateID", followed by the identifier (e.g.UpdateID10).
You can easily change the name of the Update output to something more descriptive like "Updated historical attribute" and we encourage you to do so. The collection of output fields is computed by the component based upon the intersection of the two inputs. If one column appears in only one input, it is not considered. If it appears in both, then it will be processed.
The component analyzes the old and new columns. If the column name and type match, it is added to the list of available columns to manage. The inputs must be sorted. Their sort columns must be marked as key fields and the sort order must be specified. All other non-key columns are assigned a standard Update ID 10 and are managed as comparable columns unless you specify otherwise.
Select the input column processing type. This parameter has the options listed in the following table.
|Key Field||Use this column to determine which rows from the old and new input tables represent the same data. The inputs must be sorted by this column.|
|Compare||The column from the new and old rows will be compared to detect any difference.|
|Use New||The column will be copied from the New Data Flow input to the output without comparing the contents.|
|Use Old||The column will be copied from the Old Data Flow input to the output without comparing the contents.|
Specify the "Key Field" column order. This is the order in the query's "Order By" clause. The component does not know the correct order, so it is up to you to provide this information.
Specify the column update identifier. For each unique identifier, a separate update output is created. The update identifier with the lowest value has the highest priority. For example, assume we assign the AccountNumber column the update identifier of 10 and we assign the AddressLine1 column the update identifier of 20. If AccountNumber is different, the row will be directed to the update output 10, even if AddressLine1 also contains a difference.
Specify a name and description for each desired output. If you are not interested in Unchanged output, then you can deselect it and avoid warnings for unused columns.
Specify output column options.
Select the output for which you want to edit column selection and/or names.
Select or deselect the output column. Deselecting unused columns removes warnings and increases the performance of the component.
Specify a custom output column name.
Specify string comparison options.
Specify the culture used to perform the comparison. If not specified, the component will use the culture of the running task.
Indicate that the case should be ignored for the comparison.
Indicate that the Kana type should be ignored for the comparison. The Kana type refers to Japanese hiragana and katakana characters, which represent phonetic sounds in the Japanese language.
Indicate that the character width should be ignored for the comparison.
Indicate that non-spacing combining characters, such as diacritics, should be ignored for the comparison.
Indicate that symbols, such as white-space characters, punctuation, currency symbols, the percent sign, mathematical symbols, the ampersand, and so on, should be ignored for the comparison.
Indicate that the string sort algorithm should be used, where the hyphen and the apostrophe, as well as other non-alphanumeric symbols, come before alphanumeric characters.
This tab will contain all unused input columns. By checking the warnings you can see if the component is working with all the desired columns.
- Hints for upgrading from SSIS 2005 to 2008 by Shannon.
- Fixed: Component incorrectly transformed NULL value input into empty string, when input column type was DT_TEXT or DT_NTEXT (Thank you, Michael).
- Fixed: Component failed with "COM object that has been separated from its underlying RCW cannot be used." error when processing input columns of type DT_TEXT or DT_NTEXT (Thank you, Michael).
- Fixed: Improved setup dialog loading speed, when using many columns.
- Fixed: Component was dependent on .NET 3.5 framework after the new string comparison options were introduced. The component now requires only .NET 2.0 framework.
- New: Component now includes more options for string comparison (Thank you, Shane).
- Fixed: Failed to process DT_DBTIME2 column types in SQL Server 2008 (Thank you, Ethan).
- New: Component introduced.