First introduced in the 1930s as wallpaper cleaner, Play-Doh is now used by children to make all manner of things, and they’re limited only by their imaginations, manual dexterity, and the amount of Play-Doh available. Similarly, when you code transformations—the T in ETL (extraction, transformation, and loading)—you are limited only by your imagination and the data available. SQL Server Integration Services (SSIS) provides powerful tools for transforming the raw Play-Doh of data into useful, meaningful tools to help a business thrive. This article aims to demonstrate a few of the many ways you can handle data transformation tasks in SSIS.
You can download the code and sample files discussed in this article by clicking here.
Transformation Basics
Two types of transformation can occur in SSIS. First is a simple data type transformation, such as extracting data from a comma-delimited ASCII file (where the data is stored as an ASCII string) and transforming it into an integer. Second is changing the format of data, such as taking a flattened, comma-delimited file and storing the data more efficiently in third normal form.
In both cases, you are changing the form of the data but not its substance, in the same way the form of the Play-Doh is changed without changing its substance. If your hypothetical ASCII file data indicates that a company sold five widgets, you don't change the data to say that four or six were sold—you simply change the data type from an ASCII string to an integer. Another example would be that when you obtain derived data, such as the total sales for a day, you don't change the underlying facts but rather draw conclusions from them. You can also use the transformation step to ensure data fits predetermined business rules. If it doesn't, you typically want to raise an error or create exception reports so that any problems can be analyzed and dealt with appropriately.
Transformation can raise some rather profound philosophical questions, if you are of such a mind. For example, it raises questions of form and substance similar to those raised in Aristotle's Metaphysics or the writings of Marcus Aurelius and of Ignatius of Antioch on transubstantiation. The substance of the data—what it is in its essence—remains the same while its form—the data types—may vary. (Who says those philosophy credits and a Christian Brothers education were wasted? But I digress…)
It almost goes without saying that there are all manners of business rule that can or should be enforced during the ETL process. For example, dollar values of a sale not involving return items shouldn't be negative numbers, so you would have a rule (e.g., enforced by a constraint) that a monetary value associated with a sale must be greater than zero. You might have a rule that a sale should never involve a product the company doesn't actually sell, and you could enforce this rule with a foreign key constraint comparing the SKU in a product table with the SKUs involved in any given sale. If the constraint throws an error, then we would handle it in a WTF (Where's This From?) error routine. For more information about the rules you should enforce in your SSIS environment, see "SSIS Logging Best Practices," www.sqlmag.com, InstantDoc ID 103213.
Why do you need to transform data? The most obvious reason is that data stored and optimized for use in one system will rarely match the storage and optimization needs in a second system that's optimized to meet other, usually radically different, needs. Because you won't have direct mappings between source and targets in the vast majority of data transfer situations, you must consider various options. You must consider where you should perform the transformation. You could, for example, simply dump an entire large file into a staging table, then transform it using SQL while the data is being moved to the "permanent" tables. You could also use various tools, such as SSIS, to transform and move data to the application side. Or you could combine these efforts. Specialty ETL tools are available for various needs or industries, such as Microsoft's Amalga for medical systems integration. (Such specialized tools are beyond the scope of this article.)
Use Case
With this background, let's consider a hypothetical set of business requirements. Big Box Stores owns and operates diverse retail chains that include huge Big Box warehouse stores, large retail operations, etc. The stores operate in the United States., Canada, Mexico, and Western Europe, and the brands have modified the POS systems to meet local tax and regulatory requirements.
The loss prevention department has noticed that some employees are helping themselves to five-finger discounts. The staff members use various ruses to take money from cash registers, obtain goods without paying for them, or otherwise embezzle money or steal goods from Big Box. These activities typically unfold over periods of several days or weeks. For example, employees will make purchases using the employee discount at the store where they work, then return the product for full price at another store where they aren’t known. Alternatively, they might have accomplices return the goods to the employee for a full refund.
The various methods used to steal from Big Box fall into these recognized patterns, and a good deal of this theft can be uncovered by analyzing patterns of sales transactions. Standard ETL techniques will be used to import data concerning the stores, products, and employees to a database used to analyze these patterns and detect employee theft.
You have been tasked with building a system that will import comma-delimited files exported by the POS systems into a SQL Server database that will then perform the analysis. Data concerning each sale will be sent from the POS systems.