Executive Summary: The Data Profiling task is an incredibly useful, yet often overlooked tool in Microsoft SQL Server 2008. This article explains how to configure and use both the Data Profiling task and the Data Profile Viewer (DPV) to examine and assess your data, then view the results in SQL Server Integration Services (SSIS).

I’ve noticed a distressing lack of coverage of SQL Server 2008 Integration Services’ (SSIS) new data profiling feature, even though data profiling capabilities are a significant feature for data warehouse developers. Data profiling tools have been available for some time, but until now, SQL Server hasn’t natively had the capability. The relative lack of coverage of SQL Server 2008’s data profiling capability truly makes it a hidden treasure.

Data profiling is the process of examining source system data and collecting various statistics for data quality and data integration assessment. It has been a frequently requested item for SSIS—or at least the larger SQL Server business intelligence (BI) platform— for some time now. Microsoft is competing heavily in the BI industry against other BI platform providers (many of which have now consolidated) that have offered data-profiling capabilities for some time. This article shows how to set up the Data Profiling task, designate profiles, and build and view profile output in the new Data Profile Viewer utility.

Available Profiles
The SSIS Data Profiling task contains a total of eight data profiles that you can use to assess the data quality of source systems. Five of the profiles analyze individual columns in a table or view. The remaining three profiles analyze multiple columns or relationships between columns, tables, or views.

At this time, it isn’t known if we’ll be able to create custom data profiles in the release-to-manufacturing (RTM) build of SQL Server 2008.

Some of the available profiles can only profile a particular data type. For example, if you designate a column-length profile on an integer-based column, you’ll receive a design-time warning on the Data Profiling task. If you attempt to execute either an individual Data Profiling task or a package that involves an incorrectly configured Data Profiling task, you’ll get a dialog box informing you of the problem and the execution process will halt.

Column-based profiles. The Column Length Distribution profile informs you of the various lengths for the values in a string-based column and the percentage (relative to the other distinct string lengths) those lengths represent. The Column Length Distribution profile supports columns with character data types. You can use the Column Pattern profile to determine the percentage of values in a column that match a particular profile. The Column Pattern profile supports character data types. If you need to review a column’s NULL count, you can leverage the Column Null Ratio profile, which provides the count and percentage of NULL values in a given column. Because the Column Null Ratio profile is looking only for NULLs, it can analyze a column with any data type.

Next up is the Column Statistics profile, which outputs a few basic statistics for a given column based on the data type. For numeric data types, the profile returns minimum, maximum, mean, and standard deviation values. For datetime data types, the profile returns the minimum and maximum values; the new Date and Time data types aren’t supported. The Column Value Distribution profile informs you of the distinct values in a given column and the percentages those unique values represent. The Column Value Distribution profile supports columns with numeric, character, and datetime data types.

Multicolumn, multitable profiles. There are three profile types in this category. The Candidate Key profile tells you a given column’s or combined columns’ key strength, which is a column’s (or group of columns’) ability to uniquely identify individual records within a given table. The Functional Dependency profile tells you the extent to which values in one column (i.e., dependent column) depend on another column’s (i.e., determinant column’s) values. For example, ZIP code is highly dependent on state. Finally, there is the Value Inclusion profile, which informs you of the potential for a column’s or combined columns’ ability to serve as a foreign key that could enforce a data integrity relationship between two tables. Value Inclusion supports columns with the integer, character, and datetime data types.

Configuring the Data Profiling Task
As with other SSIS tasks, the Data Profiling task is registered and installed during SQL Server 2008 setup. The task’s DLL is located at C:\Program Files\Microsoft SQL Server\100\DTS\Tasks. The Data Profiling task is a SSIS feature; consequently, to use it you must first start up Business Intelligence Development Studio (BIDS). Once BIDS has been launched you will need to either create a new or open an existing SSIS project. Once you open a SSIS package and display the toolbox you will notice a new item called Data Profiling Task. Drag and drop an instance of the Data Profiling Task from the Microsoft Visual Studio toolbox onto the package’s Control Flow designer, as shown in Figure 1.

Note that SQL Server 2008 Data Profiling can analyze only SQL Server 2000 and later systems, and you must use the ADO .NET Connection Manager.

Before you configure the Data Profiling task, you’ll need to create an ADO.NET connection manager to the source system to be profiled. With the Data Profiling task on the designer surface, right-click the task and select Edit. In the editor, you can optionally designate a timeout setting to dictate the length in seconds the task can run before timing out. Next are the destination settings—this is where usage comes into play. There are two primary uses of SQL Server Data Profiling:

• performing ad-hoc formal source system assessments
• performing recurring data quality checks as part of a larger extraction, transformation, and loading (ETL) process

If you’re performing ad-hoc assessments, simply designate a file destination for your profile’s output. However, if you want to perform recurring data profiling as part of an ETL (or other) process, you’ll want to load the output of your profiles into SSIS variables. If you specify a variable destination, simply select or create a new SSIS variable. If you select a file destination, you’ll create a new File Connection Manager and— optionally—specify whether to overwrite an existing file.

Now, to designate profiles, you can either select Profile Requests in the left pane of the editor or click the Quick Profile button in the General section of the editor. The Quick Profile route lets you select multiple profiles quickly that have default settings automatically applied.

As Figure 2 shows, the Profile Requests section has a Profile Type drop-down list from which you can select one of the eight available profiles. You can optionally provide a non-default Request ID for the profile and designate the profile’s settings in the Request Properties section. Note that you can also filter the selected profiles by expanding the View drop–down menu at the top of the dialog box. If you select a profile type in the View drop-down list, the Request Properties section displays the selected instances of the chosen profile type. After you designate the profiles to be used and their settings, click OK to complete the data profiling configuration process. Notice also that the editor’s treeview pane has a section called Expressions. As with the other SSIS tasks, you can optionally assign values for the Data Profiling task properties at runtime. For example, if you wanted to supply a single Data Profiling task with different configurations at runtime, you could apply an expression to the task’s ProfileInputXML expression.

Continue on Page 2

   Prev. page   [1] 2     next page
 
 

ADS BY GOOGLE