| Executive Summary: The Data Profiling task is an incredibly useful, yet often overlooked tool in Microsoft SQL Server 2008. This article explains how to configure and use both the Data Profiling task and the Data Profile Viewer (DPV) to examine and assess your data, then view the results in SQL Server Integration Services (SSIS). |
I’ve noticed a distressing lack of coverage of SQL
Server 2008 Integration Services’ (SSIS) new
data profiling feature, even though data profiling
capabilities are a significant feature for data warehouse
developers. Data profiling tools have been available for
some time, but until now, SQL Server hasn’t natively
had the capability. The relative lack of coverage of SQL
Server 2008’s data profiling capability truly makes it a
hidden treasure.
Data profiling is the process of examining source
system data and collecting various statistics for data
quality and data integration assessment. It has been
a frequently requested item for SSIS—or at least the
larger SQL Server business intelligence (BI) platform—
for some time now. Microsoft is competing heavily in
the BI industry against other BI platform providers
(many of which have now consolidated) that have
offered data-profiling capabilities for some time. This
article shows how to set up the Data Profiling task,
designate profiles, and build and view profile output in
the new Data Profile Viewer utility.
Available Profiles
The SSIS Data Profiling task contains a total of
eight data profiles that you can use to assess the data
quality of source systems. Five of the profiles analyze
individual columns in a table or view. The remaining
three profiles analyze multiple columns or relationships
between columns, tables, or views.
At this time, it isn’t known if we’ll be able to create
custom data profiles in the release-to-manufacturing
(RTM) build of SQL Server 2008.
Some of the available profiles can only profile a
particular data type. For example, if you designate a
column-length profile on an integer-based column,
you’ll receive a design-time warning on the Data
Profiling task. If you attempt to execute either an individual
Data Profiling task or a package that involves
an incorrectly configured Data Profiling task, you’ll
get a dialog box informing you of the problem and the
execution process will halt.
Column-based profiles. The Column Length Distribution
profile informs you of the various lengths for
the values in a string-based column and the percentage
(relative to the other distinct string lengths) those
lengths represent. The Column Length Distribution
profile supports columns with character data types.
You can use the Column Pattern profile to determine
the percentage of values in a column that match a
particular profile. The Column Pattern profile supports
character data types. If you need to review a column’s
NULL count, you can leverage the Column Null Ratio
profile, which provides the count and percentage of
NULL values in a given column. Because the Column
Null Ratio profile is looking only for NULLs, it can
analyze a column with any data type.
Next up is the Column Statistics profile, which
outputs a few basic statistics for a given column based
on the data type. For numeric data types, the profile
returns minimum, maximum, mean, and standard
deviation values. For datetime data types, the profile
returns the minimum and maximum values; the new
Date and Time data types aren’t supported. The
Column Value Distribution profile informs you of the
distinct values in a given column and the percentages
those unique values represent. The Column Value
Distribution profile supports columns with numeric,
character, and datetime data types.
Multicolumn, multitable profiles. There are three
profile types in this category. The Candidate Key profile
tells you a given column’s or combined columns’
key strength, which is a column’s (or group of columns’)
ability to uniquely identify individual records
within a given table. The Functional Dependency profile
tells you the extent to which values in one column
(i.e., dependent column) depend on another column’s
(i.e., determinant column’s) values. For example, ZIP
code is highly dependent on state. Finally, there is the Value Inclusion profile, which informs you of the
potential for a column’s or combined columns’ ability
to serve as a foreign key that could enforce a data
integrity relationship between two tables. Value Inclusion
supports columns with the integer, character, and
datetime data types.
Configuring the Data Profiling Task
As with other SSIS tasks, the Data Profiling task is registered
and installed during SQL Server 2008 setup. The
task’s DLL is located at C:\Program Files\Microsoft
SQL Server\100\DTS\Tasks. The Data Profiling task
is a SSIS feature; consequently, to use it you must first
start up Business Intelligence Development Studio
(BIDS). Once BIDS has been launched you will need
to either create a new or open an existing SSIS project. Once you open a SSIS package and display
the toolbox you will notice a new item called
Data Profiling Task. Drag and drop an
instance of the Data Profiling Task from the
Microsoft Visual Studio toolbox onto the
package’s Control Flow designer, as shown
in Figure 1.
Note that SQL Server 2008 Data Profiling
can analyze only SQL Server 2000 and
later systems, and you must use the ADO
.NET Connection Manager.
Before you configure the Data Profiling
task, you’ll need to create an ADO.NET
connection manager to the source system to
be profiled. With the Data Profiling task on
the designer surface, right-click the task and
select Edit. In the editor, you can optionally
designate a timeout setting to dictate the
length in seconds the task can run before timing out.
Next are the destination settings—this is where usage
comes into play. There are two primary uses of SQL
Server Data Profiling:
• performing ad-hoc formal source system
assessments
• performing recurring data quality checks
as part of a larger extraction, transformation,
and loading (ETL) process
If you’re performing ad-hoc assessments,
simply designate a file destination for your
profile’s output. However, if you want to
perform recurring data profiling as part of an
ETL (or other) process, you’ll want to load
the output of your profiles into SSIS variables.
If you specify a variable destination,
simply select or create a new SSIS variable.
If you select a file destination, you’ll create
a new File Connection Manager and—
optionally—specify whether to overwrite an
existing file.
Now, to designate profiles, you can either
select Profile Requests in the left pane of the
editor or click the Quick Profile button in
the General section of the editor. The Quick
Profile route lets you select multiple profiles
quickly that have default settings automatically
applied.
As Figure 2 shows, the Profile Requests section has
a Profile Type drop-down list from which you can select
one of the eight available profiles. You can optionally
provide a non-default Request ID for the profile and
designate the profile’s settings in the Request Properties
section. Note that you can also filter the selected profiles
by expanding the View drop–down menu at the top of
the dialog box. If you select a profile type in the View
drop-down list, the Request Properties section displays the selected instances of the chosen profile type. After
you designate the profiles to be used and their settings,
click OK to complete the data profiling configuration
process. Notice also that the editor’s treeview pane has a
section called Expressions. As with the other SSIS tasks,
you can optionally assign values for the Data Profiling
task properties at runtime. For example, if you wanted
to supply a single Data Profiling task with different configurations
at runtime, you could apply an expression to
the task’s ProfileInputXML expression.
Continue on Page 2
Prev. page  
[1]
2
next page