When someone asks you to describe the data in your organization, a given system, or a particular database, how often do you answer in terms of storage size, growth rate, the number of tables, or the number of rows? These characteristics are certainly important for designing and maintaining an efficient and supportable solution, but I propose that instead of answering only in terms of quantity or volume, your response should also include a characterization of the data's quality. After all, it's not the amount of data that matters: it's the data's quality that's important. An organization can have the largest and fastest growing database in the world, but if it can't trust or quantify the quality of its data, the data is useless.
The term "smart data validation" describes the concept of setting up the database and processing environment to enforce validation automatically. When you're designing, implementing, and maintaining your own databases, you should consider several important aspects of database design: data ownership and stewardship; data definitions; data modeling; data normalization; nulls and data types; and data cleanup and integrity. As you read through this article's overview of these aspects, one caveat you should remember is that there are no absolutes when talking about design and implementation options. Instead, you should always consider the features, advantages, and tradeoffs of each aspect as they relate to your particular environment. If you keep this rule in mind as you read, you'll get lots of ideas for increasing your data's quality by implementing smart data-validation strategies.
Basic Concepts: Definition, Ownership, and Stewardship
Before digging into the technical details of data validation, we must begin by discussing three crucial factors that affect data quality: data definition, data ownership, and data stewardship. Failure to address these factors can lead to an inability to maintain the data's consistency and validity over time. Definition describes, in business terms, the meaning of the data.Your definitions should include not only a brief textual description of the data, but also any limiting constraints such as valid values or relationships to other data. These definitions form the basis of any rules used during data validation. (I discuss data definitions in more detail in a moment.)
Ownership refers to the people responsible for making decisions affecting the data. Some of the most important decisions that database owners make include validation of the data's definition and approving the data's acceptable
validity level (which I define later in the article).Stewardship refers to the group responsible for maintaining the data. This group includes the application developers responsible for any systems that acquire and maintain the data, the DBA responsible for storing and ensuring the data's availability, and any users who consume, create, or enter the data.
This article primarily focuses on validation as it relates to data definition and stewardship, but if you haven't established ownership, you have no place to turn for answers when you're trying to establish what data is valid or invalid. So as you begin to implement smart data-validation strategies in your own organization, start first by establishing database ownership and stewardship because they lay the foundation for all the design and implementation topics that follow.
When you keep the concepts of data definition, ownership, and stewardship in mind, defining data validation isn't as straightforward as you might expect.The definition of what constitutes "valid data" varies from one organization to another. For example, a financial organization such as a bank, credit card company, or brokerage might have stricter validation rules for its data than a government agency such as the United States Census Bureau. Furthermore, the definition of "valid data" often varies within an organization from one system to another. In a bank, the transactional checking and savings account systems might require a state of "absolute correctness," in which account balances are accurate to the penny and available to customers 24 × 7. Within the same banking organization, analytical warehousing systems might be more tolerant because account-balance information for sales pipeline analysis is required only weekly and reflects average daily balances rounded to the nearest $1,000.
Defining data validity is often a case of defining a reasonable tolerance level rather than an absolute rule for correctness. Therefore, validity levels aren't one-size-fits-all. Instead, each organization must create a validity definition that meets its particular needs and the demands of the system under consideration. I give some examples of determining "absolute correctness" in a discussion about maintaining domain integrity later in this article. Unfortunately, examples of "validation by trending" are beyond this article's scope. Now however, we must dig deeper into the idea of data definition.
The First Step: Data Definition
As I mentioned, the data's definition forms the basis for any rules you use for subsequent validation. In other words, you can't validate what you can't define. So the first step in establishing smart data validation is to acquire and verify definitions for the data.
In a perfect world, the definitions are already captured and published as part of your data repository's dictionary or as part of the system's requirements and specification documentation. However, in many organizations, the definitions are often undocumented, buried deep within the code base, and long forgotten by a staff that has since moved on in support of other systems. If you're in such a situation, consider starting a dictionary by defining the following attributes for the data entities and the items within each entity: Name, Type (Entity or Attribute), Description, Owner, ExampleValues,ValidValues, Uniqueness, an indicator of whether the value is optional or required, and a description of any known constraints. In addition, try to capture any data relationships. For examples of what the definitions of a Customer, Account, and an attribute of each in a banking organization's account management system might look like, see Web Figure 1.
After establishing data definitions, the next step in applying smart data-validation strategies is moving from requirements to design.You need to use data-modeling techniques to maintain the validity of data and incorporate appropriate metric-gathering functionality into the system's various operational processes. These processes might include Extraction, Transformation, and Loading (ETL) designs that you can use to measure changes in data volume and capturing metrics during transactional operations to assist with validation through trending.
Prev. page  
[1]
2
3
next page