You can represent data values differently in two ways. The first is a structural difference. For example, you can represent a name as either one string or two strings (containing first and last names). Luckily, comparison tools can identify many of these structural representation differences during the schema-level comparison. Merging such data requires the DBA to specify ad-hoc conversion routinesfor example, one that splits a name into two parts (first and last name) or joins the first and last names to form a full name.
The second type of difference occurs when the interpretation of the values is different, as in Fahrenheit or Celsius. You can store such a value by using a floating-point data type such as real. In this case, automatic tools can't detect any differences between the schemata of two tables that contain such data. If you manually identify this difference, you need to provide a conversion function that changes the temperature value from one interpretation to the other. Because automatic comparison tools can't detect semantic-heterogeneity problems, you need to manually create these conversion functions and add them to the synchronization or migration scripts.
Unfortunately, you can't always create a conversion function that works in both directions without data loss. If your database stores temperature values as integers, for example, converting a Fahrenheit value to Celsius and back can result in a different temperature value than the original. For example, 101F = 38.3333C, which rounds to 38C; however, 38C = 100.4F, which rounds to 100F, not 101F.
Sometimes you can't provide such conversion functions at all. A string concatenation with unknown boundaries doesn't have an inverse. You can join the two strings "ab" and "cdef" to make "abcdef," but you can't reverse-engineer that to get the value of the original two strings if you have no information about where to split "abcdef."
Solutions to Common Problems
One way of handling semantic heterogeneity is to use strong data typing and associate the semantics of the data with the data type. Most database management systems (DBMSs) allow subtyping, making it possible to define a new data type for a temperature in Celsius, for example.
In SQL Server 2000, you can create user-defined data types (UDTs) by using the sp_addtype stored procedure:
EXEC sp_addtype CtemperatureInt,
int, 'NOT NULL'
EXEC sp_addtype FtemperatureInt,
int, 'NOT NULL'
Unfortunately, SQL Server automatically converts values of the above two data types when necessary; thus, you don't achieve strong typing by creating a UDT. However, the difference in the data type shows up during schema comparison. And SQL Server 2005 promises a more flexible data-type support that could help to achieve strong typing.
Using UDTs adds implicit documentation to your schemata by providing a more descriptive and specific name for the data. It also helps to achieve consistency in whether a field is nullable. A disadvantage of using such derived types is that you might have difficulty figuring out the base data type when you're looking at the name of a derived type. You can solve this problem by suffixing the type's name with the base type, as in the UDT code example above.
You can also identify some interpretation differences by looking at the data. Body temperature, for example, should have a range between certain sensible limits. A body temperature value of 100 C° is certainly unacceptable in a database of medical records. In such cases, you can set up database integrity constraints. Constraints can help to detect when incorrect data is inserted into a table as a result of a table merge.
To set up a check constraint for a temperature column whose type is CtemperatureInt, you can include the following code in the table declaration:
temperature Ctemperature CHECK
(temperature >10 AND
temperature <43)
Inserting an incorrect value would result in an error, such as the one Figure 4 shows. Often, a schema is already given, such as during migration from legacy applications. Adding temporary check constraints can help to find problems in the migration scripts.
An alternative to checks are triggers that consider the value of inserted or modified fields, then abort the transaction when they encounter incorrect values. You could replace the preceding check constraint for single-row inserts with a trigger like the one Listing 1 shows. Check constraints generally perform better when most of the inserts are successful, and triggers perform better when most of the inserts contain invalid values.
SQL Server 2005 will introduce data-definition language (DDL) triggers. SQL Server will be able to execute such DDL triggers automatically when database schemata change. This will give DBAs the means to enforce basic schema-related rules on the database. Such rules might include mandatory documentation, the use of certain data types, and the mandatory use of check constraints.
In many cases, not even check or trigger constraints help, since the values of two interpretations can have overlapping value ranges. Two examples are the differences in floor numbering between the UK and the United States and representing money from different currencies. If you don't use derived data types to capture this disaccord of the data semantics, a good documentation of the database schemata is the last resort for a DBA. Automatic tools don't detect these conflicts, so unless you know that two databases use the same schema and are maintained in exactly the same way, you should double-check the merging script your database comparison tool provides and contrast it with the databases' documentation.
No Easy Solutions
Semantic heterogeneity leads to complex problems that often don't have simple solutions. DBAs and database developers need to recognize when semantic heterogeneity occurs and what actions they can take to prevent, detect, and fix resulting problems. A good resource for more information about this topic is "Managing Semantic Heterogeneity with Production Rules and Persistent Queues" by Dr. Stefano Ceri and Dr. Jennifer Widom. A PDF of the article is available for download at http://dbpubs.stanford.edu:8090/pub/1993-17.
End of Article
Prev. page
1
[2]
next page -->