Semantic heterogeneity might sound like a meaningless bit of tech jargon, but it actually describes a subtle set of problems confronting many developers and DBAs. Anyone who's ever had to combine data sets from two different databases (and what DBA or database developer hasn't?) has probably encountered a semantic heterogeneity: data that doesn't match in format and therefore is difficult to merge. Combining databases when temperature is recorded in Celsius in one database and Fahrenheit in the other, for example, can result in significant data corruption unless you deal with it in advance. Recovering from such a data corruption is at best extremely time-consumingat worst, it's a logical impossibility. This article examines the four types of semantic heterogeneity, looks at where they occur in database comparisons, and shows different ways to handle them.
Semantic Heterogeneity Types
Semantic heterogeneity is a complicated term for the phenomenon of disagreement about the meaning or interpretation of the same or related data. Sources for semantic heterogeneity can include differences in data-definition constructs, differences in object representations, and system-level differences in the way that atomic data (e.g., byte order for multibyte data, such as an integer) is stored in two systems.
Dr. Stefano Ceri and Dr. Jennifer Widom, two leading database scientists, identify four kinds of semantic heterogeneity, as Table 1 shows: naming conflicts, domain conflicts, metadata conflicts, and structural conflicts. Let's look at each briefly, using example databases A and B in Figure 1 for illustration.
Naming conflicts occur when two databases use different names to identify the same real-world entities. For example, one database schema might store an employee identifier in a field named Employee ID, but in another database schema, a field containing the same data could be called Employee_Number.
Domain conflicts occur when different databases use different values to represent the same concepts. In the central company database, an employee's identifier might be as123, for example, but the department database might list her as number 17. The temperature representation example, in which one database uses Fahrenheit and another uses Celsius, falls into this heterogeneity category. For some examples of the common types of domain conflicts, see Table 2.
Metadata conflicts occur when the same concepts are represented in one database at the schema level and in another database at the data level. Historical salary information about an employee, for example, can be stored either as part of a table or as separate rows. For example, in database A in Figure 1, the last two salary values are stored as attributes; database B shows them as rows in a separate table.
Structural conflicts occur when different data organizations are used. In database B, a name is represented as one character string, whereas database A stores the name as two strings, one for the surname and one for all the other names. This kind of semantic heterogeneity also includes type conflicts, when different data types represent the same concept. For example, you can represent temperature as either a floating point or an integer type.
Where the Problems Occur
So where do such semantic heterogeneities occur when you're comparing databases? To answer this question, you need to consider the two levels at which databases are comparedthe schema level and the data level.
Schema-level comparison. A schema-level comparison matches table names of one database with table names of the other. The matching tables are checked to ascertain whether their schemata are the samethat is, do they contain the same number of columns, and do the column names and domains match? This process also compares the constraints on the databases. The result of the schema-level comparison shows the DBA any differences between the compared databases at a structural level. It can show that a particular table that exists in two databases includes a certain column in one and not the other, as Figure 2 shows, or that the data type of a certain table column has changed. You can use these schema differences to synchronize the schemata of the two databases.
Comparison toolssuch as Embarcadero's Change Manager, SQL Compare from Red Gate Software, DB SynchroComp from E-Dule, or AdeptSQL's Diffcan quickly identify the tables whose equivalents are missing from one of the databases you want to examine. But comparison tools generally consider only the name of the table, so they can miss renamed tables. The reason the comparison tool doesn't try to match tables automatically is that, although it's possible to guess the name of the equivalent table in certain circumstances, the guess might not be reliable. Thus, tables that are structurally identical yet represent different concepts might be synchronized, resulting in corrupted data.The last thing a DBA wants to do is automatically synchronize data between two random tables whose schema happen to be highly similar.
During schema-level comparison, your comparison tool can identify many semantic heterogeneity problems from the categories of naming conflicts, metadata conflicts, and structural conflicts. However, most of the domain conflicts can remain undetected. You can address these conflicts at the next database-comparison level.
Data-level comparison. Data-level comparison for one table in each database classifies rows into one of three categories:
- The row exists in the first database and has no match in the second database.
- The row exists only in the second database.
- The row exists in both databases and either it's identical or parts of the row are differentthat is, some columns contain different values.
Note that this third classification is based on the assumption that the compared rows can be uniquely identified within a table, usually by a primary key. Generally, commercial data-comparison tools require the presence of a primary key on tables whose data you want to compare.
You can use the differences the comparison tool identified to synchronize the databases at the data level. Figure 3 shows an example of such identified differences. But is the data or the database schemata correct after such a synchronization?