Fourth, many times these email messages are sent to groups. It’s an ironclad law of corporate behavior that when everyone is responsible, no one is responsible. So nothing will happen until someone with a "C" in their title (as in CIO) gets a call asking why there’s no data. This isn’t where you want to be.
Ideally, there will be a simple interface with an easy to understand "red light, green light" approach to these errors. That application will direct the operations staff to detailed messages that will help them. Unfortunately, we typically get rather cryptic error messages back from our applications. Often, we have stumbled across these problems while in the development cycle or handing off to operations. We should be making it a practice to decipher these messages for the operations staff.
For example, I recently completed a project that used a web service as a source for a data mart. Towards the end of development, I would often get the following exception in my code: An unknown error has occurred. Please contact the application administrator for further information. This is hardly the most informative error message ever created. However, we noticed that invariably this error would arise because the web service server was down. It seemed to me to be a matter of common courtesy, if nothing else, to pass this knowledge to the poor soul who would be getting it at 2 AM, when this data load was scheduled to run. In this case, some simple code wrote the additional details out to a text file, and the path to the text file was then put in the message logged in the Windows Event logs, as shown in Listing 1.
It should also be noted that in this application the inner exception messages contain a good deal of information that could help solve the problem. This is a simple method call that can be abstracted to a class file and called in whatever way you typically call a method. In our case, for example, we simply called the method from the catch portion of a try/catch block in a VB Script. You can also incorporate this method into a CLR stored procedure and execute it from a try/catch block in a SQL statement, or you might call it from the packages on error event handler.
Where you raise such errors will depend on where the monitoring system looks for the application status. I have yet to find a monitoring tool that didn’t look at the Windows Event logs for errors, so logging errors to the event logs would be a good place to start. Again, the code to do so is quite simple using the EventLog class as Listing 2 shows.
In this case, we simply note that an error has occurred, raise it to the Application event log, and direct the user to a text file where they can find the details they need to diagnose the problem. In short, this need not be complex, but it does need to provide the operations staff with the data they need to do their jobs. The following are three best practices for providing operations staff with the necessary information:
- Operational staff needs a simple UI to monitor ETL status (NOT EMAIL!).
- Provide error details somewhere. It’s easy to simply write these details to a text file and provide the path to this file via the Windows event logs or the monitoring tool.
- Have some courtesy. If you can provide more details then the exception thrown by the application, do so.
WTF Logging
Every ETL project invariably comes across data in unexpected formats or is otherwise a knuckleball. For example, you’re expecting integers and get "two" instead of "2." None of these problems will rise to the level where an application can or should fail. Instead, these represent business rules or data quality issues that should be reported back to users so that the inconsistencies might be addressed. Data from source systems can, for example, violate business rules or constraints on the target system. Every database professional has, at some point in their career, come across a table that was created using poor design practices or coding practices. Such practices result in critical data held in nullable fields. If a more carefully designed BI system attempts to enforce the NOT NULL business rule, an insert or update error will inevitably arise when someone attempts to insert a null value carried erroneously on the source system. This error should not rise to such a level that an entire ETL process fails. Instead, the single row causing the error should be caught and the error handled gracefully through the WTF logging process.
Data from the source system can violate rules of common sense or industry standards, and a bug in the source system can produce illogical data. For example, in one application, the cumulative production of an oil well would reset to 0 from time to time on the source system because of a source system bug. So, when calculating the average production of a well, we tried to use a simple formula of
(Cumulative Volume Now – Cumulative Volume Then)/elapsed time
With a cumulative volume reset to zero, we would get a negative number as a result—implying the company was putting oil back in the ground. At the price at the time, about $140/barrel, this seemed unlikely.
These examples illustrate different types of data quality issues, each of which might be handled in different ways using SQL Server Integration Services (SSIS). Those issues can broadly be classified as the following:
- Data from the source system fails based on some constraint or business rule on the target system.
- The source system consistently violates either an industry standard or a corporate standard.
- A bug in the source system produces unexpected or illogical data.
There are several ways to handle these issues as they arise. The first, and my preference, is to work with the folks running the source systems to resolve the problems. This usually involves a dope slap, accompanied with some variation of "@#$, what are you thinking?!" Unfortunately this method, while effective, violates most HR policies. Your fallback here should be more political negotiations around field mappings, particularly where you want to keep your data mappings within some reasonable semblance of logic and industry standards.