Jake Ashcraft, a systems engineer at Progressive Software Solutions in Albany, Oregon, contacted me about a problem to which he could find no set-based solution, just a solution that uses cursors and temporary tables. The problem is representative of a larger group of problems that deal with identifying groups of elements, or identifying subsets within sets. Let's look at the problem, then explore four set-based solutions that approach the problem from different angles. Thanks to Dieter Nötha true SQL wizardfor providing one of the solutions I present. As always, I encourage you to try your hand at solving the problem before looking at the solutions.
The Problem: Identifying Emission-Level Groups
The problem relates to a gas turbinedriven power station, where the emission levels of the turbine exhaust gases are monitored by the US Environmental Protection Agency (EPA). Gas emission levels are measured periodically and recorded in a table called Samples. You can run the script that Listing 1 shows to create the Samples table and populate it with some test data. The Samples table includes two columns: sample_ts, which is the sample's timestamp, and NOx, which is the emission level recorded.
The EPA needs to know when the level of NOx emissions from the gas turbines equals or exceeds 70. For each breach of this limit, a report needs to display the exact date and time when the level first exceeded 70 and when it returned to a value of less than 70. The report must also display the maximum NOx level during the intervening period. Table 1 shows the desired results, assuming the Samples table contains the test data that Listing 1 provides. Note that the elapsed time should be returned in the format hh:mm:ss.
You can use Figure 1, page 14, to help you visualize the desired results. The X axis shows the sample timestamps, and the Y axis shows the emission levels. The red elements in the graph highlight the groups of samples where the emission levels exceeded the prescribed value. The green symbols represent the maximum emission levels in each group. The challenge: From each group of interest, return the start timestamp, end timestamp, elapsed time, and maximum level measured.
Solution 1: Using Views
For the first solution, I took a modular approachusing views to break the solution into steps. First, I used the code that Listing 2, page 14, shows to create a view called VCurPrev that returns, for each sample timestamp, both the current NOx value and the previous one. The subquery returns the NOx value of the most recent sample of all the samples that were taken before the outer query's sample. So each sample retrieves the previous sample's NOx value. A SELECT * query against the VCurPrev view returns the output that Table 2 shows.
The second step is to create a view called VStartEnd that returns each group's start and end timestamps, as the code in Listing 3 shows. The query in the view joins an instance of VCurPrev called V1 to another instance of VCurPrev called V2. V1 represents starts of groups, and V2 represents ends of groups. The JOIN condition matches each row in V1 with all rows in V2 that have a more recent timestamp. The filter leaves only the rows that have a current value in V1 that's greater than or equal to 70 and a previous value in V1 that is less than 70 or is NULL. The filter also verifies that the current value in V2 is less than 70 and the previous value in V2 is greater than or equal to 70. In other words, the current value in V1 marks the start of a group, and the current value in V2 marks the end of a group. The SELECT list returns the timestamp of the start of a group; each start of a group might have several matching ends, but the SELECT list returns only the end that has the minimum timestamp. A SELECT * query against the VStartEnd view returns the output that Table 3 shows: three matching periods and their start and end times.
The third and final step in this solution is to write a query that joins the VStartEnd view with the Samples table. The code in Listing 4 matches each group in the VStartEnd view with all samples in the Samples table that fall between the group's starttime and endtime. For each group, the query uses the DATEDIFF() function to calculate the elapsed time in seconds between starttime and endtime, then calculates the maximum NOx value.
Note that the query returns the elapsed time only in seconds. If you want to format the elapsed time as hh:mm:ss, you need to write an expression that extracts the hours, minutes, and seconds; converts them to character strings; and concatenates them. You can write a user-defined function (UDF) that accepts a value in seconds as an argument and returns a formatted string. Running the script in Listing 5 creates the dbo.fn_fmtelapsed() UDF. The function uses integer division and modulo operations to extract the hours, minutes, and seconds. Then, it prefixes each part with a 0 if the part contains only one digit. After creating the UDF, you can execute the query that Listing 6 shows to get the desired results that Table 1 shows. This solution is good because it's based on the step-by-step modular approach. Such a solution is usually less prone to errors and easier to maintain than a solution that uses one gigantic query. On the other hand, this solution isn't optimal in terms of performance, as I explain later.