• subscribe
May 01, 2002 12:00 AM

Statistical Process Control

SQL Server Pro
InstantDoc ID #24814
Downloads
24814.zip

I might have been satisfied and set the problem aside at this point. But as a mathematician, when I see two digits such as 4 and 7, I think m and n. I try to generalize problems, and I'm not satisfied with a solution that works only in a few specific cases.

On the surface, this appears to be a difficult problem. The two solutions I've mentioned—storing rank orders and testing all subsequences—become intractable for sequences that contain more than about 15 to 20 measurements. Winterbottom's sequences of 7 measurements had 5040 possible rank orders and 35 subsequences of 4 measurements. But if you look at sequences of 15 measurements and examine subsequences that contain just 8 measurements, the number of rank orders increases to 15! (15 factorial), or more than 1012, and the number of subsequences increases to more than 6000. The numbers become ridiculously large for longer sequences.

You can sometimes solve problems like this—in which you have far more cases than you can possibly check individually—by using dynamic programming techniques. A dynamic programming solution keeps track of partial results—the longest increasing subsequences up to each point in the original sequence of measurements—and uses these partial results to build the final solution. Such algorithms work well in this particular case, but you can do better with an entirely different approach that happens to translate very nicely into T-SQL. Because an analyst might need to consider any number of measurements at the same time, I developed a query to find the longest increasing subsequence within a sequence of any length. The code in Listing 2 creates tables containing the data that Figure 1 shows, if you want to try your hand at devising a solution.

Mathematics to the Rescue
Mathematicians have a remarkable habit of developing solutions to practical problems before those problems arise. For example, the mathematics that underlies present-day cryptography was developed long before cryptography. And the inverse problem that must be solved for CAT scanning to work had been stated and solved decades before the first CAT scan. So, not surprisingly, when I searched mathematics literature, I discovered an efficient algorithm for finding the longest increasing subsequences in a given sequence.

Mathematical research in this area is concerned with finding the probability distribution of the longest increasing subsequence of a random permutation of the first N positive integers. For someone who is mathematically inclined, the problem is intriguing and contains many surprising connections.

If you have a sequence S of length n, how can you find the largest value of m for which there exists an increasing subsequence of S having length m? Figure 2 shows the algorithm I developed to find this value; the algorithm is adapted and simplified from an algorithm in David Aldous and Persi Diaconis's paper "Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem," Bulletin of the American Mathematical Society 36 (1999): 413-432. (Patience is the British name for the card game called solitaire in the United States, so I chose the name Solitaire for a table in the T-SQL solution.)

Note that whereas the final size of L is the length of a longest increasing subsequence, L doesn't necessarily contain the elements of a longest increasing subsequence. Table 1 shows the contents of the list L after each iteration of the algorithm when it's applied to the Uptown readings.

T-SQL, Finally
Translating this algorithm into T-SQL isn't difficult. The list L becomes a small table called Solitaire. Finding the smallest number in Solitaire that's greater than or equal to a given number is an ideal task for T-SQL. Because numbers are added to the list only if they're larger than all other numbers in the list, I added an identity column called pos to Solitaire to make finding that smallest number even simpler. The identity column keeps the elements of L in order from smallest to largest, so the smallest element is first in order by the identity column pos. You might balk at using a cursor to perform the algorithm's loop, but I found no set-based alternative. Listing 3 shows the algorithm, which assumes the tables that the code in Listing 2 creates.

After solving the problem of finding sequences of seven measurements that contain an increasing subsequence of at least four measurements, I wanted to find a practical solution for arbitrary values of m and n. I not only found a solution, I found a nice bit of mathematics that I hadn't previously encountered. The general application to a practical problem was a pleasant added bonus.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here