Many types of problems require you to analyze samples of data in sequences, looking for valuable information in data organized according to some order. For example, you might want to analyze sales figures over a time series, stock rates against a series of prices, or test results against a series of ratings. Most SQL Server 2000 procedures that analyze samples of data over a sequence can be very slow because they use cursors and set-based techniques. Now, SQL Server 2005 provides recursive common table expressions (CTEs) that greatly speed up analysis. (For an overview of recursive CTEs, see "Get in the Loop with CTEs,"April 2004, InstantDoc ID 42072.)
I'll show you some fast and powerful analysis techniques that use recursive CTEs. DNA sequencing, which involves analyzing patterns in sequences, provides the scenario for our examples. But first, I want to thank Rona Kisilevitz from Intel at Kiryat Gat, Israel, who helped me with the biological background information needed for these scenarios.
Don't forget to check out the sidebar "The Logical Puzzle," page 22, for the solution to last month's logical puzzle.This month's puzzle presents a game-show scenario in which you have to guess which curtain hides the prize.
DNA Sequencing
DNA sequencing is the process of determining the order of the pairs of chemical building blocks that make up a particular chromosome. DNA consists of sequences of four building blocks, or bases: adenosine (A), cytosine (C), guanine (G), and thymine (T). In their analyses, scientists look for certain patterns within mapped regions of a DNA string. A DNA string typically consists of mostly meaningless stretches of DNA known as junk DNA. For example, human DNA, made up of some 3 billion base pairs, is about 97 percent junk DNA.
That's quite a long sequence to search through. By locating certain patterns, scientists can identify the existence and placement of genes—small portions of a DNA string that control and organize an organism's vital functions. One way to identify genes within a DNA string is to look for known patterns or regions called promoters, which can indicate a gene's starting point. For example, a promoter called TATA Box has the sequence pattern TATAAA, the promoter GC Box has the sequence GGGCGG, and the promoter CCAAT Box has the sequence CCAAT. Finding the patterns that identify these promoters is a key technique in DNA sequencing.
Finding Patterns in a Sequence
To demonstrate how to use T-SQL to find patterns in a sequence, we'll create a solution that locates promoters within a given DNA sequence. First, run the code in Web Listing 1 (http://www.sqlmag.com, InstantDoc ID 48470) to create the tables called DNA, Promoters, and Patterns and populate them with some sample data.
The DNA table contains a small, imaginary DNA sequence; the Promoters table contains the IDs and names of promoters; and the Patterns table contains the patterns, or sequences of bases, that make up each promoter.We want to retrieve every promoter's starting point within the sample DNA sequence.
Typically, scientists analyze fairly small, isolated regions of about 1000 base pairs in the full DNA string. In some cases, though, there might be a need to analyze a larger string (up to the full length of the DNA string). If you want to test your solutions against large input, run the code in Web Listing 2 to populate the DNA table with 1 million rows containing random codes. Note, though, that I use the smaller set of sample data from Web Listing 1 for my examples.
Solution 1: A Normalized Representation
Listing 1 shows the first solution, which uses a normalized representation of the data created by running the code in Web Listing 1. The recursive CTE DNASeq isolates all promoter sequences within the DNA sequence. Callout A in Listing 1 shows the anchor member that joins the DNA and Patterns tables and matches all rows from the DNA table that contain a base that matches the first base of one of the promoters contained in the Patterns table. For each promoter, the query returns all bases that might be the promoter's first element within the DNA sequence. For each such possible starting point, the query returns the promoter's ID (pid), the DNA's subsequence starting point (DNAstart), the position within the subsequence (n), and the letter of the base (A, C, G, or T).
When the query finds a base that might be a promoter's starting point, the recursive member in callout B checks the next position in the DNA sequence for a base that matches the second base in the promoter. Recursion continues until the promoter's entire sequence is found or a mismatch occurs.The recursive member joins the previous result set (DNASeq) with the Patterns table (PNxt), then isolates the next row in the promoter's sequence:
PNxt.pid = Cur.pid AND PNxt.n =
Cur.n + 1
The query then joins the next row in the promoter's sequence to the next row in the DNA sequence (DNxt) and checks whether the bases match:
DNxt.n = DNAstart + PNxt.n - 1
AND DNxt.base = PNxt.base
The recursive CTE DNASeq returns all DNA subsequences that have the same pattern of bases as a promoter. Of course, you need to keep only those subsequences that fully match a promoter's sequence.To eliminate partial matches, we use another CTE, PromoLen, which callout C shows. PromoLen returns the length of each promoter (ln). In the outer query, we join the result of the recursive CTE DNASeq with PromoLen, isolating the end points of the DNA subsequences that have the same base pattern and length as a promoter. Then, to return the promoter names, we add a join to the Promoters table, as callout D shows. By default, recursive calls are limited to 100 repetitions. The MAXRECURSION option 0 removes that limitation. Figure 1 shows the promoters' starting points that Solution 1 returns.
For short DNA sequences of about 1000 base pairs, this solution runs in a fraction of a second. For larger sequences with 1 million base pairs and more than 1000 occurrences of promoters, the solution runs in less than 30 seconds.
Prev. page  
[1]
2
next page