SideBar    The Logical Puzzle
DOWNLOAD THE CODE:
Download the Code 48470.zip

Solution 2: A Denormalized Representation
The second solution is based on a denormalized representation of both the promoters and the DNA sequence. We store each promoter as a single string (e.g.,TATAAA) and store the entire DNA sequence as a string in a VARCHAR(MAX) data type that can hold up to 2 billion base pairs.

First, run the code shown in Web Listing 3 to create the PromotersD and DNAD tables and populate them with the sample data in denormalized format. Web Listing 3 populates the DNA string in the DNAD table by querying the normalized form of the data stored in the DNA table created by Web Listing 1. An INSERT statement invokes a query with the FOR XML PATH option, which concatenates the letters into a single string. (If you want to instead populate the DNA table with the large DNA string, run the code in Web Listing 2 before you run the code in Web Listing 3.)

Listing 2 shows Solution 2, which is simpler and faster than Solution 1's normalized representation. In Solution 2, the anchor member joins the PromotersD and DNAD tables with a join condition that uses the CHARINDEX function at callout A to look for the first occurrence of a promoter within the DNA string. When a promoter is found, the recursive member (PromCTE) looks for the next occurrence of the same promoter, until it finds no matches.That's all there is to it. Running Solution 2 against the large DNA string generated by Web Listing 2 returns the result in 2 seconds, about 15 times faster than Solution 1.

If you originally stored your data in denormalized form and later have a solution that needs normalized data, you can easily convert denormalized data into normalized form. Split each string into the characters of the individual base names and run the code in Web Listing 4 to create and populate the Nums auxiliary table. Then join the denormalized table with Nums. Listing 3 shows the code snippet that normalizes the denormalized promoters (PromotersD table). As you can see, the join condition uses n <= LEN(string), and the SUBSTRING function in the SELECT list extracts the individual base-name characters. Using a similar process, you can normalize the denormalized DNA strings in the DNAD table.

Fast and Flexible, or Faster and Specialized
I've shown you two ways to identify patterns in sequenced elements. Both solutions demonstrate the power of recursive CTEs. You can use Solution 1, which relies on normalized data, in a similar way with other types of sequenced elements, such as time series. Solution 2, although dramatically simpler and much faster, is also more specialized and limited. Solution 2 assumes that samples are fixed in length, that you can represent the samples in a denormalized form as a string, and that you can use string functions such as CHARINDEX to look for occurrences of a substring. In cases where you're working with variable-length samples, Solution 1's generic normalized representation will do the job.

Itzik Ben-Gan (itzik@solidqualitylearning.com), a mentor at Solid Quality Learning, teaches, lectures, and consults internationally. He manages the Israeli SQL Server Users Group, is a SQL Server MVP, and is a coauthor of Advanced Transact-SQL for SQL Server 2000 (Apress).

End of Article

Prev. page     1 [2]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

 
 

ADS BY GOOGLE