As you can see from the above examples, the structure of FT_EmailSubject allows complex searching, including nearness matches (one word appearing within n words of another word), OR searches, AND searches, and stemming. The only limitations on searching are the sophistication of the query generator and the requirement of one join of the FT_EmailSubject table for each word in the search phrase. Our implementation limited our searches to the five longest words of the original user-entered search string; this provided a good balance between performance and narrow search results.
Notice that all text queries to our table require the DISTINCT keyword. Without DISTINCT, if a word appears in the subject field twice, the MessageID could appear twice in the result set. For a variation that removes the sorting overhead of DISTINCT, see the Web sidebar "Improving Retrieval Performance." (To access the Web sidebar, go to http://tsqlsolutions.com and enter InstantDoc ID 27363.)
In our experiments, we found that indexing the FT_EmailSubject table worked best when we put a clustered index on the Word, MessageID, and WordPosition columns and a nonclustered index on the MessageID, Word, and WordPosition columns. As always, you'll need to perform your own index tuning for your environment.
Full-Text Search vs. Custom Programming
In comparing the full-text search component and our custom solution, we found that, for our project, the full-text search component's biggest advantage was lower development cost. After implementing our custom solution, we discovered that our solution had far lower administrative costs. Besides our concerns about the full-text search component's reliability based on our own experience and newsgroup postings we also found that performance tuning, replication, and disaster recovery were all more difficult (or impossible) with the full-text search component than with our custom solution.
One area in which the full-text search component performed better than our solution was in the RANK feature. This feature provides a numeric appropriateness value for each result that a query returns. This value isn't well documented, but it's a function of the number of times search words appear in the string returned from the search. The full-text search component provides this value automatically, but in a custom solution, you'd have to provide a way to calculate the value. However, because our Web site's search results are often sorted alphabetically, this limitation in the custom approach was insignificant for our project. Table 1, page 6, shows how the full-text search component compares with our custom implementation in several performance categories.
Home-Grown Possibilities
This article gives you just a taste of what you can do with a custom search solution. The indexes recommended in this article use a LIKE clause to let users search for the first few letters of a word efficiently (e.g., a search for magic also returns magical and magician). And you can tweak the word parsing to improve similarity searching by removing duplicate sequential characters and playing with word-break characters. For example, a search for Reily could return O'Reilly and Reilly. Finally, you can easily change the entire approach to perform SOUNDEX searching: Simply replace the Word field with a SOUNDEX field and store SOUNDEX(Word) instead of Word in the population steps. The SOUNDEX value could be stored in a separate table, in the same table as the original word, or even as an indexed computed column or as part of an indexed view.
In our real-world project, we first implemented the full-text search component to get to market, then transitioned to a custom solution similar to the example outlined in this article. For our project, our custom solution improved query performance by 10 times over full-text search. Overall, the total cost of the custom implementation (development plus administration) was much cheaper than the ongoing administrative costs of the full-text search component. Although developing custom solutions can be more expensive than using an existing product, the reliability and scalability concerns associated with the full-text search component might make it worthwhile to seriously consider a home-grown implementation.