Another feature that the full-text search component provides is stemming, in which slight variations of words could produce valid results. For example, if a user searches for the word running, he might find run, runs, or runner. The problem here is that the exact rules for stemming are an internal mystery of the full-text search component, and the service doesn't provide a method for viewing or modifying those rules. Consequently, a search for the word magical might retrieve magic; then again, it might not. You can discover the stemming rules only by experimenting. The lack of control over this feature meant that my development team ruled it out in our implementation.
Finally, the full-text search component requires the developer to preprocess the search string to remove noise words before using it in a query. You can see this need by sending a string of only one noise word to the full-text search component. The query returns the message
Server: Msg 7619, Level 16, State 1, Line 1
Execution of a full-text operation failed. A clause of the query contained only ignored words.
You get this message even if valid non-noise words are in the query string. A query string of the and ledger in which the is a noise word produced the message above on SQL Server 2000 Service Pack 2 (SP2) despite the fact that ledger isn't a noise word. This quirk causes the components that generate the SQL query (stored procedures or middleware) to use additional parsing and logic to avoid this error.
Doing It Yourself: Custom Programming
Because of the concerns our research raised, we decided to create a custom solution for our full-text searching needs. The design and implementation were surprisingly straightforward, and the administrative ease, scalability, and flexibility more than compensated for the additional development cost. (After designing our custom solution, we tested it on SQL Server 2000 SP2 and SQL Server 7.0 SP3.)
Any custom search solution requires three things: a place to physically store the full-text data, a mechanism to store the full-text data, and a way to query the data. In our solution, we fulfilled all three requirements through SQL Server functionality.
I can best explain our solution by giving an example. Suppose you have a table named Email that contains email messages. You create this table by using the definition that Listing 1 shows. The table has MessageID, subject, and body fields. The subject field must be full-text searchable. You store the full-text data by creating one table in a SQL Server database for each field that's full-text searchable. For this example, let's name the new table FT_EmailSubject. This table has a specific format: a row identifier that relates back to the original table, a word column, and a position identifier. You can create the FT_EmailSubject table by using the code in Listing 2.
Conceptually, a custom search index is a list of words that appear in a field and the position of each word in the string. Collecting this data, then, is simply a matter of enumerating through each row in Email, extracting all the words in the subject field of each row, and storing those words and their positions in the FT_EmailSubject table. To initially populate FT_EmailSubject, you could accomplish these steps by using a cursor; or you could parse the subject fields offline, store them in a file, then bulk-load the file into the production FT_EmailSubject table.
The method you choose for updating FT_EmailSubject depends on how data is inserted into the Email table and how often the table is updated. If you have a stored procedure to store a new row in Email, you could modify that stored procedure to simultaneously parse the subject and add the required rows to the FT_EmailSubject table. In our Web-site search implementation, we performed all inserts, updates, and deletes to the base table by using stored procedures. We simply modified each INSERT and UPDATE stored procedure to call a user-defined function (UDF) that returned the list of words in the searchable field. We then stored these words in the custom search table.
The structure of FT_EmailSubject provides a great deal of search flexibility. For example, the query in Listing 3 finds all messages that have the phrase last week in the subject. Similarly, the query in Listing 4 finds all messages that have the words last or week anywhere in the subject. The query in Listing 5 finds the word last appearing as many as five words before the word week. And, to return strings that contain week, weekend, weekday, or weekly, we simply use the LIKE clause with a trailing wildcard. This method is still efficient because the query optimizer can use the existing indexes on the Word column with a trailing wildcard. In effect, the LIKE clause with a trailing wildcard is a stemming query in which the stem word is week, as Listing 6 shows.
Prev. page
1
[2]
3
next page