SideBar    Improving Retrieval Performance
DOWNLOAD THE CODE:
Download the Code 27195.zip

Periodically, you need to search a database for specific words or phrases in strings. Users of a music Web site might need to search for CD titles; a bookselling site might offer searches by title or author. Queries that use the LIKE keyword are notoriously slow for this type of search because typical database indexes can't help in searching for words that appear in the middle of a string — such indexes help only in searching from the beginning of a string. To address the need for better text searching, Microsoft included the full-text search component in SQL Server 7.0. Built on the Microsoft Search Service, the full-text search component uses the same engine that Microsoft Index Server and Microsoft Exchange use to perform word and phrase searching. The full-text search component is integrated with SQL Server through a set of T-SQL extensions (CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE). DBAs can administer full-text search through Enterprise Manager and system stored procedures.

However, the full-text search component might not meet your needs as a search solution. My development team experienced some significant limitations with the component, so for a recent project, we decided to develop our own text-search solution. Let's examine the considerations that prompted my team to look at custom solutions; then, let's look at a home-grown alternative to the full-text search component and compare it objectively to a full-text search­based solution.

Microsoft Search vs. Custom Programming
On a recent project, I was the head of a team responsible for the design and implementation of word and phrase searching for a high-volume, music-related Web site. The Web site let users search for specific words or phrases in six database fields, including song names, artists, and CD titles. In general, the strings were less than 200 characters, and the number of strings ranged from 10,000 to 1.2 million, depending on the table and field. We had three possibilities for implementing the search functionality: Microsoft's full-text search component, a third-party search solution, or a custom search solution. We looked at the Yahoo! search engine and Inktomi's search engine as possible third-party solutions, but either one would have cost more than $100,000 in licenses plus the cost of additional hardware. In our case, justifying the cost of development and administration was easier than asking for money for capital purchases, so we decided to research the full-text search component and custom solutions. The project had the following basic requirements:

  • Scalability — Ideally, if the search features became unacceptably slow because of high volume, we wanted to be able to add another search server without much administration and with no development overhead.
  • Intuitive behavior — We wanted users to be able to search for single letters, numbers, or the first few characters of a word and get reasonable results.
  • Low administrative cost — We needed the search to perform well in production, be highly available, and be reliable with minimal administration.

Usually, I'm not one for reinventing the wheel. However, in researching the full-text search component, I came across several concerns that turned my thoughts towards a custom implementation. My first concern popped up when I read David Jones's SQL Server Magazine article "Build a Better Search Engine," July 2000, InstantDoc ID 8828. Jones says, "In a full-text catalog, you can run only eight searches at one time. If you begin a ninth search, Microsoft Search Service will hold it in a queue." Although I hadn't personally verified this restriction, the mention of such a limit caused me to worry about potential performance and scalability problems with the full-text search component.

More importantly, though, the full-text search component isn't integrated with SQL Server storage and consequently requires its own backup and restore strategy. The Microsoft Search Service stores its full-text indexes in a file directory and keeps the index files open while the service is running. This latter detail means that you have to stop the service to back up the indexes. From a DBA's perspective, this design is a major complication to a disaster-recovery plan. The DBA must back up and restore the full-text search files separately from the database they refer to — or rebuild them from scratch.

Also, scaling out the search functionality to multiple servers is difficult because you can't easily share or replicate the index files. And you can't use simple SQL Server replication to transfer the required search data to other servers. Instead, you must copy the full-text search files to multiple servers (a practice that Microsoft documentation doesn't describe) or you must replicate the original data and each server must perform its own index repopulations.

The full-text search component is also inflexible; it doesn't give developers many options for customization. For example, the component has specific rules for dealing with noise words. Noise words are words that SQL Server ignores in queries that use full-text search. The is a noise word; a search often ignores the because it's not useful in narrowing search results. You can find noise word files in the <SQL data directory>\MSSQL\FTDATA\SQLServer\Config folder. The English noise word file is noise.enu. The full-text search component uses the noise-word files for all full-text indexes and consequently requires that all noise words be identical for all tables and columns that are indexed. In our project, this requirement was a problem because we had to index a band named The The. We had to treat the as a valid word in all our indexes, which led to bigger indexes and slightly poorer performance.

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Would be interested in finding further details on this type of implementation.

rsheehan

Article Rating 4 out of 5

 
 

ADS BY GOOGLE