« MySQL Replication Up and Running (and I'm Impressed) | Main | Solaris MySQL Install Package »

February 4, 2004

5.8 Days Building UMLS Index with DBIx::FullTextSearch

Today I'm creating an indexed version of the UMLS (Unified Medical Language System) database for use in our system. It gives our users a uniform vocabulary of keywords (with definitions) to associate with a document. Documents have more credibility if the associated keywords have been chosen from the UMLS.

We use MySQL-backed DBIx::FullTextSearch for indexing our "stuff". It drives a variety of searches on our site. The module provides a nice interface to create, build, modify, delete and search against indexes.

In essense:
1) create an FTS index, specifying numerous parameters about how to index the data
2) build index by giving FTS the id of the item and the text to index
3) search the FTS index by search string, FTS returns either an array of ids or a hash with id key and frequency of string as value

I created a umls index and am indexing each concept with the concept_id and it's definition. The version of UMLS I have contains ~500,000 concepts (it has doubled since we installed it a few years back).

It takes FTS about 1 second to index each of the items, which means I'll have a complete index in 5.8 days. That means that when we get an updated version we'll be looking at 12 days to generate the index. Good thing it doesn't change that often.

Update: Since some of the wait time is for the CPU parsing through the text to index I broke the ULMS into three chunks and am running all concurrently. Each process continues to work at about one definition per second, which cuts the total time for indexing down to 46 hours.

I also dropped some of the MySQL indexes (at Jeremy's suggestion) which seem to speed up the process. My measurement is as far from scientific as you can get ("one one-thousand, two one-thousand, three one-thousand, four . . .")

Posted by mike at February 4, 2004 5:03 PM