« MySQL Replication Up and Running (and I'm Impressed) | Main | Solaris MySQL Install Package »

February 4, 2004

5.8 Days Building UMLS Index with DBIx::FullTextSearch

Today I'm creating an indexed version of the UMLS (Unified Medical Language System) database for use in our system. It gives our users a uniform vocabulary of keywords (with definitions) to associate with a document. Documents have more credibility if the associated keywords have been chosen from the UMLS.

We use MySQL-backed DBIx::FullTextSearch for indexing our "stuff". It drives a variety of searches on our site. The module provides a nice interface to create, build, modify, delete and search against indexes.

In essense:
1) create an FTS index, specifying numerous parameters about how to index the data
2) build index by giving FTS the id of the item and the text to index
3) search the FTS index by search string, FTS returns either an array of ids or a hash with id key and frequency of string as value

I created a umls index and am indexing each concept with the concept_id and it's definition. The version of UMLS I have contains ~500,000 concepts (it has doubled since we installed it a few years back).

It takes FTS about 1 second to index each of the items, which means I'll have a complete index in 5.8 days. That means that when we get an updated version we'll be looking at 12 days to generate the index. Good thing it doesn't change that often.

Update: Since some of the wait time is for the CPU parsing through the text to index I broke the ULMS into three chunks and am running all concurrently. Each process continues to work at about one definition per second, which cuts the total time for indexing down to 46 hours.

I also dropped some of the MySQL indexes (at Jeremy's suggestion) which seem to speed up the process. My measurement is as far from scientific as you can get ("one one-thousand, two one-thousand, three one-thousand, four . . .")

Posted by mike at February 4, 2004 5:03 PM

Hard Drive Recovery Group offers hard disk data recovery services for RAID, laptops and servers. Complete clean room and hard drive repair service.

Trackback Pings

TrackBack URL for this entry:
http://mike.kruckenberg.com/mt/mt-tb.cgi/492

Comments

It'd probably be a lot faster to drop the indexes, add all the data, and then re-enable (or re-add) the indexes.

Posted by: Jeremy Zawodny at February 4, 2004 11:37 PM

Also, make sure you key_buffer is as large as you can afford on the box.

Posted by: Jeremy Zawodny at February 5, 2004 11:44 PM