« Solaris MySQL Install Package | Main | Going to MySQL 2004 »

February 6, 2004

UMLS Indexing Reduced to 16 Hours (from 5.8 days)

I've done a bit of work and dramatically decreased the time it takes us to create a MySQL-based DBIx::FullTextSearch index of the UMLS database.

I was checking on a previous attempt to index the UMLS and noticed it had stopped, hung on an insert from one of the processes. I checked the FTS parameters and apparently FTS created a "max_doc_id" of ~300,000, limiting how many items could be indexed. Obviously no good when the UMLS is almost 500,000 entries. I upped that to 1.5 million to start, but then decided to just take it out altogether.

I also noticed that the "word_length" was set to 30 characters, which means some of the long medical terminology like:

will not get indexed. I changed that to 60, the length of this word and the longest term that appears in the UMLS without hyphens, spaces or slashes.

Resetting the word length meant restarting the indexing, and losing the past 2 days of indexing. I took the opportunity to look closely at the Perl indexing script I wrote (almost three years ago), UMLS data, MySQL indexes and FTS parameters and tweaked a number of things. Figured it was worth a few hours of work if I could shave time. I ran 100 entries into the index after each tweak to judge if it helped or hurt.

I was able to dramatically improve the speed of the indexing. I'm running two scripts now, each doing ~250,000 definitions. Combined, in the first 10 minutes 5277 entries have been run through, 8.7/second. That's quite an improvement (again, thanks to Jeremy for the hints).

I'm glad we can now get the data in faster, especially since I learned the UMLS releases a new index every 6 months and we're *supposed* to be updating at each release.

Unfortunately the UMLS indexing was a distraction so now I've got to figure out how to get caught up on other things.

Posted by mike at February 6, 2004 2:45 PM