« January 2004 | Main | March 2004 »

February 25, 2004

Rackspace it Is

The hosting debate is over, we are keeping the server (for one particular client) at rackspace.

I was able to get rackspace to give a break on the machine, I guess because we've been a customer of theirs for a few years now. The prices aren't listed anymore on their server options, but they were a week or so ago a comprable machine was around $850. We're getting an AMD 1.8Ghz with 1G RAM, 2x40G and 30G monthly bandwidth for $275/month. The specs aren't terribly impressive, but the cost includes unlimited rackspace fanatical phone and ticket support, 1 hour hardware replacement (monthy fee waived if it takes longer), 100% network uptime (price break for any downtime), and Red Hat Enterprise Linux (with patches etc). Apparently they've won some awards for their service, and are 30+ months without any downtime. Not sure how impressive that is.

To the people paying for the site, the extra cost for the support is worth the money. Didn't bat an eye at $275/month, were pretty happy that it didn't have to be $850.

Ahhh, I love a fresh server.

Posted by mike at 9:39 PM

February 24, 2004

Apache, mod_ssl, mod_perl Build Script

Today I started compiling and building a new package for Apache 1.3.29 (with mod_ssl and mod_perl). The method I had used previously was a good starting point, but I wanted to explore some additional configure arguments. The short of it is that I got really tired of backing out of the build each time I wanted to try something new; deleting the directory, re-untarring the source etc. It's not so bad when you're working with one package, but the way apache, mod_ssl and mod_perl interact you have to do it for all three, each time. Very annoying.

Shell scripting to the rescue. It started with something really small, just to wipe the dirs and untar the files, but I figured why not let the script do everything. I hope it will save time in the long run when I need to build another version and I've forgotten all the details.

The script requires modification to set the location of the tarfiles, version of Apache, mod_ssl and mod_perl as well as paths for openssl, existing SSL certificate and Apache install dir.

Posted by mike at 6:26 PM

February 23, 2004

Spell Checking Web Forms

I've had a copy of the jspell pricing breakdown on my desk for over a year now, with periodic inquiries as to when we'll be able to offer spell checking of web forms to the users. I'm not terribly fond of Java or JavaScript-based spell-checkers, but when I run into spell checking on a site I like to try it and see how the interface is done.

I was posting to a Tufts forum the other day when I noticed this option to check the spelling of the post. I hadn't ever seen a server-side checker, so was intrigued and impressed when the response came back from the server. The forum is driven by webcrossing, which is a commercial product. I can't find information on what they are using, so guess it's developed in house.

Wondering what else was available for server-side I poked around a bit and found Lingua::Ispell, an interface to ispell. There's also Text::Aspell, an interface to aspell. According to this dude aspell is much improved over ispell.

I also stumbled into an implementation of Text::Aspell which might be worth a look at. Maybe spell checking won't be too far off.

Posted by mike at 6:24 PM

February 21, 2004

The Time has Come to (finally) get a Digital Camera

It's finally time to get a digital camera. have resisted for the longest time, I think primarily because I'm entrenched in the SLR world and can't let go. Did some photography, developing and printing (in school and friend's darkroom) in high school and appreciate the mechanics and chemistry of exposing the silver halides to light through the camera lense and shutter. Not so attached to the development and printing process anymore. I supposed it's a bit like the person who continues to love records.

We're not getting rid of the SLR, but have decided that if we get a digital camera that is small and light it will fill a different role, being able to come along at times when the larger SLR isn't desirable. Small size is most important, image quality second.

I've done casual research over the past year just to be aware of options and prices. I recently went into a camera store to get my hands on some of them and came away thinking that the two finalist will be the Canon s400 and the Pentax Optio S4. So much about these cameras is the same . . . but there are a few differences.

Pros for the Optio S4:
- close focus (2.4 inches)
- larger CCD (1/2.5)

Cons for Optio S4:
- longest shutter speed is 4 seconds

Pros for s400:
- longest shutter speed is 15 seconds

Cons for s400:
- less focus range (closest is 4 inches)
- smaller CCD (1/1.8)
- LCD is 1.5"

I notice that Pentax recently announced the Optio S4i, which is due in April. Hmmm . . . I doubt it will be worth the wait.

I'm hoping to have the camera ordered in a week so I'll have it for some trips coming up in March and April.

Update
In response to Pete's comment I found an excellent site with detailed timing on camera functions.

The s400 takes 2.1 seconds to boot up and be ready for a photo and .7 - 1.4 seconds to focus depending on position of lense and if 5-point autofocus is on (faster with single point auto focus). Once the camera has focused it takes .1 seconds or less for the shutter to fire. The review hints at using the pre-shot focus by pressing the shutter release halfway down prior to wanting to take the shot, shaving the focus time off the shot. The shot-to-shot time is 1.5 seconds, apparently Canon's buffering is done pretty well so you can take another photo before the last one is written to disk. The time to get the camera on and a shot taken is at fastest 3.5 seconds. These tests were run while the camera was set to Super-Fine mode (2272x1704).

The Optio S takes 2.3 seconds to boot up and .5 to 1 second to focus (same dependancies as s400). As with the s400 it takes .1 second to take the photo once in focus. The shot-to-shot time is 2.8 to 4.4 seconds depending on image quality setting. A bit slower than the s400.

The reviews have a good selection of photos taken by each camera. It seems that the Canon takes a better picture. Better color and cleaner.

I notice now on the Canon site that there is an s410 and s500 . . . there are some minor enhancements, but can't see how it's worth the extra money (and there isn't a comprehensive review yet).

Posted by mike at 10:59 PM

February 19, 2004

mysql_config not Displaying Options Correctly

As indicated in my recent post about compiling MySQL I'm using a recommended set of C compiler flags:

-O3 -fno-omit-frame-pointer -mcpu=v8 -Wa,-xarch=v8plusa
MySQL compiles and runs fine but when I attempted to compile DBD::mysql I get a complaint that the CFLAGS aren't right. A mysql_config --cflags shows (notice the missing -Wa):
-O3 -fno-omit-frame-pointer -mcpu=v8 ,-xarch=v8plusa
If I override the Makefile cflags (defaults to using mysql_config) and manually set the cflags DBD::mysql has no complaint.

So do I take the time to see what exactly is going on with mysql_config? Most likely not now . . . maybe the next time it comes up.

Update
I couldn't resist, went poking around. mysql_config has the cflags stored correctly in a variable $cflags. However, before printing it the string is passed through a regex which cleans it up. According to the comments in mysql_config:

# Remove some options that a client doesn't have to care about
The guilty part of the regex looks like
s;-W[-A-Za-z]*;;g
Now what? I'm not sure I understand the purpose of the removal of options.

Update
Submitted a bug to MySQL.

Update
I'm impressed, less than 6 hours after filing the bug I got a confirmation that the fix has been committed and will be in the next release (4.0.19).

Posted by mike at 9:17 PM

Build MySQL (4.0.18) from Source

The release of MySQL 4.0.18 gave me reason to revisit my recent packaging of MySQL (yea, it's only been a few days since I built it). There were two things bothering me about the package.

  1. I failed to speciy a user_id for the mysql user account created in the package, meaning it would assign the next default number and I'd rather have a specific id assigned on each machine. Easy to fix in the preinstall script.
  2. I used the precompiled binary of MySQL for Solaris, not a bad thing but I at least wanted to look into what I might gain from compiling it myself. I have compiled in the past, but I guess MySQL folks got to me with this line:
    For maximum stability and performance, we recommend that you use the binaries we provide.
As expected, the MySQL docs has good information about configure options and how compiling and linking affects performance. Documentation also contains a listing of compile options used when creating the prebuilt binaries.

It appears that I can add a few things to the compiler flags used to make the prebuilt binary to further optimize MySQL (specifically for UltraSPARC machines).

The compiler args and configure options I settled on after reading through the docs:

CC=gcc CFLAGS="-O3 -fno-omit-frame-pointer -mcpu=v8 -Wa,-xarch=v8plusa" CXX=gcc CXXFLAGS="-O3 -fno-omit-frame-pointer -felide-constructors -fno-exceptions -fno-rtti -mcpu=v8 -Wa,-xarch=v8plusa" ./configure --prefix=/usr/local/mysql-4.018 --with-extra-charsets=complex --enable-thread-safe-client --enable-local-infile --enable-assembler --with-named-curses-libs=-lcurses --disable-shared --with-mysqld-user=mysql --without-isam --with-named-z-libs=no
. . . waiting for make and make test to finish.

Posted by mike at 2:59 PM

February 17, 2004

Rackspace vs. ServerBeach

I have a server at rackspace, got it on some kind of deal a few years back where you could get an older machine for $150/month. A great deal at the time, but the machine can no longer handle the usage on the site.

Lowest priced machine on rackspace now is $270/month for a AMD 1.3 Ghz with 512 Meg ram. Hoping we can get into something with 1G of ram, but to get that at rackspace is ~$1000/month. I have liked the rackspace fanatical support, they are good. You can always get a person on the phone to help solve an issue. They also have a good record for uptime. Is it worth it?

I've looked at serverbeach many times. For $150/month we could get a better machine. The uptime assurance is less (40 mins/month of downtime) and they only have email support. Moving from rackspace to serverbeach seems like a step into a high-risk situpation, but hard to ignore the savings.

I'm attempting to work with rackspace on getting pricing on a machine configured for our needs . . . haven't heard back yet. Maybe there will be some middle ground where we can get just what we need for a reasonable price.

Update
Rackspace has agreed to give us a beefier machine for $275/month. It's not as astronomical as what was on their site, but still not cheap. The thing I love about rackspace is their technical competence and alertness. They don't ignore their machines, are constantly pushing out informatin and packages about security issues. If they are working to keep me informed about changes I should be making on the machine it's less for me to worry about.

I looked seriously at colocation, an alternative to having a dedicated machine. It appears that to pay for the space and bandwidth it's around the same price as a dedicated machine. You do get more control of the machine, but still are relying on the datacenter for connectivity. At this point, for this client, I think I'd rather have someone else responsible for the hardware and be watching out for the machine.

Posted by mike at 9:16 PM

February 13, 2004

Drawing the Line Between Usability and Security

The origins of our application at Tufts goes back to a graduate student sitting at his home office hammering out a core set of libraries and interfaces. The core libraries are well thought and have worked to drive the application and all it's new development for over 5 years. The problem we face is where the line was drawn between security and usability.

A premise of the application is "every change in the database must be done using a user-established database handle." The idea is that anyone needing to make changes to the data obtains a MySQL-level account. We don't store that password anywhere other than in the mysql.user table and there are no MySQL accounts with permission to make table changes other than specific user's.

When a change is performed, the user enters their username and password, the application connects to MySQL with that username and runs the appropriate SQL. That's pretty good assurance that, barring sharing of passwords or a cracked system, the change was made by the user. This prevents the "user leaves the computer on and goes to lunch and someone else sits down and makes changes" scenario. Not a bad security measure to have in place.

However, there is a problem. User's are beyond annoyed at having to enter their password for every change. Particularly when doing something like adjusting the order of images on a page, where the page has 100 images and with each adjustment the user is entering the password.

My conclusion is that the line between usability and security was drawn too closely on the side of security. We're realizing that we have to trust the initial authentication and the session, allowing the user to perform actions as themselves without continually proving it's really them. The change in the application is pretty simple, we create a set of handles and let the application handle the permissions. Developer's acceptance of less strict security measures has been much harder to come by.

Posted by mike at 3:03 PM

February 9, 2004

OSCON Call for Proposals Ends

O'Reilly Open Source Convention.I may or may not have put in a proposal idea (have to wait and see).

Last year I presented and was both horrified (before presentation) and delighted (after presentation). It changed the tone of the conference, I didn't enjoy it as much as 2002 because I had this pending presentation looming. On the other hand it was fulfilling to share some of the stuff we'd been working on and see that there was some genuine interest.

Either way, I'm looking forward to OSCON 2004.

Posted by mike at 10:25 PM

February 8, 2004

How to Spend ThinkGeek Gift Certificate

Got a ThinkGeek gift certificate from Pete who says "the experience of having to choose something is as good as what you get". Seems to be true.

Am debating between a work toy or a shirt. The thing about the work toy is that it would be fun, could be enjoyed by others, and might even aid in stress release or creativity, but I'm sure I will play with it for awhile and then never touch it again. I could probably get pretty good mileage from these marble magnets.

If I go the shirt route I'm looking for something subtle. I would get more use (as I tend to have a pretty short clothes rotation).

Maybe I should just get both. I see now that I can get the marble set elsewhere for less so maybe the shirt route is best for the certificate.

Will sleep on it . . .

Update: I decided to go with the geek work shirt, to only be worn when in non-geek environment.

Posted by mike at 11:08 PM

February 6, 2004

Going to MySQL 2004

My wish has been granted.

Back in December I said that as I saw the MySQL 2004 conference proceedings unfold I'd strengthen my resolve to get there. That's not really what happened, the conference sessions aren't posted yet. The primary reason to push for it was having spent a good chunk of time this past two weeks on MySQL-related activities. Have spent more time in the MySQL docs this week than in a long time.

Jan 28 - set up MySQL replication
Jan 29 - attended presentation from Tufts USG going over highlights from attending weeklong MySQL training
Jan 30 - fiddling with replication performance
Feb 3 - Restore 2001 dump of MySQL data for user to poke through
Feb 4 - Set up UMLS indexing process
Feb 5 - Build MySQL package for Solaris
Feb 6 - Refactor UMLS indexing

I had once debated about whether to go to MySQL Admin training or the MySQL Conference. Having listened to the USG account of the training I decided the conference was probably more what we're looking for. We have several people on our team who have been using MySQL for years and are quite versed in the syntax of administration. Even if one of us isn't an expert at a certain task, the MySQL docs not only give good details on the syntax but in most cases give good reasons.

What we're interested in is how people are using MySQL, best practices, in-depth looks at other people's experiences and what's coming in the future. I'm hoping the tone and focus of the conference this year is similar to last year, because what I really want is a repeat of last year's conference that I didn't attend.

Posted by mike at 5:39 PM

UMLS Indexing Reduced to 16 Hours (from 5.8 days)

I've done a bit of work and dramatically decreased the time it takes us to create a MySQL-based DBIx::FullTextSearch index of the UMLS database.

I was checking on a previous attempt to index the UMLS and noticed it had stopped, hung on an insert from one of the processes. I checked the FTS parameters and apparently FTS created a "max_doc_id" of ~300,000, limiting how many items could be indexed. Obviously no good when the UMLS is almost 500,000 entries. I upped that to 1.5 million to start, but then decided to just take it out altogether.

I also noticed that the "word_length" was set to 30 characters, which means some of the long medical terminology like:

diamminecyclohexanoaminotrismethylenephosphonatoplatinum(II)
will not get indexed. I changed that to 60, the length of this word and the longest term that appears in the UMLS without hyphens, spaces or slashes.

Resetting the word length meant restarting the indexing, and losing the past 2 days of indexing. I took the opportunity to look closely at the Perl indexing script I wrote (almost three years ago), UMLS data, MySQL indexes and FTS parameters and tweaked a number of things. Figured it was worth a few hours of work if I could shave time. I ran 100 entries into the index after each tweak to judge if it helped or hurt.

I was able to dramatically improve the speed of the indexing. I'm running two scripts now, each doing ~250,000 definitions. Combined, in the first 10 minutes 5277 entries have been run through, 8.7/second. That's quite an improvement (again, thanks to Jeremy for the hints).

I'm glad we can now get the data in faster, especially since I learned the UMLS releases a new index every 6 months and we're *supposed* to be updating at each release.

Unfortunately the UMLS indexing was a distraction so now I've got to figure out how to get caught up on other things.

Posted by mike at 2:45 PM

February 5, 2004

Solaris MySQL Install Package

Building a new Solaris package for MySQL (4.0.17) install today. A good exercise in determining where to draw the line between a package that helps by setting up configuration and one that does too much and makes more work.

1) create mysql user and mysql group
2) install binaries, benchmark tools, docs etc in /usr/local/mysql-4.0.17
3) symlink /usr/local/mysql to /usr/local/mysql-4.0.17
4) install /etc/my.cnf with commented configuration for 4 different environments (dev, test, prod-slave and prod)
5) install /etc/init.d/mysql (startup script)
6) symlink startup script into appropriate rc1.d and rc2.d directories
7) run configure script to uncomment appropriate options in my.cnf based on machine name (mapped to an environment)
8) create datadir (on hardware RAID array)
9) Install mysql database files into datadir for core permissions (contains only 1 root user entry, connection only allowed from localhost and with correct password).

I initially started up the database, but seemed overkill. Most likely will want to drop some files from a backup in before getting up and running. Package removal will step backward, undoing each step.

Feels good to have a well-tested and complete package install, that I know will get everything set up properly (for us) on new or rebuilt machines.

Posted by mike at 5:38 PM

February 4, 2004

5.8 Days Building UMLS Index with DBIx::FullTextSearch

Today I'm creating an indexed version of the UMLS (Unified Medical Language System) database for use in our system. It gives our users a uniform vocabulary of keywords (with definitions) to associate with a document. Documents have more credibility if the associated keywords have been chosen from the UMLS.

We use MySQL-backed DBIx::FullTextSearch for indexing our "stuff". It drives a variety of searches on our site. The module provides a nice interface to create, build, modify, delete and search against indexes.

In essense:
1) create an FTS index, specifying numerous parameters about how to index the data
2) build index by giving FTS the id of the item and the text to index
3) search the FTS index by search string, FTS returns either an array of ids or a hash with id key and frequency of string as value

I created a umls index and am indexing each concept with the concept_id and it's definition. The version of UMLS I have contains ~500,000 concepts (it has doubled since we installed it a few years back).

It takes FTS about 1 second to index each of the items, which means I'll have a complete index in 5.8 days. That means that when we get an updated version we'll be looking at 12 days to generate the index. Good thing it doesn't change that often.

Update: Since some of the wait time is for the CPU parsing through the text to index I broke the ULMS into three chunks and am running all concurrently. Each process continues to work at about one definition per second, which cuts the total time for indexing down to 46 hours.

I also dropped some of the MySQL indexes (at Jeremy's suggestion) which seem to speed up the process. My measurement is as far from scientific as you can get ("one one-thousand, two one-thousand, three one-thousand, four . . .")

Posted by mike at 5:03 PM