« No Digital Audio for a Month | Main | Regular Expression Problem Solved »

August 29, 2006

Battling with Complex Regular Expression Problem

I've had this problem on the burner (moving between front and back) for a few days now. Have been letting it simmer while working on other things to see if there's some brilliant light that goes on as to how to best solve it. Not yet, although I'm close.

We have history messages stored in a database that need to be translated into a variety of languages. They are a controlled set of statements that share some common sentence fragments but also have specific pieces of information. Something like:

Jim Johnson approved the invoice for new computers on 01/01/2006.
The schedule request was rejected by Fred Alewife.

The statements aren't going to be changed, so my job is to find a way to have the code grab the statements, figure out what pieces should be translated and what should not, and mark them accordingly to run through the translater.

The problem is that in the statement like the one above certain pieces of the data should be kept from translation, and I don't know for sure where they are in the sentence. The solution that has been proposed (and I'm attempting to implement) is to create a set of templates, based on the known formats for the messages. The template(s) would be used when looking at the sentence to determine which pieces of the sentence to translate and which to preserve.

So I created a simple template markup, looks like this for the two statements above:

{{user}} approved the {{object}} for {{project}} on {{date}}
The {{object}} was {{status}} by {{user}}

And let's say that the final data that needs to be send to the translater is something like this for the two statements:

{{Jim Johnson}} approved the {{invoice}} for {{new computers}} on {{01/01/2006}}
The {{schedule request}} was {{rejected}} by {{Fred Alewife}}

You get the idea, that you find a matching template and then use the template to help designate words or phrases in the to preserve them from being translated.

The part that tries to find a matching template is simple, I do a regular expression to replace any of the {{.+}} with a .+? and turn the statement into something like:

$match_statement = '.+? approved the .+? for .+? on .+?'
if ($sentence =~ /$match_statement/) {
...
}

Once I have a template that matches the sentance it's a little more tricky. I can't do a piece-by-piece replacement because you have to consider the entire statement to figure out what words match up to the template.

So what I've resorted to is building a replacement regular expression. I have a piece of code that works through the template and finds the parts where word preservation is required. In the end I end up with an array of statements that I can join together to form something like:

$match = '(+?)( approved the )(.+?)( for )(.+?)( on )(.+?)';

The idea was to create the match portion of the regular expression and then use it to do a replacement like this:

$sentence =~ s/$match/$1$2$3$4$5$6$7/;

Unfortunately I don't always know how many variables there will be to match, and more importantly, that doesn't put in the necessary markup, it needs to be more like:

$sentence =~ s/$match/{{$1}}$2{{$3}}$4{{$5}}$6{{$7}}/;

So it seems that the replacement part of the regular expression needs to be dynamically built because you don't know exactly where the preserved words will appear. That's where I'm at, attempting to build a variable to stick in the statement. Perl doesn't seem to like having a string variable filled with regex variables.

While I'm close with this approach, I do continue to wonder if I should be looking at this from a completely different angle. Perhaps there's a pattern to solving this that I'm not seeing.

Putting it on the back burner again for a little bit to see if something emerges.

Posted by mike at August 29, 2006 7:43 AM