Tuesday, June 9, 2009

Regular Expressions... why you don't have 2 problems.

There is a long standing statement in the programming world that goes something like this:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
There is a great article discussing this statement at Jeffrey Friedl's blog (all the way back in 2006!) where he traces the source of that statement and the reasoning behind it.

Getting back to the point at hand, regular expressions are a tool, as Jeffrey notes. They are a very useful tool that can create headaches for the simple reason that they are hard to read.

Consider the following regular expression:
s/\n\n(\n)+/\n\n/g;

In the world of regular expressions, it's a pretty simple regular expression. What does it do? It compresses extra newlines into two. So if you have ten newlines in a row, it will reduce it to two.

But, it wasn't working correctly for me... and that is one the reasons why people often repeat that dreadful statement. I now have a problem with my regular expression, as well as my original problem. 2 problems!

In the end, I turned to StackOverflow to see if I could get some help. Sure enough, some enterprising people quickly came up with a far better answer that I had:
s/(\r?\n[ \t]*){2,}/\n\n/g;

So what's the difference? It's using the regex code of {M,N} where M specifies the minimum number of matches of the previous code and N sets the maximum (infinite in our case). The \r? matches {0,1} instances of the carriage return character. The [ \t]* matches any number of spaces/tabs. This cleans up any extraneous whitespace that might be hanging around our newlines.

Much simpler, and it does the job quite well. So now I have zero problems!

Regular expressions are certainly a very useful tool, but they can have trip ups. A better quote would be "Regular Expressions, Use At Own Risk"... but then again, that can be said about any tool.

No comments:

Post a Comment