Thursday, June 18, 2009

Wikis, Wikitext, HTML and regular expressions

Once again, regular expressions come in to help me. We have recently had a bit of push at our company to become more information sharing friendly. In other words, a lot of people are tired of trying to talk to one or two subject matter experts to get the information they require to do their job. They want to be able to share that information more easily.

Well, there's a technology out there that allows for group-think... and it's called Wiki software. You may have heard of it from a small little site called Wikipedia... Or, if you haven't, here's the idea: information is put up on the internet where anyone and everyone can edit it. The idea is that the group will flesh out that information, adding, removing and editing the information until it reaches that ultimate goal of being perfect information. Of course, you will never get to the point of perfect information, but that's the end goal. And, as Wikipedia has shown, it works fairly well... as long as you have more people who are willing to do good than bad (that, and some good moderation tools to get rid of those who don't do good).

In a corporate setting, it seems like a Wiki would be ideal, wouldn't it? Most people probably don't want to do bad (especially if their jobs are on the line) and therefore you get rid of the biggest pain for public wiki: malicious users. Of course, you have a much bigger problem... buy in.

The problem with a corporate wiki is that, if you happen to be using MediaWiki (and probably other types of Wiki software), you need to learn their equivalent Wiki markup language. In MediaWiki, it includes some great ideas like using two single quotes ('') to denote italics, and three to denote bold (''')... makes sense, doesn't it? I will give the language the fact that lists are quite easy to use (* creates a list item, ** creates a sublist item, easy as pie!). But then they came around with the syntax for tables that is hard to remember and use. Lovely. So now you've got to teach people who don't even know what a markup language is how to use one to edit WikiText...

... or you can use the wonderfully named FckEditor. Seriously, could they have not come up with a better name? Imagine going to your boss and saying we should use the FckEditor. I'm sure that will go over well. One of the issues I have with open source software is that people who name things are idiots... It's like GIMP, which I understand is supposed to be a clever acronym (GNU Image Manipulation Program), but ends up sounding, well, gimp...

Anyways, back on track. Buy in is not only difficult because of the difficult learning curve, but also because it's a culture shift. You want people to record their thoughts and experiences in the Wiki. Some may feel threatened that their position as keeper of certain knowledge will be challenged, others may just not feel comfortable writing stuff on an internal wiki. Those can be tough barriers to adoption... but those aren't the point of this post.

Getting back to WikiText. We have a set of documentation, written (and I use the term written loosely) in HTML and stored on our intranet. It's pretty useful, but suffers from being both hard to update and often out of date. This was set as one of our original goals in for our Wiki working group: move the documentation into the Wiki so we can provide the seeds of information sharing.

Well, that task turned out to be much tougher than I expected, somewhat because of the way WikiText is structured and partly because of the way HTML is structured. The much larger fun comes in that some of the documentation was written using Word, some with FrontPage and very little by people who are familiar with HTML. What this means is that we have lots of extra markup that serves very little use and tends to confuse some of the tools out there that convert HTML to WikiText. I tried using the Perl module of WikiConverter, but gave up when it removed too much formatting.

So, I turned to one of my favourite tools, Perl and regular expressions.

After several runs and reruns, and more runs, I got some code that I was happy with. Unlike WikiConverter, it doesn't convert HTML tables to Wiki tables, it leaves <pre> tags in, and it strips a lot of crap that we don't need. One of the things the script also does is remove a lot of characters that can cause issues.

You see, in our macro language, ! is used to start a macro. But in WikiText, a ! is used for tables. Similarly, * is used for comments... but that's also the WikiText definition for a list. Clearly a bunch of changes needed to be made. This is where things got a bit uncomfortable. Because WikiText is implemented using regular expressions, which, when compared to HTML, suffers some issues.

Consider the following case, which is valid html: <div> <div> </div> </div>. When matching pairs, in regular expressions, /<div>[^<]*</div>/ would match the first <div> with the first </div>, which isn't correct in terms of HTML. That can make some issues tough to deal with.

Thankfully, most of the HTML in our documentation was very simple, so I was able to get away with using just regular expressions to replace HTML tags with WikiText.

Of course, after all was said and done, using an HTML parser might have worked better, because the documentation (of course) had a hodgepodge of bad tags throughout it (<b>) tags with no closing tags, tags that are just odd (what is <os>?) and various other things. But, for the most part, it worked well. The documentation has now been imported into our Wiki.

Now, to get people to buy in...

Tuesday, June 16, 2009

Balsillie, Bettman, the NHL and Coyotes

By some reports, Balsillie is done, the judge ruled against him, it's over. No team in Hamilton. Now, take a look at the MakeItSeven website. Well, that clears things up... right?

While I think a team in the Hamilton area would do well, it's hard to see Balsillie getting his way. Unfortunately, that means Bettman might get his way. I realize that Bettman has done some good things for the game as a business, but as a sport, he's probably done more to hurt it that any previous commissioner. The game is all about the money now. The NHL worries about the leagues finances and keeping teams happy. There seems to be much less caring for the fans. And if I hear Bettman say one more time that they don't run out on cities, I may hunt him down and beat him. Can you believe that ignorance?

Let's look at this a bit more objectively, if we can.

Hamilton, population 692,000 (metropolitan). Not a large city in comparsion to Phoenix's 4.2 million (metropolitan). But, now let's consider, as a comparison, southwestern ontario with arizona. Arizona's population is 6.5 million people. Southwestern Ontario? If you exclude Toronto, roughly 3.2 million. Add Toronto, and we are close to hitting 10 million (aka, 90% of Ontario's population). So, the markets are close to the same size when you consider that Southwestern Ontario would also include the Leafs (which, it must be said, sell an ungodly amount of tickets at expensive prices and have diehard fans... kind of like the Yankees, but less successful on the field of sport).

So population wise, it's pretty even... right? Except that Phoenix can't maintain the minimum numbers of ticket sales to allow for money from the rest of the NHL's teams. 6.5 million people can't fill an arena a few times over the year? Well, I guess part of the problem is playing ice hockey in the desert. It's not really a sport that fits the region's weather. Some mention that Buffalo also might be an impedement, but of the people in the area, most who go to Buffalo games from Canada are from the region right around Buffalo. I don't see that really changing that much, but it's hard to predict what would happen to the Sabres. I think the Sabres have far more to worry about than the Leafs in this regard.

Team wise? Losing $300 million (the number thrown around by current owner Moyes) in 10 years is pretty impressive... or ridiculous. Yes, I realize that the city of Glendale help build a city for the team and (apparently) has a lease for a ridiculous 25 years... but I'm almost wondering if the team still wouldn't make more moving to Hamilton and paying for the lease... and then sublease the arena to whomever they want.

So, there are some good arguments in bringing the team to Hamilton. Canada has always been a haven for ice hockey. Southwestern Ontario, I feel, could easily manage another team in the region.

Will it happen? I highly doubt it. Bettman is, if nothing else, a control freak. And I doubt he would like having someone like Balsillie, who's a bit of a maverick, in the leagues board of governers. However, I can't see why the league's owners wouldn't want a man with deep pockets ($2 billion+ net worth) and a passionate love of hockey, in their group.

Maybe it's because too many of the don't care about the sport, but just the business?

Tuesday, June 9, 2009

Regular Expressions... why you don't have 2 problems.

There is a long standing statement in the programming world that goes something like this:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
There is a great article discussing this statement at Jeffrey Friedl's blog (all the way back in 2006!) where he traces the source of that statement and the reasoning behind it.

Getting back to the point at hand, regular expressions are a tool, as Jeffrey notes. They are a very useful tool that can create headaches for the simple reason that they are hard to read.

Consider the following regular expression:
s/\n\n(\n)+/\n\n/g;

In the world of regular expressions, it's a pretty simple regular expression. What does it do? It compresses extra newlines into two. So if you have ten newlines in a row, it will reduce it to two.

But, it wasn't working correctly for me... and that is one the reasons why people often repeat that dreadful statement. I now have a problem with my regular expression, as well as my original problem. 2 problems!

In the end, I turned to StackOverflow to see if I could get some help. Sure enough, some enterprising people quickly came up with a far better answer that I had:
s/(\r?\n[ \t]*){2,}/\n\n/g;

So what's the difference? It's using the regex code of {M,N} where M specifies the minimum number of matches of the previous code and N sets the maximum (infinite in our case). The \r? matches {0,1} instances of the carriage return character. The [ \t]* matches any number of spaces/tabs. This cleans up any extraneous whitespace that might be hanging around our newlines.

Much simpler, and it does the job quite well. So now I have zero problems!

Regular expressions are certainly a very useful tool, but they can have trip ups. A better quote would be "Regular Expressions, Use At Own Risk"... but then again, that can be said about any tool.