Thursday, June 18, 2009

Wikis, Wikitext, HTML and regular expressions

Once again, regular expressions come in to help me. We have recently had a bit of push at our company to become more information sharing friendly. In other words, a lot of people are tired of trying to talk to one or two subject matter experts to get the information they require to do their job. They want to be able to share that information more easily.

Well, there's a technology out there that allows for group-think... and it's called Wiki software. You may have heard of it from a small little site called Wikipedia... Or, if you haven't, here's the idea: information is put up on the internet where anyone and everyone can edit it. The idea is that the group will flesh out that information, adding, removing and editing the information until it reaches that ultimate goal of being perfect information. Of course, you will never get to the point of perfect information, but that's the end goal. And, as Wikipedia has shown, it works fairly well... as long as you have more people who are willing to do good than bad (that, and some good moderation tools to get rid of those who don't do good).

In a corporate setting, it seems like a Wiki would be ideal, wouldn't it? Most people probably don't want to do bad (especially if their jobs are on the line) and therefore you get rid of the biggest pain for public wiki: malicious users. Of course, you have a much bigger problem... buy in.

The problem with a corporate wiki is that, if you happen to be using MediaWiki (and probably other types of Wiki software), you need to learn their equivalent Wiki markup language. In MediaWiki, it includes some great ideas like using two single quotes ('') to denote italics, and three to denote bold (''')... makes sense, doesn't it? I will give the language the fact that lists are quite easy to use (* creates a list item, ** creates a sublist item, easy as pie!). But then they came around with the syntax for tables that is hard to remember and use. Lovely. So now you've got to teach people who don't even know what a markup language is how to use one to edit WikiText...

... or you can use the wonderfully named FckEditor. Seriously, could they have not come up with a better name? Imagine going to your boss and saying we should use the FckEditor. I'm sure that will go over well. One of the issues I have with open source software is that people who name things are idiots... It's like GIMP, which I understand is supposed to be a clever acronym (GNU Image Manipulation Program), but ends up sounding, well, gimp...

Anyways, back on track. Buy in is not only difficult because of the difficult learning curve, but also because it's a culture shift. You want people to record their thoughts and experiences in the Wiki. Some may feel threatened that their position as keeper of certain knowledge will be challenged, others may just not feel comfortable writing stuff on an internal wiki. Those can be tough barriers to adoption... but those aren't the point of this post.

Getting back to WikiText. We have a set of documentation, written (and I use the term written loosely) in HTML and stored on our intranet. It's pretty useful, but suffers from being both hard to update and often out of date. This was set as one of our original goals in for our Wiki working group: move the documentation into the Wiki so we can provide the seeds of information sharing.

Well, that task turned out to be much tougher than I expected, somewhat because of the way WikiText is structured and partly because of the way HTML is structured. The much larger fun comes in that some of the documentation was written using Word, some with FrontPage and very little by people who are familiar with HTML. What this means is that we have lots of extra markup that serves very little use and tends to confuse some of the tools out there that convert HTML to WikiText. I tried using the Perl module of WikiConverter, but gave up when it removed too much formatting.

So, I turned to one of my favourite tools, Perl and regular expressions.

After several runs and reruns, and more runs, I got some code that I was happy with. Unlike WikiConverter, it doesn't convert HTML tables to Wiki tables, it leaves <pre> tags in, and it strips a lot of crap that we don't need. One of the things the script also does is remove a lot of characters that can cause issues.

You see, in our macro language, ! is used to start a macro. But in WikiText, a ! is used for tables. Similarly, * is used for comments... but that's also the WikiText definition for a list. Clearly a bunch of changes needed to be made. This is where things got a bit uncomfortable. Because WikiText is implemented using regular expressions, which, when compared to HTML, suffers some issues.

Consider the following case, which is valid html: <div> <div> </div> </div>. When matching pairs, in regular expressions, /<div>[^<]*</div>/ would match the first <div> with the first </div>, which isn't correct in terms of HTML. That can make some issues tough to deal with.

Thankfully, most of the HTML in our documentation was very simple, so I was able to get away with using just regular expressions to replace HTML tags with WikiText.

Of course, after all was said and done, using an HTML parser might have worked better, because the documentation (of course) had a hodgepodge of bad tags throughout it (<b>) tags with no closing tags, tags that are just odd (what is <os>?) and various other things. But, for the most part, it worked well. The documentation has now been imported into our Wiki.

Now, to get people to buy in...

No comments:

Post a Comment