Wednesday, November 25, 2009

Javascript, Arrays and more than 1 million HTML tags

So, we have a set of code that does this:
var els = node.getElementsByTagName("*");
var elsLen = els.length;
var pattern = new RegExp('(^|\\s)'+searchClass+'(\\s|$)');
for (var i = 0, j = 0; i < elsLen; i++) {
if ( pattern.test(els[i].className) ) {
classElements[j] = els[i];
j++;
}
}
It's a pretty simple for loop that gets all elements on a page and finds items that belong to a class. We use it on our printing to help alleviate some bad HTML in reports (HTML on reports can have client generated content... it gets ugly fast).

We recently had a support call come in that reported on an error on very large files. The first file had 57MB of HTML. Yes, 57 megabytes. Sigh. Looking at the source code, we found 10,425 occurrences of </div> without the corresponding opening tag. Removing those caused the code to work, so I thought we had solved the issue, it was bad HTML.

The next day it came back again, this time on a report of size 75MB, after they had fixed their templates to not have the closing div without opening div. What?

The error being returned was the fantastic "null object" error, squarely on the "classElements[j] = els[i];" line. Say what?

We use the Web Browser Control (IE) in our application, so I opened it up in IE to see the issue. Some further investigation led me to the fact that it would fail when accessing the element at array position 1,000,000! Checking the element in position 999,999 was fine.

So I reworked the code to look like this:
for (clsItem in els) {
if ( pattern.test(clsItem.className) ) {
classElements[j] = clsItem;
j++;
}
}
And now it works. Apparently indirect references past 1 million work in IE.

We did suggest they cleanup their HTML (tables inside of tables inside of tables? Must have been FrontPage) to help reduce the size of the reports. The 75MB report had 1.2 million tag elements!

Thursday, June 18, 2009

Wikis, Wikitext, HTML and regular expressions

Once again, regular expressions come in to help me. We have recently had a bit of push at our company to become more information sharing friendly. In other words, a lot of people are tired of trying to talk to one or two subject matter experts to get the information they require to do their job. They want to be able to share that information more easily.

Well, there's a technology out there that allows for group-think... and it's called Wiki software. You may have heard of it from a small little site called Wikipedia... Or, if you haven't, here's the idea: information is put up on the internet where anyone and everyone can edit it. The idea is that the group will flesh out that information, adding, removing and editing the information until it reaches that ultimate goal of being perfect information. Of course, you will never get to the point of perfect information, but that's the end goal. And, as Wikipedia has shown, it works fairly well... as long as you have more people who are willing to do good than bad (that, and some good moderation tools to get rid of those who don't do good).

In a corporate setting, it seems like a Wiki would be ideal, wouldn't it? Most people probably don't want to do bad (especially if their jobs are on the line) and therefore you get rid of the biggest pain for public wiki: malicious users. Of course, you have a much bigger problem... buy in.

The problem with a corporate wiki is that, if you happen to be using MediaWiki (and probably other types of Wiki software), you need to learn their equivalent Wiki markup language. In MediaWiki, it includes some great ideas like using two single quotes ('') to denote italics, and three to denote bold (''')... makes sense, doesn't it? I will give the language the fact that lists are quite easy to use (* creates a list item, ** creates a sublist item, easy as pie!). But then they came around with the syntax for tables that is hard to remember and use. Lovely. So now you've got to teach people who don't even know what a markup language is how to use one to edit WikiText...

... or you can use the wonderfully named FckEditor. Seriously, could they have not come up with a better name? Imagine going to your boss and saying we should use the FckEditor. I'm sure that will go over well. One of the issues I have with open source software is that people who name things are idiots... It's like GIMP, which I understand is supposed to be a clever acronym (GNU Image Manipulation Program), but ends up sounding, well, gimp...

Anyways, back on track. Buy in is not only difficult because of the difficult learning curve, but also because it's a culture shift. You want people to record their thoughts and experiences in the Wiki. Some may feel threatened that their position as keeper of certain knowledge will be challenged, others may just not feel comfortable writing stuff on an internal wiki. Those can be tough barriers to adoption... but those aren't the point of this post.

Getting back to WikiText. We have a set of documentation, written (and I use the term written loosely) in HTML and stored on our intranet. It's pretty useful, but suffers from being both hard to update and often out of date. This was set as one of our original goals in for our Wiki working group: move the documentation into the Wiki so we can provide the seeds of information sharing.

Well, that task turned out to be much tougher than I expected, somewhat because of the way WikiText is structured and partly because of the way HTML is structured. The much larger fun comes in that some of the documentation was written using Word, some with FrontPage and very little by people who are familiar with HTML. What this means is that we have lots of extra markup that serves very little use and tends to confuse some of the tools out there that convert HTML to WikiText. I tried using the Perl module of WikiConverter, but gave up when it removed too much formatting.

So, I turned to one of my favourite tools, Perl and regular expressions.

After several runs and reruns, and more runs, I got some code that I was happy with. Unlike WikiConverter, it doesn't convert HTML tables to Wiki tables, it leaves <pre> tags in, and it strips a lot of crap that we don't need. One of the things the script also does is remove a lot of characters that can cause issues.

You see, in our macro language, ! is used to start a macro. But in WikiText, a ! is used for tables. Similarly, * is used for comments... but that's also the WikiText definition for a list. Clearly a bunch of changes needed to be made. This is where things got a bit uncomfortable. Because WikiText is implemented using regular expressions, which, when compared to HTML, suffers some issues.

Consider the following case, which is valid html: <div> <div> </div> </div>. When matching pairs, in regular expressions, /<div>[^<]*</div>/ would match the first <div> with the first </div>, which isn't correct in terms of HTML. That can make some issues tough to deal with.

Thankfully, most of the HTML in our documentation was very simple, so I was able to get away with using just regular expressions to replace HTML tags with WikiText.

Of course, after all was said and done, using an HTML parser might have worked better, because the documentation (of course) had a hodgepodge of bad tags throughout it (<b>) tags with no closing tags, tags that are just odd (what is <os>?) and various other things. But, for the most part, it worked well. The documentation has now been imported into our Wiki.

Now, to get people to buy in...

Tuesday, June 16, 2009

Balsillie, Bettman, the NHL and Coyotes

By some reports, Balsillie is done, the judge ruled against him, it's over. No team in Hamilton. Now, take a look at the MakeItSeven website. Well, that clears things up... right?

While I think a team in the Hamilton area would do well, it's hard to see Balsillie getting his way. Unfortunately, that means Bettman might get his way. I realize that Bettman has done some good things for the game as a business, but as a sport, he's probably done more to hurt it that any previous commissioner. The game is all about the money now. The NHL worries about the leagues finances and keeping teams happy. There seems to be much less caring for the fans. And if I hear Bettman say one more time that they don't run out on cities, I may hunt him down and beat him. Can you believe that ignorance?

Let's look at this a bit more objectively, if we can.

Hamilton, population 692,000 (metropolitan). Not a large city in comparsion to Phoenix's 4.2 million (metropolitan). But, now let's consider, as a comparison, southwestern ontario with arizona. Arizona's population is 6.5 million people. Southwestern Ontario? If you exclude Toronto, roughly 3.2 million. Add Toronto, and we are close to hitting 10 million (aka, 90% of Ontario's population). So, the markets are close to the same size when you consider that Southwestern Ontario would also include the Leafs (which, it must be said, sell an ungodly amount of tickets at expensive prices and have diehard fans... kind of like the Yankees, but less successful on the field of sport).

So population wise, it's pretty even... right? Except that Phoenix can't maintain the minimum numbers of ticket sales to allow for money from the rest of the NHL's teams. 6.5 million people can't fill an arena a few times over the year? Well, I guess part of the problem is playing ice hockey in the desert. It's not really a sport that fits the region's weather. Some mention that Buffalo also might be an impedement, but of the people in the area, most who go to Buffalo games from Canada are from the region right around Buffalo. I don't see that really changing that much, but it's hard to predict what would happen to the Sabres. I think the Sabres have far more to worry about than the Leafs in this regard.

Team wise? Losing $300 million (the number thrown around by current owner Moyes) in 10 years is pretty impressive... or ridiculous. Yes, I realize that the city of Glendale help build a city for the team and (apparently) has a lease for a ridiculous 25 years... but I'm almost wondering if the team still wouldn't make more moving to Hamilton and paying for the lease... and then sublease the arena to whomever they want.

So, there are some good arguments in bringing the team to Hamilton. Canada has always been a haven for ice hockey. Southwestern Ontario, I feel, could easily manage another team in the region.

Will it happen? I highly doubt it. Bettman is, if nothing else, a control freak. And I doubt he would like having someone like Balsillie, who's a bit of a maverick, in the leagues board of governers. However, I can't see why the league's owners wouldn't want a man with deep pockets ($2 billion+ net worth) and a passionate love of hockey, in their group.

Maybe it's because too many of the don't care about the sport, but just the business?

Tuesday, June 9, 2009

Regular Expressions... why you don't have 2 problems.

There is a long standing statement in the programming world that goes something like this:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
There is a great article discussing this statement at Jeffrey Friedl's blog (all the way back in 2006!) where he traces the source of that statement and the reasoning behind it.

Getting back to the point at hand, regular expressions are a tool, as Jeffrey notes. They are a very useful tool that can create headaches for the simple reason that they are hard to read.

Consider the following regular expression:
s/\n\n(\n)+/\n\n/g;

In the world of regular expressions, it's a pretty simple regular expression. What does it do? It compresses extra newlines into two. So if you have ten newlines in a row, it will reduce it to two.

But, it wasn't working correctly for me... and that is one the reasons why people often repeat that dreadful statement. I now have a problem with my regular expression, as well as my original problem. 2 problems!

In the end, I turned to StackOverflow to see if I could get some help. Sure enough, some enterprising people quickly came up with a far better answer that I had:
s/(\r?\n[ \t]*){2,}/\n\n/g;

So what's the difference? It's using the regex code of {M,N} where M specifies the minimum number of matches of the previous code and N sets the maximum (infinite in our case). The \r? matches {0,1} instances of the carriage return character. The [ \t]* matches any number of spaces/tabs. This cleans up any extraneous whitespace that might be hanging around our newlines.

Much simpler, and it does the job quite well. So now I have zero problems!

Regular expressions are certainly a very useful tool, but they can have trip ups. A better quote would be "Regular Expressions, Use At Own Risk"... but then again, that can be said about any tool.

Friday, May 22, 2009

Speed Improvements... and Profiling.

It's incredibly important to profile before you begin to try and do any optimizations. The reasons for this should be obvious, but I've caught myself trying to do optimizations without profiling first. Bad idea. Sometimes you make things worse. It's also important to profile after every single change and not change multiple things at once. This lets you quickly determine if a change was good or bad.

MapX and RowValues

MapX is a 3rd party tool that displays geographical data. It is available from MapInfo (now owned by Pitney-Bowes) and is quite useful. It also hasn't been updated in about 7 years, since they have a new product called MapXtreme (what's so extreme about it?). Which means that we are stuck trying to pull performance out of a legacy application again that, well, isn't going to get any faster on it's own.

I made an interesting discovery a few weeks ago regarding accessing datasets. A dataset, in MapX terms, is a set of data associated with geographical data. While the map data (lat,long) is stored in a map file, the data is stored in a proprietary data format with a link between the data and the map object. There are times when we need to pull lots of data to display it to the user (an info tool, for example, so you can query what an object is on the map).

To access the data, you can directly refer to the dataset using the object's feature key. Something like:
strValue = dataset(feature key, field)
This works, but once need to query more than once against a row (or feature key), then it makes more sense to use a RowValues object, which is easy enough:
set rowValues = dataset.RowValues(feature key)
strValue = rowValues(field).Value
And there we go. For one value, the timing is similar. Once you do several requests in a row, however, it gets much faster. The more fields you read in a row, the higher the speed increase (economy of scale!).

Visual Basic 6 and Combo Boxes

Another fun thing discovered this week is that Visual Basic 6 is slow when adding lots of items to a combo box. How much is lots of items? Try 12,000 unique strings! So, off we go trying to find a better way. And, sure enough, there is a fairly simple way, as long as you don't mind getting your hands dirty with the windows messaging system api.

So we go from:
combobox.add "this string"
To:
SendMessage combobox.hWnd, CB_ADDSTRING, 0, ByVal "this string"
Simple enough, for a 50% time decrease for 12,000 items... from 1 second to 1/2 a second! Very good stuff.

MapX and Display

Some times measured speed doesn't mean that much to a user if it appears the system is hung and not doing anything at all.

Interestingly enough, MapX's map control has a setting called RedrawInterval. This setting sets how long between map redraws while the control is generating data to be displayed. A lower setting means that the map will be redrawn more frequently. A higher value means that the map won't update until it is done or the time has run out. When viewing a small section of a map, this doesn't make much difference.

What we discovered though, is that when a user is zoomed out to seeing a radius of 10km's worth of mapping data, the map is taking a while to redisplay for big cities. This makes sense, since there is a lot of road data to display for a large city... but, as it turns out, our RedrawInterval was set to 300, which translates to 3 seconds (who works in 1/100ths of a second?).

Setting it to the minimum of 10 (so 0.1 seconds) actually results in a perceived speed increase. Why? As it turns out, MapX draws each layer in sequence internally before displaying them. Since we have about a dozen layers (County, City, Railroads, Roads, Highways, etc...), this means that updates to the screen start happening sooner. Overall, the time taken to completely redraw the screen is longer (not by much, for larger cities it is about 2-3% slower on average, it's negligible for smaller cities/towns.), but, to the user, it appears to have drawn faster since they can see updates happening.

Wednesday, May 20, 2009

Printing from a Service...

We've been working on adding the ability to create PDF files on the server side of our application. This sounds much easier than it is. Our server base is IBM UniVerse (how many people can say they are familiar with that?), an older multivalued database system. On top of being a multivalued database, it is also, essentially, a runtime environment.

To complicate this, the items that need to be printed are HTML documents, with some custom javascript that needs to run before the file can be printed (mostly to correct faulty HTML mistakes by people who have no clue about HTML -- but that's another topic for another day). So that means we need to run it through Internet Explorer. But wait, the complications are still not done! We use MeadCo's ScriptX to do the actual printing, since printing through IE's own engine is both complex and unreliable.

So, add all this together to try printing from a single source location.

Several "easy" solutions were tried first, some killed by the boss (due to too much interaction required from users, most of whom are quite unskilled with computers), some killed due to their technical limitations (IBM UniVerse offers the ability to shell out to DOS, but this lacks a windows context, so IE doesn't run properly -- ie: no javascript can be fired off).

So we came up with several proposals, of which one was approved by management for moving forward: build a window service that could print these pdf documents by running a small program that encapsulates the IE engine, thus providing a windows context for the printing.

Seemed so simple then. From that point on, things went all over the place.

Getting IE to run with a windows context was easy enough. Getting our ScriptX related javascript to run took a little more effort, but was manageable. Then came the brick wall. Sure, it works fine on XP running from the service account. What about Server 2k3? Or Vista?

Sure enough, changes to the way the service account runs starting in 2k3 meant that our solution no longer worked that well... Services in 2k3/Vista run in session 0, same as XP. However, that level is now isolated from users. In other words, only certain resources are now available to the service account. And, sure enough, Printers aren't one of them.

MSDN notes that the printing namespace isn't available from windows services (or at least not supproted) [1]. I'm obviously not the first person who's run into issues with printing from a service (StackOverflow questions: 1, 2, 3).

So, for now, we've gotten around it by creating a local account that has the printer available to it. This is far from an ideal solution and I continue to look for a solution that allows me to see printers from a 2k3/Vista windows service...

Friday, May 8, 2009

Gary Bettman and the NHL

I will be the first to admit that Gary Bettman has done some great things to help grow the NHL. The numbers sort of speak for themselves ($440 million in 1993 to $2.2 billion in 2007).

That said, I wish he would get lost.

While I would love to see another team come to Southwestern Ontario, I realize that as long as Bettman is there, it won't happen.

But my biggest complaint, by far, is the following line from Bettman:
“We generally try to avoid relocating franchises unless you absolutely have to,” he said. “We think when a franchise is in trouble, you try and fix the problems. That’s what we did in Pittsburgh and Ottawa and Buffalo prior to our work stoppage. That’s what we did when the perception was that five out of the six Canadian franchises around the turn of the century were in trouble. We fixed the problems. We don’t run out on cities.”

We don't run out on cities?

Would you like to tell that to the cities that have lost franchises since you became the comissioner? I'm willing to let Minnesota (1993) slide, since they were leaving when you got into the league and they now have a team back.

But what about Quebec City (1995)? Winnipeg (1996)? Hartford (1997)? Three great cities where hockey was appreciated. But I guess they don't fall into big business.

On top of that, let's take a look at the four franchises Bettman let into the NHL:

Nashville (1998): Barely alive, constantly looking for investors, can't sell out an arena.
Atlanta (1999): While more financially stable than Nashville, they haven't been able to convince anybody to want to play for them.
Columbus Blue Jackets (2000): Although it's in Columbus, it is doing okay as a team. Not super financially stable, but doing reasonably well in comparison to the last two.
Minnesota (2000): Oh look, hockey is back in a hockey town! After the last one was move for purely business reasons, this team is doing well, with great fans. Unfortunately, they have trouble convincing people to play in Minnesota.

And that's not even looking at where he relocated teams to. Carolina? Do they need a hockey team? Or what about the currently floundering Phoenix Coyotes? How many times does the rest of the league need to bail a team out (to the tune of $35 million dollars!) before they realize that they could move it to a better market?

But this is Bettman, he doesn't give up on cities... Unless you're a Canadian or Northern US team, that is.

Tuesday, April 21, 2009

Die Rogers Die

I'll give you two guesses on which channel the Devils-Canes game is on tonight?

Friday, April 17, 2009

VB6, Collections and a mini-Rant

I enjoy where I work a lot. There are great people, an interesting and relaxed work environment and some interesting challenges.

But every once in a while, things turn a little sour... mostly when dealing with Visual Basic 6. To give any non-programmers a run down, VB6 is a crappy, slimmed down programming language that, thanks to it's ease of creating interfaces, became quite popular. But VB6 has a lot of limitations that just don't exist in other languages. Some of these are easier to get around than others... As an example, one of our applications has many issues with focus and how VB6 handles change of focus from other apps to itself. Or, rather, how it doesn't handle it nicely. Not pretty.

Today, I'm trying to add some functionality that should be quite easy. I want to use a dictionary to store some data. A dictionary stores data in it, using a key to reference a dataset (similar to a real life dictionary that uses a word to refer to a definition). Dictionaries are quite nice in that it is easy to look something up, since we just ask the dictionary for "ephemeral ", for example, and it returns it's definition "Short-lived; existing or continuing for a short time only" (Dictionary.com's word of the day). Dictionaries are simple, have a fast lookup (key for what I need) and store anything. VB6 has a container that is very similar to a dictionary, but is called a Collection.

The problem with collections is that they don't allow you to use user-defined structs. A struct is just a very simple way of combining some related data together. There is a different way of doing this called a class in VB6. Classes can be used with no problems with collections. So why am I complaining? Because I can place a struct inside the current codebase without having to add another file (each class in VB6 needs to have it's own file). I don't mind adding new files, but for something as minor as holding three items (that's right, just three), I now have to add an entire new file with a grand total of four lines of code in it. Four lines of code. Talk about a waste of time and space.

And don't get me started on classes in VB6. They are a perversion of the programming term class. But that's a rant for another day.

Wednesday, April 15, 2009

Rogers and TSN2, or the lack thereof.

So, it's playoff time. That means, hopefully, actually getting to see Devils games on TV... or at least the possibility of it. Or it could be, if Rogers carried TSN2. Which, of course, it doesn't. What the hell? Stupid Rogers. So I get to miss the first game of the Devils-Hurricanes series. Figures.

I'm not sure why I pay 90$/month for cable. So I can't watch the things I want when I want? Television really needs to get it's act together. It's rough when I can go online and watch shows in as good (or better) quality than the broadcasts and at any time I want.

I'd be shocked if Rogers ever met any of my expectations. It's sad when a company does such a poor job that even coming within a mile of expectations would be considered a good thing. Of course, the government and CRTC need to get as much blame, since they don't believe that cable companies should face any competition in any given area. But then, how can a company that has previously given so generously be expected to face a free market? Maybe we should just nationalize them. I'm sure the government could do a better job at providing cable for me than Rogers... Heck, a trained monkey could probably do a better job.

Tuesday, April 14, 2009

Here Come the Playoffs

So the playoffs are about to start, with some interesting matchups. More precisely, some downright fantastic matchups in terms of skill (Pittsburgh vs Philedelphia), history (Montreal vs Boston) and a first time playoff team (Columbus).

So who's going to win the cup? Well, the Devils, of course... or at least I hope they do. They are in for a tough battle, facing a Carolina team that seems to have been revived since Erik Cole rejoined the team. Here's a guy who could barely score in Edmonton, yet averages a point a game in Carolina with Eric Staal. A large portion of this matchup will depend on which Devils team shows up. The team from the first 70 or so games, or the team that kinda flopped around their last ten games of the season. Sure, they finished with winning four of their last five. But they lost six straight before that, with Brodeur looking a little shaky and the rest of the team forgetting to show up completely (how do you allow 20 shots in one period? What are we, the Leafs?).

Zach Parise finished with a great season and I hope he gets nominated for the Hart. I doubt he we will win, since he doesn't have the name recognition of Ovechkin or Malkin. But it would be nice to see him get the nomination. He's had an incredible season, five goals short of fifty. The Devils have never had a fifty goal scorer, nor a hundred point player and Parise came close to being the first in both categories this year. Zajac also had a coming out this year, becoming a top line center between Parise and Langenbrunner, who also had a fantastic season. The first line reminds me of the Elias - Sykora - Arnott line that fired so strongly in the 2000 season, when the Devils won the cup. Elias finished fairly strongly, and Gionta did ok this year, but he still hasn't lived up to his 48 goal season from a couple of years ago. As for the rest of the team? Clarkson has added a lot of spunk to the team, Shanahan has been a fantastic, cheap pickup... Rolston and Holik have been okay, but not what we needed them to be, especially at $4 million for Rolston.

On defence, Paul Martin had a good season and continues to anchor the defence. Oduya is starting to produce a little bit on the blue line, and Mottau had a great season, both offensively and defensively. White continues to be a solid presence on the blueline to anchor the team.

Another set of dissapointments this season where Pandolfo and Madden. Madden seemed to have lost his edge at the beginning of the season, with bad turnovers. Both finished in the negative this year for +/-, which is shocking for two players who have been previously nominated for the Selke trophy.

In goal, Clemmensen did a fantastic job filling in for Brodeur. You can't ask for more when you primary goalie (and a workhorse like Brodeur) gets injured. The team finished with a team record 52 wins.

Overall, a good season. Now, if we can carry that into the post season and defeat the Carolina Hurricanes...