Friday, April 16, 2010

Byte Order

We have some software that interacts with third party software. Recently we've been working on using Private/Public key encryption scheme to allow them to send us a small amount of data (encrypting with a public key) and us decrypting with a private key only we know. Pretty standard stuff. We're using RSA, 2048-bit keys. RSA has been around for, well, ever in computer terms. It's a pretty big standard.

We use OpenSSL to do all of our decryption (using rsautl, the -decrypt command and a secret private key). To save on data being passed around, we only push out the modulus to them (these are devices that are operating on bandwith == expensive networks) and use the standard 65537 exponent to make life easy for both of us.

So far so good, right?

We started doing some testing with them. We couldn't decrypt anything they sent us. We gave them a test private key and modulus that we could both use for testing. They couldn't decrypt what they were encrypting either. Wait, what? They created their own private key and modulus, and used it. Everything worked fine for them. They sent it to us. I try to encrypt with the modulus and decrypt with the given private key. No go. Now I'm really confused.

So I look at the private key using RSA and extract the modulus from it. I'm comparing the two and they don't match. Wait, they don't match? Then I look a little closer... they are reversed (well, almost... since they are hex strings I'm looking at, the sets of two are reversed).

It appears that on Windows, CryptoAPI works in Least Significant Byte Order, while OpenSSL works in Most Significant Byte Order (as you may be able to guess from the title). But, Reversing the key order alone doesn't work. I still can't decrypt their data. Then we discover that not only do we have to reverse the modulus, we also have to reverse the output encrypted data.

At least it all works now, but byte ordering sucks.

Wednesday, November 25, 2009

Javascript, Arrays and more than 1 million HTML tags

So, we have a set of code that does this:
var els = node.getElementsByTagName("*");
var elsLen = els.length;
var pattern = new RegExp('(^|\\s)'+searchClass+'(\\s|$)');
for (var i = 0, j = 0; i < elsLen; i++) {
if ( pattern.test(els[i].className) ) {
classElements[j] = els[i];
j++;
}
}
It's a pretty simple for loop that gets all elements on a page and finds items that belong to a class. We use it on our printing to help alleviate some bad HTML in reports (HTML on reports can have client generated content... it gets ugly fast).

We recently had a support call come in that reported on an error on very large files. The first file had 57MB of HTML. Yes, 57 megabytes. Sigh. Looking at the source code, we found 10,425 occurrences of </div> without the corresponding opening tag. Removing those caused the code to work, so I thought we had solved the issue, it was bad HTML.

The next day it came back again, this time on a report of size 75MB, after they had fixed their templates to not have the closing div without opening div. What?

The error being returned was the fantastic "null object" error, squarely on the "classElements[j] = els[i];" line. Say what?

We use the Web Browser Control (IE) in our application, so I opened it up in IE to see the issue. Some further investigation led me to the fact that it would fail when accessing the element at array position 1,000,000! Checking the element in position 999,999 was fine.

So I reworked the code to look like this:
for (clsItem in els) {
if ( pattern.test(clsItem.className) ) {
classElements[j] = clsItem;
j++;
}
}
And now it works. Apparently indirect references past 1 million work in IE.

We did suggest they cleanup their HTML (tables inside of tables inside of tables? Must have been FrontPage) to help reduce the size of the reports. The 75MB report had 1.2 million tag elements!

Thursday, June 18, 2009

Wikis, Wikitext, HTML and regular expressions

Once again, regular expressions come in to help me. We have recently had a bit of push at our company to become more information sharing friendly. In other words, a lot of people are tired of trying to talk to one or two subject matter experts to get the information they require to do their job. They want to be able to share that information more easily.

Well, there's a technology out there that allows for group-think... and it's called Wiki software. You may have heard of it from a small little site called Wikipedia... Or, if you haven't, here's the idea: information is put up on the internet where anyone and everyone can edit it. The idea is that the group will flesh out that information, adding, removing and editing the information until it reaches that ultimate goal of being perfect information. Of course, you will never get to the point of perfect information, but that's the end goal. And, as Wikipedia has shown, it works fairly well... as long as you have more people who are willing to do good than bad (that, and some good moderation tools to get rid of those who don't do good).

In a corporate setting, it seems like a Wiki would be ideal, wouldn't it? Most people probably don't want to do bad (especially if their jobs are on the line) and therefore you get rid of the biggest pain for public wiki: malicious users. Of course, you have a much bigger problem... buy in.

The problem with a corporate wiki is that, if you happen to be using MediaWiki (and probably other types of Wiki software), you need to learn their equivalent Wiki markup language. In MediaWiki, it includes some great ideas like using two single quotes ('') to denote italics, and three to denote bold (''')... makes sense, doesn't it? I will give the language the fact that lists are quite easy to use (* creates a list item, ** creates a sublist item, easy as pie!). But then they came around with the syntax for tables that is hard to remember and use. Lovely. So now you've got to teach people who don't even know what a markup language is how to use one to edit WikiText...

... or you can use the wonderfully named FckEditor. Seriously, could they have not come up with a better name? Imagine going to your boss and saying we should use the FckEditor. I'm sure that will go over well. One of the issues I have with open source software is that people who name things are idiots... It's like GIMP, which I understand is supposed to be a clever acronym (GNU Image Manipulation Program), but ends up sounding, well, gimp...

Anyways, back on track. Buy in is not only difficult because of the difficult learning curve, but also because it's a culture shift. You want people to record their thoughts and experiences in the Wiki. Some may feel threatened that their position as keeper of certain knowledge will be challenged, others may just not feel comfortable writing stuff on an internal wiki. Those can be tough barriers to adoption... but those aren't the point of this post.

Getting back to WikiText. We have a set of documentation, written (and I use the term written loosely) in HTML and stored on our intranet. It's pretty useful, but suffers from being both hard to update and often out of date. This was set as one of our original goals in for our Wiki working group: move the documentation into the Wiki so we can provide the seeds of information sharing.

Well, that task turned out to be much tougher than I expected, somewhat because of the way WikiText is structured and partly because of the way HTML is structured. The much larger fun comes in that some of the documentation was written using Word, some with FrontPage and very little by people who are familiar with HTML. What this means is that we have lots of extra markup that serves very little use and tends to confuse some of the tools out there that convert HTML to WikiText. I tried using the Perl module of WikiConverter, but gave up when it removed too much formatting.

So, I turned to one of my favourite tools, Perl and regular expressions.

After several runs and reruns, and more runs, I got some code that I was happy with. Unlike WikiConverter, it doesn't convert HTML tables to Wiki tables, it leaves <pre> tags in, and it strips a lot of crap that we don't need. One of the things the script also does is remove a lot of characters that can cause issues.

You see, in our macro language, ! is used to start a macro. But in WikiText, a ! is used for tables. Similarly, * is used for comments... but that's also the WikiText definition for a list. Clearly a bunch of changes needed to be made. This is where things got a bit uncomfortable. Because WikiText is implemented using regular expressions, which, when compared to HTML, suffers some issues.

Consider the following case, which is valid html: <div> <div> </div> </div>. When matching pairs, in regular expressions, /<div>[^<]*</div>/ would match the first <div> with the first </div>, which isn't correct in terms of HTML. That can make some issues tough to deal with.

Thankfully, most of the HTML in our documentation was very simple, so I was able to get away with using just regular expressions to replace HTML tags with WikiText.

Of course, after all was said and done, using an HTML parser might have worked better, because the documentation (of course) had a hodgepodge of bad tags throughout it (<b>) tags with no closing tags, tags that are just odd (what is <os>?) and various other things. But, for the most part, it worked well. The documentation has now been imported into our Wiki.

Now, to get people to buy in...

Tuesday, June 16, 2009

Balsillie, Bettman, the NHL and Coyotes

By some reports, Balsillie is done, the judge ruled against him, it's over. No team in Hamilton. Now, take a look at the MakeItSeven website. Well, that clears things up... right?

While I think a team in the Hamilton area would do well, it's hard to see Balsillie getting his way. Unfortunately, that means Bettman might get his way. I realize that Bettman has done some good things for the game as a business, but as a sport, he's probably done more to hurt it that any previous commissioner. The game is all about the money now. The NHL worries about the leagues finances and keeping teams happy. There seems to be much less caring for the fans. And if I hear Bettman say one more time that they don't run out on cities, I may hunt him down and beat him. Can you believe that ignorance?

Let's look at this a bit more objectively, if we can.

Hamilton, population 692,000 (metropolitan). Not a large city in comparsion to Phoenix's 4.2 million (metropolitan). But, now let's consider, as a comparison, southwestern ontario with arizona. Arizona's population is 6.5 million people. Southwestern Ontario? If you exclude Toronto, roughly 3.2 million. Add Toronto, and we are close to hitting 10 million (aka, 90% of Ontario's population). So, the markets are close to the same size when you consider that Southwestern Ontario would also include the Leafs (which, it must be said, sell an ungodly amount of tickets at expensive prices and have diehard fans... kind of like the Yankees, but less successful on the field of sport).

So population wise, it's pretty even... right? Except that Phoenix can't maintain the minimum numbers of ticket sales to allow for money from the rest of the NHL's teams. 6.5 million people can't fill an arena a few times over the year? Well, I guess part of the problem is playing ice hockey in the desert. It's not really a sport that fits the region's weather. Some mention that Buffalo also might be an impedement, but of the people in the area, most who go to Buffalo games from Canada are from the region right around Buffalo. I don't see that really changing that much, but it's hard to predict what would happen to the Sabres. I think the Sabres have far more to worry about than the Leafs in this regard.

Team wise? Losing $300 million (the number thrown around by current owner Moyes) in 10 years is pretty impressive... or ridiculous. Yes, I realize that the city of Glendale help build a city for the team and (apparently) has a lease for a ridiculous 25 years... but I'm almost wondering if the team still wouldn't make more moving to Hamilton and paying for the lease... and then sublease the arena to whomever they want.

So, there are some good arguments in bringing the team to Hamilton. Canada has always been a haven for ice hockey. Southwestern Ontario, I feel, could easily manage another team in the region.

Will it happen? I highly doubt it. Bettman is, if nothing else, a control freak. And I doubt he would like having someone like Balsillie, who's a bit of a maverick, in the leagues board of governers. However, I can't see why the league's owners wouldn't want a man with deep pockets ($2 billion+ net worth) and a passionate love of hockey, in their group.

Maybe it's because too many of the don't care about the sport, but just the business?

Tuesday, June 9, 2009

Regular Expressions... why you don't have 2 problems.

There is a long standing statement in the programming world that goes something like this:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
There is a great article discussing this statement at Jeffrey Friedl's blog (all the way back in 2006!) where he traces the source of that statement and the reasoning behind it.

Getting back to the point at hand, regular expressions are a tool, as Jeffrey notes. They are a very useful tool that can create headaches for the simple reason that they are hard to read.

Consider the following regular expression:
s/\n\n(\n)+/\n\n/g;

In the world of regular expressions, it's a pretty simple regular expression. What does it do? It compresses extra newlines into two. So if you have ten newlines in a row, it will reduce it to two.

But, it wasn't working correctly for me... and that is one the reasons why people often repeat that dreadful statement. I now have a problem with my regular expression, as well as my original problem. 2 problems!

In the end, I turned to StackOverflow to see if I could get some help. Sure enough, some enterprising people quickly came up with a far better answer that I had:
s/(\r?\n[ \t]*){2,}/\n\n/g;

So what's the difference? It's using the regex code of {M,N} where M specifies the minimum number of matches of the previous code and N sets the maximum (infinite in our case). The \r? matches {0,1} instances of the carriage return character. The [ \t]* matches any number of spaces/tabs. This cleans up any extraneous whitespace that might be hanging around our newlines.

Much simpler, and it does the job quite well. So now I have zero problems!

Regular expressions are certainly a very useful tool, but they can have trip ups. A better quote would be "Regular Expressions, Use At Own Risk"... but then again, that can be said about any tool.

Friday, May 22, 2009

Speed Improvements... and Profiling.

It's incredibly important to profile before you begin to try and do any optimizations. The reasons for this should be obvious, but I've caught myself trying to do optimizations without profiling first. Bad idea. Sometimes you make things worse. It's also important to profile after every single change and not change multiple things at once. This lets you quickly determine if a change was good or bad.

MapX and RowValues

MapX is a 3rd party tool that displays geographical data. It is available from MapInfo (now owned by Pitney-Bowes) and is quite useful. It also hasn't been updated in about 7 years, since they have a new product called MapXtreme (what's so extreme about it?). Which means that we are stuck trying to pull performance out of a legacy application again that, well, isn't going to get any faster on it's own.

I made an interesting discovery a few weeks ago regarding accessing datasets. A dataset, in MapX terms, is a set of data associated with geographical data. While the map data (lat,long) is stored in a map file, the data is stored in a proprietary data format with a link between the data and the map object. There are times when we need to pull lots of data to display it to the user (an info tool, for example, so you can query what an object is on the map).

To access the data, you can directly refer to the dataset using the object's feature key. Something like:
strValue = dataset(feature key, field)
This works, but once need to query more than once against a row (or feature key), then it makes more sense to use a RowValues object, which is easy enough:
set rowValues = dataset.RowValues(feature key)
strValue = rowValues(field).Value
And there we go. For one value, the timing is similar. Once you do several requests in a row, however, it gets much faster. The more fields you read in a row, the higher the speed increase (economy of scale!).

Visual Basic 6 and Combo Boxes

Another fun thing discovered this week is that Visual Basic 6 is slow when adding lots of items to a combo box. How much is lots of items? Try 12,000 unique strings! So, off we go trying to find a better way. And, sure enough, there is a fairly simple way, as long as you don't mind getting your hands dirty with the windows messaging system api.

So we go from:
combobox.add "this string"
To:
SendMessage combobox.hWnd, CB_ADDSTRING, 0, ByVal "this string"
Simple enough, for a 50% time decrease for 12,000 items... from 1 second to 1/2 a second! Very good stuff.

MapX and Display

Some times measured speed doesn't mean that much to a user if it appears the system is hung and not doing anything at all.

Interestingly enough, MapX's map control has a setting called RedrawInterval. This setting sets how long between map redraws while the control is generating data to be displayed. A lower setting means that the map will be redrawn more frequently. A higher value means that the map won't update until it is done or the time has run out. When viewing a small section of a map, this doesn't make much difference.

What we discovered though, is that when a user is zoomed out to seeing a radius of 10km's worth of mapping data, the map is taking a while to redisplay for big cities. This makes sense, since there is a lot of road data to display for a large city... but, as it turns out, our RedrawInterval was set to 300, which translates to 3 seconds (who works in 1/100ths of a second?).

Setting it to the minimum of 10 (so 0.1 seconds) actually results in a perceived speed increase. Why? As it turns out, MapX draws each layer in sequence internally before displaying them. Since we have about a dozen layers (County, City, Railroads, Roads, Highways, etc...), this means that updates to the screen start happening sooner. Overall, the time taken to completely redraw the screen is longer (not by much, for larger cities it is about 2-3% slower on average, it's negligible for smaller cities/towns.), but, to the user, it appears to have drawn faster since they can see updates happening.

Wednesday, May 20, 2009

Printing from a Service...

We've been working on adding the ability to create PDF files on the server side of our application. This sounds much easier than it is. Our server base is IBM UniVerse (how many people can say they are familiar with that?), an older multivalued database system. On top of being a multivalued database, it is also, essentially, a runtime environment.

To complicate this, the items that need to be printed are HTML documents, with some custom javascript that needs to run before the file can be printed (mostly to correct faulty HTML mistakes by people who have no clue about HTML -- but that's another topic for another day). So that means we need to run it through Internet Explorer. But wait, the complications are still not done! We use MeadCo's ScriptX to do the actual printing, since printing through IE's own engine is both complex and unreliable.

So, add all this together to try printing from a single source location.

Several "easy" solutions were tried first, some killed by the boss (due to too much interaction required from users, most of whom are quite unskilled with computers), some killed due to their technical limitations (IBM UniVerse offers the ability to shell out to DOS, but this lacks a windows context, so IE doesn't run properly -- ie: no javascript can be fired off).

So we came up with several proposals, of which one was approved by management for moving forward: build a window service that could print these pdf documents by running a small program that encapsulates the IE engine, thus providing a windows context for the printing.

Seemed so simple then. From that point on, things went all over the place.

Getting IE to run with a windows context was easy enough. Getting our ScriptX related javascript to run took a little more effort, but was manageable. Then came the brick wall. Sure, it works fine on XP running from the service account. What about Server 2k3? Or Vista?

Sure enough, changes to the way the service account runs starting in 2k3 meant that our solution no longer worked that well... Services in 2k3/Vista run in session 0, same as XP. However, that level is now isolated from users. In other words, only certain resources are now available to the service account. And, sure enough, Printers aren't one of them.

MSDN notes that the printing namespace isn't available from windows services (or at least not supproted) [1]. I'm obviously not the first person who's run into issues with printing from a service (StackOverflow questions: 1, 2, 3).

So, for now, we've gotten around it by creating a local account that has the printer available to it. This is far from an ideal solution and I continue to look for a solution that allows me to see printers from a 2k3/Vista windows service...