Wednesday, October 8, 2008

Performance of IndexFindElementEx Command

We have done some experiments with the IndexFindElementEx command that compare its performance with the simpler IndexFindElement command. We were looking into this because we noted that some Regular Expression searches seemed to be taking a rather long time.

The performance difference arises because of the difference in searching the DOM dynamically for a simple string match for an atribute, versus doing the same search for a Regular Expression (RegEX) match. The RegEx search turns out to be a great deal more work. Our results appear to indicate that, for a typical complex web page (think of www.cnn.com) the regular expression match process takes about 8-12 times as long as the simple string match. This tends to be the the result of having to do so very much more work.

For example, at typical CNN page has ~2000 DOM elements, so a search of every element for a simple string uses about 2000 "string match" calls (all done in local memory). These matches insist that the string match from the left hand side (from the beginning) of the named property. Because the regular expression search has to start of at each possible character in each DOM value you wind up with quite a large number of searches. Fortunately, the work is all done in memory so the performance is very good!

On the other hand, if you search for a simple string simple string (one without any regular expression characters) using the IndexFindElementEx command, the process needs to examine each DOM element once for each character in the string, to see if it might be a match. As you can guess, this can be a lot fewer searches, and the performance is very quick.

For example, one regular expression search we ran on CNN made 325,000 attempted matches within the DOM for a regular expression, whereas only about 6,100 were needed for a regular (simple) string match. That's ~50 times as much work to the regular expression matching, depending on the lengths of the strings involved. So it is reasonable that when a regular string match on CNN.com takes ~0.2 seconds, the full regular expression match will take ~10.0 seconds. Faster machines will show reduced search times. See DOM Analysis Performance Benchmarks.

No comments: