Filtering the Noise
It would be nice if we could filter out some of this noise quickly. In theory, we could eliminate HTML and XML tags by loading the data into an XmlDocument or HtmlDocument object and then requesting just the document's text, but problems arise if the data doesn't exactly fit the requirements of the object.
In general, I've found that the risks outweigh the benefits for the following reasons. An XmlDocument object would work for our well-crafted router configuration page, but if a page isn't a well-formed XML document, parsing will fail. An HtmlDocument object would give us a nice way to filter out unneeded content, but it has significant risks for breaking our script. If a page has odd data--including some content that renders without errors in IE--an HtmlDocument object might produce a blocking GUI error box. Furthermore, scripts in a page might still run in the virtual document and cause odd behavior.
The best solution is to use regular expressions. Although we could find the public IP address with a short regular expression, it would be kind of like using an XmlDocument object. It would be cheating because it would show you what conveniently works in this case but would be useless for most other tasks. Thus, it's better to use a regular expression to clean up generic Web-like data to the point at which a much simpler regular expression or even a basic string search will let us find the information we want quickly.
We start by creating a reference to the VBScript Regular Expression engine. Because we want to work with patterns that might appear in many lines throughout our data, we then turn on the multiline and global
support:
Dim rx
Set rx = New RegExp
rx.Multiline = True
rx.Global = True
Our first step toward isolating our target information is to remove all HTML tags from the data. Because HTML and XML tags always begin and end with angle brackets (<>), we can easily identify them by using the following pattern:
rx.Pattern = "<[^>]+>"
If you use regular expressions frequently, you recognize that this pattern looks for strings that begin with < and end with >, but the portion of the expression that's within the angle brackets (i.e., [^>]+) might be puzzling. When used within the character-set markers [ and ], the caret (^) character means match any characters not in this set. The plus (+) character, which is more familiar to "regex" users, means match any sequence of one or more characters in the set. So [^>]+ just means match any characters that aren't >. Thus, the entire <[^>]+> pattern matches any sequence that begins with <, ends with >, and doesn't have > inside it.
Once we've identified the tags, all we need to do to remove them is to use a regular expression replacement, substituting an empty string for each tag:
data = rx.Replace(data, _
vbNullString)
If we peek at a typical Web page's data after such a replacement, we generally find that a lot of data remains, but most of it is white space: a mixture of tabs, standard spaces, and line terminations. We'll frequently see special HTML-character entities. Although these aren't significant in the case of the router Web page, one of them is worth cleaning up to make searching easier: the special nonbreaking-space character, typically encoded in Web pages as " ". Although we could use a regular expression for this replacement as well, it's just as easy to use VBScript's character replacement function. We can replace each of these characters with a single space by using the following line:
data = Replace _
(data, " ", " ", 1, -1, _1)
The last step in our generic clean up is to condense all the sequences of white space in our data into single spaces so that our search string doesn't need to be concerned about just how many or what type of nonprinting characters separate specific words. To translate white space to single spaces, we specify a new regular expression pattern that matches one or more white-space characters of any type:
rx.Pattern = "\s+"
Then we substitute a single space for every occurrence we encounter:
data = rx.Replace(data, " ")
The Crucial Information
Most of the remaining work will be specific to the particular Web page from which we're extracting information, but we can do it with only two or three lines of code in most cases. Because most network monitoring or management tasks involve repeatedly accessing only a handful of different pages, a minor time investment in creating a simple extraction routine will pay off quite well.
Figure 1 shows a subset of data from my router configuration page after removing HTML tags and excess white space. This isn't the full page--I trimmed material to make the extract short--but the trick we'll use to get our information will work fine with long data sets too. The trick is to look at the data immediately before what we're after and pick out the shortest bit of it that's always on the page but only appears once.
For example, we're after the Internet address: the obviously bogus value 256.261.381.125. A colon and space immediately precede the address, but that set of characters isn't unique in the data. If we reach further back, we get "Address: ", but that set of characters isn't unique either. However, "Internet Address: " is unique. So we split the data at "Internet Address: ", which gives us an array of two strings, one containing the text before "Internet Address: " and one containing the text after. We need just the second string, which has an index of 1, so we do this:
data = Split(data, _ "Internet Address: ")(1)
Now our job is simple. The first character after the information we want is a space, and because our IP address should never have a space, we can split the data at the space. We keep the first string of the two-string array that Split returns (the first string has an index of 0), like this:
data = Split(data, " ")(0)
and are left with the value 256.261.381.125.
To see the power of this approach for repeated use, look at Listing 1. I've turned each of the major technical steps--retrieving the page, removing tags and extra white space, and isolating the information substring we want--into a separate function, all shown in callout B. With the functions implemented, extracting the IP address takes just the three lines of code in callout A.
If you read through the code, you'll notice that I use VBScript's Trim function in the GetSubString function even though I didn't use it in this example. Trim simply removes spaces at the beginning and end of text. I use it in the function so you don't need to be precise about including trailing and leading spaces in strings you use to extract data.
The functions also make it easier to adapt the code to your own situation. If you already know how to extract the material from a page using regular expressions and want to do it your own way, you can use just the GetWebXml function to retrieve data and then process it as you wish. If you want more work done for you but have special needs such as extracting multiple strings, you can use the CleanTaggedText function and ignore the GetSubString function. Finally, if you definitely need just one item from a page and have consistent unique text before and after the item, you can use all three functions as I've done to get the information you need.
End of Article
Prev. page
1
[2]
next page -->