DOWNLOAD THE CODE:
Download the Code 47538.zip

One task that Windows administrators need to accomplish occasionally is extracting specific information from an HTML document. The document might be a local file, a status page on a LAN-attached network device, a Web-accessible database report, or any one of a thousand other types of pages, but in every case, we face two problems with using data from these sources. The first problem is connecting to the Web page and reading the data the page contains. If a page isn't a static file accessible through a Windows share or file system somewhere on the managed network, we can't use standard tools such as the Scripting.FileSystemObject object to read it. We might even need to supply a username and password to the device serving the Web page. Once we solve that problem, we have an even larger one: How do we reliably extract the snippet of information we need from virtually unreadable raw HTML?

We can resolve both problems by using standard components available on any Windows Script Host (WSH)-capable workstation and a little bit of thinking. I'll demonstrate the general process by walking through a demo script that fetches a Web page from a DSL router and extracts the router's public IP address. I'll then distill the process into three generic functions we can use to extract specific information from a wide variety of pages.

Note that in my discussion, the words information and data each have a particular meaning. When I use information, I mean the particular thing we want to find out--in this case, the public IP address of the router. When I use data, I'm referring to the raw page material that contains the information as well as extraneous material we need to remove.

Retrieving the Data
Out of the box, Windows 2000 and newer OSs have multiple components that we can use to retrieve Web data. Earlier 32-bit versions of Windows typically do as well if they've had patches and feature packs installed since 1999. The most obvious component for retrieving data from a Web page, and the one I personally don't use, is Microsoft Internet Explorer (IE). Although IE has many uses in scripts, it's designed for the interactive display of material. IE has problems when used in nongraphical sessions, might throw errors that block progress, and potentially has side effects when used against arbitrary remote content. IE is also slow because it automatically retrieves and displays additional content such as embedded images.

The component I typically use to retrieve data over an HTTP connection is Microsoft's XMLHTTP requester in msxml.dll. We begin the process by creating a reference to the requester like this:

Dim xml

Set xml = CreateObject _

("Microsoft.XMLHTTP")

Using the XMLHTTP requester consists of three steps: opening a connection, sending the request, and retrieving the response. For this component, opening the connection also means specifying the connection details. The full form of the open method for Microsoft.XMLHTTP looks like this, with optional arguments shown in square brackets:

open(method, url, [async],   [user], [pass])

The method argument is a string specifying the type of request we're making. For HTTP connections, this is typically "GET". The url argument is also a string, and should be a complete legal URL if we're requesting remote content.

The url argument works with local file content as well, and in those cases, just the full path to the file will work--there's no need to add a file:// prefix to the file path.

In this case, we just use the URL we see in the browser when looking at the remote configuration page we want. The particular router I'm working with, a HomePortal 1800HG DSL router used in many home and small offices for Internet access, shows its public IP address on the page http://10.1.1.1/?PAGE=B01, so this will be the URL we specify.

The optional async argument tells the requester whether it should wait for the response (synchronous behavior--a False value) or continue on to the next line of code immediately after the request is sent (asynchronous behavior--a True value). If not specified, the value defaults to True, which isn't what we want. When we send our request, we want our script to wait for a response because the next task processes the results. So we need to specify False.

The next two arguments are extremely useful if you need to access restricted resources--just be careful about hard-coding them in a script. If you need to supply a username and password to access a resource, you can specify these as the user and pass arguments so that your request doesn't return an authentication error. You don't need to specify them if the resource doesn't need authentication, or you can specify an empty string for the arguments by using vbNullString or a set of empty double quotes ("").

We now have all the information necessary to open the connection, so we can add the following line of script to our code:

xml.open "GET", _


"http://10.1.1.1/?PAGE=B01", _


False

Although we've configured and opened the connection, we haven't actually made a request yet. To do that, we call the send method:

xml.send

As soon as the send is completed, we can get the page data back by reading the requester's responseText property:

data = xml.responseText

We now have the entire page available as data in our script. Although my explanation is lengthy, the actual code required is only four lines. As you'll see when I wrap the code up in a function at the end of the article, we can call it with a single line of code for repeated use.

Now we have our data, but we still have some work to do. We're after one IP address. When we look at the router configuration page in a browser, we see about a dozen lines of text containing about five actual bits of hard data, including the IP address shown as follows:

Internet Address: www.xxx.yyy.zzz

The data returned from our request, however, includes a lot of "noise" that has to be removed before we have what we want. There are more than 200 lines of text and roughly 1250 words, and we need to find what is basically a single word in this mass.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.