Roy T. Fielding,
Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web.


4. Automated Traversal as a Maintenance Solution

Given that a means for automating the traversal process is desired, we need to define the requirements and limitations of such a solution. The primary requirement is that it improve the existing maintenance process by reducing the detrimental effects of human inattentiveness, duplication of effort, and distributed document ownership.

Manual traversal is both time-consuming and boring. Current WWW browsers are designed for the normal viewing process -- they make no distinction between old documents and those that have recently changed, nor do they show the user a document's last-modification and expiration dates. In addition, their only method for testing a link is to actually request and transfer the document contents. This is so inefficient (particularly for sites with slow network connections) that many document owners avoid testing those links at all. Even when applied repetitively (as is required for consistent maintenance), manual traversal fails because no human being can remain consistently attentive during a repetitive, time-consuming, and boring process.

Fortunately, these are the characteristics for which automation is most effective. An automated traversal program can test a link without transferring the document contents by using the HEAD request method rather than the GET used by browsers [HTTP]. Provided that document metainformation is available in the response headers, such a program can also check for special conditions that would interest the infostructure owner, such as a recent Last-modified date or an approaching Expires date. Furthermore, the program can restrict its focus to the web's structure and not be distracted by the contents of each document.

With manual traversal, duplication of effort occurs because different infostructure owners don't see the results of others' traversals. World-Wide Web infostructures are encouraged to overlap (i.e. to reuse documents created for other infostructures). For example, most sites reference the What's New With NCSA Mosaic document maintained at NCSA. If the owner of each infostructure independently checks each link with a HEAD request, the result would be a great deal of duplication, wasted network bandwidth, and an unnecessary load on the document servers. An automated traversal program should therefore be required to handle multiple infostructures, possibly maintained by different owners, and share its testing information across them.

Sharing maintenance information can also be beneficial in reducing the problem of distributed document ownership. Since the program is performing traversals for multiple owners, it needs to place the results where all can gain access. The best place for such information is on the Web itself, in the form of HTML index documents generated for each infostructure. In this way, the document owners can make use of shared maintenance information even when they are not located at the site where the program is executed. It also allows a single site to perform the maintenance traversals for many others.

Unfortunately, no automated traversal program can completely solve the maintenance problem. A program cannot tell when a document's contents are changed such that they no longer represent the intentions of a given infostructure. Nor can a program, once it has discovered a broken link, determine why that link is broken or how to fix it. These tasks must still be performed by human maintainers. However, a traversal program can greatly ease the process by alerting the human maintainer and explicitly pointing to those documents that have changed and links that are broken.

Clearly, an automated traversal program would be useful for easing the maintenance of hypertext infostructures. We have developed the Multi-Owner Maintenance spider (MOMspider) for this purpose. MOMspider is a web-wandering robot that, given a list of instructions that details what infostructures to traverse, whom to notify for problems, and where to put the resulting maintenance information, will traverse each infostructure and fulfill all of the requirements listed above. The remainder of this paper will focus on the design of MOMspider, its capabilities and limitations, and proposed enhancements to HTML and HTTP which would further increase its usefulness.

[Continue to MOMspider Design or Up to Contents]


Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science
University of California, Irvine, CA 92717-3425
Last modified: Wed Jun 15 06:32:06 1994