Roy T. Fielding,
Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web.


5. MOMspider Design

The design of MOMspider focuses on fulfilling the requirements of multi-owner maintenance while at the same time minimizing its effect on World-Wide Web servers and network bandwidth. Because the MOMspider client is oriented toward maintenance issues in general, it also attempts to maximize the benefit to information providers while respecting any limits they may place on wandering robots.

5.1 Functionality

MOMspider gets its instructions by reading a text file that contains a list of options and tasks to be performed (an example instruction file is provided in Appendix A). Each task is intended to describe a specific infostructure so that it can be encompassed by the traversal process. A task instruction includes the traversal type, an infostructure name (for later reference), the "Top" URL at which to start traversing, the location for placing the indexed output, an e-mail address that corresponds to the owner of that infostructure, and a set of options that determine what identified maintenance issues justify sending an e-mail message.

For each task, MOMspider traverses the web, in breadth-first order, from the specified top document down to each leaf node. A leaf node is defined to be any information object which is not of document-type HTML (and thus cannot contain any further links) or which is outside the given infostructure. MOMspider determines the boundaries of an infostructure according to the task's traversal type: Site, Tree, or Owner. Site traversal specifies that any URL which points to a site (the pairing of hostname/IP address and port) other than that of the top document is considered a leaf node. Tree traversal specifies that any document not at or below the "level" of the top document is considered a leaf node, where level is determined by the pathname in the URL. Owner traversal specifies that any document beyond the top which does not contain an "Owner:" metainformation header equal to the infostructure name is considered a leaf node.

The maintenance information produced by each task is formatted as an HTML index and output to the file specified in the task instructions (an example of which is provided in Appendix B). The index contains the following maintenance information:

MOMspider looks for four types of document change which may be of interest to the owner:

  1. referenced objects which have redirected URLs (moved documents);
  2. referenced objects which cannot be accessed (broken links);
  3. referenced objects with recent modification dates; and,
  4. owned objects with expiration dates near to the current date.

Each interesting item is placed in the closing cross-reference table and, if the corresponding option is requested, enclosed in a single e-mail message and posted to the owner at the task's completion.

5.2 Efficient Use of Network Resources

A key design constraint for MOMspider is that of efficiency -- particularly in regards to network bandwidth usage. It would be irresponsible to develop a maintenance robot which overly taxed the limited resources of networks like the Internet. Therefore, MOMspider minimizes the load on network bandwidth by using the HEAD request for testing links, keeping track of nodes that have already been tested, grouping multiple tasks within a single execution, and allowing the user to restrict the traversal of certain URLs.

Aside from the restrictions described above regarding the task's traversal type, MOMspider also enables the user to specify any URL prefixes which must always be avoided or leafed. These URL prefixes are listed in the systemwide or user avoid files (an example of which is provided in Appendix C). Each entry in the file includes the action (Avoid or Leaf), the URL prefix on which to apply that action, and an optional expiration date for the entry. This allows the user to completely avoid documents for which maintenance is not a concern or which could trap an unsuspecting spider (some forms of computational hypertext can have that effect).

5.3 Being Friendly to Service Providers

A second design constraint for MOMspider is that it minimize its impact on information providers (destination servers) while at the same time maximizing the indirect benefits they receive from the traversal process. All HTTP requests are similar to:

    HEAD /path HTTP/1.0
    User-Agent: MOMspider/0.1
    From: user@machine.sub.dom.ain
    Referer: http://www.site.edu/current/document.html

This allows server maintainers to properly recognize the source of the request and, if necessary, place restrictions upon a particular spider. It also provides them useful information, including how to contact the person running the spider and what document contains the reference being tested.

As an additional precaution, MOMspider periodically looks for and obeys any restrictions found in a site's /robots.txt document as per the standard proposed by Martijn Koster [Koster94a]. Before any link is tested, the destination site is looked-up in a table of recently accessed sites (the definition of "recently" can be set by the user). If it is not found, that site's /robots.txt document is requested and parsed for restrictions to be placed on MOMspider robots. Any such restrictions are added to the user's avoid list and the site is added to the site table, both with expiration dates indicating when the site must be checked again. Although this opens the possibility for a discrepancy to exist between the restrictions applied and the contents of a recently changed /robots.txt document, it is necessary to avoid a condition where the site checks cause a greater load on the server than would the maintenance requests alone. An example sites file is provided in Appendix D.

[Continue to the Need for Visible Metainformation or Up to Contents]


Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science
University of California, Irvine, CA 92717-3425
Last modified: Wed Jun 15 12:03:41 1994