Roy T. Fielding,
Maintaining Distributed Hypertext Infostructures:
Welcome to MOMspider's Web.
The design of MOMspider focuses on fulfilling the requirements of multi-owner maintenance while at the same time minimizing its effect on World-Wide Web servers and network bandwidth. Because the MOMspider client is oriented toward maintenance issues in general, it also attempts to maximize the benefit to information providers while respecting any limits they may place on wandering robots.
MOMspider gets its instructions by reading a text file that contains a list of options and tasks to be performed (an example instruction file is provided in Appendix A). Each task is intended to describe a specific infostructure so that it can be encompassed by the traversal process. A task instruction includes the traversal type, an infostructure name (for later reference), the "Top" URL at which to start traversing, the location for placing the indexed output, an e-mail address that corresponds to the owner of that infostructure, and a set of options that determine what identified maintenance issues justify sending an e-mail message.
For each task, MOMspider traverses the web, in breadth-first order, from the specified top document down to each leaf node. A leaf node is defined to be any information object which is not of document-type HTML (and thus cannot contain any further links) or which is outside the given infostructure. MOMspider determines the boundaries of an infostructure according to the task's traversal type: Site, Tree, or Owner. Site traversal specifies that any URL which points to a site (the pairing of hostname/IP address and port) other than that of the top document is considered a leaf node. Tree traversal specifies that any document not at or below the "level" of the top document is considered a leaf node, where level is determined by the pathname in the URL. Owner traversal specifies that any document beyond the top which does not contain an "Owner:" metainformation header equal to the infostructure name is considered a leaf node.
The maintenance information produced by each task is formatted as an HTML index and output to the file specified in the task instructions (an example of which is provided in Appendix B). The index contains the following maintenance information:
MOMspider looks for four types of document change which may be of interest to the owner:
Each interesting item is placed in the closing cross-reference table and, if the corresponding option is requested, enclosed in a single e-mail message and posted to the owner at the task's completion.
A key design constraint for MOMspider is that of efficiency -- particularly in regards to network bandwidth usage. It would be irresponsible to develop a maintenance robot which overly taxed the limited resources of networks like the Internet. Therefore, MOMspider minimizes the load on network bandwidth by using the HEAD request for testing links, keeping track of nodes that have already been tested, grouping multiple tasks within a single execution, and allowing the user to restrict the traversal of certain URLs.
Aside from the restrictions described above regarding the task's traversal type, MOMspider also enables the user to specify any URL prefixes which must always be avoided or leafed. These URL prefixes are listed in the systemwide or user avoid files (an example of which is provided in Appendix C). Each entry in the file includes the action (Avoid or Leaf), the URL prefix on which to apply that action, and an optional expiration date for the entry. This allows the user to completely avoid documents for which maintenance is not a concern or which could trap an unsuspecting spider (some forms of computational hypertext can have that effect).
A second design constraint for MOMspider is that it minimize its impact on information providers (destination servers) while at the same time maximizing the indirect benefits they receive from the traversal process. All HTTP requests are similar to:
HEAD /path HTTP/1.0
User-Agent: MOMspider/0.1
From: user@machine.sub.dom.ain
Referer: http://www.site.edu/current/document.html
This allows server maintainers to properly recognize the source of the request and, if necessary, place restrictions upon a particular spider. It also provides them useful information, including how to contact the person running the spider and what document contains the reference being tested.
As an additional precaution, MOMspider periodically looks for and
obeys any restrictions found in a site's /robots.txt
document as per the standard proposed by Martijn Koster
[Koster94a].
Before any link is tested, the destination site is looked-up in a
table of recently accessed sites (the definition of "recently" can be
set by the user). If it is not found, that site's
/robots.txt document is requested and parsed for
restrictions to be placed on MOMspider robots. Any such restrictions
are added to the user's avoid list and the site is added to the site
table, both with expiration dates indicating when the site must be
checked again. Although this opens the possibility for a discrepancy
to exist between the restrictions applied and the contents of a
recently changed /robots.txt document, it is
necessary to avoid a condition where the site checks cause a greater
load on the server than would the maintenance requests alone. An
example sites file is provided in Appendix D.
[Continue to the Need for Visible Metainformation or Up to Contents]