Roy T. Fielding,
Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web.


6. The Need for Visible Metainformation

MOMspider needs some method for obtaining the owner, modification date, and expiration date of maintained documents. For efficiency reasons, this metainformation must be obtainable from the headers sent by a server in response to a HEAD request on a document [HTTP]. Although most HTTP servers currently transmit the modification date, there does not exist any mechanism for authors to define arbitrary metainformation such that it can be recognized by the server for use in response headers.

MOMspider needs visible metainformation in order to keep the traversal process within the bounds of an identified infostructure. Without it, the spider must rely on the site and pathname components of the URL. Although this is sufficient for most maintenance tasks, it does not allow distributed infostructures to be encompassed within a single task (and thus within a single generated index).

Given that a means for providing arbitrary metainformation to a server is desirable, there are three possible mechanisms for doing so:

  1. The metainformation is stored external to the document such that it can be retrieved separately in response to a request. The storage may be in the form of server configuration tables or as individual documents which mirror those that are served.
  2. Both the metainformation and the document are wrapped within a container object which identifies and provides that information to the server based upon the request method.
  3. The metainformation is embedded within the document such that it can be identified and parsed by the server when the request is made.

The first and second solutions are more efficient for the server and are applicable to both HTML and non-HTML documents. However, storing the metainformation separately from the document adds an additional maintenance problem of keeping the two consistent.

Although the second solution is a much cleaner abstraction, it is also unworkable given the nature of most existing HTTP servers. Much of the useful information on the World-Wide Web serves a dual purpose, being both an object to be served remotely and a file that is used locally. Placing an additional encapsulation on the document would reduce its usefulness for filesystems which do not recognize that encapsulation. However, such a solution would be ideal for object-based servers and for filesystems where resource encapsulation is the norm.

The third solution is less efficient for the server (due to the overhead of parsing the document) but is much more flexible and easier for distributed authors to maintain. Furthermore, embedding the metainformation would allow clients to make use of it even when it is not being extracted by the server. Unfortunately, it is not possible to embed such information in binary, compressed, encrypted, or other fixed-format files. However, since MOMspider does not need to obtain the owner information from non-HTML documents, embedding the metainformation will be the preferred solution for now.

For this purpose, the META element has been proposed as an addition to the Hypertext Markup Language [HTML, Raggett94]. Each maintained HTML file would include optional META elements within the HEAD part of the document like the following:

   <META http-equiv="Owner"   content="AnyOwnerAlias">
   <META http-equiv="Expires" content="Fri, 01 Apr 1994 00:00:00 GMT">

Unfortunately, this does not solve the problem of getting HTTP servers to provide the parsing necessary to produce the actual headers. It is likely that this will only occur once it becomes clear how useful that information can be. For the meantime, MOMspider has been designed so as not to be dependent on that information and yet be able to make full use of it when it does become available.

[Continue to Conclusions and Future Research or Up to Contents]


Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science
University of California, Irvine, CA 92717-3425
Last modified: Wed Jun 15 10:54:00 1994