Re: Help on getting the Last_modification date

Bob Worthy (bworthy@briseis.worthy.com)
Tue, 29 Feb 2000 08:36:24 -0700 (MST)


Weiguo, yes i've had the same problem with my robot. Randal Schwartz
speaks the truth about missing data i'm sure. It sounds like you and i
are doing much the same thing.

The problem is actually worse than simply not having the last modified
date available. Many sites lie about the last modified date in the sense
that the last modified date is always the date-time the page was requested
from the site. I went to a scheme of trying to determine page changes by
looking at the size of the page compared to when i last requested it.
After all, when you fetch a page you can count the bytes yourself. But
this often fails too, because many many pages have useless operators that,
for instance, display the current date on the page. Since this changes the
length of the page virtually every day the page length is also a useless
determinator of whether a page has changed. 

On the plus side, it is usually easy to determine when a site is playing
these (to me anyway) stupid games. Those pages i simply treat as
UN-changed for some default period. It isn't a good solution, but it keeps
the cheating sites from pushing their pages to the front every time.

If anyone has better ideas i'd LOVE to hear them.

I guess this is a bit off-topic, and apologize for that.


On 29 Feb 2000, Randal L. Schwartz wrote:

> >>>>> "Weiguo" == Weiguo Fan <wfan@umich.edu> writes:
> 
> Weiguo> I am trying to get the last_modified_date for all the
> Weiguo> urls. But I found not every web server support this header
> Weiguo> information. Is there anyway that I can calculatet or get the
> Weiguo> modification date?
> 
> If the information is not in the response, it doesn't have a "last
> modification date".  Period.  You can't compute something that doesn't
> exist.  Or rather, any such computation would be a lie and misleading.
> 
> You're probably seeing a bunch of dynamic pages, since dynamic pages
> with SSI tend not to have a last-modified, since the last-modified is
> always "whenever you just asked for it".  Most of the pages on
> www.stonehenge.com are like that, for example.  "last-modified" makes
> sense only for static data.
> 
> If you intend to call "if-modified-since" on refetching those URLs,
> you're likely never to get a "304" response, so not knowing the
> last-modified really doesn't make any difference.
> 
> -- 
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
> See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
> 

-------------------------------
Bob Worthy
bworthy@worthy.com
406 443 5219  USA Work

Visit our new searchable SAR index at
   http://www.worthy.com/sarsearch/
-------------------------------