Re: LWP::Parallel Features
Marc Langheinrich (marclang@cs.washington.edu)
Sun, 8 Aug 1999 10:15:04 +0900
Charles,
I CC: this reply to the libwww-list since others might have similar
questions.
On Thu, 05 Aug 1999, Charles Michael wrote:
> 1- Is it possible to display to the user which link is currently being
> scanned (so it reports which links are being scanned as they are scanned,
> so users don't have to wait until the end to see results)
You can use the "on_connect", "on_failure" and "on_return" methods to
provide some high-level feedback about which site you're connecting to,
which ones fail, and which ones successfully finished. If you want more
details about each individual piece of data that comes in, register a
callback with each request which will be called whenever any data comes in.
However, you'll then need to store the data manually in the HTTP::Response
object. See the file "t/Testscript.pl" for an example (in "sub
handle_answer").
> 2- One minor issue with timeouts. I'm doing a test with 10 ftp links and
> I set the timeout to 5 seconds and setup the max_hosts to 10, so in
> theory it should take about 5 seconds to either scan or timeout. Seems
> the scripts takes around 25-30 seconds to scan. Just wondering if you may
> have any ideas? (note: those sites which did timeout did return timeout
> error codes).
The way timeouts work is that these 5 seconds are the maximum time PUA will
wait for any data to come back. If you have, say, 3 requests to 3 sites and
a timeout of 5 seconds, it could very well be that your script runs for an
hour, if at least one of the sites continuosly returns at least some data
within 5 seconds intervals (imagine you're downloading a very big file).
Even if the other two sites never return any data -- as long as only a
single site returns some, PUA will continue reading. Only if the
responsive site finally closes the connection, PUA will notice that the
remaining two sites won't send any data within a 5 seconds interval and
close them as being "timed out".
While it makes sense that you don't cut a connection after the timeout has
elapsed if you are still reading data from it, it might be desirable that
sites that do not send any information for more than <timeout> seconds
would be automatically set to "timed out", even if other connections keep
sending data. However, this would mean that PUA would have to check _all_
open connections after _every_ read to see if any of them had been inactive
for more that <timeout> seconds.
Currently, PUA uses a much simpler mechansim: Wait for <timeout> seconds
for incoming data. If some is available, read it and start again waiting
for <timeout> seconds, and so on. If PUA was waiting for <timeout> seconds
and _no_ data is available for reading on _any_ connection, set _all_
current connections to "timed out" and start processing the next set of
requests.
I could see that the "check all open connections after every read" approach
would allow PUA to detect unresponsive connections much sooner and already
close them to make room for new connections, resulting in a faster overall
scan if you have lots of unresponsive sites. However, this would require
lots of more bookkeeping for each request, and a higher overhead during
reading (since we have to loop continously over all requests).
Do people need this kind of behavior?
-m
--
Marc Langheinrich
marclang@cs.washington.edu