You could use some sort of heuristic to guess the average success rate of link reliability and server reliability necessary in order to actually get some other number of pages crawled in total.
So, assuming an 80% success rate, you need 13 total pages to get 10 successes. Let's assume we already visited 4. So you guess that we can stop adding to the queue after it reached size = 9. Obviously, if your guess is off on the low end, then you run out of pages to try.
In practice, it takes far less memory to store a list of URLs than it does to store, say, the index, so this is not really a problem for us unless we try to crawl a HUGE number of pages.
In that case, you could use a class that stored and retrieved its contents from a file on the hard drive, or in a structured database locally or on another machine. |