Current Forum: Homework 5 - Part 3 |
Date: Sat Nov 17 2001 1:59 am |
Author: Bortz, Andrew S. <abortz@andrew.cmu.edu> |
Subject: Re: Saving and Restoring from disk takes very long |
|
|
I get exactly 229,353 bytes for my index for a 100 page crawl starting at http://www.cmu.edu. It does not include a graph structure, although that wouldn't take up that much space. It does get the full 100 pages (as in not ending early). I personally don't see 229k as being small at all. In fact, there are yet more ways I could optimize the data output.
5MB (or even 2MB) _does_ seems to me to be a bit excessive when you figure the _raw_ data being read in is 359,194 bytes on this particular crawl. We throw out a lot of the raw HTML, so a data structure that takes that much more space than the original pages seems grossly inefficient. |
|