If context has a trailing "/", then the original code does work. It's when that "/" is missing that problems occur. To get around this, I basically used your method, except I also checked that String h does not begin with a "/". If the trailing "/" in context and the beginning "/" in h do not occur, then I concatenate context + "/" + h. Otherwise, I use the code that was originally there.
Without the check on h, you could get some output like:
http://www.cs.cmu.edu/~petel//someotherpage.html.
Also (this I'm not as sure of), let's say a link like this is on the page:
<a href="/index.html">, which should point to http://www.cs.cmu.edu/index.html Without the check, we would instead incorrectly spider http://www.cs.cmu.edu/~petel//index.html.
This doesn't take into account the situation when the page has a <base href> tag set, though. For that, I think you'd have to modify the state table?
BTW, thanks for posting this. I never would've noticed this problem if I hadn't seen this post. |