When I ran WebReader on http://web.mit.edu/academics.html, I got these bad links as output (I'm using similar method as Brian's, that is test URL against HttpTokenizer):
http://web.mit.edu/academics.html/academics.html http://web.mit.edu/academics.html/news.html http://web.mit.edu/academics.html/research.html ... ... that are supposed to be: http://web.mit.edu/academics.html http://web.mit.edu/news.html http://web.mit.edu/research.html ... ...
HttpTokenizer still parses those bad links, e.g., when http://web.mit.edu/academics.html/academics.html is run: word: 404 word: not word: found word: not word: found word: the word: requested word: url word: academics word: html word: academics word: html word: was word: not word: found word: on word: this word: server word: mit word: web word: server word: apache word: ssl word: 1 num: 3.6 word: mark word: 1 num: 3.0 word: server word: at word: web word: mit word: edu word: port num: 80.0
35 total page elements retrieved.
Can someone also test these pages and see if it's because HttpTokenizer? Any ideas of how to get rid of them? |