Blackboard 5

Discussion Board

Current Forum: Homework 5 General Forum
Date: Sun Nov 18 2001 9:17 pm
Author: Liu, Limin Angela <laliu@andrew.cmu.edu>
Subject: HttpTokenizer


When I ran WebReader on http://web.mit.edu/academics.html, I got these bad links as output (I'm using similar method as Brian's, that is test URL against HttpTokenizer): http://web.mit.edu/academics.html/academics.html http://web.mit.edu/academics.html/news.html http://web.mit.edu/academics.html/research.html ... ... that are supposed to be: http://web.mit.edu/academics.html http://web.mit.edu/news.html http://web.mit.edu/research.html ... ... HttpTokenizer still parses those bad links, e.g., when http://web.mit.edu/academics.html/academics.html is run: word: 404 word: not word: found word: not word: found word: the word: requested word: url word: academics word: html word: academics word: html word: was word: not word: found word: on word: this word: server word: mit word: web word: server word: apache word: ssl word: 1 num: 3.6 word: mark word: 1 num: 3.0 word: server word: at word: web word: mit word: edu word: port num: 80.0 35 total page elements retrieved. Can someone also test these pages and see if it's because HttpTokenizer? Any ideas of how to get rid of them?