If I have a page that links to google.com, but I spell it http://www.GoOglE.coM in the link, I will still get to google.com. In fact the web server is so smart it replaces my ugly spelling with the "real" spelling of http://www.google.com in the address bar. However, since our engine will put a cookie at http://www.GoOglE.coM, when I see a link for http://www.google.com I will again visit the page. The solution is not as simple as making all the pages lowercase, since after the domain (when I am actually addressing the file system of the server) the pages and directory structure are (in most cases, depending on what server they are running) case sensitive. This problem extends itself deeper as well. If I am computing the indegree, then if some site refernces google as http://www.GoOglE.coM, it will not be counted as part of the indegree of http://www.google.com when it is the same page.
My question is whether it is acceptable behavior to assume everyone uses the same address to access the same html documents and servers? (Thus allowing http://www.GoOglE.com to be "different" from http://www.google.com, and http://google.com) |