August 29, 2004
I talked in a previous post about Unbreakable Links-- that is, stating every URL in terms of a Google search rather than an absolute address. Great concept, but how do you determine which words on a web page are most likely to generate a unique search result? Well, wonder no more:
Behold the Incredible LinkTron5000 (tm)!
As you might imagine, this involves quite a bit of google abuse -- all of which is pre-cached for performance. Well, mostly pre-cached. If you have a page with a lot of words that I can't find in a dictionary, the LinkTron will take a little while to process it.
When researching this project, I found an invaluable source of information at Philipp Lenssen's Google Blogoscoped. For instance, this frequency distribution for the 26,000 most used words online. There's also a cool word frequency colorizer which visually depicts the "uniqueness" of a target URL.
Posted by Jeff Atwood
- LinkTron wasn't prefixing http:// in front of URLs entered into the textbox. Now it is.
- temporarily disable gzip page retrieval (bug)
- fixed and re-enabled GZIP support
- added two additional options
- changed default to dictionary only (faster)
- warn non-english readers
- better caching
- gracefully handle HTTP errors (timeout, dns resolution, etc)
I was waiting for this sooo bad...but now I am kind of disappointed, because it does not seem to work right. I tried two simple random URL's that came to my mind and it didn't link to the right page.
The one's I tried:
- faster rejection of words less than 5 chars
- fixed small alternate dictionary bug
- updated links to current blog entries
- reject pages with less than 20 words of plaintext
- found a much more sophisticated HTML regex replacement ( http://concepts.waetech.com/unclosed_tags/ ) that can deal with HTML tags that include "" as an element.
- force the use of non-dictionary words if the # of unique words on page is less than 60.
- now stores plaintext as continous space delimited string (for markov chain generation, eventually)
you are right. It works fine, if you don't use it for sites that change their content a lot.
I think it would be really cool to have a LinkTron web service, so everybody can start using unbreakable links on their web sites by parsing dynamic content for anchors and replacing them by unbreakable achnors!!!
- implemented phrase counting*
- show processing time in milliseconds
- allow url= querystring param
* I am unclear how to use the data generated from the phrase frequency count.. suggestions? This is much, much more complicated than a word frequency count.
- deny non-English domain suffixes *
- fix Int32 overflow due to Google doubling index size
- fix Deflate bug
- incorporate latest shared libraries
- flush alternate (user website generated) dictionary
* sorry, results will always suck due to use of English dictionary.. and there was too much abuse.
Ironic: 404 error on the linktron5k link