I talked in a previous post about Unbreakable Links-- that is, stating every URL in terms of a Google search rather than an absolute address. Great concept, but how do you determine which words on a web page are most likely to generate a unique search result? Well, wonder no more:
Behold the Incredible LinkTron5000 (tm)!
As you might imagine, this involves quite a bit of google abuse -- all of which is pre-cached for performance. Well, mostly pre-cached. If you have a page with a lot of words that I can't find in a dictionary, the LinkTron will take a little while to process it.
When researching this project, I found an invaluable source of information at Philipp Lenssen's Google Blogoscoped. For instance, this frequency distribution for the 26,000 most used words online. There's also a cool word frequency colorizer which visually depicts the "uniqueness" of a target URL.
Posted by Jeff Atwood View blog reactions
« Java vs. .NET RegEx performance You Think You Hate Mondays? »
- LinkTron wasn't prefixing http:// in front of URLs entered into the textbox. Now it is.
Jeff Atwood on August 30, 2004 09:28 AM- temporarily disable gzip page retrieval (bug)
Jeff Atwood on August 30, 2004 09:37 AM- fixed and re-enabled GZIP support
- added two additional options
- changed default to dictionary only (faster)
- warn non-english readers
- better caching
- gracefully handle HTTP errors (timeout, dns resolution, etc)
I was waiting for this sooo bad...but now I am kind of disappointed, because it does not seem to work right. I tried two simple random URL's that came to my mind and it didn't link to the right page.
The one's I tried:
msdn.microsoft.com/express
http://www.franklins.net/dotnetrocks
Well, you linked to the front page of a dynamic website ( dotnetrocks ), which is going to change. For example if you linked to the front page of this blog. Two weeks from now the content would be totally different! Not a good idea.
Try generating keywords from PERMALINKS (eg individual articles) rather than front pages.. I think you will be very happy with the result.
Also the 2nd one works fine ( http://msdn.microsoft.com/express )
I get..
http://www.google.com/search?q=hobbyists+novices+complements+lightweight+enthusiasts&btnI=1
Which brings me directly to that page!
Jeff Atwood on August 31, 2004 05:22 PM- faster rejection of words less than 5 chars
- fixed small alternate dictionary bug
- updated links to current blog entries
- reject pages with less than 20 words of plaintext
- found a much more sophisticated HTML regex replacement ( http://concepts.waetech.com/unclosed_tags/ ) that can deal with HTML tags that include ">" as an element.
- force the use of non-dictionary words if the # of unique words on page is less than 60.
- now stores plaintext as continous space delimited string (for markov chain generation, eventually)
Jeff Atwood on September 1, 2004 09:41 PMHi Jeff,
you are right. It works fine, if you don't use it for sites that change their content a lot.
I think it would be really cool to have a LinkTron web service, so everybody can start using unbreakable links on their web sites by parsing dynamic content for anchors and replacing them by unbreakable achnors!!!
Hermann Klinke on September 3, 2004 12:25 PM- implemented phrase counting*
- show processing time in milliseconds
- allow url= querystring param
* I am unclear how to use the data generated from the phrase frequency count.. suggestions? This is much, much more complicated than a word frequency count.
- deny non-English domain suffixes *
- fix Int32 overflow due to Google doubling index size
- fix Deflate bug
- incorporate latest shared libraries
- flush alternate (user website generated) dictionary
* sorry, results will always suck due to use of English dictionary.. and there was too much abuse.
Jeff Atwood on January 11, 2005 01:52 AMhttp://www.google.com/search?q=gotchas+unbreakable+irreversible+kidney+generates&btnI=1
this is what i got for http://www.codinghorror.com/linktron5k/Default.aspx
The Linkitron kind of funny
"gotchas unbreakable irreversible kidney generates"
is this the newest superhero?
Gotcha, The Irreversible, Unbreakable, Kidney generator!
| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |