I've become a big fan of Twitter. My philosophy is, when in doubt, make it public, and Twitter is essentially public instant messaging. This suits me fine. Well, when Twitter is actually up and running, at least. Its bouts of frequent downtime are legendary, even today.
(I was going to put a screenshot of one of my favorite Twitter messages here, but as I write this Twitter is down. Again. No, I'm not kidding. OK, it's back up.)
One of the design constraints of Twitter is that every message is limited to 140 characters. You quickly learn to embrace and live within those constraints, but if you like to post URLs in your Twitter messages like I do, those 140 characters become very dear. That's probably why Twitter automatically converts any URLs over about 30 characters to short URLs using the TinyUrl service.
For instance, let's say I wanted to make a shortened URL version of Steve McConnell's blog.
http://forums.construx.com/blogs/stevemcc/default.aspx
It's not a particularly long URL, but every character matters when it comes to Twitter. I found a list of common URL shortening services, so let's see how they compare:
Looks like the best we can do is 3 characters to represent the URL, along with a mandatory 16 characters for the protocol, domain name (everyone drops the leading "www"), and slashes. That's a total of 19 characters, a nice improvement over the 54 characters that make up the original URL. But using an URL shortening and redirection service isn't without pitfalls of its own.
Despite all the potential problems, URL shortening services are still useful in the right circumstances. For example, sending out very long hyperlinks in email is always risky; you never know when the email clients will insert line breaks in the links and render them unclickable. Not to mention mobile devices, where space is always at a premium.
I often wonder why Google doesn't offer an URL redirection service, as they already keep an index of every URL in the world. The idea of Google disappearing tomorrow, or having availability problems, is far less likely than the seemingly random people and companies who operate these URL redirection services-- often for no visible income.
But what really struck me about these services is how they're a perfect embodiment of a classical computer science concept-- the hash table:
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be.
It doesn't get more fundamental than the keys and values of our beloved hash tables. But some of the services use an absurdly small number of characters as keys-- 1YU, 808, cuy -- to represent the entire Steve McConnell blog URL. Thinking about how they did that leads you to some interesting solutions. For instance, let's compare the result of applying traditional hash functions to Steve's blog URL:
| Adler32 | 399014e3 |
| CRC32 | 78aa9d1a |
| MD2 | 286c50c2db4fcad77adb4edeb3a937b2 |
| MD4 | 387ac3f6aae7956c4fab176271bb4518 |
| MD5 | f061a171dfc30635462850684f98b886 |
| SHA-1 | 3c93b6d332091b2970fb660d644d0ba3d756e322 |
Even the shortest hash function, the 32-bit CRC, is a bit too long for this usage. That's 4 bytes which will be at least five ASCII characters. To get a shorter URL, you'd have to switch to a 16-bit CRC. If you're clever about how you turn those 16 bits into printable characters, you just might be able to fit those 2 bytes into three ASCII characters.
But is a 16 bit hash enough to represent every URL in the universe? Rich Skrenta helps us out with a little hash math:
Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.
2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:
264 18,446,744,073,709,551,616 237 100,000,000,000
For a 16-bit hash, our 2n/2 is a whopping 256; for a 32-bit hash it'd be 65,536. It's pretty clear that URL shortening services can't rely on traditional hashing techniques, at least not if they want to produce competitively small URLs.
My guess is the aggressive URL shortening services are doing a simple iteration across every permutation of the available characters they have as the URLs come in. Each new URL gets a unique three character combination until no more are left. How many URLs would that take? Let's say each character is simple alphanumeric, case sensitive A-Z, a-z, 0-9. You can do somewhat better because more ASCII characters than that are valid in URLs, but let's stick with this for the sake of argument. That's 26 + 26 + 10 or 62 possibilities per character. So with a three character URL, we can represent...
62 * 62 * 62 = 238,328
... about 250,000 unique three-character short URLs. Beyond that, they'd be forced to move to four character representations. Assuming, of course, that the old URLs never expire.
| [advertisement] Axosoft's OnTime 2007 allows software development teams to collaborate and ship software on time. It manages projects hierarchically, tracking defects, requirements, tasks, and help desk incidents in one place. Hosted or installed. Windows or Web. Free SDK and Free single-user license. |
Posted by Jeff Atwood View blog reactions
« How Not To Write a Technical Book, Epilogue Programming Games, Analyzing Games »
Quick note - love your site and your techclectic topics...
on this one, though, I think you drove off into the weeds:
"Some of the services use an absurdly small number of characters in their hash keys-- 1YU, 808, cuy-- to represent the entire Steve McConnell blog URL."
The string to the right of the URL, for instance the "1YU" part, is just a simple rolling string index; the next hit on that site might return 1YV, then 1YW, 1YX.
One good way to do this would be to keep two tables, one with the rolling string index and the sequence of the request, and a second table to hold the source URL. The second table wouldn't need an search index at all.
Table 1
Key Request
--- -----
1YV 1313
1YW 1314
1YX 1315
Table 2
Record URL
---- ------------------------------
1313 http://www.codinghorror.com/blog/archives/etcetcetc
1314 http://www.cnn.com/etcetcetc
1315 http://www.thedailywtf.com/stories/etcetcetc
The hash discussion was interesting on it's own, though.
tinywords on August 22, 2007 12:08 PMI think you are right. In theory it should pick any key below the most recent one, and it seems to work on the few tests that I have made.
Try these randomly chosen URLs:
http://qurl.net/A
http://tinyurl.com/y
This is also a fun method of testing how popular they are (unless they increase the index to seem more important). TinyURL has been used a lot it seems
Cool page: http://tinyurl.com/404 There is a box saying "The storage meter is history."
So basically it is "404 - storage meter not found"
+1 to tinywords
You're way overcomplicating things. The URL redirectors with short keys -- and by the way you are missing the Carl Franklin favorite shrinkster -- are just primary keys to a database table.
PWills on August 22, 2007 12:40 PMWhy would you even need 2 tables? It seems to me that one would work just fine - and retrieval would be faster because joins wouldn't be required.
I think your way of approaching it though is probably how they would do it - the hash seems like a bit of overkill for a simple problem.
James on August 22, 2007 12:45 PMAlso, Base64 is commonly used (adding '/' and '+' to your 62 alphanumerics) to shorten text strings. You get a pagefile with 64 glyphs, which is nice because 64 is a power of 2. Note that a GUID encoded in Base64 is 22 characters long instead of the usual 32-28 characters.
With Base64, you need five (case-sensitive!) glyphs to key one billion elements.
It's also interesting to note that only one of the services listed (qurl) is case-sensitive. The rest redirect no matter how you capitalize the letters. This is IMO a Good Thing, as casual users shouldn't be expected to differentiate between capitalization in URIs.
PWills on August 22, 2007 12:52 PMIndeed. After those three are used up, it will bump up to a fourth letter. Given the 26+26+10 that it appears they are using, the method isn't so complicated after all.
Kyle on August 22, 2007 01:07 PM> The string to the right of the URL, for instance the "1YU" part, is just a simple rolling string index; the next hit on that site might return 1YV, then 1YW, 1YX.
It helps if you read the entire post, because that's exactly what I described in the last two paragraphs. :)
> +1 to tinywords
Actually I count it as -1; I subtract one point for using two tables when one would suffice, and I subtract another point for not reading my post.
Oof. Browsing this list, you can see why these sort of redirection services are quickly banned at a lot of sites. Unscrupulous people create a lot of spammy redirects.. :(
> Note that a GUID encoded in Base64 is 22 characters long instead of the usual 32-28 characters.
And in ASCII85, it's only 20 characters..
http://www.codinghorror.com/blog/archives/000409.html
Jeff Atwood on August 22, 2007 01:20 PMYou're not considering that they might huffman compress their urls. If they did an analysis on the most common characters in urls, they could generate a table to compress any new url to tiny-ify. Then they could base64 the resulting string to probably get pretty tiny urls.
George Santos on August 22, 2007 01:29 PMSomething strange happens on my work station.
When I click on
http://tinyurl.com/404
My email from Yahoo opens !!!
I don't take any drug.
Luc Martineau on August 22, 2007 01:29 PMYour favorite saying is incorrect. "Taken with a grain of salt" means there's not much meat there.. so little that you'll only need a single grain of salt to season it.
So your favorite saying is actually saying the opposite of what you thought it said.
Sean on August 22, 2007 01:33 PMAlthough not directly stated even in this article's last two paragraphs, a more interesting aspect of tinyurl's URL cache is its exceptionally fast speed in reverse look-up of URL's previously cached.
Yes, tinyurl stores new URLs sequentially in a "simple rolling string index". This technique alone would be simple & always fast since it only needs to referense/deference an index but would end up producing millions or billions of duplicate URL entries.
However, tinyurl also performs reverse-lookup & returns the original index for any URL previously cached. And it does this very, very quickly. Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links.
I just re-tested this using tinyurl links from emails 5-6 years old. These links immediately resolve to their original tinyurl link without creating a new tinyurl link.
Ian Johns on August 22, 2007 01:36 PMOne URL shortening service I like is http://urltea.com
They support appending the short url with a descriptor.
For example:
http://urltea.com/1agx?404-pages
Gordon Fischer on August 22, 2007 01:39 PM> Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links.
I didn't mention this because I thought it was fairly obvious. I assumed every URL shortening services does duplication checking, as the power law of popularity still applies to URLS..
Jeff Atwood on August 22, 2007 01:44 PMSo what you are talking about here is a second-level DNS lookup table where not just the domain name is converted to a tight integer field (a domain name of N characters is reduced to 4 bytes) but also the rest of the URL.
And while you don't worry about the birthday paradox, that formula calculates the number of instances you need in order to have probability = 0.5 (or 50% chance). Try it again with the probability equal to 0.0001 (1/100th of a percent), about the error level that I would be happy with when requesting page A and actually getting page B. It takes a lot less URLs. MD5 makes a great hash when looking up sessions on a computer or verifying passwords, but logging every URL in the universe? Yikes!
And tinyurl has a nice function that lets you give out a "preview" url with a few more characters that shows the person where they are going before they get there.
Of course, it is only a matter of time before these services are infested with popup ads, doubleclick cookies and google ads.
It's a symmetric key, not a hash.
Hashes are meant to be one-way encryption that, in all common encryptions (MD5, SHA-1, etc.), create a hash of a uniform length irrelavent to the length of the input.
qURL (at least) gives you the key to decrypt, though their cipher, and get the original URL. In this case, the length of the key varies with the length of the input text.
http://qurl.net/dH -> http://google.com/
jLl on August 22, 2007 01:58 PM[correction]
It SEEMS more like a symmetric key then a hash.
The other problem with the really short ones, of course, is that as they get more popular, the links get longer!
It would be foolish to have links expire; it'd make things far too easy for spammers.
Robert Synnott on August 22, 2007 02:19 PMIf I were to implement one of these things I wouldn't even bother with a "hash". I'd just have a single database table with an integer auto increment primary key and the corresponding url. When you tinyify a URL, you just scan the table for that url, grab the ID. If it isn't there, insert it.
Then just take the integer ID and convert it into "base 38" with 0-9, a-z, _, and +, maybe more. Incoming URLs are converted back to the int, looked up and forwarded to the corresponding URL.
Of course, with a true hash, you could convert to your int and scan the table on the primary key rather than a character field, which arguably would be a bit faster. But you generally would sacrifice your URL length to do that.
Jon on August 22, 2007 02:28 PMAm I missing something here? Why would you consider using a hash?
Since a hash is one-way, there would be no value in using it. It's not like the server can convert the hash back into the actual URL. How is this "Hashes in practice"?
Kevin on August 22, 2007 02:30 PM+1 to tinywords, to counter Jeff's -1
--
With respect, Jeff, I think you missed the mark on this post.
You said: "But what really struck me about these services is how they're a perfect embodiment of a classical computer science concept--the hash table"
Rich said: "by transforming the key using a HASH FUNCTION into a hash"
These services take a long URL, store them in a database, and then provide a key to that table. The key is computed based on the quantity of data in the table. It is NOT computer through a reproducible algorithm operating only on the URL input. Therefore what we're talking about is not a 'Hash Table', it's just a 'Table'.
(Reminds me of the old quip "What do they call Chinese Food in China? Food, of course.")
Or, said another way:
Database table != Hash table
Wow, Jeff. You either need to lay off the Mountain Dew, get more sleep, or spend more time programming and less time blogging.
Here's a shorter version of how URL shortening services work:
They assign an incrementing integer to each new URL. This integer is usually encoded in the URL in some higher base (often 36 or 64). THE END
Joel on August 22, 2007 03:02 PM> It is NOT computer through a reproducible algorithm operating only on the URL input. Therefore what we're talking about is not a 'Hash Table', it's just a 'Table'.
That's fair.
To generate VERY VERY small keys (eg, "808", "cyu", "etc"), hash functions clearly won't work. Not enough bits, too many potential collisions. Clearly the short URL generators with tiny keys are not using hashes. It's impossible.
But I think it is eminently possible to generate a typical 40-64 bit hash using an existing hash algorithm based on the URL, which would resolve down to 8-10 characters.
It might even be possible to write a custom hash that did better on URLs.
But in the end, the simplest answer is probably the correct one-- iterate through all possible combinations, growing the string as necessary when you exhaust the key space.
So I re-award tinywords 0.5 and subtract 0.5 from myself. :)
Jeff Atwood on August 22, 2007 03:04 PMI have only used url shortening services for the most obtuse of URLs. Generally speaking very complicated QStrings like to send my mom a link to a product at Costco or Amazon (sometimes they get really long).
While I love the succinct nature of shortened communication in text messages and twitter, I hope it doesn't lead to a greater down of language as a whole. However, you have to admit that some thing http://sentenc.es/ is very sexy because it forces people to get to the point and saves time.
ian on August 22, 2007 03:07 PM> sending out very long hyperlinks in email is always risky; you never know when
> the email clients will insert line breaks in the links and render them unclickable
... which is why we need to pry those broken, ancient mailreaders from the 80-column terminal curmudgeons, and convince them that email is stopped being an 80 column, ASCII format 20 years ago, and if they insist on treating it as such, they have no right to complain when the rest of us want to move on and send formatted text with flowing word wrapping, reliable quoting, etc. (i.e. html-email)
A system that recycled their links wouldn't be of much use. :(
Bryan Price on August 22, 2007 04:05 PMBut the web interface for Twitter still won't let you enter urls if the resulting text BEFORE shortening the URL will be longer than 140 chars. It's a design flaw. It should check the resulting string AFTER shortening the URL.
Heck, why even use the URL at all? Why not convert any URLs found to a short URL and represent the link with the word "link"? Then calculate the total length of the string and check to see if it <140 chars?
"I've become a big fan of Twitter. "
hmmm, quit a reversal from this comment? ;)
http://www.lazycoder.com/weblog/index.php/archives/2007/03/07/twitter-when-users-use-your-app-creatively/#comment-59420
I don't have time to look thru all the previous comments so forgive me if this has been said... But maybe they just use base 36 (or similar) id's (like reddit). It probably doesn't give as many uniques as hashing, though.
1> 36#zzz.
46655
But as far as I know tinyurl and others expire their unused links periodically.
Sheesh, this guy got up on the wrong side of the bed:
>Wow, Jeff. You either need to lay off the Mountain Dew,
>get more sleep, or spend more time programming and
>less time blogging.
Even if the ultimate answer to the question of how URL shorteners work is a simple one, the process whereby the design was reached is worthy of study; more generally, any opportunity to be inspired by a real-world example into a consideration of theory is worth taking. Jeff's original post reminds me of Borges' Pierre Menard, who writes a "technical article on the possibility of improving the game of chess, eliminating one of the rook’s pawns. Menard proposes, recommends, discusses and finally rejects this innovation."
Alex Chamberlain on August 22, 2007 04:14 PMDisclosure: I'm the author of the next service.
You forgot ItGo.es (http://itgo.es/). For the shortened URL version of Steve McConnell's blog you should type http://itgo.es/10x in your browser, which is shorter than all you are speaking about. But you can also add a description with '?', for example, http://itgo.es/10x?blog
I use Base 36 with a really simple database to create the shortened URLs.
Raúl on August 22, 2007 04:42 PMAlex,
I would agree with you if this was actually "the process whereby the design is reached." I may be wrong, but I doubt the developers of these URL shortening sites even considered using a hash function. The thought process was probably more along the lines of, "I know, I'll store the URL in a table and use the autoincremented primary key as the new URL's path. Done!"
Joel on August 22, 2007 04:53 PM> I doubt the developers of these URL shortening sites even considered using a hash function
Well, it's the very first thing I thought of.
It's a superior solution from a technical perspective, IMO, because you wouldn't have to do duplicate checking. This is what hashes were born to do. Am I wrong?
As you can see in my post, I determined that the resulting length of most hashes makes this a prohibitive line of thought. I still think a 40 to 64 bit hash would totally work, but there's absolutely *no* way you could get a hash-based approach down to 3 characters.
Jeff Atwood on August 22, 2007 05:00 PMwow. when I stumbled across tinyurl I thought that some idiot had just come up with this idea and made a bunch of money. I knew that I could program the CGI for it in about 3 seconds, and my only programming background comes from "Python for dummies". I never thought it would take that much thought. I guess I need to give more credit
brodie on August 22, 2007 05:03 PMImagine a url shortening service a bit like mailinator does for mail.
You request the shortened url, it is "active" for a period of time and then it goes away, able to be reused after a period of time.
How many of you expect a shortened url to be good for a week? Month? Year?
url123.com sounds similar to http://notlong.com/ , which is my current favorite. It's not as short as some of your examples, but you do get to assign a (hopefully-)descriptive nickname to the abbreviation.
Peter Hosey on August 22, 2007 05:12 PMA project for hosting redirected url's yourself:
http://lilurl.sourceforge.net/
An elegant redirection service:
http://xhref.com/
My current favourite shortening service is http://linkpot.net/, which has a big list of short English words and just picks the next one off the list for each URL you shorten.
The idea is that linkpots are easy to type into mobile phones or to read out loud to someone, and are no harder to use on a computer than tinyurl.
Some kind of "hash the URL, take the first n bits, use them as an index to the list of words" scheme was suggested, but just popping a stack of words works equally well.
Will Thompson on August 22, 2007 05:36 PMFor TwitterLit posts on Twitter I've been using Tweetl, which has the advantage of offering stats. (To find the stats for any URL, e.g., http://tweetl.com/000, add an s: http://s.tweetl.com/000.)
Debra Hamel on August 22, 2007 05:54 PMSorry about the above link. I honestly was picking that example out of the blue. I didn't intend it to be a link to one of my own sites.
Debra Hamel on August 22, 2007 05:56 PMI'm with you Jeff...my first thought would be a hash. After all, who wants to lug around a DB solution for something as trivial as string translation? Using a DB seems like one of those "only tool is a hammer" type inclinations.
Steve Jackson on August 22, 2007 06:03 PMI'm really surprised no one has mentioned Lore Sjoberg's EvilURL! http://evilurl.com/ A great tongue-in-cheek URL shortener for those who get a giggle from swear words. It even gives you a preview of what your short-url is going to be before it saves it, so if you don't like it, just try again!
nickf on August 22, 2007 07:28 PM> It's a superior solution from a technical perspective, IMO, because you wouldn't have to do duplicate checking.
Sure, you don't have to worry about duplicate entries for the same URL, but now you need to handle the rare collision.
> After all, who wants to lug around a DB solution for something as trivial as string translation?
SQLite could handle this application easily. If you consider using SQLite "lugging around a DB solution," well, sorry.
Joel on August 22, 2007 08:40 PM>> Therefore, although tinyurl adds new strings sequentially, it also avoids duplicates by returning previously cached URL links ... And it does this very, very quickly.
> I didn't mention this because I thought it was fairly obvious. I assumed every URL shortening services does duplication checking, as the power law of popularity still applies to URLS..
Discussing hash algorithms which could avoid duplication for URL shortening services, I was more trying to emphasize that the reverse-lookup is performed very, very quickly. So regardless of entering in a huge URL of 100s of characters (like Google/Yahoo/MapQuest/eBay/etc. links), the reverse-lookup on previously entered URL cache is very, very fast. That's what I find impressive. Maybe I shouldn't be impressed, but I am. That's the part of the scheme that I find most interesting -- not that duplication is avoided, but just the speed of the reverse-lookup.
And I've noticed we've been doing a little dance today over two topics -- I hope we've both got Petzold's sense of humor. :)
Ian Johns on August 22, 2007 09:21 PMTinyURL used an incrementing table - now no longer does so but I suggest it's still not a hash. From wikipedia: someone counted until they saw the latest url was tinyurl.com/dicj , then added the next one so http://tinyurl.com/dick became a link to Dick Cheney's page - now replaced with an apology.
In that respect: what's the shortest base and protocol you could get? http:// is four letters. You could do shorter by providing wap: or ftp: links and tinyurl.com is 10 chars. I think x.yy where x is one letter and y is the tld (comes in 2 letters minimum) is the shortest. Are there 2-letter protocols? Probably none that most browsers understand, but ftp://g.to/xxx saves you some more letters. www.g.to/xxx even more when you make sure your client understands links starting with www and without protocol:// - you'd beat anything but one-letter-protocols. Sadly, webbrowsers don't undertand that...
IIVQ on August 23, 2007 12:01 AMFollowing up on George Santos' comment about compression, Jeff, another interesting way to think about this problem is compression. We want the URLs small, so any system that did try to use an algorithm instead of a simple lookup table would want small and unique, not regular and random. The problem with hashing, as you say is that the point is to not have conflicts. This means a much larger bit space than necessary to obtain that (birthday). With compression, the space isn't random or uniform, it's carefully tailored to the specific type and frequency of the data.
The example URL might be broken down like this by the compression algorithm:
"http://forums." (symbol 1, common)
"construx" (symbol 2 uncommon)
".com/" (symbol 3, very common)
"blogs/" (symbol 4, common)
"stevemcc" (symbol 5, uncommon)
"/default.aspx" (symbol 6, very common)
the very common strings will be high up on the Huffman tree and therefore can be represented by fewer bits. the uncommon ones will be higher and require more bits. ".com/" is probably the most common symbol thus may need only 1 bit to encode. While "construx" is an uncommon symbol and might need many more (perhaps 16).
Overall, these 6 symbols could likely be encoded in 40-50 bits, or 7-9 mixed case alphanumerics. This is obviously more characters than the simple lookup table you mentioned (which is really a degenerative case of compression anyway), but it has advantages in privacy (the host doesn't have to store a list of all URLs), and redundancy (the host can publish its encoding table and anyone who has it can then reconstruct any address without need for a database, even a browser plug-in)
Jeff said: It's a superior solution from a technical perspective, IMO, because you wouldn't have to do duplicate checking. This is what hashes were born to do. Am I wrong?
I say: Yes, you are wrong; you're replacing duplicate checking (something you only have to do once, when you request a new URL) with hash collisions - essentially breaking the algorithm you're using.
The normal use of hashing is to cut down the search space (a kind of bucket sort) or to provide a secondary check(sum) that what you have is what you thought you should have (e.g. MD5 on zip files, essentially checking for tampering or corruption) - it's not to provide a "unique" key.
The chances of meaningfully tampering with an MD5-checksummed file without altering its MD5 key are vanishingly small, but still greater than zero; in much the same way, the chances of generating the same MD5 (or any other kind of hash) key from two (or more) arbitrary strings (e.g. URLs) is vanishingly small, but still greater than zero.
Dino on August 23, 2007 12:58 AMSomeone mentioned the possibility of "temporary" shortened links. This is the plan for the future of http://www.linkpot.net (also mentioned previously) - by default all links will be temporary and expire after a fixed limit.
mrben on August 23, 2007 01:33 AMA couple of years ago I was annoyed enough by the total lack of context in links generated by all the other shortening services, that I created my own http://lnk.nu that keeps some context (specifically, the domain name and file extension, if any). It's not the shortest link you'll ever see, but it sure is helpful to see where the link is likely to go:
http://lnk.nu/codinghorror.com/fl2.html
Greg Hewgill on August 23, 2007 01:35 AM"My guess is the aggressive URL shortening services are doing a simple iteration across every permutation of the available characters they have as the URLs come in."
Hey, why didn't you just check this by obtaining two different URLs in short succession, before bothering us about it?
Tomer Chachamu on August 23, 2007 01:37 AMWe've written a linking service, but we don't use a database, or a hashtable. We simply use a mutable array. When we add a new URL to the array we base 36 the index and append that to the end of the shortened URL. When a request is received the end of the URL is converted back into a number which is the index of the URL in the array.
Steve on August 23, 2007 02:34 AMJust a quickie: I take it you've seen the http://hugeurl.com/ ?
Matt Gibson on August 23, 2007 04:46 AMTo help stop email clients from chopping up links in email messages surround the links with <link>
Diego on August 23, 2007 05:32 AMHashes don't fix the "if the service goes under, the link is ueseless" issue. how about a simple compressing algorithm? Certainly creates bigger URLs than these services, but at least it's reversible.
LKM on August 23, 2007 05:41 AMyou can notice it if u use "urltea" shortening, if u send many urls in a close time span, you get similar 3 letter permutations.
ruben llibre on August 23, 2007 06:13 AMIf you could register the shortest domain possible (the shortest one I know is http://o2.ie/) you can have up to 6 letters after the domain and it would be the same length as the shortest URL you posted in your article (12 characters). If you could have A-Za-z0-9, that would give you 62^6 possible combinations, or 56,800,235,584 combinations.
My maths is probably wrong :P
Peter on August 23, 2007 06:18 AMI feel sorry for you Jeff, some people who read this blog are stoopid...
Ollie on August 23, 2007 06:35 AMDoing some quick back of the envelope math, it appears that converting the md5 hash from 128bit hex representation to as base 36 alphanumeric (single case) representation would yield a consistent 4-5 character string.
As stated by Rich Skrenta in the post, md5 yields a possible 2^64 websites and more than enough for the foreseeable future.
Joel Bushart on August 23, 2007 06:44 AMWhen tinyurl first started out, you could enter a url of arbitrary size. This allowed you to basically use their service to store any type of data. You just had to write your own conversion methods.
Too bad they had to set a limit on the url size.
Joe Beam on August 23, 2007 07:00 AMYeah, if it were me creating that kind of site, I'd be plopping every link in the database (being careful not to duplicate, of course) and just letting an auto-increment field do the work of assigning a unique ID. They may be using some ASCII conversion to mask the totally numeric key.
Mattkins on August 23, 2007 08:00 AMI use the mURL Firefox extension.
TomatoQueen on August 23, 2007 08:05 AMI'm still trying to figure out Twitter and the likes. It all seems like communication for communications sake. Or maybe I'm just getting old...
We've discussed the topic a bit:
http://www.morningtoast.com/index.php/2007/07/communicating-is-now-out-of-control/
I've always thought that these sites just used primary keys, and then recycled keys that no one was clicking on?
My understanding is that tinyurl et all don't work as permanent links... if no one is accessing that url then it will eventually be lost.
but I could be on crack. I just tried a tinyurl from 1 year ago and it still points to the same place.
engtech @ IDT on August 23, 2007 10:52 AMJoel: your math seems wrong.
a-zA-Z0-9 is 62 characters, which means you can get 6 bits of data in a single character.
128bit md5 divided by 6 says it would take 21 characters to represent md5 in a-zA-Z0-9 which isn't much savings over the 32 hex digits it would normally take.
Sean on August 23, 2007 12:06 PM@Jeff
"Am I Wrong"
Yes, you are. A hash is a one-way function. That means that you can't take the result and calculate the original value from it.
If I take the string "Hello world" and apply, say, the MD5 hash on it to produce the hash "f061a...." There is no function that will accept the string "f061a..." and spit out "Hello world".
Collisions are not the issue. They /would/ be the issue if hashes were two-way. Since they are not, what you would have to do to figure out which url your hash value corresponded to would be to try hashing random urls until you found one that produced the correct hash value. Which is, of course, a completely ridiculous thing to try to do.
I appreciate your blog a lot but on this one I believe you are totally wrong. You are talking about compression, not hashing.
> * http://qurl.net/1YU
> * http://rurl.org/808
> * http://jtty.com/cuy
> * http://elfurl.com/li4na
> * http://shurl.org/pHbnD
> * http://shrinkster.com/s9y
> * http://tinyurl.com/yvvtag
> * http://clipurl.com/?PAP269
> * http://shorl.com/dihyfradiduba
Wow, no test of snipurl? Of all the various services, that is my favorite. It allows you to create your own key, and also has a JavaScript applet that you can use by placing it on your shortcut bar. Click on the shortcut button, and it automatically calls up snipurl, creates the shortcut of your current page, and copies it into your clipboard. A one step process. Snipurl gave this:
five characters, but we should also count the characters of the webpage too.
> * http://qurl.net/1YU - 19 characters
> * http://rurl.org/808 - 19 characters
> * http://jtty.com/cuy - 19 characters
> * http://elfurl.com/li4na - 23 characters
> * http://shurl.org/pHbnD - 22 characters
> * http://shrinkster.com/s9y - 25 characters
> * http://tinyurl.com/yvvtag - 25 characters
> * http://clipurl.com/?PAP269 - 26 characters
> * http://shorl.com/dihyfradiduba - 30 characters
>
> * http://snurl.com/1pvef - 21 characters
Not too shabby. (Snipurl.com can be abbreviated as Snurl.com to save a couple of characters).
David on August 23, 2007 03:08 PMNo mention of http://rubyurl.com either?
Dude, RubyURL is totally the best designed URLshortner on the web!
(I know, because I designed it)
Chris Griffin on August 23, 2007 04:28 PMWow, is everyone and their mom designing URL shortening services these days?
Joshua Bryant on August 23, 2007 05:22 PMHugeURL.com is funny :>
I once wrote a tinyurl-hugeurl-loop-script. You only see the urls redirecting in Safari.
http://rafb.net/p/cxAPGG68.html
Robert Gerlach on August 23, 2007 05:26 PMGreat post Jeff, very interesting and though it seems a lot of people are missing the point (esp. the ones claiming you're "totally wrong" like Mats Helander above) I think you're right on. It makes perfect sense to me.
Justin on August 23, 2007 05:35 PMSeeing how much spam is on these url-shortening services, it would be interesting to see what would happen if one tapped into something like Akismet or used some other kind of spam-filtering technology.
Tony on August 23, 2007 06:22 PMYou left out http://xrl.us/4gie
That's shorter than anything else you posted.
Randal L. Schwartz on August 23, 2007 07:02 PMI like shorl. It has not the shortest urls, because they use some sort of syllable approach, but it is just fun to pronounce the hashed urls. Almost dadaistic.
Tyler on August 23, 2007 07:48 PMI'll summarize the article for those in a hurry: URL shortening services could do something really dumb that produces too-long urls that sometimes collide. They would do this by utterly misusing a common computer science concept. Instead, they do something completely obvious, which sill be a surprise to only one person on earth: the author of this post.
Mr. Magoo on August 23, 2007 09:54 PM"The URL no longer contains any hints whatsoever as to the content of the URL. It's completely opaque. The only way to find out what's behind that hyperlink is to actually click on it. This is not a great user experience for the person doing the clicking."
Not sure about the other services, but TinyURL allows you to use http://preview.tinyurl.com/key which will redirect you through TinyURL's site to show you the URL you will be taken to.
In other news, Twitter's system is a bit flawed at the moment. Even though it automatically mini-fies a typed URL, it still counts the full URL as part of the 140 characters, resulting in a tweet that is shorter than 140 characters once it goes through the server. It means you either need to have a short url to begin with, or you need to mini-fy it manually [both options defeat the purpose of having the service there in the first place.] I'd like to see it automagically count URLs as four characters, and then replace it with the world "link" once it goes through the server.
andrew harrison on August 23, 2007 11:52 PM> you could do something really dumb that produces too-long urls that sometimes collide
I don't think it's dumb, personally. I'm not discounting the KISS solution, but it's something worth thinking about.
For example, the idea of using some form of two-way URL specific compression on the URLs, as others have suggested, is quite clever.
TinyURL.com says on their homepage that they've minified 42 million URLs so far.
That means we'd need at least a 56-bit hash, which shouldn't have significant collisions (based on the birthday paradox rule of thumb) until 2^28 or 268 million URLs were in play. To be very safe, we could go with a 60-bit hash, which would allow 2^30 or 1.07 billion URLs before collisions were a problem.
If we can encode a 128-bit GUID in 20 characters..
http://www.codinghorror.com/blog/archives/000409.html
Then we can probably encode a 60 bit hash in 10 characters.
Jeff Atwood on August 24, 2007 01:14 AMSpeaking of URL compression:
http://anres.cpe.ku.ac.th/pub/url-compression-ncsec.pdf
--
The prototype of this [URL compression] algorithm has been tested on a Pentium III 800 MHz machine, with 512 MB RAM. The test data set contains about 1.3 million unique URLs. The average size of the URLs is about 55 bytes. Size of a compressed URL is about 28 bytes on average, including all of the overhead, or about 20 bytes when the TreeNode is discarded. So the actual compressed data size, excluding all overhead is only 10 bytes. This yields about 50% reduction in size for web spiders’ purpose with an ability to store and retrieve the URLs on the fly. It yields about 64% size reduction for search engines, where only the ability to retrieve the URLs is required.
--
Really an interesting approach, and if anything, more of an argument that the large search engines should provide a URL compression service. They certainly have the data set to make the compression work..
Jeff Atwood on August 24, 2007 01:30 AMI'm definitely with you in that storing a fixed-length representation of a potentially infinite string sounds like just the job for a hash function (collisions aside), and that it is a rather elegant solution. However, I'm curious: how would you then quickly retrieve the URL for a given hash (which, after all, one could argue is the primary function of these URL shortening services)?
PS: I'm somewhat amused by the various commenters who leave "I'm really surprised that you missed out this service" messages. That's not the point of the article, people.
Andrew Ho on August 24, 2007 01:43 AM> If we can encode a 128-bit GUID in 20 characters..
> http://www.codinghorror.com/blog/archives/000409.html
> Then we can probably encode a 60 bit hash in 10 characters.
Really, don't do that. Having worked on a well-known link-shortening site, I can tell you that it's no fun at all fielding several emails a day from people who have entered the link incorrectly. These shortened links tend to find their way into print, where people re-type them. Characters get missed, '1's become 'I's, '0's become 'O's and so on. We once had a major national newspaper mis-transcribe a link this way.
We were using base36, so just A-Z, 0-9, case-insensitive, and people still had trouble. If you go using every symbol on the keyboard you're in a world of hurt (especially given your example string, "[Rb*hlkkXVW+q4s(YSF0" would get URL-encoded as "%5BRb*hlkkXVW+q4s(YSF0"). If I did it again, I'd use Doug Crockford's base32 encoding (http://www.crockford.com/wrmg/base32.html), which introduces redundancy to mitigate collisions caused by transcription errors.
Besides, as many here have pointed out, trying to hash a URL is completely pointless. URL shorteners just need to stick the URL in a database table, ASCII encode the primary key (and if you value your users' privacy, salt the result and then use that as the lookup, to prevent guessability of links at the expense of creating slightly longer links).
Finally, how in God's name would you go about decoding a one way hash back into a URL string anyway?
phl on August 24, 2007 02:11 AM> I can tell you that it's no fun at all fielding several emails a day from people who have entered the link incorrectly. These shortened links tend to find their way into print, where people re-type them. Characters get missed, '1's become 'I's, '0's become 'O's and so on. We once had a major national newspaper mis-transcribe a link this way.
Interesting. Then maybe the way to go is to use actual WORDS as the lookup value, rather than random-looking blobs of text. Will Thompson pointed out http://linkpot.net/, which apparently does this.
Anyone remember the old, old AOL password scheme? Two words joined by a dash? Like "JOKE-ORDER", "BAR-COLD", etcetera? Combining two common words might work. According to some quick preliminary research we can define "common words" at about 15,000 for most English speakers.. so that's 15,000 * 15,000 = 225,000,000 possible combinations.
Others have also mentioned some of these services allow you to explicitly name the short URLs, if the name isn't taken. Then you could use short words that provided hints (or not) about the content.. but I imagine these would get taken VERY quickly, just like the ongoing domain name land-grab.
> how in God's name would you go about decoding a one way hash back into a URL string anyway?
We still need the Table part of the HashTable; instead of an incrementing index, the hash becomes a GUID-like primary key. The more I think about this, and the more comments I read, the more I think ..
1) we should be exploring some type of URL compression, as in the paper I referenced above.
2) Google, MSN, and Yahoo should be the people providing the link-shortening service, not random agencies and individuals on the internet.
Jeff Atwood on August 24, 2007 03:16 AM> No mention of http://rubyurl.com either?
>
> Dude, RubyURL is totally the best designed URLshortner on the web!
>
> (I know, because I designed it)
Status: 500 Internal Server Error Content-Type: text/html
We're sorry, but something went wrong.
We've been notified about this issue and we'll take a look at it shortly.
Don Morris on August 24, 2007 08:45 AMI think you're incorrectly ignoring the problem of collisions. You shouldn't be thinking about avoiding a "likely" collision. You should be thinking about avoiding them entirely. Even if there is a 1 in 100 chance of collision at your expected volume, it is much too high. The statistical likelihood of something is totally irrelevant once I send gramma a link to pictures of the kids and she winds up on "hot asian babes exposed."
So in this application, collisions aren't just an annoyance, they're absolutely unacceptable. Hence, while hashing might seem like a good idea at first, their well-known "not unique" property should steer you away from them very early.
That's not to say you shouldn't use a smart hash internally to make duplication checks fast and easy. That, I believe, would be very smart.
Mr. Magoo on August 24, 2007 08:50 AMSome comments mention that certain services allow tagging on a descriptor like "example.com/xyz?blog". You could do much the same thing with any service, though, by using a hash (sic!): "example.com/xyz#blog".
Henrik N on August 24, 2007 12:42 PMNew to Coding Horror. While the algorithms behind URL shortening services might have been mistaken by Jeff at the start, it's pretty nice to see so many ideas have come to not only figure out how these sites work but how to make them better. That AOL-style naming convention would prove very usable. Granted, I feel dirty even commenting on AOL =P
In all seriousness, Mr. Magoo's point of a non-zero likelyhood of a collision occuring is valid.
CS Gordon on August 24, 2007 02:24 PM@ Sean
> Your favorite saying is incorrect. "Taken with a grain of salt"
> means there's not much meat there.. so little that you'll only
> need a single grain of salt to season it.
>
> So your favorite saying is actually saying the opposite of what
> you thought it said.
Jeff's interpretation, not yours is what people mean when they use the phrase. The whole little meat, grain of salt thing sounds like folk etymology.
When someone says you should take something with "a grain of salt" you are being asked to use a measure of skepticism. When Jeff said you should take it with a salt lick, he was suggesting that you'd need considerably more than a healthy measure of skepticism, in fact, so much skepticism that you should be very distrustful whenever statistics are brought in. That's a bit overstated, you should probably take Jeff's advice with a grain of salt.
huxley on August 24, 2007 06:07 PMThere's also the swedish shortening service called x, www.x.se, which produces as short url's as http://x.se/ab.
Benjamin Bergh on August 25, 2007 10:19 AMAnd if you think that short URLs are far too easy... why not make an ugly one: http://maul.ubermutant.net/
@Jeff,
Me:
"Yes you are" (Wrong)
You: (In a reply to another poster)
"We still need the Table part of the HashTable"
My apologies, I completely misunderstood you. It was the part in your post where you mentioned the problem with getting url-s back if the service was lost that threw me into thinking you were going for a table-less solution. Now it is obvious you didn't need any pointing out that hashes are one-way.
However, as those who unuderstood what you meant have pointed out, the scheme of using a hash to produce primary keys has (apart from the collision issues) no real advantages over using just an autoincreaser.
/Mats
Mats Helander on August 27, 2007 10:47 AMTiny url used to sequentially increase the url they returned until people started exploiting this fact.
dman on September 21, 2007 06:41 PM
This is far too complicated a solution. Here's a much simpler, compact, and fast method to get there:
1. Don't use the URL as a shortening target; hashes are good on a one-way basis (text to hash) but not the best way to access a database for the other way around. Instead, use an integer index to the row in a database (or array, dictionary, or whatever you please) as the value to express.
2. The next step is simply to express that number in base X; if you only accept characters A to Z, then express this number as a base 26 number using letters instead of numbers; lowercase, uppercase and numbers ? Base 62.
This way, you always benefit from the full density of the expressive range, without collision testing, hash searches, or sparse use of this valuable resource.
Guys take a look at
I created for you all
I am the developer of the site.
Just checked it to believe it. None better than this.
Manoj Paul on November 16, 2007 01:16 AMhttp://vividurl.com/DashDefyTab
hmm, this does seem easier to remember
or not :)
Homey on November 27, 2007 07:46 AMQurl.net uses (or rather, used) base62 and an auto_incrementing integer; arbitrary alphabet/base encoding is all of about 30 lines of code. Records have a SHA1 hash of the URL used for dupe checking (since the URL itself can be up to 64k long; you can encode a fair bit in a data: URL, and several people have).
Ultimately I turned qurl.net off; there aren't many users, and indeed assholes mostly used it to support their spam operations. No thanks.
Maybe I'll bring it back with a "report this link as spam" button to discourage them. Somehow I'm not convinced spammers would notice...
Thomas Hurst on February 2, 2008 09:31 AMPersonally, I use base 65. A-Z, a-z, 0-9, _, -, and . on http://www.inethdd.com/shortenURL.php
I put all of those characters into an array, then shuffle it, select a random number of them between 1 and 12 and create the hash if it doesn't already exist. Otherwise it runs again.
That gives me 12^65 possibilities. More than I would ever need and a lot simpler than some suggestions here.
Travis on April 2, 2008 02:45 PM±¨¹ØÔ±ÐÅϢƽ̨,[url=http://www.chinanet2000.com]±¨¹ØÔ±[/url]²©¿Í,н®±¨¹ØÔ±ÍøÕ¾,¿¦ÄÉ˹ºþÂí±³ÆïÂí±¨¹ØÔ±·þÎñ±±´÷ºÓ²úÆ·¿â¡¢[url=http://www.cnbeidaihe.com]±±´÷ºÓ[/url]ÐÂÎÅ¡¢±±´÷ºÓ¼¼ÊõÎÄÕ¡¢±±´÷ºÓÆóÒµ¡£ÕæÕý±±´÷ºÓÉÌΪÄúÌṩ±±´÷ºÓÐÅÏ¢
Öйú[url=http://www.zmbg.net]±±´÷ºÓ[/url]ÈËÍøÊÇ×ÛºÏÐԵı±´÷ºÓÐÐÒµÍøÕ¾,°üÀ¨¶Ô±±´÷ºÓý½é¡¢±±´÷ºÓ´´Òâ¡¢±±´÷ºÓÊг¡µÈ±±´÷ºÓÎÄÕºͱ±´÷ºÓÌÖÂÛÇø±±´÷ºÓÂÃÓιÜÀíÂÛÎÄ£º[url=http://www.cnbeidaihe.com]±±´÷ºÓÂÃÓÎ[/url]Òµ¿É³ÖÐø·¢Õ¹µÈÂÛÎÄ. ... ´óÁ¦·¢Õ¹±±´÷ºÓÂÃÓξ¼ÃÔöÇ¿×ۺϹúÁ¦[3221×Ö,Ãâ·ÑÂÛÎÄ]ÏÖ´ú±±´÷ºÓÂÃÓοª·¢±±´÷ºÓÂÃÓÎ,×ÔÖúÐÐ,°üº¬ÓÐ[url=http://www.zmbg.net]±±´÷ºÓÂÃÓÎ[/url]ÐÂÎÅ,±±´÷ºÓÂÃÓ뱨,±±´÷ºÓÂÃÓξ°µãµ¼ÓÎ,¹ú¼Ò±±´÷ºÓÂÃÓεØÀí,±±´÷ºÓÂÃÓÎÌìÆøÔ¤±¨,±±´÷ºÓÂÃÓεç×ÓÔÓÖ¾,ýÌå,,±±´÷ºÓÂÃÓÎÂÛ̳,×ÔÓÎ×ÔÖúµÄ±±´÷ºÓÂÃÓÎÍøÕ¾ÓÑÇéÁ´½Ó, ×î°®ËÄ´¨±±¾©°ì¹«¼Ò¾ßÍø, ÖÐÇàÔÚÏß[url=http://www.bj91fu.com]±±¾©°ì¹«¼Ò¾ß[/url], ±È±ÈÎ÷±±¾©°ì¹«¼Ò¾ß, Ðdz½ÔÚÏß±±¾©°ì¹«¼Ò¾ßƵµÀ, °Û¼Îͨ¸ß¶û·ò, Öйú±±¾©°ì¹«¼Ò¾ßÐÂÎÅÍø,
ÉϺ£ÔËÊä
ÉϺ£ÔËÊä on May 14, 2008 09:37 PM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |