One of the most beloved of all data structures in computer science is the hash table.
A hash table is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be.
Key-value pairs are quite common in real world data, and hashtables are both reasonably efficient in storage and quite fast at lookups, offering O(1) performance in most cases. That's why hashtables are the go-to data structure for many programmers. It may not be the optimal choice, but unlike so many things in computer science, it's rarely a bad choice.
But hash tables do have one crucial weakness: they are only as good as the hash function driving them. As we add each new item to the hashtable, we compute a hash value from the key for that item, and drop the item in the bucket represented by that hash value. So how many buckets do we need? Let's consider the extremes:
Reality, of course, lies somewhere in between these two extremes. The choice of hash function is critical, so you don't end up with a bucket shortage. As you place more and more items in each bucket (ie, "collisions") you edge closer to the slow O(n) end of the performance spectrum.
There's something magical about these hash functions that drive the hashtable. The idea of the hash as a unique digital fingerprint for every chunk of data in the entire world is a fascinating one. It's a fingerprint that cleverly fits into a mere 32 bits of storage, yet is somehow able to uniquely identify any set of data ever created.
Of course, this is a lie, for several reasons. Let's start with the most obvious one. Consider all possible values of a 32-bit hash function:
232 ~= 4.3 billion
The current population of the earth is about 6.6 billion people. If we were to apply a perfect 32-bit hash function to the DNA of every man, woman, and child on the planet, we could not guarantee uniqueness-- we simply don't have enough possible hash values to represent them all!
This is known as the pigeonhole principle. It's not complicated. If you try to put 6 pigeons in 5 holes, one will inevitably be left out in the cold.
You'll definitely want to use a large enough hash value so you can avoid the pigeonhole principle. How much you care about this depends on how many things you're planning to store in your hashtable, naturally.
The other reason hashes can fail as digital fingerprints is because collisions are a lot more likely than most people realize. The birthday paradox illustrates how quickly you can run into collision problems for small hash values. I distinctly remember the birthday paradox from my college calculus class, and I'll pose you the same question our TA asked us:
In a typical classroom of 30 students, what are the odds that two of the students will have the same birthday?
Don't read any further until you've taken a guess. What's your answer?
Everyone has completely unique DNA, but shares one of 365* possible birthdays with the rest of us. Birthdays are effectively a tiny 365 value hash function. Using such a small hash value, there's a 50% chance of two people sharing the same birthday after a mere 23 people. With the 30 students in our hypothetical classroom, the odds of two students having a shared birthday rise to 70%. The statistics don't lie: when the question was posed in that classroom so many years ago, there were in fact two students who shared the same birthday.
A rule of thumb for estimating the number of values you need to enter in a hashtable before you have a 50 percent chance of an existing collision is to take the square root of 1.4 times the number of possible hash values.
SQRT(1.4 * 365) = 23 SQRT(1.4 * 232) = 77,543
When using a 32-bit hash value, we have a 50% chance that a collision exists after about 77 thousand entries-- a pretty far cry from the 4 billion possible values we could store in that 32-bit value. This is not a big deal for a hashtable; so what if a few of our buckets have more than one item? But it's a huge problem if you're relying on the hash as a unique digital fingerprint.
The hashing functions behind our precious hashtables may be a lie. But they're a convenient lie. They work. Just keep the pigeonhole principle and the birthday paradox in mind as you're using them, and you'll do fine.
* No, let's forget leap years for now. And other variables like birth patterns. Yes, I know this is how programmers think. Imagine how much it would suck to have one birthday every four years, though. Ouch.
Joseph said:
"With a sufficiently large number of twins or triplets, we might be able to store the DNA of everyone in the planet in a single hashtable."
..or a few nukes.
Dave on December 7, 2007 1:39 AMShameless plug: http://blogs.msdn.com/dcook/archive/2007/09/09/hashes-and-tables-and-primes-oh-my.aspx
Modeling the number of collisions in a hash table is somewhat more involved than might be obvious. You end up using Poisson distributions and other techniques that you thought you'd never use again after finishing your college statistics class. See http://burtleburtle.net/bob/hash/birthday.html#half for a bit more information.
Here are a few numbers that I worked out (take them with a grain of salt -- I'm a code monkey, not a statistician). If you need to hash 100 items, what results can you expect from various sized hash tables? (Note that this assumes some kind of chaining for collision resolution; otherwise things would get hairy after 100% load.) Cross your fingers that the table comes out OK in the blog :)
Load Expected Expected
Factor Buckets Collisions # of Probes
50% 200.0 21.3 1.14
60% 166.7 24.8 1.16
70% 142.9 28.0 1.20
80% 125.0 31.2 1.23
90% 111.1 34.1 1.26
100% 100.0 36.8 1.29
110% 90.9 39.4 1.32
120% 83.3 41.8 1.36
130% 76.9 44.0 1.39
140% 71.4 46.2 1.43
150% 66.7 48.2 1.47
160% 62.5 50.1 1.50
170% 58.8 51.9 1.54
180% 55.6 53.6 1.58
190% 52.6 55.2 1.62
200% 50.0 56.7 1.66
210% 47.6 58.1 1.70
220% 45.5 59.5 1.74
230% 43.5 60.7 1.78
240% 41.7 61.9 1.82
250% 40.0 63.1 1.87
260% 38.5 64.1 1.91
270% 37.0 65.1 1.95
280% 35.7 66.1 2.00
290% 34.5 66.9 2.04
300% 33.3 67.8 2.09
@Telos: "Heh. I was just thinking that seasonal variations would have an effect when I read the last line." I was wondering if there were any twins in the class so I could be snarky and say 100%..."
Well... to be honest, you'd lose if you had been snarky. I went to school with a pair of twins - who had different birthdays. One was born at 11:50PM or so, the other didn't show up till 12:05AM the next day.
There was also a recent news item - during the end of Daylight Savings Time, a pair of twins showed up and the one born second is listed as "older" - the first one showed up at 2:50AM, then the clock popped back an hour. So the second showed up at 2:10AM....
Valdis Kletnieks on December 7, 2007 2:12 AMAs a leap-year baby, it has its down-sides. Many people don't know which day to celebrate it on, but a few of the more happy-go-lucky will just congratulate me on both days! And that is just plain fun. :)
--
Danny.
For 32bit hashes you don't have 50% collision rate at 77k entries. That is, the probability of collision on inserting another entry, what one might think of as the collision rate, is 0.002%. But there is a 50% probability that within those 77k entries there is at least one collision.
Ants Aasma on December 7, 2007 3:46 AMHeh. I was just thinking that seasonal variations would have an effect when I read the last line.
Tom on December 7, 2007 3:46 AM"For 32bit hashes you don't have 50% collision rate at 77k entries."
Imagine if you did, and it kept increasing. That'd be one hellishly unbalanced hash.
Jeff, I love your blog, so please forgive me for being the language police: you need to stop mixing up "e.g." and "i.e.". If you can't keep the difference straight just stick to english - "for example" and "that is" are perfectly serviceable.
Anonymous English Police on December 7, 2007 3:50 AMAnts/Tom -- thanks. I corrected the text. What's the math behind estimating the actual collision rate?
Jeff Atwood on December 7, 2007 3:57 AMWith a sufficiently large number of twins or triplets, we might be able to store the DNA of everyone in the planet in a single hashtable.
Jeff,
maybe you missed the other major point of hash functions. Besides trying to avoid collisions at all, most good implementations are designed to avoid collisions for _similar_inputs_, i.e. given two inputs of same size even a single bit difference will change the hash significantly.
So the chance of a collision might be unexpectedly high per se, but the probability of two similar inputs yielding the same hash value are next to 0 (provided that you use a good hash function). It's exactly this property that makes hash functions suitable for data integrity checks.
Nikola on December 7, 2007 4:33 AMIf the hash function is good and you have an uniform distribution of hash values for your input data, then the collision rate is the amount of entries divided by size of table. If the hash function is not good then the math is more complicated and dependent on the hash.
Imagine for instance a 256 entry hash table of bytestrings where the hash function is the sum of bytes in the string modulo 256. Now if you will store strings composed of only @ characters (ascii 64) in there, all your hashes will be either 0,64,128 or 192. That means that there is nearly 50% chance of collision when there are only 2 entries in the table.
Ants Aasma on December 7, 2007 4:44 AMJeff, the math behind the calculation is pretty straightforward.
For n available slots and m occupying items,
Prob. of no collision on first insertion = 1
Prob. of no collision on 2nd insertion = ( n - 1 ) / n
Prob. of no collision on 3rd insertion = ( n - 2 ) / n
...
Prob. of no collision on mth insertion = ( n - ( m - 1 ) ) / n
So the probability of no collision on ANY of m insertions is the product of the above values, which works out to
( n - 1 )! / ( ( n - m )! * n^( m - 1 ) )
and conversely, the likelihood of at least one collision is just 1 minus the above value.
John Pirie on December 7, 2007 4:51 AMIn the worst case with one bucket - the performance of a well-written hash function shouldn't be O(n). The elements should be stored in a sorted array to permit a binary search so that your performance is at most log(n). No sense taking a good scheme like a hash and messing it up with a linear search for multiple values per bucket.
Dan on December 7, 2007 4:59 AMThe examples of e.g. in the quoted text are indeed correct. What he is more likely referring to is Jeff's actual usage, which has probably been missed because it is not punctuated.
'As you place more and more items in each bucket (eg, "collisions")' should be "As you place more and more items in each bucket (i.e., "collisions")".
It is not 'As you place more and more items in each bucket (for example, "collisions")' but probably more correctly 'As you place more and more items in each bucket (in other words, "collisions")'
[ICR] on December 7, 2007 5:12 AMIt is not true that every person has completely unique DNA. In fact, identical twins (and triplets, quadruplets, etc.) share identical DNA.
Will on December 7, 2007 5:17 AMApologies for defending the Language Police, but I believe it was the following that he was referring to:
"As you place more and more items in each bucket (eg, "collisions") you edge closer"
Back on topic, not so long ago, the architects of our companies product didn't understand what the hashCode function was for, and we had a fair few classes going around with hashCode being set to a constant value...
How about the following "HashTable" implementation? http://worsethanfailure.com/Articles/It-Had-Too-Many-Functions.aspx
;-)
Ant on December 7, 2007 5:21 AMI'm not sure if this is a regional thing or not, but you are really abusing big-O notation. Statements like "offering O(1) performance in most cases" and "best-case performance: an O(1) lookup." don't make any sense. The big-O performance of a general hash table is exactly the same as the data structure you use to handle collisions.
Your two extremes aren't extremes either. It is highly likely that you will have more buckets than you expect to have items.
I kind of agree with your main point, but from what I've found, computer science purists tend to ignore hash tables (for theoretical stuff) because they are hard to analyze, and their asymptotic performance characteristics are exactly the same as the collision resolving structure. And typically no one seems to care about actual performance :)
One of the other interesting things for hash functions is the randomized ones in perl and other languages, which change the hash functions every time your program runs to prevent malicious users from forcing lots of collisions :)
Andy on December 7, 2007 5:25 AMThere are other problems with hashes.microsoft says: "Hash tables that use dynamically allocated linked lists can degrade performance. By extension, hash tables that use dynamically allocated linked lists to store their contents might perform substantially worse. In fact, in the final analysis, a simple linear search through an array might actually be faster (depending on the circumstances). Array-based hash tables (so-called closed hashing) is an often-overlooked implementation which frequently has superior performance."
offler on December 7, 2007 5:32 AM"Heh. I was just thinking that seasonal variations would have an effect when I read the last line."
I was wondering if there were any twins in the class so I could be snarky and say 100%...
Telos on December 7, 2007 5:34 AMI'm a java programmer and programming without hash tables would have been so boring!
Thanks for the post!
Laila on December 7, 2007 5:35 AMHey Now Jeff,
I heard Frosty say Happy Birthday. I've always like the way you've quoted with the different color background to distinguish the difference. I thought of the B-Day puzzle when you posted http://www.codinghorror.com/blog/archives/000951.html, 23 .
Codding Horror Fan,
Catto
The birthday paradox is a paradox because most people suck at statistics and combinatorics. They see "chance" not in terms of tries, but in terms of patterns - if the wheel in the casino came up on red 4 times in a row, you can split the groups in those who fanatically believe it has to come up with red a 5th time to complete the pattern, or that it has to come up with black because there's no way the pattern can be completed.
Rob Janssen on December 7, 2007 6:18 AMI can think of lots of business cases with hashtables 70000. Most databases use hashtables to perform large join operations. These can very quickly require millions of values. The point about hashtables is that it scales very well. Thus if you are joining tables with millions of rows together, hashtables are the ideal method.
Martin Ritchie on December 7, 2007 6:30 AM
In the worst case with one bucket - the performance of a well-written hash function shouldn't be O(n). The elements should be stored in a sorted array to permit a binary search so that your performance is at most log(n). No sense taking a good scheme like a hash and messing it up with a linear search for multiple values per bucket.
No sense taking a bad scheme like a 90+% filled hash table and pimp it from O(n) to O(log n) where it should have been O(1) at the first place.
When I had to write my own hash table in college, we were required to make it double in size once it reached 50% capacity. We also used Rehashing when you encountered a collision.
Although, the hash function that we used was pretty simple. I think we were just storing integers, so we took the (int % hash_table_size) to get the placement.
Sean on December 7, 2007 6:49 AMIf you're really worried about hashtable performance you need to look at the type of hashtable you're using. There are two kinds I see used:
1) Hashtable is just an array of objects. If a collision occurs you walk down the table until you find an empty bucket and you use that one.
2) Hashtable is an array of collections of objects. When you have a collision you simply add the object to that bucket's collection.
Unfortunately, the first one is the one I see used the most (at least when people implement their own hash tables. I don't know what the built in implementations in C#, Java, etc use). This one is the easiest to write, but requires that your hashtable's array be at least as large as the set of items you're using, and has worse performance. A collision causes the new object to go into a different bucket, effectively adding to the number of hash values that have collisions.
The second one can actually retain near constant performance while saving memory. The hashtable can be much smaller than the actual data set's size, and as long as the hash function creates an even distribution then your lookups could still be constant time (as opposed to the first option, which would be linear).
For example. say you have N objects to hash, and a hashtable of size N/2. Assuming you have a good hash function, each bucket will only have 2 objects in it, so the lookup becomes O(1) + O(2) = O(1). Abstracted out, if your size is N/M your lookup becomes O(M), and you can keep M small very easily without having to rehash your table nearly as often as you would if you were using the first method (in which case any time your data set increased above the size of the hash table you'd have to rehash).
Probably a pointless discussion as most languages have good built-in hashtables, but I felt compelled to post this for some reason. :)
Andy Herrman on December 7, 2007 6:59 AMThe only way to achieve O(1) performance for a get operations is iff the hashing function is effectively perfect and the hash table can expand if it becomes sufficiently loaded. The former is up to you. The later is usually up to using a decent hash table implementation.
For hash functions, unless you've studied hashing, don't write them yourself. Let the IDE generate them, or use a third party utility class like HashcodeBuilder (Java) to do this. This goes for equals methods as well because unless you write decent unit tests, you will screw it up.
As for the implementation, Java has HashMap, which supports expanding with a load factor. I'm pretty sure C# and the C++ STL has the same thing. Feel free to write a hash table implementation yourself, but do it as an academic exercise rather than creating something for production code. This wheel has been studied extensively and is near perfected for general usage.
Nicholaus Shupe on December 7, 2007 7:05 AMI have very little programming knowledge.
However, I'm pretty well versed in Japanese.
That's either not a calendar, or it's crazy year this year. Someone please educate me...
Ira Knight on December 7, 2007 8:02 AMJeff,
If we're talking about storing every person's DNA and running out of hash addresses, I think we'd first need to talk about running out of space in general before running out of buckets.
At around 3 billion bases in length, with a base taking on one of four values (so lets make that 2 bits), we need 3,000,000,000 * 2 / 8 = 750 million bytes.
We're out of space after just a few people, never mind 6 billion of them.
Just being sort of argumentative this morning I guess. =)
Sammy Larbi on December 7, 2007 8:16 AMjeff. i can read that chinese calendar! it's lunar too.
Jin on December 7, 2007 8:21 AMVery interesting. What are the mathematics for determining the probability of 2 students sharing a birthday?
Val on December 7, 2007 8:32 AMHi Val-- there are lots of sites that explain the Birthday Paradox math in more detail. Here's one:
http://betterexplained.com/articles/understanding-the-birthday-paradox/
Jeff Atwood on December 7, 2007 8:56 AMI think the pigeonhole principle is just a little bit different than what you stated (but I think it's an important distinction):
"If you try to put 6 pigeons in 5 holes, one of the holes will contain two pigeons."
(This also better illustrates the point you're trying to make WRT hash tables.)
Manuzhai on December 7, 2007 8:58 AMNote that the O(1) claim is not correct, in reality it's O(size of key in bits). There are other data structures storing key-value-pairs (commonly called "finite maps") like tries or patricia trees that have the same asymptotic complexity, even in the worst case (hash tables are average case only).
Anonymus on December 7, 2007 9:17 AMI use to work on a computer system called Cado. It was back in the early 1980s. The standard system had 48K of memory on an 8085A process, and could support up to four users. This was done by leaving out unnecessary things like a file allocation table.
You defined a file by telling the system what drive and track the file started on, and how long the key for each record was, and how many records you plan to store, and the average size of the record. The system would calculate everything out, and give you the ending drive and track for the file. It was your job to make sure you didn't allocate another file on the same area.
Cado files contained records, and each record had a "key" associated with it (although not necessarily an unique key. For example, a key could be a customer number, and a customer might have multiple records.) The hash function looked at the key and calculated the track the record was on. The entire track was read into an 8Kb system buffer, and the key sectors (could be between 2 to 6 sectors on each track depending upon the estimated average length of the record) were searched for the key. This gave the offset on the track where the record was located. The record was then passed to that user's read buffer where the programmer could manipulate it.
Considering that you only had 48K or memory, slow 8" floppy drives, and the processor was running at a fast 3Mhz, it was a pretty fast system. Thanks to the hash look up, could pull off the information almost instantly.
The biggest problem was file sizing due to the way the hashing worked. A file took up a fixed number of tracks on a disk -- even if it was empty. Store more records than planned, and you had to resize the file. This meant redefining the file elsewhere on the system, and having the system copy the old file -- record by record -- to the new location. This could take hours. The other issue is that the file had to occupy contiguous tracks. That meant as you moved files around, you'd end up with empty tracks here and there. Every once in a while, we'd have to take the whole system in and redo all the files to consolidate tracks. In bigger sites, this could take days.
Of course, as I mentioned before, there was no file allocation table. We wrote down on a piece of paper where each file was located. One mistake, and we'd lose the customer data. No pressure, no pressure.
David on December 7, 2007 9:39 AMAndy -- the trick about analyzing hash tables is that you have to define "average case." This depends a lot on your inputs! English dictionary words are much different than, say, Esperanto dictionary words. My field of numerical analysis has a similar challenge when defining "average matrices" as test inputs: for example, (0,1) uniformly distributed random matrices are _not_ characteristic of problems that many people would like to solve.
Don't forget the useful trick of perfect hashing (and even minimal perfect hashing), which is handy for building read-only fast lookup tables at compile time. GNU Gperf is one of many tools that lets you do this:
http://www.gnu.org/software/gperf/
Mark Hoemmen on December 7, 2007 9:43 AMHi Jeff, thanks for the mention! The 30-second summary:
With 23 people there are 23*22/2 or 253 unique comparisons to make. The chance of 1 comparison being unique 364/365. The chance of every comparison being unique (i.e. no overlap) is (364/365)^253 = 0.4995.
So, with 23 people there's about a 50% chance of having at 1 or more collisions. There's a few subtleties in there but that's the general approach.
Kalid on December 7, 2007 9:49 AMMy professor once said that it's good to have the magical number as a prime one. Many hash functions uses division and that may screw up with a non prime one.
One of the most basic hash functions, but yet somewhat effective is to use the mod operator (% in C) something like this:
#define magical_number 13; /*a prime number*/
#define hash(key) key % magical_number;
And for collisions, you can easily implement a dynamic list for every possible return in the hash function. It's a pretty efficient way to make a dynamic list more efficient at lookup. And besides, collisions won't make it much more slow.
Hoffmann on December 7, 2007 9:50 AM*cough*
http://www.codinghorror.com/blog/archives/000834.html
Point number 2.
Amusingly, if I use your search box to search for either "pictures" or "photos", it returns *every single post you've ever made*. I think that's because the search engine is indexing some invisible text in the template for every page, but I'm not sure.
Anonymous Cowherd on December 7, 2007 9:56 AMIn defense of the English Police, he/she probably meant the "eg" in this sentence: "As you place more and more items in each bucket (eg, "collisions")"
Ah, fair enough. I fixed that.
If we're talking about storing every person's DNA and running out of hash addresses, I think we'd first need to talk about running out of space in general before running out of buckets.
Actually, we aren't storing the DNA, we're just storing the hash. With enough bits in the hash (128?) this would work fine as a fingerprint; it'd be equivalent in uniqueness to the DNA, at least in theory.
@Cowherd -- the picture is there to show the visual symmetry between a set of physical pigeonholes, and a calendar. What is a calendar, after all, except 365 arbitrary pigeonholes?
Jeff Atwood on December 7, 2007 10:08 AMYou can derive the 1.2 imes the square root of n approximation by using the binomial coefficient form and invoking Stirling's approximation.
You can see how bad Feb. 29 birthdays can be by seeing The Pirates of Penzance.
Eric Jablow on December 7, 2007 10:51 AMThere's a visualization of the birthday problem at http://demonstrations.wolfram.com/TheBirthdayProblem/ which shows probabilities for both a pair and a trio of "collisions."
Joel on December 7, 2007 10:59 AMI wrote a little about this a few months ago (http://blog.perfectapi.com/2007/10/hashtable-of-hashtables.html). When you have a scenario where you will likely encounter hash-key duplicates, then a simple solution is to have a *second* (or third, 4th etc) hash key.
So if you need a 320-bit key to avoid collisions, you can use 10x32-bit unique hash keys (or even 9x32 + 4x8). The only challenge remaining is finding a bunch of unique hashes (MD5, SHA-something, CRC, etc).
Actually, we aren't storing the DNA, we're just storing the hash. With enough bits in the hash (128?) this would work fine as a fingerprint; it'd be equivalent in uniqueness to the DNA, at least in theory.
It would be equivalent to the uniqueness of DNA _in practice_. Hashes rarely display the same uniqueness characteristics as their ranges in theory, because otherwise using the hash would be pointless.
Once again, theoretical DNA uniqueness: 1 in (number of nucleotide combinations per base)^(number of bases in genome), theoretical 128-bit hash uniqueness: 1 in 2^128. To say that 2^128 is smaller is a great understatement.
James A. on December 7, 2007 10:59 AMEven if you're one in a million, there's still more than 8 of you in New York City alone.
Steve Bush on December 7, 2007 11:08 AMWhen talking about performance, people often forget that hash tables have some limitations.
For example, when facing hash collisions (which lower performance), many would naturally assume that the best answer would be a larger hash table. After all, less collisions means more O(1) accesses.
However, your array has to be as large as the hash key space.
Meaning if your keys cover, for example, the entire range of 24 bits, you need an array of the size of 2^24 items (be they pointers, lists, or just the actual stored objects).
Even if you insert only 100 objects into the table, a good hash function will probably spread them evenly across the table.
A hash table requires AND WASTES a lot more memory than other data structures if you don't fill up the table. More memory means more page faults, and no cache coherence.
In one story I heard, when porting a compiler to a system that has a lot more memory, the compiler team increased the hash table used for symbol names so that more symbols could be inserted with less collisions.
To their surprise, the performance dropped significantly, even though the programs weren't using that many symbols. When profiling, they discovered that the performance dropped due to large paging demands.
Hash tables are fun :)
As for collision handling, there are some good options: linear probing, quadratic probing, double hashing, etc. I prefer chaining using linked lists or trees, but to each his own.
jread on December 8, 2007 1:53 AM@Ravi: Ah dammit, missed the one in the actual text (no periods, and the comment included them). So I only spotted the two in the Wikipedia quote. Score one for easy quoting. I'm not sure that it should be "i.e." there either though, because "i.e." is used to give the explanation of a word, not the word itself. So it should probably be:
[With more and more "collisions" (ie, placing items into a bucket that already contains items)]
or just remove the i.e. completely.
And knowing is half the battle. Of course, the *other* half is to stop correcting poor Jeff's grammar for the purposes of showing off one's intelligence and knowledge of English grammar...
First, I see a heck of a lot of people advocating a single hashmap implementation with regard to collisions (linked list, tree, sorted array, linear search, exponential seatch, rehashing). This is like argueing that a Vector, Linked List, unsorted array, sorted array, etc is always the best option in every case. The analysis isn't trivial like in the later case but the choice of what kind of hash map depends on how it is used: number of insertions, memory costs, retrieval frequency, etc. Kudos to K for touching on some easily missed considerations.
Good catch on the O(1) abuse Andy; I hadn't considered the formal definition and it appears that noone else here has either. I'd still contend however that the colloquial use employed with regard to hashmaps still has some utility. The problem with the formal definition with regard to hashmaps is that it assumes arbitrarily large N independant of any context where-as any hashmap analysis requires size of N to be considered.
Yes, sure some theorists may avoid hashmaps because of the difficulty of analyzing them in absolutes but these are probably the same purists who would have never accepted TCP because it is too empirically based and difficult to model with absolute certainty using simple logical constructs. Theory has fantastic instructional and research-oriented uses but it should never be used to the detriment of practical application. In the end, it is usable applications that are the true goal of all computer science theories. This isn't meant to be computer philosophy after all.
JB on December 8, 2007 11:03 AMIf you try to put 6 pigeons in 5 holes, *at least* one of the holes will contain *at least* two pigeons.
I have a small rant about big-oh abuses and hash tables here: http://hnr.dnsalias.net/wp/?p=29 if anyone is interested...
Henry on December 10, 2007 11:18 AMNow 2 to the power 32 approximately equals 4.3 billion?
No longer 4.0 billion?
Thank you. We needed that.
The hash function mentioned before:
f(x) = (ax + b mod p) mod 2^n
is really a good function, but it's not called perfect hashing (since perfect hashing would never yield collisions and can only be achieved if the input values are known beforehand), it's called universal hashing.
http://en.wikipedia.org/wiki/Universal_hashing
Khi on December 11, 2007 5:05 AMThe hash function mentioned before:
f(x) = (ax + b mod p) mod 2^n
Note that if p divides 2^n this formula loses some of its value. If p and 2^n are coprime then I believe this produces a repeating sequence of length 2^n with no repetitions between cycles.
I suspect 2^n is there because of the optimisation x mod 2^n = x (2^n - 1). The "ax + b mod p" is the important part... and has uses other than hashing (e.g. generating a non-repeating sequence of p numbers).
Jheriko on December 18, 2007 9:07 AMThere's other performance considerations as well - how long does the hash take to compute? And how is its 'hash performance' affected by restricting its values to the index table's size?
I've found that a C++ standard library 'map' container can outperform a 'hash_map' container (that's part of the TR1 version of the standard library) when you have strings as keys. I think the reason for this is that the string values were reasonably evenly distributed across (in this case) the alphabet, so most of the map lookups only compared one or two characters before failing, which isn't going to take too long. And I suspect that the ability of the hash function to produce unique values may have been adversely affected by reducing it to fit in the index table size, leading to multiple resultant key comparisons. I don't know - all I know is that the application performed noticeably better with a 'map' vs. a 'hash_map'. Noticably enough that I decided to measure it.
Stuart Dootson on February 6, 2010 10:14 PMJeff,
Something that might interest you is that you can actually define an "optimal" hash function that provably has no bias.
f(x) = (ax + b mod p) mod 2^n
p is a prime larger than any message you want to hash. a and b are random natural numbers between zero and p and unique for each hash operation. n is the size of the hash in bits.
There is no way with this scheme that the choice of x can bias the result of f(x).
While this is a perfect hashing algorithm, if x is larger than just a few bytes, this calculation starts to have a serious performance penalty
This scheme is the hashing equivalent of the one-time pad and can be used to provide provably secure authentication.
Simon.
Simon Johnson on February 6, 2010 10:14 PMI haven't encountered a situation where 70000 entries are stored in a hashtable yet. It boggles my imagination to think of a business rule that requires such.
It could be possible if it was a simulation. Are there any swarm or repast users out there that can confirm this?
Jon Abaca on February 6, 2010 10:14 PMUh, guys, the .NET hash table doesn't use chaining. It's a moot point whether you'd use a linked list or a sorted list (and if you're going to go the memory-bloat route, why not use something even more efficient like another hash table or a btree?).
Almost every real-world hash table uses some form of re-hashing, and keeps its load factor below 50%. I know you CS types love to argue about this stuff, but the problem's already been solved for 99% of plausible real-world scenarios. And if it doesn't solve yours, then you probably need a more consistently-efficient data structure, like a tree or graph.
Aaron G on February 6, 2010 10:14 PMAwesome description. I thought you mis-stated a couple things, but they all turned out to be my mistakes. Thanks for fixing my head!
Will - Twins don't have identidal DNA, nor your cells. Twins started with very close DNA like your cells, but variation is guaranteed because copying isn't perfect, and DNA gets damaged.
Jon Abaca - Special hash implementations let us go WAY beyond 70k, which is useful in estimating DNA backgrounds and variation.
The bit-depth you need for hashing DNA depends on the k-mer length you're looking at, so it varies from application to application. You can store all possible 2-mers with a 3-bit address at a load factor of 1, but this is a highly specialized and useless hashing function.
Nobody hashes complete human genomes because they haven't been sequenced yet. You could however store fingerprints or patterns of k-mers that appear to identify people uniquely.
Your bit-depth doesn't need to model the possible DNA-space exhaustively, instead it needs to be matched to the sample space that you're trying to identify uniquely.
(better late than never...) There was a comment above something to the effect that if the distribution of the hash was not uniform (unbiased) then the math gets a whole lot messier and depends on the hash. But as I was diagraming the calculation for probability of at least one collision it occurred to me that the math must already be messy (not so clean as the simple factorial calculation of the series)...
Think about it... if you have 10 slots total and 5 are occupied, you would calculate a probability of a collision on the next insert at 50%. No problem there - we stated clearly that there are 5 occupied slots. BUT if we wanted to calculate the probability of at least one collision so far, the straight math gives you 70%... but only if you make the paradoxical assumption that there are 5 occupied slots - in which case you beat the odds because you had no collisions. If you in fact have a collision along the way, then the number of open slots is greater than you are basing the calculations on. See? If along the way of inserting 5 items into a 10 slot hashtable I in fact have a collision (and odds are good I do), then after 5 insertions I have 6 open slots, not 5. That make the calculation for the next insertion and the additive probability so far change (quite a bit for such a small hashtable).
Someone please tell me how to reconcile this so that some "real world" numbers come out of this. The worst case example of this is that you never have a collision until you hit m = n + 1 at which point the calculation is moot because every insert from then on must collide... but up until then you could have been beating the odds even at probabilities approaching 1... but that isn't realistic. So the pure math does not reflect reality.
bh
Benjamin Huygir on September 29, 2011 3:47 PMThe comments to this entry are closed.
|
|
Traffic Stats |