In my previous post on shuffling, I glossed over something very important. The very first thing that came to mind for a shuffle algorithm is this:
for (int i = 0; i < cards.Length; i++)
{
int n = rand.Next(cards.Length);
Swap(ref cards[i], ref cards[n]);
}
It's a nice, simple solution to the shuffling problem:
At first blush, this seems like a perfectly reasonable way to shuffle. It's simple, it's straightforward, and the output looks correct. It's the very definition of a naïve algorithm:
A naïve algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a "clever" (but less simple) algorithm. Naïve algorithms usually consume larger amounts of resources (time, space, memory accesses, ...), but are simple to devise and implement.An example of a naïve algorithm is bubble sort, which is only a few lines long and easy to understand, but has a O(n2) time complexity. A more "clever" algorithm is quicksort, which, although being considerably more complicated than bubble sort, has a O(n log n) average complexity.
But there's a deep, dark problem with this naïve shuffling algorithm, a problem that most programmers won't see. Do you see it? Heck, I had the problem explained to me and I still didn't see it.
Watch what happens when I use this naïve shuffling algorithm to shuffle a three-card deck 600,000 times. There are 3! or 6 possible combinations in that deck. If the shuffle is working properly, we should see each combination represented around 100,000 times.
As you can see, 231, 213, and 132 are over-represented, and the other three possibilities are under-represented. The naïve shuffle algorithm is biased and fundamentally broken. Moreover, the bias isn't immediately obvious; you'd have to shuffle at least a few thousand times to see real statistical evidence that things aren't working correctly. It's a subtle thing.
Usually, naïve algorithms aren't wrong -- just oversimplified and inefficient. The danger, in this case, is rather severe. A casual programmer would implement the naïve shuffle, run it a few times, see reasonably correct results, and move on to other things. Once it gets checked in, this code is a landmine waiting to explode.
Let's take a look at the correct Knuth-Fisher-Yates shuffle algorithm.
for (int i = cards.Length - 1; i > 0; i--)
{
int n = rand.Next(i + 1);
Swap(ref cards[i], ref cards[n]);
}
Do you see the difference? I missed it the first time. Compare the swaps for a 3 card deck:
| Naïve shuffle | Knuth-Fisher-Yates shuffle |
rand.Next(3); rand.Next(3); rand.Next(3); |
rand.Next(3); rand.Next(2); |
The naive shuffle results in 33 (27) possible deck combinations. That's odd, because the mathematics tell us that there are really only 3! or 6 possible combinations of a 3 card deck. In the KFY shuffle, we start with an initial order, swap from the third position with any of the three cards, then swap again from the second position with the remaining two cards.
The KFY shuffle produces exactly 3 * 2 = 6 combinations, as pictured above. Based on your experience shuffling physical cards, you might think the more shuffling that goes on, the more random the deck becomes. But more shuffling results in worse, not better, results. That's where the naïve algorithm goes horribly wrong. Let's compare all possible permutations of a 3 card deck for each algorithm:
| Naïve shuffle | Knuth-Fisher-Yates shuffle | ||||||||||||
|
|
You can plainly see how some of the deck combinations appear unevenly in the 27 results of the naïve algorithm. Stated mathematically, 27 is not evenly divisible by six.
Enough theory. Let's see more results. How about a four card deck, shuffled 600,000 times?
600,000 divided by 24 is 25,000; that's almost exactly what we see right down the line for every possible combination of cards with the KFY shuffle algorithm. The naïve algorithm, in comparison, is all over the map.
It gets worse with larger decks. Here's the same comparison for a six card deck.
With a 6 card deck, the differences between the two algorithms grow even larger. The math, yet again, explains why.
6! = 720 66 = 46,656
With 46,656 paths to only 720 real world outputs, it's inevitable that some of those paths will be severely over-represented or under-represented in the output. And are they ever. If you shipped a real card game with a naïve shuffle, you'd have some serious exploits to deal with.
I know this may seem like remedial math to some of you, but I found this result strikingly counterintuitive. I had a very difficult time understanding why the naïve shuffle algorithm, which is barely different from the KFY shuffle algorithm, produces such terribly incorrect results in practice. It's a minute difference in the code, but a profound difference in results. Tracing through all the permutations and processing the statistics helped me understand why it was wrong.
Naïve implementations are often preferred to complex ones. Simplicity is a virtue. It's better to be simple, slow, and understandable than complex, fast, and difficult to grasp. Or at least it usually is. Sometimes, as we've seen here with our shuffling example, the simplicity of the naïve implementation can mislead. It is possible for the code to be both simple and wrong. I suppose the real lesson lies in testing. No matter how simple your code may be, there's no substitute for testing it to make sure it's actually doing what you think it is.
Simple explanation through mathematical induction.
The singleton case with 1 card is perfectly shuffled (unbiased).
Assuming shuffling is unbiased with the n-1th case, let's try shuffling n items.
Well we know that for the nth item, we want a 1/n probability of having any card be the nth card. So, we pick a random card from the n cards and let it be the nth card.
Then we attempt to shuffle the n-1 remaining cards, but we know by our inductive hypothesis, that shuffling n-1 cards via same method is also unbiased and this hypothesis is also true for the base case of 1 card. So, it's true for all n >= 1.
Nice one. I too found it slightly hard to grasp, but you do a good job of explaining the logic behind it!
Jean Azzopardi on December 8, 2007 4:19 AMBefore everyone starts doing a maths demo, lets stick to the point.
The Danger of Navet
Is not so much that we all want to shuffle a pack of cards, but the story is educational. How many people would ever really test the shuffling of the pack prior to shipping product ? Of course if the project was open source eventualy the error would be corrected. If the project is closed source, the probablity is that the mistake would never ever be spotted.
Now since Open versus Closed is still debated . . . I have made the choice to focus on that singular aspect. To be fair, testing is not answered by either methodology.
David Ginger on December 8, 2007 4:27 AMQuick question... because I'm not a C# programmer.
In the example, you have "rand.Next(cards.Length)"... so how does that translate to "rand.Next(2)" in a 3 card deck?
Shouldn't it be "rand.Next(3)"... which will give a random number of 0, 1 or 2?
Ref for rand.Next:
http://shepherdsbush.dcs.hull.ac.uk/wiki/index.php/Random_Numbers
I work in an industry where randomness is our bread and butter. Every application we create is rigorously tested, and then simulated with well over several billion iterations. Finally, if it doesn't pass the chi-square test, we start all over again. A Chi-Squared test basically checks if we are too random, since thats not natural either :)
no-fun on December 8, 2007 4:38 AMGreat post. Simple and direct to the point of bad (or rather nave) programming. This is definitely the kind of geeky stuff that we "alpha" programmers truly love. Keep up the great work!
Leonardo Cassarani on December 8, 2007 5:08 AMI have always done my array shuffle the later way you mentioned above only now after reading your post I realized that it has a name, Knuth-Fisher-Yates. I always think how to construct an algorithm using a model related to what we were taught in mathematics. In this case, combinatorial arrangement of numbers gives the best model. Just think as an event where one number will be chosen each at a time to fill in one position, and the rest is just how good you manipulate it to fit your programming language limitation and syntax.
Wendy on December 8, 2007 5:12 AMJust wanted to pipe in that python's random.shuffle() does use the Fisher-Yates algorithm.
k on December 8, 2007 5:17 AMAwesome post, been watching your blog for a few weeks now and it's my first comment. Keep up this amazing work.
I've got the same question as Craig Francis, how does "rand.Next(cards.Length)" translate to "rand.Next(2)" in a 3 card deck?
Also, those graphics you made are very very good, which library or module did you use to output them?
Rodrigo Araujo on December 8, 2007 5:38 AMFunny, I usually don't do well with complex math issues, but I understood this problem easily. Maybe it's because I ran into something similar many many years back when I tried to generate a random starfield with assembler and had to write my own RNG (which turned out to be surprisingly complicated).
J. Stoever on December 8, 2007 5:51 AMIt's still hard for me to understand, but your post helped me a lot. Thanks Jeff :)
Does anyone know of some algorithm that gets for example 10 random items (without duplicates) out of a list of for example 100 items?
Interesting article! When I read your shuffling post, I though 'how would I solve it?', my first thought was creating a new array, and for each position generate a random number, which would be the index of the card to place there. If that index already was taken, you should redo the random number pulling till you get a unique index.
I forgot to test it till I saw this post. Was I naive? Well, it's definitely not fast hehe :) I wrote a very bare-bones implementation and it somewhat fair but not as fair as I'd hoped:
Results:
312: 155360
132: 155582
123: 176399
213: 167727
321: 178951
231: 165981
Randcounters:
0: 989745
1: 994988
2: 1015270
the randcounters show the # of times that index was pulled at first.
So I guess my algo was also naive. I went back to check if I could rework it a bit to make it less naive, but that turns out to be very hard for a random naive algo like this. I think that's the pitfall we should all look for, no matter if you're an expert or a beginner. So lesson learned: use algo's that are already proven to be not naive ;)
With a Naive algorithm, a proof of correctness should be easy to construct.
If you can't construct a proof instantly in your head, then don't trust the algorithm.
Simon.
Simon Johnson on December 8, 2007 6:21 AMThis flawed shuffling algorithm was used earlier at online poker sites:
http://www.cigital.com/papers/download/developer_gambling.php
http://en.wikipedia.org/wiki/Shuffling
momo on December 8, 2007 6:58 AMPython is amazing... the very same test program that I made for your previous article passes this test too, since random.shuffle() is well written.
Only three lines of code:
import random
deck = [1,2,3,4]
random.shuffle(deck)
You should write something about well designed programming languages, I think python is one the bests ones out there.
Gabriel Patio on December 8, 2007 7:15 AMI would explain the problem with the naive algorithm differently. It builds an n-ary tree with n levels and so n^n leaves, each of which is a possible outcome. Since there are always more leaves than distinct permutations, and the number of leaves is not (always? ever?) evenly divisible by the number of permutations, some permutations occur as outcomes more often than others and the naive algorithm can't be unbiased.
Stephan on December 8, 2007 7:48 AMNice post. One comment about the implementation - i think it's probably more efficient to implement a pre-decrement rather than a post-decrement.
The post decrement you have implemented evaluates to True following the end of the iteration with i=1, and then performs a pointless swap of the first item of the array with itself.
if you implement a pre-decrement the loop statement will evaluate to false and that final, pointless swap wont take place.
joe on December 8, 2007 7:54 AMThat's not a "nave" algorithm, that is a WRONG algorithm!
A nave algorithm gets the job done. It is just suboptimal and simple. This does not get the job done. And the correct algorithm is also pretty nave, as the problem is not complex enough to call for anything but a nave algorithm.
Anonymous on December 8, 2007 8:02 AMJeff, great post, but I beg to differ with your conclusion, or at least the way you express it. You wrap up with, "I suppose the real lesson lies in testing." The *real* lesson is not to assume that just because you wrote the code, you know what it's doing.
Testing is a wonderful tool, but for too many people that translates into "unit testing". The problem with unit tests is that, usually, the person writing the test is the same person who wrote the code. Such tests will often suffer from the same biases and blind spots as went into the original code. Those unit tests give developers a false sense of security. The only way to really know what's going on is to manually step through the code, either on paper or, if you must, in the debugger. Only then will you really know what the code is doing.
And knowing, as they say, is half the battle. ;)
mtVessel on December 8, 2007 8:23 AMI wrote about the same thing a while ago: http://szeryf.wordpress.com/2007/06/19/a-simple-shuffle-that-proved-not-so-simple-after-all/
szeryf on December 8, 2007 8:26 AMGreat post Jeff, your picture perfect presentation helped in understanding it clearly.
Kiran Ayyagari on December 8, 2007 9:29 AMOne other follow-up here. Don't assume rand() is very random either. The standard ANSI C library implementation of rand() generates an extremely poor distribution of numbers from any seed. Generate a distribution of say 10 million numbers and you'll see groups of some numbers coming up way more often than they should.
Jeff on December 8, 2007 9:38 AMAs someone pointed out on reddit:
That's not a "nave" algorithm, that is a wrong algorithm!
If the person writing the algorithm doesn't know math, what's the point ...
Kevin on December 8, 2007 10:00 AM> ... I found this result strikingly counterintuitive
I suspect you are being overly-specific and actually mean - 'I didn't expect this result'.
Isaac Gouy on December 8, 2007 10:01 AMThere's a problem with your table containing the possible outputs of the two algorithms. You're missing a tag somewhere, and it's messing up my livejournal friends page (I use LJ for tracking feeds). I'd appreciate it if you could fix that when you get a chance. Thanks!
Also, very interesting article! When you think about it, the bias is trivially visible in the context of the pigeonhole principle; if there's 27 possible outputs but only 6 holes to put those outputs in, then there's obviously going to be some over-represented results (at least 3, in fact, since 27%6=3)
Asmor on December 8, 2007 10:36 AMI've noticed that the majority of comments on this article seem to fall into two categories for me: those that get the point of the article, and those that think that the incorrectness of the first algorithm is mathematically obvious and wonder what all the fuss is about. I'll give this latter category the benefit of the doubt and assume that they are not just trying to make themselves look smarter than the rest of the herd. If there are developers out there who really are wondering why the example given seems like such a legitimate trap to the rest of us, I'd like to remind you of two things.
First, I've found that one of the most common (although not universal) weaknesses in developers without a formal computer science education is in the practical application of complexity theory and formal proofs to everyday code. Programmers (such as myself) who fall into this category need all the reminders we can get.
Second, if your main reaction to the given example is that it's wrongness is obvious, then you've missed the point of the article, which Mr. Atwood so kindly emphasized in boldface for us. Slightly rephrased, it is possible for code to appear both simple and correct, while still being subtly wrong. If you would never stumble over the example given, be assured that there's a different one waiting around the corner just for you.
Frank Davis on December 8, 2007 11:20 AMIn the naive algorithm an array element can change its value many times but in the KFY it can change it only once! Thats the difference. Thats why the naive algorithm leaks!
Nikos on December 8, 2007 12:02 PM"Does anyone know of some algorithm that gets for example 10 random items (without duplicates) out of a list of for example 100 items?"
The first approach I'd try would be to make a Set, fill it with all items, and for 1..10 pick one of the items out (at random), remove it from the set, and store it in your combination. This might work well enough for you (is simple and unlikely to house unfortunate bugs), or it might yield insufficient performance.
A more performant approach would be to devise a combination machine (an engine which will give you an entire series of combinations for a particular 'n' and 'r'), pick an index at random, and have it produce the particular combination at that index. Explaining precisely how to do this would be beyond the scope of this text box, but suffice to say that a combination machine is fairly easy to write once you recognize the pattern in the series is based off diagonals of Pascal's triangle. This approach would yield your 10 random items in very short order, although it's really *more* efficient if you want to produce multiple such sets as the setup involved highly outweighs the per-combination cost.
A related approach would short-circuit the CM/index approach but use the base materials (pick an index in the position-0 series, then using that value pick an index in the corresponding position-1 series, etc), which would slightly increase the per-combination cost in favor of practically eliminating the setup cost.
The *WRONG* approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items.
None of these, by my thinking are "naive". I'm not sure what Jeff's definition of "naive" is, but it seems to be "missing a fundamental point of logic" aka "wrong". So, I guess only the last would be "naive". The other two certainly each have their place. Always favor straight-forwardness and simplicity over complexity. But, if performance is a problem then it is likely that you will need to increase the complexity (and thus raise the bar on any engineer who can step into that code to debug it) to achieve your goals.
Tom Dibble on December 8, 2007 12:18 PM> I'm not sure what Jeff's definition of "naive" is, but it seems to be "missing a fundamental point of logic" aka "wrong"
The definition is specified in the first paragraph:
--
A nave algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a "clever" (but less simple) algorithm. Nave algorithms usually consume larger amounts of resources (time, space, memory accesses, ...), but are simple to devise and implement.
--
It shouldn't be wrong, just inefficient.
> In the naive algorithm an array element can change its value many times but in the KFY it can change it only once! Thats the difference. Thats why the naive algorithm leaks!
A red herring. This is true of the KFY algorithm as well; the "swap with yourself" step is valid in each iteration.
> Funny, I usually don't do well with complex math issues, but I understood this problem easily.
Thank you-- neither do I! Not all of us can be Wesner Moise, after all.. :)
<a href="http://wesnerm.blogs.com/net_undocumented/2007/11/triple-nine.html">http://wesnerm.blogs.com/net_undocumented/2007/11/triple-nine.html</a>
Jeff Atwood on December 8, 2007 1:40 PMExcellent article, Jeff.
If anyone needs to test their PRNGs, get a copy of the George Marsaglia CD. Prof. Marsaglia researched randomness at FL State.
Here's one link to a mirror site:
http://i.cs.hku.hk/~diehard/cdrom/
===================
If anyone has a VB/VB.Net application using the intrinsic VB PRNG functions/statements, you should read my analysis of its weaknesses.
http://www.15seconds.com/issue/051110.htm
Microsoft finally acknowledged the weaknesses in its VB PRNG code by adding warnings to the online VB.Net code documentation.
Mark Hutchinson on December 8, 2007 1:52 PM> In the naive algorithm an array element can change its value many times but in the KFY it can change it only once! Thats the difference. Thats why the naive algorithm leaks!
A red herring. This is true of the KFY algorithm as well; the "swap with yourself" step is valid in each iteration.
----------
KFY: [n] is set, [n-1] is set, [n-2] is set, ...
naive: if [n] is set its still possible during iteration to be swapped many more times.
Surely the real moral of the story here is never use your own algorithm? We all know we should never write our own sort(), as the standard library will contain one that's far more efficient and correct than anything we could think of. Surely that logic extends here? Is it not the case that when we need an algorithm to search, shuffle or sort, we should be copying them out of textbooks instead of navely assuming we can do just as well as theoretical computer scientists?
I'm not saying that we shouldn't make an attempt to understand the algorithms we use; obviously not -- you should never write code you don't understand. But the academics covered this ground a long time ago; let's not ignore their work.
David House on December 8, 2007 3:13 PMInteresting post. As David House, I conclude it as "know thou libraries".
Casper Bang on December 8, 2007 3:33 PMstd::random_shuffle for the win :)
The proper algorithm must have been known since at least the sixties, so it would be strange if anything else was used in libraries.
Johan on December 8, 2007 3:37 PMI think folks praising Python for not screwing this up have rather low expectations.
Josh on December 8, 2007 4:49 PMI wonder what PHPs shuffle() might use.
mario on December 8, 2007 5:12 PMWhile I recognize the truth in the article, like you I found it incredibly counter-intuitive. I thought for sure that only choosing values of n within the range "i" hasn't been assigned to yet would yield some sort of bias- Being that some cards wouldn't be moved as many times as others (since whatever got moved to cards[i] in each iteration would stay there), I thought for sure that there would be some way to predict, based on the order of the starting deck, at least the probability of each card being closer to the front or the back of the deck.
Thanks for the explanation! I like when being wrong is so fascinating.
Alex Lucas on December 8, 2007 5:14 PMNeat example, thanks! I found myself shaking my head the other day at some code that thought it could improve on Java's random (not that somebody else couldn't) but in fact generated alarmingly short repeating patterns in very short order. Now I find others shaking their heads at the shuffle algorithm I've used for years. My shop is extremely strong on business knowledge, but apparently weak at the CS and math skills needed to spot something like this!
Stan on December 8, 2007 5:31 PMThis article assumes that rand has a perfectly random distribution. In a lot of environments, I rather doubt that it does. I have implemented the 'naive' version of this algorithm when I just needed something scrambled in a hurry and didn't have time to research the proper algorithm. As a one-off, it is a perfectly serviceable solution and more than good enough given the vagaries of the various random number generators out there.
Not that I'm arguing against correctness. But there is also engineering practicality.
Todd Blanchard on December 8, 2007 5:39 PM@alvin:
Responding to a request for an algorithm to choose k < n elements randomly from a list of n things, I can think of two algorithms offhand, which are essentially variants of each other.
If your list is short, you can simply shuffle it and take the first k elements of the shuffled list. This is what I'd use, provided it performed well enough for the application.
The second algorithm is something like this, assuming the function random(n) returns an integer r such that 0 <= r <= n-1 and that lists/arrays are indexed from 0:
let L be a copy of the list of things.
let 'results' be the name of the list of results, initially set to be the empty list.
n := length (L)
repeat k times:
selected = L [ random (n) ]
append selected to results
delete selected from L
n -= 1
end repeat
return results
This is a bit more efficient than shuffling the entire list just to return a subset of them, but, I suspect not so much so that it'll frequently make a big difference. As I said before, if the simple algorithm performs well enough, go with that.
a python programmer on December 8, 2007 7:30 PMYou're naive, alright. Why are you writing a supposedly informational blog?
David Mills on December 8, 2007 9:21 PMrand.Next(2);
rand.Next(2);
rand.Next(2);
I think that should read
rand.Next(3);
rand.Next(3);
rand.Next(3);
The kfy algorithm is itself a bit naive. Depending.
If you are trying to shuffle a 52 card deck the number of permutations is 52! which is nearly 10**68 (10 to the 68th, or more exactly, 80658175170943878571660636856403766975289505440883277824000000000000. If you are seeding your random number generator with a 32 bit number, you are limited to 2 **32 or about 4.2 billion permutations (4.294967296 X 10**9).
That's 10**68 vs 10**9. Not good enough. Ruby, among other systems, uses a rand function that is a Mersenne Twister with a period of around 10 ** 6002, which is plenty large enough. So it depends on your PRNG.
The various algorithms are all short and prettier in Ruby:
<code>
# Knuth's pretty good algorithm for shuffling an array.
# From his text, I think. Or maybe from Programming Pearls.
def knuth1(ary)
0.upto(ary.length - 2) { |n| # n is in [0..MaxSubscript - 1],
# inclusive
r = n + rand((ary.length - n)); # r is [n..MaxSubscript], incl
ary[n], ary[r] = ary[r], ary[n]; # swap [n] and [r]
}
end
# Jeff Atwoods presentation of the Knuth-Fisher-Yates
# shuffling algorithm, in Ruby
def kfy_shuffle(ary) # I think this is equivalent to knuth1,
# above, but it's easier to present it
# than prove it.
(ary.length - 1).downto(0) { |i| # i is in
# [0..MaxSubscript], incl.
n = rand(i + 1) # n is in [0..i], inclusive
ary[n], ary[i] = ary[i], ary[n] # swap elements [n] and [i]
}
end
def sort_shuffle(ary) # Sort by random keys using key-sort
# (aka Shwartzian transform)
ary = ary.sort_by{|w| rand()} # Is this a better shuffle,
# than the kfy algorithm...
end
</code>
You could use the sort-shuffle before or after one of the others without compromizing randomness. Although it wouldn't guarantee a seeing each permutation once before seeing any duplicates. Going through each permutation exactly once before repeating makes the last few shuffles very predictable, anyway. For shuffling decks in a gambling situation you want the most unpredictable, where previous permutations have are independent of what permutation will come up next.
Is there any way to put properly formatted code into comments on this site? My pretty code isn't pretty any more.
JFred on December 9, 2007 2:19 AM@alvin
well, "obviously" picking 10 unique numbers out of 100 is exactly the same as this shuffle problem, just that you only care about the first 10 (which also means you can stop after 10 steps).
@Tom Dibble
"The *WRONG* approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items."
Actually no (unless I misunderstood what you meant).
In the first step the probability of selecting a specific item is 1/n. The probability of selecting a specific one in the second step is ((n-1)/n)*(1/(n-1)) = 1/n as well - the first part is the probability of not having it picked in the first step, the second is the probability of picking it now.
In addition this is exactly the same thing that it is done with the "Knuth-Fisher-Yates shuffle algorithm" (never heard that name before).
Btw. one shocking thing about random numbers:
Assume you have a random number generator that gives you equally distributed random numbers in the range [1..n] but you want equally distributed numbers in the range [1..m] where m < n and m does not divide n.
Then there is no deterministic algorithm that can do this mapping, and basically the only way is to test if your random number is > m and if yes try again (you can optimize this a bit if m < n/2).
Which is an algorithm that can not be proven to terminate if you use truly random numbers (though in practice that is not really a problem except in hard-realtime applications)...
This and other things are IMO nicely explained at http://java.sun.com/javase/6/docs/api/java/util/Random.html (see esp. the description of "public int nextInt(int n)").
Arg. This ate my smaller sign and half the following text.
The pre-last paragraph should have been:
Assume you have a random number generator that gives you equally distributed random numbers in the range [1..n] but you want equally distributed numbers in the range [1..m] where m smaller than n and m does not divide n.
Then there is no deterministic algorithm to do that, basically you have to check if that random number you got is larger than m and if yes try again (you can optimize this a bit if m is smaller than n/2).
N.B.:
All the major languages, and most of the minor ones, have crypto packages you can use to get better random numbers than the default rand function. I assumed you all knew this but then I guessed not.
JFred
I've worked in online gaming and shuffling is a moot point. You dont shuffle you simply draw each card (or whatever) out as you need it using a *truly random* number from the remaining population.
But good example.
Jan Bannister on December 9, 2007 6:44 AM@Jan Bannister
That is exactly what the correct shuffling algorithm actually does, just that it keeps the drawn cards and the remaining population in the same array, the one in the front and the one in the end.
It ends when the remaining population is empty.
Never, ever read algorithm statistics before your morning coffee.
This is why you test, and re-test before product launch. At least this is evidence that my geek nature to complicate things and avoid the overly simple has some merit - on occasion...
Dave on December 9, 2007 6:58 AMNave algorithms are usually backed by nave test cases. In a proper test case, there should be multiple assertions that prove the same thing given different circumstances.
Jon Abaca on December 9, 2007 7:52 AMFor all non C# programmers saying rand.Next(2) should be rand.Next(3) listen :
Arrays in c# are zero (0) based, that is they store values according to the real number system (0,1,2,3 ....) not the counting number system (1,2,3,4 ...)thus the length of an array in c# will be n-1 where n is the number of elements in the array.Comprende? :)
@Reimar:
"@Tom Dibble
"The *WRONG* approach would be to pick one in the list at random, then another after it at random, etc, as this would unduly weight the later-in-list items."
Actually no (unless I misunderstood what you meant)."
You did, but mainly because I wrote it poorly :)
I meant to describe an approach I'd fixed before, in which a programmer searching for a random combination of "r" items in a group of "n" would, essentially:
until ( found one )
... int minindex = 0;
... for ( i from 1..r )
... ... minindex = random # between minindex and 'n'
... ... add item at 'minindex' to combination
... ... minindex++;
...
Obviously, that's a bit pseudo-codish, but the main thing is that the "next index" is alwasy a (random number of indexes) AFTER the "last index". So, on average, the first index chosen skips half the list and eliminates that first half from consideration ...
In any case, my point was that someone felt that this approach was either elegant (which I can't imagine they really convinced themselves of that ... any time you have an "until this works" loop you've pretty much written off "elegant") or simple to understand despite possible performance problems, which I suppose might fit Jeff's definition of "naive", but definitely fits my definition of "WRONG".
@Jeff:
"A nave algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a "clever" (but less simple) algorithm. Nave algorithms usually consume larger amounts of resources (time, space, memory accesses, ...), but are simple to devise and implement.
--
It shouldn't be wrong, just inefficient."
Well, the article spent a lot of time discussing what happens when something is "wrong", not "correct yet naive". I suppose that's the base of confusion here in the comments.
I'd suggest there are two orthogonal issues here: "wrong" versus "right" and "simple" versus "complex". It is just as incorrect to conflate "simple" with "wrong" as it is to conflate "simple" with "right". "Naive" appears to be a synonym for "simple" (as opposed to "simple AND wrong"), but I'm not sure I'd agree with the connotations of that. Sure, "simple" isn't always the best answer. The vast majority of the time, though, it is.
Using the shuffle approach, and pretending there isn't a perfectly-simple-yet-properly-complex implementation of "shuffle" in pretty much any modern language, let's examine some options here:
Simple, Wrong: The "exchange items with bug" approach listed in this article.
Simple, Right: Assign a random number to each item, sort according to that random number. Doesn't perform well, but works. And is easy to explain to the next guy.
Elegant, Right: Use the published "exchange" algorithm. Performs well, works, but is hard to explain to the next guy why each line is precisely how it is without saying "just read the reference material!"
There's a place where performance matters above maintenance, and there's a place where the reverse is true. The trick for us as developers is figuring out which place (or where in the continuum between the two) we are in, and thus which approach to the problem makes the most sense for us.
In my previous post (in response to the question about pulling one of the 100C10 combinations) I gave two solutions, one of which was simple and easy to understand, the other of which would take several pages with diagrams to explain (but performs and scales significantly better than the simpler solution).
Which is the right one to use? I'd say you always start with the simple one, then move on to more complex implementations only when you've identified that performance is an issue. Moving forward, the difference is often between "anyone working on the application can touch this algorithm" and "any change to this algorithm needs to be looked at by YOU, yes, YOU, the original developer". The former tends to foster sanity far more than the latter.
"Naive" tends to connote an incorrect choice. "Simple" is not always, and in fact is not even often, an incorrect choice.
gogole: unless someone changed something while I wasn't looking, array.Length gives the number of elements in the array, not the index of the array. So, if an array has 3 elements, array.Length is 3, not 2. As written, Jeff's "naive" code should run through:
rand.Next(3);
rand.Next(3);
rand.Next(3);
while the "non-naive" code should go through:
rand.Next(3);
rand.Next(2);
I fixed the off-by-one summary of the rand.Next() calls. I also fixed the missing table cell tag. Sorry about these errors!
Jeff Atwood on December 9, 2007 1:01 PMJeff, mind sharing what program you used for those graphs? In particular the one for the six-card deck shuffle?
Computer Guru on December 9, 2007 2:21 PMGah, hate to double-post, but anyway:
This is what I was trying (and miserably failing to) point out in the previous post:
>Both end-results are as "random" as each other, but the second
>method gets you a more "equal" randomness - which is something that
>counts in cryptography.
Computer Guru - I don't know but you might be interested in http://code.google.com/apis/chart/
[ICR] on December 9, 2007 2:34 PMDavid A. Lessnau : Oh!My bad.thanks for the correction.
gogole on December 9, 2007 3:57 PM
I completely dissagree with the statement that more
shuffling yields worse results with the naive algorithm.
Going with the 3 card example:
If you stick another loop outside the naive algorithm and
run the 600,000 shuffle simulation again you get nice, neat
100,000 bin sizes for all 6 combinations. I just ran it a
bunch of times - no bullshit.
for (int x=0; x < cards.Length; x++)
for (int i = 0; i < cards.Length; i++)
{
int n = rand.Next(cards.Length);
Swap(ref cards[i], ref cards[n]);
}
This is much less efficient, but more representative
of the way a naive code would have implemented the
shuffle in the first place.
Luthos:
I think you're correct, more shuffling with the naive algorithm will lead to more evenly distributed cards. But no-one is going to shuffle cards.Length^2 times as you have in your code, that's way too expensive.
Regarding randomness... I noticed that the "Enter the word" image stays the same more than 3 times consecutively (I have no idea when did it begin). Aren't that image supposed to be changed every time I load the page?
cheong on December 9, 2007 7:26 PMGreat post, Jeff. I found the explanation very concise and logical. It's always interesting to see how such a small difference in approach can affect an algorithm.
Oh, and talking about Naive. When I saw the list permutations each algorithm produced, I was immediately thinking of adding code to keep track of each permutation created by the Naive algorithm and ignoring a permutation if it has already been produced. You'll end up with 6 perms, like the Knuth shuffle, but goodbye efficiency (in terms of time and space).
The thing I like about shuffling is that it is the opposite of sorting, yet you'd be able to change a sorting algorithm in a shuffle algorithm by editing only a few lines.
Albert on December 9, 2007 7:37 PM"It's possible for code to be both simple and wrong". Sorry, but I can't understand the central point here. If complexity and correctness are not correlated, is that surprising?
Ole Eichhorn on December 10, 2007 1:14 AMRather than using the swap based algorithm I went with:
1. Start with a deck as a stack
2. Create an array of items
3. Take the top card from the deck (Pop)
4. Insert it into a random array index
5. If the array index was not available, discard the random and try again
6. Repeat until all cards in deck gone
7. Load that array back into a new deck
Results on 6 cards 600,000 itterations:
0|99777
1|100086
2|100058
3|99879
4|100163
5|100037
Not bad. And on a 6,000,000 sample:
*** RAND CHECK ***
0|1000514
1|1000070
2|999574
3|999222
4|999939
5|1000681
Seems like a pretty easy way to do it, doubt it's as fast as the KFY way due to getting randoms in the array which are already taken.
Simon on December 10, 2007 1:53 AMSorry, that was 3 cards, 0-5 representing each of the possible combinations.
Simon on December 10, 2007 2:02 AMComparison on 10,000 shuffles in Milliseconds:
MyShuffle (very similar to PokerStars): 5984.375
KFY Shuffle: 1312.5
Suppose performance isn't the key thing for online poker rooms.
Simon on December 10, 2007 2:30 AMThe particular difficulty here is that checking the results of the algorithm for this seemingly simple problem is not obvious.
Like checking that an encryption algorithm is secure...
"The nave shuffle algorithm is biased and fundamentally broken."
I can't imagine that standard shuffling by hand fairs better than this algorithm. There's a big difference between "not the Platonic ideal" and "fundamentally broken".
Which is not to take anything away from the important point illustrated here: "It's a minute difference in the code, but a profound difference in results."
(...although "profound" may be a bit strong...)
Robert Fisher on December 10, 2007 5:55 AMsmart pointers are C++ objects that simulate simple pointers by implementing operator-> and the unary operator*.
snooker on December 10, 2007 5:55 AMHey Now Jeff,
This post really made me think. Very interesting & everyone has to agree testing sure is important.
Coding Horror Fan,
Catto
Why shuffle the deck when you can draw random cards? :)
Your uncle Bob on December 10, 2007 7:08 AM"It is possible for the code to be both simple and wrong."
Thanks for finally coming out with it. This is what I've been trying to tell people for years.
Unfortunately, there are some people here insisting that the wrong-ness of the naive algorithm here is simply common sense, and even if that's true, that is precisely the kind of arrogance that leads to these mistakes. ("Your algorithm falls outside my definition of 'simple', therefore, I still say simple is better.")
Oh well, I think you still got through to a few of them. Hopefully.
Aaron G on December 10, 2007 8:10 AM"For every complex problem, there is a solution that is simple, elegant, and wrong."
Stephen Collings on December 10, 2007 8:12 AMOnly time I implemented a shuffle, I split the desk into two lists, and merged, by randomly taking from either list. One one list was gone, the remainder of the other was appended. Maybe it's biased, but only as biased as a human doing a 'normal' shuffle. Think of it as a simulation of what normally happens when a bunch of guys light up cigars and sit around a table, not what happens in a fancy casino.
Rich Wilson on December 10, 2007 8:29 AMBTW, as a side topic, procedure for getting non-modulo bias 32bit number from 0 - N.
private int _rndIndex(int items)
{
// Uses 32bit int as the random, so we have to handle negative
// and positive values. i.e. -2,147,483,648 to +2,147,483,647
// Note: We need to -1 from the max as we will have 2 fulls set of mod posibilities
// mod = 0 at the minus end, mod = 0 at 0 and mod = 0 at the positive end which is one too many mod= 0's
byte[] randomNum = new byte[4];
// Calculate what the maximum acceptable random is to avoid modulus bias
int minNum = (Int32.MinValue - (Int32.MinValue % items));
int maxNum = (Int32.MaxValue - (Int32.MaxValue % items)) - 1;
// Keep going until we get a number between min and max
int rTemp = 0;
do
{
cry.GetBytes(randomNum); // cry is a RngCryptoProvider
rTemp = BitConverter.ToInt32(randomNum, 0);
}
while (rTemp < minNum || rTemp > maxNum);
// Get the mod, this will never return = items, i.e. if items is 52 it would only return -51 to 51
int r = (rTemp % items);
// Always positive version for use as index
return (r < 0) ? r * -1 : r;
// Note: this is not bais against 0. Mod of a negative number will return a negative number or 0
// Mod of a positive number will return a positive number or 0. There is twice a much chance of getting 0 as getting
// any specific positive or negative number. So taking negatives and positives as both positives removes the bias.
}
Note the int32 output from the rng is just the resulting random number, the RNGCryptoProvider thingy isn't public spec so not sure how many bits of internal state it has, but I would assume it's alot.
Simon on December 10, 2007 9:22 AMThis is a excellent example of H L Mencken's Law: To every problem there is a simple obvious solution that just does not work.
I mentioned the KFY algorithm in a short paper in "Software -- Practice and Experience", way back in 1977 (Vol 7, No 2, pp 201-203). I couldn't find it in Knuth at the time and so failed to give credit. My version extracted a random combination of items from an array and shuffles the array at the same time. I've used it in half-a-dozen languages and many programs.
RJBotting on December 10, 2007 9:42 AMMinor point:
> Arrays in c# are zero (0) based, that is they store values according to the real number system (0,1,2,3 ....) not the counting number system (1,2,3,4 ...)
Though that is the common way to conceptualize this feature of C/C++/C#/etc, I actually prefer to think of the so-called 'index' as actually 'the distance from the beginning of the array'. a[0] is not the 'zeroth' item, it is the item zero away from the first item, and hence the first item.
(Good article, Jeff!)
Eric "purple guy" Lippert on December 10, 2007 9:44 AMBeing pragmatic, if you want to shuffle you use python's random.shuffle(cards) and if you want to draw n cards you use python's random.sample(cards, n). No need to reinvent any wheels... unless for some very very special applications maybe.
r on December 10, 2007 11:36 AMInteresting article.
I did come up with the "good" algorithm after some thinking.
But didn't do a frequency test. Reminds me to do it next time I do something that's "random". The only right way to check if your numbers are actually random.
Carra on December 10, 2007 12:44 PMit is interesting that most people cannot see why the naive shuffle algorithm is wrong - but I was watching one of the TED talks (cant remember the link anymore) about statistics, and I remember it saying that statistics is one of the most unintuitive things (probability is part of statistics), and that most people dont get them right.
its not the case that the problem with a naive algorithm is it could be wrong - any algorithm can be wrong, naive or not. to program properly, one must also understand wot he/she is doing - including the maths behind it. if all one is making is just a simple business data entry app, there would be no real need to have any deep understanding of mathematics behind the program, but anything more complex, such as rendering, scientific calculation, games etc etc, the programmer needs to be *really* educated in their domain, and thus, these things cannot be done by the average "script kiddy" who picks up code from the net and patches them together. that is what i would called naive.
Chii on December 10, 2007 4:12 PMCode that's simple and wrong can be found in testing. Code that's complex and wrong, all too often, cannot have effective tests written for it, IMHO.
Grace on December 10, 2007 4:25 PM................:)
danceka on December 10, 2007 9:56 PMthey dont monitor the connection localy... they monitor it on their server
snooker on December 11, 2007 1:03 AMSweet post.
Rhysc on December 11, 2007 5:36 AMJeff,
I second in the query as to which app you used to generate your graphs. Care to share?
Jake Heidt on December 11, 2007 7:54 AMThanks for the great post Jeff! It's an eye opener. It brilliantly shows that even the simplest bit of code may require quite advanced testing to detect subtle flaws.
Why do you call the first, biased algorithm, "naive"? My reaction when reading the code was "oh, never thought of that one!" To me it actually looked less intuitive than the KFY algorithm. Each iteration of KFY brings one random card into place, then leaves it there. The "naive" one messes with its partial shuffling, just like combing a giraffe.
Arthur N. on December 11, 2007 9:11 AMI'm surprised that the focus of this article is on testing, rather than proving, the incorrectness of the so-called "naive" algorithm.
If you're going to be designing critical, non-trivial algorithms for your software, testing alone isn't enough because it's impossible to test it for all input values. It's also easy to accidently miss important boundary conditions.
KG on December 11, 2007 9:33 AMFunny that no one has mentioned the Faro shuffle, which is a very quick shuffling method:
http://en.wikipedia.org/wiki/Faro_shuffle
Wow, yet again you post an excellent post! This has got to be my favorite blog, and I check it every other day. Keep the programming posts coming :D
Ihoss on December 11, 2007 3:54 PMSome people questioned whether rand.Next(cards.Length) is correct. I've seen fragments of answers scattered throughout the comments, but not one complete answer.
Simply, the Length property returns the size of the array. The Next(int) method returns a random integer greater-than-or-equal to 0, and less than the argument you specify. So for the three-card shuffle, rand.Next(cards.Length) will return 0, 1, or 2. Since C# arrays are zero-based, this is exactly right.
Anon E Moose on December 11, 2007 4:34 PMYou just wanted to play with your graph chart generator. Admit it.
Michael on December 12, 2007 2:22 PM>Is it not the case that when we need an algorithm to search, shuffle or sort, we should be >copying them out of textbooks instead of navely assuming we can do just as well as >theoretical computer scientists?
It is not. What you should be doing is learning the algorithm out of the textbook. The ability to understand and implement algorithms devised by "theoretical computer scientists" is one measure of a good programmer. If you copy the algorithm blindly, you may implement it, but I claim you'll probably never know whether you implemented it correctly or well.
codeguy on December 13, 2007 1:43 PMIn JavaScript, I usually do this:
function shuffle(arr) {
return arr.sort(function () { return Math.sin(Math.random() * 360); });
}
It's not efficient or secure, but hey, it's fun!
Tim Cooijmans on December 14, 2007 5:02 AM"If you copy the algorithm blindly, you may implement it, but I claim you'll probably never know whether you implemented it correctly or well."
More importantly, it is unlikely the text book will address exactly the problem before you. You need to understand how your requirements are different and modify things accordingly and be able to be sure that your modified algorithm is correct and scales well on the axes required.
Robert Fisher on December 14, 2007 8:57 AMCome on, I want to see the graphical comparison with a 52-card deck! Ya left me hanging.
Gobot the Ultimate Destroyer on December 15, 2007 2:10 AMShuffling playlists and slideshows is a similar problem, but more complex and with different requirements, like NOT being mathematically random.
http://feeds.feedburner.com/~r/blogspot/smartbear/~3/202195450/not-in-cards.html
Jason Cohen on December 18, 2007 5:50 AMThere is a strange non-random behavior of the VB RANDOMIZE statement. My alternative shuffle I've used:
' Initialize the deck
ReDim Deck(UBound(a))
For n = 0 To UBound(a)
Deck(n) = n + 1
Next
Randomize
For n = UBound(a) To 1 Step -1
nSelect = Int(Rnd() * (n + 1))
' assign to the random array value
a(n) = Deck(nSelect)
' replace array entry with unused entry
Deck(nSelect) = Deck(n)
Next
a(0) = Deck(0)
Typically results in:
123: 166818
132: 166768
213: 166347
231: 166712
321: 166508
312: 166787
But moving the RANDOMIZE statement into the FOR loop results in:
123: 161243
132: 180510
213: 172909
231: 174002
321: 159293
312: 152043
with 132 consistently the most frequent and 321 & 312 consistently the least frequent.
I've read the article and it is very interesting.
I thought i've made that error (i've just started my study as programmer) shuffling a deck but..
I didn't use That naive algorithm i used another one..
I premit it is SLOW and STUPID, not at all smart and logical as yours but...
that's the java code
public static void shuffle(int[] a){//using an array of int..
Random ran = new Random();//i use 2 random because with just
Random ran2 = new Random();//one isn't AT ALL randomness
for (int i = 0;i<1000;i++){//swaping a lot of times(1000) 2 int, 1000 is a big number compared to 3, but it works "fine" even with 52
int kas =ran.nextInt(a.length);
int kas2 = ran2.nextInt(a.length);
swap(a,kas,kas2);
}
}
I try to calculate mine error with a deck of 3 and shuffling 6 million time.
Ok after 5 minute (yeah with a computer) of calculus (when I said that is slow I meant SLOW) my avarage error of 100/6 % is 0.027 (let me say: nothing).
And the KFY is (using a java method with Random class) 0.45% (let me say: a lot). I know that probably the 0.45 is caused by a lack of randomness in the Random class, but what I want to say is that my algorithm (ok naive, is a nice description) is PERFECTLY WORKING. Ok ok is true i didn't try with greater deck.i'll do. but that result speaks clearly to me! :D
Thank you and Bye
http://www.sgi.com/tech/stl/random_shuffle.html
This algorithm is described in section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981). Knuth credits Moses and Oakford (1963) and Durstenfeld (1964). Note that there are N! ways of arranging a sequence of N elements. Random_shuffle yields uniformly distributed results; that is, the probability of any particular ordering is 1/N!. The reason this comment is important is that there are a number of algorithms that seem at first sight to implement random shuffling of a sequence, but that do not in fact produce a uniform distribution over the N! possible orderings. That is, it's easy to get random shuffle wrong.
Alessandro Gentilini on December 24, 2007 12:24 PMI have a severe problem with the concept that a specific set of, for example, cards would have equal draws for each card after 6,000 or a jillion-trillion cycles of drawings. That's not random. Random has inequalities all thruout and in the final count. To use equal times drawn as the test of randomness is really missing the mark.
choim on January 26, 2008 11:20 PMGoing to sleep just now I realized what a stupid comment I made: because a deck of cards IS a specific set the only good test of random draws is exact equality of times drawn; every card will be drawn exactly one time per each time the deck is drawn. Sorry... I knew I was out of my league here. I was thinking big random whereas the discussion here is about little random. Go th Stumbleupon and you'll I commented that I may have embarassed myself. Glad I wasn't flippant.
choim on January 27, 2008 1:39 AMYES! I finally see the difference between the two algorithms! =D
Matthew on December 11, 2008 6:28 PMI actually stumbled upon this problem while writing a card shuffling algoithm, but noticed it in a different way.
This problem is easier to understand if you imagine the following real-world scenario:
Imagine that each card in a deck has a number value associated with it (Say, the card on top is 1, the next card down is 2, etc) and you decide to shuffle a single real world deck by asking someone for a random number between 1 and 52. You could then find the card in the deck with the corresponding number value and place it in your new "shuffled" deck.
This works great until the person goes "out of bounds" so to speak by guessing the number 52 after his 16th guess. Theres no longer 52 cards to choose from so you can't draw the 52nd card any longer. It makes sense conceptually to limit the range as you remove cards from the deck, so that the person can guess legitimate card numbers as you go.
It's the same scenario when using the random number generator. You need to limit the range so that only cards that are left in the stack can be pulled out and added to the shuffled pile.
Jason on January 5, 2009 3:30 PMI realize this is an old article and I might be beating a dead horse here, but I'd like to draw another concrete analogy.
Suppose we wanted a single random number from 1-5 inclusive, and all we had was a 6-sided die.
In the naive algorithm, if we rolled a 1, 2, 3, 4, or 5, everything would be hunky-dory. But we might roll a 6 and then just call it a 1 -- of course, in this case, the numbers 2-5 each have a 1/6 chance of being our random number, but the number 1 has a (1/6 + 1/6) chance of being our random number. We have no choice but to throw away the 6 if we want an even probability in 1-5.
On the other hand, if we asked one of our D&D-playing friends for a 20-sided die, we could map 4 numbers on the die to 1 of the random numbers we want; i.e., 1, 6, 11, or 16 would be counted as a 1, and 2, 7, 12, 17 would be counted as a 2, etc. Just something to keep in mind if the uniform random number you need is in a range of a power of 2 and you can request a certain number of random bits...
Mark Rushakoff on June 11, 2009 1:58 PMA couple of comments:
If you histogram the position of the first card in a large deck after an almost-shuffle, you'll get a much more dramatic picture showing the almost-shuffle is wrong. The first card will end up near the beginning way too much of the time.
I once found someone using the almost-shuffle in a large card-game experiment. At that point, they had spent several weeks running
half a million bridge hands on a small cluster. They had to start over. Good times.
In answer to an earlier query, if you want to pick just m cards randomly from a deck of n, it suffices to just run KFY for m steps and take the cards you've selected. This is provably right, and seems like it must be performance-optimal.
PO8 on July 16, 2009 10:23 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |