In my previous post on shuffling, I glossed over something very important. The very first thing that came to mind for a shuffle algorithm is this:
for (int i = 0; i < cards.Length; i++)
{
int n = rand.Next(cards.Length);
Swap(ref cards[i], ref cards[n]);
}
It's a nice, simple solution to the shuffling problem:
At first blush, this seems like a perfectly reasonable way to shuffle. It's simple, it's straightforward, and the output looks correct. It's the very definition of a naïve algorithm:
A naïve algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a "clever" (but less simple) algorithm. Naïve algorithms usually consume larger amounts of resources (time, space, memory accesses, ...), but are simple to devise and implement.An example of a naïve algorithm is bubble sort, which is only a few lines long and easy to understand, but has a O(n2) time complexity. A more "clever" algorithm is quicksort, which, although being considerably more complicated than bubble sort, has a O(n log n) average complexity.
But there's a deep, dark problem with this naïve shuffling algorithm, a problem that most programmers won't see. Do you see it? Heck, I had the problem explained to me and I still didn't see it.
Watch what happens when I use this naïve shuffling algorithm to shuffle a three-card deck 600,000 times. There are 3! or 6 possible combinations in that deck. If the shuffle is working properly, we should see each combination represented around 100,000 times.
As you can see, 231, 213, and 132 are over-represented, and the other three possibilities are under-represented. The naïve shuffle algorithm is biased and fundamentally broken. Moreover, the bias isn't immediately obvious; you'd have to shuffle at least a few thousand times to see real statistical evidence that things aren't working correctly. It's a subtle thing.
Usually, naïve algorithms aren't wrong -- just oversimplified and inefficient. The danger, in this case, is rather severe. A casual programmer would implement the naïve shuffle, run it a few times, see reasonably correct results, and move on to other things. Once it gets checked in, this code is a landmine waiting to explode.
Let's take a look at the correct Knuth-Fisher-Yates shuffle algorithm.
for (int i = cards.Length - 1; i > 0; i--)
{
int n = rand.Next(i + 1);
Swap(ref cards[i], ref cards[n]);
}
Do you see the difference? I missed it the first time. Compare the swaps for a 3 card deck:
| Naïve shuffle | Knuth-Fisher-Yates shuffle |
rand.Next(3); rand.Next(3); rand.Next(3); |
rand.Next(3); rand.Next(2); |
The naive shuffle results in 33 (27) possible deck combinations. That's odd, because the mathematics tell us that there are really only 3! or 6 possible combinations of a 3 card deck. In the KFY shuffle, we start with an initial order, swap from the third position with any of the three cards, then swap again from the second position with the remaining two cards.
The KFY shuffle produces exactly 3 * 2 = 6 combinations, as pictured above. Based on your experience shuffling physical cards, you might think the more shuffling that goes on, the more random the deck becomes. But more shuffling results in worse, not better, results. That's where the naïve algorithm goes horribly wrong. Let's compare all possible permutations of a 3 card deck for each algorithm:
| Naïve shuffle | Knuth-Fisher-Yates shuffle | ||||||||||||
|
|
You can plainly see how some of the deck combinations appear unevenly in the 27 results of the naïve algorithm. Stated mathematically, 27 is not evenly divisible by six.
Enough theory. Let's see more results. How about a four card deck, shuffled 600,000 times?
600,000 divided by 24 is 25,000; that's almost exactly what we see right down the line for every possible combination of cards with the KFY shuffle algorithm. The naïve algorithm, in comparison, is all over the map.
It gets worse with larger decks. Here's the same comparison for a six card deck.
With a 6 card deck, the differences between the two algorithms grow even larger. The math, yet again, explains why.
6! = 720 66 = 46,656
With 46,656 paths to only 720 real world outputs, it's inevitable that some of those paths will be severely over-represented or under-represented in the output. And are they ever. If you shipped a real card game with a naïve shuffle, you'd have some serious exploits to deal with.
I know this may seem like remedial math to some of you, but I found this result strikingly counterintuitive. I had a very difficult time understanding why the naïve shuffle algorithm, which is barely different from the KFY shuffle algorithm, produces such terribly incorrect results in practice. It's a minute difference in the code, but a profound difference in results. Tracing through all the permutations and processing the statistics helped me understand why it was wrong.
Naïve implementations are often preferred to complex ones. Simplicity is a virtue. It's better to be simple, slow, and understandable than complex, fast, and difficult to grasp. Or at least it usually is. Sometimes, as we've seen here with our shuffling example, the simplicity of the naïve implementation can mislead. It is possible for the code to be both simple and wrong. I suppose the real lesson lies in testing. No matter how simple your code may be, there's no substitute for testing it to make sure it's actually doing what you think it is.
With a Naive algorithm, a proof of correctness should be easy to construct.
If you can't construct a proof instantly in your head, then don't trust the algorithm.
Simon.
Simon Johnson on February 6, 2010 10:13 PMExcellent article, Jeff.
If anyone needs to test their PRNGs, get a copy of the George Marsaglia CD. Prof. Marsaglia researched randomness at FL State.
Here's one link to a mirror site:
http://i.cs.hku.hk/~diehard/cdrom/
===================
If anyone has a VB/VB.Net application using the intrinsic VB PRNG functions/statements, you should read my analysis of its weaknesses.
http://www.15seconds.com/issue/051110.htm
Microsoft finally acknowledged the weaknesses in its VB PRNG code by adding warnings to the online VB.Net code documentation.
Mark Hutchinson on February 6, 2010 10:13 PM@alvin:
Responding to a request for an algorithm to choose k n elements randomly from a list of n things, I can think of two algorithms offhand, which are essentially variants of each other.
If your list is short, you can simply shuffle it and take the first k elements of the shuffled list. This is what I'd use, provided it performed well enough for the application.
The second algorithm is something like this, assuming the function random(n) returns an integer r such that 0 = r = n-1 and that lists/arrays are indexed from 0:
let L be a copy of the list of things.
let 'results' be the name of the list of results, initially set to be the empty list.
n := length (L)
repeat k times:
selected = L [ random (n) ]
append selected to results
delete selected from L
n -= 1
end repeat
return results
This is a bit more efficient than shuffling the entire list just to return a subset of them, but, I suspect not so much so that it'll frequently make a big difference. As I said before, if the simple algorithm performs well enough, go with that.
a python programmer on February 6, 2010 10:13 PMNave algorithms are usually backed by nave test cases. In a proper test case, there should be multiple assertions that prove the same thing given different circumstances.
Jon Abaca on February 6, 2010 10:13 PM"It is possible for the code to be both simple and wrong."
Thanks for finally coming out with it. This is what I've been trying to tell people for years.
Unfortunately, there are some people here insisting that the wrong-ness of the naive algorithm here is simply common sense, and even if that's true, that is precisely the kind of arrogance that leads to these mistakes. ("Your algorithm falls outside my definition of 'simple', therefore, I still say simple is better.")
Oh well, I think you still got through to a few of them. Hopefully.
Aaron G on February 6, 2010 10:13 PMOnly time I implemented a shuffle, I split the desk into two lists, and merged, by randomly taking from either list. One one list was gone, the remainder of the other was appended. Maybe it's biased, but only as biased as a human doing a 'normal' shuffle. Think of it as a simulation of what normally happens when a bunch of guys light up cigars and sit around a table, not what happens in a fancy casino.
Rich Wilson on February 6, 2010 10:13 PMSome people questioned whether rand.Next(cards.Length) is correct. I've seen fragments of answers scattered throughout the comments, but not one complete answer.
Simply, the Length property returns the size of the array. The Next(int) method returns a random integer greater-than-or-equal to 0, and less than the argument you specify. So for the three-card shuffle, rand.Next(cards.Length) will return 0, 1, or 2. Since C# arrays are zero-based, this is exactly right.
Anon E Moose on February 6, 2010 10:13 PMShuffling playlists and slideshows is a similar problem, but more complex and with different requirements, like NOT being mathematically random.
http://feeds.feedburner.com/~r/blogspot/smartbear/~3/202195450/not-in-cards.html
Jason Cohen on February 6, 2010 10:13 PMI actually stumbled upon this problem while writing a card shuffling algoithm, but noticed it in a different way.
This problem is easier to understand if you imagine the following real-world scenario:
Imagine that each card in a deck has a number value associated with it (Say, the card on top is 1, the next card down is 2, etc) and you decide to shuffle a single real world deck by asking someone for a random number between 1 and 52. You could then find the card in the deck with the corresponding number value and place it in your new shuffled deck.
This works great until the person goes out of bounds so to speak by guessing the number 52 after his 16th guess. Theres no longer 52 cards to choose from so you can't draw the 52nd card any longer. It makes sense conceptually to limit the range as you remove cards from the deck, so that the person can guess legitimate card numbers as you go.
It's the same scenario when using the random number generator. You need to limit the range so that only cards that are left in the stack can be pulled out and added to the shuffled pile.
Jason on February 6, 2010 10:13 PMI was looking up a way to select a random subset from a set (well I came up with a solution and I wanted to see if it had a name). KFY is the way to do random shuffles, but should it be used for selecting, say 3 random items from 50k by shuffling then selecting the first 3? no, it's overkill. I can see there is an algorithm earlier in the posts where you select a random number and if it's already picked then try again. Also I can see someone did this:
function shuffle(arr) {
return arr.sort(function () { return Math.sin(Math.random() * 360); });
}
in javascript. This is wrong. If anyone is using a sorting method to get a "random sort" please stop. You get biases based on the implementation of the sort method. That function is the very definition of a naive algorithm!
anyway I thought I'd put my algorithm here for people to see, it's hard to explain how it works without a white-board but does pick a random subset of a set. If you increase the subset to be the size of the original set then you end up with the whole set randomized so you can also use it for a shuffle:
//num is size of subset
//arr is an array of your set
function pickRandom(num,arr){
var ret = [];
var l = arr.length;
for(var i=0;i
ret.push(Math.floor(Math.random()*(l-i)));
}
for(i=num-1;i>=0;i--){
for(var n=i-1;n>=0;n--){
if(ret[i]>=ret[n])
ret[i]++;
}
ret[i] = arr[ret[i]];
}
return ret;
}
The comments to this entry are closed.
|
|
Traffic Stats |