An Intuitive Explanation of Bayesian Reasoning is an extraordinary piece on Bayes' theorem that starts with this simple puzzle:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
This simple puzzle is not all that simple in practice. Only 15% of doctors, when presented with this situation, come up with the correct answer.
Can you come up with the correct answer -- without resorting to Google, the comments to this post, or reading the answer provided in the article?
If so, congratulations. You're a natural initiate of the Bayesian Conspiracy. For the rest of us, Bayes' Theorem is a bit more difficult to grasp:
While there are a few existing online explanations of Bayes' Theorem, my experience with trying to introduce people to Bayesian reasoning is that the existing online explanations are too abstract. Bayesian reasoning is very counterintuitive. People do not employ Bayesian reasoning intuitively, find it very difficult to learn Bayesian reasoning when tutored, and rapidly forget Bayesian methods once the tutoring is over. This holds equally true for novice students and highly trained professionals in a field. Bayesian reasoning is apparently one of those things which, like quantum mechanics or the Wason Selection Test, is inherently difficult for humans to grasp with our built-in mental faculties.
In computer science, it's easy to demonstrate the immense power of Bayes' theorem: it's the basis for almost all spam filters in use today. Bayesian email filtering was first publicized by Paul Graham's A Plan for Spam in mid-2002. Most programmers know about Bayesian filtering now; it's the primary weapon in any modern Spam fighting toolkit.
What you may not know, however, is that there's something even more effective than Bayesian spam filtering. It's eloquently described in William Yerazunis' presentation The Spam Filtering Plateau at 99.9% Accuracy and How to Get Past It (also available in pdf paper form). And it's been implemented as the CRM114 Discriminator for years. That technique is Markovian spam filtering:
How to change a Bayesian spam filter to a Markovian spam filter:
- Change the feature generator from single words to spanning multiple words
- Change the weighting so that longer features have more weight (ie, longer features generate local probabilities closer to 0.0 and 1.0)
- The 2^2n weighting means that the weights are 1, 4, 16, 64, 256, ... for span lengths of 1, 2, 3, 4, 5 ... words
In other words, where Bayesian filters examine the relationship between individual words, Markovian filters expand the scope to examine the relationship between words and phrases. It's a tweak, but a significant one that amplifies the accuracy of the already uncannily accurate Bayes' theorem.
But the true power of Bayes' theorem extends far beyond merely discriminating between spam and non-spam. As the CR114 documentation notes, you can use these powerful statistical models to discriminate between.. well, just about anything:
Spam is the big target with CRM114, but it's not a specialized Email-only tool. CRM114 has been used to sort web pages, resumes, blog entries, log files, and lots of other things. Accuracy can be as high as 99.9 %. In other words, CRM114 learns, and it learns fast.
Now perhaps you can understand why some people are so excited about Bayes' theorem.
Maybe you see Bayes' theorem, and you understand the theorem, and you can use the theorem, but you can't understand why your friends and/or research colleagues seem to think it's the secret of the universe. Maybe your friends are all wearing Bayes' theorem T-shirts, and you're feeling left out. Maybe you're a girl looking for a boyfriend, but the boy you're interested in refuses to date anyone who "isn't Bayesian". What matters is that Bayes is cool, and if you don't know Bayes, you aren't cool.Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?
It's not intuitive for most people, but look a little more closely, and I think you, too, will become an initiate of the Bayesian conspiracy.
Sounds great to me
Tom on May 1, 2007 12:28 AMOK, what do you make the probability?
I make it a little over 8.4%.
Why is that wrong? It seems simple enough to me.
Given no google, comments, or reading the article, I turned to the next best thing: My MATH Probability 3310 class notes. In my second test review sheet (written by me), I found this amazing gem:
Bayes Formula: P(E) = P(EF) + P(EF^c)
Given that I don't know what the letter 'c' denotes, I came up with the answer of 125.2%. This is why a Computer Science degree is important for any good programmer. And why I need to learn to take better notes.
David Sokol on May 1, 2007 1:00 AMNo, I have no idea how Bayes' Rule works, so here's a go with highschool math and common sense:
Of routinely screened women:
99% dont have it, BUT 99 * .096 = 9.504% will test positive anyways
1% do have it, BUT ONLY 1 * .8 = .8% will test positive
So, of those routinely screened women:
.8 / (9.504 + .8) = .0776
There's a 7.76% chance that she's actually got it. Less than 10% true positive! Yikes!
This seems way low. Am I wrong to take the 1% occurrance rate into account?
From 10+-year-old memories:
Draw up a 2x2 table. The columns represent testing positive and testing negative; the rows represent having cancer and not. We will work with probabilities in each cell, although using cardinalities from a population of, say, 10000 will yield whole numbers throughout, which some may find easier.
Statement 1 tells us that the row totals are in the proportion of 1:99.
Statement 2 tells us that in row 1, the columns are in the proportion of 80:20.
We know that row 1 sums to 0.01, so row 1 is 0.008 | 0.002.
Statement 3 tells us that in row 2, the columns are in the proportion of 9.6:(100-9.6).
We know that row 2 sums to 0.99, so row 2 is 0.09504 | 0.8946
We have the complete table. And the figure we are asked for is, given we are in column 1, what is the probability we are in row 1. This is (row 1, column 1) / [(row 1, column 1) + (row 2, column 1)] (*), or
0.008 / (0.008 + 0.09504)
= 0.008 / 0.10304
= 25/322 ~= 7.76 %
(*) This is Bayes Theorem, right? P(A | B) = P(A & B) / P(B) ?
Larry Lard on May 1, 2007 1:38 AMIsn't most of the first statement a red herring?
"9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening."
From the 1st sentence 9.6% positives are in error, so 90.4% are correct.
So a woman with a positive has a 90.4% chance of having cancer.
The first two sentences are irrelevant once the result of the mammography is known.
Keith on May 1, 2007 1:47 AMThere's an error in the first statement, repeated from the link: http://www.yudkowsky.net/bayes/bayes.html
They go on to give alternatve versions, and the questions are different.
"A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"
to
"If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?"
The last statement fundamentally asks a different question. Bayes' theorem applies to the second statment but not the first.
Keith (again) on May 1, 2007 1:54 AMI made it the correct 7.8% without reading the solution, but I did have to lookup the bayes forumla. Still, it isn't that tough.
Ben on May 1, 2007 1:58 AMP(True Positive)/ (P(True Positive)+P(False Positive))
0.8 / (0.8 + 0.096)
= 89.3%
Of course the calculation is of no importance (as always people here are distracted by the examples given =P ) - the important finding to me is that general public has a hard time dealing with set, logic, probability, or other non-arithmetic maths.
I forget how many times I convince people that all types of games in casino have negative expected value.
Kevin on May 1, 2007 2:16 AMKeith, your reasoning is wrong - 9.6% of women without breast cancer get a positive result anyway. That does not imply that 90.4% of positives are correct, but rather that 90.4% of women without breast cancer get a negative - very different things.
Kevin, your reasoning is wrong, since the chance of a true positive given a random person isn't 0.8, but rather 0.01*0.8, and the chance of a false positive is 0.99*0.096.
The end result then is (0.01*0.8) / (0.01*0.08 + 0.99*0.0.96) =~ 0.078. A woman with a positive test result has only a 7.8% chance of actually having breast cancer.
Remco Gerlich on May 1, 2007 2:25 AMOkay, now I've read the article, and it turns out I got it about right (7.76%)... although the article really is painfully slow. Given a nice separation of the relevant facts, like in the stated problem, this should be pretty straightforward.
So, if 15% of doctors get the correct answer, 99% of people don't believe that doctor when given the (correct) extraordinarily low figure, and 99% of people DO believe the doctor when he gives the wrong (much higher) figure, how likely are you as a patient to understand the correct information concerning your condition when you get a positive result on your test? This doesn't bode well...
Chris Moorhouse on May 1, 2007 2:36 AMI tentatively suggest 7.76% (or a shade over).
I did study this at university, but it was so long ago that I think the Rev. Bayes took the lecture himself. I definitely remember a forumula along the lines of David's, and I too have no idea what "c" denotes. However, I worked it out as follows:
1% of women have breast cancer
of these, 80% will have positive tests
= 0.8%
99% of women do not have breast cancer
of these, 9.6% will have positive tests
= 9.504%
So:
(adding these mutually exclusive groups): 10.304% of women will have positive tests.
of these 10.304%, the 0.8% group are the ones with breast cancer,
so if you are in the 10.304% who received the positive test, your probability of having cancer is
088%/10.304% =~ 7.76%
I humbly await correction.
I got 0.8333 but I was working without paper and dropped a clanger.
Reasoning went:
Say there's 1000 women. 10 have breast cancer, so 8 have positive test results.
990 of them don't have breast cancer, so 990 * 0.096 = 95 have positive tests results.
Then, like an idiot, I divided 8 by 95 and got .083 instead of dividing 8 by 103 and getting 0.077, so out of a 1000 women who get a positive result from a breast cancer screening test, 77 of them, on average, actually have breast cancer.
I'm pretty confident that, if the original problem had been couched in whole number terms, more people would get the right answer. Probabilities expressed as percentages can obscure at least as much as they reveal, at least, in matters of public health like breast cancer screening.
Piers Cawley on May 1, 2007 3:42 AMYep, 7.76%. On reading the first wording of the question I couldn't really figure it out, but then as soon as I read the words '10 out of 1000 women...' it clicked - I won't repeat the working above.
I find it really surprising how many intelligent people have trouble with probability questions (stuff like this, and the Monty Hall/game show problem), even after sitting down with pen and paper. They just refuse to believe numerical reasoning over their own intuition.
James on May 1, 2007 3:43 AMRemco: You've proved to me that I shouldn't read this blog and do math during office hour. Now just read the articles and thanks for correcting me!
Interesting blog, as always.
Kevin on May 1, 2007 4:18 AMChris Moorhouse: Am I wrong to take the 1% occurrance rate into account?
I think the important question is: "Where do you get a 1% occurance rate from?"
The only 1% was for number of people who get tested...
Telos on May 1, 2007 5:07 AMI guess I'm dense. The first sentence says 1% of women who participate in screening HAVE BREAST CANCER. It goes on to blah blah blah about false positives, then asks the probability that the woman who participates has breast cancer. It doesn't ask what is the probability that her positive mammogram is really cancer, vs a false positive. It asks the probability that she has cancer, which is stated in the first sentence as being 1%.
Perhaps, this is why I did not do as well in some classes as others.
Do I win the doofus prize?
--dang
Woh, it's fizzbuzz all over again.
I'm surprised there was no mention of the Monty Hall Problem. I think it's a good example of Bayes theorem.
CR on May 1, 2007 5:24 AMIsn't the answer: .01 probability (1%)?
Seems to me the first sentence gives the answer, and all the stuff about positive mammographies is irrelevant.
(But, the Bayesian discussion after the question probably indicates I'm wrong.)
I thought I was going mad thinking I was the only person to get to the answer of 1% until I got the the bottom of the comments. The answer is in the question:
"1% of women at age forty who participate in routine screening have breast cancer...<snip>...What is the probability that she actually has breast cancer?"
Heck, it's the first word! Is the problem that of indirection to make you think that the number in the middle actually mean something?
Paul on May 1, 2007 6:02 AMYou have to pay attention to the entire question:
"A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"
It is not asking for the probability that a women (who is routinely tested) has breast cancer.
It asks for the probability that a women who tests positive (in the routine testing) has breast cancer.
David Janke on May 1, 2007 6:06 AMWow! It's amazing the number of people who comment on here with wild numbers and theories, without actually taking ten seconds to look at the linked article.
Hint: if an article states that most people get this wrong, then please check you're not one of them before posting a reply! It's like "FizzBuzz" all over again!
Graham Stewart on May 1, 2007 6:12 AM10% of people who read this blog post replies. Of those 10%, 100% have to enter a captcha word. Of these, 100% have to enter the word "orange". What is the chance that you typed "orange" in order to reply to this post?
Will Sullivan on May 1, 2007 6:20 AMI thought I was going mad thinking I was the only person to get to the answer of 1% until I got the the bottom of the comments. The answer is in the question:
"1% of women at age forty who participate in routine screening have breast cancer...snip...What is the probability that she actually has breast cancer?"
Heck, it's the first word! Is the problem that of indirection to make you think that the number in the middle actually mean something?
Paul on May 1, 2007 6:28 AM@Will
I didn't have to type the word you describe. You did it for me.
Craig on May 1, 2007 6:41 AMI wrote a statistical spam filter two years before Phil Graham's. Worked pretty well, and some people converted it into an open source project.
http://www-cse.ucsd.edu/~wkerney/spamfilter.README
http://www-cse.ucsd.edu/~wkerney/spamfilter.tar.gz
"Bayesian Conspiracy"? Please. Conditional probability is covered in every lower division probability class. It's probably the first actual interesting thing you learn in probability... but it's not hard to understand.
wkerney on May 1, 2007 6:54 AM"A woman in this age group had a positive mammography in a routine screening."
It can't be 1% because you know the test result.
joe beam on May 1, 2007 6:54 AMI keep getting spam that defeats Bayesian filtering. However, I don't understand how the spammers think anyone is actually going to read the spam messages buried in the middle of 100's of random words and phrases.
Obviously I can't train my filter on these messages as they are primarily filled with non-spam content. Doing so would just train my filter to mark all content as spam.
Sending every message with a $ to the spam account helps but sometimes some e-mail from grandma ends up there too. I can live with it, but it is extremely annoying to have to delete the messages which are so obviously spam.
gerrr on May 1, 2007 7:12 AM@ wkerney: clearly it is hard to understand, or so many people wouldn't get it wrong. I think it's mainly that people get all confused thinking about this nebulous concept of probability, rather than assuming some whole number of events and working from there.
@ half the others in this thread: the whole point of this type of problem is conditional probability - the way in which additional information alters what you know about something. Yes, the probability of any arbitrary woman having cancer is 1%, independent of whether she's been tested. The question is asking you about this particular woman, who you know more about - you know the result of her test, and that affects what you know about her probability.
I figure most of the people that read this blog are software developers, which makes it doubly surprising that some don't get this. I don't mean to sound pompous - I needed a little help on my way to the answer - but the fact that some folks can't follow the working at all troubles me. FizzBuzz indeed.
James on May 1, 2007 7:41 AM1%? Oh dear, people. Please apply logic. If all the stuff about postitive and negative mamograms was irrelevant, then why would anybody bother HAVING one done? The fact that the result was positive must have some implication to the chance of having breast cancer, otherwise the test would be useless.
Nchantim on May 1, 2007 7:45 AMWell, given the first statemnt, you might conclude that if she's tested, the the chance of having breast cancer are 1%. And that IS true, IF we don't know the results from the test. That's actaully not the same as saying that 1% of 40-year old women have breast cancer. It's saying 1% of those who are screened routinely have breast cancer.
Nchantim on May 1, 2007 7:51 AMHmmm. I got around half way down that huge page and got confused. There was a problem on eggs and pearls that it does not give the answer too and I don't know how to answer. Unhelpful. The reuse of almost exactly the same example problems doesn't help either.
Asd on May 1, 2007 8:01 AMRemember: per the article, only 15% of doctors get this question right. If the question is rephrased in different ways, their accuracy goes up.
http://www.yudkowsky.net/bayes/bayes.html
Let me be the first to say that I'm one of the 85%. I *don't* find this intuitive, and I absolutely would have gotten the question wrong. It's difficult for me to see past the accuracy of the individual breast cancer test, which is 80%.
I would also expect, knowing what I know about human nature, that most of the commenters will get the question right-- people who aren't sure of their answer aren't likely to make it permanent in a comment box, either. So allow me to compliment those of you who got it wrong in a comment: at least you're honest. :)
Jeff Atwood on May 1, 2007 8:02 AMHere's my solution, and a comment on the implications in this application.
In the following, let C denote cancer, M+ denote a positive mammogram, and C^ and M+^ donote not having cancer and not having a positive mammogram, respectively.
The probability that a woman in the age range has breast cancer is P(C) = 0.01.
The probability that a woman's mammogram is positive given that she has breast cancer is P(M+ | C) = 0.8.
The probability that a woman's mammogram is positive given that she doesn't have breast cancer is P(M+ | C^) = 0.096.
We are required to find the probability that a woman has breast cancer given her mammogram is positive, P(C | M+).
From conditional probability:
P(M+ | C)P(C) + P(M+ | C^)P(C^) = P(M+)
so P(M+) = (.8 * .01)+(.096*(1-.01)) = 0.1030
Bayes' theorem states that:
P(M+ | C) = P(C | M+)P(M+) / P(C)
We re-arrange to find:
P(C | M+) = P(M+ | C)P(C) / P(M+) = .0777.
So, the probability of a woman having breast cancer given that she has a positive mammogram is just under 10%.
Assuming these probabilities are correct, one might be tempted to say that this is unacceptable. However, it is worth thinking whether it would be preferable for fewer positive mammograms to be reported (given it's hard to know which ones are true positives and which are false). Doing so would result in fewer (false) scares---but would also result in more cancers being missed. It is also worth asking whether a better method for detecting breast cancer even exists (hint: not yet).
CR on May 1, 2007 8:08 AMI think it is easier to frame the question if you start with a set diagram that plots the populations with the various attributes properly.
Note that the initial 2 populations (with and without cancer) are mutually exclusive, but other sets (like who take the tests) overlap with both populations.
It becomes even better if you try drawing them with some degree of proportions like make the cancer set smaller, etc.
Krishna Kumar on May 1, 2007 8:39 AMOne thing you can do is simulate the situation described eg
(defun random-woman ()
(if (< (random 100.0) 1)
;; One percent of women have cancer
(cons 'cancer
(if (< (random 100.0) 80)
;; Of whom 80% test positive
'positive
'negative))
;; of the others
(cons 'well
(if (< (random 100.0) 9.6)
;; 9.6% false positives
'positive
'negative))))
(loop repeat 100
for (status . result) = (random-woman)
when (eql result 'positive)
count (eql status 'well) into false-positive
count (eql status 'cancer) into true-positive
finally (format t "~&With ~D true positives and ~D false positives ~$% of positives are true positives."
true-positive
false-positive
(* 100 (/ true-positive
(+ false-positive true-positive)))))
With 2 true positives and 9 false positives 18.18% of positives are true positives.
This kind of Monte Carlo approach is hopeless if you want an accurate
answer, but it offers you something else instead.
Think of a typical primary care doctor. After he has ordered 100 mamograms and got the results of follow up investigations how do things appear. Running the simulation a few times
With 2 true positives and 5 false positives 28.57% of positives are true positives.
With 0 true positives and 12 false positives 0.00% of positives are true positives.
With 0 true positives and 11 false positives 0.00% of positives are true positives.
With 2 true positives and 10 false positives 16.67% of positives are true positives.
With 1 true positives and 5 false positives 16.67% of positives are true positives.
With 2 true positives and 18 false positives 10.00% of positives are true positives.
With 1 true positives and 10 false positives 9.09% of positives are true positives.
The results are all over the shop. So my guess is that doctors working in primary care have varied experiences with some seeing mamograms actually detecting cancers and others seeing only false positives.
Alan Crowe on May 1, 2007 8:40 AMI decided to use Excel, and it worked!
I mocked up a study with 1000 women being screened:
10 women actually have cancer (1% of 1000)
8 of the women who have cancer test positive (80% of 10)
95 women show false positives (9.6% of the 990 cancer-free women)
103 women test positive (8 + 95)
The 8 true positives are 7.76% of the 103 total positives. (8/103 = 0.0776)
For some reason I was uncomfortable solving this in a purely abstract manner and sticking to the problem description of sample sets of women really helped me reason through this.
Daniel Pritchett on May 1, 2007 8:43 AMI'm also a bit creeped out that only 46% of the doctors tested got this simplified version of the problem right:
100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?
Guess doctors don't have to be math whizzes, do they?
i think we are missing the point here that 0% of men have breast cancer. based on the comments, only about 1 in 70+ people realize this :D
Darren Kopp on May 1, 2007 9:00 AMThe author references heavily the work of Daniel Kahneman and Amos Tversky, especially their book Judgement Under Uncertainty, which is an outstanding reference for those interested in how humans make (often fallible) decisions.
The most accessible introduction to their work, IMHO, was an article written in Discover Magazine in 1985 titled "Decisions, Decisions" (see footnote below). It would be well worth the time to print this article from the microfiche files in some academic library. The closest online reference of their work I've found is at:
http://www.hss.caltech.edu/~camerer/Ec101/JudgementUncertainty.pdf
Kevin McKean, "Decisions, Decisions." Discover, June 1985 pp. 22-31
J.D. on May 1, 2007 9:03 AMWhen I approach these kinds of problems I often find it conceptually easier to deal with actual numbers rather than percentages. It's the same calculations in the end just easier for my brain to reason about.
So given 1000 women who go in for testing 10 (1%) will have breast cancer on average which means that 990 do not have breast cancer. Of the 10 who have cancer, 8(80%) will test positive. Of the 990 who are cancer free, 95.04(9.96%) will test positive. So in total, 103.04 women will test positive, but only 8 of those actually have cancer. 8/103.04 ~ 7.76%
Many probability problems are relatively simple when it comes to the actual calculations. Often times the hard part is finding the right calculation to do. You need to carefully look at exactly what each percentage is really saying. It's easy to get stuck on the 80% or 9.96% figure without realizing they aren't directly dividing the groups you're trying to reason about.
Mike Pavone on May 1, 2007 9:04 AMSetting aside all the Bayesian chatter, I am wondering if the Hidden Markov Model is similar to what Apple does with their Mail application (see http://www.macdevcenter.com/pub/a/mac/2004/05/18/spam_pt2.html). The MacDevCenter article talks about using vector space -- the combination of words found together cluster into volumes of highly-dimensional space. The HMM might do something similar (although I suspect CRM114 is order-sensitive, whereas Mail isn't).
Still, even though Apple Mail is pretty good, I'd hardly call it better than 98%. I have to write special rules for image spam, and it doesn't seem to look at tags.
Dan Neuman on May 1, 2007 9:06 AMThanks for everyone's explanations.
@Darren: men do get breast cancer, but in much smaller percentages. I vaguely recall seeing a new special about it, with one of the comments being that there isn't the support groups for men, as there are for women. I quick web search can find medical information about it. I can't vouch for the support groups.
--dang
Daniel Dang Griffith on May 1, 2007 9:11 AMThe real problem is that both Bayes and Markov are theorems within mathematical statistics (what real statisticians call statistics; not at all what baseball junkies mean when they use the word). Normally, statistical inference lies on the bed rock of independence (whether the practioner realizes it or not; econometricians widely/wildly ignore the requirement). Bayes says phooey.
The main point being, to quote Allen Holub in a negative way, it doesn't pay to know a little bit about SQL or mathematical statistics. Or relational databases. Or anything else that requires real thinking.
Probability that any randomly selected woman will test positive:
P(+) = P(+|bc)P(bc)+P(+|~bc)P(~bc) = .8*.01+.096*.99 = .10304
Now apply Bayes Rule:
P(bc|+) = P(+|bc)P(bc)/P(+) = .8*.01/.10304 = .07764
This is silly. The term "Bayesian" has not a damn thing to do with words, single words, sentences, etc. It's a simple method dealing with prior, likelihood, and posterior probabilities and methods of determining parameters from data. Markovian simply means that the infinite past's influence on the present is minimal. So, whatever relation to spam filtering this has needs to PICK NEW WORDS. Stop confusing terms, it just leads to confusion. I know cause I've been guily of it myself before.
The answer is NOT 7.7764%!
Read for content, people. The lady already has taken a mammogram, so that 1% deal doesn't apply. She's also got a positive result, so the question is: given a positive result, what's the probability she has it? And the answer is 80%, because that's exactly what it tells you in the second sentence.
Brad on May 1, 2007 11:14 AMI'm I wrong or what?
1st statement: 1% HAS CANCER, 99% WITHOUT CANCER.
2nd statement: 80% who HAS CANCER gets positive results. This leads to:
(a) 1% * 80% - has cancer with positive.
(b) 1% * 20% - has cancer without positive.
(3rd statement: 9.6% of women WITHOUT CANCER gets positive:
(c) 99% * 9.6% - no cancer, but positive
(d) 99% * 90.4% - no cancer, negative.
She got positive. That means that she is in (a) or in (c).
From all participants 10.304% (0.01*0.8 + 0.99*0.096) GETS POSITIVE results.
From all participants 8% HAS CANCER and GETS POSITIVE.
So 10304/8000 or 77.63% +/-0.005% who GOT POSITIVE result HAS CANCER.
Domas on May 1, 2007 11:16 AMBrad,
Don't get so upset, and practice your reading comprehension. 80% is the probability that she got a positive result, given that she has cancer. We are interested in the reverse: the probability that she has cancer, given that she got a positive result. The two are not the same.
Jason on May 1, 2007 11:19 AMRead for content, indeed. I misread the first sentence, thought it said 1% of women get tested. Whoops!
Brad on May 1, 2007 11:19 AMhere's my solution(I just finished a statistics course at Washintgon State University go Cougs!)
The problem statement tells us that the probability of a screened woman has cancer is 1%. P(cancer) = P(c) = 0.01
Conditional probability comes in here, and we apply the notation of P(A|B) = the probability that A is true given that B is true.
It tells us that the probability that a woman will get a positive result given that she has cancer is 80%. P(positive|cancer) = P(p|c) = 0.8
It tells us that the probability that a woman will get a positive result given that she does not have cancer is 9.6%. P(positive|NOT cancer) = P(p|NOT c)0.096
Out knowledge of conditional probability tells us that P(A|B) = P(A and B) / P(B). That is, the probability of A being true given that B is true is equal to the probability that A and B are true, divided by the probability that B is true. Drawing a Venn diagram can help to clarify this.
We can now solve for P(positive AND cancer) = P(p|c)/P(c) = 0.008
P(NOT cancer) = 1 - P(cancer) = 0.99
P(positive AND not cancer) = P(p|NOT c)/P(NOT c) = 0.09504
P(positive) = P(p AND c) + P(p AND NOT c) = 0.10304
Finally, P(c|p) = P(c AND p)/P(p) = .0776
Given this data, the probability that any positive result corresponds to a woman with cancer is 7.76%
shoez on May 1, 2007 12:00 PMoops, I made a couple typos. these lines are correct:
We can now solve for P(positive AND cancer) = P(p|c)*P(c) = 0.008
P(positive AND not cancer) = P(p|NOT c)*P(NOT c) = 0.09504
@Dan Neuman:
As far as I know, the CRM114 uses a combination of techniques (including HMMs) to catch spam.
In regards to hidden Markov models, the amount of words the model "remembers" is dependent on its order: a first-order HMM takes into account the last word (one) it saw when deciding which state to move to next. A second-order HMM takes into account the last two states it was in.
Most models I've seen are first-order, because of the exponential curve in computational/spatial price in higher order HMMs.
If you're curious about HMMs, a great resource is Durbin (et al.)’s "Biological Sequence Analysis", or Rabiner's classic tutorial "A tutorial on hidden Markov models and selected applications in speech recognition".
Edward Ocampo-Gooding on May 1, 2007 12:35 PMI'd be embarrassed to display my pathetic attempts at solving this - thanks for the links. I can scarcely remember this concept from a stats class, but I doubt I really understood it even then.
It looks very interesting - hopefully I'll become a co-conspirator soon.
David H. on May 1, 2007 12:43 PMIMAD (I'm not a doctor) rather a math teacher - and actually presented Bayes' formula the other day, so it wasn't too hard to get the 7,76%.
Inspired by Alan Crowe here is a little Scheme program that simulates a number of trials. After 1000000 trials I got 0.077307 as the result.
(define (experiment)
(if (<= (random) 0.01)
; breast cancer
(if (<= (random) 0.80)
'cancer-pos
'cancer-neg)
; well
(if (<= (random) 0.096)
'well-pos
'well-neg)))
(define (trials n)
(if (= n 0)
0
(let ([result (experiment)])
(case result
[(cancer-pos well-pos)
(case result
[(cancer-pos) (+ 1 (trials (sub1 n)))]
[(well-pos) (trials (sub1 n))])]
[else
(trials n)]))))
(let ([m 1000000])
(/ (trials m)
(* 1.0 m)))
7.76%
Eric Falcao on May 1, 2007 1:34 PMGot the reasoning right, then had trouble multiplying numbers on paper.
As you can guess, I have a degree in math, and working on another in stats...
Marc on May 1, 2007 2:01 PMI took a swing at this and emailed my results to a good friend who is a medical doctor and statistician who does research work on the statistical effectiveness of doing test screenings. She kindly sent me back a spread sheet showing a 2x2 grid plugged in with sample numbers. Boy was I ever far off the mark.
I suggest anyone here who has trouble with this to email any friend who happens to be a medical doctor / statistician / researcher.
Hmm, seems I've done this slightly wrong. But landed in the right ball park.
1% of the screened women have cancer
80% of those gets a positive test
So of all tested women 0.8% will have a positive test and have cancer
On top of this 9.6% of the women without cancer will get a positive test
so 0.8+9.6= 10.4% of all tested women will get a positive result
of these 0.8/10.5 = 7.7% will have cancer
By the way, isn't this about the Base-Rate Fallacy?
http://www.fallacyfiles.org/baserate.html
John Nilsson on May 1, 2007 2:33 PMI eyeballed this in about 15 seconds.
1000 patients. 1% cancer rate = 10 have cancer, 990 don't.
Of the 10, 8 test positive, 2 don't.
Of the 990, ~10% false positive rate means ~90 test positive, ~900 don't.
Of those testing positive, 8 / ~90 gives a chance just shy of %10.
Adam on May 1, 2007 3:31 PMYour're all wrong. If the original group is 10k women, 1030.4 will test positive, and 100 will have cancer, so the answer is about 1 in 10.
Ernie Bornheimer on May 1, 2007 3:32 PMIt is frightening that only 15% of doctors get this right. But what does "get this right" mean? One can just look at two statements - 1% of women have cancer, and 9.8% of results are positive - and instantly see that fewer than 1 in 10 of these test results are a valid positive, without doing any real math. For a doctor, this is probably enough, but EVERY doctor needs to see this right away.
AIDS is another example of this phenomenon. A false positive AIDS results occurs up to 0.0007% of the time (7 in 1,000,000). Given 23,000 heterosexual AIDS cases in the US, a purely heterosexual non-drug using male with a positive test has a 1 in 20 chance of being AIDS free. But this number falls depending on the person's Bayesian priors - has he been a monk for the past 40 years? Has he visited prostitutes? Is he white? Has he been in prison?
Ignatius Gorgonzola on May 1, 2007 4:02 PMThis is just high school data management. let B represent having breast cancer, !B not having breast cancer, P testing positive, !P testing negative. (I filled in the known values and since each set of branches must = 1, the other branch is just 1-possibility. for example you know that the chance of having breast cancer is 0.01 (B), so the chance of not having it is just 1-0.01=0.99(!B)
Now to just draw a tree diagram with each level representing a different stage(Sorry for the bad ascii art):
/ \
/ \
/ \
/ \
0.01/ \0.99
B !B
0.8/ \0.2 .096/ \0.904
/ \ / \
P !P P !P
Since probabilities are: possibility
------------
all possibilities
Since we are given that the person has tested positive it is:
B&P
---
P
(I'm guessing this is the theory but I never learned this with a name)
Then to get B&P just do 0.01*0.8=0.008
Then to get just P you add B&P and !B&P =0.01*0.8+0.99*0.096=0.10304
Therefore the probability is just 0.008/0.10304 which is about 7.76%
Although this is a very mathy question, I would expect doctors to know how to figure something like this out because it's kind of important when telling your patient the actual chances of having any disease when testing positive or negative for any disease, especially if the test does not have very accurate results and should be tested more than once if tested positive.
chris on May 1, 2007 5:04 PMWooo! I got it right... but only because I took a probability theory course a year ago, and AI this semester.
Tom on May 1, 2007 6:42 PMReally, the only thing you need to remember is this:
P(A | B) = P(A and B) / P(B)
In words, that's: the probability of A given B is equal to the probability of A and B divided by the probability of B.
Given:
P(p | c) = 0.80 (probability of positive result given they have cancer)
P(p | ~c) = 0.096 (probability of positive result given they DON'T have cancer)
P(c) = 0.01 (probability they have cancer)
Goal:
P(c | p) (probability they have cancer given a positive result)
Work:
1) Need P(c | p):
Same as P(c and p) / P(p)
2) Need P(c and p):
Know: P(p | c) = 0.80 = P(p and c) / P(c) and that P(c) = 0.01
P(p and c) = 0.80 * P(c) = 0.80 * 0.01 = 0.008
3) Need P(p):
Same as P(p and c) + P(p and ~c)
4) Have P(p and c), need P(p and ~c)
Know: P(p and ~c) / P(~c) = P(p | ~c) = 0.096 and that P(~c) = 1 - P(c) = 0.99
P(p and ~c) = P(p | ~c) * P(~c) = 0.096 *0.99 = 0.09504
Back to #3: P(p) = P(p and c) + P(p and ~c) = 0.008 + 0.09504 = 0.10304
Back to #1: P(c | p) = P(c and p) / P(p) = 0.008 / 0.10304 = 0.07764
There's your answer: 7.76%
Tom on May 1, 2007 7:02 PMIsn't the correct answer: "the probability cannot be determined"?
People resort to the equations before they understand the question:
1) 1% of women at age forty who participate in routine screening have breast cancer.
- 1% of only age 40 women (stat with rate specific to age)
2) 80% of women with breast cancer will get positive mammographies.
- no age variable
3) 9.6% of women without breast cancer will also get positive mammographies.
- no age variable
A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
From the information present it seems you cannot calculate a percent chance of having breast cancer (by applying the 2nd two stats) since the 2nd two stats cannot be rationally related to the first stat. The 2nd two stats should not be allowed to skew the first stat because they are only applicable on their own. Therefore, the 2nd two stats don't change the first stat and the most reasonable reading you could give her is a 1% probability that she will get breast cancer.
James M. on May 1, 2007 9:08 PMThe key here isn't Bayes amazing work. Then entire answer, and how close Bayes logic gets, is due to how accurately you can approximate the probabilities.
99% of the work goes into determining the probabilities and the confidence intervals. It's an amazing formula, but being off by even .5% leads you to the wrong conclusion, possibly by a large factor ( 1:10 vs 1:7). After that, it's just arithmetic.
Scott on May 1, 2007 9:30 PM(BTW, I didn't even remember the formula so I didn't even try)
Scott on May 1, 2007 9:31 PMJames M., the two stats that do not specify an age group must apply to ALL women, thus do apply to women aged 40, so do apply to this case.
James B on May 2, 2007 12:36 AMEveryone should have got the right or answer, and if they did not it should be because they could not be a#sed to work it out, after all the readership are all computer programmers . . .
Well management probably did better than most, because they would give the answer as "about 1%"
David Ginger on May 2, 2007 3:19 AMI like number and above all examples:
our case: 1000 women
following probability we can think
10 have BC
990 don't have BC
if all of them take a M (mammography)
it can be P (positive) or N (negagtive)
of the 10 with BC we have
- 8 have BC and M is P
- 2 have BC and M is N
of the 990 without BC we have
- 950 don't have BC and M is N
- 40 don't have BC M is P
if we are in the case of a Positive M then the
probability to have BC is 8/(40+8) i.e.
between 16 and 17%
I like number and above all examples
(but I'm home sick with flu and fever
and I have an excuse for the bad
calculations in the previous post!) :)
our case: 1000 women
following probability we can think
10 have BC
990 don't have BC
if all of them take a M (mammography)
it can be P (positive) or N (negagtive)
of the 10 with BC we have
- 8 have BC and M is P
- 2 have BC and M is N
of the 990 without BC we have
- 895 don't have BC and M is N
- 95 don't have BC M is P
if we are in the case of a Positive M then the
probability to have BC is 8/(8 + 95) i.e.
between 7 and 8%
Oh the irony. I just tried to post a comment here asking about Bayesian filtering works, and tried to use an example as a question, using a medication for floppy junk and a medication for hair loss as 2 examples, and the comment posting program tells me that "Your comment could not be submitted due to questionable content: " and then it lists the floppy junk medication as the reason.
Bayesian filtering at work.
Matt on May 2, 2007 5:58 AMI'm not sure I would give an answer 7.76 percent to any woman. Saying "You probably don't have cancer" wouldn't go over that well. I would recommend more (different) tests or a redo on mammography test.
I think it is interesting that the number is showing how reliable the test is. At only 7.76 probability, a mammography alone is not a reliable test for breast cancel in of itself. You would want a much more reliable test or correlating data before issueing any diagnosis. I believe that this is what the Bayes Theorum is showing here.
Jon Raynor on May 2, 2007 9:32 AMIf you can read the question, it's pretty obvious that the number of false positives must be high. I have to say, as far as statistical problems go, this one's pretty easy.
Greg Bowers on May 2, 2007 4:19 PMI didn't realize this was new. I learned this in college stats in 1977.
Pjay on May 2, 2007 6:35 PMHmm,how's this:
P(A | B) = P(A ^ B) / P(B)
and
P(B) = P(B | A)P(A)
so that
P(A | B) = P(A ^ B) / P(B | A) / P(A)
In other words, even if A is a cause of B, we can consider them as correlated variables, and deduce the probability of A given B from knowing the probabilities of B given A, the probability of A (the cause) all by itself, and the probability of A and B (when the cause results in the effect).
There has been some argument (see Wikipedia), however, about assigning the "a priori" probabilities P(B | A). Is that valid, considering we are measuring P (A | B ) ?
Greg Magarshak on May 2, 2007 10:20 PM@ Jon Raynor:
Actually, the "gentle introduction" article mentions this. The "positive" result is a low-occurance but "weak" piece of evidence, whereas a "negative" result is the typical case, and very "strong" evidence. The point of the test isn't really finding out who DOES have cancer, it's finding out who DOESN'T. On that score, the test mentioned in the question is quite accurate; false negatives are very rare.
Chris Moorhouse on May 3, 2007 12:47 AMThis is amazing. Either I'm completely daft, or you are all falling for a rather cheap and very old trick -- extra information that doesn't matter and is simply presented to confuse you.
The answer is *NOT* something with 7. The answer is clearly 100-9.6, which should be 90.4%. That's the chance she really has cancer. That's all the math you have to do. Really. I'll show you why:
--------------------------
1% of women at age forty who participate in routine screening have breast cancer.
Doesn't matter. While interesting information, it doesn't apply to our case - we are working with a limited subgroup, the "got a positive mammography" group. This information would be interesting for us if we didn't already have her mammography result, which makes this information void. New, more exact information about a person OVERRIDES old information. For example:
What's the chances that I am male ? Your first information is that 50% of the population is male (don't nitpick, I know it's not exactly 50%). Now, your second piece of information is that that I have a weiner, and that 99.999% (made up number) of the people with a weiner are male. Now, what are the chances that I'm male ? 50%, 99.999% or 50% * 99.999% ? Clearly, the 50% information gets discarded because we now have a better number. They don't get merged somehow.
--------------------------
80% of women with breast cancer will get positive mammographies.
Unimportant information: We already know we got a positive mammography. We don't need to care about how many people didn't get one.
--------------------------
9.6% of women without breast cancer will also get positive mammographies.
This is the ONLY relevant data. We ONLY know for sure that we had a positive mammography, and we now know that 9.6% of those are false. Hence, in the remaining 90.4% cases, it's correct. And that's our number.
I assume the "riddle" contains some error, because that is a really easy answer (despite none of you getting it), and while I know nothing about this Bayesian stuff, I doubt it's about silly riddles.
[twisti] on May 3, 2007 1:15 PMAh, how embarrassing. Five minutes after posting I find why this looked so easy to me, and why you all got it wrong (or rather didn't). I guess my initial assumption, that I was daft, was the correct one.
I read:
9.6% of women without breast cancer will also get positive mammographies.
and understood:
9.6% of women who got positive mammographies will have no breast cancer.
Good thing I got here late and didn't make a fool of myself on the first page ;)
[twisti] on May 3, 2007 1:22 PMWoo, got the correct 7.7% chance without reading anything. I guess being married to a statistician gives knowledge by osmosis.
I solved the problem by building a matrix of true+, false-, false+ and true- before I noticed all I cared about were the true+ and false- portion. From there is is a simple ratio. If this is Bayes, I'm an Bayes intuitive.
So where's my prize money?
Wesley Shephard on May 4, 2007 9:18 AMI was lucky enough to get this, but only because I made a table to work it out.
For me, the "intuitive" step is realizing how easily false positives can skew the results. I used to work in anti-virus/anti-spyware, so false positives are an issue we think about a lot -- it may have biased me towards looking for similar issues in any test :)
I've written up a detailed explanation here, with the table, in case it helps anyone.
http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/
Appreciate the post.
Kalid on May 6, 2007 7:27 PMsuppose there are 10000 women.
So:
1)group A of 10000*0.01=100 women have the cancer
2)group B of 10000*0.99=9900 women have no cancer
3)group C = (group A)*0.8=80 women have the cancer and positive mammographies
4)group D = (group B)*0.096=950.4 women have no cancer but positive mammographies
Since the woman gets positive mammographies, she should be in the union of group C and group D. then the chance of getting cancer is:
(group C)/((group C)+(group D))=80/(80+950.4)=0.07764
What are small chance. :)
Sean on June 7, 2007 7:45 AMI like the other poster's response,
If 9.6% of the positive results are false, then 90.4% are correct. Since she did the exam and it came up positive she has 90.4% chance that she has cancer.
That is the problem with the modern test, they are too good.
"If 9.6% of the positive results are false,"
Alex, you have failed miserably at reading comprehension. Have a nice day.
Andy on October 25, 2008 9:17 PMSeriously, when I read the question, I thought that there must be some kind of red herring, or something like that. Then I took my pen and started solving the problem as I used to in high school after reading the comments.
I seriously can't believe that I used Bayes theorem all along in high school without even knowing the formula! It's all common sense. When you see numbers and you are asked about probability, ALWAYS grab a pen, a paper and a calculator. BTW, the actual rate of probablity that a person with a positive mammography has cancer can't be lower than 10%, right?
Magnus on December 31, 2008 8:35 AMIt's been a while since I did statistics, and I only did mediocre, but I'm going to give it a try. I didn't read everything above, it was just too much, but none of the solutions I did read stated what my calulations did. Of course this means there is a higher likelihood I am wrong, but who gives a smeg. This is the intarweb.
I did use google, but I knew what I had to do: find P(B|A) while knowing P(A|B), P(B) and P(A) and didn't remember the formula. If I had my formula book available, I would have used it so I don't consider this cheating.
On to my calculations.
Chance of having breast cancer
P(C) = 1%
Chance of recieving positive result, having breast cancer
P(T|C) = 80% (read as T given B)
Change of recieving a false positive
P(T|^C) = 9,6%
Chance of recieveing positive result, both having and not
P(T) = P(T|C) + P(T|^C) = 89,6%
Question stated: "A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"
So what is probability of cancer, given a positive result? : P(C|T)
Bayes theorem says P(B|A) = P(A|B)*(P(B)/P(A)), so for my notation i'm using P(C|T) = P(T|C)*(P(C)/P(T)) which gives me
P(C|T) = 80%*(1%/89,6%) = 0,89%
Yes. Having been given a positive mammogram result there is a 0,89% chance that one actually does have breast cancer.
It sounds unintuitive to me, but both my statistics book said so at the time and Jeff said so himself in this article.
Robert Græsdal on January 12, 2009 5:27 AMThat's what I get for posting the second before I have to run out the door and not having the time to think. Of course it cannot be 0,89%, less than the stated chance of having cancer! I retract my solution and am officially embarassed.
Robert Græsdal on January 12, 2009 7:24 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |