Bayesian probability theory

<< Click to Display Table of Contents >>

Navigation:  Statistical concepts > Probability theory >

Bayesian probability theory

In recent decades there has been a substantial interest in another perspective on probability (an alternative philosophical view). This view argues that when we analyze data from an experiment or the world around us we should take into account prior knowledge, even if weak or informal (e.g. perhaps stated as beliefs), about the probability of the result before the data was seen, i.e. prior to the experiment. The posterior probability, that is the probability of the result in the light of having seen the data (or conditional on having seen the data), should be modified by this prior knowledge. Prior knowledge might be based on detailed prior data, such as knowledge of the medical risks associated with certain lifestyles such as smoking (objective priors, see further Berger, 2006 [BER1]), or on beliefs based on extensive experience such as risk assessments for industrial accidents or the likely risks associated with the release of a dangerous prisoner from jail (subjective priors).  

The need for such a perspective rose to the fore in the 1960s and 1970s, as decision-making processes became more complex and weaknesses in the frequentist approach became apparent. For example it is extremely difficult to adopt a frequentist approach to many risk-taking situations, since there is simply not the past data on which to base a frequentist analysis, but there is a wealth of knowledge and experience, perhaps from related fields and situations — for example, when examining the decision to build more nuclear power stations or when deciding on the most appropriate responses to evidence of global warming. Furthermore, many statisticians adopting the Bayesian philosophy regard the question of prior knowledge and belief as a fundamental issue, with this being considered the only sound scientific methodology to adopt, despite its often rather subjective nature.

These ideas can be expressed in statistical terms using Bayes' formula for conditional probability, which we introduced in the previous subsection, in a slightly different form:

In this formulation H is a hypothesis and D is the data, P(H) is the probability that the hypothesis H is true before seeing the data (i.e. the prior probability), and P(H|D) is the probability that the hypothesis H is true after seeing the data (i.e. the posterior probability). This leaves the two other elements, P(D|H) which is the probability of seeing the data D given that the hypothesis H is true (this is called the likelihood) and P(D) is the prior probability of seeing the data D under all mutually exclusive hypotheses. P(D) is sometimes called the marginal probability and is a form of normalizing factor. A Bayesian statistician will typically evaluate a range of hypotheses or competing models rather than seeking to accept or reject a null hypothesis, as is the case in traditional hypothesis testing.

In practice various forms of Bayes' formula are used, but each can be summarized as: conditional probability = unconditional probability x predictive power. Thus in the formula above, we have the probability that the hypothesis is true after have seen the data P(H|D) equals the unconditional probability P(H) times the predictive power, which is represented by a measure of how likely we are to see the data assuming the hypothesis is true.

Although we have derived Bayes' expression in the preceding subsection, this concept can be rather difficult to follow, so a tangible example based on an explanation by Eliezer Yudkowsky [YUD1], may help. Here is Yudowsky's example:

"80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman aged 40 had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"

The answer is not 80%. If we look at this question and the information provided carefully we see that a vital piece of information is missing. We need to know the probability that a woman aged 40 will have breast cancer — this is a prior piece of information, P(H), without which we cannot answer the question correctly. We shall assume this to be around 1%, so P(H)=0.01. We also have been given the 80% figure, which is P(D|H)=0.8, so the top part of our expression is 0.008. We now need to compute P(D), the denominator, which must include both the true positives (which is just the numerator value) and the false positives, which we are told is 9.6%, i.e. we need to find the ratio of the true positives to the sum of the true positives and the false positives. Thus P(D) is the sum of two separate elements: (i) the same element that already appears in the numerator, the true positives, 0.008; plus (ii) the false positives, which is the probability that someone who does not have breast cancer, P(~H), which we know is 99%, will receive a positive result, P(D|~H), which has been given as 9.6%. Thus we have P(~H)xP(D|~H), or 0.99x0.096 (roughly 0.095), so the value for P(D) is 0.008+0.095=0.0103, giving the ratio as 80/1030=7.8%. This is more informative than not conducting the mammography and randomly guessing, which would give 1%, so the test procedure is useful, but clearly could be better. The situation in which we are given the probability for both the true positives and the false positives (i.e. all the mutually exclusive alternative hypotheses, Hi) is sometimes described as the Second Form of Bayes' Theorem. In our example, H1=H and H2=~H. The denominator in this case, and more generally, can thus be written as:

In some ways it is simpler to start with a distinct number, for example 10,000 women aged 40. Of these we are told 1%, i.e. 100, will have breast cancer. The mammography screening process will give a positive result in 80% of these cases, so 80 people who do have breast cancer will be correctly identified. Now we also know that 9900 women in this sample do not have breast cancer, but 9.6% of these will be incorrectly given a positive reading, so there will be around 950 women incorrectly given a positive reading. The total number of positive readings is thus 80+950=1030, of which 80 are true positives, and this is the 7.8% figure we have just produced. In examples such as this, an understanding of the prior probabilities, of both true and false positives, informs our calculation of the chance the woman in question actually has breast cancer after receiving a positive result on the mammography test... and it is not 80%, or anything like it, but just under 8%. Note that if the percentage of false positives was higher, the figure of 8% would drop rapidly, indicating that the test did not add to our diagnostic abilities, whilst if false positives were very low, the existence of a positive result would almost certainly mean the probability of cancer in this case was very high. As Yudkowsky explains:

"Evidence is always the result of the differential between the two conditional probabilities. Strong evidence is not the product of a very high probability that H leads to D, but the product of a very low probability that 'not H' could have led to D ....Given some phenomenon H that we want to investigate, and an observation D that is evidence about H — for example, H is breast cancer and D is a positive mammography — Bayes' Theorem tells us how we should update our probability of H, given the new evidence D"

Although the above discussion and statement of formulas has been provided in terms of discrete probabilities, the same formulation can be extended to the continuous case, using probability density functions. The standard formulation is as follows:

where: λ is a parameter (or parameters) of the distribution, f is the probability model for the observed data x given the parameter(s) λ. The function g(λ) is the prior distribution model for λ and g(λ|x) is the posterior distribution for λ given x. The important point to note here is that, unlike typical frequentist models, the parameter(s) λ are assumed to have a probability distribution rather than being a fixed value to be estimated.

As we have seen, Bayes theorem applies and is valid in both frequentist and Bayesian probability models, but the two philosophies differ markedly in how the various elements are obtained and what they mean: for frequentists the formula applies to random events; whereas for Bayesians it is much broader and does not require an assumption of random events, but does require rational and evidence-based selection of probabilities. Researchers from the STEM disciplines (science, technology, engineering and mathematics) and social scientists from many different disciplines, have increasingly adopted this latter view of probability, which does not require the same dependence on random processes, but still utilizes many of the core ideas of classical and frequentist statistics. There are clearly pros and cons of adopting each approach.

The table below lists some of these issues in connection with their application in reliability testing — examining, for example, the mean time between failures (MTBF) for items of equipment. This commentary has been produced by the US National Institute of Standards and Technology (NIST) and to some extent reflects the caution with which engineers view Bayesian methods. This contrasts with some other disciplines, including medical research, decision analysis and psychology, where the value and advantages of Bayesian thinking have been widely accepted, aided by the development of algorithms and software tools that specifically support this approach. Most traditional software packages, such as SPSS and Minitab, have little or no support for Bayesian methods, but there are many contributed packages to the R-Statistics project that provide Bayesian analysis, and specialized tools such as WinBUGS are widely used, especially in medical statistics.

Bayesian Paradigm: Advantages and Disadvantages (Engineering perspective)



Uses prior information and knowledge, which "makes sense"

Prior information may not be accurate — generating misleading conclusions

If the prior information is encouraging, less new testing may be needed to confirm a desired MTBF at a given confidence, i.e. possibly saving time and cost

Way of inputting prior information (choice of prior) may not be correct

Confidence intervals are really intervals for the (random) MTBF — sometimes called "credibility intervals"

Customers may not accept validity of prior data or engineering judgments


There is no one "correct way" of inputting prior information and different approaches can give different results

Results aren't objective and don't stand by themselves


Bayesian statistics is often described as encompassing a strong subjective element. For those who regard probability as being broadly subjective, rather than objective, probabilistic methods are seen as a necessary and useful toolset as opposed to some form of scientific 'truth' that can be constructed from a set of axioms into a solid theory. Within these perspectives there are a range of arguments put forward for the use of statistical methods, such as:

"The world is governed by precise laws but the interactions between all the elements are so complex that these must be examined using probability theory"

"The world is governed, at least in part, by immutable chance processes"

"The world is unknowable in detail and we ultimately must use probabilistic arguments to disguise our ignorance"

"Probability theory is a convenient means of representation and analysis of aggregate datasets (and perhaps better than any other)"

There are many aspects to the various forms of objective and subjective approaches to statistics, and we recommend reading Joyce (2003) [JOY1] for a fuller discussion and extensive list of references on the subject, and Efron (1986) [EFR1] together with the comments on Efron's paper, to gauge some of the arguments and counter-arguments that have beset this issue.


[BER1] Berger J (2006) The case for objective Bayesian analysis. Bayesian Analysis, 1(3), 385-402

[EFR1] Efron B (1986) Why Isn't Everyone a Bayesian? The American Statistician, 40(1),1

[JOY1] Joyce J (2003) Bayes' Theorem. Stanford Encyclopedia of Philosophy. available from:

[YUD1] Yudkowsky E S (2003), "An Intuitive Explanation of Bayesian Reasoning". available from (also cross-refers to: )