In the frequentist model, P is called a probability function, where P(X) means the probability of an event, X, if the following three axioms apply:
1. P(X)≥ 0 and real, for all X, i.e. probabilities are positive
2. P(X)=1, i.e. the probability is defined as 1 if X is certain to occur, and
3. P(X OR Y) = P(X)+P(Y) if X and Y are independent, or more generally, for X,Y,Z... etc independent events. This is the additive axiom
From these it follows that
•  P(X)+P(~X)=1, or P(~X)=1P(X), which in the Binomial example we have seen as q=1p 
•  P(X)∈[0,1], i.e. the probability of an event has a value in the range 0 to 1 
•  P(X AND Y) = P(X)P(Y) if X and Y are independent, or more generally, for X,Y,Z... etc independent events. This is known as the multiplicative rule. 
Finally,
P(X OR Y) = P(X)+P(Y)P(X AND Y) if X and Y are not independent  see diagram below; this can be written in set notation as the union of the subsets X and Y:
P(X ∪ Y)=P(X)+P(Y)P(X ∩ Y)
Point set sample space, S, intersection and union
This last example and comment highlights the fact that so far we have omitted to place this analysis in the context of point sets (i.e. the notion of sets familiar from the mathematical topic Set Theory). The events X, Y etc. should be regarded as a collection of sample points selected from a finite or infinite set, which is known as the sample space, S. If the set S is finite, and contains n points, and the set X contains nx points then P(X)=nx/n. Using this set theoretic notation axiom 2, above, should really be written as P(S)=1. Likewise, we can visualize P(X OR Y) by using a simple Venn diagram, where the two sets representing X and Y partially overlap, as illustrated above. Note that if X and Y are mutually exclusive, the two sets do not overlap and thus P(X ∩ Y)=∅, the Null Set.
Returning to the example of throwing a single die, the sample space S={1,2,3,4,5,6} and n=6. Let X be the event "throw is less than 5" and Y be the event "throw is greater than 3", then P(X)=4/6 and P(Y)=3/6, P(X AND Y)=1/6 because both sets contain the point {4} and thus P(X OR Y) =4/6+3/61/6=6/6 in this example.
If we have a sample space, S, the set of points and outcomes is not necessarily defined numerically. For example, the set of results from tossing a coin twice would be TT, TH, HT and HH. It is often convenient to define a function, x, to map each outcome to a number on the real line. In this example we might use the mapping x(TT)=2, x(TH)=1, x(HT)=1, and x(HH)=0.The function x can be seen as a univariate mapping from the sample space, S, to the real line and is referred to as a random variable. This concept can be extended to the bivariate case using two realvalued functions, X and Y, say. The pair (X,Y) is then described as a twodimensional random variable, and the variables (assuming they are not independent) are described as being jointly distributed. A simple way to illustrate this, and associated concepts, is to examine a twoway frequency table, i.e. a crosstabulation of two variables, X and Y with cell entries being counts or frequencies. In the example table below the relationship between the height of father's and their sons is shown, based on data collected by Pearson and Lee and published in 1903 [PEA1]. The father's stature can be regarded as our first variable, X, and the son's as our second variable, Y. The entries in the cells are the frequencies recorded based on a total of 1078 family record cards completed and analyzed. Fractional data in some cells are believed to reflect the way in which the data was reported when there was more than one son. The entries in the body of the table show how X and Y jointly vary, and as noted above, is known as the joint distribution of X and Y. The righthand margin of the table shows the row totals, or marginal distribution with respect to the Y variable (the sons), whilst the lower margin totals provide the marginal distribution with respect to X (the father's). If we identify individual frequencies, f, by indexing on their row, Y, and column, X, we have the marginal distribution with respect to X as the sum over the rows, and the marginal distribution with respect to Y as the sum over the columns:
If we divided all the table entries by the overall total, 1078, the entries would be in the form of probability estimates, and we can write:
So, for example, P(Y>70.5)=24.3% can be obtained directly from the marginal distribution.
Father's and sons stature data, after Pearson and Lee (1903)

Father's stature (inches), X 

From 

58.5 
59.5 
60.5 
61.5 
62.5 
63.5 
64.5 
65.5 
66.5 
67.5 
68.5 
69.5 
70.5 
71.5 
72.5 
73.5 
74.5 


To 

59.5 
60.5 
61.5 
62.5 
63.5 
64.5 
65.5 
66.5 
67.5 
68.5 
69.5 
70.5 
71.5 
72.5 
73.5 
74.5 
75.5 
TOTAL 

Son's stature, Y 
Col 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 


Fm 
To 
Row 


















59.5 
60.5 
1 
0 
0 
0 
0 
0.5 
0.5 
1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
2 
60.5 
61.5 
2 
0 
0 
0 
0 
0.5 
0 
0 
0 
1 
0 
0 
0 
0 
0 
0 
0 
0 
1.5 
61.5 
62.5 
3 
0 
0.25 
0.25 
0 
0.5 
1 
0.25 
0.25 
0.5 
0.5 
0 
0 
0 
0 
0 
0 
0 
3.5 
62.5 
63.5 
4 
0 
0.25 
0.25 
2.25 
2.25 
2 
4 
5 
2.75 
1.25 
0 
0.25 
0.25 
0 
0 
0 
0 
20.5 
63.5 
64.5 
5 
1 
0 
1.5 
3.75 
3 
4.25 
8 
9.25 
3 
1.25 
1.5 
0.75 
1.25 
0 
0 
0 
0 
38.5 
64.5 
65.5 
6 
2 
1 
0.5 
2 
3.25 
9.5 
13.5 
10.75 
7.5 
5.5 
3.5 
2.5 
0 
0 
0 
0 
0 
61.5 
65.5 
66.5 
7 
0 
0.5 
1 
2.25 
5.25 
9.5 
10 
16.75 
17.5 
16 
5.25 
2 
2.5 
1 
0 
0 
0 
89.5 
66.5 
67.5 
8 
0 
1.5 
2 
4.75 
3.5 
13.75 
19.75 
26.5 
25.75 
19.5 
12.5 
13.75 
3.25 
0.5 
1 
0 
0 
148 
67.5 
68.5 
9 
0 
0 
1.5 
2 
7.5 
10 
10.25 
24.25 
31.5 
23.5 
29.5 
13.25 
8.5 
9.5 
2.25 
0 
0 
173.5 
68.5 
69.5 
10 
0 
0 
1 
0 
5.25 
5 
12.75 
18.25 
16 
24 
29 
21.5 
10 
3.5 
2.25 
0 
1 
149.5 
69.5 
70.5 
11 
0 
0 
0 
0 
1 
2.5 
5.75 
18.75 
11.75 
19.5 
22.5 
19.5 
14.5 
6.25 
3.5 
1.5 
1 
128 
70.5 
71.5 
12 
0 
0 
0 
0 
0 
3.25 
5 
8.75 
10.75 
19 
14.75 
20.75 
10.75 
8 
5 
1 
1 
108 
71.5 
72.5 
13 
0 
0 
0 
0 
0 
0.25 
3 
1.25 
7 
7.75 
10.75 
11.25 
10 
8.5 
2.75 
0.5 
0 
63 
72.5 
73.5 
14 
0 
0 
0 
0 
0 
0 
0.75 
0.75 
2.5 
7.5 
6.5 
6 
7.5 
6.25 
3.25 
0.5 
0.5 
42 
73.5 
74.5 
15 
0 
0 
0 
0 
1 
0 
1.5 
1.5 
0 
5.25 
2.25 
2.5 
6.5 
3.25 
3.25 
0 
2 
29 
74.5 
75.5 
16 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
2 
0 
2.5 
0.75 
1.75 
0.5 
0 
8.5 
75.5 
76.5 
17 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1.25 
0.25 
0 
0.5 
1 
1 
0 
0 
4 
76.5 
77.5 
18 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1.25 
0.25 
1 
0 
0 
1.5 
0 
0 
4 
77.5 
78.5 
19 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
1 
0 
0.25 
0.75 
0 
0 
3 
78.5 
79.5 
20 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0.25 
0.25 
0 
0 
0.5 


TOTAL 
3 
3.5 
8 
17 
33.5 
61.5 
95.5 
142 
137.5 
154 
141.5 
116 
78 
49 
28.5 
4 
5.5 
1078 
If we now look at column 7, which we have highlighted in grey, we have the particular frequency distribution f(Y  X=7), where the vertical bar means "given that", or "conditional upon", i.e. the frequency distribution of Y conditional upon X=7. Likewise, we can identify a particular value in this distribution, for example row 7, and we have f(Y=7  X=7)=10. If we divide this value by the column total we have 10/95.5 = 10.5%, which is the estimated probability of a son being between 65.5 and 66.5 inches tall GIVEN THAT his father was in the same height band.
Now, if all entries in this table were divided by the overall total, 1078, we would obtain the estimated probability P(Y=7  X=7)=10/1078=0.93%. The total for column 7 using the probability estimates for all cells would not total 1, they would total 95.5/1078=8.86%. But this is the marginal probability P(X=7). Thus there is a simple relationship between these results:
P(Y=7  X=7)=P(Y=7 ∩ X=7)/P(X=7) = 0.0093/0.0886 = 10.5%, which is the result we found earlier. If we perform the same calculation, but for X given Y, we have the same numerator, 0.0093, but the denominator is P(Y=7)=89.5/1078=0.0830 so P(X=7  Y=7)=11.2%. More generally we can write, for P(X)≠0:
and by symmetry, with P(Y)≠0:
Both these expressions contain the equivalent intersection expressions: P(X=x ∩ Y=y) and P(Y=y ∩ X=x). We can eliminate these by substitution in the two equations to give:
or more simply:
This expression is the conditional probability of Y given X, and is known as Bayes' Theorem after its originator, the Revd. Thomas Bayes. Note that this differs from the conditional probability of X on Y, as the numerical example earlier demonstrated. Indeed, it shows that the two are only identical if the ratio P(Y)/P(X)=1. Note also that care is required if P(X) is very small or 0. Although at first sight this formula appears quite innocent, it has had a major impact on both the approach to statistical analysis adopted by many modern researchers, and to the development of some very specific Bayesian techniques.
Clearly if it makes no difference what the value of X is, then
and we say that Y is independent of X. This notation is used for the discrete case, but similar notation (and essentially the same rules) apply for the continuous case. In the discrete case we talk about a probability function whereas in the continuous case we refer to the probability density. Unfortunately these terms are not used consistently by authors, and both forms are often used interchangeably.
These probability relationships can be extended to multiple partitions of the sample space. If we let the sample space, S, be comprised of k independent partitions, such that S=Y1∪Y2∪Y3 ...∪Yk then
and Bayes Theorem generalizes to:
We are now in a position to examine this expression and its implications in a little more detail.