<< Click to Display Table of Contents >>
## Historical context |

Statistics is a relatively young discipline — for discussions on the history of statistics see Stigler (1986, [STI1]) and Newman (1960,[NEW1]). Much of the foundation work for the subject has been developed in the last 150 years, although its beginnings date back to the 13th century involving the expansion of the series (p+q)n, for n=0,1,2.... The coefficients of this 'binomial' expansion were found to exhibit a well defined pattern (illustrated below) known as Pascal's triangle. Each coefficient can be obtained as the sum of the two immediately above in the diagram, as indicated.

Coefficients of the Binomial expansion

Pascal used this observation to produce a formula for the coefficients, which he noted was the same as the formula for the number of different combinations (or arrangements) of r events from a set of n (r=0,1,...n). , usually denoted:

This formula is typically expanded as:

Hence with n=5, and noting that 0! is defined as 1, we have for r=[0,1,2,3,4,5] the values [1,5,10,10,5,1] as per Pascal's triangle, above. What this formula for the coefficients says, for example, is that are 5 different ways of arranging one p and four q's. These arrangements, or possible different combinations, are:

pqqqq, qpqqq, qqpqq, qqqpq, and qqqqp

and exactly the same is true if we took one q and four p's. There is only one possible arrangement of all p's or all q's, but there are 10 possible combinations or sequences if there are 2 of one and 3 of the other. The possible different combinations are:

ppqqq, qppqq, qqppq, qqqpp, pqpqq, pqqpq, pqqqp, qpqpq, qpqqp, qqpqp

In these examples the order of arrangement is important, and we are interested in all possible combinations. If the order is not important the number of arrangements would be greater and the formula simplifies to counting the number of permutations:

Assuming (p+q)=1 then clearly (p+q)n=1. Jakob Bernoulli's theorem (published in 1713, after his death) states that if p is the probability of a single event occurring (e.g. a 2 being the result when a six-sided die is thrown), and q =1-p is the probability of it not occurring (e.g. the die showing any other value but 2) then the probability of the event occurring at least m times in n trials is the sum of all the terms of (p+q)n starting from the term with elements including pr where r≥m, i.e.

So, if a die is thrown 5 times, the expected number of occasions a 2 will occur will be determined by the terms of the binomial expansion for which p =1/6, and q =1-p = 5/6 ):

which in this case give us the set of probabilities (to 3dp): 0.402,0.402,0.161,0.032,0.003,0.000. So the chance of throwing at least one "2" from 5 throws of an unbiased die is the sum of all the terms from m=1 to 5, i.e. roughly 60% (59.8%), and the chances of all 5 throws turning up as a 2 is almost zero. Notice that we could also have computed this result more directly as 1 minus the probability of no twos, which is 1-(1/6)0(5/6)5=1-0.402, the same result as above.

This kind of computation, which is based on an a priori understanding of a problem in which the various outcomes are equally likely, works well in certain fields, such as games of chance — roulette, card games, dice games — but is not readily generalized to more complex and familiar problems. In most cases we do not know the exact chance of a particular event occurring, but we can obtain an estimate of this assuming we have a fairly large and representative sample of data. For example, if we collate data over a number of years on the age at which males and females die in a particular city, then one might use this information to provide an estimate of the probability that a woman of age 45 resident in that location will die within the next 12 months. This information, which is a form of a posteriori calculation of probability, is exactly the kind of approach that forms the basis for what are known as mortality tables, and these are used by the life insurance industry to guide the setting of insurance premiums. Statisticians involved in this particular field are called actuaries, and their principal task is to analyze collected data on all manner of events in order to produce probability estimates for a range of outcomes on which insurance premiums are then based. The collected data are typically called statistics, here being the plural form. The term statistics in the singular, refers to the science of how best to collect and analyze such data.

Returning to the games of chance examples above, we could approach the problem of determining the probability that at least one 2 is thrown from 5 separate throws of the die by conducting an experiment or trial. First, we could simply throw a die 5 times and count the number of times (if any) a 2 was the uppermost face. However, this would be a very small trial of just one set of throws. If we conducted many more trials, perhaps 1000 or more, we would get a better picture of the pattern of events. More specifically we could make a chart of the observed frequency of each type of event, where the possible events are: zero 2s, one 2, two 2s and so on up to five 2s. In practice, throwing a 6-sided die a very large number of times and counting the frequency with which each value appears is very time-consuming and difficult. Errors in the process will inevitably creep in: the physical die used is unlikely to be perfect, in the sense that differences in the shape of its corners and surfaces may lead some faces to be very slightly more likely to appear uppermost than others; as time goes on the die will wear, and this could affect the results; the process of throwing a die and the surface onto which the die is thrown may affect the results; over time we may make errors in the counting process, especially if the process continues for a very long time... in fact there are very many reasons for arguing that a physical approach is unlikely to work well.

As an alternative we can use a simple computer program with a random number generator, to simulate the throwing of a six-sided die. Although modern random number generators are extremely good, in that their randomness has been the subject of an enormous amount of testing and research, there will be a very slight bias using this approach, but it is safe to ignore this at present. In the table below we have run a simple simulation by generating a random integer number between the values of 1 and 6 a total of 100,000 times. Given that we expect each value to occur with a probability of 1/6, we would expect each value to appear approximately 16667 times. We can see that in this trial, the largest absolute difference between the simulated or observed frequency, fo, and the a priori or expected frequency, fe, is 203, which is around 1.2%.

Face |
Frequency |
|Observed-Expected| |
---|---|---|

1 |
16741 |
74 |

2 |
16870 |
203 |

3 |
16617 |
50 |

4 |
16635 |
32 |

5 |
16547 |
120 |

6 |
16589 |
78 |

This difference is either simply a matter of chance, or perhaps imperfections in the random number algorithm, or maybe in the simulation program. Some of this uncertainty can be removed by repeating the trial many times or using a larger number of tests in a single trial, and by checking the process using different software on different computers with different architectures. In our case we increased the trial run size to 1 million, and found that the largest percentage difference was 0.35%, suggesting that the random number generator and algorithm being used were indeed broadly unbiased, and also illustrating the so-called "Law of large numbers" or "Golden theorem", also due to Bernoulli. Essentially this law states that as the sample size is increased (towards infinity), the sample average tends to the true 'population' average. In the example of rolling a die, the possible values are 1,2,...6, the average of which is 3.5, so the long term average from a large number of trials should approach 3.5 arbitrarily closely. There are actually two variants of this law commonly recognized, the so-called Weak Law and the Strong Law, although the differences between these variants are quite subtle. Essentially the Weak Law allows for a larger (possibly infinite) number of very small differences between the true average and the long term sampled average, whilst the Strong Law allows just for a finite number of such cases.

This example has not directly told us how likely we are to see one or more 2s when the die is thrown five times. In this case we have to simulate batches of 5 throws at a time, and count the proportion of these batches that have one or more 2s thrown. In this case we again compute 100,000 trials, each of which involves 5 throws (so 0.5 million iterations in total) and we find the following results from a sequence of such trials: 59753, 59767,59806,... each of which is very close to the expected value based on the percentage we derived earlier, more precisely 59812 (59.812%). In general it is unnecessary to manually or programmatically compute such probabilities for well-known distributions such as the Binomial, since almost all statistical software packages will perform the computation for you. For example, the Excel function BINOMDIST() could be used. Until relatively recently statistical tables, laboriously calculated by hand or with the aid of mechanical calculators, were the principal means of comparing observed results with standard distributions. Although this is no longer necessary the use of tables can be a quick and simple procedure, and we have therefore included a number of these in the Resources topic, Distribution tables section, of this Handbook.

A number of observations are worth making about the above example. First, although we are conducting a series of trials, and using the observed data to produce our probability estimates, the values we obtain vary. So there is a distribution of results, most of which are very close to our expected (true) value, but in a smaller number of cases the results we obtain might, by chance, be rather more divergent from the expected frequency. This pattern of divergence could be studied, and the proportion of trials that diverged from the expected value by more than 1%, 2% etc. could be plotted. We could then compare an observed result, say one that diverged by 7% from that expected, and ask "how likely is it that this difference is due to chance?". For example, if there was less than one chance in 20 (5%) of such a large divergence, we might decide the observed value was probably not a simple result of chance but more likely that some other factor was causing the observed variation. From the Law of Large Numbers we now know that the size of our sample or trial is important — smaller samples diverge more (in relative, not absolute, terms) than larger samples, so this kind of analysis must take into account sample size. Many real-world situations involve modest sized samples and trials, which may or may not be truly representative of the populations from which they are drawn. The subject of statistics provides specific techniques for addressing such questions, by drawing upon experiments and mathematical analyses that have examined a large range of commonly occurring questions and datasets.

A second observation about this example is that we have been able to compare our trials with a well-defined and known 'true value', which is not generally the situation encountered. In most cases we have to rely more heavily on the data and an understanding of similar experiments, in order to obtain some idea of the level of uncertainty or error associated with our findings.

A third, and less obvious observation, is that if our trial, experiments and/or computer simulations are in some way biased or incorrectly specified or incomplete, our results will also be of dubious value. In general it is quite difficult to be certain that such factors have not affected the observed results and therefore great care is needed when designing experiments or producing simulations.

Finally, it is important to recognize that a high proportion of datasets are not obtained from well-defined and controlled experiments, but are observations made and/or collections of data obtained, by third parties, often government agencies, with a whole host of known and unknown issues relating to their quality and how representative they are. Similarly, much data is collected on human populations and their behavior, whether this be medical research data, social surveys, analysis of purchasing behavior or voting intentions. Such datasets are, almost by definition, simply observations on samples from a population taken at a particular point in time, in which the sampling units (individual people) are not fully understood or 'controlled' and can only loosely be regarded as members of a well-defined 'population'.

With the explosion in the availability of scientific data during the latter part of the 18th century and early 19th century, notably in the fields of navigation, geodesy and astronomy, efforts were made to identify associations and patterns that could be used to simplify the datasets. The aim was to minimize the error associated with large numbers of observations by examining the degree to which they fitted a simple model, such as a straight line or simple curve, and then to predict the behavior of the variables or system under examination based on this approximation. One of the first and perhaps most notable of these efforts was the discovery of the method of Least Squares, which Gauss reputedly devised at the age of 18. This method was independently discovered and developed by a number of other scientists, notably Legendre, and applied in a variety of different fields. In the case of statistical analysis, least squares is most commonly encountered in connection with linear and non-linear regression, but it was originally devised simply as the 'best' means of fitting an analytic curve (or straight line) to a set of data, in particular measurements of astronomical orbits.

During the course of the late 1900s and the first half of the 20th century major developments were made in many areas of statistics. A number of these are discussed in greater detail in the sections which follow, but of particular note is the work of a series of scientists and mathematicians working at University College London (UCL). This commenced in the 1860s with the research of the scientist Sir Francis Galton (a relation of Charles Darwin), who was investigating whether characteristics of the human population appeared to be acquired or inherited, and if inherited, whether humankind could be altered (improved) by selective breeding (a highly controversial scientific discipline, known as Eugenics). The complexity of this task led Galton to develop the concepts of correlation and regression, which were subsequently developed by Karl Pearson and refined by his student, G Udny Yule, who delivered an influential series of annual lectures on statistics at UCL which became the foundation of his famous book, An Introduction to the Theory of Statistics [YUL1], first published in 1911. Another student of Pearson at UCL was a young chemist, William Gosset, who worked for the brewing business, Guinness. He is best known for his work on testing data that have been obtained from relatively small samples. Owing to restrictions imposed by his employers on publishing his work under his own name, he used the pseudonym "Student", from which the well-known "Students t-test" and the t-distribution arise. Also joining UCL for 10 years as Professor of Eugenics, was R A Fisher, perhaps the most important and influential statistician of the 20th century. Fisher's contributions were many, but he is perhaps most famous for his work on the Design of Experiments [FIS1], a field which is central to the conduct of controlled experiments such as agricultural and medical trials. Also at UCL, but working in a different field, psychology, Charles Spearman was responsible for the introduction of a number of statistical techniques including Rank Correlation and Factor Analysis. And lastly, but not least, two eminent statisticians: Austin Bradford Hill, whose work we discuss in the section on statistics in medical research, attended Pearson's lectures at UCL and drew on many of the ideas presented in developing his formative work on the application of statistics to medical research; and George Box, developer of much of the subject we now refer to as industrial statistics. Aspects of his work are included in our discussion of the Design of Experiments, especially factorial designs.

Substantial changes to the conduct of statistical analysis have come with the rise of computers, automated monitoring and tracking technologies (e.g. GPS, smartcard systems etc.) and the Internet. The computer has removed the need for statistical tables and, to a large extent, the need to be able to recall and compute many of the complex expressions used in statistical analysis. They have also enabled very large volumes of data to be stored and analyzed, which itself presents a whole new set of challenges and opportunities. To meet some of these, scientists such as John Tukey and others developed the concept of Exploratory Data Analysis, or "EDA", which can be described as a set of visualization tools and exploratory methods designed to help researchers understand large and complex datasets, picking out significant features and feature combinations for further study. This field has become one of the most active areas of research and development in recent years, spreading well beyond the confines of the statistical fraternity, with new techniques such as Data Mining, 3D visualizations, Exploratory Spatio-Temporal Data Analysis (ESTDA) and a whole host of other procedures becoming widely used. A further, equally important impact of computational power, we have already glimpsed in our discussion on games of chance — it is possible to use computers to undertake large-scale simulations for a range of purposes, amongst the most important of which is the generation of pseudo-probability distributions for problems for which closed mathematical solutions are not possible or where the complexity of the constraints or environmental factors make simulation and/or randomization approaches the only viable option.

References

[FIS1] Fisher R A (1935) The Design of Experiments. Oliver & Boyd, London

[NEW1] Newman J R (1960) The World of Mathematics. Vol 3, Part VIII Statistics and the Design of Experiments. Oliver & Boyd, London

[STI1] Stigler S M (1986) The History of Statistics. Harvard University Press, Harvard, Mass.

[YUL1] Yule G U, Kendall M G (1950) Introduction to the Theory of Statistics. 14th edition, Charles Griffin & Co, London

MacTutor: The MacTutor History of Mathematics Archive. University of St Andrews, UK: https://www-history.mcs.st-and.ac.uk/

Mathworld: Weisstein E W "Weak Law of Large Numbers" and "Strong Law of Large Numbers": https://mathworld.wolfram.com/WeakLawofLargeNumbers.html

Wikipedia: History of statistics: https://en.wikipedia.org/wiki/History_of_statistics