A great deal has been written about the misuse of statistics by pressure groups and politicians, by pollsters and advertising campaigns, by the broadcast media (newspapers, magazines, television, and now the Internet), and even misuse by statisticians and scientists. In some instances the misuse has been simply lack of awareness of the kinds of problems that may be encountered, in others carelessness or lack of caution and review, whilst on occasion this misuse is deliberate. One reason for this has been the growth of so-called evidence-based policy making — using research results to guide and justify political, economic and social decision-making. Whilst carefully designed, peer-reviewed and repeatable research does provide a strong foundation for decision-making, weak research or selective presentation of results can have profoundly damaging consequences. In this section we provide guidance on the kinds of problems that may be encountered, and comment on how some of these can be avoided or minimized. The main categories of misuse can be summarized as:
In the subsections of this topic we discuss each of these categories in turn.
Where data is obtained as the result of some form of trial, experiment or survey, careful design can help avoid many (but not all) of the problems identified in the first category (see also Design of Experiments and Bias). This is of particular importance in medical research, and for this reason we have included a separate subsection focusing on this particular application area and the kinds of problems and issues that are encountered.
A simple example, which occurs only too frequently, is the presentation and interpretation of data where some data items are omitted. A much reported example of this concerned the analysis of the failure of O-rings on the US space shuttle in 1986. NASA staff and their contractors examined the pattern of failures of O-rings during launches against temperature just prior to the ill-fated shuttle launch on January 28 1986. They concluded that the data showed no apparent relationship between the number of failures and temperature, but as we now know, the low temperature overnight did result in a failure of these components (see graph below) with catastrophic results. What the analysts failed to consider were all those launches that had 0 failures. All the launches with no failures occurred when the ambient temperature at the launch site was much higher, as highlighted in the diagram (see also, the Space Shuttle dataset and example in the R library, vcd).
In a rather different context highlighted in Jan 2010 by BBC journalist Michael Blastland (see also, our Recommended Reading topic, [BLA1]). Reports of declining teenage pregnancy rates in Orkney off the north coast of Scotland, were shown to be highly misleading. Blastland showed two graphs. The first appears to show a halving of the teenage pregnancy rate between 1994 and 2006, following an intensive programme of education and support:
However, the reports omitted data for the intervening years, and as we know from stock market and many other types of data, rates of change depend very heavily on your start and end date. The data in this case is actually quite cyclical, and choosing 2006 rather than, say 2007, provides a completely misleading picture, as the graph below demonstrates.
In many instances misuse is not deliberate, but leads to biased results and conclusions that cannot be relied upon and the consequences can be serious.
Our final example concerns the question of independent sampling. On 2nd February 2010 a UK national newspaper, the Daily Mail, reported the story of a woman who had bought a box of 6 eggs and found that every one contained a double-yolk. They argued that because roughly 1 egg in a thousand has a double yolk, the chances of having a box with every one being double-yolks was one in a quintillion (1 in 1018). It was clearly a crazy statement that assumed the occurrence of multiple yolks in a box of eggs represented a set of independent events, and that it was therefore valid to multiply 1:1000 x 1:1000 etc. 6 times. In fact the events are in no way independent, for a whole variety of reasons. One respondent to a discussion about this example pointed out that most eggs are boxed in large sorting and packing warehouses, and in some cases eggs are checked against a strong light source to see if they contain a double yolk. If they do, they are put to one side and the staff often take these home for their own use, but if there are too many they are simply boxed up, resulting in boxes of double-yolk eggs.
This is probably the most common reason for 'statistics' and statistical analysis falling short of acceptable standards. Problems typically relate to inadequacies in sampling, i.e. in the initial design of the data collection, selection or extraction process. This results in the sample, from which inferences about the population are made, being biased or simply inadequate. The following list includes some of the main situations which lead to such problems:
•datasets and sample sizes — there are many situations where the dataset or sample size analyzed is simply too small to address the questions being posed, or is not large enough for use with the proposed statistical technique, or is used in a misleading fashion. Smaller sample sizes are also more prone to bias from missing data and non-responses in surveys and similar research exercises. For example, when examining the incidence of particular diseases recorded in different census districts (or hospital catchment areas etc.) we might find that for some diseases recorded cases were quite low in rural districts (<10), but were much higher in urban districts (>100). Does this mean the disease is more likely to occur amongst urban dwellers? Not necessarily, as there are more urban dwellers. To remove the effect of differences in the population-at-risk we might decide to compute the incidence (or rate) of the disease per 1000 population in each district (perhaps stratified by age and sex). Because of the relatively low population-at-risk in the rural area this might then show the risk appears much higher in the rural areas. Is the risk really higher or is the result a reflection of the relatively small numbers reported? Is reporting of cases for this disease the same in rural and urban areas, or is there some differential in recording perhaps due to differences in the quality of health care available or for social reasons? For a rare disease, a reported 25% increase year-on-year in the incidence of a particular type of cancer in the rural district might simply be the result of an increase of a single new reported case. It is also important to be aware that small samples tend to be much more variable in relative terms than large samples. This can result in errors in reasoning, as we discuss later in this section (see also: Sampling and sample size). Large sample sizes are also no guarantee of the quality or lack of bias in the data. One very early failure of a large dataset was when the US Literary Digest’s postal poll regarding the US presidential election in 1936 received roughly 2.4 million returns. With the aim of achieving as large a sample as possible, the magazine sought datasets that contained the names and addresses of millions of adults, these primarily comprised vehicle registration lists and telephone directories. In total, over 10 million letters were posted. However, despite receiving an impressive number of responses, the poll incorrectly predicted that Landon would beat Roosevelt. Their data sources are now understood to have produced biased samples that were likely to be of a higher socio-economic status. The rates of both automobile and telephone ownership were much lower amongst poorer adults at the time.
•clustered sampling — this issue relates to the collection of data in a manner that is known in advance to be biased, but is not subsequently adjusted for this bias. Examples include the deliberate decision to over-sample minority social groups because of expected lower response rates or due to a need to focus on some characteristic of these groups which is of particular interest — see, for example, the discussion of this issue by Brogan (1998, [BRO1]). A second example applies where the only available data is known to be clustered (in space and/or time) — for example, in order to obtain estimates of the levels of trace elements in groundwater it is often only possible to take samples from existing wells and river courses, which are often spatially clustered. If the samples taken are not subsequently weight-adjusted (or de-clustered) results may be biased because some groups or areas are sampled more than others
•self-selection and pre-screening — this is a widespread group of problems in sampling and the subsequent reporting of events. Surveys that invite respondents to participate rather than randomly selecting individuals and ensuring that the resulting survey sample is truly representative are especially common. For example, surveys that rely on opting in, such as those placed in magazines, or via the Internet, provide a set of data from those who read the publication or view the Internet site, which is a first category of selection, and from this set the individuals who choose to respond are then self-selecting. This group may represent those with a particular viewpoint, those with strong views (so greater polarization of responses) or simply those who have the time and inclination to respond. Likewise, a survey on lifestyle in the population at large that advertises for participants in a range of lifestyle magazines, or in fitness studios and sports clubs, is likely to result in a significantly biased sample of respondents
•exclusions — the process of research design and/or sampling may inadvertently or deliberately exclude certain groups or datasets. An example is the use of telephone interviewing, which effectively pre-selects respondents by telephone ownership. If the proportion of exclusions is very small (e.g. in this example, the current proportion of people with telephones in a given country may be very high) this may not be a significant issue. A different category of exclusion is prevalent where some data is easier to collect than others. For example, suppose one wishes to obtain samples of bacteria in the soil of a study region. Areas which are very inaccessible may be under-sampled or omitted altogether whilst other areas may be over-sampled. In a different context, surveys of individuals may find that obtaining an ethnically representative sample is very difficult, perhaps for social or language reasons, resulting in under-representation or exclusion of certain groups — groups such as the disabled or very young or very old are often inadvertently excluded from samples for this reason. Limitations of time and/or budget are often factors that constrain the extent and quality of data collection and hence relevant and important data may be excluded for reasons of necessity or expediency. Data may also be deliberately or inadvertently excluded as being probably an error or outlier. In May 1985 the existence of the huge 'ozone hole' over the Antarctic (depleted levels of ozone at high altitudes) was documented by research published in Nature magazine: "NASA soon discovered that the spring-time 'ozone hole' had been covered up by a computer-program designed to discard sudden, large drops in ozone concentrations as 'errors'. The Nimbus-7 data was re-run without the filter-program and evidence of the Ozone-hole was seen as far back as 1976." (source: NASA)
The ozone hole over Antarctica, November 2009
Darker/Blue zone indicates ozone level <220 Dobson units; source: NASA https://ozonewatch.gsfc.nasa.gov
•exclusions, continued — in an extremely thorough UK study of cancer incidence over 30 years amongst children in the vicinity of high-voltage overhead transmission lines, the authors, Draper et al. (2005, [DRA1]), appeared to cover every possible factor and issue. However, examining their research unstated questions (exclusions from the research) soon become apparent: no active participation from patients or their families was involved, and homes were not visited to measure actual levels of Electro-Magnetic (EM) radiation — this raises the question 'is home address at birth (which the authors used) an appropriate and sufficiently accurate measure'? (the authors did not include duration at the address, or where the children went to nursery etc.); is vertical as well as horizontal proximity to high voltage lines of importance? (they only considered horizontal distance); is proximity to pylons carrying insulators and junction equipment rather than just the lines an important factor? (they omitted this issue altogether)
•pre-conceptions — researchers in scientific and social research frequently have a particular research focus, experience and possibly current norms or paradigms of their discipline or society at large. This may result in inadvertent use of techniques or survey questions that influence the outcome of the research. A common problem is the wording of questions may lead the respondent to respond in a particular manner. Pre-conceptions may easily also lead to weak or incorrect reasoning from the data to conclusions
•data trawling — with large multi-variate datasets there is a high probability that statistically significant findings can be discovered somewhere in the data — brute-force processing of datasets looking for significant results that relate to a particular area of research interest, with or without explicit pre-conceptions, will often succeed but may well be entirely spurious. Techniques such as data-mining, cluster-hunting and factor analysis may all be 'misused' in this way
•temporal or spatial effects — the temporal or spatial sequence or arrangement of samples may be of critical importance, for many reasons. Examples of temporal effects include: dependence of test results on previous tests (e.g. in wine tasting); the temporal context of research — responses to questions on a particular topic may be very different if that topic has had a very high profile in the news in the immediate past (e.g. personal safety, terrorism, heart disease from too much salt in the diet, attitudes to eating Beef following the BSE/vCJD scare etc.) — this affects both the nature and the absolute levels (assigned values) of responses; temporal effects can also be observed in data collected as a sequence using research staff whose accuracy and attention diminish over time (for example in repeated recording of counts in microscopy; or repeated digitization of data points, of repeated asking of questions to interviewees). Examples of spatial effects include: location dependence (for example social groupings in specific areas, types of building, membership of organizations etc.); local correlation of results due to water, materials or other flows (e.g. contaminant levels in soil samples at various locations may be related to each other due to groundwater or other localized effects). Missing data (unsampled or lost data) in the temporal and spatial domains are also very common, especially with automated monitoring equipment that may fail for brief or extended periods (e.g. the NASA satellite monitoring data for high-atmosphere ozone levels from 1978 onwards was not available for much of 1995 due to technical problems)
•over- and under-scoring — the responses individuals provide to questions or tasks often show a distinct bias. When asked to state how confident the respondent is in the answer they have given, almost always the confidence level is over-stated, typically by 10-20% based on the relative frequency of correct responses. In some cultures diligence in completing surveys is taken much more seriously than in others. In one instance the present author achieved a greater than 100% response rate to a one page questionnaire asking respondents to list their activities on a given day — in principle impossible, but in fact many respondents photocopied the questionnaire and completed multiple sheets, even though this was not requested. Such response patterns are the exception — under-reporting is far more prevalent. In some instances the errors can be detected, for example by independent measurement or using a separate survey methodology. For example, when asked to record each telephone call made and its duration, respondents typically under-record the number of calls but over-score the duration, often rounding up to 1- or 5-minute multiples. The product of this particular over-scoring of duration and under-scoring of instances is generally close to the call hours (traffic, or Erlangs) measured using automatic call monitoring equipment, so the effect in terms of traffic estimation tends to cancel out in this case
•deliberate bias — by judicious selection, combination, arrangement and/or reporting of data (which may have been extremely carefully collected) is an important and serious area of misuse. Examples include: deliberate omission of data that does not fit the preconceptions of the researcher, or the conclusions they are seeking; omission or adjustment of data (this may be acceptable practice in some instances, but should always be made explicit — for example, exclusion of outliers on the grounds that they appear to be recording errors); and non-reporting of non-significant results. Examples in the temporal domain include reporting results for selected time periods, or against selected 'base years' to suggest large changes that may not be of any significance; examples in the spatial domain include re-arranging the set of zones for which reporting is being carried out to increase or decrease a particular level of a variable or correlation — this has a particular historical context in politics, where the practice has become known as Gerrymandering, and in spatial analysis, where the question has been studied in detail and is known as the MAUP problem (see further: Statistics and Statistical Analysis)
To be meaningful a statistical graph or chart should indicate:
•what the scales are
•whether it starts at zero or some other value, and
•how it was calculated, in particular exactly what dataset and time period it is based upon
Without all of these elements the information presented should be viewed with caution (as is clear from our example of teenage pregnancy data in the previous section). Line graphs and histograms that simply show the neighborhood of the top of the diagram are, in most instances, misleading. Similar issues may arise if not all intermediate datapoints are plotted, or if data prior to or after the plotted sequence is not shown but would place the information in a more meaningful and complete context. Likewise, charts that show the change over time from some base date, must be viewed with caution — changing the base date may significantly alter the values, even if the broad pattern remains unchanged. The chart below shows the daily closing price of the NASDAQ 100 stock index from a base value of 100 in late 2002 to late 2009 (7 years data). Clearly some variation is not visible, whilst within-day fluctuations are not reported. Non-trading days are omitted, which is entirely valid, so the x-axis is actually not strictly a time scale but is actually an event sequence, so could easily be numbered 1,2,3... etc. without much loss of interpretation assuming the start date was known. With a base index of 100 the graph shows a 50% rise over 7 years, but clearly within any given window there are many movements up and down. A 5 year window (base re-computed as 100 for 5 years data to late 2009) would suggest no change.
NASDAQ 100 stock index history (2002-2009)
Similar issues apply to all forms of visualization, indeed increasingly so as automatic creation of static and dynamic charts, diagrams, classified maps and 3D representations become increasingly widespread. Of particular concern is the issue of comparability. Visualizations that may be used to compare data from different sources, datasets, times and/or locations, must be directly comparable in both design and scaling, otherwise comparison is almost impossible. This applies to both distinct visualizations and those that show super-imposed data. For further discussion of visualization issue, please see the Graphics and Visualization topic.
Drawing conclusions from research findings is always a complex process, often subject to debate. The confidence that can be placed on conclusions will depend, in part, on the nature and quality of the data collected and analyzed, and the quality of the reasoning applied to the interpretation of the findings. Certain types of reasoning may appear entirely plausible but on closer examination can be seen as fundamentally flawed. The list below provides a number of commonly encountered problems of this type.
•Correlation versus causation — it is extremely easy to assume that because there is a close (perhaps highly significant) relationship between two variables, that one causes the other. This may occur in many ways and can be quite subtle (obvious examples are much easier to spot). Take the following example: "Girls at single sex schools do better than girls in mixed schools, therefore single-sex schools are better for girls". Based on test results in the UK and in a number of other countries the first part of this statement is well documented, but is the second part, which is a conclusion implying causality, actually correct? Without closer examination it is difficult to know. Further research shows that other factors are at work: (i) single-sex girls schools are often fee-paying, and wealthier families tend to have children who achieve higher academic results than less well-off families (there may be several reasons for this observed finding); (ii) single-sex girls schools are often selective, requiring entrance exams and/or interviews, thus filtering out groups who might under-perform or otherwise affect the academic results achieved; (iii) fee-paying schools often have longer days and more intensive teaching than non-fee paying schools. Put more formally, we can say that the fact that X and Y are correlated, or vary together, tells us relatively little about the causal relationship between X and Y. So, if X and Y vary together in some consistent manner, it might be that X causes Y, or Y causes X or that some set of third variables, Z are involved, such that Z causes X and Z causes Y so that the correlation of X and Y is simply due to their relationship to Z. Establishing causal relationships beyond doubt can be extremely difficult, but is often made easier by careful experimental design, thorough analysis of related factors, and repeated, independent, randomized trials. Recent examples of this kind of inadequate cause-effect reasoning include: the observation that breast cancer rates are higher in countries that have a high fat content in their diet, and then suggesting that women who eat more fat in their diet are more likely to suffer from breast cancer; or that crime rates are higher in areas of high unemployment, and then stating that it is the unemployed who are responsible for most crimes. The inferences drawn may be valid, and such observations can provide very useful pointers for research, but the data only provides very tenuous support for the claims made. Sets of "guidelines" and a number of special statistical methods have been developed over the last few decades that attempt to provide a formal framework for developing models that seek to pinpoint causal relationships. The formal methods include Rubin Causal Modeling (RCM), Structural Equation Modeling (SEM), and various forms of path modeling. These issues are discussed further in the section below on statistics in medical research
•Misunderstanding of the nature of randomness and chance — there are a number of ways in which natural randomness of events can be misunderstood, leading to incorrect judgments or conclusions. A simple example is misjudging the effect of sample size. Suppose that a large hospital has 40 births per day on average, with 50% of these being boys. A smaller hospital nearby has 10 births/day, also 50% being boys on average. On some days the proportion of boys will be higher, on others lower. Which hospital would you expect to have the most days in a year with at least 60% of births being boys? The answer is the smaller hospital, because its records will exhibit inherently more variability — a change from 5 boys to 6 is sufficient to raise the proportion to 60%, whereas the larger hospital would need to have at least 4 more boys than girls born to result in a 60%+ result, which is less likely to occur. A second example is the assumption that in a particular sequence of chance or random events in the past is a guide to events in the future — for instance, the probability that an unbiased coin toss will result in heads is not affected by the fact that perhaps the previous 10 times it has shown up tails. It is probable but not certain that sooner or later the tossed coin will come down heads, but that probability does not change from toss to toss. A similar, and perhaps more disturbing example, is the so-called prosecutor's fallacy. In this instance a prosecutor calls an expert witness who states that a piece of evidence (for example, an extremely rare blood group or condition) provides a link to the accused which would only occur one time in a million. The prosecutor then claims on the basis of this opinion that there is only one chance in a million that the accused is innocent. But we do not know that the accused is guilty (a presumption of guilt is not a satisfactory starting point). If we assume the accused is innocent, how many other people in the population might also demonstrate such a link? The person accused might be guilty, but additional evidence would be needed before reaching such a conclusion. Readers interested in this particular field should read the free Statistics Guide for Lawyers (PDF) available on the RSS website. This is a highly recommended resource for both lawyers and non-lawyers alike
•Ecological fallacy — this fallacy involves ascribing characteristics to members of a group when only the overall group characteristics are known (special statistical techniques have been devised to address certain problems of this type, for example as discussed in King et al., 2004, [KIN1]). A simple example is the suggestion that most individuals in a given census area earn $50,000 p.a. based on the census return figure for the area in question, whereas there may be no individuals at all in this area matching this description — for example 50% might earn $25,000 p.a. and 50% $75,000 p.a., or many such combinations — from the aggregated data alone is is simply not possible to know. The problem of statistical grouping of data, described in the previous section (Statistics and Statistical Analysis, unemployment statistics example) illustrates some of the difficulties encountered when data is aggregated
•Atomistic fallacy — this fallacy involves ascribing characteristics to members of a group based on a potentially unrepresentative sample of members. As such it can be regarded as a central issue in statistical research, often related to sampling that is far too small or unrepresentative to enable such conclusions to be reached
•Misinterpretation of visualizations — there is endless scope for misinterpretation and there are many books on what makes for good and bad visualizations. The work of Edward Tufte (1983, [TUF1]) is amongst the best at providing guidance on what makes for good visualization. The emphasis should always be on clarity of communication, often achieved through simplicity in design and labeling. However, the apparently simple and clear chart can easily provide scope for confused reporting. For example, the data for the chart below was cited in the Summer 2007 issue of the USA City Journal in an article authored by David Gratzer M.D., in which he stated that says the U.S. prostate cancer survival rate is 81.2 percent and the U.K. survival rate is 44.3 percent. This apparently authoritative commentary was then picked up and used by leading US politicians. There are several problems with this interpretation of the graph. First, the data are from 7 years beforehand. Second, reported incidence simply reflects diagnosis rates, which in turn is related to the level of screening for the condition, which at the time was much more common in the USA than the UK. And finally, it is incorrect to deduce survival rates from the raw data on diagnosis and mortality rates. Survival rates require data that tracks the date of diagnosis to the lifespan of the individual. In broad terms the five-year relative survival rate for men diagnosed in England in 2000-2001 was 71% and by 2022 was 85%. More details on survival rates for Prostate cancer over 1, 5 and 10 years can be found at the Cancer Research UK website
Prostate cancer incidence and mortality per 100,000 males per year
source: Anderson and Hussey (2000, [AND1])
There are occasions when data is deliberately falsified. This maybe as a result of a rogue individual scientist or group, commercial enterprise and even government agencies. The case of Prof Hwang Woo Suk who published fraudulent results on human cloning from stem cells in 2006 is one of the most famous (see https://en.wikipedia.org/wiki/Hwang_Woo-suk), but there is little doubt that deliberate or semi-deliberate falsification of data is more common than many realize. Deliberate omission of results that show no significant results or results that do not support a particular hypothesis can be regarded as a form of deliberate falsification and is a well-established problem in academic and medical research. Recent high-profile "fake news" cases highlight how modern media and lack of independent scrutiny can result in such issues becoming widely circulated.
[BRA2] Bradford Hill A (1965) The Environment and Disease: Association or Causation? Proc. of the Royal Soc. of Medicine, 58, 295-300. A copy of this article is reproduced on Tufte's website: https://www.edwardtufte.com/tufte/hill
[DRA1] Draper G, Vincent T, Kroll M E, Swanson J (2005) Childhood cancer in relation to distance from high voltage power lines in England and Wales: a case-control study. British Medical J., 330, 4 June 2005, 1-5
[TUF1] Tufte E (1983) The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.(2nd edition, 2001). Also, a 2nd, revised edition is available from Tufte's website: https://www.edwardtufte.com/tufte/books_vdqi
NASA Ozone watch information: https://ozonewatch.gsfc.nasa.gov
NADAQ 100-Index reference data: https://www.nasdaq.com/market-activity/indexes