Many people would regard statistical analysis as a purely technical exercise involving the application of specialized data collection and analysis techniques, but this perception is both incorrect and misleading. Statistical problems should be viewed within the context of a broad methodological framework, and it is the specific nature of this framework that defines "The Statistical Method". Here we are using the terminology and interpretation of MacKay and Oldford (2000, [MAC1]). They carefully examined the nature of statistical analysis by discussing the problem of determining the speed of light, as conducted in the experiments of A A Michelson in 1879. Although they used research that involved a relatively complicated experiment as their example, the conclusions they draw are much more wide-reaching. Essentially they argue that statistical analysis must involve a broad perspective on the task under consideration, from the initial Problem definition stage (P), through Planning and Data collection stages (P,D) through to Analysis (A) and Conclusions (C). This is similar to the "statistical problem solving cycle" as described in the Probability & Statistics leaflet mentioned in our Suggested Reading section and elsewhere, but widens the scope of this methodology.
The elements of this methodological framework are shown in the PPDAC table below - each is discussed in detail in their paper. MacKay and Oldford note that very often the complexity of the analysis phase is greatly reduced if the totality of a problem is addressed in the manner described. As can be seen, the formal analysis stage comes well down the sequence of steps that are involved in producing good quality statistical research. Absolutely crucial to the entire process is the initial problem definition. Only once this is thoroughly understood by all interested parties can a plan for data collection be devised and the data obtained for subsequent analysis.
PPDAC: The Statistical Method, after MacKay and Oldford (2000)
Units & Target Population (Process)
Problem Aspect(s) - causative, descriptive, predictive
Unit: One (measured) light transmission. Population: all such transmissions
Response variate: the speed of light in each measured transmission
Explanatory variates: a large number of possible factors that might help explain variations in the measured data (e.g. method used, the measurement process)
Population attributes: the average speed of light in a vacuum
Problem aspect: descriptive (seeking an estimate of a specific value)
Study Population (Process)
(Units, Variates, Attributes)
Selecting the response variate(s)
Dealing with explanatory variates
Data Collection Protocol
Study population: The collection of units that could possibly be measured (known as the sampling frame in survey work). Michelson measured the speed of light in air, not in a vacuum - the difference between the study population and the true population is known as the study error
Response variates: Michelson measured the speed of light indirectly, using distances, rotation speed (of a mirror), timing device (tuning forks) and temperature
Explanatory variates: There may be a large number. Where possible Michelson tried to fix those factors he was aware of, and measure or vary others to check if they had an effect on his results
Sampling protocol: The detailed procedure followed for sampling the data - in Michelson's case he made sets of measurements one hour after sunrise and one hour before sunset, on a series of days close to mid-summer. He made 1000 measurements, with some made by an independent observer
Measuring processes: the equipment, people, and methods used - measurement error , which is the difference between the measured value and the true value, is incurred in this step of the procedure
Data Collection Protocol: the management and administration (recording etc) of the entire data collection exercise - nowadays this would include data storage and processing considerations
Execute the Plan
and record all departures
for internal consistency
Execution: Michelson did not record every result, but just the average values for blocks of 10 measurements
Data monitoring: Tracking data as they are obtained helps identify patterns, temporal drift, outliers etc. Michelson did not explicitly do this
Data examination: The internal consistency of the data should be checked, for unexpected features (each using EDA techniques), but Michelson did not appear to do this
Data storage: simple tabulated results on paper in this instance
numerical and graphical
build, fit, criticize cycle
In Michelson's case he summarized his data in tables and computed the average of his 100 measured velocities in air, and then corrected for the deflection effect that air would have on his results, making a small adjustment for temperature variations in each case.
Formal analysis was limited to analyzing possible source of error and their maximum impact on the results, in order to obtain an estimate of the velocity of light in a vacuum, +/- the estimated errors
plain language, effective
Limitations of study
discussion of potential errors
Michelson presented his central finding and provided a full discussion as to possible sources of error and why many factors could be ignored due to the manner in which the plan was made and executed. Despite this, the true value for the speed of light is actually outside the limits of his estimates at the time, even though his mean result was within 0.05% of the correct figure, hence he slightly underestimated the size of the errors affecting his result
The PPDAC summary table suggests a relatively linear flow from problem definition through to conclusions - this is typically not the case. It is often better to see the process as cyclical, with a series of feedback loops. A summary of a revised PPDAC approach is shown in the diagram below. As can be seen, although the clockwise sequence (1→5) applies as the principal flow, each stage may and often will feed back to the previous stage. In addition, it may well be beneficial to examine the process in the reverse direction, starting with Problem definition and then examining expectations as to the format and structure of the Conclusions (without pre-judging the outcomes!). This procedure then continues, step-by-step, in an anti-clockwise manner (e→a) determining the implications of these expectations for each stage of the process.
PPDAC as an iterative process
We now expand our discussion by examining the components this revised model in a little more detail:
Problem: Understanding and defining the problem to be studied is often a substantial part of the overall analytical process - clarity at the start is obviously a key factor in determining whether a programme of analysis is a success or a failure. Success here is defined in terms of outcomes (or objectives) rather than methods. And outcomes are typically judged and evaluated by third parties - customers, supervisors, employers - so their active involvement in problem specification and sometimes throughout the entire process is essential. Breaking problems down into key components, and simplifying problems to focus on their essential and most important and relevant components, are often very effective first steps. This not only helps identify many of the issues to be addressed, likely data requirements, tools and procedures, but also can be used within the iterative process of clarifying the customer’s requirements and expectations. Problems that involve a large number of key components tend to be more complex and take more time than problems which involve a more limited set. This is fairly obvious, but perhaps less obvious is the need to examine the interactions and dependencies between these key components. The greater the number of such interactions and dependencies the more complex the problem will be to address, and as the numbers increase complexity tends to grow exponentially. Analysis of existing information, traditionally described as “desk research”, is an essential part of this process and far more straightforward now with the advantage of online/Internet-based resources. Obtaining relevant information from the client/sponsor (if any), interested third parties, information gatekeepers and any regulatory authorities, forms a further and fundamental aspect to problem formulation and specification. Box et al. (2005, p13, [BOX1]) suggest a series of questions that should be asked, particularly in the context of conducting experiments or trials, which we list below with minor alterations from their original. As can be seen, the questions echo many of the issues we raise above:
|•||what is the objective of this investigation?|
|•||who is responsible?|
|•||I am going to describe your problem - is my description correct?|
|•||do you have any past data? and if so, how were these data collected/in what order/on what days/by whom/how?|
|•||do you have any other data like these?|
|•||how does the equipment work/what does it look like/can I see it?|
|•||are there existing sampling, measurement and adjustment protocols?|
Plan: Having agreed on the problem definition the next stage is to formulate an approach that has the best possible chance of addressing the problem and achieving answers (outcomes) that meet expectations. Although the PLAN phase is next in the sequence, the iterative nature of the PPDAC process emphasizes the need to define and then re-visit each component. Thus whilst an outline project plan would be defined at this stage, one would have to consider each of the subsequent stages (DATA, ANALYSIS, CONCLUSIONS) before firming up on the detail of the plan. With projects that are more experimental in nature, drawing up the main elements of the PLAN takes place at this stage. With projects for which pre-existing datasets and analysis tools are expected to be used, the PLAN stage is much more an integrated part of the whole PPDAC exercise. The output of the PLAN stage is often formulated as a detailed project plan, with allocation of tasks, resources, times, analysis of critical path(s) and activities, and estimated costs of data, equipment, software tools, manpower, services etc. Frequently project plans are produced with the aid of formal tools, which may be paper-based or software assisted. In many instances this will involve determining all the major tasks or task blocks that need to be carried out, identifying the interconnections between these building blocks (and their sequencing), and then examining how each task block is broken down into sub-elements. This process then translates into an initial programme of work once estimated timings and resources are included, which can then be modified and fine-tuned as an improved understanding of the project is developed. In some instances this will be part of the Planning process itself, where a formal functional specification and/or pilot project forms part of the overall plan. As with other parts of the PPDAC process, the PLAN stage is not a one-shot static component, but typically includes a process of monitoring and re-evaluation of the plan, such that issues of timeliness, budget, resourcing and quality can be monitored and reported in a well-defined manner. The approach adopted involves consideration of many issues, including:
|•||the nature of the problem and project — is it purely investigative, or a formal research exercise; is it essentially descriptive, including identification of structures and relationships, or more concerned with processes, in which clearer understanding of causes and effects may be required, especially if predictive models are to be developed and/or prescriptive measures are anticipated as an output?|
|•||does it require commercial costings and/or cost-benefit analysis?|
|•||are particular decision-support tools and procedures needed?|
|•||what level of public involvement and public awareness is involved, if any?|
|•||what particular operational needs and conditions are associated with the exercise?|
|•||what time is available to conduct the research and are there any critical (final or intermediate) deadlines?|
|•||what funds and other resources are available?|
|•||is the project considered technically feasible, what assessable risk is there of failure and how is this affected by problem complexity?|
|•||what are the client (commercial, governmental, academic, personal) expectations?|
|•||are there specifications, standards, quality parameters and/or procedures that must be used (for example to comply with national guidelines)?|
|•||how does the research relate to other studies on the same or similar problems?|
|•||what data components are needed and how will they be obtained (existing sources, collected datasets)?|
|•||are the data to be studied (units) to be selected from the target population, or will the sample be distinct in some way and applied to the population subsequently (in which case, as discussed earlier, one must consider not just sampling error but study error also)?|
When deciding upon the design approach and analytical methods/tools it is often important to identify any relevant available datasets, examine their quality, strengths and weaknesses, and carry out exploratory work on subsets or samples in order to clarify the kind of approach that will be both practical and effective. There will always be unknowns at this stage, but the aim should be to minimize these at the earliest opportunity, if necessary by working through the entire process, up to and including drafting the presentation of results based on sample, hypothetical or simulated data.
Data: In research projects that involve experiments, the data are collected within the context of well-defined and (in general) tightly controlled circumstances, with the response and explanatory variates being clearly included in the design of the experiment. In many other instances data is obtained from direct or indirect observation of variates that do not form part of any controlled experiment. And in survey research, although there will be a carefully constructed sample design, the level of direct control over variates is typically very limited. Key datasets are also often provided by or acquired from third parties rather than being produced as part of the research. Analysis is often of these pre-existing datasets, so understanding their quality and provenance is extremely important. It also means that in many instances this phase of the PPDAC process involves selection of one or more existing datasets from those available. In practice not all such datasets will have the same quality, cost, licensing arrangements, availability, completeness, format, timeliness and detail. Compromises have to be made in most instances, with the over-riding guideline being fitness for purpose. If the datasets available are unsuitable for addressing the problem in a satisfactory manner, even if these are the only data that one has to work with, then the problem should either not be tackled or must be re-specified in such a way as to ensure it is possible to provide an acceptable process of analysis leading to worthwhile outcomes. A major issue related to data sourcing is the question of the compatibility of different data sets: in formats and encoding; in temporal, geographic and thematic coverage; in quality and completeness. In general datasets from different sources and/or times will not match precisely, so resolution of mismatches can become a major task in the data phase of any project. And as part of this process the issue of how and where to store the data arises, which again warrants early consideration, not merely to ensure consistency and retrievability but also for convenient analysis and reporting. Almost by definition no dataset is perfect. All may contain errors, missing values, have a finite resolution, include distortions as a result modeling the real world with discrete mathematical forms, incorporate measurement errors and uncertainties, and may exhibit deliberate or designed adjustment of data (e.g. for privacy reasons, as part of aggregation procedures).
Analysis: The Analysis phase can be seen as a multi-part exercise. It commences with the review of data collected and the manipulation of the many inputs to produce consistent and usable data. Exploratory data analysis (EDA), including the production of simple data summaries, tabulations and graphs is typically the first stage of any such analysis. The impact on research of exceptions - rare events, outliers, extreme values, unusual clusters - is extremely important. Exploratory methods, such as examining individual cases and producing box-plots, help to determine whether these observations are valid and important, or require removal from the study set. This phase then extends into more formal study in order to identify patterns of various kinds that help the researcher to develop new ideas and hypotheses regarding form and process. And this in turn may lead on to the use or development of one or more models within a formal build-fit-criticize cycle. Crawley (2007, p339, [CRA1]) provides the following extremely sound advice regarding model selection (echoing a quote attributed to George Box):
"It is as well to remember the following truths about models: all models are wrong; some models are better than others [Box said more useful]; the correct model can never be known with certainty; and the simpler a model the better it is"!
Finally the output of the models and analysis is examined, and where necessary the dataset and data gathering plan is re-visited, working back up the PPDAC model chain, prior to moving on to producing the output from the project and delivering this in the Conclusion stage. The application of a single analytical technique or software tool is often to be avoided unless one is extremely confident of the outcome, or it is the analytical technique or approach itself that is the subject of investigation, or this approach or toolset has been specifically approved for use in such cases. If analysis is not limited to single approaches, and a series of outputs, visualizations, techniques and tests all suggest a similar outcome then confidence in the findings tends to be greatly increased. If such techniques suggest different outcomes the analyst is encouraged to explain the differences, by re-examining the design, the data and/or the analytical techniques and tools applied. Ultimately the original problem definition may have to be reviewed.
Conclusions: The last stage of the PPDAC process is that of reaching conclusions based upon the analyses conducted, and communicating these to others. Note that implementation of findings (e.g. actually proceeding with building a bypass, designating an area as unfit for habitation, or implementing a vaccination programme) does not form part of this model process, but lies beyond its confines.
“The purpose of the Conclusion stage is to report the results of the study in the language of the Problem. Concise numerical summaries and presentation graphics [tabulations, visualizations] should be used to clarify the discussion. Statistical jargon should be avoided. As well, the Conclusion provides an opportunity to discuss the strengths and weaknesses of the Plan, Data and Analysis especially in regards to possible errors that may have arisen” Mackay and Oldford (2000)
For many problems this summary is sufficient. For others the conclusions stage will be the start of additional work: re-visiting the problem and iterating the entire process or parts of the process; a new project; implementing proposals; and/or wider consultation. In Michelson's case, he was aware of several imperfections in his research, and in fact spent the rest of his life conducting a series of further experiments in order to progressively improve the accuracy of his estimate of the true speed of light. A full discussion of this revised PPDAC model in the context of spatial analysis is provided in the "Chapter 3: Spatial analysis and the PPDAC model" of de Smith et al., 2009 [DES1] which is available online.
[BOX1] Box G E P,Hunter J S, Hunter W G (1978, 2005) Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building. J Wiley & Sons, New York. J Wiley & Sons, New York. The second, extended edition was published in 2005
[DES1] de Smith M J, Goodchild M F, Longley P A (2009) Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. 3rd edition, Troubador, Leicester. Available from: http://www.spatialanalysisonline.com/
[MAC1] MacKay R J, Oldford R W (2000) Scientific Method, Statistical Method and the Speed of Light. Statist. Sci., 15, 3, 254-278. Available from: http://projecteuclid.org/euclid.ss/1009212817
Wikipedia, Speed of Light article: http://en.wikipedia.org/wiki/Speed_of_light