﻿ Analysis of variance and covariance > ANOVA > Single factor or one-way ANOVA

# Single factor or one-way ANOVA

Navigation:  Analysis of variance and covariance > ANOVA >

# Single factor or one-way ANOVA

Single factor or one-way analysis of variance is one of the most commonly encountered and simplest forms of such analysis. As explained in the introduction to this topic, such analysis is essentially a statistical comparison of mean values. The basic model is of the form: This states that the observed or measured value y (observation j in group i) is a linear combination of an overall mean value, μ, plus a treatment or group effect, T, plus some unexplained random variation, or error, e. In Example 2 in the previous section, the number of groups or treatments was i=4, there were j=4 replicates for each treatment, the overall mean value in both tables was 4.4375, and the estimated treatment effects can be obtained as the difference between the overall mean and each treatment mean value. So treatment 1 has an effect of size 2.75-4.4375=-1.6875, thus the first data item in the first table, group 1, which is the value 3, is modeled as the overall mean plus the treatment effect plus a within group variation component or error for that item of 0.25. Note that the sum of the effects is zero, which is always the case for a Type I or fixed effects model, i.e. A standard format of ANOVA table is often used to summarize the sums of squares analysis. For the two examples above these can be drawn up as follows:

Table 1

 Source Sums of squares Degrees of freedom Mean squares F-test, 5% level Factors/treatments 18.688 3 6.229167 1.5177<3.49 Residual error 49.250 12 4.104167 Totals 67.938 15

Table 2

 Source Sums of squares Degrees of freedom Mean squares F-test, 5% level Factors/treatments 18.688 3 6.229167 14.2381>3.49 Residual error 5.250 12 0.4375 Totals 67.938 15

The above examples were introduced purely to explain the basic ideas behind the analysis of variance, i.e. to model the variation in datasets based on the application of different treatments or treatment levels. A more realistic example follows.

Example 3: Bacteria counts in milk shipments

The data in this example relates to a study reported by Hogg and Ledolter (1987, [HOG1]) of bacteria counts in shipments of milk. There were five shipments and for each shipment bacteria counts were made from 6 randomly selected cartons of milk. The question to be answered is whether some shipments have higher bacteria counts than others. The data are shown below, followed by the ANOVA table performed using the MATLab anova1() function (the R function aov() will produce a very similar ANOVA table, but without the final row showing the totals, for example using an expression of the form summary(aov(y~bacteria)):

Bacteria count data, shipments 1-5, samples 1-6

 1 2 3 4 5 1 24 14 11 7 19 2 15 7 9 7 24 3 21 12 7 4 19 4 27 17 13 7 15 5 33 14 12 12 10 6 23 16 18 18 20

Bacteria count ANOVA

 Source Sums of squares Degrees of freedom Mean squares F-test Prob>F Shipments 803.00 4 200.75 9.01 0.0001 Residual error 557.17 25 22.287 Totals 1360.17 29

The ANOVA summary table shows that the differences between shipments do indeed appear to be significantly different. The critical value for the F-distribution at the 5% level is not provided, but can be obtained using the F-distribution inverse function: finv(0.95,4,25). This yields the critical value as 2.7587, which is much less than the observed value of 9.01. In fact the 99.9% level is only around 6.5, so the observed result can be assumed to be extremely unlikely to be a matter of chance. If the F-test had produced a much smaller ratio, which was not significant (e.g. F=2.2) then we could not reject the null hypothesis that the mean bacteria counts were approximately equal, and the overall mean would provide the best estimate of the population mean for each shipment.

Graphical displays of the source data are strongly recommended as part of any ANOVA process. As standard many software packages generate a number of different forms of output, including one or more graphical displays. Box plots are a widely used option and provided by default in MATLab. For the bacteria dataset a standard box plot is shown below. The lines at the notches provide the position of the shipment medians and highlight the substantial differences between the data sets.

Bacteria count — box plots of shipment data As mentioned earlier, many problems encountered are of this type, but variations often occur. For example, where the treatments are random rather than fixed, an estimate of the within treatments variance due to this randomness can be obtained from the difference between the mean square between treatments and the mean square within treatments, divided by the number of replicates or rows in the data. Another common situation occurring is where one or more data items are missing, resulting in an unequal number of replicates. In general equal numbers of replicates are recommended (required), but unequal numbers can be handled by minor adjustments to the expressions used and may have little affect unless the number of replications is small (in some special circumstances, if the variation in numbers is proportional to the class frequencies then the theory still applies without further adjustment, but this is not a common situation). Non-Normality and heteroskedasticity is more of problem, at least for the statistical significance aspect of the analysis. Tests for both assumptions would generally be carried out, although for small numbers of replicates testing for Normality may not be particularly effective. A range of homogeneity tests can be carried out. For example applying the Fligner-Killeen test in R (a median chi-square test) to the bacteria count data above yields:

Fligner-Killeen: med chi-squared = 1.1394, df = 4, p-value = 0.888

confirming that there is no evidence to suggest the variances are not homogeneous. If necessary data should be transformed or an alternative non-parametric procedure applied (for example a procedure based on the ranks of the data, such as the Kruskal-Wallis form of ANOVA). Graphical checks may help as an alternative and as an accompaniment to the conduct of variance analyses. For example, in R the plot(aov()) function will generate a set of four charts that help visualize the data and the model fit. These are shown for the bacteria data below.

Graphical ANOVA — Bacteria dataset The top left chart shows the 5 datasets and looks at the pattern of absolute residual values from the fitted values (which are shipment mean values in this case). The line is a best fit line through the datapoints and ideally should be approximately 0 throughout, which it is in this case. The mean values of the shipments are 23.8,13.3,11.7,9.2 and 17.8, so shipments are arranged in order of size on the horizontal axis of the chart. The circles show the data items, with some shipments having multiple bacteria counts which are the same (e.g. 7). The numbers shown against some residuals are the data items that correspond to some of the largest residuals — thus 24 in the top left refers to record 6 in column 4 (the 24th data item, with value 18 = +8.8 residual). The second chart is a Normal QQ or Probability plot, and a good fit to the Normal distribution will be indicated by data items close to the dotted straight line. The third chart is a combination of the first two, in that it plots fitted values against the positive square root of standardized residuals, and again we are looking for a broadly horizontal line and even scatter of the residuals (no systematic trend with size). The final chart simply plots the standardized residuals against the ranked treatments, again looking for an approximately horizontally best fit line and no residuals that might bias the modeling process. Together these plots confirm (do not reject) the basic hypothesis that the data are approximately Normal, the model fit is reasonable and the variances are homogeneous.

The model itself can be conveniently inspected using the R operation summary.lm(aov()), which provides details of the underlying linear model (lm) of the form shown at the start of this section. However, for essentially categorical data, of the type being considered here, it is the comparison between shipments which is of primary concern, and this has already been shown to be highly significant.

A further issue that may arise is lack of independence between the treatments or groups. If, for example, the five shipments are in time sequence serial correlation may be a concern. Likewise, with Example 1 in the previous section, the heights and girths of trees are very likely to be spatially dependent. Computation of temporal or spatial autocorrelation measures should be undertaken if serial dependency is suspected. Where such autocorrelation exists, models that explicitly incorporate this lack of independence can often be applied (e.g. mixed effect models and/or autoregressive models).

References

[HOG1] Hogg R V, Ledolter J (1987) Engineering Statistics. MacMillan Publishing Company