﻿ Analysis of variance and covariance > ANOVA

# ANOVA

Navigation:  Analysis of variance and covariance >

# ANOVA

Analysis of variance is a family of techniques that involve separating the total variation of a dataset into component parts in order to identify whether the means of the samples or groups are the same or are significantly different. The use of the term analysis of variance is thus somewhat confusing, but the reason for this will soon become apparent. Typically the procedure separates within group variation from between group variation and examines the ratio of these two as a method of identifying how similar or different the groups are. The between group variation is defined by examining the differences in the mean values of the groups, which is why analysis of variance is, in effect, analysis of the differences between mean values.

The basic idea can be best understood using two examples. The first example illustrates how the variance can be separated into its component parts, whilst the second provides a worked example under two data scenarios for four groups. Having reviewed these examples readers are recommended to study the single-factor or one-way ANOVA section, which contains commentary that applies to many models of this type.

Example 1: Separating the variance into component parts

Consider the diagram below, which shows an oval area divided into two regions, 1 and 2. For example, this could be an island and we might be interested in measuring the heights or girths (circumferences) of a representative sample of trees across the island. Within each region we have some data measurements on a variable X (e.g. tree height in meters, or trunk girth at 1.5meters above the ground). We can easily calculate the mean value (mean tree height or girth) and the variance of all the data, var(X), and we can also calculate the mean values and variances of the data in each region, var(X1) and var(X2).

Analysis of variance model, two regions

Clearly, since X=X1+X2 then var(X)=var(X1+X2), or more generally with many regions, var(X)=var(ΣXk). But what if we want to compare regions 1 and 2? For example, the west of the island may be wetter than the east, so trees might be higher on average in region 1. A useful question that enables us to answer this question turns out to be "does var(X) which is var(X1+X2)= var(X1)+var(X2)?" or more generally "does var(ΣXk)=Σvar(Xk)?".

We can answer this question by comparing the formulas for the sums of squares (SS) for each case, i.e. computing the variances but ignoring the divisors for the time being. Let there be i=1,2 ..,k groups, each with j=1,2…,n members (e.g. number of trees in our example), then the number of data items, #{xij}=kn=N.

Let SST be the total sums of squares, SSW be the sums of squares within the regions, and SSB be the sums of squares between the regions. Then we have the three expressions:

Now the terms in pale blue (gray) in the second and third expressions above are the same, but with different signs, and if we add SSB+SSW these terms cancel each other out and we are left with two terms which are identical to that in the formula for the total sums of squares, SST. Thus SST=SSW+SSB. So the total sum of squares (and variance) is actually a combination of two separate components: the within group sum of squares and the between group sum of squares. Thus var(X) is not equal to var(X1)+var(X2), the missing piece of the puzzle is the between group sum of squares, which is measuring how different the mean values of the individual groups are. We will also need the degrees of freedom for each of these separate components if we are to apply a test, such as an F-test. There are k items in the Between regions part (B), so dfB=(k-1); n items in each Within group (W), thus dfW=n-1 and there are k Ws, so the total dfW=k(n-1)=kn-k=(N-k). Now (N-k)+(k-1)=N-1=dfT . This analysis assumes the number of measurements in each group is the same, but can be generalized to the case where these numbers are different. The statistical test we then apply is simply to compute F=(SSB/dfB)/(SSW/dfW) then F is distributed as an F-distribution with (k-1, N-k) df. We now apply this procedure to sample data to illustrate this process at work.

Example 2: Applying simple ANOVA to a four group problem

The table below shows four groups of measurements. Each group includes four measurements. The overall mean value of the data is shown in yellow in column 2 as 4.4375. In column 3 the mean values for each of the four groups are shown. Column 4 shows the calculation for the total sum of squares, SST. It is simply the data value in column 2 minus the overall mean (4.4375) squared.

Table 1

 Group Data Gp means SST SSW SSB 1 3 2.066 0.063 2 5.941 0.563 2 5.941 0.563 4 2.75 0.191 1.563 2.848 2 6 2.441 2.250 2 5.941 6.250 3 2.066 2.250 7 4.5 6.566 6.250 0.004 3 2 5.941 7.563 6 2.441 1.563 5 0.316 0.063 6 4.75 2.441 1.563 0.098 4 6 2.441 0.063 9 20.816 10.563 3 2.066 7.563 5 5.75 0.316 0.563 1.723 mean 4.4375 Sums of squares 67.938 49.250 18.688 67.938 df 15 12 3 15 Mean squares 4.104167 6.229167 (Sums of squares/df) F-test 1.517766 F(3,12:p=0.05) 3.49

Totaling column 3 we obtain SST=67.938. Column 5 is the within group sum of squares, SSW. Thus the first value, 0.063 is simply the first data item in group 1 (which is 3) minus the group one mean (which is 2.75) squared. This is repeated for all 4 items in group 1, and then for group 2 from their mean and so on. The column sum then gives the within groups sum of squares, SSW=49.250. The sixth column provides the between groups sum of squares, which only consists of four entries. The first entry is the difference between the group 1 mean (2.75) and the overall mean (4.4375) squared. The column total of the four SSB entries is 18.688, and to the right of this we see the sum of SSB and SSW is 67.938, i.e. the same value as the SST, which is exactly what we would expect from Example 1. We then show the degrees of freedom for each column. The total degrees of freedom is 16-1=15, the within groups is 4 sets of (4-1)=3 df, so 12 in total, and finally 4-1=3 df for the between groups calculation. The mean squares (variance) values for within and between groups are the total sum of squares divided by the degrees of freedom. As we have seen previously these are chi-square distributed if the population distribution is Normal.

To test whether there is a significant difference between the groups, we compute the ratio: F=(SSB/dfB)/(SSW/dfW) as noted above. Here that is 6.229167/4.104167=1.517766 as shown in the table. We can either look up this specific value, or compare it with the critical value of the F-distribution with parameters 3 and 12 df, and significance level α=0.05, which is 3.49. Because our ratio is well below this figure we cannot reject the null hypothesis, and it reasonable to assume, on this evidence, that there is no significant difference between the mean values of the four groups. We now examine an apparently minor change to the dataset above that has a surprising result.

Table 2

 Group Data Gp means SST SSW SSB 1 3 2.066 0.063 2 5.941 0.563 2 5.941 0.563 4 2.75 0.191 1.563 2.848 2 5 0.316 0.250 5 0.316 0.250 4 0.191 0.250 4 4.5 0.191 0.250 0.004 3 5 0.316 0.063 5 0.316 0.063 5 0.316 0.063 4 4.75 0.191 0.563 0.098 4 6 2.441 0.063 6 2.441 0.063 6 2.441 0.063 5 5.75 0.316 0.563 1.723 mean 4.4375 Sums of squares 23.938 5.250 18.688 23.938 df 15 12 3 15 Mean squares 0.4375 6.229167 F-test 14.2381 F(3,12:p=0.05) 3.49

Table 2 appears to be very similar to Table 1. The data column has the same overall mean value and the group means are the same as before. This means that the SSB column is also unchanged. However, the data have been rearranged so that there is far less variation within the groups. The groups are thus much more homogeneous internally, and relative to this internal homogeneity the differences between the groups is much more substantial. This is apparent from the SSW column, which now has a total of only 5.25, producing an F-test statistic which is greatly increased to 14.2381, much greater than the critical level at 5%. Hence the ANOVA approach is highlighting the relative variation between groups as compared with within groups, and only if this relative variation is very different does it suggest that the group means are probably different.