﻿ Probability distributions > Kernel Density Estimation

# Kernel Density Estimation

Given a sample set of real data values {x1,x2,x3,...xn} we are generally interested in using this sample to draw inferences about the population from which the sample was drawn. Often assumptions are made about the form of the population distribution. These assumptions typically follow some analysis of the sample, including the creation of a histogram of the data, showing the overall shape of the sample distribution and comparing this with the expected form and parameters for particular population distributions, such as the Normal distribution or the t-distribution. The process of producing a histogram converts the point samples to frequency values within fixed interval ranges. If the histogram is standardized by dividing the height (frequency) of each column by the sample size, n, then the total area under the histogram will sum to 1 and the individual histogram bars will represent estimates of the probabilities over the range in question. Each such bar is thus a measure of the proportional density of points within the range selected. Using a rectangular bar to represent the distribution of points over a particular range is a form of localized smoothing — converting point data on the real line into density data over an interval. Overall the result is not smooth, but stepped.

Kernel density estimation (KDE) is a procedure that provides an alternative to the use of histograms as a means of generating frequency distributions. This idea is simplest to understand by looking at the example in the diagrams below. The first diagram shows a set of 5 events (observed values) marked by crosses. They occur at positions 7, 8, 9, 12 and 14 along the line. We could argue that the point density across the entire 20-unit line length is 5/20=0.25 points per unit length and assign a value of 0.25 to each section, as shown on the gray line. This would correspond to a histogram with width 20 and height 0.25 (or height 0.05 if standardized so the total added to 1). We might equally well argue that if we divide the overall line into two halves, the density over the first half should be 0.3 per unit length and over the second, 0.2 per unit length, to reflect the variation in positioning of the 5 events. This is illustrated in the third line, and corresponds to a histogram with two bars, each of width 10 units.

Point data distributed along 20 segment line

There is clearly no single right answer or single method to assign the points to the line’s entire length — the method we choose will depend on the application we are considering. Important observations to notice about this problem include: the length of the line we start with seems to have an important effect on the density values we obtain, and since this may be arbitrary, some method of removing dependence on the line length is desirable; if the line is partitioned into discrete chunks a sudden break in density (a step) occurs where the partition boundaries occur, which is often undesirable; depending on the number of partitions and distribution of points, areas may contain zero density, even if this is not the kind of spread we are seeking or regard as meaningful; the line is assumed to be continuous, and we are assuming that allocation of density values to every part is valid; and finally, if we have too many partitions all sections will only contain values of 1 or 0, which is essentially back to where we started from.

These observations can be dealt with by treating each point in the original set as if it was spread over a range, then adding together overlapping zones and checking that the total adds up to the original value. For example, choosing each point and smoothing it over 5 units in a uniform symmetric manner, we obtain the result shown in the diagram below. The lower rows show the spreading of each of the 5 original points, with the total row showing the sums (densities) assigned to each unit segment. These add up to 5, as they should, and a chart showing this distribution confirms the pattern of spread.

Simple linear (box or uniform) kernel smoothing

This method still leaves us with some difficulties: there are no density values towards the edges of our linear region; density values still jump abruptly from one value to the next; and values are evenly spread around the initial points, whereas it might be more realistic to have a greater weighting of importance towards the center of each point. All of these concerns can be addressed by selecting a well-defined, smooth and optionally unbounded function, known as a kernel, and using this to spread the values. The function often used is a Normal distribution — as we have already seen, this is a bell-shaped curve extending to infinity in each direction, but with a finite (unit) area contained underneath the bell. In the diagram below, for each sample value (7,8,9,12 and 14) we have provided a Normal distribution curve with central value (the mean) at the point in question and with an average spread (standard deviation, or bandwidth, h) of 2 units. We can then add the areas under each of these curves together to obtain a (cumulative) curve, and then divide this curve by 5 to adjust the area under the curve back to 1 giving the lower (red) curve shown. When adjusted in this way the values are often described as an empirical probability density function, and when extended to two dimensions, the resulting surface is described as an empirical probability density surface. We now have a density value for every position along the original line, with smooth transitions between the values, which is exactly what we were trying to achieve.

Univariate Normal kernel smoothing and cumulative densities

There still remain some questions: why should we use the Normal distribution? could we not use almost any unimodal symmetric function, K, with a finite area under it? and why did we choose a value of h=2 units for the bandwidth? The answer to these questions is that the specific selections made are a matter of choice and experience, although in some instances a symmetric distribution with a finite extent (e.g. a box or triangular function) may be regarded as more suitable than one with an infinite possible extent.

The next diagram shows a selection of commonly used functions plotted for the same point set, using the MATLab Statistics Toolbox function ksdensity() — most other software packages provide similar functions. As may be seen from examining the various curves shown, the exact form of the kernel function does not tend to have a major impact on the set of density values assigned across the linear segment (or area in 2D applications). Of much greater impact is the choice of the spread parameter, or bandwidth.

Alternative univariate kernel density functions

Expressing these concepts more formally, univariate KDE can be defined as a method of function estimation, or probability density estimation, from a sample set of real data values {x1,x2,x3,...xn} of the form:

where K() is a kernel function and h is the bandwidth.

The discussion so far addresses problems in one dimension (univariate density estimation). We now extend the process to two dimensions. This is simply a matter of taking the univariate procedures and adding a second dimension (effectively rotating the kernel function about each point). If we were to use the Normal distribution again as our kernel function it would have a two-dimensional bell-shaped form over every point (a symmetric Bivariate Normal).

As before, we place the kernel function over each point in our study region and calculate the value contributed by that point over a finely drawn grid. The grid resolution does not affect the resulting surface form to any great degree, but if possible should be set to be meaningful within the context of the dataset being analyzed, including any known spatial errors or rounding that may have been applied, and making allowance for any areas that should be omitted from the computations (e.g. industrial zones, water, parks etc. when considering residential-based data). Values contributed by all points at every grid intersection or for every grid cell are then computed and added together to give a composite density surface. The resulting grid values may be provided as: (i) relative densities — these provide values in events per unit area (i.e. they are adjusted by the grid size, giving a figure as events per square meter or per hectare); (ii) absolute densities — these provide values in terms of events per grid cell, and hence are not adjusted by cell size. The sum of the values across all cells should equal the number of events used in the analysis; or (iii) probabilities — as per (ii) but divided by the total number of events. Computed density values may then be plotted in 2D (e.g. as density contours) or as a 3D surface (as in the diagram below). In this latter case the kernel density procedure has been applied to a dataset of reported cases of lung cancer in part of Lancashire, England (this dataset is discussed fully in Diggle (1990, [DIG1]) and more recently in Baddeley et al., 2005, [BAD1]). The software used in this instance was Crimestat, with a Normal kernel function and average spread (bandwidth) determined from the point pattern itself (using the mean distance to nearest neighbors).

Kernel density map, Lung Case data, 3D visualization

The details of each of the main kernel functions used in statistical packages are as shown in the table below. The table shows normalized functions, where the intervals or distances dij have been divided by the kernel bandwidth, h, i.e. t=dij/h.

Widely used univariate kernel density functions

 Kernel Formula Comments Normal (or Gaussian) Unbounded, hence defined for all t. The standard kernel in Crimestat; bandwidth h is the standard deviation (and may be fixed or adaptive — see further, below) Quartic (spherical) Bounded. Approximates the Normal. k is a constant (Negative) Exponential Optionally bounded. A is a constant (e.g. A=3/2) and k is a parameter (e.g. k=3). Weights more heavily to the central point than other kernels Triangular (conic) Bounded. Very simple linear decay with distance. Uniform (flat) Bounded. k=a constant. No central weighting so function is like a uniform disk placed over each event point Epanechnikov (paraboloid/quadratic) Bounded; optimal smoothing function for some statistical applications

For further details on these and other functions used for smoothing and density estimation see Bowman and Azzalini (1997 [BOW1]) and Silverman (1986 [SIL1]). Graphs of these functions are shown below, where each has been normalized such that the area under the graph sums to 1. Whether the kernel for a particular event point contributes to the value at a grid point depends on: (i) the type of kernel function (bounded or not); (ii) the parameter, k, if applicable, which may be user defined or determined automatically in some instances; and (iii) and the bandwidth, h, that is selected (a larger bandwidth spreads the influence of event points over a greater distance, but is also more likely to experience edge effects close to the study region boundaries). Event point sets may be weighted resulting in some event points having greater influence than others.

Univariate kernel density functions, unit bandwidth

 A. Constant B. Normal, SD=1 C. Exponential D. Quadratic E. Quartic F. Triangular (or linear)

Bandwidth selection is often more of an art than a science, but it may be subject to formal analysis and estimation, for example by applying kernel density estimation (KDE) procedures to sets of data where actual densities are known. An alternative to fixed bandwidth selection is adaptive selection, whereby the user specifies the selection criterion, for example defining the number of event points to include within an interval of a given width, or a circle centered on each event point and taking the radius of this circle as the bandwidth around that point. A somewhat different form of adaptive density model is provided within the R package "spatstat". In this case a random sample of points (a fraction, f%) is taken, and this is used to create a Voronoi tessellation of the sample point set. The density of the remaining points (1-f%) is then computed for each cell of the tessellation. The process is then replicated n times and the average densities of all replicates computed.

Kernel smoothing, or kernel density estimation methods (KDE methods) of the type described have a variety of applications: probability distribution estimation; exploratory data analysis; point data smoothing; creation of continuous surfaces from point data in order to combine or compare these with other datasets that are continuous; interpolation (although this terminology is confusing and not recommended — Crimestat is amongst a number of packages that use this terminology, which is essentially incorrect); and hot spot detection.

KDE can also be used in visualizing and analyzing temporal patterns, for example crime events at different times of day and/or over different periods, with the objective of understanding and potentially predicting event patterns. KDE can also be applied with more than one point set, for example a set of cases and a set of controls. The output of such “dual” dataset analysis is normally a ratio of the primary set to the secondary set (or ratio of the log transform of the variables), with the objective being to analyze the primary pattern with background effects being removed or minimized. Care must be taken in such cases that the results are not subject to distortion by very low or zero values in the second density surface, and that common bandwidths are used. Levine (2007, Chapter 8 [LEV1]) provides an excellent discussion of the various techniques and options, including alternative methods for bandwidth selection, together with examples from crime analysis, health research, urban development and ecology.

References

[BAD1] Baddeley A, Turner R, Moller J, Hazelton M (2005) Residual analysis for spatial point processes. J. of the Royal Statistical Society, Series B, 67, 617–666

[BOW1] Bowman A W, Azzalini A (1997) Applied smoothing techniques for data analysis: The kernel approach with S-Plus illustrations. Oxford Statistical Science Series, Oxford University Press, Oxford

[DIG1] Diggle P J (1990) A point process modeling approach to raised incidence of a rare phenomenon in the vicinity of a pre-specified point. J. Royal Stat. Soc., A, 153, 349-62

[LEV1] Levine N (2007) CrimeStat: A spatial statistics program for the analysis of crime incident locations (v3.1). N Levine & Associates, Houston, TX, and the National Inst. of Justice, Washington, DC. See: http://www.icpsr.umich.edu/CRIMESTAT

[SIL1] Silverman B W (1986) Density estimation for statistics and data analysis. Chapman and Hall, New York