Perhaps one of the mostly hotly debated topics in the last 2-3 years has been the question of "Big Data". Much of the discussion has been about the data: huge volumes of 2D, 3D and 4D data is now being collected and stored. So how they can be accessed? and how can we interpret massive datasets in an effective manner? Miller and Goodchild (2014) have argued that considerable care is required when working with Big Data - significant issues arise in terms of the data, in what they identify as the four Vs: the sheer Volume of data; Velocity of data arrival and associated timestamps of the data; the Variety of data available and the way in which this is selected (e.g. self selection); and the Validity of such data. A presentation by Prof Mike Goodchild of some of the key elements of the Big Data debate are provided in the Resources page of the Geospatial Analysis website - this presentation should be viewed alongside the article by Miller and Goodchild, as the latter provides a fuller explanation of the main ideas covered.
In an article published in the Financial Times in March 2014 "Big Data: Are we making a big mistake?", journalist and broadcaster Tim Harford addressed these issues and more, highlighting some of the less obvious issues posed by Big Data. Perhaps primary amongst these is the bias that is found in many such datasets. Such biases may be subtle and difficult to identify and are often impossible to manage. For example, almost all internet-related Big Data is intrinsically biased in favor of those who have access to and utilize the internet most, with demographic and geographic bias built-in. The same applies for specific services, such as Google, Twitter, Facebook, mobile phone networks, opt-in online surveys, opt-in emails - the examples are many and varied, but the problems are much the same as those familiar to statisticians for over a century. Big Data does not imply good data or unbiased data, and Big Data presents other problems - it is all to easy to focus on data exploration and pattern discovery, identifying correlations that may well be spurious - a result of the sheer volume of data and the number of events and variables measured. With enough data and enough comparisons, statistically significant findings are inevitable, but that does not necessarily provide real understanding or identification of causal relationships. Of course there are many important and interesting datasets where the collection and storage is far more systematic, less subject to bias, recording variables in a direct and consistent manner, with truly representative, 'complete' and 'clean' records, that are stored and managed well. Such data tend to be those collected by official agencies who supplement the datasets with metadata and quality assurance information. As Harford concludes: "Big Data has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers - without making the same old statistical mistakes on a grander scale than ever."