Up to this point we have described data analytic issues, showing how there are differences in emphasis between data mining and statistics, despite the considerable overlap. However, data miners must also contend with entirely non-statistical issues. One example is the problem of obtaining the data in the first place. Statisticians tend to view data as a convenient flat table, with cases cross-classified by variables, stored on the computer and simply waiting for analysis. This is fine if the data set is small enough to fit in the computer’s memory, but in many data mining problems this is not possible. Worse, very large data sets are often dispersed across several machines. Perhaps the extreme of this is arises when analysing data from the World Wide Web, which may exist on many computers around the world. Problems of this kind make the very possibility of extracting a random sample questionable (let alone the possibility of analysing the ‘complete data set’, a concept which may not exist if the data are constantly evolving, as with telephone calls, for example). When describing data mining techniques, I find it convenient to distinguish between two general classes of tools, according to whether they are aimed at model building or pattern detection. I have already noted the central role of the concept of a model in statistics. In model building one is trying to produce an overall summary of a set of data, to identify and describe the main features of the shape of the distribution. Examples of such ‘global’ models include a cluster analysis partition of a set of data, a regression model for prediction, and a tree-based classification rule. In contrast, in pattern detection, one is seeking to identify small (but nonetheless possibly important) departures from the norm, to detect unusual patterns of behaviour. Examples include sporadic waveforms in EEG traces, unusual spending patterns in credit card usage (for fraud detection), and objects with patterns of characteristics unlike any others. To many, it is this second exercise which is the essence of ‘data mining’ – an attempt to locate ‘nuggets’ of value amongst the dross. However, the first kind of exercise is just as important. Note that working with a sample is acceptable when one is concerned with global model building (one will be able to characterise the important features with a sample of a hundred thousand just as effectively as with a sample of ten million, although clearly this depends in part on the size of the features one wants to model). However, the same is not true of pattern detection. Here, selecting only a sample may discard just those few cases one had hoped to detect. Although statistics is mainly concerned with analysing numerical data, the mixed parentage of data mining means that it also has to contend with other forms of data. In particular, logical data sometimes arise – for example, in searching for patterns composed of conjunctive and disjunctive combinations of elements. Likewise, higher order structures sometimes arise. That is, the elements of the analysis may be images, text chunks, speech signals, or even (as, for example, in meta-analysis) entire scientific studies.

Web Information service © 2010 - Registered