Data mining is sometimes presented as a one-off exercise. This is a misconception. Rather, it should be perceived as an on-going process (even if the data set is fixed). One examines the data one way, interprets the results, looks more closely at the data from a related perspective, looks at them another way, and so on. The point is that, except in those very rare situations when one knows what sort of pattern is of interest, the essence of data mining is an attempt to discover the unexpected – and the unexpected, by its very nature, can arise in unexpected ways. Related to the view of data mining as a process is the recognition of the novelty of the results. Many data mining results are only what one would expect – in retrospect. However, the fact that one can explain them does not detract from the value of the data mining exercise in unearthing them. Without this exercise, it is entirely possible that one would never have thought of them. Indeed, it is likely that only those structures for which one can retrospectively formulate a plausible explanation will be valuable. Those which still seem improbable, no matter how one twists and turns the likely causal mechanisms, may well not be real phenomena at all, but simply chance artifacts of the particular data at hand.

There is clear potential, opportunity, and indeed even excitement in data mining. The possibilities for making discoveries in large data sets certainly exist, and the number of very large data sets is growing daily. However, this promise should not conceal the risk from us. All real data sets (even those collected by entirely automatic processes) have the potential for error. Data sets concerned with human beings (such as transaction and behaviour data) especially have such potential. This may well mean that most ‘unexpected structures’ discovered in the data are intrinsically uninteresting, being solely due to departures from the ideal process. (Of course, such structures may be interesting for other reasons: if the data have problems which might interfere with the purpose for which they were collected it is as well to know about them.) Associated with this is the deep issue of how to ensure (or at least provide support for the fact) that any observed patterns are ‘real’ in the sense that they reflect some underlying structure or relationship rather than merely how a particular data set, with a random component (for example, if it is a sample) happens to have fallen. Scoring methods may be relevant here, but more research, by statisticians and data miners is needed.

Web Information service © 2010 - Registered