INTRODUCTION PREPRINT Data Mining in Economics, Finance and Marketing
Data Mining in Economics,Finance and Marketing
Hans C.Jessen Georgios Paliouras
International Head of Econometrics Institute of Informatics and Telecommunications
Initiative Consulting NCSR"Demokritos"
Data Mining has become a buzzword in industry in recent years.It is something that everyone is talking about but few seem to understand.There are two reasons for this lack of understanding:First is the fact that Data Mining researchers have very diverse backgrounds such as machine learning,psychology and statistics.This means that the research is often based on different methodologies and communication links e.g.notation is often unique to a particular research area which hampers the exchange of ideas and the dissemination to the wider public.The second reason for the lack of understanding is that the main ideas behind Data Mining are often completely opposite to mainstream statistics and as many companies interested in Data Mining already employ statisticians,such a change of view can create opposition.
There are many definitions of Data Mining,the one we favour can be summarised as follows:1“Data Mining is concerned with secondary data analysis of large data bases where the aim is to
identify unsuspected relationships of interest or value.”
Classical statistics is mostly based on hypothesis testing.The researcher makes assumptions about the structure of the data and then uses statistical tests to either prove or disprove these assumptions.The result of such an exercise is that a lot of careful consideration goes into building a model and that the researcher should have a good understanding of the data involved.The drawback is,of course,that the quality of a model becomes dependent on the quality of the researcher,his ability to formulate interesting hypotheses and his experience in handling a given date source.
Data Mining does not have hypotheses testing at its heart and this is its main difference from classical statistics. Instead,Data Mining aims to find interesting relationships within the data that are of value to the researcher. The most appealing aspect of Data Mining is that it removes the need for the researcher to be an expert in model building and therefore reduces the cost of the analysis.It also offers the possibility that the tools might come up with ideas that the researcher would not have thought of.Although this sounds excellent for applied researchers there are unfortunately also some drawbacks.Here are a couple that are of particular interest:
Are the identified relationships of interest?Most Data Mining is used on medium or large-scale databases. With a large enough number of observations,it is all too easy to identify spurious or obvious patterns.Spurious patterns are caused by pure chance and do not relate to the general structure of the data,whereas obvious patterns are relationships in the data caused by data collection procedures or inherent in a particular type of data e.g.the colder it is,the fewer ice creams are sold.Many examples have been presented where Data Mining techniques have come up with solutions that are trivial for an expert in the field.
Selection bias.Much work has been done on selection bias in statistics but Data Mining research has largely ignored this issue.One way to think about selection bias is“how did the people on which I have data happen to be in the data set in the first place?”One often sees forecasts,say from decision trees,being applied to the whole population without recognising that the data from which the tree was derived was non-random.
Few people in industry doubt that Data Mining is here to stay and that it offers significant improvements over classical analysis when used on large databases.The challenge for the Data Mining research community is to incorporate knowledge from other fields,like econometrics and statistics,in order not to make obvious mistakes 1This note draws on ideas from Prof.David J.Hand’s RSS(Royal Statistical Society)presentation:“Data Mining:puff or potential”.