Data mining for a statistician

Question

I'm very new to the concept of data mining. My background is from statistics in the social sciences, and my familiarity in the computer sciences is small.

Based on what I have read so far, it appears that data mining looks at "Big Data," and then uses a computer program to run lots of regressions/correlations/etc. Is that right? So if I have a data set with a dozen variables, I might play around with them. I can run some regressions, look at correlation matrices, etc. Data mining seems like it does the same thing, but more exhaustively and systematically and with hundreds/thousands/more variables. Am I on the right track at all, or is my metaphor hopelessly flawed?

If you are interested in general discussion about Data Mining vs. Statistics, there are already several threads of interest on this site. E.g., [Data Mining and Statistical Analysis](http://stats.stackexchange.com/q/1521/930); [New revolutionary way of data mining?](http://stats.stackexchange.com/q/31513/930) — chl, Oct 02 '13 at 10:14

score 4 · Answer 1 · edited Sep 25 '14 at 08:48

First of all, data mining is much older than "big data".

Big data is a big marketing term, where companies try to sell you their latest products. Most likely, however, you don't have big enough data to need big data technologies. Some even call it "big bullsh*t"... For now, just forget about big data. That is marketing terminology.

With big data, what is mostly done is preprocessing. Filtering data from the raw format into something usable (and small enough to be analyzed in detail). Or running some statistics one algorithm on million of users independently. So unless your task involves exabytes of data, forget about "big data" for now. Better understand what you are doing first, then scale it up to universe size.

Most data mining is better described as result-oriented multivariate statistics. Result-oriented, because usually very little importance is put on having a statistically sound approach. Instead, it is considered good if it works in practice. A good example for this is Naive Bayes Classification. It's just incorrect from a statistical point of view, but it works, and is a popular data mining method.

Well, and if you only have a single variate, then you are usually just doing statistics...

score 2 · Answer 2 · answered Oct 02 '13 at 14:25

In your problem, linear model with small data set, it seems that they are quite similar, and statistical linear model may be more suitable than data mining.

Their particular emphases are slightly different. Generally speaking, statistics is more based on probability, distribution and hypothesis tests. To get some theoretical proof results, statistics has more assumptions and tends to a simpler model form. On the other hand, data mining is more focus on optimizing the result performance of data set. Not emphasizing the mathematical assumptions and proof too much, it tends to use complex structure and model averaging. Usually, no model can guarantee always more suitable (or say outperforming) than the other models in data mining consistently by mathematical proof (before try to apply to the data set). Different models and different parameters will be tried to the data set and compared by performance (like cross-validation). So we can see lots of hypothesis test to validate the statistical models while data mining will usually only compare about the testing error (prediction performance).

In the linear model, we can see there are several classic assumptions. And we use hypothesis and residuals plot etc to validate the model, which is suitable for small data. For big data, it is difficult for us to read in a mess plot and some hypothesis are not convincing. For example, Shapiro-Wilk Normality Test will nearly always reject Normality because that statistic is function of N.

With Big data and computational machine learning, not only more variables but also more complex structure even the same variables can we use, like splines, penalized regression, etc. Moreover, we can use bootstrapping to get a larger sample, and do model averaging. In these cases, the classic statistic hypothesis test or model selection like AIC/BIC are not valid, because we can not even strictly find the likelihood or parameters based on different kinds of models. So data mining is more focus on how to tune the model to get a good performance.

score 0 · Answer 3 · answered Oct 09 '14 at 09:48

Data mining preceded big data by over a decade.

Data-mining means using statistics-like algorithms to detect patterns in the data. Instead of starting with a theory to prove you look for theories to investigate. It's exploratory and automated.

So in a sense it's "statistically opposite" to something like a lab experiment.

score 0 · Answer 4 · answered Oct 09 '14 at 10:24

I think this article by Prof David Hand explains it rather well. Data Mining: Statistics and More?

One of his key points is that it is about secondary data analysis. eg A supermarket collects all this data on its users and then asks - what can we do with it? rather than setting up an experimental paradigm ( which can control for eg correlations between factors)

score 0 · Answer 5 · answered Oct 02 '13 at 12:49

The short answer is that you seem to be on the right track.

If you are on the wrong track in some way, it is because you seem to be searching for a precise definition for data mining, when in fact no such precise definition yet exists. In this respect, data mining is like several other modern data terms that continue to evolve: data science, machine learning, and exploratory data analysis.

Data mining for a statistician

5 Answers5