What is the difference between "statistical modeling" and "data mining"? I have searched the internet, but I can't see it clearly. Is there any overlap? Can they be considered different? Thanks!
-
1Does this thread answer your question? https://stats.stackexchange.com/q/1521 – cdalitz Oct 30 '21 at 11:37
1 Answers
The two terms have definitions that do not have a very clear boundary (e.g. similar to the term 'sports' what makes a 'sport a sport?') and something is considered 'statistical modeling' or 'data mining' based on 'family resemblance'. In the following answer we give some description but answers that deviate from it are possible.
Statistical model Peter McCallaugh describes this in an article titled "What is a statistical model? " (The Annals of Statistics 2002, Vol. 30, No. 5, 1225–1310)
... a statistical model is a set of probability distributions on the sample space $\mathscr{S}$. A parameterized statistical model is a parameter set $\Theta$ together with a function $P:\Theta \to \mathcal{P}(\mathscr{S}) $, which assigns to each parameter point $\theta \in \Theta$ a probability distribution $P_\theta$ on $\mathscr{S}$. Here $\mathcal{ P}(\mathscr{S} )$ is the set of all probability distributions on $\mathscr{S}$...
And we can consider statistical modelling as the practice of making and/or using a statistical model, using a probabilistic description/model of observations. Examples are making predictions or inference with the help of a statistical model of the observations. If we are doing 'statistical modelling' then we are modelling with the 'probability distribution' of the observations. Statistical modelling places emphasis on the probabilistic variations in the data.
This contrasts with data mining where a 'probability distribution' is not necessarily explicitly defined/described. Data mining is the process of finding patterns ('mining') in large data sets and applying them (e.g. artificial intelligence). With data mining, algorithms can be used that are not necessarily taking into consideration assumptions about the probability distribution of the sampled data/observations.

- 43,080
- 1
- 72
- 161