Data mining uses methods from artificial intelligence in a database context to discover previously unknown patterns. As such, the methods are usually unsupervised. It is closely related but not identical to machine learning. Key tasks of data-mining are cluster analysis, outlier detection and mining of association rules.
Questions tagged [data-mining]
1173 questions
419
votes
5 answers
How to understand the drawbacks of K-means
K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of…

KevinKim
- 6,347
- 4
- 21
- 35
219
votes
13 answers
What is the difference between data mining, statistics, machine learning and AI?
What is the difference between data mining, statistics, machine learning and AI?
Would it be accurate to say that they are 4 fields attempting to solve very similar problems but with different approaches? What exactly do they have in common and…

Olivier Lalonde
- 121
- 3
- 3
- 5
169
votes
4 answers
Cohen's kappa in plain English
I am reading a data mining book and it mentioned the Kappa statistic as a means for evaluating the prediction performance of classifiers. However, I just can't understand this. I also checked Wikipedia but it didn't help too:…

Jack Twain
- 7,781
- 14
- 48
- 74
139
votes
9 answers
Obtaining knowledge from a random forest
Random forests are considered to be black boxes, but recently I was thinking what knowledge can be obtained from a random forest?
The most obvious thing is the importance of the variables, in the simplest variant it can be done just by calculating…

Tomek Tarczynski
- 3,854
- 7
- 29
- 37
90
votes
7 answers
Euclidean distance is usually not good for sparse data (and more general case)?
I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data vectors where the Euclidean distance does not perform…

shn
- 2,479
- 9
- 31
- 38
80
votes
9 answers
Skills hard to find in machine learners?
It seems that data mining and machine learning became so popular that now almost every CS student knows about classifiers, clustering, statistical NLP ... etc. So it seems that finding data miners is not a hard thing nowadays.
My question is:
What…

Jack Twain
- 7,781
- 14
- 48
- 74
73
votes
11 answers
Having a job in data-mining without a PhD
I've been very interested in data-mining and machine-learning for a while, partly because I majored in that area at school, but also because I am truly much more excited trying to solve problems that require a bit more thought than just programming…

Charles Menguy
- 2,277
- 4
- 15
- 16
71
votes
2 answers
Performance metrics to evaluate unsupervised learning
With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?

user3125
- 2,617
- 4
- 25
- 33
65
votes
2 answers
Why only three partitions? (training, validation, test)
When you are trying to fit models to a large dataset, the common advice is to partition the data into three parts: the training, validation, and test dataset.
This is because the models usually have three "levels" of parameters: the first…

charles.y.zheng
- 7,346
- 2
- 28
- 32
62
votes
12 answers
Software needed to scrape data from graph
Anybody have any experience with software (preferably free, preferably open source) that will take an image of data plotted on cartesian coordinates (a standard, everyday plot) and extract the coordinates of the points plotted on the…

Alex Holcombe
- 519
- 1
- 7
- 9
61
votes
3 answers
Clustering with K-Means and EM: how are they related?
I have studied algorithms for clustering data (unsupervised learning): EM, and k-means.
I keep reading the following :
k-means is a variant of EM, with the assumptions that clusters are
spherical.
Can somebody explain the above sentence? I do…

Myna
- 753
- 1
- 6
- 6
55
votes
8 answers
Is sampling relevant in the time of 'big data'?
Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory.
I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything?…

PhD
- 13,429
- 19
- 45
- 47
54
votes
3 answers
Do we have a problem of "pity upvotes"?
I know, this may sound like it is off-topic, but hear me out.
At Stack Overflow and here we get votes on posts, this is all stored in a tabular form.
E.g.:
post id voter id vote type datetime
------- -------- --------- …

Sam Saffron
- 619
- 4
- 7
47
votes
5 answers
Lift measure in data mining
I searched many websites to know what exactly lift will do? The results that I found all were about using it in applications not itself.
I know about the support and confidence function. From Wikipedia, in data mining, lift is a measure of the…

Nickool
- 625
- 1
- 6
- 7
44
votes
2 answers
How to interpret the output of the summary method for an lm object in R?
I am using sample algae data to understand data mining a bit more. I have used the following commands:
data(algae)
algae <- algae[-manyNAs(algae),]
clean.algae <-knnImputation(algae, k = 10)
lm.a1 <- lm(a1 ~ ., data = clean.algae[,…

godzilla
- 593
- 2
- 6
- 8