Highest Voted 'data-mining' Questions - Statistical Analysis Stack Exchange

419

votes

5 answers

How to understand the drawbacks of K-means

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of…

asked Jan 16 '15 at 04:38

KevinKim

6,347
4
21
35

219

votes

13 answers

What is the difference between data mining, statistics, machine learning and AI?

What is the difference between data mining, statistics, machine learning and AI? Would it be accurate to say that they are 4 fields attempting to solve very similar problems but with different approaches? What exactly do they have in common and…

machine-learning data-mining

asked Nov 30 '10 at 11:26

Olivier Lalonde

121
3
3
5

169

votes

4 answers

Cohen's kappa in plain English

I am reading a data mining book and it mentioned the Kappa statistic as a means for evaluating the prediction performance of classifiers. However, I just can't understand this. I also checked Wikipedia but it didn't help too:…

classification data-mining cohens-kappa

asked Jan 13 '14 at 19:14

Jack Twain

7,781
14
48
74

139

votes

9 answers

Obtaining knowledge from a random forest

Random forests are considered to be black boxes, but recently I was thinking what knowledge can be obtained from a random forest? The most obvious thing is the importance of the variables, in the simplest variant it can be done just by calculating…

machine-learning data-mining interaction random-forest cart

asked Jan 16 '12 at 11:09

Tomek Tarczynski

3,854
7
29
37

90

votes

7 answers

Euclidean distance is usually not good for sparse data (and more general case)?

I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data vectors where the Euclidean distance does not perform…

machine-learning clustering data-mining sparse euclidean

asked Jun 01 '12 at 13:55

shn

2,479
9
31
38

80

votes

9 answers

Skills hard to find in machine learners?

It seems that data mining and machine learning became so popular that now almost every CS student knows about classifiers, clustering, statistical NLP ... etc. So it seems that finding data miners is not a hard thing nowadays. My question is: What…

machine-learning data-mining

asked Jun 24 '14 at 07:11

Jack Twain

7,781
14
48
74

73

votes

11 answers

Having a job in data-mining without a PhD

I've been very interested in data-mining and machine-learning for a while, partly because I majored in that area at school, but also because I am truly much more excited trying to solve problems that require a bit more thought than just programming…

machine-learning data-mining careers phd

asked May 01 '12 at 23:39

Charles Menguy

2,277
4
15
16

71

votes

2 answers

Performance metrics to evaluate unsupervised learning

With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?

machine-learning clustering data-mining unsupervised-learning

asked Dec 09 '13 at 03:00

user3125

2,617
4
25
33

65

votes

2 answers

Why only three partitions? (training, validation, test)

When you are trying to fit models to a large dataset, the common advice is to partition the data into three parts: the training, validation, and test dataset. This is because the models usually have three "levels" of parameters: the first…

machine-learning model-selection data-mining

asked Apr 08 '11 at 14:45

charles.y.zheng

7,346
2
28
32

62

votes

12 answers

Software needed to scrape data from graph

Anybody have any experience with software (preferably free, preferably open source) that will take an image of data plotted on cartesian coordinates (a standard, everyday plot) and extract the coordinates of the points plotted on the…

data-visualization data-mining software

asked Aug 18 '11 at 04:14

Alex Holcombe

519
1
7
9

61

votes

3 answers

Clustering with K-Means and EM: how are they related?

I have studied algorithms for clustering data (unsupervised learning): EM, and k-means. I keep reading the following : k-means is a variant of EM, with the assumptions that clusters are spherical. Can somebody explain the above sentence? I do…

machine-learning clustering data-mining k-means expectation-maximization

asked Nov 18 '13 at 11:47

Myna

753
1
6
6

55

votes

8 answers

Is sampling relevant in the time of 'big data'?

Or more so "will it be"? Big Data makes statistics and relevant knowledge all the more important but seems to underplay Sampling Theory. I've seen this hype around 'Big Data' and can't help wonder that "why" would I want to analyze everything?…

sampling data-mining large-data

asked Sep 09 '12 at 19:58

PhD

13,429
19
45
47

54

votes

3 answers

Do we have a problem of "pity upvotes"?

I know, this may sound like it is off-topic, but hear me out. At Stack Overflow and here we get votes on posts, this is all stored in a tabular form. E.g.: post id voter id vote type datetime ------- -------- --------- …

time-series hypothesis-testing data-mining markov-process censoring

asked Jun 01 '11 at 01:57

Sam Saffron

619
4
7

47

votes

5 answers

Lift measure in data mining

I searched many websites to know what exactly lift will do? The results that I found all were about using it in applications not itself. I know about the support and confidence function. From Wikipedia, in data mining, lift is a measure of the…

data-mining

asked Oct 17 '11 at 14:53

Nickool

625
1
6
7

44

votes

2 answers

How to interpret the output of the summary method for an lm object in R?

I am using sample algae data to understand data mining a bit more. I have used the following commands: data(algae) algae <- algae[-manyNAs(algae),] clean.algae <-knnImputation(algae, k = 10) lm.a1 <- lm(a1 ~ ., data = clean.algae[,…

r regression data-mining

asked May 17 '13 at 00:02

godzilla

593
2
6
8

Questions tagged [data-mining]