3

I have a simple two-dimensional dataset with columns X1,X2, and [outcome], and I want to try KNN (probably K around 100 or 1000, though ideally CV would be possible). The problem is that my dataset has a couple million rows.

Are there any preferred packages/approaches for dealing with this sort of thing?

DavidShor
  • 1,281
  • 1
  • 11
  • 18
  • You could try "aggregating" the data set, so that you have 4 columns [X1 value] [X2 value] [outcome] [count] where [count] is the number of times the specific triple occurs in those couple of million rows. – probabilityislogic Jul 30 '14 at 21:39

2 Answers2

8

A naive nearest neighbor implementation will have to compute the distances between your test example and every instance in the training set. This $O(n)$ process can be problematic if you have a lot of data.

One solution is to find a more efficient representation of the training data. "Space-partitioning" data structures organize points in a way that makes it possible to efficiently search through them. Using a $k$-d tree, one can find a point's nearest neighbor in $O(\log n)$ time instead, which is substantial speed-up.

There are also approximate nearest neighbor algorithms such as locality sensitive hashing or the best-bin first algorithm. These results are approximations--they don't always find the nearest neighbor, but they often find something very close to it (which is probably just as good for classification.

Finally, if you've got a relatively fixed training set, you could compute something like a Voronoi diagram that indicates the neighborhoods around each point. This could then be used as a look-up table for future queries.


Matt Krause
  • 19,089
  • 3
  • 60
  • 101
1

Depending on your task and your data, you might not need that many nearest neighbours. Since your data is two dimensional, I would first plot it to explore how it looks. Is it linearly separable?.

Otherwise, you could try the following, sample at random a small portion of the data. Say 10%. Classify your data, and keep only those samples which could not be correctly classified.

You could either take a new sample from those and repeat the process, until you reach a satisfactory accuracy. Maybe you end with a very low proportion of kNN.

jpmuc
  • 12,986
  • 1
  • 34
  • 64