3

how to handle sparse data problem in unsupervised learning .i'm going to use k-means on the dataset. I have 200 variables, nearly in each column have 70% zeros. how can I handle without discarding any column?

Newbie
  • 141
  • 2
  • 9

1 Answers1

3

There are two aspects to this question that I can tease out. One is how to literally handle the data, store it and process it without a loss of information; the second is how to conduct unsupervised learning on such a dataset.

1) Handling the dataset

The dataset you are describing is a sparse dataset. One way to reduce the size of the dataset files without losing any information is to use a sparse file format. For example, an ARFF file can be stored in either dense or sparse format.

From the documentation, the header information is the same between the two formats. A dense representation looks like this:

0, X, 0, Y, "class A"
0, 0, W, 0, "class B"

While a sparse representation of the same data looks like this:

{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}

2) Conducting Unsupervised Learning

Beyond the question of handling the data, there is also the question of whether or not you can get usable results from your process. This is a course in and of itself, but some things to think about include:

A) The curse of dimensionality (see the Encyclopedia of Machine Learning). At high dimensionality, our intuition of the Euclidean distance and what it means for objects to be "similar" or "dissimilar" fails. This is especially relevant if you are using an approach like k-means as your unsupervised learning algorithm;

B) Dimensionality reduction. There are a many ways to try to reduce the dimensionality of your data: feature selection, feature construction, and principal component analysis come to mind; and

C) Choice of algorithm. Is there a specific reason you are using k-means as your unsupervised learning algorithm? There are other unsupervised learning approaches which may be a better match for your data's characteristics. I am limited with the number of links I can post, but there are a number of surveys of clustering algorithms which have been completed depending on your specific problem.

user77876
  • 886
  • 6
  • 19
  • thank you, can you suggest me any other algorithm for this kind of data. and Dimensionality reduction will solve m problem.?? – Newbie Aug 03 '17 at 10:11