0

I got some sparse data for the first time and it's quite intimidating. After reading sklearn preprocessing docs it seems I should scale them with MaxAbsScaler (the sparseness is important). However, what are other recomendations for such kind of problems? Is it possible to use PCA or other kind of feature selection on this kind of data?

1 Answers1

1

Your question is, itself, relatively sparse so not much can be said. But generally

  1. If you have an enormous volume of sparse data, you want to avoid any operation (such as mean-centering) that could disrupt sparsity. This is one reason MaxAbsScaler could be helpful: it can be computed without disrupting sparsity, and applying it also doesn't disrupt sparsity.

  2. MaxAbsScaler scales by the largest absolute value (which is cheap to estimate). Likely for numerical reasons, the scipy.sparse authors do not implement standard deviation methods for sparse data types; if you disregard numerical concerns, one could scale by standard deviation (but you'd have to write your own sparsity-respecting code to do so).

  3. You don't want to use PCA, since it requires centering your data (destroying sparsity). But you can still access an orthogonal basis representation of your data using sparse SVD.

  4. PCA isn't feature selection, it just rotates the data.

  5. MaxAbsScaler isn't feature selection, either, it just scales data.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • what do you mean by "disrupting sparsity"? – Aksakal Mar 29 '18 at 17:43
  • @Aksakal Suppose you have a vector that's mostly 0, but 1% of elements are nonzero. A sparse representation has roughly 1% the storage costs of the dense representation. Mean-centering the vector requires storing all the values, as none are zero, and there is only slight difference between dense and sparse storage requirements (the sparse storage cost could even be higher!) Some specialized sparse arithmetic packages let you use non-zero values as the "implicit value" of zero used by `scipy.sparse`, but this is uncommon, so anything that breaks the "mostly 0" property is generally Bad News. – Sycorax Mar 29 '18 at 17:51