I got some sparse data for the first time and it's quite intimidating. After reading sklearn preprocessing docs it seems I should scale them with MaxAbsScaler (the sparseness is important). However, what are other recomendations for such kind of problems? Is it possible to use PCA or other kind of feature selection on this kind of data?
-
1What do you want to do with this data? How the data looks like (samples)? – Vladislavs Dovgalecs Mar 29 '18 at 17:21
-
What data have you got? What research questions are you asking? What hypotheses do you wan to test? What are you trying to find out? – Peter Flom Mar 30 '18 at 11:52
1 Answers
Your question is, itself, relatively sparse so not much can be said. But generally
If you have an enormous volume of sparse data, you want to avoid any operation (such as mean-centering) that could disrupt sparsity. This is one reason
MaxAbsScaler
could be helpful: it can be computed without disrupting sparsity, and applying it also doesn't disrupt sparsity.MaxAbsScaler
scales by the largest absolute value (which is cheap to estimate). Likely for numerical reasons, thescipy.sparse
authors do not implement standard deviation methods for sparse data types; if you disregard numerical concerns, one could scale by standard deviation (but you'd have to write your own sparsity-respecting code to do so).You don't want to use PCA, since it requires centering your data (destroying sparsity). But you can still access an orthogonal basis representation of your data using sparse SVD.
MaxAbsScaler
isn't feature selection, either, it just scales data.

- 76,417
- 20
- 189
- 313
-
-
@Aksakal Suppose you have a vector that's mostly 0, but 1% of elements are nonzero. A sparse representation has roughly 1% the storage costs of the dense representation. Mean-centering the vector requires storing all the values, as none are zero, and there is only slight difference between dense and sparse storage requirements (the sparse storage cost could even be higher!) Some specialized sparse arithmetic packages let you use non-zero values as the "implicit value" of zero used by `scipy.sparse`, but this is uncommon, so anything that breaks the "mostly 0" property is generally Bad News. – Sycorax Mar 29 '18 at 17:51