1

This is more of a conceptual question rather than a methodological one I guess.

Let's assume that we have a dataset coming from a questionnaire and after some feature scaling we run a PCA to reduce the dimentionality. The results indicate that we need "a lot of" principal components (let's say close to the original number of questions) in order to capture let's say 80% of the variability. What would that indicate? What are the most probable scenarios?

  • The questionnaire needs smarter questions so that more info is captured by fewer questions?
  • The population consists of very complicated individuals?
  • Some questions (let's say the ordinal ones) require more levels?

Edit: Some additional details about the project

The population we are investigating are all the existing customers of a specific platform. Our random sample consists of approximately 4.2K individuals from the population who answered a questionnaire of ~80 questions (behavioural, personality, preferences etc). The objective is to i) understand the persona/behavioural groups that exist in our customer database ii) collect some golden questions to be able to classify more users afterwards without having to ask them all 80 questions again. Most of these questions are ordinal and some of them are categorical.

I've already done an initial clustering using PAM and Gower's distance but I wanted to look deeper and try more stuff. My plan is to run a Hierarchical K-means clustering after a PCA and then try some SOMs as well. My plan afterwards is to train a classification model to be able to classify the future users.

When I did the PCA in each category (I think 8 in total) I saw that for most cases the "best" PC had close to 12% variability explained which I found kind of low and it made me a bit curious. Hence the question

Vasilis Vasileiou
  • 1,158
  • 1
  • 9
  • 16
  • 4
    That simply means items hardly at all correlate. Now, what does that mean to you, a questionnaire designer/user? Do items have or have not to correlate, in your project? – ttnphns Jan 21 '19 at 17:33
  • If you are analyzing covariances, not correlations, uncorrelated data may still give different caliber PCs. See https://stats.stackexchange.com/q/92791/3277 – ttnphns Jan 21 '19 at 17:37
  • Use a nonlinear dimensionality reduction technique. Using PCA is ridiculous when you have a non-Euclidian manifold. – Matthieu Brucher Jan 21 '19 at 17:43
  • @MatthieuBrucher, Can you elaborate? With suggestions, I mean. – Narfanar Jan 21 '19 at 17:53
  • If you have a non-Euclidian manifold, data that cannot be represented as a plane (even with error), then you should use something else. The typical example is Eigenfaces that is just creating wrong images compared to what can be done with an appropriate tool like autoencoders (for faces) or ISOMAP, LLE... – Matthieu Brucher Jan 21 '19 at 17:56
  • @ttnphns questions are meant to capture respondents’ behaviours, personalities, preferences etc. Intuitively, my guess would be that least some of these variables should correlate and I’m trying to identify the most likely scenario why this is not the case. Should we re-design the questionnaire? Is it the way it is and there is nothing I can do about it? – Vasilis Vasileiou Jan 21 '19 at 20:07
  • @MatthieuBrucher I know what you mean. However, imagine that 90% of these variables are ordinal and thus coded from -3 to 3. So it’s not completely ridiculous to try PCA. My question is more conceptual as I said. Forget a bit the methodological part. Even with PCA and euclidean distances, Why is this happening you think? – Vasilis Vasileiou Jan 21 '19 at 20:10
  • If, as you write, it takes "close to the original number of questions" to account for $80\%$ of the variance, then you likely have a very small number of questions in the first place (depending on what "close to" might mean). Given the vagueness of this assumption, as well as the complete lack of information about the questionnaire or its subjects, you are asking us to supply a great deal of speculation based on exceptionally little information. Could you tell us what your problem *really* is? – whuber Jan 21 '19 at 20:12
  • 1
    @whuber makes sense. Let me edit my question and add a bit more context – Vasilis Vasileiou Jan 21 '19 at 20:14

1 Answers1

0

"After some feature scaling". Maybe that is a bad choice? But you did not give any details...

You may be emphasizing variability in your data where it wasn't. Assume that all but one user chose the same answer for some question, say 2. but one funny user chose 3 for all questions. Now if you naively apply standardization first, all users will likely get a small negative value, and the one different user will get a very large other value. PCA will then preserve this variable largely unchanged - but it never was meaningful.

The other issue, in particular with PCA and standardization is resolution. Your input data only has a few levels, but that doesn't mean the "true" values shouldn't be continuous. It's best to think of these to have a high uncertainty. If the true values weren't 0 and 1, but 0.49 and 0.51 rounded, your variance would have been several times less! By PCA and these you pretend the values are of higher resolution and linear (but research has shown that even 1.2.3.4.5 questionnaire values are better not treated as numerical).

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96