2

I want to use PCA in this kind of situation. I have three variables:

  1. how many times something happened for user - positive integer;
  2. total "power" of all happened events for user - real number, can be negative
  3. percent of "successful hits" - real positive number 0 < x < 1

Wikipedia states that "PCA is sensitive to the scaling of the variables."

A problem is that "power" can be measured using various units. And choice of units will affect the results. I do not see a natural choice of units for the moment.

Are there any suggestions on how to scale observations for PCA?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • 1
    Did you read some advice for a [similar question](http://stats.stackexchange.com/q/12200/3277)? – ttnphns Feb 26 '13 at 15:35

1 Answers1

1

You could first shift the data by substracting the respective mean values to each of the columns, and then rescale the resulting values so that they fall within the interval [-1,1]

jpmuc
  • 12,986
  • 1
  • 34
  • 64
  • Thank you for the answer ! Yes, mean subtraction is necessary, but what is appropriate scaling - should I scale everything to -1 1 or only part of variables - it is not so clear... Choosing different scaling for different variables - gives different "importance" for different variables - it should depend on task what is the appropriate "importance", but I do not see the right one in my case... – Alexander Chervov Feb 26 '13 at 12:39
  • 5
    Although this recommendation might work for some datasets, it is exquisitely sensitive to any outliers that might exist, and so is not a good general procedure. – whuber Feb 26 '13 at 13:41
  • very true! otherwise you could normalize the variance of each component individually to one – jpmuc Mar 01 '13 at 16:01
  • @AlexanderChervov I meant it as: $$x -> (x-mean(x))/(max(x)-min(x))$$, so that the variable now lies in the interval $$[-1,1]$$ – jpmuc Mar 01 '13 at 16:03