0

I have data of an entire population (N=27). I would like to find out which variables (of many) have the greatest effect on one certain dependent variable, and how much of its variance can they explain. The possible independent variables could be correlating as well (some of them certainly do). They are all scale measured.

What do you think would be the best method to use and what should I pay attention to?

I'm limited to SPSS, and rather sketchy statistical knowledge. I have been trying to get something with linear regression but I fail to achieve anything interpretable, and it crossed my mind that I might need something entirely different since this is not a sample.

Thanks in advance!

EDIT: For this question this might not be important, but: I have all the variables 5 times for 5 separate years. Later on I am planning on examining how the effect of a variable changes over time.

EDIT nr.2: After the first replies it seems the best to detail the database I have: The 27 cases are 27 European countries, and the dependent variable is the percentage of their population that participated in demonstrations in that year. I also have lots of possible independent variables like gdp, unemployment, happiness, etc. I have all these values for 5 separate years. And basically I'm trying to find something that I can write about in my thesis, like "gdp is the biggest factor and its twice as big as happiness blabla ...in these countries". The reason why I'm saying its the entire population, because I am not planning on drawing general conclusions. I wouldn't be able to do that anyway as the countries aren't even representative for Europe.

Balazs
  • 3
  • 3
  • 1
    I am curious to hear what this population is : ) – Behacad Mar 21 '13 at 21:35
  • Although you might have an entire *population,* it is clear that you are thinking about your data as a *sample* of a phenomenon or process that exhibits variation: this puts you squarely back into the realm of statistical inference. – whuber Mar 21 '13 at 21:40
  • @whuber raises a key question, but I am not so clear as he is on what your answer is: Are you thinking of these 27 cases as a sample or a population? Would you be generalizing to a broader population? – Peter Flom Mar 21 '13 at 22:25
  • 1
    Having data for different years definitely makes a difference: You could then view your data as a sample from all possible years and there would be variation over time. Having multiple years gives you more data, lets you build more complex models. – Peter Flom Mar 21 '13 at 22:27
  • Behcad: I have just edited the original post to detail it. whuber: I'm thinking of these cases as population right now. If it doesn't seem like it's probably cos I'm trying to use my rather slim statistical knowledge. @Peter Flom No, I'm not planning on generalizing the results. I'm thinking of them as a population for now. – Balazs Mar 21 '13 at 23:25
  • To say that these data are a "population" is to say that you wish to limit your inferences to conditions in *those* countries in *those* years. Even then, it is worth entertaining the hypothetical question "suppose the *true* underlying relationships between GDP etc. and demonstration rates were not to change, but we could turn back the clock to the beginning of this study period and watch the world unfold in a parallel universe. To what extent would the *observed* values have been different?" Almost certainly different, because behaviors depend on so many incidental, unknowable things. – whuber Mar 22 '13 at 00:16

3 Answers3

1

Sounds like random forests (using regression trees) are the perfect tool for you. You can use regression trees to build a series of trees, then check the variable importance.

I don't know much about SPSS, but if you are willing to use R (come on... you know you want to!), the caret package will be able to do this with the train() function (by specifying importance=TRUE and using rfFuncs in the control function). You can then view the importance of each variable. The varImp() function gives you more control. If you want to see what number of variables give you the best results, you can use rfe().

Caret can be a little difficult to wrap your head around, so if you want to include all your variables, you can use this simpler (but less flexible) code from the randomForest package (included with caret):

require(randomForest)
df<-read.csv(file.choose()) ### assuming your data is in a csv
rf.fit<-randomForest(x=df[,1:??],y=df[,??+1],ntree=500,importance=TRUE) #assuming the ?? idependent variables in columns 1-?? and the response is in ??+1 column
print(rf.fit$importance) #importance of variables
print(rf.fit$rsq)  #psuedo rsquared of model

I would do this year by year, rather than include the year as a varibale. Time will no doubt play a role in the regression, but I think it would make more sense to build each model for each year and then look for changes in variable importance over time- though 5 isn't a lot to build a powerful time series with. Others might disagree and I wouldn't mind hearing others' opinions.

If you don't want to use R, I would update your question to signal for SPSS users who might know how to implement random forests with variable importance.

TLJ
  • 828
  • 1
  • 6
  • 13
  • Well I don't really want to get into R right now, but I will try to find something similar in SPSS and try it. Thanks for the tip! – Balazs Mar 22 '13 at 20:55
0

While it might be true that you have all of the possible outcomes of a phenomenon, it is more likely that you actually have specific realizations of that phenomenon. This is essentially the difference between a poker player saying "I was dealt a full house 10 hands in a row," and "I will always be dealt a full house in poker." In the first place, the random process of shuffling and dealing cards gave him a great hand. In the latter case, the set of possible outcomes is constricted so that he can only ever be dealt a full house.

I think it is much more likely that, while you have measured the only manifestations of the variables that exist, that those measurements only represent random realizations of the underlying data generation process. This means you're back in the realm of statistics: making inferences from limited samples.

I hope this helps.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • I think I understand what you are saying but please reconsider after reading my second edit of the original post where I detail my data exactly. If you mean that the values of the variables in my data have a possible error then yes I get that, but I was thinking I wouldn't care about that for now, I'm just trying to get my head around on how to find the connections. I thought I would consider creating some kind of confidence interval afterwards. – Balazs Mar 21 '13 at 23:35
  • Population parameters (analogous to sample statistics) do not have confidence intervals. If you take the mean of a population, that is 100% for sure the population mean(as long as input data is accurate). – TLJ Mar 22 '13 at 00:27
  • 1
    This is the essential disagreement that I have with how you've framed your problem. While it's physically true that only a specific number of biological humans engaged in protest in some years, I view that number itself as having been drawn from God's hat, and determined by some combination of causal variables and random chance. – Sycorax Mar 22 '13 at 00:44
  • +1 because I agree. This question also explains it further very well: http://stats.stackexchange.com/questions/2628/statistical-inference-when-the-sample-is-the-population – TLJ Mar 22 '13 at 00:50
  • Yes I understand, thanks for the useful link as well. However I'm not sure what this would mean in practice. I suppose you are saying I should consider the data a sample, but what would significance, etc. mean then? – Balazs Mar 22 '13 at 10:49
  • Significance, confidence intervals and the like would all take on their usual meanings for scenarios in which you are dealing with a sample. You might find it helpful to read an introductory statistics textbook. I found [this one](http://www.amazon.com/Statistical-Methods-Social-Sciences-Edition/dp/0205646417/ref=sr_1_1?ie=UTF8&qid=1363957688&sr=8-1&keywords=Statistical+Methods+for+the+Social+Sciences) useful as an undergrad. – Sycorax Mar 22 '13 at 13:08
-1

You can try with a correlation matrix and find out which of your vars are linked with your dependent variable. Doing this you'll also find any correlation between all attributes.

cesko80
  • 11
  • 1
  • 1
    This won't do, because mutual correlations can (and often do) mask or even completely alter the apparent relationships. These issues are extensively discussed on this site: look at threads concerning *multiple regression.* – whuber Mar 21 '13 at 21:38
  • I think whuber is right, but I'll still take a look thank you! – Balazs Mar 21 '13 at 23:36