How can I test the relevance of variables on a dependent variable when I have data for an entire population?

Question

I am investigating factors that influence how certain areas have intensified over time in a city. There are 25 areas which represent all of these types of area in the city. My study is only considering the one city.

I was planning to use Spearman correlations to determine if there is a relationship between scores for their degree of intensification between two date ranges, and a range of possible explanatory variables related the nature of the areas at the start of the date range. I was hoping to use the strength of the relationships to determine which factors are the most important in a matrix of correlations. However from reading on this site it appears that this is not an ideal approach.

What would a good approach to address this? I'm just explaining what happened here, and don't intend to imply a causation.

My dependent variable is an intensification score. It's continuous, normal, and not skewed.

My independent variables are a mix of continuous (many aren't normally distributed) and ordinal. I've got lots of them (approx. 100), but can reduce to only the most sensible.

Note: I have seen this post: Statistical inference when the sample "is" the population I'm beginning to understand some of the concepts there, but it's not giving me an answer to my fundamental question above; i.e. how can I do this population based analysis.

If this is truly a "population based analysis" then your sole objective is to *describe* the data: there are no inferences to make at all. Are you sure about this? If so, then indeed the "relationship" you wish to "determine" is there right in the data and all you have to do is look at it or characterize it. — whuber, Feb 13 '19 at 13:08
Thanks @whuber. Well, yes. It's an evaluation of policy on the 25 centres. I am examining all 25 centres. So I think it's a population. I can look at correlations between my variable of interest and other factors and describe them. But I'd also like a way to determine which variables are having greatest effect. I understand that this is what regression can tell me but I'm not sure if it's valid to run a regression model with so few cases. I've done so, and I've got a model that works really nicely with adjusted R2 of 0.901. — MelbourneJoe, Feb 13 '19 at 23:14
You don't need to do any modeling at all. The only thing you need to do is decide how to quantify the effects you are interested in and then report on what those values are. I really doubt this will be appropriate, though, because it is such a limited investigation, and so I would encourage you to rethink the question of what your objectives are and what really should be considered the population of interest. — whuber, Feb 14 '19 at 16:18
@whuber , the research evaluates the application of a specific policy to one city, over 20 years. There is no other population as it's essentially a case study, and no where else has the same policy. I'm happy to not do any modelling. But I need to see which of the possible explanatory factors has the strongest relationship to the observed changes. It's entirely descriptive. I thought multiple regression would be a way to do this, but I'm open to other ideas too. — MelbourneJoe, Feb 14 '19 at 23:34
I came across [this question](https://stats.stackexchange.com/questions/56141/explanation-of-minimum-observations-for-multiple-regression) which is exactly the same as what I need to know (you actually replied to it). In the OP's response to the first answer, he succinctly explains what I'm trying to do, but it doesn't seem like this was really addressed by any of the responses (or perhaps I'm just not understanding them?). — MelbourneJoe, Feb 14 '19 at 23:35

score 0 · Answer 1 · answered Feb 19 '19 at 03:55

Since posting here I think I've worked out why regression isn't reliable for what I wanted to do. As I've seen others posting similar questions before, I thought I'd answer this to help them along.

I was playing around inserting different variables in regression models and seeing which gave nicer adjusted r scores while meeting basic tests such as avoid colinearity issues. Eventually I got a model with an adjusted r of 1! Yep, the model explained it all. Of course, that's nonsense and at that moment I realised I'd just been selecting input data to match the curve. With only 25 data points, such overfitting is problematic and even small changes to the data points, or which variables were included, produced large variations in the model coefficients.

So although the fact you have a population means the results are descriptive and statistical significance isn't particularly relevant, if the population is small and your approach is somewhat exploratory, you will have issues with overfitting data which makes the results of regression unreliable.

Could be wrong, but that's my take on what I observed. Please correct if inaccurate.

How can I test the relevance of variables on a dependent variable when I have data for an entire population?

1 Answers1