0

I am currently doing a research and recently encountered a few statistical concerns. I am basing this question on the following post.

My question is quite similar to the above-mentioned one but has some important differences. I designed a survey in order to get data from participants. This survey had 12 different items (ranked on a scale from 0 "minor importance" to 10 "major importance").

Before constructing my independent variables with these 12 items, I decided to get rid of the items that do not display a linear relationship with the dependent variable (the latter being a continuous value between 0 and 100).

After having done that, I am left with 10 items and I decide to create 3 independent variables. Each of the latter is constructed with some underlying items, based on a logical link between the items (i.e. the items referring to volatility are put together, the one referring to performance are put together...)

When constructing the variables, I decided to use the mean between the underlying items—for example, performance = (item1 + item2)/2.

My question is the following: I need to justify each step from a "mathematical" perspective with some references. For example, I know there is a unsettled debate for the use of ordinal data in regressions but fellow researchers argued that my data can be considered as interval, especially after using the mean between the items.

My question is the following: Is this mathematically ok? Also if you can cite work that did more or less the same, or that can mathematically support my steps, that would be greatly appreciated.

Thank you!

Avraham
  • 3,182
  • 21
  • 40
Cabby
  • 11
  • 2
  • It's never "mathematically OK". Sometimes, indeed often, push comes to shove and people do something like what you describe. But it's still wishful thinking. Bear in mind that many universities average grades based on ordinal scores, even while many of their courses (e.g. in Psychology schools) explain that you really shouldn't do that. That aside, I can't see that you have a new question here on a topic that's much discussed already. – Nick Cox Dec 12 '18 at 18:49
  • The previous post is not quite the same question. But if you can't average one ordinal variable (across different observations or cases), then it doesn't make more sense to average two or more (within each observation or case) – Nick Cox Dec 12 '18 at 18:53
  • @NickCox Thank you for your help. I will check this link and come back to you if necessary I also based my thinking on the following article from Norman G: https://doi.org/10.1007/s10459-010-9222-y I think that I am mixing two different problems. The first one is to know if averaging 3 items based on a logical link between them is fine. The second is to know if that average can be used as an independent variable. – Cabby Dec 13 '18 at 10:56
  • FYI: I had 56 respondents to the survey, 56*10 items = 560 answers – Cabby Dec 13 '18 at 10:58
  • @NickCox Edit: I read the topic and it seems that this debate of ordinal vs interval data is at the heart of my use of the mean. I remember having discussed it with my superviser, he thinks we could treat it as interval because the ranks that are used in my statistical analysis (i.e. 0 to 10) were available as such to the respondents of the survey and even if they could not choose 6.534 as an answer, the mean remains a pretty accurate representation. I might also suggest the use of the median... It thus remains to be seen whether I can use this mean/median as an indep. variable – Cabby Dec 13 '18 at 11:14
  • Don't get me wrong: I think the mean is often a pragmatic, sensible choice. But there are not usually deep mathematical or theoretical reasons beyond that. The median will typically be uselessly insensitive in practice, for all that in principle it makes complete sense and is well defined. Suppose your grades are 1 1 1 2 2 2 2 and 2 2 2 2 3 3 3, so the median is 2 in either case. Is that really a better summary of the data than the means? Trimmed means are an interesting direction! – Nick Cox Dec 13 '18 at 11:51
  • @NickCox Thank you for this clarification and all the useful help. Let's consider I go on with using the average because I believe that within this context, treating an ordinal scale as interval (as many studies already did and as you mentioned in the other post), while being aware of the mathematical concerns that might pose, I could go ahead and use this average as an independent variable in my regression? I also checked the Gauss-Markov assumptions and everything seems to be well appropriate from this perspective (or is there any other concern/assumption I am missing?) – Cabby Dec 13 '18 at 12:59
  • You could. Naturally the real question is whether you should. My preference is always for using the original variables as predictors and seeing how they work. But the opposite style is common in some fields, to regard the original variables as just way-stations in attempts to define underlying or latent variables. I can't at a distance and with no data in hand act as oracle to tell you the best strategy. Bear in mind that no white magic lets the regression know how you produced your predictors and whether they are valid in any sense. It's just a robot reacting to data. – Nick Cox Dec 13 '18 at 13:37

1 Answers1

0

Combining the independent variables before any analysis and without previous knowledge is not a good approach.

Using 10 original independent variables, the following model is possible to be fit: $$Y=\beta_0 + \sum_{i=1}^{10}\beta_iX_i + \epsilon$$

After fitting this complicated model, the really useless independent variables (for example p>0.3) can be excluded. Then $X$s (1) with closed values of $\hat \beta$s verified by testing that their $\beta$s are the same and (2) having meaning to be combined can be combined.

Then the new independent variable can be constructed and the new model can be fitted. This approach will be reasonable and results will be better than the combination based on only the meaning of the combination.

This approach needs large sample, for example, > 500.

user158565
  • 7,032
  • 2
  • 9
  • 19
  • But this approach is still treating ordinal variables as if they were measured (interval scale), precisely what is in doubt. – Nick Cox Dec 12 '18 at 18:52
  • @NickCox I think OP question is "performance = (item1 + item2)/2." before any analysis. My suggestion is doing something before combining. – user158565 Dec 13 '18 at 04:51
  • Unless you are feeding in a predictor as indicator variables, not what your notation implies, you are treating it as on interval scale. – Nick Cox Dec 13 '18 at 06:59
  • @NickCox Would summing the items "performance = item1+item2" solve the combination problem ? – Cabby Dec 13 '18 at 11:07
  • That looks like the same question to me, so I don't have a different answer. – Nick Cox Dec 13 '18 at 11:52
  • @NickCox Yes, my bad, sorry. – Cabby Dec 13 '18 at 12:59