1

I am doing statistical analysis of empirical data using a a generalized ordered regression model.

I would like to test for interaction terms.

I have a 3-level categorical IV (coded as 2 dummy variables), which divides my subjects (observations) into groups, and a few contineous IVs.

I am interested in testing interactions between my categorical IV and each of the contineous IVs.

I want to mean-center my contineous IVs for that, to avoid multicolinearity issues.

What mean should I use for that? The overall mean for all observations? Or should I mean-center my variables for each group separately?

(means for my contineous variables are different in different groups)

Which approach is more correct?

Mandarc
  • 47
  • 8
  • Richard Williams of University of Notre Dame writes: "If you do center, be consistent throughout, i.e. different sample selections could produce different means, so comparing results produced by different centerings could be deceptive.", which I guess is my answer? (source: https://www3.nd.edu/~rwilliam/stats2/l53.pdf) – Mandarc Apr 03 '18 at 13:17

2 Answers2

0

Richard Williams of University of Notre Dame writes:

"If you do center, be consistent throughout, i.e. different sample selections could produce different means, so comparing results produced by different centerings could be deceptive.", which I guess is my answer.

(source: www3.nd.edu/~rwilliam/stats2/l53.pdf)

Mandarc
  • 47
  • 8
0

It's perfectly fine to have different means for different groups. What's not fine is to calculate those means having your test set included. As you say, you are trying to test a hypothesis on your data. That test should be done on a set aside test set, which should not be taken into account when you estimate your means.

The mean of a variable is a knowledge about the data. And depending on the application, it can be a very important one, i.e. is a newly given data above or bellow the average? Especially if there are outliers in your data, they would move your average substantially. Therefore independent of your hypothesis, you should remove some parts of your data from your dataset and put them as your test set, develop your hypothesis (including the calculation of the average of a variable) on only the training set, and then check if it's valid on your test set. If you calculate the mean on the whole dataset, you're implicitly gaining knowledge about the part of the data which you shouldn't see while developing your hypothesis.

Another point to make, is that centering your variables does not necessarily remove the multicolinearity effect as nicely worded here:

Mean-center the predictor variables. Generating polynomial terms (i.e., for $x_{1}, x_{1}^{2}, x_{1}^{3}$, etc.) or interaction terms (i.e., $x_{1}\times x_{2}$, etc.) can cause some multicollinearity if the variable in question has a limited range (e.g., [2,4]). Mean-centering will eliminate this special kind of multicollinearity. However, in general, this has no effect. It can be useful in overcoming problems arising from rounding and other computational steps if a carefully designed computer program is not used.

adrin
  • 456
  • 4
  • 8
  • I am not sure what you mean when you say I should not calculate means having a test set included, or that test should be done on a set aside test set. Could you clarify? As I understand it, it is not even necessary to center around a mean for my population, but it can be done for any value that is of interest. What centering does, is setting what will my intercept be - the prediction of my DV for the value of IV that is set to 0 (after the centering). Do you mean that I should calculate a mean for some subset of my observations and center all my observations around that mean? – Mandarc Apr 03 '18 at 17:26
  • Re your other point: I know that mean centering is not advised as a panacea for multicolinearity, however it helps in case the special kind of multicolinearity which can arise when you create polynomial or interaction terms (which is actually also said in the quote you included). As I am introducing interaction terms, my situation is exactly this special case when mean centering should be used, and this is why I want to use it. – Mandarc Apr 03 '18 at 17:26
  • see the edit for the mean and test set. – adrin Apr 04 '18 at 08:46
  • thank you for the edit, it is much clearer now. I think I understand what you mean, but, I don't think that a division into a 'training set' and 'test set' can be applied to my data (I could choose one group as a reference group, but not sure if that'd be theoretically justifiable). What I could do, would be centering around general population norms for the questionnaire I am using. But, isn't any value used for centering ok, as long as I take into account what this value *means* when interpreting the results? – Mandarc Apr 04 '18 at 12:51
  • it really depends on what you do after you center them around the mean. I don't have enough information about your usecase and analysis to answer what. – adrin Apr 04 '18 at 12:56
  • As mentioned in my original post, I am doing my analysis using a a generalized ordered regression model. I want to center my independent variables because I am creating interaction terms, which I want to include in my model. – Mandarc Apr 04 '18 at 12:58
  • From your comments, and conceptualisations of problems I can see that seem to be thinking in terms of learning algorithms, which is not exactly what I am doing here. – Mandarc Apr 04 '18 at 13:00
  • I'd say, as long as you're fitting a model (regression in this case) and feature engineering (feature interactions in this case) to your data, all train/test/validation scenarios apply to your case. Who knows, I may be wrong :D – adrin Apr 04 '18 at 13:03
  • It seems to me, that I am talking about a statistical model, and you are talking about a machine learning model. Regression is something that is used in both stats and machine learning, but with different assumptions and goals, depending on whether it is used in one or the other. You can apply machine learning to analysing data, but it is not something I am trying to do here. Here is a discussion on the differences between the two: https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning – Mandarc Apr 04 '18 at 13:41