3

I have the following model:

Reduction_in_clinical_score ~ Baseline_clinical_score + 
    Site_of_data_collection + Treatment_Type + Age + Sex  + ERP

Site of data collection is made up of four levels, treatment type has two levels, and sex has two levels. All other variables are continuous.

I have 88 observations in total.

In Matlab (using fitlm), I am running into the following error: Warning: Regression design matrix is rank deficient to within machine precision.

From what I have gathered online, it seems as though this may be caused by having an inadequate number of observations relative to the number of predictors in my model.

My question is then what would be the next step in this case?

Would it be to remove a predictor (ideally based on theory/literature)?

I ran the same linear regression in SPSS, which provided no warning (the output all looks reasonable).

If I may note, I checked the rank of my predictor variables, and it returned as full rank (i.e. 6). I've also checked the VIF values in SPSS and the highest is value is ~4.6. However, SPSS also shows Site and Treatment_Group as highly correlated (r = -0.861, - < 0.001). Could this be an issue of multicollinearity between two categorical variables? When I remove one or the other, the issue goes away.

I should also note that there may be a design issue. I think the problem may be stemming from the fact that for Treatment 1, it was collected at sites A, B, C and D. However, for treatment 2, data was only collected at site A.

enter image description here

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
pdhami
  • 113
  • 8
  • Do you have less observations than predictors? – Ale Nov 05 '20 at 17:29
  • Sorry, I thought I had included that information. I have 88 observations, so no. – pdhami Nov 05 '20 at 17:50
  • 1
    PLease include new information as an edit to the post, not only in comments. Not everybody reads comments! – kjetil b halvorsen Nov 05 '20 at 17:58
  • 2
    Please check these answers: https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it – Ale Nov 05 '20 at 18:23
  • Is "Site Data was Collected" a category? Like, is site 4 equal to site 1 plus site 3, or twice site 2, or are there four sites coded as 1, 2, 3, 4? The reason I'm asking is that you may have fallen into the dummy variable trap and created a rank-deficient matrix. – Sycorax Nov 05 '20 at 22:17
  • Apologizes, both site and treatment type are categories. So as you said, for site, there are four sites coded as 1, 2, 3 and 4. Same for treatment type, there are 2 treatment types coded as 1 and 2. I've looked into the Matlab "fitlm" page, but from what I can understand, it should take care of the categorical variables without falling into the dummy variable trap, although I may be wrong. – pdhami Nov 05 '20 at 22:19
  • 1
    So if you dummy encode those categories, you have 4 + 2 + 4 = 8 predictors, not six. You've found a rank of 6 < 8, which is rank deficiency. Additionally, including an intercept will be collinear with the 4 sites and the 2 treatment types. If you haven't dummy-encoded the categories, then your model is flatly bogus because you're treating site 4 as four times as "site" as site 1. – Sycorax Nov 05 '20 at 22:20
  • Thank you Sycorax. Reading the Matlab documentation on dummy variable creation, (https://www.mathworks.com/help/stats/dummy-indicator-variables.html#mw_65b527ec-2efa-4b62-a04a-f8b96a12de12), it seems as though it should have been done automatically with my 'category' variables. I also edited a potential design issue, in that all treatment 2 data was collected from site A. – pdhami Nov 05 '20 at 22:45
  • 1
    "all treatment 2 data was collected from site A" -- this is the source of the problem. – Sycorax Nov 05 '20 at 23:01
  • Would there be any remedy for this? Or would I have to choose to drop one of the two variables? – pdhami Nov 05 '20 at 23:24
  • 1
    Collect treatment 2 data at all of your sites. – Sycorax Nov 06 '20 at 21:11

0 Answers0