LInear regression with one categorical data with multiple leves

Question

Lets say you have info about how many times some person came to servis with some product. So you have table like this:

      Count_case                   Product         Person
1              1       ZTE ZXHN H168N V3.1              1
2              2       ZTE ZXHN H267A V1.0              2
3              6       ZTE ZXHN H267A V1.0              3
4              1       ZTE ZXHN H168N V3.1             15
5              2           Kaon-O2SmartBox              8
6              3           Kaon-O2SmartBox              9
7              1         Comtrend VR-3026e              1
8              2       ZTE ZXHN H168N V3.1              5
9              1       ZTE ZXHN H168N V3.1             26

So for example in first row you can see that person 1 came with product ZTE YXHN H168N V3.1 two times. I have more variables but I found out that the Product is the most important variable when I want to predict how many times will the person come = Count_case. So I want to focus only on it and find out which of the level is the most imortant. In other words, with which product is more likely that person will come multiple times.

I used basic box plot to show this but as you can see it's not very good and not say much.

Then I tired basic linear regression:

lm.fit <- lm(Count_case ~  Product, data=dt)
summary(lm.fit)

Call:
lm(formula = Count_case ~ Product, data = final_dat_KCID)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8526 -1.3713 -0.6700  0.5749 29.1474 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       2.37135    0.03677  64.496  < 2e-16 ***
ProductTP-LINK Archer MR200      -0.37135    1.83192  -0.203 0.839364    
ProductComtrend VG-8054u         -0.57135    0.58035  -0.984 0.324889    
ProductComtrend VI-3234eu        -1.37135    2.59047  -0.529 0.596546    
ProductComtrend VR-3026e         -0.23154    0.06490  -3.568 0.000361 ***
ProductComtrend VR-3028e          0.05380    0.06855   0.785 0.432578    
ProductComtrend VR-3031eu         0.08236    0.07218   1.141 0.253859    
ProductHuawei HG520i             -0.16076    0.09906  -1.623 0.104635    
ProductHuawei HG622u             -0.18002    0.07063  -2.549 0.010814 *  
ProductKaon-O2SmartBox            0.29863    0.06968   4.286 1.83e-05 ***
ProductZTE ZXHN H168N V3.1        0.48130    0.06060   7.942 2.08e-15 ***
ProductZTE ZXHN H267A V1.0       -0.15180    0.08202  -1.851 0.064232 .  
ProductZyXEL P-660HN-T3A          0.33346    0.08492   3.927 8.64e-05 ***
ProductZyXEL P-660HW-T3 v2       -0.06951    0.13770  -0.505 0.613705    
ProductZyXEL VMG3312-T20A_8C5973 -0.59357    0.61162  -0.970 0.331817    
ProductZyXEL VMG8924-B30A PP     -0.20820    0.11929  -1.745 0.080934 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.59 on 21816 degrees of freedom
  (9117 observations deleted due to missingness)
Multiple R-squared:  0.008056,  Adjusted R-squared:  0.007374 
F-statistic: 11.81 on 15 and 21816 DF,  p-value: < 2.2e-16

But I know that this output depends on dummy variable (contrast matrix or something like that) so I tried to change reference level.

dt$Product <- relevel(df$Product, ref = 2)
lm.fit1 <- lm(Count_case ~ Product, data=dt)
summary(lm.fit1)

 Call:
lm(formula = Count_case ~ Product, data = final_dat_KCID)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8526 -1.3713 -0.6700  0.5749 29.1474 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                        2.0000     1.8316   1.092    0.275
ProductZyXEL VMG1312-B30B          0.3713     1.8319   0.203    0.839
ProductComtrend VG-8054u          -0.2000     1.9209  -0.104    0.917
ProductComtrend VI-3234eu         -1.0000     3.1723  -0.315    0.753
ProductComtrend VR-3026e           0.1398     1.8323   0.076    0.939
ProductComtrend VR-3028e           0.4251     1.8325   0.232    0.817
ProductComtrend VR-3031eu          0.4537     1.8326   0.248    0.804
ProductHuawei HG520i               0.2106     1.8339   0.115    0.909
ProductHuawei HG622u               0.1913     1.8325   0.104    0.917
ProductKaon-O2SmartBox             0.6700     1.8325   0.366    0.715
ProductZTE ZXHN H168N V3.1         0.8526     1.8322   0.465    0.642
ProductZTE ZXHN H267A V1.0         0.2196     1.8330   0.120    0.905
ProductZyXEL P-660HN-T3A           0.7048     1.8332   0.384    0.701
ProductZyXEL P-660HW-T3 v2         0.3018     1.8364   0.164    0.869
ProductZyXEL VMG3312-T20A_8C5973  -0.2222     1.9306  -0.115    0.908
ProductZyXEL VMG8924-B30A PP       0.1631     1.8351   0.089    0.929

Residual standard error: 2.59 on 21816 degrees of freedom
  (9117 observations deleted due to missingness)
Multiple R-squared:  0.008056,  Adjusted R-squared:  0.007374 
F-statistic: 11.81 on 15 and 21816 DF,  p-value: < 2.2e-16

And now I don't know what to do. How can I say which of Product is really significant for prediction of Count_case?

As was suggested below in comments I tried:

glm.fit <- glm(Count_case ~ -1 + Product, data = dt, family = poisson)
test <- glht(glm.fit, linfct = mcp(Product = "Tukey"))
summary(test)

Fit: glm(formula = Count_case ~ -1 + DEVICE_TYPE_NAME, family = poisson, 
    data = final_dat_KCID)

Linear Hypotheses:
                                                        Estimate Std. Error z value Pr(>|z|)    
TP-LINK Archer MR200 - Kaon-O2SmartBox == 0            -0.288922   0.500196  -0.578   1.0000    
ZyXEL VMG1312-B30B - Kaon-O2SmartBox == 0              -0.118610   0.016750  -7.081    <0.01 ***
Comtrend VG-8054u - Kaon-O2SmartBox == 0               -0.394282   0.167252  -2.357   0.4752    
Comtrend VI-3234eu - Kaon-O2SmartBox == 0              -0.982069   1.000098  -0.982   0.9996    
Comtrend VR-3026e - Kaon-O2SmartBox == 0               -0.221351   0.019869 -11.140    <0.01 ***
Comtrend VR-3028e - Kaon-O2SmartBox == 0               -0.096175   0.020033  -4.801    <0.01 ***
Comtrend VR-3031eu - Kaon-O2SmartBox == 0              -0.084468   0.020735  -4.074    <0.01 ** 
Huawei HG520i - Kaon-O2SmartBox == 0                   -0.188808   0.027677  -6.822    <0.01 ***
Huawei HG622u - Kaon-O2SmartBox == 0                   -0.197561   0.021046  -9.387    <0.01 ***
ZTE ZXHN H168N V3.1 - Kaon-O2SmartBox == 0              0.066178   0.017800   3.718   0.0114 *  
ZTE ZXHN H267A V1.0  - Kaon-O2SmartBox == 0            -0.184764   0.023592  -7.832    <0.01 ***
ZyXEL P-660HN-T3A - Kaon-O2SmartBox == 0                0.012961   0.022770   0.569   1.0000    
ZyXEL P-660HW-T3 v2 - Kaon-O2SmartBox == 0             -0.148361   0.036549  -4.059    <0.01 ** 
ZyXEL VMG3312-T20A_8C5973 - Kaon-O2SmartBox == 0       -0.406705   0.177329  -2.294   0.5240    
ZyXEL VMG8924-B30A PP - Kaon-O2SmartBox == 0           -0.210504   0.032907  -6.397    <0.01 ***
ZyXEL VMG1312-B30B - TP-LINK Archer MR200 == 0          0.170311   0.500085   0.341   1.0000    
Comtrend VG-8054u - TP-LINK Archer MR200 == 0          -0.105361   0.527046  -0.200   1.0000    
Comtrend VI-3234eu - TP-LINK Archer MR200 == 0         -0.693147   1.118034  -0.620   1.0000

1) correlation does not make sense for categorical data, 2) you have count data (frequencies), 3) what exactly are you trying to do? what will "correlation" tell you? It seems to me like you are better off using a linear model and estimating the mean frequency (using t-tests to also check for significance). — user2974951, Oct 02 '18 at 07:33
I'm trying to somehow say that if some person have product X then it's more likely that the person will came to servise. If the mean of how often people with the product X come to servise is bigger than people with product Y, then I would say "ok, if you have product X it's more likely that you will come". But from the boxplots you can see I can't say this. So I'm trying to find another way. — Martina Zapletalová, Oct 02 '18 at 08:38
As I mentioned in the previous comment, a linear model would give you the means and you could do all mean comparisons to figure out which is significant. If you are more interested in the actual probabilities then you would do a logistic regression. — user2974951, Oct 02 '18 at 08:44
Yes, you should look at some resource online how to interpret a linear model, what you are asking about are the contrasts. After you build a linear model you can do post-hoc tests (such as Tukey) for all the comparisons or any other custom contasts. — user2974951, Oct 02 '18 at 08:55
I know some basics about how to interpret linear model. But I've never worked only with one factor variable. I also know what are the contrasts. But I don't know what to do with my data. Some first steps which I should do. — Martina Zapletalová, Oct 02 '18 at 09:10
Add the output of your linear model to your question and ask what is unclear. — user2974951, Oct 02 '18 at 09:13
Ok, I changed my question little bit and added some outputs. — Martina Zapletalová, Oct 02 '18 at 09:29
Hi and welcome to CV. You seem to have a count variable as your response, so unless the counts are very large, which doesn't seem to be the case, you should fit a generalised linear model with `family=poisson` in the first instance., before proceeding: `glm.fit — Robert Long, Oct 02 '18 at 09:41
@RobertLong Thank you I forget about glm. But there is the same problem with significance, when I use different reference level like in lm. — Martina Zapletalová, Oct 02 '18 at 10:03
You probably don't want to fit a model with the intercept removed. See: https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model — Sal Mangiafico, Oct 02 '18 at 14:51
I think the main problem you are experiencing is that you are expecting `summary` to produce a different kind of output than it does. I'll add an answer describing the use of `car::Anova` and `emmeans`, that I think will give you the output you are looking for. — Sal Mangiafico, Oct 02 '18 at 14:53

Sal Mangiafico · Answer 1 · 2018-10-02T14:55:15.787

As has been pointed out in the comments, since you have count data, you will probably want to use a model appropriate for this kind of data. A Poisson model was suggested, though you might consider negative binomial regression to account for overdispersed data. This kind of model can be fit with the glm.nb function in the MASS package.

The problem you are experiencing with the summary output is that you are expecting a different kind of output than summary provides. Instead, you probably want an anova table, such as is produced by library(car); Anova(lm.fit). You might follow this with library(emmeans); pairs(emmeans(lm.fit, ~ Product)) to see which treatments differ from which others.

score 1 · Answer 2 · answered Oct 02 '18 at 19:00

1

From the information you gave, I am very much assuming, that this is from some applied (instead of research) context. In that case I think there are a couple of hints I could give you:

As most people have pointed out, you should use some kind of Poisson model or model for overdispersed data. That will in any case give you a much clearer picture.
P-values (although widely used in research) can be very misleading, and in practice are often not what you need. A P-value refers to the likelihood of the Null-Hypothesis. So all those stars and p-values tell you, that it is unlikely that the two objects in comparison have the same number of defects. In case of the summary output each of the products is compared to the reference, and in the glht output all pairs are compared. So are you really asking "Do product ABC and product XYZ have the same number of defects?". And if that is the question, what would it mean for your interpretation if this question is answered "no" (i.e. there is one or more "*" behind that).
From my experience, a different question that may be relevant in such a context would be "How likely is it, that product XYZ is the worst we have seen". You need a completely different model for that. For example, you could try to estimate the distribution of your counts, and then, based on all distributions estimate how likely a random value from that distribution is higher than the random values from all other distribution. That would give you a ranking of the "most likely" worst products.
If you are just trying to predict (e.g. you want to stock up on relevant parts), you can ignore p-values altogther. Just use the raw estimates for prediction.

So, the most important advice would be to be very clear about the question you are asking and then pick your analysis and the numbers you are interpreting based on that. It may not always be the p-values that tell you most for your question.

answered Oct 02 '18 at 19:00

LiKao

2,329
1
17
25

Thank you. I'm number 3 so yes I'm looking for some way how to say "How likely is it, that product XYZ is the worst we have seen". But I'm not sure I have count data you think I have. My count data are connect to person too so for one Product I have more counts - Person A came with product X 4 times, Person B came with product X 2 times, so for product X I have counts 4 and 2 etc. (see the first table). So I can't just plot histogram where on X axis will be product and on Y axis count of complaint. I can plot for example this count and on Y-axis how many people complain this much. – Martina Zapletalová Oct 03 '18 at 06:02
So I think I don't understant what you mean with "and then, based on all distributions estimate how likely a random value from that distribution is higher than the random values from all other distribution. That would give you a ranking of the "most likely" worst products." How will this give me the answer? What should I estimate? You mean I should estimate density for each product separately? – Martina Zapletalová Oct 03 '18 at 06:02
If so, I'm not sure if it will help. Because it looks like exponencial distirbution for almost all product and I think the mean value will be ~2 for all of them. – Martina Zapletalová Oct 03 '18 at 06:55
Yes, I thought all these data points referenced individual customers and the number of times they brought in their device. Since this is count data, I would have guessed that it would be Poisson distributed. However: 1) Poisson distribution is determined by one parameter, which can be estimated from the mean. So your devices don't really seem much different from that. 2) Poisson distribution makes an assumption of independence. The probability of a second service should be independent of the first. Probably not given here. Not sure what to pick instead. – LiKao Oct 03 '18 at 19:56
For your second question: You could run a simple simulation. Estimate the distribution parameters (if it looks exponential to you, just try that one) for each device. Then repeatedly draw random numbers from each distribution. Count how often each device comes out with the highest number. This does not yet capture uncertainty in the fitted distributions though. Certainty may be different for each device, since you have different number of customers with that device. You could use a Bayesian approach to handle that as well. – LiKao Oct 03 '18 at 19:59

score 0 · Answer 3 · answered Oct 02 '18 at 10:08

0

The linear model returns an intercept term (that is the mean of the reference level, which is the first level of your categorical variable) and all subsequent levels of your categorical variable. The subsequent levels are compared to your reference level, so when you used relevel with 2 you put your second level as your reference, and so all the subsequent comparisons are done on your new reference level, which is why you get different results.

If you are only interested in the mean for every level separately and their respective p-values you would add a -1 term in your model, to remove the intercept term lm(Count_case ~ -1 + Product, data=dt).

After this you could do post-hoc Tukey test to test all the paired comparisons.

answered Oct 02 '18 at 10:08

user2974951

5,700
2
14
27

I used glm instead of lm as @RobertLong suggested in comment above. And then I did `glht(glm.fit, linfct = mcp(Product = "Tukey"))` which gave me p-values so now I can say which means of levels are not same. But I'm not sure about some noticeable value of this output. My factor have 16 levels, so if I say "mean of level A is not same as mean of level D and mean of level C is not same asi mean of level H etc." it's not saying much. (I added output to question.) – Martina Zapletalová Oct 02 '18 at 13:50
1

It is a rare case that you want to remove the intercept from the linear model. [Read some responses here](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model). – Sal Mangiafico Oct 02 '18 at 14:38
@SalMangiafico I also used glm(Count_case ~ Product, data=dt, family=poisson) with intercept. The output is almost the same. I’m just looking for suggestion of some next steps, of some idea how to say which of product is the most problematic. If there is some way. – Martina Zapletalová Oct 02 '18 at 14:51
@SalMangiafico Since you pointed out another question on CV, you should also indicate that the first example of this answer (https://stats.stackexchange.com/q/32518) is exactly the case here (ANOVA-style model from a regression with a categorial variable with multiple levels) and that this is given as an example, where it may be a good idea to remove the intercept. So the answers to the question you gave, actually indicate that it is OK to remove th eintercept here. – LiKao Oct 02 '18 at 18:29
@LiKao , you're right. I should point out that the answer you highlight has a great potential to mislead the reader. A usual analysis of variance should not be conducted using a model with the intercept removed. I believe the answer's author was pointing out a handy trick to produce parameter estimates, and not suggesting that using a model without an intercept is the usual course when conducting anova. I will make a note there as well. – Sal Mangiafico Oct 02 '18 at 18:55
1

@SalMangiafico I agree that this answer may be misleading. I used the term "ANOVA-style" from the original answer, although in my understanding an "ANOVA-style" model should compare all categories to the mean of all categories. While this can be achieved by 1-hot coding and normalizing the resulting codes and then using an lm without intercept, there are much simpler ways to achieve this. In some cases, it may be interesting to find out which of the levels have an estimate > 0 zero (no intercept). Most likely not in this case though, as I expect only customers who ever came in are in the data. – LiKao Oct 02 '18 at 19:07

LInear regression with one categorical data with multiple leves

3 Answers3

Linked