0

All,

I have a dataset that contains more than 45k rows and 2 columns. Below is dput() of first 50 rows. One thing to consider is that category column contains 34 factor of different levels. I am using Logistic regression model to predict the category based on other two columns. Below is my model. However, I am getting warning in my model. I Googled the warning, and found that there might be linear relationship between my DV(Dependent Variable) and IV (Independent Variable). I am not sure how to deal with this warning. Some post suggested to perform Log transformation, but not sure How to perform in my model. Being a newbie to R if you could provide an explanation of how to deal with the warning that will be great.

> dput(droplevels(head(new_df1, 10)))
structure(list(category = structure(c(1L, 5L, 7L, 8L, 9L, 10L, 
2L, 3L, 4L, 6L), .Label = c("", "baking", "canned", "crackers", 
"DELI", "dessert", "MEAT", "NUTRITION", "PASTRY", "PRODUCE"), class = "factor"), 
    quantity = c(5L, 27L, 3L, 1L, 29L, 94L, 70L, 20L, 12L, 122L
    ), sales_value = c(11.6, 86.83, 13.46, 2, 52.4, 133.75, 160.15, 
    38.81, 29.91, 208.75)), row.names = c(NA, 10L), class = "data.frame")
> dput(droplevels(head(new_df1, 50)))
structure(list(category = structure(c(1L, 5L, 20L, 21L, 24L, 
27L, 2L, 3L, 4L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 19L, 22L, 23L, 25L, 26L, 28L, 1L, 5L, 20L, 21L, 24L, 
27L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 16L, 
17L, 18L, 19L, 22L), .Label = c("", "baking", "canned", "crackers", 
"DELI", "dessert", "drinks", "drug", "ethnic", "food", "food add-ons", 
"frozen dessert", "frozen food", "frozen meat", "fruit", "health", 
"household", "instant dinner", "meat", "MEAT", "NUTRITION", "other", 
"packaged foods", "PASTRY", "personal care", "produce", "PRODUCE", 
"seasonal"), class = "factor"), quantity = c(5L, 27L, 3L, 1L, 
29L, 94L, 70L, 20L, 12L, 122L, 81L, 1L, 78L, 82L, 30L, 7L, 1L, 
33L, 5L, 56L, 4L, 66L, 5L, 45L, 37L, 36L, 3L, 1L, 41L, 2L, 18L, 
20L, 115L, 83L, 32L, 24L, 118L, 72L, 2L, 1L, 73L, 92L, 44L, 16L, 
21L, 1L, 57L, 1L, 68L, 14L), sales_value = c(11.6, 86.83, 13.46, 
2, 52.4, 133.75, 160.15, 38.81, 29.91, 208.75, 204.38, 3.99, 
128.27, 193.84, 56.27, 11.75, 1.5, 41.59, 33.51, 140.42, 7, 170.11, 
14.08, 84.93, 111.53, 33.62, 2.07, 2.99, 125.34, 4.45, 46.33, 
42.91, 132.35, 181.04, 51.64, 59.91, 260.86, 189.15, 12.68, 1.09, 
115.18, 210.44, 111.53, 31.4, 25.16, 2.29, 142.57, 2.5, 179.86, 
59.28)), row.names = c(NA, 50L), class = "data.frame")

My model

fit_glm = glm(category~.,new_df1,family = 'binomial')

Warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
  • I tried BayesGlmas suggested by Rasmus Bååth, following the link. However, getting same warning. Any suggestion ? –  Nov 23 '18 at 20:26
  • 1
    I suggested the link above just based on the warning message, but now that I look at your data, I have to wonder why you think logistic regression is what you should be doing here -- can you tell us what you're thinking regarding your modelling strategy? – duckmayr Nov 23 '18 at 20:31
  • @duckmayr..I would like to to predict probabilities of different categories based on quantity and sales value set by retailer. –  Nov 23 '18 at 20:36
  • 2
    Then what you'll likely be wanting is multinomial logistic regression; see [here](https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/) for a quick primer – duckmayr Nov 23 '18 at 20:38
  • Try creating contingency tables of your dependent variable and each of your independent variables. If you find any instance where the choice of your independent variables perfectly predicts your dependent, variable, that will cause this problem. For example if all all the values with quantity of 5 and sales_value are the same for a given category, this will be a problem. You could try collapsing some categories or combining sales_values into categories as well. You should also look into multinomial regression as others suggested or methods like nearest neighbor techniques. – StatsStudent Nov 24 '18 at 04:28

0 Answers0