Using log odds to identify a feature for machine learning?

Question

I am currently working on a project to try and predict if we have a broken electrical box in a zip-code across a range of countries (This is a made up example just to get a handle on the technique). Because of the size of the dataset and the number of columns I have with other attributes, I would like to see if the zip-code itself could be considered predictive. The idea being we would only include the predictive zipcodes (with a margin of error) and reduce the number of columns dramatically.

A zip-code can have many electrical boxes and the number varies with the size of the geographical distance covered. If a box has failed in zipcode it has a 1 attached to it. If it has not failed it has a 0 attached to it. In practice this means that a zipcode can have many 1s and 0s. There are approximately 50k zipcodes which when you explode it out with the categorical has_failed flag you get approx 580k records.

I thought a good way to approach this would be to run a logistical regression and using the zip-code as the independent variable and the has_failed as the dependent variable. Each zipcode would then get both the log-odds and a p_value. I would then only select the zipcodes with a large effect and an acceptable p-value. Concious of the possibility of getting a Type 1 error, I applied a false discovery rate adjustment (FDR) to the results to try and create a volcano plot as seen here in Figure 5.4. My problem is I use the the R package biglm to process the data in chunks, the algorithm fails to converge which I suspect is down to complete seperation of the results. It could also be that i am approaching this implementation completely incorrectly

Does anyone know of another robust way I could approach this? The overall aim is to have a column in my dataset that has a 1 or a 0 if the zipcode itself has been shown to be predictive based on the method above (given the caveats of interactions being missed etc).

Below is some random generated data from fake data to highlight a working example of what i was attempting.


library(dplyr)
library(broom)
library(ggplot2)

# The zipcodes lose the letters at the end but thats ok for what we are doing
berlin_zip <- 
c(10115L, 10117L, 10119L, 10178L, 10179L, 10315L, 10317L, 10318L, 10319L, 10365L, 10367L,
  10369L, 10405L, 10407L, 10409L, 10435L, 10437L, 10439L, 10551L, 10553L, 10555L, 10557L,
  10559L, 10585L, 10587L, 10589L, 10623L, 10625L, 10627L, 10629L, 10707L, 10709L, 10711L,
  10713L, 10715L, 10717L, 10719L, 10777L, 10779L, 10781L, 10783L, 10785L, 10787L, 10789L,
  10823L, 10825L, 10827L, 10829L, 11011L, 12043L, 12045L, 12047L, 12049L, 12051L, 12053L,
  12055L, 12057L, 12059L, 12099L, 12101L, 12103L, 12105L, 12107L, 12109L, 12157L, 12159L,
  12161L, 12163L, 12165L, 12167L, 12169L, 12203L, 12205L, 12207L, 12209L, 12247L, 12249L,
  12277L, 12279L, 12305L, 12307L, 12309L, 12347L, 12349L, 12351L, 12353L, 12355L, 12357L,
  12359L, 12435L, 12437L, 12439L, 12459L, 12487L, 12489L, 12524L, 12526L, 12527L, 12529L,
  12555L, 12557L, 12559L, 12587L, 12589L, 12619L, 12621L, 12623L, 12627L, 12629L, 12679L,
  12681L, 12683L, 12685L, 12687L, 12689L, 13051L, 13053L, 13055L, 13057L, 13059L, 13086L,
  13088L, 13089L, 13125L, 13127L, 13129L, 13156L, 13158L, 13159L, 13187L, 13189L, 13347L,
  13349L, 13351L, 13353L, 13355L, 13357L, 13359L, 13403L, 13405L, 13407L, 13409L, 13435L,
  13437L, 13439L, 13465L, 13467L, 13469L, 13503L, 13505L, 13507L, 13509L, 13581L, 13583L,
  13585L, 13587L, 13589L, 13591L, 13593L, 13595L, 13597L, 13599L, 13627L, 13629L, 14050L,
  14052L, 14053L, 14055L, 14057L, 14059L, 14089L, 14109L, 14129L, 14131L, 14163L, 14165L, 
  14167L, 14169L, 14193L, 14195L, 14197L, 14199L) %>% 
  as.character() %>% 
  sample(., 5000, replace = TRUE)

has_failed <- sample(c(0,1), 5000, replace = TRUE)

mydf <- data.frame(berlin_zip, has_failed)

# Run the logistical Regression. We use the normal one here but in the actual code we use
# in my sample is biglm
mod <- lm(has_failed~berlin_zip,data = mydf)

# Grab the co-effiencts and p-values
# Update the p-values with FDR to account for the potential false positives
mod_coefficents <- broom::tidy(mod) %>% 
  mutate(log10_fdr_p = -log10(p.adjust(p.value, method = "fdr")))

mod_coefficents
#> # A tibble: 182 x 6
#>    term            estimate std.error statistic  p.value log10_fdr_p
#>    <chr>              <dbl>     <dbl>     <dbl>    <dbl>       <dbl>
#>  1 (Intercept)      0.594      0.0883    6.72   1.99e-11      8.44  
#>  2 berlin_zip10117 -0.0776     0.126    -0.617  5.38e- 1      0.0483
#>  3 berlin_zip10119 -0.110      0.126    -0.873  3.83e- 1      0.0694
#>  4 berlin_zip10178  0.0729     0.147     0.495  6.20e- 1      0.0483
#>  5 berlin_zip10179 -0.00754    0.128    -0.0589 9.53e- 1      0.0117
#>  6 berlin_zip10315 -0.135      0.135    -1.00   3.16e- 1      0.102 
#>  7 berlin_zip10317 -0.0737     0.133    -0.553  5.80e- 1      0.0483
#>  8 berlin_zip10318 -0.149      0.131    -1.14   2.53e- 1      0.102 
#>  9 berlin_zip10319 -0.177      0.135    -1.31   1.89e- 1      0.102 
#> 10 berlin_zip10365 -0.00754    0.128    -0.0589 9.53e- 1      0.0117
#> # ... with 172 more rows

ggplot(mod_coefficents, aes(x=estimate, y=log10_fdr_p)) +
  geom_point(alpha = 0.7)

Hi @develarist, Oh sorry, I didn't actually add that bit. I think this can be pulled from the model via `exp(coef(x))` — John Smith, Nov 12 '20 at 11:59
How many zip-codes? Maybe have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels or some other regularization method ... — kjetil b halvorsen, Nov 14 '20 at 13:10
hi @kjetilbhalvorsen, There are 53,323 zipcodes.. I will have a look at the link you provided. — John Smith, Nov 16 '20 at 13:04
I think for me, I only wish to see if the zip-code itself is predictive. So if zipcode `00000` had 100 boxes and all of those boxes failed, this would be a good predictor. I am not interested in merging them per say. I wanted to stick a p-value (FPR) on it so that in the new data i could just put the log odds into a single column. If i was to try and predict failures in that zipcode tomorrow on a new unseen test set, i could use the log-odds in general of the failures in that zipcode where p < 0.001 for the initial exploration data set — John Smith, Nov 16 '20 at 13:36

Using log odds to identify a feature for machine learning?

0 Answers0