I am currently working on a project to try and predict if we have a broken electrical box in a zip-code across a range of countries (This is a made up example just to get a handle on the technique). Because of the size of the dataset and the number of columns I have with other attributes, I would like to see if the zip-code itself could be considered predictive. The idea being we would only include the predictive zipcodes (with a margin of error) and reduce the number of columns dramatically.
A zip-code can have many electrical boxes and the number varies with the size of the geographical distance covered. If a box has failed in zipcode it has a 1 attached to it. If it has not failed it has a 0 attached to it. In practice this means that a zipcode can have many 1s and 0s. There are approximately 50k zipcodes which when you explode it out with the categorical has_failed
flag you get approx 580k records.
I thought a good way to approach this would be to run a logistical regression and using the zip-code as the independent variable and the has_failed
as the dependent variable. Each zipcode would then get both the log-odds and a p_value. I would then only select the zipcodes with a large effect and an acceptable p-value. Concious of the possibility of getting a Type 1 error, I applied a false discovery rate adjustment (FDR) to the results to try and create a volcano plot as seen here in Figure 5.4. My problem is I use the the R package biglm
to process the data in chunks, the algorithm fails to converge which I suspect is down to complete seperation of the results. It could also be that i am approaching this implementation completely incorrectly
Does anyone know of another robust way I could approach this? The overall aim is to have a column in my dataset that has a 1 or a 0 if the zipcode itself has been shown to be predictive based on the method above (given the caveats of interactions being missed etc).
Below is some random generated data from fake data to highlight a working example of what i was attempting.
library(dplyr)
library(broom)
library(ggplot2)
# The zipcodes lose the letters at the end but thats ok for what we are doing
berlin_zip <-
c(10115L, 10117L, 10119L, 10178L, 10179L, 10315L, 10317L, 10318L, 10319L, 10365L, 10367L,
10369L, 10405L, 10407L, 10409L, 10435L, 10437L, 10439L, 10551L, 10553L, 10555L, 10557L,
10559L, 10585L, 10587L, 10589L, 10623L, 10625L, 10627L, 10629L, 10707L, 10709L, 10711L,
10713L, 10715L, 10717L, 10719L, 10777L, 10779L, 10781L, 10783L, 10785L, 10787L, 10789L,
10823L, 10825L, 10827L, 10829L, 11011L, 12043L, 12045L, 12047L, 12049L, 12051L, 12053L,
12055L, 12057L, 12059L, 12099L, 12101L, 12103L, 12105L, 12107L, 12109L, 12157L, 12159L,
12161L, 12163L, 12165L, 12167L, 12169L, 12203L, 12205L, 12207L, 12209L, 12247L, 12249L,
12277L, 12279L, 12305L, 12307L, 12309L, 12347L, 12349L, 12351L, 12353L, 12355L, 12357L,
12359L, 12435L, 12437L, 12439L, 12459L, 12487L, 12489L, 12524L, 12526L, 12527L, 12529L,
12555L, 12557L, 12559L, 12587L, 12589L, 12619L, 12621L, 12623L, 12627L, 12629L, 12679L,
12681L, 12683L, 12685L, 12687L, 12689L, 13051L, 13053L, 13055L, 13057L, 13059L, 13086L,
13088L, 13089L, 13125L, 13127L, 13129L, 13156L, 13158L, 13159L, 13187L, 13189L, 13347L,
13349L, 13351L, 13353L, 13355L, 13357L, 13359L, 13403L, 13405L, 13407L, 13409L, 13435L,
13437L, 13439L, 13465L, 13467L, 13469L, 13503L, 13505L, 13507L, 13509L, 13581L, 13583L,
13585L, 13587L, 13589L, 13591L, 13593L, 13595L, 13597L, 13599L, 13627L, 13629L, 14050L,
14052L, 14053L, 14055L, 14057L, 14059L, 14089L, 14109L, 14129L, 14131L, 14163L, 14165L,
14167L, 14169L, 14193L, 14195L, 14197L, 14199L) %>%
as.character() %>%
sample(., 5000, replace = TRUE)
has_failed <- sample(c(0,1), 5000, replace = TRUE)
mydf <- data.frame(berlin_zip, has_failed)
# Run the logistical Regression. We use the normal one here but in the actual code we use
# in my sample is biglm
mod <- lm(has_failed~berlin_zip,data = mydf)
# Grab the co-effiencts and p-values
# Update the p-values with FDR to account for the potential false positives
mod_coefficents <- broom::tidy(mod) %>%
mutate(log10_fdr_p = -log10(p.adjust(p.value, method = "fdr")))
mod_coefficents
#> # A tibble: 182 x 6
#> term estimate std.error statistic p.value log10_fdr_p
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.594 0.0883 6.72 1.99e-11 8.44
#> 2 berlin_zip10117 -0.0776 0.126 -0.617 5.38e- 1 0.0483
#> 3 berlin_zip10119 -0.110 0.126 -0.873 3.83e- 1 0.0694
#> 4 berlin_zip10178 0.0729 0.147 0.495 6.20e- 1 0.0483
#> 5 berlin_zip10179 -0.00754 0.128 -0.0589 9.53e- 1 0.0117
#> 6 berlin_zip10315 -0.135 0.135 -1.00 3.16e- 1 0.102
#> 7 berlin_zip10317 -0.0737 0.133 -0.553 5.80e- 1 0.0483
#> 8 berlin_zip10318 -0.149 0.131 -1.14 2.53e- 1 0.102
#> 9 berlin_zip10319 -0.177 0.135 -1.31 1.89e- 1 0.102
#> 10 berlin_zip10365 -0.00754 0.128 -0.0589 9.53e- 1 0.0117
#> # ... with 172 more rows
ggplot(mod_coefficents, aes(x=estimate, y=log10_fdr_p)) +
geom_point(alpha = 0.7)