I have a GLM model that I use on a set of data. The dataset is split by one of the variables into subsets that are then used to train the glm function. After the glm has run, I export the result to PMML.
The problem is, that sometimes some of the categorical variables in the formula have only one single value and hence the model cannot run / crashes. The reason why I do not want to just kick those variables out of the formula are, that they contain the valuable information about curtain criteria for the model.
E.g. let's say that the model predicts the time it takes to perform a task based on many different variables. Let's say then, that of those values is a categorical variable called country
. In one of the chunks, that the dataset is split into, the country
variable only takes the value of "USA"
. Now if I kicked country
out of the formula here, then the information about the fact, that this model only works for data with country="USA"
will then disappear from the PMML export. This could potentially lead to mistakes when applying the model afterwards (say the data suddenly comes has country="France"
, but the model was only trained with data from "USA"
). Is there a way to force the glm model in R to handle this, so that the it will still have coefficient for the country variable, but it will be set maybe to 1 or 0 for value "USA"
?
PS. I understand that the factor would not really be counted in, since there is no variance to measure, however I want the PMML to include information about the model only being applicable for country="USA"
.
edit
Okay, so my dataset contains 70k observations. My glm formula uses 12 variables + some interactions between them, to predict a 13th numeric variable. 3 of the 12 input variables are numeric and 9 are factors. When the dataset is split by a 14th variable, it turns into around 500 subsets containing from 40 to 1000 observations in each of them. 3 of the factor variables are actually booleans (1/0). Those are the ones that often have zero variance. When applying the trained predictive model afterwards on new incoming single cases it's crucial to know, if those variables were purely 0 or purely 1 in the trained model data, in order to determine if the model can be applied to the new incoming observations.