2

I have a GLM model that I use on a set of data. The dataset is split by one of the variables into subsets that are then used to train the glm function. After the glm has run, I export the result to PMML.

The problem is, that sometimes some of the categorical variables in the formula have only one single value and hence the model cannot run / crashes. The reason why I do not want to just kick those variables out of the formula are, that they contain the valuable information about curtain criteria for the model.

E.g. let's say that the model predicts the time it takes to perform a task based on many different variables. Let's say then, that of those values is a categorical variable called country. In one of the chunks, that the dataset is split into, the country variable only takes the value of "USA". Now if I kicked country out of the formula here, then the information about the fact, that this model only works for data with country="USA" will then disappear from the PMML export. This could potentially lead to mistakes when applying the model afterwards (say the data suddenly comes has country="France", but the model was only trained with data from "USA"). Is there a way to force the glm model in R to handle this, so that the it will still have coefficient for the country variable, but it will be set maybe to 1 or 0 for value "USA"?

PS. I understand that the factor would not really be counted in, since there is no variance to measure, however I want the PMML to include information about the model only being applicable for country="USA".

edit

Okay, so my dataset contains 70k observations. My glm formula uses 12 variables + some interactions between them, to predict a 13th numeric variable. 3 of the 12 input variables are numeric and 9 are factors. When the dataset is split by a 14th variable, it turns into around 500 subsets containing from 40 to 1000 observations in each of them. 3 of the factor variables are actually booleans (1/0). Those are the ones that often have zero variance. When applying the trained predictive model afterwards on new incoming single cases it's crucial to know, if those variables were purely 0 or purely 1 in the trained model data, in order to determine if the model can be applied to the new incoming observations.

NK1
  • 543
  • 1
  • 5
  • 6
  • If these variables like "country" which sometimes lack variations, have effects which are somewhat stable from dataset to dataset, you can have some default values and then use an offset. Or you can use bayesian methods ... – kjetil b halvorsen Sep 08 '15 at 07:12
  • Hmm I'm not sure I understand. Let's say I split the dataset by `task_type` (plowing, sowing, milking etc.). Then in some of the subsets e.g. for milking , the data set will only contain data points with `country = "USA"` whereas for the dataset containing `task_type = 'plowing'` it might both contain data for ("USA", "France", "Norway"). Could you try to explain me a bit more about how I could handle this (I'm not an expert) – NK1 Sep 08 '15 at 07:19
  • You need to give some more information. It seems you have many datasets. The data are describing similar situation, do they arrive continually with time, ... ?how. Have you reason to think the parameters (or at least some of them?) have more or less similar values across datasets? If so, you could use multi-level modelling, to "borrow strength" across datasets. Then, for the cases with variabless constant in one dataset, you get information from the other datasets. cont... – kjetil b halvorsen Sep 08 '15 at 07:24
  • .... Or, if datasets arrive sequentially with time, you coud use the results from one multilevel analysis with the existing datasets to build a prior for future datasets. – kjetil b halvorsen Sep 08 '15 at 07:25
  • Hi again, I just found this article. https://tgmstat.wordpress.com/2014/03/06/near-zero-variance-predictors/ I think it describes exactly my problem. However I do not understand the suggestions on how to tackle the problem in the "Try not to throw your data away" section. Is this related to your proposal on using bayes? – NK1 Sep 08 '15 at 08:03
  • Yes, their solution seems to be "use Bayes" but they leave it at that, they say nothing ahout "how". So, to advance, tell us more about your real situation, how many datasets, how do they arrive, how closely associated, what questions do you want to ask from your data, how many variables, how many observations, .... as an edit to the original post! – kjetil b halvorsen Sep 08 '15 at 08:09
  • Edited it now :-) – NK1 Sep 08 '15 at 08:40

0 Answers0