1

Sorry if this question was already asked - will be happy to remove this one,

I have a numerical vector of clinical data to analyse and due to various reasons this vector contains 2 types of values: 10 (exact integer number) if the value in a current patient was normal and continuous values <9 or >11 if not (say, 8.7 or 11.9). How should I treat such vector in a regression approach (want to use it as a predictor)? I was thinking about adding random noise to make values more "uniform", but then it is not clear how to do so.

Want to clarify: integer number of 10 totally makes sense, these is "number of events in the population of cells", if all the cells are normal, we will get exactly 10, but it may happen that say 62% of cells contain 8 events and 21% contain 9 events (the rest is normal and contain 10 events) - this continuous values are just average across 100% of the cells. Values are truncated at 9 and 11 because of the limitation of the measurement method - but the values are already truncated and it is not possible to recover the "noisy" measurement.

UPD: following an advice from the comments, I add: I am interested in building a model of relationships between such "strange" predictor and some clinical outcome (it may be continuous or dichotomised variable - it is not me who sets the rules...). I am not interested in deep modelling of the phenomena - an empirical model with trustable results (such as p-values distributed uniformly under H0) would be totally fine.

German Demidov
  • 1,501
  • 10
  • 22
  • 1
    Definitely do not add noise. Why do you want to change your data? What is your goal? – user2974951 Feb 20 '20 at 09:29
  • Are you seeing any issues as a result of the truncation? I would suspect this shouldn't pose much of a problem. You *could* further truncate to "low" (<10), "medium" (=10) and "high" (>10) but this is just throwing away information. – alan ocallaghan Feb 20 '20 at 09:33
  • @user2974951 the goal is to do a correct statistical analysis - I want to use these values as a predictor in a generalized linear model and I can not choose the model - it is not properly continuous, it is not properly integer... – German Demidov Feb 20 '20 at 09:39
  • @alanocallaghan unfortunately, I am doing a frequentist-like analysis with p-values and I want to do multiple testing (around 100 features distributed in a similar way) - I am afraid that even small deviation from uniformity under null hypothesis will be crucial... – German Demidov Feb 20 '20 at 09:40
  • 2
    @GermanDemidov In that case build your model and check the usual linear model diagnostics. If it looks good, you probably don't need to modify anything. – user2974951 Feb 20 '20 at 09:40
  • @user2974951 - I may notice nothing with my eye, but the deviation from nominal error rate (I use p-values, they are standard in my field) of 0.05 may be quite crucial. I was just thinking that such problem has a clear solution like zero-inflated models (the situation is different, I agree - response is zero-inflated, not a predictor like in my case), but may be there are no clear solutions... – German Demidov Feb 20 '20 at 09:43
  • 2
    As suggested, I think you need to just run the models and see if the truncation causes issues. – alan ocallaghan Feb 20 '20 at 09:58
  • A couple of things aren't entirely clear. How are the measurements 'truncated'? - would a true value less than 10 but greater than 9, 9.4 say, be measured as 9, leading to an accumulation of measurements at 9 as well as at 10 (& at 11, by the same process). Second: do you *have* a model for the true values, for which you want to estimate, & conduct inference about, the coefficients, or are you simply asking how these measurements might be represented in a predictive model? – Scortchi - Reinstate Monica Feb 20 '20 at 10:11
  • @Scortchi-ReinstateMonica - sure, the accumulation around 9 is expected, the nature of rounding is probabilistic - a measurement tool has its noise and 9.4 may be classified as 9 and it would happen more likely than 10. Honestly, I just wanted to put these values into a regression model without modelling - so I am interested in modelling relationships between response and predictors rather than modelling the predictor itself. However it is clear - having such "artifacts" may break the validity of my model no matter how good the link function will be. – German Demidov Feb 20 '20 at 10:20
  • 2
    So an empirical model that tries to best predict the response from the measurements you can make, rather than a theoretical model of the relation between the response & the unobserved true values of the predictor from which the measurements arise? (Could you please edit the question?) – Scortchi - Reinstate Monica Feb 20 '20 at 10:50
  • Thanks! Adding indicator variables for "special" values might be useful. Watch out for heteroscedasticity if relevant for the response, & take care with regularisation if you're regularising. – Scortchi - Reinstate Monica Feb 20 '20 at 11:40
  • @Scortchi-ReinstateMonica I totally like the idea of adding indicator variable! Thanks a lot! – German Demidov Feb 20 '20 at 12:14
  • 2
    You're welcome. I haven't time to write an answer, but https://stats.stackexchange.com/a/105258/17230, https://stats.stackexchange.com/a/184371/17230, https://stats.stackexchange.com/a/135890/17230, & https://stats.stackexchange.com/a/6565/17230 explain the approach (along with various other answers around the site). If you want to perform model comparison with e.g. a likelihood ratio test then @alanocallaghan's would be the small model & this the big. – Scortchi - Reinstate Monica Feb 20 '20 at 12:31

0 Answers0