How to round mean values when imputing missing questionnaire items

Question

When there are missing items (answers) in a questionnaire it is sometimes usual to impute them with the mean of the valid items. For example here are 3 items/question which can have a four values from 1 to including 4.

[2, 3, missing] would result in [2, 3, 2.5]
[4, 3, missing] would result in [2, 3, 3.5]

Because the floating point is not a valid/possible answer, rounding may be needed to give the item one of the allowed values.

Is scientific rounding the correct choice here? Please correct me if I am wrong but possible synonyms are symmetric rounding, mathematics rounding, round half down or bankers rounding.

In that case and if I am interpreting the rounding rule correct:

[2, 3, missing] would result in [2, 3, 2]
[4, 3, missing] would result in [2, 3, 3]

This question does not take technical problems (e.g. representation of floating point numbers) into account. This is about the best theoretical choice and not how to implement it.

More thoughts

Assume the case of a questionnaire where I have 3 to 5 items with possible values between 1 and 4. The calculation of mean() will often result in a *.5 number. Using scientific rounding will round them all down. I would hypothesize that more then the half of the rounding will go down which causes an unbalance.

Because of that I would argue that rounding to the nearest even value would result in a better balance between up and down rounding.

What does the more experience researches thing about that?

EDIT: This question is not about the right imputation method. The questionnaire author said I have to use this method not other. The question is about the rounding method only.

I need to see your basis for supporting this type of imputation. Some sort of comparability across items seems to be needed. Multiple imputation may also be needed, not using a simple rule. — Frank Harrell, Jan 14 '22 at 13:23
The manual of the questionnaire say so that this "simple rule" can be used: If 2/3 of the items of a subscale have valid values the missings one's can imputated by the mean of the valid values. No more information's about that in the manual; which is in German because this is the German and further developed version of the Kings Health Questionnaire (KHQ). The original KHQ manual (1995) say nothing about imputation. https://doi.org/10.1055/s-2005-872957 — buhtz, Jan 14 '22 at 13:31
Need more to go on before judging the validity of the method. — Frank Harrell, Jan 14 '22 at 18:18
I do not understand what do you need. I am not a statistican. — buhtz, Jan 14 '22 at 18:30
What are the particular questions? Are the interchangeable? What are specific reasons to trust such a simple imputation? — Frank Harrell, Jan 14 '22 at 19:09

score 2 · Answer 1 · answered Jan 14 '22 at 14:19

Here is a balanced algorithm: On the $i^{th}$ questionnaire, if there are $k_i$ valid responses which sum to $s_i$, then:

round $s_i/k_i$ up if $(i\! \mod k_i) < (s_i\! \mod k_i)$
round $s_i/k_i$ down if $(i\! \mod k_i) \ge (s_i\! \mod k_i)$

In the long run of this algorithm, averages like 2.5 get rounded up half the time and down half the time, while averages like 2.7 get rounded up 70% of the time and rounded down 30% of the time.

ecnmetrician · Answer 2 · 2022-01-14T18:34:29.820

Imputing the average of the other questionnaire items has some shortcomings.

Example: Suppose that you have 21 items that take binary values, 10 of them have a value of one, 10 of them have a value of zero, and one is missing.

If you impute the average you get a value of 0.5. But this is not a feasible data point.
The issue is that there is uncertainty. In addition to the mean you also need to account for the variability of the imputed value.

One way to address is to use multiple imputation. The idea behind this method is to create multiple draws around the average of the items, by having a model to predict the missing values based on the responses in the other items. This is attractive computationally because it is an approach designed to fit the data at hand. There are many statistical packages that allow you to run standard analyses like regression using multiple imputation (e.g. look at "mi" commands in Stata).

This approach circumvents the need to choose an arbitrary "rounding" rule, because you directly account for the variability in the observed data points.

score 1 · Answer 3 · answered Jan 14 '22 at 21:23

For which reason can't you use fractions in imputation? Based on that you will have to decide on an alternative.

For instance, it might be that the variable is categorical and you can't use numbers in the regression model. In that case, you could impute the categories (you could impute each category once with a weight depending on the frequency that the category occurs).

score 1 · Accepted Answer · answered Jan 14 '22 at 22:20

It's commonly done to impute based on the mean of the other items, but that's more because it's simple, easy to do and has some vague logic to it ("Well, if they are all high or low, maybe the missing one would be, too, right?"). However, that does not mean it is a good (never mind the best) option.

Let's imagine you have this data for one person:

Week 0, Question 1: Score 9 out of 10
Week 0, Question 2: Score 8 out of 10
Week 0, Question 3: Score 2 out of 10
Week 4, Question 1: Score 8 out of 10
Week 4, Question 2: Score 9 out of 10
Week 4, Question 3: Score 1 out of 10
Week 8, Question 1: Score 10 out of 10
Week 8, Question 2: Score 9 out of 10
Week 8, Question 3: missing

Would your first thought be to impute question 3 at week at as 9.5?

A more obvious approach is to e.g. do multiple imputation based on a joint latent normal model for ordinal data, which is an approach that is much more capable to reflect how different questions tend to be correlated. If there are multiple assessments for the same person filling in the questionnaire, one can then even try to capture within person correlations. This kind of approach then has the ability to come up with a more sensible answer to our example above.

How to round mean values when imputing missing questionnaire items

More thoughts

4 Answers4