Shannon entropy and does missing data affect output

Question

So I am doing analysis based on this paper: http://www.bioline.org.br/pdf?se10001

The author uses Shannon entropy in one part to calculate weights and he uses data which does not come with missing values. I recreated this in an alternative version, but I used the weighted portion from the paper and I tried to find something in literature that can state whether the missing values can affect the outcome to the entropy value of input parameters.

Is there any suggestions how to handle this?

My data comes from stations measuring different water quality values and at times they could end up not measuring a value for one reason or another.

Have you considered multiple imputation to deal with the missing data? If the reasons for missing measurements aren't related to what the measurement values would have been, that might both strengthen your study and solve this particular problem. — EdM, Jul 11 '16 at 01:47
I did use imputation, with mice in R package. But if there are no measurements for certain days all across the dataset, I can't impute. So even with imputation, I still have some holes in the data. — Tsarevna O., Jul 11 '16 at 14:18
When you say that "there are no measurements for certain days all across the dataset," does that mean that _all_ the different types of measurements are missing on those days? Was there anything special about those days that might be related to water quality (for example, large storms that might affect water quality)? Or did the missing days just happen at random? — EdM, Jul 11 '16 at 20:00
I ran the Hawkin's test for normality, it isn't MCAR. So my only conclusion is MAR. Yes, there is a combination of data that has values and holes due to sensors not measuring, also days where there nothing across the entire row. I am not sure why there might be gaps such as these, I think it is just hiccups from the sensors. People who measured it didn't give reasons as to any why there are holes this extensive. — Tsarevna O., Jul 12 '16 at 14:46

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

I understand that you have already done some multiple imputation, but there are some particular days for which there are no data at all. You might first consider being a bit more aggressive in the imputation if you think that values from these days are MAR without any relation to the values that would otherwise have been reported. For example, use values from other close-by days to help impute rather than just impute day by day, as I infer from your question was your approach.

If you already have done as much imputation as is reasonable, then you need to consider how you are going to use your results and the implications of the missing data for your use. I know of no particular reference on this, but the principles are pretty clear.

This "entropy" calculation is used to place different weights among indicators of water quality. The values entering the calculation aren't strictly probabilities, but they are non-negative values that sum to 1 so that an entropy-like calculation is possible. The idea is that indicators whose values seldom change much relative to their criteria of being out-of-acceptable ranges have high entropy by this calculation, little information, and thus should be weighted less than other indicators.

The particular values calculated for the entropy of any indicator will of course differ from the "true" value if you have missing data. You have to use your knowledge of the subject matter to determine if the difference is big enough to matter. If the data are MAR, then it seems that you will still end up finding the low-entropy/high-information indicators (those you presumably care the most about) in any event, although the relative weights may differ depending on missing values.

You also could calculate entropy for the complete data that you have, and simulate missing data by removing different days at random and recalculating, repeatedly. That should give some idea about how much missing days can matter in terms of this calculation.

Finally, note that the plug-in estimator used for Shannon entropy is biased. That leads to some question about this use of an entropy-like calculation in this weighting of indicators. Again, the important question is whether these issues are large enough to make a difference for your application.

Thank you for your input. I am not sure how to stimulate and to recalculate repeatedly, I need a reference or some kind of work already done to get an idea. There is also one more thing from the paper linked. The author summarizes all the values into specific months, so would this also be affected in a bias estimate? The data so far has been giving me the low/entropy/high information indicators. Though years with the most data shows more variation vs those years with more gaps present. Data seems to split along the middle with equal bearing for most of the indicators. — Tsarevna O., Jul 13 '16 at 01:03

Shannon entropy and does missing data affect output

1 Answers1