4

I pointed out a problem with averaging values over time here https://www.researchgate.net/publication/344137839_SARS-CoV-2_binds_platelet_ACE2_to_enhance_thrombosis_in_COVID-19/comments in the comments.

How would a statistician describe the problem with the figure in this publication? (I figure if I knew the lingo for describing the problem it would be easier to find the solution. Also, I find it interesting to be in the situation that I had to resort to describing the problem the way I did there.)

I get the sense that linear regression might be the right tool but I’m too ignorant / inexperienced to know.

To make the question more readable, here is a copy of the figure: an average anomaly

and here is a copy of my comment: In figure 1G, the average is misleading, because it goes up when participants leave the study, which they generally do with low values. So if we look specifically from the 25th to 26th, we notice that none of the values jumped, yet the average jumps up. One did increase slightly but the reason for the jump is that the dark purple case has dropped out. Similarly, if we look specifically from the 27th to 28th, we notice that the only value drops, yet the average goes up. again this is because other cases have dropped off that had low values the previous day. This is just a minor nit and it doesn’t change the conclusions but a better way to show the trend than the average should be used.

  • Could you please edit your question to include the pertinent part of your comment at ResearchGate, so it is self-contained? I originally thought your question was about the *presentation* of the data, not about the selection bias effect. – Stephan Kolassa Apr 12 '21 at 07:38
  • Done (by you.) Thanks. Adjusting for attrition bias is challenging. Do you think my Apr 18 idea is good? Wondering if I should hand-edit the figure accordingly and write up an answer. – WHO's NoToOldRx4CovidIsMurder Jul 07 '21 at 22:18

2 Answers2

2

First, a couple of comments on the figure itself.

One problem is the broken vertical axis. This exaggerates the variability among the 20 bottom cases and artificially creates two groups that are treated differently. It would probably be better to just use a logarithmic vertical axis.

The use of different colors to distinguish different cases is dubious. It could in principle be used to identify cases, but since the colors are not picked up again in other subplots, this identification is not necessary. So the only rationale for the colors would be so we can identify specific trajectories over time. But honestly, there is very little to see here, so the colors only add visual clutter. It would likely be better to just use a light gray so the average stands out.


Now to the statistical problem.

You correctly identify in your comment that the average goes up when people leave the study, presumably with low counts. This is a kind of selection bias: if we only retain people in our observations who are sick (and therefore have a higher platelet count) and drop individuals who get healthy (and therefore have a lower count), then our estimate of average platelet count will be biased high through the selection effect. Specifically, we have a case of a attrition bias here.

Unless we could follow people up after leaving, there is probably no really good way around this. It might be helpful to indicate the number of data points at each point in time, e.g. by adding this as another time series with a secondary vertical axis.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks. (I also thought the broken axis was odd. Without the colors I don’t think I would’ve been able to identify the issue I identified, but other than that it it doesn’t seem useful.) I’m still thinking there must be ways to make a trend line that shows the trend better than the daily average shown does. – WHO's NoToOldRx4CovidIsMurder Apr 12 '21 at 07:48
  • It's hard to account for selection bias. If we have some additional information about the likely distribution of platelet counts after individuals dropped out, we could randomly sample and calculate averages. (And blur the average line to indicate the increasing uncertainty.) – Stephan Kolassa Apr 12 '21 at 07:53
  • I just had a idea. If we created a line made up of connected segments where each segment is a line whose slope is the average slope for all the other lines that day, I feel it would better represent the trend. But random novel/wacky ideas aren’t a reasonable way to communicate information in medical literature... – WHO's NoToOldRx4CovidIsMurder Apr 12 '21 at 08:00
0

An average line has the advantage, that every reader understands an average. The problem is that every reader will expect the line to be drawn for the same group of people throughout time. That is not possible if you want to draw the line from x = 0 to x = 35. However, if you drew the average line only from x = 0 to x = 15 only for those people who are not censored within [0, 15] you'd probably get a good representation of what happens to the majority of people and for x > 15 we do not really need an average - we can see what happens to these 7 or so people without data aggregation.

Bernhard
  • 7,419
  • 14
  • 36
  • There are drop outs on days 3, 5, 9, 10…. Another scheme could be to have UNconnected segments where, again, each segment is a line whose slope is the average slope for all the other lines that day, but starting from the average at the start of each day. – WHO's NoToOldRx4CovidIsMurder Apr 18 '21 at 06:13