61

Below is a daily chart of newly-detected COVID infections in Krasnodar Krai, a region of Russia, from April 29 to May 19. The population of the region is 5.5 million people.

I read about it and wondered - does this (relatively smooth dynamics of new cases) look okay from the statistical standpoint? Or does this look suspicious? Can a curve be so level during an epidemic without any tinkering with the data by authorities of the region? In my home region, Sverdlovsk Oblast, for example, the chart is much more chaotic.

I'm an amateur in statistics, so maybe I'm wrong and this chart is nothing out of the ordinary.

enter image description here

According to a news report from 18 May 2020, a total of 136695 tests for COVID-19 had been made in the region since the start of the epidemic period and up to that day.

As of 21 May 2020, a total of 2974 infections have been recorded in the region.

P.S. Here's a link I found to a page with better-looking statistics, and covering a longer period, specifically for Krasnodar Krai. On that page, you can hover your cursor over the chart to get specific numbers for the day. (The title uses term "daily elicited" number of cases, and the bar caption "daily confirmed" number of cases):

enter image description here

CopperKettle
  • 1,123
  • 12
  • 18
  • 12
    @Tim, I asked CopperKettle to post this here. Even if I hadn't, I think there are meaningful statistical issues that can be discussed here, not just opinions. – gung - Reinstate Monica May 21 '20 at 12:42
  • @ttnphns - by "suspicious" I mean "data tampered or forged on purpose to produce an abnormally level curve". – CopperKettle May 21 '20 at 13:16
  • @ttnphns, an "amateur in statistics" may not be able to clearly state what they think looks weird in technical terms. When *I* look at it, the data certainly look underdispersed to me. – gung - Reinstate Monica May 21 '20 at 13:17
  • 1
    @CopperKettle, Your listed data sum to 1903, if there have been a total of 2974, then there were 1071 prior to April 29. Is that right? – gung - Reinstate Monica May 21 '20 at 13:23
  • 1
    @ttnphns, it's fine to create a new tag (ie, `[manipulation-detection]`), but please create at least an excerpt for it. – gung - Reinstate Monica May 21 '20 at 13:56
  • 1
    @gung-ReinstateMonica, I did not _create_ that tag. It existed on the site. – ttnphns May 21 '20 at 13:58
  • 8
    The fuller red graph is telltale. However, just one note: the bars show the "number of confirmed cases" per day. Well, "confirmed" is not quite the same as "occured" or even "elicited", it is more mediated event than those. One of possible mediations can be some sort of unfair manipulation. But other variants are also possible, for example factors concerning availability and scheduling of virus diagnostic procedures. These factors could as well have changed between April and May in the region. As "confirmed" is less immediate than (approximately Poissonian) "emerged" it could affect the curve. – ttnphns May 21 '20 at 14:58
  • 1
    @SextusEmpiricus, that can be the case. However, there also can be the jam-release effect of the testing "traffic" or even of applications for the testing (sick people who were on lockdown in April massively applied in clinics from the start of May), etc. – ttnphns May 21 '20 at 15:14
  • 10
    Maybe they can only perform 100 tests a day? (This is somewhat in jest, as the proportion of confirmed cases would be too high. However, certain regions do have testing capacity constraints. That was the case even here in the San Francisco area.) – steveo'america May 21 '20 at 16:49
  • 3
    @steveo'america probably it will be more than 100 tests per day, or otherwise nearly all the tested people would have the virus, which you do not see elsewhere. Say, it could be 300 per day, and 1/3 of them are positive. In that case the mean of positive tests per day is 100 and the variance is 66.6 (and standard deviation about 8). That is one way how you can have the underdispersion but it is still not much different from the standard deviation of 10 for a Poisson distribution. Of course there can be more effects that cause underdispersion (e.g. the 'source' of patients is heterogeneous). – Sextus Empiricus May 21 '20 at 17:07
  • Russians must have a "plan", maybe it's 100 new cases daily, so they're hitting it perfectly! – Aksakal May 21 '20 at 19:14
  • 1
    @Arkasal: That is some very Soviet data. – Ben May 21 '20 at 22:36
  • 3
    For interest - [here](https://www.worldometers.info/coronavirus/country/russia/) is the Worldometers version of the data. – Russell McMahon May 21 '20 at 23:12
  • Could someone who knows Russian post a translation of the words on the graphs? – JDL May 22 '20 at 07:26
  • 1
    @JDL Statistics of coronavirus Covid-19 infections in the Krasnodar Krai (territory); graph of diagnosed infections by date; number of confirmed cases per day; zero values indicate lack of data. – David May 22 '20 at 14:29
  • 1
    @CopperKettle Is this what "flattening the curve" means? :) – David May 22 '20 at 14:31
  • Not all regions have suspiciously flat data, see [china](https://www.worldometers.info/coronavirus/country/china/) – David May 22 '20 at 15:05
  • @steveo'america We saw that in China for a while--case growth was consistent for days, one bend upwards in the middle. Obviously it reflected their ability to test, not the disease. – Loren Pechtel May 23 '20 at 20:04
  • 2
    Somewhat related: [Kobak, Shpilkin & Pshenichnikov, "Statistical fingerprints of electoral fraud?" *Significance* 13(4), 20-23, 2016](https://doi.org/10.1111/j.1740-9713.2016.00936.x), also on Russian data. – Stephan Kolassa May 25 '20 at 07:15

6 Answers6

70

It is decidedly out of the ordinary.

The reason is that counts like these tend to have Poisson distributions. This implies their inherent variance equals the count. For counts near $100,$ that variance of $100$ means the standard deviations are nearly $10.$ Unless there is extreme serial correlation of the results (which is not biologically or medically plausible), this means the majority of individual values ought to deviate randomly from the underlying hypothesized "true" rate by up to $10$ (above and below) and, in an appreciable number of cases (around a third of them all) should deviate by more than that.

This is difficult to test in a truly robust manner, but one way would be to overfit the data, attempting to describe them very accurately, and see how large the residuals tend to be. Here, for instance, are two such fits, a lowess smooth and an overfit Poisson GLM:

Figure

The variance of the residuals for this Generalized Linear Model (GLM) fit (on a logit scale) is only $0.07.$ For other models with (visually) close fits the variance tends to be from $0.05$ to $0.10.$ This is too small.

How can you know? Bootstrap it. I chose a parametric bootstrap in which the data are replaced by independent Poisson values drawn from distributions whose parameters equal the predicted values. Here is one such bootstrapped dataset:

Figure 2

You can see how much more the individual values fluctuate than before, and by how much.

Doing this $2000$ times produced $2001$ variances (in two or three seconds of computation). Here is their histogram:

Figure 3

The vertical red line marks the value of the variance for the data.

(In a well-fit model, the mean of this histogram should be close to $1.$ The mean is $0.75,$ a little less than $1,$ giving an indication of the degree of overfitting.)

The p-value for this test is the fraction of those $2001$ variances that are equal to or less than the observed variance. Since every bootstrapped variance was larger, the p-value is only $1/2001,$ essentially zero.

I repeated this calculation for other models. In the R code below, the models vary according to the number of knots k and degree d of the spline. In every case the p-value remained at $1/2001.$

This confirms the suspicious look of the data. Indeed, if you hadn't stated that these are counts of cases, I would have guessed they were percentages of something. For percentages near $100$ the variation will be very much less than in this Poisson model and the data would not look so suspicious.


This is the code that produced the first and third figures. (A slight variant produced the second, replacing X by X0 at the beginning.)

y <- c(63, 66, 66, 79, 82, 96, 97, 97, 99, 99, 98, 99, 98, 
       99, 95, 97, 99, 92, 95, 94, 93)
X <- data.frame(x=seq_along(y), y=y)

library(splines)
k <- 6
d <- 4
form <- y ~ bs(x, knots=k, degree=d)
fit <- glm(form, data=X, family="poisson")
X$y.hat <- predict(fit, type="response")

library(ggplot2)
ggplot(X, aes(x,y)) + 
  geom_point() + 
  geom_smooth(span=0.4) + 
  geom_line(aes(x, y.hat), size=1.25) + 
  xlab("Day") + ylab("Count") + 
  ggtitle("Data with Smooth (Blue) and GLM Fit (Black)",
          paste(k, "knots of degree", d))

stat <- function(fit) var(residuals(fit))
X0 <- X
set.seed(17)
sim <- replicate(2e3, {
  X0$y <- rpois(nrow(X0), X0$y.hat)
  stat(glm(form, data=X0, family="poisson"))
})

z <- stat(fit)
p <- mean(c(1, sim <= z))
hist(c(z, sim), breaks=25, col="#f0f0f0",
     xlab = "Residual Variance", 
     main=paste("Bootstrapped variances; p =", round(p, log10(length(sim)))))
abline(v = z, col='Red', lwd=2)
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 25
    Your answers are always exceptional. I love reading them because I love learning, and I learn a lot from you. Thank you. – EngrStudent May 21 '20 at 13:25
  • 9
    You assume a Poisson distribution but are we really looking at counts from a Poisson process? Maybe the numbers are 'per thousand' and not counts or maybe they are a percentage or scaled such that a maximum equals hundred (like Google trend data)? Maybe the numbers are not from a Poisson process, and they relate to some limit of the process (e.g. lots of these data have gaps in the weekends when less data is processed)? The conclusion that these data are 'out of the extraordinary' depends on these assumptions. – Sextus Empiricus May 21 '20 at 14:24
  • 3
    @Sextus That's an interesting observation. I am indeed suspicious that the numbers might not be counts. But they're definitely not cases per thousand--that would sum to more cases than people! In any region in Russia, the total of a few thousand looks like it's the right order of magnitude. For these data to survive my analysis, they would have to represent counts at least three times larger than the raw numbers. (I worked this out simply by multiplying `y` by 3 in the code and re-running it, then doing that again with a multiple of 10.) – whuber May 21 '20 at 14:30
  • 9
    BTW, my initial reaction was to focus on weekends because they exhibit no dips at all: that's an extraordinary departure from the reporting habits of many other countries. But, not wishing to speculate about such issues, and wishing not to become embroiled in finer details of time series analysis, I opted for the simpler exploratory approach I have outlined here. – whuber May 21 '20 at 14:32
  • What's the reason you opted for splines with degree 4? I re-run your code with cubic splines and the fit was indeed much worse. – COOLSerdash May 21 '20 at 14:51
  • 2
    @COOL As I explained, there's nothing special about the model. What makes this analysis work is that when we vary the number of knots and degree of the splines, to adjust the degree of overfitting, the result stays the same. I have explored ranges of 2 through 12 for `k` and 3 through 6 for `d`. We could do the same by employing lowess models with varying degrees of tension as well as by many other regression models. – whuber May 21 '20 at 14:55
  • @Sextus I have no idea what you mean by "gaps" and "weekends suddenly gone:" they are present in the graphic in the question and there are no visible gaps. The p-value will be *much* lower than 1/2001 simply by running more bootstrap iterations. Try it! (I just reran the code with $k=d=6$ for 20,000 iterations and now the p-value is at 1/20001, which is as small as it can possibly be for this number of iterations.) – whuber May 21 '20 at 15:33
  • @SextusEmpiricus the numbers are counts – Aksakal May 22 '20 at 00:55
  • @Aksakal I realize by now that the numbers are indeed counts, although I have still doubts what sort of counting process generated it (it is only an assumption that these counts are counts from a Poisson process). Maybe it is some batch process, where the cases are reported per 100. Or maybe it is something else. In order to know whether these numbers are suspicious we should not run our models and computations, but instead dig up information about the process that generated the data. – Sextus Empiricus May 22 '20 at 00:58
  • @SextusEmpiricus it could be anything like this regions doesn’t want to be worse than the next region so they lookup the average and cap their report – Aksakal May 22 '20 at 00:59
  • @Aksakal a plausible explanation could be that all positive cases found in a local lab are being re-tested in a national lab, and the numbers from *that* lab are being reported. Maybe you could do your answer for the case of Московская область (how did you get the data for the city only?) which has ~1000 cases/day with low dispersion. I would not be surprised if again you find higher dispersion in the sub-regions. – Sextus Empiricus May 22 '20 at 01:23
  • @SextusEmpiricus I scraped the plot that’s in my answer, and the url is there too. – Aksakal May 22 '20 at 01:25
  • Is this really what is usually called *bootstrapping?* I would call this a Monte Carlo sampling of a null model, surrogate, or similar. (Mind that this just about terminology; the analysis seems completely sound to me.) – Wrzlprmft May 22 '20 at 13:26
  • 1
    @Wrzlprmft Yes, it is honest-to-God bootstrapping. There are various flavors. This one is *parametric* in the sense of assuming the data arise as independent realizations of Poisson variables--in effect, an inhomogeneous Poisson process. There is no "null model" or other hypothesis in effect. – whuber May 22 '20 at 13:51
  • @whuber: I asked a [follow-up question](https://stats.stackexchange.com/q/467975/36423) on this. – Wrzlprmft May 22 '20 at 16:34
26

The Krasnodar Krai case is not the only one. Below is a plot for the data from 36 regions (I selected the best examples out of 84) where we either see

  • a similar underdispersion
  • or at least the numbers seem to be reaching a plateau around a 'nice' number (I have drawn lines at 10, 25, 50 and 100, where several regions find their plateau)

more cases

About the scale of this plot: It looks like a logarithmic scale for the y-axis, but it is not. It is a square root scale. I have done this such that a dispersion like for Poisson distributed data $\sigma^2 = \mu$ will look the same for all means. See also: Why is the square root transformation recommended for count data?

This data looks for some cases clearly underdispersed, if it would be Poisson distributed. (Whuber showed how to derive a significance value, but I guess that it already passes the inter-ocular trauma test. I still shared this plot because I found it interesting that there are cases without the underdispersion, but still they seem to stick to a plateau. There may be more to it than just underdispersion. Or there are cases like nr 15 and nr 22, lower left of the image, which show underdispersion, but not the fixed plateau value.).

The underdispersion is indeed odd. But, we do not know what sort of process has generated these numbers. It is probably not a natural process, and there are humans involved. For some reason, there seems some plateau or an upper limit. We can only guess what it could be (this data tells us not much about it and it is highly speculative to use it to guess what could be going on). It could be falsified data, but it could also be some intricate process that generates the data and has some upper limit (e.g. these data are reported/registered cases and possibly the reporting/registration is limited to some fixed number).

### using the following JSON file
### https://github.com/mediazona/data-corona-Russia/blob/master/data.json
library(rjson)
#data <- fromJSON(file = "~/Downloads/data.json")
data <- fromJSON(file = "https://raw.githubusercontent.com/mediazona/data-corona-Russia/master/data.json")

layout(matrix(1:36,4, byrow = TRUE))
par(mar = c(3,3,1,1), mgp = c(1.5,0.5,0))

## computing means and dispersion for last 9 days
means <- rep(0,84)
disp <- rep(0,84)
for (i in 1:84) {
  x <- c(-4:4)
  y <- data[[2]][[i]]$confirmed[73:81]
  means[i] <- mean(y)
  mod <- glm(y ~ x + I(x^2) + I(x^3), family = poisson(link = identity), start = c(2,0,0,0))
  disp[i] <- mod$deviance/mod$df.residual
}

### choosing some interresting cases and ordering them
cases <- c(4,5,11,12,14,15,21,22,23,24,
   26,29,30,31,34,35,37,41,
   42,43,47,48,50,51,53,56,
   58,67,68,71,72,75,77,79,82,83)
cases <- cases[order(means[cases])]

for (i in cases) {
  col = 1
  if (i == 24) {
    col = 2
    bg = "red"
  }
  plot(-100,-100, xlim = c(0,85), ylim = c(0,11), yaxt = "n", xaxt = "n", 
       xlab = "", ylab = "counts", col = col)
  axis(2, at = c(1:10), labels = c(1:10)^2, las = 2)
  axis(1, at = c(1:85), labels = rep("",85), tck = -0.04)
  axis(1, at = c(1,1+31,1+31+30)-1, labels = c("Mar 1", "Apr 1", "May 1"), tck = -0.08)


  for (lev in c(10,25,50,100)) {
    #polygon(c(-10,200,200,-10), sqrt(c(lev-sqrt(lev),lev-sqrt(lev),lev+sqrt(lev),lev+sqrt(lev))),
    #        col = "gray")
    lines(c(-10,200), sqrt(c(lev,lev)), lty = 2) 
  }
  lines(sqrt(data[[2]][[i]]$confirmed), col = col)
  points(sqrt(data[[2]][[i]]$confirmed), bg = "white", col = col, pch = 21, cex=0.7)
  title(paste0(i,": ", data[[2]][[i]]$name), cex.main = 1, col.main = col)
}


### an interesting plot of under/overdispersion and mean of last 9 data points
### one might recognize a cluster with low deviance and mean just below 100
plot(means,disp, log= "xy",
     yaxt = "n", xaxt = "n")
axis(1,las=1,tck=-0.01,cex.axis=1,
     at=c(100*c(1:9),10*c(1:9),1*c(1:9)),labels=rep("",27))
axis(1,las=1,tck=-0.02,cex.axis=1,
     labels=c(1,10,100,1000), at=c(1,10,100,1000))
axis(2,las=1,tck=-0.01,cex.axis=1,
     at=c(10*c(1:9),1*c(1:9),0.1*c(1:9)),labels=rep("",27))
axis(2,las=1,tck=-0.02,cex.axis=1,
     labels=c(1,10,100,1000)/10, at=c(1,10,100,1000)/10)

Maybe this is overinterpreting the data a bit, but anyway here is another interesting graph (also in the code above). The graph below compares all the 84 regions (except the largest three that do not fit on the plot) based on the mean value of the last 13 days and a dispersion-factor based on a GLM model with the Poisson family and a cubic fit. It looks like the cases with underdispersion are often close to 100 cases per day.

It seems to be that whatever is causing these suspiciously level values in Krasnodar Krai, it occurs in multiple regions, and it could be related to some boundary of 100 cases/day. Possibly there is some censoring occurring in the process that generates the data, and that limits the values to some upper limit. Whatever this process is that causes the censored data, it seems to occur in multiple regions in a similar way and has likely some artificial(human) cause (e.g. some sort of limitation of the laboratory testing in smaller regions).

comparing dispersion

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • 3
    Nice answer (+1). – Ben May 22 '20 at 00:48
  • 4
    Good answer. I had wondered if there might be some selection bias - the data certainly looks very unusual, but with so many local statistics being tracked around the globe, it's expected that a small number of regions will have correct data that's statistically improbable due to chance alone, and it's easy to focus on those cases. But the consistent pattern of improbable results indicates this isn't a one-off instance due to chance. – Nuclear Hoagie May 22 '20 at 14:15
  • 2
    @NuclearWang it is also interesting that these curves are showing that it is neither as if some single person is *fabricating* the data (I guess that this goes around in some people's minds). For this to be true the person that fabricated the data must have had a lot of imagination in making these different curves that all have the same behaviour, but everytime in a slightly different way. This does not look to me as if it is being fabricated by a single source. (My guess would be that the positive cases from regions undergo a second federal lab test, and this test is limited to 100/day) – Sextus Empiricus May 22 '20 at 14:39
  • Correlating cases to population could also be informative. A "boundary" around 100 looks suspicious, but to correct data this way all local authorities would need to correct their timeseries individually. It is hard job. And variance/mean seems individually shaped. I guess these is a combination of test inaccuracy, limited hospital beds, bias in sampling patient for testing, and maybe artificial (and less possible) case count underestimation. All of these were reported on TV and news officially, expect for last point. Moreover people say that number of cases is overestimated all around Russia. – Alexey Burnakov May 22 '20 at 15:59
  • 1
    @AlexeyBurnakov *” I guess these is a combination of test inaccuracy, limited hospital beds, bias in sampling patient for testing, and maybe artificial (and less possible) case count underestimation. "* Certainly all these aspects are likely around. But I do not see how any of them are a cause for underdispersion (low noise). – Sextus Empiricus May 22 '20 at 16:13
  • I don't understand this either. But I am being careful with saying like "the Fed filters results" or Fed forces regions to filter results. A layman logic about covid stats we tend to have is that it is good for everybody but the people to over (not under) estimate cases. It is a good reason to show efforts to save the people and earn some more rating points, for all levels, from President, to Governor's, to Chief Doctors (more budget). It is just a common sense, not math. – Alexey Burnakov May 22 '20 at 16:23
  • @AlexeyBurnakov it is extremely unlikely to *not* underestimate the cases with these statistics of verified cases (unless it is done intentionally, but I don't follow your reasons why people would like to do that). This is because it is *very difficult* to trace all cases and verify them. So for all regions/countries, in order to estimate the prevelance or the total number of people that have been affected in the past, one eventually needs to use immunological tests on a random sample and extrapolate those. There is only one statistic that is not so difficult to trace and that is weekly deaths – Sextus Empiricus May 22 '20 at 16:51
  • A hypothesis: there's only one hospital in each krai, it gets 100 tests daily, and refuses to report any cases that haven't been tested, irrespective of how well the symptoms match. – John Dvorak May 22 '20 at 20:05
  • @JohnDvorak, it could be something like that. But I guess that it is more specifically like the hospitals have more testing capabilities themselves (at least some reports state that there is a lot of testing) but the tests that are used for official reporting are limited. Maybe it is only one single lab whose data is used. In this way you get that the testing is not only restricted, but *also* that the probability/fraction of positive cases is high (because of the pre-selection). – Sextus Empiricus May 22 '20 at 20:19
  • For what it's worth, The Economist has a story on death rates through February 2022, "Are some countries faking their covid-19 death counts?": https://www.economist.com/graphic-detail/2022/02/25/are-some-countries-faking-their-covid-19-death-counts – denis Mar 05 '22 at 17:23
22

I will just mention one aspect that I haven't seen mentioned in the other answers. The problem with any analysis that states that this is significantly out of the ordinary is that it doesn't take into account that the data have been selected based on looking strange. At least I'd assume that the thread opener has not only seen these data but also other data sets of similar type (maybe not even consciously, but in the media without noticing because they didn't seem any special - but I would expect somebody who writes a posting like this to have seen more consciously). The question to address is therefore not whether the data, seen as isolated, are significantly different from what could be expected, but rather whether, if everything's normal (not meant as in "normally distributed", you know what I mean), any data set like this or with a different pattern that would also prompt the thread opener to post here could be expected to be among all those they see. As we don't know what they have seen, that's pretty hard to assess, unless we come up with a p-value of $10^{-10}$ which would still be significant adjusting for almost any number of multiple tests.

Another way of testing this would be to make predictions for the future based on what the data show, and then test whether the strange trend goes on with observations that were not part of those that led to picking this data set.

Of course also the other answer that states that this kind of dodgy pattern also occurs in other regions can contribute some reassurance that something meaningful is going on because it isn't then such a special thing to pick. However the point I want to make is that for whatever analysis, selection bias should not be forgotten.

Christian Hennig
  • 10,796
  • 8
  • 35
  • 6
    This is also related to the [prosecutor's fallacy](https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy#The_Sally_Clark_case). An [example of this](https://arxiv.org/abs/math/0607340) from Dutch courts was my first introduction in Bayesian statistics. Also important is to keep in mind the data collection. Instead of mindlessly applying our equations, we should also carefully consider the process that generates the data. First look what is going one before applying the models. – Sextus Empiricus May 22 '20 at 12:20
18

Krasnodar

The data for a region is clearly not realistic in terms of its dispersion. Here's a data on Krasnodar town. The sample average is 34 in May, and the dispersion is 8.7.

enter image description here

This is more than Poisson distribution would suggest, where the dispersion is the square root of average, i.e. 5.9. This is overdispersed but the sample size is quite small so it's hard to simply reject Poisson distribution. The town has a population near 1M people.

However, when we jump into Kransodar krai with population of 5.5M, all of a sudden the dispersion collapses. In your plot the new cases average around 100, but the dispersion is 1-2. In Poisson you'd expect the dispersion of 10. Why would the capital be overdispersed but the whole region would be severy underdispersed? It doesnt make sense to me.

Also where did all the dispersion from the capital of the region go? "It's inconceivable!" (c) to think that the regional incidence is very strongly negatively correlated with its capital. Here's a scatter plot of the cases outside Krasnodar in the region vs Krasnodar town. enter image description here

Source

chart: source: https://www.yuga.ru/media/d7/69/photo_2020-05-21_10-54-10__cr75et3.jpg

scraped data: 14 45 37 37 32 25 33 40 47 40 33 38 47 25 37 35 20 25 30 37 43

Russia

@AlexeyBurnakov pulled the chart for entire Russia: enter image description here

I scraped the data for May, and it's severely overdispersed. The average is 10K but the variance is 756K, with dispersion 870 much higher than Poisson process would suggest. Hence, the overall Russia data supports my claim that Krasnodar Krai data is abnormal.

9623 10633 10581 10102 10559 11231 10699 10817 11012 11656 10899 10028 9974 10598 9200 9709 8926 9263 8764 8849 8894

Source

https://yandex.ru/covid19/stat?utm_source=main_title&geoId=225

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 9
    Interesting analysis (+1), but it's not really inconceivable that you could get negative correlation. If some of the people showing signs of illness are transported to the capital for testing/treatment (or vice versa), that would induce negative correlation between the incidence in the two places, wouldn't it? (I'm not saying this is what is happening; just that there are "conceivable" possibilities that cound explain negative correlation here.) – Ben May 22 '20 at 02:07
  • I find this an interesting approach and wonder whether the Moscow suburbs region (~1000 cases/day) may have similar correlations. If I ever find time then I am gonna scrape the data https://www.google.com/search?q="Балашиха"+covid+site:https://covid.mz.mosreg.ru and perform pca to find correlations and see whether sub regions add up to a multiple of 100. – Sextus Empiricus May 22 '20 at 02:11
12

So I think these are the data:

 month day new delta tens ones
     4  29  63    NA    6    3
     4  30  66     3    6    6
     5   1  65    -1    6    5
     5   2  79    14    7    9
     5   3  82     3    8    2
     5   4  96    14    9    6
     5   5  97     1    9    7
     5   6  97     0    9    7
     5   7  99     2    9    9
     5   8  99     0    9    9
     5   9  98    -1    9    8
     5  10  99     1    9    9
     5  11  98    -1    9    8
     5  12  99     1    9    9
     5  13  96    -3    9    6
     5  14  97     1    9    7
     5  15  99     2    9    9
     5  16  92    -7    9    2
     5  17  95     3    9    5
     5  18  94    -1    9    4
     5  19  93    -1    9    3

One of the fun, introductory, elements of forensic accounting is Benford's law.

When I look at the frequencies of the ones-digits and the tens digits I get this:

 Ones count rate
    1     0  0.0
    2     2  9.5
    3     2  9.5
    4     1  4.8
    5     2  9.5
    6     3 14.3
    7     3 14.3
    8     2  9.5
    9     6 28.6

 Tens count rate
    1     0  0.0
    2     0  0.0
    3     0  0.0
    4     0  0.0
    5     0  0.0
    6     3 14.3
    7     1  4.8
    8     1  4.8
    9    16 76.2

I notice a very strong preponderance of "6" and "9" in the data.

If the ones-place (second) digits were distributed according to Benford's rules they should happen something near 9.7% and 8.5% of the time, respectively, instead of better than 20% of the time.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
EngrStudent
  • 8,232
  • 2
  • 29
  • 82
  • 23
    Thinking of Benford's Law is good, but it's not applicable. The reason is that Benford's Law can be expected to hold only when data range over several orders of magnitude. Here, their initial digits obviously will be concentrated around 9 and 1 even when the data reflect honest reporting of values that tend to lie between 90 and 199. Thus, Benford's Law (by itself) is useless for distinguishing honest data from fake data in this example. – whuber May 21 '20 at 13:33
  • 2
    If this is how Benford's Law worked, then you could show that _any_ dataset with a small standard deviation is fake by displaying it in an (un)appropriately large base. – BlueRaja - Danny Pflughoeft May 21 '20 at 22:46
  • 1
    @BlueRaja-DannyPflughoeft , If I wanted to be (much less) hand-waving then I would use the sample size to make some decent bounds. Right now I have a mean, and half the time you are above it and half below it: mean target can be much worse for career than 95% CI window. – EngrStudent May 22 '20 at 10:25
  • Just in a non statistical sense, the prevalence of 9’s in both the ones and tens implies that they are trying to make even these counts seem smaller than they are, eg. ‘Its only about 10 cases’ (19) or ‘its not even a hundred yet’ (99), which is a well known trick to make something seem less, eg the 99/95 cent trick – B-K May 23 '20 at 04:02
  • 1
    @Bob The problem with that reasoning is that if the true rate during this period were close to 100, about a third of the time we would see counts in the 90's and half the time we would see them in the 100's, so observing a preponderance of 9's and 1's among initial digits does not discriminate random, independent behavior from behavior that looks unusual or suspicious. Benford's Law is neither applicable nor useful in this circumstance. – whuber May 24 '20 at 16:48
  • 2
    The criticism of the 'tens' is a fair point but EngrStudent is also showing that there is a discrepancy in the 'ones' (where it is also missing the zero value which makes the discrepancy larger). However, for the ones the 'problem' is that the test is not very powerfull for these smalls numbers (with large variance) a chisquare test only gives a p value around 0.17 so it is not so special to see these discrepancies. Example: run the following R-code `chisq.test(c(0,0,2,2,1,2,3,3,2,6))` – Sextus Empiricus May 25 '20 at 08:39
5

Interesting points from everyone. Let me contradict some.

1) Why Poisson? Cases generation process is intristically interdependent as a pandemic interaction between ill and healthy, so case occurence in a time interval maybe affected by the previous interval occurences. The dependency may be complicated but strong.

UDPATE (as of May 23rd)

1.1) Imagine the physics of the process.

  • a) A person is healthy ->
  • b) They get infected from a covid-positive one ->
  • c) they fill sick and go to a hospital ->
  • d) they get screened after - and very likely - waiting in line, or time table slot ->
  • e) the lab processes tests and determines new positives ->
  • f) a report goes to a ministry and gets summarized for a daily report.

I would like to insist again, after long discussion and downvotings I got, that when you see the stage F reports, you should understand that events occurred as a function of a lot of human interactions, and it is important they were accumulated to pass a "bottleneck" of either: their own time to visit a doctor, the doctor appointment time table, or laboratory test processing limits. All of these make it non-Poissonian, as we don't use the Poisson for events that wait in a line. I think that it is mostly about lab tests that are made by humans who work with average capacity and cannot process too many per day. It is also possible that the final reporting stage accumulates information in a sort of buckets.

My point is that it is not Poisson, or generalization. It is the "Poisson with waiting in line and data accumulation in time periods". I don't see 100% evidence of "Soviet-style data manipulations". It could be just bulks of pre-processed data up to report.

2) For the Krasnodar region the daily mean seems to be non-stationary. It is not good at all to approach these data from Poisson view, or at least one should take only the stationary part of it.

These points are about 2 major Possion distribution assumptions violations.

3) Why 100 tests per day? It is official information that in Russia (and I am in Russia, reading news constantly) there were 7.5 million tests made so far, and about 330,000 cases confirmed (as of May 22nd). The proportion of positives is less than 5%. With this, you should expect at least 2,000 tests per day allowed. This could be real, as the tests are scarce and expensive items and not only in the Krasnodar, Russia, or Europe. It is everywhere the same. @Aksakal

enter image description here

(source: https://yandex.ru/covid19/stat?utm_source=main_title&geoId=225)

4) Why ever would you think these are "Soviet data"? Look at the World data for new covid cases. It is extremely low-variance if you think it must be Poisson (a sum of Poissons is a Poisson). Is the World "Soviet" (I guess you mean lying?) then? @Ben - Reinstate Monica

enter image description here

(source: https://yandex.ru/covid19/stat?utm_source=main_title&geoId=225)

So, it seems to me that Statistics application in the case of pandemic is a dangerous thing. Lots of assumptions of all kinds must be true to conclude what have been concluded.

UPDATE

To address the point about the world data under/overdispersion,

library(data.table)
library(magrittr)

dat <- read.csv(url('https://covid.ourworldindata.org/data/owid-covid-data.csv'))

setDT(dat)

dt <- 
    dat[location == 'World', sum(new_cases), date] %>%
    .[, date:= as.Date(date)] %>% 
    .[date >= '2020-04-01'] %>% 
    setorder(date)

min(dt$V1)

max(dt$V1)

mean(dt$V1)

var(dt$V1)

var(dt$V1) / mean(dt$V1) # huge overdispersion, indeed

plot(dt$V1,type='l')

acf(dt$V1)

I got data for April, 1st till today (as a more stationary, plateu phase).

enter image description here

The calculation showed that variance to dispersion ratio is 1083. This is huge overdispersion. My naked-eye analysis was wrong.

There is significant weekly autocorrelation present.

enter image description here

This can be one of the reasons for higher variance, but is it enough? And why is there a daily pattern? Is it still the Poisson process or lying statistics worldwide?

Alexey Burnakov
  • 2,469
  • 11
  • 23
  • I don't know that the world is Soviet, but I do know that modern politicians are filtered for two skillsets: stage appeal (good con-man) and fund raising (good sell-out). I don't know that the Poisson process actually captures the physics of the phenomena. I don't see contact tracing on the social graph, viral load, or any of that. – EngrStudent May 22 '20 at 10:29
  • 6
    The point is that the data is under-dispersed. Even despite your point (1) and (2) one should expect that the variance of the noise in the data should be close to the mean of the data (or larger/overdispersed). This is also obvious from the plot of the curves where we see the odd *drastic* decrease in noise in May. (3) *"With this, you should expect at least 2,000 tests per day allowed"* what do you mean by that? (4) The world data has no low variance. It ranges from 80k to 100k. So roughly a coefficient of variation of some 10%. That is *overdispersion* not *underdispersion*. – Sextus Empiricus May 22 '20 at 10:46
  • 2
    1) and 2). I don't see why underdispersion should be mentioned if you are not sure these are Poisson data. That was the point. 3) I mean there are on average 5 out of 100 people who were covid-positives after taking tests, so 100 positives mean 100 * 20 tests on average... That can really be a huge number of tests for a small region like Krasnodar and the test number can be limited to 2000 by budget constraints of lack of medical workers. 4) Let me add some research to my answer, you may be right. – Alexey Burnakov May 22 '20 at 10:50
  • 2
    @EngrStudent, I would never like to see or get engaged in data politicizing on this website that I like. Not to mention that in the Soviet Union the statistics and economic science was very sophisticated. On your other two comments, intuitively, the data generation process is dependent, and data that I saw was always strange, non random. – Alexey Burnakov May 22 '20 at 11:09
  • 1
    @AlexeyBurnakov - I would weep were that to happen. I was taught partial differential equations by Basil Nikolaenko. He managed 2 teams at NASA, one American calculator drivers and another immigrant Russian pencil users and he said when the immigrant group came to him with something their stuff was always right. I respect Russian mathematics greatly. I don't know about Russian economics either way though. – EngrStudent May 22 '20 at 11:14
  • 1
    Daily pattern might come from several things: work-week vs. weekend worker activity/recreation, doctor hours(typically not weekend), reporting updates timing, lab/facilities operating hours – EngrStudent May 22 '20 at 11:17
  • @EngrStudent, I agree with this. But it makes data looking strange, violating what we know about true distributions. Flattening case counts is also possible to cause underdispersion and it is also making data strange. And I agree that it can be a manupulation, but also can be due to the lack of medical workforce (a huge problem, doctors have been working extra shifts everywhere), and budget restricting number of tests made. – Alexey Burnakov May 22 '20 at 11:22
  • 1
    According to yesterday's news, Krasnodar region (1) is appointed to still open the tourism season from July (the region is a major sea-side resort); (2) isolation regime to be considerably relaxed from tomorrow. These facts ought to taken into account because the authorities have been starting some activities to meet the plans. These actions might but not necessarily imply some sort of falcification of numbers. They, however, would imply a definitely non-Poissonian process of "daily confirmed cases". – ttnphns May 22 '20 at 11:23
  • @ttnphns, yes, true. And other tourist-dependent regions also relax carantine regime, like Turkey, Italy and may be some more. – Alexey Burnakov May 22 '20 at 11:30
  • 1
    @AlexeyBurnakov - the "Diamond Princess" data is nearly pristine. The demographics are a little older. (https://www.nature.com/articles/d41586-020-00885-w) The challenge then is a dynamic system model that transforms that non-cylic phenomenology to the complicated stuff we see. – EngrStudent May 22 '20 at 11:32
  • 2
    @AlexeyBurnakov if you have each day 2000 tests out of which each test has a 5% probability to be positive, then you have something like a binomial distributed value (with $n = 2000$ and $p=0.05$) for which the expectation value, $np$ and variance $np(1-p)$ are still very close (it explains why you may get on average 100 tests, but not why you get 100 *with so little variation*) .... – Sextus Empiricus May 22 '20 at 11:54
  • .....For most situations with countdata we should expect that the variance and mean are roughly equal. Only when you have something like a binomial distributed variable with large value for $p$, then this is not the case. (I imagine this could be the case here when the reports are based on second opinion tests from a central lab where there is some limited number of testing capacity) – Sextus Empiricus May 22 '20 at 11:54
  • 6
    *Why Poisson? Cases generation process is intristically interdependent as a pandemic interaction between ill and healthy* – Sure, the Poisson process is a rough assumption, but, when it comes to investigating underdispersion, it is a benign one. Most interdependency mechanisms such as superspreaders, weekends, weather would *increase* the dispersion in comparison to a Poisson process. I cannot think of any epidemiological mechanism that would decrease the dispersion. … – Wrzlprmft May 22 '20 at 13:19
  • 1
    There may be some dispersity-lowering mechanisms on the reporting level, but that means that the numbers are actually not reflecting reality and thus suspicion is justified. Moreover, as elaborated by @SextusEmpiricus, even limited testing capacities cannot explain this. The only thing I can think of is a bottleneck in the handling of reports, e.g., the office can at most handle 99 reports a day. But in that case, the data is indeed pretty useless. – Wrzlprmft May 22 '20 at 13:20
  • 1
    *"but that means that the numbers are actually not reflecting reality and thus suspicion is justified."* we can already expect that the numbers do not reflect reality without the observation of underdispersion. The entire world goes crazy about these figures that are daily reported and overly dispersed among the many different media, while they are not that much accurate (many countries have limited testing capabilities). – Sextus Empiricus May 22 '20 at 13:28
  • @SextusEmpiricus, the point about binomial distribution makes sense. Then, yes, low variance observed is strange as well. But bear in mind that I referred to a country-wide figure. We don't have daily test counts reported publicly, and by region. It may be that the proportion would fluctuate more if exact test numbers were given. – Alexey Burnakov May 22 '20 at 13:37
  • 1
    @AlexeyBurnakov I do not understand what you mean. What I got from your text is that you meant to say that the figure of 100 positive cases/day stems from something like 2000 tests/day. This indeed may explain why you have a plateau value. But... it does *not* explain why you have such little variation in the numbers. If your tests are limited to, say 2000, and if the expectation value is 100 then you should still expect a standard deviation around roughly 10. The data is heavily underdispersed if it comes from a binomial distribution with low $p$. (but if $p$ is large then it makes sense). – Sextus Empiricus May 22 '20 at 13:42
  • *"But bear in mind that I referred to a country-wide figure."* what does that mean in relation to my comment about the binomial distribution still having variance and expectation value being approximately the same? – Sextus Empiricus May 22 '20 at 13:45
  • 4
    @SextusEmpiricus: My point is that there are plenty of mechanisms that explain overdispersion. This does not automatically invalidate the data. Of course, one should not get overexcited over a sudden jump from one day to the next, but when you account for such effects and look at a proper moving average, the data can still have some value. By contrast, all mechanisms leading to underdispersion I can think of also lead to completely useless data. – Wrzlprmft May 22 '20 at 13:46
  • *"It may be that the proportion would fluctuate more if exact test numbers were given."* The numbers that we are currently looking at are not exact test numbers and not numbers that are daily updated? – Sextus Empiricus May 22 '20 at 13:47
  • @SextusEmpiricus, why? It is easy. We don't know how many tests ($n$) were done each day in the Krasnodar region. This info is absent. We only know that in whole country the proportion of positives ($k$) to tests is about 0.05. If we knew daily stats not only on positives, but also on tests, we could legitimately try Binomial. That's what I have just wrote. – Alexey Burnakov May 22 '20 at 13:48
  • @Wrzlprmft I am not so much worried about overdispersion. It is more that the figures are heavily underreporting the true number of cases. It is not unthinkable that the degree of underreporting may change in time (the curve for China shows this clearly with a sudden bump when the test protocol was changed). So the curve will show patterns that partly reflect how we test and report. It is like using a very bad thermometer that is not showing the accurate temperature and neither consistent. It is the worse case of [the four options](https://en.wikipedia.org/wiki/Accuracy_and_precision). – Sextus Empiricus May 22 '20 at 13:52
  • 1
    @AlexeyBurnakov we do not need to know the exact numbers in the binomial case. It could be n=2000 or n=500, it doesn't matter. If $p$ is small (or equivalently $n$ large) then the variance and expected value are approximately equal (in fact you could approximate the binomial data with a Poisson distribution https://en.wikipedia.org/wiki/Poisson_limit_theorem). Only if you have some weird situation that p is very high >0.9 does the ratio noise/signal makes sense. I mentioned before a situation how this could happen. – Sextus Empiricus May 22 '20 at 13:55
  • 1
    Note that for a binomial distribution we have: $$\text{mean} = np$$ $$\text{variance} = np(1-p)$$ and $$\frac{\text{variance}}{\text{mean}} = 1-p \, \underbrace{\approx 1}_{\llap{\text{if $p$ }}\rlap{\text{close to 0}}}$$ So if $p$ is small (approximately 5% as you say) then it doesn't matter much what it is exactly and variance/mean ~ 1. – Sextus Empiricus May 22 '20 at 14:01
  • @SextusEmpiricus, I understand. I cannot completely agree this is applicable here. Binomial experiments imply we do $n$ trials lots of times, right? The number of experiments are the number of days. If, indeed, we knew, $n$ is equal each time (without even knowing $n$), then, I agree, we couldn't go without bias. But we don't know if $n$ is equal. Do you see this is logical? HOWEVER, even if $n$ is not known and striclty speaking Binomial is also misleading, I can imagine that varying $n$ is not likely to produce low-variance results, it should, instead, increase the variance. So, I agree. – Alexey Burnakov May 22 '20 at 14:34
  • 1
    @AlexeyBurnakov what we know is that if these data are binomial distributed with a small value for $p$, then we should not observe the noise/signal ratio that we observe. Sure the number $n$ might not be equal from day to day (and so is the number $p$ not equal from day to day). But the variations that may occur in $n$ and $p$ are not gonna be of the kind that smoothen the data. So let's get back (after long discussion) to the point 3 in your post. You suggest that the number of tests is somehow limited, but that does not explain the low signal/noise ratio. – Sextus Empiricus May 22 '20 at 14:45
  • @SextusEmpiricus. I see now that limiting the number of tests is *unlikely* to flatten the data. It is hard to image that, for example, $p$ if a function tests $n$. Yes, agreed. Then, the source of low var/mean can be a data manipulation, but I don't what kind of it. It can be just "dispersing" counts more evenly over time or worse. Thank you for the discussion. – Alexey Burnakov May 22 '20 at 15:00
  • 1
    @AlexeyBurnakov in a comment under my answer I explain why I do not believe that it is some kind of intentional data manipulation of fabrication. Or at least the manipulation is not done by a single person. For that to be true the different regions look too much different in the way that they are fabricated. What I imagine is that it could be some sort of procedural limitation for the regions that turns this into binomial distributed data with high $p$. For instance, the regionally observed positive cases are being double checked, and the double checking is done in daily batches of fixed size – Sextus Empiricus May 22 '20 at 15:13
  • @AlexeyBurnakov, on #3, I think your plot for new cases in all Russia is not inconsistent with Poisson type of process. It shows 10k new cases daily, so the dispersion would be around 100, and that seems to be the case if you look at the fluctuation of daily new cases – Aksakal May 22 '20 at 16:30
  • @Aksakal, I didn't measure variance or st.deviation for this plot. It wasn't the reason I posted it. It was to show that positive cases and tests are different processes. About 4% of tests resulted in cases. You just mentioned "tests". – Alexey Burnakov May 22 '20 at 16:36
  • 1
    @AlexeyBurnakov, take a look at my updated answer. I scraped your Russia data, and it's over dispersed, the variance daily is very large. Kransodar krai data is "managed" one way or another – Aksakal May 22 '20 at 16:46
  • @Aksakal. I see, good point. By the way, upper in comments we already started treating the data as Binomial because case counts are fractions of tests made – Alexey Burnakov May 22 '20 at 16:52
  • 1
    @AlexeyBurnakov, that yandex page shows me 8M tests and 326K infected, i.e. 4% incidence rate. So, Poisson should be a fairly Ok approximation – Aksakal May 22 '20 at 17:11
  • @Aksakal, It is good to know, I was not familiar with this correlation of distributions. – Alexey Burnakov May 22 '20 at 17:20
  • 1
    @Aksakal you assume that these numbers relate to the 8M tests and Binomial distribution with 4% incidence rate, but that may not need to be the case. The data have very little meta-information provided telling how the data is gathered. It can also be that the numbers relate to a secondary test which has some limit for the different regions (like around 100) and the region's are sending only their positive cases for second tests making the incidence rate very high. – Sextus Empiricus May 22 '20 at 17:38
  • 1
    @SextusEmpiricus, that's all fair points, we don't know much about the actual data gathering process – Aksakal May 22 '20 at 17:47
  • @EngrStudent, "I respect Russian mathematics greatly. I don't know about Russian economics either way though". I was tired yesterday, sorry. On Russian math, recall the names: Markov, Chebyshev, Kolmogorov (probability), Lyapunov, Arnold (general nath), Lobachevskiy (geometry), Keldysh. They are all around. On the economy science, you could hear about Leontyev (A Nobel winner). And more not so well known. They were genuinely insightful, but, alas, sometimes, the politicians made them miserable, which can be a source of the bias. – Alexey Burnakov May 23 '20 at 15:10
  • @Aksakal, I added mor argumentation why I think these data are not Poisson in nature. Bullet in my answer 1.1) – Alexey Burnakov May 23 '20 at 15:22
  • It’s clearly not poisson but it’s not the point. The point is that the dispersion is too small. – Aksakal May 23 '20 at 17:11
  • Re "Soviet-style manipulations:" a search of this Web page shows *you* are the sole person even referring to such a claim! I think most, if not all, of the posters and readers on this page understand the limitations of statistical analysis and wouldn't presume that an unrealistic-looking dataset necessarily indicates there was skullduggery at work. Your arguments about non-Poissonness really don't hit home, because ultimately *the virus* determines who gets sick and when; and that is going to be close to Poisson. This is the basic process driving everything else. – whuber May 23 '20 at 17:17
  • 1
    @whuber 'soviet-style manipulations' is a response to 'soviet data'. That latter one is a characterisation that was not started by Alexey. – Sextus Empiricus May 23 '20 at 17:23
  • The question about the Poissonness of the data is sort of also a question about whether or not these data are supposed to relate to what 'the virus determines' (the alternative is that the data reflects measurement and reporting capabilities, and this is a likely scenario if you compare the different countries with enormous heterogeneity in approaches and figures). None of these statistics are realistic (independent from dispersion) and all of them require some clear descriptions of limitations. Except Iceland, which tests extremely a lot, all these data are just tips of the virus-icebergs. – Sextus Empiricus May 23 '20 at 17:32
  • 2
    @whuber, sir, I did it for one purpose only. "@Arkasal: That is some very Soviet data. – Ben - Reinstate Monica yesterday " The response to this comment under the Question. No other purposes. – Alexey Burnakov May 23 '20 at 17:41
  • 2
    @Alexey Thank you for the explanation. – whuber May 23 '20 at 17:52
  • 1
    @whuber, is there such a thing as "soviet data"? I think Soviets were always manipulating statistics. Whether post Soviet countries keep this tradition it is a question to me. Almost everyone who I know and still lives over there would assert that it's still the case. I don't have the first hand experience though with recent stats. I highly suspect any and all COVID related data from the region at least through April. At the moment it is probably impossible to hide the spread – Aksakal Jun 17 '20 at 17:47
  • 1
    @Aksakal I'm not the one to address that comment to. I have already protested that "Soviet" is not an adjective I am using. – whuber Jun 17 '20 at 17:50