3

I am relatively new to statistics but am very keen to learn the best approaches with working examples as I prefer to learn this way. We are clearly living in an unprecedented time with COVID-19 and I felt this might be a suitable subject to apply myself to.

The problem

My end goal is to identify Health Boards which are outliers in terms of the mortality rate (COVID-19) for the total number of patients diagnosed with COVID-19 in a bed the health board manages. I define an outlier as a health board which has a mortality rate, associated with COVID-19, which cannot be associated with chance variability alone.

Datasets

  1. The total number of patients in hospitals diagnosed with COVID-19 within each health board.
  2. The total number of deaths in patients diagnosed with COVID-19 in hospital within each health board
  3. The total number of people in the tested population diagnosed with COVID-19 located in the health board.

Methodology

I have initially created a funnel plot, as suggested in "The Art of Statistics: Learning from Data" by David Speigelhalter. I have plotted the mortality rates (COVID-19 deaths per total number of deaths in patients diagnosed with COVID-19 in hospital) against the total number of deaths in patients diagnosed with COVID-19 in hospital. I have used 95% and 99.8% control limits and this clearly shows "outliers". It is clear that part of this is due to increased infection rates in particular health board areas which is where the 3rd dataset comes in. This is also where I need guidance...

Question

I have performed multiple linear regression on these two independent variables with the total number of deaths in patients diagnosed with COVID-19 in hospital within each health board as the dependent. This shows a promising correlation with strong p-values. Is using the residuals a suitable method for identifying significant outliers or am I doing the wrong step here to answer my question?

AWGIS
  • 81
  • 3
  • I find your outliers difficult to understand intuitively. You're saying that an outlier is an observation *"which cannot be associated with chance variability alone"*. The difficulty I have with that is that 'chance variability' is always defined in a limited way. Ie. 'chance' is defined by means of an imperfect model. If it's not so accurate then often 'outliers' are not exceptional and more a reflection of error in the used model describing the 'chance variability'. So it is a bit confusing what you are trying to achieve with detecting outliers. Could you explain the underlying goal... – Sextus Empiricus Jun 01 '20 at 08:53
  • ...an example of these ponderings: Say that you use homogeneously distributed death rates according to a simple model, say binomial distributed death rates (death rate is *constant*). But death rate can vary a lot depending on uncontrolled factors. So you're likely gonna find outliers on both ends (more/less deaths than expected). The situation is that your model might assume a certain (small) degree of dispersion that may not be realistic. With this uncertainty, the term *outlier* becomes a bit difficult. So my question is: **With respect to *what (null-)model* do you wish to test and why?** – Sextus Empiricus Jun 01 '20 at 09:00
  • @SextusEmpiricus thank you. I guess there are two parts; the initial null model being that there is no linear relationship between variables 1 and 3 with dependent variable 2. The tests shows that this is not true and that there is. From there I then want to infer the health boards which appear to have other variables at play which seem to have lower or higher values for variable 2. At this stage I am not bothered what those other variables might be. Sorry if this hasn't answered your questions – AWGIS Jun 01 '20 at 11:06
  • My comment was on one side mostly something pedantic about 'outlier', but also it is about the question being unclear on what you want to do/achieve. You seem to be looking for health boards with a relatively high death rate, which can be done by simply making an ordered table with the death rates (or with a funnel plot like you did, in which case you compare z-scores with respect to a binomial distribution model). But the reason *why* you want to do this will be of influence on how to answer the question. It is not so clear what further step you want to take. – Sextus Empiricus Jun 01 '20 at 11:16

0 Answers0