3

Rajkomar, A., Oren, E., Chen, K., Dai, A.M., Hajaj, N., Liu, P.J., Liu, X., Sun, M., Sundberg, P., Yee, H. and Zhang, K., 2018. Scalable and accurate deep learning for electronic health records. arXiv preprint arXiv:1801.07860. https://arxiv.org/pdf/1801.07860.pdf

Model Evaluation and Statistical Analysis Patients were randomly split into development (80%), validation (10%) and test (10%) sets. Model accuracy is reported on the test set, and 1000 bootstrapped samples were used to calculate 95% confidence intervals. To prevent overfitting, the test set remained unused (and hidden) until final evaluation.

The study involved only 216,221 hospitalizations from 114,003 patients, so taking a single test set, especially with some highly imbalanced outcomes, seems to not make sense. Would not nested CV or bootstrapping the full set have been vastly superior, especially given that some of the diagnoses they were predicting (14,000 different ones) only had perhaps (guessing) 200 cases in the whole dataset ($p$ = 0.001). I think the test set is 20,000.

obs = []
for i in range(0,14000): obs.append(np.random.binomial(n=20000,p=0.001))

enter image description here

A hold out might have only caught 20 positive cases for some diseases (assuming most are rare). Further, bootstrapping the held out test set does not really help to detect this variation.

Generally, for their predictive models, I imagine Google uses a single hold out because their numbers are so large, but for these rare diseases this might be a real issue, and many diseases are rare.

Edit: published fast track https://www.nature.com/articles/s41746-018-0029-1

sjw
  • 5,091
  • 1
  • 21
  • 45
  • 2
    Arxiv isn't peer reviewed! – generic_user Feb 25 '18 at 22:17
  • @generic_user agree, but this paper has already gotten so much exposure – sjw Feb 25 '18 at 22:22
  • 1
    Is their dataset public? Try to overturn their result. Is it private? Nobody should trust it. – generic_user Feb 25 '18 at 22:45
  • unfortunately neither dataset nor code appear public. Maybe they are waiting until publication – sjw Feb 25 '18 at 22:50
  • 1
    I think the real tragedy is that they buried their (terrible) calibration results in the appendix. If I'm told at admission that my probability of dying is 50%, I would expect to die every other time, not 80% of the time... – M Turgeon Feb 28 '18 at 20:35
  • 1
    The dataset consists of medical records from patients who did not give explicit consent to this study. Although the records are "de-identified" and human-studies review boards at the hospitals allowed these investigators to perform these analyses, as provided in US regulations, those review boards might be hard pressed to allow public release of the data. The fear would be breaching patient confidentiality--might not such powerful machine-learning approaches, together with other publicly available information, somehow allow identification of the patients and thus their medical records? – EdM Feb 28 '18 at 21:33
  • @EdM and perhaps the models as well https://arxiv.org/abs/1802.08232 Carlini, Nicholas, et al. "The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets." arXiv preprint arXiv:1802.08232 (2018). ... generally though some of these datasets are shared (eg MIMIC Critical Care Database https://mimic.physionet.org) – sjw Feb 28 '18 at 22:39
  • "*so much exposure*"? There are 35 co-authors... I fully accept that having 35 people reading a paper counts as "*much exposure*"... More seriously, yes, it is a Google Research paper written in collaboration with two of the biggest US medical schools so it is bound to have exposure. That does not mean it is jaw-dropping good, for example I see no base-line model. Any ML paper saying it is the new-sliced ML bread with out comparison with existing standard high-performance solutions (eg. gradient boosting) is somewhat fishy. – usεr11852 Mar 01 '18 at 00:01

1 Answers1

1

I think that with over 100,000 patients, small sample sizes are not a consideration. Both traditional split-sample validation and complicated resampling techniques should lead to approximately similar model development and validation measures.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • 2
    If the outcome is relatively common, then I agree. But what if it's an outcome with very low incidence? – sjw Mar 13 '18 at 23:24