4

I am trying to calculate the ROC AUC for a dataset where I can't fit predictions and labels in memory (10s/100s billions of samples). Is there a way to calculate the AUC in a distributed way or at least to approximate it?

I know that to calculate the AUC you need to analyse all the possible threshold to calculate the FPR and TPR rates and calculate the area under that curve.

Is there a way to do in a distributed fashion? The only thing I can think of is to calculate the AUC for different batches and then average the different AUCs.

  • How long is that vector you can't fit into memory? – Firebug Feb 02 '18 at 15:28
  • I can fit the vector but I don't have all the samples in the same machine. For example: I have 1k samples in one machine and 1k samples in another machine. Is calculating 2 distinct AUCs and then averaging them correct? – Andrea Bergonzo Feb 02 '18 at 15:30
  • It's not. Why can't you simply merge the scores into a single vector? – Firebug Feb 02 '18 at 15:36
  • Each prediction is a double of 8B and each label is just a boolean of 1b. So if I have 1 billion sample I'll need 1GB only for the scores, right? I was wondering if there is a technique that doesn't rely on having a big memory available. – Andrea Bergonzo Feb 02 '18 at 15:42
  • You most probably don't need all 1 billion scores to get reasonable precision on the AUC estimate. – Firebug Feb 02 '18 at 15:45
  • Unlike training of hard model, ROC doesn't benefit that much of bigger datasets. With even only 1 million randomly sampled data points you'll have an estimate of the AUC with several decimal place accuracy. The real problem here is that your "labels are probabilities". I have no idea how you want to perform a ROC analysis with that...? – Calimo Feb 03 '18 at 13:56
  • Yeah, sorry, probabilities as predictions but labels are binary. – Andrea Bergonzo Feb 03 '18 at 21:09
  • @Sycorax If you have tens/hundreds billions samples you cannot just buy more RAM. The best way to solve it is sampling as Firebug and Callino mentioned. – Andrea Bergonzo Jul 30 '18 at 12:35
  • @AndreaBergonzo It wasn't clear to me that you had 100s of billions of samples; your comment about the memory consumption suggests that you have 1 billion samples, which would of course consume just 1 GB of RAM. Perhaps you could edit your question to clarify that? – Sycorax Jul 30 '18 at 15:21

1 Answers1

2
  • The comments note that sampling will give very precise ROC AUC estimates, even if you only use a fraction of your billions of observations. Working out how tight the confidence set is for your specific problem should be simple enough: load as many randomly-sampled prediction/label pairs as you can into memory and compute the confidence set. If the confidence set is too large for your purposes, you'll need more RAM or to use an out-of-memory method. Luckily, there are simple ways to do either task.

  • You can do the ROC AUC computation without loading all of the data into memory. Suppose that you have a flat file with two columns: predicted value and label. Sort the file by the predicted label in descending order (such as by using the sort command-line utility). Use your preferred method of iterating over a file (command-line utilities, Bash scripting, Python, whatever) to implement this algorithm: https://stats.stackexchange.com/a/146259/22311

  • AWS will let you rent a machine with 384 GB of RAM for less than \$5/hr. In comments, you write that loading 1 billion predicted values into RAM consume take 1GB of memory, so you'll be able to load about 384 billion samples into memory for less than \$5/hr. Using the X1 series is more expensive, but scales to nearly 2000 GB of RAM for less than \$14/hr. (All prices as of this writing.)

Sycorax
  • 76,417
  • 20
  • 189
  • 313