3

In some recent machine learning papers (e.g. mobileNetV2), ReLU6, defined as

$Relu(x)=\min(\max(0,x),6)$

is used instead of regular Relu non-linearities.

Doesn't such a function result in the same gradient vanishing problem as sigmoid functions? My assumption is that Batch normalization might be the reason that this works, but if that were true, then this would also hold for (scaled) sigmoid functions.

Ash
  • 239
  • 1
  • 10
  • 1
    Please, do share the background of the magic number 6. Thank you – Jim Dec 06 '18 at 18:43
  • @Jim To be honest I don't know. They say in the linked paper that this has to do with limiting accuracy loss when using low-precision arithmetic, but I haven't found anything precise on it (most of the justification seems to be experimental, although http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf seems to link it to something a bit more theoretical that I haven't had the time to go through.). Regardless, the gradient problem is what I find more puzzling at the moment... – Ash Dec 06 '18 at 18:54
  • 3
    Batch normalized would also work for sigmoids. The issue is that sigmoids are 10x slower to compute compared to relus. So the speed advantage is considerable. – Alex R. Dec 06 '18 at 22:48

0 Answers0