In some recent machine learning papers (e.g. mobileNetV2), ReLU6, defined as
$Relu(x)=\min(\max(0,x),6)$
is used instead of regular Relu non-linearities.
Doesn't such a function result in the same gradient vanishing problem as sigmoid functions? My assumption is that Batch normalization might be the reason that this works, but if that were true, then this would also hold for (scaled) sigmoid functions.