Am I missing obvious problems with my model

Question

I am using Keras to train a CNN for a single label image classification. The model is being trained on synthesized data and applied to real world images. After a significant amount of trial and error I came to the model shown in the code. It gives me the best results for my test data, with an accuracy of about 80 %.

While this is not satisfying I am running out of ideas how to improve the model itself. Since I am new to this kind of machine learning I want to make sure I am not missing the obvious and would be glad if anyone can review my model to tell me, if I am making a rookie mistake somewhere:

model = Sequential([
  BatchNormalization(input_shape=input_shape),
  Conv2D(8, kernel_size=3, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(16, kernel_size=3, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(32, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(64, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  MaxPooling2D(pool_size = 2),
  Conv2D(32, kernel_size=3, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(64, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(64, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  MaxPooling2D(pool_size = 2),

  Conv2D(32, kernel_size=3, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(64, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  Conv2D(128, kernel_size=5, padding='same'),
  BatchNormalization(),
  Activation('relu'),
  SpatialDropout2D(0.1),

  MaxPooling2D(pool_size = 2),

  Flatten(),
  Dense(1024),
  BatchNormalization(),
  Activation('relu'),
  Dropout(0.5),

  Dense(1024),
  BatchNormalization(),
  Activation('relu'),
  Dropout(0.5),

  Dense(num_classes),
  BatchNormalization(),
  Activation('softmax')
])

model.compile(loss=keras.losses.sparse_categorical_crossentropy,
          optimizer=keras.optimizers.Adadelta(),
          metrics=[keras.metrics.sparse_categorical_accuracy])

Welcome to cross validated! I'm afraid it won't be possible to give any sensible suggestions without knowing far more details about a) your application and b) the simulation that generates your training data - you may be limited by the quality of your training simulation. a) is a general finding: good models require not only statistical/machine learning knowledge but also domain knowledge including knowledge of the data generation processes. — cbeleites unhappy with SX, Jan 15 '19 at 10:35
It is at least slightly weird to reduce the number of channels after a max-pooling layer. I think it is typical to double the number of channels instead, in order to not throw away more information than you already have thrown away through pooling. — shimao, Jan 19 '19 at 15:07

score 1 · Answer 1 · answered Jan 15 '19 at 08:52

Any reason that you are building your own model from scratch as opposed to using some sort of transfer learning approach?

i.e. Take advantage of something like Keras Applications. https://keras.io/applications/

The example to look at is something like this to strip the last layer and fine tune on your data. Alternatively, if you do not have much data, using the 'Extract Features' example and then training a simple dense model or other non-linear ML model may be more appropriate.

There are some worked examples in this Notebook; though not 100% sure how up to date the code is for recent Keras releases. https://github.com/cauldnz/mlopenhack/blob/master/notebooks/Challenge5.ipynb

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K

# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(200, activation='softmax')(x)

# this is the model we will train
model = Model(inputs=base_model.input, outputs=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# train the model on the new data for a few epochs
model.fit_generator(...)

# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.

# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 249 layers and unfreeze the rest:
for layer in model.layers[:249]:
   layer.trainable = False
for layer in model.layers[249:]:
   layer.trainable = True

# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')

# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(...)

Thanks Chris, yes there are reasons not to use a pretrained model. First it's a learning endeavor to understand how to build the model, second I am classifiying images in specifiy industrial settings for which no pretrained models exist. — Herr von Wurst, Jan 15 '19 at 09:00
The 'learning exercise' is great. But, I wouldn't get too hung up on there not being a pretrained model. Unless you have a very large labelled dataset you're likely to get great results from fine-tuning or feature bottle-necking. Can you advise a bit more on your dataset? Number of examples? label balance? — Chris J.T. Auld, Jan 15 '19 at 09:05

DeltaIV · Answer 2 · 2019-01-19T15:16:24.807

You ask " if I am making a rookie mistake somewhere". By far, the biggest rookie mistake you're making, is not providing us with a sample dataset. If your data are proprietary, build a synthetic dataset or use an open one (there are open datasets also for industrial applications). Too time-consuming? Tough luck. Most results/best practices in machine learning depend on the dataset, so without seeing it, we cannot help you much. Actually, I'm not even sure this is a suitable question for this site.

At the very least, tell us more about the problem: image classification (I guess)? How many classes? Balanced? Unbalanced? Which are the frequencies in the dataset? Show us the ROC curve (yes, you can plot ROC curves for multiclass problems too).

Having said that, here are a few things that stick out:

You say that 80% accuracy "is not satisfying". Have you tried estimating the Bayes error rate? It may be that, with your dataset, and the features you're using, it's impossible to increase the accuracy significantly above 80%. Which is the human accuracy for this task, on a random sample? PS the human performing the task must not already know the right labels, of course (thus you may not be the right candidate for this "human test").
The architecture seems some weird variation over VGG-19. I guess you know already that this is not even remotely optimal for image classification: if not, read about Inception, ResNet, ResNeXt or NASNet.
BatchNorm tends not to play nicely with Dropout. Are you sure that you need both of them for all the convolutional layers? The fact that you have low dropout rates for all layers except the FC ones, may indicate that by eliminating some (possibly all) the dropout layers after the convolutional layers, you'll get statistically insignificant differences in results.
I'd rather use RMSProp or Adam (AdamW if you were using weight decay, but you're not), rather than Adadelta. But I bet it won't make a big difference. What may make a difference, instead, is using SGD with Nesterov momentum and a good learning rate schedule.
You don't show us the model.fit, so it's impossible to know if you made some silly mistakes with minibatches and/or validation.
You shouldn't control accuracy anyway. I know you're an industrial Data Scientist and accuracy is the only metric your stakeholders/internal customers understand, but that's your problem and you have to figure out a solution to that. It doesn't change the fact that controlling validation accuracy is a crappy way to perform model selection/tune hyperparameters. You should control validation accuracy, or you'll be in for a nasty surprise when you'll apply your model to data never seen before (thus, not the data you used to select the current architecture). Have a look at

Why is accuracy not the best measure for assessing classification models?

Good accuracy despite high loss value

How is it possible that validation loss is increasing while validation accuracy is increasing as well

"The model is being trained on synthesized data and applied to real world images". This has the potential to be a HUGE issue: distribution shift/label shift is a real thing, and neural networks trained on the MNIST handwritten digits dataset don't even learn to recognize handwritten digits (except those in the MNIST test dataset). Thus I would be very afraid of deploying in production a model trained on data which don't come from the actual data distribution.
Finally, why do you halve the number of channels after maxpooling? Usually you either double or keep constant. See the VGG-1X architectures here.

Thank you very much for your straightforward and very detailed answer. I will look at your links and try to improve on the question. As you have guessed correctly the dataset is proprietary, so my hands are tied at giving out detailed info at the moment. — Herr von Wurst, Jan 19 '19 at 21:27
@HerrvonWurst well, can you at least answer my questions?" image classification (I guess)? How many classes? Balanced? Unbalanced? Which are the frequencies in the dataset? Show us the ROC curve". — DeltaIV, Jan 19 '19 at 23:23

Am I missing obvious problems with my model

2 Answers2