I have 33 classes (33 different objects). I need to recognize the object from any view of the object. Like a packet of potato chips, the packet has different appearance from different view (as shown in the attached images).
I trained those images as one class even though they have different appearance. Then I do augmentation to have different rotation and add in noises. So one object has about 600 images to train.
I have 33 classes to train and all classes are taken from different views. So I have about 20,000 images to train.
When I train with SSD, detection is very poor. Currently I am training with Faster-RCNN, still training and I notice that loss is not converging well.
How should I improve the detection? Is that ok to train as one class for different views of the object?
Should I separate images in different views and train different models for different views?