I have a series of short strings that each describe some item (one item per string). The people who write these strings can get pretty creative when it comes to spelling.
For each string, I also have the label of the true object the strings refers to.
string -> true label
-------------------------------
lemon -> lemon
banana -> banana
strawberry -> strawberry
lemmonn -> lemon
llemon -> lemon
stawberry -> strawberry
ba nana -> banana
yellow fruit -> banana
small red fruit -> strawberry
There are millions of such examples, and I'm trying to train a convolutional network to identify the true labels given the strings. The main issue then is that it's hard to make the CNN translationally invariant.
Here's an example of when it becomes problematic:
banana -> banana
appear 120k times in the training datasetraw strawberry -> strawberry
appears 80k timesraw banana -> banana
almost never appears (1 or 2 times only)
Now comes the problem: when trying to predict the string raw banana
, the network outputs strawberry
. In this case, it looks like the CNN has learned that something that starts with raw
usually corresponds to strawberry
and there aren't enough contradictory examples (esp. with banana) to challenge this.
My question is, how to make the CNN learn that banana
clearly spelled after raw
is more indicative of a banana than of a strawberry? More generally, how to make the CNN learn that banana
is representative of a banana, even if not at the very beginning of the string?
I've tried prefixing the input strings with random strings (of variable length) so that a portion of the training data becomes f89jbanana -> banana
, ah2qo banana -> banana
but it doesn't seem to have much effect and the problem still remains.
Note on the structure of the CNN:
The CNN I'm using is made of 3 parallel blocks of Conv1d/BatchNorm/ReLU with respectively 2, 3 and 4 convolution kernels, blocks which then are concatenated together, result which goes into several convolutional steps with AveragePooling1D in between, finishing by a couple of dense layers.