Neural network to read short strings - translational invariance in CNNs

Question

I have a series of short strings that each describe some item (one item per string). The people who write these strings can get pretty creative when it comes to spelling.

For each string, I also have the label of the true object the strings refers to.

string          ->  true label
-------------------------------
lemon           ->  lemon
banana          ->  banana
strawberry      ->  strawberry
lemmonn         ->  lemon
llemon          ->  lemon
stawberry       ->  strawberry
ba nana         ->  banana
yellow fruit    ->  banana
small red fruit -> strawberry

There are millions of such examples, and I'm trying to train a convolutional network to identify the true labels given the strings. The main issue then is that it's hard to make the CNN translationally invariant.

Here's an example of when it becomes problematic:

banana -> banana appear 120k times in the training dataset
raw strawberry -> strawberry appears 80k times
raw banana -> banana almost never appears (1 or 2 times only)

Now comes the problem: when trying to predict the string raw banana, the network outputs strawberry. In this case, it looks like the CNN has learned that something that starts with raw usually corresponds to strawberry and there aren't enough contradictory examples (esp. with banana) to challenge this.

My question is, how to make the CNN learn that banana clearly spelled after raw is more indicative of a banana than of a strawberry? More generally, how to make the CNN learn that banana is representative of a banana, even if not at the very beginning of the string?

I've tried prefixing the input strings with random strings (of variable length) so that a portion of the training data becomes f89jbanana -> banana, ah2qo banana -> banana but it doesn't seem to have much effect and the problem still remains.

Note on the structure of the CNN:

The CNN I'm using is made of 3 parallel blocks of Conv1d/BatchNorm/ReLU with respectively 2, 3 and 4 convolution kernels, blocks which then are concatenated together, result which goes into several convolutional steps with AveragePooling1D in between, finishing by a couple of dense layers.

Why you need neural network for this rather than regular expressions or fuzzy search ? — Tim, Oct 15 '21 at 17:41
because, although it's not shown in the examples above, sometimes the strings are very different from the label — e.g. `yellow oblate fruit` for `banana` — the idea is not to match the spelling of the string to that of the label, the idea is to learn patterns that predict the label — Jivan, Oct 15 '21 at 17:52
Still, you can search the phrases and then match with keywords. In such case, neural network won’t be able to generalize outside given mappings, even if it did it probably wouldn’t be trustworthy. — Tim, Oct 15 '21 at 18:05
The problem here is that spelling mistakes is the norm rather than the exception, and there is a lot of noise in the message transmission as well. Strings like `as0b4annana` for a banana are pretty common. And again there is no general rule that matches a keyword with a particular target label. — Jivan, Oct 15 '21 at 18:12
As for regular expressions, we're precisely trying to get rid of a library of thousands of regular expressions and other aliases that are a nightmare to maintain (and thus never are). — Jivan, Oct 15 '21 at 18:13
`as0b4annana` can still be matched to `banana`, it's literally the only fruit which matches all letters in its name with `as0b4annana ` — Firebug, Oct 16 '21 at 12:42

Tim · Answer 1 · 2021-10-15T19:14:00.923

You don't need deep learning for that. You have a list of keywords and need to match them. The problem is that they may be misspelled. Another problem is that sometimes you need to match a keyword with a different value (small red fruit -> strawberry). The first thing to notice is that those are two distinct problems.

If you have misspelled keywords, you can just use one of the many available spellchecking algorithms, just use them against your list of keywords instead of the generic language-specific lists.

Another option is to use a fuzzy search algorithm, there are also many available implementations, depending on your preferences in Python, Go, and other languages. This is mostly about calculating edit distance between strings. If you want to use an out-of-the-box, scalable solution, ElasticSearch has a fuzzy search build-in.

Yes, you could use deep learning instead, but there are many reasons why this should be your last resort.
- For a neural network, you would need a lot of data such as the pairs you presented. We are talking of hundreds of thousands if not millions.
- You would need to train it, tune, debug, maintain it. That's a lot of work.
- Deep learning algorithm is a black-box algorithm, you don't know why and how exactly it makes its classifications. You have no guarantees that it doesn't invent some kind of crazy, overfitting rules that have nothing to do with your data.
- Neural network would be considerably slower and more computationally intensive, hence more expensive.
A neural network would basically be learning to mimic what the fuzzy search algorithms are already doing: finding substrings with the smallest distance to the target keywords or will memorize your data and become a computationally inefficient lookup table. Why re-inventing the wheel?

Finally, if you had a simple algorithm that works but misses edge cases, you could use it for the majority of the data and then focus on building a machine learning solution just for the edge cases.
As for matching keywords with some other value, you can treat this as a two-step algorithm. In the first step, you do the fuzzy matching for the keywords. In the second step, you just use a hash table to map keywords to different values if needed. This is $O(1)$ complexity problem vs computationally intensive solution with deep learning.

If you still insisted on using deep learning, I'd recommend having two different models: the first one learning to correct misspelled words and the second one doing a classification.

Hello Tim and thanks for the answer. What you are suggesting is essentially what we’ve been doing so far, with suboptimal results. Maintaining a list of keywords, regexes and aliases is a nightmarish pain, and the real vocabulary doesn’t even match real words that much. Also the strings don’t necessarily match the labels. We’ve been experimenting with neural nets and results are promising so far, although still putting things together. I don’t agree with your statement that NNs are unexplainable. Lastly we do have millions of data points. — Jivan, Oct 15 '21 at 23:27
@Jivan I didn’t recommend regexes, but fuzzy search. Those details in the comment are probably worth including in the question. As about maintaining the list of aliases, you’d still need it as training data, so NN doesn’t solve the problem, but I may be missing details of your problem. — Tim, Oct 16 '21 at 06:36
I think I made the problem look too much like a "word" problem, whereas in reality it's a "character" problem. Examples of real strings can be `nlrtm` or `sin03vpk`, not made of actual words. And the input strings match the target labels only occasionally. Apologies for wanting to simplify the problem a bit too much in my question. — Jivan, Oct 17 '21 at 13:24
@Jivan seems like it might make sense for you to ask a new question with more realistic examples and detailed description as you are getting answers to question you didn’t intend to ask. — Tim, Oct 17 '21 at 13:54

score 7 · Answer 2 · answered Oct 15 '21 at 19:14

If you really want to use deep learning for this, then I'd consider a character-level recurrent neural network (such as a bidirectional LSTM) or if you want a transformer, which would take as the input the sequence of characters and would use as its output a category (either 1) if you have a fixed list of possible outputs, those categories, 2) categorical output for the words in an appropriate dictionary, or 3) a sequence of characters). Using a model pre-trained on some English language corpus would seem obvious, because that would likely give the model a lot of relevant context beyond your training data (e.g. https://huggingface.co/transformers/v3.4.0/model_doc/longformer.html).

In fact, DistilGPT2 is a pretty good at this task with just a few examples (few shot learning). I provided this input

Input:banana

Output:banana

Input:raw strawberry

Output:strawberry

Input:llemon

Output:lemon

Input:pear

Output:pear

Input:aple

Output:apple

Input:rotten fig

Output:fig

Input:raw banana

Output:

and the model (gpt2-large, temperature 0, max-time 5) auto-completed this to

Input:banana

Output:banana

Input:raw strawberry

Output:strawberry

Input:llemon

Output:lemon

Input:pear

Output:pear

Input:aple

Output:apple

Input:rotten fig

Output:fig

Input:raw banana

Output: banana

Input:raw

So, it seems like a transformer model can deal with this in a few shot fashion.

What you tried in terms of data augmentation (i.e. adding some extra characters) sounds sensible, but I'd also consider other ideas (e.g. adding extra random words from a dictionary - perhaps making sure not to pick any fruit words - and randomly swapping letters). There's various packages for data augmentation with text data that you could try.

Thanks for the answer. I need to stress that in the real case, the vocabulary is not made of real worlds for the most part. — Jivan, Oct 15 '21 at 21:28
@Jivan: That's also something that you probably should include in your question, since otherwise pre-training the model with background knowledge of the input language seems like a natural approach (and one of the few likely to actually work for inputs like in your example) to me. Without such background knowledge, it's hard to see why your model should *not* assume that e.g. "raw" means "actually a strawberry". — Ilmari Karonen, Oct 16 '21 at 20:39

Neural network to read short strings - translational invariance in CNNs

2 Answers2