How can "word hashing" cause a collision in DSSM?

Question

They say in their paper, that "word hashing" can cause a collision. But I don't understand, how. For example, if word good is tranformed to #go, goo, ood, od# it still remains unambiguous and can't collide with anything.

Can you explain the point please?

Tim · Answer 1 · 2020-06-19T08:18:27.653

2

Apparently they did found 22 such collisions in their data.

What they do, is they first first divide words into $n$-grams and then one-hot encode into vector. This is not described in the paper explicitly, but may be guessed from the context, that each position in the vector is occurrence (coded as one), or absence (coded as zero), of the particular $n$-gram in the word. That is the reason why they observed $10,306$ unique vectors for $40\text{k}$ word set and $30,621$ for $500\text{k}$ word set. Notice that $30,621^{1/3} = 31.28$ and $10,306^{1/3} = 21.76$ (for three-grams), where number of possible three-grams build from the set of the Latin characters, -, and #, is $28^3=21,952$, while non-standard characters like æ, or ö, may also appear, so the length of the vectors is the number of unique $n$-grams observed in the data. Of course, language is not build by combining letters in random combinations, so not all combinations will appear, or will be equally popular, hence the larger collection of words, the more tokens we'll observe.

What this also means is that neither order, nor number of times $n$-grams appear is accounted for. So for example, "aaa" and "aaaa" both contain only the #aa, aaa, aa# 3-grams, so both would be encoded as the same vector. As you can see from the paper, such cases are very rare, so it would be hard to come up with a more realistic example, at least no such example immediately comes to my mind. I skimmed through the paper, but didn't find what was the data that they used, but you could always preprocess the data and check the duplicates by hand to verify what they were.

Still, tl;dr is that collisions should be a rare case for human language. Of course, this does not have to be the case for all sequences. For example, if you encoded DNA sequences like this, I'd imagine there would be a lot of collisions, since they consist of only four nucleobases ( A, G, C, and T), so there is a much smaller number of possible $n$-grams among them.

edited Jun 19 '20 at 08:18

answered Jun 18 '20 at 13:01

Tim

108,699
20
212
390

1

To me, this also suggests that the choice to put `#` at the start and end of each word is crucial, because `#` is not a character of any natural word, and distinguishing the beginning and ending of the words will dramatically reduce the incidence of collisions, i.e. anagrams can mostly be ignored. – Sycorax Jun 18 '20 at 13:03
@SycoraxsaysReinstateMonica agree, but this seems to be standard in character-level models to mark beginning and end of words. – Tim Jun 18 '20 at 13:12
This is a good observation. I haven't had much use for $n$-gram models of natural language, so the convention seemed odd to me, an outside. – Sycorax Jun 18 '20 at 13:25
"aaaa" gives 1*`#aa` + 2*`aaa` + 1* `aa#`, while "aaa" will give 1*`#aa` + 1*`aaa` + 1* `aa#`, which is not the same.... – Dims Jun 18 '20 at 14:30
2

@Dims but they’re creating one-hot vectors of it, like in bag-of-words, so order, or number of occurrences doesn’t matter. That’s why word vectors have fixed size, independent of length of the word. – Tim Jun 18 '20 at 15:09
@Tim but look, on the other hand, what are you saying, means collision occurs each time 3 letters repeats in a word. In the table 1 they say thas within 40k word, there are only 2 cases of such phenomena. I can't believe this. – Dims Jun 18 '20 at 16:32
1

@Dims this is not what I’m saying, try it on different examples. Moreover, in English language there ain’t any words where letter is repeated three times in a row. – Tim Jun 18 '20 at 17:53
1

To pick a less contrived example, in informal writing like a text thread or Twitter conversation, you might see words like `haha` and `hahahahaha` denoting laughter. But using a binary encoding of 3-grams, these two words will collide (since the same 2 letters are repeated over and over). – Sycorax Jun 20 '20 at 01:11
@Tim choose common ending and seek for a worrd, which has the same in the middle or the beginning. For example, `ionisation` (ion), `restorator` (tor), `singing` (ing). And this is from top of my head. I already found three. In the list of 40k words it will be dozens of them. – Dims Nov 27 '20 at 13:40
@Tim ah, only `sing` / `singing` make a collision; nevertheless, I found one. It should be more than 2. – Dims Nov 27 '20 at 13:44

Li Chen · Answer 2 · 2020-11-27T03:42:31.717

2

if we looked on "goomoosoo" and "goosoomoo" as words, then it is easy to find that they have the same letter 3-gram vector representation。

"goomoosoo" gives 1#go + 1goo + 1oom + 1omo+ 1moo + 1oos + 1oso + 1soo + 1oo#, while "goosoomoo" will give 1#go + 1goo + 1oom + 1omo+ 1moo + 1oos + 1oso + 1soo + 1oo#, which is the same

so, i think Tim's answer has something wrong, word frequency is matters

edited Nov 27 '20 at 03:42

answered Nov 26 '20 at 06:46

Li Chen

21
2

bomzh · Answer 3 · 2020-06-19T22:22:26.650

If I understand correctly, they use a hash function $h$ that maps n-grams to an integer that is "much" smaller than the vocabulary size. The vector representation of a word is the vector which component $i$ is 1 iff the word contains a subword which hash value is $i$.

A collision happens if the set of hashes of two words is identical. More precisely, if:

The sets of subwords are identical but subwords occur a different number of times in the two words. (That's essentially Tim's answer.)
The sets of subwords are different but subwords occurring in one of the word can be paired with at least one other subword in the other word, with the same hash value.

Therefore, there are collisions caused by the set representation of the word (#1), and this could be avoided by using a "multi-set" (counting the number of occurences of each hash value instead of merely indicating presence); and there are collisions in the hash table of subwords itself (#2), which could be avoided by increasing the range of the hash function.

Although, it is probably unnecessary to deal with this, as Tim pointed out.

How can "word hashing" cause a collision in DSSM?

3 Answers3

Linked