Why is BLAKE2 faster than chacha20?

Question

Chacha20 is essentially a hash function that maps 512-bit strings to other 512-bit strings which are in turn xored with the plaintext to create the ciphertext. Of the 512-bit input 128-bit are used for the "expand 32-byte k" constant, 256-bit are used for the key, 64-bit are used for the nonce and the final 64-bit are used for the counter.

I noticed that in the eBASH/eBASC benchmarks here and here that chacha20 depending on the architecture the speed for the encryption of a 512-bit message ranges (at least for the 64-bit architectures) from slightly faster to significantly slower when compared to BLAKE2b (if we use it in a mode similar to the one that we use in chacha20 - so 512-bit consisting of a constant, key, nonce, and counter and xor the result with the message) which is a "real" cryptographic/collision resistant hash function based on chacha20 that given a message of any (reasonable) size gives a 512-bit output. BLAKE2b has 12 rounds (each of which if I understand correctly is equal to two chacha rounds).

My questions,

How can BLAKE2b be faster in some architectures than chacha20 even though it does more work?
I think that BLAKE2b uses 64-bit words rather than the 32-bit words that chacha20 uses, is this one of the causes?
Also, would there be any reason not to prefer BLAKE2b (in what is essentially a CTR mode) for such an architecture?

Since the eBASH/eBASC benchmarks contain a lot of information, could you please specify some platforms where BLAKE2b is faster than ChaCha20? — Raoul722, Aug 02 '19 at 10:20
I just browsed your two benchmarks. Even though I did not compare every architecture, I tried to get a representative view and I've yet to find an architecture where blake2b is faster than chacha20. I think your question is moot because it is based on false premises. — A. Hersean, Aug 05 '19 at 08:21
@Raoul722 Consider the first one: amd64; CannonLake (60663); 2018 Intel Core i3-8121U; 2 x 2200MHz. Cycles/byte for 64 bytes for chacha20 is 8.78 8.88 9.19 while for blake2b it is 6.09 6.16 6.22. The second one too. ppc64; 2017 IBM POWER9 DD2.1; 32 x 3126MHz. 26.03 26.03 26.80 for chacha20 while we have 19.91 19.91 20.67 for blake2b. — Astolfo, Aug 06 '19 at 15:06
@forest I have not. As far as I know BLAKE2s has a 256-bit output, which means that it will be able to encrypt only half the data that BLAKE2b or chacha20 can (per invocation). So I assumed that it would be too slow to compare with either of these. — Astolfo, Aug 06 '19 at 15:08
In your comment you stated that on "2018 Intel Core i3-8121U" blake2b is faster than chacha20. However, on average on this architecture, chacha20 uses 0.57 cycles/bytes (for messages over 4KB, which are most messages) whereas blake2b uses 6.16 cycles/bytes (when processing input of the same length as its output, which is the case when used in CTR mode even for long messages). Thus, in this case chacha20 is 10.8 times faster than blake2b. Chacha20 as an overhead to initialize the cipher, this make it "slow" on very small messages, and this is documented. — A. Hersean, Aug 07 '19 at 13:36
@A.Hersean I do not believe that taking anything but the 64 bytes benchmark would be fair as the chacha benchmark calculates (chacha(a) xor m[0]) || (chacha(b) xor m[1]) || ... while the blake2b benchmark calculates blake2b(a || b || ...). Also, I am not aware of any initialization overheard for chacha, would you mind pointing me in the right direction? Wouldn't said overhead also apply to blake2b as it uses chacha? — Astolfo, Aug 08 '19 at 09:43
With an input of 64 bytes, blake2b in CTR mode does not compute `blake2b(a || b || ...)` as you suggest but `blake2b(counter) XOR m[i]`. The XOR is a very small overhead not taken into account in the benchmark. As for the overhead for small messages, I can't find the source I had in mind, but by looking at the [source code](https://github.com/jedisct1/libsodium/blob/master/src/libsodium/crypto_stream/chacha20/ref/chacha20_ref.c), chacha20 has a constant overhead in initializing (the key and IV) and in cleaning up the memory afterwards. — A. Hersean, Aug 09 '19 at 08:55
@A.Hersean The fact that blake2b in CTR mode does not compute `blake2b(a || b || ...)` is exactly my point. After all in eBASH blake2 is benchmarked in the `blake2b(a || b || ...)` form. That being said, I think that my initial premise is faulty as I assumed that "64 bytes" includes things like the constant, nonce, etc (although blake2b is still faster for the 8bytes case). I believe that the best way to compare blake2b and chacha20 for encryption would be to try fork the benchmark to include blake2b as a cipher and run it myself. As for the overhead, blake2b also has one in the form of the IV — Astolfo, Aug 09 '19 at 14:38

Why is BLAKE2 faster than chacha20?

0 Answers0