#1 performance: Use AVX2 operations for some of the math.

Closed
opened 1 year ago by yawning · 2 comments

The reference implementation has AVX2 optimized implementations of:

  • Centered Binomial Distribution (eta = 4).
  • The number theoretic transform.
  • The inverse number theoretic transform.
  • Pointwise multiply and accumulate.
  • The matrix generation

Casually profiling the Go benchmarks results in something like:

  flat  flat%   sum%        cum   cum%
 9.82s 27.24% 27.24%      9.82s 27.24%  golang.org/x/crypto/sha3.keccakF1600
 5.54s 15.37% 42.61%      5.54s 15.37%  git.schwanenlied.me/yawning/kyber%2egit.ntt
 3.96s 10.98% 53.59%      3.98s 11.04%  git.schwanenlied.me/yawning/kyber%2egit.invntt
 3.04s  8.43% 62.02%      3.04s  8.43%  git.schwanenlied.me/yawning/kyber%2egit.(*poly).pointwiseAcc
 1.50s  4.16% 66.19%      8.61s 23.88%  git.schwanenlied.me/yawning/kyber%2egit.genMatrix
 1.38s  3.83% 70.01%      2.05s  5.69%  git.schwanenlied.me/yawning/kyber%2egit.(*poly).cbd
The reference implementation has AVX2 optimized implementations of: * Centered Binomial Distribution (eta = 4). * The number theoretic transform. * The inverse number theoretic transform. * Pointwise multiply and accumulate. * The matrix generation Casually profiling the Go benchmarks results in something like: flat flat% sum% cum cum% 9.82s 27.24% 27.24% 9.82s 27.24% golang.org/x/crypto/sha3.keccakF1600 5.54s 15.37% 42.61% 5.54s 15.37% git.schwanenlied.me/yawning/kyber%2egit.ntt 3.96s 10.98% 53.59% 3.98s 11.04% git.schwanenlied.me/yawning/kyber%2egit.invntt 3.04s 8.43% 62.02% 3.04s 8.43% git.schwanenlied.me/yawning/kyber%2egit.(*poly).pointwiseAcc 1.50s 4.16% 66.19% 8.61s 23.88% git.schwanenlied.me/yawning/kyber%2egit.genMatrix 1.38s 3.83% 70.01% 2.05s 5.69% git.schwanenlied.me/yawning/kyber%2egit.(*poly).cbd
Yawning Angel referenced this issue from a commit 1 year ago
Yawning Angel commented 1 year ago
Owner

Casually profiling the Go benchmarks (All AVX2) now results in something like:

  flat  flat%   sum%        cum   cum%
17.72s 38.08% 38.08%     17.72s 38.08%  golang.org/x/crypto/sha3.keccakF1600
 2.86s  6.15% 44.23%      4.49s  9.65%  git.schwanenlied.me/yawning/kyber%2egit.(*poly).cbd
 2.29s  4.92% 49.15%     14.51s 31.18%  git.schwanenlied.me/yawning/kyber%2egit.genMatrix
 1.99s  4.28% 53.43%      3.38s  7.26%  runtime.scanobject
 1.63s  3.50% 56.93%      1.63s  3.50%  git.schwanenlied.me/yawning/kyber%2egit.loadLittleEndian

My feeling is to hold off on throwing more assembly at this until I see what tweaks happen to the algorithm in response to the public feedback. While not as fast as the upstream avx2 implementation, the optimizations now make it quite fast.

The next logical thing to do would be the Centered Binomial Distribution.

Casually profiling the Go benchmarks (All AVX2) now results in something like: flat flat% sum% cum cum% 17.72s 38.08% 38.08% 17.72s 38.08% golang.org/x/crypto/sha3.keccakF1600 2.86s 6.15% 44.23% 4.49s 9.65% git.schwanenlied.me/yawning/kyber%2egit.(*poly).cbd 2.29s 4.92% 49.15% 14.51s 31.18% git.schwanenlied.me/yawning/kyber%2egit.genMatrix 1.99s 4.28% 53.43% 3.38s 7.26% runtime.scanobject 1.63s 3.50% 56.93% 1.63s 3.50% git.schwanenlied.me/yawning/kyber%2egit.loadLittleEndian My feeling is to hold off on throwing more assembly at this until I see what tweaks happen to the algorithm in response to the public feedback. While not as fast as the upstream `avx2` implementation, the optimizations now make it quite fast. The next logical thing to do would be the Centered Binomial Distribution.
Yawning Angel referenced this issue from a commit 1 year ago
Yawning Angel commented 1 year ago
Owner

It is possible to improve this further by doing the following, in decreasing order of gain (judging by profiler output):

  • Multi-lane SHAKE (group getNoise calls, faster genMatrix).
  • Non-eta=4 Centered Binomial Distribution.
  • Vectorized rejection sampling.

I will hold off on these for the following reasons:

  • Including another SHA3 implementation that's used instead of the one currently imported from golang.org/x/crypto/sha3 feels like a bad idea.
  • eta=4 is the designer recommended parameter set.
  • Out of the genMatrix runtime, approximately 75% of it is consumed by squeezing blocks out of SHAKE.

Basically, "it's good enough, especially Kyber-768, and further improvements likely will involve considerable amounts of pain".

It is possible to improve this further by doing the following, in decreasing order of gain (judging by profiler output): * Multi-lane SHAKE (group `getNoise` calls, faster `genMatrix`). * Non-`eta=4` Centered Binomial Distribution. * Vectorized rejection sampling. I will hold off on these for the following reasons: * Including another SHA3 implementation that's used instead of the one currently imported from `golang.org/x/crypto/sha3` feels like a bad idea. * `eta=4` is the designer recommended parameter set. * Out of the `genMatrix` runtime, approximately 75% of it is consumed by squeezing blocks out of SHAKE. Basically, "it's good enough, especially Kyber-768, and further improvements likely will involve considerable amounts of pain".
Sign in to join this conversation.
No Label
No Milestone
No assignee
1 Participants
Loading...
Cancel
Save
There is no content yet.