Most of the work newer than described below is available in the latest supercop (20190910). One exception is a Chacha20 implementation using the current (0.8.0) draft specification of the V extension to the RISC-V architecture. This version uses builtins functions developed as part of the compiler work in the European Processor Initiative project. Source code can be found here. More informations on the intrinsics and compiler can be found here and here.
The archive crypto_stream_chacha20_dolbeau_riscv-v.tgz was last modified on 2020-04-19 (change the way VL scalability is used).
I've written implementations of some stream & aead ciphers using C + intrinsics. I find intrinsics more readable than raw assembly, and much easier to write. Also, modern compilers do a lot of things very well, I like to take advantage of them. The implementations are mostly targeted at the Intel C compiler, ICC, but should also work with GCC. Stream algorithms include salsa20 and chacha20 (both using SSE/AVX/AVX2/AVX-512) and AES-256 in counter mode (using AES-NI). You can find the archive here. AEAD algorithms are AES-256 in GCM mode (using AES-NI & PCLMULQDQ) and HS1-SIV in -hi parameters (using AVX2, to a specification slightly newer than the reference code in supercop-20140622; updated code was made available in supercop-20140907). You can find the archive here. There now is also a patch against the official supercop release.
Note that the interfaces and directory hierarchy are designed to fit in the supercop benchmark. A version of these codes is included in the 20140905 version of supercop, later updated in the 20140910 version. The published results of the benchmark (as of 2014/09/10) do not include compilation with ICC. Results are available for all algorithms.
Beware: those implementations are purely designed for speed on recent Intel architectures (mostly Haswell and newer), and ARMv8 (64 bits) with the crypto extension. They were not verified to be resistant to side channel attacks. It's probably safer to stick to reputable libraries for your cryptographic needs. Pr. Dan Boneh makes a very compelling argument during his excellent course over at Coursera.
Differences from the version in supercop-20141124 include updating HS1-SIV to v2 of the specifications (unfortunately, the name was not changed ans is still v1). As of supercop-20160717, my HS1-SIV implementation is properly labelled v2.The archive crypto_stream-intrinsics.tgz was last modified on 2016-05-04.
The archive crypto_aead-intrinsics.tgz was last modified on 2016-05-04.
The archive crypto_core-intrinsics.tgz was last modified on 2016-05-04.
The patch supercop_20141124_patch_20160504.patch was last modified on 2016-05-04.The 20160504 version also contains some crypto_core algorithms, and aes256gcmv1 for ARMv8+crypto.
To test all the algorithms, you will need a CPU supporting AVX2, AES and PCLMULfor x86_64, and the crypto extension for ARMv8. Some algorithms will run on less than that, but this has not been extensively tested. Tested compilers (beyond what is tested in the official supercop results) include (and some others):icc -m64 -march=native -mtune=native -O3 -fomit-frame-pointer
gcc -m64 -march=native -mtune=native -O3 -fomit-frame-pointer
gcc-4.9.2 -m64 -march=native -mtune=native -O3 -fomit-frame-pointer
[results seem better than with 4.7.2]gcc-5.1.0 -m64 -march=native -mtune=native -O3 -fomit-frame-pointer
clang -march=x86-64 -mcpu=core-avx2 -mavx2 -maes -mpclmul -O3 -fomit-frame-pointer
I've written an hybrid AES-256-GCM implementation in CUDA and NEON for the Jetson TK1 platform (based on the Tegra K1 SoC). The implementation includes a large family of AES kernels in CUDA.
There is also support for GCM using PCLMULQDQ on x86-64 CPUs (now with faster unrolled-by-8 version).
My first results are described in a paper titled An hybrid AES-256-GCM implementation for NEON CPU & CUDA GPU. Full code for all the evaluated implementations and the tests are available here.
The paper aes_gcm_gpu.pdf was last modified on 2014-11-05.
The archive aes_gcm_gpu-20141214.tgz was last modified on 2014-12-14.
Also, a fairly straightforward implementation of Chacha20 in CUDA.
The archive chacha_gpu-20141129.tgz was last modified on 2014-11-29.
This is an experiment in coding stream algorithms (chacha20 and AES-256 in CTR mode) on the Adapteva Parallella. The blocks are generated on the Epiphany chip and brought back to the Cortex A9 for XORing with the message.
The archive epi_crypto-20160827.tgz was last modified on 2016-08-27.