Cryptographic Performance on the
2nd Generation Intel® Core™ processor family
6
However, for simplicity, we use fixed-size small buffers without schedulers in
this paper, as we mainly focus on the performance improvements of the core
algorithms.
We did not include other algorithms in this study, as we believe the current
set is representative and forms the basis of the vast majority of cryptographic
protocols used today. Although other private key ciphers such as ARC4 and
DES are still used today, we anticipate the new AES algorithm will continue to
replace them in the future.
Multi-Buffer via SIMD
The Intel
®
64 and IA-32 instruction set architectures have two distinct
instruction subsets: general purpose instructions and Single Instruction
Multiple Data (SIMD) instructions [1]. SIMD instructions include, and are
mostly known as Intel
®
SSE (Streaming SIMD Extensions). These instructions
can be used to implement fast hashing for MD5, SHA1 and SHA256. As these
algorithms are defined on 32-bit words and perform a sequence of arithmetic
and logical operations, they are amenable to a SIMD approach where we can
process 4 independent buffers concurrently.
The 2
nd
Generation Intel® Core™ processor family improves the performance
of SIMD instructions with the introduction of the Intel® AVX (Intel®
Advanced Vector Extensions) instruction set [5], resulting in much faster
multi-buffer hashing. Although we process the same number of buffers as
with SSE, we gain some performance efficiency due to the more flexible
register operand definition of the AVX instruction set. Both our SSE and AVX
implementations operate on 128-bit data which consists of 4 data elements
from independent parallel buffers.
Multi-Buffer via Data Dependency Hiding
The Intel
®
AES-NI new instructions have been defined as a set of SSE
instructions, but are not SIMD. In particular the AESENC instruction, which
does one round of AES encryption, has a latency of several cycles. This means
that in some modes, such as counter-mode or CBC (Cipher Block Chaining)
decrypt, one can implement the algorithm such that multiple blocks of the
same buffer are being processed in parallel, but in the case of CBC-encrypt,
one cannot start encrypting a block until the previous block has been
encrypted. This means that CBC-encrypt requires a serial implementation,
where performance is limited by the latency rather than by the throughput.
However, if we can encrypt multiple independent buffers in parallel, we can
break the data dependencies and get ideal performance limited only by the
throughput for CBC-Encrypt. The 2
nd
Generation Intel® Core™ processor
family improves the throughput performance of the AES-NI instructions; to
realize this peak performance, we process 8 buffers in parallel. For the Intel