background image

 

Cryptographic Performance on the  

2nd Generation Intel® Core™ processor family 

 

 

6

  

 

 

However, for simplicity, we use fixed-size small buffers without schedulers in 

this paper, as we mainly focus on the performance improvements of the core 

algorithms.  

We did not include other algorithms in this study, as we believe the current 

set is representative and forms the basis of the vast majority of cryptographic 

protocols used today. Although other private key ciphers such as ARC4 and 

DES are still used today, we anticipate the new AES algorithm will continue to 

replace them in the future. 

Multi-Buffer via SIMD 

The Intel

®

 64 and IA-32 instruction set architectures have two distinct 

instruction subsets: general purpose instructions and Single Instruction 

Multiple Data (SIMD) instructions [1]. SIMD instructions include, and are 

mostly known as Intel

®

 SSE (Streaming SIMD Extensions). These instructions 

can be used to implement fast hashing for MD5, SHA1 and SHA256. As these 

algorithms are defined on 32-bit words and perform a sequence of arithmetic 

and logical operations, they are amenable to a SIMD approach where we can 

process 4 independent buffers concurrently.  

The 2

nd

 Generation Intel® Core™ processor family improves the performance 

of SIMD instructions with the introduction of the Intel® AVX (Intel® 

Advanced Vector Extensions) instruction set [5], resulting in much faster 

multi-buffer hashing. Although we process the same number of buffers as 

with SSE, we gain some performance efficiency due to the more flexible 

register operand definition of the AVX instruction set. Both our SSE and AVX 

implementations operate on 128-bit data which consists of 4 data elements 

from independent parallel buffers. 

Multi-Buffer via Data Dependency Hiding 

The Intel

®

 AES-NI new instructions have been defined as a set of SSE 

instructions, but are not SIMD. In particular the AESENC instruction, which 

does one round of AES encryption, has a latency of several cycles. This means 

that in some modes, such as counter-mode or CBC (Cipher Block Chaining) 

decrypt, one can implement the algorithm such that multiple blocks of the 

same buffer are being processed in parallel, but in the case of CBC-encrypt, 

one cannot start encrypting a block until the previous block has been 

encrypted. This means that CBC-encrypt requires a serial implementation, 

where performance is limited by the latency rather than by the throughput. 

However, if we can encrypt multiple independent buffers in parallel, we can 

break the data dependencies and get ideal performance limited only by the 

throughput for CBC-Encrypt. The 2

nd

 Generation Intel® Core™ processor 

family improves the throughput performance of the AES-NI instructions; to 

realize this peak performance, we process 8 buffers in parallel. For the Intel 

Summary of Contents for BX80623I72600K

Page 1: ...Performance on the 2nd Generation Intel Core processor family January 2011 White Paper Vinodh Gopal Jim Guilford Wajdi Feghali Erdinc Ozturk Gil Wolrich Kirk Yap Sean Gulley Martin Dixon IA Architect...

Page 2: ...e net result is an improvement in cryptographic performance up to 1 8X over the previous Intel processors1 The Intel Embedded Design Center provides qualified developers with web based access to techn...

Page 3: ...Cryptographic Performance on the 2nd Generation Intel Core processor family 3 and the embedded community Design Fast Design Smart Get started today www intel com embedded edc...

Page 4: ...ocessor family 4 Contents Overview 5 Improving Cryptographic Processing 5 Multi Buffer via SIMD 6 Multi Buffer via Data Dependency Hiding 6 Performance 7 Private Key and Secure Hashing Performance 8 P...

Page 5: ...the RSA DSA and Diffie Hellman algorithms In 3 we demonstrate the fastest implementation of modular exponentiation on Intel processors The 2nd Generation Intel Core processor family improves the perf...

Page 6: ...ntel AVX Intel Advanced Vector Extensions instruction set 5 resulting in much faster multi buffer hashing Although we process the same number of buffers as with SSE we gain some performance efficiency...

Page 7: ...ded keys AES using 128 bit keys is a common usage the results will be similar and scale for other key sizes such as 192 and 256 bit keys The modular exponentiation code bases were measured on 512 bit...

Page 8: ...2 bit Modular Exponentiation 360 880 246 899 1 46 1024 bit Modular Exponentiation 2 722 590 1 906 555 1 43 In this case we observe a large performance boost with the Intel Core i7 2600 processor based...

Page 9: ...erence http softwarecommunity intel com isn downloads intelavx Intel AVX Programming Reference 31943302 pdf The Intel Embedded Design Center provides qualified developers with web based access to tech...

Page 10: ...Gopal Jim Guilford Erdinc Ozturk Gil Wolrich Wajdi Feghali Kirk Yap Sean Gulley and Martin Dixon are IA Architects with the IAG Group at Intel Corporation Acronyms IA Intel Architecture API Applicati...

Page 11: ...tem Performance will vary depending on the specific hardware and software you use For more information including details on which processors support HT Technology see here 64 bit computing on Intel ar...

Reviews: