background image

Optimizing for SIMD Integer Applications

4

4-17

Figure 4-7

pmovmskb

 Instruction Example

Example 4-10

pmovmskb

 Instruction Code

; Input:

;

source value

; Output:

;

32-bit register containing the byte mask in the lower 

;

eight bits

;

movq     mm0, [edi] 

pmovmskb eax, mm0

O M15165

M M

R32

31

0

63

0..0

31

0..0

7

0

55

47

39

23

15

7

Summary of Contents for ARCHITECTURE IA-32

Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...

Page 2: ...ny software that may be provided in association with this document Except as permitted by such license no part of this document may be reproduced stored in a retrieval system or trans mitted in any fo...

Page 3: ...Goals of Intel NetBurst Microarchitecture 1 8 Overview of the Intel NetBurst Microarchitecture Pipeline 1 9 The Front End 1 11 The Out of order Core 1 12 Retirement 1 12 Front End Pipeline Detail 1 13...

Page 4: ...Chapter 2 General Optimization Guidelines Tuning to Achieve Optimum Performance 2 1 Tuning to Prevent Known Coding Pitfalls 2 2 General Practices and Coding Guidelines 2 3 Use Available Performance T...

Page 5: ...eon Processors 2 45 Aliasing Cases in the Pentium M Processor 2 46 Mixing Code and Data 2 47 Self modifying Code 2 47 Write Combining 2 48 Locality Enhancement 2 50 Minimizing Bus Latency 2 52 Non Tem...

Page 6: ...ng Point SIMD Operands 2 88 Prolog Sequences 2 90 Code Sequences that Operate on Memory Operands 2 90 Instruction Scheduling 2 91 Latencies and Resource Constraints 2 91 Spill Scheduling 2 92 Scheduli...

Page 7: ...for 128 bit data 3 24 Compiler Supported Alignment 3 24 Improving Memory Utilization 3 27 Data Structure Layout 3 27 Strip Mining 3 32 Loop Blocking 3 34 Instruction Selection 3 37 SIMD Optimizations...

Page 8: ...ord 4 31 Complex Multiply by a Constant 4 32 Packed 32 32 Multiply 4 33 Packed 64 bit Add Subtract 4 33 128 bit Shifts 4 33 Memory Optimizations 4 34 Partial Memory Accesses 4 35 Supplemental Techniqu...

Page 9: ...etch Coding Guidelines 6 2 Hardware Prefetching of Data 6 4 Prefetch and Cacheability Instructions 6 5 Prefetch 6 6 Software Data Prefetch 6 6 The Prefetch Instructions Pentium 4 Processor Implementat...

Page 10: ...Decoder Implementation 6 46 Optimizing Memory Copy Routines 6 46 TLB Priming 6 47 Using the 8 byte Streaming Stores and Software Prefetch 6 48 Using 16 byte Streaming Stores and Hardware Prefetch 6 5...

Page 11: ...Cache Misses 7 36 Use Full Write Transactions to Achieve Higher Data Rate 7 37 Memory Optimization 7 38 Cache Blocking Technique 7 38 Shared Memory Optimization 7 39 Minimize Sharing of Data between P...

Page 12: ...ep Technology 9 12 Enabling Intel Enhanced Deeper Sleep 9 14 Multi Core Considerations 9 15 Enhanced Intel SpeedStep Technology 9 15 Thread Migration Considerations 9 16 Multi core Considerations for...

Page 13: ...rmance Metrics B 1 Pentium 4 Processor Specific Terminology B 2 Bogus Non bogus Retire B 2 Bus Ratio B 2 Replay B 3 Assist B 3 Tagging B 3 Counting Clocks B 4 Non Halted Clockticks B 5 Non Sleep Clock...

Page 14: ...4 Latency and Throughput with Register Operands C 6 Table Footnotes C 19 Latency and Throughput with Memory Operands C 20 Appendix DStack Alignment Stack Frames D 1 Aligned esp Based Stack Frames D 4...

Page 15: ...Code 2 36 Example 2 15 Two Examples to Avoid the Non forwarding Situation in Example 2 14 2 36 Example 2 13 A Non forwarding Example of Large Load After Small Store 2 36 Example 2 16 Large and Small...

Page 16: ...op Blocking 3 35 Example 3 21 Emulation of Conditional Moves 3 37 Example 4 1 Resetting the Register between __m64 and FP Data Types 4 5 Example 4 2 Unsigned Unpack Instructions 4 7 Example 4 3 Signed...

Page 17: ...s and shuffle Instructions 5 15 Example 5 7 Deswizzling Data 64 bit Integer SIMD Data 5 16 Example 5 8 Using MMX Technology Code for Copying or Shuffling 5 18 Example 5 9 Horizontal Add Using movhlps...

Page 18: ...ithout Sharing a Cache Line 7 32 Example 7 8 Batched Implementation of the Producer Consumer Threads 7 41 Example 7 9 Adding an Offset to the Stack Pointer of Three Threads 7 45 Example 7 10 Adding a...

Page 19: ...ler Performance Trade offs 3 13 Figure 3 3 Loop Blocking Access Pattern 3 36 Figure 4 2 Interleaved Pack with Saturation 4 9 Figure 4 1 PACKSSDW mm mm mm64 Instruction Example 4 9 Figure 4 4 Result of...

Page 20: ...State Transitions 9 3 Figure 9 2 Active Time Versus Halted Time of a Processor 9 4 Figure 9 3 Application of C states to Idle Time 9 6 Figure 9 4 Profiles of Coarse Task Scheduling and Power Consumpt...

Page 21: ...to Strip mining Code 6 39 Table 6 2 Relative Performance of Memory Copy Routines 6 52 Table 6 3 Deterministic Cache Parameters Leaf 6 54 Table 7 1 Properties of Synchronization Objects 7 21 Table B 1...

Page 22: ...xxii Table C 5 Streaming SIMD Extension 64 bit Integer Instructions C 14 Table C 7 IA 32 x87 Floating point Instructions C 16 Table C 8 IA 32 General Purpose Instructions C 17...

Page 23: ...itecture Volume 2A Instruction Set Reference A M Volume 2B Instruction Set Reference N Z and Volume 3 System Programmer s Guide When developing and optimizing software applications to achieve a high l...

Page 24: ...n your applications On the Pentium 4 Intel Xeon and Pentium M processors this tool can monitor an application through a selection of performance monitoring events and analyze the performance event dat...

Page 25: ...tBurst microarchitecture and Pentium M processor microarchitecture Chapter 3 Coding for SIMD Architectures Describes techniques and concepts for using the SIMD integer and SIMD floating point instruct...

Page 26: ...ols for analyzing and enhancing application performance without having to write assembly code Appendix B Intel Pentium 4 Processor Performance Metrics Provides information that can be gathered using P...

Page 27: ...Programmer s Guide doc number 253668 Intel Processor Identification with the CPUID Instruction doc number 241618 Developing Multi threaded Applications A Platform Consistent Approach available at http...

Page 28: ...t THIS TYPE STYLE Indicates a value for example TRUE CONST1 or a variable for example A B or register names MMO through MM7 l indicates lowercase letter L in examples 1 is the number 1 in examples O i...

Page 29: ...Technology Intel EM64T Intel processors supporting Hyper Threading HT Technology1 Multi core architecture supported in Intel Core Duo Intel Pentium D processors and Pentium processor Extreme Edition2...

Page 30: ...t of eight 64 bit registers called MMX registers see Figure 1 2 The Pentium III processor extended the SIMD computation model with the introduction of the Streaming SIMD Extensions SSE SSE allows SIMD...

Page 31: ...ble precision floating point data elements and 128 bit packed integers There are 144 instructions in SSE2 that operate on two packed double precision floating point data elements or on 16 packed byte...

Page 32: ...Point Arithmetic They are accessible from all IA 32 execution modes protected mode real address mode and Virtual 8086 mode SSE SSE2 and MMX technologies are architectural extensions in the IA 32 Inte...

Page 33: ...with Streaming SIMD Extensions 2 SSE3 Chapter 12 Programming with Streaming SIMD Extensions 3 SSE3 Summary of SIMD Technologies MMX Technology MMX Technology introduced 64 bit MMX registers support fo...

Page 34: ...nverting between new and existing data types extended support for data shuffling extended support for cacheability and memory ordering operations SSE2 instructions are useful for 3D graphics video dec...

Page 35: ...mode enables a 64 bit operating system to run applications written to access 64 bit linear address space In the 64 bit mode of Intel EM64T software may access 64 bit flat linear addressing 8 additiona...

Page 36: ...ency and Throughput Intel NetBurst microarchitecture is designed to achieve high performance for integer and floating point computations at high clock rates It supports the following features hyper pi...

Page 37: ...kes to execute each individual instruction is not always deterministic Chapter 2 General Optimization Guidelines lists optimizations to use and situations to avoid The chapter also gives a sense of re...

Page 38: ...l program order and that the proper architectural states are updated Figure 1 3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline The f...

Page 39: ...hat are sources of delay the time required to decode instructions fetched from the target wasted decode bandwidth due to branches or a branch target in the middle of a cache line Instructions are fetc...

Page 40: ...r operations executing in parallel or by the execution of ops queued up in a buffer The core is designed to facilitate parallel execution It can dispatch up to six ops per cycle through the issue port...

Page 41: ...tory Figure 1 3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture an execution loop that interacts with multilevel cache hierarchy and the system bus...

Page 42: ...croarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock Some complex instructions must enlist the help of the microcode ROM The decoder operatio...

Page 43: ...es accurately and to reduce the cost of taken branches These include the ability to dynamically predict the direction and target of branches based on an instruction s linear address using the branch t...

Page 44: ...Return Stack that can predict return addresses for a series of procedure calls This increases the benefit of unrolling loops containing function calls It also mitigates the need to put certain proced...

Page 45: ...order IA 32 instructions to preserve available parallelism by minimizing long dependence chains and covering long instruction latencies order instructions so that their operands are ready and their co...

Page 46: ...either one floating point move op a floating point stack move floating point exchange or floating point store data or one arithmetic logical unit ALU op arithmetic logic branch or store data In the se...

Page 47: ...ared between instructions and data Figure 1 4 Execution Units and Ports in the Out Of Order Core OM15151 ALU 0 Double Speed Port 0 ADD SUB Logic Store Data Branches FP Move FP Store Data FXCH ALU 1 Do...

Page 48: ...rd level cache miss initiates a transaction across the system bus A bus write transaction writes 64 bytes to cacheable memory or separate 8 byte chunks if the destination is not cacheable A bus read t...

Page 49: ...ntrolled prefetch is enabled using the four prefetch instructions PREFETCHh introduced with SSE The software prefetch is not intended for prefetching code Using it can incur significant penalties on a...

Page 50: ...processor and the use of too many prefetches can limit their effectiveness Examples of this include prefetching data in a loop for a reference outside the loop and prefetching in a basic block that is...

Page 51: ...sfy a condition that the linear address distance between these cache misses is within a threshold value The threshold value depends on the processor implementation of the microarchitecture see Table 1...

Page 52: ...by not exceeding the memory issue bandwidth and buffer resources provided by the processor Up to one load and one store may be issued for each cycle from a memory port reservation station In order to...

Page 53: ...ntil a write to memory and or cache is complete Writes are generally not on the critical path for dependence chains so it is often beneficial to delay writes for more efficient use of memory access bu...

Page 54: ...or microarchitecture contains three sections in order issue front end out of order superscalar execution core in order retirement unit Intel Pentium M processor microarchitecture supports a high speed...

Page 55: ...ure 1 5 The Front End The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption It s shorter than that of the Intel NetBurst microarchitecture The Int...

Page 56: ...oding an instruction with four or fewer ops The remaining two decoders each decode a one op instruction in each clock cycle The front end can issue multiple ops per cycle in original program order to...

Page 57: ...ro op This holds for integer floating point and MMX technology loads and for most kinds of successive execution operations Note that SSE Loads can not be fused Data Prefetching The Intel Pentium M pro...

Page 58: ...ands are ready and resources are available Each cycle the core may dispatch up to five ops through the issue ports Table 1 2 Trigger Threshold and CPUID Signatures for IA 32 Processor Families Trigger...

Page 59: ...wo cores in an Intel Core Duo processor to minimize bus traffic between two cores accessing a single copy of cached data It allows an Intel Core Solo processor or when one of the two cores in an Intel...

Page 60: ...and memory have single micro op flows comparable to X87 flows Many packed instructions are fused to reduce its micro op flow from four to two micro ops Eliminating decoder restrictions Intel Core Sol...

Page 61: ...em at lower priority than normal cache miss requests If bus queue is in high demand hardware prefetch requests may be ignored or cancelled to service bus traffic required by demand cache misses and ot...

Page 62: ...ell suited for multiprocessor systems to provide an additional performance boost in throughput when compared to traditional MP systems Figure 1 6 shows a typical bus based symmetric multiprocessor SMP...

Page 63: ...to the fact that operating systems and user programs can schedule processes or threads to execute simultaneously on the logical processors in each physical processor the ability to use on chip execut...

Page 64: ...tably the memory type range registers MTRRs and the performance monitoring resources For a complete list of the architecture state and exceptions see the IA 32 Intel Architecture Software Developer s...

Page 65: ...ers partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations Shared Resources Most resources in a physical proce...

Page 66: ...e stage include cache misses branch mispredictions and instruction dependencies Front End Pipeline The execution trace cache is shared between two logical processors Execution trace cache access is ar...

Page 67: ...processor by alternating between the two logical processors If one logical processor is not ready to retire any instructions then all retirement bandwidth is dedicated to the other logical processor O...

Page 68: ...rocessors in a physical package Each logical processor has a separate execution core including first level cache and a smart second level cache The second level cache is shared between two logical pro...

Page 69: ...nterface Bus Interface Pentium D Processor Caches Caches System Bus Architectual State Execution Engine Local APIC Local APIC Execution Engine Architectual State Bus Interface Bus Interface Local APIC...

Page 70: ...nded model and model fields See Table 1 4 Shared Cache in Intel Core Duo Processors The Intel Core Duo processor has two symmetric cores that share the second level cache and a single bus interface se...

Page 71: ...to wait before the same operation can start again The latency of a bus transaction is exposed in some of these operations as indicated by entries containing bus transaction On Intel Core Duo processo...

Page 72: ...in response time of these cache misses For store operation reading for ownership must be completed before the data is written to the first level data cache and the line is marked as modified Reading...

Page 73: ...ssed in Chapter 7 This chapter explains the optimization techniques both for those who use the Intel C or Fortran Compiler and for those who use other compilers The Intel compiler which generates code...

Page 74: ...ache line splits Table 2 1 lists coding pitfalls that cause performance degradation in some Pentium 4 and Intel Xeon processor implementations For every issue Table 2 1 references a section in this do...

Page 75: ...or the common performance features of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture The coding practices recommended under each heading and the bullets under each...

Page 76: ...raffic locality memory traffic characteristics etc Characterize the performance gain Optimize Performance Across Processor Generations Use a cpuid dispatch strategy to deliver optimum performance for...

Page 77: ...sible eliminate branches Avoid indirect calls Optimize Memory Access Observe store forwarding constraints Ensure proper data alignment to prevent data split across cache line boundary This includes st...

Page 78: ...registers rounding modes between more than two values Use efficient conversions such as those that implicitly include a rounding mode in order to avoid changing control status registers Take advantage...

Page 79: ...16 bit registers because accessing them requires a shift operation internally Use xor and pxor instructions to clear registers and break dependencies for integer operations also use xorps and xorpd t...

Page 80: ...sures how frequently such instances occur across all application domains with the frequency marked as H high M medium L low These rules are very approximate They can vary depending on coding style app...

Page 81: ...n and instruction scheduling Refer to the Intel C Intrinsics Reference section of the Intel C Compiler User s Guide C class libraries Refer to the Intel C Class Libraries for SIMD Operations Reference...

Page 82: ...cial for most programs If a performance problem is root caused to a poor choice on the part of the compiler using different switches or compiling the targeted module with a different compiler may be t...

Page 83: ...chapter include descriptions of the VTune analyzer events that provide measurable data of performance gain achieved by following recommendations Refer to the VTune analyzer online help for instructio...

Page 84: ...sed shifts rotates integer multiplies and moves from memory with sign extension are longer than before Use care when using the lea instruction See the section Use of the lea Instruction for recommenda...

Page 85: ...ntify the processor generation and integrate processor specific instructions such as SSE2 instructions into the source code The Intel C Compiler supports the integration of different versions of the c...

Page 86: ...t IA 32 processor families offer hardware multi threading support in two forms dual core technology and Hyper Threading Technology Future trend for IA 32 processors will continue to improve in the dir...

Page 87: ...n wait loops Inline functions and pair up calls and returns Unroll as necessary so that repeatedly executed loops have sixteen or fewer iterations unless this causes an excessive code size increase Se...

Page 88: ...g both paths of a conditional branch In addition converting conditional branches to cmovs or setcc trades of control flow dependence for data dependence and restricts the capability of the out of orde...

Page 89: ...the cmov and fcmov instructions Example 2 3 shows changing a test and branch instruction sequence using cmov and eliminating a branch If the test sets the equal flag the value in ebx will be moved to...

Page 90: ...the Pentium 4 processor may suffer a severe penalty when exiting the loop because the processor may detect a possible memory order violation Inserting the pause instruction significantly reduces the...

Page 91: ...branches in processors based on the Intel NetBurst microarchitecture are predicted using the following static prediction algorithm Predict backward conditional branches to be taken This rule is suita...

Page 92: ...t and make the fall through code following a conditional branch be the unlikely target for a branch with a backward target Example 2 5 illustrates the static branch prediction algorithm The body of an...

Page 93: ...n JC Begin in Example 2 7 segment is a conditional forward branch It is not in the BTB the first time through but the static predictor will predict the branch to fall through The static prediction alg...

Page 94: ...od of exceeding the stack depth in a manner that will impact performance is very low Assembly Compiler Coding Rule 4 MH impact MH generality Near calls must be matched with near returns and far calls...

Page 95: ...all to a jump This will save the call return overhead as well as an entry in the return stack buffer Assembly Compiler Coding Rule 10 M impact L generality Do not put more than four branches in a 16 b...

Page 96: ...or calls through pointers can jump to an arbitrary number of locations If the code sequence is such that the target destination of a branch goes to the same address most of the time then the BTB will...

Page 97: ...f the direction of the branch under consideration Example 2 8 shows a simple example of the correlation between a target of a preceding conditional branch with a target of an indirect branch Correlati...

Page 98: ...s This is useful if you have enough free registers to keep variables live as you stretch out the dependence chain to expose the critical path Unrolling exposes the code to various other optimizations...

Page 99: ...f the unrolled loop is 16 or less the branch predictor should be able to correctly predict branches in the loop body that alternate direction Assembly Compiler Coding Rule 13 H impact M generality Unr...

Page 100: ...on separate pages using conditional move instructions to eliminate branches generating code that is consistent with the static branch prediction algorithm inlining where appropriate unrolling if the...

Page 101: ...entium 4 processor Alignment Alignment of data concerns all kinds of variables dynamically allocated members of a data structure global or local variables parameters passed on the stack Misaligned dat...

Page 102: ...A 64 byte or greater data structure or array should be aligned so that its base address is a multiple of 64 Sorting data in decreasing size order is one heuristic for assisting with natural alignment...

Page 103: ...ment of branch targets will improve decoder throughput Example 2 11 Code That Causes Cache Line Split mov esi 029e70feh mov edi 05be5260h Blockmove mov eax DWORD PTR esi mov ebx DWORD PTR esi 4 mov DW...

Page 104: ...eral examples of coding pitfalls that cause store forwarding stalls and solutions to these pitfalls are discussed in detail in the Store to Load Forwarding Restriction on Size and Alignment section Th...

Page 105: ...a significant latency in forwarding Passing floating point argument in preferably XMM registers should save this long latency operation Parameter passing conventions may limit the choice of which par...

Page 106: ...rt point and therefore the same alignment as the store data Assembly Compiler Coding Rule 19 H impact M generality The data of a load which is forwarded from a store must be completely contained withi...

Page 107: ...ead and register copies as needed Example 2 12 contains several store forwarding situations when small loads follow large stores The first three load operations illustrate the situations described in...

Page 108: ...d mov EAX EBP blocked The first 4 small store can be consolidated into a single DWORD store to prevent this non forwarding situation Example 2 14 A Non forwarding Situation in Compiler Generated Code...

Page 109: ...or example when bytes or words are stored and then words or doublewords are read from the same area of memory In the second case Example 2 16 B there is a series of small loads after a large store to...

Page 110: ...l cases where data is passed through memory where the store may need to be separated from the load spills save and restore registers in a stack frame parameter passing global and volatile variables ty...

Page 111: ...ctures and arrays to minimize the amount of memory wasted by padding However compilers might not have this freedom The C programming language for example specifies the order in which structure element...

Page 112: ...d into several arrays to achieve better packing as shown in Example 2 19 The efficiency of such optimizations depends on usage patterns If the elements of the structure are all accessed together but t...

Page 113: ...so have an impact on DRAM page access efficiency An alternative hybrid_struct_of_array blends the two approaches In this case only 2 separate address streams are generated and referenced 1 for hybrid_...

Page 114: ...For example if a language only supports 8 bit 16 bit 32 bit and 64 bit data quantities but never uses 80 bit data quantities the language can require the stack to always be aligned on a 64 bit boundar...

Page 115: ...the same set of each way in a cache can cause a capacity issue There are aliasing conditions that apply to specific microarchitectures Note that first level cache lines are 64 bytes Thus the least si...

Page 116: ...ss of first level cache misses for more than 4 simultaneous competing memory references to addresses with 2KB modulus On Pentium 4 and Intel Xeon processors with CPUID signature of family encoding 15...

Page 117: ...rence load or store which is under way then the second reference cannot begin until the first one is kicked out of the cache On Pentium 4 and Intel Xeon processors with CPUID signature of family encod...

Page 118: ...first level cache working set Avoid having a store followed by a non dependent load with addresses that differ by a multiple of 4 KB When declaring multiple arrays that are referenced with the same i...

Page 119: ...re cases a performance problem may be noted due to executing data on a code page as instructions The condition where this is very likely to happen is when execution is following an indirect branch tha...

Page 120: ...condition and should be avoided where possible Avoid the condition by introducing indirect branches and using data tables on data not code pages via register indirect calls Write Combining Write comb...

Page 121: ...to writeback memory into sepa rate phases can assure that the write combining buffers can fill before getting evicted by other write traffic Eliminating partial write transac tions has been found to h...

Page 122: ...is typically that access cost from an outer sub system may be somewhere between 3 10X more expensive than accessing data from the immediate inner level in the cache memory hierarchy assuming similar...

Page 123: ...ocality enhancing techniques Using the lock prefix heavily can incur large delays when accessing memory irrespective of whether the data is in the cache or in system memory User Source Coding Rule 6 H...

Page 124: ...sactions into read phases and write phases can help performance Note however that the order of read and write operations on the bus are not the same as they appear in the program Bus latency of fetchi...

Page 125: ...ead for ownership of a cache line and writes The data transfer rate for bus write transactions is higher if 64 bytes are written out to the bus at a time Typically bus writes to Writeback WB type memo...

Page 126: ...r ecx eax xmm0 movntps XMMWORD ptr ecx eax 16 xmm0 movntps XMMWORD ptr ecx eax 32 xmm0 movntps XMMWORD ptr ecx eax 48 xmm0 64 bytes is written in one bus transaction add eax STRIDESIZE cmp eax edx jl...

Page 127: ...using the prefetchnta instruction fetches 128 bytes into one way of the second level cache The Pentium M processor also provides a hardware prefetcher for data It can track 12 separate streams in the...

Page 128: ...ll also have an impact For a detailed description of how to use prefetching see Chapter 6 Optimizing Cache Usage User Source Coding Rule 9 M impact H generality Enable the prefetch generation in your...

Page 129: ...ode such as code to handle error conditions out of that sequence See Prefetching section on how to optimize for the instruction prefetcher Assembly Compiler Coding Rule 29 M impact H generality All br...

Page 130: ...race cache bandwidth or instruction latency Focus on optimizing the problem area For example adding prefetch instructions will not help if the bus is already saturated If trace cache bandwidth is the...

Page 131: ...store forwarding delays over some compiler implementations This avoids changing the rounding mode User Source Coding Rule 14 M impact ML generality Break dependence chains where possible Removing data...

Page 132: ...conditions such as arithmetic overflow arithmetic underflow denormalized operand Refer to Chapter 4 of the IA 32 Intel Architecture Software Developer s Manual Volume 1 for the definition of overflow...

Page 133: ...eptions Scale the range of operands results to reduce as much as possible the number of arithmetic overflow underflow situations Keep intermediate results on the x87 FPU register stack until the final...

Page 134: ...that can be encountered in FTZ mode are the ones specified as constants read only The DAZ mode is provided to handle denormal source operands efficiently when running an SSE application When the DAZ...

Page 135: ...olution to this problem is to choose two constant FCW values take advantage of the optimization of the FLDCW instruction to alternate between only these two constant FCW values and devise some means t...

Page 136: ...olved For x87 floating point the fist instruction uses the rounding mode represented in the floating point control word FCW The rounding mode is generally round to nearest therefore many compiler writ...

Page 137: ...se the algorithm in Example 2 23 to avoid synchronization issues the overhead of the fldcw instruction and having to change the rounding mode The provided example suffers from a store forwarding probl...

Page 138: ...mov edx ecx 4 high dword of integer mov eax ecx low dword of integer test eax eax je integer_QnaN_or_zero arg_is_not_integer_QnaN fsubp st 1 st TOS d round d st 1 st 1 st pop ST test edx edx what s s...

Page 139: ...rd is set to Single Precision the floating point divider can complete a single precision computation much faster than either a double precision computation or an extended double precision computation...

Page 140: ...to expose more independent instructions to the hardware scheduler An fxch instruction may be required to effectively increase the register name space so that more operands can be simultaneously live N...

Page 141: ...results will not underflow Therefore subsequent computation will not face the performance penalty of handling denormal input operands For example in the case of 3D applications with low lighting leve...

Page 142: ...he overhead associated with the management of the X87 register stack Scalar SSE SSE2 Performance on Intel Core Solo and Intel Core Duo Processors On Intel Core Solo and Intel Core Duo processors the c...

Page 143: ...ation occurs when mixing single precision and double precision code On Pentium 4 processors using cvtss2sd has performance penalty relative to the alternative sequence xorps xmm1 xmm1 movss xmm1 xmm2...

Page 144: ...with using separate instructions Assembly Compiler Coding Rule 36 M impact L generality Try to use 32 bit operands rather than 16 bit operands for fild However do not do so at the expense of introduc...

Page 145: ...nd fewer ops Use optimized sequences for clearing and comparing registers Enhance register availability Avoid prefixes especially more than one prefix Assembly Compiler Coding Rule 37 M impact H gener...

Page 146: ...can also be used as a multiple operand addition instruction for example lea ecx eax ebx 4 a Using lea in this way may avoid register usage by not tying up registers for operands of arithmetic instruct...

Page 147: ...responding to family 15 and model encoding of 0 1 or 2 The latency of a sequence of adds will be shorter for left shifts of three or less Fixed and variable shifts have the same latency The rotate by...

Page 148: ...generate the same number of ops If AX or EAX are known to be positive replace these instructions with xor dx dx or xor edx edx Operand Sizes and Partial Register Accesses The Pentium 4 processor Penti...

Page 149: ...op with limited parallelism the resulting optimization can yield several percent performance improvement Example 2 24 Dependencies Caused by Referencing Partial Registers 1 add ah bh 2 add al 3 instru...

Page 150: ...d If multiple prefixes or a prefix that changes the size of an immediate or displacement cannot be avoided schedule them behind instructions that stall the pipe for some other reason Assembly Compiler...

Page 151: ...mon However the technique will not work if the compare is for greater than less than greater than or equal and so on or if the values in eax or ebx are to be used in another operation where sign exten...

Page 152: ...eax 0xffff is encoded as 35 FF FF 00 00 in 32 bit code while overriding the default operand size with 0x66 results in the instruction xor ax 0xffff which is encoded as 66 35 FF FF Use of an LCP cause...

Page 153: ...odr m byte consider adjusting code alignment by adding a nop before the instruction to avoid the opcode byte aligning on byte 14 from the beginning of an instruction fetch line REP Prefix and Data Mov...

Page 154: ...iteration will in general reduce overhead and improve throughput Sometimes this may involve a comparison of the relative overhead of an iterative loop structure versus using REP prefix for iteration...

Page 155: ...le application situation When applying general heuristics to the design of general purpose high performance library routines the following guidelines can are useful when optimizing arbitrary size of c...

Page 156: ...e address alignment is amortized over many iterations b an iterative approach using the instruction with largest data granularity where the overhead for SIMD feature detection iteration overhead prolo...

Page 157: ...unt Size and 4 Byte Aligned Destination A C example of Memset Equivalent Implementation Using REP STOSD void memset void dst int c size_t size char d char dst size_t i for i 0 i size i d char c push e...

Page 158: ...de high performance in the situations described above However using a REP prefix with string scan instructions scasb scasw scasd scasq or compare instructions cmpsb cmpsw smpsd smpsq is not recommende...

Page 159: ...break a false dependence chain resulting from re use of registers In contexts where the condition codes must be preserved move 0 into the register instead This requires more code space than using xor...

Page 160: ...Thus the operation can be tested directly by a jcc instruction The notable exceptions are mov and lea In these cases use test Assembly Compiler Coding Rule 53 ML impact M generality Eliminate unnecess...

Page 161: ...s into a single register This same functionality can be obtained using movsd xmmreg1 mem movsd xmmreg2 mem 8 unpcklpd xmmreg1 xmmreg2 which uses fewer ops and can be packed into the trace cache more e...

Page 162: ...mory operands can improve performance Instructions of the form OP REG MEM can reduce register pressure by taking advantage of scratch registers that are not available to the compiler Assembly Compiler...

Page 163: ...esult Assembly Compiler Coding Rule 60 M impact M generality When an address of a store is unknown subsequent loads cannot be scheduled to execute out of order ahead of the store limiting the out of o...

Page 164: ...aced in the dependence chain but there would also be a data not ready stall of the load costing further cycles Assembly Compiler Coding Rule 62 H impact MH generality For small loops placing loop inva...

Page 165: ...decoders can each decode one macroinstruction per clock cycle assuming the instruction is one micro op up to seven bytes in length Instructions composed of more than four micro ops take multiple cycl...

Page 166: ...te SIMD code avoid global pointers avoid global variables These may be less of a problem if all modules are compiled simultaneously and whole program optimization is used User Source Coding Rule 17 H...

Page 167: ...d for the following operations 1 byte xchg EAX EAX 2 byte mov reg reg 3 byte lea reg 0 reg 8 bit displacement 6 byte lea reg 0 reg 32 bit displacement These are all true NOPs having no effect on the s...

Page 168: ...o summarize the rules and suggestions specified in this chapter be reminded that coding recommendations are ranked in importance according to these two criteria Local impact referred to earlier as imp...

Page 169: ...entium M processors and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors 2 42 User Source Coding Rule 4 H impact ML generality Consider using a special memory allocation library to...

Page 170: ...n range to avoid denormal values underflows 2 58 User Source Coding Rule 12 M impact ML generality Do not use double precision unless necessary Set the precision control PC field in the x87 FPU contro...

Page 171: ...e of data in an earlier iteration happens lexically after the load of that data in a future iteration something which is called a lexically backward dependence 2 85 User Source Coding Rule 19 M impact...

Page 172: ...ates a mismatch in calls and returns 2 21 Assembly Compiler Coding Rule 5 MH impact MH generality Selectively inline a function where doing so decreases code size or if the function is small and the c...

Page 173: ...ently executed and that have a predictable number of iterations to reduce the number of iterations to 16 or fewer unless this increases code size so that the working set no longer fits in the trace ca...

Page 174: ...rs as in register allocation and for parameter passing to minimize the likelihood and impact of store forwarding problems Try not to store forward data generated from a long latency instruction e g mu...

Page 175: ...B pages or on separate aligned 1 KB subpages 2 47 Assembly Compiler Coding Rule 28 H impact L generality If an inner loop writes to more than four arrays four distinct cache lines apply loop fission t...

Page 176: ...36 M impact L generality Try to use 32 bit operands rather than 16 bit operands for fild However do not do so at the expense of introducing a store forwarding problem by writing the two halves of the...

Page 177: ...d of partial registers For moves this can be accomplished with 32 bit moves or by using movzx 2 76 Assembly Compiler Coding Rule 47 M impact M generality Try to use zero extension or operate on 32 bit...

Page 178: ...1 xmmreg2 instruction instead 2 80 Assembly Compiler Coding Rule 53 ML impact L generality Instead of using movupd xmmreg1 mem for a unaligned 128 bit load use movsd xmmreg1 mem movsd xmmreg2 mem 8 un...

Page 179: ...loading the memory adding two registers and storing the result 2 82 Assembly Compiler Coding Rule 58 M impact M generality When an address of a store is unknown subsequent loads cannot be scheduled t...

Page 180: ...hat is not resident in the trace cache If a performance problem is clearly due to this problem try moving the data elsewhere or inserting an illegal opcode or a pause instruction immediately following...

Page 181: ...lopment of advanced multimedia signal processing and modeling applications To take advantage of the performance opportunities presented by these new capabilities do the following Ensure that the proce...

Page 182: ...ctive for programs that may be executed on different machines 3 Create a fat binary that includes multiple versions of routines versions that use SIMD technology and versions that do not Check for SIM...

Page 183: ...tate introduced by SSE for your application to properly function To check whether your system supports SSE follow these steps 1 Check that your processor supports the cpuid instruction 2 Check the fea...

Page 184: ...swer See Example 3 3 Example 3 2 Identification of SSE with cpuid identify existence of cpuid instruction identify signature is genuine intel mov eax 1 request for feature flags cpuid 0Fh 0A2h cpuid i...

Page 185: ...shows how to find the SSE2 feature bit bit 26 in the cpuid feature flags SSE2 requires the same support from the operating system as SSE To find out whether the operating system supports SSE2 execute...

Page 186: ...nts for SSE To check whether your system supports the x87 and SIMD instructions of SSE3 follow these steps 1 Check that your processor has the cpuid instruction 2 Check the ECX feature bit 0 of cpuid...

Page 187: ...wer See Example 3 7 Detecting the availability of MONITOR and MWAIT instructions can be done using a code sequence similar to Example 3 6 the availability of MONITOR and MWAIT is indicated by bit 3 of...

Page 188: ...treaming SIMD Extensions 3 2 Is this code integer or floating point 3 What integer word size or floating point precision is needed 4 What coding techniques should I use 5 What guidelines do I need to...

Page 189: ...Integer Range or Precision If possible re arrange data for SIMD efficiency Integer Change to use SIMD Integer Yes Change to use Single Precision Can convert to Single precision Yes No No Align data s...

Page 190: ...cation Performance Tools The VTune analyzer provides a hotspots view of a specific module to help you identify sections in your code that take the most CPU time and that have potential performance pro...

Page 191: ...D technologies can be time consuming and difficult Likely candidates for conversion are applications that are highly computation intensive such as the following speech compression algorithms and filte...

Page 192: ...alar code into code that can execute in parallel taking advantage of the SIMD architecture parallelism This section discusses the coding techniques available for an application to make use of the SIMD...

Page 193: ...sive to write and maintain Performance objectives can be met by taking advantage of the different SIMD technologies using high level languages as well as assembly The new C C language extensions desig...

Page 194: ...allows a simple replacement of the code with Streaming SIMD Extensions For the optimal use of the Streaming SIMD Extensions that need data alignment on the 16 byte boundary all examples in this chapt...

Page 195: ...ovide the access to the ISA functionality using C C style coding instead of assembly language Intel has defined three sets of intrinsic functions that are implemented in the Intel C Compiler to suppor...

Page 196: ...ription of the intrinsics and their use refer to the Intel C Compiler User s Guide Example 3 10 shows the loop from Example 3 8 using intrinsics The intrinsics map one to one with actual Streaming SIM...

Page 197: ...eveloper s Manual Volumes 2A 2B Classes A set of C classes has been defined and available in Intel C Compiler to provide both a higher level abstraction and more flexibility for programming with MMX t...

Page 198: ...mechanism by which loops such as in Example 3 8 can be automatically vectorized or converted into Streaming SIMD Extensions code The compiler uses similar techniques to those used by a programmer to...

Page 199: ...memory to which the pointers point In other words the pointer for which it is used provides the only means of accessing the memory in question in the scope in which the pointers live Without the rest...

Page 200: ...for improving data access such as padding organizing data elements into arrays etc are described below SSE3 provides a special purpose instruction LDDQU that can avoid cache line splits is discussed...

Page 201: ...nce for accesses which span multiple cache lines The following declaration allows you to vectorize the scaling operation and further improve the alignment of the data access patterns short ptx N pty N...

Page 202: ...or 128 bit SIMD Technologies For best performance the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16 byte 16B boundaries Unaligned data can...

Page 203: ...sions and SSE2 see Appendix D Stack Alignment Data Alignment for MMX Technology Many compilers enable alignment of variables using controls This aligns the variables bit lengths to the appropriate bou...

Page 204: ...e is keep the data 16 byte aligned Such alignment will also work for MMX technology code even though MMX technology only requires 8 byte alignment The following discussion and examples describe alignm...

Page 205: ...100 objects of type __m128 or F32vec4 In the code below the construction of the F32vec4 object x will occur with aligned data void foo F32vec4 x __m128 buffer Without the declaration of __declspec ali...

Page 206: ...names m and f can be used as immediate member names of my__m128 Note that __declspec align has no effect when applied to a class struct or union member in either C or C Alignment by Using __m64 or dou...

Page 207: ...tions see Chapter 6 Optimizing Cache Usage Data Structure Layout For certain algorithms like 3D transformations and lighting there are two basic ways of arranging the vertex data The traditional metho...

Page 208: ...ertices int b NumOfVertices int c NumOfVertices VerticesList VerticesList Vertices Example 3 16 AoS and SoA Code Samples The dot product of an array of vectors Array and a fixed vector Fixed is a comm...

Page 209: ...ures are generated see Chapters 4 and 5 for specific examples of swizzling code Performing the swizzle dynamically is usually better than using AoS addps xmm1 xmm0 xmm0 DC DC DC x0 xF z0 zF movaps xmm...

Page 210: ...tions that consume SIMD execution slots but produce only a single scalar result as shown by the many don t care DC slots in Example 3 16 Use of the SoA format for data structures can also lead to more...

Page 211: ...iables a b c would also used together but not at the same time as x y z This hybrid SoA approach ensures data is organized to enable more efficient vertical SIMD computation simpler less address gener...

Page 212: ...forms the loop structure twofold It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm It reduces the number of iterations of th...

Page 213: ...he cache then the coordinates for v i that were cached during Transform v i will be evicted from the cache by the time we do Lighting v i This means that v i will have to be fetched from main memory a...

Page 214: ...ossible This technique transforms the memory domain of a given problem into smaller chunks rather than sequentially traversing through the entire memory domain Each chunk should be small enough to fit...

Page 215: ...es Due to the limitation of cache capacity this line will be evicted due to conflict misses before the inner loop reaches the end For the next iteration of the outer loop another cache miss will be ge...

Page 216: ...0 0 7 will be completely consumed by the first iteration of the outer loop Consequently B 0 0 7 will only experience one cache miss after applying loop blocking optimization in lieu of eight misses f...

Page 217: ...ection The following section gives some guidelines for choosing instructions to complete a task One barrier to SIMD computation can be the existence of data dependent branches Conditional moves can be...

Page 218: ...zing SIMD code targeting Intel Core Solo and Intel Core Duo processors The register register variant of the following instructions has improved performance on Intel Core Solo and Intel Core Duo proces...

Page 219: ...ectly is to use a profiler that measures the application while it is running on a system VTune analyzer can help you determine where to make changes in your application to improve performance Using th...

Page 220: ...IA 32 Intel Architecture Optimization 3 40...

Page 221: ...rred to as SIMD integer instructions Unless otherwise noted the following sequences are written for the 64 bit integer registers Note that they can easily be adapted to use the 128 bit SIMD integer fo...

Page 222: ...ns to ensure that the input operands in XMM registers contain properly defined data type to match the instruction Code sequences containing cross typed usage will produce the same result across differ...

Page 223: ...unrelated to the x87 FP MMX registers The emms instruction is not needed to transition to or from SIMD floating point operations or 128 bit SIMD operations Using the EMMS Instruction When generating 6...

Page 224: ...g point and 64 bit SIMD integer instructions follow these steps 1 Always call the emms instruction at the end of 64 bit SIMD integer code when the code transitions to x87 floating point code 2 Insert...

Page 225: ...ge Further you must be aware that your code generates an MMX instruction which uses the MMX registers with the Intel C Compiler in the following situations when using a 64 bit SIMD integer intrinsic f...

Page 226: ...wer performance than using 16 byte aligned references Refer to Stack and Data Alignment in Chapter 3 for more information Data Movement Coding Techniques In general better performance can be achieved...

Page 227: ...ssumes the source is a packed word 16 bit data type Example 4 2 Unsigned Unpack Instructions Input MM0 source value MM7 0 a local variable can be used instead of the register MM7 if desired Output MM0...

Page 228: ...ck Code Input MM0 source value Output MM0 two sign extended 32 bit doublewords from the two low end words MM1 two sign extended 32 bit doublewords from the two high end words movq MM1 MM0 copy source...

Page 229: ...operation The two signed doublewords are used as source operands and the result is interleaved signed words The pack instructions can be performed with or without saturation as needed Figure 4 1 PACK...

Page 230: ...ion of the MMX instruction set can be found in the Intel Architecture MMX Technology Programmer s Reference Manual order number 243007 Interleaved Pack without Saturation Example 4 5 is similar to Exa...

Page 231: ...djacent elements of a packed word data type in source2 and place this value in the high 32 bits of the results One of the destination registers will have the combination illustrated in Figure 4 3 Exam...

Page 232: ...in a non interleaved way The goal is to use the instruction which unpacks doublewords to a quadword instead of using the instruction which unpacks words to doublewords Figure 4 3 Result of Non Interl...

Page 233: ...interleaved Way Input MM0 packed word source value MM1 packed word source value Output MM0 contains the two low end words of the original sources non interleaved MM2 contains the two high end words o...

Page 234: ...of the immediate constant Insertion is done in such a way that the three other words from the destination register are left untouched see Figure 4 6 and Example 4 8 Figure 4 5 pextrw Instruction Examp...

Page 235: ...ntent and break the dependence chain by either using the pxor instruction or loading the register See the Clearing Registers section in Chapter 2 Figure 4 6 pinsrw Instruction Example 4 8 pinsrw Instr...

Page 236: ...XMM registers it produces a 16 bit mask zeroing out the upper 16 bits in the destination register The 64 bit version is shown in Figure 4 7 and Example 4 10 Example 4 9 Repeated pinsrw Instruction Co...

Page 237: ...ure 4 7 pmovmskb Instruction Example Example 4 10 pmovmskb Instruction Code Input source value Output 32 bit register containing the byte mask in the lower eight bits movq mm0 edi pmovmskb eax mm0 OM1...

Page 238: ...he immediate value encode the source for destination word 0 in MMX register 15 0 and so on as shown in the table Bits 7 and 6 encode for word 3 in MMX register 63 48 Similarly the 2 bit encoding repre...

Page 239: ...urce to any double word field in the 128 bit result using an 8 bit immediate operand No more than 3 instructions using pshuflw pshufhw pshufd are required to implement some common data shuffling opera...

Page 240: ...register The high low order 64 bits of the source operands are ignored Example 4 13 Swap Using 3 Instructions Goal Swap the values in word 6 and word 1 Instruction Result 7 6 5 4 3 2 1 0 PSHUFD 3 0 1...

Page 241: ...de conversion of single precision data to from double word integer data Also conversions between double precision data and double word integer data have been added Generating Constants The SIMD intege...

Page 242: ...ld MM1 32 n two instructions above generate the signed constant 2n 1 in every packed word or packed dword field pcmpeq MM1 MM1 psllw MM1 n pslld MM1 n two instructions above generate the signed consta...

Page 243: ...operands and subtracts them with UNSIGNED saturation This support exists only for packed bytes and packed words not for packed doublewords This example will not work if the operands are signed Note th...

Page 244: ...sorting technique that uses the fact that B xor A xor A B and A xor A 0 Thus in a packed data type having some elements being xor A B and some being 0 you could xor such an operand with A and receive...

Page 245: ...M0 create a mask of 0s and xor A B elements Where A B there will be a value xor A B and where A B there will be 0 pxor MM4 MM2 minima xor A swap mask pxor MM1 MM2 maxima xor B swap mask psubw MM1 MM4...

Page 246: ...ues For simplicity we use the following constants corresponding constants are used in case the operation is done on byte values packed_max equals 0x7fff7fff7fff7fff packed_min equals 0x800080008000800...

Page 247: ...ipping to a Signed Range of Words high low Input MM0 signed source operands Output MM0 signed words clipped to the signed range high low pminsw MM0 packed_high pmaxsw MM0 packed_low Example 4 20 Clipp...

Page 248: ...he second instruction psubssw MM0 0xffff high low in the three step algorithm Example 4 21 is executed a negative number is subtracted The result of this subtraction causes the values in MM0 to be inc...

Page 249: ...our signed words in either two SIMD registers or one SIMD register and a memory location The pminsw instruction returns the minimum between the four signed words in either two SIMD registers or one SI...

Page 250: ...he pmulhuw and pmulhw instruction multiplies the unsigned signed words in the destination operand with the unsigned signed words in the source operand The high order 16 bits of the 32 bit intermediate...

Page 251: ...ce operand to the unsigned data elements of the destination register along with a carry in The results of the addition are then each independently shifted to the right by one bit position The high ord...

Page 252: ...that the 64 bit MMX registers are being used Let the input data be Dr and Di where Dr is real component of the data and Di is imaginary component of the data Format the constant complex coefficients i...

Page 253: ...within each 64 bit chunk from the two sources the 64 bit result from each computation is written to the destination register Like the integer ADD SUB instruction PADDQ PSUBQ can operate on either unsi...

Page 254: ...t loads a 64 bit or 128 bit operand for example movq MM0 m64 the register memory form of any SIMD integer instruction that operates on a quadword or double quadword memory operand for example pmaddw M...

Page 255: ...words are stored and then words or doublewords are read from the same area of memory When you change the code sequence as shown in Example 4 25 the processor can access the data without delay Example...

Page 256: ...when doublewords or words are stored and then words or bytes are read from the same area of memory When you change the code sequence as shown in Example 4 27 the processor can access the data without...

Page 257: ...this situation is when each line in a video frame is averaged by shifting horizontally half a pixel Example 4 28 shows a common operation in video processing that loads data from memory address not a...

Page 258: ...ation dependent one or more LDDQU is designed for programming usage of loading data from memory without storing modified data back to the same address Thus the usage of LDDQU should be restricted to s...

Page 259: ...less the same from a memory bandwidth perspective However using many smaller loads consumes more microarchitectural resources than fewer larger stores Consuming too many of these resources can cause t...

Page 260: ...iciency of the bus transactions By aligning the stores to the size of the stores you eliminate the possibility of crossing a cache line boundary and the stores will not be split into separate transact...

Page 261: ...he 64 bit integer counterpart Extension of the pshufw instruction shuffle word across 64 bit integer operand across a full 128 bit operand is emulated by a combination of the following instructions ps...

Page 262: ...tion latency XMM registers can provide twice the space to store data for in flight execution Wider XMM registers can facilitate loop unrolling or in reducing loop overhead by halving the number of loo...

Page 263: ...ating point code containing SIMD floating point instructions Generally it is important to understand and balance port utilization to create efficient SIMD floating point code The basic rules and sugge...

Page 264: ...oint instructions to achieve optimum performance gain requires programmers to consider several issues In general when choosing candidates for optimization look for code segments that are computational...

Page 265: ...64 bit SIMD integer code Scalar Floating point Code There are SIMD floating point instructions that operate only on the least significant operand in the SIMD register These instructions are known as s...

Page 266: ...tion Data Arrangement Because the SSE and SSE2 incorporate a SIMD architecture arranging the data to fully use the SIMD registers produces optimum performance This implies contiguous data for processi...

Page 267: ...ng of parallel data elements i e the destination of each element is the result of a common arithmetic operation of the input operands in the same vertical position This is shown in the diagram below T...

Page 268: ...it is common to use only a subset of the vector components for example in 3D graphics the W component is sometimes ignored This means that for single vector operations 1 of 4 computation slots is not...

Page 269: ...arithmetic operation Vertical computation takes advantage of the inherent parallelism in 3D geometry processing of vertices It assigns the computation of four vertices to the four compute slots of th...

Page 270: ...Operation Example 5 1 Pseudocode for Horizontal xyz AoS Computation mulps x x y y z z movaps reg reg move since next steps overwrite shufps get b a d c from a b c d addps get a b a b c d c d movaps r...

Page 271: ...the SIMD registers This operation is referred to as swizzling operation and the reverse operation is referred to as deswizzling Data Swizzling Swizzling data from one format to another may be required...

Page 272: ...shuffle the yyyy by using another shuffle The zzzz is derived the same way but only requires one shuffle Example 5 3 illustrates the swizzle function Example 5 3 Swizzling Data typedef struct _VERTEX_...

Page 273: ...m0 0xDD xmm6 y1 y2 y3 y4 Y movlps xmm2 ecx 8 xmm2 w1 z1 movhps xmm2 ecx 24 xmm2 w2 z2 u1 z1 movlps xmm1 ecx 40 xmm1 s3 z3 movhps xmm1 ecx 56 xmm1 w4 z4 w3 z3 movaps xmm0 xmm2 xmm0 w1 z1 w1 z1 shufps x...

Page 274: ..._mm_loadh_pi x __m64 stride char in y _mm_loadl_pi y __m64 2 stride char in y _mm_loadh_pi y __m64 3 stride char in tmp _mm_shuffle_ps x y _MM_SHUFFLE 2 0 2 0 y _mm_shuffle_ps x y _MM_SHUFFLE 3 1 3 1...

Page 275: ...registers The same situation can occur for the above movhps movlps shufps sequence Since each movhps movlps instruction bypasses part of the destination register the instruction cannot execute until t...

Page 276: ...mple 5 5 illustrates the deswizzle function Example 5 5 Deswizzling Single Precision SIMD Data void deswizzle_asm Vertex_soa in Vertex_aos out __asm mov ecx in load structure addresses mov edx out mov...

Page 277: ...cklps xmm5 xmm4 xmm5 z1 w1 z2 w2 unpckhps xmm0 xmm4 xmm0 z3 w3 z4 w4 movlps edx 8 xmm5 v1 x1 y1 z1 w1 movhps edx 24 xmm5 v2 x2 y2 z2 w2 movlps edx 40 xmm0 v3 x3 y3 z3 w3 movhps edx 56 xmm0 v4 x4 y4 z4...

Page 278: ...mm2 xmm7 0xDD xmm2 r4 g4 b4 a4 shufps xmm1 xmm3 0x88 xmm4 r1 g1 b1 a1 shufps xmm5 xmm3 0x88 xmm5 r2 g2 b2 a2 shufps xmm6 xmm7 0xDD xmm6 r3 g3 b3 a3 movaps edx xmm4 v1 r1 g1 b1 a1 movaps edx 16 xmm5 v2...

Page 279: ...zled into AoS layout uv for the graphic cards to process you can use either the SSE or MMX technology code Using the MMX instructions allow you to conserve XMM registers for other computational tasks...

Page 280: ...r parts of each register This allows you to use a vertical add With the resulting partial horizontal summation full summation follows easily Figure 5 3 schematically presents horizontal add using movh...

Page 281: ...2 A3 A4 xmm0 movaps xmm1 ecx 16 load B1 B2 B3 B4 xmm1 movaps xmm2 ecx 32 load C1 C2 C3 C4 xmm2 movaps xmm3 ecx 48 load D1 D2 D3 D4 xmm3 continued A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 A1 A3...

Page 282: ...2 B4 movaps xmm4 xmm2 movlhps xmm2 xmm3 xmm2 C1 C2 D1 D2 movhlps xmm3 xmm4 xmm3 C3 C4 D3 D4 addps xmm3 xmm2 xmm3 C1 C3 C2 C4 D1 D3 D2 D4 movaps xmm6 xmm3 xmm6 C1 C3 C2 C4 D1 D3 D2 D4 shufps xmm3 xmm5...

Page 283: ...4 __m128 tmm0 tmm1 tmm2 tmm3 tmm4 tmm5 tmm6 Temporary variables tmm0 _mm_load_ps in x tmm0 A1 A2 A3 A4 tmm1 _mm_load_ps in y tmm1 B1 B2 B3 B4 tmm2 _mm_load_ps in z tmm2 C1 C2 C3 C4 tmm3 _mm_load_ps in...

Page 284: ...tions in Chapter 2 SIMD Floating point Programming Using SSE3 SSE3 enhances SSE and SSE2 with 9 instructions targeted for SIMD floating point programming In contrast to many SSE and SSE2 instructions...

Page 285: ...of complex numbers For example a complex number can be stored in a structure consisting of its real and imaginary part This naturally leads to the use of an array of structure Example 5 11 demonstrate...

Page 286: ...perform complex arithmetics on double precision complex numbers In the case of double precision complex arithmetics multiplication or divisions is done on one pair of complex numbers at a time Example...

Page 287: ...d0 b0c0 shufps xmm1 xmm1 b1 reorder the real and imaginary parts c1 d1 c0 d0 movsldup xmm2 Src1 load the real parts into the destination a1 a1 a0 a0 mulps xmm2 xmm1 temp results a1c1 a1d1 a0c0 a0d0 ad...

Page 288: ...stored as AOS The use of HADDPS adds more flexibility to sue SIMD instructions and eliminated the need to insert data swizzling and deswizzling code sequences Example 5 13 Calculating Dot Products fr...

Page 289: ...um M processor depends on several factors Generally code that is decoder bound and or has a mixture of integer and packed floating point instructions can expect significant gain Code that is limited b...

Page 290: ...uctions on Intel Core Solo and Intel Core Duo processors This is because scalar SSE2 instructions can be dispatched through two ports and executed using two separate floating point units Packed horizo...

Page 291: ...ed can be fetched from the processor caches or if memory traffic can take advantage of hardware prefetching effectively Standard techniques to bring data into the processor before it is needed involve...

Page 292: ...hardware prefetcher s ability to prefetch data that are accessed in a regular pattern with access stride that are substantially smaller than half of the trigger distance of the hardware prefetch see...

Page 293: ...etchnta Balance single pass versus multi pass execution An algorithm can use single or multi pass execution defined as follows single pass or unlayered execution passes a single data element through a...

Page 294: ...est performance software prefetch instructions must be interspersed with other computational instructions in the instruction sequence rather than clustered together Hardware Prefetching of Data The Pe...

Page 295: ...ams in the backward direction The hardware prefetcher of Intel Core Solo processor can track 16 forward streams and 4 backward streams On the Intel Core Duo processor the hardware prefetcher in each c...

Page 296: ...s pattern to suit the automatic hardware prefetch mechanism Software Data Prefetch The prefetch instruction can hide the latency of data access in performance critical sections of application code by...

Page 297: ...g cache pollution and by using the caches and memory efficiently This is particularly important for applications that share critical system resources such as the memory bus See an example in the Video...

Page 298: ...ium 4 processor prefetcht1 Identical to prefetcht0 prefetcht2 Identical to prefetcht0 Prefetch and Load Instructions The Pentium 4 processor has a decoupled execution and memory architecture that allo...

Page 299: ...of the advantages may change in the future In addition there are cases where a prefetch instruction will not perform the data prefetch These include the prefetch causes a DTLB Data Translation Lookasi...

Page 300: ...cacheable and not write allocating stored data is written around the cache and will not generate a read for ownership bus request for the corresponding cache line Fencing Because streaming stores are...

Page 301: ...memory type for the region is retained If the programmer specifies the weakly ordered uncacheable memory type of Write Combining WC then the non temporal store and the region have the same semantics a...

Page 302: ...Appropriate use of synchronization and a fencing operation see The fence Instructions later in this chapter must be performed for producer consumer usage models Fencing ensures that all system agents...

Page 303: ...etween processors Within a single processor system the CPU can also re read the same memory location and be assured of coherence that is a single consistent view of this memory location the same is tr...

Page 304: ...eaming store Streaming Store Instruction Descriptions The movntq movntdq non temporal store of packed integer in an MMX technology or Streaming SIMD Extensions register instructions store data from a...

Page 305: ...nce store fence instruction makes it possible for every store instruction that precedes the sfence instruction in program order to be globally visible before any store instruction that follows the sfe...

Page 306: ...e instruction provides a means of segregating certain load instructions from other loads The mfence Instruction The mfence memory fence instruction makes it possible for every load and store instructi...

Page 307: ...ssion checking and faults associated with a byte load clflush is an unordered operation with respect to other memory traffic including other clflush instructions Software should use a mfence memory fe...

Page 308: ...odes in the cache hierarchy The software controlled prefetch is not intended for prefetching code Using it can incur significant penalties on a multiprocessor system when code is shared Software prefe...

Page 309: ...automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes Start up penalty before hardware pr...

Page 310: ...l array to alter the concentration of small stride cache misses at the expense of large stride cache misses to take advantage of the automatic hardware prefetcher Example of Effective Latency Reductio...

Page 311: ...ister char p char next Populating pArray for circular pointer chasing with constant access stride p char p loads a value pointing to next load p char pArray for i 0 i aperture i stride p char pArray i...

Page 312: ...gures show two separate pipelines an execution pipeline and a memory pipeline front side bus Since the Pentium 4 processor similarly to the Pentium II and Pentium III processors completely decouples t...

Page 313: ...s Latency and Execution Without Prefetch Figure 6 3 Memory Access Latency and Execution With Prefetch OM15170 Execution units idle Mem latency Issue loads Time Vertex n 1 Execution units idle Executio...

Page 314: ...in the pipelines and thus the best possible performance can be achieved Prefetching is useful for inner loops that have heavy computations or are close to the boundary between being compute bound and...

Page 315: ...d Since prefetch distance is not a well defined metric for this discussion we define a new term prefetch scheduling distance PSD which is represented by the number of iterations For large loops prefet...

Page 316: ...3D vertices in strip format is used as an example A strip contains a list of vertices whose predefined vertex order forms contiguous triangles It can be easily observed that the memory pipe is de pipe...

Page 317: ...prefetch insertion the loads from the first iteration of an inner loop can miss the cache and stall the execution pipeline waiting for data returned thus degrading the performance In the code of Exam...

Page 318: ...suffers any memory access latency penalty assuming the computation time is larger than the memory latency Inserting a prefetch of the first data element needed prior to entering the nested loop comput...

Page 319: ...number of prefetches required Figure 6 4 presents a code example which implements prefetch and unrolls the loop to remove the redundant prefetch instructions whose prefetch addresses hit the previous...

Page 320: ...tore streams Each load and store stream accesses one 128 byte cache line each per iteration 2 The amount of computation per loop This is varied by increasing the number of dependent arithmetic operati...

Page 321: ...ive loop latency 0 00 10 00 20 00 30 00 40 00 50 00 60 00 70 00 80 00 90 00 100 00 of Bus Utilized 16 32 64 128 none Bus Utilization One load and one store stream 0 50 100 150 200 250 300 350 48 108 1...

Page 322: ...ed together If possible they should also be placed apart from loads This improves the instruction level parallelism and reduces the potential instruction resource stalls In addition this mixing reduce...

Page 323: ...x 18128 prefetchnta ebx 19128 prefetchnta ebx 20128 movps xmm1 ebx addps xmm2 ebx 3000 mulps xmm3 ebx 4000 addps xmm1 ebx 1000 addps xmm2 ebx 3016 mulps xmm1 ebx 2000 mulps xmm1 xmm2 add ebx 128 cmp e...

Page 324: ...mensions can be applied for a better memory performance If an application uses a large data set that can be reused across multiple passes of a loop it will benefit from strip mining data sets larger t...

Page 325: ...by pass m 1 requiring data re fetch into the first level cache and perhaps the second level cache if a later pass reuses the data If both data sets fit into the second level cache load operations in p...

Page 326: ...l cache pollution Use prefetchnta if the data is only touched once during the entire execution pass in order to minimize cache pollution in the higher level caches This provides instant availability a...

Page 327: ...data access patterns of a 3D geometry engine first without strip mining and then incorporating strip mining Note that 4 wide SIMD instructions of Pentium III processor can process 4 vertices per ever...

Page 328: ...y of second level cache during the strip mined transformation loop and reused in the lighting loop Keeping data in the cache reduces both bus traffic and the number of prefetches used Example 6 8 Data...

Page 329: ...iple times and some of the read once memory references An example of the situations of read once memory references can be illustrated with a matrix or image transpose reading from a column first orien...

Page 330: ...t level cache Maximizing the width of each tile for memory read Example 6 9 Using HW Prefetch to Improve Read Once Memory Traffic a Un optimized image transpose dest and src represent two dimensional...

Page 331: ...ten easier to use when implementing a general purpose API where the choice of code paths that can be taken depends on the specific combination of features selected by the application for example for 3...

Page 332: ...n be better suited to applications which limit the number of features that may be used at a given time A single pass approach can reduce the amount of data copying that can occur with a multi pass eng...

Page 333: ...isturbing the cache hierarchy To manage which data structures remain in the cache and which are transient Detailed implementations of these usage models are covered in the following sections Non tempo...

Page 334: ...ocessor Although the SWWC method requires explicit instructions for performing temporary writes and reads this ensures that the transaction on the front side bus causes line transaction rather than se...

Page 335: ...prefetchnta instruction brings data into only one way of the second level cache thus reducing pollution of the second level cache If the data brought directly to second level cache is not re used then...

Page 336: ...lication does not gain performance significantly from having data ready from prefetches it can improve from more efficient use of the second level cache and memory Such design reduces the encoder s de...

Page 337: ...improve performance of the translation of a virtual memory address to a physical memory address by providing fast access to page table entries If memory pages are accessed and the page table entry is...

Page 338: ...e the cost of the loop 2 Loads the data into an xmm register using the _mm_load_ps intrinsic 3 Transfers the 8 byte data to a different memory location via the _mm_stream intrinsics bypassing the cach...

Page 339: ...py 128 byte per loop for j kk j kk NUMPERPAGE j 16 _mm_stream_ps float b j _mm_load_ps float a j _mm_stream_ps float b j 2 _mm_load_ps float a j 2 _mm_stream_ps float b j 4 _mm_load_ps float a j 4 _mm...

Page 340: ...se the implementation of streaming stores on Pentium 4 processor writes data directly to memory maintaining cache coherency Using 16 byte Streaming Stores and Hardware Prefetch An alternate technique...

Page 341: ...m1 esi ecx 16 movdqa xmm2 esi ecx 32 movdqa xmm3 esi ecx 48 movdqa xmm4 esi ecx 64 movdqa xmm5 esi ecx 16 64 movdqa xmm6 esi ecx 32 64 movdqa xmm7 esi ecx 48 64 movntdq edi ecx xmm0 movntdq edi ecx 16...

Page 342: ...or A comparison of the two coding techniques discussed above and two un optimized techniques is shown in Table 6 2 add esi ecx add edi ecx sub edx ecx jnz main_loop sfence Table 6 2 Relative Performan...

Page 343: ...tores Increases in bus speed is the primary contributor to throughput improvements The technique shown in Example 6 12 will likely take advantage of the faster bus speed in the platform more efficient...

Page 344: ...and single core processors Table 6 3 Deterministic Cache Parameters Leaf Bit Location Name Meaning EAX 4 0 Cache Type 0 Null No more caches 1 Data Cache 2 Instruction Cache 3 Unified Cache 4 31 Reserv...

Page 345: ...r execution environments including processors supporting Hyper Threading Technology HT or multi core processors The basic technique is to place an upper limit of the blocksize to be less than the size...

Page 346: ...oftware should not assume that the coherency line size is the prefetch stride If this field is zero then software should assume a default size of 64 bytes is the prefetch stride Software will use the...

Page 347: ...package and or logical processor per core Therefore there are some aspects of multi threading optimization that apply across MP multi core and Hyper Threading Technology There are also some specific...

Page 348: ...mount of parallelism in the control flow of the workload Two common usage models are multithreaded applications multitasking using single threaded applications Multithreading When an application emplo...

Page 349: ...e threads on an MP systems with N physical processors over single threaded execution can be expressed as where P is the fraction of workload that can be parallelized and O represents the overhead of m...

Page 350: ...rmance scaling In addition to maximizing the parallelism of control flows interaction between threads in the form of thread synchronization and imbalance of task scheduling can also impact overall pro...

Page 351: ...tasking workload however bus activities and cache access patterns are likely to affect the scaling of the throughput Running two copies of the same application or same suite of applications in a lock...

Page 352: ...by large degrees of parallelism or minimal dependencies in the following areas workload thread interaction hardware utilization The key to maximizing workload parallelism is to identify multiple tasks...

Page 353: ...ere threads can be created to handle the multiplication of half of matrix with the multiplier matrix Domain Decomposition is a programming model based on creating identical or similar threads to proce...

Page 354: ...essor supporting Hyper Threading Technology Specialized Programming Models Intel Core Duo processor offers a second level cache shared by two processor cores in the same physical package This provides...

Page 355: ...of single threaded execution of a sequence of task units where each task unit either the producer or consumer executes serially shown in Figure 7 2 In the equivalent scenario under multi threaded exec...

Page 356: ...ion overhead The decimal number in the parenthesis represents a buffer index On an Intel Core Duo processor the producer thread can store data in the second level cache to allow the consumer thread to...

Page 357: ...same buffer As each task completes one thread signals to the other thread notifying its Example 7 2 Basic Structure of Implementing Producer Consumer Threads a Basic structure of a producer thread fu...

Page 358: ...the first or second level cache of the same core the consumer can access them without incurring bus traffic The scheduling of the interlaced producer consumer model is shown in Figure 7 4 Example 7 3...

Page 359: ...ffer index unsigned int iter_num workamount 1 unsigned int i 0 iter_num master workamount 1 if master master thread starts the first iteration produce buffs mode count Signal sigp 1 mode 1 notify the...

Page 360: ...parallelism in applications OpenMP supports directive based processing This uses special preprocessors or modified compilers to interpret parallelism expressed in Fortran comments or C C pragmas Bene...

Page 361: ...ind threading errors e g data races stalls and deadlocks and reduce the amount of time spent debugging threaded applications Intel Thread Checker product is an Intel VTune Performance Analyzer plug in...

Page 362: ...aling due to Hyper Threading Technology Techniques that apply to only one environment are noted Key Practices of Thread Synchronization Key practices for minimizing the cost of thread synchronization...

Page 363: ...rk Excessive use of software prefetches can significantly and unnecessarily increase bus utilization if used inappropriately Consider using overlapping multiple back to back memory reads to improve ef...

Page 364: ...ssors that support Hyper Threading Technology are Avoid Excessive Loop Unrolling to ensure the Trace Cache is operating efficiently Optimize code size to improve locality of Trace Cache and increase d...

Page 365: ...cation and threading domain The purpose of including high medium and low impact ranking with each recommendation is to provide a relative indicator as to the degree of performance gain that can be exp...

Page 366: ...tation of MONITOR and MWAIT these instructions are available to operating system so that operating system can optimize thread synchronization in different areas For example an operating system can use...

Page 367: ...es of three categories of synchronization objects available to multi threaded applications Table 7 1 Properties of Synchronization Objects Characteristics Operating System Synchronization Objects Ligh...

Page 368: ...er with each read request being allocated a buffer resource On detection of a write by a worker thread to a load that is in progress Miscellaneous Some objects provide intra process synchronization an...

Page 369: ...tel Core Solo and Intel Core Duo processors However the penalty on these processors is small compared with penalties suffered on the Pentium 4 and Intel Xeon processors There the performance penalty f...

Page 370: ...ng thread and the worker thread do _asm pause Ensure this loop is de pipelined i e preventing more than one load request to sync_var to be outstanding avoiding performance penalty when the worker thre...

Page 371: ...loop is shown in Example 7 4 b The PAUSE instruction is compatible with all IA 32 processors On IA 32 processors prior to Intel NetBurst microarchitecture the PAUSE instruction is essentially a NOP in...

Page 372: ...red by two threads there is no need to use a lock Synchronization for Longer Periods When using a spin wait loop not expected to be released quickly an application should follow these guidelines Keep...

Page 373: ...rs Furthermore using a thread blocking API also benefits from the system idle loop optimization that OS implements using the HLT instruction User Source Coding Rule 23 H impact M generality Use a thre...

Page 374: ...of thread synchronization Because the control thread still behaves like a fast spinning loop the only runnable worker thread must share execution resources with the spin wait loop if both are running...

Page 375: ...to release the processor incorrectly It experiences a performance penalty if the only worker thread and the control thread runs on the same physical processor package Only one worker thread is runnin...

Page 376: ...Microarchitecture is when thread private data and a thread synchronization variable are located within the line size boundary 64 bytes or sector boundary 128 bytes When one thread modifies the synchr...

Page 377: ...should be allocated to reside alone in a 128 byte region and aligned to a 128 byte boundary Example 7 6 shows a way to minimize the bus traffic required to maintain cache coherency in MP systems This...

Page 378: ...variables of different types in data structures because the layout that compilers give to data variables might be different than their placement in the source code When each thread needs to use its ow...

Page 379: ...rformance impact due data traffic fetched from memory depends on the characteristics of the workload and the degree of software optimization on memory access locality enhancements implemented in the s...

Page 380: ...n fetches from system memory User Source Coding Rule 27 M impact H generality Improve data and code locality to conserve bus command bandwidth Using a compiler that supports profiler guided optimizati...

Page 381: ...ly one thread is using the second level cache and or bus then it is expected to get the maximum benefit of the cache and bus systems because the other core does not interfere with the progress of the...

Page 382: ...processor core It also consumes critical execution resources and can result in stalled execution The guidelines for using software prefetch instructions are described in Chapter 2 The techniques of us...

Page 383: ...Reduction with H W Prefetch in Chapter 6 User Source Coding Rule 30 M impact M generality Consider adjusting the sequencing of memory references such that the distribution of distances of successive c...

Page 384: ...imiza tion Efficient operation of caches needs to address the following cache blocking shared memory optimization eliminating 64 K Aliased data accesses preventing excessive evictions in first level c...

Page 385: ...e This technique can also be applied to single threaded applications that will be used as part of a multitasking workload User Source Coding Rule 32 H impact H generality Use cache blocking to improve...

Page 386: ...d level cache On an Intel Core Duo processor and when the work buffers are small enough to fit within the first level cache re ordering of producer and consumer tasks are necessary to achieve optimal...

Page 387: ...1 for mode1 0 mode1 batchsize mode1 produce buffs mode1 count while iter_num Signal signal1 1 produce buffs mode1 count placeholder function WaitForSignal end1 mode1 if mode1 batchsize mode1 0 void co...

Page 388: ...ule 34 H impact H generality Minimize data access patterns that are offset by multiples of 64 KB in each thread The presence of 64 KB aliased data access can be detected using Pentium 4 processor perf...

Page 389: ...improves the temporal locality of the first level data cache Multithreaded applications need to prevent unnecessary evictions in the first level data cache when Multiple threads within an application...

Page 390: ...in an application is to add an equal amount of stack offset each time a new thread is created in a thread pool 7 Example 7 9 shows a code fragment that implements per thread stack offset for three thr...

Page 391: ...r thread stack offset to create three threads Code for the thread function Stack accesses within descendant functions do_foo1 do_foo2 are less likely to cause data cache evictions because of the stack...

Page 392: ...on is to allow application instance to add a suitable linear address offset for its stack Once this offset is added at start up a buffer of linear addresses is established even when two copies of the...

Page 393: ...hat is also a multiple of the reference offset or 128 bytes Usually this per instance pseudo random offset can be less than 7 KB Example 7 10 provides a code fragment for adjusting the stack pointer i...

Page 394: ...rocessor For dual core processors where the second level unified cache is shared by two processor cores e g Intel Core Duo processor multi threaded software should consider the increase in code workin...

Page 395: ...sors multithreaded applications should improve code locality of frequently executed sections and target one half of the size of Trace Cache for each application thread when considering code size optim...

Page 396: ...cation for thread scheduling is represented by a bit in an OS construct commonly referred to as an affinity mask8 Software can use an affinity mask to control the binding of a software thread to a spe...

Page 397: ...stemAffinity is a bitmask of all the processors started by the OS Use OS specific APIs to obtain it ThreadAffinityMask is used to affinitize the topology enumeration thread to each processor using OS...

Page 398: ...o manage thread affinity binding is shown in Example 7 12 The example shows an implementation of building a lookup table so that the sequence of thread scheduling is mapped to an array of affinity mas...

Page 399: ...logy The 3 level hierarchy and relationships between the initial APIC_ID and affinity mask can also be used to manage such a topology Example 7 13 illustrates the steps of discovering sibling logical...

Page 400: ...ecutively to different core while j NumStartedLPs Determine the first LP in each core if apic_conf j smt This is the first LP in a core supporting HT LuT index apic_conf j affinitymask j Now the we ha...

Page 401: ...ular boundary of those logical processors sharing the target level cache may coincide with core boundary or above core boundary ThreadAffinityMask 1 ProcessorNum 0 while ThreadAffinityMask 0 ThreadAff...

Page 402: ...mask of logical processors sharing the same target level cache these are logical processors with the same Cache_ID The algorithm below assumes there is symmetry across the modular boundary of target c...

Page 403: ...CacheProcessorMask i ProcessorMask Break Found in existing bucket skip to next iteration if i CacheNum Cache_ID did not match any bucket start new bucket CacheIDBucket i CacheID ProcessorNum CachePro...

Page 404: ...ache topology is available In general optimizing the building blocks of a multi threaded application can start from an individual thread The guidelines discussed in Chapters 2 through 6 largely apply...

Page 405: ...g shared resource effectively can deliver positive benefit in processor scaling if the workload does saturate the critical resource in contention Tuning Suggestion 5 M Impact L Generality Schedule thr...

Page 406: ...suming about one third of peak retirement bandwidth Significant portions of the issue port bandwidth are left unused Thus optimizing single thread performance usually can be complementary with optimiz...

Page 407: ...machine clear condition occurs all instructions that are in flight at various stages of processing in the pipeline must be resolved and then they are either retired or cancelled While the pipeline is...

Page 408: ...of both threads count toward the limit of four write combining buffers For example if an inner loop that writes to three separate areas of memory per iteration is run by two threads simultaneously th...

Page 409: ...Instructions When The Data Size Is 32 Bits 64 bit mode makes 16 general purpose 64 bit registers available to applications If application data size is 32 bits however there is no need to use 64 bit re...

Page 410: ...x is necessary Using 8 additional registers can prevent the compiler from needing to spill values onto the stack Note that the potential increase in code size due to the REX prefix can increase cache...

Page 411: ...is 32 bits the upper 32 bits must be zeroed Zeroing the upper 32 bits requires an extra uop and is less optimal than sign extending to 64 bits While sign extending to 64 bits makes the instruction one...

Page 412: ...r 64 Bit Mode Use 64 Bit Registers Instead of Two 32 Bit Registers for 64 Bit Arithmetic Legacy 32 bit mode offers the ability to support extended precision integer arithmetic such as 64 bit arithmeti...

Page 413: ...t require a 64 bit result To add two 64 bit numbers in 32 bit legacy mode the add instruction followed by the addc instruction is used For example to add two 64 bit variables X and Y the following fou...

Page 414: ...result in a microcode flow from the microcode ROM and takes longer to execute In most cases the 32 bit versions of CVTSI2SS and CVTSI2SD is sufficient Assembly Compiler Coding rule Use the 32 bit vers...

Page 415: ...s referred to as static power ACPI 3 0 ACPI stands for Advanced Configuration and Power Interface provides a standard that enables intelligent power management and consumption It does this by allowing...

Page 416: ...ut of higher numbered C states have longer latency Mobile Usage Scenarios In mobile usage models heavy loads occur in bursts while working on battery power Most productivity web and streaming workload...

Page 417: ...the OS sets the processor to a lower frequency P1 3 The processor decreases frequency and processor utilization increases to the most effective level 80 90 of the highest possible frequency The same a...

Page 418: ...n and the processor enters a halted state in which it waits for the next interrupt The interrupt may be a timer interrupt that comes every 10 ms or an interrupt that signals an event As shown in the i...

Page 419: ...wer savings over the C1 state The main improvements are provided at the platform level C3 This level provides greater power savings than C1 or C2 In C3 the processor stops clock generating and snoopin...

Page 420: ...d Intel Core Duo processors4 provide additional processor specific C states and associated sub C states that can be mapped to ACPI C3 state type The processor specific C states and sub C states are ac...

Page 421: ...life and adapt for mobile computing usage Adopt a power management scheme to provide just enough not the highest performance to achieve desired features or experiences Avoid using spin loops Reduce t...

Page 422: ...f logging activities Consolidating disk operations over time to prevent unnecessary spin up of the hard drive Reducing the amount or quality of visual animations Turning off or significantly reducing...

Page 423: ...ble For example When an application needs to wait for more then a few milliseconds it should avoid using spin loops and use the Windows synchronization APIs such as WaitForSingleObject When an immedia...

Page 424: ...mpact performance and may provide additional power conservation Read ahead from CD DVD data and cache it in memory or hard disk to allow the DVD drive to stop spinning Switch off unused devices When d...

Page 425: ...he necessary information for an application to react appropriately Here are some examples of an application reaction to sleep mode transitions Saving state data prior to the sleep transition and resto...

Page 426: ...ssor needs to stay in P0 state highest frequency to deliver highest performance for 0 5 seconds and may then goes to sleep for 0 5 seconds The demand pattern then alternates Thus the processor demand...

Page 427: ...be met by 50 performance at 100 duty cycle Because the slope of frequency scaling efficiency of most workloads will be less than one reducing the core frequency to 50 can achieve more than 50 of the...

Page 428: ...power consumption by moving to lower power C states Generally the longer a processor stays idle OS power management policy directs the processor into deeper low power C states After an application rea...

Page 429: ...The situation is significantly improved in the Intel Core Solo processor compared to previous generations of the Pentium M processors but polling will likely prevent the processor from entering into h...

Page 430: ...e physical processor to run at 60 of its maximum frequency However it may be possible to divide work equally between threads so that each of them require 50 of execution cycles As a result both cores...

Page 431: ...he issue of OS lacking multi core aware P state coordination policy Upgrade to an OS with multi core aware P state coordination policy Some newer OS releases may include multi core aware P state coord...

Page 432: ...serve battery life in multithreaded applications a multi threaded applications should synchronize threads to work simultaneously and sleep simultaneously using OS synchronization primitives By keeping...

Page 433: ...ution_Cycle and Unhalted_Ref_Cycles Changing sections of serialized code to execute into two parallel threads enables coordinated thread synchronization to achieve better power savings Although Serial...

Page 434: ...IA 32 Intel Architecture Optimization 9 20...

Page 435: ...atures such as profile guided optimizations and other advanced optimizations This includes vectorization for MMX technology the Streaming SIMD Extensions SSE Streaming SIMD Extensions 2 SSE2 and the S...

Page 436: ...rtran Compiler is source compatible with Compaq Visual Fortran and also plugs into the Microsoft NET IDE In Linux environment the Intel C Compilers are binary compatible with the corresponding version...

Page 437: ...izations most of which are effective only in conjunction with processor specific optimizations described below All the command line options are described in the Intel C Compiler User s Guide Further a...

Page 438: ...s Qax extensions generates code specialized to processors which support the specified extensions but also generates generic IA 32 code The generic code usually executes slower than the specialized ve...

Page 439: ...to the Qx M K W B P or Qax M K W B P switches the compiler provides the following vectorization control switch options Qvec_report n Controls the vectorizer s diagnostic levels where n is either 0 1...

Page 440: ...pendencies This is activated by the compiler switch Qparallel Inline Expansion of Library Functions Oi Oi The compiler inlines a number of standard C C and math library functions by default This usual...

Page 441: ...ogram from your source code and special code from the compiler Each time this instrumented code is executed the compiler generates a dynamic information file When you compile a second time the dynamic...

Page 442: ...tel C Compiler User s Guide Intel VTune Performance Analyzer The Intel VTune Performance Analyzer is a powerful software profiling tool for Microsoft Windows and Linux The VTune analyzer helps you und...

Page 443: ...ter the sampling activity completes the VTune analyzer displays the data by process thread software module function or line of source There are two methods for generating samples Time based sampling a...

Page 444: ...the microprocessor as it executes software Some of the events that can be used to trigger sampling include clockticks cache misses and branch mispredictions The VTune analyzer indicates where micro a...

Page 445: ...effective throughput of instructions or ops at the retirement stage For example UPC measure ops throughput at the retirement stage which can be compared to the peak retirement bandwidth of the microar...

Page 446: ...access pattern with very poor data parallelism and will fully exposes memory latency whenever cache misses occur Large Stride Inefficiency Large stride data accesses are much less efficient than smal...

Page 447: ...trumentation is the process of modifying a function so that performance data can be captured when the function is executed Instrumentation does not change the functionality of the program However it c...

Page 448: ...level performance bottlenecks It periodically polls software and hardware performance counters The performance counter data can help you understand how your application is impacting the performance of...

Page 449: ...sors Intel Integrated Performance Primitives for Linux and Windows IPP is a cross platform software library which provides a range of library functions for multimedia audio codecs video codecs for exa...

Page 450: ...d throughout this manual Examples include architecture specific tuning such as loop unrolling instruction pairing and scheduling and memory management with explicit and implicit data prefetching and c...

Page 451: ...word 4 double words single precision 4 single precision floating point and double precision 2 double precision floating point When a register is updated the new value appears in red The corresponding...

Page 452: ...tomatically locates threading errors As your program runs the Intel Thread Checker monitors memory accesses and other events and automatically detects situations which could cause unpredictable thread...

Page 453: ...execution etc Mountains of data are collapsed into relevant summaries sorted to identify parallel regions or loops that require attention Its intuitive color coded displays make it easy to assess you...

Page 454: ...e Training at http developer intel com software college CourseCatalog asp CatID web based For key algorithms and their optimization examples for the Pentium 4 processor refer to the application notes...

Page 455: ...ors Pentium 4 Processor Performance Metrics The descriptions of the Intel Pentium 4 processor performance metrics use terminology that are specific to the Intel NetBurst microarchitecture and to the i...

Page 456: ...that were scheduled to execute along the mispredicted path must be cancelled These instructions and ops are referred to as bogus instructions and bogus ops A number of Pentium 4 processor performance...

Page 457: ...al with some event the machine takes an assist One example of such situation is an underflow condition in the input operands of a floating point operation The hardware must internally modify the forma...

Page 458: ...erformance monitoring related counters to be powered down The processor is asleep either as a result of being halted for a while or as part of a power management scheme Note that there are different l...

Page 459: ...abled Nominal CPI timestamp counter ticks instructions retired measures the CPI over the entire duration of the program including those periods the machine is halted while waiting for I O The distinct...

Page 460: ...e Note that this overrides any qualification e g by CPL specified in the ESCR Enable counting in the CCCR for that counter by setting the enable bit The counts produced by the Non halted and Non sleep...

Page 461: ...dependent period if there is no work for the processor to do Time Stamp Counter The time stamp counter increments whenever the sleep pin is not asserted or when the clock signal on the system bus is...

Page 462: ...eful metric for determining front end performance The metric that is most analogous to an instruction cache miss is a trace cache miss An unsuccessful lookup of the trace cache colloquially a miss is...

Page 463: ...level cache misses and writebacks also called core references result in references to the 2nd level cache The Bus Sequence Queue BSQ holds requests from the processor core or prefetcher that are to b...

Page 464: ...re B 1 Relationships Between the Cache Hierarchy IOQ BSQ and Front Side Bus Chip Set System Memory 1st Level Data Cache 3rd Level Cache FSB_ IOQ BSQ Unified 2nd Level Cache 1st Level Data Cache 3rd Le...

Page 465: ...t these bus and memory metrics are derived from The granularities of core references are listed below according to the performance monitoring events that are docu mented in Appendix A of the IA 32 Int...

Page 466: ...unt each partial individually Different transaction sizes The allocations of non partial programmatic load requests get a count of one per 128 bytes in the BSQ on current implementations and a count o...

Page 467: ..._cache_rerference The metrics are 2nd Level Cache Read Misses 2nd Level Cache Read References 3rd Level Cache Read Misses 3rd Level Cache Read References 2nd Level Cache Reads Hit Shared 2nd Level Cac...

Page 468: ...counted once per 64 byte line the size of a core reference This makes the event counts for read misses appear to have a 2 times overcounting with respect to read and write RFO hits and write RFO miss...

Page 469: ...nce adjacent sector prefetches have lower priority than demand fetches there is a high probability on a heavily utilized system that the adjacent sector prefetch will have to wait until the next bus a...

Page 470: ...t Hyper Threading Technology the performance counters and associated model specific registers MSRs are extended to support Hyper Threading Technology A subset of the performance monitoring events allo...

Page 471: ...cs and Tagging Mechanisms For event names that appear in this column refer to the IA 32 Intel Architecture Software Developer s Manual Volumes 3A 3B Column 4 specifies the event mask bit that is neede...

Page 472: ...structions Retired Non bogus IA 32 instructions executed to completion May count more than once for some instructions with complex uop flow and were interrupted before retirement The count may vary de...

Page 473: ...g Replay_event set the following replay tag Tagged_mispred_ branch NBOGUS Mispredicted Branches Retired Mispredicted branch instructions executed to completion This stat is often used in a per instruc...

Page 474: ...INDIRECT Mispredicted calls All Mispredicted indirect calls retired_branch_type CALL Mispredicted conditionals The number of mispredicted branches that are conditional jumps retired_mispred_ branch_t...

Page 475: ...delivering traces associated with logical processor 0 regardless of the operating modes of the TDE for traces associated with logical processor 1 If a physical processor supports only one logical pro...

Page 476: ...pplicable only if a physical processor supports Hyper Threading Technology and have two logical processors per package TC_deliver_mode SS BS IS Logical Processor N In Deliver Mode Fraction of all non...

Page 477: ...rocessor 0 TC_deliver_mode BB BS BI Logical Processor 1 Build Mode The number of cycles that the trace and delivery engine TDE is building traces associated with logical processor 1 regardless of the...

Page 478: ...tc_ms_xfer CISC Speculative TC Built Uops The number of speculative uops originating when the TC is in build mode uop_queue_writes FROM_TC_BUILD Speculative TC Delivered Uops The number of speculativ...

Page 479: ...dercount when loads are spaced apart Replay_event set the following replay tag 2ndL_cache_load_ miss_retired NBOGUS DTLB Load Misses Retired The number of retired load ops that experienced DTLB misses...

Page 480: ...lit Load Replays The number of load references to data that spanned two cache lines Memory_complete LSC Split Loads Retired The number of retired load ops that spanned two cache lines Replay_event set...

Page 481: ...e number of 2nd level cache read references loads and RFOs Beware of granularity differences BSQ_cache_reference RD_2ndL_HITS RD_2ndL_HITE RD_2ndL_HITM RD_2ndL_MISS 3rd Level Cache Read Misses2 The nu...

Page 482: ...y differences BSQ_cache_reference RD_2ndL_HITM 2nd Level Cache Reads Hit Exclusive The number of 2nd level cache read references loads and RFOs that hit the cache line in exclusive state Beware of gra...

Page 483: ...ys Retired The number of retired load ops that experienced replays related to the MOB Replay_event set the following replay tag MOB_load_replay_ retired NBOGUS Loads Retired The number of retired load...

Page 484: ...liased A high count of this metric when there is no significant contribution due to write combining buffer full condition may indicate such a situation WC_buffer WCB_EVICTS WCB Full Evictions The numb...

Page 485: ...el 2 2 Enable edge filtering6 in the CCCR Non prefetch Bus Accesses from the Processor The number of all bus transactions that were allocated in the IO Queue from this processor excluding prefetched s...

Page 486: ...ata Ready Bus_ratio 100 Non Sleep Clockticks Reads from the Processor The number of all read includes RFOs transactions on the bus that were allocated in IO Queue from this processor includes prefetch...

Page 487: ...2 2 Enable edge filtering6 in the CCCR Reads Non prefetch from the Processor The number of all read transactions includes RFOs but excludes prefetches on the bus that originated from this processor B...

Page 488: ...g6 in the CCCR All UC from the Processor The number of UC Uncacheable memory transactions on the bus that originated from this processor User Note Beware of granularity issues e g a store of dqword to...

Page 489: ...CH CPUID model 2 2 Enable edge filtering6 in the CCCR Bus Accesses Underway from the processor7 This is an accrued sum of the durations of all bus transactions by this processor Divide by Bus Accesses...

Page 490: ...PREFETCH CPUID model 2 Non prefetch Reads Underway from the processor7 This is an accrued sum of the durations of read includes RFOs but excludes prefetches transac tions that originate from this proc...

Page 491: ...qA0 ALL_READ ALL_WRITE MEM_UC OWN CPUID model 2 All WC Underway from the processor7 This is an accrued sum of the durations of all WC transactions by this processor Divide by All WC from the processor...

Page 492: ...M_WC MEM_UC OWN CPUID model 2 Bus Accesses Underway from All Agents7 This is an accrued sum of the durations of entries by all agents on the bus Divide by Bus Accesses from All Agents to get bus reque...

Page 493: ...om DWord operands BSQ_allocation 1 REQ_TYPE1 REQ_LEN0 MEM_TYPE0 REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR Writes WB Full BSQ The number of writeback evicted from cache transactions to WB type...

Page 494: ...E REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR UC Reads Chunk BSQ The number of 8 byte aligned UC read transactions User note Read requests associated with 16 byte operands may under count BSQ_al...

Page 495: ...unk BSQ The number of 8 byte aligned IO port read transactions BSQ_allocation 1 REQ_LEN0 REQ_ORD_TYPE REQ_IO_TYPE REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR IO Writes Chunk BSQ The number of IO...

Page 496: ...e_entries 1 REQ_TYPE0 REQ_TYPE1 REQ_LEN0 REQ_LEN1 MEM_TYPE1 MEM_TYPE2 REQ_CACHE_TYPE REQ_DEM_TYPE UC Reads Chunk Underway BSQ 8 This is an accrued sum of the durations of UC read transactions Divide b...

Page 497: ...LEN0 MEM_TYPE0 REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR Characterization Metrics x87 Input Assists The number of occurrences of x87 input operands needing assistance to handle an exception co...

Page 498: ...ution_event set this execution tag Scalar_SP_retired NONBOGUS0 Scalar DP Retired3 Non bogus scalar double precision instructions retired Execution_event set this execution tag Scalar_DP_retired NONBOG...

Page 499: ...Resources non standard5 The duration of stalls due to lack of store buffers Resource_stall SBFULL Stalls of Store Buffer Resources non standard5 The number of allocation stalls due to lack of store bu...

Page 500: ...fying a split_load_retired tag in addition to programming the replay_event to count at retirement This section describes three sets of tags that are used in conjunction with three at retirement counti...

Page 501: ..._ retired Bit 2 BIT 24 BIT 25 Bit 1 None NBOGUS DTLB_all_miss_ retired Bit 2 BIT 24 BIT 25 Bit 0 Bit 1 None NBOGUS Tagged_mispred_ branch Bit 15 Bit 16 Bit 24 Bit 25 Bit 4 None NBOGUS MOB_load_ replay...

Page 502: ...ming an upstream ESCR to select event mask with its TagUop and TagValue bit fields The event mask for the downstream ESCR is specified in column 4 The event names referenced in column 4 can be found i...

Page 503: ...tired Set the ALL bit in the event mask and the TagUop bit in the ESCR of scalar_SP_uop 1 NBOGUS0 Scalar_DP_retired Set the ALL bit in the event mask and the TagUop bit in the ESCR of scalar_DP_uop 1...

Page 504: ...e is not sufficient number of MSRs for simultaneous counting of the same metric on both logical processors In both cases it is also possible to program the relevant ESCR for a performance metric that...

Page 505: ...ance metrics related to the trace cache that are exceptions to the three categories above They are Logical Processor 0 Deliver Mode Logical Processor 1 Deliver Mode Logical Processor 0 Build Mode Logi...

Page 506: ...ll conditionals Mispredicted returns Mispredicted indirect branches Mispredicted calls Mispredicted conditionals TC and Front End Metrics Trace Cache Misses ITLB Misses TC to ROM Transfers TC Flushes...

Page 507: ...red Stores Retired DTLB Store Misses Retired DTLB Load and Store Misses Retired 2nd Level Cache Read Misses 2nd Level Cache Read References 3rd Level Cache Read Misses 3rd Level Cache Read References...

Page 508: ...way from the processor1 All UC Underway from the processor1 All WC Underway from the processor1 Bus Writes Underway from the processor1 Bus Accesses Underway from All Agents1 Write WC Full BSQ 1 Write...

Page 509: ...alled Cycles of Store Buffer Resources Stalls of Store Buffer Resources 1 Parallel counting is not supported due to ESCR restrictions Table B 7 Metrics That Are Independent of Logical Processors Gener...

Page 510: ...by more than one core and some performance events provide an event mask or unit mask that allows qualification at the physical processor boundary or at bus agent boundary Some events allow qualificat...

Page 511: ...each core The above is also applicable when the core specificity sub field bits 15 14 of IA32_PERFEVTSELx MSR within an event mask is programmed with 11B The result of reported by performance counter...

Page 512: ...from the L1 or prefetches to the L2 cache were issued Unhalted_Core_Cycles event number 3C unit mask 00H This event counts the smallest unit of time recognized by an active core In many operating syst...

Page 513: ...These snoops are done through the DCU store port Frequent DCU snoops may conflict with stores to the DCU and this may increase store latency and impact performance Bus_Not_In_Use event number 7DH uni...

Page 514: ...IA 32 Intel Architecture Optimization B 60...

Page 515: ...d scheduling Definitions the definitions for the primary information presented in the tables in section Latency and Throughput Latency and Throughput of Pentium 4 and Intel Xeon processors the listing...

Page 516: ...s to aid in selecting the sequence of instructions which minimizes dependency chain latency and to arrange instructions in an order which assists the hardware in processing instructions efficiently wh...

Page 517: ...esource conflicts Interleaving instructions so that they don t compete for the same port or execution unit can increase throughput For example alternating PADDQ and PMULUDQ each have a throughput of o...

Page 518: ...execute the ops for each instruction This information is provided only for IA 32 instructions that are decoded into no more than 4 ops ops for instructions that decode into more than 4 ops are supplie...

Page 519: ...tional information that is beyond the scope of this manual Comparisons of latency and throughput data between the Pentium 4 processor and the Pentium M processor can be misleading because one cycle in...

Page 520: ...for CPUID signature 0xF2n and 0xF3n The notation 0xF2n represents the hex value of the lower 12 bits of the EAX register reported by CPUID instruction with input value of EAX 1 F indicates the family...

Page 521: ...xmm 6 6 1 1 1 1 FP_MOVE MOVDQU xmm xmm 6 6 1 1 1 1 FP_MOVE MOVDQ2Q mm xmm 8 8 1 2 2 1 FP_MOVE MMX_ALU MOVQ2DQ xmm mm 8 8 1 2 2 1 FP_MOVE MMX_SHFT MOVQ xmm xmm 2 2 1 2 2 1 MMX_SHFT PACKSSWB PACKSSDW PA...

Page 522: ...P_MUL PMULUDQ xmm xmm 9 8 6 2 2 2 4 FP_MUL POR xmm xmm 2 2 1 2 2 1 MMX_ALU PSADBW xmm xmm 4 4 5 2 2 2 4 MMX_ALU PSHUFD xmm xmm imm8 4 4 2 1 2 2 2 MMX_SHFT PSHUFHW xmm xmm imm8 2 2 1 2 2 1 MMX_SHFT PSH...

Page 523: ...xmm 2 2 1 2 2 1 MMX_ALU See Table Footnotes Table C 3 Streaming SIMD Extension 2 Double precision Floating point Instructions Instruction Latency1 Throughput Execution Unit2 CPUID 0F3n 0F2n 0x69n 0F3...

Page 524: ...MX_ALU CVTPS2PD3 xmm xmm 3 2 2 1 2 3 FP_ADD MMX_SHFT MMX_ALU CVTSD2SI r32 xmm 9 8 2 2 FP_ADD FP_MISC CVTSD2SS3 xmm xmm 17 16 4 2 4 1 FP_ADD MMX_SHFT CVTSI2SD3 xmm r32 16 15 4 2 3 1 FP_ADD MMX_SHFT MMX...

Page 525: ...LPD xmm xmm 7 6 2 2 FP_MUL MULSD xmm xmm 7 6 2 2 FP_MUL ORPD3 xmm xmm 4 4 2 2 MMX_ALU SHUFPD3 xmm xmm imm8 6 6 2 2 MMX_SHFT SQRTPD xmm xmm 70 69 58 57 70 69 114 FP_DIV SQRTSD xmm xmm 39 38 58 39 38 57...

Page 526: ...FP_ADD COMISS xmm xmm 7 6 1 2 2 1 FP_ADD FP_ MISC CVTPI2PS xmm mm 12 11 3 2 4 1 MMX_ALU FP_ ADD MMX_ SHFT CVTPS2PI mm xmm 8 7 3 2 2 1 FP_ADD MMX_ ALU CVTSI2SS3 xmm r32 12 11 4 2 2 2 FP_ADD MMX_ SHFT M...

Page 527: ...2 MMX_MISC RSQRTSS3 xmm xmm 6 6 4 4 1 MMX_MISC MMX_SHFT SHUFPS3 xmm xmm imm8 6 6 2 2 2 2 MMX_SHFT SQRTPS xmm xmm 40 39 29 28 40 39 58 FP_DIV SQRTSS xmm xmm 32 23 30 32 23 29 FP_DIV SUBPS xmm xmm 5 4 4...

Page 528: ...2 1 FP_MISC PMULHUW3 mm mm 9 8 1 1 FP_MUL PSADBW mm mm 4 4 5 1 1 2 MMX_ALU PSHUFW mm mm imm8 2 2 1 1 1 1 MMX_SHFT See Table Footnotes Table C 6 MMX Technology 64 bit Instructions Instruction Latency1...

Page 529: ...RAD mm mm imm8 2 2 1 1 MMX_SHFT PSRLQ PSRLW PSRLD mm mm imm8 2 2 1 1 MMX_SHFT PSUBB PSUBW PSUBD mm mm 2 2 1 1 MMX_ALU PSUBSB PSUBSW PSU BUSB PSUBUSW mm mm 2 2 1 1 MMX_ALU PUNPCKHBW PUNPCK HWD PUNPCKHD...

Page 530: ...OM 3 2 1 1 FP_MISC FCHS 3 2 1 1 FP_MISC FDIV Single Precision 30 23 30 23 FP_DIV FDIV Double Precision 40 38 40 38 FP_DIV FDIV Extended Precision 44 43 44 43 FP_DIV FSQRT SP 30 23 30 23 FP_DIV FSQRT D...

Page 531: ...0F2n 0x69n 0F2n ADC SBB reg reg 8 8 3 3 ADC SBB reg imm 8 6 2 2 ALU ADD SUB 1 0 5 0 5 0 5 ALU AND OR XOR 1 0 5 0 5 0 5 ALU BSF BSR 16 8 2 4 BSWAP 1 7 0 5 1 ALU BTC BTR BTS 8 9 1 CLI 26 CMP TEST 1 0 5...

Page 532: ...ALU PUSH 1 5 1 MEM_STORE ALU RCL RCR reg 18 6 4 1 1 ROL ROR 1 4 0 5 1 RET 8 1 MEM_LOAD ALU SAHF 1 0 5 0 5 0 5 ALU SAL SAR SHL SHR 1 4 1 0 5 1 SCAS 4 1 5 ALU MEM_ LOAD SETcc 5 1 5 ALU STI 36 STOSB 5 2...

Page 533: ...nd ports in the out of order core Note the following The FP_EXECUTE unit is actually a cluster of execution units roughly consisting of seven separate execution units The FP_ADD unit handles x87 and S...

Page 534: ...ted more slowly This applies to the Pentium 4 and Intel Xeon processors Latency and Throughput with Memory Operands The discussion of this section applies to the Intel Pentium 4 and Intel Xeon process...

Page 535: ...Throughput of these instructions with load operation remains the same with the register to register flavor of the instructions Floating point MMX technology Streaming SIMD Extensions and Streaming SIM...

Page 536: ...IA 32 Intel Architecture Optimization C 22...

Page 537: ...points to the base of the frame for the function and from which all data are referenced via appropriate offsets The convention on IA 32 is to use the esp register as the stack frame pointer for norma...

Page 538: ...versions of the Intel C Compiler for Win32 Systems have attempted to provide 8 byte aligned stack frames by dynamically adjusting the stack frame pointer in the prologue of main and preserving 8 byte...

Page 539: ...requirement of 4 but it calls function G at many call sites and in a loop If G s alignment requirement is 16 then by promoting F s alignment requirement to 16 and making all calls to G go to its alig...

Page 540: ...must be used When using this type of frame the sum of the sizes of the return address saved registers local variables register spill slots and parameter space must be a multiple of 16 bytes This cause...

Page 541: ...Note A push ebx mov ebx esp sub esp 0x00000008 and esp 0xfffffff0 add esp 0x00000008 jmp common foo aligned push ebx mov ebx esp common See Note B push edx sub esp 20 j k mov edx ebx 8 mov esp 16 edx...

Page 542: ...he return address the previous ebp the exception handling record the local variables and the spill area must be a multiple of 16 bytes In addition the parameter passing space must be a multiple of 16...

Page 543: ...er add jmp common foo aligned push ebx esp is 8 mod 16 after push mov ebx esp common push ebp this slot will be used for duplicate return pt push ebp esp is 0 mod 16 after push rtn ebx ebp ebp mov ebp...

Page 544: ...o 5 add esp 4 normal call sequence to unaligned entry mov esp 5 call foo for stdcall callee cleans up stack foo aligned 5 add esp 16 aligned entry this should be a multiple of 16 mov esp 5 call foo al...

Page 545: ...e 16 byte alignment then that function will not have any alignment code in it That is the compiler will not use ebx to point to the argument block and it will not have alternate entry points because t...

Page 546: ...ored relative to ebx in the function s epilog For additional information on the use of ebx in inline assembly code and other related issues see relevant application notes in the Intel Architecture Per...

Page 547: ...uation the mathematical model of the calculation Simplified Equation A simplified equation to compute PSD is as follows where psd is prefetch scheduling distance Nlookup is the number of clocks for lo...

Page 548: ...mathematics discussed are as follows psd prefetch scheduling distance measured in number of iterations il iteration latency Tc computation latency per iteration with prefetch caches Tl memory leadoff...

Page 549: ...needed to execute this loop with actually run time memory footprint Tc can be determined by computing the critical path latency of the code dependency graph This work is quite arduous without help fro...

Page 550: ...e that three cache lines are accessed per iteration and four chunks of data are returned per iteration for each cache line Also assume these 3 accesses are pipelined in memory subsystem Based on these...

Page 551: ...To determine the proper prefetch scheduling distance follow these steps and formulae Optimize Tc as much as possible Use the following set of formulae to calculate the proper prefetch scheduling dista...

Page 552: ...e bus is idle during the computation portion of the loop The memory access latencies could be hidden behind execution if data could be fetched earlier during the bus idle time Further analyzing Figure...

Page 553: ...ory pipeline With an ideal placement of the data prefetching the iteration latency should be either bound by execution latency or memory latency that is il maximum Tc Tb Compute Bound Case Tc Tl Tb Fi...

Page 554: ...computation latency which means the memory accesses are executed in background and their latencies are completely hidden Compute Bound Case Tl Tb Tc Tb Now consider the next case by first examining Fi...

Page 555: ...umed in iteration i 2 Figure E 4 represents the case when the leadoff latency plus data transfer latency is greater than the compute latency which is greater than the data transfer latency The followi...

Page 556: ...refetch scheduling distance or prefetch iteration distance for the case when memory throughput latency is greater than the compute latency Apparently the iteration latency is dominant by the memory th...

Page 557: ...aphics driver moving data from writeback memory to write combining memory belongs to this category where performance advantage from prefetch instructions will be marginal Example As an example of the...

Page 558: ...n of Tc the computation latency The steady state iteration latency il is either memory bound or compute bound depending on Tc if prefetches are scheduled effectively The graph in example 2 of accesses...

Page 559: ...tion gap in computing the prefetch scheduling distance The transaction gap Tg must be factored into the burst cycles Tb for the calculation of prefetch scheduling distance The following relationship s...

Page 560: ...IA 32 Intel Architecture Optimization E 14...

Page 561: ...pplication performance tools A 1 Arrays Aligning 2 39 automatic processor dispatch support A 4 automatic vectorization 3 18 3 19 B battery life 9 1 9 7 OS APIs 9 8 performance options 9 8 Branch Predi...

Page 562: ...6 44 _mm_stream 6 2 6 44 compiler plug in A 2 compiler supported alignment 3 24 complex instructions 2 74 computation latency E 8 computation intensive code 3 11 compute bound E 7 E 8 converting code...

Page 563: ...et 7 44 placement of shared synchronization variable 7 31 prevent false sharing of data 7 30 preventing excessive evictions in first level data cache 7 43 shared memory optimization 7 39 synchronizati...

Page 564: ...7 OS APIs 9 8 C4 state 9 6 CD DVD WLAN WiFi 9 10 C states 9 1 9 4 deep sleep transitions 9 11 deeper sleep 9 6 9 14 OS changes processor frequency 9 2 OS synchronization APIs 9 9 overview 9 1 performa...

Page 565: ...39 Performance and Usage Models Multithreading 7 2 Performance and Usage Models 7 2 Performance Library Suite A 14 optimizations A 16 PEXTRW instruction 4 13 PGO See profile guided optimization PINSRW...

Page 566: ...n algorithm 2 20 static power 9 1 static prediction 2 19 static prediction algorithm 2 19 streaming stores coherent requests 6 13 non coherent requests 6 13 strip mining 3 32 3 34 strip mining 6 37 6...

Page 567: ...rni 14 Brno 61600 Czech Rep Denmark Intel Corp Soelodden 13 Maaloev DK2760 Denmark Germany Intel Corp Sandstrasse 4 Aichner 86551 Germany Intel Corp Dr Weyerstrasse 2 Juelich 52428 Germany Intel Corp...

Page 568: ...9465 Counselors Row Suite 200 Indianapolis IN 46240 USA Fax 317 805 4939 Massachusetts Intel Corp 125 Nagog Park Acton MA 01720 USA Fax 978 266 3867 Intel Corp 59 Composit Way suite 202 Lowell MA 018...

Reviews: