background image

IA-32 Intel® Architecture Optimization

2-10

General Compiler Recommendations

A compiler that has been extensively tuned for the target microarchitec-
ture can be expected to match or outperform hand-coding in a general 
case. However, if particular performance problems are noted with the 
compiled code, some compilers (like the Intel C++ and Fortran Compil-
ers) allow the coder to insert intrinsics or inline assembly in order to 
exert greater control over what code is generated. If inline assembly is 
used, the user should verify that the code generated to integrate the 
inline assembly is of good quality and yields good overall performance.

Default compiler switches are targeted for the common case. An 
optimization may be made to the compiler default if it is beneficial for 
most programs. If a performance problem is root-caused to a poor 
choice on the part of the compiler, using different switches or compiling 
the targeted module with a different compiler may be the solution.

VTune

 Performance Analyzer

Where performance is a critical concern, use performance monitoring 
hardware and software tools to tune your application and its interaction 
with the hardware. IA-32 processors have counters which can be used to 
monitor a large number of performance-related events for each 
microarchitecture. The counters also provide information that helps 
resolve the coding pitfalls.

The VTune Performance Analyzer allow engineers to use these counters 
to provide with two kinds of tuning feedback:

indication of a performance improvement gained by using a specific 
coding recommendation or microarchitectural feature,

information on whether a change in the program has improved or 
degraded performance with respect to a particular metric.

Содержание ARCHITECTURE IA-32

Страница 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...

Страница 2: ...ny software that may be provided in association with this document Except as permitted by such license no part of this document may be reproduced stored in a retrieval system or trans mitted in any fo...

Страница 3: ...Goals of Intel NetBurst Microarchitecture 1 8 Overview of the Intel NetBurst Microarchitecture Pipeline 1 9 The Front End 1 11 The Out of order Core 1 12 Retirement 1 12 Front End Pipeline Detail 1 13...

Страница 4: ...Chapter 2 General Optimization Guidelines Tuning to Achieve Optimum Performance 2 1 Tuning to Prevent Known Coding Pitfalls 2 2 General Practices and Coding Guidelines 2 3 Use Available Performance T...

Страница 5: ...eon Processors 2 45 Aliasing Cases in the Pentium M Processor 2 46 Mixing Code and Data 2 47 Self modifying Code 2 47 Write Combining 2 48 Locality Enhancement 2 50 Minimizing Bus Latency 2 52 Non Tem...

Страница 6: ...ng Point SIMD Operands 2 88 Prolog Sequences 2 90 Code Sequences that Operate on Memory Operands 2 90 Instruction Scheduling 2 91 Latencies and Resource Constraints 2 91 Spill Scheduling 2 92 Scheduli...

Страница 7: ...for 128 bit data 3 24 Compiler Supported Alignment 3 24 Improving Memory Utilization 3 27 Data Structure Layout 3 27 Strip Mining 3 32 Loop Blocking 3 34 Instruction Selection 3 37 SIMD Optimizations...

Страница 8: ...ord 4 31 Complex Multiply by a Constant 4 32 Packed 32 32 Multiply 4 33 Packed 64 bit Add Subtract 4 33 128 bit Shifts 4 33 Memory Optimizations 4 34 Partial Memory Accesses 4 35 Supplemental Techniqu...

Страница 9: ...etch Coding Guidelines 6 2 Hardware Prefetching of Data 6 4 Prefetch and Cacheability Instructions 6 5 Prefetch 6 6 Software Data Prefetch 6 6 The Prefetch Instructions Pentium 4 Processor Implementat...

Страница 10: ...Decoder Implementation 6 46 Optimizing Memory Copy Routines 6 46 TLB Priming 6 47 Using the 8 byte Streaming Stores and Software Prefetch 6 48 Using 16 byte Streaming Stores and Hardware Prefetch 6 5...

Страница 11: ...Cache Misses 7 36 Use Full Write Transactions to Achieve Higher Data Rate 7 37 Memory Optimization 7 38 Cache Blocking Technique 7 38 Shared Memory Optimization 7 39 Minimize Sharing of Data between P...

Страница 12: ...ep Technology 9 12 Enabling Intel Enhanced Deeper Sleep 9 14 Multi Core Considerations 9 15 Enhanced Intel SpeedStep Technology 9 15 Thread Migration Considerations 9 16 Multi core Considerations for...

Страница 13: ...rmance Metrics B 1 Pentium 4 Processor Specific Terminology B 2 Bogus Non bogus Retire B 2 Bus Ratio B 2 Replay B 3 Assist B 3 Tagging B 3 Counting Clocks B 4 Non Halted Clockticks B 5 Non Sleep Clock...

Страница 14: ...4 Latency and Throughput with Register Operands C 6 Table Footnotes C 19 Latency and Throughput with Memory Operands C 20 Appendix DStack Alignment Stack Frames D 1 Aligned esp Based Stack Frames D 4...

Страница 15: ...Code 2 36 Example 2 15 Two Examples to Avoid the Non forwarding Situation in Example 2 14 2 36 Example 2 13 A Non forwarding Example of Large Load After Small Store 2 36 Example 2 16 Large and Small...

Страница 16: ...op Blocking 3 35 Example 3 21 Emulation of Conditional Moves 3 37 Example 4 1 Resetting the Register between __m64 and FP Data Types 4 5 Example 4 2 Unsigned Unpack Instructions 4 7 Example 4 3 Signed...

Страница 17: ...s and shuffle Instructions 5 15 Example 5 7 Deswizzling Data 64 bit Integer SIMD Data 5 16 Example 5 8 Using MMX Technology Code for Copying or Shuffling 5 18 Example 5 9 Horizontal Add Using movhlps...

Страница 18: ...ithout Sharing a Cache Line 7 32 Example 7 8 Batched Implementation of the Producer Consumer Threads 7 41 Example 7 9 Adding an Offset to the Stack Pointer of Three Threads 7 45 Example 7 10 Adding a...

Страница 19: ...ler Performance Trade offs 3 13 Figure 3 3 Loop Blocking Access Pattern 3 36 Figure 4 2 Interleaved Pack with Saturation 4 9 Figure 4 1 PACKSSDW mm mm mm64 Instruction Example 4 9 Figure 4 4 Result of...

Страница 20: ...State Transitions 9 3 Figure 9 2 Active Time Versus Halted Time of a Processor 9 4 Figure 9 3 Application of C states to Idle Time 9 6 Figure 9 4 Profiles of Coarse Task Scheduling and Power Consumpt...

Страница 21: ...to Strip mining Code 6 39 Table 6 2 Relative Performance of Memory Copy Routines 6 52 Table 6 3 Deterministic Cache Parameters Leaf 6 54 Table 7 1 Properties of Synchronization Objects 7 21 Table B 1...

Страница 22: ...xxii Table C 5 Streaming SIMD Extension 64 bit Integer Instructions C 14 Table C 7 IA 32 x87 Floating point Instructions C 16 Table C 8 IA 32 General Purpose Instructions C 17...

Страница 23: ...itecture Volume 2A Instruction Set Reference A M Volume 2B Instruction Set Reference N Z and Volume 3 System Programmer s Guide When developing and optimizing software applications to achieve a high l...

Страница 24: ...n your applications On the Pentium 4 Intel Xeon and Pentium M processors this tool can monitor an application through a selection of performance monitoring events and analyze the performance event dat...

Страница 25: ...tBurst microarchitecture and Pentium M processor microarchitecture Chapter 3 Coding for SIMD Architectures Describes techniques and concepts for using the SIMD integer and SIMD floating point instruct...

Страница 26: ...ols for analyzing and enhancing application performance without having to write assembly code Appendix B Intel Pentium 4 Processor Performance Metrics Provides information that can be gathered using P...

Страница 27: ...Programmer s Guide doc number 253668 Intel Processor Identification with the CPUID Instruction doc number 241618 Developing Multi threaded Applications A Platform Consistent Approach available at http...

Страница 28: ...t THIS TYPE STYLE Indicates a value for example TRUE CONST1 or a variable for example A B or register names MMO through MM7 l indicates lowercase letter L in examples 1 is the number 1 in examples O i...

Страница 29: ...Technology Intel EM64T Intel processors supporting Hyper Threading HT Technology1 Multi core architecture supported in Intel Core Duo Intel Pentium D processors and Pentium processor Extreme Edition2...

Страница 30: ...t of eight 64 bit registers called MMX registers see Figure 1 2 The Pentium III processor extended the SIMD computation model with the introduction of the Streaming SIMD Extensions SSE SSE allows SIMD...

Страница 31: ...ble precision floating point data elements and 128 bit packed integers There are 144 instructions in SSE2 that operate on two packed double precision floating point data elements or on 16 packed byte...

Страница 32: ...Point Arithmetic They are accessible from all IA 32 execution modes protected mode real address mode and Virtual 8086 mode SSE SSE2 and MMX technologies are architectural extensions in the IA 32 Inte...

Страница 33: ...with Streaming SIMD Extensions 2 SSE3 Chapter 12 Programming with Streaming SIMD Extensions 3 SSE3 Summary of SIMD Technologies MMX Technology MMX Technology introduced 64 bit MMX registers support fo...

Страница 34: ...nverting between new and existing data types extended support for data shuffling extended support for cacheability and memory ordering operations SSE2 instructions are useful for 3D graphics video dec...

Страница 35: ...mode enables a 64 bit operating system to run applications written to access 64 bit linear address space In the 64 bit mode of Intel EM64T software may access 64 bit flat linear addressing 8 additiona...

Страница 36: ...ency and Throughput Intel NetBurst microarchitecture is designed to achieve high performance for integer and floating point computations at high clock rates It supports the following features hyper pi...

Страница 37: ...kes to execute each individual instruction is not always deterministic Chapter 2 General Optimization Guidelines lists optimizations to use and situations to avoid The chapter also gives a sense of re...

Страница 38: ...l program order and that the proper architectural states are updated Figure 1 3 illustrates a diagram of the major functional blocks associated with the Intel NetBurst microarchitecture pipeline The f...

Страница 39: ...hat are sources of delay the time required to decode instructions fetched from the target wasted decode bandwidth due to branches or a branch target in the middle of a cache line Instructions are fetc...

Страница 40: ...r operations executing in parallel or by the execution of ops queued up in a buffer The core is designed to facilitate parallel execution It can dispatch up to six ops per cycle through the issue port...

Страница 41: ...tory Figure 1 3 illustrates the paths that are most frequently executing inside the Intel NetBurst microarchitecture an execution loop that interacts with multilevel cache hierarchy and the system bus...

Страница 42: ...croarchitecture has a single decoder that decodes instructions at the maximum rate of one instruction per clock Some complex instructions must enlist the help of the microcode ROM The decoder operatio...

Страница 43: ...es accurately and to reduce the cost of taken branches These include the ability to dynamically predict the direction and target of branches based on an instruction s linear address using the branch t...

Страница 44: ...Return Stack that can predict return addresses for a series of procedure calls This increases the benefit of unrolling loops containing function calls It also mitigates the need to put certain proced...

Страница 45: ...order IA 32 instructions to preserve available parallelism by minimizing long dependence chains and covering long instruction latencies order instructions so that their operands are ready and their co...

Страница 46: ...either one floating point move op a floating point stack move floating point exchange or floating point store data or one arithmetic logical unit ALU op arithmetic logic branch or store data In the se...

Страница 47: ...ared between instructions and data Figure 1 4 Execution Units and Ports in the Out Of Order Core OM15151 ALU 0 Double Speed Port 0 ADD SUB Logic Store Data Branches FP Move FP Store Data FXCH ALU 1 Do...

Страница 48: ...rd level cache miss initiates a transaction across the system bus A bus write transaction writes 64 bytes to cacheable memory or separate 8 byte chunks if the destination is not cacheable A bus read t...

Страница 49: ...ntrolled prefetch is enabled using the four prefetch instructions PREFETCHh introduced with SSE The software prefetch is not intended for prefetching code Using it can incur significant penalties on a...

Страница 50: ...processor and the use of too many prefetches can limit their effectiveness Examples of this include prefetching data in a loop for a reference outside the loop and prefetching in a basic block that is...

Страница 51: ...sfy a condition that the linear address distance between these cache misses is within a threshold value The threshold value depends on the processor implementation of the microarchitecture see Table 1...

Страница 52: ...by not exceeding the memory issue bandwidth and buffer resources provided by the processor Up to one load and one store may be issued for each cycle from a memory port reservation station In order to...

Страница 53: ...ntil a write to memory and or cache is complete Writes are generally not on the critical path for dependence chains so it is often beneficial to delay writes for more efficient use of memory access bu...

Страница 54: ...or microarchitecture contains three sections in order issue front end out of order superscalar execution core in order retirement unit Intel Pentium M processor microarchitecture supports a high speed...

Страница 55: ...ure 1 5 The Front End The Intel Pentium M processor uses a pipeline depth that enables high performance and low power consumption It s shorter than that of the Intel NetBurst microarchitecture The Int...

Страница 56: ...oding an instruction with four or fewer ops The remaining two decoders each decode a one op instruction in each clock cycle The front end can issue multiple ops per cycle in original program order to...

Страница 57: ...ro op This holds for integer floating point and MMX technology loads and for most kinds of successive execution operations Note that SSE Loads can not be fused Data Prefetching The Intel Pentium M pro...

Страница 58: ...ands are ready and resources are available Each cycle the core may dispatch up to five ops through the issue ports Table 1 2 Trigger Threshold and CPUID Signatures for IA 32 Processor Families Trigger...

Страница 59: ...wo cores in an Intel Core Duo processor to minimize bus traffic between two cores accessing a single copy of cached data It allows an Intel Core Solo processor or when one of the two cores in an Intel...

Страница 60: ...and memory have single micro op flows comparable to X87 flows Many packed instructions are fused to reduce its micro op flow from four to two micro ops Eliminating decoder restrictions Intel Core Sol...

Страница 61: ...em at lower priority than normal cache miss requests If bus queue is in high demand hardware prefetch requests may be ignored or cancelled to service bus traffic required by demand cache misses and ot...

Страница 62: ...ell suited for multiprocessor systems to provide an additional performance boost in throughput when compared to traditional MP systems Figure 1 6 shows a typical bus based symmetric multiprocessor SMP...

Страница 63: ...to the fact that operating systems and user programs can schedule processes or threads to execute simultaneously on the logical processors in each physical processor the ability to use on chip execut...

Страница 64: ...tably the memory type range registers MTRRs and the performance monitoring resources For a complete list of the architecture state and exceptions see the IA 32 Intel Architecture Software Developer s...

Страница 65: ...ers partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations Shared Resources Most resources in a physical proce...

Страница 66: ...e stage include cache misses branch mispredictions and instruction dependencies Front End Pipeline The execution trace cache is shared between two logical processors Execution trace cache access is ar...

Страница 67: ...processor by alternating between the two logical processors If one logical processor is not ready to retire any instructions then all retirement bandwidth is dedicated to the other logical processor O...

Страница 68: ...rocessors in a physical package Each logical processor has a separate execution core including first level cache and a smart second level cache The second level cache is shared between two logical pro...

Страница 69: ...nterface Bus Interface Pentium D Processor Caches Caches System Bus Architectual State Execution Engine Local APIC Local APIC Execution Engine Architectual State Bus Interface Bus Interface Local APIC...

Страница 70: ...nded model and model fields See Table 1 4 Shared Cache in Intel Core Duo Processors The Intel Core Duo processor has two symmetric cores that share the second level cache and a single bus interface se...

Страница 71: ...to wait before the same operation can start again The latency of a bus transaction is exposed in some of these operations as indicated by entries containing bus transaction On Intel Core Duo processo...

Страница 72: ...in response time of these cache misses For store operation reading for ownership must be completed before the data is written to the first level data cache and the line is marked as modified Reading...

Страница 73: ...ssed in Chapter 7 This chapter explains the optimization techniques both for those who use the Intel C or Fortran Compiler and for those who use other compilers The Intel compiler which generates code...

Страница 74: ...ache line splits Table 2 1 lists coding pitfalls that cause performance degradation in some Pentium 4 and Intel Xeon processor implementations For every issue Table 2 1 references a section in this do...

Страница 75: ...or the common performance features of the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture The coding practices recommended under each heading and the bullets under each...

Страница 76: ...raffic locality memory traffic characteristics etc Characterize the performance gain Optimize Performance Across Processor Generations Use a cpuid dispatch strategy to deliver optimum performance for...

Страница 77: ...sible eliminate branches Avoid indirect calls Optimize Memory Access Observe store forwarding constraints Ensure proper data alignment to prevent data split across cache line boundary This includes st...

Страница 78: ...registers rounding modes between more than two values Use efficient conversions such as those that implicitly include a rounding mode in order to avoid changing control status registers Take advantage...

Страница 79: ...16 bit registers because accessing them requires a shift operation internally Use xor and pxor instructions to clear registers and break dependencies for integer operations also use xorps and xorpd t...

Страница 80: ...sures how frequently such instances occur across all application domains with the frequency marked as H high M medium L low These rules are very approximate They can vary depending on coding style app...

Страница 81: ...n and instruction scheduling Refer to the Intel C Intrinsics Reference section of the Intel C Compiler User s Guide C class libraries Refer to the Intel C Class Libraries for SIMD Operations Reference...

Страница 82: ...cial for most programs If a performance problem is root caused to a poor choice on the part of the compiler using different switches or compiling the targeted module with a different compiler may be t...

Страница 83: ...chapter include descriptions of the VTune analyzer events that provide measurable data of performance gain achieved by following recommendations Refer to the VTune analyzer online help for instructio...

Страница 84: ...sed shifts rotates integer multiplies and moves from memory with sign extension are longer than before Use care when using the lea instruction See the section Use of the lea Instruction for recommenda...

Страница 85: ...ntify the processor generation and integrate processor specific instructions such as SSE2 instructions into the source code The Intel C Compiler supports the integration of different versions of the c...

Страница 86: ...t IA 32 processor families offer hardware multi threading support in two forms dual core technology and Hyper Threading Technology Future trend for IA 32 processors will continue to improve in the dir...

Страница 87: ...n wait loops Inline functions and pair up calls and returns Unroll as necessary so that repeatedly executed loops have sixteen or fewer iterations unless this causes an excessive code size increase Se...

Страница 88: ...g both paths of a conditional branch In addition converting conditional branches to cmovs or setcc trades of control flow dependence for data dependence and restricts the capability of the out of orde...

Страница 89: ...the cmov and fcmov instructions Example 2 3 shows changing a test and branch instruction sequence using cmov and eliminating a branch If the test sets the equal flag the value in ebx will be moved to...

Страница 90: ...the Pentium 4 processor may suffer a severe penalty when exiting the loop because the processor may detect a possible memory order violation Inserting the pause instruction significantly reduces the...

Страница 91: ...branches in processors based on the Intel NetBurst microarchitecture are predicted using the following static prediction algorithm Predict backward conditional branches to be taken This rule is suita...

Страница 92: ...t and make the fall through code following a conditional branch be the unlikely target for a branch with a backward target Example 2 5 illustrates the static branch prediction algorithm The body of an...

Страница 93: ...n JC Begin in Example 2 7 segment is a conditional forward branch It is not in the BTB the first time through but the static predictor will predict the branch to fall through The static prediction alg...

Страница 94: ...od of exceeding the stack depth in a manner that will impact performance is very low Assembly Compiler Coding Rule 4 MH impact MH generality Near calls must be matched with near returns and far calls...

Страница 95: ...all to a jump This will save the call return overhead as well as an entry in the return stack buffer Assembly Compiler Coding Rule 10 M impact L generality Do not put more than four branches in a 16 b...

Страница 96: ...or calls through pointers can jump to an arbitrary number of locations If the code sequence is such that the target destination of a branch goes to the same address most of the time then the BTB will...

Страница 97: ...f the direction of the branch under consideration Example 2 8 shows a simple example of the correlation between a target of a preceding conditional branch with a target of an indirect branch Correlati...

Страница 98: ...s This is useful if you have enough free registers to keep variables live as you stretch out the dependence chain to expose the critical path Unrolling exposes the code to various other optimizations...

Страница 99: ...f the unrolled loop is 16 or less the branch predictor should be able to correctly predict branches in the loop body that alternate direction Assembly Compiler Coding Rule 13 H impact M generality Unr...

Страница 100: ...on separate pages using conditional move instructions to eliminate branches generating code that is consistent with the static branch prediction algorithm inlining where appropriate unrolling if the...

Страница 101: ...entium 4 processor Alignment Alignment of data concerns all kinds of variables dynamically allocated members of a data structure global or local variables parameters passed on the stack Misaligned dat...

Страница 102: ...A 64 byte or greater data structure or array should be aligned so that its base address is a multiple of 64 Sorting data in decreasing size order is one heuristic for assisting with natural alignment...

Страница 103: ...ment of branch targets will improve decoder throughput Example 2 11 Code That Causes Cache Line Split mov esi 029e70feh mov edi 05be5260h Blockmove mov eax DWORD PTR esi mov ebx DWORD PTR esi 4 mov DW...

Страница 104: ...eral examples of coding pitfalls that cause store forwarding stalls and solutions to these pitfalls are discussed in detail in the Store to Load Forwarding Restriction on Size and Alignment section Th...

Страница 105: ...a significant latency in forwarding Passing floating point argument in preferably XMM registers should save this long latency operation Parameter passing conventions may limit the choice of which par...

Страница 106: ...rt point and therefore the same alignment as the store data Assembly Compiler Coding Rule 19 H impact M generality The data of a load which is forwarded from a store must be completely contained withi...

Страница 107: ...ead and register copies as needed Example 2 12 contains several store forwarding situations when small loads follow large stores The first three load operations illustrate the situations described in...

Страница 108: ...d mov EAX EBP blocked The first 4 small store can be consolidated into a single DWORD store to prevent this non forwarding situation Example 2 14 A Non forwarding Situation in Compiler Generated Code...

Страница 109: ...or example when bytes or words are stored and then words or doublewords are read from the same area of memory In the second case Example 2 16 B there is a series of small loads after a large store to...

Страница 110: ...l cases where data is passed through memory where the store may need to be separated from the load spills save and restore registers in a stack frame parameter passing global and volatile variables ty...

Страница 111: ...ctures and arrays to minimize the amount of memory wasted by padding However compilers might not have this freedom The C programming language for example specifies the order in which structure element...

Страница 112: ...d into several arrays to achieve better packing as shown in Example 2 19 The efficiency of such optimizations depends on usage patterns If the elements of the structure are all accessed together but t...

Страница 113: ...so have an impact on DRAM page access efficiency An alternative hybrid_struct_of_array blends the two approaches In this case only 2 separate address streams are generated and referenced 1 for hybrid_...

Страница 114: ...For example if a language only supports 8 bit 16 bit 32 bit and 64 bit data quantities but never uses 80 bit data quantities the language can require the stack to always be aligned on a 64 bit boundar...

Страница 115: ...the same set of each way in a cache can cause a capacity issue There are aliasing conditions that apply to specific microarchitectures Note that first level cache lines are 64 bytes Thus the least si...

Страница 116: ...ss of first level cache misses for more than 4 simultaneous competing memory references to addresses with 2KB modulus On Pentium 4 and Intel Xeon processors with CPUID signature of family encoding 15...

Страница 117: ...rence load or store which is under way then the second reference cannot begin until the first one is kicked out of the cache On Pentium 4 and Intel Xeon processors with CPUID signature of family encod...

Страница 118: ...first level cache working set Avoid having a store followed by a non dependent load with addresses that differ by a multiple of 4 KB When declaring multiple arrays that are referenced with the same i...

Страница 119: ...re cases a performance problem may be noted due to executing data on a code page as instructions The condition where this is very likely to happen is when execution is following an indirect branch tha...

Страница 120: ...condition and should be avoided where possible Avoid the condition by introducing indirect branches and using data tables on data not code pages via register indirect calls Write Combining Write comb...

Страница 121: ...to writeback memory into sepa rate phases can assure that the write combining buffers can fill before getting evicted by other write traffic Eliminating partial write transac tions has been found to h...

Страница 122: ...is typically that access cost from an outer sub system may be somewhere between 3 10X more expensive than accessing data from the immediate inner level in the cache memory hierarchy assuming similar...

Страница 123: ...ocality enhancing techniques Using the lock prefix heavily can incur large delays when accessing memory irrespective of whether the data is in the cache or in system memory User Source Coding Rule 6 H...

Страница 124: ...sactions into read phases and write phases can help performance Note however that the order of read and write operations on the bus are not the same as they appear in the program Bus latency of fetchi...

Страница 125: ...ead for ownership of a cache line and writes The data transfer rate for bus write transactions is higher if 64 bytes are written out to the bus at a time Typically bus writes to Writeback WB type memo...

Страница 126: ...r ecx eax xmm0 movntps XMMWORD ptr ecx eax 16 xmm0 movntps XMMWORD ptr ecx eax 32 xmm0 movntps XMMWORD ptr ecx eax 48 xmm0 64 bytes is written in one bus transaction add eax STRIDESIZE cmp eax edx jl...

Страница 127: ...using the prefetchnta instruction fetches 128 bytes into one way of the second level cache The Pentium M processor also provides a hardware prefetcher for data It can track 12 separate streams in the...

Страница 128: ...ll also have an impact For a detailed description of how to use prefetching see Chapter 6 Optimizing Cache Usage User Source Coding Rule 9 M impact H generality Enable the prefetch generation in your...

Страница 129: ...ode such as code to handle error conditions out of that sequence See Prefetching section on how to optimize for the instruction prefetcher Assembly Compiler Coding Rule 29 M impact H generality All br...

Страница 130: ...race cache bandwidth or instruction latency Focus on optimizing the problem area For example adding prefetch instructions will not help if the bus is already saturated If trace cache bandwidth is the...

Страница 131: ...store forwarding delays over some compiler implementations This avoids changing the rounding mode User Source Coding Rule 14 M impact ML generality Break dependence chains where possible Removing data...

Страница 132: ...conditions such as arithmetic overflow arithmetic underflow denormalized operand Refer to Chapter 4 of the IA 32 Intel Architecture Software Developer s Manual Volume 1 for the definition of overflow...

Страница 133: ...eptions Scale the range of operands results to reduce as much as possible the number of arithmetic overflow underflow situations Keep intermediate results on the x87 FPU register stack until the final...

Страница 134: ...that can be encountered in FTZ mode are the ones specified as constants read only The DAZ mode is provided to handle denormal source operands efficiently when running an SSE application When the DAZ...

Страница 135: ...olution to this problem is to choose two constant FCW values take advantage of the optimization of the FLDCW instruction to alternate between only these two constant FCW values and devise some means t...

Страница 136: ...olved For x87 floating point the fist instruction uses the rounding mode represented in the floating point control word FCW The rounding mode is generally round to nearest therefore many compiler writ...

Страница 137: ...se the algorithm in Example 2 23 to avoid synchronization issues the overhead of the fldcw instruction and having to change the rounding mode The provided example suffers from a store forwarding probl...

Страница 138: ...mov edx ecx 4 high dword of integer mov eax ecx low dword of integer test eax eax je integer_QnaN_or_zero arg_is_not_integer_QnaN fsubp st 1 st TOS d round d st 1 st 1 st pop ST test edx edx what s s...

Страница 139: ...rd is set to Single Precision the floating point divider can complete a single precision computation much faster than either a double precision computation or an extended double precision computation...

Страница 140: ...to expose more independent instructions to the hardware scheduler An fxch instruction may be required to effectively increase the register name space so that more operands can be simultaneously live N...

Страница 141: ...results will not underflow Therefore subsequent computation will not face the performance penalty of handling denormal input operands For example in the case of 3D applications with low lighting leve...

Страница 142: ...he overhead associated with the management of the X87 register stack Scalar SSE SSE2 Performance on Intel Core Solo and Intel Core Duo Processors On Intel Core Solo and Intel Core Duo processors the c...

Страница 143: ...ation occurs when mixing single precision and double precision code On Pentium 4 processors using cvtss2sd has performance penalty relative to the alternative sequence xorps xmm1 xmm1 movss xmm1 xmm2...

Страница 144: ...with using separate instructions Assembly Compiler Coding Rule 36 M impact L generality Try to use 32 bit operands rather than 16 bit operands for fild However do not do so at the expense of introduc...

Страница 145: ...nd fewer ops Use optimized sequences for clearing and comparing registers Enhance register availability Avoid prefixes especially more than one prefix Assembly Compiler Coding Rule 37 M impact H gener...

Страница 146: ...can also be used as a multiple operand addition instruction for example lea ecx eax ebx 4 a Using lea in this way may avoid register usage by not tying up registers for operands of arithmetic instruct...

Страница 147: ...responding to family 15 and model encoding of 0 1 or 2 The latency of a sequence of adds will be shorter for left shifts of three or less Fixed and variable shifts have the same latency The rotate by...

Страница 148: ...generate the same number of ops If AX or EAX are known to be positive replace these instructions with xor dx dx or xor edx edx Operand Sizes and Partial Register Accesses The Pentium 4 processor Penti...

Страница 149: ...op with limited parallelism the resulting optimization can yield several percent performance improvement Example 2 24 Dependencies Caused by Referencing Partial Registers 1 add ah bh 2 add al 3 instru...

Страница 150: ...d If multiple prefixes or a prefix that changes the size of an immediate or displacement cannot be avoided schedule them behind instructions that stall the pipe for some other reason Assembly Compiler...

Страница 151: ...mon However the technique will not work if the compare is for greater than less than greater than or equal and so on or if the values in eax or ebx are to be used in another operation where sign exten...

Страница 152: ...eax 0xffff is encoded as 35 FF FF 00 00 in 32 bit code while overriding the default operand size with 0x66 results in the instruction xor ax 0xffff which is encoded as 66 35 FF FF Use of an LCP cause...

Страница 153: ...odr m byte consider adjusting code alignment by adding a nop before the instruction to avoid the opcode byte aligning on byte 14 from the beginning of an instruction fetch line REP Prefix and Data Mov...

Страница 154: ...iteration will in general reduce overhead and improve throughput Sometimes this may involve a comparison of the relative overhead of an iterative loop structure versus using REP prefix for iteration...

Страница 155: ...le application situation When applying general heuristics to the design of general purpose high performance library routines the following guidelines can are useful when optimizing arbitrary size of c...

Страница 156: ...e address alignment is amortized over many iterations b an iterative approach using the instruction with largest data granularity where the overhead for SIMD feature detection iteration overhead prolo...

Страница 157: ...unt Size and 4 Byte Aligned Destination A C example of Memset Equivalent Implementation Using REP STOSD void memset void dst int c size_t size char d char dst size_t i for i 0 i size i d char c push e...

Страница 158: ...de high performance in the situations described above However using a REP prefix with string scan instructions scasb scasw scasd scasq or compare instructions cmpsb cmpsw smpsd smpsq is not recommende...

Страница 159: ...break a false dependence chain resulting from re use of registers In contexts where the condition codes must be preserved move 0 into the register instead This requires more code space than using xor...

Страница 160: ...Thus the operation can be tested directly by a jcc instruction The notable exceptions are mov and lea In these cases use test Assembly Compiler Coding Rule 53 ML impact M generality Eliminate unnecess...

Страница 161: ...s into a single register This same functionality can be obtained using movsd xmmreg1 mem movsd xmmreg2 mem 8 unpcklpd xmmreg1 xmmreg2 which uses fewer ops and can be packed into the trace cache more e...

Страница 162: ...mory operands can improve performance Instructions of the form OP REG MEM can reduce register pressure by taking advantage of scratch registers that are not available to the compiler Assembly Compiler...

Страница 163: ...esult Assembly Compiler Coding Rule 60 M impact M generality When an address of a store is unknown subsequent loads cannot be scheduled to execute out of order ahead of the store limiting the out of o...

Страница 164: ...aced in the dependence chain but there would also be a data not ready stall of the load costing further cycles Assembly Compiler Coding Rule 62 H impact MH generality For small loops placing loop inva...

Страница 165: ...decoders can each decode one macroinstruction per clock cycle assuming the instruction is one micro op up to seven bytes in length Instructions composed of more than four micro ops take multiple cycl...

Страница 166: ...te SIMD code avoid global pointers avoid global variables These may be less of a problem if all modules are compiled simultaneously and whole program optimization is used User Source Coding Rule 17 H...

Страница 167: ...d for the following operations 1 byte xchg EAX EAX 2 byte mov reg reg 3 byte lea reg 0 reg 8 bit displacement 6 byte lea reg 0 reg 32 bit displacement These are all true NOPs having no effect on the s...

Страница 168: ...o summarize the rules and suggestions specified in this chapter be reminded that coding recommendations are ranked in importance according to these two criteria Local impact referred to earlier as imp...

Страница 169: ...entium M processors and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors 2 42 User Source Coding Rule 4 H impact ML generality Consider using a special memory allocation library to...

Страница 170: ...n range to avoid denormal values underflows 2 58 User Source Coding Rule 12 M impact ML generality Do not use double precision unless necessary Set the precision control PC field in the x87 FPU contro...

Страница 171: ...e of data in an earlier iteration happens lexically after the load of that data in a future iteration something which is called a lexically backward dependence 2 85 User Source Coding Rule 19 M impact...

Страница 172: ...ates a mismatch in calls and returns 2 21 Assembly Compiler Coding Rule 5 MH impact MH generality Selectively inline a function where doing so decreases code size or if the function is small and the c...

Страница 173: ...ently executed and that have a predictable number of iterations to reduce the number of iterations to 16 or fewer unless this increases code size so that the working set no longer fits in the trace ca...

Страница 174: ...rs as in register allocation and for parameter passing to minimize the likelihood and impact of store forwarding problems Try not to store forward data generated from a long latency instruction e g mu...

Страница 175: ...B pages or on separate aligned 1 KB subpages 2 47 Assembly Compiler Coding Rule 28 H impact L generality If an inner loop writes to more than four arrays four distinct cache lines apply loop fission t...

Страница 176: ...36 M impact L generality Try to use 32 bit operands rather than 16 bit operands for fild However do not do so at the expense of introducing a store forwarding problem by writing the two halves of the...

Страница 177: ...d of partial registers For moves this can be accomplished with 32 bit moves or by using movzx 2 76 Assembly Compiler Coding Rule 47 M impact M generality Try to use zero extension or operate on 32 bit...

Страница 178: ...1 xmmreg2 instruction instead 2 80 Assembly Compiler Coding Rule 53 ML impact L generality Instead of using movupd xmmreg1 mem for a unaligned 128 bit load use movsd xmmreg1 mem movsd xmmreg2 mem 8 un...

Страница 179: ...loading the memory adding two registers and storing the result 2 82 Assembly Compiler Coding Rule 58 M impact M generality When an address of a store is unknown subsequent loads cannot be scheduled t...

Страница 180: ...hat is not resident in the trace cache If a performance problem is clearly due to this problem try moving the data elsewhere or inserting an illegal opcode or a pause instruction immediately following...

Страница 181: ...lopment of advanced multimedia signal processing and modeling applications To take advantage of the performance opportunities presented by these new capabilities do the following Ensure that the proce...

Страница 182: ...ctive for programs that may be executed on different machines 3 Create a fat binary that includes multiple versions of routines versions that use SIMD technology and versions that do not Check for SIM...

Страница 183: ...tate introduced by SSE for your application to properly function To check whether your system supports SSE follow these steps 1 Check that your processor supports the cpuid instruction 2 Check the fea...

Страница 184: ...swer See Example 3 3 Example 3 2 Identification of SSE with cpuid identify existence of cpuid instruction identify signature is genuine intel mov eax 1 request for feature flags cpuid 0Fh 0A2h cpuid i...

Страница 185: ...shows how to find the SSE2 feature bit bit 26 in the cpuid feature flags SSE2 requires the same support from the operating system as SSE To find out whether the operating system supports SSE2 execute...

Страница 186: ...nts for SSE To check whether your system supports the x87 and SIMD instructions of SSE3 follow these steps 1 Check that your processor has the cpuid instruction 2 Check the ECX feature bit 0 of cpuid...

Страница 187: ...wer See Example 3 7 Detecting the availability of MONITOR and MWAIT instructions can be done using a code sequence similar to Example 3 6 the availability of MONITOR and MWAIT is indicated by bit 3 of...

Страница 188: ...treaming SIMD Extensions 3 2 Is this code integer or floating point 3 What integer word size or floating point precision is needed 4 What coding techniques should I use 5 What guidelines do I need to...

Страница 189: ...Integer Range or Precision If possible re arrange data for SIMD efficiency Integer Change to use SIMD Integer Yes Change to use Single Precision Can convert to Single precision Yes No No Align data s...

Страница 190: ...cation Performance Tools The VTune analyzer provides a hotspots view of a specific module to help you identify sections in your code that take the most CPU time and that have potential performance pro...

Страница 191: ...D technologies can be time consuming and difficult Likely candidates for conversion are applications that are highly computation intensive such as the following speech compression algorithms and filte...

Страница 192: ...alar code into code that can execute in parallel taking advantage of the SIMD architecture parallelism This section discusses the coding techniques available for an application to make use of the SIMD...

Страница 193: ...sive to write and maintain Performance objectives can be met by taking advantage of the different SIMD technologies using high level languages as well as assembly The new C C language extensions desig...

Страница 194: ...allows a simple replacement of the code with Streaming SIMD Extensions For the optimal use of the Streaming SIMD Extensions that need data alignment on the 16 byte boundary all examples in this chapt...

Страница 195: ...ovide the access to the ISA functionality using C C style coding instead of assembly language Intel has defined three sets of intrinsic functions that are implemented in the Intel C Compiler to suppor...

Страница 196: ...ription of the intrinsics and their use refer to the Intel C Compiler User s Guide Example 3 10 shows the loop from Example 3 8 using intrinsics The intrinsics map one to one with actual Streaming SIM...

Страница 197: ...eveloper s Manual Volumes 2A 2B Classes A set of C classes has been defined and available in Intel C Compiler to provide both a higher level abstraction and more flexibility for programming with MMX t...

Страница 198: ...mechanism by which loops such as in Example 3 8 can be automatically vectorized or converted into Streaming SIMD Extensions code The compiler uses similar techniques to those used by a programmer to...

Страница 199: ...memory to which the pointers point In other words the pointer for which it is used provides the only means of accessing the memory in question in the scope in which the pointers live Without the rest...

Страница 200: ...for improving data access such as padding organizing data elements into arrays etc are described below SSE3 provides a special purpose instruction LDDQU that can avoid cache line splits is discussed...

Страница 201: ...nce for accesses which span multiple cache lines The following declaration allows you to vectorize the scaling operation and further improve the alignment of the data access patterns short ptx N pty N...

Страница 202: ...or 128 bit SIMD Technologies For best performance the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16 byte 16B boundaries Unaligned data can...

Страница 203: ...sions and SSE2 see Appendix D Stack Alignment Data Alignment for MMX Technology Many compilers enable alignment of variables using controls This aligns the variables bit lengths to the appropriate bou...

Страница 204: ...e is keep the data 16 byte aligned Such alignment will also work for MMX technology code even though MMX technology only requires 8 byte alignment The following discussion and examples describe alignm...

Страница 205: ...100 objects of type __m128 or F32vec4 In the code below the construction of the F32vec4 object x will occur with aligned data void foo F32vec4 x __m128 buffer Without the declaration of __declspec ali...

Страница 206: ...names m and f can be used as immediate member names of my__m128 Note that __declspec align has no effect when applied to a class struct or union member in either C or C Alignment by Using __m64 or dou...

Страница 207: ...tions see Chapter 6 Optimizing Cache Usage Data Structure Layout For certain algorithms like 3D transformations and lighting there are two basic ways of arranging the vertex data The traditional metho...

Страница 208: ...ertices int b NumOfVertices int c NumOfVertices VerticesList VerticesList Vertices Example 3 16 AoS and SoA Code Samples The dot product of an array of vectors Array and a fixed vector Fixed is a comm...

Страница 209: ...ures are generated see Chapters 4 and 5 for specific examples of swizzling code Performing the swizzle dynamically is usually better than using AoS addps xmm1 xmm0 xmm0 DC DC DC x0 xF z0 zF movaps xmm...

Страница 210: ...tions that consume SIMD execution slots but produce only a single scalar result as shown by the many don t care DC slots in Example 3 16 Use of the SoA format for data structures can also lead to more...

Страница 211: ...iables a b c would also used together but not at the same time as x y z This hybrid SoA approach ensures data is organized to enable more efficient vertical SIMD computation simpler less address gener...

Страница 212: ...forms the loop structure twofold It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm It reduces the number of iterations of th...

Страница 213: ...he cache then the coordinates for v i that were cached during Transform v i will be evicted from the cache by the time we do Lighting v i This means that v i will have to be fetched from main memory a...

Страница 214: ...ossible This technique transforms the memory domain of a given problem into smaller chunks rather than sequentially traversing through the entire memory domain Each chunk should be small enough to fit...

Страница 215: ...es Due to the limitation of cache capacity this line will be evicted due to conflict misses before the inner loop reaches the end For the next iteration of the outer loop another cache miss will be ge...

Страница 216: ...0 0 7 will be completely consumed by the first iteration of the outer loop Consequently B 0 0 7 will only experience one cache miss after applying loop blocking optimization in lieu of eight misses f...

Страница 217: ...ection The following section gives some guidelines for choosing instructions to complete a task One barrier to SIMD computation can be the existence of data dependent branches Conditional moves can be...

Страница 218: ...zing SIMD code targeting Intel Core Solo and Intel Core Duo processors The register register variant of the following instructions has improved performance on Intel Core Solo and Intel Core Duo proces...

Страница 219: ...ectly is to use a profiler that measures the application while it is running on a system VTune analyzer can help you determine where to make changes in your application to improve performance Using th...

Страница 220: ...IA 32 Intel Architecture Optimization 3 40...

Страница 221: ...rred to as SIMD integer instructions Unless otherwise noted the following sequences are written for the 64 bit integer registers Note that they can easily be adapted to use the 128 bit SIMD integer fo...

Страница 222: ...ns to ensure that the input operands in XMM registers contain properly defined data type to match the instruction Code sequences containing cross typed usage will produce the same result across differ...

Страница 223: ...unrelated to the x87 FP MMX registers The emms instruction is not needed to transition to or from SIMD floating point operations or 128 bit SIMD operations Using the EMMS Instruction When generating 6...

Страница 224: ...g point and 64 bit SIMD integer instructions follow these steps 1 Always call the emms instruction at the end of 64 bit SIMD integer code when the code transitions to x87 floating point code 2 Insert...

Страница 225: ...ge Further you must be aware that your code generates an MMX instruction which uses the MMX registers with the Intel C Compiler in the following situations when using a 64 bit SIMD integer intrinsic f...

Страница 226: ...wer performance than using 16 byte aligned references Refer to Stack and Data Alignment in Chapter 3 for more information Data Movement Coding Techniques In general better performance can be achieved...

Страница 227: ...ssumes the source is a packed word 16 bit data type Example 4 2 Unsigned Unpack Instructions Input MM0 source value MM7 0 a local variable can be used instead of the register MM7 if desired Output MM0...

Страница 228: ...ck Code Input MM0 source value Output MM0 two sign extended 32 bit doublewords from the two low end words MM1 two sign extended 32 bit doublewords from the two high end words movq MM1 MM0 copy source...

Страница 229: ...operation The two signed doublewords are used as source operands and the result is interleaved signed words The pack instructions can be performed with or without saturation as needed Figure 4 1 PACK...

Страница 230: ...ion of the MMX instruction set can be found in the Intel Architecture MMX Technology Programmer s Reference Manual order number 243007 Interleaved Pack without Saturation Example 4 5 is similar to Exa...

Страница 231: ...djacent elements of a packed word data type in source2 and place this value in the high 32 bits of the results One of the destination registers will have the combination illustrated in Figure 4 3 Exam...

Страница 232: ...in a non interleaved way The goal is to use the instruction which unpacks doublewords to a quadword instead of using the instruction which unpacks words to doublewords Figure 4 3 Result of Non Interl...

Страница 233: ...interleaved Way Input MM0 packed word source value MM1 packed word source value Output MM0 contains the two low end words of the original sources non interleaved MM2 contains the two high end words o...

Страница 234: ...of the immediate constant Insertion is done in such a way that the three other words from the destination register are left untouched see Figure 4 6 and Example 4 8 Figure 4 5 pextrw Instruction Examp...

Страница 235: ...ntent and break the dependence chain by either using the pxor instruction or loading the register See the Clearing Registers section in Chapter 2 Figure 4 6 pinsrw Instruction Example 4 8 pinsrw Instr...

Страница 236: ...XMM registers it produces a 16 bit mask zeroing out the upper 16 bits in the destination register The 64 bit version is shown in Figure 4 7 and Example 4 10 Example 4 9 Repeated pinsrw Instruction Co...

Страница 237: ...ure 4 7 pmovmskb Instruction Example Example 4 10 pmovmskb Instruction Code Input source value Output 32 bit register containing the byte mask in the lower eight bits movq mm0 edi pmovmskb eax mm0 OM1...

Страница 238: ...he immediate value encode the source for destination word 0 in MMX register 15 0 and so on as shown in the table Bits 7 and 6 encode for word 3 in MMX register 63 48 Similarly the 2 bit encoding repre...

Страница 239: ...urce to any double word field in the 128 bit result using an 8 bit immediate operand No more than 3 instructions using pshuflw pshufhw pshufd are required to implement some common data shuffling opera...

Страница 240: ...register The high low order 64 bits of the source operands are ignored Example 4 13 Swap Using 3 Instructions Goal Swap the values in word 6 and word 1 Instruction Result 7 6 5 4 3 2 1 0 PSHUFD 3 0 1...

Страница 241: ...de conversion of single precision data to from double word integer data Also conversions between double precision data and double word integer data have been added Generating Constants The SIMD intege...

Страница 242: ...ld MM1 32 n two instructions above generate the signed constant 2n 1 in every packed word or packed dword field pcmpeq MM1 MM1 psllw MM1 n pslld MM1 n two instructions above generate the signed consta...

Страница 243: ...operands and subtracts them with UNSIGNED saturation This support exists only for packed bytes and packed words not for packed doublewords This example will not work if the operands are signed Note th...

Страница 244: ...sorting technique that uses the fact that B xor A xor A B and A xor A 0 Thus in a packed data type having some elements being xor A B and some being 0 you could xor such an operand with A and receive...

Страница 245: ...M0 create a mask of 0s and xor A B elements Where A B there will be a value xor A B and where A B there will be 0 pxor MM4 MM2 minima xor A swap mask pxor MM1 MM2 maxima xor B swap mask psubw MM1 MM4...

Страница 246: ...ues For simplicity we use the following constants corresponding constants are used in case the operation is done on byte values packed_max equals 0x7fff7fff7fff7fff packed_min equals 0x800080008000800...

Страница 247: ...ipping to a Signed Range of Words high low Input MM0 signed source operands Output MM0 signed words clipped to the signed range high low pminsw MM0 packed_high pmaxsw MM0 packed_low Example 4 20 Clipp...

Страница 248: ...he second instruction psubssw MM0 0xffff high low in the three step algorithm Example 4 21 is executed a negative number is subtracted The result of this subtraction causes the values in MM0 to be inc...

Страница 249: ...our signed words in either two SIMD registers or one SIMD register and a memory location The pminsw instruction returns the minimum between the four signed words in either two SIMD registers or one SI...

Страница 250: ...he pmulhuw and pmulhw instruction multiplies the unsigned signed words in the destination operand with the unsigned signed words in the source operand The high order 16 bits of the 32 bit intermediate...

Страница 251: ...ce operand to the unsigned data elements of the destination register along with a carry in The results of the addition are then each independently shifted to the right by one bit position The high ord...

Страница 252: ...that the 64 bit MMX registers are being used Let the input data be Dr and Di where Dr is real component of the data and Di is imaginary component of the data Format the constant complex coefficients i...

Страница 253: ...within each 64 bit chunk from the two sources the 64 bit result from each computation is written to the destination register Like the integer ADD SUB instruction PADDQ PSUBQ can operate on either unsi...

Страница 254: ...t loads a 64 bit or 128 bit operand for example movq MM0 m64 the register memory form of any SIMD integer instruction that operates on a quadword or double quadword memory operand for example pmaddw M...

Страница 255: ...words are stored and then words or doublewords are read from the same area of memory When you change the code sequence as shown in Example 4 25 the processor can access the data without delay Example...

Страница 256: ...when doublewords or words are stored and then words or bytes are read from the same area of memory When you change the code sequence as shown in Example 4 27 the processor can access the data without...

Страница 257: ...this situation is when each line in a video frame is averaged by shifting horizontally half a pixel Example 4 28 shows a common operation in video processing that loads data from memory address not a...

Страница 258: ...ation dependent one or more LDDQU is designed for programming usage of loading data from memory without storing modified data back to the same address Thus the usage of LDDQU should be restricted to s...

Страница 259: ...less the same from a memory bandwidth perspective However using many smaller loads consumes more microarchitectural resources than fewer larger stores Consuming too many of these resources can cause t...

Страница 260: ...iciency of the bus transactions By aligning the stores to the size of the stores you eliminate the possibility of crossing a cache line boundary and the stores will not be split into separate transact...

Страница 261: ...he 64 bit integer counterpart Extension of the pshufw instruction shuffle word across 64 bit integer operand across a full 128 bit operand is emulated by a combination of the following instructions ps...

Страница 262: ...tion latency XMM registers can provide twice the space to store data for in flight execution Wider XMM registers can facilitate loop unrolling or in reducing loop overhead by halving the number of loo...

Страница 263: ...ating point code containing SIMD floating point instructions Generally it is important to understand and balance port utilization to create efficient SIMD floating point code The basic rules and sugge...

Страница 264: ...oint instructions to achieve optimum performance gain requires programmers to consider several issues In general when choosing candidates for optimization look for code segments that are computational...

Страница 265: ...64 bit SIMD integer code Scalar Floating point Code There are SIMD floating point instructions that operate only on the least significant operand in the SIMD register These instructions are known as s...

Страница 266: ...tion Data Arrangement Because the SSE and SSE2 incorporate a SIMD architecture arranging the data to fully use the SIMD registers produces optimum performance This implies contiguous data for processi...

Страница 267: ...ng of parallel data elements i e the destination of each element is the result of a common arithmetic operation of the input operands in the same vertical position This is shown in the diagram below T...

Страница 268: ...it is common to use only a subset of the vector components for example in 3D graphics the W component is sometimes ignored This means that for single vector operations 1 of 4 computation slots is not...

Страница 269: ...arithmetic operation Vertical computation takes advantage of the inherent parallelism in 3D geometry processing of vertices It assigns the computation of four vertices to the four compute slots of th...

Страница 270: ...Operation Example 5 1 Pseudocode for Horizontal xyz AoS Computation mulps x x y y z z movaps reg reg move since next steps overwrite shufps get b a d c from a b c d addps get a b a b c d c d movaps r...

Страница 271: ...the SIMD registers This operation is referred to as swizzling operation and the reverse operation is referred to as deswizzling Data Swizzling Swizzling data from one format to another may be required...

Страница 272: ...shuffle the yyyy by using another shuffle The zzzz is derived the same way but only requires one shuffle Example 5 3 illustrates the swizzle function Example 5 3 Swizzling Data typedef struct _VERTEX_...

Страница 273: ...m0 0xDD xmm6 y1 y2 y3 y4 Y movlps xmm2 ecx 8 xmm2 w1 z1 movhps xmm2 ecx 24 xmm2 w2 z2 u1 z1 movlps xmm1 ecx 40 xmm1 s3 z3 movhps xmm1 ecx 56 xmm1 w4 z4 w3 z3 movaps xmm0 xmm2 xmm0 w1 z1 w1 z1 shufps x...

Страница 274: ..._mm_loadh_pi x __m64 stride char in y _mm_loadl_pi y __m64 2 stride char in y _mm_loadh_pi y __m64 3 stride char in tmp _mm_shuffle_ps x y _MM_SHUFFLE 2 0 2 0 y _mm_shuffle_ps x y _MM_SHUFFLE 3 1 3 1...

Страница 275: ...registers The same situation can occur for the above movhps movlps shufps sequence Since each movhps movlps instruction bypasses part of the destination register the instruction cannot execute until t...

Страница 276: ...mple 5 5 illustrates the deswizzle function Example 5 5 Deswizzling Single Precision SIMD Data void deswizzle_asm Vertex_soa in Vertex_aos out __asm mov ecx in load structure addresses mov edx out mov...

Страница 277: ...cklps xmm5 xmm4 xmm5 z1 w1 z2 w2 unpckhps xmm0 xmm4 xmm0 z3 w3 z4 w4 movlps edx 8 xmm5 v1 x1 y1 z1 w1 movhps edx 24 xmm5 v2 x2 y2 z2 w2 movlps edx 40 xmm0 v3 x3 y3 z3 w3 movhps edx 56 xmm0 v4 x4 y4 z4...

Страница 278: ...mm2 xmm7 0xDD xmm2 r4 g4 b4 a4 shufps xmm1 xmm3 0x88 xmm4 r1 g1 b1 a1 shufps xmm5 xmm3 0x88 xmm5 r2 g2 b2 a2 shufps xmm6 xmm7 0xDD xmm6 r3 g3 b3 a3 movaps edx xmm4 v1 r1 g1 b1 a1 movaps edx 16 xmm5 v2...

Страница 279: ...zled into AoS layout uv for the graphic cards to process you can use either the SSE or MMX technology code Using the MMX instructions allow you to conserve XMM registers for other computational tasks...

Страница 280: ...r parts of each register This allows you to use a vertical add With the resulting partial horizontal summation full summation follows easily Figure 5 3 schematically presents horizontal add using movh...

Страница 281: ...2 A3 A4 xmm0 movaps xmm1 ecx 16 load B1 B2 B3 B4 xmm1 movaps xmm2 ecx 32 load C1 C2 C3 C4 xmm2 movaps xmm3 ecx 48 load D1 D2 D3 D4 xmm3 continued A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 A1 A3...

Страница 282: ...2 B4 movaps xmm4 xmm2 movlhps xmm2 xmm3 xmm2 C1 C2 D1 D2 movhlps xmm3 xmm4 xmm3 C3 C4 D3 D4 addps xmm3 xmm2 xmm3 C1 C3 C2 C4 D1 D3 D2 D4 movaps xmm6 xmm3 xmm6 C1 C3 C2 C4 D1 D3 D2 D4 shufps xmm3 xmm5...

Страница 283: ...4 __m128 tmm0 tmm1 tmm2 tmm3 tmm4 tmm5 tmm6 Temporary variables tmm0 _mm_load_ps in x tmm0 A1 A2 A3 A4 tmm1 _mm_load_ps in y tmm1 B1 B2 B3 B4 tmm2 _mm_load_ps in z tmm2 C1 C2 C3 C4 tmm3 _mm_load_ps in...

Страница 284: ...tions in Chapter 2 SIMD Floating point Programming Using SSE3 SSE3 enhances SSE and SSE2 with 9 instructions targeted for SIMD floating point programming In contrast to many SSE and SSE2 instructions...

Страница 285: ...of complex numbers For example a complex number can be stored in a structure consisting of its real and imaginary part This naturally leads to the use of an array of structure Example 5 11 demonstrate...

Страница 286: ...perform complex arithmetics on double precision complex numbers In the case of double precision complex arithmetics multiplication or divisions is done on one pair of complex numbers at a time Example...

Страница 287: ...d0 b0c0 shufps xmm1 xmm1 b1 reorder the real and imaginary parts c1 d1 c0 d0 movsldup xmm2 Src1 load the real parts into the destination a1 a1 a0 a0 mulps xmm2 xmm1 temp results a1c1 a1d1 a0c0 a0d0 ad...

Страница 288: ...stored as AOS The use of HADDPS adds more flexibility to sue SIMD instructions and eliminated the need to insert data swizzling and deswizzling code sequences Example 5 13 Calculating Dot Products fr...

Страница 289: ...um M processor depends on several factors Generally code that is decoder bound and or has a mixture of integer and packed floating point instructions can expect significant gain Code that is limited b...

Страница 290: ...uctions on Intel Core Solo and Intel Core Duo processors This is because scalar SSE2 instructions can be dispatched through two ports and executed using two separate floating point units Packed horizo...

Страница 291: ...ed can be fetched from the processor caches or if memory traffic can take advantage of hardware prefetching effectively Standard techniques to bring data into the processor before it is needed involve...

Страница 292: ...hardware prefetcher s ability to prefetch data that are accessed in a regular pattern with access stride that are substantially smaller than half of the trigger distance of the hardware prefetch see...

Страница 293: ...etchnta Balance single pass versus multi pass execution An algorithm can use single or multi pass execution defined as follows single pass or unlayered execution passes a single data element through a...

Страница 294: ...est performance software prefetch instructions must be interspersed with other computational instructions in the instruction sequence rather than clustered together Hardware Prefetching of Data The Pe...

Страница 295: ...ams in the backward direction The hardware prefetcher of Intel Core Solo processor can track 16 forward streams and 4 backward streams On the Intel Core Duo processor the hardware prefetcher in each c...

Страница 296: ...s pattern to suit the automatic hardware prefetch mechanism Software Data Prefetch The prefetch instruction can hide the latency of data access in performance critical sections of application code by...

Страница 297: ...g cache pollution and by using the caches and memory efficiently This is particularly important for applications that share critical system resources such as the memory bus See an example in the Video...

Страница 298: ...ium 4 processor prefetcht1 Identical to prefetcht0 prefetcht2 Identical to prefetcht0 Prefetch and Load Instructions The Pentium 4 processor has a decoupled execution and memory architecture that allo...

Страница 299: ...of the advantages may change in the future In addition there are cases where a prefetch instruction will not perform the data prefetch These include the prefetch causes a DTLB Data Translation Lookasi...

Страница 300: ...cacheable and not write allocating stored data is written around the cache and will not generate a read for ownership bus request for the corresponding cache line Fencing Because streaming stores are...

Страница 301: ...memory type for the region is retained If the programmer specifies the weakly ordered uncacheable memory type of Write Combining WC then the non temporal store and the region have the same semantics a...

Страница 302: ...Appropriate use of synchronization and a fencing operation see The fence Instructions later in this chapter must be performed for producer consumer usage models Fencing ensures that all system agents...

Страница 303: ...etween processors Within a single processor system the CPU can also re read the same memory location and be assured of coherence that is a single consistent view of this memory location the same is tr...

Страница 304: ...eaming store Streaming Store Instruction Descriptions The movntq movntdq non temporal store of packed integer in an MMX technology or Streaming SIMD Extensions register instructions store data from a...

Страница 305: ...nce store fence instruction makes it possible for every store instruction that precedes the sfence instruction in program order to be globally visible before any store instruction that follows the sfe...

Страница 306: ...e instruction provides a means of segregating certain load instructions from other loads The mfence Instruction The mfence memory fence instruction makes it possible for every load and store instructi...

Страница 307: ...ssion checking and faults associated with a byte load clflush is an unordered operation with respect to other memory traffic including other clflush instructions Software should use a mfence memory fe...

Страница 308: ...odes in the cache hierarchy The software controlled prefetch is not intended for prefetching code Using it can incur significant penalties on a multiprocessor system when code is shared Software prefe...

Страница 309: ...automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes Start up penalty before hardware pr...

Страница 310: ...l array to alter the concentration of small stride cache misses at the expense of large stride cache misses to take advantage of the automatic hardware prefetcher Example of Effective Latency Reductio...

Страница 311: ...ister char p char next Populating pArray for circular pointer chasing with constant access stride p char p loads a value pointing to next load p char pArray for i 0 i aperture i stride p char pArray i...

Страница 312: ...gures show two separate pipelines an execution pipeline and a memory pipeline front side bus Since the Pentium 4 processor similarly to the Pentium II and Pentium III processors completely decouples t...

Страница 313: ...s Latency and Execution Without Prefetch Figure 6 3 Memory Access Latency and Execution With Prefetch OM15170 Execution units idle Mem latency Issue loads Time Vertex n 1 Execution units idle Executio...

Страница 314: ...in the pipelines and thus the best possible performance can be achieved Prefetching is useful for inner loops that have heavy computations or are close to the boundary between being compute bound and...

Страница 315: ...d Since prefetch distance is not a well defined metric for this discussion we define a new term prefetch scheduling distance PSD which is represented by the number of iterations For large loops prefet...

Страница 316: ...3D vertices in strip format is used as an example A strip contains a list of vertices whose predefined vertex order forms contiguous triangles It can be easily observed that the memory pipe is de pipe...

Страница 317: ...prefetch insertion the loads from the first iteration of an inner loop can miss the cache and stall the execution pipeline waiting for data returned thus degrading the performance In the code of Exam...

Страница 318: ...suffers any memory access latency penalty assuming the computation time is larger than the memory latency Inserting a prefetch of the first data element needed prior to entering the nested loop comput...

Страница 319: ...number of prefetches required Figure 6 4 presents a code example which implements prefetch and unrolls the loop to remove the redundant prefetch instructions whose prefetch addresses hit the previous...

Страница 320: ...tore streams Each load and store stream accesses one 128 byte cache line each per iteration 2 The amount of computation per loop This is varied by increasing the number of dependent arithmetic operati...

Страница 321: ...ive loop latency 0 00 10 00 20 00 30 00 40 00 50 00 60 00 70 00 80 00 90 00 100 00 of Bus Utilized 16 32 64 128 none Bus Utilization One load and one store stream 0 50 100 150 200 250 300 350 48 108 1...

Страница 322: ...ed together If possible they should also be placed apart from loads This improves the instruction level parallelism and reduces the potential instruction resource stalls In addition this mixing reduce...

Страница 323: ...x 18128 prefetchnta ebx 19128 prefetchnta ebx 20128 movps xmm1 ebx addps xmm2 ebx 3000 mulps xmm3 ebx 4000 addps xmm1 ebx 1000 addps xmm2 ebx 3016 mulps xmm1 ebx 2000 mulps xmm1 xmm2 add ebx 128 cmp e...

Страница 324: ...mensions can be applied for a better memory performance If an application uses a large data set that can be reused across multiple passes of a loop it will benefit from strip mining data sets larger t...

Страница 325: ...by pass m 1 requiring data re fetch into the first level cache and perhaps the second level cache if a later pass reuses the data If both data sets fit into the second level cache load operations in p...

Страница 326: ...l cache pollution Use prefetchnta if the data is only touched once during the entire execution pass in order to minimize cache pollution in the higher level caches This provides instant availability a...

Страница 327: ...data access patterns of a 3D geometry engine first without strip mining and then incorporating strip mining Note that 4 wide SIMD instructions of Pentium III processor can process 4 vertices per ever...

Страница 328: ...y of second level cache during the strip mined transformation loop and reused in the lighting loop Keeping data in the cache reduces both bus traffic and the number of prefetches used Example 6 8 Data...

Страница 329: ...iple times and some of the read once memory references An example of the situations of read once memory references can be illustrated with a matrix or image transpose reading from a column first orien...

Страница 330: ...t level cache Maximizing the width of each tile for memory read Example 6 9 Using HW Prefetch to Improve Read Once Memory Traffic a Un optimized image transpose dest and src represent two dimensional...

Страница 331: ...ten easier to use when implementing a general purpose API where the choice of code paths that can be taken depends on the specific combination of features selected by the application for example for 3...

Страница 332: ...n be better suited to applications which limit the number of features that may be used at a given time A single pass approach can reduce the amount of data copying that can occur with a multi pass eng...

Страница 333: ...isturbing the cache hierarchy To manage which data structures remain in the cache and which are transient Detailed implementations of these usage models are covered in the following sections Non tempo...

Страница 334: ...ocessor Although the SWWC method requires explicit instructions for performing temporary writes and reads this ensures that the transaction on the front side bus causes line transaction rather than se...

Страница 335: ...prefetchnta instruction brings data into only one way of the second level cache thus reducing pollution of the second level cache If the data brought directly to second level cache is not re used then...

Страница 336: ...lication does not gain performance significantly from having data ready from prefetches it can improve from more efficient use of the second level cache and memory Such design reduces the encoder s de...

Страница 337: ...improve performance of the translation of a virtual memory address to a physical memory address by providing fast access to page table entries If memory pages are accessed and the page table entry is...

Страница 338: ...e the cost of the loop 2 Loads the data into an xmm register using the _mm_load_ps intrinsic 3 Transfers the 8 byte data to a different memory location via the _mm_stream intrinsics bypassing the cach...

Страница 339: ...py 128 byte per loop for j kk j kk NUMPERPAGE j 16 _mm_stream_ps float b j _mm_load_ps float a j _mm_stream_ps float b j 2 _mm_load_ps float a j 2 _mm_stream_ps float b j 4 _mm_load_ps float a j 4 _mm...

Страница 340: ...se the implementation of streaming stores on Pentium 4 processor writes data directly to memory maintaining cache coherency Using 16 byte Streaming Stores and Hardware Prefetch An alternate technique...

Страница 341: ...m1 esi ecx 16 movdqa xmm2 esi ecx 32 movdqa xmm3 esi ecx 48 movdqa xmm4 esi ecx 64 movdqa xmm5 esi ecx 16 64 movdqa xmm6 esi ecx 32 64 movdqa xmm7 esi ecx 48 64 movntdq edi ecx xmm0 movntdq edi ecx 16...

Страница 342: ...or A comparison of the two coding techniques discussed above and two un optimized techniques is shown in Table 6 2 add esi ecx add edi ecx sub edx ecx jnz main_loop sfence Table 6 2 Relative Performan...

Страница 343: ...tores Increases in bus speed is the primary contributor to throughput improvements The technique shown in Example 6 12 will likely take advantage of the faster bus speed in the platform more efficient...

Страница 344: ...and single core processors Table 6 3 Deterministic Cache Parameters Leaf Bit Location Name Meaning EAX 4 0 Cache Type 0 Null No more caches 1 Data Cache 2 Instruction Cache 3 Unified Cache 4 31 Reserv...

Страница 345: ...r execution environments including processors supporting Hyper Threading Technology HT or multi core processors The basic technique is to place an upper limit of the blocksize to be less than the size...

Страница 346: ...oftware should not assume that the coherency line size is the prefetch stride If this field is zero then software should assume a default size of 64 bytes is the prefetch stride Software will use the...

Страница 347: ...package and or logical processor per core Therefore there are some aspects of multi threading optimization that apply across MP multi core and Hyper Threading Technology There are also some specific...

Страница 348: ...mount of parallelism in the control flow of the workload Two common usage models are multithreaded applications multitasking using single threaded applications Multithreading When an application emplo...

Страница 349: ...e threads on an MP systems with N physical processors over single threaded execution can be expressed as where P is the fraction of workload that can be parallelized and O represents the overhead of m...

Страница 350: ...rmance scaling In addition to maximizing the parallelism of control flows interaction between threads in the form of thread synchronization and imbalance of task scheduling can also impact overall pro...

Страница 351: ...tasking workload however bus activities and cache access patterns are likely to affect the scaling of the throughput Running two copies of the same application or same suite of applications in a lock...

Страница 352: ...by large degrees of parallelism or minimal dependencies in the following areas workload thread interaction hardware utilization The key to maximizing workload parallelism is to identify multiple tasks...

Страница 353: ...ere threads can be created to handle the multiplication of half of matrix with the multiplier matrix Domain Decomposition is a programming model based on creating identical or similar threads to proce...

Страница 354: ...essor supporting Hyper Threading Technology Specialized Programming Models Intel Core Duo processor offers a second level cache shared by two processor cores in the same physical package This provides...

Страница 355: ...of single threaded execution of a sequence of task units where each task unit either the producer or consumer executes serially shown in Figure 7 2 In the equivalent scenario under multi threaded exec...

Страница 356: ...ion overhead The decimal number in the parenthesis represents a buffer index On an Intel Core Duo processor the producer thread can store data in the second level cache to allow the consumer thread to...

Страница 357: ...same buffer As each task completes one thread signals to the other thread notifying its Example 7 2 Basic Structure of Implementing Producer Consumer Threads a Basic structure of a producer thread fu...

Страница 358: ...the first or second level cache of the same core the consumer can access them without incurring bus traffic The scheduling of the interlaced producer consumer model is shown in Figure 7 4 Example 7 3...

Страница 359: ...ffer index unsigned int iter_num workamount 1 unsigned int i 0 iter_num master workamount 1 if master master thread starts the first iteration produce buffs mode count Signal sigp 1 mode 1 notify the...

Страница 360: ...parallelism in applications OpenMP supports directive based processing This uses special preprocessors or modified compilers to interpret parallelism expressed in Fortran comments or C C pragmas Bene...

Страница 361: ...ind threading errors e g data races stalls and deadlocks and reduce the amount of time spent debugging threaded applications Intel Thread Checker product is an Intel VTune Performance Analyzer plug in...

Страница 362: ...aling due to Hyper Threading Technology Techniques that apply to only one environment are noted Key Practices of Thread Synchronization Key practices for minimizing the cost of thread synchronization...

Страница 363: ...rk Excessive use of software prefetches can significantly and unnecessarily increase bus utilization if used inappropriately Consider using overlapping multiple back to back memory reads to improve ef...

Страница 364: ...ssors that support Hyper Threading Technology are Avoid Excessive Loop Unrolling to ensure the Trace Cache is operating efficiently Optimize code size to improve locality of Trace Cache and increase d...

Страница 365: ...cation and threading domain The purpose of including high medium and low impact ranking with each recommendation is to provide a relative indicator as to the degree of performance gain that can be exp...

Страница 366: ...tation of MONITOR and MWAIT these instructions are available to operating system so that operating system can optimize thread synchronization in different areas For example an operating system can use...

Страница 367: ...es of three categories of synchronization objects available to multi threaded applications Table 7 1 Properties of Synchronization Objects Characteristics Operating System Synchronization Objects Ligh...

Страница 368: ...er with each read request being allocated a buffer resource On detection of a write by a worker thread to a load that is in progress Miscellaneous Some objects provide intra process synchronization an...

Страница 369: ...tel Core Solo and Intel Core Duo processors However the penalty on these processors is small compared with penalties suffered on the Pentium 4 and Intel Xeon processors There the performance penalty f...

Страница 370: ...ng thread and the worker thread do _asm pause Ensure this loop is de pipelined i e preventing more than one load request to sync_var to be outstanding avoiding performance penalty when the worker thre...

Страница 371: ...loop is shown in Example 7 4 b The PAUSE instruction is compatible with all IA 32 processors On IA 32 processors prior to Intel NetBurst microarchitecture the PAUSE instruction is essentially a NOP in...

Страница 372: ...red by two threads there is no need to use a lock Synchronization for Longer Periods When using a spin wait loop not expected to be released quickly an application should follow these guidelines Keep...

Страница 373: ...rs Furthermore using a thread blocking API also benefits from the system idle loop optimization that OS implements using the HLT instruction User Source Coding Rule 23 H impact M generality Use a thre...

Страница 374: ...of thread synchronization Because the control thread still behaves like a fast spinning loop the only runnable worker thread must share execution resources with the spin wait loop if both are running...

Страница 375: ...to release the processor incorrectly It experiences a performance penalty if the only worker thread and the control thread runs on the same physical processor package Only one worker thread is runnin...

Страница 376: ...Microarchitecture is when thread private data and a thread synchronization variable are located within the line size boundary 64 bytes or sector boundary 128 bytes When one thread modifies the synchr...

Страница 377: ...should be allocated to reside alone in a 128 byte region and aligned to a 128 byte boundary Example 7 6 shows a way to minimize the bus traffic required to maintain cache coherency in MP systems This...

Страница 378: ...variables of different types in data structures because the layout that compilers give to data variables might be different than their placement in the source code When each thread needs to use its ow...

Страница 379: ...rformance impact due data traffic fetched from memory depends on the characteristics of the workload and the degree of software optimization on memory access locality enhancements implemented in the s...

Страница 380: ...n fetches from system memory User Source Coding Rule 27 M impact H generality Improve data and code locality to conserve bus command bandwidth Using a compiler that supports profiler guided optimizati...

Страница 381: ...ly one thread is using the second level cache and or bus then it is expected to get the maximum benefit of the cache and bus systems because the other core does not interfere with the progress of the...

Страница 382: ...processor core It also consumes critical execution resources and can result in stalled execution The guidelines for using software prefetch instructions are described in Chapter 2 The techniques of us...

Страница 383: ...Reduction with H W Prefetch in Chapter 6 User Source Coding Rule 30 M impact M generality Consider adjusting the sequencing of memory references such that the distribution of distances of successive c...

Страница 384: ...imiza tion Efficient operation of caches needs to address the following cache blocking shared memory optimization eliminating 64 K Aliased data accesses preventing excessive evictions in first level c...

Страница 385: ...e This technique can also be applied to single threaded applications that will be used as part of a multitasking workload User Source Coding Rule 32 H impact H generality Use cache blocking to improve...

Страница 386: ...d level cache On an Intel Core Duo processor and when the work buffers are small enough to fit within the first level cache re ordering of producer and consumer tasks are necessary to achieve optimal...

Страница 387: ...1 for mode1 0 mode1 batchsize mode1 produce buffs mode1 count while iter_num Signal signal1 1 produce buffs mode1 count placeholder function WaitForSignal end1 mode1 if mode1 batchsize mode1 0 void co...

Страница 388: ...ule 34 H impact H generality Minimize data access patterns that are offset by multiples of 64 KB in each thread The presence of 64 KB aliased data access can be detected using Pentium 4 processor perf...

Страница 389: ...improves the temporal locality of the first level data cache Multithreaded applications need to prevent unnecessary evictions in the first level data cache when Multiple threads within an application...

Страница 390: ...in an application is to add an equal amount of stack offset each time a new thread is created in a thread pool 7 Example 7 9 shows a code fragment that implements per thread stack offset for three thr...

Страница 391: ...r thread stack offset to create three threads Code for the thread function Stack accesses within descendant functions do_foo1 do_foo2 are less likely to cause data cache evictions because of the stack...

Страница 392: ...on is to allow application instance to add a suitable linear address offset for its stack Once this offset is added at start up a buffer of linear addresses is established even when two copies of the...

Страница 393: ...hat is also a multiple of the reference offset or 128 bytes Usually this per instance pseudo random offset can be less than 7 KB Example 7 10 provides a code fragment for adjusting the stack pointer i...

Страница 394: ...rocessor For dual core processors where the second level unified cache is shared by two processor cores e g Intel Core Duo processor multi threaded software should consider the increase in code workin...

Страница 395: ...sors multithreaded applications should improve code locality of frequently executed sections and target one half of the size of Trace Cache for each application thread when considering code size optim...

Страница 396: ...cation for thread scheduling is represented by a bit in an OS construct commonly referred to as an affinity mask8 Software can use an affinity mask to control the binding of a software thread to a spe...

Страница 397: ...stemAffinity is a bitmask of all the processors started by the OS Use OS specific APIs to obtain it ThreadAffinityMask is used to affinitize the topology enumeration thread to each processor using OS...

Страница 398: ...o manage thread affinity binding is shown in Example 7 12 The example shows an implementation of building a lookup table so that the sequence of thread scheduling is mapped to an array of affinity mas...

Страница 399: ...logy The 3 level hierarchy and relationships between the initial APIC_ID and affinity mask can also be used to manage such a topology Example 7 13 illustrates the steps of discovering sibling logical...

Страница 400: ...ecutively to different core while j NumStartedLPs Determine the first LP in each core if apic_conf j smt This is the first LP in a core supporting HT LuT index apic_conf j affinitymask j Now the we ha...

Страница 401: ...ular boundary of those logical processors sharing the target level cache may coincide with core boundary or above core boundary ThreadAffinityMask 1 ProcessorNum 0 while ThreadAffinityMask 0 ThreadAff...

Страница 402: ...mask of logical processors sharing the same target level cache these are logical processors with the same Cache_ID The algorithm below assumes there is symmetry across the modular boundary of target c...

Страница 403: ...CacheProcessorMask i ProcessorMask Break Found in existing bucket skip to next iteration if i CacheNum Cache_ID did not match any bucket start new bucket CacheIDBucket i CacheID ProcessorNum CachePro...

Страница 404: ...ache topology is available In general optimizing the building blocks of a multi threaded application can start from an individual thread The guidelines discussed in Chapters 2 through 6 largely apply...

Страница 405: ...g shared resource effectively can deliver positive benefit in processor scaling if the workload does saturate the critical resource in contention Tuning Suggestion 5 M Impact L Generality Schedule thr...

Страница 406: ...suming about one third of peak retirement bandwidth Significant portions of the issue port bandwidth are left unused Thus optimizing single thread performance usually can be complementary with optimiz...

Страница 407: ...machine clear condition occurs all instructions that are in flight at various stages of processing in the pipeline must be resolved and then they are either retired or cancelled While the pipeline is...

Страница 408: ...of both threads count toward the limit of four write combining buffers For example if an inner loop that writes to three separate areas of memory per iteration is run by two threads simultaneously th...

Страница 409: ...Instructions When The Data Size Is 32 Bits 64 bit mode makes 16 general purpose 64 bit registers available to applications If application data size is 32 bits however there is no need to use 64 bit re...

Страница 410: ...x is necessary Using 8 additional registers can prevent the compiler from needing to spill values onto the stack Note that the potential increase in code size due to the REX prefix can increase cache...

Страница 411: ...is 32 bits the upper 32 bits must be zeroed Zeroing the upper 32 bits requires an extra uop and is less optimal than sign extending to 64 bits While sign extending to 64 bits makes the instruction one...

Страница 412: ...r 64 Bit Mode Use 64 Bit Registers Instead of Two 32 Bit Registers for 64 Bit Arithmetic Legacy 32 bit mode offers the ability to support extended precision integer arithmetic such as 64 bit arithmeti...

Страница 413: ...t require a 64 bit result To add two 64 bit numbers in 32 bit legacy mode the add instruction followed by the addc instruction is used For example to add two 64 bit variables X and Y the following fou...

Страница 414: ...result in a microcode flow from the microcode ROM and takes longer to execute In most cases the 32 bit versions of CVTSI2SS and CVTSI2SD is sufficient Assembly Compiler Coding rule Use the 32 bit vers...

Страница 415: ...s referred to as static power ACPI 3 0 ACPI stands for Advanced Configuration and Power Interface provides a standard that enables intelligent power management and consumption It does this by allowing...

Страница 416: ...ut of higher numbered C states have longer latency Mobile Usage Scenarios In mobile usage models heavy loads occur in bursts while working on battery power Most productivity web and streaming workload...

Страница 417: ...the OS sets the processor to a lower frequency P1 3 The processor decreases frequency and processor utilization increases to the most effective level 80 90 of the highest possible frequency The same a...

Страница 418: ...n and the processor enters a halted state in which it waits for the next interrupt The interrupt may be a timer interrupt that comes every 10 ms or an interrupt that signals an event As shown in the i...

Страница 419: ...wer savings over the C1 state The main improvements are provided at the platform level C3 This level provides greater power savings than C1 or C2 In C3 the processor stops clock generating and snoopin...

Страница 420: ...d Intel Core Duo processors4 provide additional processor specific C states and associated sub C states that can be mapped to ACPI C3 state type The processor specific C states and sub C states are ac...

Страница 421: ...life and adapt for mobile computing usage Adopt a power management scheme to provide just enough not the highest performance to achieve desired features or experiences Avoid using spin loops Reduce t...

Страница 422: ...f logging activities Consolidating disk operations over time to prevent unnecessary spin up of the hard drive Reducing the amount or quality of visual animations Turning off or significantly reducing...

Страница 423: ...ble For example When an application needs to wait for more then a few milliseconds it should avoid using spin loops and use the Windows synchronization APIs such as WaitForSingleObject When an immedia...

Страница 424: ...mpact performance and may provide additional power conservation Read ahead from CD DVD data and cache it in memory or hard disk to allow the DVD drive to stop spinning Switch off unused devices When d...

Страница 425: ...he necessary information for an application to react appropriately Here are some examples of an application reaction to sleep mode transitions Saving state data prior to the sleep transition and resto...

Страница 426: ...ssor needs to stay in P0 state highest frequency to deliver highest performance for 0 5 seconds and may then goes to sleep for 0 5 seconds The demand pattern then alternates Thus the processor demand...

Страница 427: ...be met by 50 performance at 100 duty cycle Because the slope of frequency scaling efficiency of most workloads will be less than one reducing the core frequency to 50 can achieve more than 50 of the...

Страница 428: ...power consumption by moving to lower power C states Generally the longer a processor stays idle OS power management policy directs the processor into deeper low power C states After an application rea...

Страница 429: ...The situation is significantly improved in the Intel Core Solo processor compared to previous generations of the Pentium M processors but polling will likely prevent the processor from entering into h...

Страница 430: ...e physical processor to run at 60 of its maximum frequency However it may be possible to divide work equally between threads so that each of them require 50 of execution cycles As a result both cores...

Страница 431: ...he issue of OS lacking multi core aware P state coordination policy Upgrade to an OS with multi core aware P state coordination policy Some newer OS releases may include multi core aware P state coord...

Страница 432: ...serve battery life in multithreaded applications a multi threaded applications should synchronize threads to work simultaneously and sleep simultaneously using OS synchronization primitives By keeping...

Страница 433: ...ution_Cycle and Unhalted_Ref_Cycles Changing sections of serialized code to execute into two parallel threads enables coordinated thread synchronization to achieve better power savings Although Serial...

Страница 434: ...IA 32 Intel Architecture Optimization 9 20...

Страница 435: ...atures such as profile guided optimizations and other advanced optimizations This includes vectorization for MMX technology the Streaming SIMD Extensions SSE Streaming SIMD Extensions 2 SSE2 and the S...

Страница 436: ...rtran Compiler is source compatible with Compaq Visual Fortran and also plugs into the Microsoft NET IDE In Linux environment the Intel C Compilers are binary compatible with the corresponding version...

Страница 437: ...izations most of which are effective only in conjunction with processor specific optimizations described below All the command line options are described in the Intel C Compiler User s Guide Further a...

Страница 438: ...s Qax extensions generates code specialized to processors which support the specified extensions but also generates generic IA 32 code The generic code usually executes slower than the specialized ve...

Страница 439: ...to the Qx M K W B P or Qax M K W B P switches the compiler provides the following vectorization control switch options Qvec_report n Controls the vectorizer s diagnostic levels where n is either 0 1...

Страница 440: ...pendencies This is activated by the compiler switch Qparallel Inline Expansion of Library Functions Oi Oi The compiler inlines a number of standard C C and math library functions by default This usual...

Страница 441: ...ogram from your source code and special code from the compiler Each time this instrumented code is executed the compiler generates a dynamic information file When you compile a second time the dynamic...

Страница 442: ...tel C Compiler User s Guide Intel VTune Performance Analyzer The Intel VTune Performance Analyzer is a powerful software profiling tool for Microsoft Windows and Linux The VTune analyzer helps you und...

Страница 443: ...ter the sampling activity completes the VTune analyzer displays the data by process thread software module function or line of source There are two methods for generating samples Time based sampling a...

Страница 444: ...the microprocessor as it executes software Some of the events that can be used to trigger sampling include clockticks cache misses and branch mispredictions The VTune analyzer indicates where micro a...

Страница 445: ...effective throughput of instructions or ops at the retirement stage For example UPC measure ops throughput at the retirement stage which can be compared to the peak retirement bandwidth of the microar...

Страница 446: ...access pattern with very poor data parallelism and will fully exposes memory latency whenever cache misses occur Large Stride Inefficiency Large stride data accesses are much less efficient than smal...

Страница 447: ...trumentation is the process of modifying a function so that performance data can be captured when the function is executed Instrumentation does not change the functionality of the program However it c...

Страница 448: ...level performance bottlenecks It periodically polls software and hardware performance counters The performance counter data can help you understand how your application is impacting the performance of...

Страница 449: ...sors Intel Integrated Performance Primitives for Linux and Windows IPP is a cross platform software library which provides a range of library functions for multimedia audio codecs video codecs for exa...

Страница 450: ...d throughout this manual Examples include architecture specific tuning such as loop unrolling instruction pairing and scheduling and memory management with explicit and implicit data prefetching and c...

Страница 451: ...word 4 double words single precision 4 single precision floating point and double precision 2 double precision floating point When a register is updated the new value appears in red The corresponding...

Страница 452: ...tomatically locates threading errors As your program runs the Intel Thread Checker monitors memory accesses and other events and automatically detects situations which could cause unpredictable thread...

Страница 453: ...execution etc Mountains of data are collapsed into relevant summaries sorted to identify parallel regions or loops that require attention Its intuitive color coded displays make it easy to assess you...

Страница 454: ...e Training at http developer intel com software college CourseCatalog asp CatID web based For key algorithms and their optimization examples for the Pentium 4 processor refer to the application notes...

Страница 455: ...ors Pentium 4 Processor Performance Metrics The descriptions of the Intel Pentium 4 processor performance metrics use terminology that are specific to the Intel NetBurst microarchitecture and to the i...

Страница 456: ...that were scheduled to execute along the mispredicted path must be cancelled These instructions and ops are referred to as bogus instructions and bogus ops A number of Pentium 4 processor performance...

Страница 457: ...al with some event the machine takes an assist One example of such situation is an underflow condition in the input operands of a floating point operation The hardware must internally modify the forma...

Страница 458: ...erformance monitoring related counters to be powered down The processor is asleep either as a result of being halted for a while or as part of a power management scheme Note that there are different l...

Страница 459: ...abled Nominal CPI timestamp counter ticks instructions retired measures the CPI over the entire duration of the program including those periods the machine is halted while waiting for I O The distinct...

Страница 460: ...e Note that this overrides any qualification e g by CPL specified in the ESCR Enable counting in the CCCR for that counter by setting the enable bit The counts produced by the Non halted and Non sleep...

Страница 461: ...dependent period if there is no work for the processor to do Time Stamp Counter The time stamp counter increments whenever the sleep pin is not asserted or when the clock signal on the system bus is...

Страница 462: ...eful metric for determining front end performance The metric that is most analogous to an instruction cache miss is a trace cache miss An unsuccessful lookup of the trace cache colloquially a miss is...

Страница 463: ...level cache misses and writebacks also called core references result in references to the 2nd level cache The Bus Sequence Queue BSQ holds requests from the processor core or prefetcher that are to b...

Страница 464: ...re B 1 Relationships Between the Cache Hierarchy IOQ BSQ and Front Side Bus Chip Set System Memory 1st Level Data Cache 3rd Level Cache FSB_ IOQ BSQ Unified 2nd Level Cache 1st Level Data Cache 3rd Le...

Страница 465: ...t these bus and memory metrics are derived from The granularities of core references are listed below according to the performance monitoring events that are docu mented in Appendix A of the IA 32 Int...

Страница 466: ...unt each partial individually Different transaction sizes The allocations of non partial programmatic load requests get a count of one per 128 bytes in the BSQ on current implementations and a count o...

Страница 467: ..._cache_rerference The metrics are 2nd Level Cache Read Misses 2nd Level Cache Read References 3rd Level Cache Read Misses 3rd Level Cache Read References 2nd Level Cache Reads Hit Shared 2nd Level Cac...

Страница 468: ...counted once per 64 byte line the size of a core reference This makes the event counts for read misses appear to have a 2 times overcounting with respect to read and write RFO hits and write RFO miss...

Страница 469: ...nce adjacent sector prefetches have lower priority than demand fetches there is a high probability on a heavily utilized system that the adjacent sector prefetch will have to wait until the next bus a...

Страница 470: ...t Hyper Threading Technology the performance counters and associated model specific registers MSRs are extended to support Hyper Threading Technology A subset of the performance monitoring events allo...

Страница 471: ...cs and Tagging Mechanisms For event names that appear in this column refer to the IA 32 Intel Architecture Software Developer s Manual Volumes 3A 3B Column 4 specifies the event mask bit that is neede...

Страница 472: ...structions Retired Non bogus IA 32 instructions executed to completion May count more than once for some instructions with complex uop flow and were interrupted before retirement The count may vary de...

Страница 473: ...g Replay_event set the following replay tag Tagged_mispred_ branch NBOGUS Mispredicted Branches Retired Mispredicted branch instructions executed to completion This stat is often used in a per instruc...

Страница 474: ...INDIRECT Mispredicted calls All Mispredicted indirect calls retired_branch_type CALL Mispredicted conditionals The number of mispredicted branches that are conditional jumps retired_mispred_ branch_t...

Страница 475: ...delivering traces associated with logical processor 0 regardless of the operating modes of the TDE for traces associated with logical processor 1 If a physical processor supports only one logical pro...

Страница 476: ...pplicable only if a physical processor supports Hyper Threading Technology and have two logical processors per package TC_deliver_mode SS BS IS Logical Processor N In Deliver Mode Fraction of all non...

Страница 477: ...rocessor 0 TC_deliver_mode BB BS BI Logical Processor 1 Build Mode The number of cycles that the trace and delivery engine TDE is building traces associated with logical processor 1 regardless of the...

Страница 478: ...tc_ms_xfer CISC Speculative TC Built Uops The number of speculative uops originating when the TC is in build mode uop_queue_writes FROM_TC_BUILD Speculative TC Delivered Uops The number of speculativ...

Страница 479: ...dercount when loads are spaced apart Replay_event set the following replay tag 2ndL_cache_load_ miss_retired NBOGUS DTLB Load Misses Retired The number of retired load ops that experienced DTLB misses...

Страница 480: ...lit Load Replays The number of load references to data that spanned two cache lines Memory_complete LSC Split Loads Retired The number of retired load ops that spanned two cache lines Replay_event set...

Страница 481: ...e number of 2nd level cache read references loads and RFOs Beware of granularity differences BSQ_cache_reference RD_2ndL_HITS RD_2ndL_HITE RD_2ndL_HITM RD_2ndL_MISS 3rd Level Cache Read Misses2 The nu...

Страница 482: ...y differences BSQ_cache_reference RD_2ndL_HITM 2nd Level Cache Reads Hit Exclusive The number of 2nd level cache read references loads and RFOs that hit the cache line in exclusive state Beware of gra...

Страница 483: ...ys Retired The number of retired load ops that experienced replays related to the MOB Replay_event set the following replay tag MOB_load_replay_ retired NBOGUS Loads Retired The number of retired load...

Страница 484: ...liased A high count of this metric when there is no significant contribution due to write combining buffer full condition may indicate such a situation WC_buffer WCB_EVICTS WCB Full Evictions The numb...

Страница 485: ...el 2 2 Enable edge filtering6 in the CCCR Non prefetch Bus Accesses from the Processor The number of all bus transactions that were allocated in the IO Queue from this processor excluding prefetched s...

Страница 486: ...ata Ready Bus_ratio 100 Non Sleep Clockticks Reads from the Processor The number of all read includes RFOs transactions on the bus that were allocated in IO Queue from this processor includes prefetch...

Страница 487: ...2 2 Enable edge filtering6 in the CCCR Reads Non prefetch from the Processor The number of all read transactions includes RFOs but excludes prefetches on the bus that originated from this processor B...

Страница 488: ...g6 in the CCCR All UC from the Processor The number of UC Uncacheable memory transactions on the bus that originated from this processor User Note Beware of granularity issues e g a store of dqword to...

Страница 489: ...CH CPUID model 2 2 Enable edge filtering6 in the CCCR Bus Accesses Underway from the processor7 This is an accrued sum of the durations of all bus transactions by this processor Divide by Bus Accesses...

Страница 490: ...PREFETCH CPUID model 2 Non prefetch Reads Underway from the processor7 This is an accrued sum of the durations of read includes RFOs but excludes prefetches transac tions that originate from this proc...

Страница 491: ...qA0 ALL_READ ALL_WRITE MEM_UC OWN CPUID model 2 All WC Underway from the processor7 This is an accrued sum of the durations of all WC transactions by this processor Divide by All WC from the processor...

Страница 492: ...M_WC MEM_UC OWN CPUID model 2 Bus Accesses Underway from All Agents7 This is an accrued sum of the durations of entries by all agents on the bus Divide by Bus Accesses from All Agents to get bus reque...

Страница 493: ...om DWord operands BSQ_allocation 1 REQ_TYPE1 REQ_LEN0 MEM_TYPE0 REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR Writes WB Full BSQ The number of writeback evicted from cache transactions to WB type...

Страница 494: ...E REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR UC Reads Chunk BSQ The number of 8 byte aligned UC read transactions User note Read requests associated with 16 byte operands may under count BSQ_al...

Страница 495: ...unk BSQ The number of 8 byte aligned IO port read transactions BSQ_allocation 1 REQ_LEN0 REQ_ORD_TYPE REQ_IO_TYPE REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR IO Writes Chunk BSQ The number of IO...

Страница 496: ...e_entries 1 REQ_TYPE0 REQ_TYPE1 REQ_LEN0 REQ_LEN1 MEM_TYPE1 MEM_TYPE2 REQ_CACHE_TYPE REQ_DEM_TYPE UC Reads Chunk Underway BSQ 8 This is an accrued sum of the durations of UC read transactions Divide b...

Страница 497: ...LEN0 MEM_TYPE0 REQ_DEM_TYPE 2 Enable edge filtering6 in the CCCR Characterization Metrics x87 Input Assists The number of occurrences of x87 input operands needing assistance to handle an exception co...

Страница 498: ...ution_event set this execution tag Scalar_SP_retired NONBOGUS0 Scalar DP Retired3 Non bogus scalar double precision instructions retired Execution_event set this execution tag Scalar_DP_retired NONBOG...

Страница 499: ...Resources non standard5 The duration of stalls due to lack of store buffers Resource_stall SBFULL Stalls of Store Buffer Resources non standard5 The number of allocation stalls due to lack of store bu...

Страница 500: ...fying a split_load_retired tag in addition to programming the replay_event to count at retirement This section describes three sets of tags that are used in conjunction with three at retirement counti...

Страница 501: ..._ retired Bit 2 BIT 24 BIT 25 Bit 1 None NBOGUS DTLB_all_miss_ retired Bit 2 BIT 24 BIT 25 Bit 0 Bit 1 None NBOGUS Tagged_mispred_ branch Bit 15 Bit 16 Bit 24 Bit 25 Bit 4 None NBOGUS MOB_load_ replay...

Страница 502: ...ming an upstream ESCR to select event mask with its TagUop and TagValue bit fields The event mask for the downstream ESCR is specified in column 4 The event names referenced in column 4 can be found i...

Страница 503: ...tired Set the ALL bit in the event mask and the TagUop bit in the ESCR of scalar_SP_uop 1 NBOGUS0 Scalar_DP_retired Set the ALL bit in the event mask and the TagUop bit in the ESCR of scalar_DP_uop 1...

Страница 504: ...e is not sufficient number of MSRs for simultaneous counting of the same metric on both logical processors In both cases it is also possible to program the relevant ESCR for a performance metric that...

Страница 505: ...ance metrics related to the trace cache that are exceptions to the three categories above They are Logical Processor 0 Deliver Mode Logical Processor 1 Deliver Mode Logical Processor 0 Build Mode Logi...

Страница 506: ...ll conditionals Mispredicted returns Mispredicted indirect branches Mispredicted calls Mispredicted conditionals TC and Front End Metrics Trace Cache Misses ITLB Misses TC to ROM Transfers TC Flushes...

Страница 507: ...red Stores Retired DTLB Store Misses Retired DTLB Load and Store Misses Retired 2nd Level Cache Read Misses 2nd Level Cache Read References 3rd Level Cache Read Misses 3rd Level Cache Read References...

Страница 508: ...way from the processor1 All UC Underway from the processor1 All WC Underway from the processor1 Bus Writes Underway from the processor1 Bus Accesses Underway from All Agents1 Write WC Full BSQ 1 Write...

Страница 509: ...alled Cycles of Store Buffer Resources Stalls of Store Buffer Resources 1 Parallel counting is not supported due to ESCR restrictions Table B 7 Metrics That Are Independent of Logical Processors Gener...

Страница 510: ...by more than one core and some performance events provide an event mask or unit mask that allows qualification at the physical processor boundary or at bus agent boundary Some events allow qualificat...

Страница 511: ...each core The above is also applicable when the core specificity sub field bits 15 14 of IA32_PERFEVTSELx MSR within an event mask is programmed with 11B The result of reported by performance counter...

Страница 512: ...from the L1 or prefetches to the L2 cache were issued Unhalted_Core_Cycles event number 3C unit mask 00H This event counts the smallest unit of time recognized by an active core In many operating syst...

Страница 513: ...These snoops are done through the DCU store port Frequent DCU snoops may conflict with stores to the DCU and this may increase store latency and impact performance Bus_Not_In_Use event number 7DH uni...

Страница 514: ...IA 32 Intel Architecture Optimization B 60...

Страница 515: ...d scheduling Definitions the definitions for the primary information presented in the tables in section Latency and Throughput Latency and Throughput of Pentium 4 and Intel Xeon processors the listing...

Страница 516: ...s to aid in selecting the sequence of instructions which minimizes dependency chain latency and to arrange instructions in an order which assists the hardware in processing instructions efficiently wh...

Страница 517: ...esource conflicts Interleaving instructions so that they don t compete for the same port or execution unit can increase throughput For example alternating PADDQ and PMULUDQ each have a throughput of o...

Страница 518: ...execute the ops for each instruction This information is provided only for IA 32 instructions that are decoded into no more than 4 ops ops for instructions that decode into more than 4 ops are supplie...

Страница 519: ...tional information that is beyond the scope of this manual Comparisons of latency and throughput data between the Pentium 4 processor and the Pentium M processor can be misleading because one cycle in...

Страница 520: ...for CPUID signature 0xF2n and 0xF3n The notation 0xF2n represents the hex value of the lower 12 bits of the EAX register reported by CPUID instruction with input value of EAX 1 F indicates the family...

Страница 521: ...xmm 6 6 1 1 1 1 FP_MOVE MOVDQU xmm xmm 6 6 1 1 1 1 FP_MOVE MOVDQ2Q mm xmm 8 8 1 2 2 1 FP_MOVE MMX_ALU MOVQ2DQ xmm mm 8 8 1 2 2 1 FP_MOVE MMX_SHFT MOVQ xmm xmm 2 2 1 2 2 1 MMX_SHFT PACKSSWB PACKSSDW PA...

Страница 522: ...P_MUL PMULUDQ xmm xmm 9 8 6 2 2 2 4 FP_MUL POR xmm xmm 2 2 1 2 2 1 MMX_ALU PSADBW xmm xmm 4 4 5 2 2 2 4 MMX_ALU PSHUFD xmm xmm imm8 4 4 2 1 2 2 2 MMX_SHFT PSHUFHW xmm xmm imm8 2 2 1 2 2 1 MMX_SHFT PSH...

Страница 523: ...xmm 2 2 1 2 2 1 MMX_ALU See Table Footnotes Table C 3 Streaming SIMD Extension 2 Double precision Floating point Instructions Instruction Latency1 Throughput Execution Unit2 CPUID 0F3n 0F2n 0x69n 0F3...

Страница 524: ...MX_ALU CVTPS2PD3 xmm xmm 3 2 2 1 2 3 FP_ADD MMX_SHFT MMX_ALU CVTSD2SI r32 xmm 9 8 2 2 FP_ADD FP_MISC CVTSD2SS3 xmm xmm 17 16 4 2 4 1 FP_ADD MMX_SHFT CVTSI2SD3 xmm r32 16 15 4 2 3 1 FP_ADD MMX_SHFT MMX...

Страница 525: ...LPD xmm xmm 7 6 2 2 FP_MUL MULSD xmm xmm 7 6 2 2 FP_MUL ORPD3 xmm xmm 4 4 2 2 MMX_ALU SHUFPD3 xmm xmm imm8 6 6 2 2 MMX_SHFT SQRTPD xmm xmm 70 69 58 57 70 69 114 FP_DIV SQRTSD xmm xmm 39 38 58 39 38 57...

Страница 526: ...FP_ADD COMISS xmm xmm 7 6 1 2 2 1 FP_ADD FP_ MISC CVTPI2PS xmm mm 12 11 3 2 4 1 MMX_ALU FP_ ADD MMX_ SHFT CVTPS2PI mm xmm 8 7 3 2 2 1 FP_ADD MMX_ ALU CVTSI2SS3 xmm r32 12 11 4 2 2 2 FP_ADD MMX_ SHFT M...

Страница 527: ...2 MMX_MISC RSQRTSS3 xmm xmm 6 6 4 4 1 MMX_MISC MMX_SHFT SHUFPS3 xmm xmm imm8 6 6 2 2 2 2 MMX_SHFT SQRTPS xmm xmm 40 39 29 28 40 39 58 FP_DIV SQRTSS xmm xmm 32 23 30 32 23 29 FP_DIV SUBPS xmm xmm 5 4 4...

Страница 528: ...2 1 FP_MISC PMULHUW3 mm mm 9 8 1 1 FP_MUL PSADBW mm mm 4 4 5 1 1 2 MMX_ALU PSHUFW mm mm imm8 2 2 1 1 1 1 MMX_SHFT See Table Footnotes Table C 6 MMX Technology 64 bit Instructions Instruction Latency1...

Страница 529: ...RAD mm mm imm8 2 2 1 1 MMX_SHFT PSRLQ PSRLW PSRLD mm mm imm8 2 2 1 1 MMX_SHFT PSUBB PSUBW PSUBD mm mm 2 2 1 1 MMX_ALU PSUBSB PSUBSW PSU BUSB PSUBUSW mm mm 2 2 1 1 MMX_ALU PUNPCKHBW PUNPCK HWD PUNPCKHD...

Страница 530: ...OM 3 2 1 1 FP_MISC FCHS 3 2 1 1 FP_MISC FDIV Single Precision 30 23 30 23 FP_DIV FDIV Double Precision 40 38 40 38 FP_DIV FDIV Extended Precision 44 43 44 43 FP_DIV FSQRT SP 30 23 30 23 FP_DIV FSQRT D...

Страница 531: ...0F2n 0x69n 0F2n ADC SBB reg reg 8 8 3 3 ADC SBB reg imm 8 6 2 2 ALU ADD SUB 1 0 5 0 5 0 5 ALU AND OR XOR 1 0 5 0 5 0 5 ALU BSF BSR 16 8 2 4 BSWAP 1 7 0 5 1 ALU BTC BTR BTS 8 9 1 CLI 26 CMP TEST 1 0 5...

Страница 532: ...ALU PUSH 1 5 1 MEM_STORE ALU RCL RCR reg 18 6 4 1 1 ROL ROR 1 4 0 5 1 RET 8 1 MEM_LOAD ALU SAHF 1 0 5 0 5 0 5 ALU SAL SAR SHL SHR 1 4 1 0 5 1 SCAS 4 1 5 ALU MEM_ LOAD SETcc 5 1 5 ALU STI 36 STOSB 5 2...

Страница 533: ...nd ports in the out of order core Note the following The FP_EXECUTE unit is actually a cluster of execution units roughly consisting of seven separate execution units The FP_ADD unit handles x87 and S...

Страница 534: ...ted more slowly This applies to the Pentium 4 and Intel Xeon processors Latency and Throughput with Memory Operands The discussion of this section applies to the Intel Pentium 4 and Intel Xeon process...

Страница 535: ...Throughput of these instructions with load operation remains the same with the register to register flavor of the instructions Floating point MMX technology Streaming SIMD Extensions and Streaming SIM...

Страница 536: ...IA 32 Intel Architecture Optimization C 22...

Страница 537: ...points to the base of the frame for the function and from which all data are referenced via appropriate offsets The convention on IA 32 is to use the esp register as the stack frame pointer for norma...

Страница 538: ...versions of the Intel C Compiler for Win32 Systems have attempted to provide 8 byte aligned stack frames by dynamically adjusting the stack frame pointer in the prologue of main and preserving 8 byte...

Страница 539: ...requirement of 4 but it calls function G at many call sites and in a loop If G s alignment requirement is 16 then by promoting F s alignment requirement to 16 and making all calls to G go to its alig...

Страница 540: ...must be used When using this type of frame the sum of the sizes of the return address saved registers local variables register spill slots and parameter space must be a multiple of 16 bytes This cause...

Страница 541: ...Note A push ebx mov ebx esp sub esp 0x00000008 and esp 0xfffffff0 add esp 0x00000008 jmp common foo aligned push ebx mov ebx esp common See Note B push edx sub esp 20 j k mov edx ebx 8 mov esp 16 edx...

Страница 542: ...he return address the previous ebp the exception handling record the local variables and the spill area must be a multiple of 16 bytes In addition the parameter passing space must be a multiple of 16...

Страница 543: ...er add jmp common foo aligned push ebx esp is 8 mod 16 after push mov ebx esp common push ebp this slot will be used for duplicate return pt push ebp esp is 0 mod 16 after push rtn ebx ebp ebp mov ebp...

Страница 544: ...o 5 add esp 4 normal call sequence to unaligned entry mov esp 5 call foo for stdcall callee cleans up stack foo aligned 5 add esp 16 aligned entry this should be a multiple of 16 mov esp 5 call foo al...

Страница 545: ...e 16 byte alignment then that function will not have any alignment code in it That is the compiler will not use ebx to point to the argument block and it will not have alternate entry points because t...

Страница 546: ...ored relative to ebx in the function s epilog For additional information on the use of ebx in inline assembly code and other related issues see relevant application notes in the Intel Architecture Per...

Страница 547: ...uation the mathematical model of the calculation Simplified Equation A simplified equation to compute PSD is as follows where psd is prefetch scheduling distance Nlookup is the number of clocks for lo...

Страница 548: ...mathematics discussed are as follows psd prefetch scheduling distance measured in number of iterations il iteration latency Tc computation latency per iteration with prefetch caches Tl memory leadoff...

Страница 549: ...needed to execute this loop with actually run time memory footprint Tc can be determined by computing the critical path latency of the code dependency graph This work is quite arduous without help fro...

Страница 550: ...e that three cache lines are accessed per iteration and four chunks of data are returned per iteration for each cache line Also assume these 3 accesses are pipelined in memory subsystem Based on these...

Страница 551: ...To determine the proper prefetch scheduling distance follow these steps and formulae Optimize Tc as much as possible Use the following set of formulae to calculate the proper prefetch scheduling dista...

Страница 552: ...e bus is idle during the computation portion of the loop The memory access latencies could be hidden behind execution if data could be fetched earlier during the bus idle time Further analyzing Figure...

Страница 553: ...ory pipeline With an ideal placement of the data prefetching the iteration latency should be either bound by execution latency or memory latency that is il maximum Tc Tb Compute Bound Case Tc Tl Tb Fi...

Страница 554: ...computation latency which means the memory accesses are executed in background and their latencies are completely hidden Compute Bound Case Tl Tb Tc Tb Now consider the next case by first examining Fi...

Страница 555: ...umed in iteration i 2 Figure E 4 represents the case when the leadoff latency plus data transfer latency is greater than the compute latency which is greater than the data transfer latency The followi...

Страница 556: ...refetch scheduling distance or prefetch iteration distance for the case when memory throughput latency is greater than the compute latency Apparently the iteration latency is dominant by the memory th...

Страница 557: ...aphics driver moving data from writeback memory to write combining memory belongs to this category where performance advantage from prefetch instructions will be marginal Example As an example of the...

Страница 558: ...n of Tc the computation latency The steady state iteration latency il is either memory bound or compute bound depending on Tc if prefetches are scheduled effectively The graph in example 2 of accesses...

Страница 559: ...tion gap in computing the prefetch scheduling distance The transaction gap Tg must be factored into the burst cycles Tb for the calculation of prefetch scheduling distance The following relationship s...

Страница 560: ...IA 32 Intel Architecture Optimization E 14...

Страница 561: ...pplication performance tools A 1 Arrays Aligning 2 39 automatic processor dispatch support A 4 automatic vectorization 3 18 3 19 B battery life 9 1 9 7 OS APIs 9 8 performance options 9 8 Branch Predi...

Страница 562: ...6 44 _mm_stream 6 2 6 44 compiler plug in A 2 compiler supported alignment 3 24 complex instructions 2 74 computation latency E 8 computation intensive code 3 11 compute bound E 7 E 8 converting code...

Страница 563: ...et 7 44 placement of shared synchronization variable 7 31 prevent false sharing of data 7 30 preventing excessive evictions in first level data cache 7 43 shared memory optimization 7 39 synchronizati...

Страница 564: ...7 OS APIs 9 8 C4 state 9 6 CD DVD WLAN WiFi 9 10 C states 9 1 9 4 deep sleep transitions 9 11 deeper sleep 9 6 9 14 OS changes processor frequency 9 2 OS synchronization APIs 9 9 overview 9 1 performa...

Страница 565: ...39 Performance and Usage Models Multithreading 7 2 Performance and Usage Models 7 2 Performance Library Suite A 14 optimizations A 16 PEXTRW instruction 4 13 PGO See profile guided optimization PINSRW...

Страница 566: ...n algorithm 2 20 static power 9 1 static prediction 2 19 static prediction algorithm 2 19 streaming stores coherent requests 6 13 non coherent requests 6 13 strip mining 3 32 3 34 strip mining 6 37 6...

Страница 567: ...rni 14 Brno 61600 Czech Rep Denmark Intel Corp Soelodden 13 Maaloev DK2760 Denmark Germany Intel Corp Sandstrasse 4 Aichner 86551 Germany Intel Corp Dr Weyerstrasse 2 Juelich 52428 Germany Intel Corp...

Страница 568: ...9465 Counselors Row Suite 200 Indianapolis IN 46240 USA Fax 317 805 4939 Massachusetts Intel Corp 125 Nagog Park Acton MA 01720 USA Fax 978 266 3867 Intel Corp 59 Composit Way suite 202 Lowell MA 018...

Отзывы: