background image

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions

47

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization 

PREFETCH/W versus 
PREFETCHNTA/T0/T1
/T2

The PREFETCHNTA/T0/T1/T2 instructions in the MMX
extensions are processor implementation dependent. To
maintain compatibility with the 25 million AMD-K6

®

-2 and

A M D -K 6 -I I I   p ro c e s s o rs   a l re a dy   s o l d ,   u s e   t h e   3 D N ow !
PREFETCH/W instructions instead of the various prefetch
flavors in the new MMX extensions. 

PREFETCHW Usage

Code that intends to modify the cache line brought in through
prefetching should use the PREFETCHW instruction. While
P R E F E T CH W   wo rks   t he   s a m e   a s   a   P R EF E T C H  o n  t he
AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a
hint to the AMD Athlon processor of an intent to modify the
cache line. The AMD Athlon processor will mark the cache line
b e i n g   b ro u g h t   i n   by   P R E F ET CH W   a s   M o d i f ie d .   U si n g
PREFETCHW can save an additional 15-25 cycles compared to
a PREFETCH and the subsequent cache state change caused by
a write to the prefetched cache line.

Multiple Prefetches

Programmers can initiate multiple outstanding prefetches on
t h e   A M D A t h l o n   p ro c e s s o r.   Wh i l e   t h e   A M D -K 6 -2   a n d
AMD-K6-III processors can have only one outstanding prefetch,
the AMD Athlon processor can have up to six outstanding
prefetches. When all six buffers are filled by various memory
read requests, the processor will simply ignore any new
prefetch requests until a buffer frees up. Multiple prefetch
requests are essentially handled in-order. If data is needed first,
then that data should be prefetched first.

The example below shows how to initiate multiple prefetches
when traversing more than one array.

Example (Multiple Prefetches):  

.CODE
.K3D

; original C code
;
; #define LARGE_NUM 65536
;
; double array_a[LARGE_NUM];
; double array b[LARGE_NUM];
; double array c[LARGE_NUM];
; int i;
;
; for (i = 0; i < LARGE_NUM; i++) {
;    a[i] = b[i] * c[i]
; }

Содержание Athlon Processor x86

Страница 1: ...AMD Athlon Processor x86 Code Optimization Guide TM...

Страница 2: ...ations and product descriptions at any time without notice No license whether express implied arising by estoppel or otherwise to any intellectual property rights is granted by this publication Except...

Страница 3: ...nstructions 9 Group II Optimizations Secondary Optimizations 9 Load Execute Instruction Usage 9 Take Advantage of Write Combining 10 Use 3DNow Instructions 10 Avoid Branches Dependent on Random Data 1...

Страница 4: ...Base Type Size 28 Accelerating Floating Point Divides and Square Roots 29 Avoid Unnecessary Integer Division 31 Copy Frequently De referenced Pointer Arguments to Local Variables 31 4 Instruction Dec...

Страница 5: ...Cache Line 50 Store to Load Forwarding Restrictions 51 Store to Load Forwarding Pitfalls True Dependencies 51 Summary of Store to Load Forwarding Pitfalls to Avoid 54 Stack Alignment Considerations 5...

Страница 6: ...8 Integer Optimizations 77 Replace Divides with Multiplies 77 Multiplication by Reciprocal Division Utility 77 Unsigned Division by Multiplication of Constant 78 Signed Division by Multiplication of...

Страница 7: ...103 Check Argument Range of Trigonometric Instructions Efficiently 103 Take Advantage of the FSINCOS Instruction 105 10 3DNow and MMX Optimizations 107 Use 3DNow Instructions 107 Use FEMMS Instructio...

Страница 8: ...119 Optimized Matrix Multiplication 119 Efficient 3D Clipping Code Computation Using 3DNow Instructions 122 Use 3DNow PAVGUSB for MPEG 2 Motion Compensation 123 Stream of Packed Unsigned Bytes 125 Com...

Страница 9: ...g Point Pipeline Stages 146 Execution Unit Resources 148 Terminology 148 Integer Pipeline Operations 149 Floating Point Pipeline Operations 150 Load Store Pipeline Operations 151 Code Sample Analysis...

Страница 10: ...toring Software 168 Monitoring Counter Overflow 169 Appendix E Programming the MTRR and PAT 171 Introduction 171 Memory Type Range Register MTRR Mechanism 171 Page Attribute Table PAT 177 Appendix F I...

Страница 11: ...6 Fetch Scan Align Decode Pipeline Stages 142 Figure 7 Integer Execution Pipeline 144 Figure 8 Integer Pipeline Stages 144 Figure 9 Floating Point Unit Block Diagram 146 Figure 10 Floating Point Pipe...

Страница 12: ...xii List of Figures AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 13: ...Generation Rules 159 Table 11 Performance Monitoring Counters 164 Table 12 Memory Type Encodings 174 Table 13 Standard MTRR Types and Properties 176 Table 14 PATi 3 Bit Encodings 178 Table 15 Effecti...

Страница 14: ...rocessor x86 Code Optimization 22007E 0 November 1999 Table 29 VectorPath Integer Instructions 231 Table 30 VectorPath MMX Instructions 234 Table 31 VectorPath MMX Extensions 234 Table 32 VectorPath F...

Страница 15: ...FFREEP Macro to Pop One Register from the FPU Stack on page 98 Further clarification of Minimize Floating Point to Integer Conversions on page 100 Added the optimization Check Argument Range of Trigon...

Страница 16: ...xvi Revision History AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 17: ...AMD Athlon processor programmers should write software that includes specific code optimization techniques About this Document This document contains information to assist programmers in creating opti...

Страница 18: ...makes efficient use of the large L1 caches and high bandwidth buses of the AMD Athlon processor Chapter 6 Branch Optimizations Describes optimizations that improves branch prediction and minimizes bra...

Страница 19: ...ists the x86 instructions that are DirectPath and VectorPath instructions AMD Athlon Processor Family The AMD Athlon processor family uses state of the art decoupled decode execution design techniques...

Страница 20: ...ive execution Three way integer execution Three way address generation Three way floating point execution 3DNow technology and MMX single instruction multiple data SIMD instruction extensions Super da...

Страница 21: ...for high performance floating point vector operations which can replace x87 instructions and enhance the performance of 3D graphics and other floating point intensive applications Because the 3DNow a...

Страница 22: ...the AMD Athlon processor to achieve maximum performance Due to the more flexible pipeline control and aggressive out of order execution the AMD Athlon processor is not as sensitive to instruction sele...

Страница 23: ...hould follow these critical guidelines closely The optimizations in Group I are as follows Memory Size and Alignment Issues Avoid memory size mismatches Align data where possible Use the 3DNow PREFETC...

Страница 24: ...ze mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep the loads stores of each operand the same size Align Data W...

Страница 25: ...code and execute efficiently by minimizing the number of operations per x86 instruction Three DirectPath instructions can be decoded in parallel Using VectorPath instructions will block DirectPath ins...

Страница 26: ...ementation of Write Combining on page 155 for more details Use 3DNow Instructions Unless accuracy requirements dictate otherwise perform floating point computations using the 3DNow instructions instea...

Страница 27: ...ze of previous processors Code and data should not be shared in the same 64 byte cache line especially if the data ever becomes modified In order to maintain cache coherency the AMD Athlon processor m...

Страница 28: ...12 Group II Optimizations Secondary Optimizations AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 29: ...ating point variables and expressions are of type float Pay special attention to floating point constants These require a suffix of F or f for example 3 14f in order to be of type float otherwise they...

Страница 30: ...while others are faster for signed types Integer to floating point conversion using integers larger than 16 bit is faster with signed types as the x86 FPU provides instructions for converting signed i...

Страница 31: ...writes through a pointer can write to any place in memory This includes storage allocated to other variables creating the issue of aliasing i e the same block of memory is accessible in more than one...

Страница 32: ...de it is therefore possible that pointer style code will be compiled into machine code that is faster than that generated from equivalent array style code It is advisable to check the performance afte...

Страница 33: ...eferred typedef struct float x y z w VERTEX typedef struct float m 4 4 MATRIX void XForm float res const float v const float m int numverts int i const VERTEX vv VERTEX v const MATRIX mm MATRIX m VERT...

Страница 34: ...4 transform matrix M r 0 M 0 0 V 0 M 1 0 V 1 M 2 0 V 2 M 3 0 V 3 r 1 M 0 1 V 0 M 1 1 V 1 M 2 1 V 2 M 3 1 V 3 r 2 M 0 2 V 0 M 1 2 V 1 M 2 2 V 2 M 3 2 V 3 r 3 M 0 3 V 0 M 1 3 V 1 M 2 3 V 2 M 3 3 v 3 Avo...

Страница 35: ...refore recommended that the programmer remove the dependency manually e g by introducing a temporary variable that can be kept in a register This can result in a significant performance increase The f...

Страница 36: ...s of conditional branches The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition and can have a significant impact on performance It is unf...

Страница 37: ...integers as this will allow the switch to be translated as a jump table Example 1 Avoid int days_in_month short_months normal_months long_months switch days_in_month case 28 case 29 short_months brea...

Страница 38: ...rmance of inner loops it is beneficial to reduce redundant constant calculations i e loop invariant calculations However this idea can be extended to invariant control structures The first case is tha...

Страница 39: ...imple switch statement Example 2 for i if CONSTANT0 DoWork0 i does not affect CONSTANT0 or CONSTANT1 else DoWork1 i does not affect CONSTANT0 or CONSTANT1 if CONSTANT1 DoWork2 i does not affect CONSTA...

Страница 40: ...method for combining the input constants gets more complicated but will be worth it for the performance benefit However the number of inner loops can also substantially increase If the number of inne...

Страница 41: ...cit Parallelism into Code Where possible long dependency chains should be broken into several independent dependency chains which can then be executed in parallel exploiting the pipeline execution uni...

Страница 42: ...t adder Each stage of the floating point adder is occupied on every clock cycle ensuring maximal sustained utilization Explicitly Extract Common Subexpressions In certain situations C compilers are un...

Страница 43: ...better alignment for structures In addition to improve the alignment of structure members some compilers might allocate structure elements in an order that differs from the order in which they are de...

Страница 44: ...Component Considerations on page 55 for a different perspective Sort Local Variables According to Base Type Size When a compiler allocates local variables in the same order in which they are declared...

Страница 45: ...a much longer latency than other floating point operations even though the AMD Athlon processor provides significant acceleration of these two operations In some codes these operations occur so often...

Страница 46: ...ere it creates little overhead such as outside a computation intensive loop Otherwise the overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square...

Страница 47: ...i j k Example 2 Preferred int i j k l m i j k Copy Frequently De referenced Pointer Arguments to Local Variables Avoid frequently de referencing pointer arguments inside a function Since the compiler...

Страница 48: ...Avoid assumes pointers are different and q r void isqrt unsigned long a unsigned long q unsigned long r q a if a 0 while q r a q q q r 1 r a q q Example 2 Preferred assumes pointers are different and...

Страница 49: ...ne selects for decode up to three x86 instructions from the instruction byte queue All instructions x86 x87 3DNow and MMX are classified into two types of decodes DirectPath and VectorPath see DirectP...

Страница 50: ...ctions in the AMD Athlon processor Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions See Appendix F Instruction Dispatch and Execution Resource...

Страница 51: ...FMUL ST ST 1 Example 2 Preferred FLD QWORD PTR TEST1 FMUL QWORD PTR TEST2 Avoid Load Execute Floating Point Instructions with Integer Operands Do not use load execute floating point instructions with...

Страница 52: ...while preventing I cache space in branch intensive code Use Short Instruction Lengths Assemblers and compilers should generate the tightest code possible to optimize use of the I cache and increase a...

Страница 53: ...rent state of the remainder of the register Therefore the dependency hardware can potentially force a false dependency on the most recent instruction that writes to any part of the register Example 1...

Страница 54: ...ions while SHLD is a VectorPath instruction SHR and LEA preserves decode bandwidth as it potentially enables the decoding of a third DirectPath instruction Example 1 Avoid SHLD REG1 REG2 1 Preferred S...

Страница 55: ...can be executed it should take up as few execution resources as possible not diminish decode density and not modify any processor state other than advancing EIP A one byte padding can easily be achiev...

Страница 56: ...neutral code filler of up to nine bytes Note When used as a filler instruction REP REPNE prefixes can be used in conjunction only with NOPs REP REPNE has undefined behavior when used with instruction...

Страница 57: ...used by instructions in the vicinity of the neutral code filler Note that certain instructions use registers implicitly For example PUSH POP CALL and RET all make implicit use of the ESP register The...

Страница 58: ...esi 00 NOP4_EDI TEXTEQU DB 08Dh 074h 026h 000h lea edi edi 00 NOP4_ESP TEXTEQU DB 08Dh 07Ch 027h 000h lea esp esp 00 lea eax eax 00 nop NOP5_EAX TEXTEQU DB 08Dh 044h 020h 000h 090h lea ebx ebx 00 nop...

Страница 59: ...B 08Dh 034h 035h 0 0 0 0 lea edi edi 1 00000000 NOP7_EDI TEXTEQU DB 08Dh 03Ch 03Dh 0 0 0 0 lea ebp ebp 1 00000000 NOP7_EBP TEXTEQU DB 08Dh 02Ch 02Dh 0 0 0 0 lea eax eax 1 00000000 nop NOP8_EAX TEXTEQU...

Страница 60: ...44 Code Padding Using Neutral Code Fillers AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 61: ...nment Issues Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep th...

Страница 62: ...e likelihood of encountering a store to load forwarding pitfall For a more detailed discussion of store to load forwarding issues see Store to Load Forwarding Restrictions on page 51 Use the 3DNow PRE...

Страница 63: ...brought in by PREFETCHW as Modified Using PREFETCHW can save an additional 15 25 cycles compared to a PREFETCH and the subsequent cache state change caused by a write to the prefetched cache line Mul...

Страница 64: ...ZE 16 b i 2 FMUL QWORD PTR ECX ECX 8 ARR_SIZE 16 b i 2 c i 2 FSTP QWORD PTR EAX ECX 8 ARR_SIZE 16 a i 2 i 2 c i 2 FLD QWORD PTR EDX ECX 8 ARR_SIZE 24 b i 3 FMUL QWORD PTR ECX ECX 8 ARR_SIZE 24 b i 3 c...

Страница 65: ...er to cut down on loop overhead Determining Prefetch Distance Given the latency of a typical AMD Athlon processor system and expected processor speeds the following formula should be used to determine...

Страница 66: ...ain coherency between the separate instruction and data caches The AMD Athlon processor has a cache line size of 64 bytes which is twice the size of previous processors Programmers must be aware that...

Страница 67: ...eorder buffer The implication of this restriction is that all instructions in the reorder buffer up to and including the store must complete and retire out of the reorder buffer before the load can co...

Страница 68: ...R EAX doubleword load cannot forward upper byte from store buffer Example 2 Avoid MOV EAX 10h MOV BYTE PTR EAX 3 BL byte store MOV ECX DWORD PTR EAX doubleword load cannot forward upper byte from stor...

Страница 69: ...rd boundary etc A common case of misaligned store data forwarding involves the passing of misaligned quadword floating point data on the doubleword aligned integer stack Avoid the type of code shown i...

Страница 70: ...rd or byte stores avoid loading data from anywhere in the same doubleword of memory other than the identical start addresses of the stores Stack Alignment Considerations Make sure the stack is suitabl...

Страница 71: ...Variables on Quadword Aligned Addresses Align variables of type TBYTE on quadword aligned addresses In order to make an array of TBYTE variables that are aligned array elements are 16 bytes apart In...

Страница 72: ...al variables according to their base type size and allocate variables with larger base type size ahead of those with smaller base type size Assuming the first variable allocated is naturally aligned a...

Страница 73: ...these are difficult to predict For example a piece of code receives a random stream of characters A through Z and branches if the character is before M in the collating sequence Data dependent branche...

Страница 74: ...AX MOV Z EAX save min X Y Blended AMD K6 and AMD Athlon Processor Code Example 3 Signed integer ABS function X labs X MOV ECX X load value MOV EBX ECX save value SAR ECX 31 x 0 0xffffffff 0 XOR EBX EC...

Страница 75: ...a 0 MOV a EAX store new offset Example 7 Integer Signum Function C Code int a s if a s 0 else if a 0 s 1 else s 1 Assembly Code MOV EAX a load a CDQ t a 0 0xffffffff 0 CMP EDX EAX a 0 CF NC ADC EDX 0...

Страница 76: ...The principal tools for this are the following instructions PCMPGT PFCMPGT PFCMPGE PFMIN PFMAX PAND PANDN POR PXOR Muxing Constructs The most important construct to avoiding branches in 3DNow and MMX...

Страница 77: ...ions for scalar code because the advantage of 3DNow instructions lies in their SIMDness These examples are meant to demonstrate general techniques for translating source code with branches into branch...

Страница 78: ...e z PFRCPIT1 MM0 MM2 1 z step PFRCPIT2 MM0 MM2 1 z final PFMIN MM0 MM1 z z 1 z 1 z Example 3 C code float x z r res z fabs x if z 0 575 res r else res PI 2 2 r 3DNow code in MM0 x MM1 r out MM0 res MO...

Страница 79: ...efine PI 3 14159265358979323 float x z r res 0 r PI 4 z abs x if z 1 res r else res PI 2 r 3DNow code in MM0 x MM1 r out MM1 res MOVQ MM5 mabs mask to clear sign bit MOVQ MM6 one 1 0 PAND MM0 MM5 z ab...

Страница 80: ...1 y MM2 x out MM0 res MOVQ MM7 sgn mask to extract sign bit MOVQ MM6 sgn mask to extract sign bit MOVQ MM5 mabs mask to clear sign bit PAND MM7 MM2 xs sign x PAND MM1 MM5 ya abs y PAND MM2 MM5 xa abs...

Страница 81: ...uction in the AMD Athlon processor requires eight cycles to execute Use the preferred code shown below Example 1 Avoid LOOP LABEL Example 2 Preferred DEC ECX JNZ LABEL Avoid Far Control Transfer Instr...

Страница 82: ...due to the danger of overflowing the return address stack Convert end recursive functions to iterative code An end recursive function is when the function call to itself is at the end of the code Exa...

Страница 83: ...possibly having a different latency The AMD Athlon processor has flexible scheduling but for absolute maximum performance schedule instructions especially FPU and 3DNow instructions according to thei...

Страница 84: ...nown at compile time The loop body once unrolled is less than 100 instructions which is approximately 400 bytes of code Partial Loop Unrolling Partial loop unrolling can increase register pressure whi...

Страница 85: ...e However the pipelined floating point adder allows one add every cycle In the following code the loop is partially unrolled by a factor of two which creates potential endcases that must be handled ou...

Страница 86: ...l case the loop count starts at some lower bound lo increases by some fixed positive increment inc for each iteration of the loop and may not exceed some upper bound hi The following example shows how...

Страница 87: ...t high speed due to the use of prediction mechanisms However there is still overhead due to passing function arguments through memory which creates STLF store to load forwarding dependencies Some comp...

Страница 88: ...body For large functions the benefits of reduced function call overhead gives diminishing returns Therefore a function that results in the insertion of more than 500 machine instructions at the call...

Страница 89: ...X and MOVSX instructions to zero extend and sign extend byte size and word size operands to doubleword length For example typical code for zero extension creates a superset dependency when the zero ex...

Страница 90: ...2 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for i 0 i MAXSIZE i c i a i b i MOV ECX MAXSIZE 1 initialize loop counter add_loop MOV EAX ECX 4 a get element a MOV EDX ECX 4 b get element b ADD EAX...

Страница 91: ...esses are used in the displacement portion of the address and biasing is accomplished at compile time by simply modifying the displacement Example 3 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for i...

Страница 92: ...76 Push Memory Data Carefully AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 93: ...sion delivers only one bit of quotient per cycle 22 47 cycles signed 17 41 cycles unsigned the equivalent code is much faster The user can follow the examples in this chapter that illustrate the use o...

Страница 94: ...ut to output the code to a file Unsigned Division by Multiplication of Constant Algorithm Divisors 1 d 231 Odd d The following code shows an unsigned division using a constant value multiplier In d di...

Страница 95: ...0 1 Simpler Code for Restricted Dividend Integer division by a constant can be made faster if the range of the dividend is limited which removes a shift associated with most divisors For example for a...

Страница 96: ...EAX dividend OUT EAX quotient CDQ Sign extend into EDX AND EDX 2 n 1 Mask correction use divisor 1 ADD EAX EDX Apply correction if necessary SAR EAX n Perform right shift by log2 divisor Signed Divis...

Страница 97: ...ly unit the replacement code may provide better throughput The following code samples are designed such that the original source also receives the final result Other sequences are possible if the resu...

Страница 98: ...ADD REG1 REG2 by 15 MOV REG2 REG1 2 cycles SHL REG1 4 SUB REG1 REG2 by 16 SHL REG1 4 1 cycle by 17 MOV REG2 REG1 2 cycles SHL REG1 4 ADD REG1 REG2 by 18 ADD REG1 REG1 3 cycles LEA REG1 REG1 8 REG1 by...

Страница 99: ...ns to do integer only work especially if the function already uses 3DNow or MMX code Using MMX instructions relieves register pressure on the integer registers As long as data is simply loaded stored...

Страница 100: ...icantly However these types are less commonly found The user should use the formula and round up to the nearest integer value to determine the latency Guidelines for Repeated String Instructions To he...

Страница 101: ...misaligned For REP STOS make the destination aligned Inline REP String with Low Counts Expand REP string instructions into equivalent sequences of simple x86 instructions if the repeat count is const...

Страница 102: ...fts are best handled by inline code Multiplies divides and remainders are less common operations and should usually be implemented as subroutines If these subroutines are used often the programmer sho...

Страница 103: ...EDX by 32 bits rshift_done Example 6 Multiplication _llmul computes the low order half of the product of its arguments two 64 bit integers INPUT ESP 8 ESP 4 multiplicand ESP 16 ESP 12 multiplier OUTPU...

Страница 104: ...hi 0 quotient in EDX EAX POP EBX restore EBX as per calling convention RET done return to caller two_divs MOV ECX EAX save dividend_lo in ECX MOV EAX EDX get dividend_hi XOR EDX EDX zero extend it int...

Страница 105: ...EBX restore EBX as per calling convention RET done return to caller _ulldiv ENDP Example 8 Remainder _ullrem divides two unsigned 64 bit integers and returns the remainder INPUT ESP 8 ESP 4 dividend...

Страница 106: ...er of remaining shifts SHRD EBX EDI CL scale down divisor and dividend such SHRD EAX EDX CL that divisor is less than 2 32 SHR EDX CL i e fits in EBX ROL EDI 1 restore original divisor EDI ESI DIV EBX...

Страница 107: ...hms consist of the following steps Step 1 Partition the integer into groups of two bits Compute the population count for each 2 bit group and store the result in the 2 bit group This calls for the fol...

Страница 108: ...YYYkkkkZZZZ The WWWW XXXX YYYY and ZZZZ values are the interesting sums with each at most 1000b or 8 decimal Step 4 The four 4 bit sums can now be rapidly accumulated by means of a multiply with a mag...

Страница 109: ...ctor The utility udiv exe was compiled using the code shown in this section The following code derives the multiplier value used when performing integer division by constants The code works for unsign...

Страница 110: ...ong U32 U32 d l s m a r U64 m_low m_high j k U32 log2 U32 i U32 t 0 i i 1 while i i i 1 t return t Generate m s for algorithm 0 Based on Granlund T Montgomery P L Division by Invariant Integers using...

Страница 111: ...d m r d 1 1 U32 m_low U32 m_low 1 a 1 Reduce multiplier shift factor for either algorithm to smallest possible while m 1 m m 1 s Signed Derivation for Algorithm Multiplier and Shift Factor The utilit...

Страница 112: ...2 log2 U32 i U32 t 0 i i 1 while i i i 1 t return t U32 d l s m a U64 m_low m_high j k Determine algorithm a multiplier m and shift count s for 32 bit signed integer division Based on Granlund T Montg...

Страница 113: ...leword boundaries and quadwords on quadword boundaries Misaligned memory accesses reduce the available memory bandwidth Use Multiplies Rather than Divides If accuracy requirements allow floating point...

Страница 114: ...low Note that the FFREEP instruction although insufficiently documented in the past is supported by all 32 bit x86 processors The opcode bytes for FFREEP ST i are listed in Table 22 on page 212 FFREEP...

Страница 115: ...aneously by explicitly switching execution between them Although the AMD Athlon processor FPU has a deep scheduler which in most cases can extract sufficient parallelism from existing code long depend...

Страница 116: ...such code as quickly as possible In most situations the above code is therefore the fastest way to perform floating point to integer conversion and the conversion is compliant both with programming la...

Страница 117: ...ually the case Example 2 Potentially faster FLD QWORD PTR X load double to be converted FST DWORD PTR TX store X because sign X is needed FIST DWORD PTR I store rndint x as default result FISUB DWORD...

Страница 118: ...rsion that is not compliant with existing programming language standards but is IEEE 754 compliant perform the conversion using the rounding mode that is currently in effect usually round to nearest e...

Страница 119: ...tions by putting the shared source operand at the top of the stack For example using the function func x y x z Example 1 Avoid FLD Z FLD Y FLD X FADD ST ST 2 FXCH ST 1 FMUL ST ST 2 CALL FUNC FSTP ST 0...

Страница 120: ...truction and take appropriate action if it is set indicating an argument out of range Example 1 Avoid FLD QWORD PTR x argument FSIN compute sine FSTSW AX store FPU status word to AX TEST AX 0400h is t...

Страница 121: ...an argument also needs to compute the cosine of that same argument In such cases the FSINCOS instruction should be used to compute both trigonometric functions concurrently which is faster than using...

Страница 122: ...106 Take Advantage of the FSINCOS Instruction AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 123: ...m floating point computations using the 3DNow instructions instead of x87 instructions The SIMD nature of 3DNow achieves twice the number of FLOPs that are achieved through x87 instructions 3DNow inst...

Страница 124: ...uctions for Fast Division 3DNow instructions can be used to compute a very fast highly accurate reciprocal or quotient Optimized 14 Bit Precision Divide This divide operation executes with a total lat...

Страница 125: ...it precision approximation of 1 b using just one PFRCP instruction A full 24 bit precision reciprocal can then be quickly computed from this approximation using a Newton Raphson algorithm The general...

Страница 126: ...ur less cycles than the square root operation Example MOVD MM0 MEM 0 a PFRSQRT MM1 MM0 1 sqrt a 1 sqrt a approximate PUNPCKLDQ MM0 MM0 a a MMX instr PFMUL MM0 MM1 sqrt a sqrt a Optimized 24 Bit Precis...

Страница 127: ...square root value is X3 In the AMD Athlon processor 3DNow implementation the estimate contains the correct round to nearest value for approximately 87 of all arguments The remaining arguments differ f...

Страница 128: ...ction Set Manual order 22466 for more usage information Blended Code Otherwise for blended code which needs to run well on AMD K6 and AMD Athlon family processors the following code is recommended Exa...

Страница 129: ...to accomplish the same task in blended code that achieves good performance on the AMD Athlon processor as well as on the AMD K6 family processors that support 3DNow technology Example 1 AMD Athlon sp...

Страница 130: ...the PCMP has a latency of two cycles while the PFCMP has a latency of four cycles In addition to the shorter latency PCMP can be issued to either the FADD or the FMUL pipe while PFCMP is restricted to...

Страница 131: ...aligned blocks of data In cases where memory blocks are not quadword aligned additional code is required to handle end cases as needed AMD K6 and AMD Athlon Processor Blended Code The following exampl...

Страница 132: ...1 eax 32 movq edx 48 mm2 movq mm2 eax 24 movq edx 40 mm0 movq mm0 eax 16 movq edx 32 mm1 movq mm1 eax 8 movq edx 24 mm2 movq edx 16 mm0 dec ecx movq edx 8 mm1 jnz xfer femms block fill destination QWO...

Страница 133: ...Athlon processor specific code where the destination of the block copy is in cacheable space but no immediate data re use of the data at the destination is expected Example 2 block copy source and des...

Страница 134: ...e MMX PXOR to Clear All Bits in an MMX Register To clear all the bits in an MMX register to zero use PXOR MMreg MMreg Note that PXOR MMreg MMreg is dependent on previous writes to MMreg Therefore usin...

Страница 135: ...d and properly aligned memory location However loading the data from memory runs the risk of cache misses Cases where MOVQ is superior to PCMPEQD are therefore rare and PCMPEQD should be used in gener...

Страница 136: ...m 3 0 res y v x m 0 1 v y m 1 1 v z m 2 1 v w m 3 1 res z v x m 0 2 v y m 1 2 v z m 2 2 v w m 3 2 res w v x m 0 3 v y m 1 3 v z m 2 3 v w m 3 3 define M00 0 define M01 4 define M02 8 define M03 12 de...

Страница 137: ...VQ MM2 QWORD PTR EAX M22 m 2 3 m 2 2 PFMUL MM0 MM1 v z m 2 1 v z m 2 0 PFADD MM3 MM4 v x m 0 1 v y m 1 1 v x m 0 0 v y m 1 0 MOVQ MM4 QWORD PTR EAX M30 m 3 1 m 3 0 PFMUL MM2 MM1 v z m 2 3 v z m 2 2 PF...

Страница 138: ...de indicates whether the vertex is outside the frustum with regard to a specific clip plane Examination of the clip code for a vertex and clipping if the clip code is non zero The following example sh...

Страница 139: ...GHT BEFORE MOVQ MM1 MM2 BELOW ABOVE BEHIND LEFT RIGHT BEFORE PUNPCKHDQ MM2 MM2 BELOW ABOVE BEHIND BELOW ABOVE BEHIND POR MM2 MM1 zclip yclip xclip clip code Use 3DNow PAVGUSB for MPEG 2 Motion Compens...

Страница 140: ...ORD3 0xfefefefe POR MM2 MM3 calculate adjustment PSRLQ MM0 1 MM0 QWORD1 0xfefefefe 2 PSRLQ MM1 1 MM1 QWORD3 0xfefefefe 2 PAND MM2 MM6 PADDB MM0 MM1 MM0 QWORD1 2 QWORD3 2 w o adjustment PADDB MM0 MM2 a...

Страница 141: ...D2 PAVGUSB MM0 EDI QWORD1 QWORD3 2 with adjustment PAVGUSB MM1 EDI 8 QWORD2 QWORD4 2 with adjustment ADD EAX EDX MOVQ EDI MM0 MOVQ EDI 8 MM1 ADD EDI EBX LOOP L1 Stream of Packed Unsigned Bytes The fol...

Страница 142: ...wo element vectors v real v imag one can see the need for swapping the elements of src1 to perform the multiplies for result imag and the need for a mixed positive negative accumulation to complete th...

Страница 143: ...ium Pro processors either improve the performance of the AMD Athlon processor or are not required and have a neutral effect usually due to fewer coding restrictions with the AMD Athlon processor Short...

Страница 144: ...in memory This technique avoids the comparatively long latencies for accessing memory Stack Allocation When allocating space for local variables and or outgoing parameters within a procedure adjust th...

Страница 145: ...re of the AMD Athlon processor is the industry standard x86 instruction set The term microarchitecture refers to the design techniques used in the processor to reach the target cost performance and fu...

Страница 146: ...n the AMD Athlon processor enables higher processor core performance and promotes straightforward extendibility for future designs Superscalar Processor The AMD Athlon processor is an aggressive out o...

Страница 147: ...y using the bus interface unit BIU The instruction cache generates fetches on the naturally aligned 64 bytes containing the instructions and the next sequential line of 64 bytes a prefetch The princip...

Страница 148: ...uction In addition the predecode logic detects code branches such as CALLs RETURNs and short unconditional JMPs When a branch is detected predecoding begins at the target of the branch Branch Predicti...

Страница 149: ...to determine which type of basic decode should occur DirectPath or VectorPath DirectPath Decoder DirectPath instructions can be decoded directly into a MacroOP and subsequently into one or two OPs in...

Страница 150: ...ght MacroOPs whether integer or floating point for maximum instruction throughput The ICU can simultaneously dispatch multiple MacroOPs from the reorder buffer to both the integer and floating point s...

Страница 151: ...r execution pipeline is organized to match the three MacroOP dispatch pipes in the ICU as shown in Figure 2 on page 135 MacroOPs are broken down into OPs in the schedulers OPs issue when their operand...

Страница 152: ...at is attached to the pipeline at pipe 0 See Figure 2 on page 135 Multiplies always issue to integer pipe 0 and the issue logic creates results bus bubbles for the multiplier in integer pipes 0 and 1...

Страница 153: ...superscalar x87 3DNow and MMX operations The first of the three pipes is generally known as the adder pipe FADD and it contains 3DNow add MMX ALU shifter and floating point add execution units The se...

Страница 154: ...ues a 12 entry queue for L1 cache load and store accesses and a 32 entry queue for L2 cache or system memory load and store accesses The 12 entry queue can request a maximum of two L1 cache loads and...

Страница 155: ...Combining See Appendix C Implementation of Write Combining on page 155 for detailed information about write combining AMD Athlon System Bus The AMD Athlon system bus is a high speed bus that consists...

Страница 156: ...140 AMD Athlon Processor Microarchitecture AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 157: ...the floating point pipeline manages all x87 3DNow and MMX instructions This appendix describes the operation and functionality of these pipelines Fetch and Decode Pipeline Stages Figure 5 on page 142...

Страница 158: ...the DirectPath decodes the common x86 instructions it also contains VectorPath instruction data which allows it to maintain dispatch order at the end of cycle 5 Figure 6 Fetch Scan Align Decode Pipel...

Страница 159: ...M and SIB bytes for each instruction and sends the accumulated prefix information to EDEC Cycle 4 VectorPath MEROM In the microcode engine ROM MEROM pipeline stage the entry point generated in the pre...

Страница 160: ...aches or system memory There are three integer pipes associated with the three IEUs Figure 7 Integer Execution Pipeline Figure 7 and Figure 8 show the integer execution resources and the pipeline stag...

Страница 161: ...If addresses must be calculated to access data necessary to complete the operation the OP proceeds to the next stages ADDGEN and DCACC Cycle 9 ADDGEN In the address generation ADDGEN pipeline stage t...

Страница 162: ...of the dataflow through the FPU Figure 9 Floating Point Unit Block Diagram The floating point pipeline stages 7 15 are shown in Figure 10 and described in the following sections Note that the floating...

Страница 163: ...schedules up to three MacroOPs per cycle from the 36 entry FPU scheduler to the FREG pipeline stage to read register operands MacroOPs are sent when their operands and or tags are obtained Cycle 11 FR...

Страница 164: ...a register results Produced by load or register instructions Address register results Produced by LEA or PUSH instructions Examples The following examples illustrate the operand and result definitions...

Страница 165: ...tion early decodes in the VectorPath and requires three OPs an address generation operation for the indirect address a data load from memory and a compare to CX using an IEU The final JZ instruction i...

Страница 166: ...common floating point instructions The MMX PFACC instruction is DirectPath decodeable and generates a single MacroOP targeted for the arithmetic operation execution pipeline in the floating point log...

Страница 167: ...are waiting to enter the cache subsystem Loads and stores are allocated into LS1 entries at dispatch time in program order and are required by LS1 to probe the data cache in program order The AGUs can...

Страница 168: ...DirectPath DP The following nomenclature is used to describe the current location of a particular operation D Dispatch stage Allocate in ICU reservation stations load store LS1 queue I Issue stage Sc...

Страница 169: ...ions and therefore dispatches alone in pipe 0 The multiply latency is four cycles 2 The simple INC operation is paired with instructions 3 and 4 The INC executes in IEU0 in cycle 4 3 The MOV executes...

Страница 170: ...che access 3 The load execute instruction accesses the data cache in tandem with instruction 2 After the load portion completes the subtraction is executed in cycle 6 in IEU2 4 The shift operation exe...

Страница 171: ...y as either writeback WB write protected WP writethrough WT uncacheable UC or write combining WC Defining the memory type for a range of memory as WC or WT allows the processor to conditionally combin...

Страница 172: ...cycles that target locations within the address range of a write buffer The AMD Athlon processor combines multiple memory write cycles to a 64 byte buffer whenever the memory address is within a WC o...

Страница 173: ...mprove system performance the AMD Athlon processor aggressively combines multiple memory write cycles of any data size that address locations within a 64 byte write buffer that is aligned to a cache l...

Страница 174: ...thin a lock can be combined Uncacheable Read A UC read closes write combining A WC read closes combining only if a cache block address match occurs between the WC read and a write in the write buffer...

Страница 175: ...the eight quadwords are valid If this case is true do not proceed to the next rule 2 If all longwords are either full 4 bytes valid or empty 0 bytes valid a Write Longword system command is issued for...

Страница 176: ...160 Write Combining Operations AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 177: ...ounting events a counter is incremented each time a specified event takes place or a specified number of events takes place When measuring duration a counter counts the number of processor clocks that...

Страница 178: ...3h The PerfEvtSel 3 0 MSRs shown in Figure 11 control the operation of the performance monitoring counters with one register used to set up each counter These MSRs specify the events to be counted how...

Страница 179: ...ows software to measure not only the fraction of time spent in a particular state but also the average length of time spent in such a state for example the time spent waiting for an interrupt to be se...

Страница 180: ...ES Segment register loads 21h LS Stores to active instruction stream 40h DC Data cache accesses 41h DC Data cache misses 42h DC xxx1_xxxxb Modified M xxxx_1xxxb Owner O xxxx_x1xxb Exclusive E xxxx_xx...

Страница 181: ...ECC errors detected corrected 75h BU bits 15 12 reserved xxxx_1xxxb I invalidates D xxxx_x1xxb I invalidates I xxxx_xx1xb D invalidates D xxxx_xxx1b D invalidates I Internal cache line invalidates 76...

Страница 182: ...C2h FR Retired branches conditional unconditional exceptions interrupts C3h FR Retired branches mispredicted C4h FR Retired taken branches C5h FR Retired taken branches mispredicted C6h FR Retired far...

Страница 183: ...he RDPMC instruction operation is performed Only the operating system executing at privilege level 0 can directly manipulate the performance counters using the RDMSR and WRMSR instructions A secure op...

Страница 184: ...by clearing the enable counters flag or by clearing all the bits in the PerfEvtSel 3 0 MSRs Event and Time Stamp Monitoring Software For applications to use the performance monitoring counters and ti...

Страница 185: ...D Athlon processor provides the option of generating a debug interrupt when a performance monitoring counter overflows This mechanism is enabled by setting the interrupt enable flag in one of the Perf...

Страница 186: ...er Overflow AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 An event monitor application utility or another application program can read the collected performance information of the...

Страница 187: ...e of the MTRRs is to provide system software with the ability to manage the memory mapping of the hardware Both the BIOS software and operating systems utilize this capability The AMD Athlon processor...

Страница 188: ...there is one fixed address MTRR The fixed address ranges all exist in the first 1 Mbyte There are eight variable address ranges above 1 Mbytes Each is programmed to a specific memory starting address...

Страница 189: ...lon Processor x86 Code Optimization Figure 12 MTRR Mapping of Physical Memory 0 FFFFFFFFh 512 Kbytes 256 Kbytes 256 Kbytes 8 Fixed Ranges 64 Kbytes each 64 Fixed Ranges 4 Kbytes each 16 Fixed Ranges 1...

Страница 190: ...Uncacheable Uncacheable for reads or writes Cannot be combined Must be non speculative for reads or writes 01h WC Write Combining Uncacheable for reads or writes Can be combined Can be speculative for...

Страница 191: ...ed when clear and all of physical memory is mapped as uncacheable memory reset state 0 FE Fixed range MTRRs are enabled when set All MTRRs are disabled when clear When the fixed range MTRRs are enable...

Страница 192: ...d behavior However testing shows that these processors decompose these large pages into 4 Kbyte pages When a large page 2 Mbytes 4 Mbytes mapping covers a region that contains more than one memory typ...

Страница 193: ...the flexibility of the page tables It provides the operating systems and applications to determine the desired memory type for optimal performance PAT support is detected in the feature flags bit 16 o...

Страница 194: ...enabled or when the PDE doesn t describe a large page In the latter case the PATi bit for a PTE bit 7 corresponds to the page size bit in a PDE Therefore the OS should only use PA0 3 when setting the...

Страница 195: ...UC MTRR WC x WC WT WB WT WT UC UC WC CD WP CD WP WB WP WP UC UC MTRR WC WT CD WB WB WB UC UC WC WC WT WT WP WP Notes 1 UC MTRR indicates that the UC attribute came from the MTRRs and that the processo...

Страница 196: ...Table 16 Final Output Memory Types Input Memory Type Output Memory Type Note RdMem WrMem Effective MType forceCD 5 AMD 751 RdMem WrMem MemType UC UC 1 CD CD 1 WC WC 1 WT WT 1 WP WP 1 WB WB CD 1 2 UC U...

Страница 197: ...ince cached IO lines cannot be copied back to IO the processor forces WB to WT to prevent cached IO from going dirty 5 ForceCD The memory type is forced CD due to 1 CR0 CD 1 2 memory type is for the I...

Страница 198: ...FF A0000 A3FFF MTRR_fix16K_A0000 C7000 C7FFF C6000 C6FFF C5000 C5FFF C4000 C4FFF C3000 C3FFF C2000 C2FFF C1000 C1FFF C0000 C0FFF MTRR_fix4K_C0000 CF000C FFFF CE000 CEFFF CD000 CDFFF CC000 CCFFF CB000...

Страница 199: ...ss range is power of 2 sized and aligned The range of supported sizes is from 212 to 236 in powers of 2 The AMD Athlon processor does not implement A 35 32 Figure 16 MTRRphysBasen Register Format Note...

Страница 200: ...area not mapped by the mask value is set to the default type Discontinuous ranges should not be used The range that is mapped by the variable range MTRR register pair must meet the following range siz...

Страница 201: ...ormat on page 183 201h MTRR Mask0 See MTRRphysMaskn Register Format on page 184 202h MTRR Base1 203h MTRR Mask1 204h MTRR Base2 205h MTRR Mask2 206h MTRR Base3 207h MTRR Mask3 208h MTRR Base4 209h MTR...

Страница 202: ...186 Page Attribute Table PAT AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 203: ...ates the instruction mnemonic and operand types with the following notations reg8 byte integer register defined by instruction byte s or bits 5 4 and 3 of the modR M byte mreg8 byte integer register d...

Страница 204: ...e used by the instruction The modR M byte defines the instruction as register or memory form If mod bits 7 and 6 are documented as mm memory form mm can only be 10b 01b or 00b The fifth column lists t...

Страница 205: ...3h 11 010 xxx DirectPath ADC mem16 32 imm8 sign extended 83h mm 010 xxx DirectPath ADD mreg8 reg8 00h 11 xxx xxx DirectPath ADD mem8 reg8 00h mm xxx xxx DirectPath ADD mreg16 32 reg16 32 01h 11 xxx xx...

Страница 206: ...ded 83h 11 100 xxx DirectPath AND mem16 32 imm8 sign extended 83h mm 100 xxx DirectPath ARPL mreg16 reg16 63h 11 xxx xxx VectorPath ARPL mem16 reg16 63h mm xxx xxx VectorPath BOUND 62h VectorPath BSF...

Страница 207: ...CALL full pointer 9Ah VectorPath CALL near imm16 32 E8h VectorPath CALL mem16 16 32 FFh 11 011 xxx VectorPath CALL near mreg32 indirect FFh 11 010 xxx VectorPath CALL near mem32 indirect FFh mm 010 x...

Страница 208: ...NO reg16 32 mem16 32 0Fh 41h mm xxx xxx DirectPath CMOVNP CMOVPO reg16 32 reg16 32 0Fh 4Bh 11 xxx xxx DirectPath CMOVNP CMOVPO reg16 32 mem16 32 0Fh 4Bh mm xxx xxx DirectPath CMOVNS reg16 32 reg16 32...

Страница 209: ...reg8 0Fh B0h mm xxx xxx VectorPath CMPXCHG mreg16 32 reg16 32 0Fh B1h 11 xxx xxx VectorPath CMPXCHG mem16 32 reg16 32 0Fh B1h mm xxx xxx VectorPath CMPXCHG8B mem64 0Fh C7h mm xxx xxx VectorPath CPUID...

Страница 210: ...8 signed 6Bh 11 xxx xxx VectorPath IMUL reg16 32 mem16 32 imm8 signed 6Bh mm xxx xxx VectorPath IMUL AX AL mreg8 F6h 11 101 xxx VectorPath IMUL AX AL mem8 F6h mm 101 xxx VectorPath IMUL EDX EAX EAX mr...

Страница 211: ...rt disp8 78h DirectPath JNS short disp8 79h DirectPath JP JPE short disp8 7Ah DirectPath JNP JPO short disp8 7Bh DirectPath JL JNGE short disp8 7Ch DirectPath JNL JGE short disp8 7Dh DirectPath JLE JN...

Страница 212: ...AHF 9Fh VectorPath LAR reg16 32 mreg16 32 0Fh 02h 11 xxx xxx VectorPath LAR reg16 32 mem16 32 0Fh 02h mm xxx xxx VectorPath LDS reg16 32 mem32 48 C5h mm xxx xxx VectorPath LEA reg16 mem16 32 8Dh mm xx...

Страница 213: ...h MOV reg8 mem8 8Ah mm xxx xxx DirectPath MOV reg16 32 mreg16 32 8Bh 11 xxx xxx DirectPath MOV reg16 32 mem16 32 8Bh mm xxx xxx DirectPath MOV mreg16 segment reg 8Ch 11 xxx xxx VectorPath MOV mem16 se...

Страница 214: ...xxx DirectPath MOVSX reg32 mreg16 0Fh BFh 11 xxx xxx DirectPath MOVSX reg32 mem16 0Fh BFh mm xxx xxx DirectPath MOVZX reg16 32 mreg8 0Fh B6h 11 xxx xxx DirectPath MOVZX reg16 32 mem8 0Fh B6h mm xxx xx...

Страница 215: ...Ch DirectPath OR EAX imm16 32 0Dh DirectPath OR mreg8 imm8 80h 11 001 xxx DirectPath OR mem8 imm8 80h mm 001 xxx DirectPath OR mreg16 32 imm16 32 81h 11 001 xxx DirectPath OR mem16 32 imm16 32 81h mm...

Страница 216: ...PUSH DS 1Eh VectorPath PUSH EAX 50h DirectPath PUSH ECX 51h DirectPath PUSH EDX 52h DirectPath PUSH EBX 53h DirectPath PUSH ESP 54h DirectPath PUSH EBP 55h DirectPath PUSH ESI 56h DirectPath PUSH EDI...

Страница 217: ...xx DirectPath RCR mem8 1 D0h mm 011 xxx DirectPath RCR mreg16 32 1 D1h 11 011 xxx DirectPath RCR mem16 32 1 D1h mm 011 xxx DirectPath RCR mreg8 CL D2h 11 011 xxx DirectPath RCR mem8 CL D2h mm 011 xxx...

Страница 218: ...xx DirectPath ROR mreg8 CL D2h 11 001 xxx DirectPath ROR mem8 CL D2h mm 001 xxx DirectPath ROR mreg16 32 CL D3h 11 001 xxx DirectPath ROR mem16 32 CL D3h mm 001 xxx DirectPath SAHF 9Eh VectorPath SAR...

Страница 219: ...SB AL mem8 AEh VectorPath SCASW AX mem16 AFh VectorPath SCASD EAX mem32 AFh VectorPath SETO mreg8 0Fh 90h 11 xxx xxx DirectPath SETO mem8 0Fh 90h mm xxx xxx DirectPath SETNO mreg8 0Fh 91h 11 xxx xxx D...

Страница 220: ...h SETG SETNLE mreg8 0Fh 9Fh 11 xxx xxx DirectPath SETG SETNLE mem8 0Fh 9Fh mm xxx xxx DirectPath SGDT mem48 0Fh 01h mm 000 xxx VectorPath SIDT mem48 0Fh 01h mm 001 xxx VectorPath SHL SAL mreg8 imm8 C0...

Страница 221: ...SHRD mreg16 32 reg16 32 imm8 0Fh ACh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 imm8 0Fh ACh mm xxx xxx VectorPath SHRD mreg16 32 reg16 32 CL 0Fh ADh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 CL...

Страница 222: ...05h VectorPath SYSENTER 0Fh 34h VectorPath SYSEXIT 0Fh 35h VectorPath SYSRET 0Fh 07h VectorPath TEST mreg8 reg8 84h 11 xxx xxx DirectPath TEST mem8 reg8 84h mm xxx xxx DirectPath TEST mreg16 32 reg16...

Страница 223: ...h VectorPath XCHG EAX EDI 97h VectorPath XLAT D7h VectorPath XOR mreg8 reg8 30h 11 xxx xxx DirectPath XOR mem8 reg8 30h mm xxx xxx DirectPath XOR mreg16 32 reg16 32 31h 11 xxx xxx DirectPath XOR mem16...

Страница 224: ...FADD FMUL PACKUSWB mmreg1 mmreg2 0Fh 67h 11 xxx xxx DirectPath FADD FMUL PACKUSWB mmreg mem64 0Fh 67h mm xxx xxx DirectPath FADD FMUL PADDB mmreg1 mmreg2 0Fh FCh 11 xxx xxx DirectPath FADD FMUL PADDB...

Страница 225: ...PMADDWD mmreg1 mmreg2 0Fh F5h 11 xxx xxx DirectPath FMUL PMADDWD mmreg mem64 0Fh F5h mm xxx xxx DirectPath FMUL PMULHW mmreg1 mmreg2 0Fh E5h 11 xxx xxx DirectPath FMUL PMULHW mmreg mem64 0Fh E5h mm x...

Страница 226: ...ctPath FADD FMUL PSUBB mmreg1 mmreg2 0Fh F8h 11 xxx xxx DirectPath FADD FMUL PSUBB mmreg mem64 0Fh F8h mm xxx xxx DirectPath FADD FMUL PSUBD mmreg1 mmreg2 0Fh FAh 11 xxx xxx DirectPath FADD FMUL PSUBD...

Страница 227: ...odR M Byte Decode Type FPU Pipe s Notes MASKMOVQ mmreg1 mmreg2 0Fh F7h VectorPath FADD FMUL FSTORE MOVNTQ mem64 mmreg 0Fh E7h DirectPath FSTORE PAVGB mmreg1 mmreg2 0Fh E0h 11 xxx xxx DirectPath FADD F...

Страница 228: ...mem8 0Fh 18h DirectPath 1 PREFETCHT2 mem8 0Fh 18h DirectPath 1 SFENCE 0Fh AEh VectorPath Table 22 Floating Point Instructions Instruction Mnemonic First Byte Second Byte ModR M Byte Decode Type FPU Pi...

Страница 229: ...D FCOMP mem64real DCh mm 011 xxx DirectPath FADD FCOMPP DEh D9h 11 011 001 DirectPath FADD FCOS D9h FFh VectorPath FDECSTP D9h F6h DirectPath FADD FMUL FSTORE FDIV ST ST i D8h 11 110 xxx DirectPath FM...

Страница 230: ...m 001 xxx VectorPath FINCSTP D9h F7h DirectPath FADD FMUL FSTORE FINIT DBh E3h VectorPath FIST mem16int DFh mm 010 xxx DirectPath FSTORE FIST mem32int DBh mm 010 xxx DirectPath FSTORE FISTP mem16int D...

Страница 231: ...h 11 001 xxx DirectPath FMUL 1 FNOP D9h D0h DirectPath FADD FMUL FSTORE FPTAN D9h F2h VectorPath FPATAN D9h F3h VectorPath FPREM D9h F8h DirectPath FMUL FPREM1 D9h F5h DirectPath FMUL FRNDINT D9h FCh...

Страница 232: ...1 FSUBP ST ST i DEh 11 101 xxx DirectPath FADD 1 FSUBR mem32real D8h mm 101 xxx DirectPath FADD FSUBR mem64real DCh mm 101 xxx DirectPath FADD FSUBR ST ST i D8h 11 100 xxx DirectPath FADD 1 FSUBR ST...

Страница 233: ...mmreg2 0Fh 0Fh A0h 11 xxx xxx DirectPath FADD PFCMPGT mmreg mem64 0Fh 0Fh A0h mm xxx xxx DirectPath FADD PFMAX mmreg1 mmreg2 0Fh 0Fh A4h 11 xxx xxx DirectPath FADD PFMAX mmreg mem64 0Fh 0Fh A4h mm xxx...

Страница 234: ...ruction Mnemonic Prefix Byte s imm8 ModR M Byte Decode Type FPU Pipe s Note Notes 1 For the PREFETCH and PREFETCHW instructions the mem8 value refers to an address in the 64 byte line that will be pre...

Страница 235: ...udes register register op memory as well as register register op register forms of instructions DirectPath Instructions The following tables contain DirectPath instructions which should be used in the...

Страница 236: ...imm8 ADD mreg16 32 imm16 32 ADD mem16 32 imm16 32 ADD mreg16 32 imm8 sign extended ADD mem16 32 imm8 sign extended AND mreg8 reg8 AND mem8 reg8 AND mreg16 32 reg16 32 AND mem16 32 reg16 32 AND reg8 mr...

Страница 237: ...NS reg16 32 mem16 32 CMOVO reg16 32 reg16 32 CMOVO reg16 32 mem16 32 CMOVP CMOVPE reg16 32 reg16 32 CMOVP CMOVPE reg16 32 mem16 32 CMOVS reg16 32 reg16 32 CMOVS reg16 32 mem16 32 CMP mreg8 reg8 CMP me...

Страница 238: ...16 32 JNL JGE near disp16 32 JLE JNG near disp16 32 JNLE JG near disp16 32 JMP near disp16 32 direct JMP far disp32 48 direct JMP disp8 short Table 25 DirectPath Integer Instructions Continued Instruc...

Страница 239: ...imm16 32 OR mreg8 imm8 OR mem8 imm8 OR mreg16 32 imm16 32 OR mem16 32 imm16 32 OR mreg16 32 imm8 sign extended OR mem16 32 imm8 sign extended Table 25 DirectPath Integer Instructions Continued Instru...

Страница 240: ...16 32 SBB reg8 mreg8 SBB reg8 mem8 Table 25 DirectPath Integer Instructions Continued Instruction Mnemonic SBB reg16 32 mreg16 32 SBB reg16 32 mem16 32 SBB AL imm8 SBB EAX imm16 32 SBB mreg8 imm8 SBB...

Страница 241: ...g16 32 CL SHR mem16 32 CL STC SUB mreg8 reg8 Table 25 DirectPath Integer Instructions Continued Instruction Mnemonic SUB mem8 reg8 SUB mreg16 32 reg16 32 SUB mem16 32 reg16 32 SUB reg8 mreg8 SUB reg8...

Страница 242: ...0 November 1999 XOR reg16 32 mem16 32 XOR AL imm8 XOR EAX imm16 32 XOR mreg8 imm8 XOR mem8 imm8 XOR mreg16 32 imm16 32 XOR mem16 32 imm16 32 XOR mreg16 32 imm8 sign extended XOR mem16 32 imm8 sign ext...

Страница 243: ...em64 PAND mmreg1 mmreg2 PAND mmreg mem64 PANDN mmreg1 mmreg2 PANDN mmreg mem64 PCMPEQB mmreg1 mmreg2 PCMPEQB mmreg mem64 PCMPEQD mmreg1 mmreg2 PCMPEQD mmreg mem64 PCMPEQW mmreg1 mmreg2 PCMPEQW mmreg m...

Страница 244: ...reg mem64 PUNPCKLBW mmreg1 mmreg2 PUNPCKLBW mmreg mem64 PUNPCKLDQ mmreg1 mmreg2 PUNPCKLDQ mmreg mem64 PUNPCKLWD mmreg1 mmreg2 PUNPCKLWD mmreg mem64 PXOR mmreg1 mmreg2 Table 26 DirectPath MMX Instructi...

Страница 245: ...IVR ST i ST FDIVR mem32real FDIVR mem64real FDIVRP ST i ST FFREE ST i FFREEP ST i FILD mem16int FILD mem32int FILD mem64int FIMUL mem32int FIMUL mem16int FINCSTP FIST mem16int FIST mem32int FISTP mem1...

Страница 246: ...6 Code Optimization 22007E 0 November 1999 FSUB ST i ST FSUBP ST ST i FSUBR mem32real FSUBR mem64real FSUBR ST ST i FSUBR ST i ST FSUBRP ST i ST FTST FUCOM FUCOMP FUCOMPP FWAIT FXCH Table 28 DirectPat...

Страница 247: ...reg16 32 mreg16 32 BSF reg16 32 mem16 32 BSR reg16 32 mreg16 32 BSR reg16 32 mem16 32 BT mem16 32 reg16 32 BTC mreg16 32 reg16 32 BTC mem16 32 reg16 32 BTC mreg16 32 imm8 BTC mem16 32 imm8 BTR mreg16...

Страница 248: ...ect JMP far mem32 indirect JMP far mreg32 indirect LAHF LAR reg16 32 mreg16 32 LAR reg16 32 mem16 32 LDS reg16 32 mem32 48 Table 29 VectorPath Integer Instructions Continued Instruction Mnemonic LEA r...

Страница 249: ...Integer Instructions Continued Instruction Mnemonic RCL mem8 imm8 RCL mem16 32 imm8 RCL mem8 CL RCL mem16 32 CL RCR mem8 imm8 RCR mem16 32 imm8 RCR mem8 CL RCR mem16 32 CL RDMSR RDPMC RDTSC RET near i...

Страница 250: ...g16 32 XCHG reg8 mreg8 XCHG reg8 mem8 XCHG reg16 32 mreg16 32 XCHG reg16 32 mem16 32 XCHG EAX ECX XCHG EAX EDX XCHG EAX EBX XCHG EAX ESP XCHG EAX EBP XCHG EAX ESI XCHG EAX EDI XLAT Table 29 VectorPath...

Страница 251: ...OM mem32int FICOM mem16int FICOMP mem32int FICOMP mem16int FIDIV mem32int FIDIV mem16int FIDIVR mem32int FIDIVR mem16int FIMUL mem32int FIMUL mem16int FINIT FISUB mem32int FISUB mem16int FISUBR mem32i...

Страница 252: ...236 VectorPath Instructions AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Страница 253: ...siderations 27 55 Cache 4 64 Byte Cache Line 11 50 Cache and Memory Optimizations 45 CALL and RETURN Instructions 59 Code Padding Using Neutral Code Fillers 39 Code Sample Analysis 152 Complex Number...

Страница 254: ...teger Only Work 83 MOVQ Instruction 85 PAND to Find Absolute Value in 3DNow Code 119 PCMP Instead of 3DNow PFCMP 114 PCMPEQD to Set an MMX Register 119 PMADDWD Instruction 111 PREFETCHNTA T0 T1 T2 Ins...

Страница 255: ...Athlon Processor x86 Code Optimization T TBYTE Variables 55 Trigonometric Instructions 103 V VectorPath Decoder 133 VectorPath Instructions 231 W Write Combining 10 50 139 155 157 159 X x86 Optimizati...

Страница 256: ...240 Index AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Отзывы: