AMD Athlon Processor x86 Optimization Manual Download | Manualshive

Page: 85 / 256

background image

Unrolling Loops

69

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Without Loop Unrolling:

MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B

$add_loop:
FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

DEC

ECX

JNZ

$add_loop

The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:

With Partial Loop Unrolling:

MOV

ECX, MAX_LENGTH

MOV

EAX, offset A

MOV

EBX, offset B

SHR

ECX, 1

JNC

$add_loop

FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

$add_loop:
FLD

QWORD PTR[EAX]

FADD

QWORD PTR[EBX]

FSTP

QWORD PTR[EAX]

FLD

QWORD PTR[EAX+8]

FADD

QWORD PTR[EBX+8]

FSTP

QWORD PTR[EAX+8]

ADD

EAX, 16

ADD

EBX, 16

DEC

ECX

JNZ

$add_loop

Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes

«
...
83
84
85
86
87
...
»

Summary of Contents for Athlon Processor x86

Page 1: ...AMD Athlon Processor x86 Code Optimization Guide TM...

Page 2: ...ations and product descriptions at any time without notice No license whether express implied arising by estoppel or otherwise to any intellectual property rights is granted by this publication Except...

Page 3: ...nstructions 9 Group II Optimizations Secondary Optimizations 9 Load Execute Instruction Usage 9 Take Advantage of Write Combining 10 Use 3DNow Instructions 10 Avoid Branches Dependent on Random Data 1...

Page 4: ...Base Type Size 28 Accelerating Floating Point Divides and Square Roots 29 Avoid Unnecessary Integer Division 31 Copy Frequently De referenced Pointer Arguments to Local Variables 31 4 Instruction Dec...

Page 5: ...Cache Line 50 Store to Load Forwarding Restrictions 51 Store to Load Forwarding Pitfalls True Dependencies 51 Summary of Store to Load Forwarding Pitfalls to Avoid 54 Stack Alignment Considerations 5...

Page 6: ...8 Integer Optimizations 77 Replace Divides with Multiplies 77 Multiplication by Reciprocal Division Utility 77 Unsigned Division by Multiplication of Constant 78 Signed Division by Multiplication of...

Page 7: ...103 Check Argument Range of Trigonometric Instructions Efficiently 103 Take Advantage of the FSINCOS Instruction 105 10 3DNow and MMX Optimizations 107 Use 3DNow Instructions 107 Use FEMMS Instructio...

Page 8: ...119 Optimized Matrix Multiplication 119 Efficient 3D Clipping Code Computation Using 3DNow Instructions 122 Use 3DNow PAVGUSB for MPEG 2 Motion Compensation 123 Stream of Packed Unsigned Bytes 125 Com...

Page 9: ...g Point Pipeline Stages 146 Execution Unit Resources 148 Terminology 148 Integer Pipeline Operations 149 Floating Point Pipeline Operations 150 Load Store Pipeline Operations 151 Code Sample Analysis...

Page 10: ...toring Software 168 Monitoring Counter Overflow 169 Appendix E Programming the MTRR and PAT 171 Introduction 171 Memory Type Range Register MTRR Mechanism 171 Page Attribute Table PAT 177 Appendix F I...

Page 11: ...6 Fetch Scan Align Decode Pipeline Stages 142 Figure 7 Integer Execution Pipeline 144 Figure 8 Integer Pipeline Stages 144 Figure 9 Floating Point Unit Block Diagram 146 Figure 10 Floating Point Pipe...

Page 12: ...xii List of Figures AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 13: ...Generation Rules 159 Table 11 Performance Monitoring Counters 164 Table 12 Memory Type Encodings 174 Table 13 Standard MTRR Types and Properties 176 Table 14 PATi 3 Bit Encodings 178 Table 15 Effecti...

Page 14: ...rocessor x86 Code Optimization 22007E 0 November 1999 Table 29 VectorPath Integer Instructions 231 Table 30 VectorPath MMX Instructions 234 Table 31 VectorPath MMX Extensions 234 Table 32 VectorPath F...

Page 15: ...FFREEP Macro to Pop One Register from the FPU Stack on page 98 Further clarification of Minimize Floating Point to Integer Conversions on page 100 Added the optimization Check Argument Range of Trigon...

Page 16: ...xvi Revision History AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 17: ...AMD Athlon processor programmers should write software that includes specific code optimization techniques About this Document This document contains information to assist programmers in creating opti...

Page 18: ...makes efficient use of the large L1 caches and high bandwidth buses of the AMD Athlon processor Chapter 6 Branch Optimizations Describes optimizations that improves branch prediction and minimizes bra...

Page 19: ...ists the x86 instructions that are DirectPath and VectorPath instructions AMD Athlon Processor Family The AMD Athlon processor family uses state of the art decoupled decode execution design techniques...

Page 20: ...ive execution Three way integer execution Three way address generation Three way floating point execution 3DNow technology and MMX single instruction multiple data SIMD instruction extensions Super da...

Page 21: ...for high performance floating point vector operations which can replace x87 instructions and enhance the performance of 3D graphics and other floating point intensive applications Because the 3DNow a...

Page 22: ...the AMD Athlon processor to achieve maximum performance Due to the more flexible pipeline control and aggressive out of order execution the AMD Athlon processor is not as sensitive to instruction sele...

Page 23: ...hould follow these critical guidelines closely The optimizations in Group I are as follows Memory Size and Alignment Issues Avoid memory size mismatches Align data where possible Use the 3DNow PREFETC...

Page 24: ...ze mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep the loads stores of each operand the same size Align Data W...

Page 25: ...code and execute efficiently by minimizing the number of operations per x86 instruction Three DirectPath instructions can be decoded in parallel Using VectorPath instructions will block DirectPath ins...

Page 26: ...ementation of Write Combining on page 155 for more details Use 3DNow Instructions Unless accuracy requirements dictate otherwise perform floating point computations using the 3DNow instructions instea...

Page 27: ...ze of previous processors Code and data should not be shared in the same 64 byte cache line especially if the data ever becomes modified In order to maintain cache coherency the AMD Athlon processor m...

Page 28: ...12 Group II Optimizations Secondary Optimizations AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 29: ...ating point variables and expressions are of type float Pay special attention to floating point constants These require a suffix of F or f for example 3 14f in order to be of type float otherwise they...

Page 30: ...while others are faster for signed types Integer to floating point conversion using integers larger than 16 bit is faster with signed types as the x86 FPU provides instructions for converting signed i...

Page 31: ...writes through a pointer can write to any place in memory This includes storage allocated to other variables creating the issue of aliasing i e the same block of memory is accessible in more than one...

Page 32: ...de it is therefore possible that pointer style code will be compiled into machine code that is faster than that generated from equivalent array style code It is advisable to check the performance afte...

Page 33: ...eferred typedef struct float x y z w VERTEX typedef struct float m 4 4 MATRIX void XForm float res const float v const float m int numverts int i const VERTEX vv VERTEX v const MATRIX mm MATRIX m VERT...

Page 34: ...4 transform matrix M r 0 M 0 0 V 0 M 1 0 V 1 M 2 0 V 2 M 3 0 V 3 r 1 M 0 1 V 0 M 1 1 V 1 M 2 1 V 2 M 3 1 V 3 r 2 M 0 2 V 0 M 1 2 V 1 M 2 2 V 2 M 3 2 V 3 r 3 M 0 3 V 0 M 1 3 V 1 M 2 3 V 2 M 3 3 v 3 Avo...

Page 35: ...refore recommended that the programmer remove the dependency manually e g by introducing a temporary variable that can be kept in a register This can result in a significant performance increase The f...

Page 36: ...s of conditional branches The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition and can have a significant impact on performance It is unf...

Page 37: ...integers as this will allow the switch to be translated as a jump table Example 1 Avoid int days_in_month short_months normal_months long_months switch days_in_month case 28 case 29 short_months brea...

Page 38: ...rmance of inner loops it is beneficial to reduce redundant constant calculations i e loop invariant calculations However this idea can be extended to invariant control structures The first case is tha...

Page 39: ...imple switch statement Example 2 for i if CONSTANT0 DoWork0 i does not affect CONSTANT0 or CONSTANT1 else DoWork1 i does not affect CONSTANT0 or CONSTANT1 if CONSTANT1 DoWork2 i does not affect CONSTA...

Page 40: ...method for combining the input constants gets more complicated but will be worth it for the performance benefit However the number of inner loops can also substantially increase If the number of inne...

Page 41: ...cit Parallelism into Code Where possible long dependency chains should be broken into several independent dependency chains which can then be executed in parallel exploiting the pipeline execution uni...

Page 42: ...t adder Each stage of the floating point adder is occupied on every clock cycle ensuring maximal sustained utilization Explicitly Extract Common Subexpressions In certain situations C compilers are un...

Page 43: ...better alignment for structures In addition to improve the alignment of structure members some compilers might allocate structure elements in an order that differs from the order in which they are de...

Page 44: ...Component Considerations on page 55 for a different perspective Sort Local Variables According to Base Type Size When a compiler allocates local variables in the same order in which they are declared...

Page 45: ...a much longer latency than other floating point operations even though the AMD Athlon processor provides significant acceleration of these two operations In some codes these operations occur so often...

Page 46: ...ere it creates little overhead such as outside a computation intensive loop Otherwise the overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square...

Page 47: ...i j k Example 2 Preferred int i j k l m i j k Copy Frequently De referenced Pointer Arguments to Local Variables Avoid frequently de referencing pointer arguments inside a function Since the compiler...

Page 48: ...Avoid assumes pointers are different and q r void isqrt unsigned long a unsigned long q unsigned long r q a if a 0 while q r a q q q r 1 r a q q Example 2 Preferred assumes pointers are different and...

Page 49: ...ne selects for decode up to three x86 instructions from the instruction byte queue All instructions x86 x87 3DNow and MMX are classified into two types of decodes DirectPath and VectorPath see DirectP...

Page 50: ...ctions in the AMD Athlon processor Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions See Appendix F Instruction Dispatch and Execution Resource...

Page 51: ...FMUL ST ST 1 Example 2 Preferred FLD QWORD PTR TEST1 FMUL QWORD PTR TEST2 Avoid Load Execute Floating Point Instructions with Integer Operands Do not use load execute floating point instructions with...

Page 52: ...while preventing I cache space in branch intensive code Use Short Instruction Lengths Assemblers and compilers should generate the tightest code possible to optimize use of the I cache and increase a...

Page 53: ...rent state of the remainder of the register Therefore the dependency hardware can potentially force a false dependency on the most recent instruction that writes to any part of the register Example 1...

Page 54: ...ions while SHLD is a VectorPath instruction SHR and LEA preserves decode bandwidth as it potentially enables the decoding of a third DirectPath instruction Example 1 Avoid SHLD REG1 REG2 1 Preferred S...

Page 55: ...can be executed it should take up as few execution resources as possible not diminish decode density and not modify any processor state other than advancing EIP A one byte padding can easily be achiev...

Page 56: ...neutral code filler of up to nine bytes Note When used as a filler instruction REP REPNE prefixes can be used in conjunction only with NOPs REP REPNE has undefined behavior when used with instruction...

Page 57: ...used by instructions in the vicinity of the neutral code filler Note that certain instructions use registers implicitly For example PUSH POP CALL and RET all make implicit use of the ESP register The...

Page 58: ...esi 00 NOP4_EDI TEXTEQU DB 08Dh 074h 026h 000h lea edi edi 00 NOP4_ESP TEXTEQU DB 08Dh 07Ch 027h 000h lea esp esp 00 lea eax eax 00 nop NOP5_EAX TEXTEQU DB 08Dh 044h 020h 000h 090h lea ebx ebx 00 nop...

Page 59: ...B 08Dh 034h 035h 0 0 0 0 lea edi edi 1 00000000 NOP7_EDI TEXTEQU DB 08Dh 03Ch 03Dh 0 0 0 0 lea ebp ebp 1 00000000 NOP7_EBP TEXTEQU DB 08Dh 02Ch 02Dh 0 0 0 0 lea eax eax 1 00000000 nop NOP8_EAX TEXTEQU...

Page 60: ...44 Code Padding Using Neutral Code Fillers AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 61: ...nment Issues Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep th...

Page 62: ...e likelihood of encountering a store to load forwarding pitfall For a more detailed discussion of store to load forwarding issues see Store to Load Forwarding Restrictions on page 51 Use the 3DNow PRE...

Page 63: ...brought in by PREFETCHW as Modified Using PREFETCHW can save an additional 15 25 cycles compared to a PREFETCH and the subsequent cache state change caused by a write to the prefetched cache line Mul...

Page 64: ...ZE 16 b i 2 FMUL QWORD PTR ECX ECX 8 ARR_SIZE 16 b i 2 c i 2 FSTP QWORD PTR EAX ECX 8 ARR_SIZE 16 a i 2 i 2 c i 2 FLD QWORD PTR EDX ECX 8 ARR_SIZE 24 b i 3 FMUL QWORD PTR ECX ECX 8 ARR_SIZE 24 b i 3 c...

Page 65: ...er to cut down on loop overhead Determining Prefetch Distance Given the latency of a typical AMD Athlon processor system and expected processor speeds the following formula should be used to determine...

Page 66: ...ain coherency between the separate instruction and data caches The AMD Athlon processor has a cache line size of 64 bytes which is twice the size of previous processors Programmers must be aware that...

Page 67: ...eorder buffer The implication of this restriction is that all instructions in the reorder buffer up to and including the store must complete and retire out of the reorder buffer before the load can co...

Page 68: ...R EAX doubleword load cannot forward upper byte from store buffer Example 2 Avoid MOV EAX 10h MOV BYTE PTR EAX 3 BL byte store MOV ECX DWORD PTR EAX doubleword load cannot forward upper byte from stor...

Page 69: ...rd boundary etc A common case of misaligned store data forwarding involves the passing of misaligned quadword floating point data on the doubleword aligned integer stack Avoid the type of code shown i...

Page 70: ...rd or byte stores avoid loading data from anywhere in the same doubleword of memory other than the identical start addresses of the stores Stack Alignment Considerations Make sure the stack is suitabl...

Page 71: ...Variables on Quadword Aligned Addresses Align variables of type TBYTE on quadword aligned addresses In order to make an array of TBYTE variables that are aligned array elements are 16 bytes apart In...

Page 72: ...al variables according to their base type size and allocate variables with larger base type size ahead of those with smaller base type size Assuming the first variable allocated is naturally aligned a...

Page 73: ...these are difficult to predict For example a piece of code receives a random stream of characters A through Z and branches if the character is before M in the collating sequence Data dependent branche...

Page 74: ...AX MOV Z EAX save min X Y Blended AMD K6 and AMD Athlon Processor Code Example 3 Signed integer ABS function X labs X MOV ECX X load value MOV EBX ECX save value SAR ECX 31 x 0 0xffffffff 0 XOR EBX EC...

Page 75: ...a 0 MOV a EAX store new offset Example 7 Integer Signum Function C Code int a s if a s 0 else if a 0 s 1 else s 1 Assembly Code MOV EAX a load a CDQ t a 0 0xffffffff 0 CMP EDX EAX a 0 CF NC ADC EDX 0...

Page 76: ...The principal tools for this are the following instructions PCMPGT PFCMPGT PFCMPGE PFMIN PFMAX PAND PANDN POR PXOR Muxing Constructs The most important construct to avoiding branches in 3DNow and MMX...

Page 77: ...ions for scalar code because the advantage of 3DNow instructions lies in their SIMDness These examples are meant to demonstrate general techniques for translating source code with branches into branch...

Page 78: ...e z PFRCPIT1 MM0 MM2 1 z step PFRCPIT2 MM0 MM2 1 z final PFMIN MM0 MM1 z z 1 z 1 z Example 3 C code float x z r res z fabs x if z 0 575 res r else res PI 2 2 r 3DNow code in MM0 x MM1 r out MM0 res MO...

Page 79: ...efine PI 3 14159265358979323 float x z r res 0 r PI 4 z abs x if z 1 res r else res PI 2 r 3DNow code in MM0 x MM1 r out MM1 res MOVQ MM5 mabs mask to clear sign bit MOVQ MM6 one 1 0 PAND MM0 MM5 z ab...

Page 80: ...1 y MM2 x out MM0 res MOVQ MM7 sgn mask to extract sign bit MOVQ MM6 sgn mask to extract sign bit MOVQ MM5 mabs mask to clear sign bit PAND MM7 MM2 xs sign x PAND MM1 MM5 ya abs y PAND MM2 MM5 xa abs...

Page 81: ...uction in the AMD Athlon processor requires eight cycles to execute Use the preferred code shown below Example 1 Avoid LOOP LABEL Example 2 Preferred DEC ECX JNZ LABEL Avoid Far Control Transfer Instr...

Page 82: ...due to the danger of overflowing the return address stack Convert end recursive functions to iterative code An end recursive function is when the function call to itself is at the end of the code Exa...

Page 83: ...possibly having a different latency The AMD Athlon processor has flexible scheduling but for absolute maximum performance schedule instructions especially FPU and 3DNow instructions according to thei...

Page 84: ...nown at compile time The loop body once unrolled is less than 100 instructions which is approximately 400 bytes of code Partial Loop Unrolling Partial loop unrolling can increase register pressure whi...

Page 85: ...e However the pipelined floating point adder allows one add every cycle In the following code the loop is partially unrolled by a factor of two which creates potential endcases that must be handled ou...

Page 86: ...l case the loop count starts at some lower bound lo increases by some fixed positive increment inc for each iteration of the loop and may not exceed some upper bound hi The following example shows how...

Page 87: ...t high speed due to the use of prediction mechanisms However there is still overhead due to passing function arguments through memory which creates STLF store to load forwarding dependencies Some comp...

Page 88: ...body For large functions the benefits of reduced function call overhead gives diminishing returns Therefore a function that results in the insertion of more than 500 machine instructions at the call...

Page 89: ...X and MOVSX instructions to zero extend and sign extend byte size and word size operands to doubleword length For example typical code for zero extension creates a superset dependency when the zero ex...

Page 90: ...2 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for i 0 i MAXSIZE i c i a i b i MOV ECX MAXSIZE 1 initialize loop counter add_loop MOV EAX ECX 4 a get element a MOV EDX ECX 4 b get element b ADD EAX...

Page 91: ...esses are used in the displacement portion of the address and biasing is accomplished at compile time by simply modifying the displacement Example 3 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for i...

Page 92: ...76 Push Memory Data Carefully AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 93: ...sion delivers only one bit of quotient per cycle 22 47 cycles signed 17 41 cycles unsigned the equivalent code is much faster The user can follow the examples in this chapter that illustrate the use o...

Page 94: ...ut to output the code to a file Unsigned Division by Multiplication of Constant Algorithm Divisors 1 d 231 Odd d The following code shows an unsigned division using a constant value multiplier In d di...

Page 95: ...0 1 Simpler Code for Restricted Dividend Integer division by a constant can be made faster if the range of the dividend is limited which removes a shift associated with most divisors For example for a...

Page 96: ...EAX dividend OUT EAX quotient CDQ Sign extend into EDX AND EDX 2 n 1 Mask correction use divisor 1 ADD EAX EDX Apply correction if necessary SAR EAX n Perform right shift by log2 divisor Signed Divis...

Page 97: ...ly unit the replacement code may provide better throughput The following code samples are designed such that the original source also receives the final result Other sequences are possible if the resu...

Page 98: ...ADD REG1 REG2 by 15 MOV REG2 REG1 2 cycles SHL REG1 4 SUB REG1 REG2 by 16 SHL REG1 4 1 cycle by 17 MOV REG2 REG1 2 cycles SHL REG1 4 ADD REG1 REG2 by 18 ADD REG1 REG1 3 cycles LEA REG1 REG1 8 REG1 by...

Page 99: ...ns to do integer only work especially if the function already uses 3DNow or MMX code Using MMX instructions relieves register pressure on the integer registers As long as data is simply loaded stored...

Page 100: ...icantly However these types are less commonly found The user should use the formula and round up to the nearest integer value to determine the latency Guidelines for Repeated String Instructions To he...

Page 101: ...misaligned For REP STOS make the destination aligned Inline REP String with Low Counts Expand REP string instructions into equivalent sequences of simple x86 instructions if the repeat count is const...

Page 102: ...fts are best handled by inline code Multiplies divides and remainders are less common operations and should usually be implemented as subroutines If these subroutines are used often the programmer sho...

Page 103: ...EDX by 32 bits rshift_done Example 6 Multiplication _llmul computes the low order half of the product of its arguments two 64 bit integers INPUT ESP 8 ESP 4 multiplicand ESP 16 ESP 12 multiplier OUTPU...

Page 104: ...hi 0 quotient in EDX EAX POP EBX restore EBX as per calling convention RET done return to caller two_divs MOV ECX EAX save dividend_lo in ECX MOV EAX EDX get dividend_hi XOR EDX EDX zero extend it int...

Page 105: ...EBX restore EBX as per calling convention RET done return to caller _ulldiv ENDP Example 8 Remainder _ullrem divides two unsigned 64 bit integers and returns the remainder INPUT ESP 8 ESP 4 dividend...

Page 106: ...er of remaining shifts SHRD EBX EDI CL scale down divisor and dividend such SHRD EAX EDX CL that divisor is less than 2 32 SHR EDX CL i e fits in EBX ROL EDI 1 restore original divisor EDI ESI DIV EBX...

Page 107: ...hms consist of the following steps Step 1 Partition the integer into groups of two bits Compute the population count for each 2 bit group and store the result in the 2 bit group This calls for the fol...

Page 108: ...YYYkkkkZZZZ The WWWW XXXX YYYY and ZZZZ values are the interesting sums with each at most 1000b or 8 decimal Step 4 The four 4 bit sums can now be rapidly accumulated by means of a multiply with a mag...

Page 109: ...ctor The utility udiv exe was compiled using the code shown in this section The following code derives the multiplier value used when performing integer division by constants The code works for unsign...

Page 110: ...ong U32 U32 d l s m a r U64 m_low m_high j k U32 log2 U32 i U32 t 0 i i 1 while i i i 1 t return t Generate m s for algorithm 0 Based on Granlund T Montgomery P L Division by Invariant Integers using...

Page 111: ...d m r d 1 1 U32 m_low U32 m_low 1 a 1 Reduce multiplier shift factor for either algorithm to smallest possible while m 1 m m 1 s Signed Derivation for Algorithm Multiplier and Shift Factor The utilit...

Page 112: ...2 log2 U32 i U32 t 0 i i 1 while i i i 1 t return t U32 d l s m a U64 m_low m_high j k Determine algorithm a multiplier m and shift count s for 32 bit signed integer division Based on Granlund T Montg...

Page 113: ...leword boundaries and quadwords on quadword boundaries Misaligned memory accesses reduce the available memory bandwidth Use Multiplies Rather than Divides If accuracy requirements allow floating point...

Page 114: ...low Note that the FFREEP instruction although insufficiently documented in the past is supported by all 32 bit x86 processors The opcode bytes for FFREEP ST i are listed in Table 22 on page 212 FFREEP...

Page 115: ...aneously by explicitly switching execution between them Although the AMD Athlon processor FPU has a deep scheduler which in most cases can extract sufficient parallelism from existing code long depend...

Page 116: ...such code as quickly as possible In most situations the above code is therefore the fastest way to perform floating point to integer conversion and the conversion is compliant both with programming la...

Page 117: ...ually the case Example 2 Potentially faster FLD QWORD PTR X load double to be converted FST DWORD PTR TX store X because sign X is needed FIST DWORD PTR I store rndint x as default result FISUB DWORD...

Page 118: ...rsion that is not compliant with existing programming language standards but is IEEE 754 compliant perform the conversion using the rounding mode that is currently in effect usually round to nearest e...

Page 119: ...tions by putting the shared source operand at the top of the stack For example using the function func x y x z Example 1 Avoid FLD Z FLD Y FLD X FADD ST ST 2 FXCH ST 1 FMUL ST ST 2 CALL FUNC FSTP ST 0...

Page 120: ...truction and take appropriate action if it is set indicating an argument out of range Example 1 Avoid FLD QWORD PTR x argument FSIN compute sine FSTSW AX store FPU status word to AX TEST AX 0400h is t...

Page 121: ...an argument also needs to compute the cosine of that same argument In such cases the FSINCOS instruction should be used to compute both trigonometric functions concurrently which is faster than using...

Page 122: ...106 Take Advantage of the FSINCOS Instruction AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 123: ...m floating point computations using the 3DNow instructions instead of x87 instructions The SIMD nature of 3DNow achieves twice the number of FLOPs that are achieved through x87 instructions 3DNow inst...

Page 124: ...uctions for Fast Division 3DNow instructions can be used to compute a very fast highly accurate reciprocal or quotient Optimized 14 Bit Precision Divide This divide operation executes with a total lat...

Page 125: ...it precision approximation of 1 b using just one PFRCP instruction A full 24 bit precision reciprocal can then be quickly computed from this approximation using a Newton Raphson algorithm The general...

Page 126: ...ur less cycles than the square root operation Example MOVD MM0 MEM 0 a PFRSQRT MM1 MM0 1 sqrt a 1 sqrt a approximate PUNPCKLDQ MM0 MM0 a a MMX instr PFMUL MM0 MM1 sqrt a sqrt a Optimized 24 Bit Precis...

Page 127: ...square root value is X3 In the AMD Athlon processor 3DNow implementation the estimate contains the correct round to nearest value for approximately 87 of all arguments The remaining arguments differ f...

Page 128: ...ction Set Manual order 22466 for more usage information Blended Code Otherwise for blended code which needs to run well on AMD K6 and AMD Athlon family processors the following code is recommended Exa...

Page 129: ...to accomplish the same task in blended code that achieves good performance on the AMD Athlon processor as well as on the AMD K6 family processors that support 3DNow technology Example 1 AMD Athlon sp...

Page 130: ...the PCMP has a latency of two cycles while the PFCMP has a latency of four cycles In addition to the shorter latency PCMP can be issued to either the FADD or the FMUL pipe while PFCMP is restricted to...

Page 131: ...aligned blocks of data In cases where memory blocks are not quadword aligned additional code is required to handle end cases as needed AMD K6 and AMD Athlon Processor Blended Code The following exampl...

Page 132: ...1 eax 32 movq edx 48 mm2 movq mm2 eax 24 movq edx 40 mm0 movq mm0 eax 16 movq edx 32 mm1 movq mm1 eax 8 movq edx 24 mm2 movq edx 16 mm0 dec ecx movq edx 8 mm1 jnz xfer femms block fill destination QWO...

Page 133: ...Athlon processor specific code where the destination of the block copy is in cacheable space but no immediate data re use of the data at the destination is expected Example 2 block copy source and des...

Page 134: ...e MMX PXOR to Clear All Bits in an MMX Register To clear all the bits in an MMX register to zero use PXOR MMreg MMreg Note that PXOR MMreg MMreg is dependent on previous writes to MMreg Therefore usin...

Page 135: ...d and properly aligned memory location However loading the data from memory runs the risk of cache misses Cases where MOVQ is superior to PCMPEQD are therefore rare and PCMPEQD should be used in gener...

Page 136: ...m 3 0 res y v x m 0 1 v y m 1 1 v z m 2 1 v w m 3 1 res z v x m 0 2 v y m 1 2 v z m 2 2 v w m 3 2 res w v x m 0 3 v y m 1 3 v z m 2 3 v w m 3 3 define M00 0 define M01 4 define M02 8 define M03 12 de...

Page 137: ...VQ MM2 QWORD PTR EAX M22 m 2 3 m 2 2 PFMUL MM0 MM1 v z m 2 1 v z m 2 0 PFADD MM3 MM4 v x m 0 1 v y m 1 1 v x m 0 0 v y m 1 0 MOVQ MM4 QWORD PTR EAX M30 m 3 1 m 3 0 PFMUL MM2 MM1 v z m 2 3 v z m 2 2 PF...

Page 138: ...de indicates whether the vertex is outside the frustum with regard to a specific clip plane Examination of the clip code for a vertex and clipping if the clip code is non zero The following example sh...

Page 139: ...GHT BEFORE MOVQ MM1 MM2 BELOW ABOVE BEHIND LEFT RIGHT BEFORE PUNPCKHDQ MM2 MM2 BELOW ABOVE BEHIND BELOW ABOVE BEHIND POR MM2 MM1 zclip yclip xclip clip code Use 3DNow PAVGUSB for MPEG 2 Motion Compens...

Page 140: ...ORD3 0xfefefefe POR MM2 MM3 calculate adjustment PSRLQ MM0 1 MM0 QWORD1 0xfefefefe 2 PSRLQ MM1 1 MM1 QWORD3 0xfefefefe 2 PAND MM2 MM6 PADDB MM0 MM1 MM0 QWORD1 2 QWORD3 2 w o adjustment PADDB MM0 MM2 a...

Page 141: ...D2 PAVGUSB MM0 EDI QWORD1 QWORD3 2 with adjustment PAVGUSB MM1 EDI 8 QWORD2 QWORD4 2 with adjustment ADD EAX EDX MOVQ EDI MM0 MOVQ EDI 8 MM1 ADD EDI EBX LOOP L1 Stream of Packed Unsigned Bytes The fol...

Page 142: ...wo element vectors v real v imag one can see the need for swapping the elements of src1 to perform the multiplies for result imag and the need for a mixed positive negative accumulation to complete th...

Page 143: ...ium Pro processors either improve the performance of the AMD Athlon processor or are not required and have a neutral effect usually due to fewer coding restrictions with the AMD Athlon processor Short...

Page 144: ...in memory This technique avoids the comparatively long latencies for accessing memory Stack Allocation When allocating space for local variables and or outgoing parameters within a procedure adjust th...

Page 145: ...re of the AMD Athlon processor is the industry standard x86 instruction set The term microarchitecture refers to the design techniques used in the processor to reach the target cost performance and fu...

Page 146: ...n the AMD Athlon processor enables higher processor core performance and promotes straightforward extendibility for future designs Superscalar Processor The AMD Athlon processor is an aggressive out o...

Page 147: ...y using the bus interface unit BIU The instruction cache generates fetches on the naturally aligned 64 bytes containing the instructions and the next sequential line of 64 bytes a prefetch The princip...

Page 148: ...uction In addition the predecode logic detects code branches such as CALLs RETURNs and short unconditional JMPs When a branch is detected predecoding begins at the target of the branch Branch Predicti...

Page 149: ...to determine which type of basic decode should occur DirectPath or VectorPath DirectPath Decoder DirectPath instructions can be decoded directly into a MacroOP and subsequently into one or two OPs in...

Page 150: ...ght MacroOPs whether integer or floating point for maximum instruction throughput The ICU can simultaneously dispatch multiple MacroOPs from the reorder buffer to both the integer and floating point s...

Page 151: ...r execution pipeline is organized to match the three MacroOP dispatch pipes in the ICU as shown in Figure 2 on page 135 MacroOPs are broken down into OPs in the schedulers OPs issue when their operand...

Page 152: ...at is attached to the pipeline at pipe 0 See Figure 2 on page 135 Multiplies always issue to integer pipe 0 and the issue logic creates results bus bubbles for the multiplier in integer pipes 0 and 1...

Page 153: ...superscalar x87 3DNow and MMX operations The first of the three pipes is generally known as the adder pipe FADD and it contains 3DNow add MMX ALU shifter and floating point add execution units The se...

Page 154: ...ues a 12 entry queue for L1 cache load and store accesses and a 32 entry queue for L2 cache or system memory load and store accesses The 12 entry queue can request a maximum of two L1 cache loads and...

Page 155: ...Combining See Appendix C Implementation of Write Combining on page 155 for detailed information about write combining AMD Athlon System Bus The AMD Athlon system bus is a high speed bus that consists...

Page 156: ...140 AMD Athlon Processor Microarchitecture AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 157: ...the floating point pipeline manages all x87 3DNow and MMX instructions This appendix describes the operation and functionality of these pipelines Fetch and Decode Pipeline Stages Figure 5 on page 142...

Page 158: ...the DirectPath decodes the common x86 instructions it also contains VectorPath instruction data which allows it to maintain dispatch order at the end of cycle 5 Figure 6 Fetch Scan Align Decode Pipel...

Page 159: ...M and SIB bytes for each instruction and sends the accumulated prefix information to EDEC Cycle 4 VectorPath MEROM In the microcode engine ROM MEROM pipeline stage the entry point generated in the pre...

Page 160: ...aches or system memory There are three integer pipes associated with the three IEUs Figure 7 Integer Execution Pipeline Figure 7 and Figure 8 show the integer execution resources and the pipeline stag...

Page 161: ...If addresses must be calculated to access data necessary to complete the operation the OP proceeds to the next stages ADDGEN and DCACC Cycle 9 ADDGEN In the address generation ADDGEN pipeline stage t...

Page 162: ...of the dataflow through the FPU Figure 9 Floating Point Unit Block Diagram The floating point pipeline stages 7 15 are shown in Figure 10 and described in the following sections Note that the floating...

Page 163: ...schedules up to three MacroOPs per cycle from the 36 entry FPU scheduler to the FREG pipeline stage to read register operands MacroOPs are sent when their operands and or tags are obtained Cycle 11 FR...

Page 164: ...a register results Produced by load or register instructions Address register results Produced by LEA or PUSH instructions Examples The following examples illustrate the operand and result definitions...

Page 165: ...tion early decodes in the VectorPath and requires three OPs an address generation operation for the indirect address a data load from memory and a compare to CX using an IEU The final JZ instruction i...

Page 166: ...common floating point instructions The MMX PFACC instruction is DirectPath decodeable and generates a single MacroOP targeted for the arithmetic operation execution pipeline in the floating point log...

Page 167: ...are waiting to enter the cache subsystem Loads and stores are allocated into LS1 entries at dispatch time in program order and are required by LS1 to probe the data cache in program order The AGUs can...

Page 168: ...DirectPath DP The following nomenclature is used to describe the current location of a particular operation D Dispatch stage Allocate in ICU reservation stations load store LS1 queue I Issue stage Sc...

Page 169: ...ions and therefore dispatches alone in pipe 0 The multiply latency is four cycles 2 The simple INC operation is paired with instructions 3 and 4 The INC executes in IEU0 in cycle 4 3 The MOV executes...

Page 170: ...che access 3 The load execute instruction accesses the data cache in tandem with instruction 2 After the load portion completes the subtraction is executed in cycle 6 in IEU2 4 The shift operation exe...

Page 171: ...y as either writeback WB write protected WP writethrough WT uncacheable UC or write combining WC Defining the memory type for a range of memory as WC or WT allows the processor to conditionally combin...

Page 172: ...cycles that target locations within the address range of a write buffer The AMD Athlon processor combines multiple memory write cycles to a 64 byte buffer whenever the memory address is within a WC o...

Page 173: ...mprove system performance the AMD Athlon processor aggressively combines multiple memory write cycles of any data size that address locations within a 64 byte write buffer that is aligned to a cache l...

Page 174: ...thin a lock can be combined Uncacheable Read A UC read closes write combining A WC read closes combining only if a cache block address match occurs between the WC read and a write in the write buffer...

Page 175: ...the eight quadwords are valid If this case is true do not proceed to the next rule 2 If all longwords are either full 4 bytes valid or empty 0 bytes valid a Write Longword system command is issued for...

Page 176: ...160 Write Combining Operations AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 177: ...ounting events a counter is incremented each time a specified event takes place or a specified number of events takes place When measuring duration a counter counts the number of processor clocks that...

Page 178: ...3h The PerfEvtSel 3 0 MSRs shown in Figure 11 control the operation of the performance monitoring counters with one register used to set up each counter These MSRs specify the events to be counted how...

Page 179: ...ows software to measure not only the fraction of time spent in a particular state but also the average length of time spent in such a state for example the time spent waiting for an interrupt to be se...

Page 180: ...ES Segment register loads 21h LS Stores to active instruction stream 40h DC Data cache accesses 41h DC Data cache misses 42h DC xxx1_xxxxb Modified M xxxx_1xxxb Owner O xxxx_x1xxb Exclusive E xxxx_xx...

Page 181: ...ECC errors detected corrected 75h BU bits 15 12 reserved xxxx_1xxxb I invalidates D xxxx_x1xxb I invalidates I xxxx_xx1xb D invalidates D xxxx_xxx1b D invalidates I Internal cache line invalidates 76...

Page 182: ...C2h FR Retired branches conditional unconditional exceptions interrupts C3h FR Retired branches mispredicted C4h FR Retired taken branches C5h FR Retired taken branches mispredicted C6h FR Retired far...

Page 183: ...he RDPMC instruction operation is performed Only the operating system executing at privilege level 0 can directly manipulate the performance counters using the RDMSR and WRMSR instructions A secure op...

Page 184: ...by clearing the enable counters flag or by clearing all the bits in the PerfEvtSel 3 0 MSRs Event and Time Stamp Monitoring Software For applications to use the performance monitoring counters and ti...

Page 185: ...D Athlon processor provides the option of generating a debug interrupt when a performance monitoring counter overflows This mechanism is enabled by setting the interrupt enable flag in one of the Perf...

Page 186: ...er Overflow AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 An event monitor application utility or another application program can read the collected performance information of the...

Page 187: ...e of the MTRRs is to provide system software with the ability to manage the memory mapping of the hardware Both the BIOS software and operating systems utilize this capability The AMD Athlon processor...

Page 188: ...there is one fixed address MTRR The fixed address ranges all exist in the first 1 Mbyte There are eight variable address ranges above 1 Mbytes Each is programmed to a specific memory starting address...

Page 189: ...lon Processor x86 Code Optimization Figure 12 MTRR Mapping of Physical Memory 0 FFFFFFFFh 512 Kbytes 256 Kbytes 256 Kbytes 8 Fixed Ranges 64 Kbytes each 64 Fixed Ranges 4 Kbytes each 16 Fixed Ranges 1...

Page 190: ...Uncacheable Uncacheable for reads or writes Cannot be combined Must be non speculative for reads or writes 01h WC Write Combining Uncacheable for reads or writes Can be combined Can be speculative for...

Page 191: ...ed when clear and all of physical memory is mapped as uncacheable memory reset state 0 FE Fixed range MTRRs are enabled when set All MTRRs are disabled when clear When the fixed range MTRRs are enable...

Page 192: ...d behavior However testing shows that these processors decompose these large pages into 4 Kbyte pages When a large page 2 Mbytes 4 Mbytes mapping covers a region that contains more than one memory typ...

Page 193: ...the flexibility of the page tables It provides the operating systems and applications to determine the desired memory type for optimal performance PAT support is detected in the feature flags bit 16 o...

Page 194: ...enabled or when the PDE doesn t describe a large page In the latter case the PATi bit for a PTE bit 7 corresponds to the page size bit in a PDE Therefore the OS should only use PA0 3 when setting the...

Page 195: ...UC MTRR WC x WC WT WB WT WT UC UC WC CD WP CD WP WB WP WP UC UC MTRR WC WT CD WB WB WB UC UC WC WC WT WT WP WP Notes 1 UC MTRR indicates that the UC attribute came from the MTRRs and that the processo...

Page 196: ...Table 16 Final Output Memory Types Input Memory Type Output Memory Type Note RdMem WrMem Effective MType forceCD 5 AMD 751 RdMem WrMem MemType UC UC 1 CD CD 1 WC WC 1 WT WT 1 WP WP 1 WB WB CD 1 2 UC U...

Page 197: ...ince cached IO lines cannot be copied back to IO the processor forces WB to WT to prevent cached IO from going dirty 5 ForceCD The memory type is forced CD due to 1 CR0 CD 1 2 memory type is for the I...

Page 198: ...FF A0000 A3FFF MTRR_fix16K_A0000 C7000 C7FFF C6000 C6FFF C5000 C5FFF C4000 C4FFF C3000 C3FFF C2000 C2FFF C1000 C1FFF C0000 C0FFF MTRR_fix4K_C0000 CF000C FFFF CE000 CEFFF CD000 CDFFF CC000 CCFFF CB000...

Page 199: ...ss range is power of 2 sized and aligned The range of supported sizes is from 212 to 236 in powers of 2 The AMD Athlon processor does not implement A 35 32 Figure 16 MTRRphysBasen Register Format Note...

Page 200: ...area not mapped by the mask value is set to the default type Discontinuous ranges should not be used The range that is mapped by the variable range MTRR register pair must meet the following range siz...

Page 201: ...ormat on page 183 201h MTRR Mask0 See MTRRphysMaskn Register Format on page 184 202h MTRR Base1 203h MTRR Mask1 204h MTRR Base2 205h MTRR Mask2 206h MTRR Base3 207h MTRR Mask3 208h MTRR Base4 209h MTR...

Page 202: ...186 Page Attribute Table PAT AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 203: ...ates the instruction mnemonic and operand types with the following notations reg8 byte integer register defined by instruction byte s or bits 5 4 and 3 of the modR M byte mreg8 byte integer register d...

Page 204: ...e used by the instruction The modR M byte defines the instruction as register or memory form If mod bits 7 and 6 are documented as mm memory form mm can only be 10b 01b or 00b The fifth column lists t...

Page 205: ...3h 11 010 xxx DirectPath ADC mem16 32 imm8 sign extended 83h mm 010 xxx DirectPath ADD mreg8 reg8 00h 11 xxx xxx DirectPath ADD mem8 reg8 00h mm xxx xxx DirectPath ADD mreg16 32 reg16 32 01h 11 xxx xx...

Page 206: ...ded 83h 11 100 xxx DirectPath AND mem16 32 imm8 sign extended 83h mm 100 xxx DirectPath ARPL mreg16 reg16 63h 11 xxx xxx VectorPath ARPL mem16 reg16 63h mm xxx xxx VectorPath BOUND 62h VectorPath BSF...

Page 207: ...CALL full pointer 9Ah VectorPath CALL near imm16 32 E8h VectorPath CALL mem16 16 32 FFh 11 011 xxx VectorPath CALL near mreg32 indirect FFh 11 010 xxx VectorPath CALL near mem32 indirect FFh mm 010 x...

Page 208: ...NO reg16 32 mem16 32 0Fh 41h mm xxx xxx DirectPath CMOVNP CMOVPO reg16 32 reg16 32 0Fh 4Bh 11 xxx xxx DirectPath CMOVNP CMOVPO reg16 32 mem16 32 0Fh 4Bh mm xxx xxx DirectPath CMOVNS reg16 32 reg16 32...

Page 209: ...reg8 0Fh B0h mm xxx xxx VectorPath CMPXCHG mreg16 32 reg16 32 0Fh B1h 11 xxx xxx VectorPath CMPXCHG mem16 32 reg16 32 0Fh B1h mm xxx xxx VectorPath CMPXCHG8B mem64 0Fh C7h mm xxx xxx VectorPath CPUID...

Page 210: ...8 signed 6Bh 11 xxx xxx VectorPath IMUL reg16 32 mem16 32 imm8 signed 6Bh mm xxx xxx VectorPath IMUL AX AL mreg8 F6h 11 101 xxx VectorPath IMUL AX AL mem8 F6h mm 101 xxx VectorPath IMUL EDX EAX EAX mr...

Page 211: ...rt disp8 78h DirectPath JNS short disp8 79h DirectPath JP JPE short disp8 7Ah DirectPath JNP JPO short disp8 7Bh DirectPath JL JNGE short disp8 7Ch DirectPath JNL JGE short disp8 7Dh DirectPath JLE JN...

Page 212: ...AHF 9Fh VectorPath LAR reg16 32 mreg16 32 0Fh 02h 11 xxx xxx VectorPath LAR reg16 32 mem16 32 0Fh 02h mm xxx xxx VectorPath LDS reg16 32 mem32 48 C5h mm xxx xxx VectorPath LEA reg16 mem16 32 8Dh mm xx...

Page 213: ...h MOV reg8 mem8 8Ah mm xxx xxx DirectPath MOV reg16 32 mreg16 32 8Bh 11 xxx xxx DirectPath MOV reg16 32 mem16 32 8Bh mm xxx xxx DirectPath MOV mreg16 segment reg 8Ch 11 xxx xxx VectorPath MOV mem16 se...

Page 214: ...xxx DirectPath MOVSX reg32 mreg16 0Fh BFh 11 xxx xxx DirectPath MOVSX reg32 mem16 0Fh BFh mm xxx xxx DirectPath MOVZX reg16 32 mreg8 0Fh B6h 11 xxx xxx DirectPath MOVZX reg16 32 mem8 0Fh B6h mm xxx xx...

Page 215: ...Ch DirectPath OR EAX imm16 32 0Dh DirectPath OR mreg8 imm8 80h 11 001 xxx DirectPath OR mem8 imm8 80h mm 001 xxx DirectPath OR mreg16 32 imm16 32 81h 11 001 xxx DirectPath OR mem16 32 imm16 32 81h mm...

Page 216: ...PUSH DS 1Eh VectorPath PUSH EAX 50h DirectPath PUSH ECX 51h DirectPath PUSH EDX 52h DirectPath PUSH EBX 53h DirectPath PUSH ESP 54h DirectPath PUSH EBP 55h DirectPath PUSH ESI 56h DirectPath PUSH EDI...

Page 217: ...xx DirectPath RCR mem8 1 D0h mm 011 xxx DirectPath RCR mreg16 32 1 D1h 11 011 xxx DirectPath RCR mem16 32 1 D1h mm 011 xxx DirectPath RCR mreg8 CL D2h 11 011 xxx DirectPath RCR mem8 CL D2h mm 011 xxx...

Page 218: ...xx DirectPath ROR mreg8 CL D2h 11 001 xxx DirectPath ROR mem8 CL D2h mm 001 xxx DirectPath ROR mreg16 32 CL D3h 11 001 xxx DirectPath ROR mem16 32 CL D3h mm 001 xxx DirectPath SAHF 9Eh VectorPath SAR...

Page 219: ...SB AL mem8 AEh VectorPath SCASW AX mem16 AFh VectorPath SCASD EAX mem32 AFh VectorPath SETO mreg8 0Fh 90h 11 xxx xxx DirectPath SETO mem8 0Fh 90h mm xxx xxx DirectPath SETNO mreg8 0Fh 91h 11 xxx xxx D...

Page 220: ...h SETG SETNLE mreg8 0Fh 9Fh 11 xxx xxx DirectPath SETG SETNLE mem8 0Fh 9Fh mm xxx xxx DirectPath SGDT mem48 0Fh 01h mm 000 xxx VectorPath SIDT mem48 0Fh 01h mm 001 xxx VectorPath SHL SAL mreg8 imm8 C0...

Page 221: ...SHRD mreg16 32 reg16 32 imm8 0Fh ACh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 imm8 0Fh ACh mm xxx xxx VectorPath SHRD mreg16 32 reg16 32 CL 0Fh ADh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 CL...

Page 222: ...05h VectorPath SYSENTER 0Fh 34h VectorPath SYSEXIT 0Fh 35h VectorPath SYSRET 0Fh 07h VectorPath TEST mreg8 reg8 84h 11 xxx xxx DirectPath TEST mem8 reg8 84h mm xxx xxx DirectPath TEST mreg16 32 reg16...

Page 223: ...h VectorPath XCHG EAX EDI 97h VectorPath XLAT D7h VectorPath XOR mreg8 reg8 30h 11 xxx xxx DirectPath XOR mem8 reg8 30h mm xxx xxx DirectPath XOR mreg16 32 reg16 32 31h 11 xxx xxx DirectPath XOR mem16...

Page 224: ...FADD FMUL PACKUSWB mmreg1 mmreg2 0Fh 67h 11 xxx xxx DirectPath FADD FMUL PACKUSWB mmreg mem64 0Fh 67h mm xxx xxx DirectPath FADD FMUL PADDB mmreg1 mmreg2 0Fh FCh 11 xxx xxx DirectPath FADD FMUL PADDB...

Page 225: ...PMADDWD mmreg1 mmreg2 0Fh F5h 11 xxx xxx DirectPath FMUL PMADDWD mmreg mem64 0Fh F5h mm xxx xxx DirectPath FMUL PMULHW mmreg1 mmreg2 0Fh E5h 11 xxx xxx DirectPath FMUL PMULHW mmreg mem64 0Fh E5h mm x...

Page 226: ...ctPath FADD FMUL PSUBB mmreg1 mmreg2 0Fh F8h 11 xxx xxx DirectPath FADD FMUL PSUBB mmreg mem64 0Fh F8h mm xxx xxx DirectPath FADD FMUL PSUBD mmreg1 mmreg2 0Fh FAh 11 xxx xxx DirectPath FADD FMUL PSUBD...

Page 227: ...odR M Byte Decode Type FPU Pipe s Notes MASKMOVQ mmreg1 mmreg2 0Fh F7h VectorPath FADD FMUL FSTORE MOVNTQ mem64 mmreg 0Fh E7h DirectPath FSTORE PAVGB mmreg1 mmreg2 0Fh E0h 11 xxx xxx DirectPath FADD F...

Page 228: ...mem8 0Fh 18h DirectPath 1 PREFETCHT2 mem8 0Fh 18h DirectPath 1 SFENCE 0Fh AEh VectorPath Table 22 Floating Point Instructions Instruction Mnemonic First Byte Second Byte ModR M Byte Decode Type FPU Pi...

Page 229: ...D FCOMP mem64real DCh mm 011 xxx DirectPath FADD FCOMPP DEh D9h 11 011 001 DirectPath FADD FCOS D9h FFh VectorPath FDECSTP D9h F6h DirectPath FADD FMUL FSTORE FDIV ST ST i D8h 11 110 xxx DirectPath FM...

Page 230: ...m 001 xxx VectorPath FINCSTP D9h F7h DirectPath FADD FMUL FSTORE FINIT DBh E3h VectorPath FIST mem16int DFh mm 010 xxx DirectPath FSTORE FIST mem32int DBh mm 010 xxx DirectPath FSTORE FISTP mem16int D...

Page 231: ...h 11 001 xxx DirectPath FMUL 1 FNOP D9h D0h DirectPath FADD FMUL FSTORE FPTAN D9h F2h VectorPath FPATAN D9h F3h VectorPath FPREM D9h F8h DirectPath FMUL FPREM1 D9h F5h DirectPath FMUL FRNDINT D9h FCh...

Page 232: ...1 FSUBP ST ST i DEh 11 101 xxx DirectPath FADD 1 FSUBR mem32real D8h mm 101 xxx DirectPath FADD FSUBR mem64real DCh mm 101 xxx DirectPath FADD FSUBR ST ST i D8h 11 100 xxx DirectPath FADD 1 FSUBR ST...

Page 233: ...mmreg2 0Fh 0Fh A0h 11 xxx xxx DirectPath FADD PFCMPGT mmreg mem64 0Fh 0Fh A0h mm xxx xxx DirectPath FADD PFMAX mmreg1 mmreg2 0Fh 0Fh A4h 11 xxx xxx DirectPath FADD PFMAX mmreg mem64 0Fh 0Fh A4h mm xxx...

Page 234: ...ruction Mnemonic Prefix Byte s imm8 ModR M Byte Decode Type FPU Pipe s Note Notes 1 For the PREFETCH and PREFETCHW instructions the mem8 value refers to an address in the 64 byte line that will be pre...

Page 235: ...udes register register op memory as well as register register op register forms of instructions DirectPath Instructions The following tables contain DirectPath instructions which should be used in the...

Page 236: ...imm8 ADD mreg16 32 imm16 32 ADD mem16 32 imm16 32 ADD mreg16 32 imm8 sign extended ADD mem16 32 imm8 sign extended AND mreg8 reg8 AND mem8 reg8 AND mreg16 32 reg16 32 AND mem16 32 reg16 32 AND reg8 mr...

Page 237: ...NS reg16 32 mem16 32 CMOVO reg16 32 reg16 32 CMOVO reg16 32 mem16 32 CMOVP CMOVPE reg16 32 reg16 32 CMOVP CMOVPE reg16 32 mem16 32 CMOVS reg16 32 reg16 32 CMOVS reg16 32 mem16 32 CMP mreg8 reg8 CMP me...

Page 238: ...16 32 JNL JGE near disp16 32 JLE JNG near disp16 32 JNLE JG near disp16 32 JMP near disp16 32 direct JMP far disp32 48 direct JMP disp8 short Table 25 DirectPath Integer Instructions Continued Instruc...

Page 239: ...imm16 32 OR mreg8 imm8 OR mem8 imm8 OR mreg16 32 imm16 32 OR mem16 32 imm16 32 OR mreg16 32 imm8 sign extended OR mem16 32 imm8 sign extended Table 25 DirectPath Integer Instructions Continued Instru...

Page 240: ...16 32 SBB reg8 mreg8 SBB reg8 mem8 Table 25 DirectPath Integer Instructions Continued Instruction Mnemonic SBB reg16 32 mreg16 32 SBB reg16 32 mem16 32 SBB AL imm8 SBB EAX imm16 32 SBB mreg8 imm8 SBB...

Page 241: ...g16 32 CL SHR mem16 32 CL STC SUB mreg8 reg8 Table 25 DirectPath Integer Instructions Continued Instruction Mnemonic SUB mem8 reg8 SUB mreg16 32 reg16 32 SUB mem16 32 reg16 32 SUB reg8 mreg8 SUB reg8...

Page 242: ...0 November 1999 XOR reg16 32 mem16 32 XOR AL imm8 XOR EAX imm16 32 XOR mreg8 imm8 XOR mem8 imm8 XOR mreg16 32 imm16 32 XOR mem16 32 imm16 32 XOR mreg16 32 imm8 sign extended XOR mem16 32 imm8 sign ext...

Page 243: ...em64 PAND mmreg1 mmreg2 PAND mmreg mem64 PANDN mmreg1 mmreg2 PANDN mmreg mem64 PCMPEQB mmreg1 mmreg2 PCMPEQB mmreg mem64 PCMPEQD mmreg1 mmreg2 PCMPEQD mmreg mem64 PCMPEQW mmreg1 mmreg2 PCMPEQW mmreg m...

Page 244: ...reg mem64 PUNPCKLBW mmreg1 mmreg2 PUNPCKLBW mmreg mem64 PUNPCKLDQ mmreg1 mmreg2 PUNPCKLDQ mmreg mem64 PUNPCKLWD mmreg1 mmreg2 PUNPCKLWD mmreg mem64 PXOR mmreg1 mmreg2 Table 26 DirectPath MMX Instructi...

Page 245: ...IVR ST i ST FDIVR mem32real FDIVR mem64real FDIVRP ST i ST FFREE ST i FFREEP ST i FILD mem16int FILD mem32int FILD mem64int FIMUL mem32int FIMUL mem16int FINCSTP FIST mem16int FIST mem32int FISTP mem1...

Page 246: ...6 Code Optimization 22007E 0 November 1999 FSUB ST i ST FSUBP ST ST i FSUBR mem32real FSUBR mem64real FSUBR ST ST i FSUBR ST i ST FSUBRP ST i ST FTST FUCOM FUCOMP FUCOMPP FWAIT FXCH Table 28 DirectPat...

Page 247: ...reg16 32 mreg16 32 BSF reg16 32 mem16 32 BSR reg16 32 mreg16 32 BSR reg16 32 mem16 32 BT mem16 32 reg16 32 BTC mreg16 32 reg16 32 BTC mem16 32 reg16 32 BTC mreg16 32 imm8 BTC mem16 32 imm8 BTR mreg16...

Page 248: ...ect JMP far mem32 indirect JMP far mreg32 indirect LAHF LAR reg16 32 mreg16 32 LAR reg16 32 mem16 32 LDS reg16 32 mem32 48 Table 29 VectorPath Integer Instructions Continued Instruction Mnemonic LEA r...

Page 249: ...Integer Instructions Continued Instruction Mnemonic RCL mem8 imm8 RCL mem16 32 imm8 RCL mem8 CL RCL mem16 32 CL RCR mem8 imm8 RCR mem16 32 imm8 RCR mem8 CL RCR mem16 32 CL RDMSR RDPMC RDTSC RET near i...

Page 250: ...g16 32 XCHG reg8 mreg8 XCHG reg8 mem8 XCHG reg16 32 mreg16 32 XCHG reg16 32 mem16 32 XCHG EAX ECX XCHG EAX EDX XCHG EAX EBX XCHG EAX ESP XCHG EAX EBP XCHG EAX ESI XCHG EAX EDI XLAT Table 29 VectorPath...

Page 251: ...OM mem32int FICOM mem16int FICOMP mem32int FICOMP mem16int FIDIV mem32int FIDIV mem16int FIDIVR mem32int FIDIVR mem16int FIMUL mem32int FIMUL mem16int FINIT FISUB mem32int FISUB mem16int FISUBR mem32i...

Page 252: ...236 VectorPath Instructions AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Page 253: ...siderations 27 55 Cache 4 64 Byte Cache Line 11 50 Cache and Memory Optimizations 45 CALL and RETURN Instructions 59 Code Padding Using Neutral Code Fillers 39 Code Sample Analysis 152 Complex Number...

Page 254: ...teger Only Work 83 MOVQ Instruction 85 PAND to Find Absolute Value in 3DNow Code 119 PCMP Instead of 3DNow PFCMP 114 PCMPEQD to Set an MMX Register 119 PMADDWD Instruction 111 PREFETCHNTA T0 T1 T2 Ins...

Page 255: ...Athlon Processor x86 Code Optimization T TBYTE Variables 55 Trigonometric Instructions 103 V VectorPath Decoder 133 VectorPath Instructions 231 W Write Combining 10 50 139 155 157 159 X x86 Optimizati...

Page 256: ...240 Index AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999...

Reviews:

No comments

Related manuals for Athlon Processor x86

Professional Series

Brand: PACOM Pages: 15

Brand: Ublox Pages: 17

LBON320AC Series

Brand: Lindsay Broadband Pages: 4

Brand: Rainbow Pages: 2

Brand: Intercoax Pages: 2

Brand: Xcellon Pages: 24

Brand: Renesas Pages: 4

Brand: Idis Pages: 28

Brand: Grundfos Pages: 40

ROScube-I Series

Brand: ADLINK Technology Pages: 95

Brand: Huawei Pages: 15

Brand: Huawei Pages: 20

ViewPoint 8650 V100R008

Brand: Huawei Pages: 36

Brand: Huawei Pages: 15

Brand: Huawei Pages: 2

Brand: Huawei Pages: 36

Brand: Huawei Pages: 65

Brand: Huawei Pages: 100

Brands by name

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Popular brands

Load more brands