background image

ii

 

Intel® PXA27x Processor Family

 

Optimization Guide

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTELR PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS 
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS 
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO 
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER 
INTELLECTUAL PROPERTY RIGHT. 

Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the 
presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by 
estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. 

Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.

Intel may make changes to specifications and product descriptions at any time, without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for 
future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

The Intel® PXA27x Processor Family may contain design defects or errors known as errata which may cause the product to deviate from published 
specifications. Current characterized errata are available on request.

MPEG is an international standard for video compression/decompression promoted by ISO. Implementations of MPEG CODECs, or MPEG enabled 
platforms may require licenses from various entities, including Intel Corporation. 

This document and the software described in it are furnished under license and may only be used or copied in accordance with the terms of the 
license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a 
commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this 
document or any software that may be provided in association with this document. Except as permitted by such license, no part of this document may 
be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation. 

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an ordering number and are referenced in this document, or other Intel literature may be obtained by calling 

 

1-800-548-4725 or by visiting Intel's website at http://www.intel.com.

Copyright © Intel Corporation, 2004

AlertVIEW, i960, AnyPoint, AppChoice, BoardWatch, BunnyPeople, CablePort, Celeron, Chips, Commerce Cart, CT Connect, CT Media, Dialogic, 
DM3, EtherExpress, ETOX, FlashFile, GatherRound, i386, i486, iCat, iCOMP, Insight960, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, 
IntelDX2, IntelDX4, IntelSX2, Intel ChatPad, Intel Create&Share, Intel Dot.Station, Intel GigaBlade, Intel InBusiness, Intel Inside, Intel Inside logo, Intel 
NetBurst, Intel NetStructure, Intel Play, Intel Play logo, Intel Pocket Concert, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel TeamStation, 
Intel WebOutfitter, Intel Xeon, Intel XScale, Itanium, JobAnalyst, LANDesk, LanRover, MCS, MMX, MMX logo, NetPort, NetportExpress, Optimizer 
logo, OverDrive, Paragon, PC Dads, PC Parents, Pentium, Pentium II Xeon, Pentium III Xeon, Performance at Your Command, ProShare, 
RemoteExpress, Screamline, Shiva, SmartDie, Solutions960, Sound Mark, StorageExpress, The Computer Inside, The Journey Inside, This Way In, 
TokenExpress, Trillium, Vivonic, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and 
other countries.

*Other names and brands may be claimed as the property of others.

Summary of Contents for PXA270

Page 1: ...Order Number 280004 001 Intel PXA27x Processor Family Optimization Guide April 2004...

Page 2: ...ed in accordance with the terms of the license The information in this document is furnished for informational use only is subject to change without notice and should not be construed as a commitment...

Page 3: ...le Microarchitecture Pipeline 2 1 2 2 1 General Pipeline Characteristics 2 1 2 2 1 1 Pipeline Organization 2 1 2 2 1 2 Out of Order Completion 2 2 2 2 1 3 Use of Bypassing 2 2 2 2 2 Instruction Flow T...

Page 4: ...n the Internal SRAM 3 6 3 3 2 3 Creating Scratch RAM in Data Cache 3 7 3 3 2 4 Reducing Memory Page Thrashing 3 7 3 3 2 5 Using Mini Data Cache 3 8 3 3 2 6 Reducing Cache Conflicts Pollution and Press...

Page 5: ...nd MSR Instructions 4 17 4 3 1 11 Scheduling Coprocessor 15 Instructions 4 18 4 3 2 Instruction Scheduling for Intel Wireless MMX Technology 4 18 4 3 2 1 Increasing Load Throughput on Intel Wireless M...

Page 6: ...and C Level Optimization 5 1 5 1 1 Efficient Usage of Preloading 5 1 5 1 1 1 Preload Considerations 5 1 5 1 1 2 Preload Loop Limitations 5 3 5 1 1 3 Coding Technique with Preload 5 4 5 1 2 Array Merg...

Page 7: ...ion Guidelines A 2 Glossary Glossary 1 Figures 1 1 PXA27x Processor Block Diagram 1 3 2 1 Intel XScale Microarchitecture RISC Superpipeline 2 1 2 2 Intel Wireless MMX Technology Pipeline Threads and r...

Page 8: ...truction Timings 4 41 4 12 Load and Store Multiple Instruction Timings 4 41 4 13 Semaphore Instruction Timings 4 42 4 14 CP15 Register Access Instruction Timings 4 42 4 15 CP14 Register Access Instruc...

Page 9: ...Intel PXA27x Processor Family Optimization Guide ix Contents Revision History Date Revision Description April 2004 001 Initial release...

Page 10: ...x Intel PXA27x Processor Family Optimization Guide Contents...

Page 11: ...l Chapter 4 Intel XScale Microarchitecture Intel Wireless MMX Technology Optimization discusses how to optimize software mostly at the assembly programming level to take advantage of the Intel XScale...

Page 12: ...PCA processors help to redefine what a mobile device can do to meet many of the performance demands of Enterprise class wireless computing and feature hungry technology consumers Targeted at wireless...

Page 13: ...power OEMs to develop smaller more cost effective handheld devices with long battery life with the performance to run rich multimedia applications Or the microarchitecture could be surrounded by high...

Page 14: ...s permissions D cache attributes 4 entry Fill and Pend buffers promote core efficiency by allowing hit under miss operation with data caches Performance monitoring unit furnishes two 32 bit event coun...

Page 15: ...External Memory Controller The PXA27x processor supports a memory controller for external memory which can access SDRAM up to 100 MHz at 1 8 Volts Flash memories Synchronous ROM SRAM Variable latency...

Page 16: ...performance DMA controller supporting memory to memory transfers peripheral to memory and memory to peripheral device transfers It has support for 32 channels and up to 63 peripheral devices The cont...

Page 17: ...ay be programmed as an output an input or as bidirectional for certain alternate functions 1 2 6 Wireless Intel Speedstep technology Wireless Intel Speedstep technology advances the capabilities of In...

Page 18: ...nts added to this core Memory map and register locations are backward compatible with the previous Intel XScale Microarchitecture hand held products The Intel Wireless MMX technology instruction set i...

Page 19: ...on preload aborts Access control to other coprocessors Enhanced set of supported cache control options A branch target buffer for dynamic branch prediction Performance monitoring unit Software debug s...

Page 20: ...1 10 Intel PXA27x Processor Family Optimization Guide Introduction...

Page 21: ...d for the Intel XScale Microarchitecture using the techniques presented in this document 2 2 Intel XScale Microarchitecture Pipeline This section provides a brief description of the structure and beha...

Page 22: ...ependencies between instructions A register dependency occurs when a previous MAC or load instruction is about to modify a register value that has not returned to the register file Core bypassing allo...

Page 23: ...id pipeline stalls The following sections provide more detail on the nature of the pipeline and ways of preventing stalls 2 2 3 Main Execution Pipeline 2 2 3 1 F1 F2 Instruction Fetch Pipestages The j...

Page 24: ...revious instruction is about to modify a register value that has not been returned to the RFU and the current instruction needs to access that same register If no dependencies exist the RFU selects th...

Page 25: ...write performance by the use of write coalescing Coalescing is combining a new store operation with an existing store operation already resident in the write buffer The new store is placed in the sam...

Page 26: ...rce operands Results are completed N cycles later where N is dependent on the operand size and returned to the register file For more information on MAC instruction latencies refer to Section 4 8 Inst...

Page 27: ...e with the remainder of the decoding being completed in the RF stage However it is worth noting that the register address decoding is fully completed in the ID stage because the register file needs to...

Page 28: ...hitecture detects exceptions and flushes in the X2 pipe stage Intel Wireless MMX Technology also flushes all the pipeline stages 2 3 1 5 XWB Stage The XWB stage is the last stage of the X pipeline whe...

Page 29: ...des a virtual address that is used to access the data cache There is no logic inside the Intel Wireless MMX Technology in the D1 pipe stage 2 3 3 2 D2 Stage The D2 stage is where load data is returned...

Page 30: ...2 10 Intel PXA27x Processor Family Optimization Guide Microarchitecture Overview...

Page 31: ...MHz turbo mode using only a 200 MHz run mode frequency The clock frequency combination should be chosen to fit the target application mix Possible frequency selections are listed in the clocks and pow...

Page 32: ...fer strength should be set to the lowest possible setting minimum drive strength that still allows for reliable memory system performance This will minimize the power usage of the external memory bus...

Page 33: ...y the ARM architecture This behavior is detailed in Table 3 3 If the X bit for a descriptor is one the C and B bits behave differently as shown in Table 3 4 The load and store buffer behavior in Intel...

Page 34: ...much faster than external memory Executing non cached instructions severely curtails the processor s performance so it is important to do everything possible to minimize cache misses 1 0 Mini data cac...

Page 35: ...into the instruction cache Once locked into the instruction cache the code is always available for fast execution Another reason for locking critical code into cache is that with the round robin repla...

Page 36: ...es cause an additional read from the memory during a write miss Subsequent read and write performance may be improved by more frequent cache hits Most of the regular data and the stack for application...

Page 37: ...64 bytes each that are allocated to the on chip RAM and assume that the address of arr1 is 0 address of arr2 is 1024 and the address of arr3 is 2048 All three arrays are within the same sets set0 and...

Page 38: ...rashes the cache The mini data cache could also be used to keep frequently used tables cached The advantage of keeping these in the minicache is two fold First the data thrashing in the main cache doe...

Page 39: ...the external memory This scheme may free up some internal memory space for OS and user applications Depending on the user profile the internal memory can be used for different purposes 3 4 1 LCD Fram...

Page 40: ...i data caches Data pre loading allows hiding of memory transfer latency while the processor continues to execute instructions The preload is important to compiler and assembly code because judicious u...

Page 41: ...bandwidth requirements The formula for each plane is Length and width are the number of lines per panel and pixels per line respectively Refresh rate is in frames per second BPP is bits per pixel in...

Page 42: ...this problem The LCD controller has an internal buffering mechanism to minimize the impact of fluctuations in the bandwidths The maximum latency the LCD controller can tolerate for it s 32 byte burst...

Page 43: ...me sum multiplied by the percent of time the overlay is enabled After estimating the total accesses for the base plane and all overlays employed place the frame buffers for the planes with the highest...

Page 44: ...region If this region is set to noncached but bufferable graphics performance improvements can be achieved The noncached but bufferable mode X 0 C 0 B 1 improves write performance by allowing the cons...

Page 45: ...rks is necessary in order to begin tuning the arbiter settings 3 5 2 2 Determining the Optimal Weights for Clients The weights are decided based on the real time RT deadline1 bandwidth BW requirements...

Page 46: ...commended that the OS and applications use this feature to park the bus where it results in the best performance for the current task While most applications have the highest performance with the bus...

Page 47: ...ons and generally improves efficiency This can be disabled and requires active transactions complete before another transaction starts Please refer to the DMA Programmed I O Control Status register de...

Page 48: ...3 18 Intel PXA27x Processor Family Optimization Guide System Level Optimization...

Page 49: ...XScale Microarchitecture instructions to modify the condition codes makes a wide array of optimizations possible 4 2 1 Conditional Instructions and Loop Control The Intel XScale Microarchitecture ins...

Page 50: ...ck for the loop exit condition for i 0 i 10 i do something If the loop were rewritten as follows the code generated avoids using the compare instruction to check for the loop exit condition for i 9 i...

Page 51: ...erage the code above takes 5 5 cycles to execute Using the Intel XScale Microarchitecture to execute instructions conditionally the code generated for the preceding if else statement is cmp r0 10 movg...

Page 52: ...se branches instead of conditional instructions cmp r0 0 bne L1 add r0 r0 1 add r1 r1 1 add r2 r2 1 add r3 r3 1 add r4 r4 1 b L2 L1 sub r0 r0 1 sub r1 r1 1 sub r2 r2 1 sub r3 r3 1 sub r4 r4 1 L2 The C...

Page 53: ...shortcut evaluation feature The use of conditional instructions in this fashion improves performance by minimizing the number of branches thereby minimizing the penalties caused by branch mispredicti...

Page 54: ...d DES Triple DES T DES Hashing functions SHA This approach helps other application such as network packet parsing and voice stream parsing 4 2 4 Optimizing the Use of Immediate Values Use the Intel XS...

Page 55: ...y 2n 1 add r0 r0 r0 LSL n Multiplication by an integer constant expressed as can be optimized Multiplication of r0 by an integer constant that can be expressed as 2n 1 2m add r0 r0 r0 LSL n mov r0 r0...

Page 56: ...r0 to the value contained in r1 and make r0 point to the previous word str r1 r0 4 Decrement the contents of r0 to make it point to the previous word and set the contents of the word pointed to the v...

Page 57: ...0 addne r4 r5 4 subeq r4 r5 4 ldr r0 r4 cmp r0 10 This example rewrites this code to make it run faster at the expense of increasing code size cmp r1 0 ldrne r0 r5 4 ldreq r0 r5 4 addne r4 r5 4 subeq...

Page 58: ...lue in register r6 is not used after this It is possible to move the ADD and the LDR instructions before the SUB instruction so that the contents of register R6 are allowed to spill and restore from t...

Page 59: ...he number of loads that are outstanding Use the number of outstanding loads to improve performance of the PXA27x processor 4 3 1 2 Increasing Load Throughput Increasing load throughput for data demand...

Page 60: ...PXA27x processor set by the page table attributes combines multiple stores going to the same half of the cache line into a single memory transaction This approach increases the bus efficiency and thro...

Page 61: ...be aligned on an 8 byte boundary The specified register must be even r0 r2 Using LDRD STRD instead of LDM STM to do the same thing is more efficient because LDRD STRD issues in only one or two clock c...

Page 62: ...yte boundary Achieve this using the following LDM instructions r0 contains the address of the value being copied r1 contains the address of the destination location ldm r0 r2 r3 ldm r1 r4 r5 adds r0 r...

Page 63: ...ata processing instructions incur a two cycle issue penalty and a two cycle result penalty when the shifter operand is shifted rotated by a register or the shifter operand is a register The next instr...

Page 64: ...r0 0 Refer to Section 4 8 Instruction Latencies for Intel XScale Microarchitecture for more information on instruction latencies for various multiply instructions The multiply instructions should be s...

Page 65: ...lty due to the three cycle result latency for the second destination register mra r6 r7 acc0 mov r1 r7 mov r0 r6 add r2 r2 1 Rearrange the code to prevent the stall mra r6 r7 acc0 add r2 r2 1 mov r0 r...

Page 66: ...reless MMX Technology The constraints on issuing load transactions with Intel XScale Microarchitecture also hold with Intel Wireless MMX Technology The considerations reviewed using the Intel XScale M...

Page 67: ...r instructions WLDRD wR0 r2 8 WZERO wR15 WLDRD wR1 r4 8 SUBS r3 r3 8 WLDRD wR3 r4 8 Always try to interleave additional operations between the load instruction and the instruction which will first use...

Page 68: ...D instruction in the following example executes with no stalls WMACS wR14 wR1 wR2 ADD R1 R2 R3 Refer to Section 4 8 Instruction Latencies for Intel XScale Microarchitecture for more information on ins...

Page 69: ...on the data Compute intensive processing In the following sections we illustrate how the rules for writing fast sequences of Intel MMX Technology instructions on Intel Wireless MMX Technology can be...

Page 70: ...written to illustrate that 4 taps are computed for each loop iteration for i 0 i N i s0 0 for j 0 j T 4 j 4 s0 a j x i j s0 a j 1 x i j 1 s0 a j 2 x i j 2 s0 a j 3 x i j 3 y i round s0 The direct asse...

Page 71: ...t from 9 cycles for every four taps to 9 cycles for every eight taps This corresponds to a throughput of 1 125 cycles per tap represents a 2X throughput improvement It is useful to define a metric to...

Page 72: ...ned to four 64 bit Intel Wireless MMX Technology registers In order to obtain near ideal throughput the inner loop is unrolled to provide for eight taps for each of the four output samples per loops i...

Page 73: ...variations are possible 4 4 3 Data Alignment Techniques The exploitation of the data parallelism present in multimedia algorithms is accomplished by executing the same operation on different elements...

Page 74: ...since algorithm mapping to Intel Wireless MMX Technology may be significantly accelerated The Intel MMX Technology target pipeline and architecture is different than Intel Wireless MMX Technology and...

Page 75: ...t have MOV instructions associated with the destructive register behavior to improve throughput The following is an example of Intel MMX Technology to Intel Wireless MMX Technology instruction mapping...

Page 76: ...l MMX Technology Instructions Input wR0 Source Value Input mm0 Source Value mm7 0 WUNPCKELU wR1 wR0 MOVQ mm1 mm0 WUNPCKEHU wR2 wR0 PUNPCKLWD mm0 mm7 PUNPCKHWD mm1 mm WPACK h w US PACKUS wb dw WAND PAN...

Page 77: ...ines can benefit greatly by being optimized for the Intel XScale Microarchitecture The following string and memory manipulation routines are good candidates to be tuned for the Intel XScale Microarchi...

Page 78: ...s current iteration It also uses LDRD and groups the STRs together to coalesce 4 6 2 Case Study 2 Optimizing Memory Fill Graphics applications use fill routines Most of the personal data assistant PDA...

Page 79: ...r0 4 writing out as words str r4 r0 4 instead of bytes or half words str r4 r0 4 achieves optimum performance str r4 r0 4 str r4 r0 4 str r4 r0 4 str r4 r0 4 str r4 r0 4 str r4 r0 4 str r4 r0 4 str r4...

Page 80: ...processing However if the end user views the output in portrait mode a portrait to landscape conversion needs to occur each time the frame buffer writes to the display The display driver usually impl...

Page 81: ...ed as a single write request str r11 r10 4 Write Coalesce the two stores str r12 r10 4 This can be exploited by either unrolling the C loop or by explicitly inlining multiple stores which can be combi...

Page 82: ...r2 ADD r2 r2 r3 Adding stride BNE LOOP 4 7 Intel Performance Primitives Users who want to take full advantage of many of the optimizations in this guide are likely to use these techniques Write hand o...

Page 83: ...e multimedia CODECs Video ITU H 263 decoder ISO IEC 14496 2 MPEG 4 decoder Audio ISO IEC 11172 3 and 13818 3 MPEG 1 2 Layer 3 MP3 decoder Speech ITU T G 723 1 CODEC and ETSI GSM AMR codec Image ISO IE...

Page 84: ...ediction correct prediction is assumed Minimum Result Latency This represents the required minimum cycle is the distance from the issue clock of the current instruction to the issue clock of the first...

Page 85: ...le 4 2 Latency Example Cycle Issue Executing 0 umlal 1st cycle 1 umlal 2nd cycle umlal 2 add umlal 3 sub stalled umlal add 4 sub stalled umlal 5 sub umlal 6 mov sub 7 mov Table 4 3 Branch Instruction...

Page 86: ...te by Register Or shifter operand is RRX Minimum Issue Latency Minimum Result Latency Minimum Issue Latency Minimum Result Latency ADC 1 1 2 2 ADD 1 1 2 2 AND 1 1 2 2 BIC 1 1 2 2 CMN 1 1 2 2 CMP 1 1 2...

Page 87: ...or Rs 31 15 0x1FFFF 0 1 2 1 1 2 2 2 Rs 31 27 0x00 or Rs 31 27 0x1F 0 1 3 2 1 3 3 3 all others 0 1 4 3 1 4 4 4 SMLAL Rs 31 15 0x00000 or Rs 31 15 0x1FFFF 0 2 RdLo 2 RdHi 3 2 1 3 3 3 Rs 31 27 0x00 or R...

Page 88: ...Latency Throughput MIA Rs 31 15 0x0000 or Rs 31 15 0xFFFF 1 1 1 Rs 31 27 0x0 or Rs 31 27 0xF 1 2 2 all others 1 3 3 MIAxy N A 1 1 1 MIAPH N A 1 2 2 Table 4 8 Implicit Accumulator Access Instruction Ti...

Page 89: ...writeback of base LDRH 1 3 for load data 1 for writeback of base LDRSB 1 3 for load data 1 for writeback of base LDRSH 1 3 for load data 1 for writeback of base LDRT 1 3 for load data 1 for writeback...

Page 90: ...struction Minimum Issue Latency Minimum Result Latency MRC 4 4 MCR 2 N A MRC to R15 is unpredictable MRC and MCR to CP0 and CP1 is described in the Intel Wireless MMX Technology section Table 4 15 CP1...

Page 91: ...B instructions to ARM instructions can be found in the ARM Architecture Reference Manual 4 9 Instruction Latencies for Intel Wireless MMX Technology The issue cycle and result latency of all the PXA27...

Page 92: ...1 1 WALIGNR 1 1 WSHUF 1 1 TANDC 1 1 TORC 1 1 TEXTRC 1 1 TEXTRM 1 2 TMCR 1 3 TMCRR 1 1 TMRC 1 2 TMRRC 1 3 TMOVMSK 1 2 TINSTR 1 1 TBCST 1 1 WLDR BHW to main regfile 1 4 3 WLDRW to control regfile 1 4 W...

Page 93: ...late the result when certain qualifiers are specified This list describes the data hazards for the PXA27x processor 1 0 implementation When saturation is specified for WADD or WSUB the result latency...

Page 94: ...ction note the resource may still be processing the previous instruction further down its internal pipeline A delay of one clock cycle indicates that the resource is available immediately to the next...

Page 95: ...he multiply resource These delays for are shown below in Table 4 21 For example if a TMIA instruction is followed by a TMIAph class3 instruction then the TMIAph sees a resource availability of 2 cycle...

Page 96: ...ntly full due to a sequence of memory transactions the following instruction must wait for space in the buffer The resource availability delay in this case is two cycles This is summarized in Table 4...

Page 97: ...For optimum performance the MAC unit in the core should not be used adjacent to TMRC instructions as they both share the route back to the core register file 4 10 2 5 Multiple Pipelines The WSAD TMIA...

Page 98: ...4 50 Intel PXA27x Processor Family Optimization Guide Intel XScale Microarchitecture Intel Wireless MMX Technology Optimization...

Page 99: ...ase where a linked list or recursive data structure is terminated by a NULL pointer Preloading the NULL pointer does not cause a fault The preload instructions PLD can be inserted by the compiler duri...

Page 100: ...ere to fetch the data The number of iterations to preload ahead is referred to as the preload scheduling distance PSD For the Intel XScale Microarchitecture this can be calculated as Where Nlinexfer T...

Page 101: ...racteristics that limit value of adding preloads are discussed below 5 1 1 2 1 Preload Limitations Throughput bound vs Latency bound The worst case is a loop which is bounded by the memory throughput...

Page 102: ...allow the memory bus traffic to flow freely and to minimize the number of necessary preloads Section 5 1 1 3 discusses code optimization for preloading 5 1 1 3 Coding Technique with Preload Since prel...

Page 103: ...arrays as much as possible while p prefetch p next do_something p data p p next Recursive data structure traversal is another construct where preloading can be applied This is similar to linked list t...

Page 104: ...n later chapters 5 1 2 Array Merging Stride the way data structures are walked through can affect the temporal quality of the data and reduce or increase cache conflicts Intel XScale Microarchitecture...

Page 105: ...ear2Date401KDed float Year2DateOtherDed In the data structure shown above the fields Year2DatePay Year2DateTax Year2Date401KDed and Year2DateOtherDed are likely to change with each pay check The remai...

Page 106: ...k 100 k for j2 0 j 100 j for k2 0 k 100 k j j1 100 j2 k k1 100 k2 C j k A i k B j i 5 1 4 Loop Interchange As previously mentioned the sequence in which data is accessed affects cache thrashing Usual...

Page 107: ...c i for i 0 i NMAX i prefetch D i 1 c i 1 A i 1 D i A i c i The second loop reuses the data elements A i and c i Fusing the loops together produces for i 0 i NMAX i prefetch D i 1 A i 1 c i 1 b i 1 a...

Page 108: ...down an arbitrary size loop into small unrolled blocks some loop overhead can be avoided For example it is unlikely that a compiler will unroll this code void f int nTotalIterations for i 0 i nTotalI...

Page 109: ...he benefit of this technique Again performance may potentially decline if the instructions within the unrolled block do not fit in the instruction cache Ensure that all inline functions inline procedu...

Page 110: ...ngle conditional is met By breaking the switch into two or more levels the worst case lookup is dramatically reduced Using a switch statement with 16 case statements to jump to 16 other switch stateme...

Page 111: ...ligned on a cache line 12 bytes then the prefetch would have to be placed on tdata i 1 id If the structure is not sized to a multiple of the cache line size then the preload address must be advanced a...

Page 112: ...locality However local variables should also be kept to a minimum so that fewer register values must be pushed and subsequently popped from the stack Also loops run much more efficiently if all the d...

Page 113: ...15 High Level Language Optimization Passing by pointer or reference is highly preferred over passing by value Passing by value should only be used when there is a compelling reason to do so Small dat...

Page 114: ...5 16 Intel PXA27x Processor Family Optimization Guide High Level Language Optimization...

Page 115: ...s that may help to reduce power consumption consumed primarily by the Intel XScale core 6 2 1 Code Optimization for Power Consumption In most cases optimizing the operating system OS or application fo...

Page 116: ...re placed in a low power mode where state is retained but no activity is allowed some of the internal power domains see the Intel PXA270 Processor Electrical Mechanical and Thermal Specification and t...

Page 117: ...ntel Speedstep Technology There are some additional considerations and additions required by applications in order to take advantage of the power manager but these additions were minimal For details a...

Page 118: ...of power 6 2 4 1 Fast Bus Mode The system bus frequency can be doubled through the use of CLKCGF B refer to the CLKCFG Bit Definitions table in the Intel PXA27x Processor Family Developer s Manual Whe...

Page 119: ...r the PXA27x processor external memory bus have programmable strength settings This feature allows for simple software based control of the output driver impedance for the external memory bus Use thes...

Page 120: ...power savings using 1 8 V SDRAM compared to 2 5 V or 3 3 V SDRAM If no other devices in the system use 1 8 V then users must consider the power savings compared to the extra components and board real...

Page 121: ...al inputs USBCP and USBCN to a logic high 6 3 7 5 Sleep Mode For lowest power consumption in sleep mode Disable and ground VCC_SRAM VCC_Core and VCC_PLL Configure all possible IO pins as outputs and d...

Page 122: ...6 8 Intel PXA27x Processor Family Optimization Guide Power Optimization...

Page 123: ...ame buffer non cached but bufferable Use write back caches if possible Optimize assembly code based on the suggestions presented in this guide Enable the branch target buffer Configure non cacheable m...

Page 124: ...est possible run and turbo mode frequencies Higher run and turbo mode frequencies consume more power Optimize system for desired power and performance Consider performing a frequency change sequence t...

Page 125: ...signals that are converted into a format that allows them to carry data Cellular phones and other wireless devices use analog in geographic areas with insufficient digital networks ARM V5te An ARM ar...

Page 126: ...ssed in bytes per second BTB Branch Target Buffer BTS Base Transmitter Station Buffer Storage used to compensate for a difference in data rates or time of occurrence of events when transmitting data f...

Page 127: ...guring Software Software resident on the host software that is responsible for configuring a USB device This may be a system configuration or software specific to the device Control Endpoint A pair of...

Page 128: ...USB device This software may or may not also be responsible for configuring the device for use DMA Direct Memory Access Downstream The direction of data flow from the host or away from the host A dow...

Page 129: ...erpreted by a packet receiver as an EOP FDD The Mobile Station transmits on one frequency the Base Station transmits on another frequency FDM Frequency Division Multiplexing Each Mobile station transm...

Page 130: ...DML uses hypertext transfer protocol HTTP to display text versions of web pages on wireless devices Unlike WML HDML is not based on XML HDML does not allow scripts while WML uses a variant of JavaScri...

Page 131: ...en itself on the host and an endpoint of a device in an appropriate direction IrDA Infrared Development Association IRP See I O Request Packet IRQ See Interrupt Request ISI Inter signal interference D...

Page 132: ...allows requests to be reliably identified and communicated Microframe A 125 microsecond time base established on high speed buses MMC Multimedia Card small form factor memory and I O card MMX Technolo...

Page 133: ...d Network Networks that transfer packets of data PCMCIA Personal Computer Memory Card Interface Association PC Card PCS Personal Communications services An alternative to cellular PCD works like cellu...

Page 134: ...Frequency Device These devices use radio frequencies to transmit data One typical use is for bar code scanning of products in a warehouse or distribution center and sending that information to an ERP...

Page 135: ...int per unit time SIMD Single Instruction Multiple Data a parallel processing architecture Smart Phone A combination of a mobile phone and a PDA which allow users to communicate as well as perform tas...

Page 136: ...ster clock There is a fixed relation between Fsi and Fso System Programming Interface SPI A defined interface to services provided by system software TC Temperature Cycling TDD Time Division Duplexing...

Page 137: ...er Transmitter serial port Universal Serial Bus Driver USBD The host resident software entity responsible for providing common services to clients that are manipulating one or more functions on one or...

Page 138: ...technology integrates the high performance of Intel MMXTM technology and the integer functions from Streaming SIMD Extensions SSE to the Intel XScaleTM microarchitecture Intel Wireless MMX technology...

Page 139: ...nal Instructions and Loop Control 1 Coprocessor Interface Pipeline 49 Count Leading Zeros Instruction Timings 42 CP14 Register Access Instruction Timings 42 CP15 and CP14 Coprocessor Instructions 42 C...

Page 140: ...ion Mapping 27 Intel Wireless MMX Technology Pipeline 7 Intel Wireless MMX Technology Pipeline Threads and relation with Intel XScale Microarchitecture Pipe line 7 Interleaved Pack with Saturation Exa...

Page 141: ...load scheduling distance PSD 2 Processor Internal Communications 5 Program Flow and Branch Instructions 2 PXA27x Processor Block Diagram 3 PXA27x processor Mapping to Intel Wireless MMX Technology and...

Page 142: ...DMA 17 Use of Bypassing 2 Using Mini Data Cache 8 V Voltage and Regulators 6 W Weight for Core 16 Weight for DMA 15 Weight for LCD 15 Wireless Intel Speedstep technology 7 Wireless Intel Speedstep Te...

Page 143: ...Intel PXA27x Processor Family Optimization Guide Index 5 Index...

Page 144: ......

Reviews: