background image

  ATI CTM Guide v. 1.01

 © 2006 Advanced Micro Devices, Inc.

4 CTM Units 

2.1.1

The ATI Data Parallel Processor Array

The ATI Data Parallel Processor (DPP) Array comprises one or more processors, each a programmable unit that can 
execute a series of instructions.

Each processor in the array is directed by the Processor Execution Unit (see Section 2.1.2). If a processor is idle, the 
PE may request that it execute a program. It does so by passing to the processor an identifier pair (i, j), where i and j 
are non-negative (range-limited) integers, as well as its conditional value. Upon receiving the identifier pair and 
conditional value, the processor informs the PE that it is busy, resets its program counter to zero, and begins program 
execution. The processor remains busy until its internal program counter reaches the end of the program, as specified 
in the Application Binary Interface.  After the processor executes the instruction at the end-of-program address, the 
processor halts and informs that PE that it is again idle.

Instructions for a program, as well as constants, inputs, and outputs to which the program may refer, are stored in 
memory, and read or written through the MC (see Section 2.1.4). Conceptually, each processor maintains a separate 
interface to the MC. This interface consists of two non-negative (range-limited) integer indices (x, y) and a field 
identifying the type of memory access the processor is requesting (program instruction, floating-point constant, 
integer constant, boolean constant, or input read; or output write).

The (x, y) pair is different for each of the types of memory that a processor may request. For instructions, (x, y) is 
equal to (pc, 0), where pc is the current program counter. The index pair for each of the constants is (c, 0), where c is 
the index specified by the program instruction requesting the constant. The index pair and identifier for inputs are 
specified by the program instruction requesting an input value. The index pair for an output is always the pair assigned 
to the processor by the PE (i, j), while the identifier for the output is specified in the requesting program instruction.

If conditional output is enabled, output write requests by a processor are conditionally generated, based on a value 
returned by the Conditional Output Unit (see Section 2.1.3). The processor sends a conditional value (v) and its (i, j) 
index pair to the CO, and the CO then performs a conditional test based on the value and index pair. If the test passes, 
the processors dispatch l output write requests to the MC; otherwise no output write requests are generated. The 
conditional value, v, depends on program that is currently being executed. The value may be specified directly in an 
instruction in the program, or it may equal the conditional value sent to the processor by the PE. If conditional output 
is disabled, the processor behaves as if the conditional output test always passes. Conditional output is enabled by 
setting the condition location to the processors with the 

set_cond_loc

 command (see page 22).

All processors refer to the same instructions and constants, but may index different input, output, and conditional data. 
Thus, if multiple processors are working simultaneously, CTM exports a SIMD programming model. Individual 
processors, however, may or may not execute in SIMD lock-step in a particular CTM implementation; the behavior 
of individual processors relative to other processors is unspecified.

2.1.2

Processor Execution Unit

The Processor Execution Unit interprets commands from a command buffer, forwarding them to other units in CTM 
if necessary. Under normal operation, the PE consumes commands as fast as it can process them or pass them along. 
If, however, the PE receives a 

wait_for_idle

 command (see page 11), it stops reading commands until all processors 

in the processor array are idle. Once the processors are idle, the PE again starts to read commands, beginning with the 
one following the 

wait_for_idle

 command.

In addition to parsing the command buffer, the PE is responsible for distributing work to the processors in the 
processor array. The PE's distribution of work is based on the rectangular domain D \subset Z^2, with D = { (i,j) | i0 
<= i <= i1, j0 <= j <= j1 }. The parameters i0, j0, i1, and j1 are specified to the PE through the 

set_domain

 command 

(see page 10).

When the PE receives a 

start_program

 command (see page 11), it begins allocating work for the processors. If 

conditional program execution is disabled, the PE schedules the processors to run the current program once for each 
index pair (i, j) in the current domain. The specific partitioning of work among the processors and the order in which 
the index pairs are scheduled is unspecified. As described in Section 2.1.1, the PE sends a corresponding index pair 
and its conditional value to an idle processor in order to execute the program for that index pair. The result of the 
entire computation is as if the program were executed in SIMD across all index pairs.[x]

Содержание ATI CTM

Страница 1: ...ATI CTM Guide Technical Reference Manual Version 1 01...

Страница 2: ...logo AMD Athlon AMD Opteron and combinations thereof AMD XXXX ATI and ATI product and product feature names are trademarks of Advanced Micro Devices Inc HyperTransport is a licensed trademark of the H...

Страница 3: ...Processor Execution Unit Commands 10 2 2 2 Memory Controller Unit Commands 13 2 2 3 Conditional Output Unit Commands 21 Chapter 3 DPP Array Instruction Set Architecture 23 3 1 Instructions 23 3 2 Ins...

Страница 4: ...E 40 3 6 2 ALU Non Transcendental Floating Point 41 3 6 3 ALU Transcendental Floating Point 42 3 6 4 Texture Floating Point 43 3 7 Errata 43 Chapter 4 DPP Application Binary Interface 45 4 1 Executabl...

Страница 5: ...sor array found in ATI graphics hardware CTM consists of this processor array plus a handful of supporting components that control and feed the array This manual provides a programmatic overview of th...

Страница 6: ...ATI CTM Guide v 1 01 2006 Advanced Micro Devices Inc 2 Related Documents...

Страница 7: ...nents the Processor Execution Unit PE the Conditional Operation Unit CO and the Memory Contoller Unit MC Figure 2 1 CTM Block Diagram The PE reads commands sequentially from a specified area of memory...

Страница 8: ...e CO then performs a conditional test based on the value and index pair If the test passes the processors dispatch l output write requests to the MC otherwise no output write requests are generated Th...

Страница 9: ...or the test is a comparison between the conditional value v from a client to a value b read from a conditional output buffer residing in memory result v op b where op is one of or The conditional outp...

Страница 10: ...ram inputs The x y index pair for a given request is specified in an instruction being executed on one of the processors The index pair is sent to the MC along with the program input identifier also s...

Страница 11: ...st can be found in the MC conditional output cache then the MC will satisfy the request from the cache Otherwise it will pull data into the cache either from local or remote memory as appropriate in t...

Страница 12: ...is a 32 bit unsigned integer It is followed immediately in memory by one or more parameters each of which is also a 32 bit unsigned integer No commands take a variable number of parameters and all co...

Страница 13: ...offset pitch tiling and data format for the floating point constants set_consti_fmt x C0010F00 no 0 Base Address 1 Format Set the base address offset pitch tiling and data format for the integer const...

Страница 14: ...ction 3 2 These are unsigned integers of a range given by the bits in use Parameter 0 i0 Conditional Output Unit Commands set_cond_test x C0001B00 yes 0 Condition Set the test the conditional output u...

Страница 15: ...ll processors are idle processing of subsequent commands resumes 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 j0 Bits Field Name Description 11 0 j0 The j0 dom...

Страница 16: ...ny enabled counters counting Parameter 0 Read Performance Counters Command The read_perf_counters command takes one parameter This command transfers the performance counters to an area in memory begin...

Страница 17: ...ven in multiples of 4 the lowest two bits of the pitch must be zero The pitch tiling and data formats are used by the MC to calculate an address offset from an x y index pair as described in Section 2...

Страница 18: ...rmation for the program data Parameter 0 base address 23 18 Reserved Reserved 26 24 format Data format possible values 0 UINT16_1 1 UINT8_4 2 FLOAT32_1 3 FLOAT32_2 4 FLOAT32_4 5 Reserved 31 27 Reserve...

Страница 19: ...0 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 format t pitch Bits Field Name Description 1 0 Reserved Reserved 12 2 pitch The pitch in multiples of 4 15 13 Reserved...

Страница 20: ...d Name Description 10 0 Reserved Reserved 31 11 base address The 2K aligned address at which the given input is located 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3...

Страница 21: ...se address Bits Field Name Description 10 0 Reserved Reserved 31 11 base address The 2K aligned address at which the given output is located 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12...

Страница 22: ...ght 31 13 Reserved Reserved 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 base address Bits Field Name Description 10 0 Reserved Reserved 31 11 base address The...

Страница 23: ...commands all have the same form They take two parameters The first parameter is the base address for the corresponding constants in memory The second contains the corresponding pitch tiling and forma...

Страница 24: ...30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 base address Bits Field Name Description 10 0 Reserved Reserved 31 11 base address The 2K aligned address at which th...

Страница 25: ...nditional Output Unit CO Set Conditional Test Command The set_cond_test command takes a single parameter which specifies the test condition that the CO will perform This test is active until the next...

Страница 26: ...ived Parameter 0 loc 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 cond Bits Field Name Description 2 0 condition Test Condition 0 Never always false 1 Less tru...

Страница 27: ...t also be an OUTPUT instruction even if it s not outputting anything interesting The first OUTPUT instruction will reserve space in the output register fifo This space is limited therefore issuing an...

Страница 28: ...tiple types It will process the three types ALU Output Texture and FC in parallel whenever possible Instructions need to be synchronized when an instruction of one type depends on the output of anothe...

Страница 29: ...here is a special instruction the other engine can use to copy the result 3 3 1 Sources Each instruction can specify the addresses for 6 different sources 3 RGB vectors and 3 Alpha scalars Each source...

Страница 30: ...he previous instruction is an ALU or output instruction a NOP needs to be inserted between the two instructions Do this by setting the NOP flag in the previous instruction so the NOP does not consume...

Страница 31: ...e RGB selectors RGB_SEL_x and except for one case noted below the red RED_SWIZ_x green GREEN_SWIZ_x and blue BLUE_SWIZ_x swizzle selects The alpha unit always uses the alpha selectors ALPHA_SEL_x and...

Страница 32: ...B_MAX OP_ALPHA_MAX A B A B Maximum of A and B OP_RGB_CND OP_ALPHA_CND C 0 5 A B OP_RGB_CMP OP_ALPHA_CMP C 0 A B OP_RGB_FRC OP_ALPHA_FRC A floor A floor A is the largest integer value less than or equa...

Страница 33: ...de by 4 OMOD_D8 Divide by 8 OMOD_DISABLED No modification Each instruction can also be optionally clamped to the range 0 to 1 This happens after the above output modifier Disabling the Output Modifier...

Страница 34: ...register Writemask Size Description RGB_WMASK 3 bits Write R G B to register destination ALPHA_WMASK 1 bit Write A to register destination RGB_OMASK 3 bits Write R G B to output or to predicate bits...

Страница 35: ...reads If the program is reading back results written with uncached writes then the program should use one of the TEX_SEM_ACQUIRE choices to synchronize uncached writes and reads Additional information...

Страница 36: ...ination temporary address 1 to 3 source temporary addresses a sampler ID and an opcode and control bits specifying how to lookup the texture As with ALU temporary addresses the loop variable aL may be...

Страница 37: ...at that particular lookup and all prior lookups have completed before releasing the semaphore Therefore to protect several texture lookups you may set TEX_SEM_ACQUIRE only on the last texture lookup a...

Страница 38: ...subroutine calls Partial flow control mode should be used unless the program requires branch statements nested more than 6 deep or the program requires loops or subroutines In CTM partial or full flo...

Страница 39: ...CONTINUE instruction Each stack s size is dependent on whether the program is in partial or full flow control mode Stack overflows and underflows produce undefined behaviour in the hardware The stack...

Страница 40: ...UNC 2x2x2 table indicating when to jump Bit 0 Jump when alu_result predicate boolean Bit 1 Jump when alu_result predicate boolean Bit 2 Jump when alu_result predicate boolean Bit 3 Jump when alu_resul...

Страница 41: ...counter for inactive processors by amount in B_POP_CNT Activate processors which go negative FC_B_OP_INCR Increment branch counter for inactive processors by 1 Deactivate processors which disagree wit...

Страница 42: ...o use for loop initialization the red channel is used for iteration count green for aL initialization and blue for aL increment JUMP_ADDR Which instruction to jump to if conditions pass JUMP_GLOBAL Wh...

Страница 43: ...ENDIF 1 ENDIF 0x00 1 JUMP NONE DECR NONE 1 0 0 LOOP 0x00 0 LOOP NONE NONE NONE 0 0 ENDLOOP 1 ENDLOOP 0xff 1 ENDLOOP NONE NONE NONE 0 0 LOOP 1 REP 0x00 0 REP NONE NONE NONE 0 0 ENDREP 1 ENDREP 0xff 1 E...

Страница 44: ...P NONE INCR DECR 2 0 ENDIF 1 ENDIF ENDIF ENDIF 0x00 1 JUMP NONE DECR NONE 3 0 0 3 6 Note on Floating Point X1K FP is designed to be compliant with the Shader Model 3 which does not officially support...

Страница 45: ...lts but this will not have the desired effect for special values In IEEE an infinite value is equivalent to itself but NaN is never equivalent to NaN Yet infinity infinity NaN NaN NaN and the results...

Страница 46: ...operations usually enable the output modifier which in turn standardizes NaN values and flushes denormal results to zero A MOV instruction which preserves the source bits may be implemented by setting...

Страница 47: ...ution when generating very large values for use as coordinates in a texture lookup These values may generate infinite values when scaled by the texture dimensions or projected 3 7 Errata There is a pr...

Страница 48: ...ATI CTM Guide v 1 01 2006 Advanced Micro Devices Inc 44 Errata...

Страница 49: ...with ELF and details only the portions of the ELF file that are specific to loading programs for the DPP 4 1 1 File Format The top level layout of an executable file object the Execution View of the...

Страница 50: ...mation Note is Name Type Flags Section Contents text SHT_PROGBITS SHF_ALLOC SHF_EXECINSTR The executable instructions of a program Note Description Program Information Ancillary information used by th...

Страница 51: ...2_Word 0x00000000 Reserved Reserved Elf32_Word 0x00000000 Reserved Reserved Elf32_Word 0x00000000 Reserved Reserved Elf32_Word 0x00000000 Reserved Reserved Elf32_Word 0x00000000 Reserved Reserved Elf3...

Страница 52: ...he Int32 Constants Note is summarized in the following table If an object file does not contain an Int32 constants note then the program loader acts as if the program has no int32 constant references...

Страница 53: ...if the program has no early program exit command type 4 ELF_NOTE_ATI_INT32CONSTS current value is 6 name 8 ELF_NOTE_ATI current value is ATI DPP desc 4 nint32consts List of the int32 constant indices...

Страница 54: ...ATI CTM Guide v 1 01 2006 Advanced Micro Devices Inc 50 Executable Files...

Страница 55: ...2 amCommandBufferConsumed amCommandBufferConsumed AMmanagedDevice dev AMuint32 buf Submit a command buffer to a managed device This function submits a command buffer to a managed device The command b...

Страница 56: ...Submit a command buffer to a managed device This function submits a command buffer to a managed device The command buffer location as a GPU address and size are passed in It returns a unique 32 bit u...

Отзывы: