background image

Improved Dual Issue 

Special function units (SFUs) in the SMs compute transcendental math, attribute 
interpolation (interpreting pixel attributes from a primitive’s vertex attributes), and 
perform floating-point MUL instructions. The individual streaming processing cores 
of GeForce GTX 200 GPUs can now perform near full-speed dual-issue of 
multiply-add operations (MADs) and MULs (3 flops/SP) by using the SP’s MAD 
unit to perform a MUL and ADD per clock, and using the SFU to perform another 
MUL in the same clock. Optimized and directed tests can measure around 93-94% 
efficiency.  
The entire GeForce GTX 200 GPU SPA delivers nearly one teraflop of peak, 
single-precision, IEEE 754, floating-point performance. 

Double Precision Support 

A very important new addition to the GeForce GTX 200 GPU architecture is 
double-precision, 64-bit floating point computation support. This benefits various 
high-end scientific, engineering, and financial computing applications or any 
computational task requiring very high accuracy of results. Each SM incorporates a 
double-precision 64-bit floating math unit, for a total of 30 double-precision 64-bit 
processing cores.  
The double-precision unit performs a fused MAD, which is a high-precision 
implementation of a MAD instruction that is also fully IEEE 754R floating-point 
specification compliant. The overall double-precision performance of all 10 TPCs of 
a GeForce GTX 280 GPU is roughly equivalent to an eight-core Xeon CPU, 
yielding up to 78 gigaflops. 

Improved Texturing Performance 

The eight TPCs of the GeForce 8800 GTX allowed for 64 pixels per clock of 
texture filtering, 32 pixels per clock of texture addressing, 32 pixels per clock of 2× 
anisotropic bilinear filtering (8-bit integer), or 32-bilinear-filtered pixels per clock (8-
bit integer or 16-bit floating point). Subsequent GeForce 8 and 9 Series GPUs 
balanced texture addressing and filtering.  

‰

 

For example, the GeForce 9800 GTX can address and filter 64 pixels 
per clock, supporting 64-bilinear-filtered pixels per clock (8-bit integer) 
or 32-bilinear-filtered pixels per clock (16-bit floating point). 

GeForce GTX 200 GPUs also provide balanced texture addressing and filtering and 
each of the 10 TPCs includes a dual-quad texture unit capable of addressing and 
filtering eight bilinear pixels/clock, or four 2:1 anisotropic filtered pixels/clock, or 
four FP16 bilinear-filtered pixels/clock. Total bilinear texture addressing and 
filtering capability for an entire high-end GeForce GTX 200 GPU is 80 pixels per 
clock.  
GeForce GTX 200 GPUs employ a more efficient scheduler, allowing the chips to 
attain close to theoretical peak performance in texture filtering. In real world 
measurements, it is 22% more efficient than the GeForce 9 Series. 
 
 

 

 

May 2008  |  TB-04044-001_v01 

15

 

Summary of Contents for GeForce GTX 200 GPU

Page 1: ...Technical Brief NVIDIA GeForce GTX 200 GPU Architectural Overview Second Generation Unified GPU Architecture for Visual Computing...

Page 2: ...Processing Architecture 10 Parallel Computing Architecture 12 SIMT Architecture 13 Greater Number of Threads in Flight 13 Larger Register File 14 Improved Dual Issue 15 Double Precision Support 15 Im...

Page 3: ...ure 10 Figure 5 GeForce GTX 280 GPU Parallel Computing Architecture 12 Figure 6 TPC Thread Processing Cluster 13 Figure 7 Local Register File 2 versus 1 14 Figure 8 Geometry Shading Performance 17 Tab...

Page 4: ...perience We ll begin by describing architectural design goals and key features and then dive into the technical implementation of the GeForce GTX 200 GPUs We assume you have a basic understanding of f...

Page 5: ...e natural character motion and very accurate and convincing physics effects The GeForce GTX 200 GPUs are designed to be fully compliant with Microsoft DirectX 10 and Open GL 2 1 Architectural Design G...

Page 6: ...luding Convincing facial and character animation Multiple ultra high polygon characters in complex environments Advanced volumetric effects smoke fog mist etc Fluid and cloth simulation Fully simulate...

Page 7: ...it color output Gaming Beyond SLI NVIDIA s SLI technology is the industry s leading multi GPU technology giving you an easy low cost high impact performance upgrade PC gaming simply doesn t get any fa...

Page 8: ...coding application from Elemental and various video and photo editing applications Many engineering scientific medical and financial areas demand high performance computational horsepower for numerous...

Page 9: ...ure consists of a number of TPCs which stands for Texture Processing Clusters in graphics processing mode and Thread Processing Clusters in parallel compute mode Each TPC is in turn made up of a numbe...

Page 10: ...imates show 20 of the transistors of a CPU are dedicated to computation compared to 80 of GPU transistors GPU processing is centered on computation and throughput where CPUs focus heavily on reducing...

Page 11: ...heoretical performance limits than could prior generations Table 2 compares the GeForce 8800 GTX to the new GeForce GTX 280 GPU You will notice sizable increases in a number of important measurable pa...

Page 12: ...perform atomic read modify write operations to memory Atomic access provides granular access to memory locations and facilitates parallel reductions and parallel data structure management Figure 5 Ge...

Page 13: ...utilized at all times From the programmer s perspective SIMT also allows each thread to take on its own path Since branching is handled by the hardware there is no need to manually manage branching wi...

Page 14: ...Series GPUs The older GPUs could run into situations with long shaders where registers would be exhausted generating the need to swap to memory A much larger register file permits larger and more comp...

Page 15: ...EE 754R floating point specification compliant The overall double precision performance of all 10 TPCs of a GeForce GTX 280 GPU is roughly equivalent to an eight core Xeon CPU yielding up to 78 gigafl...

Page 16: ...ng quality The new GeForce GTX 200 GPU ROP subsystem supports all of the previous generation features and delivers a maximum of 32 pixels per clock output equating to 4 pixels clock per ROP partition...

Page 17: ...cantly faster than prior generation NVIDIA GPUs and competitive products Geometry Shader Performance Rightmark 3D 2 0 Hyperlight Heavy http www ixbt com video itogi video ini rmdx10 rar 0 100 200 300...

Page 18: ...iver can seamlessly switch between the power modes based on utilization of the GPU Each of the new GeForce GTX 200 GPUs integrates utilization monitors digital watchdogs that constantly check the amou...

Page 19: ...age Setup rates are similar to prior generation supporting up to one primitive per clock Z Culling performance has also been improved especially at high resolutions Early Z rejection rates have been i...

Page 20: ...uffer memory access Improvements in on chip communications between various units Improved Z cull and compression supporting higher performance at high resolutions and 10 bit color support These all re...

Page 21: ...ed number of pixel processing units and a fixed number of vertex processing units This same unified architecture provided the framework for efficient high end computation using NVIDIA CUDA software te...

Page 22: ...CPUs and GeForce 8800 GTS 512 were run on Asus P5K V motherboard Intel G33 based with 2 GB DDR2 system memory Based on an extrapolation of 1 min 50 sec 1280 720 high definition movie clip 4 http devel...

Page 23: ...hat may result from its use No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change...

Reviews: