background image

GeForce GTX 980 Whitepaper

 

GM204 HARDWARE ARCHITECTURE 

IN-DEPTH 

 

 

 

and power that had to be spent to manage data transfer in the more complex datapath organization 
used by Kepler. 

 

Compared to Kepler, the SMM’s memory hierarchy has also changed. Rather than implementing a 
combined shared memory/L1 cache block as in Kepler SMX, Maxwell SMM units feature a 96KB 
dedicated shared memory, while the L1 caching function has been moved to be shared with the texture 
caching function. 

 

As a result of these changes, each Maxwell CUDA core is able to deliver roughly 1.4x more performance 
per core compared to a Kepler CUDA core, and 2x the performance per watt.  At the SM level, with 33% 
fewer total cores per SM, but 1.4x performance per core, each Maxwell SMM can deliver total per-SM 
performance similar to Kepler’s SMX, and the area savings from this more efficient architecture enabled 
us to then double up the total SM count, compared to GK104.  

PolyMorph Engine 3.0 

Tessellation was one of DirectX 11’s key features and will play a bigger role in the future as the next 
generation of games are designed to use more tessellation. With the addition of more SMs in GM204, 
GTX 980 also benefits from 2x the Polymorph Engines, compared to GTX 680. As a result, performance 
on geometry heavy workloads is roughly doubled, and due to architectural improvements within the PE, 
can achieve up to 3x performance improvement with high tessellation expansion factors. 

 

 

 

 

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

Per

for

m

an

ce

 

Expansion Factor 

Microsoft SubD11 SDK Test 

GTX 980

GTX 680

Содержание GeForce GTX 980

Страница 1: ...Whitepaper NVIDIA GeForce GTX 980 Featuring Maxwell The Most Advanced GPU Ever Made V1 1 ...

Страница 2: ...re In Depth 6 Maxwell Streaming Multiprocessor 8 PolyMorph Engine 3 0 9 GM204 Memory Subsystem 10 New Display and Video Engines 11 Maxwell Enabling The Next Frontier in PC Graphics 13 Hardware Acceleration for VXGI Multi Projection and Conservative Raster 21 Tiled Resources 23 Raster Ordered View 24 DirectX 12 25 Advancing the State Of The Art in Image Quality 27 Dynamic Super Resolution 29 Conclu...

Страница 3: ...re ideal for use in power limited environments like notebooks and small form factor PCs in addition to mainstream desktops NVIDIA s latest GPU GM204 is the first to use the full realization of our 10th generation GPU architecture Maxwell Our design goals for GM204 were to deliver Extraordinary Gaming Performance for the Latest Displays Incredible Energy Efficiency Dramatic Leap Forward In Lighting...

Страница 4: ...p PC gaming market has grown explosively in the past few years The Maxwell architecture was designed to provide an extraordinary leap in power efficiency and deliver unrivaled performance while simultaneously reducing power consumption from the previous generation With a combination of advances originally developed for Tegra K1 new architectural approaches seen first in the GeForce GTX 750 Ti and ...

Страница 5: ... rendering stage to accurately determine the effect of light bouncing around in the scene Cyril s original implementation relied on voxels that were stored in an octree structure While it was able to run successfully on a GeForce GTX 680 it had limitations We ve spent the last three years developing an implementation that can be accelerated natively by the GPU as well as improving the algorithm Th...

Страница 6: ...6 Maxwell SMs SMM and four memory controllers GeForce GTX 980 uses the full complement of these architectural components if you are not well versed in these structures we suggest you first read the Kepler and Fermi whitepapers Another version of the chip with 13 SMs will ship concurrently and be called GeForce GTX 970 In the future we plan to offer additional products based on GM204 that will ship...

Страница 7: ...ache Size 512KB 2048KB TDP 195 Watts 165 Watts Transistors 3 54 billion 5 2 billion Die Size 294 mm 398 mm Manufacturing Process 28 nm 28 nm The GeForce GTX 980 has double the SMs compared to the GK104 GPU used in the GeForce GTX 680 released two years ago Because of the changes implemented in GTX 980 s new Maxwell SM we were able to integrate 2x more SMs without doubling the die size With each SM...

Страница 8: ...igned to provide dramatically improved performance per watt than prior GeForce GPUs Compared to GPUs based on our Kepler architecture Maxwell s new SMM design has been reconfigured to improve efficiency Each SMM contains four warp schedulers and each warp scheduler is capable of dispatching two instructions per warp every clock Compared to Kepler s scheduling logic we ve integrated a number of imp...

Страница 9: ...r SM but 1 4x performance per core each Maxwell SMM can deliver total per SM performance similar to Kepler s SMX and the area savings from this more efficient architecture enabled us to then double up the total SM count compared to GK104 PolyMorph Engine 3 0 Tessellation was one of DirectX 11 s key features and will play a bigger role in the future as the next generation of games are designed to u...

Страница 10: ...ression is realized a second time when clients such as the Texture Unit later read the data As illustrated in the preceding figure our compression engine has multiple layers of compression algorithms Any block going out to memory will first be examined to see if 4x2 pixel regions within the block are constant in which case the data will be compressed 8 1 i e from 256B to 32B of data for 32b color ...

Страница 11: ...mes Maxwell uses roughly 25 fewer bytes per frame compared to Kepler This means that from the perspective of the GPU core a Kepler style memory system running at 9 3Gbps would provide effective bandwidth similar to the bandwidth that Maxwell s enhanced memory system provides New Display and Video Engines As the rapid adoption rate of 4K displays shows consumer demand for high resolution devices ha...

Страница 12: ...he distracting screen tearing that currently plagues gaming when Vsync is disabled G SYNC also eliminates display subsystem generated stutter and reduces input lag that gamers put up with today Utilizing DisplayPort the GeForce GTX 980 can drive up to three G SYNC displays in Surround GM2xx Maxwell also ships with an enhanced NVENC encoder that adds support for H 265 also known has HEVC encoding H...

Страница 13: ...real world all objects are lit by a combination of direct light photons that travel directly from a light source to illuminate an object and indirect light photons that travel from the light source hit one object and bounce off of it and then hit a second object thus indirectly illuminating that object Global illumination GI is a term for lighting systems that model this effect Without indirect li...

Страница 14: ...expensive lighting technique particularly in highly detailed scenes GI has been primarily used to render complex CG scenes in movies using offline GPU rendering farms While some forms of GI have been used in many of today s most popular games their implementations have relied on pre computed lighting These prebaked techniques are used for performance reasons however they require additional artwork...

Страница 15: ...topic and a video from GTC 2012 is available here Epic s Elemental Unreal Engine 4 tech demo from 2012 used a similar technique Figure 6 Epic s UE4 Elemental tech demo used voxel cone tracing for its jaw dropping GI Since that time NVIDIA has been working on the next generation of this technology VXGI that combines new software algorithms and special hardware acceleration in the Maxwell architectu...

Страница 16: ...rection and intensity The first step as illustrated in the following figure is the coverage calculation step In this step each triangle needs to be checked from the perspective of each face of the cube to assess what fraction of the voxel is covered The picture on the left shows a traditional rasterized image of a simple scene The picture on the right is a visualization of the voxelized result In ...

Страница 17: ...evaluate direct lighting at each non empty voxel and render the scene multiple times from the point of view of different light sources capturing the amount of light that hits each voxel In the figure below the direct light source indicated by the yellow dot causes light to strike the white walls and some of the surfaces of the red and green boxes Each will then emit reflected light based on the co...

Страница 18: ...the main difference is that the final rasterization and lighting now has a new and more powerful data structure the voxel data structure that it can use in its lighting calculations along with other structures such as shadow maps The approach of calculating indirect lighting during the final rendering pass of VXGI is called cone tracing Cone tracing is an approximation of the effect of secondary r...

Страница 19: ...ditionally need to launch hundreds or thousands of scattered secondary rays for each ray that bounces from the original reflector It s incredibly challenging to reflect these lights realistically especially when you also factor in the material properties of the various light reflectors Using our approach we ve replaced the thousands of secondary rays with just a handful of voxel cones that are tra...

Страница 20: ...te diffuse or specular lighting with only a few scattered cones Ultimately as a result we re able to compute approximate GI at high frame rates in real time allowing us to realistically render glossy and metallic surfaces Figure 10 In the example above voxel cones are used to produce various forms of diffuse and specular light ...

Страница 21: ...rendering the same scene from multiple views multi projection It turns out that multi projection is a property of other important rendering algorithms as well For example cube maps used commonly for assisting with modelling of reflections require rendering to six faces And as will be discussed in more depth later shadow maps can also be rendered at multiple resolutions Therefore acceleration of mu...

Страница 22: ... original 3D triangle data properly Conservative raster helps the hardware to perform this calculation efficiently without conservative raster there are workarounds that can be used to achieve the same result but they are much more expensive The benefit of these features can be measured by running the voxelization stage of VXGI both ways i e with the new features enabled vs disabled Figure 12 belo...

Страница 23: ...d redundant storage of voxel data saving significant amounts of memory You can read more about Tiled Resources at this link One interesting application of Tiled Resources is multi resolution shadow maps In the following Figure 13 the image on the left shows the result of determining shadow information from a fixed resolution shadow map In the foreground the shadow map resolution is not adequate an...

Страница 24: ...pecial interlock hardware in the ROP is responsible for enforcing this ordering requirement DX11 introduced the capability for the pixel shader to bind Unordered Access Views of color and Z buffers and read and write arbitrary locations within those buffers However as the name implies there is no processing order guarantee when multiple pixel shaders are accessing the same UAV The next generation ...

Страница 25: ...ming DirectX 12 API has been designed to have CPU efficiency significantly greater than earlier DirectX versions One of the keys to accomplishing this is providing more explicit control over hardware giving game developers more control of GPU and CPU functions While the NVIDIA driver very efficiently manages resource allocation and synchronization under DX11 under DX12 it is the game developer s r...

Страница 26: ...nservative Raster discussed earlier in the GI section of this paper is one such DX graphics feature Another is Raster Ordered Views ROVs which gives developers control over the ordering pixel shader operations GM2xx supports both Conservative Raster and ROVs The new graphics features included in DX12 will be accessible from either DX11 or DX12 so developers will be free to use these new features w...

Страница 27: ...terization providing opportunities for more flexible and novel AA techniques to be implemented in the context of both deferred and conventional forward rendering With programmable sample positions the ROMs that were used to store the standard sample positions are replaced with RAMs The RAMs may be programmed with the standard patterns but the driver or application may also load the RAMs with custo...

Страница 28: ...tterns or interleaved across multiple frames in time Multi Frame Sampled AA MFAA is a new AA technique that alternates AA sample patterns both temporally and spatially to produce the best image quality while still offering a performance advantage compared to traditional MSAA The final result can deliver image quality approaching that of 8xAA at roughly the cost of 4xAA or 4xAA quality at roughly t...

Страница 29: ...ement in image quality artifacts are sometimes observed on textures and when certain post processing effects are applied To address the usability and quality issues NVIDIA has developed a method called Dynamic Super Resolution In principal Dynamic Super Resolution works like traditional downsampling but it has a simple on off user control and it uses a 13 tap Gaussian filter during the conversion ...

Страница 30: ...ng process to be at a given resolution set by the game itself Figure 15 A screenshot from Dark Souls 2 Standard 1080p on the left DSR on the right Dynamic Super Resolution can be found in the control panel of our Release 343 driver as well as GeForce Experience where we provide Optimal Playable Settings OPS for Dynamic Super Resolution for today s hottest games While it s compatible with all GeFor...

Страница 31: ...e on the PC The GeForce GTX 980 supports new features for sampling control that will enable new AA techniques like MFAA allowing lower level AA sample patterns to be perceived as higher quality AA but with the faster performance of lower AA levels And the GeForce GTX 980 supports Dynamic Super Resolution technology an NVIDIA developed version of downsampling that brings 4K visuals to existing 1080...

Страница 32: ...r for any infringement of patents or other rights of third parties that may result from its use No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation Specifications mentioned in this publication are subject to change without notice This publication supersedes and replaces all information previously supplied NVIDIA Corporation products are not aut...

Отзывы: