GeForce GTX 980 Whitepaper
GM204 HARDWARE ARCHITECTURE
IN-DEPTH
10
GM204 Memory Subsystem
In GM204 one ROP partition contains 16 ROP units (compared to eight ROP units per partition in Kepler);
each ROP can process a single color sample. With four ROP partitions, a full GM204 has 64 ROPs, twice
that of its predecessor, GK104, dramatically improving ROP throughput.
GM204 has a 256-bit memory interface with 7Gbps GDDR5 memory, the fastest in the industry. GM204
also features a unified 2048KB L2 cache that is shared across the GPU. In addition, GM204 has made
significant enhancements to our memory compression implementation.
To reduce DRAM bandwidth demands, NVIDIA GPUs make use of lossless compression techniques as
data is written out to memory. The bandwidth savings from this compression is realized a second time
when clients such as the Texture Unit later read the data. As illustrated in the preceding figure, our
compression engine has multiple layers of compression algorithms. Any block going out to memory will
first be examined to see if 4x2 pixel regions within the block are constant, in which case the data will be
compressed 8:1 (i.e., from 256B to 32B of data, for 32b color). If that fails, but 2x2 pixel regions are
constant, we will compress the data 4:1.
These modes are very effective for AA surfaces, but less so for 1xAA rendering. Therefore, starting in
Fermi we also implemented support for a “delta color compression” mode. In this mode, we calculate
the difference between each pixel in the block and its neighbor, and then try to pack these different
values together using the minimum number of bits. For example if pixel A’s red value is 253 (8 bits) and
pixel B’s red value is 250 (also 8 bits), the difference is 3, which can be represented in only 2 bits.