A Detailed Look Inside the Intel
®
NetBurst
™
Micro-Architecture of the Intel Pentium
®
4 Processor
Page 17
branches are resolved. However, speculative loads cannot cause page faults. Reordering loads with respect to each
other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to
different addresses can enable more parallelism, allowing the machine to execute more operations as soon as their
inputs are ready. Writes to memory are always carried out in program order to maintain program correctness.
A cache miss for a load does not prevent other loads from issuing and completing. The Pentium 4 processor supports
up to four outstanding load misses that can be serviced either by on-chip caches or by memory.
Store buffers improve performance by allowing the processor to continue executing instructions without having to
wait until a write to memory and/or cache is complete. Writes are generally not on the critical path for dependence
chains, so it is often beneficial to delay writes for more efficient use of memory-access bus cycles.
Store Forwarding
Loads can be moved before stores that occurred earlier in the program if they are not predicted to load from the
same linear address. If they do read from the same linear address, they have to wait for the store’s data to become
available. However, with store forwarding, they do not have to wait for the store to write to the memory hierarchy
and retire. The data from the store can be forwarded directly to the load, as long as the following conditions are met:
§
Sequence: The data to be forwarded to the load has been generated by a programmatically earlier store, which
has already executed.
§
Size: the bytes loaded must be a subset of (including a proper subset, that is, the same) bytes stored.
§
Alignment: the store cannot wrap around a cache line boundary, and the linear address of the load must be the
same as that of the store.