Chapter 4. Continuous availability and manageability
157
In other cases, the POWER Hypervisor notifies the owning partition that the page must be
deallocated. Where possible, the operating system moves any data currently contained in that
memory area to another memory area and removes the pages associated with this error from
its memory map, no longer addressing these pages. The operating system performs memory
page deallocation without any user intervention and is transparent to users and applications.
The POWER Hypervisor maintains a list of pages marked for deallocation during the current
platform initial program load (IPL). During a partition IPL, the partition receives a list of all the
bad pages in its address space. In addition, if memory is dynamically added to a partition
(through a dynamic LPAR operation), the POWER Hypervisor warns the operating system
when memory pages are included that need to be deallocated.
Finally, If an uncorrectable error in memory is discovered, the logical memory block
associated with the address with the uncorrectable error is marked for deallocation by the
POWER Hypervisor. This deallocation will take effect on a partition reboot if the logical
memory block is assigned to an active partition at the time of the fault.
In addition, the system will deallocate the entire memory group associated with the error on
all subsequent system reboots until the memory is repaired. This precaution is intended to
guard against future uncorrectable errors while waiting for parts replacement.
Memory persistent deallocation
Defective memory discovered at boot time is automatically switched off. If the service
processor detects a memory fault at boot time, it marks the affected memory as bad so that it
is not used on subsequent reboots.
If the service processor identifies faulty memory in a server that includes CoD memory, the
POWER Hypervisor attempts to replace the faulty memory with available CoD memory. Faulty
resources are marked as deallocated, and working resources are included in the active
memory space. Because these activities reduce the amount of CoD memory available for
future use, repair of the faulty memory must be scheduled as soon as convenient.
Upon reboot, if not enough memory is available to meet minimum partition requirements, the
POWER Hypervisor will reduce the capacity of one or more partitions.
Depending on the configuration of the system, the HMC Service IBM Focal Point™, OS
Service Focal Point, or service processor will receive a notification of the failed component,
and will trigger a service call.
4.2.4 Cache protection
processor-based systems are designed with cache protection mechanisms,
including cache line delete in both L2 and L3 arrays, processor instruction retry and alternate
processor recovery protection on L1-I and L1-D, and redundant “repair” bits in L1-I, L1-D, and
L2 caches, and also L2 and L3 directories.
L1 instruction and data array protection
The processor instruction and data caches are protected against intermittent
errors using processor instruction retry and against permanent errors by alternate processor
recovery, both mentioned previously. L1 cache is divided into sets. processor can
deallocate all but one before doing a processor instruction retry.
In addition, faults in the Segment Lookaside Buffer (SLB) array are recoverable by the
POWER Hypervisor. The SLB is used in the core to perform address translation calculations.