
Chapter 4. Continuous availability and manageability
133
4.1.2 Placement of components
Packaging is designed to deliver both high performance and high reliability. For example, the
reliability of electronic components is directly related to their thermal environment, that is,
large decreases in component reliability directly correlate with relatively small increases in
temperature; POWER6 processor-based systems are carefully packaged to ensure adequate
cooling. Critical system components such as the POWER6 processor chips are positioned on
printed circuit cards so they receive cool air during operation. In addition, POWER6
processor-based systems are built with redundant, variable-speed fans that can automatically
increase output to compensate for increased heat in the central electronic complex (CEC).
IBMs POWER6 chip was designed to save energy and cooling costs. Innovations include:
A dramatic improvement in the way instructions are executed inside the chip. Performance
was increased by keeping static the number of pipeline stages, but making each stage
faster, removing unnecessary work and doing more in parallel. As a result, execution time
is cut in half or energy consumption is reduced.
Separating circuits that cannot support low voltage operation onto their own power supply
rails
, dramatically reducing power for the rest of the chip.
Voltage/frequency
slewing
, which enables the chip to lower electricity consumption by up
to 50%, with minimal performance impact.
Innovative and pioneering techniques allow the POWER6 chip to turn off its processor clocks
when there's no useful work to be done, then turn them on when needed, reducing both
system power consumption and cooling requirements. Power saving is also realized when the
memory is not fully utilized, as power to parts of the memory not being utilized is dynamically
turned off and then turned back on when needed. When coupled with other RAS
improvements, these features can deliver a significant improvement in overall system
availability.
4.1.3 Continuous field monitoring
Aided by the IBM First Failure Data Capture (FFDC) methodology and the associated error
reporting strategy, product engineering builds an accurate profile of the types of failures that
might occur, and initiates programs to enable corrective actions. Product engineering also
continually analyzes critical system faults, testing to determine if system firmware and
maintenance procedures and tools are effectively handling and recording faults as designed.
See section 4.3.3, “Detecting errors” on page 152.
A system designed with the FFDC methodology includes an extensive array of error checkers
and fault isolation registers (FIR) to detect, isolate, and identify faulty conditions in a server.
This type of automated error capture and identification is especially useful in providing a quick
recovery from unscheduled hardware outages. This data provides a basis for failure analysis
of the component, and can also be used to improve the reliability of the part and as the
starting point for design improvements in future systems.
IBM RAS engineers use specially designed logic circuitry to create faults that can be detected
and stored in FIR bits, simulating internal chip failures. This technique, called
error injection
,
is used to validate server RAS features and diagnostic functions in a variety of operating
conditions (power-on, boot, and operational run-time phases). IBM traditionally classifies
hardware error events in the following ways:
Repair actions (RA) are related to the industry standard definition of mean time between
failure (MTBF). An RA is any hardware event that requires service on a system. Repair
Summary of Contents for Power 595
Page 2: ......
Page 120: ...108 IBM Power 595 Technical Overview and Introduction...
Page 182: ...170 IBM Power 595 Technical Overview and Introduction...
Page 186: ...174 IBM Power 595 Technical Overview and Introduction...
Page 187: ......