6
RAS Design Philosophy
Realization of a mainframe-class continuous operation through the pursuit of
reliability and availability in a single server construct
Mainframe-class RAS Features
Clustering
Dependable Server Technology
Continuous operations through failures
Redundant components, error prediction and error
correction allows for continuous operation
Minimized spread of failures
Technology to minimize the effects of hardware failures on
the system. Reduction of performance degradation and
multi-node shutdown
Smooth recovery after failures
Ability to replace failed components without
shutting down operations
Impr
oved system availability
Impr
oved r
eliability and availability as a stand alone server
Generally, in order to achieve reliability and availability on an
open server, clustering would be implemented. However,
clustering comes with a price tag. To keep costs at a minimum,
the Express5800/1000 series servers were designed to
achieve a high level of reliability and availability, but within a
single server.
The Express5800/1000 series server’s powerful RAS features
were developed through the pursuit of dependable server
technology.
Continuous operations throughout failures; minimize the
spread of failures; and smooth recovery after failures were
goals set forth which lead to implementation of technologies
such as memory mirroring, increased redundancy of intricate
components, and modularization. Through these technologies
a mainframe level of continuous operation was achieved.
Mainflame
Level
Conventional
open server
Level
PC Server
Level
Reliability
Availability
Serviceability
Center
plane
Chipset
Clock
Core I/O
PCI card
Memory
CPU
L3 cache
Power
HDD
No chipset on the center plane
ECC protection of main
data paths Intricate error
detectionof the high-
speed interconnects
Partial chipset degradation/
Dynamic recovery
Hot Pluggable
*
4
Hot Pluggable
*
4
Hot Pluggable
*
4
Hot Pluggable
*
4
Hot Pluggable
*
4
Hot Pluggable
*
4
Duplexed*
1
16 processor domain
segmentation*
2
Core I/O Relief
ECC protection
SDDC Memory
Memory
Mirroring*
1
Intel
®
Cache Safe
Technology*
3
N+1 Redundant
Two independent
power sources
Software RAID
Hardware RAID
*1 Available only on the 1320Xf/1160Xf
*2 Available only on the 1320Xf
*3 Intel
®
technology designed to avoid cache based failures
*4 Replacement of failed component without shutting down other partitions.
The Dual-Core Intel
®
Itanium
®
processor MCA
(Machine Check Architecture)
The framework for hardware, firmware and OS error handling
The Dual-Core Intel
®
Itanium
®
processor, designed for high-end
enterprise servers, not only excels in performance, but is also
abundant in RAS features. At the core of the processor’s RAS
feature set, is the error handling framework, called MCA.
MCA provides a 3 stage error handling mechanism – hardware,
firmware, and operating system. In the first stage, the CPU and
chipset attempt to handle errors through ECC (Error Correcting
Code) and parity protection. If the error can not be handled by
the hardware, it is then passed to the second stage, where the
firmware attempts to resolve the issue. In the third stage, if the
error can not be handled by the first two stages, the operating
system runs recovery procedures based on the error report
and error log that was received. In the event of a critical error,
the system will automatically reset, to significantly reduce the
possibility of a system failure.
Application Layer
Operating System
The OS logs the error, and then starts the recovery process
Hardware
CPU and chipset ECC and parity protection
The Firmware and OS aid in the correction of complex platform errors to restore the system
Error details are logged, and then a report flow is defined for the OS
Detects and corrects a wide range of hardware errors for main data structures
Firmware
Seamlessly handles the error