background image

6

RAS Design Philosophy

Realization of a mainframe-class continuous operation through the pursuit of    

 

reliability and availability in a single server construct

Mainframe-class RAS Features

Clustering

Dependable Server Technology

Continuous operations through failures 

Redundant components, error prediction and error 

correction allows for continuous operation

Minimized spread of failures

Technology to minimize the effects of hardware failures on

the system.  Reduction of performance degradation and 

multi-node shutdown

Smooth recovery after failures

Ability to replace failed components without

shutting down operations

Impr

oved system availability

Impr

oved r

eliability and availability as a stand alone server

Generally, in order to achieve reliability and availability on an 
open server, clustering would be implemented.  However, 
clustering comes with a price tag.  To keep costs at a minimum, 
the Express5800/1000 series servers were designed to 
achieve a high level of reliability and availability, but within a 
single server.

The Express5800/1000 series server’s powerful RAS features 
were developed through the pursuit of dependable server 
technology.

Continuous operations throughout failures; minimize the 
spread of failures; and smooth recovery after failures were 
goals set forth which lead to implementation of technologies 
such as memory mirroring, increased redundancy of intricate 
components, and modularization.  Through these technologies 
a mainframe level of continuous operation was achieved.

Mainflame

Level

Conventional

open server

Level

PC Server

Level

Reliability

Availability

Serviceability

Center

plane

Chipset

Clock

Core I/O

PCI card

Memory

CPU

L3 cache

Power

HDD

No chipset on the center plane

ECC protection of main

data paths Intricate error

detectionof the high-

speed interconnects

Partial chipset degradation/

Dynamic recovery

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Duplexed*

1

16 processor domain 

segmentation*

2

Core I/O Relief

ECC protection

SDDC Memory

Memory

Mirroring*

1

Intel

®

 Cache Safe

Technology*

3

N+1 Redundant

Two independent 

power sources

Software RAID

Hardware RAID

*1 Available only on the 1320Xf/1160Xf
*2 Available only on the 1320Xf
*3 Intel

®

 technology designed to avoid cache based failures

*4 Replacement of failed component without shutting down other partitions.

The Dual-Core Intel

®

 Itanium

®

 processor MCA  

(Machine Check Architecture)

The framework for hardware, firmware and OS error handling

The Dual-Core Intel

®

 Itanium

®

 processor, designed for high-end 

enterprise servers, not only excels in performance, but is also 
abundant in RAS features. At the core of the processor’s RAS 
feature set, is the error handling framework, called MCA.

MCA provides a 3 stage error handling mechanism – hardware, 
firmware, and operating system. In the first stage, the CPU and 
chipset attempt to handle errors through ECC (Error Correcting 
Code) and parity protection. If the error can not be handled by 
the hardware, it is then passed to the second stage, where the 
firmware attempts to resolve the issue.  In the third stage, if the 
error can not be handled by the first two stages, the operating 
system runs recovery procedures based on the error report 
and error log that was received. In the event of a critical error, 
the system will automatically reset, to significantly reduce the 
possibility of a system failure.

Application Layer

Operating System

The OS logs the error, and then starts the recovery process

Hardware

CPU and chipset ECC and parity protection 

The Firmware and OS aid in the correction of complex platform errors to restore the system
Error details are logged, and then a report flow is defined for the OS
Detects and corrects a wide range of hardware errors for main data structures 

Firmware

Seamlessly handles the error 

Содержание 1000 Series

Страница 1: ...NEC Express5800 1000 Technology Guide Vol 1 Powered by the Dual Core Intel Itanium Processor Reliability and Performance through the fusion of the NEC A3 chipset and the Dual Core Intel Itanium processor NECExpress5800 1000 Series 1320Xf 1160Xf 1080Rf ...

Страница 2: ...l enterprises With the new Dual Core Intel Itanium processor 9000 series and the NEC designed third generation chipset A3 from chipset board to system level design NEC has never compromised to realize mainframe class reliability and supercomputer class performance Express5800 1000 series is the perfect IT platform for the most demanding mission critical enterprises Supercomputer class Performance ...

Страница 3: ...retransmission of error data Two independent power sources Avoid system shutdown due to failures of the power distribution units Serviceability Autonomic reporting of logs with pinpoint prognosis of failed components allow for the realization of mainframe class platform serviceability n System Hardware Layout of the Express5800 1000 Series Server 1320Xf Redundant configuration available Fan box Ce...

Страница 4: ...allelization is achieved however it is not maximized nor efficient Parallel processing with EPIC architecture In the EPIC architecture parallelization is run at compile time allowing for maximum parallelization with minimal scheduling Hardware Partial HW Parallelization Intel Itanium processor supported compiler Compiler Sequential Machine Code Intel Itanium processor source is parallelized at com...

Страница 5: ...ise applications performance through reduced cache memory access latency Very Large Cache VLC Architecture Intel Itanium 2 processor Madison L3 9MB Latency Dual Core Intel Itanium processor Montvale L3 24MB Latency CPU CPU CPU Cache Memory Cache Memory CPU Cache Memory Cache Memory Intel Itanium 2 processor Madison L3 9MB Latency High speed cache to cache transfers Direct CPU to CPU transfers FSB ...

Страница 6: ...cts Partial chipset degradation Dynamic recovery Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Hot Pluggable 4 Duplexed 1 16 processor domain segmentation 2 Core I O Relief ECC protection SDDC Memory Memory Mirroring 1 Intel Cache Safe Technology 3 N 1 Redundant Two independent power sources Software RAID Hardware RAID 1 Available only on the 1320Xf 1160Xf 2 Avail...

Страница 7: ...re may result in a multi partition shutdown To resolve this issue the Express5800 1000 series servers have been designed to allow for the partial degradation of chipsets Within each of the LSI chips which make up the chipset multiple LSI sub units exist These sub units are connected to other sub units located on separate LSI chips The combined sub units together make up single partition If an erro...

Страница 8: ...ode that is linked directly to the failed crossbar will be temporarily shutdown The failed crossbar card can be replaced without halting other business operations Cell Cell Cell Cell Cell Cell Cell Cell Cell Cell Cell Cell Failure Down Operation 1 Node 1 Operation 2 Node 2 Operation 3 Node 3 Operation 4 Node 4 PCI box PCI box PCI box PCI box PCI box PCI box Cell card CPU CPU CPU CPU Memory Memory ...

Страница 9: ... distribution mechanisms so that system downtime can be minimized The 1320Xf system allows for the division of the system into two 16 processor segments where one segment utilizes one system clock and the other 16 processor segment utilizes the remaining system clock A failure in a system clock therefore will not result in shutdown of the entire system Express5800 1000 Series Redundant Active Stan...

Страница 10: ...lected by the chipset in the event of an error The BID is able to diagnose the location of the error and will pinpoint the required FRU Field Replaceable Unit so that the time required to replace the component and recover the system can be minimized In the event of a failure the Express5800 1000 series servers also have the capability to automatically send detailed error logs to maintenance person...

Страница 11: ...figuration Small footprint and a highly scalable I O Along with the industry s prevalent Microsoft Windows operating system the Express5800 1000 series servers also support the Linux operating system By dividing the system into multiple partitions it is possible to support multiple operating systems within a single server With the inception of the Itanium Solutions Alliance ISA whose main objectiv...

Страница 12: ...ntel logo Itanium and Itanium inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Microsoft and Windows are registered trademarks or trademarks of the US Microsoft Corporation in the United States and other countries Red Hat and Shadow Man logos are registered trademarks or trademarks of Red Hat Inc in the United States a...

Отзывы: