background image

Health Check Messages

Advanced System Diagnostics and Troubleshooting Guide

33

The following examples describe how these values apply to a BlackDiamond 6808:

On a BlackDiamond 6808, if more than six fast-path errors are detected within one 20-second 
window, a message is inserted into the system log. If this pattern recurs three times within three 
windows, the system health check subsystem takes the action specified in the 

configure sys-health-check

 command.

If fewer than six fast-path errors are detected within a single 20-second window, there is no 
threshold violation, so no message is inserted in the system log.

If more than six fast-path errors are detected within a single 20-second window, but no fast-path 
errors are detected in other 20-second windows, an error message is inserted in the system log for 
the fast-path window threshold violation, but no system health check action is taken.

NOTE

The state of the interpreting and reporting subsystem is configurable (enabled/disabled), as are the 
values associated with the slow- and fast-path thresholds and bad window counters. However, these 
attributes are currently accessible only under the instruction from Extreme Networks TAC personnel. 
The default settings for these attributes have been found to work effectively under a broad range of 
networking conditions and should not require changes.

Health Check Messages

As stated earlier, ExtremeWare maintains five types of system health check error counters, divided into 
two categories: three slow path counters and two fast path counters.

Slow path counters:

CPU packet error—Data (control or learning) packet processed by the CPU and found to be 
corrupted (a passive health check).

CPU diagnostics error—CPU health check (an active health check)

Backplane diagnostics error—EDP diagnostics packets (an active health check)

Fast path counters:

Internal MAC checksum errors

External MAC checksum errors

Each of these system health check counters has an associated system log message type, to help focus 
attention during troubleshooting. These message types are reported in the system log according to the 
level of threat to the system. The message levels are:

Alert messages

Corrective action messages

Alert Messages

These errors are inserted into the system log when the configured default error threshold is exceeded 
within a given 20-second sampling window. When a threshold is exceeded, that window is marked as a 
“bad” window and the interpreting and reporting subsystem inserts an error message into the system 
log indicating the primary reason why the window was marked bad.

Содержание ExtremeWare Version 7.8

Страница 1: ...85 Monroe Street Santa Clara California 95051 888 257 3000 http www extremenetworks com Advanced System Diagnostics and Troubleshooting Guide ExtremeWare Software Version 7 8 Published May 2008 Part n...

Страница 2: ...nc which may be registered or pending registration in certain jurisdictions The Extreme Turbodrive logo is a service mark of Extreme Networks which may be registered or pending registration in certain...

Страница 3: ...d Hardware 13 Applicable ExtremeWare Versions 13 Chapter 2 i Series Switch Hardware Architecture Diagnostics Support 15 The BlackDiamond Systems 16 BlackDiamond 6800 Series Hardware Architecture Diffe...

Страница 4: ...dog Behavior 37 System Software Exception Recovery Behavior 38 Redundant MSM Behavior 38 Configuring System Recovery Actions 40 Related Commands 40 Configuring System Recovery Actions on i Series Swit...

Страница 5: ...t Alpine or BlackDiamond with a Single MSM 64 BlackDiamond System with Two MSMs 64 Interpreting Memory Scanning Results 66 Per Slot Packet Memory Scan on BlackDiamond Switches 67 Related Commands 67 C...

Страница 6: ...Command 91 Example Output from the show switch command 92 Example Output from the show fdb remap Command 92 Chapter 6 Additional Diagnostics Tools Temperature Logging for Modular Switches 93 Related C...

Страница 7: ...Scanning to Screen I O Modules 109 Appendix A Limited Operation Mode and Minimal Operation Mode Limited Operation Mode 111 Triggering Limited Operation Mode 112 Bringing a Switch Out of Limited Opera...

Страница 8: ...8 Advanced System Diagnostics and Troubleshooting Guide Contents...

Страница 9: ...a system This guide is intended for use by network designers planners and operations staff Terminology When features functionality or operation is specific to a modular or stand alone switch family t...

Страница 10: ...ts information as it appears on the screen The words enter and type When you see the word enter in this guide you must type something and then press the Return or Enter key Do not press the Return or...

Страница 11: ...The Extreme Networks diagnostic software is intended to identify possible faulty hardware or software error conditions and depending on how the various diagnostics features are configured take the app...

Страница 12: ...t levels of network availability and performance by providing a suite of diagnostic tests that moves away from a reactive stance wherein a problem occurs and then you attempt to determine what caused...

Страница 13: ...he operation of all data and control busses within the switch This low level subsystem composed of software and hardware components passes the results of its tests to another system health check subsy...

Страница 14: ...14 Advanced System Diagnostics and Troubleshooting Guide Introduction...

Страница 15: ...Equipment based on this chipset are referred to as being inferno series or i series products the BlackDiamond family of core chassis switches 6804 6808 and 6816 the Alpine systems 3802 3804 3808 and t...

Страница 16: ...these are the key hardware distinctions between the BlackDiamond 6816 6808 and 6804 BlackDiamond 6816 Modular chassis with passive backplane sixteen chassis slots for I O modules four chassis slots fo...

Страница 17: ...each MSM and each I O slot on BlackDiamond 6816 systems Device management occurs on a 32 bit PCI bus connecting MSMs and I O modules The number of backplane slots for I O modules and MSMs determines t...

Страница 18: ...to prepare a packet for transmission to the switch fabric or to the external network including 802 1p and DiffServ examination VLAN insertion MAC substitution TTL decrement 802 1p and DiffServ replac...

Страница 19: ...les As its name indicates the Management Switch Fabric Module MSM serves a dual role in the system it is equipped to act as the internal switch fabric for data that is being transferred between I O mo...

Страница 20: ...the updated slave is ready to take over as the master if the master MSM fails Despite the master slave switch management role the switch fabrics on both MSMs actively switch core traffic in a load sha...

Страница 21: ...memory scanning and mapping Chapter 5 Diagnostics contains three tables that describe the behavior of the switch for different platform types and diagnostics configuration Table 6 Auto recovery memor...

Страница 22: ...US links to all I O modules The number of backplane slots for I O modules determines the Alpine system type 3802 3804 3808 SMMi processor module instead on MSM The SMMi processor module is similar to...

Страница 23: ...and flash memory console port connectors management interface and a PCMCIA slot The Summit switching fabric subsystem uses the same basic set of ASICs the switch engine ASIC and the address filtering...

Страница 24: ...24 Advanced System Diagnostics and Troubleshooting Guide i Series Switch Hardware Architecture...

Страница 25: ...fabrics are subject to electronic anomalies that might result in packet errors The Ethernet standard contains built in protections to detect packet errors on the link between devices but these mechan...

Страница 26: ...uted by the MAC chip when the packet is transferred from the switch fabric to the MAC chip for transmission Checksum error When a packet exits the switch fabric the packet checksum that follows the pa...

Страница 27: ...Packet Errors Between Wires The 802 3 Ethernet specification provides a CRC check for validation of data on the wire but offers no guidance for handling data validation in the devices between the wir...

Страница 28: ...the MAC ASIC must then transfer the packet to the internal switch fabric Before doing this however it produces a checksum value based on the packet data being passed to the switch fabric This checksum...

Страница 29: ...of the receiving and transmitting MAC ASICs is the same for packets traversing an MSM module as for packets traversing an I O module Thus far all of the systems described have been involved in fast p...

Страница 30: ...ingle events or might recur for short durations Because these transient events usually occur randomly throughout the network there is usually no single locus of packet errors They are temporary do not...

Страница 31: ...witch to the higher level intelligent layer that is responsible for interpreting the test results and reporting them to the user Rather than simply insert every checksum validation error in the system...

Страница 32: ...count equals or exceeds the window error parameter value the interpreting and reporting subsystem alerts the system health check system that there is a potential service affecting condition The syste...

Страница 33: ...personnel The default settings for these attributes have been found to work effectively under a broad range of networking conditions and should not require changes Health Check Messages As stated ear...

Страница 34: ...d packet type descriptors are CPU DIAG EDP EXT and INT CPU DIAG and EDP refer to packets generated within the CPU health checking subsystem on the control slow path of the switch CPU packet types incl...

Страница 35: ...ting and reporting subsystem are exceeded These messages indicate that the system health check has taken the configured response action log send traps card down system down or auto recovery upon detec...

Страница 36: ...es as well The packet memory scan should always be used in conjunction with the extended diagnostics to check the integrity of all the components Possible error conditions for PBUS action messages CAR...

Страница 37: ...of the switch itself System watchdog behavior System software exception recovery behavior configuration options Redundant MSM behavior and failover in BlackDiamond systems System Watchdog Behavior The...

Страница 38: ...ides commands to configure system recovery behavior when a software exception occurs Recovery behavior configure sys recovery level command Reboot behavior configure reboot loop protection command Sys...

Страница 39: ...ld periodically use the synchronize command to ensure that the slave MSM and master MSM are using matched images and configurations If not synchronized the slave MSM might attempt to use the image it...

Страница 40: ...urs no system shutdown or reboot etc This is the default action all If any task exception occurs ExtremeWare logs an error in the system log and automatically initiates the configured response action...

Страница 41: ...th MSM 3 modules and an error occurs the system does not require a system reboot so there is no service outage When the reboot option is configured on a system and an error occurs the system is reboot...

Страница 42: ...only difference between these two options in this case is that under the reboot option the reinsertion of the system triggers a second network convergence The impact of the reconvergence depends on t...

Страница 43: ...Loop Protection To configure reboot loop protection use the following command configure reboot loop protection threshold time interval count where By default the reboot loop protection feature is dis...

Страница 44: ...interval count Or configure reboot loop protection backup msm threshold use master config Use this command to configure the switch to use the global reboot loop protection configuration where time int...

Страница 45: ...e sys recovery level command You can also initiate a system dump manually using the upload system dump command NOTE The system dump commands are available only on the i series switches To configure an...

Страница 46: ...unreachable the show system dump command display looks similar to this example Server ip 10 45 209 100 currently unreachable via Mgmt vlan dump timeout none Initiating a Manual System Dump To initiat...

Страница 47: ...r 0 0 12 23 2000 23 15 07 11 Warn SYST 00x800b9500 BGTask2_G2 7cc twisterMemoryToBuffer eeeeeeee eeeeeeee eeeeeeee 200bb 12 23 2000 23 15 07 04 Warn SYST 00x80e0f63c vxTaskEntry c BGTask2_G2 0 0 0 0 1...

Страница 48: ...48 Advanced System Diagnostics and Troubleshooting Guide Software Exception Handling...

Страница 49: ...e any of the ExtremeWare diagnostic tests you must understand some of the basics about how a given diagnostic test functions what it does how it does it as well as how your use of that diagnostic test...

Страница 50: ...n extensive network outages The key to effective diagnostic use in optimizing network availability lies in understanding what happens in the switch when a given test is run How the Test Affects the Sw...

Страница 51: ...mplified example of the backplane health check test In the backplane health check the CPU uses the control bus to load test packets into the fabric of the MSM The packets are then transferred across t...

Страница 52: ...em s control and data planes Diagnostic Suite Components The components that make up the diagnostic test suite include Power on self test POST Runs at system boot Offers configurable levels Runtime sl...

Страница 53: ...health check configuration Runs in background to detect potential control path faults Tests internal transceiver data paths Tests all ASICs for proper read write operations The Role of Memory Scanning...

Страница 54: ...ped and excluded from use The module will continue to operate If the test detects more than eight error cells the module is identified as failed and must be replaced You should use this test when the...

Страница 55: ...lpine and BlackDiamond systems and are described in Runtime Diagnostics on i Series Systems on page 57 Automatically as a background task under the global system health check umbrella as configured in...

Страница 56: ...re diagnostics extended fastpost normal off verbose quiet unconfigure switch all show diagnostics slot msm a msm b slot number show switch Configuring the Boot Up Diagnostics To configure the boot up...

Страница 57: ...t on all switch fabric ports The normal tests are completed in approximately 30 seconds The normal system diagnostics test the following switch elements depending on the switch type Summit or Alpine v...

Страница 58: ...t normal extended packet memory Alpine and Summit only configure diagnostics extended fastpost normal off show diagnostics slot msm a msm b slot number show diagnostics packet memory slot slot number...

Страница 59: ...rk traffic Are you sure you want to continue yes no When you enter the run diagnostics extended command on a BlackDiamond with one MSM an Alpine or a Summit system the system displays the following me...

Страница 60: ...l display a message prompting you to confirm the diagnostic action For example Running Normal diagnostics will reset the switch and disrupt network traffic Are you sure you want to continue yes no Aut...

Страница 61: ...takes approximately 90 seconds and the module remains offline for the duration of the scan For Summit i series and Alpine systems the test is initiated by manual command the entire switch is taken off...

Страница 62: ...1 7 7 Both kept online Errors mapped both kept online Errors not mapped both kept online Both enter limited commands mode Errors mapped both kept online Errors not mapped both enter limited commands m...

Страница 63: ...kept online Errors mapped MSM64i kept online Errors not mapped master MSM64i fails over BlackDiamond with two MSM64i modules error on slave 0 1 7 7 0 1 7 7 MSM64i kept online Errors mapped MSM64i kept...

Страница 64: ...nd is equipped with a single MSM or redundant MSMs Summit Alpine or BlackDiamond with a Single MSM Summit Alpine and single MSM BlackDiamond systems are taken down during the time the packet memory sc...

Страница 65: ...ts exceeds the configured threshold the system removes the module from service The module is permanently marked down is left in a non operational state and cannot be used in a system running ExtremeWa...

Страница 66: ...le show diagnostics messages for memory scanning Diagnostic Test Result run on Thu May 23 14 24 44 2005 Slot B CPU System Passed Registers Test Passed Memory Test Packet memory test Passed NOTICE Pack...

Страница 67: ...er slot basis is disabled Related Commands configure packet mem scan recovery mode offline online slot msm a msm b slot number unconfigure packet mem scan recovery mode slot msm a msm b slot number sh...

Страница 68: ...overy protocols both within the system housing the modules and within attached devices If the disabled module is an MSM64i and is the master the system will reboot causing the slave MSM to assume the...

Страница 69: ...tes Because of this quick reconvergence factor the reboot triggering a second reconvergence might be absorbed as long as small network outages are acceptable If no amount of loss is acceptable the shu...

Страница 70: ...re generated by the CPU on the I O slots and sent back to the CPU through the CPU packet path The backplane health check packet is a 384 byte packet generated by the CPU on the MSM and sent across the...

Страница 71: ...re configured by two separate CLI commands described below Alarm Level Response Action To configure the switch to respond to a failed health check based on alarm level use this command configure sys h...

Страница 72: ...g Backplane Health Check Results show log Command The backplane health check uses the same system log reporting mechanism as checksum validation so you can use the show log command to view health chec...

Страница 73: ...ets as only a general check of the health of the system Small numbers fewer than five can generally be ignored as they can be caused by conditions where the CPU becomes too busy to receive the transmi...

Страница 74: ...0 Port 28 0 0 0 0 0 Port 29 0 0 0 0 0 Port 30 0 0 0 0 0 Port 31 214020 214020 0 0 0 Port 32 214010 214010 0 0 0 where Total Tx Lists the total number of health check packets transmitted by the CPU to...

Страница 75: ...14010 Port 25 chan3 slot1 6140 6140 Port 26 chan3 slot2 0 0 Port 27 chan3 slot3 0 0 Port 28 chan3 slot4 0 0 Port 29 chan3 slot5 0 0 Port 30 chan3 slot6 0 0 Port 31 chan3 slot7 214020 214020 Port 32 ch...

Страница 76: ...939 201 0 0 03 08 2005 05 28 46 Port 2 0 0 0 0 0 Port 3 0 0 0 0 0 Port 4 214020 214020 0 38882 0 03 08 2005 05 28 51 Port 5 0 0 0 0 0 Port 6 0 0 0 0 0 Port 7 214020 214020 0 1288 0 03 08 2003 05 28 50...

Страница 77: ...ort 2 0 0 0 0 0 Port 3 0 0 0 0 0 Port 4 0 0 0 0 0 Port 5 0 0 0 0 0 Port 6 0 0 0 0 0 Port 7 214020 214020 0 1288 0 03 08 2005 05 28 51 Port 8 214010 214010 0 0 0 Port 9 6140 6140 0 0 0 Port 10 0 0 0 0...

Страница 78: ...2 0 0 0 0 0 Port 13 0 0 0 0 0 Port 14 0 0 0 0 0 Port 15 214020 214018 2 0 0 03 08 2005 05 28 51 Port 16 214010 214007 3 0 0 03 08 2005 05 28 51 Port 17 6140 6137 3 0 0 03 08 2005 05 28 51 Port 18 0 0...

Страница 79: ...ng problems If the backplane health check shows no failures but the log shows checksum error messages If the checksum error messages occur infrequently it might indicate a packet memory problem that i...

Страница 80: ...5314 5314 0 0 Type 4 5314 5314 0 0 For each slot the CPU health check sends five different CPU health check test patterns Type 0 through Type 4 Counts are maintained for transmitted received missed an...

Страница 81: ...m error messages and the backplane health check counts for missing or corrupted packets are increasing Live data is probably being disrupted The combination of the health check counters in the output...

Страница 82: ...m b BlackDiamond disable transceiver test clear transceiver test configure transceiver test failure action log sys health check configure transceiver test period 1 60 configure transceiver test thresh...

Страница 83: ...s is adequate for most networks To configure the action the system health check takes when too many transceiver test failures are detected within the specified transceiver test window use this command...

Страница 84: ...iagnostic test detected transceiver problems on multiple modules possibly caused by a failure on the module in the slot specified as slot n Diags detected n other i o blade s with a transceiver proble...

Страница 85: ...agnostics command BlackDiamond System The following example of the show diagnostics command displays the results of the transceiver diagnostics for a BlackDiamond system Transceiver system health diag...

Страница 86: ...diagnostics command displays the results of the transceiver diagnostics for an Alpine system Transceiver system health diag result Pass Fail Counters are in HEX Slot CardType Cardstate Test Pass Fail...

Страница 87: ...Enabled Failure action sys health check System Watchdog Enabled Transceiver Diagnostic Result Analysis If transceiver test error counters are incrementing but there is no associated log message the p...

Страница 88: ...ved ExtremeWare 6 2 2b108 configurations on a switch with ExtremeWare 6 2 2b134 or later the FDB scan diagnostic is enabled If you want the FDB scanning feature disabled you must manually disable FDB...

Страница 89: ...se this command enable fdb scan all slot backplane slot number msm a msm b where NOTE This command and its settings are independent of and in addition to the system health check configuration the loca...

Страница 90: ...on and an auto recovery option was specified using the configure sys health check command the response to too many FDB scan failures is to take whatever action the system health check is configured to...

Страница 91: ...placed FDB Scan entry entry marked remapped This message indicates that the diagnostic detected a corrupted portion of FDB memory and that the database entry entry was marked as unusable by the FDB sc...

Страница 92: ...for that specific entry all other entries will be okay Manually remove the entry and allow FDB scan to remap it If this is not possible contact Extreme Network TAC Example Output from the show switch...

Страница 93: ...perature celsius of a switch once an hour to both the internal system log and to a syslog server if configured The recommended ambient operating temperature for Extreme Networks switches is 32 to 104...

Страница 94: ...lier true cause of the failure to be purged Another important fact to consider when designing for effective recovery is that network device logs might be lost after a catastrophic failure Syslog serve...

Страница 95: ...cal0 local7 System Impact of the Syslog Server Facility The syslog server facility has no impact on the performance of the system All log messages that are written to the local log of the system are s...

Страница 96: ...800 series and the BlackDiamond 6800 series chassis based systems The CDM tests can be run on any cable connected to a port on a High Density Gigabit Ethernet module The cable can be open at the far e...

Страница 97: ...the diagnostics run results in a failure for a particular port To create the auto diagnostics template use this command configure diagnostics cable time time reset port on failure enable disable where...

Страница 98: ...le command displays or prints the following cable diagnostics data for each specified port Time and date of test Port identity Speed Mbps Cable length meters Cable pair subcategory information for eac...

Страница 99: ...The following is a sample summary diagnostic output from this command Manual Diagnostics Collected Fri Jan 16 03 41 54 2004 Port Speed Pair Cable Diagnostic Mode MB Status 6 1 1000 1 2 Ok Manual 3 6...

Страница 100: ...rs of the same cable pair make an electrical contact the cable diagnostic reports that condition as a short on that pair The possible causes are The cable pair conductors might be shorted if they are...

Страница 101: ...ostics can detect when the auto crossover capability of the port s PHY has adjusted the role of its MDI pairs in the process of establishing link to its link partner The cable diagnostics report the c...

Страница 102: ...102 Advanced System Diagnostics and Troubleshooting Guide Additional Diagnostics Tools...

Страница 103: ...ng to Screen I O Modules on page 109 Contacting Extreme Technical Support The Extreme Networks Global Technical Assistance Centers TACs are the focal point of contact for post sales technical and netw...

Страница 104: ...OTE The Extreme Networks TAC provides 24x7x365 service Call received outside the normal business hours by the Americas Asia Pacific Rim and EMEA TACs are forwarded to the TAC that is available at that...

Страница 105: ...If the problem was resolved what steps did you take to diagnose and resolve the problem Optional information upon request from Extreme Networks TAC personnel System dump CPU memory dump Additional CLI...

Страница 106: ...e log shows checksum errors If the checksum errors occur infrequently it might indicate a packet memory problem that is being triggered sporadically it might be a low risk situation but if possible yo...

Страница 107: ...em log it might be a temporary hard error If any hardware has been changed recently and this switch is a modular chassis try reseating the affected blade CG_003 Customer experiences checksum errors on...

Страница 108: ...ry scan In addition to the extended diagnostics you should also run the transceiver diagnostics and FDB scan diagnostics within this same maintenance window because run together these tests can detect...

Страница 109: ...n correctly on an older version of ExtremeWare In temporarily upgrading your systems plan to roll back the systems to the original working code image and configuration This means you need the original...

Страница 110: ...110 Advanced System Diagnostics and Troubleshooting Guide Troubleshooting Guidelines...

Страница 111: ...o see whether a catastrophe caused this reboot If that is the case the switch enters limited operation mode In this context a catastrophe is defined as When diagnostics fail on an Alpine or Summit swi...

Страница 112: ...ommand d Capture the output from the show diagnostics command e On a sheet of paper write down the state of all LEDs on the switch 2 Use the clear log diag status command to clear the hardware error c...

Страница 113: ...itch out of minimal operation mode perform these steps 1 Begin collecting background information that might help isolate the causes of the problem and contact Extreme Networks technical support and no...

Страница 114: ...114 Advanced System Diagnostics and Troubleshooting Guide Limited Operation Mode and Minimal Operation Mode...

Страница 115: ...t parity errors and soft hard failures in general http www research ibm com journal rd 401 curtis html http www pcguide com ref ram err htm http www eetimes com news 98 1012news ibm html Other Documen...

Страница 116: ...ware Downloads area of eSupport Virtual Information Center VIC Use this site to check for any available information about the problem Extreme Knowledge Base Use this site to search for any known issue...

Страница 117: ...abric 25 control bus See slow path conventions notice icons About This Guide 9 text About This Guide 10 CPU health check packet type descriptors 34 system health check test 70 CPU subsystem backplane...

Страница 118: ...errors 26 PHY port layer 28 PM See packet memory POST CPU register ASIC test 56 introduced 12 test routines 56 Power on self test See POST R Real Time Clock RTC 56 reboot 40 41 recovery action mainten...

Страница 119: ...tExcTask 41 tExtTask 41 tLogTask 41 tNetTask 40 41 tShell 41 tSwFault 41 tSyslogTask 41 tBGTask 40 41 tConsole 41 tEdpTask 40 41 tEsrpTask 40 41 test packet type system health check 34 tExcTask 41 tEx...

Страница 120: ...120 Advanced System Diagnostics and Troubleshooting Guide Index...

Страница 121: ...ceiver test failure action 83 configure transceiver test period 82 configure transceiver test threshold 83 configure transceiver test window 83 D disable diagnostics cable 96 disable diagnostics cable...

Страница 122: ...how port rxerrors 27 show ports cable diagnostics 101 show switch 43 56 84 91 105 show system dump 46 show tech 105 112 113 show version 105 synchronize 43 synchronize command 21 U unconfigure fdb sca...

Отзывы: