background image

 

 

 

Express5800/A2040b,A2020b, 

A2010b,A1040b 

Machine Check Monitoring Service 

User's Guide 

(Release 1.5) 

May 2015 

NEC Corporation 

 

© 2015 NEC Corporation 

855-900937 

Summary of Contents for Express5800/A1040b

Page 1: ...Express5800 A2040b A2020b A2010b A1040b Machine Check Monitoring Service User s Guide Release 1 5 May 2015 NEC Corporation 2015 NEC Corporation 855 900937 ...

Page 2: ...ered trademarks of Red Hat Inc in the United States and other countries Intel and its log are registered trademarks of Intel Corporation in the United States and other countries Emulex and LightPulse are registered trademarks of Emulex Corporation Broadcom NetXtreme Ethernet Wirespeed LiveLink and Smart Load Balancing trademarks of Broadcom Corporation and or its associated company in the United S...

Page 3: ...3 1 Installation 6 3 1 1 Installing acpi_call 6 3 1 2 Installing capmonitor 8 3 1 3 Installing mcemonitor 9 3 2 Upgrade 10 3 2 1 Upgrading acpi_call 10 3 2 2 Upgrading capmonitor 11 3 2 3 Upgrading mcemonitor 12 3 3 Configuration 13 3 3 1 capmonitor configuration file 13 3 3 2 mcemonitor configuration file 13 3 3 3 Disabling CMCI 14 3 3 4 Disabling kdump restart on udev triggered by logical proces...

Page 4: ...sages output from capmonitor 24 6 1 3 On screen messages output from acpi_call 26 6 1 4 Other on screen messages 27 6 2 syslog Messages 27 6 3 Operation Log Messages 28 6 3 1 Operation log messages output from mcemonitor 28 6 3 2 Operation log messages output from capmonitor 35 7 Restrictions and Precautions 39 7 1 Manual Onlining CPU being Core Offlined 39 7 2 cpuspeed Error Message Output at OS ...

Page 5: ...s spare CPU automatically Core Online after Core Offline completes The Offline and Online operations are performed in cooperation with kernel on Linux server Machine Check Monitoring Service is composed of firmware and software on Linux server Software includes mcemonitor Machine Check Monitoring Service and capmonitor Capacity Monitoring Service Note Refer to Capacity Optimization COPT User s Gui...

Page 6: ...are configuration MCE Machine Check Exceptions Hardware error detected by CPU CMC Corrected Machine Check Correctable error detected by CPU CPU socket Means a single Intel Xeon processor One CPU socket can have several cores With Express5800 A2040 up to 4 CPU sockets can be installed in the server CPU core Core portion of CPU that performs arithmetic processing and others One or more cores can exi...

Page 7: ...ue the Machine Check Monitoring Service degrades CPU or memory page online Core Offline Page Offline In addition if the server uses an OS that supports Core Online feature and spare CPU is equipped in the server the Machine Check Monitoring Service adds the spare CPU automatically Core Online after Core Offline Thus the performance deterioration can be prevented Note Refer to Capacity Optimization...

Page 8: ...ng Service Functional drawing of Machine Check Monitoring Service and its associated components are shown below Figure 2 2 Functional drawing capmonitor log syslog Firmware mcemonitor log mcemonitor Hardware CPU Memory Fault kernel capmonitor acpi_call ...

Page 9: ...fline feature mcemonitor notifies the firmware of result of CPU Offline When CPU Offline succeeds and if the server has spare CPU the spare CPU is added automatically Core Online feature Note For details of Core Online refer to Capacity Optimization COPT User s Guide CPU fault information and result of CPU Offline can be confirmed by mcemonitor command See 5 1 Show CPU Memory Status for details of...

Page 10: ..._call 100 Starting acpi_call driver OK 4 Confirm that acpi_call RPM package of Machine Check Monitoring Service is installed correctly The following is displayed when installation completes successfully rpm qa grep acpicall mcl acpicall 2 4 3 01 2 6 32 504 23 4 el6 x86_64 5 Check if acpi_call driver is started normally If the following 3 acpi_call are displayed acpi_call driver is started normally...

Page 11: ...low missing Sample configuration of etc sysconfig kdump MKDUMPRD_ARGS allow missing With this configuration the following warning may appear when kdump service is started This message indicates that the external module was not incorporated and it is not the problem WARNING No module xxx found for kernel 2 6 32 504 23 4 el6 x86_64 continuing anyway xxx represents external module name ...

Page 12: ...icall is needed by mcl capmonitor 2 4 2 12 el6 x86_64 4 Confirm that capmonitor RPM package of Machine Check Monitoring Service is installed correctly The following is displayed when installation completes successfully rpm qa grep capmonitor mcl capmonitor 2 4 2 12 el6 x86_64 5 Check if capmonitor is started normally If the following is displayed capmonitor is started normally ps aux grep monitor ...

Page 13: ...icall is needed by mcl mcemonitor1 2 4 2 02 el6 x86_64 4 Confirm that mcemonitor RPM package of Machine Check Monitoring Service is installed correctly The following is displayed when installation completes successfully rpm qa grep mcemonitor mcl mcemonitor1 2 4 2 02 el6 x86_64 5 Check if mcemonitor is started normally If the following is displayed mcemonitor is started normally ps aux grep monito...

Page 14: ...the following website http www 58support nec co jp global download index html 4 Upgrade acpi_call RPM package of Machine Check Monitoring Service using rpm command rpm Uvh mcl acpicall 2 4 3 02 2 6 32 504 23 4 el6 x86_64 rpm Preparing 100 1 mcl acpi_call 100 Starting acpi_call driver OK 5 Confirm that acpi_call RPM package of Machine Check Monitoring Service is upgraded correctly The following is ...

Page 15: ...pt nec capmonitor capmonitor Stopping capmonitor OK 1 mcl capmonitor 100 Starting capmonitor daemon OK If capmonitor conf was changed the following message will be displayed The message can be safely ignored because your configuration of the capmonitor conf is preserved capmonitor conf rpmnew is the default capmonitor conf file warning opt nec capmonitor conf capmonitor conf created as opt nec cap...

Page 16: ...pt nec mcemonitor mcemonitor Stopping mcemonitor OK 1 mcl mcemonitor1 100 Starting mcemonitor daemon OK If mcemonitor conf was changed the following message will be displayed The message can be safely ignored because your configuration of the mcemonitor conf is preserved mcemonitor conf rpmnew is the default mcemonitor conf file warning opt nec mcemonitor conf mcemonitor conf created as opt nec mc...

Page 17: ...te For details of capmonitor configuration file refer to Capacity Optimization COPT User s Guide 3 3 2 mcemonitor configuration file mcemonitor configuration file opt nec mcemonitor conf mcemonitor conf is used for configuration related to CPU Core Offline and Memory Page Offline Modify this file according to description below mcemonitor conf vi opt nec mcemonitor conf mcemonitor conf Config file ...

Page 18: ...h notifies the operating system of the detected corrrectable error may cause System panic To change the error detecting mode from interrupt mode to polling mode you need to add mce no_cmci to the kernel line in the boot efi EFI redhat grub conf The system must be rebooted if configuration file is modified title Red Hat Enterprise Linux Server 2 6 32 504 23 4 el6 x86_64 root hd0 0 kernel vmlinuz 2 ...

Page 19: ... capmonitor script cpu offline d Script file name Description How to install script file 03kdump sh Script that restarts kdump daemon as needed so that crash dump can be collected after Core Offline Copy from opt nec capmonitor script to opt nec capmonitor script cpu offline d XX sh User script XX Specify execution order by 2 digit decimal lnumber Starts from younger number Arbitrary character str...

Page 20: ...le mcemonitor and capmonitor have not been uninstalled the following message is output and uninstallation of acpi_call fails rpm e mcl acpicall 2 4 3 01 2 6 32 504 23 4 el6 x86_64 error Failed dependencies mcl acpicall is needed by mcl capmonitor 2 4 2 12 el6 x86_64 mcl acpicall is needed by mcl mcemonitor1 2 4 2 02 el6 x86_64 3 Confirm that acpi_call RPM package of Machine Check Monitoring Servic...

Page 21: ...rrectly If capmonitor is not displayed capmonitor is stopped correctly ps aux grep monitor 3 4 3 Uninstalling mcemonitor 1 Login to the target machine as a root user 2 Uninstall mcemonitor RPM package of Machine Check Monitoring Service using rpm command rpm e mcl mcemonitor1 2 4 2 02 el6 x86_64 3871 opt nec mcemonitor mcemonitor Stopping mcemonitor OK Starting mcelog daemon 3 Confirm that mcemoni...

Page 22: ...3 Tue Feb 19 21 03 39 2013 SOCKETID 0 Tue Feb 19 21 03 39 2013 APICID 14 Tue Feb 19 21 03 39 2013 MCGCAP 0x5000c20 Tue Feb 19 21 03 39 2013 Tue Feb 19 21 03 39 2013 Offlining CPU 7 due to corrected error threshold Tue Feb 19 21 03 39 2013 Offlining CPU 22 due to corrected error threshold Tue Feb 19 21 03 39 2013 Offlining CPU 7 succeeded Tue Feb 19 21 03 39 2013 Offlining CPU 22 succeeded var opt ...

Page 23: ...x mcemonitor version mcemonitor client client core client page Description CPU fault information and offline state of CPU Memory page can be confirmed by mcemonitor command Option version Shows version information of mcemonitor command client Shows CPU fault information and offline state of CPU Memory page client core Shows CPU fault information and offline state of CPU client page Shows offline s...

Page 24: ...ted errors 1 total uncorrected errors 0 total CPU4 core1 corrected errors 10 total uncorrected errors 0 total CPU4 core2 corrected errors 10 total uncorrected errors 0 total CPU1 uncore corrected errors 1 total uncorrected errors 0 total Per CPU status corrected error over threshold CPU4 core1 sys devices system cpu5 offline failed sys devices system cpu15 offline CPU4 core2 sys devices system cpu...

Page 25: ...et number x corey Indicates CPU core number y corrected errors Shows number of occurrence of correctable errors x total Indicates that errors occurred x times uncorrected errors Shows number of occurrence of uncorrectable errors CPU1 uncore Fault information of CPU Uncore Per CPU status corrected error over threshold Shows result of CPU Offline sys devices system cpu5 offline failed Indicates that...

Page 26: ...emonitor Run the command again Run opt nec mcemonitor mcemo nitor client again 6 Cannot reopen logfile var opt nec mcemonitor mcemonitor found a continuable error mcemonitor will continue to be run safely Failed to reopen log file but mcemonitor continue operation mcemonitor continues operation No action is needed 7 Usage mcemonitor client display core page status mcemonitor client core display co...

Page 27: ...as not found opt nec acpicall proc acpi mceca ll acpi_mcecall ko was not found thus failed to start mcemonitor Reinstall mcemonitor 23 opt nec mcemonitor mcemonitor was not found opt nec mcemonitor mcemonitor was not found thus failed to start mcemonitor Reinstall mcemonitor 24 opt nec mcemonitor mcemonitorc md was not found opt nec mcemonitor mcemonitor cmd was not found thus failed to start mcem...

Page 28: ...safely Failed to reopen log file but capmonitor continue operation capmonitor continues operation No action is needed 7 Usage capmonitor client addtime display cpu core hot add processing time Shows usage of capmonitor 8 capmonitor Version x x Shows capmonitor version 9 out of memory capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related fu...

Page 29: ...capmonitor 23 opt nec capmonitor capmonitorc md was not found opt nec capmonitor capmonitorcm d was not found thus failed to start capmonitor Reinstall capmonitor 24 var opt nec was not found var opt nec was not found thus failed to start capmonitor Reinstall capmonitor 25 Unknown capmonitor mode xx Valid daemon Unknown mode Only daemon mode is valid Specify daemon for CAPMONITOR_MODE of etc rc d ...

Page 30: ...proc acpi mceca ll acpi_mcecall ko was not found Reinstall acpi_call 4 opt nec acpicall proc acpi capcall acpi_capcall ko was not found Failed to load acpi_capcall ko because opt nec acpicall proc acpi capcall acpi_capcall ko was not found Reinstall acpi_call 5 opt nec acpicall proc acpi clpcall acpi_clpcall ko was not found Failed to load acpi_clpcall ko because opt nec acpicall proc acpi clpcall...

Page 31: ... of active cores exceeded the number of core license The number of active cores exceeded the number of core license Offline CPU so that the number of CPUs becomes less than the number of license 2 cpuspeed Disabling ondemand cpu frequency scaling governor cpuspeed is stopped This message is output when CPU is onlined 3 cpuspeed Enabling ondemand cpu frequency scaling governor cpuspeed is started O...

Page 32: ...atically by cron 6 Error 1016 error type mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related function and mcemonitor exited mcemonitor is restarted by cron Restart mcemonitor automatically by cron 7 Error 1017 error cause mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related func...

Page 33: ...nitor is restarted by cron Restart mcemonitor automatically by cron 20 Error 1041 error cause mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related function and mcemonitor exited mcemonitor is restarted by cron Restart mcemonitor automatically by cron 21 Error 1042 error cause mcemonitor exited due to a system error mcemonitor will be rest...

Page 34: ... related function but mcemonitor continue operation mcemonitor continues running No action is needed 32 Error 5010 error type mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related function and mcemonitor exited mcemonitor is restarted by cron Restart mcemonitor automatically by cron 33 Error 5011 error cause mcemonitor exited due to a syst...

Page 35: ...or type mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related function and mcemonitor exited mcemonitor is restarted by cron Restart mcemonitor automatically by cron 46 Error 6003 error type mcemonitor exited due to a system error mcemonitor will be restarted by cron An error occurred on system related function and mcemonitor exited mcemon...

Page 36: ...function but mcemonitor continue operation mcemonitor continues running No action is needed 57 warning xxxx bytes ignored in each record consider an update mcemonitor can not analyze mcelog due to the inconsistency of log format mcemonitor needs to be updated mce structure in Linux kernel may be changed Update mcemonitor Install mcemonitor of the latest version 58 Cannot open pidfile error cause m...

Page 37: ...ontinuable error mcemonitor will continue to be run safely An error occurred on system related function but mcemonitor continue operation mcemonitor continues running No action is needed 68 Offlining CPU xx due to corrected error threshold Offline CPU xx due to corrected error exceeds threshold 69 Not offlining CPU 0 because of kernel running on CPU 0 Cannot offline CPU 0 because its kernel is run...

Page 38: ...tion and mcemonitor exited mcemonitor is restarted by cron Restart mcemonitor automatically by cron 87 Error 1051 error cause mcemonitor exited due to a system error An error occurred on system related function during stop phase mcemonitor exited Restart mcemonitor then stop mcemonitor 88 Error 1052 error cause mcemonitor exited due to a system error An error occurred on system related function du...

Page 39: ...timeout value of capmonitor conf to less than cpu hotadd online timeout value and restart capmonitor 5 Error 1104 error type capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related function and capmonitor terminated capmonitor is restarted by cron mcemonitor restarts automatically by cron 6 Error 1105 error cause capmonitor exited due to a s...

Page 40: ...s needed 17 Error 5103 error cause capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related function and capmonitor terminated capmonitor is restarted by cron mcemonitor restarts automatically by cron 18 Error 5104 error cause capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related func...

Page 41: ...Error 5117 error cause capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related function and capmonitor terminated capmonitor is restarted by cron mcemonitor restarts automatically by cron 33 Error 5118 error cause capmonitor found a continuable error capmonitor will continue to be run safely An error occurred on system related function but c...

Page 42: ...nd capmonitor terminated capmonitor is restarted by cron mcemonitor restarts automatically by cron 48 cannot set FD_CLOEXEC flag on fd error cause capmonitor exited due to a system error capmonitor will be restarted by cron An error occurred on system related function and capmonitor terminated capmonitor is restarted by cron mcemonitor restarts automatically by cron 49 poll table overflow capmonit...

Page 43: ...stem cpu cpuX online 7 2 cpuspeed Error Message Output at OS Shutdown An error message of cpuspeed may be displayed when OS is shutdown However this does not affect system operation When correctable error exceeds threshold value the failed CPU core is offlined If OS is shutdown after CPU Core Offline the message of cpuspeed is output This indicates that cpuspeed daemon failed to execute end proces...

Page 44: ... in any form without the prior written permission of NEC Corporation Express5800 A2040b A2020b A2010b A1040b Machine Check Monitoring Service User s Guide Release 1 5 NEC Corporation 7 1 Shiba 5 Chome Minato Ku Tokyo 108 8001 Japan TEL 03 3454 1111 Main phone number ...

Reviews: