background image

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

       

 

 

 

 

 

 

 

 

 

 

 
 

IBM 

~

 pSeries 

 

High Performance Switch 

 

Tuning and Debug Guide 

 

Version 1.0 

April 2005

 

 

 

 

 
 
 
 

IBM Systems and Technology Group 

Cluster Performance Department 

Poughkeepsie, NY 

 

 
 
 

Summary of Contents for @server pSeries

Page 1: ...IBM pSeries High Performance Switch Tuning and Debug Guide Version 1 0 April 2005 IBM Systems and Technology Group Cluster Performance Department Poughkeepsie NY ...

Page 2: ... single LPAR 15 3 7 Amount of memory available 15 3 8 Debug settings in the AIX 5L kernel 16 4 0 Daemon configuration 16 4 1 RSCT daemons 16 4 2 LoadLeveler daemons 17 4 2 1 Reducing the number of daemons running 17 4 2 2 Reducing daemon communication and placing daemons on a switch17 4 2 3 Reducing logging 17 4 3 Settings for AIX 5L threads 18 4 4 AIX 5L mail spool and sync daemons 18 4 5 Placeme...

Page 3: ...rface 26 5 12 3 Packets dropped because of a hardware problem on an endpoint 27 5 12 4 Packets dropped in the switch hardware 28 5 13 MP_INFOLEVEL 28 5 14 LAPI_DEBUG_COMM_TIMEOUT 29 5 15 LAPI_DEBUG_PERF 29 5 16 AIX 5L trace for daemon activity 30 6 0 Conclusions and summary 30 7 0 Additional reading 30 7 1 HPS documentation 30 7 2 MPI documentation 31 7 3 AIX 5L performance guides 31 7 4 IBM Redbo...

Page 4: ...ubsystems The second section deals with tuning AIX 5L and its components for optimal performance of the HPS system The third section deals with tuning various system daemons in both AIX 5L and cluster environments to prevent impact on high performance parallel applications The final section deals with debugging performance problems on the HPS Before debugging a performance problem in the HPS revie...

Page 5: ... the corresponding receive has already been posted If the receive has not been posted the transport incurs an extra copy cost on the target because data is staged through the early arrival buffers However the overall time to send a small message might still be less in eager mode Well designed MPI applications often try to post each MPI_RECV before the message is expected but because tasks of a par...

Page 6: ... MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT You can improve application performance by allowing a task that is sending a message shorter than the eager limit to return the send buffer to the application before the message has reached its destination rather than forcing the sending task to wait until the data has actually reached the receiving task and the acknowledgement has been returned To allow i...

Page 7: ...se your application to be interrupted when new packets arrive which could be helpful if a receiving MPI task is likely to be in the middle of a long numerical computation at the time when data from a remote blocking send arrives 2 2 MPI IO The most effective use of MPI IO is when an application takes advantage of file views and collective operations to read or write a file in which data for each t...

Page 8: ... in partitions of up to 16MB each Each increase in the buffer that crosses a 16 MB boundary allocates an additional partition If you are running a pSeries 655 system with two HPS links allocate two partitions 32MB of buffer space If you are running a p690 system with eight HPS links set the buffer size to 128MB If you are running in an LPAR and have a different number of links scale the buffer siz...

Page 9: ...tivity might have problems identifying which interfaces are down when multiple interfaces are on the same IP subnet The IP subsystem has several variables that impact IP performance over HPS The following table contains recommended initial settings used for TCP IP For more information about these variables see the AIX 5L manuals listed in section 7 0 Parameter Setting sb_max 1310720 tcp_sendspace ...

Page 10: ...centage 10737 file pages 0 0 compressed percentage 0 compressed pages 0 0 numclient percentage If the system tends to move towards a high numperm level here are a couple of approaches to address performance concerns Use vmo tunables to tune page replacement By decreasing the maxperm percentage and maxclient percentage you can try to force page replacement to steal permanent and client pages before...

Page 11: ... 26628 20173 0 26628 190899 work mbuf pool 15793 15793 0 15793 20002 work page table area 7858 168 7690 7858 0 work kernel segment 6394 3327 953 6394 70b07 work other kernel segments Y 4096 4096 0 4096 c0b0c work other kernel segments Y 4096 4096 0 4096 1b00bb work vmm software hat 4096 4096 0 4096 a09aa work loader segment 3074 0 0 3074 Memory overhead associated with HPS communication buffers al...

Page 12: ...ry of the vmstat statistics enter the following command vmstat l The first part of the output reports the number of CPUs and the amount of usable physical memory System Configuration lcpu 32 mem 157696MB kthr memory page faults cpu large page r b avm fre re pi po fr sr cy in sy cs us sy id wa alp flp 3 1 35953577 5188383 0 0 0 0 0 0 3334 2080 176 1 0 99 0 213 7787 The output is grouped into five c...

Page 13: ...visory mode recommended for high performance computing applications You can enable the application for TLPs by using the loader flag by means of the ldedit command or by using the environment variable at run time The ldedit command enables the application for TLPs in the advisory mode ldedit b lpdata executable path name You can use b nolpdata to turn TLPs off The b lpdata loader flag on the ld co...

Page 14: ...l size 16777216 To change the number of windows use the chgsni command To set the Large Page option use one of the following vmo commands vmo r o v_pinshm 1 o lgpg_size 16777216 o lgpg_regions number of TLP required dsh vn node name echo y vmo r o v_pinshm 1 o lgpg_size 16777216 o lgpg_regions number of TLP required If you use dsh command which is provided by CSM you must use the echo command beca...

Page 15: ..._dump r traces discussed later report this with the following messages 02 1099351956s 0831707096ns 0x000c sn_setup_if_env large page check failed lgpg_size 0x1000000 lgpg_cnt 02 1099351956s 0831708002ns 0x000d sn_setup_if_env large page check failed num_pages 0x4 lgpg_numfrb 3 6 Memory affinity for a single LPAR If you are running with one big LPAR containing all processors on a p690 machine you n...

Page 16: ...changes to the kernel settings run the following command and then reboot bosboot a 4 0 Daemon configuration Several daemons on AIX 5L and the HPS can impact performance These daemons run periodically to monitor the system but can interfere with performance of parallel applications If there are as many MPI tasks as CPUs then when these demons run they must temporarily take a CPU away from a task Th...

Page 17: ...ng jobs or POE On LoadL_config LOCAL_CONFIG tilde LoadL_config local hostname On LoadL_config local plainnode SCHEDD_RUNS_HERE False On LoadL_config local scheddnode SCHEDD_RUNS_HERE True On LoadL_admin for schedd node to make public node_name xxx xxx xxx type machine alias node_name1 xxx xxx xxx node_name2 xxx xxx xxx schedd_host true 4 2 2 Reducing daemon communication and placing daemons on a s...

Page 18: ...mons on a running system use the following commands stopsrc s sendmail stopsrc s qdeamon stopsrc s writesrv You can also change the frequency when the syncd daemon for the file system runs In the sbin rc boot file change the number of seconds setting between syncd calls by increasing the default value of 60 to something higher Here is an example nohup usr sbin syncd 300 dev null 2 1 You also need ...

Page 19: ...r_debug setting The driver_debug setting is used to increase the amount of information collected by the HPS device drivers eave this setting set to default value unless you are directed to change it by IBM service 5 1 2 ip_trc_lvl setting The ip_trc_lvl setting is used to change the amount of data collected by the IP driver Leave this setting set to default value unless you are directed to change ...

Page 20: ...ical Real Memory Maximum Memory Available 64GB 61 5GB 128GB 120GB 256GB 240GB 512GB 495GB 5 5 Deconfigured L3 cache The p690 and p655 systems can continue running if parts of the hardware fail However this can lead to unexpectedly lower performance on a long running job One of the degradations observed has been the deconfiguration of the L3 cache To check for this condition run the following comma...

Page 21: ...C errors in the FNM_Recover log grep i evtsum FNM_Recov log grep i crc In general if Service Focal Point is working properly you should not need to check the low level FNM logs such as the FNM_Recov file However for completeness these are additional FNM logs on the HMC FNM_Comm log FNM_Ice log FNM_Init log FNM_Route log Another debug command you can run on the HMC is lsswtopol n 1 p PLANE_NUMBER F...

Page 22: ... Window ID Network ID Device Name MP_EUIDEVICE Window Instances MP_INSTANCES Striping Setup Protocols in Use MP_MSG_API Effective Libpath LIBPATH Current Directory 64 Bit Mode Threaded Library Requested Thread Scope AIXTHREAD_SCOPE Thread Stack Allocation MP_THREAD_STACKSIZE Bytes CPU Use MP_CPU_USE Adapter Use MP_ADAPTER_USE Clock Source MP_CLOCK_SOURCE Priority Class MP_PRIORITY Connection Timeo...

Page 23: ... Copy Size MP_COPY_SEND_BUF_SIZE User Script Name MP_PRINTENV Size of User Script Output 5 11 MP_STATISTICS If MP_STATISTICS is set to yes statistics are collected However these statistics are written only when a call is made to mp_statistics_write which takes a pointer to a file descriptor as its sole argument These statistics can be zeroed out with a call to mp_statistics_zero This can be used w...

Page 24: ...times dropped at one of the endpoints of the packet transfer In this case you should be able to run AIX 5L commands to see some evidence on the endpoint that dropped the packet For example run usr sni sni snap l adapter_number to get the correct endpoint data This is best taken both before and after re creating the problem The sni snap creates a new archive in var adm sni snaps For example usr sni...

Page 25: ...12756be 19355326 ndd_recvintr_msw 0x00000000 0 ndd_recvintr_lsw 0x00000000 0 ndd_ierrors 0x00000000 0 ndd_opackets_msw 0x00000000 0 ndd_opackets_lsw 0x00022595 140693 ndd_obytes_msw 0x00000000 0 ndd_obytes_lsw 0x01172099 18292889 ndd_xmitintr_msw 0x00000000 0 ndd_xmitintr_lsw 0x00000000 0 ndd_oerrors 0x00000000 0 ndd_nobufs 0x00000000 0 ndd_xmitque_max 0x00000000 0 ndd_xmitque_ovf 0x00000000 0 ndd...

Page 26: ...P BROADCAST RUNNING SIMPLEX NOECHO BPF IFBUFMGT CANTCHANGE init 00000000 output 075AE498 start 00000000 done 00000000 ioctl 075AE4B0 reset 00000000 watchdog 00000000 ipackets 00026549 ierrors 00000000 opackets 00022778 oerrors 00000000 collisions 00000000 next 075AF2C8 type 00000038 addrlen 00000000 hdrlen 00000018 index 00000003 ibytes 0127BA9E obytes 0117B005 imcasts 00000000 omcasts 00000000 iq...

Page 27: ...re problem on an endpoint To check for dropped packets at the HMC check var adm sni sni_errpt_capture Each hardware event has an entry If you don t have the register mappings for error bits check whether the errors are recoverable non MP Fatal or MP Fatal MP Fatal errors take longer to recover from and could be associated with more drops The following is an example of Recoverable Non Mp Fatal entr...

Page 28: ...se settings INFO 0031 364 Contacting LoadLeveler to set and query information for interactive job INFO 0031 380 LoadLeveler step ID is test_mach1 customer com 2507 0 INFO 0031 118 Host test_mach1 customer com requested for task 0 INFO 0031 118 Host test_mach2 customer com requested for task 1 INFO 0031 119 Host test_mach1 customer com allocated for task 0 INFO 0031 120 Host address 10 10 10 1 allo...

Page 29: ...LAPI_DEBUG_PERF flag to yes export LAPI_DEBUG_PERF yes The following additional information is sent to standard error in the job output _retransmit_pkt_cnt Tot_retrans_pkt_cnt LAPI Tot_retrans_pkt_cnt Shared Tot_retrans_pkt_cnt Be aware that some retransmissions in the initialization stage are normal Here is a simple Perl script count_drops to count the number of lost packets When LAPI_DEBUG_PERF ...

Page 30: ...start pprof cpu You will find all these files on the PWD at the time you run it tprof c A all x sleep XX XX is the time for your trace Look at sleep prof you will find this file on the PWD at the time you run it 6 0 Conclusions and summary Peak performance of HPS systems depends on properly tuning the HPS and on correctly setting application shell variables and AIX 5L tunables Because there are ma...

Page 31: ...4 01 Parallel Environment for AIX 5L V4 1 1 MPI Programming Guide SA22 7945 01 Parallel Environment for AIX 5L V4 1 1 MPI Subroutine Reference SA22 7946 01 7 3 AIX 5L performance guides AIX 5L Version 5 2 Performance Management Guide SC23 4876 00 AIX 5L Version 5 2 Performance Tools Guide and Reference SC23 4859 03 7 4 IBM Redbooks AIX 5L Performance Tools Handbook SG24 6039 01 7 5 POWER4 POWER4 P...

Page 32: ...tries or both A full list of U S trademarks owned by IBM may be found at http www ibm com legal copytrade shtml Other company product and service names may be trademarks or service marks of others IBM hardware products are manufactured from new parts or new and used parts Regardless our warranty terms apply Copying or downloading the images contained in this document is expressly prohibited withou...

Reviews: