background image

204

 

POWER7 and Optimization and Tuning Guide

Although the PAM document is the official source for any SAP-related platform support, the 
SAP Community Network topic, Supported Platforms/PARS for SAP Business Objects 
(

http://www.sdn.sap.com/irj/boc/articles?rid=/webcontent/uuid/e01d4e05-6ea5-2c10-1

bb6-a8904ca76411

) provides a good overview of SAP BusinessObjects releases and 

supported platforms.

Sizing for optimum performance

Adequate system sizing is essential for successfully running an SBOP BI solution. Therefore, 
SAP includes SBOP BI in its Quick Sizer tool. Based on the number of users who are 
planning to use the SBOP BI applications, the Quick Sizer tool provides an SAP Application 
Performance Standard (SAPS) number and the memory necessary to run the applications. 
IBM can then provide the correct system configuration that is based on these numbers. The 
Quick Sizer tool is available at 

http://service.sap.com/quicksizer

 (registration required). 

Also, the SAP BusinessObjects BI 4 - Companion Guide is available on the SAP Quick Sizer 
landing page at 

http://service.sap.com/quicksizer

 (registration required). To download the 

guide, open the web page and click Sizing Guidelines

 Solutions & Platforms

 SAP 

BusiessObjects. This guide explains the sizing process in detail using a 
sizing example. 

Landscape design

For general considerations about how to design an SBOP BI landscape, see the Master 
Guide
 (registration required), available at:

http://service.sap.com/~sapidb/011000358700001237052010E/xi4_master_en.pdf

There are six different reference architecture landscapes available that demonstrate how to 
implement an SBOP BI solution that is based on the type of enterprise resource planning 
(ERP) and data warehouse solution you are using. 

The architecture documents are in the Getting Started with SAP BusinessObjects BI 
Solutions
 topic on the SAP Community Network (

http://scn.sap.com/docs/DOC-27193

) in the 

Implementation and Upgrade section.

Summary of Contents for Power System POWER7 Series

Page 1: ...John MacMillan Sudhir Maddali K Madhusudanan Bruce Mealey Steve Munroe Francis P O Connell Sergio Reyes Raul Silvera Randy Swanberg Brian Twichell Brian F Veale Julian Wang Yaakov Yaari Discover simp...

Page 2: ......

Page 3: ...International Technical Support Organization POWER7 and POWER7 Optimization and Tuning Guide November 2012 SG24 8079 00...

Page 4: ...cted by GSA ADP Schedule Contract with IBM Corp First Edition November 2012 This edition pertains to Power Systems servers based on POWER7 and POWER7 processor based technology Specific software level...

Page 5: ...POWER7 features 25 2 3 1 Page sizes 4 KB 64 KB 16 MB and 16 GB 25 2 3 2 Cache sharing 29 2 3 3 SMT priorities 35 2 3 4 Storage synchronization sync lwsync lwarx stwcx and eieio 37 2 3 5 Vector Scalar...

Page 6: ...6 2 1 Common prerequisites 109 6 2 2 XL compiler family 110 6 2 3 GCC compiler family 112 6 3 IBM Feedback Directed Program Restructuring 114 6 3 1 Introduction 114 6 3 2 FDPR supported environments...

Page 7: ...performance tooling 143 8 6 1 High level investigation 143 8 6 2 Low level investigation 144 8 7 Conclusion 144 8 8 Related publications 144 Chapter 9 WebSphere Application Server 147 9 1 IBM WebSpher...

Page 8: ...Oracle 11gR2 preferred practices for AIX V6 1 and AIX V7 1 on Power Systems 188 Migrating Sybase ASE to POWER7 193 Implementing Sybase IQ to POWER7 195 Environment variables 195 Special consideration...

Page 9: ...o not in any manner serve as an endorsement of those websites The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk IBM may u...

Page 10: ...M Micro Partitioning Power Architecture POWER Hypervisor Power Systems Power Systems Software POWER6 POWER7 Systems POWER7 POWER7 PowerLinux PowerPC PowerVM POWER PureSystems Rational Redbooks Redbook...

Page 11: ...who are responsible for performing migration and implementation activities on IBM POWER7 based servers which includes system administrators system architects network administrators information archite...

Page 12: ...th IBM STG Systems Solution Development Her responsibilities include end to end system and software design performance from application middleware and operating system to hardware Her fields of expert...

Page 13: ...as written extensively about IBM POWER performance and Java performance Francis P O Connell is a member of the IBM Systems and Technology Group in Austin Texas He is a Senior Technical Staff Member in...

Page 14: ...in Computer Science from the University of Oklahoma where he was a Graduate Assistance in Areas of National Need GAANN Fellow and a lecturer teaching courses in Computer Science and Electrical and Co...

Page 15: ...Sanjay Ruprell IT Specialist Foster City California Bruce P Semple STG Power and z Systems Architecture Gaithersburg Maryland Maneesh Sharma POWER Sales Representative Englewood Cliffs New Jersey Wol...

Page 16: ...o IBM Corporation International Technical Support Organization Dept HYTD Mail Station P099 2455 South Road Poughkeepsie NY 12601 5400 Stay connected to IBM Redbooks Find us on Facebook http www facebo...

Page 17: ...nd tuning on IBM POWER7 and IBM POWER7 This chapter describes the optimization and tuning of the IBM POWER7 system It covers the the following topics Introduction Outline of this guide Conventions tha...

Page 18: ...given for the POWER7 and POWER7 processors the general guidance is applicable to the IBM POWER6 POWER5 and even to earlier processors This guide is directed at personnel who are responsible for perfo...

Page 19: ...provide full support and usage of POWER7 technologies and systems Linux is based on community efforts that are focused not only on the Linux kernel but also all of the complementary packages tools too...

Page 20: ...esign are making it more important than ever to consider analyzing and working to improve application performance In the past two of the ways in which newer processor chips delivered higher performanc...

Page 21: ...ories 1 Lightweight tuning and optimization guidelines Lightweight tuning covers simple prescriptive steps for tuning application performance on POWER7 These simple steps can be carried out without de...

Page 22: ...cs that are shared among processor chips in the same family each generation of processor chip has unique performance characteristics Optimizing code for POWER7 requires that you set up a test bed on a...

Page 23: ...ormance testing initially should be done in a non virtualized environment to minimize the factors that affect performance Ensure that the LPAR is running an up to date version of the operating system...

Page 24: ...t link executable image such as one produced by static compilers and applies additional optimizations FDPR is another tool that can be considered for optimizing applications that are based on an execu...

Page 25: ...Not a Number NaN signed zeros infinities floating point expression reorganization or setting the errno variable The new Ofast GCC option includes O3 and ffast math and might include other options in...

Page 26: ...ery large Java heap is being used On Power Systems the Xcodecache option often delivers a small improvement in performance especially in a large Java application This option specifies the size of each...

Page 27: ...the pool front end Larger allocations are handled with good scalability by the multiheap malloc A simple example of specifying the pool and multiheap combination is by using the environment variable s...

Page 28: ...64k bdatapsize 64k bstackpsize 64k executable 3 Using linker options at build time cc btextpsize 64k bdatapsize 64k bstackpsize 64k ld btextpsize 64k bdatapsize 64k bstackpsize 64k All of these mechan...

Page 29: ...ployment guidelines This section discusses deployment guidelines as they relate to virtualized and non virtualized environments and the effect of partition size and affinity on deployments Virtualized...

Page 30: ...apter 2 The POWER7 processor on page 21 This short description provides some background to help understand two important performance issues that are known as affinity effects Cache affinity The hardwa...

Page 31: ...ions are started first It is a preferred practice to start higher priority partitions first so that there is a better opportunity for them to obtain good affinity characteristics in their core and mem...

Page 32: ...n about ASO the MEMORY_AFFINITY environment variable the execrset command and related environment variables and commands see Chapter 4 AIX on page 67 The same forced affinity can be established on Lin...

Page 33: ...va methods are compiled to binary code With considerably small partitions there might be a long warm up period before reaching steady state performance where a 0 05 LPAR cannot get additional cycles f...

Page 34: ...8 routines This situation is associated with a locking issue This locking might ultimately arise at the system level as seen with malloc locking issues on AIX or at the application level in Java code...

Page 35: ...lts with a GUI The GUI presents information about thread states and has powerful features to drill down to see call chains The WAIT tool results combine many of the features of a time based profile a...

Page 36: ...20 POWER7 and POWER7 Optimization and Tuning Guide...

Page 37: ...R7 processor This chapter introduces the POWER7 processor and describes some of the technical details and features of this product It covers the the following topics Introduction to the POWER7 process...

Page 38: ...e particular POWER7 system Figure 2 1 The POWER7 processor chip Each core is a 64 bit implementation of the IBM Power ISA Version 2 06 Revision B and has the following features Multi threaded design c...

Page 39: ...e There are no new instructions in POWER7 over POWER7 The differences in POWER7 are Manufactured with 32 nm technology A 10 MB L3 cache per core On chip encryption accelerators On chip compression acc...

Page 40: ...e and thread boundaries can improve application scaling Details about operating system binding facilities are available in 4 1 AIX and system libraries on page 68 and include Affinity topology binding...

Page 41: ...tion Lookaside Buffer TLB Misses A single large page that is being constantly referenced remains in memory This feature eliminates the possibility of several small pages often being swapped out Unhind...

Page 42: ...y based on processor chip type AIX V6 1 supports segments with two page sizes 4 KB and 64 KB By default processes use these variable page size segments This configuration is overridden by the existing...

Page 43: ...4 KB page size The PSPA can be set for the whole system by using the vmm_default_pspa vmo tunable or for a specific process by using the vm_pattr system call 8 In addition to 4 KB and 64 KB page sizes...

Page 44: ...4K bstackpsize 64K sub1 o sub2 o The ldedit command can be used to set these page size options in an existing executable command ldedit btextpsize 4K bdatapsize 64K bstackpsize 64K mpsize out We can s...

Page 45: ...ou can use to scale the hardware to many nodes of processor chips and memory One advantage is that systems can be used for multiple workloads and workloads that are large However these characteristics...

Page 46: ...a condition to be satisfied and then being resumed on a different core Any application data that is in the cache local to the original core is no longer in the local cache because the application thre...

Page 47: ...mory Table 2 6 shows the cache sizes and related geometry information for POWER7 Figure 2 2 POWER7 chip and local memory17 Table 2 6 POWER7 storage hierarchy18 17 Ibid Cache POWER7 POWER7 L1 i cache C...

Page 48: ...ur CPU power from invisibly going down the drain 20 it is also important to carefully assess the impact of this strategy especially when applied to systems where there are a high number of CPU cores a...

Page 49: ...y done by the POWER7 hardware and is configurable as described in 2 3 7 Data prefetching using d cache instructions and the Data Streams Control Register DSCR on page 46 Alignment of data Processors a...

Page 50: ...rning data can end up being routed through busses that connect multiple chips and memory which have particular bandwidth and latency characteristics The goal for scaling across multiple cores then is...

Page 51: ...read_set_smt_priority system call in AIX The result can be boosted performance for the sibling SMT threads on the same processor core Concepts and benefits The POWER processor architecture uses SMT to...

Page 52: ...ified the SMT thread priority see APIs on page 37 Where to use SMT thread priority can be used to improve the performance of a workload by lowering the SMT thread priority that is being used on an SMT...

Page 53: ...iate data consistency because of their inherent heavyweight nature Concepts and benefits The Power Architecture defines a storage model that provides weak ordering of storage accesses The order in whi...

Page 54: ...provides an order for storage accesses caused by load store dcbz eciwx and ecowx instructions 42 Where to use Care must be taken when you use synchronization mechanisms in any processor architecture b...

Page 55: ...oint Operations conforming to the IEEE 754 Standard for Floating Point Arithmetic The introduction of VSX in to the Power Architecture increases the parallelism by providing Single Instruction Multipl...

Page 56: ...Environment of Power ISA v2 06 a white paper from Power org available at https www power org documentation whats new in the server environment of power isa v2 06 registration required FPR0 VSR0 FPR1...

Page 57: ...r data types and the size and possible values for each type Table 2 9 Vector data types Type Interpretation of content Range of values vector unsigned char 16 unsigned char 0 255 vector signed char 16...

Page 58: ...or literal or any expression that has the same vector type For example 55 vector unsigned int v1 vector unsigned int v2 vector unsigned int 10 XL only not GCC v1 v2 The number of values in a braced in...

Page 59: ...O3 For Fortran xlf qarch pwr7 qtune pwr7 O3 qhot qsimd gfortran mcpu power7 mtune power7 O3 Using Engineering and Scientific Subroutine ESSL libraries with vectorization support Select routines have v...

Page 60: ...t in performance because it is usually implemented in software IBM POWER6 and POWER7 processor based systems provide hardware support for DFP arithmetic The POWER6 and POWER7 microprocessor cores incl...

Page 61: ...e Toolchain compiler and run time The Advance Toolchain runtime libraries can also be integrated with recent XL V9 compilers for DFP exploitation The latest Advance Toolchain compiler and run times ca...

Page 62: ...your applications are using DFP There are two AIX commands that are used for monitoring hpmstat for monitoring the whole system hpmcount for monitoring a single program The PM_DFU_FIN DFU instruction...

Page 63: ...bt and dcbtst instructions provide hints about a sequence of accesses to data elements or indicate the expected use Such a sequence is called a data stream and a dcbt or dcbtst instruction in which TH...

Page 64: ...ater but can also degrade performance by expending bandwidth on cache lines that are not later referenced or by displacing cache lines that are later referenced by the program Similarly setting DPFD t...

Page 65: ...returns the values in the output buffer struct dscr_properties defined in sys machine h DSCR_SET_DEFAULT Sets a 64 bit DSCR value in a buffer pointed to by buf_p as the operating system default Retur...

Page 66: ...ause the system call writes the new value both in the process context and in the DSCR When a thread runs dcsr_ctl to change the prefetch depth for the process the new value is written into the AIX pro...

Page 67: ...haracteristic Performance can be improved by disabling hardware prefetching in these cases by running the following command dscrctl n s 1 This system partition wide disabling is only appropriate if it...

Page 68: ...r to the following sections Section 3 1 Program Priority Registers Section 3 2 or Instruction Section 4 3 4 Program Priority Register Section 4 4 3 OR Instruction Section 5 3 4 Program Priority Regist...

Page 69: ...htm splat Command found at http publib boulder ibm com infocenter aix v7r1 index jsp topic com ibm aix cmds doc aixcmds5 splat htm trace Daemon found at http publib boulder ibm com infocenter aix v7r...

Page 70: ...54 POWER7 and POWER7 Optimization and Tuning Guide...

Page 71: ...ved 55 Chapter 3 The POWER Hypervisor This chapter introduces the POWER7 Hypervisor and describes some of the technical details for this product It covers the the following topics Introduction to the...

Page 72: ...ng tool that takes virtualization impacts into consideration such as the IBM Workload Estimator to estimate capacity for each partition One of the goals of virtualization is maximizing usage This usag...

Page 73: ...inter partition communication VLANs option that is used for higher network performance Shared Ethernet versus host Ethernet Virtual disk I O Virtual small computer system interface vSCSI N_Port ID Vir...

Page 74: ...if the partition is defined to run in a specific virtual shared processor pool the number of virtual processors ought not to exceed the maximum that is defined for the specific virtual shared processo...

Page 75: ...virtual processors Matching entitlement of an LPAR close to its average usage for better performance The aggregate entitlement minimum or wanted processor capacity of all LPARs in a system is a facto...

Page 76: ...ld value of 49 the primary thread of a core is used before unfolding another virtual processor to consume another core from the shared pool on POWER7 Systems If free cores are available in the shared...

Page 77: ...ut also makes the table become sparse which results in the following situations A dense page table tends to help with better cache affinity because of reloads Less memory that is consumed by the hyper...

Page 78: ...mains that are assigned to the LPAR Setting lpar_placement 0 is the default setting and follows the existing rules when SPPL is set to MAX How to determine if an LPAR is contained within a domain From...

Page 79: ...y so that the performance impact of resource addition or deletion is minimal Planning for growth helps alleviate the fragmentation that is caused by DLPAR operations Knowing the LPARs that must grow o...

Page 80: ...y on demand as business needs for compute capacity grows Therefore a Power System might not have all resources that are licensed which poses a challenge to allocate both cores and memory from a local...

Page 81: ...le called the Dynamic Platform Optimizer This optimizer automates the manual steps to improve resource placement For more information visit the following Web site and select the Doc type Word document...

Page 82: ...Guide Virtual I O VIO and Virtualization found at http www ibm com developerworks wikis display virtualization VIO Virtualization Best Practice found at http www ibm com developerworks wikis display...

Page 83: ...This chapter describes the optimization and tuning of a POWER7 processor based server running the AIX operating system It covers the the following topics AIX and system libraries AIX Active System Opt...

Page 84: ...are Default allocator The default allocator is selected when the MALLOCTYPE environment variable is unset This setting maintains a consistent performance even in a worst case scenario but might not b...

Page 85: ...ts This suboption is similar to the built in bucket allocator of the Watson allocator However with this option you can have fine grained control over the number of buckets number of blocks per bucket...

Page 86: ...d is scalable for small allocations while multiheap ensures scalability for larger and less frequent allocations 8 If you notice high memory usage in the application process even after you run free th...

Page 87: ...For more information about this topic see 4 4 Related publications on page 94 File system performance benefits AIX Enhanced Journaled File System JFS2 is the default file system for 64 bit kernel envi...

Page 88: ...h improves performance because I O operations and applications processing can run simultaneously Many applications such as databases and file servers take advantage of the ability to overlap processin...

Page 89: ...address space The mmap subroutine provides a unique object address for each process that maps to an object The software accomplishes this task by providing each process with a unique virtual address w...

Page 90: ...point registers by the ABI the high 32 bits of all fixed point registers are treated as volatile or undefined by the ABI The 32 bit ABI preserves only 32 bit fixed point context across subroutine lin...

Page 91: ...ituation is in contrast to the manual usage of the affinity APIs documented in this section Processor affinity bindprocessor Processor affinity is the probability of dispatching of a thread to the log...

Page 92: ...plications receive a handle to the RSET The RSET handle datatype rsethandle_t in sys rset h is then used in RSET APIs to manipulate or attach the RSET Summary of RSET commands Here is a summary of the...

Page 93: ...n about an RSET rs_getrad Get resource allocation domain information from an input RSET rs_numrads Returns the number of system resource allocation domains at the specified system detail level that ha...

Page 94: ...n POWER7 Systems Enhanced Affinity extends the AIX existing memory affinity support AIX V6 1 technology level 6100 05 contains AIX Enhanced Affinity support Enhanced Affinity status is determined duri...

Page 95: ...ory by using the sra_detach function new Hybrid thread and core AIX provides facilities to customize simultaneous multi threading SMT characteristics of CPUs running within a partition The features re...

Page 96: ...o take advantage of this hybrid mode are Asymmetric workload where the performance of one thread serializes an entire workload For example one master thread dispatches work to many subordinate threads...

Page 97: ...all to thread_wait is not blocked but returns with success immediately Multiple posts to the same thread without an intervening wait by the specified thread counts only as a single post The posting re...

Page 98: ...e software package must be split into shareable and non shareable files Shareable files such as executable code and message catalogs must be installed into the shared global file systems that are read...

Page 99: ...are described in this section there are no application visible changes or awareness required AIX encrypted file system EFS Integrated with the AIX Journaled File System JFS2 is the ability to create...

Page 100: ...ptimizer ASO and Dynamic System Optimizer DSO attempt to address the optimization of both the operating system and server autonomously 4 2 1 Concepts DSO is built on the Active System Optimizer ASO fr...

Page 101: ...ystem Optimization strategies Two optimization strategies are provided with ASO Cache affinity optimization Memory affinity optimization DSO adds two more optimizations to the ASO framework Large page...

Page 102: ...affinity ASO analyzes the cache access patterns that are based on information from the kernel and the PMU to identify potential improvements in cache affinity by moving threads of workloads closer tog...

Page 103: ...optimization the workload must pass certain minimum criteria as described in this section Ideal workloads Workload characteristics for each optimization are Cache affinity optimization and memory affi...

Page 104: ...e placed normally CPU usage The CPU usage of the workload should be above 0 1 cores Workload age Workloads must be at least 10 seconds of age to be considered for cache affinity and 5 minutes of age f...

Page 105: ...Memory prefetch optimization 30 minutes 4 2 4 The asoo command The ASO framework is off by default in an AIX installation The asoo command must be used to enable the ASO framework The command syntax...

Page 106: ...erSaver mode on the HMC causes virtual processor management in a dedicated environment to be enabled forcing cache and memory affinity optimizations to be disabled Enabling active memory sharing disab...

Page 107: ...aso log This file lists major ASO events such as when it is enabled or disabled or when it hibernates It also contains a basic audit trail of optimizations that are performed to workloads Example 4 1...

Page 108: ...rred practices that are applicable to all Power Systems generations AIX preferred practices that are applicable to POWER7 POWER7 mid range and high end High Impact or Pervasive advisory 4 3 1 AIX pref...

Page 109: ...1 For more information about this topic see 4 4 Related publications on page 94 4 3 3 POWER7 mid range and high end High Impact or Pervasive advisory IBM maintains a strong focus on the quality and r...

Page 110: ...ibm com infocenter aix v7r1 topic com ibm aix baseadmn do c baseadmndita excluseprocrecset htm execrset command found at http publib boulder ibm com infocenter aix v7r1 topic com ibm aix cmds doc ai...

Page 111: ...ory index html thread_post Subroutine found at http publib boulder ibm com infocenter aix v7r1 index jsp topic com ibm aix basetechref doc basetrf2 thread_post htm thread_post_many Subroutine found at...

Page 112: ...96 POWER7 and POWER7 Optimization and Tuning Guide...

Page 113: ...7 Chapter 5 Linux This chapter describes the optimization and tuning of the POWER7 processor based server running the Linux operating system It covers the following topics Linux and system libraries L...

Page 114: ...s are enabled to run on small Power Micro Partitioning partitions through the broad range of IBM Power offerings from low cost PowerLinux servers and Flex System nodes up through the largest IBM Power...

Page 115: ...cpu and mtune compiler flags might be the best option For example mcpu power7 allows the compiler to use all the new instructions such as the Vector Scalar Extended category The mcpu power7 option als...

Page 116: ...instructions or POWER6 no vector double instructions machines You can optimize all three Power platforms if you build and install your application and libraries correctly by completing the following...

Page 117: ...e best used with the libhugetlbfs API Large segments can be used to back shared memory malloc storage and main program text and data segments incorporating large pages for shared library text or data...

Page 118: ...default malloc implementation uses trylock techniques to detect contentions between POSIX threads and then tries to assign each thread its own arena This action works well when the same thread frees s...

Page 119: ...e html Massif a heap profiler available at http valgrind org docs manual ms manual html For more details about memory management tools see Empirical performance analysis using the IBM SDK for PowerLin...

Page 120: ...commands export TCMALLOC_MEMFS_MALLOC_PATH libhugetlbfs export HUGETLB_ELFMAP RW export HUGETLB_MORECORE yes Where TCMALLOC_MEMFS_MALLOC_PATH libhugetlbfs defines the libhugetlbfs mount point HUGETLB...

Page 121: ...uild and improves overall performance Previously the TOC mfull toc defaulted to a single instruction access form that restricts the total size of the TOC to 64 KB This configuration can cause large pr...

Page 122: ...ific logical processors The setaffinity API allows processes and threads to have affinity to specific logical processors The number and numbering of logical processors is a product of the number of pr...

Page 123: ...or C C and Fortran This chapter describes the optimization and tuning of the POWER7 processor based server using compilers and tools It covers the following topics Compiler versions and optimization l...

Page 124: ...dvantage of the more advanced compiler optimization For numerical or compute intensive codes the XL compiler options O3 or qhot O3 enable loop transformations which improve program performance by rest...

Page 125: ...analysis and transformations improve runtime performance by changing the translation of the program source into assembly code Changes in these translations might cause the application to behave differ...

Page 126: ...This situation can be useful for older code that is written without following these rules The options to request this optimization are qalias noansi for C C and qalias nostd for Fortran High order tra...

Page 127: ...nk step can increase significantly Optimization that is based on Profile Directed Feedback Profile based optimization allows the compiler to collect information about the program behavior and use that...

Page 128: ...ions include fpeel loops funroll loops ftree vectorize fvect cost model mcmodel medium Specifying the mveclibabi mass option and linking to the MASS libraries enables more loops for ftree vectorize Th...

Page 129: ...ormation in the resulting object file Then at application link time the linker can collect all the objects with additional information and pass them back to the compiler GCC for whole program IPA and...

Page 130: ...the executable binary file of a program by collecting information about the behavior of the program while the program is used for a typical workload and then creates a new version of the program that...

Page 131: ...n optimized version of the input 6 3 2 FDPR supported environments FDPR is available on the following platforms AIX and Power Systems Part of the AIX 5L V5 operating system and higher for both 32 bit...

Page 132: ...roof Run man fdpr for more information about this wrapper Special input and output files FDPR has a number of options that control input and output files One option that controls the input files is ig...

Page 133: ...This information might also be interspersed with warning and debugging messages Use the quiet q option to avoid progress information To limit the warning information use the warning l w l option 6 3...

Page 134: ...e or absolute location where it was created in the instrumentation step or where specified originally by fdir Use the FDPR_PROF_NAME environment variable to specify the profile file name if the profil...

Page 135: ...en the basic blocks are mostly not taken This configuration makes instruction prefetching more efficient Chains are terminated when the heat that is execution count goes below a certain threshold rela...

Page 136: ...it de virtualizes the virtual method calls by calling the actual targets directly The optimized code compares the address of the function descriptor which is used for the indirect call against the add...

Page 137: ...termine the data that is contained in each register at each point in the function and whether this value is used later The function optimizations are killed regs kr A register is considered killed at...

Page 138: ...ream The most common place is following a function call in code Because the call might have modified the TOC anchor register R2 the compiler inserts a load instruction that resets R2 to its correct va...

Page 139: ...ational found at http www ibm com rational cafe community ccpp FDPR Post Link Optimization for Linux on Power found at https www ibm com developerworks mydeveloperworks groups service html communi tyv...

Page 140: ...124 POWER7 and POWER7 Optimization and Tuning Guide...

Page 141: ...describes the optimization and tuning of Java based applications that are running on a POWER7 processor based server It covers the following topics Java levels 32 bit versus 64 bit Java Memory and pa...

Page 142: ...R7 supports prefetch instructions for transient data which is needed but this data must be evacuated from the CPU caches with priority which results in more efficient usage of CPU caches and leads to...

Page 143: ...age of medium 64 KB and large 16 MB page sizes that are supported by the current AIX versions and POWER processors Using medium or large pages instead of the default 4 KB page size can improve applica...

Page 144: ...ministrator can add this capability by running chuser chuser capabilities CAP_BYPASS_RAC_VMM CAP_PROPAGATE user_id On Linux 1 GB of 16 MB pages are configured by running echo echo 64 proc sys vm nr_hu...

Page 145: ...ompiler has a cap on how much memory it can allocate at run time to store compiled code and for most of applications the default cap is more than sufficient However certain programs especially those p...

Page 146: ...independent of any running JVM and persists until it is deleted A shared cache can contain Bootstrap classes Application classes Metadata that describes the classes Ahead of time AOT compiled code 7 4...

Page 147: ...y default up to 25 of the heap is dedicated to the new space The division between the new space and the old space can be controlled with the Xmn option which specifies the size of the new space the re...

Page 148: ...ds of the application can lower the overall GC impact If an application requires more flexibility than can be achieved with a constant sized heap it might be beneficial to tune the sizing parameters f...

Page 149: ...fault SMT modes Table 7 3 SMT mode on POWER7 is dependent upon AIX and compatibility mode Most applications benefit from SMT However some applications do not scale with an increased number of logical...

Page 150: ...T threads that belong to one core Create an RSET with eight logical CPUs by selecting eight SMT threads that belong to two cores The smtctl command can be used to determine which logical CPUs belong t...

Page 151: ...ses the cost of acquiring the monitor can be reduced by using the XlockReservation option With this option it is assumed that the last thread to acquire the monitor is also likely to be the next threa...

Page 152: ...concurrentlevel number option which specifies the ratio between the amounts of heap that is allocated and heap marked The default value is 8 The number of low priority mark threads can be set with the...

Page 153: ...ocessor based server running IBM DB2 It covers the following topics DB2 and the POWER7 processor Taking advantage of the POWER7 processor Capitalizing on the compilers and optimization tools for POWER...

Page 154: ...on POWER7 Systems through the DB2 registry variable DB2_RESOURCE_POLICY In general this variable defines which operating system resources are available for DB2 databases or assigns specific resources...

Page 155: ...er for large pages support by running vmo vmo r o lgpg_size LargePageSize o lgpg_regions LargePages LargePageSize is the size in bytes of the hardware supported large pages and LargePages specifies th...

Page 156: ...ons On the AIX platform whole program analysis IPA and profile based optimizations PDF compiler options are used to optimize DB2 using a set of customer representative workloads This technique produce...

Page 157: ...k for available memory in the system when instance_memory is set to automatic DB2 also supports the PowerVM Live Partition Mobility LPM feature when virtual I O is configured LPM allows an active data...

Page 158: ...tems requires many individual function calls typically as many as the number of EDUs being woken up 8 5 2 File systems DB2 uses most of the advanced features within the AIX file systems These features...

Page 159: ...vers configuration parameter By default this parameter is automatically tuned during database startup For more information about how to monitor and tune AIO for DB2 see Best Practices for DB2 on AIX 6...

Page 160: ...le is a system profiling tool similar in nature to tprof that is popular on the Linux platform OProfile uses hardware counters to provide functional level profiling in both the kernel and user space L...

Page 161: ...SOURCE_POLICY and DB2_LARGE_PAGE_MEM http pic dhe ibm com infocenter db2luw v10r1 index jsp topic com ibm db2 luw admin regvars doc doc r0005665 html DB2 Virtualization SG24 7805 DECFLOAT The data typ...

Page 162: ...146 POWER7 and POWER7 Optimization and Tuning Guide...

Page 163: ...ght IBM Corp 2012 All rights reserved 147 Chapter 9 WebSphere Application Server This chapter describes the optimization and tuning of the POWER7 processor based server running WebSphere Application S...

Page 164: ...rations Table 9 1 Installation considerations 9 1 2 Deployment When you start the WebSphere Application Server there is an option to bind the Java processors to specific CPU processor cores to circumv...

Page 165: ...WER hardware POWER5 or POWER6 you might experience scalability issues because the default SMT mode on POWER7 is SMT4 but on POWER5 and POWER6 the default is SMT and SMT2 mode As some of these applicat...

Page 166: ...nce of WebSphere Application Server on POWER7 Systems For an example of using the taskset and numactl commands in a Linux environment see Partition sizes and affinity on page 14 More information about...

Page 167: ...pendix A Analyzing malloc usage under AIX This appendix describes the optimization and tuning of the POWER7 processor based server by using the AIX malloc subroutine It covers the following topics Int...

Page 168: ...resented here presents a basic view How to collect malloc usage information To discover the distribution of allocation sizes set the following environment variable export MALLOCOPTIONS buckets bucket_...

Page 169: ...hows a sample output Example A 2 Sample output from the malloc subroutine dbx malloc The following options are enabled Implementation Algorithm Default Allocator Yorktown Malloc Log Stack Depth 4 Stat...

Page 170: ...154 POWER7 and POWER7 Optimization and Tuning Guide...

Page 171: ...l performance analysis This appendix describes the optimization and tuning of the POWER7 processor based server from the perspective of performance tooling and empirical performance analysis It covers...

Page 172: ...in Expert system advisors on page 156 The fourth advisor is part of the IBM Rational Developer for Power Systems Software It is a component of an integrated development environment IDE which provides...

Page 173: ...plains why a particular topic was monitored and provides a definition of the performance metric or setting 2 Why is it Important This report entry explains why the topic is relevant and how it impacts...

Page 174: ...capture from the VIOS Advisor Figure B 2 The VIOS Advisor Virtualization Performance Advisor The Virtualization Performance Advisor provides guidance for various aspects of an LPAR both dedicated and...

Page 175: ...ser in determining the best possible configuration The LPAR Performance Advisor can be found at https www ibm com developerworks wikis display WikiPtype PowerVM Virtualization performance advisor Figu...

Page 176: ...and the user s expertise level Figure B 4 is a snapshot of Java and WebSphere Application Server recommendations from a sample run indicating the best JVM optimization and WebSphere Application Server...

Page 177: ...s generated by the compiler allows this data to be matched back to the original source code XLC compilers can generate XML report files that provide information about optimizations that were performed...

Page 178: ...s Here the fundamental measure of performance is throughput the number of transactions that are run over a period with an acceptable response time Other applications are more batch oriented where few...

Page 179: ...flag instructs tprof to report CPU time in the number of ticks that is samples instead of percentages The x sleep 10 argument instructs tprof to collect profiling data during the running of the sleep...

Page 180: ...76 606265 1 1 0 0 0 usr bin trcstop 245976 606263 1 1 0 0 0 swapper 0 3 1 1 0 0 0 rmcd 155876 348337 1 1 0 0 0 Total 7504 5865 1637 2 0 Total Samples 7504 Total Elapsed Time 18 76s Example B 3 from th...

Page 181: ...es the system call was run during the monitoring interval Total Time Amount of CPU time in milliseconds consumed in running the system call sys time Percentage of overall CPU capacity that is spent in...

Page 182: ...tal Time Avg Time Min Time Max Time SVC Address msec msec msec msec 492 157 0663 0 3192 0 0032 0 6596 listio64 516ea40 494 3 3656 0 0068 0 0002 0 0163 GetMultipleCompletionStatus 549a6a8 12 0 0238 0 0...

Page 183: ...xample of a mutex report from splat is shown in Example B 7 Example B 7 Mutex report from splat pthread MUTEX ADDRESS 00000000F0154CD0 Parent Thread 0000000000000001 creation time 26 232305 Pid 18396...

Page 184: ...or suspended Percent Held This field contains the following subfields Real CPU The percentage of the cumulative processor time the lock was held by a running thread Real Elapsed The percentage of the...

Page 185: ...ommand runs To detect alignment issues that are handled by microcode run hpmcount to collect data for group 38 An example is provided in Example B 8 Example B 8 Example of the results of the hpmcount...

Page 186: ...stat htm Finding emulation issues Over the 20 year evolution of the Power instruction set a few instructions were removed Instead of trapping programs that run these instructions AIX emulates them in...

Page 187: ...cts performance data on a system wide basis rather than just for the execution of a command Further documentation about hpmcount and hpmstat can be found at http publib boulder ibm com infocenter aix...

Page 188: ...e OProfile can be run directly as a command line tool or under the IBM SDK for PowerLinux The OProfile tools can monitor the whole system LPAR including all the tasks and the kernel This action requir...

Page 189: ...n the kernel Reserve POSIX pthread_spinlock and sched_yield for applications that have exclusive use of the system and with carefully designed thread affinity assigning specific threads to specific co...

Page 190: ...roduct owner to provide a build using higher optimization Alternatively for open source library packages you can build your own optimized binary version of those packages Deeper empirical analysis If...

Page 191: ...shows a nested set of twisties starting with the event cycles by default then program library function and source line within function The developer drills down by opening the twisties in the profile...

Page 192: ...esulting FDPR journal is used to drive the SCA analysis Running FDPR and retrieving the journal is can be automated by clicking Profile as Profile with Source Code Advisor Java either AIX or Linux Foc...

Page 193: ...er using graphs and other diagrams allowing trends and totals to be easily and quickly recognized The graphs can be used to determine the minimum and maximum heap usage growth and shrink rates over ti...

Page 194: ...it is more common to profile Java after a warm up period so that JIT compilation activity has generally completed To profile after a warm up start Java and wait an appropriate interval until steady s...

Page 195: ...create lock contention A new Double Num B new Double Num C new Double Num initialize calculate d C 0 doubleValue return d use the calculated values public static void initialize Initialize A and B fo...

Page 196: ...ple B 13 as ticks in the libj9jit24 so helper routine jitMonitorEntry in the AIX pthreads library libpthreads a and in the AIX kernel routine_check_lock This Java program clearly has excessive lock co...

Page 197: ...ation levels by the JIT compiler Most of the ticks appear in the final highly optimized version of the doWork D method into which the initialize V and calculate V methods are inlined by the JIT compil...

Page 198: ...on Make sure we start from scratch opcontrol shutdown Load the Oprofile module if required and makes the Oprofile driver interface available opcontrol init Clear out data from current session opcontro...

Page 199: ...r routine analysis on page 177 Thread state analysis Multi threaded Java applications especially applications that are running on top of WebSphere Application Server often have many threads that might...

Page 200: ...more information Manuals animated demonstrations and sample reports are also available on the WAIT website For more information about WAIT go to http wait researchlabs ibm com This site also has sampl...

Page 201: ...tions This appendix describes the optimization and tuning of the POWER7 processor based server running third party applications It covers the following topics Migrating Oracle to POWER7 Migrating Syba...

Page 202: ...nstallation information AIX patches tuning implementation suggestions and so on Oracle 11gR2 Readme Release Notes http docs oracle com cd E11882_01 relnotes 112 e10 853 toc htm Oracle Database Release...

Page 203: ...nsf Web Index WP101176 11gR2 planning and implementation Oracle DB RAC 10gR2 on IBM AIX Tips and Considerations http www ibm com support techdocs atsmastr nsf Web Index WP101089 10gR2 planning and imp...

Page 204: ...cture and Tuning on AIX v2 20 found at http www ibm com support techdocs atsmastr nsf WebIndex WP100883 The following sections describe memory CPU I O network and miscellaneous settings In addition th...

Page 205: ...OCK_SGA TRUE The AIX parameters to enable pinned memory and 16 MB large pages are vmo p o v_pinshm 1 allows pinned memory and requires reboot vmo p o lgpg_size 16777216 o lgpg_regions number_of_large_...

Page 206: ...s that are sensitive to a degree of parallelism might change behavior because of the migration to POWER7 Reviewing the PARALLEL_MAX_SERVERS parameter after migration but set DB_WRITER PROCESSES to the...

Page 207: ...with no options Redo logs Create the logs with agblksize of 512 and mount with no options With SETALL use direct I O for Redo logs Control files Create the files with agblksize of 512 and mount with n...

Page 208: ...p docs oracle com cd E18283_01 server 112 e16102 asminst htm CHDBBIBF Network This section outlines the minimum values that are applicable to network configurations Kernel configurations The following...

Page 209: ...unning iostat Dl might indicate a need to increase queue depth max_transfer might need to be adjusted upward depending on the largest I O requested by Oracle a typical starting point for Oracle on AIX...

Page 210: ...wer Systems available at http www ibm com support techdocs atsmastr nsf WebIndex WP102105 The subtitle of this paper explains its purpose It presents the case for POWER7 as the optimal platform for Sy...

Page 211: ...AD_SCOPE S Same as AIXTHREAD_MNRATIO 1 1 Firm suggestion Setting AIXTHREAD_SCOPE S means that user threads created with default attributes are placed into system wide contention scope If a user thread...

Page 212: ...o 60 min iqnumbercpus 4 50 iqnumbercpus 4 2 numConnections 2 1 On a 4 core single threaded system with gm number of connections it is set to 20 that is iqmt 60 4 50 4 4 2 20 2 1 285 However if SMT4 mo...

Page 213: ...ur results suggest that IQ 15 2 can attain dedicated level performance with a 1 4 entitlement to virtual processor ratio providing that the other configuration suggestions are followed Migrating SAS t...

Page 214: ...user process If the maximum number of processes for a single user exceeds 2000 increase the value of maxuproc to prevent SAS processes from abnormal shutdown or delay Increase the maxuproc setting by...

Page 215: ...file systems and SAS data file systems on physically separate disks Use multiple storage server controllers to further separate and isolate the I O traffic between SAS temporary and data spaces Use mu...

Page 216: ...isks in the array For example strip size x number of disks stripe size The AIX LVM stripe size that you can select from the smit lv create panel is the single strip size not stripe It is the size of d...

Page 217: ...Os that cannot be serviced in the disk queues go into the single wait queue of the dpo device The benefit of this situation is that the dpo device provides fault tolerant error handling This situation...

Page 218: ...SDD DPO qdepth_enable lsattr El dpo displays the current value Run datapath to change the parameters if at SDD 1 6 or greater Otherwise run chdev For example datapath set qdepth disable Available doc...

Page 219: ...om many supported database systems including text or multi dimensional online analytical processing OLAP systems You can publish in many different formats to many different publishing systems Platform...

Page 220: ...Quick Sizer tool is available at http service sap com quicksizer registration required Also the SAP BusinessObjects BI 4 Companion Guide is available on the SAP Quick Sizer landing page at http servic...

Page 221: ...17 0 473 90 249 pages POWER7 and POWER7 Optimization and Tuning Guide POWER7 and POWER7 Optimization and Tuning Guide POWER7 and POWER7 Optimization and Tuning Guide POWER7 and POWER7 Optimization and...

Page 222: ...POWER7 and POWER7 Optimization and Tuning Guide POWER7 and POWER7 Optimization and Tuning Guide...

Page 223: ......

Page 224: ...ent types of code that runs under the IBM AIX and Linux operating systems focusing on the more pervasive performance opportunities that are identified and how to capitalize on them The technical infor...

Reviews: