IBM AIX Continuous Availability Features: Paper
IBM AIX Continuous Availability Features: Paper
Octavian Lascu
Shawn Bodily
Matti Harvala
Anil K Singh
DoYoung Song
Frans Van Den Berg
ibm.com/redbooks Redpaper
International Technical Support Organization
April 2008
REDP-4367-00
Note: Before using this information and the product it supports, read the information in Notices on
page vii.
This edition applies to Version 6, Release 1, of AIX Operating System (product number 5765-G62).
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Business continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Disaster recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 High availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Continuous operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Continuous availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.3 Serviceability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 First Failure Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 IBM AIX continuous availability strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Contents v
vi IBM AIX Continuous Availability Features
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
This IBM Redpaper describes the continuous availability features of IBM AIX Version 6,
Release 1. It also addresses and defines the terms Reliability, Availability, and Serviceability
(RAS) as used in an IT infrastructure. It touches on the global availability picture for an IT
environment in order to better clarify and explain how AIX can improve that availability. The
paper is intended for AIX specialists, whether customers, business partners, or IBM
personnel, who are responsible for server availability.
Octavian Lascu is a Project Leader associated with the ITSO, Poughkeepsie Center. He
writes extensively and teaches IBM classes worldwide on all areas of IBM System p and
Linux clusters. His areas of expertise include High Performance Computing, Blue Gene
and Clusters. Before joining the ITSO, Octavian worked in IBM Global Services Romania as a
software and hardware Services Manager. He holds a Master's Degree in Electronic
Engineering from the Polytechnical Institute in Bucharest, and is also an IBM Certified
Advanced Technical Expert in AIX/PSSP/HACMP. He has worked for IBM since 1992.
Shawn Bodily is a Certified Consulting IT Specialist with Advanced Technical Support (ATS)
Americas based in Dallas, TX, and has worked for IBM for nine years. With 12 years of
experience in AIX, his area of expertise for the last 10 years has been HACMP. Shawn has
written and presented extensively on high availability for System p. He has co-authored two
IBM Redbooks publications.
Matti Harvala is an Advisory IT Specialist in Finland, and has worked for IBM for more than
eight years. Matti has 10 years of experience supporting UNIX systems, and holds a
Bachelor of Science degree in Information Technology Engineering. His areas of expertise
include AIX systems support, AIX Advanced Power Virtualization, NIM and DS 4000 family
disk subsystems.
Anil K Singh is a Senior Software Engineer in India. He has six years of experience in testing
and development, including more than three years in AIX. He holds a Bachelor of
DoYoung Song is a Consulting IT specialist for pre-sales technical support in the IBM
Technical Sales Support group in Seoul, and has worked for IBM for 17 years. He currently
works as an IT Systems Architect. DoYoung has 15 years of experience in AIX, RS/6000,
and IBM System p. His areas of expertise include technologies and solutions on AIX and
pSeries, and System p high-end server systems, supporting IBM sales, IBM Business
Partners, and clients with IT infrastructure and pre-sales consulting.
Frans Van Den Berg is an IT Specialist in the United Kingdom, and has worked at IBM for
more than seven years. Frans has 11 years of experience working with UNIX. His areas of
expertise include knowledge of a number of key operating systems, backup and recovery,
security, database and software packages, SAN, storage and hardware skills.
Michael Lyons
IBM Austin
Susan Schreitmueller
IBM Dallas
Jay Kruemcke
IBM Austin
Michael S Wilcox
IBM Tulsa
Maria R Ward
IBM Austin
Saurabh Sharma
IBM Austin
Thierry Fauck
IBM France
Bernhard Buehler
IBM Germany
Donald Stence
IBM Austin
James Moody
IBM Austin
Grgoire Pichon
BULL AIX development, France
Jim Shaffer
IBM Austin, TX
Shajith Chandran
IBM India
Bruce Mealey
IBM Austin TX
Larry Brenner
IBM Austin TX
Mark Rogers
IBM Austin TX
Bruno Blanchard
IBM France
Steve Edwards
IBM UK
Brad Gough
IBM Australia
Hans Mozes
IBM Germany
The IBM Redbooks publication team for IBM AIX V6.1 Differences Guide
Rosa Fernandez
IBM France
Roman Aleksic
Zrcher Kantonalbank, Switzerland
Ismael Castillo
IBM Austin, TX
Armin Rll
IBM Germany
Nobuhiko Watanabe
IBM Japan Systems Engineering
Scott Vetter
International Technical Support Organization, Austin Center
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you
will develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Preface xi
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an e-mail to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Chapter 1. Introduction
This chapter addresses some of the common concepts and defines the associated terms
used in IT infrastructure and generally referred to as Reliability, Availability, and Serviceability
(RAS). Although this paper covers the AIX V6, R1 operating system continuous availability
features, it also introduces the global availability picture for an IT environment in order to
better identify and understand what AIX brings to overall IT environment availability.
In addition, IBM virtualization technologies1, available in the IBM System p and System i
product families, enable individual servers to run dozens or even hundreds of mission-critical
applications.
Today's enterprises can no longer afford planned or unplanned system outages. Even a few
minutes of application downtime can result in financial losses, eroded customer confidence,
damage to brand image, and public relations problems.
To better control and manage their IT infrastructure, enterprises have concentrated their IT
operations in large (and on demand) data centers. These data centers must be resilient
enough to handle the ups and downs of the global market, and must manage changes and
threats with consistent availability, security and privacy, both around the clock and around the
world. Most of the solutions are based on an integration of operating system clustering
software, storage, and networking.
Today's businesses require that IT systems be self-detecting, self-healing, and support 7x24x
365 operations. More and more IT systems are adopting fault tolerance through techniques
such as redundancy and error correction, to achieve a high level of reliability, availability, and
serviceability.
More and more IT systems are adopting fault tolerance through redundancy, memory failure
detection and correction methods, to achieve a high level of reliability, availability and
serviceability.
The following sections discuss the concepts of continuous availability features in more detail.
1
Virtualization features available for IBM System p and System i servers may depend on the type of hardware
used, and may be subject to separate licensing.
Business continuity is implemented using a plan that follows a strategy that is defined
according to the needs of the business. A total Business Continuity Plan has a much larger
focus and includes items such as a crisis management plan, business impact analysis,
human resources management, business recovery plan procedure, documentation and so
on.
Clarification:
Disaster recovery is only one component of an overall business continuity plan.
Business continuity planning forms the first level of planning before disaster recovery
comes in to the plan.
Disaster recovery (DR) is a coordinated activity to enable the recovery of IT and business
systems in the event of disaster. A DR plan covers both the hardware and software required
to run critical business applications and the associated processes, and to (functionally)
recover a complete site. The DR for IT operations employs additional equipment (in a
physically different location) and the use of automatic or manual actions and methods to
recover some or all of the affected business processes.
In addition, high availability is also the ability (and associated processes) to provide access to
applications regardless of hardware, software, or system management issues. This is
achieved through greatly reducing, or masking, planned downtime. Planned downtime often
includes hardware upgrades, repairs, software updates, backups, testing, and development.
High availability solutions should eliminate single points of failure (SPOFs) through
appropriate design, planning, selection of hardware, configuration of software, and carefully
controlled change management discipline. High availability is fault resilience, but not fault
tolerance.
Chapter 1. Introduction 3
1.5 Continuous operations
Continuous operations is an attribute of IT environments and systems which allows them to
continuously operate and mask planned outages from end users. Continuous operations
employs non-disruptive hardware, software, configuration and administrative changes.
Unplanned downtime is an unexpected outage and often is the result of administrator error,
application software failure, operating system faults, hardware faults, or environmental
disasters.
Most of todays solutions are based on an integration of the operating system with clustering
software, storage, and networking. When a failure is detected, the integrated solution will
trigger an event that will perform a predefined set of tasks required to reactivate the operating
system, storage, network, and in many cases, the application on another set of servers and
storage. This kind of functionality is defined as IT continuous availability.
The main goal in protecting an IT environment is to achieve continuous availability; that is,
having no end-user observed downtime. Continuous availability is a collective term for those
characteristics of a product which make it:
Capable of performing its intended functions under stated conditions for a stated period of
time (reliability)
Ready to perform its function whenever requested (availability)
Able to quickly determine the cause of an error and to provide a solution to eliminate the
effects of the error (serviceability)
AIX continuous availability encompasses all tools and techniques implemented at the
operating system level that contribute to overall system availability. Figure 1-1 on page 5
illustrates AIX continuous availability.
Disaster Recovery
Change management
Continuous
Security
Policies
Availability
High Continuous
Availability Operation
Operating AIX
System continuous
availability
features
Hardware
Figure 1-1 Positioning AIX continuous availability
1.6.1 Reliability
From a server hardware perspective, reliability is a collection of technologies (such as chipkill
memory error detection/correction, dynamic configuration and so on) that enhance system
reliability by identifying specific hardware errors and isolating the failing components.
Built-in system failure recovery methods enable cluster nodes to recover, without falling over
to a backup node, when problems have been detected by a component within a node in the
cluster. Built-in system failure recovery should be transparent and achieved without the loss
or corruption of data. It should also be much faster compared to system or application failover
recovery (failover to a backup server and recover). And because the workload does not shift
from this node to another, no other node's performance or operation should be affected.
Built-in system recovery should cover applications (monitoring and restart), disks, disk
adapters, LAN adapters, power supplies (battery backups) and fans.
From a software perspective, reliability is the capability of a program to perform its intended
functions under specified conditions for a defined period of time. Software reliability is
achieved mainly in two ways: infrequent failures (built-in software reliability), and extensive
recovery capabilities (self healing - availability).
IBM's fundamental focus on software quality is the primary driver of improvements in reducing
the rate of software failures. As for recovery features, IBM-developed operating systems have
historically mandated recovery processing in both the mainline program and in separate
recovery routines as part of basic program design.
As IBM System p systems become larger, more and more customers expect mainframe levels
of reliability. For some customers, this expectation derives from their prior experience with
mainframe systems which were downsized to UNIX servers. For others, this is simply a
consequence of having systems that support more users.
The cost associated with an outage grows every year, therefore avoiding outages becomes
increasingly important. This leads to new design requirements for all AIX-related software.
Chapter 1. Introduction 5
For all operating system or application errors, recovery must be attempted. When an error
occurs, it is not valid to simply give up and terminate processing. Instead, the operating
system or application must at least try to keep the component affected by the error up and
running. If that is not possible, the operating system or application should make every effort to
capture the error data and automate system restart as quickly as possible.
The amount of effort put into the recovery should, of course, be proportional to the impact of a
failure and the reasonableness of trying again. If actual recovery is not feasible, then the
impact of the error should be reduced to the minimum appropriate level.
Today, many customers require that recovery processing be subject to a time limit and have
concluded that rapid termination with quick restart or takeover by another application or
system is preferable to delayed success. However, takeover strategies rely on redundancy
that becomes more and more expensive as systems get larger, and in most cases the main
reason for quick termination is to begin a lengthy takeover process as soon as possible.
Thus, the focus is now shifting back towards core reliability, and that means quality and
recovery features.
1.6.2 Availability
Todays systems have hot plug capabilities for many subcomponents, from processors to
input/output cards to memory. Also, clustering techniques, reconfigurable input/output data
paths, mirrored disks, and hot swappable hardware should help to achieve a significant level
of system availability.
From a software perspective, availability is the capability of a program to perform its function
whenever it is needed. Availability is a basic customer requirement. Customers require a
stable degree of certainty, and also require that schedules and user needs are met.
Availability gauges the percentage of time a system or program can be used by the customer
for productive use. Availability is determined by the number of interruptions and the duration
of the interruptions, and depends on characteristics and capabilities which include:
The ability to change program or operating system parameters without rebuilding the
kernel and restarting the system
The ability to configure new devices without restarting the system
The ability to install new software or update existing software without restarting the system
The ability to monitor system resources and programs and cleanup or recover resources
when failures occur
The ability to maintain data integrity in spite of errors
The AIX operating system includes many availability characteristics and capabilities from
which your overall environment will benefit.
1.6.3 Serviceability
Focus on serviceability is shifting from providing customer support remotely through
conventional methods, such as phone and e-mail, to automated system problem reporting
and correction, without user (or system administrator) intervention.
Hot swapping capabilities of some hardware components enhances the serviceability aspect.
A service processor with advanced diagnostic and administrative tools further enhances the
system serviceability. A System p server's service processor can call home in the service
report, providing detailed information for IBM service to act upon. This automation not only
On the software side, serviceability is the ability to diagnose and correct or recover from an
error when it occurs. The most significant serviceability capabilities and enablers in AIX are
referred to as the software service aids. The primary software service aids are error logging,
system dump, and tracing.
With the advent of next generation UNIX servers from IBM, many hardware reliability-,
availability-, and serviceability-related issues such as memory error detection, LPARs,
hardware sensors and so on have been implemented. These features are supported with the
relevant software in AIX. These abilities continue to establish AIX as the best UNIX operating
system.
Integrated hardware error detection and fault isolation is a key component of the System p
and System i platform design strategy. It is for this reason that in 1997, IBM introduced First
Failure Data Capture (FFDC) for IBM POWER servers. FFDC plays a critical role in
delivering servers that can self-diagnose and self-heal. The system effectively traps hardware
errors at system run time.
FFDC is a technique which ensures that when a fault is detected in a system (through error
checkers or other types of detection methods), the root cause of the fault will be captured
without the need to recreate the problem or run any sort of extended tracing or diagnostics
program. For the vast majority of faults, an effective FFDC design means that the root cause
can also be detected automatically without servicer intervention. The pertinent error data
related to the fault is captured and saved for further analysis.
In hardware, FFDC data is collected in fault isolation registers based on the first event that
had occurred. FFDC check stations are carefully positioned within the server logic and data
paths to ensure that potential errors can be quickly identified and accurately tracked to an
individual Field Replaceable Unit (FRU).
This proactive diagnostic strategy is a significant improvement over less accurate reboot and
diagnose service approaches. Using projections based on IBM internal tracking information,
it is possible to predict that high impact outages would occur two to three times more
frequently without an FFDC capability.
In fact, without some type of pervasive method for problem diagnosis, even simple problems
that occur intermittently can cause serious and prolonged outages. By using this proactive
diagnostic approach, IBM no longer has to rely on an intermittent reboot and retry error
detection strategy, but instead knows with some certainty which part is having problems.
This architecture is also the basis for IBM predictive failure analysis, because the Service
Processor can now count and log intermittent component errors and can deallocate or take
other corrective actions when an error threshold is reached.
Chapter 1. Introduction 7
IBM has tried to enhance FFDC features such that in most cases, failures in AIX will not result
in recreate requests, also known as Second Failure Data Capture (SFDC), from AIX support
to customers in order to solve the problem. In AIX, this service functionality focuses on
gathering sufficient information upon a failure to allow for complete diagnosis without
requiring failure reproduction. For example, Lightweight Memory Trace (LMT) support
introduced with AIX V5.3 ML3 represents a significant advance in AIX first failure data capture
capabilities, and provides service personnel with a powerful and valuable tool for diagnosing
problems.
The Run-Time Error Checking (RTEC) facility provides service personnel with a method to
manipulate debug capabilities that are already built into product binaries. RTEC provides
service personnel with powerful first failure data capture and second failure data capture
(SFDC) error detection features. This SFDC service functionality focuses on tools to enhance
serviceability data gathering after an initial failure. The basic RTEC framework has been
introduced in AIX V5.3 TL3, and extended with additional features in subsequent AIX
releases.
IBM has made AIX robust with respect to continuous availability characteristics, and this
robustness makes IBM UNIX servers the best in the market. IBM's AIX continuous availability
strategy has the following characteristics:
Reduce the frequency and severity of AIX system outages, planned and unplanned
Improve serviceability by enhancing AIX failure data capture tools.
Provide enhancements to debug and problem analysis tools.
Ensure that all necessary information involving unplanned outages is provided, to correct
the problem with minimal customer effort
Use of mainframe hardware features for operating system continuous availability brought
to System p hardware
Provide key error detection capabilities through hardware-assist
Exploit other System p hardware aspects to continue transition to stay-up designs
Use of stay-up designs for continuous availability
Maintain operating system availability in the face of errors while minimizing application
impacts
Use of sophisticated and granular operating system error detection and recovery
capabilities
Maintain a strong tie between serviceability and availability
Provide problem diagnosis from data captured at first failure without the need for further
disruption
Provide service aids that are non-disruptive to the customer environment
This paper explores and explains continuous availability features and enhancements
available in AIX V5.3, as well as the new features in AIX V6.1.
The goal is to provide a summarized, exact definition of all the enhancements (including
those not visible directly by users, as well as the visible ones), complete with working
scenarios of commands and the background information required to understand a topic.
The paper is intended for AIX specialists, whether customers, business partners, or IBM
personnel, who are responsible for server availability.
Chapter 1. Introduction 9
10 IBM AIX Continuous Availability Features
2
Today's IT industries can no longer afford system outages, whether planned or unplanned.
Even a few minutes of application downtime can cause significant financial losses, erode
client confidence, damage brand image, and create public relations problems.
The primary role of an operating system is to manage the physical resources of a computer
system to optimize the performance of its applications. In addition, an operating system
needs to handle changes in the amount of physical resources allocated to it in a smooth
fashion and without any downtime. Endowing a computing system with this self-management
feature often translates to the implementation of self-protecting, self-healing, self-optimizing,
and self-configuring facilities and features.
Customers are looking for autonomic computing, in the ability of components and operating
systems to adapt smoothly to changes in their environment. Some of the most prominent
physical resources of an operating system are processors, physical memory, and I/O
devices; how a system deals with the loss of any of these resources is an important feature in
the making of a continuously available operating system. At the same time, the need to add
and remove resources, as well as maintain systems with little or no impact to the application
or database environment, and hence the business, are other important considerations.
The original CPU Guard feature predicts the failure of a running CPU by monitoring certain
types of transient errors and dynamically takes the CPU offline, but it does not provide a
substitute CPU, so that a customer is left with less computing power. Additionally, the older
feature will not allow an SMP system to operate with fewer than two processors.
The Dynamic CPU Guard feature, introduced in AIX 5.2, is an improved and dynamic version
of the original CPU Guard that was available in earlier AIX versions. The key differences are
that it utilizes DLPAR technologies and allows the operating system to function with only one
processor. This feature, beginning with AIX 5.2, is enabled by default. Example 2-1 shows
how to check this attribute.
If this feature is disabled, you can enable it by executing the chdev command as follows:
chdev -l sys0 -a cpuguard=enable
This feature employs a kernel profiling approach to detect disabled code that runs for too
long. The basic idea is to take advantage of the regularly scheduled clock ticks that
generally occur every 10 milliseconds, using them to approximately measure continuously
disabled stretches of CPU time individually on each logical processor in the configuration.
This approach will alert you to partially disabled code sequences by logging one or more hits
within the offending code. It will alert you to fully disabled code sequences by logging the
i_enable that terminates them.
You can turn excessive interrupt disablement off and on, respectively, by changing the
proc.disa RAS component:
errctrl -c proc.disa errcheckoff
errctrl -c proc.disa errcheckon
Note that the preceding commands only affect the current boot. In AIX 6.1, the -P flag is
introduced so that the setting can be changed persistently across reboots, for example:
errctrl -c proc.disa -P errcheckoff
In other cases, the POWER Hypervisor notifies the owning partition that the page should be
deallocated. Where possible, the operating system moves any data currently contained in
that memory area to another memory area and removes the pages associated with this error
from its memory map, no longer addressing these pages. The operating system performs
memory page deallocation without any user intervention and is transparent to end users and
applications.
Additional detailed information about memory page deallocation is available at the following
site:
http://www-05.ibm.com/cz/power6/files/zpravy/WhitePaper_POWER6_availabilty.PDF
The SRC was designed to minimize the need for operator intervention. It provides a
mechanism to control subsystem processes using a common command line and the C
interface. This mechanism includes the following:
Consistent user interface for start, stop, and status inquiries
Logging of the abnormal termination of subsystems
Notification program called at the abnormal system termination of related processes
Tracing of a subsystem, a group of subsystems, or a subserver
Support for control of operations on a remote system
Refreshing of a subsystem (such as after a configuration data change)
Dynamic reconfiguration is the ability of the system to adapt to changes in the hardware and
firmware configuration while it is still running. PCI Hot Plug Support for PCI Adapters is a
specific subset of the dynamic reconfiguration function that provides the capability of adding,
removing, and replacing PCI adapter cards while the host system is running and without
interrupting other adapters in the system. You can also display information about PCI hot plug
slots.
You can insert a new PCI hot plug adapter into an available PCI slot while the operating
system is running. This can be another adapter of the same type that is currently installed, or
of a different type of PCI adapter. New resources are made available to the operating system
and applications without having to restart the operating system. The PCI hot plug manager
interface can be accessed by executing the following command:
smitty devdrpci
Additional detailed information about PCI hot plug management is available in AIX 5L System
Management Guide: Operating System and Devices, SC23-5204, which is downloadable
from the following site:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.baseadmn/doc/basead
mndita/baseadmndita.pdf
RMC provides a single monitoring and management infrastructure for standalone servers
(single operating system image), RSCT peer domains (where the infrastructure is used by the
configuration resource manager), and management domains (where the infrastructure is
used by the Hardware Management Console (HMC) and Cluster Systems Management
(CSM)).
RMC is part of standard AIX V5 and V6 installation, and provides comprehensive monitoring
when configured and activated. The idea behind the RSCT/RMC implementation is to provide
a high availability infrastructure for managing resources in a standalone system, as well as in
a cluster (peer domain or management domain).
RMC can be configured to monitor any event that may occur on your system, and you can
provide a response program (script or binary). For example, if a particular file system is
always filling up, you can configure the RMC to raise an event when the file system grows to
a specified utilization threshold. Your response program (script) might increase the size of the
file system or archive old data, but after the user-specified response script is executed and
the condition recovers (that is, file system utilization falls below a specified reset value), then
the event is cleared and RMC returns to monitor mode.
A resource manager is a process that maps resource and resource class abstractions into
actual calls and commands for one or more specific types of resources. A resource manager
runs as a standalone daemon and contains definitions of all resource classes that the
resource manager supports. These definitions include a description of all attributes, actions,
and other characteristics of a resource class. RSCT provides a core set of resource
managers for managing base resources on single systems and across clusters.
RSCT provides a core set of resource managers. Resource managers provide low-level
instrumentation and control, or act as a foundation for management applications.
These services are often referred to as hats and hags: high availability Topology Services
daemon (hatsd) and Group Services daemon (hagsd).
While redundancy can be built into the VIOS itself with the use of standard AIX tools like
multipath I/O (MPIO) and AIX Logical Volume Manager (LVM) RAID Options for storage
devices, and Ethernet link aggregation for network devices, the Virtual I/O Server must be
available with respect to the client. Planned outages (such as software updates) and
unplanned outages (such as hardware outages) challenge 24x7 availability. In case of a
crash of the Virtual I/O Server, the client partitions will see I/O errors and not be able to
access the adapters and devices that are hosted by the Virtual I/O Server.
The Virtual I/O Server itself can be made redundant by running a second instance in another
partition. When running two instances of the Virtual I/O Server, you can use LVM mirroring,
multipath I/O, Ethernet link aggregation, or multipath routing with dead gateway detection in
the client partition to provide highly available access to virtual resources hosted in separate
Virtual I/O Server partitions. Many configurations are possible; they depend on the available
hardware resources as well as on your requirements.
For example, with the availability of MPIO on the client, each VIOS can present a virtual SCSI
device that is physically connected to the same physical disk. This achieves redundancy for
the VIOS itself and for any adapter, switch, or device that is used between the VIOS and the
disk. With the use of logical volume mirroring on the client, each VIOS can present a virtual
SCSI device that is physically connected to a different disk and then used in a normal AIX
As an example of network high availability, Shared Ethernet Adapter (SEA) failover offers
Ethernet redundancy to the client at the virtual level. The client gets one standard virtual
Ethernet adapter hosted by two VIO servers. The two Virtual I/O servers use a control
channel to determine which of them is supplying the Ethernet service to the client. Through
this active monitoring between the two VIOS, failure of either will result in the remaining VIOS
taking control of the Ethernet service for the client. The client has no special protocol or
software configured, and uses the virtual Ethernet adapter as though it was hosted by only
one VIOS.
On these servers, when an uncorrectable error is identified at one of the many checkers
strategically deployed throughout the system's central electronic complex, the detecting
hardware modifies the ECC word associated with the data, creating a special ECC code. This
code indicates that an uncorrectable error has been identified at the data source and that the
data in the standard ECC word is no longer valid. The check hardware also signals the
Service Processor and identifies the source of the error. The Service Processor then takes
appropriate action to handle the error. This technique is called Special Uncorrectable Error
(SUE) handling.
Simply detecting an error does not automatically cause termination of a system or partition. In
many cases, an uncorrectable error will cause generation of a synchronous machine check
interrupt. The machine check interrupt occurs when a processor tries to load the bad data.
The firmware provides a pointer to the instruction that referred to the corrupt data, the system
continues to operate normally, and the hardware observes the use of the data.
The system is designed to mitigate the problem using a number of approaches. For example,
if the data is never actually used but is simply overwritten, then the error condition can safely
be voided and the system will continue to operate normally.
For AIX V5.2 or greater, if the data is actually referenced for use by a process, then the
operating system is informed of the error. The operating system will terminate only the
specific user process associated with the corrupt data.
Specifically, if an SUE occurs inside one of the copyin() and copyout() family of kernel
services, these functions will return an error code and allow the system to continue operating
(in contrast, on a POWER4 or POWER5 system, AIX would crash). The new SUE feature
integrates the kernel mode handling of SUEs with the FRR recovery framework.
Kernel recovery
Kernel recovery in AIX V6.1 is disabled by default. This is because the set of errors that can
be recovered is limited in AIX V6.1, and kernel recovery, when enabled, requires an extra 4 K
page of memory per thread. To enable, disable, or show kernel recovery state, use the SMIT
path Problem Determination Kernel Recovery, or use the smitty krecovery command.
You can show the current and next boot states, and also enable or disable the kernel
recovery framework at the next boot. In order for the change to become fully active, you must
run the /usr/sbin/bosboot command after changing the kernel recovery state, and then
reboot the operating system.
During a kernel recovery action, the system might pause for a short time, generally less than
two seconds. The following actions occur immediately after a kernel recovery action:
1. The system console displays the message saying that a kernel error recovery action has
occurred.
2. AIX adds an entry into the error log.
3. AIX may generate a live dump.
4. You can send the error log data and live dump data to IBM for service (similar to sending
data from a full system termination).
Note: Some functions might be lost after a kernel recovery, but the operating system
remains in a stable state. If necessary, shut down and restart your system to restore the
lost functions.
The basic RTEC framework is introduced in AIX V5.3 TL3, and has now been extended with
additional features. RTEC features include the Consistency Checker and Xmalloc Debug
features. Features are generally tunable with the errctrl command.
Some features also have attributes or commands specific to a given subsystem, such as the
sodebug command associated with new socket debugging capabilities. The enhanced socket
debugging facilities are described in the AIX publications, which can be found online at the
following site:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp
These problems can be difficult to service. AIX V5.3 TL5 introduces an asynchronous
run-time error checking capability to examine if certain kernel stacks have overflowed. The
default action upon overflow detection is to log an entry in the AIX error log. The stack
overflow run-time error checking feature is controlled by the ml.stack_overflow component.
AIX V6.1 improves kernel stack overflow detection so that some stacks are guarded with a
synchronous overflow detection capability. Additionally, when the recovery framework is
enabled, some kernel stack overflows that previously were fatal are now fully recoverable.
When a page is paged out, a checksum will be computed on the data in the page and saved
in a pinned array associated with the paging device. If and when it is paged back in, a new
checksum will be computed on the data that is read in from paging space and compared to
the value in the array. If the values do not match, the kernel will log an error and halt (if the
error occurred in system memory), or send an exception to the application (if it occurred in
user memory).
Paging space verification can be enabled or disabled, on a per-paging space basis, by using
the mkps and chps commands. The details of these commands can be found in their
corresponding AIX man pages.
Memory overlays and addressing errors are among the most difficult problems to diagnose
and service. The problem is compounded by growing software size and increased
complexity. Under AIX, a large global address space is shared among a variety of software
components. This creates a serviceability issue for both applications and the AIX kernel.
The AIX 64-bit kernel makes extensive use of a large flat address space by design. This is
important in order to avoid costly MMU operations on POWER processors. Although this
design does produce a significant performance advantage, it also adds reliability, availability
and serviceability (RAS) costs. Large 64-bit applications, such as DB2, use a global
address space for similar reasons and also face issues with memory overlays.
Storage keys were introduced in PowerPC architecture to provide memory isolation, while
still permitting software to maintain a flat address space. The concept was adopted from the
System z and IBM 390 systems. Storage keys allow an address space to be assigned
context-specific protection. Access to the memory regions can be limited to prevent, and
catch, illegal storage references.
A new CPU facility, Authority Mask Register (AMR), has been added to define the key set that
the CPU has access to. The AMR is implemented as bit pairs vector indexed by key number,
with distinct bits to control read and write access for each key. The key protection is in
addition to the existing page protection bits. For any load or store process, the CPU retrieves
the memory key assigned to the targeted page during the translation process. The key
number is used to select the bit pair in the AMR that defines if an access is permitted.
A data storage interrupt occurs when this check fails. The AMR is a per-context register that
can be updated efficiently. The TLB/ERAT contains storage key values for each virtual page.
This allows AMR updates to be efficient, since they do not require TLB/ERAT invalidation.
The APIs that manage hardware keys in user mode refer to the functionality as user keys.
User key support is primarily being provided as a reliability, availability and serviceability
(RAS) feature for applications. The first major application software to implement user keys is
DB2. In DB2, user keys are used for two purposes. Their primary purpose is to protect the
DB2 core from errors in user-defined functions (UDFs). The second use is as a debug tool to
prevent and diagnose internal memory overlay errors. But this functionality is available to any
application.
DB2 provides a UDF facility where customers can add extra code to the database. There are
two modes that UDFs can run under, fenced and unfenced, as explained here:
In fenced mode, UDFs are isolated from the database by execution under a separate
process. Shared memory is used to communicate between the database and UDF
process. Fenced mode does have a significant performance penalty, because a context
switch is required to execute the UDF.
An unfenced mode is also provided, where the UDF is loaded directly into the DB2
address space. Unfenced mode greatly improves performance, but introduces a
significant RAS exposure.
Although DB2 recommends fenced mode, many customers use unfenced mode for improved
performance. Use of user keys must provide significant isolation between the database and
UDFs with low overhead.
User keys work with application programs. They are a virtualization of the PowerPC storage
key hardware. User keys can be added and removed from a user space AMR, and a single
user key can be assigned to an applications memory pages. Management and abstraction of
user keys is left to application developers. The storage protection keys application
programming interface (API) for user space applications is available in AIX V5.3 TL6 and is
supported on all IBM System p POWER6 processor-based servers running this technology
level.
Kernel keys are added to AIX as an important Reliability, Availability, and Serviceability (RAS)
function. They provide a Reliability function by limiting the damage that one software
component can do to other parts of the system. They will prevent kernel extensions from
damaging core kernel components, and provide isolation between kernel extension classes.
Kernel keys will also help to provide significant Availability function by helping prevent error
propagationand this will be a key feature as AIX starts to implement kernel error recovery
handlers. Serviceability is enhanced by detecting memory addressing errors closer to their
origin. Kernel keys allow many random overlays to be detected when the error occurs, rather
than when the corrupted memory is used.
With kernel key support, the AIX kernel introduces the concept of kernel domains and private
memory access. Kernel domains are component data groups that are created to segregate
sections of the kernel and kernel extensions from each other. Hardware protection of kernel
memory domains is provided and enforced. Also, global storage heaps are separated and
protected. This keeps heap corruption errors within kernel domains. There are also private
memory keys that allow memory objects to be accessed only by authorized components.
Besides the Reliability, Availability and Serviceability benefits, private memory keys are a tool
to enforce data encapsulation.
The AIX V5.3 TL3 package introduced these new First Failure Data Capture (FFDC)
capabilities. The set of FFDC features is further expanded in AIX V5.3 TL5 and AIX V6.1.
These features are described in the sections that follow, and include:
Lightweight Memory Trace (LMT)
Run-Time Error Checking (RTEC)
Component Trace (CT)
Live Dump
These features are enabled by default at levels that provide valuable FFDC information with
minimal performance impacts. The advanced FFDC features can be individually manipulated.
Additionally, a SMIT dialog has been provided as a convenient way to persistently (across
reboots) disable or enable the features through a single command. To enable or disable all
four advanced FFDC features, enter the following command:
smitty ffdc
This specifies whether the advanced memory tracing, live dump, and error checking facilities
are enabled or disabled. Note that disabling these features reduces system Reliability,
Availability, and Serviceability.
Note: You must run the /usr/sbin/bosboot command after changing the state of the
Advanced First Failure Data Capture Features, and then reboot the operating system in
order for the changes to become fully active. Some changes will not fully take effect until
the next boot.
AIX V6.1 also provides additional features to reduce the size of dumps and the time needed
to create a dump. It is possible to control which components participate in the system dump. It
may be desirable to exclude some components from the system dump in order to decrease
the dump size.
Note: As of AIX V6.1, traditional system dumps are always compressed to further reduce
size.
The dumpctrl command obtains information about which components are registered for a
system dump. If a problem is being diagnosed, and multiple system dumps are needed,
components that are not needed can be excluded from system dumps until the problem is
solved. When the problem is solved, the system administrator should again enable all system
dump components.
AIX V6.1 provides also a live dump capability to allow failure data to be dumped without
taking down the entire system. A live dump will most likely involve just a few system
components. For example, prior to AIX V6.1, if an inconsistency were detected by a kernel
component such as a device driver, the usual approach is to bring down and dump the entire
system.
System dumps can now be copied to DVD media. You can also use DVD as a primary or
secondary dump device. Note that the snap command can use a DVD as source, as well as
an output device.
Important: In AIX V6.1, traditional dump remains the default method for performing a
system dump in all configurations.
Firmware-assisted system dump should only be used if you are directed to do so by IBM
Service during problem determination.
Selective memory dump is a firmware-assisted system dump that is triggered by (or uses) the
AIX instance. Full memory dump is a firmware-assisted system dump that dumps all partition
memory without any interaction with the AIX instance that is failing. Both selective-memory
dump and traditional system dump require interaction with the failing AIX instance to
complete the dump.
If all conditions for a firmware-assisted system dump are validated, AIX reserves a scratch
area and performs the firmware-assisted system dump configuration operations. The scratch
area is not released unless the administrator explicitly reconfigures a traditional system
dump configuration. Verification is not performed when a dynamic reconfiguration operation
modifies the memory size.
AIX can switch from firmware-assisted system dump to traditional system dump at dump time
because traditional system dump is always initialized. There are two cases of traditional
system dump: a user-specified traditional system dump configuration, and a traditional
system dump that is initialized just in case AIX cannot start a firmware-assisted system dump.
AIX can be configured to choose the type of dump between firmware-assisted system dump
and traditional system dump. When the configuration of the dump type is changed from
firmware-assisted system dump to traditional system, the new configuration is effective
immediately. When the configuration of the dump type is changed from traditional system
dump to firmware-assisted system dump, the new configuration is only effective after a
reboot.
When firmware-assisted system dump is supported by the platform and by AIX, and is
activated by the administrator, selective memory dump is the default dump configuration. Full
memory dump is not allowed by default. In case of firmware-assisted system dump
configuration, the administrator can configure AIX to allow a full memory dump. If allowed but
not required, the full memory dump is only performed when AIX cannot initiate a selective
memory dump.
The administrator can configure AIX to require a full memory dump even if AIX can initiate a
selective memory dump. In both cases, where full memory dump is either allowed or required,
the administrator can start a full memory dump from AIX or from the HMC menus.
Live dumps can be initiated by software programs or by users with root user authority.
Software programs use live dumps as part of recovery actions, or when the runtime
error-checking value for the error disposition is ERROR_LIVE_DUMP. If you have root user
authority, you can initiate live dumps when a subsystem does not respond or behaves
erroneously. For more information about how to initiate and manage live dumps, see the
livedumpstart and dumpctrl commands in AIX V6.1 command reference manuals, which
are downloadable at the following site:
http://publib.boulder.ibm.com/infocenter/pseries/v6r1/index.jsp
Unlike system dumps, which are written to a dedicated dump device, live dumps are written to
the file system. By default, live dumps are placed in the /var/adm/ras/livedump directory. The
directory can be changed by using the dumpctrl command.
In AIX V6.1, only serialized live dumps are available. A serialized live dump causes a system
to be frozen or suspended when data is being dumped. The freeze is done by stopping all
processors, except the processor running the dump. When the system is frozen, the data is
copied to the live dump heap in pinned kernel memory. The data is then written to the file
system, but only after the system is unfrozen. Live dump usually freezes the system for no
more than 100 ms.
The heapsz attribute (heap size) can be set to zero (0), meaning that at dump initialization
time, the system calculates the live dump heap size based on the amount of real memory,
which is the minimum of 16 MB or 1/64 the size of real memory (whichever is smaller).
Duplicate live dumps that occur rapidly are eliminated to prevent system overload and to save
file system space. Eliminating duplicate dumps requires periodic (once every 5 minutes)
scans of the live dump repository through a cron job. Duplicate elimination can be stopped via
the dumpctrl command.
Each live dump has a data priority. A live dump of info priority is for informational purposes,
and a live dump of critical priority is used to debug a problem. Info priority dumps can be
deleted to make room for critical priority dumps.
You can enable or disable all live dumps by using the dumpctrl ldmpon/ldmpoff command, or
by using the SMIT fastpath:
smitty livedump
Note: Live dumps are compressed and must be uncompressed with the dmpumcompress
command.
A new optimized compressed dump format has been introduced in AIX V5.3. The dump file
extension for this new format is .BZ. In this new compressed dump file, the blocks are
compressed and unordered; this unordered feature allows multiple processors to dump
parallel sub-areas of the system. Parallel dumps are produced automatically when supported
by the AIX release.
Important: The new file format for parallel dump is not readable with uncompress and zcat
commands.
In order to increase dump reliability, a new -S checking option, to be used with the -L option
for the statistical information on the most recent dump, is also added to the sysdumpdev
command. The -S option scans a specific dump device and sees whether it contains a valid
compressed dump.
sysdumpdev -L -S <Device>
The dump must be from an AIX release with parallel dump support. This flag can be used
only with the -L flag. Additional options can be found and modified in SMIT via smitty dump.
2.3.7 Minidump
The minidump, introduced in AIX V5.3 TL3, is a small compressed dump that is stored to
NVRAM when the system crashes or a dump is initiated, and is then written to the error log on
reboot. It can be used to see some of the system state and do some debugging if a full dump
is not available. It can also be used to obtain a quick snapshot of a crash without having to
transfer the entire dump from the crashed system.
Minidumps will show up as error log entries with a label of MINIDUMP_LOG and a description of
COMPRESSED MINIMAL DUMP. To filter the minidump entries in the error log, you can use the
errpt command with the J flag (errpt [a] J MINIDUMP_LOG). Minidumps in the error log
can be extracted, decompressed, and formatted using the mdmprpt command, as shown:
mdmprpt [-l seq_no] [-i filename] [-r]
To check the actual trace subsystem properties, use the trcctl command, as shown in
Example 2-2.
lpar15root:/root#trcctl
To list all trace event groups, you can use the following command:
trcevgrp -l
There are several trace-related tools which are used for various operating system
components:
CPU monitoring tprof, curt, splat, trace, trcrpt
Memory monitoring trace, trcrpt
I/O subsystem trace, trcrpt
Network iptrace, ipreport, trace, trcrpt
Processes and threads tprof, pprof, trace, trcrpt
Additional details about these trace options are available in IBM AIX Version 6.1 Differences
Guide, SC27-7559.
POSIX trace
AIX Version 6 implements the POSIX trace system, which supports tracing of user
applications via a standardized set of interfaces. The POSIX tracing facilities allow a process
to select a set of trace event types, activate a trace stream of the selected trace events as
they occur in the flow of execution, and retrieve the recorded trace events. Like system trace,
POSIX trace is also dependent upon precompiled-in trace hooks in the application being
instrumented.
Additional details about these trace options are available in IBM AIX Version 6.1 Differences
Guide, SC27-7559.
Iptrace
The iptrace daemon provides interface-level packet tracing for Internet protocols. This
daemon records Internet packets received from configured interfaces. Command flags
provide a filter so that the daemon traces only packets that meet specific criteria. Packets are
traced only between the local host on which the iptrace daemon is invoked and the remote
host.
If the iptrace process was started from a command line without the System Resource
Controller (SRC), it must be stopped with the kill -15 command. The kernel extension
loaded by the iptrace daemon remains active in memory if iptrace is stopped any other way.
The LogFile parameter specifies the name of a file to which the results of the iptrace
command are sent. To format this file, run the ipreport command.
The ipreport command may display the message TRACING DROPPED xxxx PACKETS. This
count of dropped packets indicates only the number of packets that the iptrace command
was unable to grab because of a large packet whose size exceeded the socket-receive buffer
size. The message does not mean that the packets are being dropped by the system.
CT provides system trace information for specific system components. This information
allows service personnel to access component state information through either in-memory
trace buffers or through traditional AIX system trace. Component Trace is enabled by default.
LMT provides system trace information for First Failure Data Capture (FFDC). It is a constant
kernel trace mechanism that records software events occurring during system operation. The
system activates LMT at initialization, then tracing runs continuously. Recorded events are
saved into per-processor memory trace buffers. There are two memory trace buffers for each
processorone to record common events, and one to record rare events. The memory trace
buffers can be extracted from system dumps accessed on a live system by service personnel.
The trace records look like traditional AIX system trace records. The extracted memory trace
buffers can be viewed with the trcrpt command, with formatting as defined in the /etc/trcfmt
file.
For further details about LMT, refer to 3.2, Lightweight memory trace on page 57.
2.3.11 ProbeVue
AIX V6.1 provides a new dynamic tracing facility that can help to debug complex system or
application code. This dynamic tracing facility is introduced via a new tracing command,
probevue, that allows a developer or system administrator to dynamically insert trace probe
points in existing code without having to recompile the code. ProbeVue is described in detail
in 3.8, ProbeVue on page 111. To show or change the ProbeVue configuration, use the
following command:
smitty probevue
The purpose of error logging is to collect and record data related to a failure so that it can be
subsequently analyzed to determine the cause of the problem. The information recorded in
the error log enables the customer and the service provider to rapidly isolate problems,
retrieve failure data, and take corrective action.
Error logging is automatically started by the rc.boot script during system initialization. Error
logging is automatically stopped by the shutdown script during system shutdown.
The error logging process begins when the AIX operating system module detects an error.
The error-detecting segment of code then sends error information to either the errsave kernel
The errlast kernel service preserves the last error record in the NVRAM. Therefore, in the
event of a system crash, the last logged error is not lost. This process then adds a time stamp
to the collected data.
The errdemon daemon constantly checks the /dev/error file for new entries, and when new
data is written, the daemon conducts a series of operations. Before an entry is written to the
error log, the errdemon daemon compares the label sent by the kernel or application code to
the contents of the error record template repository. If the label matches an item in the
repository, the daemon collects additional data from other parts of the system.
To create an entry in the error log, the errdemon daemon retrieves the appropriate template
from the repository, the resource name of the unit that detected the error, and detailed data.
Also, if the error signifies a hardware-related problem and the Vital Product Data (VPD)
hardware exists, the daemon retrieves the VPD from the Object Data Manager (ODM).
When you access the error log, either through SMIT or by using the errpt command, the
error log is formatted according to the error template in the error template repository and
presented in either a summary or detailed report.
Most entries in the error log are attributable to hardware and software problems, but
informational messages can also be logged. The errlogger command allows the system
administrator to record messages of up to 1024 bytes in the error log.
Whenever you perform a maintenance activity, such as clearing entries from the error log,
replacing hardware, or applying a software fix, it is good practice to record this activity in the
system error log; here is an example:
errlogger system hard disk '(hdisk0)' replaced.
The alog command works with log files that are specified on the command line, or with logs
that are defined in the alog configuration database. Logs that are defined in the alog
configuration database are identified by LogType. The File, Size, and Verbosity attributes for
each defined LogType are stored in the alog configuration database with the LogType. The
alog facility is used to log boot time messages, install logs, and so on.
You can use the alog -L command to display all log files that are defined for your system.
The resulting list contains all logs that are viewable with the alog command. Information
saved in BOS installation log files might help you determine the cause of installation
problems. To view BOS installation log files, enter:
alog -o -f bosinstlog
All boot messages are collected in a boot log file, because at boot time there is no console
available. Boot information is usually collected in /var/adm/ras/bootlog. It is good practice to
check the bootlog file when you are investigating boot problems. The file will contain output
generated by the cfgmgr command and rc.boot.
Alternatively, you can use the SMIT fastpath smitty alog menu shown in Example 2-3.
Alog
[Entry Fields]
* +--------------------------------------------------------------------------+
| Alog TYPE |
| |
| Move cursor to desired item and press Enter. |
| |
| boot |
| bosinst |
| nim |
| console |
| cfg |
| lvmcfg |
| lvmt |
| dumpsymp |
| |
| F1=Help F2=Refresh F3=Cancel |
F1| F8=Image F10=Exit Enter=Do |
F5| /=Find n=Find Next |
F9+--------------------------------------------------------------------------+
2.3.14 syslog
The syslogd daemon logs the system messages from different software components (kernel,
daemon processes, and system applications). This daemon uses a configuration file to
determine where to send a system message, depending on the message's priority level and
the facility that generated it. By default, syslogd reads the default configuration file
/etc/syslog.conf, but by using the -f flag when starting syslogd, you can specify an alternate
The syslogd daemon reads a datagram socket and sends each message line to a destination
described by the /etc/syslog.conf configuration file. The syslogd daemon reads the
configuration file when it is activated and when it receives a hang-up signal.
Each message is one line. A message can contain a priority code marked by a digit enclosed
in angle braces (< >) at the beginning of the line. Messages longer than 900 bytes may be
truncated.
The /usr/include/sys/syslog.h include file defines the facility and priority codes used by the
configuration file. Locally-written applications use the definitions contained in the syslog.h file
to log messages using the syslogd daemon. For example, a configuration file that contains
the following line is often used when a daemon process causes a problem:
daemon.debug /tmp/syslog.debug
This line indicates that a facility daemon should be controlled. All messages with the priority
level debug and higher should be written to the file /tmp/syslog.debug.
The daemon process that causes problems (in our example, the inetd) is started with option
-d to provide debug information. This debug information is collected by the syslogd daemon,
which writes the information to the log file /tmp/syslog.debug.
Logging can also improve security on your system by adding additional logging to the
syslog.conf file. You can then increase system logging by adding extra logging for your
system by editing inetd.conf and adding logging into each of the demons as required.
bianca:/etc/#vi syslog.conf
*.info /var/adm/syslog/syslog.log
*.alert /var/adm/syslog/syslog.log
*.notice /var/adm/syslog/syslog.log
*.warning /var/adm/syslog/syslog.log
*.err /var/adm/syslog/syslog.log
*.crit /var/adm/syslog/syslog.log rotate time 1d files 9
The last line in Example 2-5 indicates that this will create only 9 files and use a rotation on
syslog. A file designated for storing syslog messages must exist; otherwise, the logging will
not start. Remember to refresh syslog after any changes are made to the syslog configuration
file.
Extra logging on demons controlled by inetd.conf can be configured as shown in Example 2-6
(ftpd is shown).
bianca:/etc/#vi inetd.con
Concurrent AIX Update enables activation and deactivation of IBM fixes to the kernel and
kernel extensions. It accomplishes this by adding new capabilities to the interim fix packaging
and installation tools, the system loader, and to the system process component.
Note: The ability to apply or remove a fix without the requirement of a reboot is limited to
Concurrent AIX Updates. Technological restrictions prevent some fixes from being made
available as a Concurrent AIX Update. In such cases, those fixes may be made available
as an interim fix (that is, traditional ifix).
Traditional interim fixes for the kernel or kernel extensions still require a reboot of the
operating system for both activation and removal.
Performing live updates on an operating system is a complicated task, and it places stringent
demands on the operating system. There are many different approaches available for
patching the operating system kernel. Concurrent AIX Update uses a method of functional
redirection within the in-memory image of the operating system to accomplish patching-in of
corrected code. After a fix for a problem has been determined, the corrected code is built,
packaged, and tested according to a new process for Concurrent AIX Update. It is then
provided to the customer, using the existing interim fix package format.
The package will contain one or more object files, and their corresponding executable
modules. Patch object files have numerous restrictions, including (but not limited to) that no
non-local data be modified, and that changes to multiple functions are only permitted if they
Defective functions that are to be patched have their first instruction saved aside and then
subsequently replaced with a branch. The branch serves to redirect calls to a defective
function to the corrected (patched) version of that function. To maintain system coherency,
instruction replacements are collectively performed under special operation of the system.
After patching is successfully completed, the system is fully operational with corrected code,
and no reboot is required. To remove a Concurrent AIX Update from the system, the
saved-aside instructions are simply restored. Again, no reboot is required.
This method is suitable to the majority of kernel and kernel extension code, including interrupt
handlers, locking code, and possibly even the concurrent update mechanism itself.
Two new commands, lscore and chcore, have been introduced to check the settings for the
corefile creation and change them, respectively. SMIT support has also been added; the
fastpath is smitty corepath.
For AIX 5.3, the chcore command can change core file creation parameters, as shown:
chcore [ -R registry ] [ -c {on|off|default} ] [ -p {on|off|default} ] [ -l {path|
default] [ -n {on|off|default} ] [ username | -d ]
-c {on|off|default} Setting for core compression.
-d Changes the default setting for the system.
-l path Directory path for stored corefiles.
-n {on|off|default} Setting for core naming.
-p {on|off|default} Setting for core location.
-R registry Specifies the loadable I&A module.
New features have been added to control core files that will avoid key file systems being filled
up by core files generated by faulty programs. AIX allows users to compress the core file and
specify its name and destination directory.
The chcore command, as shown in Example 2-7, can be used to control core file parameters.
lpar15root:/root#chcore -c on -p on -l /coredumps
lpar15root:/root#lscore
In Example 2-7 on page 35, we have created a file system to store core files generated on the
local system. This file system is mounted in the /coredumps directory. Next we created a core
file for a program (sleep 5000), as shown in Example 2-8. We send the program to the
background, so we get the process id, then we kill the program with Abort (-6) flag, and
observe the core file.
This section covers the following network-related utilities currently available in AIX:
Virtual IP address support
Multipath IP routing
Dead gateway detection
EtherChannel
IEEE 802.3ad Link Aggregation
2-Port Adapter-based Ethernet Failover
Shared Ethernet Failover
Additional details and configuration information on these topics can be found in AIX 5L
Differences Guide Version 5.2 Edition, SG24-5765, AIX 5L Differences Guide Version 5.3
Edition, SG24-7463, and IBM AIX Version 6.1 Differences Guide, SC27-7559.
To overcome this, support for virtual IP addresses (VIPA) on both IPv4 and IPv6 was
introduced in AIX V5.1. VIPA allows the application to bind to a system-wide level virtual IP
address, as opposed to a single network interface. VIPA is a virtual device often utilizing
several network interfaces. VIPA can often mask underlying network interface failures by
re-routing automatically to a different one. This allows continued connectivity and is
transparent to the application and processes. VIPA also supports load balancing of traffic
across the available connections.
Another advantage of choosing a virtual device (as opposed to defining aliases to real
network interfaces) is that a virtual device can be brought up or down separately without
having any effect on the real interfaces of a system. Furthermore, it is not possible to change
the address of an alias (aliases can only be added and deleted), but the address of a virtual
interface can be changed at any time.
Since its initial introduction in AIX V5.1, VIPA has been enhanced to make it friendlier, from a
network administration perspective. It has also been enhanced so that failovers are
completed faster, thus further improving availability.
Also, previous AIX releases did not provide any mechanism to associate a specific interface
with a route. When there were multiple interfaces on the same subnet, the same outgoing
interface for all destinations accessible through that network was always chosen.
In order to configure a system for network traffic load balancing, it is desirable to have
multiple routes so that the network subsystem routes network traffic to the same network
segment by using different interfaces.
With the new multipath routing feature in AIX V6.1, routes no longer need to have a different
destination, netmask, or group ID list. If there are several routes that equally qualify as a route
to a destination, AIX will use a cyclic multiplexing mechanism (round-robin) to choose
between them. The benefit of this feature is two-fold:
It enables load balancing between two or more gateways.
The feasibility of load balancing between two or more interfaces on the same network can
be realized. The administrator would simply add several routes to the local network, one
through each interface.
AIX releases prior to AIX V5.1 do not permit you to configure multiple routes to the same
destination. If a route's first-hop gateway failed to provide the required routing function, AIX
continued to try to use the broken route and address the dysfunctional gateway. Even if there
was another path to the destination which would have offered an alternative route, AIX did not
have any means to identify and switch to the alternate route unless a change to the kernel
routing table was explicitly initiated, either manually or by running a routing protocol program,
such as gated or routed. Gateways on a network run routing protocols and communicate with
one another. If one gateway goes down, the other gateways will detect it, and adjust their
routing tables to use alternate routes (only the hosts continue to try to use the dead gateway).
The DGD feature in AIX V5.1 enables host systems to sense and isolate a dysfunctional
gateway and adjust the routing table to make use of an alternate gateway without the aid of a
running routing protocol program.
The passive DGD mechanism depends on the protocols Transmission Control Protocol
(TCP) and Address Resolution Protocol (ARP), which provide information about the state of
the relevant gateways. If the protocols in use are unable to give feedback about the state of a
gateway, a host will never know that a gateway is down and no action will be taken.
Passive dead gateway detection has low overhead and is recommended for use on any
network that has redundant gateways. However, passive DGD is done on a best-effort basis
only.
Active DGD will ping gateways periodically, and if a gateway is found to be down, the routing
table is changed to use alternate routes to bypass the dysfunctional gateway. Active dead
gateway detection will be off by default and it is recommended to be used only on machines
that provide critical services and have high availability requirements. Because active DGD
imposes some extra network traffic, network sizing and performance issues have to receive
careful consideration. This applies especially to environments with a large number of
machines connected to a single network.
2.4.4 EtherChannel
EtherChannel is a network interface aggregation technology that allows you to produce a
single large pipe by combining the bandwidth of multiple Ethernet adapters. In AIX V5.1, the
EtherChannel allows for multiple adapters to be aggregated into one virtual adapter, which
the system treats as a normal Ethernet adapter. The IP layer sees the adapters as a single
interface with a shared MAC and IP address. The aggregated adapters can be a combination
of any supported Ethernet adapter, although they must be connected to a switch that
supports EtherChannel. All connections must be full-duplex and there must be a
point-to-point connection between the two EtherChannel-enabled endpoints.
If an adapter in the EtherChannel goes down, then traffic is transparently rerouted. Incoming
packets are accepted over any of the interfaces available. The switch can choose how to
distribute its inbound packets over the EtherChannel according to its own implementation,
which in some installations is user-configurable. If all adapters in the channel fail, then the
channel is unable to transmit or receive packets.
There are two policies for outbound traffic starting in AIX V5.2: standard and round robin:
The standard policy is the default; this policy allocates the adapter to use on the basis of
the hash of the destination IP addresses.
The round robin policy allocates a packet to each adapter on a round robin basis in a
constant loop.
AIX V5.2 also introduced the concept of configuring a backup adapter to the EtherChannel.
The backup adapters purpose is to take over the IP and MAC address of the channel in the
event of a complete channel failure, which is constituted by failure of all adapters defined to
the channel. It is only possible to have one backup adapter configured per EtherChannel.
For example, ent0 and ent1 can be aggregated into an IEEE 802.3ad Link Aggregation called
ent3; interface ent3 would then be configured with an IP address. The system considers
these aggregated adapters as one adapter. Therefore, IP is configured over them as over any
single Ethernet adapter. The link remains available if one of the underlying physical adapters
loses connectivity.
Like EtherChannel, IEEE 802.3ad requires support in the switch. Unlike EtherChannel,
however, the switch does not need to be configured manually to know which ports belong to
the same aggregation.
While this concept is not new by todays standards, it represents yet another ability that
contributes to overall system availability. This ability can be used in conjunction with both PCI
hot plug management (see 2.1.9, PCI hot plug management on page 15), and hot spare
disks (see Hot spare disks in a volume group on page 43).
For details about system backup and recovery, refer to the AIX Documentation Web page:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.install/
doc/insgdrf/backup_intro.htm
This feature also allows the ability to multi-boot from different AIX level images that could also
contain different versions or updates to an application. This may be useful for periods of
testing, or to even help create a test environment. This feature, originally introduced in AIX
4.1.4.0, utilizes the alt_disk_inst command. Starting in AIX V5.3, this feature has now been
expanded into three commands:
alt_disk_copy
alt_disk_mksysb
alt_rootvg_op
These commands now offer additional granular options and greater flexibility than ever
before.
Additional details about these commands are available in AIX 5L Version 5.3 Commands
Reference, Volume 1, a-c, SC23-4888. This publication is also available online at the
following site:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.cmds/doc/aixcmds1/a
ixcmds1.pdf
The multibos utility allows the administrator to access, install maintenance, update, and
customize the standby instance of BOS (during setup or in subsequent customization
The file systems /, /usr, /var, /opt, and /home, along with the boot logical volume, must exist
privately in each instance of BOS. The administrator has the ability to share or keep private
all other data in the rootvg. As a general rule, shared data should be limited to file systems
and logical volumes containing data not affected by an upgrade or modification of private
data.
When updating the non-running BOS instance, it is best to first update the running BOS
instance with the latest available version of multibos (which is in the bos.rte.bosinst fileset).
Additional details on the multibos utility are available in the man pages and in AIX 5L Version
5.3 Commands Reference, Volume 3, a-c, SC23-4890. This publication is also available
online at the following site:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.cmds/doc/aixcmds3/a
ixcmds3.pdf
For details and examples about how to use NIM to enhance your system availability, refer to
the IBM Redbooks publication NIM from A to Z in AIX 5L, SG24-7296.
For non-SAN based storage environments, it is quite common to utilize AIX LVMs ability to
mirror data. This is especially true for the operating system disks (rootvg). By mirroring
rootvg, this allows AIX to continue operating in the event of a rootvg disk failure. This feature,
However, although LVM mirroring does increase storage availability, it is not intended to be a
substitute for system backup. Additional detailed information about LVM mirroring is available
in the mirrorvg man page and in AIX 5L Version 5.3 System Management Concepts:
Operating System and Devices Management, SC23-5204.
Automatic synchronization is a recovery mechanism that will only be attempted after the LVM
device driver logs LVM_SA_STALEPP in the errpt. A partition that becomes stale through any
other path (for example, mklvcopy) will not be automatically resynced. This is always used in
conjunction with the hot spare disk in a volume group feature.
To change the volume group characteristics, you can use the smitty chvg SMIT fastpath, or
you can use the following commands (these are examples):
/usr/sbin/chvg -s'y' fransvg
/usr/sbin/chvg -h hotsparepolicy -s syncpolicy volumegroup
In order to use this space, the disk must grow in size by dynamically adding additional
physical partitions (PP). AIX V5.2 introduced support of dynamic volume expansion by
updating the chvg command to include a new -g flag. More detailed information about this
topic is available in the man page for chvg and in AIX 5L Differences Guide Version 5.2
Edition, SG24-5765.
The design of LVM also allows for accidental duplication of volume group and logical volume
names. If the volume group or logical volume names being imported already exist on the new
machine, then LVM will generate a distinct volume group or logical volume name.
When a volume group is created onto a single disk, it initially has two VGDA/VGSA areas
residing on the disk. If a volume group consists of two disks, one disk still has two
VGDA/VGSA areas, but the other disk has one VGDA/VGSA. When the volume group is
made up of three or more disks, then each disk is allocated just one VGDA/VGSA.
A quorum is lost when enough disks and their VGDA/VGSA areas are unreachable such that
51% (a majority) of VGDA/VGSA areas no longer exists. In a two-disk volume group, if the
disk with only one VGDA/VGSA is lost, a quorum still exists because two of the three
VGDA/VGSA areas still are reachable. If the disk with two VGDA/VGSA areas is lost, this
statement is no longer true. The more disks that make up a volume group, the lower the
chances of quorum being lost when one disk fails.
When a quorum is lost, the volume group varies itself off so that the disks are no longer
accessible by the Logical Volume Manager (LVM). This prevents further disk I/O to that
volume group so that data is not lost or assumed to be written when physical problems occur.
Additionally, as a result of the vary off, the user is notified in the error log that a hardware
error has occurred and service must be performed.
There are cases when it is desirable to continue operating the volume group even though a
quorum is lost. In these cases, quorum checking may be turned off for the volume group. This
type of volume group is referred to as a nonquorum volume group. The most common case
for a nonquorum volume group is when the logical volumes have been mirrored.
When a disk is lost, the data is not lost if a copy of the logical volume resides on a disk that is
not disabled and can be accessed. However, there can be instances in nonquorum volume
groups, mirrored or nonmirrored, when the data (including copies) resides on the disk or disks
that have become unavailable. In those instances, the data may not be accessible even
though the volume group continues to be varied on.
GLVM is synchronous data replication of physical volumes to remote physical volumes. The
combination of both remote physical volumes with local physical volumes form
geographically mirrored volume groups. These are managed by LVM very much like
ordinary volume groups. A high level example of a GLVM environment is shown in Figure 2-1.
Site A Site B
copy
copy 11 Mirror
Mirror 11 copy
copy 22
LVM
Mirrored
Volume
Group copy
copy 11 Mirror
Mirror 22 copy
copy 22
Additional GLVM details, including installing and configuring, are available in the white paper
Using the Geographic LVM in AIX 5L, which can be found online at the following site:
http://www-03.ibm.com/systems/p/os/aix/whitepapers/pdf/aix_glvm.pdf
GLVM by itself does not provide automated recovery in the event of a catastrophic site failure.
However, it is a fully integrated and supported option in IBMs premier disaster recovery
software solution for System p systems running AIX called HACMP/XD. More information
about HACMP/XD options, including GLVM, is available in Implementing High Availability
Cluster Multi-Processing (HACMP) Cookbook, SG24-6769.
Although GLVM does not have technical distance limitations for support, there are several
factors which dictate what is realistically achievable. These factors include, but are not limited
to, network bandwidth, I/O rates, and latency between the sites. More detailed information
about this topic is available in the white paper Optimizing Mirroring Performance using
HACMP/XD for Geographic LVM, which can be found online at the following site:
http://www-304.ibm.com/jct03004c/systems/p/software/whitepapers/hacmp_xd_glvm.pdf
Each JFS2 resides on a separate logical volume. The operating system mounts JFS2 during
initialization. This multiple file system configuration is useful for system management
functions such as backup, restore, and repair. It isolates a part of the file tree to allow system
administrators to work on a particular part of the file tree.
AIX V5.3 introduced the ability to shrink a JFS2 filesystem dynamically by allowing the chfs
command to recognize a - in the size attribute to be interpreted as a request to reduce the
filesystem by the amount specified. More detailed information about shrinking a filesystem is
available in AIX 5L Differences Guide Version 5.3 Edition, SG24-7463.
MPIO supports standard AIX commands to be used to administer the MPIO devices.
A path control module (PCM) provides the path management functions. An MPIO-capable
device driver can control more than one type of target device. A PCM can support one or
more specific devices. Therefore, one device driver can be interfaced to multiple PCMs that
control the I/O across the paths to each of the target devices.
The AIX PCM has a health-check capability that can be used for the following tasks:
Check the paths and determine which paths are currently usable for sending I/O.
Enable a path that was previously marked failed because of a temporary path fault (for
example, when a cable to a device was removed and then reconnected).
Check currently unused paths that would be used if a failover occurred (for example, when
the algorithm attribute value is failover, the health check can test the alternate paths).
AIX MPIO supports multiple IO routing policies, thereby increasing the system administrators
control over MPIO reliability and performance. Detailed information on MPIO including path
management is available in AIX 5L System Management Guide: Operating System and
Devices Management, SC23-5204, which is also online at the following site:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.baseadmn/doc/basead
mndita/baseadmndita.pdf
Dynamic tracking allows such changes to be performed without bringing the devices offline.
Although this support is an integral component in contributing to overall system availability,
devices that are only accessible by one path can still be affected during these changes.
Applied logic dictates that a single path environment does not provide maximum availability.
Dynamic tracking of fibre channel devices is controlled by a new fscsi device attribute, dyntrk.
The default setting for this attribute is no. To enable this feature, the fscsi attribute must be
set to yes, as shown in Example 2-9.
Important: If child devices exist and are in the available state, this command will fail.
These devices must either be removed or put in the defined state for successful
execution.
Important: If child devices exist and are in the available state, this command will fail.
These devices must either be removed or put in the defined state for successful
execution.
In addition, dynamic tracking is often used in conjunction with fast I/O fail. Additional
information about requirements and restrictions of fast I/O failure for fibre channel devices is
available in AIX 5L Version 5.3 Kernel Extensions and Device Support Programming
Concepts, SC23-4900.
This tool tracks and captures service information, hardware error logs, and performance
information. It automatically reports hardware error information to IBM support as long as the
system is under an IBM maintenance agreement or within the IBM warranty period. Service
information and performance information reporting do not require an IBM maintenance
agreement or do not need to be within the IBM warranty period to be reported.
Previous Electronic Service Agent products were unique to the platform or operating system
on which they were designed to run. Because the Electronic Service Agent products were
In contrast, Electronic Service Agent 6.1 installs on platforms running different operating
systems. ESA 6.1 offers a consistent interface to reduce the burden of administering a
network with different platforms and operating systems. Your network can have some clients
running the Electronic Service Agent 6.1 product and other clients running the previous
Electronic Service Agent product.
If you have a mixed network of clients running Electronic Service Agent 6.1 and previous
Electronic Service Agent products, you need to refer to the information specific to each
Electronic Service Agent product for instructions on installing and administering that product.
To access Electronic Service Agent user guides, go to the Electronic Services Web site and
select Electronic Service Agent from the left navigation. In the contents pane, select
Reference Guides > Select a platform > Select an Operating System or Software.
You can write shell scripts to perform data reduction on the command output, warn of
performance problems, or record data on the status of a system when a problem is occurring.
For example, a shell script can test the CPU idle percentage for zero (0), a saturated
condition, and execute another shell script for when the CPU-saturated condition occurred.
The topas command extracts and displays statistics from the system with a default interval of
two seconds. The command offers the following alternate screens:
Overall system statistics
List of busiest processes
WLM statistics
List of hot physical disks
Logical partition display
Cross-Partition View (AIX V5.3 ML3 and higher)
SMIT panels are available for easier configuration and setup of the topas recording function
and report generation; use this command:
smitty topas
Important: Keep in mind that the incorrect use of commands to change or tune the AIX
kernel can cause performance degradation or operating system failure.
Before modifying any tunable parameter, you should first carefully read about all of the
parameter's characteristics in the Tunable Parameters section of the product
documentation in order to fully understand the parameter's purpose.
Then ensure that the Diagnosis and Tuning sections for this parameter actually apply to
your situation, and that changing the value of this parameter could help improve the
performance of your system. If the Diagnosis and Tuning sections both contain only N/A, it
is recommended that you do not change the parameter unless you are specifically directed
to do so by IBM Software Support.
All six tuning commands (schedo, vmo, no, nfso, ioo, and raso) use a common syntax
and are available to directly manipulate the tunable parameter values. SMIT panels and
Web-based System Manager plug-ins are also available. These all provide options for
displaying, modifying, saving, and resetting current and next boot values for all the kernel
tuning parameters. To start the SMIT panels that manage AIX kernel tuning parameters, use
the SMIT fast path smitty tuning.
You can make permanent kernel-tuning changes without having to edit any rc files. This is
achieved by centralizing the reboot values for all tunable parameters in the
/etc/tunables/nextboot stanza file. When a system is rebooted, the values in the
/etc/tunables/nextboot file are automatically applied. For more information, refer to IBM
The specified flag determines whether the raso command sets or displays a parameter. The
-o flag can be used to display the current value of a parameter, or to set a new value for a
parameter.
As with all AIX tuning parameters, changing a raso parameter may impact the performance or
reliability of your AIX LPAR or server; refer to IBM System p5 Approaches to 24x7 Availability
Including AIX 5L, for more information about this topic, which is available at the following site:
http://www.redbooks.ibm.com/redbooks/pdfs/sg247196.pdf
We recommend that you do not change the parameter unless you are specifically directed to
do so by IBM Software Support.
2.7 Security
The security features in AIX also contribute to system availability.
Note: The partition to be moved must use only virtual devices. The virtual disks must be
LUNs on external SAN storage. The LUNs must be accessible to the VIO Server on each
system.
For more information about this topic, refer to the IBM Redbooks publication IBM System p
Live Partition Mobility, SG24-7460.
Note: Live application mobility is a completely separate technology from live partition
mobility. However, they can coexist on a system that can match the prerequisites of both.
For live partition mobility, each participating logical partition and machine must be configured
the same way for the workload partition. This includes:
The same file systems needed by the application via NFS V3 or NFS V4.
Similar network functionalities (usually the same subnet, with routing implications).
Enough space to handle the data created during the relocation process. (Estimating the
size is very difficult since it would include the virtual application size, some socket
information, and possible file descriptor information.)
The same operating system level and technology maintenance level (TL).
Note: In addition to operating system requirements, if not on shared file systems (NFS),
then application binaries must be at the same level (identical).
For more information about workload partitions and live application mobility, refer to the IBM
Redbooks publication Introduction to Workload Partition Management in IBM AIX Version 6,
SG24-7431.
A RAS component hierarchy is used by some features. This divides the system into a
resource hierarchy, and allows individual RAS commands to be directed to very specific parts
of the system. The RAS features that exploit the RAS component hierarchy are runtime
checking, component trace, and component dump. This grouping hierarchy is illustrated in
Figure 3-1.
RAS Components
Component RAS
These features are enabled by default at levels that provide valuable First Failure Data
Capture information with minimal performance impact. To enable or disable all four advanced
First Failure Data Capture features, enter the following command:
/usr/lib/ras/ffdcctrl -o ffdc=enabled -o bosboot=no
[Entry Fields]
Advanced First Failure Data Capture Features [enabled] +
Run bosboot automatically [no] +
For more information about FFDC, refer to IBM eServer p5 590 and 595 System Handbook,
SG24-9119.
There are two memory trace buffers for each processor: one to record common events, and
one to record rare events. These buffers can be extracted from system dumps or accessed
on a live system by service personnel using the mtrcsave command. The extracted memory
trace buffers can be viewed by using the trcrpt command, with formatting as defined in the
/etc/trcfmt file.
LMT has been carefully implemented such that it has negligible performance impacts. The
impact on the throughput of a kernel-intensive benchmark is just 1%, and is much less for
typical user workloads. LMT requires the consumption of a small amount of pinned kernel
memory. The default amount of memory required for the trace buffers is calculated based on
factors that influence trace record retention.
Lightweight memory trace differs from traditional AIX system trace in several ways. First, it is
more efficient. Second, it is enabled by default, and has been explicitly tuned as a First
Failure Data Capture mechanism. Unlike traditional AIX system trace, you cannot selectively
record only certain AIX trace hook ids with LMT. With LMT, you either record all LMT-enabled
hooks, or you record none.
Traditional AIX system trace also provides options that allow you to automatically write the
trace information to a disk-based file (such as /var/adm/ras/trcfile). Lightweight memory trace
provides no such option to automatically write the trace entries to disk when the memory
trace buffer fills. When an LMT memory trace buffer fills, it wraps, meaning that the oldest
trace record is overwritten.
The value of LMT derives from being able to view some history of what the system was doing
prior to reaching the point where a failure is detected. As previously mentioned, each CPU
has a memory trace buffer for common events, and a smaller memory trace buffer for rare
events.
The intent is for the common buffer to have a 1- to 2-second retention (in other words, have
enough space to record events occurring during the last 1 to 2 seconds, without wrapping).
The 'rare buffer should have at least an hour's retention. This depends on workload, and on
where developers place trace hook calls in the AIX kernel source and which parameters they
trace.
You can enable lightweight memory trace by using the following command:
/usr/bin/raso -r -o mtrc_enabled=1
Note: In either case, the boot image must be rebuilt (the bosboot command needs to be
run), and the change does not take effect until the next reboot.
There are several factors that may reduce the amount of memory automatically used. The
behavior differs slightly between the 32-bit (unix_mp) and 64-bit (unix_64) kernels. For the
64-bit kernel, the default calculation is limited such that no more than 1/128 of system
memory can be used by LMT, and no more than 256 MB by a single processor.
The 32-bit kernel uses the same default memory buffer size calculations, but further restricts
the total memory allocated for LMT (all processors combined) to 16 MB. Table 3-1 presents
some example LMT memory consumption.
POWER3 (375 1 1 GB 8 MB 8 MB
MHz CPU)
POWER3 (375 2 4 GB 16 MB 16 MB
MHz CPU)
To determine the total amount of memory (in bytes) being used by LMT, enter the following
shell command:
echo mtrc | kdb | grep mt_total_memory
The 64-bit kernel resizes the LMT trace buffers in response to dynamic reconfiguration events
(for POWER4 and above systems). The 32-bit kernel does not resize; it will continue to use
the buffer sizes calculated during system initialization.
Note: For either kernel, in the rare case that there is insufficient pinned memory to allocate
an LMT buffer when a CPU is being added, the CPU allocation will fail. This can be
identified by a CPU_ALLOC_ABORTED entry in the AIX error log, with detailed data
showing an Abort Cause of 0000 0008 (LMT) and Abort Data of 0000 0000 0000 000C
(ENOMEM).
For example, to change the per cpu rare buffer size to sixteen 4 K pages, for this boot as well
as future boots, you would enter:
raso -p -o mtrc_rarebufsize=16
For more information about the memory trace buffer size tunables, refer to raso command
documentation.
Note: Internally, LMT tracing is temporarily suspended during any 64-bit kernel buffer
resize operation.
For the 32-bit kernel, the options are limited to accepting the default (automatically
calculated) buffer sizes, or disabling LMT (to completely avoid buffer allocation).
The LMT memory trace buffers are included in an AIX system dump. You can manipulate
them similarly to traditional AIX system trace buffers. The easiest method is to use the
trcdead command to extract the LMT buffers from the dump. The new -M parameter extracts
the buffers into files in the LMT log directory. The default LMT log directory is
/var/adm/ras/mtrcdir.
If the dump has compression set (on), the dumpfile file needs to be uncompressed with the
dmpuncompress command before running kdb, as shown here:
dmpuncompress dump_file_copy.BZ
For example, to extract LMT buffers from a dump image called dump_file_copy, you can use:
trcdead -M dump_file_copy
Each buffer is extracted into a unique file, with a control file for each buffer type. This is
similar to the per CPU trace option in traditional AIX system trace. As an example, executing
the previous command on a dump of a two-processor system would result in the creation of
the following files:
ls /var/adm/ras/mtrcdir
mtrccommon mtrccommon-1 mtrcrare-0
mtrccommon-0 mtrcrare mtrcrare-1
The new -M parameter of the trcrpt command can then be used to format the contents of
the extracted files. The trcrpt command allows you to look at common file and rare files
separately. Beginning with AIX V5.3 TL5, both common and rare files can be merged
together chronologically.
Continuing the previous example, to view the LMT files that were extracted from the dumpfile,
you can use:
trcrpt -M common
and
trcrpt -M rare
These commands produce large amounts of output, so we recommend that you use the -o
option to save the output to a file.
Other trcrpt parameters can be used in conjunction with the -M flag to qualify the displayed
contents. As one example, you could use the following command to display only VMM trace
event group hook ids that occurred on CPU 1:
trcrpt -D vmm -C 1 -M common
Example 3-2 shows the output of the trcrpt command displaying a VMM trace:
Using the trcrpt command is the easiest and most flexible way to view lightweight memory
trace records.
Note: The data collected with the snap command is designed to be used by IBM service
personnel.
To create a pax image of all files contained in the /tmp/ibmsupt directory, use:
snap -c
We recommend that you run the snap commands in the following order:
1. Remove old snap command output from /tmp/ibmsupt:
snap -r
2. Gather all new system configuration information:
snap -a
3. Create pax images of all files contained in the /tmp/ibmsupt directory:
snap -c
The snap command has several flags that you can use to gather relevant system information.
For examples, refer to the snap man pages:
man snap
A minidump is a small, compressed dump that is stored to NVRAM when the system crashes
or a dump is initiated. It is written to the error log on reboot. The minidump can be used to
observe the system state and perform some debugging if a full dump is not available. It can
Minidumps will show up as error log entries with a label of MINIDUMP_LOG and a description
of COMPRESSED MINIMAL DUMP.
To view only the minidump entries in the error log, enter this command (see also
Example 3-3):
errpt J MINIDUMP_LOG
When given a sequence number (seqno), the mdmprpt command will format the minidump
with the given sequence number. Otherwise, it will format the most recent minidump in the
error log, if any. It reads the corresponding minidump entry from the error log, and
uncompresses the detail data. It then formats the data into a human-readable output which is
sent to stdout.
The formatted output will start with the symptom string and other header information. Next it
will display the last error log entry, the dump error information if any, and then each CPU, with
an integrated stack (the local variables for each frame, if any, interleaved with the function
name+offset output).
The mdmprpt command will use the uncompressed header of the minidump to correctly
decompress and format the remainder. In the case of a user-initiated dump, the data
gathered will be spread evenly across all CPUs, because the failure point (if any) is unknown.
The size of NVRAM is extremely limited, so the more CPUs on the system, the less data that
will be gathered per CPU. In the case of a dump after a crash, the CPU that crashed will use
up most of the space, and any that remains will be split among the remaining CPUs.
Note: The mdmprpt command requires a raw error log file; it cannot handle output from the
errpt command.
Example 3-3 on page 62 displays partial output from a minidump. The g flag will tell snap
to gather the error log file.
# errpt -J MINIDUMP_LOG
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
F48137AC 1002144807 U O minidump COMPRESSED MINIMAL DUMP
Symptom Information:
Crash Location: [000000000042E80C] unknown
Component: COMP Exception Type: 267
The command dumpctrl can be used manage system dumps and live dumps. The command
also has several flags and options; refer to the dumpctrl man pages (man dumpctrl) for more
information.
The component dump allows the user, subsystem, or system administrator to request that
one or more components be dumped. When combined with the live dump functionality,
component dump allows components to be dumped, without bringing down the entire system.
When combined with the system dump functionality, component dump allows you to limit the
size of the system dump or disaster dump.
Note that the main users of these functions are IBM service personnel. Examples of the live
dump and component dump functions are provided here so that you can perform these
dumps if requested.
In our case, our test system had the following dump-aware components:
Inter-process communication (ipc)
The ipc component currently has following subcomponents: semaphores data (sem);
shared memory segments data (shm); message queues data (msg).
Virtual memory manager (vmm)
The vmm component currently has the following subcomponents: frame set data (frs);
memory pool data (memp); paging device data (pdt); system control data (pfhdata); page
size data (pst); interval data (vmint); kernel data (vmker); resource pool data (vmpool).
Enhanced journaled file system (jfs2)
The jfs2 component currently has the following subcomponents: file system data; log data.
Logical file system (lfs)
The lfs component currently has the following subcomponents: file system data; pile data;
General Purpose Allocator Interface (gpai) data (gpai is a collector of fifo, vfs, and vnode
data).
To see component-specific live dump attributes and system dump attributes on the system,
you can use the dumpctrl -qc command; sample output is shown in Example 3-4.
# dumpctrl -qc
-----------------------------------------------+------+-----------+------------
| Have | Live Dump | System Dump
Component Name |Alias | /level | /level
-----------------------------------------------+------+-----------+------------
dump
.livedump | YES | OFF/3 | ON/3
errlg | NO | OFF/3 | ON/3
ipc
.msg | YES | ON/3 | OFF/3
.sem | YES | ON/3 | OFF/3
.shm | YES | ON/3 | OFF/3
jfs2 | NO | ON/3 | OFF/3
filesystem
__1
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_admin_9
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_home_8
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_opt_11
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_tmp_5
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_usr_2
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
_var_4
.inode | NO | ON/3 | OFF/3
.snapshot | NO | ON/3 | OFF/3
log
.A_3 | NO | ON/3 | OFF/3
You can access the SMIT main menu of Component/Live Dump by using the smitty
livedump command. The menu is shown in Example 3-5.
For example, if you are requested to run a live dump of virtual memory managers paging
device data (vmm.pdt), you would use the smitty livedump menu and the Start a Live Dump
submenu. The menu is shown in Example 3-6.
[Entry Fields]
Component Path Names
Component Logical Alias Names [@pdt]
Component Type/Subtype Identifiers []
Pseudo-Component Identifiers []
Error Code []
Symptom String [any-description]
File Name Prefix []
Dump Priority [critical] +
Generate an Error Log Entry [yes] +
Component Path Names Can be found by output of dumpctrl. For instance, ipc.msg is a component path name.
Not needed if any of Alias/Type/Pseudo-component is specified. Placing an at (@) sign
in front of the component specifies that it is the failing component (to replace the
nocomp in the file name).
Component Logical Alias Can be found by output of dumpctrl for those components where Have Alias is listed
Names as YES. For instance, msg is a valid Alias. Not needed if any of Path
Name/Type/Pseudo-component is specified. Placing an at (@) sign before the alias
specifies it as the failing component (to replace the nocomp in the file name).
Component Type/Subtype Specify type/subtypes from /usr/include/sys/ rasbase.h. Not all types or subtypes map
Identifiers to valid livedump components.
Pseudo-Component Built-in components not present in dumpctrl -qc. A list of them can be obtained by
Identifiers selecting the help for this topic. They require parameters to be specified with the
identifiers. Details about the parameters are provided by an error message if you do
not specify the parameters. Not needed if any of Path Name/Alias/Type is specified.
Symptom String Mandatory string that describes the nature of the live dump.
Dump Priority Choice of critical/info. Critical dumps may delete info dumps if there is not enough room
in the filesystem. Critical dumps can also be held in kernel memory.
Generate an Error Log Entry Whether or not to generate an error log entry in the event of live dump completion.
Force this Dump Applies to whether or not duplicate elimination can potentially abort this live dump.
Dump Subcomponents For any components specified by Path Name/Alias/Type, whether or not to dump their
subcomponents. Subcomponents appear under and to the right of other components
in dumpctrl -qc. (For instance, ipc.sem and ipc.msg are subcomponents of ipc.)
Dump Parent Components For any components specified by Path Name/Alias/Type, whether or not to dump their
parent components. (For instance, the parent of ipc.sem is ipc.)
Obtain a Size Estimate Do not take a dump, just get an estimate for the compressed size. (No dump is taken)
Show Parameters for If a component has help related to it, show that help information.
Components
COMMAND STATUS
Before the dump file can be analyzed, it needs to be uncompressed with the dmpuncompress
command. Output from a dmpuncompress command is shown Example 3-8.
# ls -l /var/adm/ras/livedump/pdt.200710111501.00.DZ
-rw------- 1 root system 4768 Oct 11 10:24
/var/adm/ras/livedump/pdt.200710111501.00.DZ
# dmpuncompress /var/adm/ras/livedump/pdt.200710111501.00.DZ
-- replaced with /var/adm/ras/livedump/pdt.200710111501.00
#
You can use the kdb command to analyze a live dump file. Output from a kdb command is
shown in Example 3-9.
# kdb /var/adm/ras/livedump/pdt.200710111501.00
# kdb /var/adm/ras/livedump/pdt.200710111501.00
/var/adm/ras/livedump/pdt.200710111501.00 mapped from @ 700000000000000 to @
70000000000cd88
Preserving 1675375 bytes of symbol table [/unix]
Component Names:
1) dump.livedump.header [3 entries]
2) dump.livedump.sysconfig [3 entries]
3) vmm.pdt [14 entries]
4) vmm.pdt.pdt8B [2 entries]
5) vmm.pdt.pdt8A [2 entries]
6) vmm.pdt.pdt89 [2 entries]
7) vmm.pdt.pdt88 [2 entries]
8) vmm.pdt.pdt87 [2 entries]
9) vmm.pdt.pdt86 [2 entries]
10) vmm.pdt.pdt81 [2 entries]
11) vmm.pdt.pdt82 [2 entries]
12) vmm.pdt.pdt85 [2 entries]
13) vmm.pdt.pdt84 [2 entries]
14) vmm.pdt.pdt83 [2 entries]
15) vmm.pdt.pdt80 [2 entries]
16) vmm.pdt.pdt0 [2 entries]
17) cu [2 entries]
Live Dump Header:
Passes: 1
Type: LDT_FORCE
Concurrent AIX An update to the in-memory image, update of the base kernel and/kernel
Update extensions that is immediately active or subsequently inactive without reboot.
Synonyms: hotpatch, live update, online update.
Control file A text file that contains all information required by emgr command to perform any
update, whether concurrent or not.
Timestamp The timestamp recorded within the XCOFF header of the module targeted for
updating. It is used to validate that the ifix was created for the proper module
level.
Description file A text file derived from the emgr control file, and which contains only
information relevant to Concurrent AIX Update. The kpatch command requires
this file to apply a Concurrent AIX Update.
ifix Interim fix. Package delivered for customer for temporary fixes.
Synonym: efix
Kernel extension Used to refer both kernel extensions and device drivers.
Concurrent updates are packaged by using the epkg command. The epkg command also has
multiple flags; for additional information about the flags refer to the man pages (use the man
epkg command). You can also obtain limited help information by simply entering the epkg
command.
EFIX Management
From the List EFIXES and Related Information menu, you can select an emergency fix (EFIX)
to be listed by the EFIX label, EFIX ID, or virtually unique ID (VUID). The menu is shown in
Example 3-12.
[Entry Fields]
EFIX Label [ALL] +
From the Install EFIX Packages menu, you can specify the location of the emergency fix
(EFIX) package to be installed or previewed. The EFIX package should be a file created with
the epkg command. The menu is shown in Example 3-13.
Important: Concurrent update package files will have a name ending with *.epkg.Z. These
files are compressed and should not be uncompressed before installation.
[Entry Fields]
LOCATION of EFIX Package [] /
-OR-
LOCATION of Package List File [] /
COMMAND STATUS
[TOP]
+-----------------------------------------------------------------------------+
Efix Manager Initialization
+-----------------------------------------------------------------------------+
Initializing log /var/adm/ras/emgr.log ...
Efix package file is: /usr/emgrdata/efixdata/beta_patch.070909.epkg.Z
MD5 generating command is /usr/bin/csum
MD5 checksum is a7710363ad00a203247a9a7266f81583
Accessing efix metadata ...
Processing efix label "beta_patch" ...
Verifying efix control file ...
+-----------------------------------------------------------------------------+
Installp Prerequisite Verification
+-----------------------------------------------------------------------------+
No prerequisites specified.
Building file-to-package list ...
+-----------------------------------------------------------------------------+
Efix Attributes
+-----------------------------------------------------------------------------+
LABEL: beta_patch
PACKAGING DATE: Sun Sep 9 03:21:13 CDT 2007
ABSTRACT: CU ifix for cu_kext
PACKAGER VERSION: 6
VUID: 00C6CC3C4C00090903091307
REBOOT REQUIRED: no
BUILD BOOT IMAGE: no
PRE-REQUISITES: no
SUPERSEDE: no
PACKAGE LOCKS: no
E2E PREREQS: no
FIX TESTED: no
ALTERNATE PATH: None
EFIX FILES: 1
Install Scripts:
PRE_INSTALL: no
POST_INSTALL: no
PRE_REMOVE: no
POST_REMOVE: no
File Number: 1
LOCATION: /usr/lib/drivers/cu_kext
FILE TYPE: Concurrent Update
INSTALLER: installp (new)
SIZE: 8
ACL: root:system:755
CKSUM: 60788
+-----------------------------------------------------------------------------+
Efix Description
+-----------------------------------------------------------------------------+
CU ifix - test ifix for /usr/lib/drivers/cu_kext
+-----------------------------------------------------------------------------+
Efix Lock Management
+-----------------------------------------------------------------------------+
Checking locks for file /usr/lib/drivers/cu_kext ...
+-----------------------------------------------------------------------------+
Space Requirements
+-----------------------------------------------------------------------------+
Checking space requirements ...
+-----------------------------------------------------------------------------+
Efix Installation Setup
+-----------------------------------------------------------------------------+
Unpacking efix package file ...
Initializing efix installation ...
+-----------------------------------------------------------------------------+
Efix State
+-----------------------------------------------------------------------------+
Setting efix state to: INSTALLING
+-----------------------------------------------------------------------------+
Efix File Installation
+-----------------------------------------------------------------------------+
Installing all efix files:
Installing efix file #1 (File: /usr/lib/drivers/cu_kext) ...
+-----------------------------------------------------------------------------+
Package Locking
+-----------------------------------------------------------------------------+
Processing package locking for all files.
File 1: no lockable package for this file.
+-----------------------------------------------------------------------------+
Reboot Processing
+-----------------------------------------------------------------------------+
Reboot is not required by this efix package.
+-----------------------------------------------------------------------------+
Efix State
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
Operation Summary
+-----------------------------------------------------------------------------+
Log file is /var/adm/ras/emgr.log
With the Remove Installed EFIXES menu, you can select an emergency fix to be
removed.The menu is shown in Example 3-15.
[Entry Fields]
EFIX Label [] +
-OR-
LOCATION of EFIX List File [] /
We also tested the removal of the efix. Part of the output from the test is shown in
Example 3-16.
COMMAND STATUS
[TOP]
+-----------------------------------------------------------------------------+
Efix Manager Initialization
+-----------------------------------------------------------------------------+
Initializing log /var/adm/ras/emgr.log ...
Accessing efix metadata ...
Processing efix label "beta_patch" ...
+-----------------------------------------------------------------------------+
Efix Attributes
+-----------------------------------------------------------------------------+
From the Check Installed EFIXES menu, you can select the fixes to be checked. The menu is
shown in Example 3-17.
[Entry Fields]
EFIX Label [] +
-OR-
LOCATION of EFIX List File [] /
This section explains storage protection keys and describes how kernel programmers or
application programmers use them. The storage protection key concept was adopted from
the z/OS and 390 systems, so AIX can confidently rely on this feature and its robustness.
Software might use keys to protect multiple classes of data individually and to control access
to data on a per-context basis. This differs from the older page protection mechanism, which
is global in nature.
Storage keys are available in both kernel-mode and user-mode application binary interfaces
(ABIs). In kernel-mode ABIs, storage key support is known as kernel keys. In user space
A memory protection domain generally uses multiple storage protection keys to achieve
additional protection. AIX Version 6.1 divides the system into four memory protection
domains, as described here:
Kernel public This term refers to kernel data that is available without restriction to
the kernel and its extensions, such as stack, bss, data, and areas
allocated from the kernel or pinned storage heaps.
Kernel private This term refers to data that is largely private within the AIX kernel
proper, such as the structures representing a process.
Kernel extension This term refers to data that is used primarily by kernel extensions,
such as file system buf structures.
User This term refers to data in an application address space that might be
using key protection to control access to its own data.
One purpose of the various domains (except for kernel public) is to protect data in a domain
from coding accidents in another domain. To a limited extent, you can also protect data within
a domain from other subcomponents of that domain. When coding storage protection into a
kernel extension, you can achieve some or all of the following RAS benefits:
Protect data in user space from accidental overlay by your extension
Respect private user key protection used by an application to protect its own private data
Protect kernel private data from accidental overlay by your extension
Protect your private kernel extension data from accidental overlay by the kernel, by other
kernel extensions, and even by subcomponents of your own kernel extension
Note: Storage protection keys are not meant to be used as a security mechanism. Keys
are used following a set of voluntary protocols by which cooperating subsystem designers
can better detect, and subsequently repair, programming errors.
Design considerations
The AIX 64-bit kernel makes extensive use of a large flat address space by design. It
produces a significant performance advantage, but also adds Reliability, Accessibility and
Serviceability (RAS) costs.
Storage keys were introduced into PowerPC architecture to provide memory isolation while
still permitting software to maintain a flat address space. Large 64-bit applications, such as
DB2, use a global address space for similar reasons and also face issues with memory
overlays.
A new CPU special purpose register known as the Authority Mask Register (AMR) has been
added to define the keyset that the CPU has access to. The AMR is implemented as a bit
pairs vector indexed by key number, with distinct bits to control read and write access for
each key. The key protection is in addition to the existing page protection mechanism.
The AMR is a per-context register that can be updated efficiently. The TLB/ERAT contains
storage key values for each virtual page. This allows AMR updates to be efficient, because
they do not require TLB/ERAT invalidation. The POWER hardware enables a mechanism
that software can use to efficiently change storage accessibility.
Ideally, each storage key would correspond to a hardware key. However, due to the limited
number of hardware keys with current Power Architecture, more than one kernel key is
frequently mapped to a given hardware key. This key mapping or level of indirection may
change in the future as architecture supports more hardware keys.
Another advantage that indirection provides is that key assignments can be changed on a
system to provide an exclusive software-to-hardware mapping for a select kernel key. This is
an important feature for testing fine granularity keys. It could also be used as an SFDC tool.
Kernel keys provide a formalized and abstracted API to map kernel memory classes to a
limited number of hardware storage keys.
For each kernel component, data object (virtual pages) accessibility is determined. Then
component entry points and exit points may be wrapped with protection gates. Protection
gates change the AMR as components are entered and exited, thus controlling what storage
can be accessed by a component.
At configuration time, the module initializes its kernel keyset to contain the keys required for
its runtime execution. The kernel keyset is then converted to a hardware keyset. When the
module is entered, the protection gates set the AMR to the required access authority
efficiently by using a hardware keyset computed at configuration time.
User keys work in application programs. The keys are a virtualization of the PowerPC storage
hardware keys. Access rights can be added and removed from a user space AMR, and an
user key can be assigned as appropriate to an applications memory pages. Management
and abstraction of user keys is left to application developers.
Note: You may want to disable kernel keys if one of your kernel extensions is causing key
protection errors but you need to be able to run the system even though the error has not
been fixed.
The current storage key setting for our test system is shown in Example 3-20.
You can change the next boot storage key setting by using the second option of the SMIT
menu shown in Example 3-19.
Note: Although you may be able to use the command line for checking and altering storage
key settings, this is not supported for direct use. Only SMIT menus are supported.
With kernel key support, the AIX kernel introduces kernel domains and private memory
access. Kernel domains are component data groups that are created to segregate sections of
the kernel and kernel extensions from each other. Hardware protection of kernel memory
domains is provided and enforced. Also, global storage heaps are separated and protected.
This keeps heap corruption errors within kernel domains.
There are also private memory keys that allow memory objects to be accessed only by
authorized components. In addition to RAS benefits, private memory keys are a tool to
enforce data encapsulation. There is a static kernel key-to-storage key mapping function set
by the kernel at boot time. This mapping function is dependent on the number of storage keys
that are present in the system.
Note: Kernel keys are not intended to provide a security function. There is no infrastructure
provided to authorize access to memory. The goal is to detect and prevent accidental
memory overlays. The data protection kernel keys provide can be circumvented, but this is
by design. Kernel code, with the appropriate protection gate, can still access any memory
for compatibility reasons.
Analogy
To understand the concept of kernel storage keys, consider a simple analogy. Assume there
is a large house (the kernel) with many rooms (components and device drivers) and many
members (kernel processes and kernel execution path), and each member has keys only for
a few other selected rooms in addition to its own room (keyset).
Therefore, members having a key for a room are considered safe to be allowed inside. Every
time a member wants to enter a room, the member needs to see whether its keyset contains
a key for that room.
If the member does not have the corresponding key, it can either create a key which will
permit the member to enter the room (which means it will add the key to its keyset). Or the
member can try to enter without a key. If the member tries to enter without a key, an alarm will
trip (cause a DSI/Kernel crash) and everything will come to a halt because one member (a
component or kernel execution path) tried to intrude into an unauthorized room.
Kernel keys
The kernel's data is classified into kernel keys according to intended use. A kernel key is a
software key that allows the kernel to create data protection classes, regardless of the
number of hardware keys available. A kernel keyset is a representation of a collection of
kernel keys and the desired read or write access to them. Remember, several kernel keys
KKEY_PUBLIC This kernel key is always necessary for access to a program's stack, bss,
and data regions. Data allocated from the pinned_heap and the
kernel_heap is also public.
KKEY_BLOCK_DEV This kernel key is required for block device drivers. Their buf structs must
be either public or in this key.
KKEY_COMMO This kernel key is required for communication drivers. CDLI structures must
be either public or in this key.
KKEY_NETM This kernel key is required for network and other drivers to reference
memory allocated by net_malloc.
KKEY_DMA This kernel key is required for DMA information (DMA handles and EEH
handles).
KKEY_TRB This kernel key is required for timer services (struct trb).
KKEY_FILE_SYSTEM This kernel key is required to access vnodes and gnodes (vnop callers).
Note: Table 3-4 shows a preliminary list of kernel keys. As kernel keys are added to
components, additional kernel keys will be defined.
Kernel keysets
Note: Not all keys in the kernel key list are currently enforced. However, it is always safe to
include them in a keyset.
Because the full list of keys might evolve over time, the only safe way to pick up the set of
keys necessary for a typical kernel extension is to use one of the predefined kernel keysets,
as shown the following list.
KKEYSET_KERNEXT The minimal set of keys needed by a kernel extension.
KKEYSET_COMMO Keys needed for a communications or network driver.
KKEYSET_BLOCK Keys needed for a block device driver.
KKEYSET_GRAPHICS Keys needed for a graphics device driver.
KKEYSET_USB Keys needed for a USB device driver.
See sys/skeys.h for a complete list of the predefined kernel keysets. These keysets provide
read and write access to the data protected by their keys. If you want simply read access to
keys, those sets are named by appending _READ (as in KKEYSET_KERNEXT_READ).
For all supported hardware configurations, the ability exists to force one kernel key to be
mapped exclusively to one hardware key. Kernel mappings can be expected to become more
fine-grained as the number of available hardware keys increases. These mappings will be
updated as new keys are defined.
The hardware key number (a small integer) is associated with a virtual page through the page
translation mechanism. The AMR special purpose register is part of the executing context;
see Figure 3-2. The 32-bit-pair keyset identifies the keys that the kernel has at any point of
execution. If a keyset does not have the key to access a page, it will result in a DSI/system
crash.
KKEY_UPUBLIC, KKEY_FILE_DATA
PFT
64 bit KEY_SET
KKEY_UPRIVATE1 0 KEY0 W R page #x
KKEY_PUBLIC 1 KEY1
page #x + 1
KKEY_BLOCK_DEV
2 KEY2
KKEY_FILE_SYSTEM
page #x + 2
3 Mapping of .
AMR
KKEY_COMMO
KEY_SET to
KKEY_NETM 4 HKEY_SET .
KKEY_USB through ABI
KKEY_GRAPHICS
5 .
6 KEY30
KKEY_DMA
KKEY_TRB 7 KEY31
KKEY_IOMAP KKEY_LDR
KKEY_PRIVATE1-32 KKEY_LFS
KKEY_J2
keys for kernel internal ......
use, ex:
KKEY_LDATALOC
KKEY_VMM_PMAP
KKEY_KER
Figure 3-2 PFT entry with AMR for a typical kernel process/execution path on P6 hardware
For P6 hardware, there are only eight available hardware keys, so each keyset will be
mapped to a 16-bit AMR. Each bit-pair in AMR may have more than one key mapped to it.
For example, if key 4 is set in AMR, that means at least one of KKEY_COMMO,
KKEY_NETM, KKEY_USB, and KKEY_GRAPHICS has been added to the hardware keyset.
Two base kernel domains are provided. Hardware key 6 is used for critical kernel functions.
Hardware key 7 for all other base kernel keys. Hardware key 5 is used for kernel extension
private data keys. Hardware keys 3 to 5 are used for kernel extension domains. Two keys are
dedicated for user mode kernel keys. KKEY_UPRIVATE1 is allocated by default for potential
user mode.
It is the kernel's responsibility to ensure that legacy code continues to function as it did on
prior AIX releases and on hardware without storage key support, even though such code
might access kernel private data.
When a key-unsafe function is called, the kernel must, in effect, transparently insert special
glue code into the call stack between the calling function and the key-unsafe called function.
This is done automatically, but it is worth understanding the mechanism because the inserted
glue code is visible in stack callback traces.
When legacy code is called, either directly by calling an exported function or indirectly by
using a function pointer, the kernel must:
1. Save the caller's current key access rights (held in the AMR).
2. Save the caller's link register (LR).
3. Replace the current AMR value with one granting broad data access rights.
4. Proceed to call the key-unsafe function, with the link register set, so that the called
function returns to the next step.
5. Restore the original caller's AMR and LR values.
6. Return to the original caller.
The AMR update must be performed transparently, thus the new AMR stack had to be
developed. The new resource is also called a context stack. The current context stack pointer
is maintained in the active kmstsave structure, which holds the machine state for the thread
or interrupt context. Use the mst kernel debugger command to display this information. The
context stack is automatically pinned for key-unsafe kernel processes. The setjmpx and
longjmpx kernel services maintain the AMR and the context stack pointer.
When a context stack frame needs to be logically inserted between standard stack frames,
the affected function (actually, the function's traceback table) is flagged with an indicator. The
debugger recognizes this and is able to provide you with a complete display for the stack
trace. The inserted routine is named hkey_legacy_gate. A similar mechanism is applied at
many of the exported entry points into the kernel, where you might observe the use of
kernel_add_gate and kernel_replace_gate.
This processing adds overhead when an exported key-unsafe function is called, but only
when the called function is external to the calling module. Exported functions are represented
by function descriptors that are modified by the loader to enable the AMR changing service to
front-end exported services. Intramodule calls do not rely on function descriptors for direct
calls, and thus are not affected.
All indirect function pointer calls in a key-aware kernel extension go through special
kernel-resident glue code that performs the automatic AMR manipulations as described. If
you call out this way to key-unsafe functions, the glue code recognizes the situation and takes
care of it for you. Hence, a key-aware kernel extension must be compiled with the
-q noinlglue option for glue code.
Your initialization or configuration entry point cannot start off with a protection gate whose
underlying hardware keyset it must first compute. Only after setting up the necessary
hardware keysets can you implement your protection gates.
The computation of these keysets should be done only once (for example, when the first
instance of an adapter is created). These are global resources used by all instances of the
driver. Until you can use your new protection gates, you must be sure to reference only data
that is unprotected, such as your stack, bss, and data regions.
If this is particularly difficult for some reason, you can statically initialize your hardware keyset
to HKEYSET_GLOBAL. That initial value allows your protection gates to work even before
you have constructed your kernel and hardware keysets, although they would grant the code
following them global access to memory until after the hardware keysets have been properly
initialized. If your extension accepts and queues data for future asynchronous access, you
might also need to use HKEYSET_GLOBAL, but only if this data is allowed to be arbitrarily
key-protected by your callers. Use of the global keyset should be strictly minimized.
If you want to be certain that a hardware keyset is not used unexpectedly, statically initialize it
to HKEYSET_INVALID. A replace gate with this hardware keyset would revoke all access to
memory and cause a DSI almost immediately.
Your protection gates protect kernel data and the data of other modules from many of the
accidental overlays that might originate in your extension. It should not be necessary to
change any of the logic of your module to become key safe. But your module's own data
remains unprotected. The next step is to protect your kernel extension.
Making a kernel extension fully key-protected adds more steps to the port. You now must also
follow these steps:
1. Analyze your private data, and decide which of your structures can be key-protected.
You might decide that your internal data objects can be partitioned into multiple classes,
according to the internal subsystems that reference them, and use more than one private
key to achieve this.
2. Consider that data allocated for you by special services might require you to hold specific
keys.
3. Construct hardware keysets as necessary for your protection gates.
4. Consider using read-only access rights for extra protection. For example, you might switch
to read-only access for private data being made available to an untrusted function.
5. Allocate one or more private kernel keys to protect your private data.
6. Construct a heap (preferably, or use another method for allocating storage) protected by
each kernel key you allocate, and substitute this heap (or heaps) consistently in your
existing xmalloc and xmfree calls.
When substituting, pay particular attention that you replace use of kernel_heap with a
pageable heap, and the use of pinned_heap with a pinned one. Also be careful to always
free allocated storage back to the heap from which it was allocated. You can use malloc
and free as a shorthand for xmalloc and xmfree from the kernel_heap, so be sure to also
check for these.
7. Understand the key requirements of the services you call. Some services might only work
if you pass them public data.
You need to collect individual global variables into a single structure that can be xmalloced
with key protection. Only the pointer to the structure and the hardware keyset necessary to
access the structure need to remain public.
The private key or keys you allocate share hardware keys with other kernel keys, and
perhaps even with each other. This affects the granularity of protection that you can achieve,
but it does not affect how you design and write your code. So, write your code with only its
kernel keys in mind, and do not concern yourself with mapping kernel keys to hardware keys
for testing purposes. Using multiple keys requires additional protection gates, which might not
be justifiable in performance-sensitive areas.
If you make a kernel extension key-aware, you must add explicit protection gates, typically at
all entry and exit points of your module. The gates exist to ensure that your module has
access to the data it requires, and does not have access to data it does not require.
Without a gate at (or near) an entry point, code would run with whichever keys the caller
happened to hold. This is something that should not be left to chance. Part of making a kernel
If a directly called service is not passed any parameters pointing to data that might be in an
arbitrary key, a replace gate should be used in preference to an add gate, because it is a
stronger form of protection. Generally, calls within your module do not need gates, unless you
want to change protection domains within your module as part of a multiple keys component
design.
Protection gates might be placed anywhere in your program flow, but it is often simplest to
identify and place gates at all the externally visible entry points into your module. However,
there is one common exception: you can defer the gate briefly while taking advantage of your
caller's keys to copy potentially private data being passed into public storage, and then switch
to your own keyset with a replace gate.
This technique yields stronger storage protection than the simpler add gate at the entry point.
When using this approach, you must restore the caller's keys before public data can be
copied back through a parameter pointer. If you need both the caller's keys and your own
keys simultaneously, you must use an add gate.
To identify the entry points of your kernel extension, be sure to consider the following typical
entry points:
Device switch table callbacks, such as:
open
close
read
write
ioctl
strategy
select
print
You generally need protection gates only to set up access rights to non-parameter private
data that your function references. It is the responsibility of called programs to ensure the
ability to reference any of their own private data. When your caller's access rights are known
to be sufficient, protection gates are not needed.
Memory allocation
Issues related to heap changes and associated APIs are beyond the scope of this document.
For information about this topic, refer to the AIX Information Center for 6.1 at the following
site:
http://publib.boulder.ibm.com/infocenter/pseries/v6r1/index.jsp
A configuration entry point can execute an explicit protection gate after its keysets are
initialized. Future calls to other extension interfaces use protection gates (if the developer
In these two cases, the kernel is required to load the legacy keyset before calling an
initialization function contained in a key-unsafe extension.
Note: The kernel must be built using bosboot -aD to include the kernel debugger. Without
this, you will not see the kernel printfs, and the dsi will not pop you into kdb, but will just
take a dump.
When system call kkey_test() is called with parameter=0, it tries to access private heap with
KKEY_VMM in its protection gate (as shown in Example 3-28 on page 96).
When system call kkey_test() is called with parameter>0, it tries to access private heap
without KKEY_VMM in its protection gate (as shown in Example 3-29 on page 96).
service: service.c
$(CC) -o service service.c
The kernel extension will create a private heap protected by key KKEY_VMM. Then kernel
extension will try to access it with and without KKEY_VMM in its keyset.
When trying access without KKEY_VMM, it should cause a DSI with exception
EXCEPT_SKEY; see Example 3-22.
if ((rc=kkeyset_to_hkeyset(myset, &hwset))!=0){
printf("kkeyset_to_hkeyset() failed: rc=%lx\n",rc);
return -1;
if ((rc=kkeyset_to_hkeyset(myset, &hwset))!=0){
printf("kkeyset_tohkeyset() failed: rc=%lx\n",rc);
return -1;
}
printf("hardware keyset after KKEY_VMM=%016lx\n",hwset);
/*
* Create a heap protected by the key KKEY_VMM
*/
bzero(&heapattr, sizeof(heapattr));
heapattr.hpa_eyec=EYEC_HEAPATTR;
heapattr.hpa_version=HPA_VERSION;
heapattr.hpa_flags=HPA_PINNED|HPA_SHARED;
heapattr.hpa_debug_level=HPA_DEFAULT_DEBUG;
/*
* The heap will be protected by key==heapattr.hpa_kkey=KKEY_VMM
* So other extensions/components should have KKEY_VMM in their keyset in
* order to access it.
*/
heapattr.hpa_kkey=kkey;
if ((rc=heap_create(&heapattr, &my_heap))!=0 ){
printf("heap_create() failed\n");
return -1;
}
/*
* Add current keyset={KKEYSET_KERNEXT, KKEY_VMM} to the current kernel
* extension/system-call. This will be done through the help of a Protection
* Gate. If you dont do this you will not be able to access the private heap
* created in line#75 as thats protected by key KKEY_VMM
*/
oldhwset=hkeyset_replace(hwset);
/*
* Assign a page from our private kernel heap which is protected by
* keyset={KKEYSET_KERNEXT, KKEY_VMM}.
*/
caddr_t page=xmalloc(4096, 12, my_heap);
if (page==NULL){
printf("xmalloc() failed");
return -1;
}
/* Test KA_READ access on heap. Since we have the key in our keyset (from
* line#52) so we will be able to access it. In case we haven't it would have
* caused a DSI or machine crash.
*/
Sample code for a tool used for loading and unloading the kernel extension is shown in
Example 3-25.
Example 3-25 Tool for loading and unloading kernel extension: service.c
#include <sys/types.h>
#include <sys/sysconfig.h>
#include <errno.h>
#define ADD_LIBPATH() \
strcpy(g_binpath,*(argv+2)); \
g_cfg_load.path=g_binpath; \
if (argc<4) \
g_cfg_load.libpath=NULL; \
else{ \
strcpy(g_libpath,*(argv+3)); \
g_cfg_load.libpath=g_libpath;\
} \
g_cfg_load.kmid = 0;
if (!(strcmp("--load",*(argv+1)))){
ADD_LIBPATH();
/*
* Load the kernel extension
*/
if (sysconfig(SYS_SINGLELOAD, &g_cfg_load, sizeof(struct cfg_load))==\\
CONF_SUCC){
printf("[%s] loaded successfully kernel \\
mid=[%d]\n",g_binpath,g_cfg_load.kmid);
Try to load the kernel extension and call it first with a protection gate, and then without an
appropriate protection gate (that is, without KKEY_VMM) by using KKEY_VMM.
To view the output of the kernel extension (system call), you need console access. All output
will go to the hardware console when you use printf() in kernel mode. Compile the program
using the command shown in Example 3-26.
Run myprog without any argument, which means it will run the kernel extension (syscall) with
the appropriate protection gate {{KKEYSET_KERNEXT}+KKEY_VMM}. Because the private
heap of the kernel extension has been protected by KKEY_VMM, you can access it and the
system call will return without any DSI/crash.
Execute the user-level program that will call the kernel system call with protection gate
enabled with KKEY_VMM, as shown in Example 3-28.
Now execute the user-level program that will call the kernel system call with a protection gate
without KKEY_VMM key in its keyset (see Example 3-29 on page 96). This will cause the
kernel to crash. Running myprog with argument>2 will do that.
Because the private heap of the kernel extension has not been protected by KKEY_VMM, a
DSI will result and the kernel will crash. Keep in mind that exception type EXCEPT_SKEY
cannot be caught with setjmpx(), so the kernel programmer cannot catch EXCEPT_SKEY
using setjmpx().
Note: There are kernel services provided for catching storage-key exceptions. However,
their use is not recommended and they are provided only for testing purposes.
When you are debugging in kernel mode (see Example 3-30), you can see the value of the
AMR (current hardware key or protection gate), the current process which caused the
exception, and the exception type. Refer to 3.7.8, Kernel debugger commands on page 108
KDB(6)> dr amr
amr : F00F000000000000
hkey 2 RW PUBLIC
hkey 3 RW BLOCK_DEV LVM RAMDISK FILE_SYSTEM NFS CACHEFS AUTOFS KRB5 ...
hkey 4 RW COMMO NETM IPSEC USB GRAPHICS
hkey 5 RW DMA PCI VDEV TRB IOMAP PRIVATE1 PRIVATE17 PRIVATE9 PRIVATE25 ...
KDB(6)> dk 1 @r4
Protection for sid 4000006000000, pno 00C00134, raddr 81C7A000
Hardware Storage Key... 7
Page Protect key....... 0
No-Execute............. 0
KDB(6)> p
SLOT NAME STATE PID PPID ADSPACE CL #THS
NAME....... myprog
STATE...... stat :07 .... xstat :0000
FLAGS...... flag :00200001 LOAD EXECED
........... flag2 :00000001 64BIT
........... flag3 :00000000
........... atomic :00000000
........... secflag:0001 ROOT
LINKS...... child :0000000000000000
........... siblings :0000000000000000
........... uidinfo :000000000256BA78
........... ganchor :F10008070100AC00 <pvproc+100AC00>
THREAD..... threadlist :F10008071080B100 <pvthread+80B100>
DISPATCH... synch :FFFFFFFFFFFFFFFF
AACCT...... projid :00000000 ........... sprojid :00000000
........... subproj :0000000000000000
........... file id :0000000000000000 0000000000000000 00000000
........... kcid :00000000
........... flags :0000
(6)> more (^C to quit) ?
KDB(6)> mst
Machine State Save Area
iar : F1000000908E34C8 msr : 8000000000009032 cr : 80243422
lr : F1000000908E34BC ctr : 00000000007CD2A0 xer : 20000001
mq : FFFFFFFF asr : FFFFFFFFFFFFFFFF amr : F00F000000000000
r0 : F1000000908E34BC r1 : F00000002FF47480 r2 : F1000000908E4228
The following APIs are available for managing kernel and hardware keysets:
kerrno_t kkeyset_create(kkeyset_t *set)
This creates a kernel keyset.
kerrno_t kkeyset_delete(kkeyset_t set)
This deletes a kernel keyset.
kerrno_t kkeyset_add_key(kkeyset_t set, kkey_t key, unsigned long flags)
This adds a kernel key to a kernel keyset.
kerrno_t kkeyset_add_set(kkeyset_t set, kkeyset_t addset)
This adds a kernel keyset to an existing kernel keyset.
kerrno_t kkeyset_remove_key(kkeyset_t set, kkey_t key, unsigned long flags)
This removes a kernel key- from an existing kernel keyset.
kerrno_t kkeyset_remove_keyset(kkeyset_t set, kkeyset_t removeset)
This removes members of one kernel keyset from an existing kernel
keyset.
kerrno_t kkeyset_to_hkeyset(kkeyset_t kkeyset, hkeyset_t *hkey)
This computes the hardware key (AMR value) that provides memory
access specified by the inputted kernel keyset.
hkeyset_t hkeyset_add(hkeyset_t keyset)
This updates the protection domain by adding the hardware-keyset
specified by keyset to the currently addressable hardware-keyset. The
previous hardware-keyset is returned.
hkeyset_t hkeyset_replace(hkeyset_t keyset)
This updates the protection domain by loading the set specified by
keyset as the currently addressable storage set. The previous
hardware-keyset is returned.
void hkeyset_restore(hkeyset_t keyset)
This updates the protection domain by loading the set specified by
keyset as the currently addressable storage set. No return value is
provided by this function. Because this service is slightly more efficient
than hkeyset_replace(), it can be used to load a hardware-keyset
when the previous keyset does not need to be restored.
hkeyset_t hkeyset_get()
This reads the current hardware-key-set without altering it.
Hardware keysets can also be statically assigned several predefined values. This is often
useful to deal with use of a hardware keyset before a component can initialize it.
HKEYSET_INVALID
This keyset is invalid. When used, it will cause a storage-key
exception on the next data reference.
HKEYSET_GLOBAL
This keyset allows access to all kernel keys. It is implemented as an
all zeroes AMR value.
To understand how a user key is useful, take a brief look at DB2. DB2 provides a UDF facility
where customers can add extra code to the database. There are two modes UDFs can run in:
fenced and unfenced, as explained here:
Fenced mode
UDFs are isolated from the database by execution under a separate process. Shared
memory is used to communicate between the database and UDF process. Fenced mode
has a significant performance penalty because a context switch is required to execute the
UDF.
Unfenced mode
The UDF is loaded directly into the DB2 address space. Unfenced mode greatly improves
performance, but introduces a significant RAS exposure.
Although DB2 recommends using fenced mode, many customers use unfenced mode for
performance reasons. It is expected that only private user data can be protected with user
keys. There are still exposures in system data, such as library data that is likely to remain
unprotected.
Hardware keys are separated into a user mode pool and a kernel mode pool for several
reasons. First, an important feature of kernel keys is to prevent accidental kernel references
to user data. If the same hardware key is used for both kernel and user data, then kernel
components that run with that hardware key can store to user space. This is avoided.
Separating the hardware keys simplifies user memory access services such as copyin().
Because the hardware keys are separated, the settings for kernel mode access and user
mode access can be contained in a single AMR. This avoids a costly requirement to alter the
AMR between source and destination buffer accesses.
When user keys are disabled, sysconf(_SC_AIX_UKEYS) returns zero (0), indicating that the
feature is not available. Applications that discover the feature is not available should not call
other user key-related services. These services fail if called when user keys are disabled.
Support for user keys is provided for both 32-bit and 64-bit APIs. In 32-bit mode, compiler
support for long long is required to use user keys. User key APIs will be provided in an AIX
V5.3 update, as well as in AIX V6.1.
Applications that use these APIs have an operating system load time requisite. To avoid this
requisite, it is recommended that the application conditionally load a wrapper module that
makes reference to the new APIs. The wrapper module is only loaded when it is determined
that user keys are available; that is, when the configured number of user keys is greater than
zero.
The UKEY values are an abstraction of storage keys. These key values are the same across
all applications. For example, if one process sets a shared page to UKEY_PRIVATE1, all
processes need UKEY_PRIVATE1 authority to access that page.
The sysconf() service can be used to determine if user keys are available without load time
dependencies. Applications must use ukey_enable() to enable user keys before user key
APIs can be used. All user memory pages are initialized to be in UKEY_PUBLIC. Applications
have the option to alter the user key for specific data pages that should not be publicly
accessible. User keys may not be altered on mapped files. The application must have write
authority to shared memory to alter the user key.
The kernel manages its own AMR when user keys are in use. When the kernel performs
loads or stores on behalf of an application, it respects the user mode AMR that was active
when the request was initiated. The user key values are shared among threads in a
multithreaded process, but a user mode AMR is maintained per thread. Kernel context
switches preserve the AMR. Threaded applications are prevented from running M:N mode
with user keys enabled by the ukey_enable() system call and pthread_create().
The user mode AMR is inherited by fork(), and it is reset to its default by exec(). The default
user mode value enables only UKEY_PUBLIC (read and write access). A system call,
ukeyset_activate() is available to modify the user mode AMR. Applications cannot disable
access to UKEY_PUBLIC. Preventing this key from being disabled allows memory that is
unknown to an application to always be accessible. For example, the TOC or data used by
an external key-unsafe library is normally set to UKEY_PUBLIC.
The ucontext_t structure is extended to allow the virtualized user mode AMR to be saved and
restored. The sigcontext structure is not changed. The jmp_buf structure is not extended to
contain an AMR, so callers of setjmp(), _setjmp(), and sig_setjmp() must perform explicit
AMR management. A ukey_setjmp() API is provided that is a front-end to setjmp() and
manages the user mode AMR.
The user mode AMR is reset to contain only UKEY_PUBLIC when signals are delivered and
the interrupted AMR is saved in the ucontext_t structure. Signal handlers that access storage
that is not mapped UKEY_PUBLIC are responsible for establishing their user mode AMR.
Note: User keys are provided as a RAS feature. They are only intended to prevent
accidental accesses to memory and are not intended to provide data security.
Example of user keys for two processes using a shared memory object
To demonstrate how user keys can be used, we developed a short C program. This program
creates two processes sharing a memory page, and that page will have a private key
assigned to it.
The parent process will have UKEY_PRIVATE1=UK_RW access. The child process will have
UKEY_PRIVATE1 with UK_READ, UK_WRITE and UK_RW access for each iteration of the
program. We will see how the child process behaves when it is given access to a shared
page with each authority. We also need to ensure that shared memory is allocated as a
multiple of pagesize, because that is the granularity level for protection keys.
Table 3-5 lists the sequence of operations for child and parent processes with respect to time.
2 Sleeping Sleeping
3 READ1(); Sleeping
WRITE1("ABCD");
sleep(2);
4 Sleeping READ1();
WRITE1("ABCD");
sleep(2);
5 Sleeping Sleeping
ukey_t pkey=UKEY_PRIVATE1;
ukey_t ckey=UKEY_PRIVATE1;
ukeyset_t pkeyset;
ukeyset_t ckeyset;
ukeyset_t oldset;
int nkeys;
if ((nkeys=ukey_enable())==-1){
perror("main():ukey_enable(): USERKEY not enabled");
exit(1);
}
assert(nkeys>=2);
if (rc=ukeyset_init(&pkeyset,0)){
perror("main():ukeyset_init(pkeyset)");
exit(1);
}
if (rc=ukeyset_init(&ckeyset,0)){
perror("main():ukeyset_init(ckeyset)");
exit(1);
}
if (rc=ukeyset_add_key(&pkeyset, pkey, UK_RW)){
perror("main():ukeyset_add_key(pkeyset, pkey,UK_RW)");
exit(1);
}
if (!strcmp(*(argv+1),"write")){
if (rc=ukeyset_add_key(&ckeyset, ckey, UK_WRITE)){
perror("main():ukeyset_add_key(ckeyset, ckey,UK_WRITE)");
exit(1);
}
}else{
We gave the child process write (no-read) access on the shared memory segment, and
executed it as shown in Example 3-32.
Notice that a segmentation fault occurred at strlen(), which was trying to read shared memory
and calculate its size. Because read permission was not provided, it caused SIGSEGV.
Next, we gave the child process read (no-write) access to the shared memory segment and
executed the code as shown in Example 3-33.
Notice that a segmentation fault occurred at the strcpy() function, which was trying to write to
shared memory. The strlen() function did not fail this time, because we gave read access to
the shared page. However, the child process does not have write access to strcpy(), which
caused SIGSEGV.
We executed the program by giving read and write access, as shown in Example 3-34 on
page 106.
Example 3-34 Program gets read and write access to shared segment
# ./ukey1 readwrite
parent:pagesize=4096
child:data=0x30000000
child :READ1 =[]
child :WRITE1=[abcd]
parent:READ1 =[abcd]
parent:WRITE1=[ABCD]
child :READ2 =[abcdABCD]
child :WRITE2=[efgh]
Because the child process was able to read and write the content of shared object, there is no
core file at this time.
kkeymap Displays the available hardware keys and the kernel keys that map to
each.
kkeymap <decimal kernel key number> Displays the mapping of the specified kernel key to a hardware key (-1
indicates that the kernel key is not mapped).
hkeymap <decimal hardware key number> Displays all the kernel keys that map to the specified hardware key.
kkeyset <address of kkeyset_t> Displays the kernel key access rights represented by a kernel keyset.
The operand of this command is the address of the pointer to the
opaque kernel keyset, not the kernel keyset structure itself.
hkeyset <64 bit hex value> Displays the hardware key accesses represented by this value if used
in the AMR, and a sampling of the kernel keys involved.
dr amr Displays the current AMR and the access rights it represents.
dk 1 <eaddr> Displays the hardware key of the resident virtual page containing
eaddr.
mst Displays the AMR and context stack values. A storage key protection
exception is indicated in excp_type as DSISR_SKEY.
pft Displays the hardware key (labeled hkey) value for the page.
pte Displays the hardware key (labeled sk) value for the page.
scb Displays the hardware key default set for the segment.
heap Displays the kernel and hardware keys associated with an xmalloc
heap.
In the following examples we show storage key features using the kernel debugger (kdb).
33 2 PUBLIC
Note: In Example 3-35, the kernel key number (kkey) is that of the first key in the row.
There is no guarantee that the other keys in the row are numbered sequentially.
Example 3-36 shows the Authority Mask Register (AMR) value for the current Machine Status
Table (MST).
Example 3-37 shows the hardware keys for a page frame table (PFT) entry.
> in use
> on scb list
> valid (h/w)
> referenced (pft/pvt/pte): 0/0/0
> modified (pft/pvt/pte): 1/0/0
base psx.................. 01 (64K) soft psx.................. 00 ( 4K)
owning vmpool............. 00000000 owning mempool............ 00000000
owning frameset........... 00000002
source page number............ 0100
dev index in PDT.............. 0000
next page sidlist. 0000000000002C10 prev page sidlist. 0000000000001010
next page aux......... 000000000000 prev page aux......... 000000000000
waitlist.......... 0000000000000000 logage.................... 00000000
nonfifo i/o............... 00000000 next i/o fr list...... 000000000000
If you make a key-safe extension by simply adding the minimal entry and exit point protection
gates, it might actually run slightly faster than it otherwise would on a keys-enabled system,
because explicit gates do not use the context stack. You must trade off granularity of
protection against overhead as you move into the key-protected realm, however. For
example, adding protection gates within a loop for precise access control to some private
object might result in unacceptable overhead. Try to avoid such situations, where possible, in
the framework of your specific key-protected design.
3.8 ProbeVue
Introduced in AIX V6.1, ProbeVue is a facility that enables dynamic tracing data collection. A
tracing facility is dynamic because it is able to gather execution data from applications without
any modification of their binaries or their source codes. The term dynamic refers to the
capability to insert trace points at run-time without the need to prepare the source code in
advance. Inserting specific tracing calls and defining specific tracing events into the source
code would require you to recompile the software and generate a new executable, which is
referred to as a static tracing facility.
Currently there are no standards in the area of dynamic tracing. POSIX has defined a tracing
standard for static tracing software only as described in Chapter 1 of the IBM Redbooks
publication IBM AIX Version 6.1 Differences Guide, SG24-7559. So, no compatibility between
ProbeVue and other UNIX dynamic tracing facilities can be expected until a standard is
established.
However, this general statement is currently evolving due to the recent advances in hardware
capabilities and software engineering, such as:
The processing and memory capabilities of high-end servers with associated storage
technologies have lead to huge systems into production.
Dedicated solutions developed by system integrators (for example, based on ERP
software) implement numerous middleware and several application layers, and have lead
to huge production systems.
Software is now mostly multithreaded and running on many processors. Thus, two runs
can behave differently, depending on the order of thread execution: multithreaded
applications are generally non-deterministic. Erroneous behaviors are more difficult to
reproduce and debug for such software.
The ProbeVue dynamic tracing facility provides a way to investigate problems on production
systems. ProbeVue captures execution data without installing dedicated instrumented
versions of applications or kernels that require a service interruption for application restart or
server reboot.
Additionally, ProbeVue helps you to find the root cause of errors that may occur on
long-running jobs where unexpected accumulated data, queue overflows and other defects of
the application or kernel are revealed only after many days or months of execution.
For these reasons, ProbeVue is a complementary tracing tool to the static tracing methods,
adding new and innovative tracing capabilities to running production systems.
ProbeVue can be used for performance analysis as well as for debugging problems. It is
designed to be safe to run on production systems and provides protection against errors in
the instrumentation code.
The section defines some of the terminology used. The subsequent sections introduce Vue,
the programming language used by ProbeVue and the probevue command that is used to
start a tracing session.
Distinguishing probes by probe types induces a structure to the wide variety of probe points.
So, ProbeVue requires a probe manager to be associated with each probe type.
Probe manager This is the software code that defines and provides a set of probe
points of the same probe type (for example, the system calls probe
manager).
In short, a Vue script tells ProbeVue where to trace, when to trace, and what to trace.
It is recommended that Vue scripts have a file suffix of .e to distinguish them from other file
types, although this is not a requirement.
The ProbeVue session stays active until a <Ctrl-C> is typed on the terminal or an exit action
is executed from within the Vue script.
Each invocation of the probevue command activates a separate dynamic tracing session.
Multiple tracing sessions may be active at one time, but each session presents only the trace
data that is captured in that session.
Running the probevue command is considered a privileged operation, and privileges are
required for non-root users who wish to initiate a dynamic tracing session. For a detailed
description of the probevue command, refer to AIX Version 6.1 Commands Reference,
Volume 4, SC23-5246.
For a detailed description of the probevctrl command, refer to AIX Version 6.1 Commands
Reference, Volume 4, SC23-5246.
@@BEGIN
{
printf("Hello World\n");
exit();
}
This Hello World program prints "Hello World" when <Ctrl-C> is typed on the keyboard.
#!/usr/bin/probevue
@@END
{
printf("Hello World\n");
}
The format for a probe specification is probe-type specific. The probe specification is a tuple
of ordered list of fields separated by colons. It has the following general format:
@@<probetype>:<probetype field1 >:...:<probetype fieldn >:<location>
Action block
The action block identifies the set of actions to be performed when a thread hits the probe
point. Supported actions are not restricted to the capturing and formatting of trace data; the
full power of Vue can be employed.
Unlike procedures in procedural languages, an action block in Vue does not have an output
or a return value. And it does not have inherent support for a set of input parameters. On the
other hand, the context data at the point where a probe is entered can be accessed within the
action block to regulate the actions to be performed.
Predicate
Predicates should be used when execution of clauses at probe points must be performed
conditionally.
For example, this is a predicate indicating that probe points should be executed for process
ID = 1678:
when ( __pid == 1678 )
Probe manager
The probe manager is an essential component of dynamic tracing. Probe managers are the
providers of the probe points that can be instrumented by ProbeVue.
Probe managers generally support a set of probe points that belong to some common domain
and share some common feature or attribute that distinguishes them from other probe points.
Probe points are useful at points where control flow changes significantly, at points of state
change or other similar points that of significant interest. Probe managers are careful to select
probe points only in locations that are safe to instrument.
Note: The uft probe manager requires the process ID for the process to be traced and
the complete function name of the function at whose entry point the probe is to be
placed. Further, the uft probe manager currently requires that the third field be set to an
asterisk (*) to indicate that the function name is to be searched in any of the modules
loaded into the process address space, including the main executable and shared
modules.
Vue functions
Unlike programs written in C or in FORTRAN programming languages or in a native
language, scripts written in Vue do not have access to the routines provided by the AIX
system libraries or any user libraries. However, Vue supports its own special library of
functions useful for dynamic tracing programs. Functions include:
Tracing-specific functions
get_function Returns the name of the function that encloses the current probe
timestamp Returns the current time stamp
diff_time Finds the difference between two time stamps
List functions
list Instantiate a list variable
append Append a new item to list
sum, max, min, avg, count
Aggregation functions that can be applied on a list variable
C-library functions
atoi, strstr Standard string functions
Miscellaneous functions
exit Terminates the tracing program
get_userstring Read string from user memory
For additional information, see the article ProbeVue: Extended Users Guide Specification
at:
http://www.ibm.com/developerworks/aix/library/au-probevue/
4. We execute the pvue.e script with the probevue command passing the process ID to be
traced as parameter:
# probevue ./pvue.e 262272
Built-in variables may be used in the predicate section of a Vue clause. To see __pid, refer to
Example 3-39 on page 119.
String
The string data type is a representation of string literals. The user specifies the length of the
string type. Here is an example of a string declaration:
String s[25]
The following operators are supported for the string data types:
"+", "=", "==", "!=", ">", ">=", "<" and "<=".
Vue supports several functions that return a string data type. It automatically converts
between a string data type and C-style character data types (char * or char[]) as needed.
List
List is used to collect a set of integral type values. It is an abstract data type and cannot be
used directly with the standard unary or binary operators. Instead, Vue supports following
operations for the list type:
A constructor function, list() to create a list variable.
A concatenation function, append() to add an item to the list.
The "=" operator that allows a list to be assigned to another.
A set of aggregation functions that operate on a list variable and return a scalar (integer)
value like sum(), avg(), min(), max(), and so on.
Example 3-41 on page 123 uses the list variable lst. Prototypes and a detailed explanation
of List data types can be found in Chapter 4. Dynamic Tracing in AIX Version 6.1 General
Programming Concepts: Writing and Debugging Programs, SC23-5259.
Symbolic constants
Vue has predefined symbolic constants which can be used:-
NULL
float static do
if typedef for
long switch
return while
short
signed
sizeof
struct
union
unsigned
void
Elements of shell
Vue translates exported shell variables (specified by the $ prefix) and positional parameters
into their real values during the initial phase of compilation. So, $1, $2, and so on, will be
replaced with corresponding value by ProbeVue.
When assigning general environment variables to a string, you need to make sure it starts
and ends with a backward (\) slashmark. For instance, the environment variable
VAR=abcdef will result in an error when assigned to a string. Defining it as VAR=\abcdef\
is the proper way of using it in a Vue script.
Restriction: The special shell parameters like $$, $@, and so on are not supported in
Vue. However, they can be obtained by other predefined built-in variables.
Example 3-41 illustrates the use of various variable types, and also uses kernel variables.
The comments provided in the Vue script explain the scope of various variables.
/*
* Strings are by default Global variables
*/
String global_var[4096];
/*
* Global variables are accessible throughout the scope of Vue file in any clause
*/
__global global_var1;
__kernel long lbolt;
/*
* Thread variables are like globals but instantiated per traced thread first
* time it executes an action block that uses the variable
*/
__thread int thread_var;
/*
* Built-in variables are not supposed to be defined. They are by default
* available to the clauses where is makes any sense. They are : __rv, __arg1,
* __pid etc. __rv (return value for system calls) makes sense only when system
* call is returning and hence available only for syscall()->exit() action block.
* While __arg1, __arg2 and arg3 are accessible at syscall()->entry(). As system
* call read() accespts only 3 arguments so only __arg1, __arg2 and __arg2 are
* valid for read()->entry() action block.
* __pid can be accessed in any clause as current process id is valid everywhere
*/
int read(int fd, void *buf, int n);
@@BEGIN
{
global_var1=0;
lst=list();
printf("lbolt=%lld\n",lbolt);
}
@@uft:$1:*:printf:entry
{
global_var1=global_var1+1;
append(lst,1);
}
/*
* Valid built-in variables : __pid, __arg1, __arg2, __arg3
* __rv not valid here, its valid only inside read()->exit()
*/
@@syscall:*:read:entry
when ( __pid == $1 )
{
/*
* Automatic variable is not accessible outside their own clause. So auto_var
thread_var=__arg1;
global_var=get_userstring(__arg2,15);
global_var1=__arg3;
printf("At read()->entry():\n");
printf("\tfile descriptor ====>%d\n", thread_var);
printf("\tfile context (15 bytes)====>%s\n", global_var);
printf("\tMAX buffer size ====>%d\n", global_var1);
}
Example 3-42 displays the C file that will be traced using the Vue script.
XMDBG will catch errors that might otherwise result in system outages, such as traps, Data
Storage Interrupts (DSIs), and hangs. Typical errors include freeing memory that is not
allocated, allocating memory without freeing it (memory leak), using memory before
initializing it, and writing to freed storage.
Note: This section is meant only for support personnel or for system administrators under
the supervision of support personnel. End customer use without supervision is not
recommended.
Example 3-1 on page 57 shows how to use the smit menu command smit ffdc to enable
RTEC. Alternatively, you can use the following commands:
Turn Off Error Checking for xmalloc
errctrl -c alloc.xmdbg errcheckoff
Turn On Error Checking for xmalloc for the previously set checking level or default level if
no previous check level exists.
errctrl -c alloc.xmdbg errcheckon
To display the current RTEC level for xmalloc and its subcomponents, execute the following
command:
errctrl -c alloc -q -r
# errctrl -c alloc -q -r
---------------------------------------------+-------+-------+-------+--------
| Have |ErrChk |LowSev |MedSev
Component name | alias | /level| Disp | Disp
---------------------------------------------+-------+-------+-------+--------
alloc
.heap0 | NO | ON /0 | 48 | 64
.xmdbg | NO | ON /9 | 64 | 80
Example 3-44 shows the RTEC level for alloc and its subcomponents. Note that alloc.xmdbg
is set to errorcheckmaximum (which is explained in more detail in 3.9.3, Run-time error
checking (RTEC) levels for XMDBG (alloc.xmdbg) on page 127). In this example,
alloc.heap0 has no error checking enabled.
To display the current RTEC level for any subcomponent of xmalloc, execute:
errctrl -c alloc.<subcomponent> -q -r
For example, the command errctrl -c alloc.xmdbg -q -r will show the RTEC level for
alloc.xmdbg.
The frequency of various xmalloc debug tunables can be viewed by using the kdb
subcommand xm Q. Example 3-45 on page 128 shows the frequencies for various tunables.
Minimal checking is the default checking level on version 5. The frequency that appears next
to each tunable is proportional to the frequency base (1024). From the example, you can see
that the Ruin All Data technique will be applied 5 times out of every 1024 (0x400) calls to
xmalloc() (about 0.5% of every 1024 xmalloc() calls). Also, 16 byte allocations will be
promoted about 10 times out of every 1024 calls to xmalloc(), which is about 1% of the time.
(0)> xm -Q
XMDBG data structure @ 00000000025426F0
Debug State: Enabled
Frequency Base: 00000400
Tunable Frequency
Allocation Record 00000033
Ruin All Data 00000005
Trailer non-fragments 00000005
Trailer in fragments 00000005
Redzone Page 00000005
VMM Check 0000000A
Deferred Free Settings
Fragments 00000005
Non-fragments 00000005
Promotions 00000066
Page Promotion
Frag size Frequency
[00010] 0000000A
[00020] 0000000A
[00040] 0000000A
[00080] 0000000A
[00100] 0000000A
[00200] 0000000A
[00400] 0000000A
[00800] 0000000A
[01000] 0000000A
[02000] 0000000A
[04000] 0000000A
[08000] 0000000A
In Example 3-46 on page 128, frequencies for xmalloc tunables at normal level are shown. A
trailer will be added to a fragment about 51 (0x33) times out of every 1024 times a fragment
is allocated (about 5%). The deferred free technique will be applied to page promotions about
153 (0x99) times out of every 1024 (0x400) times a fragment is promoted, which is about
15% of the time.
(0)> xm -Q
XMDBG data structure @ 00000000025426F0
Debug State: Enabled
Page Promotion
Frag size Frequency
[00010] 0000000D
[00020] 0000000D
[00040] 0000000D
[00080] 0000000D
[00100] 0000000D
[00200] 0000000D
[00400] 0000000D
[00800] 0000000D
[01000] 0000000D
[02000] 0000000D
[04000] 0000000D
[08000] 0000000D
Example 3-47 shows frequencies for various tunables of alloc.xmdbg at level errcheckdetail.
For instance, Allocation Records are kept on every call to xmalloc() (0x400 out of 0x400
calls). 0x80 byte Fragments are promoted 0x200 out of every 0x400 times the 0x80 byte
fragment is allocated (50%).
(0)> xm -Q
XMDBG data structure @ 00000000025426F0
Debug State: Enabled
Frequency Base: 00000400
Tunable Frequency
Allocation Record 00000400
Ruin All Data 00000200
Trailer non-fragments 00000066
Trailer in fragments 00000200
Redzone Page 00000266
VMM Check 00000266
Deferred Free Settings
Fragments 00000066
Page Promotion
Frag size Frequency
[00010] 00000200
[00020] 00000200
[00040] 00000200
[00080] 00000200
[00100] 00000200
[00200] 00000200
[00400] 00000200
[00800] 00000200
[01000] 00000200
[02000] 00000200
[04000] 00000200
[08000] 00000200
Example 3-48 shows frequencies for various tunables of alloc.xmdbg at the highest RTEC
level.
Page Promotion
Frag size Frequency
[00010] 00000400
[00020] 00000400
[00040] 00000400
[00080] 00000400
[00100] 00000400
[00200] 00000400
[00400] 00000400
Note: The bosdebug -M command sets all the frequencies for alloc.xmdbg at maximal level
except for promotion settings which are all set to zero (0). A reboot is required in order for
bosdebug to take effect.
In AIX V6.1, the user can set the probability of a check being performed by specifying the
frequency of a tunable as a number between 0 and 1024. This is the number of times out of
the base frequency (1024) that the technique is to be applied by xmalloc. For example, to
request 50%, the user specifies a frequency of 512.
Frequencies can be input as decimal or hex numbers, so 50% can be specified as 0x200. As
a convenient alternative, the frequency can be expressed as a percentage. To do this, the
user specifies a number between 0 and 100 followed by the percent (%) sign. The following
sections detail the RTEC tunables for the xmalloc.xmdbg component.
The allocation record contains a three-level stack trace-back of the xmalloc() and xmfree()
callers, as well as other debug information about the allocated memory. The presence of a
record is a minimum requirement for xmalloc run time error checking.
errctrl -c alloc.xmdbg alloc_record=<frequency>
Ruin storage
This option sets the frequency at which xmalloc() will return storage that is filled with a ruin
pattern. This helps catch errors with un-initialized storage, because a caller with bugs is more
likely to crash when using the ruined storage. Note that xmalloc() does not perform any
explicit checks when this technique is employed. The ruined data will contain 0x66 in every
allocated byte on allocation, and 0x77 in every previously allocated byte after being freed.
errctrl -c alloc.xmdbg ruin_all=<frequency>
Note: The tunable small_trailer did not exist on 5.3, because all trailers were controlled
with the single tunable known as alloc_trailer.
The error disposition can be made more severe by changing the disposition of medium
severity errors as follows:
errctrl -c alloc.xmdbg medsevdisposition=sysdump
Be aware, however, that overwrites to the trailers and other medium severity errors will cause
a system crash if the severity disposition is changed to be more severe.
Trailers are checked at fragment free time for consistency. The error disposition can be
affected for these checks just as it is for the small_trailer option. Trailers and redzones can be
used together to ensure that overruns are detected. Trailers are not used if the requested size
is exactly a multiple of the page size. Overwrites can still be detected by using the redzone
option.
errctrl -c alloc.xmdbg large_trailer=<frequency>
This provides isolation for the returned memory and catches users that overrun buffers. When
used in conjunction with the df_promote option, this also helps catch references to freed
memory. This option uses substantially more memory than other options.
Sizes that are greater than 2 K are still promoted in the sense that an extra redzone page is
constructed for them.
Note: The page size of the heap passed to xmalloc() makes no difference. If the heap
normally contains 64 K pages (kernel_heap or pinned_heap on a machine that supports a
64 K kernel heap page size), then the returned memory of a promoted allocation will still be
backed by 4 K pages.
These promoted allocations come from a region that has a 4 K page size, to avoid using an
entire 64 K page as a redzone.
Note: In AIX V5.3, this feature did not provide a redzone page, and always caused the
freeing of fragment to be deferred. To provide a redzone page, 5.3 used:
errctrl -c alloc.xmdbg doublepage_promote=<size>,<frequency>
In AIX V6.1, this option is provided but the function is identical to the promote option.
Also, in AIX V5.3, the doublepage_promote option always caused the freeing of fragment
to be deferred.
This option affects the freeing of promoted fragments. It sets the frequency with which the
freeing of promoted fragment is deferred. Page promotion (that is, the promote option) and
df_promote are designed to be used together.
Be aware that there is a difference between the option def_free_frag and the option
df_promote. The options are similar, but with def_free_frag, the freeing of every fragment on
a page will be deferred together. This implies the number of pages used by these two
techniques is substantially different:
The df_promote option constructs one fragment per page (with an additional redzone
page).
The def_free_frag option constructs multiple fragments per page (with no redzone).
Note: The options def_free_frag, promote_all, and df_promote do not exist in AIX 5.3.
When a page is returned that has a lower than expected pin count, has the wrong page
protection settings, or has the wrong hardware storage key associated with it, the system will
crash.
errctrl -c alloc.xmdbg vmmcheck=<frequency>
For example, alloc.heap0 is a separate component that controls the heap used by the loader,
and it uses a different percentage than the kernel_heap, which is controlled by alloc.xmdbg.
Component level heaps created by heap_create() can be registered separately, and can be
given different percentages. Refer to 3.9.7, Heap registration for individual debug control on
page 137 for information about the individual heap registration mechanism.
errctrl -c alloc.xmdbg memleak_pct=<percentage>
Tip: This tunable requires the user to make a judgment about how much storage should be
consumed before a leak should be suspected. Users who do not have that information
should not use the command. The default percentage is 100% (1024/1024).
Memory leak errors are classified as LOW_SEVERITY errors and the default disposition is to
ignore them. The error disposition for low severity errors can be modified to log an error or to
cause a system crash. This tunable can be seen in KDB by using the xm -Q command. The
field Ratio of memory to declare a memory leak shows the memory leak percentage.
This tunable can be seen in KDB by using the xm -Q command. The field Outstanding memory
allocations to declare a memory leak shows the memory leak count.
This tunable can be seen in KDB by using the xm Q command. The field Minimum allocation
size to force a record for shows the large allocation record keeping.
Deferral count
This tunable is the total number of pages that are deferred before xmalloc() recycles deferred
storage back to a heap. Deferring the freeing of storage for a very long time can result in
fragmented heaps that result in allocation failures for large requests. Xmalloc supports setting
this option to -1 which causes xmalloc() to defer reallocation as long as possible. This means
the heap is exhausted before memory is recycled. On AIX V6.1, the default value is 0x4000
deferrals.
errctrl c alloc.xmdbg deferred_count=<num>
This tunable can be seen in KDB using the xm Q command. The field Deferred page
reclamation count shows the deferral count.
Note: The alloc.xmdbg and alloc.heap0 components and their potential child components
support a variety of tunables that can be changed as a group. Use the errctrl command
using the errcheckminimal, errchecknormal, errcheckdetail, and errchecklevel
subcommands.
Changes to the alloc.xmdbg component apply to the kernel_heap, pinned_heap, and all
heaps created by kernel subsystems via the heap_create() subroutine. alloc.xmdbg is the
default xmalloc-related component, and other components are mentioned only for
completeness.
The alloc.heap0 component applies to the loader-specific heap that appears in the kernel
segment.
Example 3-49 shows the total usage information about the kernel heap, and then information
that includes three levels of stack trace. Only partial output is shown.
(0)> xm -L 3 -u
Storage area................... F100060010000000..F100060800000000
............(34091302912 bytes, 520192 pages)
Primary heap allocated size.... 351797248 (5368 Kbytes)
Alternate heap allocated size.. 56557568 (863 Kbytes)
Max_req_size....... 1000000000000000 Min_req_size....... 0100000000000000
Size Count Allocated from
---------------------------------------------------------------------------
Max_req_size....... 0100000000000000 Min_req_size....... 0000000010000000
Size Count Allocated from
---------------------------------------------------------------------------
Max_req_size....... 0000000010000000 Min_req_size....... 0000000001000000
Size Count Allocated from
---------------------------------------------------------------------------
000000000C000000 1 4664B8 .nlcInit+0000D4
62AF30 .lfsinit+00018C
78B6E4 .main+000120
Max_req_size....... 0000000001000000 Min_req_size....... 0000000000100000
Size Count Allocated from
---------------------------------------------------------------------------
00000000108DD000 32 658B38 .geninit+000174
45C9C0 .icache_init+00009C
45D254 .inoinit+000770
0000000000900000 3 5906B4 .allocbufs+0000B0
5920CC .vm_mountx+0000C8
592BE4 .vm_mount+000060
0000000000800000 1 786334 .devsw_init+000050
78B6E4 .main+000120
34629C .start1+0000B8
:
:
The command xm -H @<heap_name> -u will show the total memory usage of the heap
<heap_name>.
Where <addr> is the address of heap and needs to be found from KDB. In Example 3-49 on
page 136, my_heap is an individual heap and is valid for registration under alloc.xmdbg. To find
its address in kernel space, execute dd my_heap in KDB.
After heap is registered under alloc.xmdbg, it should be visible under its component
hierarchy. Execute following command to see it:
errctrl -q -c alloc.xmdbg -r
The new child component represents the heap my_heap and can be controlled individually
by using pass-through commands. For further information about this topic, refer to 3.9.4,
XMDBG tunables affected by error check level on page 131, and to3.9.5, XMDBG tunables
not affected by error check level on page 134.
Note that persistence (the P flag of the errctrl command) is not supported with this
mechanism because this new subcomponent will not exist at the next system boot.
iptrace 5.2
ProbeVue 6.1
EtherChannel 5.1
MPIO 5.2
Topas 5.2
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this paper.
Other publications
These publications are also relevant as further information sources:
AIX 5L Version 5.3 Commands Reference, Volume 1, a-c, SC23-4888
AIX 5L Version 5.3 Kernel Extensions and Device Support Programming Concepts,
SC23-4900
AIX 5L System Management Guide: Operating System and Devices, SC23-5204
AIX Version 6.1 Commands Reference, Volume 4, SC23-5246
AIX Version 6.1 General Programming Concepts: Writing and Debugging Programs,
SC23-5259
Reliable Scalable Cluster Technology: Administration Guide, SA22-7889
D
data storage interrupt 21 I
DB2 core 100 IBM System p
Dead gateway detection 3637 p5-595 server 7
default amount 5758 inetd.conf 33
device driver 29 int argc 9394
DGD 38 iptrace 29
Disaster recovery (DR) 23
DLPAR 12
dynamic kernel tuning 50
J
Journaled File System (JFS) 46
dynamic processor deallocation 13
Dynamic tracking of fibre channel devices 47
K
kdb context 71
E kernel component 29
Efix Attribute 74, 76
kernel extension 29, 64, 7172
Efix Description 75
Exporting file 93
efix file 7475
kernel key 79
Total number 75
kernel keys 22
EFIX Id 73
kernel no-execute protection 20
efix installation 7475
kernel recovery framework 19
EFIX label 7374, 76
kk VMM_PMAP 109
EFIX package 7375
Efix State 75
EFIXES Menu 76 L
Electronic Service Agent 48 LDAP 51
Enhanced Journaled File System (JFS2) 46 Lightweight Memory Trace (LMT) 8, 23, 30, 55, 5758,
entry field 57, 68, 73, 76 140
select values 73, 77 List EFIXES 72
ERRCHECK_MAXIMAL 126 Live application mobility 52
ERRCHECK_MAXIMAL Disabled 58 live dump 2425, 6465, 6869
error detection 8 Live partition mobility 52
error logging 30 LMT buffer 59
N R
Network Installation Manager (NIM) 4142 RAID 17, 43
NIM server 41 rare event 57
NVRAM 27 read/write access 98
read/write attribute 107
O Reboot Processing 75
operating system 7, 9 Redbooks Web site 146
reboot 102 Contact us xii
operating system error 8 Redzone 128129
Reliability, Availability, and Serviceability (RAS) 22, 51,
56, 58, 60
P Reliable Scalable Cluster Technology (RSCT) 15, 139
parallel dump 27 remaining CPU 62
partition reboot 135 resource manager 16
Passive dead gateway detection 38 Resource Monitoring and Control (RMC) 15, 141
path control module (PCM) 46 RMC 16
pax file 61 routed 38
PCI hot plug management 15 RSCT 16
performance monitoring 48 RTEC level 127, 130
PFT entry 110 alloc.xmdbg 130
POSIX trace 29, 55, 140 Run time error checking
POWER5 system 59 xmalloc debug enhancement 126
probe actions 113 Run-Time Error Checking (RTEC) 8, 20, 23, 56, 126, 139
probe event 113
probe location 113
Probe manager 113 S
probe point 113 SC_AIX_UKEYS 107
probe point specification 115116 SCSI disc 65
probe type 113 Second Failure Data Capture (SFDC) 58, 102
probevctrl command 114 SIGSEGV 106
ProbeVue 30, 55 single points of failure (SPOFs) 3
action block 115116 smitty menu 72
dynamic tracing benefits and restrictions 111 SMP 12
interval probe manager 118 Special Uncorrectable Error 18
introduction 111 SRC 14
predicate 116 stack overflow 20
probe actions 113 Storage key 55, 7980
probe event 113 storage key 19, 21
probe location 113 subcomponent 6465
Index 149
subsequent reboot 71, 133 V
SUE 18 VGDA 44
Symbolic constant 121 VGSA 44
syslogd 32 VIOS 17
System backup (mksysb) 40 VIPA 37
system call 91, 96 Virtual I/O Server 17
return value 123 Virtual IP address 37
system crash 23, 61 Virtual IP address support 36
system dump 23 Virtual SCSI 65, 71
System p Volume Group Descriptor Area 44
hardware 8 Volume Group Status Area 44
hardware aspect 8 Vue Overview 114
server 6 Vue program 114
System p hardware 8 Vue Programming Language 113
system reboot 102, 126 Vue script 114115, 122
System Resource Controller (SRC) 14, 29, 139
system trace 27
X
xmalloc allocator 125
T xmalloc debug
target thread 121 tunable 128
Current errno value 121
effective user Id 121
tmp/ibmsupt directory 61
topas 49
Tracing Facility
dynamic tracing benefits and restrictions 111
interval probe manager 118
predicate 116
probe action block 115116
probe actions 113
probe event 113
probe location 113
Probe manager 113, 117
probe point 113
probe point specification 115116
probe type 113
probevctrl command 114
ProbeVue 111
probevue command 114
ProbeVue example 119
ProbeVue Terminology 113
system call probe manager 117
user function probe manager 117
Vue functions 118
Vue Overview 114
Vue program 114
Vue Programming Language 113
Vue script 114115
trcrpt command 57, 6061
M parameter 60
U
UK_RW access 102
uncompressed header 62
user key 100
user keys 22
Learn about AIX V6.1 This IBM Redpaper describes the continuous availability features
of AIX Version 6, Release 1. It also addresses and defines the
INTERNATIONAL
and POWER6
terms Reliability, Availability, and Serviceability (RAS) as used in TECHNICAL
advanced availability
an IT infrastructure. It touches on the global availability picture for SUPPORT
features
an IT environment in order to better clarify and explain how AIX ORGANIZATION
View sample can improve that availability. The paper is intended for AIX
specialists, whether customers, business partners, or IBM
programs that exploit
personnel, who are responsible for server availability.
storage protection A key goal of AIX development is to improve overall system
keys BUILDING TECHNICAL
serviceability by developing problem determination tools and INFORMATION BASED ON
techniques that have minimal impact on a live system; this PRACTICAL EXPERIENCE
Harden your AIX document explains the new debugging tools and techniques, as
system well as the kernel facilities that work in conjunction with new IBM Redbooks are developed
hardware, that can help you provide continuous availability for by the IBM International
your AIX systems. Technical Support
The paper provides a broad description of the advanced Organization. Experts from
IBM, Customers and Partners
continuous availability tools and features on AIX that help to from around the world create
capture software problems at the moment they appear, with no timely technical information
need to recreate the failure. In addition to software problems, the based on realistic scenarios.
AIX kernel works closely with advanced hardware features to Specific recommendations
identify and isolate failing hardware and replace hardware are provided to help you
implement IT solutions more
components dynamically without bringing down the system. effectively in your
The tools discussed include Dynamic Trace, Lightweight Memory environment.
Trace, Component Trace, Live dump and Component dump,
Storage protection keys (kernel and user), Live Kernel update,
and xmalloc debug.
For more information:
ibm.com/redbooks
REDP-4367-00