[Milberg09] Chapter 11. Disk I/O: Monitoring

Stefen 2010-09-13

展開全文

Chapter 11. Disk I/O: Monitoring

This chapter provides an overview of the AIX-specific tools (sar, nmon, and topas) available to monitor disk I/O activity. These commands let you quickly troubleshoot a performance problem and capture data for historical trending and analysis. Don't expect to see iostat here. That Unix utility lets you quickly determine whether there is an imbalanced I/O load between your physical disks and adapters. But unless you decide to write your own scripting tools using iostat, it won't help you with long-term trending and capturing data.

11.1. sar

The sar command, whose syntax is given in Chapter 8 , is one of those older, generic Unix tools that have been improved over the years. Although I generally prefer to use more specific AIX tools, such as nmon and topas, sar provides strong information with respect to disk I/O. Let's run a typical sar command to examine I/O activity:

# sar -d 1 2

AIX newdev 3 5     06/04/
System Configuration: lcpu=4 disk=5
07:11:16     device  %busy    avque    r+w/s   blks/s   avwait     avser
07:11:17     hdisk1      0      0.0        0        0      0.0       0.0
             hdisk0     29      0.0      129       85      0.0       0.0
             hdisk3      0      0.0        0        0      0.0       0.0

Here's a breakdown of the column headings:

%busy — Portion of time the device was busy servicing transfer requests
avque —Number of requests waiting to be sent to disk (as of AIX 5.3)
r+w/s — Number of read or write transfers to or from a device (in 512-byte units)
avwait — Average wait time per request (in milliseconds)
avserv — Average service time per request (in milliseconds)

You want to be wary of any disk that approaches 100 percent utilization or shows a large number of queue requests waiting for disk. Although the sample output shows some activity, we have no real I/O problems because no waiting for I/O is occurring. We should continue to monitor this system to ensure other disks in addition to hdisk0 are being used.

Where sar differs from iostat is in its ability to capture data for long-term analysis and trending using its system activity data collector (sadc) utility. Usually turned off in cron, the sadc utility lets you capture data for historical trending and analysis.

Here's how this works. As delivered by default on AIX systems, two shell scripts, /usr/lib/sa/sa1 and /usr/lib/sa/sa2, which are normally commented out provide daily reports on the activity of the system. The sar command actually calls the sadc routine to access system data. The following example shows how the shell scripts are usually kicked off from cron:

# crontab -l | grep sa1

0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &

11.2. topas

What about something a little more user-friendly? Did you say topas? The topas command is a nice performance-monitoring tool that you can use for a number of purposes, including monitoring your disk subsystem.

Let's take a look at the topas output from a disk perspective:

Code View: Scroll / Show All

Topas output for host - Testhost               Events/Queues     FILE/TTY
Mon May  7 07:33:38 2007   Interval:  2        Cswitch     500   Readch        487
                                               Syscall    1298   Writech       943
Kernel    0.5   |#                          }  Reads         2   Rawin           0
User      0.5   |#                          |  Writes        1   Ttyout        459
Wait      0.0   |                           |  Forks         0   Igets           0
Idle     99.0   |###########################|  Execs         0   Namei          25
                                               Runqueue    0.0   Dirblk          0
Network  KBPS  I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   0.0
en1       0.6     1.0     1.0     0.1    0.5
lo0       0.1     1.0     1.0     0.0    0.0   PAGING            MEMORY
                                               Faults        1   Real,MB      4095
Disk    Busy%    KBPS     TPS KB-Read KB-Writ  Steals        0   % Comp       13.8
hdisk0    0.0     0.0     0.0     0.0     0.0  PgspIn        0   % Noncomp    87.1
hdisk1    0.0     0.0     0.0     0.0     0.0  PgspOut       0   % Client      0.5
hdisk3    0.0     0.0     0.0     0.0     0.0  PageIn        0
cd0       0.0     0.0     0.0     0.0     0.0  PageOut       0   PAGING SPACE
hdisk2    0.0     0.0     0.0     0.0     0.0  Sios          0   Size,MB      4096
                                                                 % Used        0.5
Name           PID  CPU%  PgSp Owner           NFS (calls/sec)   % Free       99.4
X            15256  0.8    2.5 root            ServerV2       0
topas        22320  0.2    1.5 root            ClientV2       0    Press:
syncd        15016  0.0    0.6 root            ServerV3       0    "h" for help
lrud          9030  0.0    0.0 root            ClientV3       0    "q" to quit
gil          10320  0.0    0.1 root
i4llmd       12434  0.0    1.1 root
prngd        19154  0.0    0.2 root
rpc.lock     26878  0.0    0.0 root
nfsd         28238  0.0    0.0 root
tcl          17906  0.0    0.8 root
i4lmd        25352  0.0    1.3 root
dtwm         22752  0.0    1.9 rds
xmgc          9804  0.0    0.0 root
dtsessio     20700  0.0    1.8 rds
init             1  0.0    0.7 root
vmstat       37288  0.0    0.2 root
dtfile       20444  0.0    1.7 rds
cron         27720  0.0    0.4 root
rshell       33334  0.0    0.8 user
netm         10062  0.0    0.0 root

No I/O activity at all is going on here. Besides the physical disk, pay close attention to the "Wait" information (in the CPU section up top), which also helps you determine whether the system is I/O-bound. If you see high numbers here, you can then use other tools, such as filemon, fileplace, lslv, or lsof, to help you figure out which processes, adapters, or file systems are causing your bottlenecks.

The topas command is useful for quickly troubleshooting an issue when you want a little more than iostat can provide. In a sense, topas is a graphical mix of iostat and vmstat, although recent improvements now provide the ability to capture data for historical analysis. These improvements, introduced in AIX 5.3, no doubt were made because of the popularity of nmon.

While nmon provides a front end similar to topas, it is much more useful in terms of long-term trending and analysis. Further, as you learned in Chapter 5, nmon gives system administrators the ability to output data to an Excel spreadsheet for presentation in graphical charts (tailor-made for senior management and functional teams) that clearly illustrate bottlenecks. The nmon analyzer tool provides the hooks into nmon. (Figure 5.1 in Chapter 5 shows some sample output from the nmon analyzer.) With respect to disk I/O, nmon reports the following data: disk I/O rates, data transfers, read/write ratios, and disk adapter statistics.

Here is one small example of where nmon really shines. Let's say you want to know which processes are hogging most of the disk I/O, and you want to be able to correlate that activity with the actual disk to clearly illustrate I/O per process. nmon usage helps you here more than any other tool. To perform this task with nmon, use the –t option; set your timing and then sort by I/O channel.

How do you use nmon to capture data and import it into the analyzer? Use the open-source sudo command and run nmon for three hours, taking a snapshot every 30 seconds:

# sudo nmon -f -t -r test1 -s 30 -c 180

Next, sort the created output file:

# sort -A testsystem_yymmdd.nmon > testsystem_yymmdd.csv

Then FTP the .csv file to your PC, start the nmon analyzer spreadsheet (enabling macros), and click on Analyze nmon data. The nmon command also helps track the configuration of asynchronous I/O servers.

11.3. Logical Volume Monitoring

Say that a ticket has just been opened up with the service desk that relates to slow performance on some database server. You suspect there might be an I/O issue, so you start with iostat. iostat, the equivalent of using vmstat for virtual memory, is arguably the most effective way to get a first glance at what is happening with your I/O subsystem. Let's run iostat, in this case once a second:

Code View: Scroll / Show All

# iostat 1

System configuration: lcpu=4 disk=4
tty:      tin          tout   avg-cpu:  % user    % sys     % idle     % iowait
          0.0         392.0                5.2      5.5       88.3          1.1
Disks:      % tm_act      Kbps     tps     Kb_read      Kb_wrtn
hdisk1           0.5      19.5     1.4    53437739     21482563
hdisk0           0.7      29.7     3.0    93086751     21482563
hdisk4           1.7     278.2     6.2    238584732   832883320
hdisk3           2.1     294.3     8.0    300653060   832883320

The command reports the following information:

% tm_act — Percentage of time that the physical disk was active, or the total time of disk request
Kbps — Amount of data (in kilobytes per second) transferred to the drive
tps — Number of transfers per second issued to the physical disk
Kb_read — Total data (in kilobytes) from the measured interval that is read from the physical volumes
Kb_wrtn — Amount of data (kilobytes) from the measured interval that is written to the physical volumes

You need to watch % tm_act very carefully because if this utilization exceeds roughly 60 to 70 percent, that usually indicates that processes are starting to wait for I/O. This might be your first clue of impending I/O problems. Moving data to less busy drives can obviously help ease this burden. Generally speaking, the more drives your data hits, the better.

Just like anything else, too much of a good thing can also be bad, and you also have to make sure you don't have too many drives hitting any one adapter. One way to determine whether an adapter is saturated is to sum the Kbps amounts for all disks attached to one adapter. The total should be below the disk adapter's throughput rating, usually less than 70 percent. Using the –a flag with iostat helps you drill down further to examine adapter utilization. In the following output, there clearly are no bottlenecks:

# iostat -a

Adapter:                   Kbps      tps    Kb_read    Kb_wrtn
scsi0                      0.0       0.0          0          0

Paths/Disk:       % tm_act     Kbps     tps    Kb_read    Kb_wrtn
hdisk1_Path0          37.0     89.0     0.0          0          0
hdisk0_Path0          67.0     47.0     0.0          0          0
hdisk4_Path0           0.0      0.0     0.0          0          0
hdisk3_Path0           0.0      0.0     0.0          0          0

Adapter:                   Kbps      tps    Kb_read    Kb_wrtn
ide0                        0.0      0.0          0          0

Paths/Disk:       % tm_act     Kbps     tps    Kb_read    Kb_wrtn
cd0                    0.0      0.0     0.0          0          0

11.4. AIX LVM Commands

We examined disk placement earlier, and I stressed the importance of architecting your systems correctly from the beginning. Unfortunately, you don't always have that option. As system administrators, we sometimes inherit systems that must be fixed. Let's look at a sample layout of the logical volumes on disks to determine whether we need to change definitions or rearrange data. We'll examine a volume group and find the logical volumes that are a part of it.

The lsvg command provides volume group information:

# lsvg -l data2

Data2vg

LV NAME          TYPE   LPs   PPs   PVs    LV STATE      MOUNT POINT
data2lv           jfs   128   256     2    open/syncd    /data
loglv00        jfslog     1     2     2    open/syncd    N/
appdatalv         jfs   128   256     2    open/syncd    /appdata

Now, let's use lslv, which provides information about logical volumes:

Code View: Scroll / Show All

# lslv data2lv

LOGICAL VOLUME: data2lv                             VOLUME GROUP: data2vg
LV IDENTIFIER:  0003a0ec00004c00000000fb076f3f41.1  PERMISSION:   read/write
VG STATE:           active/complete                 LV STATE:     opened/syncd
TYPE:               jfs                             WRITE VERIFY: off
MAX LPs:            512                             PP SIZE:      64 megabyte(s)
COPIES:             2                               SCHED POLICY: parallel
LPs:                128                             PPs:           256
STALE PPs:          0                               BB POLICY:    relocatable
INTER-POLICY:       minimum                         RELOCATABLE:  yes
INTRA-POLICY:       center                          UPPER BOUND:  32
MOUNT POINT:        /data                           LABEL:        /data

This view provides a detailed description of the logical volume attributes. What do we have here? The intra-policy is at the center, which normally is the best policy for I/O-intensive logical volumes. As you recall from an earlier discussion, there are exceptions to this rule. Unfortunately, you've just hit one of them. Because Mirror Write Consistency Check (MWCC) is on, the volume would have been better served if it were placed on the edge.

Let's look at its inter-policy. The inter-policy is minimum, which is usually the best policy if availability matters more than performance. Further, there are twice as many physical partitions as logical partitions, which signifies that you are mirroring your systems. In this case, let's assume you were told that raw performance was the most important objective, so the logical volume wasn't configured to reflect the reality of how the volume is being used. Further, if you are mirroring the system and using an external storage array, the situation would even be worse, because you're already providing mirroring at the hardware layer, which is actually more effective than using AIX mirroring.

The lslv command's –l (lowercase L) flag lists all the physical volumes associated with the logical volumes and shows the distribution for each logical volume:

# lslv -l data2lv

data2lv:/data2
PV                COPIES        IN BAND       DISTRIBUTION
hdisk2            128:000:000   100%          000:108:020:000:000
hdisk3            128:000:000   100%          000:108:020:000:000

With this detail, you can determine that 100 percent of the physical partitions on the disk are allocated to this logical volume. The distribution section of the output shows the actual number of physical partitions within each physical volume. From here, you can detail the volume's intra-disk policy.

Let's drill down even further, using the -p flag:

# lspv -p hdisk2

hdisk2:

PP RANGE  STATE   REGION         LV ID            TYPE       MOUNT POINT
  1-108   free    outer edge
109-109   used    outer edge     loglv00          jfslog     N/A
110-217   used    outer middle   data2lv          jfs        /data2
218-237   used    center         appdatalv        jfs        /appdata
238-325   used    center         testdatalv       jfs        /testdata
326-365   used    inner middle   stagingdatalv    jfs        /staging
366-433   free    inner middle
434-542   free    inner edge

The preceding view shows you what is free on the physical volume, what has been used, and which partitions are used where. The order of the fields is as follows: edge, middle, center, inner-middle, inner-edge. The sample report shows that most of the data is in the middle and some is at the center. This is a nice view.

You can do a lot with lsvg and lslv; run a man on these commands to find out more about them.

One of the best tools for looking at LVM use is lvmstat. Because the lvmstat view is not enabled by default, you need to enable it before running the tool:

# lvmstat -v data2vg -e

The following command takes a snapshot of Logical Volume Manager information every second for 10 intervals:

# lvmstat -v data2vg 1 10

The resulting output shows the most utilized logical volumes on your system since you started the data collection tool:

# lvmstat -v data2vg

Logical Volume      % iocnt     Kb_read     Kb_wrtn       Kbps
  appdatalv          306653    47493022      383822      103.2
  loglv00                34           0        3340        2.8
  data2lv               453      234543      234343       89.3

This detail is very helpful when drilling down to the logical volume layer in tuning your systems:

% iocnt — Number of read and write requests
Kb_read — Total data (in kilobytes) from your measured interval that is read
Kb_wrtn — Total data (in kilobytes) from your measured interval that is written
Kbps — Amount of data transferred (in kilobytes per second)

Be sure to review the documentation for all the commands discussed here before adding them to your repertoire.

11.5. filemon and fileplace

This section introduces two important I/O tools, filemon and fileplace, and discusses how you can use them during systems administration each day.

11.6. filemon

filemon [-d] [-i Trace_File -n Gennames_File] [-o File] [-O Levels]
    [-P] [-T n] [-u] [-v]

The filemon command uses a trace facility to report on the I/O activity of physical and logical storage, including your actual files. The I/O activity monitored is based on the time interval specified when running the trace. The command reports on all layers of file system utilization, including the LVM, virtual memory, and physical disk layers. Run without any flags, filemon executes in the background while application programs or system commands are being run and monitored.

The trace starts automatically and runs until it is stopped. At that time, the command generates an I/O activity report and exits. It can also process a trace file that has been recorded by the trace facility. You can then generate reports from this file. Because reports generated to standard output usually scroll past your screen, I advise using the –o option to write the output to a file:

# filemon -o dbmon.out -O all

Run trcstop command to signal end of trace.
Sun Aug 19 17:47:34 200

System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00

# trcstop
[filemon: Reporting started]

# [filemon: Reporting completed]

[filemon: 73.906 secs in measured interval]

When we check out the file, here is what we see:

Code View: Scroll / Show All

Sun Aug 19 17:50:45 2007
System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00
Cpu utilization:  68.2%
Cpu allocation:   77.1%
130582780 events were lost. Reported data may have inconsistencies or errors.
Most Active Files
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                    volume:inode
   .
   .
   .

Look for long seek times because they can result in decreased application performance. By examining the read and write sequence counts in detail, you can further determine whether the access is sequential or random. This information helps you when it is time to do I/O tuning. The sample output clearly illustrates that there is no I/O bottleneck to speak of in this case.

The filemon command provides a tremendous amount of detail; to be honest, I've found it gives too much information at times. Further, using filemon can impose a large performance hit. I don't typically like to recommend performance tools that impose such a substantial overhead, so I'll reiterate that although filemon certainly has a purpose, you need to be very careful when using it.

11.7. fileplace

fileplace [ {-l|-p} [-i] [-v] ] File | [-m LogicalVolumeName]

The fileplace command reports the placement of a file's blocks within a file system. The command is commonly used to examine and assess the efficiency of a file's placement on disk. For what purposes do you use it? One reason would be to help determine whether some of your heavily used files are substantially fragmented.

The fileplace command can also help you identify the physical volume with the highest utilization and determine whether the drive or I/O adapter is causing the bottleneck. Let's look at an example of a frequently accessed file:

Code View: Scroll / Show All

# fileplace -pv dbfile

File: dbfile  Size: 5374622 bytes  Vol: /dev/hd4
Blk Size: 4096  Frag Size: 4096  Nfrags: 1313
Inode: 21  Mode: -rw-r--r--  Owner: root  Group: system

Physical Addresses (mirror copy 1)                           Logical Extent
----------------------------------                           -----------------
02134816-02134943  hdisk0  128 frags   524288 Bytes,   9.7%  00004352-00004479
02135680-02136864  hdisk0  1185 frags  4853760 Bytes, 90.3%  00005216-00006400

1313 frags over space of 2049 frags:   space efficiency = 64.1%
2 extents out of 1313 possible:   sequentiality = 99.9%

You should be interested in space efficiency and sequentiality here. Higher space efficiency means files are less fragmented and provide better sequential file access. A higher sequentiality tells you that the files are more contiguously allocated, which is also better for sequential file access. In the example, space efficiency could be better, while sequentiality is quite high.

If space and sequentiality are too low, you might want to consider file system reorganization. You would do this with the reorgvg command, which can improve logical volume utilization and efficiency.