iolatency

This commit is contained in:
Brendan Gregg
2014-07-21 19:29:00 -07:00
parent f08575bd6b
commit b3dd8a599f
5 changed files with 454 additions and 0 deletions

View File

@@ -16,6 +16,7 @@ Using perf_events:
Using ftrace:
- [iosnoop](iosnoop): trace disk I/O with details including latency. [Examples](examples/iosnoop_example.txt).
- [iolatency](iolatency): summarize disk I/O latency as a histogram. [Examples](examples/iolatency_example.txt).
- [execsnoop](execsnoop): trace process exec() with command line argument details. [Examples](examples/execsnoop_example.txt).
- [opensnoop](opensnoop): trace open() syscalls showing filenames. [Examples](examples/opensnoop_example.txt).
- kernel/[funccount](kernel/funccount): count kernel functions that match a string. [Examples](examples/funccount_example.txt).

1
bin/iolatency Symbolic link
View File

@@ -0,0 +1 @@
../iolatency

View File

@@ -0,0 +1,57 @@
Demonstrations of iolatency, the Linux ftrace version.
Here's a busy system doing over 4k disk IOPS:
# ./iolatency
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4381 |######################################|
1 -> 2 : 9 |# |
2 -> 4 : 5 |# |
4 -> 8 : 0 | |
8 -> 16 : 1 |# |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4053 |######################################|
1 -> 2 : 18 |# |
2 -> 4 : 9 |# |
4 -> 8 : 2 |# |
8 -> 16 : 1 |# |
16 -> 32 : 1 |# |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4658 |######################################|
1 -> 2 : 9 |# |
2 -> 4 : 2 |# |
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 4298 |######################################|
1 -> 2 : 17 |# |
2 -> 4 : 10 |# |
4 -> 8 : 1 |# |
8 -> 16 : 1 |# |
^C
Ending tracing...
Disk I/O latency is usually between 0 and 1 milliseconds, as this system uses
SSDs. There are occasional outliers, up to the 16->32 ms range.
Identifying outliers like these is difficult from iostat(1) alone, which at
the same time reported:
# iostat 1
[...]
avg-cpu: %user %nice %system %iowait %steal %idle
0.53 0.00 1.05 46.84 0.53 51.05
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 0.00 0.00 0.00 28.00 0.00 112.00 8.00 0.02 0.71 0.00 0.71 0.29 0.80
xvdb 0.00 0.00 2134.00 0.00 18768.00 0.00 17.59 0.51 0.24 0.24 0.00 0.23 50.00
xvdc 0.00 0.00 2088.00 0.00 18504.00 0.00 17.72 0.47 0.22 0.22 0.00 0.22 46.40
md0 0.00 0.00 4222.00 0.00 37256.00 0.00 17.65 0.00 0.00 0.00 0.00 0.00 0.00
I/O latency ("await") averages 0.24 and 0.22 ms for our busy disks, but this
output doesn't show that occasionally is much higher.
To get more information on these I/O, try the iosnoop(8) tool.

287
iolatency Executable file
View File

@@ -0,0 +1,287 @@
#!/bin/bash
#
# iolatency - summarize block device I/O latency as a histogram.
# Written using Linux ftrace.
#
# This shows the distribution of latency, allowing modes and latency outliers
# to be identified and studied.
#
# USAGE: ./iolatency [-hT] [-d device] [-i iotype] [interval [count]]
#
# REQUIREMENTS: FTRACE CONFIG and block:block_rq_* tracepoints, which you may
# already have on recent kernels.
#
# OVERHEAD: block device I/O issue and completion events are traced and buffered
# in-kernel, then processed and summarized in user space. There may be
# measurable overhead with this approach, relative to the block device IOPS.
#
# This was written as a proof of concept for ftrace.
#
# From perf-tools: https://github.com/brendangregg/perf-tools
#
# COPYRIGHT: Copyright (c) 2014 Brendan Gregg.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
# (http://www.gnu.org/copyleft/gpl.html)
#
# 20-Jul-2014 Brendan Gregg Created this.
### default variables
tracing=/sys/kernel/debug/tracing
flock=/var/tmp/.ftrace-lock
bufsize_kb=4096
opt_device=0; device=; opt_iotype=0; iotype=; opt_timestamp=0
opt_interval=0; interval=1; opt_count=0; count=0
trap ':' INT QUIT TERM PIPE # sends execution to end tracing section
function usage {
cat <<-END >&2
USAGE: iolatency [-hT] [-d device] [-i iotype] [interval [count]]
-d device # device string (eg, "202,1)
-i iotype # match type (eg, '*R*' for all reads)
-T # timestamp on output
-h # this usage message
interval # summary interval, seconds (default 1)
count # number of summaries
eg,
iolatency # summarize latency every second
iolatency 5 2 # 2 x 5 second summaries
iolatency -i '*R*' # trace reads
iolatency -d 202,1 # trace device 202,1 only
END
exit
}
function warn {
if ! eval "$@"; then
echo >&2 "WARNING: command failed \"$@\""
fi
}
function end {
# disable tracing
echo 2>/dev/null
echo "Ending tracing..." 2>/dev/null
cd $tracing
warn "echo 0 > events/block/block_rq_issue/enable"
warn "echo 0 > events/block/block_rq_complete/enable"
if (( opt_device || opt_iotype )); then
warn "echo 0 > events/block/block_rq_issue/filter"
warn "echo 0 > events/block/block_rq_complete/filter"
fi
warn "echo > trace"
(( wroteflock )) && warn "rm $flock"
}
function die {
echo >&2 "$@"
exit 1
}
function edie {
# die with a quiet end()
echo >&2 "$@"
exec >/dev/null 2>&1
end
exit 1
}
### process options
while getopts d:hi:T opt
do
case $opt in
d) opt_device=1; device=$OPTARG ;;
i) opt_iotype=1; iotype=$OPTARG ;;
T) opt_timestamp=1 ;;
h|?) usage ;;
esac
done
shift $(( $OPTIND - 1 ))
if (( $# )); then
opt_interval=1
interval=$1
shift
fi
if (( $# )); then
opt_count=1
count=$1
fi
if (( opt_device )); then
major=${device%,*}
minor=${device#*,}
dev=$(( (major << 20) + minor ))
fi
### select awk
[[ -x /usr/bin/mawk ]] && awk='mawk -W interactive' || awk=awk
### check permissions
cd $tracing || die "ERROR: accessing tracing. Root user? Kernel has FTRACE?
debugfs mounted? (mount -t debugfs debugfs /sys/kernel/debug)"
### ftrace lock
[[ -e $flock ]] && die "ERROR: ftrace may be in use by PID $(cat $flock) $flock"
echo $$ > $flock || die "ERROR: unable to write $flock."
wroteflock=1
### setup and begin tracing
warn "echo nop > current_tracer"
warn "echo $bufsize_kb > buffer_size_kb"
filter=
if (( opt_iotype )); then
filter="rwbs ~ \"$iotype\""
fi
if (( opt_device )); then
[[ "$filter" != "" ]] && filter="$filter && "
filter="${filter}dev == $dev"
fi
if (( opt_iotype || opt_device )); then
echo $filter
if ! echo "$filter" > events/block/block_rq_issue/filter || \
! echo "$filter" > events/block/block_rq_complete/filter
then
edie "ERROR: setting -d or -t filter. Exiting."
fi
fi
if ! echo 1 > events/block/block_rq_issue/enable || \
! echo 1 > events/block/block_rq_complete/enable; then
edie "ERROR: enabling block I/O tracepoints. Exiting."
fi
etext=
(( !opt_count )) && etext=" Ctrl-C to end."
echo "Tracing block I/O. Output every $interval seconds.$etext"
#
# Determine output format. It may be one of the following (newest first):
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# TASK-PID CPU# TIMESTAMP FUNCTION
# To differentiate between them, the number of header fields is counted,
# and an offset set, to skip the extra column when needed.
#
offset=$($awk 'BEGIN { o = 0; }
$1 == "#" && $2 ~ /TASK/ && NF == 6 { o = 1; }
$2 ~ /TASK/ { print o; exit }' trace)
### print trace buffer
warn "echo > trace"
i=0
while (( !opt_count || (i < count) )); do
(( i++ ))
sleep $interval
# snapshots were added in 3.10
if [[ -x snapshot ]]; then
echo 1 > snapshot
echo > trace
cat snapshot
else
cat trace
echo > trace
fi
(( opt_timestamp )) && printf "time %(%H:%M:%S)T:\n" -1
echo "tick"
done | \
$awk -v o=$offset -v opt_timestamp=$opt_timestamp '
function star(sval, smax, swidth) {
stars = ""
if (smax == 0) return ""
for (si = 0; si < (swidth * sval / smax); si++) {
stars = stars "#"
}
return stars
}
BEGIN { max_i = 0 }
# common fields
$1 != "#" {
time = $(3+o); sub(":", "", time)
dev = $(5+o)
}
# block I/O request
$1 != "#" && $0 ~ /rq_issue/ {
#
# example: (fields1..4+o) 202,1 W 0 () 12862264 + 8 [tar]
# The cmd field "()" might contain multiple words (hex),
# hence stepping from the right (NF-3).
#
loc = $(NF-3)
starts[dev, loc] = time
next
}
# block I/O completion
$1 != "#" && $0 ~ /rq_complete/ {
#
# example: (fields1..4+o) 202,1 W () 12862256 + 8 [0]
#
dir = $(6+o)
loc = $(NF-3)
if (starts[dev, loc] > 0) {
latency_ms = 1000 * (time - starts[dev, loc])
i = 0
for (ms = 1; latency_ms > ms; ms *= 2) { i++ }
hist[i]++
if (i > max_i)
max_i = i
delete starts[dev, loc]
}
next
}
# timestamp
$1 == "time" {
lasttime = $2
}
# print summary
$1 == "tick" {
print ""
if (opt_timestamp)
print lasttime
# find max value
max_v = 0
for (i = 0; i <= max_i; i++) {
if (hist[i] > max_v)
max_v = hist[i]
}
# print histogram
printf "%8s .. %-8s: %-8s |%-38s|\n", ">=(ms)", "<(ms)",
"I/O", "Distribution"
ms = 1
from = 0
for (i = 0; i <= max_i; i++) {
printf "%8d -> %-8d: %-8d |%-38s|\n", from, ms,
hist[i], star(hist[i], max_v, 38)
from = ms
ms *= 2
}
fflush()
delete hist
delete starts # invalid if events missed between snapshots
max_i = 0
}
$0 ~ /LOST.*EVENTS/ { print "WARNING: " $0 > "/dev/stderr" }
'
### end tracing
end

108
man/man8/iolatency.8 Normal file
View File

@@ -0,0 +1,108 @@
.TH iolatency 8 "2014-07-12" "USER COMMANDS"
.SH NAME
iolatency \- summarize block device I/O latency as a histogram. Uses Linux ftrace.
.SH SYNOPSIS
.B iolatency
[\-hT] [\-d device] [\-i iotype] [interval [count]]
.SH DESCRIPTION
This shows the distribution of latency, allowing modes and latency outliers
to be identified and studied. For more details of block device I/O, use
iosnoop(8).
This is a proof of concept tool using ftrace, and involves user space
processing and related overheads. See the OVERHEAD section.
NOTE: Due to the way trace buffers are switched per interval, there is the
possibility of losing a small number of I/O (usually less than 1%). The
summary therefore shows the general distribution, but may be slightly
incomplete. If 100% of I/O must be studied, use iosnoop(8) and post-process.
Also note that I/O may be missed when the trace buffer is full: see the
interval section in OPTIONS.
Since this uses ftrace, only the root user can use this tool.
.SH REQUIREMENTS
FTRACE CONFIG, and the tracepoints block:block_rq_issue and
block:block_rq_complete, which you may already have enabled and available on
recent Linux kernels. And awk.
.SH OPTIONS
.TP
\-d device
Only show I/O issued by this device. (eg, "202,1"). This matches the DEV
column in the iolatency output, and is filtered in-kernel.
.TP
\-i iotype
Only show I/O issued that matches this I/O type. This matches the TYPE column
in the iolatency output, and wildcards ("*") can be used at the beginning or
end (only). Eg, "*R*" matches all reads. This is filtered in-kernel.
.TP
\-h
Print usage message.
.TP
\-T
Include timestamps with each summary output.
.TP
interval
Interval between summary histograms, in seconds.
During the interval, trace output will be buffered in-kernel, which is then
read and processed for the summary. This buffer has a fixed size per-CPU (see
/sys/kernel/debug/tracing/buffer_size_kb). If you think events are missing,
try increasing that size (the bufsize_kb setting in iolatency). With the
default setting (4 Mbytes), I'd expect this to happen around 50k I/O per
summary.
.TP
count
Number of summaries to print.
.SH EXAMPLES
.TP
Default output, print a summary of block I/O latency every 1 second:
#
.B iolatency
.TP
Print 5 x 1 second summaries:
#
.B iolatency 1 5
.TP
Trace reads only:
#
.B iolatency \-i '*R*'
.TP
Trace I/O issued to device 202,1 only:
#
.B iolatency \-d 202,1
.SH FIELDS
.TP
>=(ms)
Latency was greater than or equal-to this value, in milliseconds.
.TP
<(ms)
Latency was less than this value, in milliseconds.
.TP
I/O
Number of block device I/O in this latency range, during the interval.
.TP
Distribution
ASCII histogram representation of the I/O column.
.SH OVERHEAD
Block device I/O issue and completion events are traced and buffered
in-kernel, then processed and summarized in user space. There may be
measurable overhead with this approach, relative to the block device IOPS.
The overhead may be acceptable in many situations. If it isn't, this tool
can be reimplemented in C, or using a different tracer (eg, perf_events,
SystemTap, ktap.)
.SH SOURCE
This is from the perf-tools collection.
.IP
https://github.com/brendangregg/perf-tools
.PP
Also look under the examples directory for a text file containing example
usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
iosnoop(8), iostat(1)