diff --git a/README.md b/README.md index 1673f9b..d662f4a 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,7 @@ Using ftrace: - [opensnoop](opensnoop): trace open() syscalls showing filenames. [Examples](examples/opensnoop_example.txt). - kernel/[funccount](kernel/funccount): count kernel functions that match a string. [Examples](examples/funccount_example.txt). - kernel/[functrace](kernel/functrace): trace kernel functions that match a string. [Examples](examples/functrace_example.txt). +- kernel/[kprobe](kernel/kprobe): trace a given kprobe definition. [Examples](examples/kprobe_example.txt). - tools/[reset-ftrace](tools/reset-ftrace): reset ftrace state if needed. [Examples](examples/reset-ftrace_example.txt). ## Prerequisites diff --git a/bin/kprobe b/bin/kprobe new file mode 120000 index 0000000..eb1b532 --- /dev/null +++ b/bin/kprobe @@ -0,0 +1 @@ +../kernel/kprobe \ No newline at end of file diff --git a/examples/kprobe_example.txt b/examples/kprobe_example.txt new file mode 100644 index 0000000..e407156 --- /dev/null +++ b/examples/kprobe_example.txt @@ -0,0 +1,168 @@ +Demonstrations of kprobe, the Linux ftrace version. + +This traces do_sys_open() returns, showing the return value: + +# ./kprobe 'r:myopen do_sys_open $retval' +Tracing kprobe myopen. Ctrl-C to end. + kprobe-26386 [001] d... 6593278.858754: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x3 + <...>-26387 [001] d... 6593278.860043: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x3 + <...>-26387 [001] d... 6593278.860064: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x3 + <...>-26387 [001] d... 6593278.860433: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x3 + <...>-26387 [001] d... 6593278.860521: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x3 + supervise-1685 [001] d... 6593279.178806: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1689 [001] d... 6593279.228756: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1689 [001] d... 6593279.229106: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1688 [000] d... 6593279.229501: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1695 [000] d... 6593279.229944: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1685 [001] d... 6593279.230104: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1687 [001] d... 6593279.230293: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1699 [000] d... 6593279.230381: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1692 [000] d... 6593279.230825: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1698 [000] d... 6593279.230915: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1698 [000] d... 6593279.231277: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 + supervise-1690 [000] d... 6593279.231703: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) arg1=0x9 +^C +Ending tracing... + +The syntax for the kprobe definition is the same as can be written to the +/sys/kernel/debug/tracing/kprobe_events file, and is documented in the Linux +kernel source under Documentation/trace/kprobetrace.txt. + +This example probe was named "myopen" (which is arbitrary). You can also provide +arbitrary names for arguments. Eg: + +# ./kprobe 'r:myopen do_sys_open rval=$retval' +Tracing kprobe myopen. Ctrl-C to end. + kprobe-27454 [001] d... 6593356.250019: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x3 + <...>-27455 [001] d... 6593356.251280: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x3 + <...>-27455 [001] d... 6593356.251301: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x3 + <...>-27455 [001] d... 6593356.251672: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x3 + <...>-27455 [001] d... 6593356.251769: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x3 + supervise-1689 [000] d... 6593356.859758: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x9 + supervise-1689 [000] d... 6593356.860143: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x9 + supervise-1696 [000] d... 6593356.862682: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x9 + supervise-1685 [001] d... 6593356.862684: myopen: (SyS_open+0x1e/0x20 <- do_sys_open) rval=0x9 +[...] + +That's a bit better. + + +Tracing the open() mode: + +# ./kprobe 'p:myopen do_sys_open mode=%cx:u16' +Tracing kprobe myopen. Ctrl-C to end. + kprobe-29572 [001] d... 6593503.353923: myopen: (do_sys_open+0x0/0x220) mode=0x1 + kprobe-29572 [001] d... 6593503.353945: myopen: (do_sys_open+0x0/0x220) mode=0x0 + kprobe-29572 [001] d... 6593503.354307: myopen: (do_sys_open+0x0/0x220) mode=0x5c00 + kprobe-29572 [001] d... 6593503.354401: myopen: (do_sys_open+0x0/0x220) mode=0x0 + supervise-1689 [000] d... 6593503.944125: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 + supervise-1688 [001] d... 6593503.944125: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 + supervise-1688 [001] d... 6593503.944606: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 + supervise-1689 [000] d... 6593503.944606: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 + supervise-1698 [000] d... 6593503.944728: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 + supervise-1698 [000] d... 6593503.945077: myopen: (do_sys_open+0x0/0x220) mode=0x1a4 +[...] + +Here I guessed that the mode was in register %cx, and cast it as a 16-bit +unsigned integer (":u16"). Your platform and kernel may be different, and the +mode may be in a different register. If fiddling with such registers becomes too +painful or unreliable for you, consider installing kernel debuginfo and using +the named variables with perf_events "perf probe". + + +Tracing the open() filename: + +# ./kprobe 'p:myopen do_sys_open filename=+0(%si):string' +Tracing kprobe myopen. Ctrl-C to end. + kprobe-32369 [001] d... 6593706.999728: myopen: (do_sys_open+0x0/0x220) filename="/etc/ld.so.cache" + kprobe-32369 [001] d... 6593706.999748: myopen: (do_sys_open+0x0/0x220) filename="/lib/x86_64-linux-gnu/libc.so.6" + kprobe-32369 [001] d... 6593707.000092: myopen: (do_sys_open+0x0/0x220) filename="/usr/lib/locale/locale-archive" + kprobe-32369 [001] d... 6593707.000176: myopen: (do_sys_open+0x0/0x220) filename="trace_pipe" + supervise-1699 [000] d... 6593707.254970: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1689 [001] d... 6593707.254970: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1689 [001] d... 6593707.255432: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1699 [000] d... 6593707.255432: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1695 [001] d... 6593707.258805: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" +[...] + +As mentioned previously, the %si register may be different on your platform. +In this example, I cast it as a string. + + +The -H option will show the headings: + +# ./kprobe -H 'p:myopen do_sys_open filename=+0(%si):string' +Tracing kprobe myopen. Ctrl-C to end. +# tracer: nop +# +# entries-in-buffer/entries-written: 4/4 #P:2 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + kprobe-1460 [001] d... 6593814.070876: myopen: (do_sys_open+0x0/0x220) filename="/etc/ld.so.cache" + kprobe-1460 [001] d... 6593814.070898: myopen: (do_sys_open+0x0/0x220) filename="/lib/x86_64-linux-gnu/libc.so.6" + kprobe-1460 [001] d... 6593814.071256: myopen: (do_sys_open+0x0/0x220) filename="/usr/lib/locale/locale-archive" + kprobe-1460 [001] d... 6593814.071355: myopen: (do_sys_open+0x0/0x220) filename="trace" + kprobe-1460 [001] d... 6593814.070876: myopen: (do_sys_open+0x0/0x220) filename="/etc/ld.so.cache" +[...] + + +Specifying a duration will buffer in-kernel (reducing overhead), and write at +the end. Here's tracing for 10 seconds, and writing to the "out" file: + +# ./kprobe -d 10 'p:myopen do_sys_open filename=+0(%si):string' > out + + +You can match on a single PID only: + +# ./kprobe -p 1696 'p:myopen do_sys_open filename=+0(%si):string' +Tracing kprobe myopen. Ctrl-C to end. + supervise-1696 [001] d... 6593773.677033: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1696 [001] d... 6593773.677332: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1696 [001] d... 6593774.697144: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1696 [001] d... 6593774.697675: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1696 [001] d... 6593775.717986: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" + supervise-1696 [001] d... 6593775.718499: myopen: (do_sys_open+0x0/0x220) filename="supervise/status.new" +^C +Ending tracing... + +This will only show events when that PID is on-CPU. + + +The -v option will show you the available variables you can use in custom +filters: + +# ./kprobe -v 'p:myopen do_sys_open filename=+0(%si):string' +name: myopen +ID: 1443 +format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:unsigned long __probe_ip; offset:8; size:8; signed:0; + field:__data_loc char[] filename; offset:16; size:4; signed:1; + +print fmt: "(%lx) filename=\"%s\"", REC->__probe_ip, __get_str(filename) + + +Tracing filenames that end in "stat", by adding a filter: + +# ./kprobe 'p:myopen do_sys_open filename=+0(%si):string' 'filename ~ "*stat"' +Tracing kprobe myopen. Ctrl-C to end. + postgres-1172 [000] d... 6594028.787166: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" + postgres-1172 [001] d... 6594028.797410: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" + postgres-1172 [001] d... 6594028.797467: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" + postgres-4443 [001] d... 6594028.800908: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" + postgres-4443 [000] d... 6594028.811237: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" + postgres-4443 [000] d... 6594028.811290: myopen: (do_sys_open+0x0/0x220) filename="pg_stat_tmp/pgstat.stat" +^C +Ending tracing... + +This filtering is done in-kernel context. diff --git a/kernel/kprobe b/kernel/kprobe new file mode 100755 index 0000000..5b25cfe --- /dev/null +++ b/kernel/kprobe @@ -0,0 +1,196 @@ +#!/bin/bash +# +# kprobe - trace a given kprobe definition. Kernel dynamic tracing. +# Written using Linux ftrace. +# +# This will create, trace, then destroy a given kprobe definition. See +# Documentation/trace/kprobetrace.txt in the Linux kernel source for the +# syntax of a kprobe definition. +# +# USAGE: ./kprobe [-hv] [-d secs] [-p pid] kprobe_definition [filter] +# +# Run "kprobe -h" for full usage. +# +# I wrote this because I kept testing different custom kprobes at the command +# line, and wanted a way to automate the steps. +# +# REQUIREMENTS: FTRACE and KPROBE CONFIG, which you may already have on recent +# kernel versions. +# +# From perf-tools: https://github.com/brendangregg/perf-tools +# +# See the kprobe(8) man page (in perf-tools) for more info. +# +# COPYRIGHT: Copyright (c) 2014 Brendan Gregg. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License +# as published by the Free Software Foundation; either version 2 +# of the License, or (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software Foundation, +# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. +# +# (http://www.gnu.org/copyleft/gpl.html) +# +# 22-Jul-2014 Brendan Gregg Created this. + +### default variables +tracing=/sys/kernel/debug/tracing +flock=/var/tmp/.ftrace-lock; wroteflock=0 +opt_duration=0; duration=; opt_pid=0; pid=; opt_filter=0; filter= +opt_view=0; opt_headers=0; dmesg=2 +trap ':' INT QUIT TERM PIPE # sends execution to end tracing section + +function usage { + cat <<-END >&2 + USAGE: kprobe [-hHv] [-d secs] [-p PID] kprobe_definition [filter] + -d seconds # trace duration, and use buffers + -p PID # PID to match on I/O issue + -v # view format file (don't trace) + -H # include column headers + -h # this usage message + + Note that these examples may need modification to match your kernel + version's function names and platform's register usage. + eg, + kprobe 'r:myopen do_sys_open \$retval' + # trace open() return value + kprobe 'p:myopen do_sys_open mode=%cx:u16' + # trace open() file mode + kprobe 'p:myopen do_sys_open filename=+0(%si):string' + # trace open() with filename +END + exit +} + +function warn { + if ! eval "$@"; then + echo >&2 "WARNING: command failed \"$@\"" + fi +} + +function end { + # disable tracing + echo 2>/dev/null + echo "Ending tracing..." 2>/dev/null + cd $tracing + warn "echo 0 > events/kprobes/$kname/enable" + if (( opt_filter )); then + warn "echo 0 > events/kprobes/$kname/filter" + fi + warn "echo -:$kname >> kprobe_events" + warn "echo > trace" + (( wroteflock )) && warn "rm $flock" +} + +function die { + echo >&2 "$@" + exit 1 +} + +function edie { + # die with a quiet end() + echo >&2 "$@" + exec >/dev/null 2>&1 + end + exit 1 +} + +### process options +while getopts d:hHp:v opt +do + case $opt in + d) opt_duration=1; duration=$OPTARG ;; + p) opt_pid=1; pid=$OPTARG ;; + H) opt_headers=1 ;; + v) opt_view=1 ;; + h|?) usage ;; + esac +done +shift $(( $OPTIND - 1 )) +(( $# )) || usage +kprobe=$1 +shift +if (( $# )); then + opt_filter=1 + filter=$1 +fi + +### option logic +(( opt_pid && opt_filter )) && die "ERROR: use either -p or -f." +(( opt_duration && opt_view )) && die "ERROR: use either -d or -v." +if (( opt_pid )); then + # convert to filter + opt_filter=1 + filter="common_pid == $pid" +fi +if [[ "$kprobe" != p:* && "$kprobe" != r:* ]]; then + echo >&2 "ERROR: invalid kprobe definition (should start with p: or r:)" + usage +fi +kname=${kprobe#*:} +kname=${kname%% *} +if (( !opt_view )); then + if (( opt_duration )); then + echo "Tracing kprobe $kname for $duration seconds (buffered)..." + else + echo "Tracing kprobe $kname. Ctrl-C to end." + fi +fi + +### check permissions +cd $tracing || die "ERROR: accessing tracing. Root user? Kernel has FTRACE? + debugfs mounted? (mount -t debugfs debugfs /sys/kernel/debug)" + +### ftrace lock +[[ -e $flock ]] && die "ERROR: ftrace may be in use by PID $(cat $flock) $flock" +echo $$ > $flock || die "ERROR: unable to write $flock." +wroteflock=1 + +### setup and begin tracing +echo nop > current_tracer +if ! echo $kprobe >> kprobe_events; then + echo >&2 "ERROR: adding kprobe \"$kprobe\"." + if (( dmesg )); then + echo >&2 "Last $dmesg dmesg entries (might contain reason):" + dmesg | tail -$dmesg | sed 's/^/ /' + fi + edie "Exiting." +fi +if (( opt_view )); then + cat events/kprobes/$kname/format + edie "" +fi +if (( opt_filter )); then + if ! echo "$filter" > events/kprobes/$kname/filter; then + edie "ERROR: setting filter or -p. Exiting." + fi +fi +if ! echo 1 > events/kprobes/$kname/enable; then + edie "ERROR: enabling kprobe $kname. Exiting." +fi + +### print trace buffer +warn "echo > trace" +if (( opt_duration )); then + sleep $duration + if (( opt_headers )); then + cat trace + else + grep -v '^#' trace + fi +else + # trace_pipe lack headers, so fetch them from trace + (( opt_headers )) && cat trace + cat trace_pipe +fi + +### end tracing +end diff --git a/man/man8/kprobe.8 b/man/man8/kprobe.8 new file mode 100644 index 0000000..4811a3b --- /dev/null +++ b/man/man8/kprobe.8 @@ -0,0 +1,126 @@ +.TH kprobe 8 "2014-07-20" "USER COMMANDS" +.SH NAME +kprobe \- trace a given kprobe definition. Kernel dynamic tracing. Uses Linux ftrace. +.SH SYNOPSIS +.B kprobe +[\-hv] [\-d secs] [\-p PIDs] kprobe_definition [filter] +.SH DESCRIPTION +This will create, trace, then destroy a given kprobe. See +Documentation/trace/kprobetrace.txt in the Linux kernel source for the +syntax of a kprobe definition. + +WARNING: This uses kernel dynamic tracing, which can cause cause kernel panics +or freezes. Test, and know what you are doing, before use. + +Also beware of feedback loops: tracing tcp functions over an ssh session, +or writing ext4 functions to an ext4 file system. For the former, tcp +trace data could be redirected to a file (as in the usage message). For +the latter, trace to the screen or a different file system. + +Since this uses ftrace, only the root user can use this tool. +.SH REQUIREMENTS +FTRACE and KPROBES CONFIG, which you may already have enabled and available on +recent kernels. +.SH OPTIONS +.TP +\-d seconds +Set the duration of tracing, in seconds. Trace output will be buffered and +printed at the end. This also reduces overheads by buffering in-kernel, +instead of printing events as they occur. + +The ftrace buffer has a fixed size per-CPU (see +/sys/kernel/debug/tracing/buffer_size_kb). If you think events are missing, +try increasing that size. +.TP +\-h +Print usage message. +.TP +\-H +Print column headers. +.TP +\-v +Show the kprobe format file only (do not trace), identifying possible variables +for use in a custom filter. +.TP +\-p PID +Only trace kernel functions when this process ID is on-CPU. +.TP +kprobe_definition +A full kprobe definition, as documented by Documentation/trace/kprobetrace.txt +in the Linux kernel source. See the EXAMPLES section. +.TP +filter +An ftrace filter definition. +.SH EXAMPLES +These examples may need modification to match your kernel version's function +names and platform's register usage. If using platform specific registers +becomes too painful in practice, consider a kernel debuginfo-based tracer, +which can trace variables names instead. For example, perf_events. +.TP +Trace do_sys_open() return values: +# +.B kprobe 'r:myopen do_sys_open $retval' +.TP +Trace do_sys_open() file mode: +# +.B kprobe 'p:myopen do_sys_open mode=%cx:u16' +.TP +Trace do_sys_open() file mode for PID 81: +# +.B kprobe -p 81 'p:myopen do_sys_open mode=%cx:u16' +.TP +Trace do_sys_open() with filename string: +# +.B kprobe 'p:myopen do_sys_open filename=+0(%si):string' +.TP +Trace do_sys_open() for filenames ending in "stat" +# +.B kprobe 'p:myopen do_sys_open fn=+0(%si):string' 'fn ~ "*stat"' +.SH FIELDS +The output format depends on the kernel version, and headings can be printed +using \-H. The format is the same as the ftrace function trace format, described +in the kernel source under Documentation/trace/ftrace.txt. + +Typical fields are: +.TP +TASK-PID +The process name (which could include dashes), a dash, and the process ID. +.TP +CPU# +The CPU ID, in brackets. +.TP +|||| +Kernel state flags. For example, on Linux 3.16 these are for irqs-off, +need-resched, hardirq/softirq, and preempt-depth. +.TP +TIMESTAMP +Time of event, in seconds. +.TP +FUNCTION +Kernel function name. +.SH OVERHEAD +This can generate a lot of trace data quickly, depending on the +frequency of the traced events. Such data will cause performance overheads. +This also works without buffering by default, printing function events +as they happen (uses trace_pipe), context switching and consuming CPU to do +so. If needed, you can try the "\-d secs" option, which buffers events +instead, reducing overhead. If you think the buffer option is losing events, +try increasing the buffer size (buffer_size_kb). + +It's a good idea to use funccount(8) first, which is lower overhead, to +help you select which functions you may want to trace using kprobe(8). +.SH SOURCE +This is from the perf-tools collection: +.IP +https://github.com/brendangregg/perf-tools +.PP +Also look under the examples directory for a text file containing example +usage, output, and commentary for this tool. +.SH OS +Linux +.SH STABILITY +Unstable - in development. +.SH AUTHOR +Brendan Gregg +.SH SEE ALSO +functrace(8), funccount(8)