diff --git a/README.md b/README.md index 0e809a3..cf6b2a8 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ Using ftrace: - [execsnoop](execsnoop): trace process exec() with command line argument details. [Examples](examples/execsnoop_example.txt). - [opensnoop](opensnoop): trace open() syscalls showing filenames. [Examples](examples/opensnoop_example.txt). - [killsnoop](killsnoop): trace kill() signals showing process and signal details. [Examples](examples/killsnoop_example.txt). +- fs/[cachestat](fs/cachestat): basic cache hit/miss statistics for the Linux page cache. [Examples](examples/cachestat_example.txt). - net/[tcpretrans](net/tcpretrans): show TCP retransmits, with address and other details. [Examples](examples/tcpretrans_example.txt). - system/[tpoint](system/tpoint): trace a given tracepoint. [Examples](examples/tpoint_example.txt). - kernel/[funccount](kernel/funccount): count kernel function calls, matching a string with wildcards. [Examples](examples/funccount_example.txt). diff --git a/bin/cachestat b/bin/cachestat new file mode 120000 index 0000000..a1d4940 --- /dev/null +++ b/bin/cachestat @@ -0,0 +1 @@ +../fs/cachestat \ No newline at end of file diff --git a/examples/cachestat_example.txt b/examples/cachestat_example.txt new file mode 100644 index 0000000..47ba106 --- /dev/null +++ b/examples/cachestat_example.txt @@ -0,0 +1,58 @@ +Demonstrations of cachestat, the Linux ftrace version. + + +Here is some sample output showing file system cache statistics, followed by +the workload that caused it: + +# ./cachestat -t +Counting cache functions... Output every 1 seconds. +TIME HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB +08:28:57 415 0 0 100.0% 1 191 +08:28:58 411 0 0 100.0% 1 191 +08:28:59 362 97 0 78.9% 0 8 +08:29:00 411 0 0 100.0% 0 9 +08:29:01 775 20489 0 3.6% 0 89 +08:29:02 411 0 0 100.0% 0 89 +08:29:03 6069 0 0 100.0% 0 89 +08:29:04 15249 0 0 100.0% 0 89 +08:29:05 411 0 0 100.0% 0 89 +08:29:06 411 0 0 100.0% 0 89 +08:29:07 411 0 3 100.0% 0 89 +[...] + +I used the -t option to include the TIME column, to make describing the output +easier. + +The workload was: + +# echo 1 > /proc/sys/vm/drop_caches; sleep 2; cksum 80m; sleep 2; cksum 80m + +At 8:28:58, the page cache was dropped by the first command, which can be seen +by the drop in size for "CACHE_MB" (page cache size) from 191 Mbytes to 8. +After a 2 second sleep, a cksum command was issued at 8:29:01, for an 80 Mbyte +file (called "80m"), which caused a total of ~20,400 misses ("MISSES" column), +and the page cache size to grow by 80 Mbytes. The hit ratio during this dropped +to 3.6%. Finally, after another 2 second sleep, at 8:29:03 the cksum command +was run a second time, this time hitting entirely from cache. + +Instrumenting all file system cache accesses does cost some overhead, and this +tool might slow your target system by 2% or so. Test before use if this is a +concern. + +This tool also uses dynamic tracing, and is tied to Linux kernel implementation +details. If it doesn't work for you, it probably needs fixing. + + +Use -h to print the USAGE message: + +# ./cachestat -h +USAGE: cachestat [-Dht] [interval] + -D # print debug counters + -h # this usage message + -t # include timestamp + interval # output interval in secs (default 1) + eg, + cachestat # show stats every second + cachestat 5 # show stats every 5 seconds + +See the man page and example file for more info. diff --git a/fs/cachestat b/fs/cachestat new file mode 100755 index 0000000..8df2fd7 --- /dev/null +++ b/fs/cachestat @@ -0,0 +1,167 @@ +#!/bin/bash +# +# cachestat - show Linux page cache hit/miss statistics. +# Uses Linux ftrace. +# +# This is a proof of concept using Linux ftrace capabilities on older kernels, +# and works by using function profiling for in-kernel counters. Specifically, +# four kernel functions are traced: +# +# mark_page_accessed() for measuring cache accesses +# mark_buffer_dirty() for measuring cache writes +# add_to_page_cache_lru() for measuring page additions +# account_page_dirtied() for measuring page dirties +# +# It is possible that these functions have been renamed (or are different +# logically) for your kernel version, and this script will not work as-is. +# This script was written on Linux 3.13. This script is a sandcastle: the +# kernel may wash some away, and you'll need to rebuild. +# +# USAGE: cachestat [-Dht] [interval] +# eg, +# cachestat 5 # show stats every 5 seconds +# +# Run "cachestat -h" for full usage. +# +# WARNING: This uses dynamic tracing of kernel functions, and could cause +# kernel panics or freezes. Test, and know what you are doing, before use. +# It also traces cache activity, which can be frequent, and cost some overhead. +# The statistics should be treated as best-effort: there may be some error +# margin depending on unusual workload types. +# +# REQUIREMENTS: CONFIG_FUNCTION_PROFILER, awk. +# +# From perf-tools: https://github.com/brendangregg/perf-tools +# +# COPYRIGHT: Copyright (c) 2014 Brendan Gregg. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License +# as published by the Free Software Foundation; either version 2 +# of the License, or (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software Foundation, +# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. +# +# (http://www.gnu.org/copyleft/gpl.html) +# +# 28-Dec-2014 Brendan Gregg Created this. + +### default variables +tracing=/sys/kernel/debug/tracing +interval=1; opt_timestamp=0; opt_debug=0 +trap 'quit=1' INT QUIT TERM PIPE HUP # sends execution to end tracing section + +function usage { + cat <<-END >&2 + USAGE: cachestat [-Dht] [interval] + -D # print debug counters + -h # this usage message + -t # include timestamp + interval # output interval in secs (default 1) + eg, + cachestat # show stats every second + cachestat 5 # show stats every 5 seconds + + See the man page and example file for more info. +END + exit +} + +function warn { + if ! eval "$@"; then + echo >&2 "WARNING: command failed \"$@\"" + fi +} + +function die { + echo >&2 "$@" + exit 1 +} + +### process options +while getopts Dht opt +do + case $opt in + D) opt_debug=1 ;; + t) opt_timestamp=1 ;; + h|?) usage ;; + esac +done +shift $(( $OPTIND - 1 )) + +### option logic +if (( $# )); then + interval=$1 +fi +echo "Counting cache functions... Output every $interval seconds." + +### check permissions +cd $tracing || die "ERROR: accessing tracing. Root user? Kernel has FTRACE? + debugfs mounted? (mount -t debugfs debugfs /sys/kernel/debug)" + +### enable tracing +sysctl -q kernel.ftrace_enabled=1 # doesn't set exit status +printf "mark_page_accessed\nmark_buffer_dirty\nadd_to_page_cache_lru\naccount_page_dirtied\n" > set_ftrace_filter || \ + die "ERROR: tracing these four kernel functions: mark_page_accessed,"\ + "mark_buffer_dirty, add_to_page_cache_lru and account_page_dirtied (unknown kernel version?). Exiting." +warn "echo nop > current_tracer" +if ! echo 1 > function_profile_enabled; then + echo > set_ftrace_filter + die "ERROR: enabling function profiling. Have CONFIG_FUNCTION_PROFILER? Exiting." +fi + +(( opt_timestamp )) && printf "%-8s " TIME +printf "%8s %8s %8s %8s %12s %10s" HITS MISSES DIRTIES RATIO "BUFFERS_MB" "CACHE_MB" +(( opt_debug )) && printf " DEBUG" +echo + +### summarize +quit=0; secs=0 +while (( !quit && (!opt_duration || secs < duration) )); do + (( secs += interval )) + echo 0 > function_profile_enabled + echo 1 > function_profile_enabled + sleep $interval + + (( opt_timestamp )) && printf "%(%H:%M:%S)T " -1 + + # cat both meminfo and trace stats, and let awk pick them apart + cat /proc/meminfo trace_stat/function* | awk -v debug=$opt_debug ' + # match meminfo stats: + $1 == "Buffers:" && $3 == "kB" { buffers_mb = $2 / 1024 } + $1 == "Cached:" && $3 == "kB" { cached_mb = $2 / 1024 } + # identify and save trace counts: + $2 ~ /[0-9]/ && $3 != "kB" { a[$1] += $2 } + END { + mpa = a["mark_page_accessed"] + mbd = a["mark_buffer_dirty"] + apcl = a["add_to_page_cache_lru"] + apd = a["account_page_dirtied"] + + total = mpa - mbd + misses = apcl - apd + if (misses < 0) + misses = 0 + hits = total - misses + + ratio = 100 * hits / total + printf "%8d %8d %8d %7.1f%% %12.0f %10.0f", hits, misses, mbd, + ratio, buffers_mb, cached_mb + if (debug) + printf " (%d %d %d %d)", mpa, mbd, apcl, apd + printf "\n" + }' +done + +### end tracing +echo 2>/dev/null +echo "Ending tracing..." 2>/dev/null +warn "echo 0 > function_profile_enabled" +warn "echo > set_ftrace_filter" diff --git a/man/man8/cachestat.8 b/man/man8/cachestat.8 new file mode 100644 index 0000000..0133441 --- /dev/null +++ b/man/man8/cachestat.8 @@ -0,0 +1,111 @@ +.TH cachestat 8 "2014-12-28" "USER COMMANDS" +.SH NAME +cachestat \- Measure page cache hits/misses. Uses Linux ftrace. +.SH SYNOPSIS +.B cachestat +[\-Dht] [interval] +.SH DESCRIPTION +This tool provides basic cache hit/miss statistics for the Linux page cache. + +Its current implementation uses Linux ftrace dynamic function profiling to +create custom in-kernel counters, which is a workaround until such counters +can be built-in to the kernel. Specifically, four kernel functions are counted: +.IP +mark_page_accessed() for measuring cache accesses +.IP +mark_buffer_dirty() for measuring cache writes +.IP +add_to_page_cache_lru() for measuring page additions +.IP +account_page_dirtied() for measuring page dirties +.PP +It is possible that these functions have been renamed (or are different +logically) for your kernel version, and this script will not work as-is. +This was written for a Linux 3.13 kernel, and tested on a few others versions. +This script is a sandcastle: the kernel may wash some away, and you'll +need to rebuild. + +This program's implementation can be improved in the future when other +kernel capabilities are made available. If you need a more reliable tool now, +then consider other tracing alternatives (eg, SystemTap). This tool is really +a proof of concept to see what ftrace can currently do. + +WARNING: This uses dynamic tracing of kernel functions, and could cause +kernel panics or freezes. Test, and know what you are doing, before use. +It also traces cache activity, which can be frequent, and cost some overhead. +The statistics should be treated as best-effort: there may be some error +margin depending on unusual workload types. + +Since this uses ftrace, only the root user can use this tool. +.SH REQUIREMENTS +CONFIG_FUNCTION_PROFILER, which you may already have enabled and available on +recent kernels, and awk. +.SH OPTIONS +.TP +\-D +Include extra fields for debug purposes (see script). +.TP +\-h +Print usage message. +.TP +\-t +Include timestamps in units of seconds. +.TP +interval +Output interval in seconds. Default is 1. +.SH EXAMPLES +.TP +Show per-second page cache statistics: +# +.B cachestat +.SH FIELDS +.TP +TIME +Time, in HH:MM:SS. +.TP +HITS +Number of page cache hits (reads). Each hit is for one memory page (the size +depends on your processor architecture; commonly 4 Kbytes). Since this tool +outputs at a timed interval, this field indicates the cache hit rate. +.TP +MISSES +Number of page cache misses (reads from storage I/O). Each miss is for one +memory page. Cache misses should be causing disk I/O. Run iostat(1) for +correlation (although the miss count and size by the time disk I/O is issued +can differ due to I/O subsystem merging). +.TP +DIRTIES +Number of times a page in the page cache was written to and thus "dirtied". +The same page may be counted multiple times per interval, if it is written +to multiple times. This field gives an indication of how much cache churn there +is, caused by applications writing data. +.TP +RATIO +The ratio of cache hits to total cache accesses (hits + misses), as a +percentage. +.TP +BUFFERS_MB +Size of the buffer cache, for disk I/O. From /proc/meminfo. +.TP +CACHED_MB +Size of the page cache, for file system I/O. From /proc/meminfo. +.SH OVERHEAD +This tool currently uses ftrace function profiling, which provides efficient +in-kernel counters. However, the functions profiled are executed frequently, +so the overheads can add up. Test and measure before use. My own testing +showed around a 2% loss in application performance while this tool was running. +.SH SOURCE +This is from the perf-tools collection. +.IP +https://github.com/brendangregg/perf-tools +.PP +Also look under the examples directory for a text file containing example +usage, output, and commentary for this tool. +.SH OS +Linux +.SH STABILITY +Unstable - in development. +.SH AUTHOR +Brendan Gregg +.SH SEE ALSO +iostat(1), iosnoop(8)