From owner-freebsd-performance@FreeBSD.ORG Sat Sep 13 00:15:08 2008 Return-Path: Delivered-To: performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5AE68106564A for ; Sat, 13 Sep 2008 00:15:08 +0000 (UTC) (envelope-from david@catwhisker.org) Received: from bunrab.catwhisker.org (adsl-63-193-123-122.dsl.snfc21.pacbell.net [63.193.123.122]) by mx1.freebsd.org (Postfix) with ESMTP id 1CBE18FC14 for ; Sat, 13 Sep 2008 00:15:07 +0000 (UTC) (envelope-from david@catwhisker.org) Received: from bunrab.catwhisker.org (localhost [127.0.0.1]) by bunrab.catwhisker.org (8.13.3/8.13.3) with ESMTP id m8CNmMse048067 for ; Fri, 12 Sep 2008 16:48:22 -0700 (PDT) (envelope-from david@bunrab.catwhisker.org) Received: (from david@localhost) by bunrab.catwhisker.org (8.13.3/8.13.1/Submit) id m8CNmMqK048066 for performance@freebsd.org; Fri, 12 Sep 2008 16:48:22 -0700 (PDT) (envelope-from david) Date: Fri, 12 Sep 2008 16:48:22 -0700 From: David Wolfskill To: performance@freebsd.org Message-ID: <20080912234822.GK11991@bunrab.catwhisker.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="x8hBN3jcr0uqJpm2" Content-Disposition: inline User-Agent: Mutt/1.4.2.1i Cc: Subject: Using sysctl(1) to gather resource consumption data X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Sep 2008 00:15:08 -0000 --x8hBN3jcr0uqJpm2 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable At $work, I've been trying to gather information on "interesting patterns" of resource consumption during moderately long-running (5 - 8 hour) tasks; the hosts in question usually run FreeBSD 6.2, though there's an occasional 6.x that's more recent, as well as a bit of 7-STABLE. I wanted to have a low impact on the system being measured (of course), and I was unwilling to require that a system to be measured had any software installed on it other than base FreeBSD. (Yes, that means I didn't assume Perl, though in practice in this environment, each does.) I also wanted the data to be transferred reasonably securely, even if part of that transit was over facilities over which I had no control. (Some of the machines being measured happen to be in a continent other than where I am.) So I cobbled up a Perl script to run on a data-gathering machine (that one was mine, so I could require that it had any software I wanted on it); it acts (if you will) as a "shepherd," watching over child processes, one of which is created for each host to be measured. A given child process copies over a shell script to the remote machine, then redirects STDOUT to append to a file on the data-gathering machine, and exec()s ssh(1), telling it to run the shell script on the remote machine. The shell script fabricates a string (depending on the arguments with which it was invoked), then sits in a loop: * eval the string * sleep for the amount of time remaining indefinitely. (In practice, the usual nominal time between successive eval()s is 5 minutes. I have recently been doing some experiments at a 10-second interval.) Periodically, back on the data-gathering machine, a couple of different things happen: * The "shepherd" script wakes up and checks the mtime on the file for each per-host process (to see if it's been updated "sufficiently recently"). Acttually, it first checks the file that lists the hosts to watch; if its mtime has changed, it's re-read, and the list of hosts is modified as appropriate. Anyway, if a given per-host file is "too old," the corresponding child process is killed. The the script runs through the list of hosts that should be checked, creating a per-host process for each one for which that's necessary. There's a fair amount of detail I'm eliding (such as limited exponential backoff for unresponsive hosts). In practice, this runs every 2 minutes at the moment. * There's a cron(8)-initiated make(1) process that runs, reading the files created by the per-host processes and writing to a corresponding RRD. (I cobbled up a Perl script to do this.) While I tried to externalize a fair amount of this -- e.g., the list of sysctl(1) OIDs to use is read from an external file -- it turns out that certain types of change are a bit ... painful. In particular, adding a new "data source" to the RRD qualifies (as "painful"). I recently modified the scripts involved to allow them to also be used to gather per-NIC statistics (via invocation of "netstat -nibf inet"). I'm about to implement that change over the weekend, so it occurred to me that this might be a good time to add some more sysctl(1) OIDs. So I'm asking for suggestions -- ideally, for OIDs that are fairly easily parseable. (I started being limited to only OIDs that were presented as a single numeric value per line, then figured out how to handle kern.cp_time (which is an ordered quintuple); later I figured out how to cope with vm.loadavg (which is an order triplet ... surrounded by curly braces). I don't currently have logic to cope with anything more complicated than those.) Here's a list of the OIDs I'm currently using: debug.dir_entry debug.direct_blk_ptrs debug.numcache debug.numcachehv debug.numneg debug.to_avg_depth debug.to_avg_gcalls debug.to_avg_mpcalls hw.usermem kern.cp_time kern.ipc.max_datalen kern.ipc.max_hdr kern.ipc.maxsockbuf kern.ipc.msgmax kern.ipc.msgmnb kern.ipc.msgmni kern.ipc.msgtql kern.ipc.nmbclusters kern.ipc.nmbjumbo16 kern.ipc.nmbjumbo9 kern.ipc.nmbjumbop kern.ipc.nsfbufs kern.ipc.nsfbufspeak kern.ipc.nsfbufsused kern.ipc.numopensockets kern.ipc.pipekva kern.ipc.pipes kern.kstack_pages kern.malloc_count kern.maxfiles kern.maxusers kern.nselcoll kern.openfiles net.isr.count net.isr.deferred net.isr.directed net.isr.drop net.isr.queued vfs.bufdefragcnt vfs.buffreekvacnt vfs.bufmallocspace vfs.bufreusecnt vfs.bufspace vfs.cache.dotdothits vfs.cache.dothits vfs.cache.numcache vfs.cache.numcalls vfs.cache.numchecks vfs.cache.numfullpathcalls vfs.cache.numfullpathfail1 vfs.cache.numfullpathfail2 vfs.cache.numfullpathfail4 vfs.cache.numfullpathfound vfs.cache.nummiss vfs.cache.nummisszap vfs.cache.numneg vfs.cache.numneghits vfs.cache.numnegzaps vfs.cache.numposhits vfs.cache.numposzaps vfs.dirtybufferflushes vfs.dirtybufthresh vfs.flushwithdeps vfs.freevnodes vfs.getnewbufcalls vfs.getnewbufrestarts vfs.hibufspace vfs.hidirtybuffers vfs.hirunningspace vfs.lobufspace vfs.lodirtybuffers vfs.lorunningspace vfs.maxbufspace vfs.maxmallocbufspace vfs.nfs.downdelayinitial vfs.nfs.downdelayinterval vfs.nfs.realign_count vfs.nfs.realign_test vfs.nfs.reconnects vfs.nfs4.access_cache_timeout vfs.numdirtybuffers vfs.numfreebuffers vfs.numvnodes vfs.read_max vfs.reassignbufcalls vfs.wantfreevnodes vfs.write_behind vm.loadavg vm.stats.misc.cnt_prezero vm.stats.misc.zero_page_count vm.stats.sys.v_intr vm.stats.sys.v_soft vm.stats.sys.v_swtch vm.stats.sys.v_syscall vm.stats.sys.v_trap vm.stats.vm.v_active_count vm.stats.vm.v_cow_faults vm.stats.vm.v_cow_optim vm.stats.vm.v_forkpages vm.stats.vm.v_forks vm.stats.vm.v_free_count vm.stats.vm.v_inactive_count vm.stats.vm.v_intrans vm.stats.vm.v_kthreads vm.stats.vm.v_ozfod vm.stats.vm.v_pdpages vm.stats.vm.v_pdwakeups vm.stats.vm.v_pfree vm.stats.vm.v_reactivated vm.stats.vm.v_rforks vm.stats.vm.v_swapin vm.stats.vm.v_swapout vm.stats.vm.v_swappgsin vm.stats.vm.v_swappgsout vm.stats.vm.v_tfree vm.stats.vm.v_vforkpages vm.stats.vm.v_vforks vm.stats.vm.v_vm_faults vm.stats.vm.v_vnodein vm.stats.vm.v_vnodeout vm.stats.vm.v_vnodepgsin vm.stats.vm.v_vnodepgsout vm.stats.vm.v_wire_count vm.stats.vm.v_zfod vm.swap_idle_threshold1 vm.swap_idle_threshold2 I admit that I don't know what several of those actually mean: I figured I'd capture what I can, then try to make sense of it. It's very easy to ignore data that I've captured, but don't need; it's a little harder to take appropriate corrective action if I determine that there was some information I should have captured, but didn't. :-} Still, if something's in there that's just silly, I wouldn't mind knowing about it. :-) Thanks! Peace, david --=20 David H. Wolfskill david@catwhisker.org Depriving a girl or boy of an opportunity for education is evil. See http://www.catwhisker.org/~david/publickey.gpg for my public key. --x8hBN3jcr0uqJpm2 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (FreeBSD) iEYEARECAAYFAkjK/8UACgkQmprOCmdXAD2+AgCfTCXMnDh6IR0ctObhi8UE21mR OKsAn0hVB0xaNCPiB1XkckZhKFpEjMze =E3FQ -----END PGP SIGNATURE----- --x8hBN3jcr0uqJpm2--