From owner-freebsd-questions Sun Aug 20 07:28:45 1995 Return-Path: questions-owner Received: (from majordom@localhost) by freefall.FreeBSD.org (8.6.11/8.6.6) id HAA01546 for questions-outgoing; Sun, 20 Aug 1995 07:28:45 -0700 Received: from lilly.ping.de (lilly.ping.de [193.100.14.2]) by freefall.FreeBSD.org (8.6.11/8.6.6) with SMTP id HAA01521 for ; Sun, 20 Aug 1995 07:28:37 -0700 Received: from cliwe.ping.de by lilly.ping.de with smtp (Smail3.1.28.1 #4) id m0skBMN-000onrC; Sun, 20 Aug 95 16:28 MET DST From: fdc@cliwe.ping.de (Frank D. Cringle) Date: Sun, 20 Aug 95 16:19:55 +0200 Message-Id: <9508201419.AA00108@cliwe.ping.de> Received: by cliwe.ping.de (5.0/GEN-1.0.17-fdc) id AA00108; Sun, 20 Aug 95 16:19:55 +0200 To: questions@freebsd.org Subject: Monitoring system performance content-length: 0 Sender: questions-owner@freebsd.org Precedence: bulk The domain ping.de is a non-profit club run by enthusiasts on their own time. It provides internet connectivity, mail, and news to over 400 sites using a couple of FreeBSD systems. As with many other providers, the explosive growth in demand regularly exposes performance bottlenecks which are then dealt with reactively by hardware expansion or software reconfiguration. I am interested in monitoring the systems so that we can react more quickly to problems, preferably by predicting and avoiding them before they affect service. I would love to be told that a suitable package exists and is available at URL:ftp.whatever, but in case none exists I would like to start a discussion by setting out my idea of what would be useful. Comments are very welcome, including any that point out some completely different approach which may not have occurred to me. First off, I am looking for something that continually gathers statistics for later analysis. Tools such as top, systat, and the recently announced xperfmon++ are oriented towards online display of the current situation. I believe we need to look back at the behaviour of the system over recent 24hr periods and also look at trends over preceding weeks and months. I have been experimenting with vmstat and iostat, starting them with cron at midnight and collecting samples every minute through to the next midnight. The results are then mangled by a perl script to produce input for a graphical display program (e.g. xmgr, xvgr or gnuplot). This gives a good overview of how the system has been performing for those variables that are provided by the two programs. There are some problems with this approach however. The output formats of vmstat and iostat are still more oriented towards display than programmed interpretation. Indeed, the vmstat in FreeBSD 2 [sccsid vmstat.c 8.1 (Berkeley) 6/6/93] is harder to interpret than that in earlier versions [sccsid vmstat.c 5.31 (Berkeley)]. The printf formats previously included a space between all columns and now don't, so that columns coalesce if a number overflows its expected width. More important, not all statistics gathered by the kernel are available via vmstat/iostat and some of the numbers are converted to time-averages, a job I would prefer to handle in a separate analysis. Also, a complete picture of system performance should include network loading. I have not got to grips with running netstat and interpreting its output. What I would like to see is a generalised statistics gathering program, with the potential to sample all the various counters maintained by the system at regular intervals, and that outputs the results in an easily parsable format. The program would be told how many samples to take and how often to take them on the command line (like vmstat). It could also be told which counters to sample (default all) on the command line or in a parameter file. The output file would be in ascii and the lines (following a header) would be space- separated lists of numbers representing the delta of each counter value with respect to the previous sample (delta rather than absolute value to reduce file size). The header would be one line per counter giving its absolute value at the start of sampling and a short name or title describing the variable, with an empty line separating the header from the actual samples. The order of header lines would correspond to the order of columns in the subsequent sample lines. The first counter would typically be "time in seconds since 1970". So, why don't you just go ahead and write the program, Frank? Well, I do not have easy access to a FreeBSD system or to the sources. I have warm and fuzzy feelings towards FreeBSD (and linux), but my home is a PC-free zone, so I just observe from afar. I have ppp access to a shell account on our club's systems, but I don't think it would be appropriate for me to mess with suid- kmem programs on them. Also, people who are intimately familiar with the kernel and io and networking code are better placed to ferret out all the potentially interesting numbers that could be made available. The goal up to here is to provide raw, unadulterated numbers that can be analysed offline, e.g. using perl and gnuplot. Those plots should provide a better basis for deciding whether to buy more memory or a higher bandwidth network connection or whatever. Really keen developers could write code to produce html displays on the fly out of the statistics files. Then inquisitive users (like me :-) could satisfy their curiosity about why service is so slow today by following a link on their provider's www home page. -- Frank Cringle | fdc@cliwe.ping.de voice + fax | +49 2304 467101