From owner-freebsd-questions  Sun Aug 20 07:28:45 1995
Return-Path: questions-owner
Received: (from majordom@localhost)
          by freefall.FreeBSD.org (8.6.11/8.6.6) id HAA01546
          for questions-outgoing; Sun, 20 Aug 1995 07:28:45 -0700
Received: from lilly.ping.de (lilly.ping.de [193.100.14.2])
          by freefall.FreeBSD.org (8.6.11/8.6.6) with SMTP id HAA01521
          for <questions@freebsd.org>; Sun, 20 Aug 1995 07:28:37 -0700
Received: from cliwe.ping.de by lilly.ping.de with smtp
	(Smail3.1.28.1 #4) id m0skBMN-000onrC; Sun, 20 Aug 95 16:28 MET DST
From: fdc@cliwe.ping.de (Frank D. Cringle)
Date: Sun, 20 Aug 95 16:19:55 +0200
Message-Id: <9508201419.AA00108@cliwe.ping.de>
Received: by cliwe.ping.de (5.0/GEN-1.0.17-fdc)
	id AA00108; Sun, 20 Aug 95 16:19:55 +0200
To: questions@freebsd.org
Subject: Monitoring system performance
content-length: 0
Sender: questions-owner@freebsd.org
Precedence: bulk

The domain ping.de is a non-profit club run by enthusiasts on their own time.
It provides internet connectivity, mail, and news to over 400 sites using a
couple of FreeBSD systems.  As with many other providers, the explosive growth
in demand regularly exposes performance bottlenecks which are then dealt with
reactively by hardware expansion or software reconfiguration.

I am interested in monitoring the systems so that we can react more quickly to
problems, preferably by predicting and avoiding them before they affect
service.  I would love to be told that a suitable package exists and is
available at URL:ftp.whatever, but in case none exists I would like to start a
discussion by setting out my idea of what would be useful.  Comments are very
welcome, including any that point out some completely different approach which
may not have occurred to me.

First off, I am looking for something that continually gathers statistics for
later analysis.  Tools such as top, systat, and the recently announced
xperfmon++ are oriented towards online display of the current situation.  I
believe we need to look back at the behaviour of the system over recent 24hr
periods and also look at trends over preceding weeks and months.

I have been experimenting with vmstat and iostat, starting them with cron at
midnight and collecting samples every minute through to the next midnight.
The results are then mangled by a perl script to produce input for a graphical
display program (e.g. xmgr, xvgr or gnuplot).  This gives a good overview of
how the system has been performing for those variables that are provided by
the two programs.

There are some problems with this approach however.  The output formats of
vmstat and iostat are still more oriented towards display than programmed
interpretation.  Indeed, the vmstat in FreeBSD 2 [sccsid vmstat.c 8.1
(Berkeley) 6/6/93] is harder to interpret than that in earlier versions
[sccsid vmstat.c 5.31 (Berkeley)].  The printf formats previously included a
space between all columns and now don't, so that columns coalesce if a number
overflows its expected width.  More important, not all statistics gathered by
the kernel are available via vmstat/iostat and some of the numbers are
converted to time-averages, a job I would prefer to handle in a separate
analysis.  Also, a complete picture of system performance should include
network loading.  I have not got to grips with running netstat and
interpreting its output.

What I would like to see is a generalised statistics gathering program, with
the potential to sample all the various counters maintained by the system at
regular intervals, and that outputs the results in an easily parsable format.
The program would be told how many samples to take and how often to take them
on the command line (like vmstat).  It could also be told which counters to
sample (default all) on the command line or in a parameter file.  The output
file would be in ascii and the lines (following a header) would be space-
separated lists of numbers representing the delta of each counter value with
respect to the previous sample (delta rather than absolute value to reduce
file size).  The header would be one line per counter giving its absolute
value at the start of sampling and a short name or title describing the
variable, with an empty line separating the header from the actual samples.
The order of header lines would correspond to the order of columns in the
subsequent sample lines.  The first counter would typically be "time in
seconds since 1970".

So, why don't you just go ahead and write the program, Frank?  Well, I do not
have easy access to a FreeBSD system or to the sources.  I have warm and fuzzy
feelings towards FreeBSD (and linux), but my home is a PC-free zone, so I just
observe from afar.  I have ppp access to a shell account on our club's
systems, but I don't think it would be appropriate for me to mess with suid-
kmem programs on them.  Also, people who are intimately familiar with the
kernel and io and networking code are better placed to ferret out all the
potentially interesting numbers that could be made available.


The goal up to here is to provide raw, unadulterated numbers that can be
analysed offline, e.g. using perl and gnuplot.  Those plots should provide a
better basis for deciding whether to buy more memory or a higher bandwidth
network connection or whatever.

Really keen developers could write code to produce html displays on the fly
out of the statistics files.  Then inquisitive users (like me :-) could
satisfy their curiosity about why service is so slow today by following a link
on their provider's www home page.

-- 
Frank Cringle                      | fdc@cliwe.ping.de
voice + fax                        | +49 2304 467101