From owner-freebsd-current  Mon Apr 15 03:00:48 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id DAA05149
          for current-outgoing; Mon, 15 Apr 1996 03:00:48 -0700 (PDT)
Received: from bunyip.cc.uq.oz.au (pp@bunyip.cc.uq.oz.au [130.102.2.1])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id DAA05124
          for <freebsd-current@freebsd.org>; Mon, 15 Apr 1996 03:00:27 -0700 (PDT)
Received: from bunyip.cc.uq.oz.au by bunyip.cc.uq.oz.au 
          id <23176-0@bunyip.cc.uq.oz.au>; Mon, 15 Apr 1996 20:00:21 +1000
Received: from orion.devetir.qld.gov.au 
          by pandora.devetir.qld.gov.au (8.6.10/DEVETIR-E0.3a) with ESMTP 
          id SAA00868 for <freebsd-current@freebsd.org>;
          Mon, 15 Apr 1996 18:54:25 +1000
Received: from localhost by orion.devetir.qld.gov.au (8.6.10/DEVETIR-0.3) 
          id SAA14153; Mon, 15 Apr 1996 18:56:00 +1000
Message-Id: <199604150856.SAA14153@orion.devetir.qld.gov.au>
To: freebsd-current@freebsd.org
cc: syssgm@devetir.qld.gov.au
Subject: Re: Just how stable is current
Date: Mon, 15 Apr 1996 18:55:59 +1000
From: Stephen McKay <syssgm@devetir.qld.gov.au>
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Ollivier Robert <roberto@keltia.freenix.fr> thinks:
>It seems that J Wunsch said:
>> > Yes, I know that this is a bad question to ask, but....
>> 
>> Mine's from the Easter weekend, and i can't complain.
>
>Mine is from tuesday is running fine. -CURRENT  has been very stable for me
>for at least 3 weeks (if not more).

Not all of us are happy campers.  I have a -current kernel from January 9
which works well for me, and have had various problems with all kernels
built since.  My hardware is modest: 16Mhz 386sx with 4Mb ram, NFS for all
source and object files, vnconfig swap + real swap totals 16Mb.

I have 3 problems:

1) NFS problem:  My January 9 kernel will work properly as a client with any
server using 8Kb max size UDP connections.  More recent kernels won't.  I
get severe performance degradation that I assume is from lots of retries and
timeouts, even though I can't find them in nfsstat.  Many processes hang
for long periods in sbwait, nfsrcvlk and similar network states.

Ok, overruns are a common problem with PC network cards, especially in slow
machines.  However, setting the maximum size to 1Kb does not cure the
problem (or maybe moves the problem elsewhere).  Switching to TCP transport
produced a total cure, but is not available on all servers.


2) Processes with negative resident size: Friday, I started a make all of
-current and snapped this: (some boring processes deleted)

  UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN  STAT  TT       TIME COMMAND
    0     0     0   0 -18  0     0    0 sched  DLs   ??    0:03.33  (swapper)
    0     1     0  12  10  0   392    0 wait   IWs   ??    0:00.33 /sbin/init --
    0     2     0  75 -18  0     0   12 psleep DL    ??   11:54.65  (pagedaemon)
    0     3     0  32  28  0     0   12 psleep DL    ??    2:52.65  (vmdaemon)
    0     4     0   5  29  0     0   12 update DL    ??    0:14.13  (update)
...
    0  2177  2176   9  10  5   340   -4 wait   IWN   p0    0:02.70 make
    0  2179  2177  38  10  5   452    0 wait   IWN   p0    0:00.36 /bin/sh -ec for entry in  include lib bin games gnu libexec sbin
    0  2190  2179  75  10  5   308   -4 wait   IWN   p0    0:02.29 make all DIRPRFX
    0  2192  2190 107  10  5   452   -4 wait   IWN   p0    0:00.33 /bin/sh -ec for entry in csu/i386 libc libcompat libcom_err libc
    0  2195  2192  32  10  5  2840    8 wait   IWN   p0    1:12.30 make all DIRPRFX
    0  2233  2195 135  10  5   216   16 wait   IWN   p0    0:00.99 cc -O2 -DLIBC_RCS -DSYSLIBC_RCS -D__DBINTERFACE_PRIVATE -DPOSIX_
    0  2238  2233 109  65  5   848 1004 -      RN    p0    0:17.92 /usr/libexec/cc1 /tmp/cc002233.i -quiet -dumpbase bt_open.c -O2 
    0   147     1  48   3  0   156   -4 ttyin  IWs+  v0    0:00.49 /usr/libexec/getty Pc ttyv0

RSS < 0 may be a cosmetic flaw, or it may be seriously buggering the VM system.
I don't know yet, but I'm valiantly struggling through the VM code. :-)


3) Madly spinning processes: This morning the scene was:

  UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN  STAT  TT       TIME COMMAND
    0  4796  4399 131  10  5   308   -4 wait   IWN   ??    0:01.85 make all DIRPRFX
    0  4798  4796  87  10  5   452   -4 wait   IWN   ??    0:00.72 /bin/sh -ec for entry in as awk bc cc cpio cvs dc dialog diff di
    0  4990  4798 135  10  5   312   -4 wait   IWN   ??    0:01.98 make all DIRPRFX
    0  4992  4990 149  10  5   452   -4 wait   IWN   ??    0:00.39 /bin/sh -ec for entry in libgroff libdriver libbib groff troff n
    0  5011  4992 210  90  5   344   20 -      RN    ??  3509:56.22 make all DIR

All but one process had reasonable amounts of time accrued.  Some even had
normal resident memory. :-)  vmstat -s revealed: (sorry, I don't know what's
irrelevant here)

  3010564 cpu context switches
 69486232 device interrupts
  2658782 software interrupts
371029200 traps
  1002815 system calls
    86889 swap pager pageins
   195866 swap pager pages paged in
    57630 swap pager pageouts
    82118 swap pager pages paged out
   115789 vnode pager pageins
   238148 vnode pager pages paged in
        0 vnode pager pageouts
        0 vnode pager pages paged out
    41415 page daemon wakeups
 27543608 pages examined by the page daemon
    15642 pages reactivated
   158113 copy-on-write faults
   262888 zero fill pages zeroed
      253 intransit blocking page faults
367919662 total VM faults taken
   514357 pages freed
    39851 pages freed by daemon
   368305 pages freed by exiting processes
      286 pages active
       68 pages inactive
        9 pages in VM cache
      313 pages wired down
       13 pages free
     4096 bytes per page
   550001 total name lookups
          cache hits (77% pos + 2% neg) system 2% per-directory
          deletions 0%, falsehits 4%, toolong 0%

367919662 VM faults over 2.5 days equates to 1700 per second.  This is far
in excess of what the machine can fetch from disk, so it can only be "soft"
faults (where pages really are there, but the VM system was hoping you didn't
need them any more and was going to free them soon), or some total failure
to provide the needed page at all, causing make to fault again immediately
on returning to user mode.

That make process has only 5 resident pages (or is it 6 :-)), but lots of
memory was available for my shell, telnetd, etc when I logged in.  It isn't
lack of real memory that caused this.

Now, for the final twist before the audience can return to the comfortable
normalcy of their own lives:  I stopped the whole process group with
SIGSTOP, and noted that all processes went from RSS -4 to 8, presumably
because the u area had faulted in.  I waited all day (just because I had
real work :-)), and found that the problem make process was eventually
reduced to 8Kb, like the others.  Then I restarted them with SIGCONT, and
blow me down if they didn't just up and carry on like nothing had happened.
The problem make exited (presumably after finishing successfully), and the
compilation is proceeding normally as I write.

Thanks to all who have bothered to read this far.  I shall be consulting the
special texts of the masters (sys/vm/*.[hc]) for enlightenment, but expect to
be beaten to the answer by more knowledgeable persons.

Stephen.