From owner-freebsd-hackers  Tue Aug 24 11:56:45 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11])
	by hub.freebsd.org (Postfix) with SMTP id 3F84A14C81
	for <freebsd-hackers@freebsd.org>; Tue, 24 Aug 1999 11:56:37 -0700 (PDT)
	(envelope-from dwmalone@maths.tcd.ie)
Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP
          id <aa23162@salmon>; 24 Aug 1999 19:55:02 +0100 (BST)
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: vm_fault: pager read error on NFS filesystems. 
In-reply-to: Your message of "Tue, 24 Aug 1999 09:58:08 PDT."
             <199908241658.JAA17289@apollo.backplane.com> 
X-Request-Do: 
Date: Tue, 24 Aug 1999 19:55:01 +0100
From: David Malone <dwmalone@maths.tcd.ie>
Message-ID:  <199908241955.aa23162@salmon.maths.tcd.ie>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> :	1) Stop vm_fault logging so much stuff.
> :	2) Change sendsig to check if catching SIGBUS of SIGSEGV
> :	will cause a SIGBUS or SIGSEGV. If it will send the process
> :	a SIGKILL.
> 
>     Well, we can't do #2 - that would make us incompatible with
>     the API.

I don't see how 2 could make break the API - all a process in this
state can do is spin trying to serve SIGBUSs. I think HPUX may KILL
processes in this state. Yep, on HPUX 10.10 when I run my test
program (included at the bottom of this mail) I get:

Pid 2717 was killed due to failure in writing the signal context.

Solaris 2.6 gives a bus error. AIX and Digital Unix have the same
behavior as FreeBSD. Linux either does what FreeBSD does or core
dumps depending on the address of the SIGBUS handler I give it.
I'd like to see what NTs POSIX subsystem does...

BTW - don't run the test program on a -CURRENT machine unless you've
recompiled the kernel today. There was a bug which alowd processes
to catch SIGKILL, and it results in the process being unkillable.

>     It would be fairly easy to add a sysctl to control VM
>     related logging.  The sysctl would default to 1.

Sounds like it would be a step in the right direction. Someone
pointed out to me that NFS will sometimes send a SIGKILL if it
notices the executable has changed - maybe we need a more complete
implimentation of this?

>     panics on the client or server?  If you run a kernel with DDB

We have a dump from a 3.2 machine which died tring to core dump a
executable which I'd truncated over NFS (client). I'll include a
back trace for that too. The program had a global array 4MB of
ints.  It first sleeps, then I truncated the executable on the NFS
server and then it tries to write to the whole array of ints - the
client went *BOOM* straight away.

>     init should definitely not stop reaping zombies.  If the

You betcha - the two machines effected became unusable fairly
quickly 'cos of per user process limits once init broke.

Init got stuck in vmopar while doing a wait on a 3.2-STABLE machine.
We haven't managed to do this reproduceably, but the first stuck
zombie was a process which had suffered from the "vm_fault: pager
read error" problem which someone had kill -9'ed.

We do have a trace from a 4.0 machine with a normal process stuck
in vmopar. I've tacked that on the bottom too. The problems may
be similar. I don't understand why waiting on a zombie would
require the text of the program at all - but I haven't looked at
the code so.

>     init binary itself runs over NFS and you have updated it,
>     you have no choice but to reboot.

The init binary was run locally.

If you want to get access to a kernel and either of these vmcores I
can arange it.

	David.

==========================================================================
Test program for looping SIGBUS
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>

int
main(int argc,char **argv)
{
        int i;
        struct sigaction action, oa;

        action.sa_handler = (void *)0xffffffff;
        sigemptyset(&action.sa_mask);
        action.sa_flags = 0;

        for( i = 0; i < 32; i++ )
                if( sigaction(i,&action,&oa) != 0 )
                        perror("");
                else
                        printf("%d %x\n", i, oa.sa_flags);

        *((int *)NULL) = 0;

        exit(0);
}

==========================================================================
3.2-STABLE machine panicing on truncated NFS executable

#29 0xc014fc9d in panic (
    fmt=0xc0239a4c "vm_page_unwire: invalid wire count: %d\n")
    at ../../kern/kern_shutdown.c:446
#30 0xc01ead6f in vm_page_unwire (m=0xc05d165c, activate=0)
    at ../../vm/vm_page.c:1328
#31 0xc016dc3e in vfs_vmio_release (bp=0xc377aaa8) at ../../kern/vfs_bio.c:828
#32 0xc016e18b in getnewbuf (vp=0xc761b280, blkno=409, slpflag=256, 
    slptimeo=0, size=8192, maxsize=8192) at ../../kern/vfs_bio.c:1107
#33 0xc016e9c8 in getblk (vp=0xc761b280, blkno=409, size=8192, slpflag=256, 
    slptimeo=0) at ../../kern/vfs_bio.c:1511
#34 0xc0195a61 in nfs_getcacheblk (vp=0xc761b280, bn=409, size=8192, 
    p=0xc75574a0) at ../../nfs/nfs_bio.c:905
#35 0xc0195627 in nfs_write (ap=0xc76c9df0) at ../../nfs/nfs_bio.c:770
#36 0xc017a1c9 in vn_rdwr (rw=UIO_WRITE, vp=0xc761b280, 
    base=0x8049000 "\177ELF\001\001\001", len=4194304, offset=4096, 
---Type <return> to continue, or q <return> to quit--- 
    segflg=UIO_USERSPACE, ioflg=9, cred=0xc115a680, aresid=0x0, p=0xc75574a0)
    at vnode_if.h:331
#37 0xc01404ae in elf_coredump (p=0xc75574a0) at ../../kern/imgact_elf.c:742
#38 0xc0150eb4 in sigexit (p=0xc75574a0, signum=11)
    at ../../kern/kern_sig.c:1241
#39 0xc0150d67 in postsig (signum=11) at ../../kern/kern_sig.c:1158
#40 0xc02079c0 in trap (frame={tf_es = 39, tf_ds = 39, tf_edi = -1077945804, 
      tf_esi = 1, tf_ebp = -1077945848, tf_isp = -949182492, 
      tf_ebx = -1077945796, tf_edx = -1077946684, tf_ecx = -1077946684, 
      tf_eax = 16064, tf_trapno = 12, tf_err = 16064, tf_eip = 134513835, 
      tf_cs = 31, tf_eflags = 66195, tf_esp = -1077945852, tf_ss = 39})
    at ../../i386/i386/trap.c:167

==========================================================================
4.0 machine with process stuck in vompar

(kgdb) proc 5536
(kgdb) bt
#0  mi_switch () at ../../kern/kern_synch.c:827
#1  0xc013c191 in tsleep (ident=0xc059d140, priority=4, 
    wmesg=0xc0250b91 "vmopar", timo=0) at ../../kern/kern_synch.c:443
#2  0xc01e2caf in vm_object_page_remove (object=0xc87baf3c, start=0, end=978, 
    clean_only=0) at ../../vm/vm_page.h:536
#3  0xc01e7449 in vnode_pager_setsize (vp=0xc8760980, nsize=0)
    at ../../vm/vnode_pager.c:285
#4  0xc01ad117 in nfs_loadattrcache (vpp=0xc86aedbc, mdp=0xc86aedc8, 
    dposp=0xc86aedcc, vaper=0x0) at ../../nfs/nfs_subs.c:1383
#5  0xc01b5f80 in nfs_readrpc (vp=0xc8760980, uiop=0xc86aee30, cred=0xc0ac6580)
    at ../../nfs/nfs_vnops.c:1086
#6  0xc018def5 in nfs_getpages (ap=0xc86aee6c) at ../../nfs/nfs_bio.c:154
#7  0xc01e79fe in vnode_pager_getpages (object=0xc87baf3c, m=0xc86aef00, 
    count=1, reqpage=0) at vnode_if.h:1067
#8  0xc01dc158 in vm_fault (map=0xc7c5ff80, vaddr=134512640, 
    fault_type=1 '\001', fault_flags=0) at ../../vm/vm_pager.h:130
#9  0xc0214a14 in trap_pfault (frame=0xc86aefa8, usermode=1, eva=134513640)
    at ../../i386/i386/trap.c:781
#10 0xc02145a3 in trap (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, 
      tf_edi = -1077945600, tf_esi = -1077945608, tf_ebp = -1077945652, 
      tf_isp = -932515884, tf_ebx = 1, tf_edx = 1, tf_ecx = 0, tf_eax = 10, 
      tf_trapno = 12, tf_err = 4, tf_eip = 134513640, tf_cs = 31, 
      tf_eflags = 66118, tf_esp = -1077945652, tf_ss = 47})
    at ../../i386/i386/trap.c:349
#11 0x80483e8 in ?? ()
#12 0x804837d in ?? ()


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message