From owner-freebsd-current@FreeBSD.ORG  Wed Sep 15 06:39:52 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: by hub.freebsd.org (Postfix, from userid 758)
	id 902E716A4CF; Wed, 15 Sep 2004 06:39:52 +0000 (GMT)
Date: Wed, 15 Sep 2004 06:39:52 +0000
From: Kris Kennaway <kris@FreeBSD.org>
To: Ken Smith <kensmith@cse.Buffalo.EDU>
Message-ID: <20040915063952.GB63279@hub.freebsd.org>
References: <200407151424.i6FEOdoq060881@fledge.watson.org>
	<20040715220447.GA32888@xor.obsecurity.org>
	<20040902155947.GA12006@electra.cse.Buffalo.EDU>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040902155947.GA12006@electra.cse.Buffalo.EDU>
User-Agent: Mutt/1.4.1i
cc: re@FreeBSD.org
cc: current@FreeBSD.org
cc: Kris Kennaway <kris@obsecurity.org>
Subject: Re: 5.3-RELEASE TODO
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Sep 2004 06:39:52 -0000

On Thu, Sep 02, 2004 at 11:59:47AM -0400, Ken Smith wrote:
> On Thu, Jul 15, 2004 at 03:04:47PM -0700, Kris Kennaway wrote:
> 
> > These are the bugs I'm currently tracking (those I can remember right
> > now, at least)

All of these issues except for the last one seem to be resolved for me
now.  I haven't tested the last one (memory tuning on 4GB machines)
because I have tuned my kernel configs to avoid the problem, but I can
remove those changes and see if the problems persist.

I am now seeing a couple of other problems:

* softupdates stack overflow (previously reported; I've now hit this
on two machines).  I might be able to hack around it by increasing
KSTACK_PAGES, but that doesn't help others.  phk could not think of
any way to fix the unboundedness of the dependency chains, and kirk
replied saying he's on vacation.

* I had an apparent scheduler hang tonight (4BSD): the only process
that is running has a trace including sched_switch, and nothing else
apart from the idle tasks is running or runnable.  I'll try to post
more details tomorrow.

* There may be a problem with swapping: I had an extremely weird
sequence of errors (binaries aborting, spurious "missing
/libexec/ld-elf.so.1") on pointyhat at around the time it started
swapping.  I don't know if swapping was the cause or another symptom
of some other problem.  I'll try to reproduce on another machine.

* I was able to break to KDB a few times on pointyhat to try and
diagnose this problem, but eventually it hung trying to enter KDB.
This happens with fairly high frequency (on SMP machines?)

I think there are some other bugs I'm forgetting right now.

Kris

> > * SMP is unusable for me because of the following frequent panic
> > (actually a panic and another kernel printf interleaved).  Here is the
> > untangled version:
> > 
> > panic: APIC: Previous IPI is stu c k
> >                                 p m a
> >  _ l a z y f i x :   s p
> > u c p u i d  =    0 ;
> >  n   f o r   5 0 0 0 0 0 0 0
> > c D e b u g g e r ( " p a n i
> > 
> > jhb says:
> > 
> > > Seems the two CPUs are deadlocked waiting on each other.  The first sent a
> > > pmap_lazyfixup IPI to the second but the second has interrupts disabled as it
> > > is trying to send an IPI as well.
> > 
> > He suggested a patch, but it did not fix the problem.
> 
> Was this fixed with the IPI patches done before BETA2?
> 
> > * linprocfs 
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; apic id = 00
> > fault virtual address   = 0x8
> > fault code              = supervisor read, page not present
> > instruction pointer     = 0x8:0xc04e1870
> > stack pointer           = 0x10:0xf11e6b50
> > frame pointer           = 0x10:0xf11e6b6c
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0, pres 1, def32 1, gran 1
> > processor eflags        = interrupt enabled, resume, IOPL = 0
> > current process         = 23938 (mtree)
> > kernel: type 12 trap, code=0
> > Stopped at      pfs_getattr+0x130:      movl    0x8(%eax),%eax
> > db> trace
> > pfs_getattr(f11e6b78,c06fda00,cf397b2c,f11e6b98,d23e8a80) at pfs_getattr+0x130
> > vn_stat(cf397b2c,f11e6c80,d23e8a80,0,c5eb0c60) at vn_stat+0x4f
> > lstat(c5eb0c60,f11e6d14,2,2,297) at lstat+0x6a
> > syscall(2f,2f,2f,805a200,805a248) at syscall+0x217
> > Xint0x80_syscall() at Xint0x80_syscall+0x1f
> > --- syscall (190, FreeBSD ELF32, lstat), eip = 0x280ac664, esp = 0xbfbf7594, ebp = 0xbfbf7620 ---
> > 
> > dosirak# addr2line -e kernel.debug 0xc04e1870
> > /usr/src/sys/i386/compile/DOSIRAK/../../../fs/pseudofs/pseudofs_vnops.c:200
> > 
> > [...]
> >         if (pvd->pvd_pid != NO_PID) {
> >                 if ((proc = pfind(pvd->pvd_pid)) == NULL)
> >                         PFS_RETURN (ENOENT);
> > -->             vap->va_uid = proc->p_ucred->cr_ruid;
> > 
> > rwatson has a patch that works around this particular null pointer
> > deref, but the underlying cause is not addressed.
> 
> A patch to pseudofs_vnops.c was made that checks to make sure what pfind()
> returned was "usable".  Did that solve this problem?  Looks like that
> patch went in after you reported this because it's immediately above
> line 200 you show above.
> 
> > * ULE has lots of problems (poor performance on HTT, unable to disable
> > HTT, incorrect load average reporting on SMP machines, ...).  Should
> > be turned off until an active maintainer is found.
> 
> re@ is discussing this now, it looks likely we will shift to 4BSD soon.
> 
> > * ---
> > Fatal trap 12: page fault while in kernel mode
> > fault virtual address   = 0x104
> > fault code              = supervisor read, page not present
> > instruction pointer     = 0x8:0xc058a8cf
> > stack pointer           = 0x10:0xdcb34cc4
> > frame pointer           = 0x10:0xdcb34cec
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0, pres 1, def32 1, gran 1
> > processor eflags        = resume, IOPL = 0
> > current process         = 50 (schedcpu)
> > trap number             = 12
> > panic: page fault
> > 
> > syncing disks, buffers remaining... panic: mi_switch: switch in a critical section
> > 
> > addr2line says the panic was in kern/sched_4bsd.c:327
> > 
> >                                 /*
> >                                  * The kse slptimes are not touched in wakeup
> >                                  * because the thread may not HAVE a KSE.
> >                                  */
> >                                 if (ke->ke_state == KES_ONRUNQ) {
> >                                         awake = 1;
> >                                         ke->ke_flags &= ~KEF_DIDRUN;
> > --->                            } else if ((ke->ke_state == KES_THREAD) &&
> >                                     (TD_IS_RUNNING(ke->ke_thread))) {
> >                                         awake = 1;
> > 
> > gdb -k got confused and couldn't make anything out of the backtrace.
> 
> The code you quote above hasn't changed recently but a few kse related
> fixes have gone in recently if I recall correctly.  Is this one still
> biting you?
> 
> > * Machines with 4GB RAM do not auto-tune kernel memory parameters
> > optimally and easily panic under load with a panic message that does
> > not at least give instructions on what may be wrong and how to fix it.
> 
> Work was done on that recently-ish, do you know off hand if that fixed
> what you were seeing?
> 
> Thanks...
> 
> -- 
> 						Ken Smith
> - From there to here, from here to      |       kensmith@cse.buffalo.edu
>   there, funny things are everywhere.   |
>                       - Theodore Geisel |

-- 
--
In God we Trust -- all others must submit an X.509 certificate.
    -- Charles Forsythe <forsythe@alum.mit.edu>