From owner-freebsd-current@FreeBSD.ORG Wed Sep 15 06:39:52 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 758) id 902E716A4CF; Wed, 15 Sep 2004 06:39:52 +0000 (GMT) Date: Wed, 15 Sep 2004 06:39:52 +0000 From: Kris Kennaway To: Ken Smith Message-ID: <20040915063952.GB63279@hub.freebsd.org> References: <200407151424.i6FEOdoq060881@fledge.watson.org> <20040715220447.GA32888@xor.obsecurity.org> <20040902155947.GA12006@electra.cse.Buffalo.EDU> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040902155947.GA12006@electra.cse.Buffalo.EDU> User-Agent: Mutt/1.4.1i cc: re@FreeBSD.org cc: current@FreeBSD.org cc: Kris Kennaway Subject: Re: 5.3-RELEASE TODO X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Sep 2004 06:39:52 -0000 On Thu, Sep 02, 2004 at 11:59:47AM -0400, Ken Smith wrote: > On Thu, Jul 15, 2004 at 03:04:47PM -0700, Kris Kennaway wrote: > > > These are the bugs I'm currently tracking (those I can remember right > > now, at least) All of these issues except for the last one seem to be resolved for me now. I haven't tested the last one (memory tuning on 4GB machines) because I have tuned my kernel configs to avoid the problem, but I can remove those changes and see if the problems persist. I am now seeing a couple of other problems: * softupdates stack overflow (previously reported; I've now hit this on two machines). I might be able to hack around it by increasing KSTACK_PAGES, but that doesn't help others. phk could not think of any way to fix the unboundedness of the dependency chains, and kirk replied saying he's on vacation. * I had an apparent scheduler hang tonight (4BSD): the only process that is running has a trace including sched_switch, and nothing else apart from the idle tasks is running or runnable. I'll try to post more details tomorrow. * There may be a problem with swapping: I had an extremely weird sequence of errors (binaries aborting, spurious "missing /libexec/ld-elf.so.1") on pointyhat at around the time it started swapping. I don't know if swapping was the cause or another symptom of some other problem. I'll try to reproduce on another machine. * I was able to break to KDB a few times on pointyhat to try and diagnose this problem, but eventually it hung trying to enter KDB. This happens with fairly high frequency (on SMP machines?) I think there are some other bugs I'm forgetting right now. Kris > > * SMP is unusable for me because of the following frequent panic > > (actually a panic and another kernel printf interleaved). Here is the > > untangled version: > > > > panic: APIC: Previous IPI is stu c k > > p m a > > _ l a z y f i x : s p > > u c p u i d = 0 ; > > n f o r 5 0 0 0 0 0 0 0 > > c D e b u g g e r ( " p a n i > > > > jhb says: > > > > > Seems the two CPUs are deadlocked waiting on each other. The first sent a > > > pmap_lazyfixup IPI to the second but the second has interrupts disabled as it > > > is trying to send an IPI as well. > > > > He suggested a patch, but it did not fix the problem. > > Was this fixed with the IPI patches done before BETA2? > > > * linprocfs > > > > Fatal trap 12: page fault while in kernel mode > > cpuid = 0; apic id = 00 > > fault virtual address = 0x8 > > fault code = supervisor read, page not present > > instruction pointer = 0x8:0xc04e1870 > > stack pointer = 0x10:0xf11e6b50 > > frame pointer = 0x10:0xf11e6b6c > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 0, pres 1, def32 1, gran 1 > > processor eflags = interrupt enabled, resume, IOPL = 0 > > current process = 23938 (mtree) > > kernel: type 12 trap, code=0 > > Stopped at pfs_getattr+0x130: movl 0x8(%eax),%eax > > db> trace > > pfs_getattr(f11e6b78,c06fda00,cf397b2c,f11e6b98,d23e8a80) at pfs_getattr+0x130 > > vn_stat(cf397b2c,f11e6c80,d23e8a80,0,c5eb0c60) at vn_stat+0x4f > > lstat(c5eb0c60,f11e6d14,2,2,297) at lstat+0x6a > > syscall(2f,2f,2f,805a200,805a248) at syscall+0x217 > > Xint0x80_syscall() at Xint0x80_syscall+0x1f > > --- syscall (190, FreeBSD ELF32, lstat), eip = 0x280ac664, esp = 0xbfbf7594, ebp = 0xbfbf7620 --- > > > > dosirak# addr2line -e kernel.debug 0xc04e1870 > > /usr/src/sys/i386/compile/DOSIRAK/../../../fs/pseudofs/pseudofs_vnops.c:200 > > > > [...] > > if (pvd->pvd_pid != NO_PID) { > > if ((proc = pfind(pvd->pvd_pid)) == NULL) > > PFS_RETURN (ENOENT); > > --> vap->va_uid = proc->p_ucred->cr_ruid; > > > > rwatson has a patch that works around this particular null pointer > > deref, but the underlying cause is not addressed. > > A patch to pseudofs_vnops.c was made that checks to make sure what pfind() > returned was "usable". Did that solve this problem? Looks like that > patch went in after you reported this because it's immediately above > line 200 you show above. > > > * ULE has lots of problems (poor performance on HTT, unable to disable > > HTT, incorrect load average reporting on SMP machines, ...). Should > > be turned off until an active maintainer is found. > > re@ is discussing this now, it looks likely we will shift to 4BSD soon. > > > * --- > > Fatal trap 12: page fault while in kernel mode > > fault virtual address = 0x104 > > fault code = supervisor read, page not present > > instruction pointer = 0x8:0xc058a8cf > > stack pointer = 0x10:0xdcb34cc4 > > frame pointer = 0x10:0xdcb34cec > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 0, pres 1, def32 1, gran 1 > > processor eflags = resume, IOPL = 0 > > current process = 50 (schedcpu) > > trap number = 12 > > panic: page fault > > > > syncing disks, buffers remaining... panic: mi_switch: switch in a critical section > > > > addr2line says the panic was in kern/sched_4bsd.c:327 > > > > /* > > * The kse slptimes are not touched in wakeup > > * because the thread may not HAVE a KSE. > > */ > > if (ke->ke_state == KES_ONRUNQ) { > > awake = 1; > > ke->ke_flags &= ~KEF_DIDRUN; > > ---> } else if ((ke->ke_state == KES_THREAD) && > > (TD_IS_RUNNING(ke->ke_thread))) { > > awake = 1; > > > > gdb -k got confused and couldn't make anything out of the backtrace. > > The code you quote above hasn't changed recently but a few kse related > fixes have gone in recently if I recall correctly. Is this one still > biting you? > > > * Machines with 4GB RAM do not auto-tune kernel memory parameters > > optimally and easily panic under load with a panic message that does > > not at least give instructions on what may be wrong and how to fix it. > > Work was done on that recently-ish, do you know off hand if that fixed > what you were seeing? > > Thanks... > > -- > Ken Smith > - From there to here, from here to | kensmith@cse.buffalo.edu > there, funny things are everywhere. | > - Theodore Geisel | -- -- In God we Trust -- all others must submit an X.509 certificate. -- Charles Forsythe