Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 Dec 2006 12:09:39 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-stable@freebsd.org
Cc:        Gleb Smirnoff <glebius@freebsd.org>
Subject:   Re: RELENG_6 panic under heavy load
Message-ID:  <200612061209.40253.jhb@freebsd.org>
In-Reply-To: <20061116160900.GQ32700@FreeBSD.org>
References:  <20061113084430.GE59604@dimma.mow.oilspace.com> <20061116111525.GO32700@FreeBSD.org> <20061116160900.GQ32700@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 16 November 2006 11:09, Gleb Smirnoff wrote:
> On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote:
> T> On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote:
> T> T>   I wonder why UMA was suspected to be the problem. Dima gave
> T> T> me access to the core. Here are more details from the trace:
> 
> And even more:
> 
> (kgdb) thread 133
> [Switching to thread 133 (Thread 100147)]#0  sched_switch (td=0xd745c900, 
newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980
> 980             sched_lock.mtx_lock = (uintptr_t)td;
> (kgdb) frame 9
> #9  0xd07a6e16 in syscall (frame=
>       {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 
134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, 
tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 
0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp 
= -813707972, tf_ss = 59})
>     at /usr/src/sys/i386/i386/trap.c:1034
> 1034            userret(td, &frame, sticks);
> (kgdb) p *callp
> $92 = {sy_narg = 65539, sy_call = 0xd0630550 <poll>, sy_auevent = 43012}
> 
> (kgdb) set $poll = (struct thread *)0xd745c900
> (kgdb) set $fork = (struct thread *)0xd59aad80
> 
> (kgdb) p $poll->td_state
> $93 = TDS_INHIBITED
> (kgdb) p $poll->td_inhibitors
> $94 = 1 == TDI_SUSPENDED
> (kgdb) p/x $poll->td_flags
> $96 = 0x1010c01	== TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | 
TDF_NEEDRESCHED | TDF_SCHED0
> (kgdb) p $fork->td_state
> $97 = TDS_INHIBITED
> (kgdb) p $fork->td_inhibitors
> $98 = 8 == TDI_LOCK
> (kgdb) p/x $fork->td_flags     
> $99 = 0x1000000 == TDF_SCHED0
> 
> Not everything clear yet, but looks like:
> 
> 1) $fork thread obtains proc lock
> 2) $poll thread blocks on proc lock
> 3) $fork thread has suspended the $poll thread in thread_single()
> 4) $fork thread temporarily unlocks proc lock (line 821) and is
>    preempted by $poll thread
> 5) $poll thread obtains proc lock, and starts doing its poll job
> 6) $fork thread blocks on proc lock, and is added to its turnstile
> 7) $poll thread drops the proc lock, but isn't preempted by $fork
> 8) $poll thread exits and is preempted by $fork
> 
> ...) and here is something difficult to understand, when $poll tries to
> make $fork runnable, while $fork is trying to put itself in the turnstile
> that is owned by $poll

Hmm.  I'm guessing the problem is the $poll thread is suspended (not exited) 
while holding the proc lock?  That would appear to be the problem.  That 
thread can't run again to release the lock.  Ah, yes, I see the bug.  
Something like this should fix it:

Index: kern_thread.c
===================================================================
RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v
retrieving revision 1.216.2.6
diff -u -r1.216.2.6 kern_thread.c
--- kern_thread.c       2 Sep 2006 17:29:57 -0000       1.216.2.6
+++ kern_thread.c       6 Dec 2006 17:06:26 -0000
@@ -969,7 +969,9 @@
        TAILQ_REMOVE(&p->p_suspended, td, td_runq);
        TD_CLR_SUSPENDED(td);
        p->p_suspcount--;
+       critical_enter();
        setrunnable(td);
+       critical_exit();
 }

 /*

What this does is force setrunnable() to be in a nested critical section so we 
won't preempt during setrunnable() until either the caller of 
thread_unsuspend_one() eventually releases sched_lock, or, in the case you 
ran into, the thread does a PROC_UNLOCK() and calls mi_switch().

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200612061209.40253.jhb>