Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Sep 2005 21:51:48 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Rob Watt <rob@hudson-trading.com>
Cc:        freebsd-hackers@FreeBSD.org, mikep@hudson-trading.com, freebsd-amd64@FreeBSD.org, Jason Carroll <jason@hudson-trading.com>
Subject:   Re: freebsd-5.4-stable panics
Message-ID:  <20050929212738.A34322@fledge.watson.org>
In-Reply-To: <20050929160945.A65402@daemon.mistermishap.net>
References:  <da4a53d805092310237d732554@mail.gmail.com>  <20050925115912.H11229@fledge.watson.org> <20050927140535.G50334@daemon.mistermishap.net> <20050927203128.S61419@fledge.watson.org> <cf6c78405092714227722d534@mail.gmail.com> <20050927222624.R34322@fledge.watson.org> <20050928134724.P56436@daemon.mistermishap.net> <20050929185538.R61419@fledge.watson.org> <20050929160945.A65402@daemon.mistermishap.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 29 Sep 2005, Rob Watt wrote:

> On Thu, 29 Sep 2005, Robert Watson wrote:
>
>> Could you dump the contents of *td and *td->td_proc for me?  I'm quite
>> interested to know what the value in td->td_proc->p_state is, among other
>> things.  If I could also have you generate a dump of the KSE group
>> structures in td->td_proc->p_ksegrps and the threads in
>> td->td_proc->p_threads.
>
> I've attached a file with many of the values you have asked for. We 
> looked at some of the threads referenced by td->td_proc->p_threads, but 
> we weren't sure we were walking the list correctly. Do you have any tips 
> for walking those thread lists?
>
>> Could you tell me if the program named by p->p_comm is linked against a 
>> threading library?  If it's a custom app, you may already know, and if 
>> not, you can run ldd on the application to see what it is linked 
>> against.
>
> The programs named by p->p_comm is linked against the pthreads library.

This seems to be enough information to at least track this down a bit: 
td_ksegrp is NULL, rather than a corrupt value, which suggests that the 
thread is incompletely initialized.  Other hints that this are the case 
are that td_critnest is 1 (as is set when it is allocated), and the state 
is TDS_INACTIVE.  Some other fields are set though, such as td_oncpu, 
which is normally initialized to NOCPU.

> (kgdb) p *td
> $1 = {td_proc = 0xffffff004aa9f000, td_ksegrp = 0x0, td_plist = 
> {tqe_next = 0xff ffff00b4798000,
>     tqe_prev = 0xffffff00a97ae010}, td_kglist = {tqe_next = 
> 0xffffff00b4798000,
>     tqe_prev = 0xffffff00a97ae020}, td_slpq = {tqe_next = 0x0, tqe_prev 
> = 0xffff ff001fac7c10}, td_lockq = {
>     tqe_next = 0xffffff00a97ae000, tqe_prev = 0xffffffffb6797a70}, 
> td_runq = {tq e_next = 0x0,
>     tqe_prev = 0xffffffff80608180}, td_selq = {tqh_first = 0x0, tqh_last 
> = 0xfff fff00633112c0},
>   td_sleepqueue = 0xffffff00382b0400, td_turnstile = 0xffffff00c1712900, 
> td_umtx q = 0xffffff00d1207080,
>   td_tid = 100253, td_flags = 16777216, td_inhibitors = 0, td_pflags = 
> 128, td_d upfd = 0, td_wchan = 0x0,
>   td_wmesg = 0x0, td_lastcpu = 2 '\002', td_oncpu = 2 '\002', 
> td_owepreempt = 0 '\0', td_locks = 0,
>   td_blocked = 0x0, td_ithd = 0x0, td_lockname = 0x0, td_contested = 
> {lh_first =
>  0x0}, td_sleeplocks = 0x0,
>   td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, td_ucred = 
> 0xfffff f00ad18f200,
>   td_standin = 0x0, td_upcall = 0x0, td_sticks = 0, td_uuticks = 0, 
> td_usticks =
>  0, td_intrval = 0,
>   td_oldsigmask = {__bits = {0, 0, 0, 0}}, td_sigmask = {__bits = 
> {4294967295, 4 294967295, 4294967295,
>       4294967295}}, td_siglist = {__bits = {0, 0, 0, 0}}, td_generation 
> = 14, td _sigstk = {ss_sp = 0x0,
>     ss_size = 0, ss_flags = 0}, td_kflags = 0, td_xsig = 0, 
> td_profil_addr = 0, td_profil_ticks = 0,
>   td_base_pri = 182 '\uffff', td_priority = 182 '\uffff', td_pcb = 
> 0xffffffffb68 dcd10, td_state = TDS_INACTIVE,
>   td_retval = {1, 29309280}, td_slpcallout = {c_links = {sle = {sle_next 
> = 0x0},
>  tqe = {tqe_next = 0x0,
>         tqe_prev = 0xffffff001fac7d80}}, c_time = 55907602, c_arg = 
> 0xffffff0063 311260,
>     c_func = 0xffffffff802e32a0 <sleepq_timeout>, c_mtx = 0x0, c_flags = 
> 16}, td _frame = 0xffffffffb68dcc40,
>   td_kstack_obj = 0xffffff0087f93d20, td_kstack = 18446744072477315072, 
> td_kstac k_pages = 4,
>   td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, 
> td_critnest = 1, td_md = {
>     md_spinlock_count = 1, md_saved_flags = 582}, td_sched = 
> 0xffffff0063311488}

I'm not familiar with the internals of the thread and KSE life cycle here, 
so I think we'll need to look to those more familiar with this to 
understand what of two things may be going on:

(1) Is the fact that td_ksegrp != NULL an invariant for a connected
     thread, and that kern_proc is relying on that but the thread code is
     failing to implement it safely?

(2) Is td_ksegrp sometimes left legitimately as NULL as part of the thread
     life cycle, and that kern_proc incorrectly assumes that it is never
     NULL when hooked up to a thread.

This suggests a possible work-around of simply testing td_ksegrp for NULL 
in kern_proc in order to avoid this, while attempting to resolve whether 
an invariant is violated (or incorrectly assumed), which might require 
some serious thinking and a solution that is non-trivial.  Something like 
the following might work in the mean time:

Index: kern_proc.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.231
diff -u -r1.231 kern_proc.c
--- kern_proc.c	27 Sep 2005 18:03:15 -0000	1.231
+++ kern_proc.c	29 Sep 2005 20:50:33 -0000
@@ -882,6 +882,8 @@
  	} else {
  		_PHOLD(p);
  		FOREACH_THREAD_IN_PROC(p, td) {
+			if (td->td_ksegrp == NULL)
+				continue;
  			fill_kinfo_thread(td, &kinfo_proc);
  			PROC_UNLOCK(p);
  			error = SYSCTL_OUT(req, (caddr_t)&kinfo_proc,

I'm going to forward off your e-mail to the threads@ list and see if 
anyone there wants to talk some more about this.  If you don't mind 
testing the above patch to see if this is a workable work-around, we may 
want to think about getting it committed in the mean time.

Thanks,

Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050929212738.A34322>