Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Jun 2009 19:03:34 +0400
From:      pluknet <pluknet@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 6.2 sporadically locks up
Message-ID:  <a31046fc0906160803n284604bcs741e6b038079ed12@mail.gmail.com>
In-Reply-To: <200906160830.29721.jhb@freebsd.org>
References:  <a31046fc0906160323s3e4ec60bxb585bb29f9f3a02a@mail.gmail.com> <200906160830.29721.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
2009/6/16 John Baldwin <jhb@freebsd.org>:
> On Tuesday 16 June 2009 6:23:47 am pluknet wrote:
>> Hi all.
>>
>> This is one of livelocks we have on a weekly basis.
>> Yes, we do still use ULE scheduler on 6.2 and not moved to 7 yet.
>> Any thought?
>>
>> db> ps
>> =A0pid =A0ppid =A0pgrp =A0 uid =A0 state =A0 wmesg =A0 =A0 wchan =A0 =A0=
cmd
>> 70304 69700 69670 =A0 =A0 0 =A0R =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 sh
>> 70303 70292 93818 =A03572 =A0RL =A0 =A0 =A0CPU 2 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 chrsh
>> 70302 70294 93818 =A03572 =A0R =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 crond
>> 70299 93818 93818 =A0 =A0 0 =A0R =A0 =A0 =A0 CPU 1 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 crond
>> 70298 93818 93818 =A0 =A0 0 =A0R =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 crond
>> 70294 93818 93818 =A03572 =A0S =A0 =A0 =A0 piperd =A0 0xd1d8d330 crond
>> 70292 93818 93818 =A03572 =A0R =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 crond
>> 70284 70279 70040 10229 =A0S =A0 =A0 =A0 biord =A0 =A00xdbe2e4e8 perl5.8=
.8
>> 70283 70278 93818 10229 =A0SL =A0 =A0 =A0biord =A0 =A00xdbd70710 exim-4.=
63-0
>> 70279 70040 70040 10229 =A0S =A0 =A0 =A0 wait =A0 =A0 0xc9005860 sh
>> 70278 69996 93818 10229 =A0S =A0 =A0 =A0 wait =A0 =A0 0xcaf4ac90 sh
>> 70191 =A04680 =A04680 =A09738 =A0S =A0 =A0 =A0 select =A0 0xc0a12944 htt=
pd
>> 70190 =A04796 =A04796 10008 =A0R =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 httpd
>> 70188 =A05043 =A05043 30532 =A0RL =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0httpd
>> 70043 69999 70043 =A03572 =A0Ss =A0 =A0 =A0select =A0 0xc0a12944 wget
>> 70042 70000 70042 =A03572 =A0Ss =A0 =A0 =A0select =A0 0xc0a12944 wget
>> 70041 70001 70041 =A03572 =A0Ss =A0 =A0 =A0select =A0 0xc0a12944 wget
>> 70040 69996 70040 10229 =A0Ss =A0 =A0 =A0piperd =A0 0xca35e990 perl5.8.8
>> 70039 70002 70039 =A03572 =A0Ss =A0 =A0 =A0select =A0 0xc0a12944 wget
>
> This is not a full listing so one cannot assume it is a deadlock.

Ok, usually that listing doesn't show anything interesting in this
sort of lockup.
I'll share a full ps output next time (sure, rather soon).

>
>> db> show lockchain Giant
>> thread -3420549 (pid 434, ) ??? (0xc099cb0c)
>
> You would use 'show lock' or perhaps 'show turnstile' with specific lock
> variables. =A0'show lockchain' needs a TID or PID.

Ok.
As for turnstile, it showed nothing at all, hence omitted.

>
>> db> show allpcpu
>> cpuid =A0 =A0 =A0 =A0=3D 0
>> curthread =A0 =A0=3D 0xc7cfec80: pid 18 "swi4: clock sio"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 1
>> curthread =A0 =A0=3D 0xc99f9960: pid 70299 "crond"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 2
>> curthread =A0 =A0=3D 0xc99f9af0: pid 70303 "chrsh"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 3
>> curthread =A0 =A0=3D 0xd087d320: pid 69700 "sh"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 4
>> curthread =A0 =A0=3D 0xc98f84b0: pid 69604 "httpd"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 5
>> curthread =A0 =A0=3D 0xcaebe190: pid 69598 "httpd"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 6
>> curthread =A0 =A0=3D 0xc7cfe960: pid 27 "irq17: bce1 aacu0"
>>
>> cpuid =A0 =A0 =A0 =A0=3D 7
>> curthread =A0 =A0=3D 0xc837fe10: pid 69711 "arcconf"
>
> This is far more useful output than the truncated 'ps'. =A0From this, all=
 of the
> CPUs are busy (in at least some deadlocks, all the CPUs would be idle
> instead). =A0There are several deadlocks fixed since 6.2 that I am aware =
of,
> but this doesn't look like any of those. =A0I'm not sure why you aren't g=
etting
> useful stack traces of running threads.

I'll do next time. I thought it would be similar to bt PID output and
simply didn't include.

As for allpcpu, I often see the picture, when one CPU runs the "irq17:
bce1 aacu0" thread
and another one runs arcconf. I wonder if that might be a source of
bad locking or races, or..
The arcconf utility uses ioctl that goes into aac/aacu(4) internals.

> Perhaps DDB in 6.2 doesn't know to
> look in stoppcbs[]. =A0Hmm, looks like 6.2 only does that if you are usin=
g
> KDB_STOP_NMI. =A0Are you using that kernel option? =A0If not, you probabl=
y want
> to.

No, I'm not. Will that add an additional visible overhead on a running syst=
em?

>
> --
> John Baldwin
>

Thank you.

--=20
wbr,
pluknet



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a31046fc0906160803n284604bcs741e6b038079ed12>