Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 Aug 2011 14:04:47 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        freebsd-stable@FreeBSD.org
Subject:   Re: panic: spin lock held too long (RELENG_8 from today)
Message-ID:  <20110817210446.GA49737@icarus.home.lan>
In-Reply-To: <20110817175201.GB1973@libertas.local.camdensoftware.com>
References:  <20110707082027.GX48734@deviant.kiev.zoral.com.ua> <4E159959.2070401@sentex.net> <4E15A08C.6090407@sentex.net> <20110818.023832.373949045518579359.hrs@allbsd.org> <20110817175201.GB1973@libertas.local.camdensoftware.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Aug 17, 2011 at 10:52:01AM -0700, Chip Camden wrote:
> Quoth Hiroki Sato on Thursday, 18 August 2011:
> > Hi,
> > 
> > Mike Tancsa <mike@sentex.net> wrote
> >   in <4E15A08C.6090407@sentex.net>:
> > 
> > mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote:
> > mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote:
> > mi> >>
> > mi> >> BTW, we had a similar panic, "spinlock held too long", the spinlock
> > mi> >> is the sched lock N, on busy 8-core box recently upgraded to the
> > mi> >> stable/8. Unfortunately, machine hung dumping core, so the stack trace
> > mi> >> for the owner thread was not available.
> > mi> >>
> > mi> >> I was unable to make any conclusion from the data that was present.
> > mi> >> If the situation is reproducable, you coulld try to revert r221937. This
> > mi> >> is pure speculation, though.
> > mi> >
> > mi> > Another crash just now after 5hrs uptime. I will try and revert r221937
> > mi> > unless there is any extra debugging you want me to add to the kernel
> > mi> > instead  ?
> > 
> >  I am also suffering from a reproducible panic on an 8-STABLE box, an
> >  NFS server with heavy I/O load.  I could not get a kernel dump
> >  because this panic locked up the machine just after it occurred, but
> >  according to the stack trace it was the same as posted one.
> >  Switching to an 8.2R kernel can prevent this panic.
> > 
> >  Any progress on the investigation?
> > 
> > --
> > spin lock 0xffffffff80cb46c0 (sched lock 0) held by 0xffffff01900458c0 (tid 100489) too long
> > panic: spin lock held too long
> > cpuid = 1
> > KDB: stack backtrace:
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
> > kdb_backtrace() at kdb_backtrace+0x37
> > panic() at panic+0x187
> > _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
> > _mtx_lock_spin() at _mtx_lock_spin+0x9e
> > sched_add() at sched_add+0x117
> > setrunnable() at setrunnable+0x78
> > sleepq_signal() at sleepq_signal+0x7a
> > cv_signal() at cv_signal+0x3b
> > xprt_active() at xprt_active+0xe3
> > svc_vc_soupcall() at svc_vc_soupcall+0xc
> > sowakeup() at sowakeup+0x69
> > tcp_do_segment() at tcp_do_segment+0x25e7
> > tcp_input() at tcp_input+0xcdd
> > ip_input() at ip_input+0xac
> > netisr_dispatch_src() at netisr_dispatch_src+0x7e
> > ether_demux() at ether_demux+0x14d
> > ether_input() at ether_input+0x17d
> > em_rxeof() at em_rxeof+0x1ca
> > em_handle_que() at em_handle_que+0x5b
> > taskqueue_run_locked() at taskqueue_run_locked+0x85
> > taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
> > fork_exit() at fork_exit+0x11f
> > fork_trampoline() at fork_trampoline+0xe
> > --
> > 
> > -- Hiroki
> 
> 
> I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
> have to power off.  Once, I happened to be looking at the console when it
> happened and copied dow the following:
> 
> Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
> panic: sleeping thread
> cpuid=1

No idea, might be relevant to the thread.

> Another time I got:
> 
> lock order reversal:
> 1st 0xffffff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
> 2nd 0xffffff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587
> 
> I didn't copy down the traceback.

"snaplk" refers to UFS snapshots.  The above must have been typed in
manually as well, due to some typos in filenames as well.

Either this is a different problem, or if everyone in this thread is
doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem
happen then I recommend people stop using UFS snapshots.  I've ranted
about their unreliability in the past (years upon years ago -- still
seems valid) and just how badly they can "wedge" a system.  This is one
of the many (MANY!) reasons why we use rsnapshot/rsync instead.  The
atime clobbering issue is the only downside.

I don't see what this has to do with "heavy WAN I/O" unless you're doing
something like dump-over-ssh, in which case see the above paragraph.

> These panics seem to hit when I'm doing heavy WAN I/O.  I can go for
> about a day without one as long as I stay away from the web or even chat.
> Last night this system copied a backup of 35GB over the local network
> without failing, but as soon as I hopped onto Firefox this morning, down
> she went.  I don't know if that's coincidence or useful data.
> 
> I didn't get to say "Thanks" to Eitan Adler for attempting to help me
> with this on Monday night.  Thanks, Eitan!

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110817210446.GA49737>