From owner-freebsd-net@freebsd.org Tue Sep 22 16:47:04 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D838BA07407 for ; Tue, 22 Sep 2015 16:47:04 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 7E50E1AE8; Tue, 22 Sep 2015 16:47:04 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (localhost [127.0.0.1]) by mail.pingpong.net (Postfix) with ESMTP id 80C8AD122; Tue, 22 Sep 2015 18:46:56 +0200 (CEST) X-Virus-Scanned: by amavisd-new at pingpong.net Received: from mail.pingpong.net ([127.0.0.1]) by mail.pingpong.net (mail.pingpong.net [127.0.0.1]) (amavisd-new, port 10024) with LMTP id wUruCjEF0MUp; Tue, 22 Sep 2015 18:46:56 +0200 (CEST) Received: from [10.0.0.143] (citron2.pingpong.net [195.178.173.68]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 34889D11F; Tue, 22 Sep 2015 18:46:56 +0200 (CEST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: Kernel panics in tcp_twclose From: Palle Girgensohn In-Reply-To: <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org> Date: Tue, 22 Sep 2015 18:46:55 +0200 Cc: Konstantin Belousov , freebsd-net@freebsd.org, Hans Petter Selasky Content-Transfer-Encoding: quoted-printable Message-Id: <73856F2B-3E70-483C-9988-C84E798CEB44@FreeBSD.org> References: <26B0FF93-8AE3-4514-BDA1-B966230AAB65@FreeBSD.org> <55FC1809.3070903@freebsd.org> <20150918160605.GN67105@kib.kiev.ua> <55FFBE01.6060706@freebsd.org> <3721F099-F45D-4DCD-8AB3-84D1ABC44145@FreeBSD.org> To: Julien Charbon X-Mailer: Apple Mail (2.2104) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Sep 2015 16:47:04 -0000 Hi all, > 21 sep 2015 kl. 15:53 skrev Palle Girgensohn : >=20 >>=20 >> 21 sep 2015 kl. 10:21 skrev Julien Charbon : >>=20 >>=20 >> Hi Konstantin, Hi Palle, >>=20 >> On 18/09/15 18:06, Konstantin Belousov wrote: >>> On Fri, Sep 18, 2015 at 03:56:25PM +0200, Julien Charbon wrote: >>>> Hi Palle, >>>>=20 >>>> On 18/09/15 11:12, Palle Girgensohn wrote: >>>>> We see daily panics on our production systems (web server, apache >>>>> running MPM event, openjdk8. Kernel with VIMAGE. Jails using = netgraph >>>>> interfaces [not epair]). >>>>>=20 >>>>> The problem started after the summer. Normal port upgrades seems = to >>>>> be the only difference. The problem occurs with 10.2-p2 kernel as >>>>> well as 10.1-p4 and 10.1-p15. >>>>>=20 >>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D203175 >>>>>=20 >>>>> Any ideas? >>>>=20 >>>> Thanks for you detailed report. I am not aware of any = tcp_twclose() >>>> related issues (without VIMAGE) since FreeBSD 10.0 (does not mean = there >>>> are none). Few interesting facts (at least for me): >>>>=20 >>>> - Your crash happens when unlocking a inp exclusive lock with = INP_WUNLOCK() >>>>=20 >>>> - Something is already wrong before calling turnstile_broadcast() = as it >>>> is called with ts =3D NULL: >>> In the kernel without witness this is a 99%-sure indication of = attempt to >>> unlock not owned lock. >>=20 >> Thanks, this is useful. So far I did not find any path where >> tcp_twclose() can call INP_WUNLOCK without having the exclusive lock >> held, that makes this issue interesting. >>=20 >>>> I won't go to far here as I am not expert enough in VIMAGE, but one >>>> question anyway: >>>>=20 >>>> - Can you correlate this kernel panic to a particular event? Like = for >>>> example a VIMAGE/VNET jail destruction. >>>>=20 >>>> I will test that on my side on a 10.2 machine. >>=20 >> I did not find any issues while testing 10.2 + VIMAGE on my side. = Thus >> Palle what I would suggest: >>=20 >> - First, test with stable/10 to see if by chance this issue has = already >> been fixed in stable branch. >>=20 >> - Second, if issue is still in stable/10, compile 10.2 kernel with >> these options: >>=20 >> options DDB >> options DEADLKRES >> options INVARIANTS >> options INVARIANT_SUPPORT >> options WITNESS >> options WITNESS_SKIPSPIN >>=20 >> To see where the original fault is coming from. >=20 > Hi, >=20 > We just had two crashes within 15 minutes using 10.2 with these two = added: >=20 > https://svnweb.freebsd.org/changeset/base/287261 >=20 > https://svnweb.freebsd.org/changeset/base/287780=20 >=20 > We don't always get a core dump, but the second time, we did. >=20 > very similar stack trace, but not identical: >=20 > (kgdb) #0 doadump (textdump=3D) at pcpu.h:219 > #1 0xffffffff80949a82 in kern_reboot (howto=3D260) > at /usr/src/sys/kern/kern_shutdown.c:451 > #2 0xffffffff80949e65 in vpanic (fmt=3D, > ap=3D) at = /usr/src/sys/kern/kern_shutdown.c:758 > #3 0xffffffff80949cf3 in panic (fmt=3D0x0) > at /usr/src/sys/kern/kern_shutdown.c:687 > #4 0xffffffff80d5d0bb in trap_fatal (frame=3D, > eva=3D) at /usr/src/sys/amd64/amd64/trap.c:851 > #5 0xffffffff80d5d3bd in trap_pfault (frame=3D0xfffffe1760bc1840, > usermode=3D) at = /usr/src/sys/amd64/amd64/trap.c:674 > #6 0xffffffff80d5ca5a in trap (frame=3D0xfffffe1760bc1840) > at /usr/src/sys/amd64/amd64/trap.c:440 > #7 0xffffffff80d42dd2 in calltrap () > at /usr/src/sys/amd64/amd64/exception.S:236 > #8 0xffffffff8099861c in turnstile_broadcast (ts=3D0x0, queue=3D1) > at /usr/src/sys/kern/subr_turnstile.c:838 > #9 0xffffffff80948100 in __rw_wunlock_hard (c=3D0xfffff811c43487a0, = tid=3D1, > file=3D0x1
, line=3D1) > at /usr/src/sys/kern/kern_rwlock.c:988 > #10 0xffffffff80b067c4 in tcp_twclose (tw=3D, > reuse=3D) at = /usr/src/sys/netinet/tcp_timewait.c:540 > #11 0xffffffff80b06e0b in tcp_tw_2msl_scan (reuse=3D0) > at /usr/src/sys/netinet/tcp_timewait.c:748 > #12 0xffffffff80b04b0e in tcp_slowtimo () > at /usr/src/sys/netinet/tcp_timer.c:198 > #13 0xffffffff809b7a04 in pfslowtimo (arg=3D0x0) > at /usr/src/sys/kern/uipc_domain.c:508 > #14 0xffffffff8095f91b in softclock_call_cc (c=3D0xffffffff81620bf0, > cc=3D0xffffffff8169dc00, direct=3D0) at = /usr/src/sys/kern/kern_timeout.c:685 > #15 0xffffffff8095fd44 in softclock (arg=3D0xffffffff8169dc00) > at /usr/src/sys/kern/kern_timeout.c:814 > #16 0xffffffff8091592b in intr_event_execute_handlers ( > p=3D, ie=3D0xfffff801102e0d00) > at /usr/src/sys/kern/kern_intr.c:1264 > #17 0xffffffff80915d76 in ithread_loop (arg=3D0xfffff801102adee0) > at /usr/src/sys/kern/kern_intr.c:1277 > #18 0xffffffff8091347a in fork_exit ( > callout=3D0xffffffff80915ce0 , = arg=3D0xfffff801102adee0, > frame=3D0xfffffe1760bc1c00) at /usr/src/sys/kern/kern_fork.c:1018 > #19 0xffffffff80d4330e in fork_trampoline () > at /usr/src/sys/amd64/amd64/exception.S:611 > #20 0x0000000000000000 in ?? () >=20 >=20 >=20 > I'll try stable/10 now. Would you suggest a "clean" stable/10, or = could 287621 and 287780 help? >=20 > I'll add the debugging suggested options right away. >=20 > Palle I have a new core dump from ^/stable/10 with: options DDB options DEADLKRES options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN What can I do with the core dump? "corrupt stack"... (kgdb) #0 doadump (textdump=3D1) at pcpu.h:219 #1 0xffffffff8094b337 in kern_reboot (howto=3D260) at /usr/src/sys/kern/kern_shutdown.c:451 #2 0xffffffff8094b845 in vpanic (fmt=3D, ap=3D) at /usr/src/sys/kern/kern_shutdown.c:758 #3 0xffffffff8094b6d9 in kassert_panic (fmt=3D) at /usr/src/sys/kern/kern_shutdown.c:646 #4 0xffffffff80b1ee59 in tcp_usr_detach (so=3D) at /usr/src/sys/netinet/tcp_usrreq.c:202 #5 0xffffffff809cd291 in sofree (so=3D0xfffff801dd302000) at /usr/src/sys/kern/uipc_socket.c:747 #6 0xffffffff809cdb00 in soclose (so=3D) at /usr/src/sys/kern/uipc_socket.c:849 #7 0xffffffff808fe659 in _fdrop (fp=3D0xfffff802a593db40, td=3D0x0) at = file.h:343 #8 0xffffffff80901092 in closef (fp=3D0xfffff802a593db40, td=3D0xfffff80eebc894a0) at /usr/src/sys/kern/kern_descrip.c:2338 #9 0xffffffff808feb5d in closefp (fdp=3D0xfffff80b20cce000, fd=3D, fp=3D0xfffff802a593db40, = td=3D0xfffff80eebc894a0, holdleaders=3D) at /usr/src/sys/kern/kern_descrip.c:1194 #10 0xffffffff80d7bc3a in amd64_syscall (td=3D0xfffff80eebc894a0, = traced=3D0) at subr_syscall.c:134 #11 0xffffffff80d5f1db in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396 #12 0x0000000801c8d94a in ?? () Previous frame inner to this frame (corrupt stack?) Current language: auto; currently minimal (kgdb) Thanks, Palle