From owner-freebsd-net@freebsd.org Thu Jul 14 21:02:26 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3BA37B994A0 for ; Thu, 14 Jul 2016 21:02:26 +0000 (UTC) (envelope-from ler@lerctr.org) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 15FEC12B1 for ; Thu, 14 Jul 2016 21:02:26 +0000 (UTC) (envelope-from ler@lerctr.org) Received: by mailman.ysv.freebsd.org (Postfix) id 11B32B9949A; Thu, 14 Jul 2016 21:02:26 +0000 (UTC) Delivered-To: net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 10E56B99498; Thu, 14 Jul 2016 21:02:26 +0000 (UTC) (envelope-from ler@lerctr.org) Received: from thebighonker.lerctr.org (thebighonker.lerctr.org [IPv6:2001:470:1f0f:3ad:223:7dff:fe9e:6e8a]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "thebighonker.lerctr.org", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C86CA12AD; Thu, 14 Jul 2016 21:02:25 +0000 (UTC) (envelope-from ler@lerctr.org) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lerctr.org; s=lerami; h=Message-ID:References:In-Reply-To:Subject:Cc:To:From:Date: Content-Transfer-Encoding:Content-Type:MIME-Version:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=0bUhy+HwcKimMN98IUmKSSIcpZtLF9apKdNSGskKiaQ=; b=KUrqvK4oH79Df8KvoqdREXT3se 5bKHrKpOv3+Q1yM4MaF9lVu5k4aGicSC2/9qgPGkP7xP5HEeWD44o3pEjTgmpQISlmrdCuBhvUCbm Hx9UNecJ08tS7W+cFqUaYyonX0eyJ7Bauq2buYYOh1tEbRWE1oQVfhIOQV287FVyVRQk=; Received: from thebighonker.lerctr.org ([2001:470:1f0f:3ad:223:7dff:fe9e:6e8a]:12473 helo=webmail.lerctr.org) by thebighonker.lerctr.org with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87 (FreeBSD)) (envelope-from ) id 1bNnm5-000HmE-51; Thu, 14 Jul 2016 16:02:25 -0500 Received: from proxy.na.alcatel-lucent.com ([135.245.48.74]) by webmail.lerctr.org with HTTP (HTTP/1.1 POST); Thu, 14 Jul 2016 16:02:25 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Thu, 14 Jul 2016 16:02:25 -0500 From: Larry Rosenman To: Julien Charbon Cc: Gleb Smirnoff , rrs@freebsd.org, hselasky@freebsd.org, net@freebsd.org, current@freebsd.org, owner-freebsd-current@freebsd.org Subject: Re: panic with tcp timers In-Reply-To: References: <20160617045319.GE1076@FreeBSD.org> <1f28844b-b4ea-b544-3892-811f2be327b9@freebsd.org> <20160620073917.GI1076@FreeBSD.org> <1d18d0e2-3e42-cb26-928c-2989d0751884@freebsd.org> Message-ID: X-Sender: ler@lerctr.org User-Agent: Roundcube Webmail/1.2.0 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2016 21:02:26 -0000 On 2016-07-14 12:01, Julien Charbon wrote: > Hi, > > On 6/20/16 11:55 AM, Julien Charbon wrote: >> On 6/20/16 9:39 AM, Gleb Smirnoff wrote: >>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >>> J> > Comparing stable/10 and head, I see two changes that could >>> J> > affect that: >>> J> > >>> J> > - callout_async_drain >>> J> > - switch to READ lock for inp info in tcp timers >>> J> > >>> J> > That's why you are in To, Julien and Hans :) >>> J> > >>> J> > We continue investigating, and I will keep you updated. >>> J> > However, any help is welcome. I can share cores. >>> >>> Now, spending some time with cores and adding a bunch of >>> extra CTRs, I have a sequence of events that lead to the >>> panic. In short, the bug is in the callout system. It seems >>> to be not relevant to the callout_async_drain, at least for >>> now. The transition to READ lock unmasked the problem, that's >>> why NetflixBSD 10 doesn't panic. >>> >>> The panic requires heavy contention on the TCP info lock. >>> >>> [CPU 1] the callout fires, tcp_timer_keep entered >>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); >>> [CPU 2] schedules the callout >>> [CPU 2] tcp_discardcb called >>> [CPU 2] callout successfully canceled >>> [CPU 2] tcpcb freed >>> [CPU 1] unblocks... panic >>> >>> When the lock was WLOCK, all contenders were resumed in a >>> sequence they came to the lock. Now, that they are readers, >>> once the lock is released, readers are resumed in a "random" >>> order, and this allows tcp_discardcb to go before the old >>> running callout, and this unmasks the panic. >> >> Highly interesting. I should be able to reproduce that (will be >> useful >> for testing the corresponding fix). > > Finally, I was able to reproduce it (without glebius fix). The trick > was to really lower TCP keep timer expiration: > > $ sysctl -a | grep tcp.keep > net.inet.tcp.keepidle: 7200000 > net.inet.tcp.keepintvl: 75000 > net.inet.tcp.keepinit: 75000 > net.inet.tcp.keepcnt: 8 > $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl > net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" > Password: > net.inet.tcp.keepidle: 7200000 -> 10 > net.inet.tcp.keepintvl: 75000 -> 50 > net.inet.tcp.keepinit: 75000 -> 10 > > Note: It will certainly close all your ssh connections to the tested > server. > > Now I will test in order: > > #1. glebius fix > https://svnweb.freebsd.org/base?view=revision&revision=302350 > > #2. rss extra fix > https://reviews.freebsd.org/D7135 > > #3. rrs TCP Timer cleanup > https://reviews.freebsd.org/D7136 > > My panic for reference: > > Fatal trap 9: general protection fault while in kernel mode > cpuid = 10; apic id = 28 > [root@atlas-dl360-4 ~]# instruction pointer = > 0x20:0xffffffff80c346f1 > stack pointer = 0x28:0xfffffe1f29b848b0 > frame pointer = 0x28:0xfffffe1f29b848e0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 12 (swi4: clock (4)) > trap number = 9 > panic: general protection fault > cpuid = 10 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > 0xfffffe1f29b844a0 > vpanic() at vpanic+0x182/frame 0xfffffe1f29b84520 > panic() at panic+0x43/frame 0xfffffe1f29b84580 > trap_fatal() at trap_fatal+0x351/frame 0xfffffe1f29b845e0 > trap() at trap+0x820/frame 0xfffffe1f29b847f0 > calltrap() at calltrap+0x8/frame 0xfffffe1f29b847f0 > --- trap 0x9, rip = 0xffffffff80c346f1, rsp = 0xfffffe1f29b848c0, rbp = > 0xfffffe1f29b848e0 --- > tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfffffe1f29b848e0 > softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfffffe1f29b849c0 > softclock() at softclock+0x47/frame 0xfffffe1f29b849e0 > intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame > 0xfffffe1f29b84a20 > ithread_loop() at ithread_loop+0xa6/frame 0xfffffe1f29b84a70 > fork_exit() at fork_exit+0x84/frame 0xfffffe1f29b84ab0 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1f29b84ab0 > --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > > -- > Julien please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 17716 Limpia Crk, Round Rock, TX 78664-7281