From owner-freebsd-net@FreeBSD.ORG Tue Dec 27 20:57:22 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E61A7106567C; Tue, 27 Dec 2011 20:57:21 +0000 (UTC) (envelope-from sobomax@sippysoft.com) Received: from mail.sippysoft.com (mail.sippysoft.com [4.59.13.245]) by mx1.freebsd.org (Postfix) with ESMTP id E9B968FC1B; Tue, 27 Dec 2011 20:57:20 +0000 (UTC) Received: from s0106005004e13421.vs.shawcable.net ([70.71.175.212] helo=[192.168.1.79]) by mail.sippysoft.com with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.72 (FreeBSD)) (envelope-from ) id 1Rfe5G-0000zc-R9; Tue, 27 Dec 2011 12:57:19 -0800 Message-ID: <4EFA3127.20705@FreeBSD.org> Date: Tue, 27 Dec 2011 12:57:11 -0800 From: Maxim Sobolev Organization: Sippy Software, Inc. User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: "Bjoern A. Zeeb" References: <4EB804D2.2090101@FreeBSD.org> <4EB86276.6080801@sippysoft.com> <4EB86866.9060102@sippysoft.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: sobomax@sippysoft.com X-ssp-trusted: yes Cc: freebsd-net@freebsd.org, Robert Watson , Jack Vogel Subject: Re: Panic in the udp_input() under heavy load X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Dec 2011 20:57:22 -0000 So it's actually happening: Nov 8 21:38:02 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff05e5798bd0 Nov 13 03:34:49 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff02e5b05930 Nov 30 04:18:11 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff03b2d2e000 Nov 30 20:24:12 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff03a35e33f0 Nov 30 22:03:20 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff03a6349690 Dec 5 03:33:01 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff02e0c9e930 Dec 9 06:02:06 dal09 kernel: BZZT! Something is terribly wrong, up == NULL! inp = 0xffffff038a4fea80 I'd love to try that socket closure locking patch that the Robert suggested, but kinda loaded right now. Robert, will it be too much to ask if you could provide me with the patch that applies to the latest 8-STABLE for a test? I'd give it a spin on 2-3 production boxes. And yes, those servers do a lot of socket ops per second, I'd say in the order of hundreds if not thousands per second. -Maxim On 11/7/2011 3:25 PM, Bjoern A. Zeeb wrote: > On Mon, 7 Nov 2011, Maxim Sobolev wrote: > >> On 11/7/2011 2:57 PM, Maxim Sobolev wrote: >>> On 11/7/2011 10:24 AM, Bjoern A. Zeeb wrote: >>>> Unlikely; the inp is properly locked there and the udp info attach >>>> better still be valid there; your problem is most likely elsewhere; >>>> try to see if you have other threads and see what they do at the same >>>> time, etc. You would need to race with udp_detach(); you also want >>>> to make sure that the inp still looks sane from either ddb or a dump >>>> and we are not talking about random memory corruption here. >>> >>> Well, as you can see from the trace it points pretty strongly to that >>> piece of code. And as I said this panic is completely reproducible, >>> we've seen it at least 5 times to date in exactly this location. >>> Unfortunately the trace is rather long so we could not capture it in >>> full before, until we've switched to the 80x50 mode. >>> >>> If it was a memory corruption it would be just random fault, while here >>> we have it failing in this point reliably. >>> >>> Unfortunately the panic happens in the driver thread context (I >>> believe), so the KDB/dump is not working. After panicing the machine >>> just hangs there. Keyboard is not working and I need to do a hard reset. >>> >>> Is there any other explanation that you can think of? Is it possible for >>> some other portion of the code (i.e. network driver, DMA engine etc) to >>> trash this structure by writing something off bound? Or something along >>> the lines? >> >> OK, I've put the following catch to prove the case: >> >> up = intoudpcb(inp); >> if (up == NULL) { >> printf("BZZT! Something is terribly wrong, up == >> NULL!\n"); >> INP_RUNLOCK(inp); >> goto badunlocked; >> } >> if (up->u_tun_func == NULL) { >> >> I am going to give it a spin on two busiest boxes and see if I can log >> anything. > > Now if you are clever you'd also log the inp there as the above will > only prove the case that something is wrong but still not help us in > anything to figure out what.