From owner-freebsd-stable@FreeBSD.ORG  Thu Jan 21 06:19:14 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 559AD1065670
	for <freebsd-stable@freebsd.org>; Thu, 21 Jan 2010 06:19:14 +0000 (UTC)
	(envelope-from erik@malcolm.berkeley.edu)
Received: from malcolm.berkeley.edu (malcolm.Berkeley.EDU
	[IPv6:2607:f140:ffff:ffff::239])
	by mx1.freebsd.org (Postfix) with ESMTP id 339028FC24
	for <freebsd-stable@freebsd.org>; Thu, 21 Jan 2010 06:19:14 +0000 (UTC)
Received: from malcolm.berkeley.edu (localhost [127.0.0.1])
	by malcolm.berkeley.edu (8.14.3/8.13.8m1) with ESMTP id o0L6JDak097071
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 20 Jan 2010 22:19:13 -0800 (PST)
	(envelope-from erik@malcolm.berkeley.edu)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.95.3 at malcolm.berkeley.edu
Received: (from erik@localhost)
	by malcolm.berkeley.edu (8.14.3/8.13.3/Submit) id o0L6JDUX097070;
	Wed, 20 Jan 2010 22:19:13 -0800 (PST) (envelope-from erik)
Date: Wed, 20 Jan 2010 22:19:13 -0800
From: Erik Klavon <erikk@berkeley.edu>
To: Pyun YongHyeon <pyunyh@gmail.com>
Message-ID: <20100121061912.GA96603@malcolm.berkeley.edu>
References: <20100114014719.GA11284@malcolm.berkeley.edu>
	<20100114020640.GT1228@michelle.cdnetworks.com>
	<20100114232618.GA27380@malcolm.berkeley.edu>
	<20100120231251.GA85328@malcolm.berkeley.edu>
	<20100120234208.GL6201@michelle.cdnetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100120234208.GL6201@michelle.cdnetworks.com>
User-Agent: Mutt/1.4.2.3i
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.3
	(malcolm.berkeley.edu [127.0.0.1]);
	Wed, 20 Jan 2010 22:19:14 -0800 (PST)
Cc: freebsd-stable@freebsd.org
Subject: Re: bge panic in 8.0
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jan 2010 06:19:14 -0000

On Wed, Jan 20, 2010 at 03:42:08PM -0800, Pyun YongHyeon wrote:
> On Wed, Jan 20, 2010 at 03:12:51PM -0800, Erik Klavon wrote:
> > On Thu, Jan 14, 2010 at 03:26:18PM -0800, Erik Klavon wrote:
> > > On Wed, Jan 13, 2010 at 06:06:40PM -0800, Pyun YongHyeon wrote:
> > > > On Wed, Jan 13, 2010 at 05:47:19PM -0800, Erik Klavon wrote:
> > > > > One of my amd64 machines running 8.0p1 acting as a NAT system for many
> > > > > network clients dropped into kdb today. tr indicates a problem in
> > > > > bge.
> > > > > 
> > > > > Tracing pid 12 tid 100033 td 0xffffff0001687000
> > > > > pmap_kextract() at pmap_kextract+0x4e
> > > > > bus_dmamap_load() at bus_dmamap_load+0xab
> > > > > bge_newbuf_std() at bge_newbuf_std+0xcc
> > > > > bge_rxeof() at bge_rxeof+0x36a
> > > > > bge_intr() at bge_intr+0x1c0
> > > > > intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
> > > > > ithread_loop() at ithread_loop+0x8e
> > > > > fork_exit() at fork_exit+0x118
> > > > > fork_trampoline() at fork_trampoline+0xe
> > > > > --- trap 0, rip = 0, rsp = 0xffffff8074c01d30, rbp = 0 ---
> > > > > 
> > > > > I haven't been able to find a PR that matches this particular trace.
> > > > > 
> > > > > Pyun recently MFCd to stable (hence my post to this list) some changes
> > > > > to bge that involve functions in the above trace and according to the
> > > > > commit log (r201685) may address a kernel panic. Is there any
> > > > > indication in the above trace that this is the type of panic the
> > > > > commit attempts to address? I don't have a core dump for this
> > > > > panic. This machine has been unstable on 8, so I may be able to get a
> > > > > core dump in the future. If there is other information you'd like me
> > > > > to gather, please let me know.
> > > > 
> > > > Yes, that part of code in trace above were rewritten to address
> > > > bus_dma(9) issues. So it would be great if you can try latest
> > > > bge(4) in stable/8 and let me know how it goes on your box. I guess
> > > > you can just download if_bge.c and if_bgereg.h from stable/8 and
> > > > rebuild bge(4) would be enough to run it on 8.0-RELEASE.
> > > 
> > > Great, I will try this out on a test machine today. If it holds up
> > > under testing, I will put it into production. These crashes can happen
> > > weeks after a machine boots, so I won't know if the problem is solved
> > > for some time. Thanks for your help,
> > 
> > I didn't run into any problems while testing. I started running bge(4)
> > from stable in production this morning. I had three kernel panics in a
> > couple hours; here's an example
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; apic id = 00
> > fault virtual address   = 0x18
> > fault code              = supervisor read data, page not present
> > instruction pointer     = 0x20:0xffffffff805ccf17
> > stack pointer           = 0x28:0xffffff800004f830
> > frame pointer           = 0x28:0xffffff800004f890
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0 pres 1, long 1, def32 0, gran 1
> > processor eflags        = interrupt enabled, resume, IOPL = 0
> > current process         = 13 (ng_queue0) 
> > [thread pid 13 tid 100009 ]
> > Stopped at      m_copym+0x37:   movl    0x18(%r12),%eax
> > 
> > db> tr
> > Tracing pid 13 tid 100009 td 0xffffff000189aab0
> > m_copym() at m_copym+0x37
> > ip_fragment() at ip_fragment+0x131
> > ip_output() at ip_output+0xeec
> > ip_forward() at ip_forward+0x16a
> > ip_input() at ip_input+0x57d
> > ng_ipfw_rcvdata() at ng_ipfw_rcvdata+0xb9
> > ng_apply_item() at ng_apply_item+0x220
> > ngthread() at ngthread+0x16b
> > fork_exit() at fork_exit+0x118
> > fork_trampoline() at fork_trampoline+0xe
> > --- trap 0, rip = 0, rsp = 0xffffff800004fd30, rbp = 0 ---
> > 
> > I tried the kdb command 'panic' to dump core, but this command only
> > produced further faults. After the third panic related to m_copym, I
> > reverted to the previous version of bge(4) from 8.0p1. A couple of
> > hours has passed without these panics repeating while running the
> > previous version of bge(4).
> > 
> 
> I guess this is NULL pointer dereference in m_copym(9). And I also
> see you're using netgraph(4). Can you run the server without
> netgraph(4) in your configuration and see how this make any
> difference? I'm not familiar with netgraph(4) but other developers
> can comment on this.
> Another thing to narrow down the cause would be trying other
> controllers and see you can reproduce the issue. But I think the
> above panic is not related with bge(4).

netgraph(4) is key to what this system does; removing it isn't an
option for us. Using a different controller on the current hardware
isn't feasible. We're looking into different hardware for further
testing, and we plan to include both bge(4) and non bge
controllers.

> > There is a long open PR, 89070, that looks to be related to the above
> > panic. I don't have any proof that these panics resulted from the
> > newer version of bge(4). I haven't seen kernel panics such as these on
> > any of the other machines with this same configuration.
> > 
> > I have seen a kernel panic on systems running 8.0p1 with a different
> > stack trace than the one I posted previous that also appears to be
> > related to bge(4).
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 1; apic id = 01
> > fault virtual address   = 0x28
> > fault code              = supervisor write data, page not present
> > instruction pointer     = 0x20:0xffffffff802cdf0e
> > stack pointer           = 0x28:0xffffff8074c1ab10
> > frame pointer           = 0x28:0xffffff8074c1ab70
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags        = interrupt enabled, resume, IOPL = 0
> > current process         = 12 (irq25: bge1)
> > [thread pid 12 tid 100034 ]
> > Stopped at      bge_rxeof+0x1be:        movq    %r15,0x28(%r14)
> > 
> > db> trace
> > Tracing pid 12 tid 100034 td 0xffffff0001680ab0
> > bge_rxeof() at bge_rxeof+0x1be
> > bge_intr() at bge_intr+0x1c0
> > intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
> > ithread_loop() at ithread_loop+0x8e
> > fork_exit() at fork_exit+0x118
> > fork_trampoline() at fork_trampoline+0xe
> 
> I think this is a real bug of bge(4) and I believe it was fixed in
> stable.

Thanks for confirming this. So far I haven't been able to reproduce
these problems in test; they only show up in production.

Erik