Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 08 Jan 2007 10:58:55 -0500
From:      Sven Willenberger <sven@dmv.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        stable@freebsd.org, freebsd-amd64@freebsd.org
Subject:   Re: Panic in 6.2-PRERELEASE with bge on amd64
Message-ID:  <1168271935.23549.10.camel@lanshark.dmv.com>
In-Reply-To: <20070108154433.C75042@delplex.bde.org>
References:  <1168211205.22629.6.camel@lanshark.dmv.com> <20070108154433.C75042@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:
> On Sun, 7 Jan 2007, Sven Willenberger wrote:
> 
> > I am starting a new thread on this as what I had assumed was a panic in
> > nfsd turns out to be an issue with the bge driver. This is an amd64 box,
> > dual processor (SMP kernel) that happens to be running nfsd. About every
> > 3-5 days the kernel panics and I have finally managed to get a core
> > dump.
> > The system: FreeBSD 6.2-PRERELEASE #8: Tue Jan  2 10:57:39 EST 2007
> 
> Like most NIC drivers, bge unlocks and re-locks around its call to
> ether_input() in its interrupt handler.  This isn't very safe, and it
> certainly causes panics for bge.  I often see it panic when bringing
> the interface down and up while input is arriving, on a non-SMP non-amd64
> (actually i386) non-6.x (actually -current) system.  Bringing the
> interface down is probably the worst case.  It creates a null pointer
> for bge_intr() to follow.
> 
> > The short and dirty of the dump:
> > ...
> > --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 ---
> > bge_rxeof() at bge_rxeof+0x3b7
> 
> What is the instruction here?

I will do my best to ferret out the information you need. For the
bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:

0xffffffff801d5f17 <bge_rxeof+951>:     mov    %r15,0x28(%r14)

bge_intr() at bge_intr+0x1c8 line, the instruction is:

0xffffffff801db818 <bge_intr+456>:      mov    %rbx,%rdi

> 
> > bge_intr() at bge_intr+0x1c8
> > ithread_loop() at ithread_loop+0x14c
> > fork_exit() at fork_exit+0xbb
> > fork_trampoline() at fork_trampoline+0xe
> > --- trap 0, rip = 0, rsp = 0xffffffffb371ad00, rbp = 0 ---
> 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 1; apic id = 01
> > fault virtual address   = 0x28
> 
> Looks like a null pointer panic anyway.  I guess the instruction is
> movl to/from 0x28(%reg) where %reg is a null pointer.
> 

from the above lines, apparently %r14 is null then.

> > ...
> > #8  0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707
> 
> What is the statement here?  It presumably follow a null pointer and only
> the exprssion for the pointer is interesting.  xsc is already null but
> that is probably a bug in gdb, or the result of excessive optimization.
> Compiling kernels with -O2 has little effect except to break debugging.
> 

the block of code from if_bge.c:

   2705         if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
   2706                 /* Check RX return ring producer/consumer. */
   2707                 bge_rxeof(sc);
   2708
   2709                 /* Check TX ring producer/consumer. */
   2710                 bge_txeof(sc);
   2711         }

By default -O2 is passed to CC (I don't use any custom make flags other
than and only define CPUTYPE in my /etc/make.conf).

> I rarely use gdb on kernels and haven't looked closely enough using ddb
> to see where the null pointer for the panic on down/up came from.
> 
> BTW, the sbdrop panic in -current isn't bge-only or SMP-only.  I saw
> it once for sk on a non-SMP system.  It rarely happens for non-SMP
> (much more rarely than the panic in bge_intr()).  Under -current, on
> an SMP amd64 system with bge, It happens almost every time on close
> of the socket for a ttcp server if input is arriving at the time of
> the close.  I haven't seen it for 6.x.
> 
> Bruce

The short of it is that this interface sees pretty much non-stop traffic
as this is a mailserver (final destination) and is constantly being
delivered to (direct disk access) and mail being retrieved (remote
machine(s) with nfs mounted mail spools. If a momentary down of the
interface is enough to completely panic the driver and then the kernel,
this hardly seems "robust" if, in fact, this is what is happening. So
the question arises as to what would be causing the down/up of the
interface; I could start looking at the cable, the switch it's connected
to and ... any other ideas? (I don't have watchdog enabled or anything
like that, for example).

Sven




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1168271935.23549.10.camel>