Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Jan 2007 12:50:51 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Sven Willenberger <sven@dmv.com>
Cc:        stable@FreeBSD.org, freebsd-amd64@FreeBSD.org
Subject:   Re: Panic in 6.2-PRERELEASE with bge on amd64
Message-ID:  <20070109124826.M79616@delplex.bde.org>
In-Reply-To: <1168271935.23549.10.camel@lanshark.dmv.com>
References:  <1168211205.22629.6.camel@lanshark.dmv.com>  <20070108154433.C75042@delplex.bde.org> <1168271935.23549.10.camel@lanshark.dmv.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 8 Jan 2007, Sven Willenberger wrote:

> On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:
>> On Sun, 7 Jan 2007, Sven Willenberger wrote:

>>> The short and dirty of the dump:
>>> ...
>>> --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 ---
>>> bge_rxeof() at bge_rxeof+0x3b7
>>
>> What is the instruction here?
>
> I will do my best to ferret out the information you need. For the
> bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:
>
> 0xffffffff801d5f17 <bge_rxeof+951>:     mov    %r15,0x28(%r14)
> ...
>> Looks like a null pointer panic anyway.  I guess the instruction is
>> movl to/from 0x28(%reg) where %reg is a null pointer.
>>
>
> from the above lines, apparently %r14 is null then.

Yes.  It's a bit suprising that the access is a write.

>>> ...
>>> #8  0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707
>>
>> What is the statement here?  It presumably follow a null pointer and only
>> the exprssion for the pointer is interesting.  xsc is already null but
>> that is probably a bug in gdb, or the result of excessive optimization.
>> Compiling kernels with -O2 has little effect except to break debugging.
>
> the block of code from if_bge.c:
>
>   2705         if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
>   2706                 /* Check RX return ring producer/consumer. */
>   2707                 bge_rxeof(sc);
>   2708
>   2709                 /* Check TX ring producer/consumer. */
>   2710                 bge_txeof(sc);
>   2711         }

Oops.  I should have asked for the statment in bge_rxeof().

> By default -O2 is passed to CC (I don't use any custom make flags other
> than and only define CPUTYPE in my /etc/make.conf).

-O2 is unfortunately the default for COPTFLAGS for most arches in
sys/conf/kern.pre.mk.  All of my machines and most FreeBSD cluster
machines override this default in /etc/make.conf.

With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(),
so your environment must be a little different to get even the above
ifo.  I think gdb can show the correct line numbers but not the call
frames (since there is no call).  ddb and the kernel stack trace can
only show the call frames for actual calls.

With -O1, I couldn't find any instruction similar to the mov to the
null pointer + 28.  28 is a popular offset in mbufs

> The short of it is that this interface sees pretty much non-stop traffic
> as this is a mailserver (final destination) and is constantly being
> delivered to (direct disk access) and mail being retrieved (remote
> machine(s) with nfs mounted mail spools. If a momentary down of the
> interface is enough to completely panic the driver and then the kernel,
> this hardly seems "robust" if, in fact, this is what is happening. So
> the question arises as to what would be causing the down/up of the
> interface; I could start looking at the cable, the switch it's connected
> to and ... any other ideas? (I don't have watchdog enabled or anything
> like that, for example).

I don't think down/up can occur in normal operation, since it takes ioctls
or a watchdog timeout to do it.  Maybe some ioctls other than a full
down/up can cause problems... bge_init() is called for the following
ioctls:
- mtu changes
- some near down/up (possibly only these)
Suspend/resume and of course detach/attach do much the same things as
down/up.

BTW, I added some sysctls and found it annoying to have to do down/up
to make the sysctls take effect.  Sysctls in several other NIC drivers
require the same, since doing a full reinitialization is easiest.
Since I am tuning using sysctls, I got used to doing down/up too much.

Similarly for the mtu ioctl.  I think a full reinitialization is used
for mtu changes mainly in cases the change switches on/off support for
jumbo buffers.  Then there is a lot of buffer reallocation to be
done, and interfaces have to be stopped to ensure that the bufferes
being deallocated are not in use, etc.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070109124826.M79616>