Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Aug 2009 00:17:11 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Navdeep Parhar <nparhar@gmail.com>
Cc:        freebsd-current@freebsd.org, jeff@FreeBSD.org, "Bjoern A. Zeeb" <bz@FreeBSD.org>, kib@FreeBSD.org, Navdeep Parhar <np@FreeBSD.org>, lstewart@FreeBSD.org
Subject:   Re: reproducible panic in netisr
Message-ID:  <alpine.BSF.2.00.0908060011490.59996@fledge.watson.org>
In-Reply-To: <20090805063417.GA10969@doormat.home>
References:  <20090804225806.GA54680@hub.freebsd.org> <20090805054115.O93661@maildrop.int.zabbadoz.net> <20090805063417.GA10969@doormat.home>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 4 Aug 2009, Navdeep Parhar wrote:

>>> This occurs on today's HEAD + some unrelated patches.  That makes it 
>>> 8.0BETA2+ code.  I haven't tried older builds.
>>
>> We have finally been able to reproduce this ourselves yesterday and
>
> Well, it happens every single time on all of my amd64 machines. After I'd 
> already sent my email I noticed that the netisr mutex has an odd address 
> (pun intended :-))
>
> m=0xffffffff8144d867

Heh, indeed.  We just spotted the same result here.  In this case it's causing 
a panic because it leads to a non-atomic read due to mtx_lock spanning a cache 
line boundary, followed shortly by a panic because it's not a valid thread 
pointer when it's dereferenced, as we get a fractional pointer.

> It's a bit unusual for the mutex struct to start at a completely unaligned 
> address.  I hope things are better on sparc64 etc., not everyone is as 
> forgiving as amd64.

amd64 isn't as forgiving either, it turns out. :-)

> The mutex led me to some DPCPU stuff that I didn't quite get.
>
> (kgdb) p/x dpcpu_off
> $2 = {0x8407d7, 0xffffff807f4037d7, 0x0 <repeats 30 times>}
> (kgdb) p dpcpu
> $3 = (void *) 0xffffff8000010000
> (kgdb) p &__start_set_pcpu
> $4 = (uintptr_t **) 0xffffffff80c0c829
> (kgdb) p/x 0xffffff8000010000 - 0xffffffff80c0c829
> $5 = 0xffffff807f4037d7
>
> It's not clear why we prefer to store offsets from DPCPU_START, instead of 
> the base address of the dpcpu area directly.  On amd64, the dpcpu area for 
> cpu 0 is above kernbase (immediately after kernbase + thread0's stack). 
> For the other CPUs it's below kernbase.  This makes the pointer arithmetic 
> that calculates offsets more "interesting."
>
> Why have a dpcpu_off[] instead of a dpcpu_base[]?

Each field in DPCPU is named with respect to the start of a "master" dpcpu 
copy, which holds the static initialization.  This makes the per-CPU name:

    (&master_name_for_variable - DPCPU_START) + per-cpu-base

What Jeff has done is factor out the DPCPU_START subtraction, since it's a 
constant subtraction across all DPCPU use, and do it once when calculating 
dpcpu_off.  This should all be fine, the question is why we're losing the 
alignment during linking of the kernel.  netisr is linked into the base 
kernel, so I guess it's some problem with the way the linker set is being laid 
out at compile-time.  I expect we may have a similar issue with the run-time 
allocation of DPCPU space as well.

Robert N M Watson
Computer Laboratory
University of Cambridge



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.0908060011490.59996>