From owner-freebsd-current@FreeBSD.ORG Wed Aug 5 23:17:12 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 71FAB106564A; Wed, 5 Aug 2009 23:17:12 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 496038FC08; Wed, 5 Aug 2009 23:17:12 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id D888746B03; Wed, 5 Aug 2009 19:17:11 -0400 (EDT) Date: Thu, 6 Aug 2009 00:17:11 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Navdeep Parhar In-Reply-To: <20090805063417.GA10969@doormat.home> Message-ID: References: <20090804225806.GA54680@hub.freebsd.org> <20090805054115.O93661@maildrop.int.zabbadoz.net> <20090805063417.GA10969@doormat.home> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-current@freebsd.org, jeff@FreeBSD.org, "Bjoern A. Zeeb" , kib@FreeBSD.org, Navdeep Parhar , lstewart@FreeBSD.org Subject: Re: reproducible panic in netisr X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Aug 2009 23:17:12 -0000 On Tue, 4 Aug 2009, Navdeep Parhar wrote: >>> This occurs on today's HEAD + some unrelated patches. That makes it >>> 8.0BETA2+ code. I haven't tried older builds. >> >> We have finally been able to reproduce this ourselves yesterday and > > Well, it happens every single time on all of my amd64 machines. After I'd > already sent my email I noticed that the netisr mutex has an odd address > (pun intended :-)) > > m=0xffffffff8144d867 Heh, indeed. We just spotted the same result here. In this case it's causing a panic because it leads to a non-atomic read due to mtx_lock spanning a cache line boundary, followed shortly by a panic because it's not a valid thread pointer when it's dereferenced, as we get a fractional pointer. > It's a bit unusual for the mutex struct to start at a completely unaligned > address. I hope things are better on sparc64 etc., not everyone is as > forgiving as amd64. amd64 isn't as forgiving either, it turns out. :-) > The mutex led me to some DPCPU stuff that I didn't quite get. > > (kgdb) p/x dpcpu_off > $2 = {0x8407d7, 0xffffff807f4037d7, 0x0 } > (kgdb) p dpcpu > $3 = (void *) 0xffffff8000010000 > (kgdb) p &__start_set_pcpu > $4 = (uintptr_t **) 0xffffffff80c0c829 > (kgdb) p/x 0xffffff8000010000 - 0xffffffff80c0c829 > $5 = 0xffffff807f4037d7 > > It's not clear why we prefer to store offsets from DPCPU_START, instead of > the base address of the dpcpu area directly. On amd64, the dpcpu area for > cpu 0 is above kernbase (immediately after kernbase + thread0's stack). > For the other CPUs it's below kernbase. This makes the pointer arithmetic > that calculates offsets more "interesting." > > Why have a dpcpu_off[] instead of a dpcpu_base[]? Each field in DPCPU is named with respect to the start of a "master" dpcpu copy, which holds the static initialization. This makes the per-CPU name: (&master_name_for_variable - DPCPU_START) + per-cpu-base What Jeff has done is factor out the DPCPU_START subtraction, since it's a constant subtraction across all DPCPU use, and do it once when calculating dpcpu_off. This should all be fine, the question is why we're losing the alignment during linking of the kernel. netisr is linked into the base kernel, so I guess it's some problem with the way the linker set is being laid out at compile-time. I expect we may have a similar issue with the run-time allocation of DPCPU space as well. Robert N M Watson Computer Laboratory University of Cambridge