From owner-freebsd-current@FreeBSD.ORG Thu Aug 6 14:11:27 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2FE9B106564A; Thu, 6 Aug 2009 14:11:27 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 049C28FC18; Thu, 6 Aug 2009 14:11:27 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 3717946B0D; Thu, 6 Aug 2009 10:11:26 -0400 (EDT) Date: Thu, 6 Aug 2009 15:11:26 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Larry Rosenman In-Reply-To: Message-ID: References: <20090804225806.GA54680@hub.freebsd.org> <20090805054115.O93661@maildrop.int.zabbadoz.net> <20090805063417.GA10969@doormat.home> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: jeff@FreeBSD.org, "Bjoern A. Zeeb" , freebsd-current@freebsd.org, kib@FreeBSD.org, Navdeep Parhar , Navdeep Parhar , lstewart@FreeBSD.org Subject: Re: reproducible panic in netisr X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Aug 2009 14:11:27 -0000 On Thu, 6 Aug 2009, Larry Rosenman wrote: > On Thu, 6 Aug 2009, Robert Watson wrote: > >> On Tue, 4 Aug 2009, Navdeep Parhar wrote: >> >>>>> This occurs on today's HEAD + some unrelated patches. That makes it >>>>> 8.0BETA2+ code. I haven't tried older builds. >>>> >>>> We have finally been able to reproduce this ourselves yesterday and >>> >>> Well, it happens every single time on all of my amd64 machines. After I'd >>> already sent my email I noticed that the netisr mutex has an odd address >>> (pun intended :-)) >>> >>> m=0xffffffff8144d867 >> >> Heh, indeed. We just spotted the same result here. In this case it's >> causing a panic because it leads to a non-atomic read due to mtx_lock >> spanning a cache line boundary, followed shortly by a panic because it's >> not a valid thread pointer when it's dereferenced, as we get a fractional >> pointer. > [snip] > > Do we have an ETA for a testable patch? RSN, I'm afraid. We can eliminate the effect by reverting the use of DPCPU in netisr.c (basically reverting to pre-r195019 of netisr.c). The interesting question is where the problem originates -- is gcc/ld/etc not laying out the elf section properly, or are the MD parts not providing an aligned base? There are also probably issues in the DPCPU handling of modules along similar lines, but first things first. We'll be adding assertions of alignment to the various lock init functions to catch this happening explicitly in the future. There are probably one or two other places where we have very strong alignment requirements on i386/amd64, such as the td_ucred pointer that we check for change on system calls/traps to see if we need to refresh the thread's credential from the process credential. Robert N M Watson Computer Laboratory University of Cambridge