From owner-freebsd-current@FreeBSD.ORG Thu Apr 16 21:12:46 2009 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 98AC7106567B; Thu, 16 Apr 2009 21:12:46 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 2A6568FC1E; Thu, 16 Apr 2009 21:12:45 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.local (pooker.samsco.org [168.103.85.57]) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id n3GLCfrH016446; Thu, 16 Apr 2009 15:12:41 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <49E79F49.6000606@samsco.org> Date: Thu, 16 Apr 2009 15:12:41 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 SeaMonkey/1.1.9 MIME-Version: 1.0 To: John Baldwin References: <200904161336.18557.jhb@freebsd.org> <20090416184738.GA60409@wep4035.physik.uni-wuerzburg.de> <200904161558.56919.jhb@freebsd.org> In-Reply-To: <200904161558.56919.jhb@freebsd.org> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: Alexey Shuvaev , current@freebsd.org Subject: Re: [PATCH] Possible fix to recent data corruption on HEAD since USB2 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2009 21:12:47 -0000 John Baldwin wrote: > On Thursday 16 April 2009 2:47:38 pm Alexey Shuvaev wrote: >> On Thu, Apr 16, 2009 at 01:36:18PM -0400, John Baldwin wrote: >>> Due to some good sleuthing by avg@, >>> there is a patch that might fix the recent >>> reports of data corruption on current. It would explain some of the recent >>> reports where a file that was read would have missing gaps of bytes. The >>> problem is with the BUS_DMA_KEEP_PG_OFFSET changes to bus_dma. When a bounce >>> page was used by USB2, the changes to bus_dma would actually change the >>> starting virtual and physical addresses of the bounce page. When the bounce >>> page was no longer needed it was left in this bogus state. Later if another >>> device used the same bounce page for DMA it would use the wrong offset and >>> address. The issue there is if the second device was doing a full page of >>> I/O. In that case the DMA from the device would actually spill over into the >>> next page which could in theory be used by another DMA request. It could >>> also break alignment assumptions (since the previous PG_OFFSET may not be >>> aligned and the bus_dma code assumes bounce pages for the !PG_OFFSET case are >>> page aligned). The quick fix is to always restore the bounce page to the >>> normal state when a PG_OFFSET DMA request is finished. I'd actually prefer >>> not ever touching the page's starting addresses, but those changes would be >>> more invasive I believe. >>> >>> http://www.FreeBSD.org/~jhb/patches/dma_sg.patch >>> >> Am I right that hardware prerequisite in order to observe these problems >> is amd64 + 4Gb or more of RAM? > > Well, i386 with PAE would do it as well. Basically, you need USB + one other > device that use bounce pages and the other device ends up with corruption. > >> Is it possible to fabricate some (artificial) test case to stress this >> particular situation (interleaved use of bounce pages by USB and some other >> device (?HDD?))? > > I haven't constructed one though it might be possible to do so. > >> Asking because as I understand the data corruption is silent >> and affected consumer (of bounce pages) should have some mechanism >> of detecting this (e.g. zfs' CRCs). >> In my case stess testing unpatched system till UFS filesystems are dead >> is no fun... > > Understood. I know some other folks are going to test this and if there is > early success that may make the risk easier to take. > I have pretty high confidence that John and Andriy found the problem and fixed it with this patch. It'll be good to get it tested, but I think that the risk to tester will be pretty low. Scott