From owner-freebsd-current@FreeBSD.ORG Fri Apr 17 10:36:40 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3727D1065677 for ; Fri, 17 Apr 2009 10:36:40 +0000 (UTC) (envelope-from dgerow@afflictions.org) Received: from relay2-v.mail.gandi.net (relay2-v.mail.gandi.net [217.70.178.76]) by mx1.freebsd.org (Postfix) with ESMTP id BCEA28FC16 for ; Fri, 17 Apr 2009 10:36:39 +0000 (UTC) (envelope-from dgerow@afflictions.org) Received: from plebeian.afflictions.org (CPE0021296fd1ec-CM0019475d4056.cpe.net.cable.rogers.com [99.241.164.229]) by relay2-v.mail.gandi.net (Postfix) with ESMTP id 4E356135EB for ; Fri, 17 Apr 2009 12:36:38 +0200 (CEST) Received: by plebeian.afflictions.org (Postfix, from userid 1001) id 17BF631F0; Fri, 17 Apr 2009 06:36:35 -0400 (EDT) Date: Fri, 17 Apr 2009 06:36:34 -0400 From: Damian Gerow To: freebsd-current@freebsd.org Message-ID: <20090417103634.GD1186@plebeian.afflictions.org> References: <200904161336.18557.jhb@freebsd.org> <20090416184738.GA60409@wep4035.physik.uni-wuerzburg.de> <200904161558.56919.jhb@freebsd.org> <49E79F49.6000606@samsco.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49E79F49.6000606@samsco.org> User-Agent: Mutt/1.5.19 (2009-01-05) Subject: Re: [PATCH] Possible fix to recent data corruption on HEAD since USB2 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Apr 2009 10:36:40 -0000 Scott Long wrote: : John Baldwin wrote: : > On Thursday 16 April 2009 2:47:38 pm Alexey Shuvaev wrote: : >> On Thu, Apr 16, 2009 at 01:36:18PM -0400, John Baldwin wrote: : >>> Due to some good sleuthing by avg@, : >>> there is a patch that might fix the recent : >>> reports of data corruption on current. It would explain some of the recent : >>> reports where a file that was read would have missing gaps of bytes. The : >>> problem is with the BUS_DMA_KEEP_PG_OFFSET changes to bus_dma. When a bounce : >>> page was used by USB2, the changes to bus_dma would actually change the : >>> starting virtual and physical addresses of the bounce page. When the bounce : >>> page was no longer needed it was left in this bogus state. Later if another : >>> device used the same bounce page for DMA it would use the wrong offset and : >>> address. The issue there is if the second device was doing a full page of : >>> I/O. In that case the DMA from the device would actually spill over into the : >>> next page which could in theory be used by another DMA request. It could : >>> also break alignment assumptions (since the previous PG_OFFSET may not be : >>> aligned and the bus_dma code assumes bounce pages for the !PG_OFFSET case are : >>> page aligned). The quick fix is to always restore the bounce page to the : >>> normal state when a PG_OFFSET DMA request is finished. I'd actually prefer : >>> not ever touching the page's starting addresses, but those changes would be : >>> more invasive I believe. : >>> : >>> http://www.FreeBSD.org/~jhb/patches/dma_sg.patch : >>> : >> Am I right that hardware prerequisite in order to observe these problems : >> is amd64 + 4Gb or more of RAM? : > : > Well, i386 with PAE would do it as well. Basically, you need USB + one other : > device that use bounce pages and the other device ends up with corruption. : > : >> Is it possible to fabricate some (artificial) test case to stress this : >> particular situation (interleaved use of bounce pages by USB and some other : >> device (?HDD?))? : > : > I haven't constructed one though it might be possible to do so. : > : >> Asking because as I understand the data corruption is silent : >> and affected consumer (of bounce pages) should have some mechanism : >> of detecting this (e.g. zfs' CRCs). : >> In my case stess testing unpatched system till UFS filesystems are dead : >> is no fun... : > : > Understood. I know some other folks are going to test this and if there is : > early success that may make the risk easier to take. : > : : I have pretty high confidence that John and Andriy found the problem and : fixed it with this patch. It'll be good to get it tested, but I think : that the risk to tester will be pretty low. Having been running the patch for sixteen hours now, I can safely say that it fixes my issues. - Damian