From owner-freebsd-current@FreeBSD.ORG  Fri Apr 17 10:36:40 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3727D1065677
	for <freebsd-current@freebsd.org>; Fri, 17 Apr 2009 10:36:40 +0000 (UTC)
	(envelope-from dgerow@afflictions.org)
Received: from relay2-v.mail.gandi.net (relay2-v.mail.gandi.net
	[217.70.178.76])
	by mx1.freebsd.org (Postfix) with ESMTP id BCEA28FC16
	for <freebsd-current@freebsd.org>; Fri, 17 Apr 2009 10:36:39 +0000 (UTC)
	(envelope-from dgerow@afflictions.org)
Received: from plebeian.afflictions.org
	(CPE0021296fd1ec-CM0019475d4056.cpe.net.cable.rogers.com
	[99.241.164.229])
	by relay2-v.mail.gandi.net (Postfix) with ESMTP id 4E356135EB
	for <freebsd-current@freebsd.org>;
	Fri, 17 Apr 2009 12:36:38 +0200 (CEST)
Received: by plebeian.afflictions.org (Postfix, from userid 1001)
	id 17BF631F0; Fri, 17 Apr 2009 06:36:35 -0400 (EDT)
Date: Fri, 17 Apr 2009 06:36:34 -0400
From: Damian Gerow <dgerow@afflictions.org>
To: freebsd-current@freebsd.org
Message-ID: <20090417103634.GD1186@plebeian.afflictions.org>
References: <200904161336.18557.jhb@freebsd.org>
	<20090416184738.GA60409@wep4035.physik.uni-wuerzburg.de>
	<200904161558.56919.jhb@freebsd.org> <49E79F49.6000606@samsco.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <49E79F49.6000606@samsco.org>
User-Agent: Mutt/1.5.19 (2009-01-05)
Subject: Re: [PATCH] Possible fix to recent data corruption on HEAD since
	USB2
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Apr 2009 10:36:40 -0000

Scott Long wrote:
: John Baldwin wrote:
: > On Thursday 16 April 2009 2:47:38 pm Alexey Shuvaev wrote:
: >> On Thu, Apr 16, 2009 at 01:36:18PM -0400, John Baldwin wrote:
: >>> Due to some good sleuthing by avg@,
: >>> there is a patch that might fix the recent 
: >>> reports of data corruption on current.  It would explain some of the recent 
: >>> reports where a file that was read would have missing gaps of bytes.  The 
: >>> problem is with the BUS_DMA_KEEP_PG_OFFSET changes to bus_dma.  When a bounce 
: >>> page was used by USB2, the changes to bus_dma would actually change the 
: >>> starting virtual and physical addresses of the bounce page.  When the bounce 
: >>> page was no longer needed it was left in this bogus state.  Later if another 
: >>> device used the same bounce page for DMA it would use the wrong offset and 
: >>> address.  The issue there is if the second device was doing a full page of 
: >>> I/O.  In that case the DMA from the device would actually spill over into the 
: >>> next page which could in theory be used by another DMA request.  It could 
: >>> also break alignment assumptions (since the previous PG_OFFSET may not be 
: >>> aligned and the bus_dma code assumes bounce pages for the !PG_OFFSET case are 
: >>> page aligned).  The quick fix is to always restore the bounce page to the 
: >>> normal state when a PG_OFFSET DMA request is finished.   I'd actually prefer 
: >>> not ever touching the page's starting addresses, but those changes would be 
: >>> more invasive I believe.
: >>>
: >>> http://www.FreeBSD.org/~jhb/patches/dma_sg.patch
: >>>
: >> Am I right that hardware prerequisite in order to observe these problems
: >> is amd64 + 4Gb or more of RAM?
: > 
: > Well, i386 with PAE would do it as well.  Basically, you need USB + one other
: > device that use bounce pages and the other device ends up with corruption.
: > 
: >> Is it possible to fabricate some (artificial) test case to stress this
: >> particular situation (interleaved use of bounce pages by USB and some other
: >> device (?HDD?))?
: > 
: > I haven't constructed one though it might be possible to do so.
: > 
: >> Asking because as I understand the data corruption is silent
: >> and affected consumer (of bounce pages) should have some mechanism
: >> of detecting this (e.g. zfs' CRCs).
: >> In my case stess testing unpatched system till UFS filesystems are dead
: >> is no fun...
: > 
: > Understood.  I know some other folks are going to test this and if there is
: > early success that may make the risk easier to take.
: > 
: 
: I have pretty high confidence that John and Andriy found the problem and
: fixed it with this patch.  It'll be good to get it tested, but I think
: that the risk to tester will be pretty low.

Having been running the patch for sixteen hours now, I can safely say that
it fixes my issues.

  - Damian