Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Jul 2016 07:32:57 -0600
From:      Ian Lepore <ian@freebsd.org>
To:        Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org
Subject:   Re: Not-so stable if you take a CAM error....
Message-ID:  <1468243977.72182.118.camel@freebsd.org>
In-Reply-To: <6e9c07e1-12a6-a7cd-f775-6b0fe5a706bc@denninger.net>
References:  <2b0c454b-c1a0-4b5b-e778-bf0939e90ae1@denninger.net> <op.ykfe1fvbkndu52@ronaldradial.radialsg.local> <6e9c07e1-12a6-a7cd-f775-6b0fe5a706bc@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2016-07-11 at 06:30 -0500, Karl Denninger wrote:
> On 7/11/2016 02:57, Ronald Klop wrote:
> > On Mon, 11 Jul 2016 02:54:38 +0200, Karl Denninger
> > <karl@denninger.net> wrote:
> > 
> > > Got a (nasty) surprise this afternoon on my sandbox machine.
> > > 
> > > I was updating some Raspberry Pi2 machines which involved taking
> > > the sd
> > > card out, sticking it in an adapter and plugging it into the
> > > sandbox,
> > > then mounting the partition and using rsync.
> > > 
> > > Unfortunately one of the cards was, unknown to me, bad and
> > > returned a
> > > write error during the update.
> > > 
> > > The machine panic'd immediately after the CAM write error popped
> > > up.
> > > 
> > > I was quite surprised by this, since (1) the SD card was (of
> > > course)
> > > mounted as a UFS filesystem; it shows up as a CAM device, (2) the
> > > machine itself is running off a ZFS root on a normal host-adapter
> > > and
> > > thus there is no comingling of the buffer cache and (3) there
> > > were no
> > > images being run from (can't, wrong architecture!) nor any system
> > > I/O
> > > (e.g. pagefile) going to the SD card.
> > > 
> > > I certainly understand that under some circumstances (maybe even
> > > most
> > > circumstances) taking a hard I/O error to a system device is
> > > going to
> > > hose you and a panic() is arguably "least astonishment" when the
> > > price
> > > of being wrong might be a corrupted system file or worse (e.g.
> > > corrupted
> > > paged-out RSS, etc.)  But I didn't expect a panic out a failed
> > > write to
> > > a device that is mounted and being used purely for data.
> > > 
> > > I don't have a crash dump but can almost-certainly reproduce this
> > > if
> > > it's something that shouldn't happen and thus merits
> > > investigation.
> > > 
> > 
> > Hi,
> > 
> > I understand you are surprised by this. I don't think it is the way
> > it
> > should work.
> > Is there _any_ debugging information for people to use and try to
> > help
> > you? Like which FreeBSD version are you running? Which FreeBSD
> > version
> > was used to create the UFS fs? Does it use softupdates (SU) or also
> > journaling (SU+J)?
> > Maybe some output of dmesg? Or type of SD-card and reader. Other
> > people might have similar problems with similar hardware.
> > 
> > Regards,
> > Ronald.
> > 
> FreeBSD 11.0-BETA1 #0 r302489: Sat Jul  9 10:15:24 CDT 2016    
> karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP
> 
> and
> 
> FreeBSD 11.0-BETA1 #0 r302526: Sun Jul 10 10:39:31 CDT 2016    
> karl@NewFS.denninger.net:/pics/CrossBuild/obj/arm.armv6/pics/CrossBui
> ld/src/sys/RPI2
> 
> Both blew up in the same way when stimulated with same I/O error.
> 
> The filesystem in question does have softupdates enabled (the RPI
> images
> have it turned on by default) but no journaling.  It's not
> card/reader
> dependent no architecture dependent; when it occurred the first time
> I
> stuck the card and reader into one of my Pis and attempted to update
> it
> there (thinking that perhaps my sandbox machine's USB port was wonky)
> and it blew up the Pi2 in the exact same way.
> 
> This isn't (obviously, given both Intel-style and ARM machines being
> involved) architecture dependent.
> 
> It's been a good long while since I took an actual hard I/O error
> that
> was 'visible' at the OS level (I've had plenty of disks die on ZFS
> over
> last few years but no "double failures" on a mirror or similar, and I
> on
> my servers I haven't had a UFS-based system for a while.  This
> definitely looks like some sort of regression in the code; I've run
> FreeBSD for a hell of a long time and have had plenty of instances
> where
> disks have failed without having the machine go out from under me.
> 

Unfortunately, this is "just the way it works".  A hard IO error while
writing to a ufs filesystem with softupdates enabled will cause a
panic, because the softupdates code doesn't handle that sort of
failure, and the failure means that filesystem integrity is lost.  The
code has no idea how important the data is to the functioning of the
system, no basis on which to decide whether to panic or not.

-- Ian




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1468243977.72182.118.camel>