Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 30 Aug 2011 16:43:23 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Kevin Oberman <kob6558@gmail.com>
Cc:        freebsd-stable@freebsd.org, David Magda <dmagda@ee.ryerson.ca>, dart@es.net
Subject:   Re: Unable to shutdown
Message-ID:  <20110830234323.GA88936@icarus.home.lan>
In-Reply-To: <CAN6yY1upYKW9e3mmED11pTXk%2BVO1KduPy-boWTr7m9S42jUKWw@mail.gmail.com>
References:  <CAN6yY1s3x1ojxh-Dx9Ht=L8M4frohLXcMLNgz%2BzgtBCDodBdsg@mail.gmail.com> <uh78vqd9u8e.fsf@P142.sics.se> <4E5BF15F.9070601@es.net> <CAN6yY1u6ZshVZT2DwaQ2Et7Y1JvNA8q%2BFj5os4SmK4=7=Z77vg@mail.gmail.com> <f0ffdf9eccf14f42ee24f0982bb0fc4b.squirrel@webmail.ee.ryerson.ca> <20110830214832.GA87354@icarus.home.lan> <CAN6yY1upYKW9e3mmED11pTXk%2BVO1KduPy-boWTr7m9S42jUKWw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 30, 2011 at 04:10:13PM -0700, Kevin Oberman wrote:
> On Tue, Aug 30, 2011 at 2:48 PM, Jeremy Chadwick
> <freebsd@jdc.parodius.com> wrote:
> > On Tue, Aug 30, 2011 at 01:29:02PM -0400, David Magda wrote:
> >> On Tue, August 30, 2011 11:50, Kevin Oberman wrote:
> >> [...]
> >> > The more I look at this, the more it seems to me that it is an issue
> >> > with the Seagate drive and not a FreeBSD issue. Probably a bug that is
> >> > never triggered on Windows, so is largely unnoticed. I suspect Widows
> >> > probably orders the command is a subtly different order.
> >> [...]
> >>
> >> Or not the drive per se, but the USB-to-IDE/SATA chipset.
> >>
> >> A while back on the OpenSolaris zfs-discuss list there was an issue where
> >> USB drives would have corrupt ZFS pools if a drive was yanked without a
> >> 'zpool export' being run. Even though ZFS is supposed to always be
> >> consistent on-disk (because it's transactional), this wasn't happening.
> >>
> >> It turned that the chipset had a list of particular SATA commands that it
> >> allowed through to the drive, and all others were simply answered with
> >> "OK", regardless of what actual actions needed to be taken. One of the
> >> SATA commands that was NOT whitelisted was the 'cache flush'
> >> command--which ZFS needs to make sure that it's data structures were
> >> written in the proper order.
> >>
> >> Turns out the drive and its firmware were fine and doing things properly,
> >> it's just that the necessary commands weren't getting to it because of the
> >> USB adaptor's chipsset.
> >
> > I don't think that advice is applicable in this situation. ?Here's why:
> >
> > Kevin's original description indicates that when the drive (or enclosure
> > translation ASIC for that matter) is in standby, when the system is shut
> > down, the drive/ASIC never spins back up on I/O (flushing all I/O
> > buffers to disk).
> >
> > If he issues "ls" commands or similar userland-induced I/O to the drive
> > prior to shutting the system down, the drive/ASIC spins up normally.
> >
> > Here's Kevin's original quote:
> >
> >>> The drive is "green" and spins down when idle. ?If an attempt is made
> >>> to shutdown the system while the drive is spun down, the system goes
> >>> through the usual shutdown including flushing all buffer out to disk,
> >>> but when the final disk access to mark the file systems as clean, the
> >>> drive never spins up and the system hangs until it is powered down.
> >>> I've found no way to avoid this other then to remember to access the
> >>> disk and cause it to spin up before shutting down.
> >>>
> >>> If I attempt to unmount the file systems when the drive is shut down.
> >>> the same thing happens, but I can recover as the second file system
> >>> is still mounted and an ls(1) to that file system will cause the disk
> >>> to spin up and everything is fine.
> >
> > So the question is what's "unique" about flushing all I/O buffers to
> > disk during shutdown compared to issuing standard I/O in userland. ?I
> > can speculate all day as to what the cause is, but it's highly unlikely
> > that the USB-to-SATA controller ASIC is causing the problem.
> 
> You are perhaps assuming a bit too much. Since I know that a disk read
> or write WILL spin up the drive, I can only assume that the msdosfs is
> not finding anything to flush, so is not writing. I see the full
> "flushing all buffers" countdown and it always runs successfully to
> zero. This, without the drive spinning up. This begs at least the
> question of whether the drive is receiving any writes or whether the
> "writes" are simply being cached by the drive to save energy. I
> suspect that the drive only spins up when enough of its write cache is
> filled.

If there's "nothing to flush", then why is the kernel indefinitely
looping (finally giving up, and it usually prints something when it
encounters that condition) when trying to flush buffers when the drive
is spun down?  What exactly is it trying to flush if there's "nothing to
flush"?

Let me ask you this: can you stop using msdosfs on said USB device and
instead use UFS2 and see if the problem disappears?  This is in no way a
permanent solution.  If this workaround fixes the problem, then I'm
inclined to believe msdosfs is to blame.  There have been a lot of
discussion of this driver in the kernel as of late, and the general
opinion of it is that it's crummy.

And here's another thought: what if the issue is limited, somehow, to
just writes?  Meaning, could the kernel issue a "false" read to the
device (for some random LBA, even LBA 0 for all I care) and then proceed
with its write/flushing?  I wonder if that would cause the drive to spin
up first.  That would be a "quirk" in my opinion.

There's also the possibility the USB stack on FreeBSD is doing something
really stupid... man, I don't even want to go down that road.  Hans
should be able to help determine if that's the case, but not using
msdosfs as a test would be a good start.

> In that case, the "flush cache" might actually be what is issued, but
> I can't claim any certainly about that. I'm not willing to completely
> clear the USB-SATA chip as the culprit.

I'm pretty certain FLUSH CACHE or -EXT is what's used when the kernel is
shutting down.  You ABSOLUTELY want all pending disk I/O (writes in
particularly) written to the platters/media on the disk before the
machine reboots, otherwise you're hoping the drive does it before it
gets re-initialised during POST or when an option ROM (AHCI) starts.

So I'm pretty sure the kernel is iterating over whatever cache buffers
there are for I/O (I don't know what this is called technically) and
issuing WRITE DMA or -EXT and either waiting for a non-error response
from the device or issuing it blindly followed by a FLUSH CACHE or -EXT
(either once per write or at the very end).

> > Furthermore, Windows doesn't have "special disk/enclosure drivers" for
> > such drives, so there's nothing "unique" Windows would be sending across
> > the wire, ATA-protocol-wise, that would explain why Windows works and
> > FreeBSD doesn't. ?At least that's my opinion.
> 
> This is not always quite true, but it is true for the general case. (I
> know some WD
> enclosures do install a custom driver.)

It's true 99% of the time.  I use Windows XP exclusively on my
workstations and make use of USB-class storage devices (hard disks, CF,
microSD) quite often.  There are no drivers involved, but just like with
FreeBSD there are potential device quirks.

The only way to find out what Windows is doing in this situation is to
make use of a hardware ATA protocol analyser (one would need to buy one
(expensive) and disassemble the drive and stick the analyser between the
USB/SATA ASIC and the drive).  Fun project?  Not really.

> > With ATA/SATA, the FLUSH CACHE (0xe7) and -EXT (0xea) (for 48-bit LBAs)
> > commands are separate from WRITE DMA (0xca) and -EXT (0x35) (for 48-bit
> > LBAs). ?Both FLUSH CACHE commands do not take LBAs in their input CDB.
> > To "flush buffers to disk" I imagine what the kernel should be doing is
> > issuing WRITE commands followed by FLUSH CACHE. ?The WRITEs should be
> > "waking" the drive up.
> 
> Should they? As I pointed out above, that is not necessarily the case.

"It depends".  If the drive is in "sleep", then no.  If "standby", then
yes.  There is no ATA protocol "wakeup" command, just for the record.

What needs to happen here is that those wanting to participate in this
ATA protocol discussion *NEED* to familiarise themselves with the
ATA8-ACS specification.  Please PLEASE **PLEASE** take the time to do
this before questioning.

http://www.t13.org/Documents/UploadedDocuments/docs2007/D1699r4a-ATA8-ACS.pdf

Section 4.18.3 contains a flow-chart diagram that is difficult to
understand, so I'll summarise:

PM0 state = ACTIVE state -- spun up and ready to handle any I/O of any kind

PM1 state = IDLE state -- this does not mean "the drive is sitting there
idle doing nothing.  There is an ATA IDLE command that can be used to
tell the drive to go into a "lower-power" state.

PM2 state = STANDBY state -- this equates to "camcontrol standby".  This
is what people here are describing as "the drive has spun down".  Or,
well, I sure hope that's what people are describing, because "sleep" is
not the same thing as "standby".

PM3 state = SLEEP state -- this equates to "camcontrol sleep".  It's
permanent until the entire bus is reset or the physical device is
power-cycled (which works varies from device to device).

So with those definitions, you can see quite clearly the documentation
states what should happen when transitioning from one state to another.
Specifically this is the one that matters (PM2 --> PM0 state):

Transition PM2:PM0: When a media access is required, the device shall
make a transition to the PM0:Active mode.

Now as for drives which may be in IDLE mode (I'm not sure if FreeBSD
makes use of that mode automatically or not), it's the same thing:

Transition PM1:PM0: When a media access is required, the device shall
make a transition to the PM0:Active mode.

So that answers the question: any I/O (read or write) to the device
should spin the drive up.  If you have an enclosure or an ASIC that is
screwing this up (I highly doubt it, and this is not the same problem as
what David was describing!), then it's in violation of the ATA protocol.

> > But wait, there's more.
> >
> > I want to point out to people that "sleep" and "standby" are two very
> > different things (they're separate ATA commands too). ?So if you're
> > using "camcontrol sleep" you probably should be using "camcontrol
> > standby". ?The man page is quite clear about the repercussions of the
> > former (and in the latter case I can imagine I/O to the drive failing or
> > simply timing out given that a bus reset is not performed during
> > shutdown TMK).
> 
> This is  very interesting point. Note that when this happens, whether
> at shutdown
> or when unmounting the file system, it hangs forever. There is not timeout.
> 
> I should also make one oddity completely clear, just in case my
> initial report failed to
> do so. I have two msdosfs file systems on the disk (along with an encrypted UFS
> system which is not normally mounted). I can dismount one file system.
> It no longer
> shows up as mounted, but the drive DOES NOT SPIN UP. Only when I attempt to
> unmount the second FS does that unmount hang. And, since the system is running
> normally and the drive is still mounted, I can issue a command to read
> from the disk
> and it spins up. (I actually use tcsh command completion to do this by typing
> "ls /media/MUSIC/Ctrl-D" The terminal window freezes at that point for several
> seconds until the disk is spun up and ready and than completes the
> operation. Both
> disks are then unmounted and the system is clear.
> 
> Does anyone know what the very last operations of unmount are? Things that are
> AFTER the system as been removed from all system tables? I'm guessing it is just
> to mark the system as clean (single block write) and flush the cache.
> I'm guessing
> that the write is not going to fill cache to the point of triggering a
> spin-up, so the
> system THINKS the first drive is unmounted, but something is still not complete.

This is really starting to sound like idiocy within the msdosfs driver.
That's just my opinion at this point.  As for what happens during device
unmount, I believe it's handled per-device (per-layer) as well as
per-filesystem.  Kirk McKusick might have some insight to this --
filesystems aren't something I'm really well-versed in.

Sorry for sounding crass, but I really grow tired of people "blaming
hardware" willy-nilly when in my experience most of these wonky problems
turn out to be bugs/issues in FreeBSD.  Anyone who thinks this OS is
infallible is smoking some serious crack.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110830234323.GA88936>