FreeBSD Mail Archives

Date:      Sun, 27 Sep 1998 19:10:02 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        gibbs@plutotech.com (Justin T. Gibbs)
Cc:        tlambert@primenet.com, ken@plutotech.com, freebsd-alpha@FreeBSD.ORG, gibbs@plutotech.com, imp@plutotech.com
Subject:   Re: one other thing...
Message-ID:  <199809271910.MAA25729@usr05.primenet.com>
In-Reply-To: <199809262307.RAA14553@pluto.plutotech.com> from "Justin T. Gibbs" at Sep 26, 98 05:00:38 pm

> >Be careful with this.  This is intentional.  Enabling caching will
> >make the disk respond that it has committed the write to stable
> >storage, when it fact it has not.
> 
> Really?  I didn't know that the write cache had that effect.  Fascinating.

If the write operation returns, and the little magnetic domains don't
contain your data yet, then this is generally the effect you can
expect.


> >If you do not get power fail notification of some kind, then there
> >is no way to guarantee that your disk can be recovered to the state
> >that it was supposed to be in (as opposed to merely being recovered
> >to a consistent state).
> 
> Although I've never seen this happen on any of the disks I have here,
> yes, it could happen.  The chances of it happening, due to the small
> size of the cache on most disks, the fact that most drives will commit
> the write to non-volatile storage as soon as possible, etc. make your
> chances pretty good.  Certainly better than if you were running async
> mounts.

But worse than if you weren't enabling write caching.

BTW, this probability argument is the same argument that EXT2FS
advocates use to justfy that FS's behaviour...


> >It is much better to turn *off* write caching, and use soft updates
> >(which also, technically, does it's own write caching), rather than
> >enabling it on the drive.
> 
> My systems don't usually panic.  Why?  I don't use soft updates, yet.

Or CAM?  And you've fixed the three known VM bugs?  And you never
run your system out of swap?  And you aren't using NFS?  And...
And...


> Soft Updates is not a replacement for the on disk cache.  The two
> serve very different purposes.  One reduces the number of writes
> to the device, the other reduces the number of writes committed by
> the device to the disk and reduces latency for any device writes that
> the OS believes are necessary.

Soft update reduces the number of writes to the device.  And because
it does implicit write gathering, there is little or no room for the
disk to further optimize this under the cover.

A well written OS will be better able to utilize memory in a fashion
suitable to the OS than some disk drive manufacturer building disks
with the general expectation of a VFAT32 FS.

As to the latency argument, yeah, it reduces latency.  So does mounting
async, and so does a caching controller and so does noatime, and so
does taking the fsync() calls out of the database's two stage commit
routine, and... and...

The bottom line: I can make it go as fast as you want, if it doesn't
have to be correct.  Faster even...


> >The drive, in doing caching, may reorder these operations, such
> >that the index is written out, but the new record is not.
> 
> This all depends on how you setup the drive.  You can tell it not
> to re-order writes (FSW bit in the caching control page).

This helps little.  Unless the write is committed tostable storage
on a device block basis under OS control, there are still race
windows inherent in the sector order reversal.  If the drive
believes it is about to write a run of contiguous sectors, it
will *still* reorder the writes.

The correct way to achieve lower latency is to increase concurrency
-- but only between unrelated operations.

The appropriate technology for this is multiple outstanding commands;
tagged command queueing, in other words.


> If I was really worried, however, I'd have the box on a UPS.

This protects against crashes as a result of a power loss, but
not those resulting from a memory overcommit architecture, nor
those resulting from kernel bugs.

Frankly, we are arguing on different axes; you are discussing
"safe enough", while I'm discussing "reliable".


> >The
> >normal way you guarantee ordering in an application is to fsync()
> >the record file before writing the index.  The fsync() is not
> >supposed to return until the drive states the data has been
> >committed to stable storage.  With write caching on, the drive lies.
> 
> This is an interface issue, not a cache issue.  If the kernel told the
> disk driver to sync the cache, it could.  This is what the Synchronize
> Cache command is all about.

But it doesn't, so you can't turn caching on and maintain data
integrity guarantees; only data integrity probabilities.


> >PS: If you turn this on, you might as well mount the drive async,
> >too, since we are only talking about "how the data can not be
> >trusted", as opposed to "if the data can not be trusted" (it can't).
> 
> You are assuming that the OS will never panic.  I don't use async
> mounts because I expect the OS to occasionally crash.  I worry
> about power outages too, but they are something I can easily control
> with a UPS.

Actually, I am assuming that the OS *will* panic.  If you discount
everything including power failures, then you get the first part
of my "PS:".  If you discount everything *but* power failures,
e.g., by appeal to a UPS, then you get the second part (since as
long as the write occurs before a bus reset telling the device to
forget everything and initialize itself, a cached write will
still occur -- note that this is a real danger in a panic situation).


> >If you get power fail notification, then you can use async and
> >drive level write caching in relative saftey (ie: as safe as
> >possible, given the possibility of the system crashing for some
> >reason other than power failure).
> 
> exactly.
> 
> So why did you feel the need to sermonize again?  Ken and I are
> well aware of how SCSI devices work and the effects of setting
> these parameters.  We did write a SCSI layer, you know...

Because someone suggested doing something that I felt was bad
advice, and so long as that bad advice was in the record, I
felt it necessary to note, also for the record, why the advice
was bad.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809271910.MAA25729>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation