Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Jun 2003 23:53:30 -0400
From:      Bill Moran <wmoran@potentialtech.com>
To:        John Ekins <john.ekins@brightview.com>
Cc:        questions@freebsd.org
Subject:   Re: Softupdates: df, du, sync and fsck  [quite long]
Message-ID:  <3EFD113A.3060402@potentialtech.com>
In-Reply-To: <20030627220033.5586e86b.john.ekins@brightview.com>
References:  <20030627220033.5586e86b.john.ekins@brightview.com>

next in thread | previous in thread | raw e-mail | index | archive | help
John Ekins wrote:
> Hello,
> 
> I've a couple of questions about soft updates. I've Googled heavily on this but
> not really found a satisfactory answer. The story:
> 
> I'm running on numerous FreeBSD 4.7 SMP machines as primary MX machines. The mail
> is not stored on the FreeBSD machines but on NetApps via NFS. However the mail is
> temporarily spooled on the FreeBSD machines during normal MTA handling and passing
> to an anti-virus scanner. I have one large partition /var on each machine where
> basically all the work and temporary/transient files for the MTA and AV scanner
> takes place.
> 
> These machines are heavily utilised, running quite "hot" with a load average of
> anything from 2 to 8. Many thousands of temporary files are thus created and
> deleted a minute. I have no problem with this as nearly all email is delivered in
> under 1 minute whatever. 
> 
> I notice that after a while the amount of free space as shown by df considerably
> varies from a du on /var. I'm aware of why this happens with soft updates, but
> that's not the whole story. If I turn off incoming email on a machine, the space
> does not seem to sync back to what it should be.  No matter how long I turn off
> the MTA, the space is simply not returned, and df/du show differences of about
> 5:1. Nothing else is writing/holding open files on that partition (even turned
> off syslog, cron, etc. and checked using lsof). In comparison, if, for example, on
> my normal desktop machine I create a 500MB file, then delete it, the space shortly
> afterwards is returned to me when I run df. The only way I've been able to recover
> this space to what it should be is to reboot the machine.

I don't know what's wrong, but does unmounting and remounting the partition reclaim
the lost space?

> As an example, here is a snippet from the console from when I rebooted an affected
> machine:
> 
>   boot() called on cpu#2
>   Waiting (max 60 seconds) for system process `vnlru' to stop...stopped
>   Waiting (max 60 seconds) for system process `bufdaemon' to stop...stopped
>   Waiting (max 60 seconds) for system process `syncer' to stop...timed out
> 
>   syncing disks... 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 
>   giving up on 22 buffers
>   Uptime: 27d23h1m27s
>   Rebooting...
> 
> As you can see the file system is unable to sync. When the machine reboots it
> literally takes hours to fsck the /var partition (only about 15GB). And the fsck
> output is full of messages like this:
> 
>   UNEXPECTED SOFT UPDATE INCONSISTENCY

Well, this sure isn't good.

> Now, is there a problem here with soft updates "losing track" of what is going on
> on this busy partition? It would appear to be so as quietening the machine does
> not lead to a proper sync. Secondly, why does the fsck take such an inordinate
> amount of time for a smallish partition?

If there's a LOT of inodes with problems, it could easily take a while to fix.  Also,
if you run fsck without specifying a filesystem to fix, it exhaustively checks all
filesystems.  So even if the problem is on /var, it might spend a long time checking
/usr as well.  You can work around this by calling fsck with the filesystem to check.

> I really like the performance benefits of soft updates, but it seems that I'm
> going to have to turn it off on /var because of the problems that eventually
> occur.

If these are production boxes, I'd recommend turning it off until you resolve the
problem.

> If anyone has some advice I'd be grateful.

I don't know if this would qualify as "advice", but since nobody else seems to have
any suggestions, I figured I'd throw my thoughts in.
Are you using ATA or SCSI drives?  Does issuing a manual "sync" once you've stopped
the spooling process help any?  Are these all identical mobos ... possibly a BIOS
update available?  These aren't IBM ATA drives are they?  I had one of those give
me grief for months (if you look in the archives, you should be able to find details
on which drives caused problems).  Have you tried updating one of the machines to
4.8 to see if the problem has been fixed?
Like I said, not good advice, just some ideas for you.

-- 
Bill Moran
Potential Technologies
http://www.potentialtech.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3EFD113A.3060402>