Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Mar 2009 04:31:56 -0500
From:      Scott Lambert <lambert@lambertfam.org>
To:        FreeBSD-stable <freebsd-stable@freebsd.org>
Subject:   Re: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?
Message-ID:  <20090322093156.GE80292@sysmon.tcworks.net>
In-Reply-To: <20090320194157.GB80292@sysmon.tcworks.net>
References:  <20090320194157.GB80292@sysmon.tcworks.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote:
> I have a previously stable machine, other than a one time panic in
> soft-updates which I could never reproduce, running RELENG_7 from July
> 23, 2008.
> 
> Starting update: Wed Jul 23 01:29:47 CDT 2008
> Finished update: Wed Jul 23 01:31:13 CDT 2008
> 
> I had the userquota option in the fstab for /home, but I did not yet
> have anything in /etc/rc.conf to enable them.  I have been running an
> unmodified GENERIC kernel config.
> 
> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates)
> 
> It runs a few jails, using ezjails.  Two of them were image based jails,
> 1GB and 2GB.  There is also one non-image file jail.  The jails live in
> /home/ezjails.
> 
> I added another image based jail, 3GB image, on March 12th.
> 
> I added this machine to our AMANDA setup on March 13, 2009.  
> 
> Things seemed to be okay until the 19th.  On the 19th, during the dump
> of /home, things gradually started to hang.  Nagios paged me about
> services not responding.  
> 
> I did not find any explanation for it.  The disks were idle according to
> systat -vm.  I was able to grep the log files on /var for a while, and
> then I could no longer do anything with it.
> 
> I eventually had to go to the office and power cycle it.  I tried C-A-D
> first, but shutdown timed out after 30 seconds.
> 
> Just to make sure it wasn't something that had since been fixed, I
> updated to RELENG_7 as of Mar 19th.
> 
> Starting update: Thu Mar 19 03:40:41 CDT 2009
> Finished update: Thu Mar 19 03:48:45 CDT 2009
> 
> I rebooted to the new kernel and installed the world just after midnight
> on the 20th.  I started getting paged by Nagios again at 3:40am.  
> 
> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
> as things began to circle the drain.  That was about 30 minutes after
> the dump attempt had been started by AMANDA.  There were many processes
> waiting in state D.  This time I did a reboot -n -q and the box rebooted
> but was still fscking when I got to the office.
> 
> # ls -l /home/.snap
> -r--------   1 root  operator  117285093376 Mar 20 03:18 dump_snapshot
> 
> # df /home
> Filesystem            Size    Used   Avail Capacity  Mounted on
> /dev/mirror/gm0s1g    106G     11G     86G    11%    /home
> 
> I removed userquota from the fstab entry for /home and rebooted, just
> to be sure.  The last danger combination I remember for snapshots was
> in combination with quotas.  Am I even in the danger zone for quotas
> without having them compiled into the kernel?
> 
> It looks like removing the .snap directory should be enough to prevent
> any future snapshots during the backup process.  Does that sound like a
> reasonable workaround?  It would at least remove one variable from the
> trouble shooting process.
> 
> Any other suggestions?
> 
> Thank you for any help you may be able to provide,

Did it to me again tonight.  I was unable to get in to look at anything.
Just pushed the power button.  It did give me the same "shutdown timed
out after 30 seconds."

So, I tuned the /home fs to disable softupdates.  I also removed the
.snap directory.

I would appreciate any suggestions...
 
-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert@lambertfam.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090322093156.GE80292>