Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Mar 2009 14:41:57 -0500
From:      Scott Lambert <lambert@lambertfam.org>
To:        FreeBSD-STABLE <freebsd-stable@freebsd.org>
Subject:   Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?
Message-ID:  <20090320194157.GB80292@sysmon.tcworks.net>

next in thread | raw e-mail | index | archive | help
I have a previously stable machine, other than a one time panic in
soft-updates which I could never reproduce, running RELENG_7 from July
23, 2008.

Starting update: Wed Jul 23 01:29:47 CDT 2008
Finished update: Wed Jul 23 01:31:13 CDT 2008

I had the userquota option in the fstab for /home, but I did not yet
have anything in /etc/rc.conf to enable them.  I have been running an
unmodified GENERIC kernel config.

/dev/mirror/gm0s1g on /home (ufs, local, soft-updates)

It runs a few jails, using ezjails.  Two of them were image based jails,
1GB and 2GB.  There is also one non-image file jail.  The jails live in
/home/ezjails.

I added another image based jail, 3GB image, on March 12th.

I added this machine to our AMANDA setup on March 13, 2009.  

Things seemed to be okay until the 19th.  On the 19th, during the dump
of /home, things gradually started to hang.  Nagios paged me about
services not responding.  

I did not find any explanation for it.  The disks were idle according to
systat -vm.  I was able to grep the log files on /var for a while, and
then I could no longer do anything with it.

I eventually had to go to the office and power cycle it.  I tried C-A-D
first, but shutdown timed out after 30 seconds.

Just to make sure it wasn't something that had since been fixed, I
updated to RELENG_7 as of Mar 19th.

Starting update: Thu Mar 19 03:40:41 CDT 2009
Finished update: Thu Mar 19 03:48:45 CDT 2009

I rebooted to the new kernel and installed the world just after midnight
on the 20th.  I started getting paged by Nagios again at 3:40am.  

I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
as things began to circle the drain.  That was about 30 minutes after
the dump attempt had been started by AMANDA.  There were many processes
waiting in state D.  This time I did a reboot -n -q and the box rebooted
but was still fscking when I got to the office.

# ls -l /home/.snap
-r--------   1 root  operator  117285093376 Mar 20 03:18 dump_snapshot

# df /home
Filesystem            Size    Used   Avail Capacity  Mounted on
/dev/mirror/gm0s1g    106G     11G     86G    11%    /home

I removed userquota from the fstab entry for /home and rebooted, just
to be sure.  The last danger combination I remember for snapshots was
in combination with quotas.  Am I even in the danger zone for quotas
without having them compiled into the kernel?

It looks like removing the .snap directory should be enough to prevent
any future snapshots during the backup process.  Does that sound like a
reasonable workaround?  It would at least remove one variable from the
trouble shooting process.

Any other suggestions?

Thank you for any help you may be able to provide,

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert@lambertfam.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090320194157.GB80292>