Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Nov 2008 20:42:00 -0800
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Tim Bishop <tim-lists@bishnet.net>
Cc:        Kostik Belousov <kostikbel@gmail.com>, Tim Bishop <tim@bishnet.net>, freebsd-stable@freebsd.org
Subject:   Re: System deadlock when using mksnap_ffs
Message-ID:  <20081113044200.GA10419@icarus.home.lan>
In-Reply-To: <20081113004102.GD24360@carrick.bishnet.net>
References:  <20081112175826.GD26195@carrick.bishnet.net> <20081112194735.GK47073@deviant.kiev.zoral.com.ua> <20081113004102.GD24360@carrick.bishnet.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 13, 2008 at 12:41:02AM +0000, Tim Bishop wrote:
> On Wed, Nov 12, 2008 at 09:47:35PM +0200, Kostik Belousov wrote:
> > On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote:
> > > I've been playing around with snapshots lately but I've got a problem on
> > > one of my servers running 7-STABLE amd64:
> > > 
> > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN  amd64
> > > 
> > > I run the mksnap_ffs command to take the snapshot and some time later
> > > the system completely freezes up:
> > > 
> > > paladin# cd /u2/.snap/
> > > paladin# mksnap_ffs /u2 test.1
> > > 
> > > It only happens on this one filesystem, though, which might be to do
> > > with its size. It's not over the 2TB marker, but it's pretty close. It's
> > > also backed by a hardware RAID system, although a smaller filesystem on
> > > the same RAID has no issues.
> > > 
> > > Filesystem  1K-blocks       Used     Avail Capacity  Mounted on
> > > /dev/da0s1a 2078881084 921821396 990749202    48%    /u2
> > > 
> > > To clarify "completely freezes up": unresponsive to all services over
> > > the network, except ping. On the console I can switch between the ttys,
> > > but none of them respond. The only way out is to hit the reset button.
> > 
> > You need to provide information described in the
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
> > and especially
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> 
> Ok, I've done that, and removed the patch that seemed to fix things.
> 
> The first thing I notice after doing this on the console is that I can
> still ctrl+t the process:
> 
> load: 0.14  cmd: mksnap_ffs 2603 [newbuf] 0.00u 10.75s 0% 1160k
> 
> But the top and ps I left running on other ttys have all stopped
> responding.

Then in my book, the patch didn't fix anything.  :-)  The system is
still "deadlocking"; snapshot generation **should not** wedge the system
hard like this.

Also, during my own testing, I am always able to use Ctrl-T to get
SIGINFO from the running process (mksnap_ffs).  That behaviour does not
change for me.

The rest of the below information is good -- but I'm confused about
something: is there anyone out there who can use mksnap_ffs on a
filesystem (/usr is a good test source) and NOT experience this
deadlocking problem?  Literally *every* FreeBSD box I have root access
to suffers from this problem, so I'm a little baffled why we end-users
need to keep providing debugging output when it should be easy as pie
for a developer to do "dump -0 -L -a -f /path/fs.dump /usr" and watch
their system wedge.

Also, a fellow on -fs just mentioned he's having this exact problem:

http://lists.freebsd.org/pipermail/freebsd-fs/2008-November/005324.html

> Also the following kernel message came out:
> 
> Expensive timeout(9) function: 0xffffffff802ce380(0xffffff000677ca50) 0.006121001 s
>
> There is also still some disk I/O.
> 
> Dropping to ddb worked, but I don't have a serial console so I can't
> paste the output.
> 
> ps shows mksnap_ffs in newbuf, as we already saw. A trace of mksnap_ffs
> looks like this:
> 
> Tracing pid 2603 tid 100214 td 0xffffff0006a0e370
> sched_switch() at sched_switch+0x2a1
> mi_switch() at mi_switch+0x233
> sleepq_switch() at sleepq_switch+0xe9
> sleepq_wait() at sleepq_wait+0x44
> _sleep() at _sleep+0x351
> getnewbuf() at getnewbuf+0x2e1
> getblk() at getblk+0x30d
> setup_allocindir_phase2() at setup_allocindir_phase2+0x338
> softdep_setup_allocindir_page() at softdep_setup_allocindir_page+0xa7
> ffs_balloc_ufs2() at ffs_balloc_ufs2+0x121e
> ffs_snapshot() at ffs_snapshot+0xc52
> ffs_mount() at ffs_mount+0x735
> vfs_donmount() at vfs_donmount+0xeb5
> kernel_mount() at kernel_mount+0xa1
> ffs_cmount() at ffs_cmount+0x92
> mount() at mount+0x1cc
> syscall() at syscall+0x1f6
> Xfast_syscall() at Xfast_syscall+0xab
> --- syscall (21, FreeBSD ELF64, mount), rip = 0x80068636c, rsp = 0x7fffffffe518, rbp = 0x8008447a0 ---
> 
> show pcpu shows cpuid 3 (quad core machine) in thread "swi6: Giant taskq".
> All the other cpus are idle.
> 
> show locks shows:
> 
> exclusive sleep mutex Giant r = 0 (0xffffffff806ae040) locked @ /usr/src/sys/kern/kern_intr.c:1087
> 
> There are two other locks shown by show all locks, one for sshd and one
> for mysqld, both in kern/uipc_sockbuf.c.
> 
> show lockedvnods shows mksnap_ffs has a lock on da0s1a with ffs_vget at
> the top of the stack.
> 
> Sorry for any typos. I'll sort out a serial cable if more is needed :-)
> 
> Tim.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081113044200.GA10419>