Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Aug 2015 15:22:07 -0600
From:      "Justin T. Gibbs" <gibbs@scsiguy.com>
To:        d@delphij.net
Cc:        Don Lewis <truckman@FreeBSD.org>, freebsd-fs@FreeBSD.org, "Justin T. Gibbs" <gibbs@freebsd.org>, George Wilson <george.wilson@delphix.com>
Subject:   Re: solaris assert: avl_is_empty(&dn -> dn_dbufs) panic
Message-ID:  <D9D6BA1D-30F3-4417-A3DA-E0253E6EA5A2@scsiguy.com>
In-Reply-To: <55D7700B.3080207@delphij.net>
References:  <201508211748.t7LHmo96096088@gw.catspoiler.org> <55D7700B.3080207@delphij.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

I'll need a little time to fully reload the context for these changes. =
However, reintroducing a blocking loop is not the right fix - it was a =
hack in the original code. :-) My hunch is that removing the assert is =
safe, but it would be nice to have a core dump to better understand why =
the list isn't empty.

--
Justin

> On Aug 21, 2015, at 12:38 PM, Xin Li <delphij@delphij.net> wrote:
>=20
> Hi,
>=20
> A quick glance at the changes suggests that Justin's changeset may be
> related.  The reasoning is here:
>=20
> https://reviews.csiden.org/r/131/
>=20
> Related Illumos ticket:
>=20
> https://www.illumos.org/issues/5056
>=20
> In dnode_evict_dbufs(), remove multiple passes over dn->dn_dbufs.
> This is possible now that objset eviction is asynchronously
> completed in a different context once dbuf eviction completes.
>=20
> In the case of objset eviction, any dbufs held by children will
> be evicted via dbuf_rele_and_unlock() once their refcounts go
> to zero.  Even when objset eviction is not active, the ordering
> of the avl tree guarantees that children will be released before
> parents, allowing the parent's refcounts to naturally drop to
> zero before they are inspected in this single loop.
>=20
> =3D=3D=3D=3D
>=20
> So, upon return from dnode_evict_dbufs(), there could be some
> DB_EVICTING buffers on the AVL pending release and thus breaks the
> invariant.
>=20
> Should we restore the loop where we yield briefly with the lock
> released, then reacquire and recheck?
>=20
> Cheers,
>=20
> On 08/21/15 10:48, Don Lewis wrote:
>> On 21 Aug, Don Lewis wrote:
>>> On 21 Aug, Don Lewis wrote:
>>>> I just started getting this panic:
>>>>=20
>>>> solaris assert: avl_is_empty(&dn -> dn_dbufs), file:
>>>> =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c,
>>>> line 495
>>>>=20
>>>> System info:
>>>> FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #25 =
r286923: Wed Aug 19 09:28:53 PDT 2015     =
dl@zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC  amd64
>>>>=20
>>>> My zfs pool has one mirrored vdev.  Scrub doesn't find any =
problems.
>>>>=20
>>>> %zpool status
>>>>  pool: zroot
>>>> state: ONLINE
>>>>  scan: scrub repaired 0 in 2h58m with 0 errors on Fri Aug 21 =
00:44:52 2015
>>>> config:
>>>>=20
>>>> 	NAME        STATE     READ WRITE CKSUM
>>>> 	zroot       ONLINE       0     0     0
>>>> 	  mirror-0  ONLINE       0     0     0
>>>> 	    ada0p3  ONLINE       0     0     0
>>>> 	    ada1p3  ONLINE       0     0     0
>>>>=20
>>>> This panic is reproduceable and happens every time I use poudriere =
to
>>>> build ports using my 9.3-RELEASE amd64 jail and occurs at the end =
of the
>>>> poudriere run when it is unmounting filesystems.
>>>>=20
>>>> [00:10:43] =3D=3D=3D=3D>> Stopping 4 builders
>>>> 93amd64-default-job-01: removed
>>>> 93amd64-default-job-01-n: removed
>>>> 93amd64-default-job-02: removed
>>>> 93amd64-default-job-02-n: removed
>>>> 93amd64-default-job-03: removed
>>>> 93amd64-default-job-03-n: removed
>>>> 93amd64-default-job-04: removed
>>>> 93amd64-default-job-04-n: removed
>>>> [00:10:46] =3D=3D=3D=3D>> Creating pkgng repository
>>>> Creating repository in /tmp/packages: 100%
>>>> Packing files for repository: 100%
>>>> [00:10:55] =3D=3D=3D=3D>> Committing packages to repository
>>>> [00:10:55] =3D=3D=3D=3D>> Removing old packages
>>>> [00:10:55] =3D=3D=3D=3D>> Built ports: devel/py-pymtbl net/sie-nmsg =
net/p5-Net-Nmsg net/axa
>>>> [93amd64-default] [2015-08-21_00h47m41s] [committing:] Queued: 4  =
Built: 4  Failed: 0  Skipped: 0  Ignored: 0  Tobuild: 0   Time: 00:10:53
>>>> [00:10:55] =3D=3D=3D=3D>> Logs: =
/var/poudriere/data/logs/bulk/93amd64-default/2015-08-21_00h47m41s
>>>> [00:10:55] =3D=3D=3D=3D>> Cleaning up
>>>> 93amd64-default: removed
>>>> 93amd64-default-n: removed
>>>> [00:10:55] =3D=3D=3D=3D>> Umounting file systems
>>>> Write failed: Broken pipe
>>>>=20
>>>> Prior to that, I ran poudriere a number of times with a 10.2-STABLE
>>>> amd64 jail without incident.
>>>>=20
>>>> I've kicked off a bunch of poudriere runs for other jails and
>>>> will check on it in the morning.
>>>=20
>>> Died the same way after building ports on the first jail,
>>> 10.1-RELEASE amd64.
>>>=20
>>> Since there have been some zfs commits since r286923, I upgraded to
>>> r286998 this morning and tried again with no better luck.  I got the
>>> same panic again.
>>>=20
>>> This machine has mirrored swap, and even though I've done what
>>> gmirror(8) says to do in order to capture crash dumps, I've had no =
luck
>>> with that.  The dump is getting written, but savecore is unable to =
find
>>> it.
>>=20
>> Not sure what is happening with savecore during boot, but I was able =
to
>> run it manually and collect the crash dump.
>>=20
>>=20
>> Unread portion of the kernel message buffer:
>> panic: solaris assert: avl_is_empty(&dn->dn_dbufs), file: =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, =
line: 495
>> cpuid =3D 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame =
0xfffffe0859e4e4e0
>> vpanic() at vpanic+0x189/frame 0xfffffe0859e4e560
>> panic() at panic+0x43/frame 0xfffffe0859e4e5c0
>> assfail() at assfail+0x1a/frame 0xfffffe0859e4e5d0
>> dnode_sync() at dnode_sync+0x6c8/frame 0xfffffe0859e4e6b0
>> dmu_objset_sync_dnodes() at dmu_objset_sync_dnodes+0x2b/frame =
0xfffffe0859e4e6e0
>> dmu_objset_sync() at dmu_objset_sync+0x29e/frame 0xfffffe0859e4e7b0
>> dsl_pool_sync() at dsl_pool_sync+0x348/frame 0xfffffe0859e4e820
>> spa_sync() at spa_sync+0x442/frame 0xfffffe0859e4e910
>> txg_sync_thread() at txg_sync_thread+0x23d/frame 0xfffffe0859e4e9f0
>> fork_exit() at fork_exit+0x84/frame 0xfffffe0859e4ea30
>> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0859e4ea30
>> --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 ---
>> KDB: enter: panic
>>=20
>>=20
>>=20
>> (kgdb) bt
>> #0  doadump (textdump=3D0) at pcpu.h:221
>> #1  0xffffffff8037bb86 in db_fncall (dummy1=3D<value optimized out>,=20=

>>    dummy2=3D<value optimized out>, dummy3=3D<value optimized out>,=20
>>    dummy4=3D<value optimized out>) at =
/usr/src/sys/ddb/db_command.c:568
>> #2  0xffffffff8037b941 in db_command (cmd_table=3D0x0)
>>    at /usr/src/sys/ddb/db_command.c:440
>> #3  0xffffffff8037b5d4 in db_command_loop ()
>>    at /usr/src/sys/ddb/db_command.c:493
>> #4  0xffffffff8037e18b in db_trap (type=3D<value optimized out>, =
code=3D0)
>>    at /usr/src/sys/ddb/db_main.c:251
>> #5  0xffffffff80a5b294 in kdb_trap (type=3D3, code=3D0, tf=3D<value =
optimized out>)
>>    at /usr/src/sys/kern/subr_kdb.c:654
>> #6  0xffffffff80e6a4b1 in trap (frame=3D0xfffffe0859e4e410)
>>    at /usr/src/sys/amd64/amd64/trap.c:540
>> #7  0xffffffff80e49f22 in calltrap ()
>>    at /usr/src/sys/amd64/amd64/exception.S:235
>> #8  0xffffffff80a5a96e in kdb_enter (why=3D0xffffffff81379010 =
"panic",=20
>>    msg=3D0xffffffff80a60b60 =
"UH\211\ufffdAWAVATSH\203\ufffdPI\211\ufffdA\211\ufffdH\213\004%Py\ufffd\2=
01H\211E\ufffd\201<%\ufffd\210\ufffd\201") at cpufunc.h:63
>> #9  0xffffffff80a1e2c9 in vpanic (fmt=3D<value optimized out>,=20
>>    ap=3D<value optimized out>) at =
/usr/src/sys/kern/kern_shutdown.c:619
>> #10 0xffffffff80a1e333 in panic (fmt=3D0xffffffff81aafa90 "\004")
>>    at /usr/src/sys/kern/kern_shutdown.c:557
>> ---Type <return> to continue, or q <return> to quit---
>> #11 0xffffffff8240922a in assfail (a=3D<value optimized out>,=20
>>    f=3D<value optimized out>, l=3D<value optimized out>)
>>    at =
/usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81
>> #12 0xffffffff820d4f78 in dnode_sync (dn=3D0xfffff8040b72d3d0,=20
>>    tx=3D0xfffff8001598ec00)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:495
>> #13 0xffffffff820c922b in dmu_objset_sync_dnodes =
(list=3D0xfffff80007712b90,=20
>>    newlist=3D<value optimized out>, tx=3D<value optimized out>)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1045
>> #14 0xffffffff820c8ede in dmu_objset_sync (os=3D0xfffff80007712800,=20=

>>    pio=3D<value optimized out>, tx=3D0xfffff8001598ec00)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1163
>> #15 0xffffffff820e8e78 in dsl_pool_sync (dp=3D0xfffff80015676000, =
txg=3D2660975)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:536
>> #16 0xffffffff8210dca2 in spa_sync (spa=3D0xfffffe00089c6000, =
txg=3D2660975)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6641
>> #17 0xffffffff8211843d in txg_sync_thread (arg=3D0xfffff80015676000)
>>    at =
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:517
>> #18 0xffffffff809e47c4 in fork_exit (
>>    callout=3D0xffffffff82118200 <txg_sync_thread>, =
arg=3D0xfffff80015676000,=20
>>    frame=3D0xfffffe0859e4ea40) at /usr/src/sys/kern/kern_fork.c:1006
>> ---Type <return> to continue, or q <return> to quit---
>> #19 0xffffffff80e4a45e in fork_trampoline ()
>>    at /usr/src/sys/amd64/amd64/exception.S:610
>> #20 0x0000000000000000 in ?? ()
>> Current language:  auto; currently minimal
>>=20
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>>=20
>=20
>=20
> --=20
> Xin LI <delphij@delphij.net>    https://www.delphij.net/
> FreeBSD - The Power to Serve!           Live free or die
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D9D6BA1D-30F3-4417-A3DA-E0253E6EA5A2>