Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Oct 2014 01:54:05 +0100
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Mark Martinec" <Mark.Martinec+freebsd@ijs.si>, <freebsd-stable@freebsd.org>, <freebsd-fs@freebsd.org>
Subject:   Re: zfs pool import hangs on [tx->tx_sync_done_cv]
Message-ID:  <A5BA41116A7F4B23A9C9E469C4146B99@multiplay.co.uk>
References:  <54372173.1010100@ijs.si> <644FA8299BF848E599B82D2C2C298EA7@multiplay.co.uk> <54372EBA.1000908@ijs.si> <DE7DD7A94E9B4F1FBB3AFF57EDB47C67@multiplay.co.uk> <543731F3.8090701@ijs.si> <543AE740.7000808@ijs.si>

next in thread | previous in thread | raw e-mail | index | archive | help
I have reproduced the issue with your image. It seems
like the import is waiting a transation group to complete.

It looks like its waiting on the zio from dsl_pool_sync_mos
which for some reason is never returning.

The pool does seem to be happy importing read only with:
zpool import -N -f -o readonly=on sys1boot

 print dp->dp_tx
$3 = {
  tx_cpu = 0xfffffe0004a05000,
  tx_sync_lock = {
    lock_object = {
      lo_name = 0xffffffff815f2941 "tx->tx_sync_lock",
      lo_flags = 40960000,
      lo_data = 0,
      lo_witness = 0x0
    },
    sx_lock = 1
  },
  tx_open_txg = 11733519,
  tx_quiesced_txg = 0,
  tx_syncing_txg = 11733518,
  tx_synced_txg = 0,
  tx_open_time = 120562502889,
  tx_sync_txg_waiting = 11733518,
  tx_quiesce_txg_waiting = 11733519,
  tx_sync_more_cv = {
    cv_description = 0xffffffff815f2953 "tx->tx_sync_more_cv",
    cv_waiters = 0
  },
  tx_sync_done_cv = {
    cv_description = 0xffffffff815f2968 "tx->tx_sync_done_cv",
    cv_waiters = 1
  },
  tx_quiesce_more_cv = {
    cv_description = 0xffffffff815f297d "tx->tx_quiesce_more_cv",
    cv_waiters = 1
  },
  tx_quiesce_done_cv = {
    cv_description = 0xffffffff815f2995 "tx->tx_quiesce_done_cv",
    cv_waiters = 0
  },
  tx_timeout_cv = {
    cv_description = 0x0,
    cv_waiters = 0
  },
  tx_exit_cv = {
    cv_description = 0xffffffff815f29ad "tx->tx_exit_cv",
    cv_waiters = 0
  },
  tx_threads = 2 '\002',
  tx_exiting = 0 '\0',
  tx_sync_thread = 0xfffff80041505490,
  tx_quiesce_thread = 0xfffff80041509920,
  tx_commit_cb_taskq = 0x0
}

Relavent threads:

#0  sched_switch (td=0xfffff800446bc000, newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1945
#1  0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at 
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80021f4b220, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617
#4  0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80021f4b220, lock=0xfffff80021f4b1b8) at /usr/src/sys/kern/kern_condvar.c:139
#5  0xffffffff8151ec4b in txg_wait_synced (dp=0xfffff80021f4b000, txg=<value optimized out>) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:631
#6  0xffffffff81510937 in spa_load (spa=0xfffffe0003fe1000, state=SPA_LOAD_IMPORT, type=<value optimized out>, mosconfig=<value 
optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2777
#7  0xffffffff8150e5df in spa_load_best (spa=0xfffffe0003fe1000, state=SPA_LOAD_IMPORT, mosconfig=1, max_request=<value optimized 
out>, rewind_flags=1) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2873
#8  0xffffffff8150e023 in spa_import (pool=0xfffffe0002e2a000 "sys1boot", config=0xfffff80044f24ba0, props=0x0, flags=<value 
optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4207
#9  0xffffffff81566d78 in zfs_ioc_pool_import (zc=0xfffffe0002e2a000) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1594
#10 0xffffffff81563c4f in zfsdev_ioctl (dev=<value optimized out>, zcmd=<value optimized out>, arg=<value optimized out>, 
flag=<value optimized out>, td=<value optimized out>) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6158
#11 0xffffffff805299eb in devfs_ioctl_f (fp=0xfffff8004f020aa0, com=3222821378, data=0xfffff80038123e60, cred=<value optimized out>, 
td=0xfffff800446bc000) at /usr/src/sys/fs/devfs/devfs_vnops.c:757
#12 0xffffffff8066201b in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>) at file.h:320
#13 0xffffffff80661d9c in sys_ioctl (td=0xfffff800446bc000, uap=0xfffffe011e49fa40) at /usr/src/sys/kern/sys_generic.c:702
#14 0xffffffff8088defa in amd64_syscall (td=0xfffff800446bc000, traced=0) at subr_syscall.c:134
#15 0xffffffff80870ecb in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:391


Thread 509 (Thread 100533):
#0  sched_switch (td=0xfffff800446bf000, newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1945
#1  0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at 
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff8064e323 in sleepq_timedwait (wchan=0xfffff80044e47210, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:652
#4  0xffffffff805b84f0 in _cv_timedwait_sbt (cvp=0xfffff80044e47210, lock=0xfffff80044e471b8, sbt=<value optimized out>, pr=<value 
optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_condvar.c:325
#5  0xffffffff8151f72e in txg_thread_wait (tx=<value optimized out>, cpr=0xfffffe011dc76a00, cv=0xfffff80044e47210, time=<value 
optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:249
#6  0xffffffff8151e930 in txg_sync_thread (arg=0xfffff80044e47000) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:483
#7  0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e480 <txg_sync_thread>, arg=0xfffff80044e47000, frame=0xfffffe011dc76ac0) 
at /usr/src/sys/kern/kern_fork.c:996
#8  0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606
#9  0x0000000000000000 in ?? ()

Thread 508 (Thread 100532):
#0  sched_switch (td=0xfffff80048204920, newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1945
#1  0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at 
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80044e47230, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617
#4  0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80044e47230, lock=0xfffff80044e471b8) at /usr/src/sys/kern/kern_condvar.c:139
#5  0xffffffff8151f73b in txg_thread_wait (tx=<value optimized out>, cpr=0xfffffe011dc6da00, cv=<value optimized out>, time=<value 
optimized out>) at /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:251
#6  0xffffffff8151e450 in txg_quiesce_thread (arg=0xfffff80044e47000) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:556
#7  0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e030 <txg_quiesce_thread>, arg=0xfffff80044e47000, 
frame=0xfffffe011dc6dac0) at /usr/src/sys/kern/kern_fork.c:996
#8  0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606
#9  0x0000000000000000 in ?? ()


Thread 512 (Thread 100338):
#0  sched_switch (td=0xfffff80041505490, newtd=<value optimized out>, flags=<value optimized out>) at 
/usr/src/sys/kern/sched_ule.c:1945
#1  0xffffffff806110d9 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff8064ddc2 in sleepq_switch (wchan=<value optimized out>, pri=<value optimized out>) at 
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff8064dc23 in sleepq_wait (wchan=0xfffff80044acf320, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:617
#4  0xffffffff805b7c1a in _cv_wait (cvp=0xfffff80044acf320, lock=0xfffff80044acf300) at /usr/src/sys/kern/kern_condvar.c:139
#5  0xffffffff81547c9b in zio_wait (zio=<value optimized out>) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1442
#6  0xffffffff814f056c in dsl_pool_sync (dp=0xfffff80021f4b000, txg=11733518) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:531
#7  0xffffffff81514030 in spa_sync (spa=0xfffffe0003fe1000, txg=11733518) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6604
#8  0xffffffff8151e6cd in txg_sync_thread (arg=0xfffff80021f4b000) at 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:518
#9  0xffffffff805d5bd4 in fork_exit (callout=0xffffffff8151e480 <txg_sync_thread>, arg=0xfffff80021f4b000, frame=0xfffffe011e6fdac0) 
at /usr/src/sys/kern/kern_fork.c:996
#10 0xffffffff8087111e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606
#11 0x0000000000000000 in ?? ()


----- Original Message ----- 
From: "Mark Martinec" <Mark.Martinec+freebsd@ijs.si>
To: <freebsd-stable@freebsd.org>; <freebsd-fs@freebsd.org>
Sent: Sunday, October 12, 2014 9:40 PM
Subject: Re: zfs pool import hangs on [tx->tx_sync_done_cv]


>I made available an image copy (dd) of my 4 GiB partition
> (compressed down to a 384 MiB file), holding one of my
> partitions (a small bsd root) that can no longer be imported
> into a 10.1-RC1 or 10.1-BETA3, as described in my first
> posting (below):
>
>   http://www.ijs.si/usr/mark/bsd/
>
> I would appreciate if it can be confirmed that such pool
> (one of several I have with this symptom) causes
> 'zpool import' to hang on 10.1 or 10-STABLE:
>
> - download, xz -d sys1boot.img.xz
>   # mdconfig -f sys1boot.img
>   # zpool import sys1boot
>
> ... and advise on a solution.
>
> Considering that 'zdb -e -cc' is happy and there were no
> other prior trouble with these pools, except for an upgrade
> to 10.1-BETA3/-RC1 from 10-STABLE as of cca. late September,
> it is my belief that these pools are still healthy, just
> non-importable. I'm reluctant to upgrade any other system
> from 10.0 to 10.1 without finding out what went wrong here.
>
>   Mark
>
>
> On 10/10/2014 03:02, Steven Hartland wrote:
> > Sorry to be a pain but could you attach the full output or link it
> > somewhere as mail has messed up the formatting :(
>
> Now at
>   http://www.ijs.si/usr/mark/bsd/procstat-kka-2074.txt
>
> On 2014-10-10 Mark Martinec wrote:
>> In short, after upgrading to 10.1-BETA3 or -RC1 I ended up with several
>> zfs pools that can no longer be imported. The zpool import command
>> (with no arguments) does show all available pools, but trying to
>> import one just hangs and the command cannot be aborted, although
>> the rest of the system is still alive and fine:
>>
>> # zpool import -f <mypool>
>>
>> Typing ^T just shows an idle process, waiting on [tx->tx_sync_done_cv]:
>>
>> load: 0.20  cmd: zpool 939 [tx->tx_sync_done_cv] 5723.65r 0.01u 0.02s 0% 8220k
>> load: 0.16  cmd: zpool 939 [tx->tx_sync_done_cv] 5735.73r 0.01u 0.02s 0% 8220k
>> load: 0.14  cmd: zpool 939 [tx->tx_sync_done_cv] 5741.83r 0.01u 0.02s 0% 8220k
>> load: 0.13  cmd: zpool 939 [tx->tx_sync_done_cv] 5749.16r 0.01u 0.02s 0% 8220k
>>
>> ps shows (on a system re-booted to a LiveCD running FreeBSD-10.1-RC1):
>>
>>   PID    TID COMM             TDNAME           CPU  PRI STATE WCHAN
>>   939 100632 zpool            -                  5  122 sleep tx->tx_s
>> UID PID PPID CPU PRI NI    VSZ  RSS MWCHAN   STAT TT     TIME COMMAND
>>   0 939  801   0  22  0 107732 8236 tx->tx_s D+   v0  0:00.04
>>   zpool import -f -o cachefile=/tmp/zpool.cache -R /tmp/sys0boot sys0boot
>>
>>   NWCHAN
>>   fffff8007b0f2a20
>>
>> # procstat -kk 939
>>
>>   PID    TID COMM             TDNAME KSTACK
>>   939 100632 zpool            -                mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d txg_wait_synced+0x85 
>> spa_load+0x1cd1 spa_load_best+0x6f spa_import+0x1ff zfs_ioc_pool_import+0x137 zfsdev_ioctl+0x6f0 devfs_ioctl_f+0x114 
>> kern_ioctl+0x255 sys_ioctl+0x13c amd64_syscall+0x351 Xfast_syscall+0xfb
>>
>>
>> Background story: the system where this happened was being kept
>> to a fairly recent 10-STABLE. The last upgrade was very close to
>> a BETA3 release. There are a couple of zfs pools there, one on a
>> mirrored pair of SSDs mostly holding the OS, one with a mirrored
>> pair of large spindles, and three more small ones (4 GiB each),
>> mostly for boot redundancy or testing - these small ones are on
>> old smallish disks. These disks are different, and attached to
>> different SATA controllers (LSI and onboard Intel). Pools were
>> mostly kept up-to-date to the most recent zpool features set
>> through their lifetime (some starting their life with 9.0, some
>> with 10.0).
>>
>> About two weeks ago after a reboot to a 10-STABLE of the day
>> the small pools became unavailable, but the regular two large
>> pools were still normal. At first I wasn't giving much attention
>> to that, as these pools were on oldish disks and nonessential
>> for normal operation, blaming a potentially crappy hw.
>>
>> Today I needed to do a reboot (for unrelated reason), and the
>> machine was no longer able to mount the boot pool.
>> The first instinct was - disks are malfunctioning - but ...
>>
>> Booting it to a FreeBSD-10.1-RC1 LiveCD was successful.
>> smartmon disk test shows no problems.  dd is able to read whole
>> partititions of each problematic pool. And most importantly,
>> running a  'zdb -e -cc' on each (non-imported) pool was churning
>> normally and steadily, producing a stats report at the end
>> and reported no errors.
>>
>> As a final proof that disks are fine I sacrificed one of the broken
>> 4 GiB GPT partitions with one of the problematic pools, and
>> did a fresh 10.1-RC1 install on it from a distribution ISO DVD.
>> The installation went fine and the system does boot and run
>> fine from the newly installed OS. Trying to import one of the
>> remaining old pools hangs the import command as before.
>>
>> As a final proof, I copied (with dd) one of the broken 4 GiB
>> partitions to a file on another system (running 10.1-BETA3,
>> which did not suffer from this problem), made a memory disk
>> out of this file, then run zfs import on this pool - and it hangs
>> there too! So hardware was not a problem - either these partitions
>> are truly broken (even though zdb -cc says they are fine),
>> or the new OS is somehow no longer able to import them.
>>
>> Please advise.
>>
>> I have a copy of the 4 GiB partition on a 400 MB compressed
>> file available, if somebody would be willing to play with it.
>>
>> Also have a ktrace of the 'zpool import' command. It's last
>> actions before it hangs are:
>>
>>    939 zpool    RET   madvise 0
>>    939 zpool    CALL  madvise(0x80604e000,0x1000,MADV_FREE)
>>    939 zpool    RET   madvise 0
>>    939 zpool    CALL  close(0x6)
>>    939 zpool    RET   close 0
>>    939 zpool    CALL  ioctl(0x3,0xc0185a05,0x7fffffffbf00)
>>    939 zpool    RET   ioctl -1 errno 2 No such file or directory
>>    939 zpool    CALL  madvise(0x802c71000,0x10000,MADV_FREE)
>>    939 zpool    RET   madvise 0
>>    939 zpool    CALL  madvise(0x802ca5000,0x1000,MADV_FREE)
>>    939 zpool    RET   madvise 0
>>    939 zpool    CALL  ioctl(0x3,0xc0185a06,0x7fffffffbf60)
>>    939 zpool    RET   ioctl 0
>>    939 zpool    CALL  ioctl(0x3,0xc0185a06,0x7fffffffbf60)
>>    939 zpool    RET   ioctl 0
>>    939 zpool    CALL  stat(0x802c380e0,0x7fffffffbc58)
>>    939 zpool    NAMI  "/tmp"
>>    939 zpool    STRU  struct stat {dev=273, ino=2, mode=041777, nlink=8, uid=0, gid=0, rdev=96, atime=1412866648, 
>> stime=1412871393, ctime=1412871393, birthtime=1412866648, size=512, blksize=32768, blocks=8, flags=0x0 }
>>    939 zpool    RET   stat 0
>>    939 zpool    CALL  ioctl(0x3,0xc0185a02,0x7fffffffbc60)
>>
>>
>> Mark
>
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
> 




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A5BA41116A7F4B23A9C9E469C4146B99>