Date: Tue, 17 Mar 2009 18:04:52 -0500 From: Kevin Day <toasty@dragondata.com> To: Kevin Day <toasty@dragondata.com> Cc: freebsd-fs@freebsd.org Subject: Re: zio->io_cv deadlock Message-ID: <FEB164C7-00AE-4B49-8832-2AF002D6F499@dragondata.com> In-Reply-To: <8E12CEFC-25DE-4B82-97BD-7ED717650089@dragondata.com> References: <8E12CEFC-25DE-4B82-97BD-7ED717650089@dragondata.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I've got a test environment where I can make this deadlock happen within 3-4 hours of use now. This is from -CURRENT as of yesterday. This server isn't trying to use zfs on root, so when it hangs I'm not quite as bad off. Here's a ps output: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND UID PPID CPU PRI NI MWCHAN SL RE PAGEIN LIM TSIZ root 477 0.0 0.0 2180 656 ?? Is 9:34AM 0:00.00 /sbin/ devd 0 1 0 44 0 select 127 127 0 - 336 root 593 0.0 0.0 5780 1444 ?? Is 9:35AM 0:00.05 /usr/ sbin/syslog 0 1 0 44 0 select 21 127 0 - 36 root 811 0.0 0.0 24872 4172 ?? Is 9:35AM 0:00.00 /usr/ sbin/sshd 0 1 0 44 0 select 127 127 0 - 220 root 837 0.0 0.0 6836 1532 ?? Is 9:35AM 0:00.07 /usr/ sbin/cron - 0 1 0 44 0 nanslp 27 127 0 - 36 root 974 0.0 0.0 8232 2484 ?? Ss 9:38AM 1:19.01 screen 0 1 0 44 0 select 0 127 6 - 292 root 1612 0.0 0.0 38852 7856 ?? Ss 2:49PM 0:00.16 sshd: root@pts/0 0 811 0 44 0 select 0 127 0 - 220 root 1617 0.0 0.0 10188 2792 0 Is 2:49PM 0:00.01 -csh (csh) 0 1612 0 47 0 pause 127 127 0 - 304 root 1622 0.0 0.0 8232 2132 0 S+ 2:49PM 0:00.03 screen - x 0 1617 0 44 0 pause 1 127 0 - 292 root 975 0.0 0.0 10188 2704 1 Is 9:38AM 0:00.01 /bin/ csh 0 974 0 49 0 pause 127 127 0 - 304 root 980 0.0 1.1 794196 766248 1 D+ 9:38AM 66:50.20 rsync - ravH root 0 975 0 69 0 zio->i 127 127 1 - 344 root 982 0.0 0.2 181844 142444 1 I+ 9:38AM 69:32.28 rsync - ravH root 0 980 0 44 0 select 57 127 0 - 344 root 983 0.0 0.0 10188 2788 2 Ss 9:38AM 0:00.01 /bin/ csh 0 974 0 44 0 pause 0 127 0 - 304 root 1 0.0 0.0 2176 596 ?? ILs 9:34AM 0:00.01 /sbin/ init -- 0 0 0 44 0 wait 127 127 8 - 604 root 827 0.0 0.0 10796 3800 ?? Ss 9:35AM 0:00.50 sendmail: accept 0 1 0 44 0 select 3 127 1 - 628 smmsp 831 0.0 0.0 10796 3844 ?? Is 9:35AM 0:00.01 sendmail: Queue 25 1 0 44 0 pause 127 127 0 - 628 root 887 0.0 0.0 5776 1224 v0 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 888 0.0 0.0 5776 1224 v1 Is+ 9:35AM 0:00.00 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 889 0.0 0.0 5776 1224 v2 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 890 0.0 0.0 5776 1224 v3 Is+ 9:35AM 0:00.00 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 2 - 20 root 891 0.0 0.0 5776 1224 v4 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 892 0.0 0.0 5776 1224 v5 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 893 0.0 0.0 5776 1224 v6 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 894 0.0 0.0 5776 1224 v7 Is+ 9:35AM 0:00.01 /usr/ libexec/get 0 1 0 76 0 ttyin 127 127 0 - 20 root 981 0.0 0.0 27668 11220 1 S+ 9:38AM 172:25.63 ssh -l root 216. 0 980 0 44 0 select 1 127 0 - 120 root 2112 0.0 0.0 6892 1416 2 R+ 6:48PM 0:00.00 ps auxwwlv 0 983 0 44 0 - 127 0 0 - 28 root 0 0.0 0.0 0 128 ?? DLs 9:34AM 46:56.19 [kernel] 0 0 0 -68 0 - 127 127 0 - 0 root 2 0.0 0.0 0 16 ?? DL 9:34AM 0:01.23 [g_event] 0 0 0 -8 0 - 0 127 0 - 0 root 3 0.0 0.0 0 16 ?? DL 9:34AM 0:26.74 [g_up] 0 0 0 -8 0 - 0 127 0 - 0 root 4 0.0 0.0 0 16 ?? DL 9:34AM 5:51.49 [g_down] 0 0 0 -8 0 - 0 127 0 - 0 root 5 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [xpt_thrd] 0 0 0 -16 0 ccb_sc 127 127 0 - 0 root 6 0.0 0.0 0 16 ?? DL 9:34AM 0:00.14 [fdc0] 0 0 0 -16 0 - 0 127 0 - 0 root 7 0.0 0.0 0 24 ?? DL 9:34AM 0:00.00 [sctp_iterator] 0 0 0 -16 0 waitin 127 127 0 - 0 root 8 0.0 0.0 0 16 ?? DL 9:34AM 0:00.03 [pagedaemon] 0 0 0 -16 0 psleep 4 127 0 - 0 root 9 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [vmdaemon] 0 0 0 -16 0 psleep 127 127 0 - 0 root 10 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [audit] 0 0 0 -16 0 audit_ 127 127 0 - 0 root 11 800.0 0.0 0 128 ?? RL 9:34AM 3958:43.82 [idle] 0 0 0 171 0 - 127 127 0 - 0 root 12 0.0 0.0 0 400 ?? WL 9:34AM 6:03.01 [intr] 0 0 0 -60 0 - 127 127 0 - 0 root 13 0.0 0.0 0 16 ?? DL 9:34AM 0:44.77 [yarrow] 0 0 0 44 0 - 0 127 0 - 0 root 14 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [usbus0] 0 0 0 -64 0 wmsg 127 127 0 - 0 root 15 0.0 0.0 0 16 ?? DL 9:34AM 0:00.92 [usbus0] 0 0 0 -68 0 wmsg 2 127 0 - 0 root 16 0.0 0.0 0 16 ?? DL 9:34AM 0:00.56 [usbus0] 0 0 0 -68 0 wmsg 2 127 0 - 0 root 17 0.0 0.0 0 16 ?? DL 9:34AM 0:00.55 [usbus0] 0 0 0 -64 0 wmsg 2 127 0 - 0 root 18 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [usbus1] 0 0 0 -64 0 wmsg 127 127 0 - 0 root 19 0.0 0.0 0 16 ?? DL 9:34AM 0:00.92 [usbus1] 0 0 0 -68 0 wmsg 2 127 0 - 0 root 20 0.0 0.0 0 16 ?? DL 9:34AM 0:00.52 [usbus1] 0 0 0 -68 0 wmsg 2 127 0 - 0 root 21 0.0 0.0 0 16 ?? DL 9:34AM 0:00.60 [usbus1] 0 0 0 -64 0 wmsg 2 127 0 - 0 root 22 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [pagezero] 0 0 0 76 0 pgzero 127 127 0 - 0 root 23 0.0 0.0 0 16 ?? DL 9:34AM 0:00.14 [bufdaemon] 0 0 0 -16 0 psleep 0 127 0 - 0 root 24 0.0 0.0 0 16 ?? DL 9:34AM 0:24.78 [syncer] 0 0 0 44 0 zfsvfs 127 127 0 - 0 root 25 0.0 0.0 0 16 ?? DL 9:34AM 0:02.17 [vnlru] 0 0 0 44 0 vlruwt 0 127 0 - 0 root 26 0.0 0.0 0 16 ?? DL 9:34AM 0:00.21 [softdepflush] 0 0 0 -16 0 sdflus 0 127 0 - 0 root 88 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 89 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 90 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 91 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 92 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 93 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 94 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 95 0.0 0.0 0 16 ?? DL 9:34AM 0:00.00 [system_taskq] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 104 0.0 0.0 0 16 ?? DL 9:34AM 5:11.71 [arc_reclaim_thr 0 0 0 44 0 arc_re 0 127 0 - 0 root 105 0.0 0.0 0 16 ?? DL 9:34AM 0:00.15 [l2arc_feed_thre 0 0 0 -16 0 l2arc_ 0 127 0 - 0 root 936 0.0 0.0 0 16 ?? DL 9:37AM 0:00.55 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 937 0.0 0.0 0 16 ?? DL 9:37AM 0:00.17 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 938 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 939 0.0 0.0 0 16 ?? DL 9:37AM 0:08.89 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 940 0.0 0.0 0 16 ?? DL 9:37AM 0:31.71 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 941 0.0 0.0 0 16 ?? DL 9:37AM 0:08.91 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 942 0.0 0.0 0 16 ?? DL 9:37AM 0:08.76 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 943 0.0 0.0 0 16 ?? DL 9:37AM 0:08.79 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 944 0.0 0.0 0 16 ?? DL 9:37AM 0:09.21 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 945 0.0 0.0 0 16 ?? DL 9:37AM 0:08.83 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 946 0.0 0.0 0 16 ?? DL 9:37AM 0:08.88 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 947 0.0 0.0 0 16 ?? DL 9:37AM 1:34.28 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 948 0.0 0.0 0 16 ?? DL 9:37AM 1:34.28 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 949 0.0 0.0 0 16 ?? DL 9:37AM 1:34.25 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 950 0.0 0.0 0 16 ?? DL 9:37AM 1:34.30 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 951 0.0 0.0 0 16 ?? DL 9:37AM 1:34.74 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 952 0.0 0.0 0 16 ?? DL 9:37AM 1:34.38 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 953 0.0 0.0 0 16 ?? DL 9:37AM 1:34.36 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 954 0.0 0.0 0 16 ?? DL 9:37AM 1:34.14 [spa_zio] 0 0 0 44 0 tq->tq 127 127 0 - 0 root 955 0.0 0.0 0 16 ?? DL 9:37AM 2:46.35 [spa_zio] 0 0 0 44 0 zfsvfs 127 127 0 - 0 root 956 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 957 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 958 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 959 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 960 0.0 0.0 0 16 ?? DL 9:37AM 0:00.00 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 961 0.0 0.0 0 16 ?? DL 9:37AM 0:00.11 [spa_zio] 0 0 0 -16 0 tq->tq 127 127 0 - 0 root 962 0.0 0.0 0 16 ?? DL 9:37AM 0:24.24 [vdev:worker da1 0 0 0 44 0 vgeom: 127 127 0 - 0 root 963 0.0 0.0 0 16 ?? DL 9:37AM 0:00.21 [txg_thread_ente 0 0 0 -16 0 tx->tx 127 127 0 - 0 root 964 0.0 0.0 0 28 ?? DL 9:37AM 1:19.07 [txg_thread_ente 0 0 0 44 0 tx->tx 127 127 0 - 0 root 965 0.0 0.0 0 16 ?? DL 9:37AM 0:12.34 [zil_clean] 0 0 0 -16 0 tq->tq 127 127 0 - 0 Note that syncer and one spa_zio are stuck in zfsvfs, and my rsync process is frozen in zio->io_cv. zpool makes everything look okay: cs03# zpool iostat -v capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- z 549G 13.0T 14 133 709K 12.9M da1 549G 13.0T 14 133 709K 12.9M ---------- ----- ----- ----- ----- ----- ----- cs03# zpool status -v pool: z state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM z ONLINE 0 0 0 da1 ONLINE 0 0 0 errors: No known data errors ZFS related sysctls: vfs.zfs.arc_meta_limit: 26214400 vfs.zfs.arc_meta_used: 70401408 vfs.zfs.mdcomp_disable: 0 vfs.zfs.arc_min: 140928537 vfs.zfs.arc_max: 104857600 vfs.zfs.zfetch.array_rd_sz: 1048576 vfs.zfs.zfetch.block_cap: 256 vfs.zfs.zfetch.min_sec_reap: 2 vfs.zfs.zfetch.max_streams: 8 vfs.zfs.prefetch_disable: 0 vfs.zfs.recover: 0 vfs.zfs.txg.synctime: 5 vfs.zfs.txg.timeout: 30 vfs.zfs.scrub_limit: 10 vfs.zfs.vdev.cache.bshift: 16 vfs.zfs.vdev.cache.size: 10485760 vfs.zfs.vdev.cache.max: 16384 vfs.zfs.vdev.aggregation_limit: 131072 vfs.zfs.vdev.ramp_rate: 2 vfs.zfs.vdev.time_shift: 6 vfs.zfs.vdev.min_pending: 4 vfs.zfs.vdev.max_pending: 35 vfs.zfs.cache_flush_disable: 0 vfs.zfs.zil_disable: 0 vfs.zfs.version.zpl: 3 vfs.zfs.version.vdev_boot: 1 vfs.zfs.version.spa: 13 vfs.zfs.version.dmu_backup_stream: 1 vfs.zfs.version.dmu_backup_header: 2 vfs.zfs.version.acl: 1 vfs.zfs.debug: 0 vfs.zfs.super_owner: 0 kstat.zfs.misc.arcstats.hits: 57819155 kstat.zfs.misc.arcstats.misses: 5590858 kstat.zfs.misc.arcstats.demand_data_hits: 63981 kstat.zfs.misc.arcstats.demand_data_misses: 635 kstat.zfs.misc.arcstats.demand_metadata_hits: 47525277 kstat.zfs.misc.arcstats.demand_metadata_misses: 4114419 kstat.zfs.misc.arcstats.prefetch_data_hits: 19665 kstat.zfs.misc.arcstats.prefetch_data_misses: 809 kstat.zfs.misc.arcstats.prefetch_metadata_hits: 10210232 kstat.zfs.misc.arcstats.prefetch_metadata_misses: 1474995 kstat.zfs.misc.arcstats.mru_hits: 11018143 kstat.zfs.misc.arcstats.mru_ghost_hits: 937056 kstat.zfs.misc.arcstats.mfu_hits: 37133185 kstat.zfs.misc.arcstats.mfu_ghost_hits: 3428689 kstat.zfs.misc.arcstats.deleted: 7789751 kstat.zfs.misc.arcstats.recycle_miss: 6302888 kstat.zfs.misc.arcstats.mutex_miss: 9406 kstat.zfs.misc.arcstats.evict_skip: 418277598 kstat.zfs.misc.arcstats.hash_elements: 9872 kstat.zfs.misc.arcstats.hash_elements_max: 124517 kstat.zfs.misc.arcstats.hash_collisions: 112246 kstat.zfs.misc.arcstats.hash_chains: 96 kstat.zfs.misc.arcstats.hash_chain_max: 3 kstat.zfs.misc.arcstats.p: 92080153 kstat.zfs.misc.arcstats.c: 140928537 kstat.zfs.misc.arcstats.c_min: 140928537 kstat.zfs.misc.arcstats.c_max: 104857600 kstat.zfs.misc.arcstats.size: 141645696 kstat.zfs.misc.arcstats.hdr_size: 2617264 kstat.zfs.misc.arcstats.l2_hits: 0 kstat.zfs.misc.arcstats.l2_misses: 0 kstat.zfs.misc.arcstats.l2_feeds: 0 kstat.zfs.misc.arcstats.l2_rw_clash: 0 kstat.zfs.misc.arcstats.l2_writes_sent: 0 kstat.zfs.misc.arcstats.l2_writes_done: 0 kstat.zfs.misc.arcstats.l2_writes_error: 0 kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0 kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0 kstat.zfs.misc.arcstats.l2_evict_reading: 0 kstat.zfs.misc.arcstats.l2_free_on_write: 0 kstat.zfs.misc.arcstats.l2_abort_lowmem: 0 kstat.zfs.misc.arcstats.l2_cksum_bad: 0 kstat.zfs.misc.arcstats.l2_io_error: 0 kstat.zfs.misc.arcstats.l2_size: 0 kstat.zfs.misc.arcstats.l2_hdr_size: 0 kstat.zfs.misc.arcstats.memory_throttle_count: 0 kstat.zfs.misc.vdev_cache_stats.delegations: 671323 kstat.zfs.misc.vdev_cache_stats.hits: 4416731 kstat.zfs.misc.vdev_cache_stats.misses: 349266 There does seem to be something stuck in the syncer: vfs.worklist_len: 11 (that doesn't go down or move at all), but that doesn't tell me much. Next time this happens, is there anything else I should look at? -- Kevin On Feb 8, 2009, at 10:59 PM, Kevin Day wrote: > > I'm playing with a -CURRENT install from a couple of weeks ago. > Everything seems okay for a few days, then eventually every process > ends up stuck in zio->io_cv. If I go to the console, it's responsive > until I try logging in, then login is stuck in zio->io_cv as well. > Ctrl-Alt-Esc drops me into ddb, but then ddb hangs instantly. > > Nothing on the console or syslog before it hangs. > > Anyone seen anything similar? > > -- Kevin > > > Possibly relevant info: > > 8 core Opteron > 64GB RAM > > da1 at twa0 bus 0 target 0 lun 1 > da1: <AMCC 9650SE-12M DISK 3.08> Fixed Direct Access SCSI-5 device > da1: 100.000MB/s transfers > da1: 4678158MB (9580867585 512 byte sectors: 255H 63S/T 596381C) > > server5# zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > z 4.44T 1.19T 3.25T 26% ONLINE - > > server5# zpool status -v > pool: z > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > z ONLINE 0 0 0 > da1 ONLINE 0 0 0 > > errors: No known data errors > > server5# cat /boot/loader.conf > vm.kmem_size_max="2048M" > vm.kmem_size="2048M" > vfs.zfs.arc_max="100M" > zfs_load="YES" > vfs.root.mountfrom="zfs:z" > > > (tried lowering arc_max, didn't seem to help) > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FEB164C7-00AE-4B49-8832-2AF002D6F499>