From owner-freebsd-stable@FreeBSD.ORG Sat Aug 21 22:04:37 2010 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BEC61065672 for ; Sat, 21 Aug 2010 22:04:37 +0000 (UTC) (envelope-from tdb@carrick.bishnet.net) Received: from carrick.bishnet.net (carrick.bishnet.net [IPv6:2a01:348:132:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 40C908FC12 for ; Sat, 21 Aug 2010 22:04:36 +0000 (UTC) Received: from [2a01:348:132:51::10] (helo=carrick-users) by carrick.bishnet.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.72 (FreeBSD)) (envelope-from ) id 1OmwB1-0001mg-EE for freebsd-stable@FreeBSD.org; Sat, 21 Aug 2010 23:04:35 +0100 Received: (from tdb@localhost) by carrick-users (8.14.4/8.14.4/Submit) id o7LM4Z4J006861 for freebsd-stable@FreeBSD.org; Sat, 21 Aug 2010 23:04:35 +0100 (BST) (envelope-from tdb) Date: Sat, 21 Aug 2010 23:04:35 +0100 From: Tim Bishop To: freebsd-stable@FreeBSD.org Message-ID: <20100821220435.GA6208@carrick-users.bishnet.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="J2SCkAp4GZ/dPZZf" Content-Disposition: inline X-PGP-Key: 0x5AE7D984, http://www.bishnet.net/tim/tim-bishnet-net.asc X-PGP-Fingerprint: 1453 086E 9376 1A50 ECF6 AE05 7DCE D659 5AE7 D984 User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Subject: 8.1R ZFS almost locking up system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Aug 2010 22:04:37 -0000 --J2SCkAp4GZ/dPZZf Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I've had a problem on a FreeBSD 8.1R system for a few weeks. It seems that ZFS gets in to an almost unresponsive state. Last time it did it (two weeks ago) I couldn't even log in, although the system was up, this time I could manage a reboot but couldn't stop any applications (they were likely hanging on I/O). Here's some details I collected prior to reboot. The zpool output, including iostat and gstat for the disks: # zpool status pool: pool0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool0 ONLINE 0 0 0 mirror ONLINE 0 0 0 ad4s3 ONLINE 0 0 0 ad6s3 ONLINE 0 0 0 errors: No known data errors # zpool iostat -v 5 ... capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- pool0 117G 16.7G 248 114 865K 269K mirror 117G 16.7G 248 114 865K 269K ad4s3 - - 43 56 2.47M 269K ad6s3 - - 39 56 2.41M 269K ---------- ----- ----- ----- ----- ----- ----- # gstat ... L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 1 48 48 3042 9.8 0 0 0.0 47.6| ad4 0 38 38 2406 10.5 0 0 0.0 39.5| ad6 0 0 0 0 0.0 0 0 0.0 0.0| ad4s1 0 0 0 0 0.0 0 0 0.0 0.0| ad4s2 1 48 48 3042 9.8 0 0 0.0 47.6| ad4s3 0 0 0 0 0.0 0 0 0.0 0.0| ad6s1 0 0 0 0 0.0 0 0 0.0 0.0| ad6s2 0 38 38 2406 11.8 0 0 0.0 44.4| ad6s3 I've seen this before when I've had poor ZFS performance. There's more I/O on the disks than on the pool itself. It's not particularly busy though. A few items from top, including zfskern: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 5 root 4 -8 - 0K 60K zio->i 0 54:38 3.47% zfskern 91775 70 1 44 0 53040K 31144K tx->tx 1 2:11 0.00% postgres 39661 tdb 1 44 0 55776K 32968K tx->tx 0 0:39 0.00% mutt 14828 root 1 47 0 14636K 1572K tx->tx 1 0:03 0.00% zfs 11188 root 1 51 0 14636K 1572K tx->tx 0 0:03 0.00% zfs At some point during this process my zfs snapshots have been failing to complete: root 5 0.8 0.0 0 60 ?? DL 7Aug10 54:43.83 [zfskern] root 8265 0.0 0.0 14636 1528 ?? D 10:00AM 0:03.12 zfs snapshot -r pool0@2010-08-21_10:00:01--1d root 11188 0.0 0.1 14636 1572 ?? D 11:00AM 0:02.93 zfs snapshot -r pool0@2010-08-21_11:00:01--1d root 14828 0.0 0.1 14636 1572 ?? D 12:00PM 0:03.04 zfs snapshot -r pool0@2010-08-21_12:00:00--1d root 17862 0.0 0.1 14636 1572 ?? D 1:00PM 0:01.96 zfs snapshot -r pool0@2010-08-21_13:00:01--1d root 20986 0.0 0.1 14636 1572 ?? D 2:00PM 0:02.07 zfs snapshot -r pool0@2010-08-21_14:00:01--1d It all seems to point at ZFS getting to the point of being almost unresponsive. It's been exactly two weeks since the last time this happened and therefore the last reboot, so it'll be interesting to see if the same happens again after the same period of time. I noticed this given in a few other ZFS related messages: vfs.worklist_len: 15 I have attached all (hopefully) ZFS-related sysctl output. Finally, the reboot log: Aug 21 22:13:06 server kernel: Aug 21 22:13:06 server reboot: rebooted by tdb Aug 21 22:19:47 server kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done Aug 21 22:19:47 server kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop... Aug 21 22:19:48 server kernel: done Aug 21 22:19:48 server kernel: Waiting (max 60 seconds) for system process `syncer' to stop... Aug 21 22:20:03 server kernel: Aug 21 22:20:03 server kernel: Syncing disks, vnodes remaining...14 Aug 21 22:20:48 server kernel: timed out Aug 21 22:21:55 server kernel: Waiting (max 60 seconds) for system process `vnlru' to stop... Aug 21 22:22:39 server kernel: 1 Aug 21 22:22:55 server kernel: timed out Aug 21 22:22:55 server kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop... I've undoubtedly missed some important information, so please let me know if there's anything more useful I can collect next time (I'm quite sure it'll happen again). Thanks, Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984 --J2SCkAp4GZ/dPZZf Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=sysctl-zfs-out vfs.zfs.l2c_only_size: 0 vfs.zfs.mfu_ghost_data_lsize: 40245248 vfs.zfs.mfu_ghost_metadata_lsize: 87331328 vfs.zfs.mfu_ghost_size: 127576576 vfs.zfs.mfu_data_lsize: 99885056 vfs.zfs.mfu_metadata_lsize: 146944 vfs.zfs.mfu_size: 101330432 vfs.zfs.mru_ghost_data_lsize: 181200896 vfs.zfs.mru_ghost_metadata_lsize: 25819648 vfs.zfs.mru_ghost_size: 207020544 vfs.zfs.mru_data_lsize: 351579136 vfs.zfs.mru_metadata_lsize: 0 vfs.zfs.mru_size: 419203072 vfs.zfs.anon_data_lsize: 0 vfs.zfs.anon_metadata_lsize: 0 vfs.zfs.anon_size: 16982016 vfs.zfs.l2arc_norw: 1 vfs.zfs.l2arc_feed_again: 1 vfs.zfs.l2arc_noprefetch: 0 vfs.zfs.l2arc_feed_min_ms: 200 vfs.zfs.l2arc_feed_secs: 1 vfs.zfs.l2arc_headroom: 2 vfs.zfs.l2arc_write_boost: 8388608 vfs.zfs.l2arc_write_max: 8388608 vfs.zfs.arc_meta_limit: 162276480 vfs.zfs.arc_meta_used: 167108904 vfs.zfs.mdcomp_disable: 0 vfs.zfs.arc_min: 81138240 vfs.zfs.arc_max: 649105920 vfs.zfs.zfetch.array_rd_sz: 1048576 vfs.zfs.zfetch.block_cap: 256 vfs.zfs.zfetch.min_sec_reap: 2 vfs.zfs.zfetch.max_streams: 8 vfs.zfs.prefetch_disable: 1 vfs.zfs.check_hostid: 1 vfs.zfs.recover: 0 vfs.zfs.txg.write_limit_override: 0 vfs.zfs.txg.synctime: 5 vfs.zfs.txg.timeout: 30 vfs.zfs.scrub_limit: 10 vfs.zfs.vdev.cache.bshift: 16 vfs.zfs.vdev.cache.size: 10485760 vfs.zfs.vdev.cache.max: 16384 vfs.zfs.vdev.aggregation_limit: 131072 vfs.zfs.vdev.ramp_rate: 2 vfs.zfs.vdev.time_shift: 6 vfs.zfs.vdev.min_pending: 4 vfs.zfs.vdev.max_pending: 35 vfs.zfs.cache_flush_disable: 0 vfs.zfs.zil_disable: 0 vfs.zfs.zio.use_uma: 0 vfs.zfs.version.zpl: 3 vfs.zfs.version.vdev_boot: 1 vfs.zfs.version.spa: 14 vfs.zfs.version.dmu_backup_stream: 1 vfs.zfs.version.dmu_backup_header: 2 vfs.zfs.version.acl: 1 vfs.zfs.debug: 0 vfs.zfs.super_owner: 0 kstat.zfs.misc.zfetchstats.hits: 0 kstat.zfs.misc.zfetchstats.misses: 0 kstat.zfs.misc.zfetchstats.colinear_hits: 0 kstat.zfs.misc.zfetchstats.colinear_misses: 0 kstat.zfs.misc.zfetchstats.stride_hits: 0 kstat.zfs.misc.zfetchstats.stride_misses: 0 kstat.zfs.misc.zfetchstats.reclaim_successes: 0 kstat.zfs.misc.zfetchstats.reclaim_failures: 0 kstat.zfs.misc.zfetchstats.streams_resets: 0 kstat.zfs.misc.zfetchstats.streams_noresets: 0 kstat.zfs.misc.zfetchstats.bogus_streams: 0 kstat.zfs.misc.arcstats.hits: 268465432 kstat.zfs.misc.arcstats.misses: 68288918 kstat.zfs.misc.arcstats.demand_data_hits: 96018646 kstat.zfs.misc.arcstats.demand_data_misses: 29346057 kstat.zfs.misc.arcstats.demand_metadata_hits: 172436107 kstat.zfs.misc.arcstats.demand_metadata_misses: 38917800 kstat.zfs.misc.arcstats.prefetch_data_hits: 0 kstat.zfs.misc.arcstats.prefetch_data_misses: 0 kstat.zfs.misc.arcstats.prefetch_metadata_hits: 10679 kstat.zfs.misc.arcstats.prefetch_metadata_misses: 25061 kstat.zfs.misc.arcstats.mru_hits: 50140760 kstat.zfs.misc.arcstats.mru_ghost_hits: 3740247 kstat.zfs.misc.arcstats.mfu_hits: 218317059 kstat.zfs.misc.arcstats.mfu_ghost_hits: 25731748 kstat.zfs.misc.arcstats.allocated: 74130663 kstat.zfs.misc.arcstats.deleted: 40615278 kstat.zfs.misc.arcstats.stolen: 31781579 kstat.zfs.misc.arcstats.recycle_miss: 26118118 kstat.zfs.misc.arcstats.mutex_miss: 69832 kstat.zfs.misc.arcstats.evict_skip: 315025656 kstat.zfs.misc.arcstats.evict_l2_cached: 0 kstat.zfs.misc.arcstats.evict_l2_eligible: 1987022682112 kstat.zfs.misc.arcstats.evict_l2_ineligible: 63680512 kstat.zfs.misc.arcstats.hash_elements: 44157 kstat.zfs.misc.arcstats.hash_elements_max: 206842 kstat.zfs.misc.arcstats.hash_collisions: 24859197 kstat.zfs.misc.arcstats.hash_chains: 9616 kstat.zfs.misc.arcstats.hash_chain_max: 22 kstat.zfs.misc.arcstats.p: 40774359 kstat.zfs.misc.arcstats.c: 628821360 kstat.zfs.misc.arcstats.c_min: 81138240 kstat.zfs.misc.arcstats.c_max: 649105920 kstat.zfs.misc.arcstats.size: 628128200 kstat.zfs.misc.arcstats.hdr_size: 10035104 kstat.zfs.misc.arcstats.data_size: 537511936 kstat.zfs.misc.arcstats.other_size: 80581160 kstat.zfs.misc.arcstats.l2_hits: 0 kstat.zfs.misc.arcstats.l2_misses: 0 kstat.zfs.misc.arcstats.l2_feeds: 0 kstat.zfs.misc.arcstats.l2_rw_clash: 0 kstat.zfs.misc.arcstats.l2_read_bytes: 0 kstat.zfs.misc.arcstats.l2_write_bytes: 0 kstat.zfs.misc.arcstats.l2_writes_sent: 0 kstat.zfs.misc.arcstats.l2_writes_done: 0 kstat.zfs.misc.arcstats.l2_writes_error: 0 kstat.zfs.misc.arcstats.l2_writes_hdr_miss: 0 kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0 kstat.zfs.misc.arcstats.l2_evict_reading: 0 kstat.zfs.misc.arcstats.l2_free_on_write: 0 kstat.zfs.misc.arcstats.l2_abort_lowmem: 0 kstat.zfs.misc.arcstats.l2_cksum_bad: 0 kstat.zfs.misc.arcstats.l2_io_error: 0 kstat.zfs.misc.arcstats.l2_size: 0 kstat.zfs.misc.arcstats.l2_hdr_size: 0 kstat.zfs.misc.arcstats.memory_throttle_count: 4236474 kstat.zfs.misc.arcstats.l2_write_trylock_fail: 0 kstat.zfs.misc.arcstats.l2_write_passed_headroom: 0 kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 0 kstat.zfs.misc.arcstats.l2_write_in_l2: 0 kstat.zfs.misc.arcstats.l2_write_io_in_progress: 0 kstat.zfs.misc.arcstats.l2_write_not_cacheable: 62188 kstat.zfs.misc.arcstats.l2_write_full: 0 kstat.zfs.misc.arcstats.l2_write_buffer_iter: 0 kstat.zfs.misc.arcstats.l2_write_pios: 0 kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 0 kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 0 kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 0 kstat.zfs.misc.vdev_cache_stats.delegations: 16993 kstat.zfs.misc.vdev_cache_stats.hits: 27480802 kstat.zfs.misc.vdev_cache_stats.misses: 10351041 --J2SCkAp4GZ/dPZZf--