From owner-freebsd-fs@FreeBSD.ORG Wed Oct 3 03:27:38 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 821) id 9E254106566B; Wed, 3 Oct 2012 03:27:38 +0000 (UTC) Date: Wed, 3 Oct 2012 03:27:38 +0000 From: John To: FreeBSD-FS Message-ID: <20121003032738.GA42140@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i Subject: ZFS/istgt lockup X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Oct 2012 03:27:38 -0000 Hi Folks, I've been chasing a problem that I'm not quite sure originates on the BSD side, but the system shouldn't lock up and require a power cycle to reboot. The config: I have a bsd system running 9.1RC handing out a 36TB volume to a Linux RHEL 6.1 system. The RHEL 6.1 systems is doing heavy I/O & number crunching. Many hours into the job stream the kernel becomes quite unhappy: kernel: __ratelimit: 27665 callbacks suppressed kernel: swapper: page allocation failure. order:1, mode:0x4020 kernel: Pid: 0, comm: swapper Tainted: G ---------------- T 2.6.32-131.0.15.el6.x86_64 #1 kernel: Call Trace: kernel: [] ? __alloc_pages_nodemask+0x716/0x8b0 kernel: [] ? alloc_pages_current+0xaa/0x110 kernel: [] ? refill_fl+0x3d5/0x4a0 [cxgb3] kernel: [] ? napi_frags_finish+0x6d/0xb0 kernel: [] ? process_responses+0x653/0x1450 [cxgb3] kernel: [] ? ring_buffer_lock_reserve+0xa2/0x160 kernel: [] ? napi_rx_handler+0x3c/0x90 [cxgb3] kernel: [] ? net_rx_action+0x103/0x2f0 kernel: [] ? __do_softirq+0xb7/0x1e0 kernel: [] ? handle_IRQ_event+0xf6/0x170 kernel: [] ? call_softirq+0x1c/0x30 kernel: [] ? do_softirq+0x65/0xa0 kernel: [] ? irq_exit+0x85/0x90 kernel: [] ? do_IRQ+0x75/0xf0 kernel: [] ? ret_from_intr+0x0/0x11 kernel: [] ? native_safe_halt+0xb/0x10 kernel: [] ? ftrace_raw_event_power_start+0x16/0x20 kernel: [] ? default_idle+0x4d/0xb0 kernel: [] ? cpu_idle+0xb6/0x110 kernel: [] ? start_secondary+0x202/0x245 On the bsd side, the istgt daemon appears to see that one of the connection threads is down and attempts to restart it. At this point, the istgt process size starts to grow. USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 1224 0.0 0.4 8041092 405472 v0- DL 4:59PM 15:28.72 /usr/local/bin/istgt root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 63:18.34 /usr/local/bin/istgt root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 61:13.80 /usr/local/bin/istgt root 1224 0.0 0.4 8041092 405472 v0- IL 4:59PM 0:00.00 /usr/local/bin/istgt There are more than 1400 threads reported. Also of interest, netstat shows: tcp4 0 0 10.59.6.12.5010 10.59.25.113.54076 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.33345 CLOSED tcp4 0 0 10.59.6.12.5010 10.59.25.113.54074 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.33343 CLOSED tcp4 0 0 10.59.6.12.5010 10.59.25.113.54072 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.33341 CLOSED tcp4 0 0 10.60.6.12.5010 10.60.25.113.33339 CLOSED tcp4 0 0 10.59.6.12.5010 10.59.25.113.54070 CLOSE_WAIT tcp4 0 0 10.60.6.12.5010 10.60.25.113.53806 CLOSE_WAIT There are more than 1400 sockets in the CLOSE* state. What would prevent these sockets from cleaning up in a reasonable timeframe? Both sides of the mpio connection appear to be attempting reconnects. An attempt to gracefully kill istgt fails. A kill -9 does not clean things up either. A procstat -kk 1224 after the kill -9 shows: PID TID COMM TDNAME KSTACK 1224 100959 istgt sigthread mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_ lookup_norm+0x45 zap_lookup+0x2e zfs_dirent_lock+0x4ff zfs_dirlook+0x69 zfs_lookup+0x26b zfs_freebsd_lookup+0x81 vfs_cache_lookup+0xf8 VOP_LOOKUP_APV+0x40 lookup+0x 464 namei+0x4e9 vn_open_cred+0x3cb 1224 100960 istgt luthread #1 mi_switch+0x186 sleepq_wait+0x42 _sleep+0x376 bwait+0x64 physio+0x246 devfs_write_f+0x8d dofilewrite+0x8b kern_writev +0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 1224 103533 istgt sendthread #1493 mi_switch+0x186 thread_suspend_switch+0xc9 thread_single+0x1b2 exit1+0x72 sigexit+0x7c postsig+0x3a4 ast+0x26c doreti _ast+0x1f An attempt to forcefully export the pool hangs also. A procstat shows: PID TID COMM TDNAME KSTACK 4427 100991 zpool - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 dbuf_read+0x30b dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e dsl_dir_open_spa+0x121 dsl_dataset_hold+0x3b dmu_objset_hold+0x23 zfs_ioc_objset_stats+0x2b zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x546 Xfast_syscall+0xf7 If anyone has any ideas, please let me know. I know I've left a lot of config information out in an attempt to keep the email shorter. Random comments: This happens with or without multipathd enabled on the linux client. If I catch the istgt daemon while it's creating threads and kill it the system will not lock up. I see no errors in the istgt log file. One of my next things to try is to enable all debugging... The amount of debugging data captured is quite large :-( I am using chelsio 10G cards on both client/server which have been rock solid in all other cases. Thoughts welcome! Thanks, John