From owner-freebsd-fs@FreeBSD.ORG  Wed Oct  3 03:27:38 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: by hub.freebsd.org (Postfix, from userid 821)
	id 9E254106566B; Wed,  3 Oct 2012 03:27:38 +0000 (UTC)
Date: Wed, 3 Oct 2012 03:27:38 +0000
From: John <jwd@FreeBSD.org>
To: FreeBSD-FS <freebsd-fs@freebsd.org>
Message-ID: <20121003032738.GA42140@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4.2.1i
Subject: ZFS/istgt lockup
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Oct 2012 03:27:38 -0000

Hi Folks,

   I've been chasing a problem that I'm not quite sure originates
on the BSD side, but the system shouldn't lock up and require a power
cycle to reboot.

   The config:  I have a bsd system running 9.1RC handing out a
36TB volume to a Linux RHEL 6.1 system. The RHEL 6.1 systems is
doing heavy I/O & number crunching. Many hours into the job stream
the kernel becomes quite unhappy:

kernel: __ratelimit: 27665 callbacks suppressed
kernel: swapper: page allocation failure. order:1, mode:0x4020
kernel: Pid: 0, comm: swapper Tainted: G           ---------------- T 2.6.32-131.0.15.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ>  [<ffffffff811201e6>] ? __alloc_pages_nodemask+0x716/0x8b0
kernel: [<ffffffff8115464a>] ? alloc_pages_current+0xaa/0x110
kernel: [<ffffffffa0115cd5>] ? refill_fl+0x3d5/0x4a0 [cxgb3]
kernel: [<ffffffff814200bd>] ? napi_frags_finish+0x6d/0xb0
kernel: [<ffffffffa0116d33>] ? process_responses+0x653/0x1450 [cxgb3]
kernel: [<ffffffff810e7f62>] ? ring_buffer_lock_reserve+0xa2/0x160
kernel: [<ffffffffa0117b6c>] ? napi_rx_handler+0x3c/0x90 [cxgb3]
kernel: [<ffffffff814225a3>] ? net_rx_action+0x103/0x2f0
kernel: [<ffffffff8106f717>] ? __do_softirq+0xb7/0x1e0
kernel: [<ffffffff810d69d6>] ? handle_IRQ_event+0xf6/0x170
kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
kernel: [<ffffffff8106f505>] ? irq_exit+0x85/0x90
kernel: [<ffffffff814e3505>] ? do_IRQ+0x75/0xf0
kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
kernel: <EOI>  [<ffffffff810362ab>] ? native_safe_halt+0xb/0x10
kernel: [<ffffffff81100826>] ? ftrace_raw_event_power_start+0x16/0x20
kernel: [<ffffffff810142fd>] ? default_idle+0x4d/0xb0
kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
kernel: [<ffffffff814d493c>] ? start_secondary+0x202/0x245

   On the bsd side, the istgt daemon appears to see that one of the
connection threads is down and attempts to restart it. At this point,
the istgt process size starts to grow.

USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED       TIME COMMAND
root  1224   0.0  0.4 8041092 405472 v0- DL    4:59PM   15:28.72 /usr/local/bin/istgt
root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM   63:18.34 /usr/local/bin/istgt
root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM   61:13.80 /usr/local/bin/istgt
root  1224   0.0  0.4 8041092 405472 v0- IL    4:59PM    0:00.00 /usr/local/bin/istgt

   There are more than 1400 threads reported.

   Also of interest, netstat shows:

tcp4       0      0 10.59.6.12.5010        10.59.25.113.54076     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.33345     CLOSED
tcp4       0      0 10.59.6.12.5010        10.59.25.113.54074     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.33343     CLOSED
tcp4       0      0 10.59.6.12.5010        10.59.25.113.54072     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.33341     CLOSED
tcp4       0      0 10.60.6.12.5010        10.60.25.113.33339     CLOSED
tcp4       0      0 10.59.6.12.5010        10.59.25.113.54070     CLOSE_WAIT
tcp4       0      0 10.60.6.12.5010        10.60.25.113.53806     CLOSE_WAIT

   There are more than 1400 sockets in the CLOSE* state. What would
prevent these sockets from cleaning up in a reasonable timeframe?
Both sides of the mpio connection appear to be attempting reconnects.

   An attempt to gracefully kill istgt fails. A kill -9 does not clean
things up either.

   A procstat -kk 1224 after the kill -9 shows:

  PID    TID COMM             TDNAME           KSTACK
 1224 100959 istgt            sigthread        mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_
lookup_norm+0x45 zap_lookup+0x2e zfs_dirent_lock+0x4ff zfs_dirlook+0x69 zfs_lookup+0x26b zfs_freebsd_lookup+0x81 vfs_cache_lookup+0xf8 VOP_LOOKUP_APV+0x40 lookup+0x
464 namei+0x4e9 vn_open_cred+0x3cb
 1224 100960 istgt            luthread #1      mi_switch+0x186 sleepq_wait+0x42 _sleep+0x376 bwait+0x64 physio+0x246 devfs_write_f+0x8d dofilewrite+0x8b kern_writev
+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7
 1224 103533 istgt            sendthread #1493 mi_switch+0x186 thread_suspend_switch+0xc9 thread_single+0x1b2 exit1+0x72 sigexit+0x7c postsig+0x3a4 ast+0x26c doreti
_ast+0x1f


   An attempt to forcefully export the pool hangs also. A procstat
shows:

  PID    TID COMM             TDNAME           KSTACK                       
 4427 100991 zpool            -                mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 dbuf_read+0x30b dmu_buf_hold+0xe0 zap_lockdir+0x58 zap_lookup_norm+0x45 zap_lookup+0x2e dsl_dir_open_spa+0x121 dsl_dataset_hold+0x3b dmu_objset_hold+0x23 zfs_ioc_objset_stats+0x2b zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x546 Xfast_syscall+0xf7 



   If anyone has any ideas, please let me know. I know I've left a lot
of config information out in an attempt to keep the email shorter.

   Random comments:

   This happens with or without multipathd enabled on the linux client.

   If I catch the istgt daemon while it's creating threads and kill it
the system will not lock up.

   I see no errors in the istgt log file. One of my next things to try
is to enable all debugging... The amount of debugging data captured
is quite large :-(

   I am using chelsio 10G cards on both client/server which have been
rock solid in all other cases.


   Thoughts welcome!

Thanks,
John