From owner-freebsd-stable@FreeBSD.ORG Mon Jul 19 03:34:26 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1E19E1065675 for ; Mon, 19 Jul 2010 03:34:26 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta05.emeryville.ca.mail.comcast.net (qmta05.emeryville.ca.mail.comcast.net [76.96.30.48]) by mx1.freebsd.org (Postfix) with ESMTP id 03AE08FC22 for ; Mon, 19 Jul 2010 03:34:25 +0000 (UTC) Received: from omta02.emeryville.ca.mail.comcast.net ([76.96.30.19]) by qmta05.emeryville.ca.mail.comcast.net with comcast id jdz41e0040QkzPwA5faRvu; Mon, 19 Jul 2010 03:34:25 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta02.emeryville.ca.mail.comcast.net with comcast id jfaQ1e00B3LrwQ28NfaQCx; Mon, 19 Jul 2010 03:34:25 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 4E3249B425; Sun, 18 Jul 2010 20:34:24 -0700 (PDT) Date: Sun, 18 Jul 2010 20:34:24 -0700 From: Jeremy Chadwick To: Mike Tancsa Message-ID: <20100719033424.GA92607@icarus.home.lan> References: <201007182108.o6IL88eG043887@lava.sentex.ca> <20100718211415.GA84127@icarus.home.lan> <201007182142.o6ILgDQW044046@lava.sentex.ca> <20100719023419.GA91006@icarus.home.lan> <201007190301.o6J31Hs1045607@lava.sentex.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201007190301.o6J31Hs1045607@lava.sentex.ca> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: deadlock or bad disk ? RELENG_8 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Jul 2010 03:34:26 -0000 On Sun, Jul 18, 2010 at 11:01:03PM -0400, Mike Tancsa wrote: > At 10:34 PM 7/18/2010, Jeremy Chadwick wrote: > >On Sun, Jul 18, 2010 at 05:42:14PM -0400, Mike Tancsa wrote: > >> At 05:14 PM 7/18/2010, Jeremy Chadwick wrote: > >> > >> >Where exactly is your swap partition? > >> > >> On one of the areca raidsets. > >> > >> # swapctl -l > >> Device: 1024-blocks Used: > >> /dev/da0s1b 10485760 108 > > > >So is da0 actually a RAID volume "behind the scenes" on the Areca > >controller? How many disks are involved in that set? > > yes, da0 is a RAID volume with 4 disks behind the scenes. Okay, so can you get full SMART statistics for all 4 of those disks? The adjusted/calculated values for SMART thresholds won't be helpful here, one will need the actual raw SMART data. I hope the Areca CLI can provide that. Also, I'm willing to bet that the da0 "volume" and the da1 "volume" actually share the same physical disks on the Areca controller. Is that correct? If so, think about what would happen if heavy I/O happened on both da0 and da1 at the same time. I talk about this a bit more below. > >Well, the thread I linked you stated that the problem has to do with a > >controller or disk "taking too long". I have no idea what the threshold > >is. I suppose it could also indicate that your system is (possibly) > >running low on resources (RAM); I would imagine swap_pager would get > >called if a processes needed to be offloaded to swap. So maybe this is > >a system tuning thing more than a hardware thing. > > Prior to someone rebooting it, it had been stuck in this state for a > good 90min. Apart from upgrading to a later RELENG_8 to get the > security patches, the machine had been running a few versions of > RELENG_8 doing the same workloads every week without issue. Then I would say you'd need to roll back kernel+world to a previous date and try to figure out when the issue began, if that is indeed the case. > /boot/loader.conf has > ahci_load="YES" > siis_load="YES" > > sysctl.conf has > > net.inet.tcp.recvbuf_max=16777216 > net.inet.tcp.recvspace=131072 > net.inet.tcp.sendbuf_max=16777216 > net.inet.tcp.sendspace=32768 > net.inet.udp.recvspace=65536 > kern.ipc.somaxconn=1024 > kern.ipc.maxsockbuf=4194304 > net.inet.ip.redirect=0 > net.inet.ip.intr_queue_maxlen=4096 > net.route.netisr_maxqlen=1024 > kern.ipc.nmbclusters=131072 None of these, to my knowledge, would affect what you're seeing; they're all network-related. > I do track some basic mem stats via rrd. Looking at the graphs upto > that period, nothing unusual was happening sysctl vm.stats.vm | grep swap Here's another post basically reiterating the same thing: that the controller the swap slice is on (in your case a 4-disk RAID array) is basically taking too long to respond. http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/2e7faeeaca719c52/cdcd4601ce1b90c5 I have no idea where the timeout values are in the kernel. I do see these two entries in sysctl that look to be of interest though. You might try adjusting these (not sure if they're sysctls or loader.conf tunables only): vm.swap_idle_threshold2: 10 vm.swap_idle_threshold1: 2 Descriptions: vm.swap_idle_threshold2: Time before a process will be swapped out vm.swap_idle_threshold1: Guaranteed swapped in time for a process I want to point out that the actual amount of data being swapped out is fairly small -- note the "size" fields the swap_pager kernel messages. There doesn't necessarily have to be a shortage of memory to cause a swapout (case in point, see above). It would also help if you could provide timestamps of those messages; are they all happening at once, or gradual over time? If over time, do they all happen around the same time every day, etc.? You see where I'm going with this. Workaround recommendation: put swap directly on a device and not as part of a 4-disk RAID volume (regardless of what type of RAID) and see if the problem goes away. I realise that probably isn't plausible in your situation (since you'd then be dedicating an entire disk to just swap). Others may have other advice. You mention in a later mail that the ada[0-3] disks make up a ZFS pool of some sort. You might try splitting ada0 into two slices, one for swap and the other used as a pool member. Again: I don't think this is necessarily a bad disk problem. The only way you'd be able to determine that would be to monitor on a per-disk basis the I/O response time of each disk member on the Areca. If the CLI tools provide this, awesome. Otherwise you'll probably need to involve Areca Support. Remember: CAM thinks da0 and da1 are actually individual disks, and lacks knowledge of them being associated with a 4-disk RAID volume behind the scenes. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |