From owner-freebsd-stable@FreeBSD.ORG  Mon Jul 19 03:34:26 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1E19E1065675
	for <freebsd-stable@freebsd.org>; Mon, 19 Jul 2010 03:34:26 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta05.emeryville.ca.mail.comcast.net
	(qmta05.emeryville.ca.mail.comcast.net [76.96.30.48])
	by mx1.freebsd.org (Postfix) with ESMTP id 03AE08FC22
	for <freebsd-stable@freebsd.org>; Mon, 19 Jul 2010 03:34:25 +0000 (UTC)
Received: from omta02.emeryville.ca.mail.comcast.net ([76.96.30.19])
	by qmta05.emeryville.ca.mail.comcast.net with comcast
	id jdz41e0040QkzPwA5faRvu; Mon, 19 Jul 2010 03:34:25 +0000
Received: from koitsu.dyndns.org ([98.248.41.155])
	by omta02.emeryville.ca.mail.comcast.net with comcast
	id jfaQ1e00B3LrwQ28NfaQCx; Mon, 19 Jul 2010 03:34:25 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 4E3249B425; Sun, 18 Jul 2010 20:34:24 -0700 (PDT)
Date: Sun, 18 Jul 2010 20:34:24 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Mike Tancsa <mike@sentex.net>
Message-ID: <20100719033424.GA92607@icarus.home.lan>
References: <201007182108.o6IL88eG043887@lava.sentex.ca>
	<20100718211415.GA84127@icarus.home.lan>
	<201007182142.o6ILgDQW044046@lava.sentex.ca>
	<20100719023419.GA91006@icarus.home.lan>
	<201007190301.o6J31Hs1045607@lava.sentex.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201007190301.o6J31Hs1045607@lava.sentex.ca>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: freebsd-stable@freebsd.org
Subject: Re: deadlock or bad disk ?  RELENG_8
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Jul 2010 03:34:26 -0000

On Sun, Jul 18, 2010 at 11:01:03PM -0400, Mike Tancsa wrote:
> At 10:34 PM 7/18/2010, Jeremy Chadwick wrote:
> >On Sun, Jul 18, 2010 at 05:42:14PM -0400, Mike Tancsa wrote:
> >> At 05:14 PM 7/18/2010, Jeremy Chadwick wrote:
> >>
> >> >Where exactly is your swap partition?
> >>
> >> On one of the areca raidsets.
> >>
> >> # swapctl -l
> >> Device:       1024-blocks     Used:
> >> /dev/da0s1b    10485760       108
> >
> >So is da0 actually a RAID volume "behind the scenes" on the Areca
> >controller?  How many disks are involved in that set?
> 
> yes, da0 is a RAID volume with 4 disks behind the scenes.

Okay, so can you get full SMART statistics for all 4 of those disks?
The adjusted/calculated values for SMART thresholds won't be helpful
here, one will need the actual raw SMART data.  I hope the Areca CLI can
provide that.

Also, I'm willing to bet that the da0 "volume" and the da1 "volume"
actually share the same physical disks on the Areca controller.  Is that
correct?  If so, think about what would happen if heavy I/O happened on
both da0 and da1 at the same time.  I talk about this a bit more below.

> >Well, the thread I linked you stated that the problem has to do with a
> >controller or disk "taking too long".  I have no idea what the threshold
> >is.  I suppose it could also indicate that your system is (possibly)
> >running low on resources (RAM); I would imagine swap_pager would get
> >called if a processes needed to be offloaded to swap.  So maybe this is
> >a system tuning thing more than a hardware thing.
> 
> Prior to someone rebooting it, it had been stuck in this state for a
> good 90min.  Apart from upgrading to a later RELENG_8 to get the
> security patches, the machine had been running a few versions of
> RELENG_8 doing the same workloads every week without issue.

Then I would say you'd need to roll back kernel+world to a previous date
and try to figure out when the issue began, if that is indeed the case.

> /boot/loader.conf has
> ahci_load="YES"
> siis_load="YES"
> 
> sysctl.conf has
> 
> net.inet.tcp.recvbuf_max=16777216
> net.inet.tcp.recvspace=131072
> net.inet.tcp.sendbuf_max=16777216
> net.inet.tcp.sendspace=32768
> net.inet.udp.recvspace=65536
> kern.ipc.somaxconn=1024
> kern.ipc.maxsockbuf=4194304
> net.inet.ip.redirect=0
> net.inet.ip.intr_queue_maxlen=4096
> net.route.netisr_maxqlen=1024
> kern.ipc.nmbclusters=131072

None of these, to my knowledge, would affect what you're seeing; they're
all network-related.

> I do track some basic mem stats via rrd.  Looking at the graphs upto
> that period, nothing unusual was happening

sysctl vm.stats.vm | grep swap

Here's another post basically reiterating the same thing: that the
controller the swap slice is on (in your case a 4-disk RAID array) is
basically taking too long to respond.

http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/2e7faeeaca719c52/cdcd4601ce1b90c5

I have no idea where the timeout values are in the kernel.  I do see
these two entries in sysctl that look to be of interest though.  You
might try adjusting these (not sure if they're sysctls or loader.conf
tunables only):

vm.swap_idle_threshold2: 10
vm.swap_idle_threshold1: 2

Descriptions:

vm.swap_idle_threshold2: Time before a process will be swapped out
vm.swap_idle_threshold1: Guaranteed swapped in time for a process

I want to point out that the actual amount of data being swapped out is
fairly small -- note the "size" fields the swap_pager kernel messages.
There doesn't necessarily have to be a shortage of memory to cause a
swapout (case in point, see above).

It would also help if you could provide timestamps of those messages;
are they all happening at once, or gradual over time?  If over time, do
they all happen around the same time every day, etc.?  You see where I'm
going with this.

Workaround recommendation: put swap directly on a device and not as part
of a 4-disk RAID volume (regardless of what type of RAID) and see if the
problem goes away.  I realise that probably isn't plausible in your
situation (since you'd then be dedicating an entire disk to just swap).
Others may have other advice.  You mention in a later mail that the
ada[0-3] disks make up a ZFS pool of some sort.  You might try splitting
ada0 into two slices, one for swap and the other used as a pool member.

Again: I don't think this is necessarily a bad disk problem.  The only
way you'd be able to determine that would be to monitor on a per-disk
basis the I/O response time of each disk member on the Areca.  If the
CLI tools provide this, awesome.  Otherwise you'll probably need to
involve Areca Support.

Remember: CAM thinks da0 and da1 are actually individual disks, and
lacks knowledge of them being associated with a 4-disk RAID volume
behind the scenes.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |