From owner-freebsd-questions@FreeBSD.ORG  Mon May 20 01:45:51 2013
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id D5E44971
 for <freebsd-questions@freebsd.org>; Mon, 20 May 2013 01:45:51 +0000 (UTC)
 (envelope-from freebsd@pki2.com)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 94BFA22A
 for <freebsd-questions@freebsd.org>; Mon, 20 May 2013 01:45:51 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.6/8.14.5) with ESMTP id r4K1jZqf071822;
 Sun, 19 May 2013 18:45:35 -0700 (PDT)
 (envelope-from freebsd@pki2.com)
Subject: Re: More than 32 CPUs under 8.4-P
From: Dennis Glatting <freebsd@pki2.com>
To: Paul Kraus <paul@kraus-haus.org>
In-Reply-To: <B06924FB-141E-421B-96E0-CEFE37C277A5@kraus-haus.org>
References: <1368897188.16472.19.camel@btw.pki2.com>
 <51989FDA.5070302@coosemans.org> <1368978686.16472.25.camel@btw.pki2.com>
 <B06924FB-141E-421B-96E0-CEFE37C277A5@kraus-haus.org>
Content-Type: text/plain; charset="ISO-8859-1"
Date: Sun, 19 May 2013 18:45:35 -0700
Message-ID: <1369014335.16472.60.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
X-yoursite-MailScanner-Information: Dennis Glatting
X-yoursite-MailScanner-ID: r4K1jZqf071822
X-yoursite-MailScanner: Found to be clean
X-MailScanner-From: freebsd@pki2.com
Cc: Tijl Coosemans <tijl@coosemans.org>, freebsd-questions@freebsd.org
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 20 May 2013 01:45:51 -0000

On Sun, 2013-05-19 at 16:28 -0400, Paul Kraus wrote:
> On May 19, 2013, at 11:51 AM, Dennis Glatting <freebsd@pki2.com> wrote:
> 
> > ZFS hangs on multi-socket systems (Tyan, Supermicro) under 9.1. ZFS does
> > not hang under 8.4. This (and one other 4 socket) is a production
> > system.
> 
> 	Can you be more specific, I have been running 9.0 and 9.1 systems with
> multi-CPU and all ZFS with no (CPU related*) issues.
> 

I have (down to) ten FreeBSD/ZFS systems. Five of them are multi-socket
populated. All are AMD CPUs of the 6200 series. Two of those
multi-socketed systems are simply workstations and don't do much file
I/O, so I have yet to see them fault.

The remaining three perform significant I/O in the 1-8TB (simultaneous)
file range, including sorting, compression, backup, etc (ZFS compression
is enabled on some data sets as is dedup on a few minor data sets). I
also do iSCSI and NFS from one of these systems.

Simply, if I run 9.1 on those three busy systems ZFS will eventually
hang under load (within ten hours to a few days) whereas it does not
under 8.3/4. Two of those systems are 4x16 cores, one 2x16, and two 2x8
cores. Multiple, simultaneous pbzip2 runs on individual 2-5TB ASCII
files generally causes a hang within 10-20 hours.

"Hang" means the system is alive and on the network but disk I/O has
stopped. Run any command except statically linked executables on a
memory volume and they will not run (no output or return to command
prompt). This includes "reboot," which never really reboots.

The volumes where work is performed are typically 12-33TB RAIDz2
volumes. For example:

root@mc:~ # zpool list disk-1
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
disk-1  16.2T  5.86T  10.4T    36%  1.32x  ONLINE  -

root@mc:~ # zpool status disk-1
  pool: disk-1
 state: ONLINE
  scan: scrub repaired 0 in 21h53m with 0 errors on Mon Apr 29 01:52:55
2013
config:

	NAME        STATE     READ WRITE CKSUM
	disk-1      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	cache
	  da0       ONLINE       0     0     0

errors: No known data errors


> * I say no CPU related issues because I have run into SATA timeout
> issues with an external SATA enclosure with 4 drives (I know, SATA port
> expanders are evil, but it is my best option here). Sometimes the zpool
> hangs hard, sometimes just becomes unresponsive for a while. My "fix",
> such as it is, is to tune the zfs per vdev queue depth as follows:
> 
> vfs.zfs.vdev.min_pending="3"
> vfs.zfs.vdev.max_pending="5"
> 

I've not tried those. Currently, these are mine:

vfs.zfs.write_limit_override="1G"
vfs.zfs.arc_max="8G"
vfs.zfs.txg.timeout=15
vfs.zfs.cache_flush_disable=1

# Recommended from the net
# April, 2013
vfs.zfs.l2arc_norw=0			# Default is 1
vfs.zfs.l2arc_feed_again=0		# Default is 1
vfs.zfs.l2arc_noprefetch=0		# Default is 0
vfs.zfs.l2arc_feed_min_ms=1000		# Default is 200


> The defaults are 5 and 10 respectively, and when I run with those I
> have the timeout issues, but only under very heavy I/O load. I only
> generate such load when migrating large amounts of data, which
> thankfully does not happen all that often.
> 

Two days ago when the 9.1 system hanged I was able to run a static
procstat where it inadvertently(?) printed that da0 wasn't responsive on
the console. Unfortunately I didn't have a static camcontrol ready so I
was unable to query it.

That said, according to the criteria from
https://wiki.freebsd.org/AvgZfsDeadlockDebug that hang isn't a true ZFS
problem, yet hung it was.

I have since (today) updated the firmware of most of the devices in that
system and it is currently running some tasks. Most of the disks in that
system are Seagate but the un-updated devices include three WD disks
(RAID1 OS and a swap disk) -- unupdated because I haven't been able to
figure WD firmware download out) and a SSD where the manufacturer
indicates the firmware diff is minor, though I plan to go back and flash
it anyway.

If my 4x16 system ever finishes I will be updating its device's firmware
too but it is an 8.4-P system and doesn't give me any trouble. Another
4x16 system gave me ZFS trouble under 9.1 but when I downgraded to 8.4-P
it has been stable as a rock for the past 22 days often under heavy
load.