From owner-freebsd-fs@FreeBSD.ORG  Wed Jun  9 15:12:48 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 10A21106566B;
	Wed,  9 Jun 2010 15:12:48 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
	[131.104.91.44])
	by mx1.freebsd.org (Postfix) with ESMTP id 9C48B8FC12;
	Wed,  9 Jun 2010 15:12:47 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEAPJLD0yDaFvK/2dsb2JhbACeS3G+HIUYBA
X-IronPort-AV: E=Sophos;i="4.53,391,1272859200"; d="scan'208";a="79387669"
Received: from fraser.cs.uoguelph.ca ([131.104.91.202])
	by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 09 Jun 2010 11:12:44 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id C0744109C2C9;
	Wed,  9 Jun 2010 11:12:45 -0400 (EDT)
X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca
Received: from fraser.cs.uoguelph.ca ([127.0.0.1])
	by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id n5r7430M-ZgK; Wed,  9 Jun 2010 11:12:45 -0400 (EDT)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 04749109C327;
	Wed,  9 Jun 2010 11:12:45 -0400 (EDT)
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	o59FSqn27257; Wed, 9 Jun 2010 11:28:52 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Wed, 9 Jun 2010 11:28:52 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Anders Nordby <anders@FreeBSD.org>
In-Reply-To: <20100609122517.GA16231@fupp.net>
Message-ID: <Pine.GSO.4.63.1006091119410.23896@muncher.cs.uoguelph.ca>
References: <20100608083649.GA77452@fupp.net>
	<Pine.GSO.4.63.1006081946040.8742@muncher.cs.uoguelph.ca>
	<20100609122517.GA16231@fupp.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Odd network issues on ZFS based NFS server
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Jun 2010 15:12:48 -0000


On Wed, 9 Jun 2010, Anders Nordby wrote:

>
> Thanks. The only thing that (temporarily) solves this issue so far is
> rebooting, which helps only for a day or so. I have tried different
> NICs, replacing the physical server, replacing cables, changing and
> resetting switch ports. But it did not help, so I think this is a
> software problem. I will try zio_use_uma = 0 I think, and then try to
> limit vfs.zfs.arc_max to 100 MB or so.
>

When you tried a different NIC, was a different type (ie. different
chipset that uses a different device driver)? I suggested that not
because I thought the hardware was broken but because I thought it
might be related to the network interface's device driver and switching
to a different device driver would isolate that possibility.

> On the ZFS+NFS server while having these issues:
>
> root@unixfile:~# netstat -m
> 1293/4602/5895 mbufs in use (current/cache/total)
> 1109/3619/4728/65536 mbuf clusters in use (current/cache/total/max)
> 257/1023 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/104/104/12800 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> 2541K/8804K/11345K bytes allocated to network (current/cache/total)
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/0/0 sfbufs in use (current/peak/max)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
>
> Packet loss seen from my workstation:
>
> anders@noname:~$ ping unixfile
> PING unixfile.aftenposten.no (192.168.120.33) 56(84) bytes of data.
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=1
> ttl=63 time=0
> .230 ms
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=3
> ttl=63 time=0
> .262 ms
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=5
> ttl=63 time=0
> .272 ms
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=6
> ttl=63 time=0
> .203 ms
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=7
> ttl=63 time=0
> .306 ms
> 64 bytes from unixfile.aftenposten.no (192.168.120.33): icmp_seq=9
> ttl=63 time=0
> .309 ms

Well, it doesn't seem to be mbuf exhaustion (I don't know what
"out of packet secondary zone" means, I'll have to look at that) and
if it doesn't handle pings it seems really hosed. Have you done a
"vmstat 5" + "ps axlH" (or similar) to try and see what it's doing?
("top" and "netstat" might also help?)

If you can figure out where it's spinning its wheels, that might
at least give us a hint w.r.t. the problem.

Good luck with it, rick