From owner-freebsd-net@FreeBSD.ORG  Thu Jan 30 03:56:31 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 27A70B03
 for <freebsd-net@freebsd.org>; Thu, 30 Jan 2014 03:56:31 +0000 (UTC)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id E23921DA2
 for <freebsd-net@freebsd.org>; Thu, 30 Jan 2014 03:56:30 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAM7M6VKDaFve/2dsb2JhbABZhBuDAboRCYEadIJPBIEHAg0ZAl+IGJthjxGgcReBKY0igyqBSQSJSaB+g0segW4
X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91744702"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:56:29 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DBB80B4063
 for <freebsd-net@freebsd.org>; Wed, 29 Jan 2014 22:56:29 -0500 (EST)
Date: Wed, 29 Jan 2014 22:56:29 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: FreeBSD Net <freebsd-net@freebsd.org>
Message-ID: <24918548.18766184.1391054189890.JavaMail.root@uoguelph.ca>
Subject: 64K NFS I/O generates a 34mbuf list for TCP which breaks TSO
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.209]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jan 2014 03:56:31 -0000

For some time, I've been seeing reports of NFS related issues
that get resolved by the user either disabling TSO or reducing
the rsize/wsize to 32K.

I now think I know why this is happening, although the evidence
is just coming in. (I have no hardware/software that does TSO,
so I never see these problems during testing.)

A 64K NFS read reply, readdir reply or write request results in
the krpc handing the TCP socket an mbuf list with 34 entries via
sosend(). Now, I am really rusty w.r.t. TCP, but it looks like
this will result in a TCP/IP header + 34 data mbufs being handed
to the network device driver, if if_hw_tsomax has the default
setting of 65535 (max IP datagram).

At a glance, many drivers use a scatter/gather list of around
32 elements for transmission. If the mbuf list doesn't fit in
this scatter/gather list (which looks to me like it will be the
case), then the driver either calls m_defrag() or m_collapse()
to try and fix the problem.
This seems like a serious problem to me.
1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen
    and things wedge until a TCP timeout retransmit gets things
    going again. It looks like m_defrag() is less likely to fail,
    but generates a lot of overhead. m_collapse() seems to be less
    overhead, but seems less likely to succeed.
    (Since m_defrag() is called with M_NOWAIT, it can fail in that
     extreme case. I'm not sure if it will fail otherwise?)

So, how to fix this?
1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing
    the mbuf list from 34->18. Preliminary patches for this are being
    tested.
    --> However, this seems to be more of a work-around than a fix.
2 - As soon as a driver needs to call m_defrag() or m_collapse()
    because the length of the TSO transmit mbuf list is too long,
    reduce if_hw_tsomax by a significant amount to try and get
    tcp_output() to generate shorter mbuf lists.
    Not great, but at least better than calling m_defrag()/m_collapse()
    over and over and over again.
    --> As a starting point, instrumenting the device drivers so that
        counts of # ofcalls to m_defrag()/m_collapse() and counts of
        failed calls would help to confirm how serious this problem is.
3 - ??? Any ideas from folk familiar with TSO and these drivers.

rick
ps: Until this gets resolved, please tell anyone with serious NFS
    performance/reliability issues to try either disabling TSO or
    doing client mounts with "-o rsize=32768,wsize=32768".
    I'm not sure how many believe me when I tell them, but at least
    I now have a theory as to why it can help a lot.