From owner-freebsd-net@FreeBSD.ORG  Fri Jan 31 06:18:34 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id CF903D91
 for <freebsd-net@freebsd.org>; Fri, 31 Jan 2014 06:18:34 +0000 (UTC)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 9ED5D1E26
 for <freebsd-net@freebsd.org>; Fri, 31 Jan 2014 06:18:34 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s0V6IVEN027168;
 Fri, 31 Jan 2014 01:18:31 -0500 (EST)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.7/8.14.4/Submit) id s0V6IVJv027167;
 Fri, 31 Jan 2014 01:18:31 -0500 (EST) (envelope-from wollman)
Date: Fri, 31 Jan 2014 01:18:31 -0500 (EST)
Message-Id: <201401310618.s0V6IVJv027167@hergotha.csail.mit.edu>
From: wollman@freebsd.org
To: j.david.lists@gmail.com
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
X-Newsgroups: mit.lcs.mail.freebsd-net
In-Reply-To: <CABXB=RTx9_gE=0G9UAzwJ3LuYv8fy=sAOZp1e2D7cJ6_=kgd9A@mail.gmail.com>
References: <CABXB=RR1eDvdUAaZd73Vv99EJR=DFzwRvMTw3WFER3aQ+2+2zQ@mail.gmail.com>
 <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca>
Organization: none
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (hergotha.csail.mit.edu [127.0.0.1]); Fri, 31 Jan 2014 01:18:31 -0500 (EST)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 31 Jan 2014 06:18:35 -0000

In article
<CABXB=RTx9_gE=0G9UAzwJ3LuYv8fy=sAOZp1e2D7cJ6_=kgd9A@mail.gmail.com>,
J David writes:

>The process of TCP segmentation, whether offloaded or not, is
>performed on a single TCP packet.  It operates by reusing that
>packet's header over and over for each segment with slight
>modifications.  Consequently the maximum size that can be offloaded is
>the maximum size that can be segmented: one packet.

This is almost entirely wrong in its description of the non-offload
case.  A segment is a PDU at the transport layer.  In normal
operation, TCP figures out how much it can send, constructs a header,
and copies an mbuf chain referencing one segment's worth of data out
of the socket's transmit buffer.  tcp_output() repeats this process
(possibly using the same mbuf cluster multiple times, if it's larger
than the receiver's or the path's maximum segment size) until it
either runs out of stuff to send, or runs out of transmit window to
send into.  THAT IS WHY TSO IS A WIN: as you describe, the packet
headers are mostly identical, and (if the transmit window allows) it's
much cheaper to build the header and do the DMA setup once, then let
the NIC take over from there, rather than having to DMA a different
(but nearly identical) header for every individual segment.

>NFS is not sending packets to the TCP stack, it is sending stream
>data.  With TCP_NODELAY it should be possible to engineer a one send =
>one packet correlation, but that's true if and only if that send is
>less than the max packet size.

Yes and no.  NFS constructs a chain of mbufs and calls the socket's
sosend() routine.  This ultimately results in a call to tcp_output(),
and in the normal case where there is no data awaiting transmission,
that mbuf chain will be shallow-copied (bumping all the mbuf cluster
reference counts) up to the limit of what the transmit window allows,
and Ethernet, IP, and TCP headers will be prepended (possibly in a
separate mbuf).  The whole mess is then passed on to the hardware for
offload, if it fits.  RPC responses will only get smushed together if
tcp_output() wasn't able to schedule the transmit immediately, and if
the network is working properly, that will only happen if there's more
than one client-side-receive-window's-worth of data to be transmitted.

This shallow-copy behavior, by the way, is why the drivers need
m_defrag() rather than m_collapse(): M_WRITABLE is never true for
clusters coming out of tcp_output(), because the refcount will never
be less than 2 (one for the socket buffer and at least one for the
interface's transmit queue, depending on how many segments include
some data from the cluster).  But it's also part of why having a
"gigantic" cluster (e.g., 128k) would be a big win for NFS.

-GAWollman