From owner-freebsd-net@FreeBSD.ORG  Fri Oct  1 10:26:22 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 69953106564A
	for <freebsd-net@freebsd.org>; Fri,  1 Oct 2010 10:26:22 +0000 (UTC)
	(envelope-from gsriram@gmail.com)
Received: from mail-qy0-f175.google.com (mail-qy0-f175.google.com
	[209.85.216.175])
	by mx1.freebsd.org (Postfix) with ESMTP id 2B3D08FC14
	for <freebsd-net@freebsd.org>; Fri,  1 Oct 2010 10:26:21 +0000 (UTC)
Received: by qyk8 with SMTP id 8so1435119qyk.13
	for <freebsd-net@freebsd.org>; Fri, 01 Oct 2010 03:26:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:received:date:message-id
	:subject:from:to:content-type;
	bh=HCtOFGt/ty4WLm5PqEwL/H/YdHrpSwNN3v6aFaoXOww=;
	b=Da5SrSR7YFLcPoT/B0CdNAzd0piiLqLNm7xHG/jqdOYOXoDs2ncnJo3Z2wW1uDPQwt
	gAIVX4ud5jJtx+j8NM1oWNex9cFdXSuByh7sY2x7yPQh9BUzhQEMk0VLhsywddhc1LNh
	rJ1+WmJtCvKkrtbtehxmqxNEaA7Me+9+wYb84=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:date:message-id:subject:from:to:content-type;
	b=q/MD2oGg4a8LkabriynCEBFWBNjwHe4GeOUIXJxAAj6aYc+bynvyK0W83JYyPtJiVK
	du+TXDvh8csQEkxCDETZDHpwg/MzIpM1JQjFf+1Y7W3X/6NLaOfMZww8i6uwvVvjybRz
	MLCrRkvjDG4zNBM/XWKwUD3doQTlEBm5ppk5w=
MIME-Version: 1.0
Received: by 10.224.29.3 with SMTP id o3mr3580581qac.178.1285927289266; Fri,
	01 Oct 2010 03:01:29 -0700 (PDT)
Received: by 10.229.236.66 with HTTP; Fri, 1 Oct 2010 03:01:29 -0700 (PDT)
Date: Fri, 1 Oct 2010 15:31:29 +0530
Message-ID: <AANLkTikWWmrnBy_DGgSsDbh6NAzWGKCWiFPnCRkwoDRi@mail.gmail.com>
From: Sriram Gorti <gsriram@gmail.com>
To: freebsd-net@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Subject: Question on TCP reassembly counter
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Oct 2010 10:26:22 -0000

Hi,

In the following is an observation when testing our XLR/XLS network
driver with 16 concurrent instances of netperf on FreeBSD-CURRENT.
Based on this observation, I have a question on which I hope to get
some understanding from here.

When running 16 concurrent netperf instances (each for about 20
seconds), it was found that after some number of runs performance
degraded badly (almost by a factor of 5). All subsequent runs remained
so. Started debugging this from TCP-side as other driver tests were
doing fine for comparably long durations on same board+s/w.

netstat indicated the following:

$ netstat -s -f inet -p tcp | grep discarded
                0 discarded for bad checksums
                0 discarded for bad header offset fields
                0 discarded because packet too short
                7318 discarded due to memory problems

Then, traced the "discarded due to memory problems" to the following counter:

$ sysctl -a net.inet.tcp.reass
net.inet.tcp.reass.overflows: 7318
net.inet.tcp.reass.maxqlen: 48
net.inet.tcp.reass.cursegments: 1594    <--- // corresponds to
V_tcp_reass_qsize variable
net.inet.tcp.reass.maxsegments: 1600

Our guess for the need for reassembly (in this low-packet-loss test
setup) was the lack of per-flow classification in the driver, causing
it to spew incoming packets across the 16 h/w cpus instead of packets
of a flow being sent to the same cpu. While we are working on
addressing this driver limitation, debugged further to see how/why the
V_tcp_reass_qsize grew (assuming that out-of-order segments should
have dropped to zero at the end of the run). It was seen that this
counter was actually growing up from the initial runs but only when it
reached near to maxsgements, perf degradation was seen. Then, started
looking at vmstat also to see how many of the reassembly segments were
lost. But, there were no segments lost. We could not reconcile "no
lost segments" with "growth of this counter across test runs".

$ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre"
net.inet.tcp.reass.overflows: 0
net.inet.tcp.reass.maxqlen: 48
net.inet.tcp.reass.cursegments: 147
net.inet.tcp.reass.maxsegments: 1600
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    4096,    3200, 5653833,   0,   0
mbuf:                   256,      0,       1,    2048, 4766910,   0,   0
mbuf_cluster:          2048,  25600,    7296,       6,    7297,   0,   0
mbuf_jumbo_page:       4096,  12800,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216,   6400,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,   3200,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0
tcpreass:                20,   1690,       0,     845, 1757074,   0,   0

In view of these observations, my question is: is it possible for the
V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The
particular flavor of XLS that was used in the test had 4 cores with 4
h/w threads/core). I see that the tcp_reass function assumes some lock
is taken but not sure if it is the per-socket or the global tcp lock.

Any inputs on what I missed are most welcome.

Thanks,
Sriram Gorti
Netlogic Microsystems