From owner-freebsd-bugs  Mon Mar 16 08:10:13 1998
Return-Path: <owner-freebsd-bugs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA10495
          for freebsd-bugs-outgoing; Mon, 16 Mar 1998 08:10:13 -0800 (PST)
          (envelope-from owner-freebsd-bugs@FreeBSD.ORG)
Received: (from gnats@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA10458;
          Mon, 16 Mar 1998 08:10:04 -0800 (PST)
          (envelope-from gnats)
Received: from brookfield.ans.net (brookfield-ef0.brookfield.ans.net [204.148.1.20])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA09221
          for <FreeBSD-gnats-submit@freebsd.org>; Mon, 16 Mar 1998 08:05:19 -0800 (PST)
          (envelope-from curtis@brookfield.ans.net)
Received: (from curtis@localhost)
	by brookfield.ans.net (8.8.5/8.8.5) id LAA02255;
	Mon, 16 Mar 1998 11:05:16 -0500 (EST)
Message-Id: <199803161605.LAA02255@brookfield.ans.net>
Date: Mon, 16 Mar 1998 11:05:16 -0500 (EST)
From: Curtis Villamizar <curtis@brookfield.ans.net>
Reply-To: curtis@brookfield.ans.net
To: FreeBSD-gnats-submit@FreeBSD.ORG
X-Send-Pr-Version: 3.2
Subject: kern/6032: poor TCP performance using FDDI over long delay path
Sender: owner-freebsd-bugs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


>Number:         6032
>Category:       kern
>Synopsis:       poor TCP performance using FDDI over long delay path
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:
>Keywords:
>Date-Required:
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Mon Mar 16 08:10:01 PST 1998
>Last-Modified:
>Originator:     curtis@ans.net
>Organization:
ANS Communications
>Release:        FreeBSD 2.2.5-RELEASE i386
>Environment:

FreeBSD is being used here as a TCP load for performance testing of
network equipment.  The problems encountered here will be seen by high
performance applications run over the Internet and over long delay
paths such as satellite.

The simplest setup is two PCs with DEC PCI FDDI cards and a network
with a long RTT between them.

[Note: this was run on 2.2.5 but the bug report was filled in on a
2.2.1 system in case send-pr reports 2.2.1.]

>Description:

Change window size to a large value (for excample 128KB).  Expect to
get 40-80 Mb/s.  Instead FreeBSD yields about 1 MB/s.  BSDI and other
BSD or *ix flavors yields 40-80 Mb/s as expected.

A bit part of the problem is the function tcp_mss in tcp_input.c which
sets the window size back to a small value.

>How-To-Repeat:

Run ttcp or netperf over long delay path with 128KB window.  Source
for ttcp is freely available (source on request if you don't have it).

>Fix:

The email message below sums it up.  I never did look into the reason
why increasing MCLSHIFT would result in an unusable kernel so I'm
sending this in as one bug report.  If I get a chance I'll look at the
MCLSHIFT problem and also try to figure out why setting NMBCLUSTERS to
2048 or above was a problem.  You can do whatever you want with these
patches.  This is simply a performance issue.


Subject: FreeBSD performance problem solved
Date: Thu, 19 Feb 1998 22:23:07 -0500
From: Curtis Villamizar <curtis@brookfield.ans.net>


The FreeBSD performance problem we had run into previously has now
been solved.  It may not be the best way for the general FreeBSD
audience but it is completely solved for our puposes.

The executive summary is:

    - the kernel no longer resets the window size back to a small
      value for no apparent reason (see below)
    - we now can use just under a 1MB window (about the same as BSDI)
    - some kernel tuning (page buffer size, number of clusters) was
      done to make FDDI MTU work slightly faster
    - we get 20 Mb/s with 192 KB window and 70 msec RTT
    - we get 77 Mb/s with 896 KB window and 70 msec RTT (6.7 sec transfer)
    - we get 88 Mb/s with 896 KB window and 70 msec RTT (47 sec transfer)
    - we get 89 Mb/s with 896 KB window and 70 msec RTT (184 sec transfer)
    - these are slightly better than the BSDI figures (I think? Bill?)

The 2GB transfer in just over 3 minutes is getting quite close to FDDI
line rate.

The gory details are listed below.  I'll be sending separate bug
reports to the FreeBSD team on the tcp_mss issue and the inability to
change MCLSHIFT or increase NMBCLUSTERS to 2048.

Curtis


All the kernel stuff is in /sys which is really a symbolic link to
/usr/src/sys.  Some of the key directories are netinet where all the
ip, udp, and tcp code is, kern where all the socket code is, vm where
the virtual memory code is, and sys where system header files are.

The main culprit was the function tcp_mss in tcp_input.c.  This
function is called when a TCP SYN or SYN ACK arrives.  Its purpose in
life is to adjust the initial MSS and when doing so also adjust the
buffer size if appropriate.  One of the new "features" of tcp_mss is
that it now looks up the route that would be used for the socket
return path and unconditionally reset the send and recv buffer size if
there is a sendspace or recvspace parameter on the route even if the
buffer sizes had been set by a setsockopt.  When I found this in the
code my first reaction was to not touch the source and just explicitly
set the sendspace or recvspace on the route to 10/8.  This effort was
foiled by the fact that tcp_mss seems to have picked up the wrong
route.  I then decided to get rid of the problem for good and just
change the code so it will only increase the buffer sizes according to
the route, but never decrease them.

The patch is:

*** tcp_input.c.orig	Thu Feb 19 21:56:49 1998
--- tcp_input.c	Thu Feb 19 21:56:14 1998
***************
*** 2075,2080 ****
--- 2075,2082 ----
  	if ((bufsize = rt->rt_rmx.rmx_sendpipe) == 0)
  #endif
  		bufsize = so->so_snd.sb_hiwat;
+ 	if (bufsize < so->so_snd.sb_hiwat)
+ 	  bufsize = so->so_snd.sb_hiwat;
  	if (bufsize < mss)
  		mss = bufsize;
  	else {
***************
*** 2089,2094 ****
--- 2091,2098 ----
  	if ((bufsize = rt->rt_rmx.rmx_recvpipe) == 0)
  #endif
  		bufsize = so->so_rcv.sb_hiwat;
+ 	if (bufsize < so->so_rcv.sb_hiwat)
+ 	  bufsize = so->so_rcv.sb_hiwat;
  	if (bufsize > mss) {
  		bufsize = roundup(bufsize, mss);
  		if (bufsize > sb_max)

Another change is the change to SB_MAX (which can also be changed with
sysctl).

*** sys/socketvar.h.orig	Thu Feb 19 22:00:24 1998
--- sys/socketvar.h	Tue Feb  3 21:30:31 1998
***************
*** 90,96 ****
  		short	sb_flags;	/* flags, see below */
  		short	sb_timeo;	/* timeout for read/write */
  	} so_rcv, so_snd;
! #define	SB_MAX		(256*1024)	/* default for max chars in sockbuf */
  #define	SB_LOCK		0x01		/* lock on data queue */
  #define	SB_WANT		0x02		/* someone is waiting to lock */
  #define	SB_WAIT		0x04		/* someone is waiting for data/space */
--- 90,96 ----
  		short	sb_flags;	/* flags, see below */
  		short	sb_timeo;	/* timeout for read/write */
  	} so_rcv, so_snd;
! #define	SB_MAX		(1024*1024)	/* default for max chars in sockbuf */
  #define	SB_LOCK		0x01		/* lock on data queue */
  #define	SB_WANT		0x02		/* someone is waiting to lock */
  #define	SB_WAIT		0x04		/* someone is waiting for data/space */

The change to the page size makes a full MTU packet fit within a page
and allows the kernel code to do less copying.

*** vm/vm_param.h.orig	Thu Feb 19 22:00:38 1998
--- vm/vm_param.h	Tue Feb  3 23:04:46 1998
***************
*** 77,83 ****
   *	The machine independent pages are refered to as PAGES.  A page
   *	is some number of hardware pages, depending on the target machine.
   */
! #define DEFAULT_PAGE_SIZE	4096
  
  #if 0
  
--- 77,83 ----
   *	The machine independent pages are refered to as PAGES.  A page
   *	is some number of hardware pages, depending on the target machine.
   */
! #define DEFAULT_PAGE_SIZE	8192
  
  #if 0
  
One other thing that needs to be done is changing the total number of
mbuf clusters allocated to the kernel.  This can be done in the config
file.  Neither BSDI or FreeBSD would take a very large number for no
apparent reason.  I added the following to the config file for the
testnet kernel (i386/conf/testnet-pc).

  options NMBCLUSTERS=1024

We could increase this to something over 1024.  At the POC lab it
would take 2048.  This is sort of odd since that would have only been
4 MB dedicated to clusters on a 64 MB machine.  This could be a
magical power of two boundary for some other reason that I wasn't able
to locate in the source code.  I was never successful in increasing
the cluster size from 2048 to 8192 (increase MCLSHIFT from 11 to 13).
Again, there are dependencies on the relative size of some things in
the kernel that aren't documented (and might be regarded as bugs).

Increasing NMBCLUSTERS to 2048 or more or increasing MCLSHIFT from 11
to 13 will have to be exercises for a later date.  These are tuning
beyond what we really need.

Fooling with these latter optimization gave us unusable kernels in the
POC lab so I didn't want to play with this unless I was within walking
distance of the reset button and had a console and keyboard.
>Audit-Trail:
>Unformatted:

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message