Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Jul 2008 22:34:46 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        FreeBSD Net <freebsd-net@FreeBSD.org>, Andre Oppermann <andre@FreeBSD.org>, Ingo Flaschberger <if@xip.at>, Paul <paul@gtcomm.net>
Subject:   Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Message-ID:  <20080714212912.D885@besplex.bde.org>
In-Reply-To: <20080707142018.U63144@fledge.watson.org>
References:  <4867420D.7090406@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <alpine.LFD.1.10.0807021052041.557@filebunker.xip.at> <486B4F11.6040906@gtcomm.net> <alpine.LFD.1.10.0807021155280.557@filebunker.xip.at> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> <486C7F93.7010308@gtcomm.net> <20080703195521.O6973@delplex.bde.org> <486D35A0.4000302@gtcomm.net> <alpine.LFD.1.10.0807041106591.19613@filebunker.xip.at> <486DF1A3.9000409@gtcomm.net> <alpine.LFD.1.10.0807041303490.20760@filebunker.xip.at> <486E65E6.3060301@gtcomm.net> <alpine.LFD.1.10.0807052356130.2145@filebunker.xip.at> <4871DB8E.5070903@freebsd.org> <20080707191918.B4703@besplex.bde.org> <4871FB66.1060406@freebsd.org> <20080707213356.G7572@besplex.bde.org> <20080707134036.S63144@fledge.watson.org> <20080707224659.B7844@besplex.bde.org> <20080707142018.U63144@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 7 Jul 2008, Robert Watson wrote:

> On Mon, 7 Jul 2008, Bruce Evans wrote:
>
>>> (1) sendto() to a specific address and port on a socket that has been 
>>> bound to
>>>    INADDR_ANY and a specific port.
>>> 
>>> (2) sendto() on a specific address and port on a socket that has been 
>>> bound to
>>>    a specific IP address (not INADDR_ANY) and a specific port.
>>> 
>>> (3) send() on a socket that has been connect()'d to a specific IP address 
>>> and
>>>    a specific port, and bound to INADDR_ANY and a specific port.
>>> 
>>> (4) send() on a socket that has been connect()'d to a specific IP address
>>>    and a specific port, and bound to a specific IP address (not 
>>> INADDR_ANY)
>>>    and a specific port.
>>> 
>>> The last of these should really be quite a bit faster than the first of 
>>> these, but I'd be interested in seeing specific measurements for each if 
>>> that's possible!
>> 
>> Not sure if I understand networking well enough to set these up quickly. 
>> Does netrate use one of (3) or (4) now?
>
> (3) and (4) are effectively the same thing, I think, since connect(2) should 
> force the selection of a source IP address, but I think it's not a bad idea 
> to confirm that. :-)
>
> The structure of the desired micro-benchmark here is basically:
> ...

I hacked netblast.c to do this:

% --- /usr/src/tools/tools/netrate/netblast/netblast.c	Fri Dec 16 17:02:44 2005
% +++ netblast.c	Mon Jul 14 21:26:52 2008
% @@ -44,9 +44,11 @@
%  {
% 
% -	fprintf(stderr, "netblast [ip] [port] [payloadsize] [duration]\n");
% -	exit(-1);
% +	fprintf(stderr, "netblast ip port payloadsize duration bind connect\n");
% +	exit(1);
%  }
% 
% +static int	gconnected;
%  static int	global_stop_flag;
% +static struct sockaddr_in *gsin;
% 
%  static void
% @@ -116,6 +118,13 @@
%  			counter++;
%  		}
% -		if (send(s, packet, packet_len, 0) < 0)
% +		if (gconnected && send(s, packet, packet_len, 0) < 0) {
%  			send_errors++;
% +			usleep(1000);
% +		}
% +		if (!gconnected && sendto(s, packet, packet_len, 0,
% +		    (struct sockaddr *)gsin, sizeof(*gsin)) < 0) {
% +			send_errors++;
% +			usleep(1000);
% +		}
%  		send_calls++;
%  	}
% @@ -146,9 +155,10 @@
%  	struct sockaddr_in sin;
%  	char *dummy, *packet;
% -	int s;
% +	int bind_desired, connect_desired, s;
% 
% -	if (argc != 5)
% +	if (argc != 7)
%  		usage();
% 
% +	gsin = &sin;
%  	bzero(&sin, sizeof(sin));
%  	sin.sin_len = sizeof(sin);
% @@ -176,4 +186,7 @@
%  		usage();
% 
% +	bind_desired = (strcmp(argv[5], "b") == 0);
% +	connect_desired = (strcmp(argv[6], "c") == 0);
% +
%  	packet = malloc(payloadsize);
%  	if (packet == NULL) {
% @@ -189,7 +202,19 @@
%  	}
% 
% -	if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) {
% -		perror("connect");
% -		return (-1);
% + 	if (bind_desired) {
% +		struct sockaddr_in osin;
% +
% +		osin = sin;
% +		if (inet_aton("0", &sin.sin_addr) == 0)
% +			perror("inet_aton(0)");
% + 		if (bind(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% + 			err(-1, "bind");
% +		sin = osin;
% + 	}
% +
% + 	if (connect_desired) {
% + 		if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% + 			err(-1, "connect");
% +		gconnected = 1;
%  	}
%

This also fixes some bugs in usage() (bogus [] around non-optional args and
bogus exit code) and adds a sleep after send failure.  Without the sleep,
netblast distorts the measurements by taking 100% CPU.  This depends on
kernel queues having enough buffering to not run dry during the sleep
time (rounded up to a tick boundary).  I use ifq_maxlen =
DRIVER_TX_RING_CNT + imax(2 * tick / 4, 10000) = 10512 for DRIVER = bge
and HZ = 100.  This is actually wrong now.  The magic 2 is to round up to
a tick boundary and the magic 4 is for bge taking a minimum of 4 usec per
packet on old hadware, but bge actually takes about 1.5 usec on the test
hardware and I'd like it to take 0.66 usec.  The queues rarely run dry in
practice, but running dry just a few times for a few msec each would
explain some anomalies.  Old SGI ttcp uses a select timeout of 18 msec here.
nttcp and netsend use more sophisticated methods that don't work unless HZ
is too small.  It's just impossible for a program to schedule its sleeps
with a fine enough resolution to ensure waking up before the queue runs
dry, unless HZ is too small or the queue is too large.  select() for
writing doesn't work for the queue part of socket i/o.

Results:
~5.2 sendto (1):  630 kpps   98% CPU  11   cm/p (cache misses/packet (min))
-cur sendto:      590 kpps  100% CPU  10   cm/p (July 8 -current)
             (2):  no significant difference - see below
~5.2 send   (3):  620 kpps   75% CPU   9.5 cm/p
-cur send:        520 kpps   60% CPU   8   cm/p
             (4):  no significant difference - see below

send() has lower CPU overheads as expected.  For some reason, send() gets
lower throughput than sendto().  I think the reason is just that the
queue runs dry due to the lower CPU overhead making it possible for
the userland sender to outrun the hardware -- userland sees more ENOBUFS
and sleeps more often, so it sometimes sleeps too long due to my out of
date hack for increasing the queue length.  For some reason, this affects
-current much more than ~5.2 (the bge drivers in each have lots of
modifications which are supposed to be equivalent here).  Probably the
same reason.  sendto() still 5-10% higher overhead in -current than in
~5.2 and runs out of CPU.  It runs out under ~5.2 testing ttcp too.

> If you look at the design of the higher performance UDP applications, they 
> will generally bind a specific IP (perhaps every IP on the host with its own 
> socket), and if they do sustained communication to a specific endpoint they 
> will use connect(2) rather than providing an address for each send(2) system 
> call to the kernel.

I couldn't see any effect from binding.  I'm only testing sending, and it
doesn't seem to be possible to bind to anything except local addresses
(0.0.0.0, the NIC's address and 127.0.0.1) but these seem to be equivalent
(with no extra work for translation on every packet?) and seem to be used
by default anyway.  In the above, sin.sin_addr has to be set to the
receiver's ip from the command line (else it defaults to a local address),
and the above temporarily sets it back to 0.0.0.0 so as to use the same
sin for the local bind()).

> udp_output(2) makes the trade-offs there fairly clear: with the most recent 
> rev, the optimal case is one connect(2) has been called, allowing a single 
> inpcb read lock and no global data structure access, vs. an application 
> calling sendto(2) for each system call and the local binding remaining 
> INADDR_ANY.  Middle ground applications, such as named(8) will force a local 
> binding using bind(2), but then still have to pass an address to each 
> sendto(2).  In the future, this case will be further optimized in our code by 
> using a global read lock rather than a global write lock: we have to check 
> for collisions, but we don't actually have to reserve the new 4-tuple for the 
> UDP socket as it's an ephemeral association rather than a connect(2).

The July 8 -current should have this rev.  Note that I'm not testing
SMP or stessing locking, or nontrivial routine tables, or forwarding,
and don't plan to.  UP with a direct connection is hard enough and
short of CPU enough to understand and make efficient.  Locking barely
shows up in older tests, only partly because it is mostly inline.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080714212912.D885>