Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Feb 1998 23:26:20 +0100
From:      sthaug@nethelp.no
To:        mike@smith.net.au
Cc:        hackers@FreeBSD.ORG
Subject:   Re: "Best" Fast Ethernet Card 
Message-ID:  <27484.888531980@verdi.nethelp.no>
In-Reply-To: Your message of "Wed, 25 Feb 1998 18:44:42 -0800"
References:  <199802260244.SAA21962@dingo.cdrom.com>

next in thread | previous in thread | raw e-mail | index | archive | help
> > One *great* bonus is it will do IP, TCP and UDP checksums automagically 
> > in hardware!
> 
> Oh great.  This card was designed *explicitly* for Windows systems, 
> where they think it's funny for the network adapter driver to know 
> enough about the protocol layer to manage junk like this.

Probably not. More likely it was simply meant to give lower CPU usage,
given the right modifications to the TCP/IP stack. If you check the new
Gigabit Ethernet cards that are becoming available, you'll find *most*
of them will do IP checksum on-chip.

I've included below a recent Usenet article by Craig Partridge which
explains some of the things that can be done to speed up BSD TCP/IP.
You'll note that he explicitly mentions hardware checksums.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no
----------------------------------------------------------------------
From: craigp@world.std.com (Craig Partridge)
Subject: Re: BSD TCP/IP stack code; performance improvement
Message-ID: <ELLvF4.2F9@world.std.com>
Date: Mon, 22 Dec 1997 19:28:15 GMT

chuckbo@garnet.vnd.tek.com (Chuck Bolz) writes:

>I'm getting ready to "tune" a TCP/IP stack based on 4.3BSD with 
>numerous 4.4BSD enhancements.  I've been testing an echo server
>at 100 Mbps, and preliminary profiling indicates the following
>breakdown: 50% of CPU time in socket code, 40% in TCP/IP code,
>and the remainder in the driver/interrupt stack.  This is a lot
>of code to analyze!

This note gave me an excuse to sit down and write up a little note about
known improvements to TCP/UDP/IP performance that have not yet worked their
way into the standard 4.3/4.4 BSD sources.  This note takes the form of
a list of known improvements.

Comments on other known improvements are appreciated -- this list is off
the top of my head and could use enhancement.

Some of these improvements exist freely (for instance, Steve Pink and
I have got the sosend() and soreceive() and combined copy/cksum stuff
for x386 processor and ought to get them to the FreeBSD and NetBSD folks).

Craig

Improvement: Replace sosend()
Performance Benefit: 5% (see Pink&Partridge 1994) + enables other improvements

    Sosend() is this horrendously complex bit of code that tries to figure
    out how the lower layer wants its data laid out and then tries to put
    the data being sent in that form.

    In almost every case, the lower layer protocol could do the job faster
    and more simply (faster because it knows its requirements, more simply
    because it doesn't have to test for a whole bunch of cases, and thus
    code is more compact and has less branches).

    Done wrong, this change requires rewriting the send code for all
    protocols.  Done simply, you just add an pr_sosend entry in the
    protosw structure and set it to sosend() unless there's a protocol
    specific routine.

    NOTE: This change is a pre-requisite for some other performance
    improvements (such as combined checksum/copy) because sosend() is
    where data is copied from user space into the kernel.

Improvement: Replace soreceive()
Performance Benefit: Minor (< 1%) but enables benefits below

    You can simply soreceive() very slightly by making it protocol
    specific like sosend().  More important, you enable a bunch of
    improvements in memory handling.

Improvement: Reduce data copies
Performance Benefit: Large (10%-25% -- results vary see Partridge&Pink 94)

    Currently TCP touches its data 3 times, UDP 2 times, on transmission,
    and similar numbers on receipt.  In both cases, the count should be 1
    (or 0, with hardware assist).

    There are two necessary steps here both easy.

    The easy one is to create a kernel copy routine (typically a version
    of uiomove() and copyin()/copyout()) that computes the Internet
    checksum of the data being copied, while doing the copy. 
    Then use this routine in the protocol specific sosend() and soreceive()
    to move data in and out of the kernel.  This change reduces UDP to one
    copy and TCP to two copies.

    To reduce TCP to one copy, you need to make sure the device driver doesn't
    delete the TCP data when a segment is transmitted, so you can point to
    the same data when retransmitting.

    To get to zero copies, you need hardware checksumming (done when DMAing
    to the interface).

    NOTE: Many of these benefits can also be achieved using Copy-On-Write --
    you mark application buffers COW and then don't have to copy them.  You
    still however, need to checksum them, so unless there's hardware checksum
    support, you still scan the data once.

Improvement: Delete IP header checksum call to in_cksum()
Performance Benefit: 2% to 8% (depends on packet size and processor - P&P 94)

    The IP output code calls in_cksum() to checksum the IP header
    checksum.  Since the header checksum requires only 14 instructions
    (without any conditionals) to compute, this is silly (you'll burn
    several times 14 instructions calling in_cksum(), plus harm code
    locality).  Better to do the checksum in ip_output.  Ditto on
    input in ipintr()

Improvement: Delete IP interrupt
Performance Benefit: never measured, estimated to be 20%+

    On the inbound side, the networking code goes through two software
    interrupts, one for IP processing and one for socket processing.

    Given the high cost of doing the interrupt, the IP processing interrupt
    should go away -- IP and partial TCP processing should just be done
    at board interrupt level, then a single interrupt to the socket layer
    should be made to complete TCP processing.  Van Jacobson has done
    preliminary work here but never gotten it to the point of distribution.

Improvement: Get A Better Compiler
Performance Benefit: 10% plus

    There's evidence that compilers that can relocate code segments and
    adjust branches based on actual profiles (so called Profiler Based
    Optimization) can easily give 10% performance improvements.

    Various folks have also done fancier reworking of binary layouts by
    hand and gotten even better results.  (Work at Arizona and UC I believe)

Improvement: Fix PCB lookup
Performance Benefit: 5% or more

    Two issues here.  First, the PCB caches don't work well, especially for
    UDP.

    Second, in_pcblookup() is a linear search -- it should be a hash table
    (see McKenney's paper in SIGCOMM '90).

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?27484.888531980>