Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Jul 2001 10:11:07 -0400
From:      Leo Bicknell <bicknell@ufp.org>
To:        freebsd-hackers@freebsd.org
Subject:   Network performance roadmap.
Message-ID:  <20010713101107.B9559@ussenterprise.ufp.org>

next in thread | raw e-mail | index | archive | help

After talking with a number of people and reading some more papers,
I think I can put together a better road map for what should be
done to increase network performance.  The good news is there are
some immediate bandaids.

In particular, I'd like those who are working on comitting network
changes to -current to pay attention here, as I can't commit. :-)

Let's go through some new details:

1) FreeBSD's TCP windows cannot grow large enough to allow for
   optimum performance.  The primary obstical to raising them is
   that if you do so, the system can run out of MBUF's.  Schemes
   need to be put in place to limit MBUF usage, and better allocate
   buffers per connection.

2) Windows are currently 16k.  It seems a wide number of people
   think 32k would not cause major issues, and is in fact in use
   by many other OS's at this time.

There are a few other observations that have been made that are
important.

A) The receive buffers are hardly used.  In fact, data generally
   only sits in a receive buffer for one of two reasons.  First,
   the data has not yet been passed to the application.  This amount of
   data is generally very small.  Second, data for unacknowledged
   segments will sit in the buffer waiting for a retransmit.  It is of
   course possible that the buffers could be completely full from either
   case, but several research papers indicate that receive buffers
   rarely use much space at all.

B) When the system runs out of MBUF's, really bad things happen.  It
   would be nice to make the system handle MBUF exhaustion in a nicer
   way, or avoid it.

C) Many people think TCP_EXTENSIONS="YES" gives them windows > 64k.
   It does, in the sense that it allows the window scale option, but
   it doesn't in that socket buffers aren't changed.

From all of this, I propose the following short term road map:

a - Commit higher socket buffer sizes:

    -current:  64k receive  (based on observation A)
	       32k send     (based on detail 2)

    -stable:   32k receive  (based on detail 2)
	       32k send     (based on detail 2)

    I think this can be done more or less immediately.

b - Allow larger receive windows in some cases.  In -current
    only, if TCP_EXTENSIONS="YES" is configured (turn on RFC1323
    extensions) change the settings to:

    1M kernel limit  (based on observation C)
    256k receive     (based on observation A, C)
    64k send         (based on observation C)

    Note, 64k send is most likely agressive with the current MBUF
    problems.  Some later points will address that.  For now, the
    basic assumption is that people configuring TCP_EXTENSIONS are
    clueful people with larger memory machines who also tune things like
    MAXUSERS up, so they will probably be ok.

c - Prevent MBUF exhaustion.  Today, when you run out of MBUF's, bad
    things start to happen.  It would be nice to prevent that from
    happening, and also to provide sysadmins some warning when it is
    about to happen.

    This change sounds easy, but I don't know where in the code to start
    looking.  Basically, there is a bit of code somewhere that decides
    if a sending TCP process should block or not.  Today this code only
    looks to see if that socket's TCP send buffer is full.  What I
    propose is that it should also check if less than 10% of the MBUF's
    are free, and if so also block the sender.

    Blocking senders keeps some MBUF's free for receivers (the 10%),
    most likely keeping the system from running out.  What will happen
    is receivers will either read data from the receive buffers, or
    data will "drain" from the send buffers until enough MBUF's are
    free to unblock the senders.

    I believe a message should be logged when this happens (a la a
    full file system) so admins know they are running low on MBUF's.

    I would think this would only be a couple of line patch to the
    function that decides if a particular socket could block.  Could
    someone more familiar with the code comment?

d - Prevent gross overbuffering on a sender.  In a TCP stream there are
    several interesting variables:

    ---------------------------------------------------------------
    A         B  C  D   e                               F         G

    A = lowest acknowledged segment
    B = highest transmitted segment
    C = A + cwin
    D = A + win
    e = my new variable
    F = A + buffer_in_use
    G = A + max_buffer_size

    Note that the following must always be true:

    A <= B <= C, B <= C <= D, F <= G, G - A <= sendspace

    Note, all the capital letter values are readily available, either
    directly tracked in variables, or easily computable from variables
    that are tracked.

    Now, in today's world (for senders) if F < G then we unblock the
    sending process, and allow it to put data into the buffer.  This
    means that in general, F = G, we always have a full send buffer.
    This is the crux of running out of MBUF's when you have slow clients
    connected.

    So, I propose a new value e.  This is the desired buffer length.
    The first observation is that if the receiver gives us a smaller
    window, there's no reason to buffer much more than that window on
    our side.  That is, e should only be "a little" bigger than D.
    So we need a new constant, SPB - socket process buffer, the amount
    of data we want to buffer in the kernel from a process for a socket.
    I'll propose 8k as a good starting point.

    This gives us an upper bound for e, D + SPB.

    We also always want to buffer some data.  Even if the window (or
    other factors I'll talk about next) are 0, we want to buffer
    something.  SPB is a good value here.  We also have to be careful
    not to exceed the hard limits.  This gives us:

    SPB <= e <= min(D + SPB, G)

    No, going back to the code.  When we check if a process should
    unblock, rather than checking if F < G, it should check if e < D +
    SPB, and allow D + SPB - e bytes to be read in.  This way if a
    receiver gives us a window of 16k (which older freebsd boxes will be
    doing for quite a while) we buffer 16k + SPB bytes at most.  Not a
    bad tradeoff for almost no code!

    Of course, the drawback here is obvious.  Let's say the sender
    advertises a large window, say G sized, but is on a slow/congested
    link and can't use that window.  We could be overbuffering again.
    So, we need to look at a second critera.  We now have a rage for
    e, how do we pick it.

    I'll borrow from PSC.edu's research here.  It should be in the range
    2 * cwin to 4 * cwin.  So, every time cwin is updated, we look at
    e.  If it's less than 2 * cwin, we increase it to 2 * cwin (or it's
    max value), and every time it's greater than 4 * cwin, we decrease
    it to 2 * cwin, or it's minimum value.

    This is in fact their "autotuning", but without the "fair share"
    component, which so far everyone seems to think is too complicated
    and there should be a better solution.  The good news is this part I
    think is very little code.  We need to track e, so one more
    variable.  I'd venture the code in the block/unblock section is
    probably 4-5 lines, at most, and the code in the cwin update section
    is another 4-5 lines, at most.  If this all became 20 lines of code
    I'd be surprised.

e - Once we have this better management in place, we can go back to new
    values.  Assuming d is in current, I'd then like to see:

    kernel max      1M
    sendspace       256k
    recevspace      256k

f - At this point we can look at a "fair share" replacement.  Since we
    have the MBUF warning code in c we can get some idea of the cases
    where it's needed.  The basic premise is you don't have enough
    MBUF's, and connections need more buffer space, so how do you fairly
    allocate the space that you have.  That's an interesting question,
    but it can wait for some other day. :-)

I would think we could have a, b, and c done by the end of next week,
and d within two weeks assuming some people familar with the network
code can help with some pointers.

-- 
Leo Bicknell - bicknell@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010713101107.B9559>