FreeBSD Mail Archives

Date:      Mon, 30 Oct 2000 20:47:05 +0100
From:      Andre Oppermann <oppermann@telehouse.ch>
To:        Alfred Perlstein <bright@wintelcom.net>
Cc:        Bosko Milekic <bmilekic@dsuper.net>, freebsd-net@FreeBSD.ORG
Subject:   Re: MP: per-CPU mbuf allocation lists
Message-ID:  <39FDD039.594CED43@telehouse.ch>
References:  <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com> <20001030104457.E22110@fw.wintelcom.net>

I'd like to throw my 2c in here...

First of all from a tuners and server administrators perspective this
all sounds really complex to me. How am I going to tune these things?
What kind of statistics and data bases do I have? etc.

That beeing said in my past experience I've leared that sticking to
the KISS principle pays off far more than micro- or hyperoptimizing.

OK, now to the constructive part of my email...

I think the idea of per-CPU mbuf fast lists is very smart but in the
howtodo proposal here I see some shortcomings and I have some unclear
points.

While reading it I got the feeling that this watermark system would
work really well for the average lightly loaded case but would fail
pretty much for the highly loaded high traffic server case. Let me
explain how I got this feeling;

Lets assume there is an apache webserver running on the MP machine.
Now this apache process tends to stick to one CPU (does it? would
make sense for cache locality). This process generates a ton of
traffic and needs thousands of mbufs. If I keep the watermark
defaults then this is sub-optimal because on CPU1 I need thousands
of mbuf whereas CPU2 is not using much. If I tune the watermarks
to the required level of CPU1 then I'm wasting an awful lot of
mbufs on the other CPU('s)...

The proposal; Lets have the minimum watermark set on a system-wide
level (sysctl) but the high-watermark would be based per-cpu with
an sliding window calculating the average use of mbufs over a certain
period of time (eg. 10sec). The initial high watermark would be
either min*2 or set upon system initialization to some initial (base)
value. This also takes automagically care of process migration between
the CPU's. Also this system would be self adjusting and self tuning
without admin intervention in common edge cases.

Please don't add too much complexity and types and sorts of free lists.
Remember what happend to the VM subsystem and how Matt Dillon axed
around to remove the complexities for far too special situations and
how much that gained us in terms of performance.

-- 
Andre


Alfred Perlstein wrote:
> 
> * Bosko Milekic <bmilekic@dsuper.net> [001030 10:16] wrote:
> >
> >   [cross-posted to freebsd-arch and freebsd-net, please continue
> >   discussion on freebsd-net]
> >
> >   Hello,
> >
> >       I recently wrote an initial "scratch pad" design for per-CPU mbuf
> >   lists (in the MP case). The design consists simply of introducing
> >   these "fast" lists for each CPU and populating them with mbufs on bootup.
> >   Allocations from these lists would not need to be protected with a mutex
> >   as each CPU has its own. The general mmbfree list remains, and remains
> >   protected with a mutex, in case the per-CPU list is empty.
> >       My initial idea was to leave freeing to the general list, and have a
> >   kproc "daemon" periodically populate the "fast" lists. This would have of
> >   course involved the addition of a mutex for each "fast" list as well, in
> >   order to insure synch with the kproc. However, in the great majority of
> >   cases when the kproc would be sleeping, the acquiring of the mutex for
> >   the fast list would be very cheap, as waiting for it would never be an
> >   issue.
> >       Yesterday, Alfred pointed me to the HOARD web page and made several
> >   suggestions... all worthy of my attention.
> >       The changes I have decided to make to the design will make the system
> >   work as follows:
> >
> >   - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark")
> >     number of mbufs, typically... more on this below.
> >
> >   - The general (already existing) mmbfree list; mutex protected, global
> >     list, in case the fast list is empty for the given CPU.
> >
> >   - Allocations; all done from "fast" lists. All are very fast, in the
> >     general case. If no mbufs are available, the general mmbfree list's
> >     lock is acquired, and an mbuf is made from there. If no mbuf is
> >     available, even from the general list, we let go of the lock and
> >     allocate a page from mb_map and drop the mbufs onto our fast list, from
> >     which we grab the one we need. If mb_map is starved, then:
> >       (a) if M_NOWAIT, return ENOBUFS
> >       (b) go to sleep, if timeout, return ENOBUFS
> >       (c) not timeout, so got a wakeup, the wakeup was accompanied with the
> >       acquiring of the mmbfree general list. Since we were sleeping, we are
> >       insured that there is an mbuf waiting for us on the general mmbfree
> >       list, so we grab it and drop the lock (see the "freeing" section on
> >       why we know there's one on mmbfree).
> >
> >    - Freeing; First, if someone is sleeping, we grab the mmbfree global
> >      list mutex and drop the mbuf there, and then issue a wakeup. If nobody
> >      is sleeping, then we proceed as follows:
> 
> I like this idea here, you could do that as a general way of noting that
> there's a shortage and free to the global pool even if you're below the
> low watermark that I discuss below...
> 
> >       (a) if our fast list does not have over "w" mbufs, put the mbuf on
> >       our fast list and then we're done
> >       (b) since our fast list already has "w" mbufs, acquire the mmbfree
> >       mutex and drop the mbuf there.
> 
> You want to free in chunks, see below for suggestions.
> 
> >   Things to note:
> >
> >     - note that if we're out of mbufs on our fast list, and the general
> >       mmbfree list has none available either, and mb_map is starved, even
> >       though there may be free mbufs on other CPU's fast lists, we will
> >       return ENOBUFS. This behavior will usually be an indication of a
> >       wrongly chosen watermark ("w") and we will have to consider how to
> >       inform our users on how to properly select a watermark. I already
> >       have some ideas for alternate situations/ways of handeling this, but
> >       will leave this investigation for later.
> >
> >     - "w" is a tunable watermark. No fast list will ever contain more than
> >       "w" mbufs. This presents a small problem. Consider a situation where
> >       we initially set w = 500; consider we have two CPUs; consider CPU1's
> >       fast list eventually gets 450 mbufs, and CPU2's fast list gets 345.
> >       Consider then that we decide to set w = 200; Even though all
> >       subsequent freeing will be done to the mmbfree list, unless we
> >       eventually go under the 200 mark for our free list, we will likely
> >       end up sitting with > 200 mbufs on each CPU's fast list. The idea I
> >       presently have is to have a kproc "garbage collect" > w mbufs on the
> >       CPUs' fast lists and put them back onto the mmbfree general list, if
> >       it detects that "w" has been lowered.
> >
> >   I'm looking for input. Please feel free to comment with the _specifics_
> >   of the system in mind.
> >
> >   Thanks in advance to Alfred who has already generated input. :-)
> 
> Oops, I think I wasn't clear enough, the idea is to have a low AND a high
> watermark, let's consider you have hw (high water) at 500 and lw (low water
> at 250).  The point being that:
> 
> 1) if you are freeing mbufs and hit the highwater mark on your frastlist
>    you free into the general pool (hw - lw) mbufs from your fastlist
> 2) if you are allocating mbufs and have 0 on your fastlist you aquire
>    (lw) mbufs from the general pool into your fast list.
> 
> this should avoid a ping pong affect and at the same time allow the
> last problem you spoke about to be addressed better.
> 
> More tricks that can be done:
> 
> Since you only free from low water to high water you can do this
> to avoid a linked list traversal for counting, keep all mbufs below
> your low watermark on a seperate fastlist along with a count, when
> you free mbufs past your lowwater stick them on a seperate list,
> and maintain a count, when the count on that list becomes greater
> than hw-lw then you can just dump it into the global list with just
> a pointer swap and a bumping the count in the global freelist.
> 
> You can also keep the mbufs in the global list in a special way
> so that you can do chunk allocations from it by simply using
> the m_nextpkt flag on the mbuf to point to the next "chunk" of
> mbufs that are hung off of m_next, you can hijack a byte out
> of the m_data to keep the count.
> 
> --
> -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> "I have the heart of a child; I keep it in a jar on my desk."
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-net" in the body of the message


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?39FDD039.594CED43>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation