From owner-freebsd-net  Mon Oct 30 10:16:55 2000
Delivered-To: freebsd-net@freebsd.org
Received: from falla.videotron.net (falla.videotron.net [205.151.222.106])
	by hub.freebsd.org (Postfix) with ESMTP
	id D95F737B479; Mon, 30 Oct 2000 10:16:47 -0800 (PST)
Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213])
 by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8)
 with ESMTP id <0G3900M7G9F772@falla.videotron.net>; Mon, 30 Oct 2000 13:16:19 -0500 (EST)
Date: Mon, 30 Oct 2000 13:20:52 -0500 (EST)
From: Bosko Milekic <bmilekic@dsuper.net>
Subject: MP: per-CPU mbuf allocation lists
X-Sender: bmilekic@jehovah.technokratis.com
To: freebsd-net@freebsd.org
Cc: freebsd-arch@freebsd.org
Message-id: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


  [cross-posted to freebsd-arch and freebsd-net, please continue
  discussion on freebsd-net]

  Hello,

  	I recently wrote an initial "scratch pad" design for per-CPU mbuf
  lists (in the MP case). The design consists simply of introducing
  these "fast" lists for each CPU and populating them with mbufs on bootup.
  Allocations from these lists would not need to be protected with a mutex
  as each CPU has its own. The general mmbfree list remains, and remains
  protected with a mutex, in case the per-CPU list is empty.
  	My initial idea was to leave freeing to the general list, and have a
  kproc "daemon" periodically populate the "fast" lists. This would have of
  course involved the addition of a mutex for each "fast" list as well, in
  order to insure synch with the kproc. However, in the great majority of
  cases when the kproc would be sleeping, the acquiring of the mutex for
  the fast list would be very cheap, as waiting for it would never be an
  issue.
  	Yesterday, Alfred pointed me to the HOARD web page and made several
  suggestions... all worthy of my attention.
  	The changes I have decided to make to the design will make the system
  work as follows:

  - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark")
    number of mbufs, typically... more on this below.
    
  - The general (already existing) mmbfree list; mutex protected, global
    list, in case the fast list is empty for the given CPU.
    
  - Allocations; all done from "fast" lists. All are very fast, in the
    general case. If no mbufs are available, the general mmbfree list's
    lock is acquired, and an mbuf is made from there. If no mbuf is
    available, even from the general list, we let go of the lock and
    allocate a page from mb_map and drop the mbufs onto our fast list, from
    which we grab the one we need. If mb_map is starved, then:
	(a) if M_NOWAIT, return ENOBUFS
	(b) go to sleep, if timeout, return ENOBUFS
	(c) not timeout, so got a wakeup, the wakeup was accompanied with the
	acquiring of the mmbfree general list. Since we were sleeping, we are
	insured that there is an mbuf waiting for us on the general mmbfree
	list, so we grab it and drop the lock (see the "freeing" section on
	why we know there's one on mmbfree).

   - Freeing; First, if someone is sleeping, we grab the mmbfree global
     list mutex and drop the mbuf there, and then issue a wakeup. If nobody
     is sleeping, then we proceed as follows:
     	(a) if our fast list does not have over "w" mbufs, put the mbuf on
	our fast list and then we're done
	(b) since our fast list already has "w" mbufs, acquire the mmbfree
	mutex and drop the mbuf there.
	
  Things to note:
  
    - note that if we're out of mbufs on our fast list, and the general
	mmbfree list has none available either, and mb_map is starved, even
	though there may be free mbufs on other CPU's fast lists, we will
	return ENOBUFS. This behavior will usually be an indication of a
	wrongly chosen watermark ("w") and we will have to consider how to
	inform our users on how to properly select a watermark. I already
	have some ideas for alternate situations/ways of handeling this, but
	will leave this investigation for later.
  
    - "w" is a tunable watermark. No fast list will ever contain more than
	"w" mbufs. This presents a small problem. Consider a situation where
	we initially set w = 500; consider we have two CPUs; consider CPU1's
	fast list eventually gets 450 mbufs, and CPU2's fast list gets 345.
	Consider then that we decide to set w = 200; Even though all
	subsequent freeing will be done to the mmbfree list, unless we
	eventually go under the 200 mark for our free list, we will likely
	end up sitting with > 200 mbufs on each CPU's fast list. The idea I
	presently have is to have a kproc "garbage collect" > w mbufs on the
	CPUs' fast lists and put them back onto the mmbfree general list, if
	it detects that "w" has been lowered.
	
  I'm looking for input. Please feel free to comment with the _specifics_
  of the system in mind.

  Thanks in advance to Alfred who has already generated input. :-)

  Cheers,
  Bosko Milekic
  bmilekic@technokratis.com

  P.S.: Most of the beneficial effects of this system will get to be seen
  when the stack is fully threaded.
  

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message