From owner-freebsd-net Mon Oct 30 10:16:55 2000 Delivered-To: freebsd-net@freebsd.org Received: from falla.videotron.net (falla.videotron.net [205.151.222.106]) by hub.freebsd.org (Postfix) with ESMTP id D95F737B479; Mon, 30 Oct 2000 10:16:47 -0800 (PST) Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213]) by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8) with ESMTP id <0G3900M7G9F772@falla.videotron.net>; Mon, 30 Oct 2000 13:16:19 -0500 (EST) Date: Mon, 30 Oct 2000 13:20:52 -0500 (EST) From: Bosko Milekic Subject: MP: per-CPU mbuf allocation lists X-Sender: bmilekic@jehovah.technokratis.com To: freebsd-net@freebsd.org Cc: freebsd-arch@freebsd.org Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org [cross-posted to freebsd-arch and freebsd-net, please continue discussion on freebsd-net] Hello, I recently wrote an initial "scratch pad" design for per-CPU mbuf lists (in the MP case). The design consists simply of introducing these "fast" lists for each CPU and populating them with mbufs on bootup. Allocations from these lists would not need to be protected with a mutex as each CPU has its own. The general mmbfree list remains, and remains protected with a mutex, in case the per-CPU list is empty. My initial idea was to leave freeing to the general list, and have a kproc "daemon" periodically populate the "fast" lists. This would have of course involved the addition of a mutex for each "fast" list as well, in order to insure synch with the kproc. However, in the great majority of cases when the kproc would be sleeping, the acquiring of the mutex for the fast list would be very cheap, as waiting for it would never be an issue. Yesterday, Alfred pointed me to the HOARD web page and made several suggestions... all worthy of my attention. The changes I have decided to make to the design will make the system work as follows: - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark") number of mbufs, typically... more on this below. - The general (already existing) mmbfree list; mutex protected, global list, in case the fast list is empty for the given CPU. - Allocations; all done from "fast" lists. All are very fast, in the general case. If no mbufs are available, the general mmbfree list's lock is acquired, and an mbuf is made from there. If no mbuf is available, even from the general list, we let go of the lock and allocate a page from mb_map and drop the mbufs onto our fast list, from which we grab the one we need. If mb_map is starved, then: (a) if M_NOWAIT, return ENOBUFS (b) go to sleep, if timeout, return ENOBUFS (c) not timeout, so got a wakeup, the wakeup was accompanied with the acquiring of the mmbfree general list. Since we were sleeping, we are insured that there is an mbuf waiting for us on the general mmbfree list, so we grab it and drop the lock (see the "freeing" section on why we know there's one on mmbfree). - Freeing; First, if someone is sleeping, we grab the mmbfree global list mutex and drop the mbuf there, and then issue a wakeup. If nobody is sleeping, then we proceed as follows: (a) if our fast list does not have over "w" mbufs, put the mbuf on our fast list and then we're done (b) since our fast list already has "w" mbufs, acquire the mmbfree mutex and drop the mbuf there. Things to note: - note that if we're out of mbufs on our fast list, and the general mmbfree list has none available either, and mb_map is starved, even though there may be free mbufs on other CPU's fast lists, we will return ENOBUFS. This behavior will usually be an indication of a wrongly chosen watermark ("w") and we will have to consider how to inform our users on how to properly select a watermark. I already have some ideas for alternate situations/ways of handeling this, but will leave this investigation for later. - "w" is a tunable watermark. No fast list will ever contain more than "w" mbufs. This presents a small problem. Consider a situation where we initially set w = 500; consider we have two CPUs; consider CPU1's fast list eventually gets 450 mbufs, and CPU2's fast list gets 345. Consider then that we decide to set w = 200; Even though all subsequent freeing will be done to the mmbfree list, unless we eventually go under the 200 mark for our free list, we will likely end up sitting with > 200 mbufs on each CPU's fast list. The idea I presently have is to have a kproc "garbage collect" > w mbufs on the CPUs' fast lists and put them back onto the mmbfree general list, if it detects that "w" has been lowered. I'm looking for input. Please feel free to comment with the _specifics_ of the system in mind. Thanks in advance to Alfred who has already generated input. :-) Cheers, Bosko Milekic bmilekic@technokratis.com P.S.: Most of the beneficial effects of this system will get to be seen when the stack is fully threaded. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message