From owner-freebsd-arch Sun Mar 2 1:10:24 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C8C3F37B401 for ; Sun, 2 Mar 2003 01:10:23 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id B2AB443F3F for ; Sun, 2 Mar 2003 01:10:22 -0800 (PST) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h229AIA6006041; Sun, 2 Mar 2003 10:10:19 +0100 (CET) (envelope-from phk@phk.freebsd.dk) To: "Alan L. Cox" Cc: arch@FreeBSD.ORG Subject: Re: Removal of ENABLE_VFS_IOOPT From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sat, 01 Mar 2003 23:48:22 CST." <3E619B26.DF1E4FC7@imimic.com> Date: Sun, 02 Mar 2003 10:10:18 +0100 Message-ID: <6040.1046596218@critter.freebsd.dk> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG In message <3E619B26.DF1E4FC7@imimic.com>, "Alan L. Cox" writes: >Before I begin work on vm_object locking, I'd like to remove >ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a >work-in-progress by John Dyson to perform zero-copy file system I/O. >Unfortunately, it still has some unresolved issues, and no one has taken >an active interest in fixing them. > >For the record, both Matt Dillon and Tor Egge have stated in public or >private e-mail that they favor removing it. >Unless I hear an objection, I intend to remove it in a few days. Kill it! -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 8:15: 3 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5A6B037B401 for ; Sun, 2 Mar 2003 08:15:01 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4395C43FD7 for ; Sun, 2 Mar 2003 08:15:00 -0800 (PST) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h22GEwA6008323 for ; Sun, 2 Mar 2003 17:14:59 +0100 (CET) (envelope-from phk@phk.freebsd.dk) To: arch@freebsd.org Subject: caddr_t, d_ioctl_t and cdevsw{} in general. From: "Poul-Henning Kamp" Date: Sun, 02 Mar 2003 17:14:58 +0100 Message-ID: <8322.1046621698@critter.freebsd.dk> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG >des 2003/03/02 07:29:13 PST > > FreeBSD src repository > > Modified files: > sys/sys uio.h > sys/kern kern_subr.c > Log: > Convert one of our main caddr_t consumers, uiomove(9), to void *. One of the other ones are cdevsw->d_ioctl(), where I would really like to make the pointer argument a "void *". Unfortunately, we have approx 140 statically initialized instances of struct cdevsw {} in our tree, so any modification to this structure makes a major mess and flagday, and 's/caddr_t[ ]*/void */' is not sufficient reason alone. (The worry that we may break source compatibility with previous FreeBSD's or other operating systems is likely false economy because SMPng may change semantics so much that any such compatibility at best is illusory.) There are other changes to struct cdevsw{} which are pending, I just eliminated the d_psize element for instance, and we probably need to pass a file descriptor to d_open(), d_close() and possibly d_ioctl() in the future. My intent is to postpone any flag-day for as long as I can, but if I can see it being required, I will attempt to get it over with before RELENG_5 if at all possible. But... It would be perfect to have a versioned API instead of the cdevsw{}, one way to do this would be to abandon the cdevsw{} and assign the methods to members in the dev_t directly, something like dev = make_dev(...); dev->d_open = fooopen; dev->d_close = fooclose; dev->d_ioctl = fooioctl; ... But this would be wasteful, most devices under a driver uses the same methods so the indirection through cdevsw{} saves space, and it would be fairly tedious and repetitive in drivers with more than one make_dev() call. I know a lot of people will cry "KOBJ", and while we can avoid the cache-collision by caching the methods in the dev_t at make_dev() time, the loss of type-checking in this API worries me a lot: This is one of the API's which we have most new-to-FreeBSD people writing against. I don't think we can introduce type-safety in KOBJ as it stands. And besides, we have the same issue over in VOP_* and a few other places. I _really_ think it is time to seriously consider using an augmented 'C' dialect where we build in some of these "OOlite" concepts, and write a preprocessor for it which outputs C. Any takers ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 9:38:21 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ECE2D37B401 for ; Sun, 2 Mar 2003 09:38:20 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id E4DC543F93 for ; Sun, 2 Mar 2003 09:38:18 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0246.cvx22-bradley.dialup.earthlink.net ([209.179.198.246] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18pXPl-0000kc-00; Sun, 02 Mar 2003 09:38:18 -0800 Message-ID: <3E624133.8FB21AA6@mindspring.com> Date: Sun, 02 Mar 2003 09:36:51 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Alan L. Cox" Cc: arch@freebsd.org Subject: Re: Removal of ENABLE_VFS_IOOPT References: <3E619B26.DF1E4FC7@imimic.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4177e2f84cbf0e0555bd4b21f2e093821350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Alan L. Cox" wrote: > Before I begin work on vm_object locking, I'd like to remove > ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a > work-in-progress by John Dyson to perform zero-copy file system I/O. > Unfortunately, it still has some unresolved issues, and no one has taken > an active interest in fixing them. > > For the record, both Matt Dillon and Tor Egge have stated in public or > private e-mail that they favor removing it. > Unless I hear an objection, I intend to remove it in a few days. Here's an idea... ask John Dyson about it. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 10:53:33 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AB86637B401 for ; Sun, 2 Mar 2003 10:53:32 -0800 (PST) Received: from canning.wemm.org (canning.wemm.org [192.203.228.65]) by mx1.FreeBSD.org (Postfix) with ESMTP id 62DC843F75 for ; Sun, 2 Mar 2003 10:53:32 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by canning.wemm.org (Postfix) with ESMTP id 4D54B2A89E; Sun, 2 Mar 2003 10:53:32 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: "Alan L. Cox" Cc: arch@freebsd.org Subject: Re: Removal of ENABLE_VFS_IOOPT In-Reply-To: <3E619B26.DF1E4FC7@imimic.com> Date: Sun, 02 Mar 2003 10:53:32 -0800 From: Peter Wemm Message-Id: <20030302185332.4D54B2A89E@canning.wemm.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Alan L. Cox" wrote: > Before I begin work on vm_object locking, I'd like to remove > ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a > work-in-progress by John Dyson to perform zero-copy file system I/O. > Unfortunately, it still has some unresolved issues, and no one has taken > an active interest in fixing them. Hold on a second.. I thought the zero-copy folks fixed this up and it is required for turning on zero-copy mode etc. I remember they added code to fix the read() coherency problems. Cheers, -Peter -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 10:58: 4 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 19ED637B401 for ; Sun, 2 Mar 2003 10:58:03 -0800 (PST) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8E72143FD7 for ; Sun, 2 Mar 2003 10:58:02 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.12.7/8.12.7) with ESMTP id h22Iw2So036193; Sun, 2 Mar 2003 10:58:02 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.12.7/8.12.7/Submit) id h22Iw1Id036192; Sun, 2 Mar 2003 10:58:01 -0800 (PST) Date: Sun, 2 Mar 2003 10:58:01 -0800 From: Steve Kargl To: Terry Lambert Cc: "Alan L. Cox" , arch@FreeBSD.ORG Subject: Re: Removal of ENABLE_VFS_IOOPT Message-ID: <20030302185801.GA36138@troutmask.apl.washington.edu> References: <3E619B26.DF1E4FC7@imimic.com> <3E624133.8FB21AA6@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3E624133.8FB21AA6@mindspring.com> User-Agent: Mutt/1.4i Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, Mar 02, 2003 at 09:36:51AM -0800, Terry Lambert wrote: > "Alan L. Cox" wrote: > > Before I begin work on vm_object locking, I'd like to remove > > ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a > > work-in-progress by John Dyson to perform zero-copy file system I/O. > > Unfortunately, it still has some unresolved issues, and no one has taken > > an active interest in fixing them. > > > > For the record, both Matt Dillon and Tor Egge have stated in public or > > private e-mail that they favor removing it. > > Unless I hear an objection, I intend to remove it in a few days. > > Here's an idea... ask John Dyson about it. > From John's comments in c.u.b.f.m, I doubt he follows the kernel development in 5.0 close enough to make any recommendation without studying the code. I further suspect that John would not want to spend the time require. -- Steve To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 11:13: 5 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7DF7137B401 for ; Sun, 2 Mar 2003 11:13:04 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id E1F1043F93 for ; Sun, 2 Mar 2003 11:13:03 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0246.cvx22-bradley.dialup.earthlink.net ([209.179.198.246] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18pYtQ-00063l-00; Sun, 02 Mar 2003 11:13:01 -0800 Message-ID: <3E625750.9319E291@mindspring.com> Date: Sun, 02 Mar 2003 11:11:12 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Steve Kargl Cc: "Alan L. Cox" , arch@FreeBSD.ORG Subject: Re: Removal of ENABLE_VFS_IOOPT References: <3E619B26.DF1E4FC7@imimic.com> <3E624133.8FB21AA6@mindspring.com> <20030302185801.GA36138@troutmask.apl.washington.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a40e82b3f33a558e7ca695839109118f8193caf27dac41a8fd350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Steve Kargl wrote: > > Here's an idea... ask John Dyson about it. > > From John's comments in c.u.b.f.m, I doubt he > follows the kernel development in 5.0 close enough > to make any recommendation without studying the > code. I further suspect that John would not want > to spend the time require. It's a design question, not an implementation question. Alan's suggestion is that the design be modified because (in his opinion) the implementation is incomplete. Though Peter Wemm's comment that it is not incomplete, and that it's used by the zero copy TCP people, to the extent that they've maintained the read coherency code, is also salient. BTW: I think you are wrong about John; I guess you missed his post to -chat last week? -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 11:44:39 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E6EAE37B401 for ; Sun, 2 Mar 2003 11:44:37 -0800 (PST) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0840743FCB for ; Sun, 2 Mar 2003 11:44:37 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.12.7/8.12.7) with ESMTP id h22JiaSo036412; Sun, 2 Mar 2003 11:44:36 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.12.7/8.12.7/Submit) id h22JiZ7L036411; Sun, 2 Mar 2003 11:44:35 -0800 (PST) Date: Sun, 2 Mar 2003 11:44:35 -0800 From: Steve Kargl To: Terry Lambert Cc: "Alan L. Cox" , arch@FreeBSD.ORG Subject: Re: Removal of ENABLE_VFS_IOOPT Message-ID: <20030302194435.GA36383@troutmask.apl.washington.edu> References: <3E619B26.DF1E4FC7@imimic.com> <3E624133.8FB21AA6@mindspring.com> <20030302185801.GA36138@troutmask.apl.washington.edu> <3E625750.9319E291@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3E625750.9319E291@mindspring.com> User-Agent: Mutt/1.4i Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, Mar 02, 2003 at 11:11:12AM -0800, Terry Lambert wrote: > Steve Kargl wrote: > > > Here's an idea... ask John Dyson about it. > > > > From John's comments in c.u.b.f.m, I doubt he > > follows the kernel development in 5.0 close enough > > to make any recommendation without studying the > > code. I further suspect that John would not want > > to spend the time require. > > It's a design question, not an implementation question. Alan's > suggestion is that the design be modified because (in his opinion) > the implementation is incomplete. I suppose Alan's suggestion to remove the code is a modification to the design. :-) > BTW: I think you are wrong about John; I guess you missed his > post to -chat last week? I don't read -chat. I do read c.u.b.f.m and John's statements leads one to conclude he doesn't follow the development close enough to make a design decision. -- Steve To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 12:29:34 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5B35C37B401 for ; Sun, 2 Mar 2003 12:29:33 -0800 (PST) Received: from mail26a.sbc-webhosting.com (mail26a.sbc-webhosting.com [216.173.237.36]) by mx1.FreeBSD.org (Postfix) with SMTP id 6485F43FB1 for ; Sun, 2 Mar 2003 12:29:32 -0800 (PST) (envelope-from alc@imimic.com) Received: from www.imimic.com (64.143.12.21) by mail26a.sbc-webhosting.com (RS ver 1.0.63s) with SMTP id 079238; Sun, 2 Mar 2003 15:29:05 -0500 (EST) Message-ID: <3E626997.5005AE71@imimic.com> Date: Sun, 02 Mar 2003 14:29:11 -0600 From: "Alan L. Cox" Organization: iMimic Networking, Inc. X-Mailer: Mozilla 4.8 [en] (X11; U; Linux 2.4.2 i386) X-Accept-Language: en MIME-Version: 1.0 To: Peter Wemm Cc: arch@freebsd.org Subject: Re: Removal of ENABLE_VFS_IOOPT References: <20030302185332.4D54B2A89E@canning.wemm.org> Content-Type: text/plain; charset=x-user-defined Content-Transfer-Encoding: 7bit X-Loop-Detect: 1 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Peter Wemm wrote: > > "Alan L. Cox" wrote: > > Before I begin work on vm_object locking, I'd like to remove > > ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a > > work-in-progress by John Dyson to perform zero-copy file system I/O. > > Unfortunately, it still has some unresolved issues, and no one has taken > > an active interest in fixing them. > > Hold on a second.. I thought the zero-copy folks fixed this up and it > is required for turning on zero-copy mode etc. > > I remember they added code to fix the read() coherency problems. > No, they didn't. The zero-copy sockets code introduced a new page-based copy-on-write mechanism and a new parameter to uiomoveco() ("disposable") that guaranteed that the new mechanism wouldn't be used for ENABLE_VFS_IOOPT. In contrast, ENABLE_VFS_IOOPT tried to use the preexisting object-based copy-on-write mechanism. Except for a few lines of code in kern_subr.c's userspaceco(), the two mechanisms are distinct. Specifically, the page flipping logic is totally distinct: vm_pgmoveco() for zero-copy sockets and vm_uiomove() for ENABLE_VFS_IOOPT. Someday, someone could attempt to reimplement ENABLE_VFS_IOOPT using the page-based mechanism, in which case, the four snippets of code in ffs_vnops.c could be useful. Aside from that, the code which lives in vm_map.* and vm_object.* is dead weight. Alan To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 12:52:53 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 61C5937B401 for ; Sun, 2 Mar 2003 12:52:51 -0800 (PST) Received: from canning.wemm.org (canning.wemm.org [192.203.228.65]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0852A43FCB for ; Sun, 2 Mar 2003 12:52:51 -0800 (PST) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by canning.wemm.org (Postfix) with ESMTP id E53D22A89E; Sun, 2 Mar 2003 12:52:50 -0800 (PST) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: "Alan L. Cox" Cc: arch@freebsd.org Subject: Re: Removal of ENABLE_VFS_IOOPT In-Reply-To: <3E626997.5005AE71@imimic.com> Date: Sun, 02 Mar 2003 12:52:50 -0800 From: Peter Wemm Message-Id: <20030302205250.E53D22A89E@canning.wemm.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Alan L. Cox" wrote: > Peter Wemm wrote: > > > > "Alan L. Cox" wrote: > > > Before I begin work on vm_object locking, I'd like to remove > > > ENABLE_VFS_IOOPT from the kernel sources. ENABLE_VFS_IOOPT was a > > > work-in-progress by John Dyson to perform zero-copy file system I/O. > > > Unfortunately, it still has some unresolved issues, and no one has taken > > > an active interest in fixing them. > > > > Hold on a second.. I thought the zero-copy folks fixed this up and it > > is required for turning on zero-copy mode etc. > > > > I remember they added code to fix the read() coherency problems. > > > > No, they didn't. The zero-copy sockets code introduced a new page-based > copy-on-write mechanism and a new parameter to uiomoveco() > ("disposable") that guaranteed that the new mechanism wouldn't be used > for ENABLE_VFS_IOOPT. In contrast, ENABLE_VFS_IOOPT tried to use the > preexisting object-based copy-on-write mechanism. Except for a few > lines of code in kern_subr.c's userspaceco(), the two mechanisms are > distinct. Specifically, the page flipping logic is totally distinct: > vm_pgmoveco() for zero-copy sockets and vm_uiomove() for > ENABLE_VFS_IOOPT. > > Someday, someone could attempt to reimplement ENABLE_VFS_IOOPT using the > page-based mechanism, in which case, the four snippets of code in > ffs_vnops.c could be useful. Aside from that, the code which lives in > vm_map.* and vm_object.* is dead weight. Perhaps my confusion comes from earlier versions of the zero copy work when it was necessary to turn on ENABLE_VFS_IOOPT. Personally, I dont think the complexity is worth it. I don't mind removing it, as long as it is for the right reasons. I thought it had been fixed, but since it hasn't I'll shut up. I think we'd get more mileage from optimizing the data copying routines. There are a whole host of new instructions that came with SSE that we are not taking advantage of yet. We can probably get a factor of 5 improvement in general purpose data copying speeds with this stuff on modern cpus. Cheers, -Peter -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 18:47:35 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0FE7937B401 for ; Sun, 2 Mar 2003 18:47:34 -0800 (PST) Received: from mailsrv.otenet.gr (mailsrv.otenet.gr [195.170.0.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8CF5543FBD for ; Sun, 2 Mar 2003 18:47:32 -0800 (PST) (envelope-from keramida@ceid.upatras.gr) Received: from gothmog.gr (patr530-b205.otenet.gr [212.205.244.213]) by mailsrv.otenet.gr (8.12.6/8.12.6) with ESMTP id h232lSks029154; Mon, 3 Mar 2003 04:47:29 +0200 (EET) Received: from gothmog.gr (gothmog [127.0.0.1]) by gothmog.gr (8.12.7/8.12.7) with ESMTP id h232lSef097714; Mon, 3 Mar 2003 04:47:28 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Received: (from giorgos@localhost) by gothmog.gr (8.12.7/8.12.7/Submit) id h232lLAn097713; Mon, 3 Mar 2003 04:47:21 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Date: Mon, 3 Mar 2003 04:47:21 +0200 From: Giorgos Keramidas To: Terry Lambert Cc: Steve Kargl , "Alan L. Cox" , arch@FreeBSD.org Subject: Re: Removal of ENABLE_VFS_IOOPT Message-ID: <20030303024721.GC97321@gothmog.gr> References: <3E619B26.DF1E4FC7@imimic.com> <3E624133.8FB21AA6@mindspring.com> <20030302185801.GA36138@troutmask.apl.washington.edu> <3E625750.9319E291@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3E625750.9319E291@mindspring.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 2003-03-02 11:11, Terry Lambert wrote: >Steve Kargl wrote: >> > Here's an idea... ask John Dyson about it. >> >> From John's comments in c.u.b.f.m, I doubt he follows the kernel >> development in 5.0 close enough to make any recommendation without >> studying the code. I further suspect that John would not want to >> spend the time require. > > BTW: I think you are wrong about John; I guess you missed his > post to -chat last week? Well, it was a bit older than that, but you're right. From: "John S. Dyson" Message-Id: <200302151802.NAA25682@dyson.jdyson.com> Subject: Just some comments about FreeBSDV5.0 To: chat@freebsd.org Date: Sat, 15 Feb 2003 13:02:43 -0500 (EST) - Giorgos To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 19:13:58 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2FA3337B401 for ; Sun, 2 Mar 2003 19:13:57 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 67DD343FBD for ; Sun, 2 Mar 2003 19:13:56 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h23365p59793 for ; Sun, 2 Mar 2003 22:06:05 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Sun, 2 Mar 2003 22:06:05 -0500 (EST) From: Jeff Roberson To: arch@freebsd.org Subject: New getblk parameter. Message-ID: <20030302220232.C56877-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I'd like to add a new parameter to getblk called 'flags'. The only flag I'm currently defining is GB_LOCK_NOWAIT so that it doesn't block trying to get the block. This is useful in the vfs_cluster code where we want to include a block in a cluster but only if it isn't currently in use. I have defined a new function 'getblkf' and put up a #define wrapper for getblk. It'd be neat to have a getblk() that didn't have the slpflag and slptimo args since almost nothing uses those and then use getblkf to supply all possible arguments. I'm not doing that for now though. I have a patch that does this and makes use of it in vfs_cluster available at: http://www.chesapeake.net/~jroberson/cluster.diff This clears up some other unsafe code in vfs cluster as well. Comments? Cheers, Jeff To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 19:26:29 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B969C37B401 for ; Sun, 2 Mar 2003 19:26:27 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 38B3843F85 for ; Sun, 2 Mar 2003 19:26:27 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0459.cvx21-bradley.dialup.earthlink.net ([209.179.193.204] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18pgau-00005I-00; Sun, 02 Mar 2003 19:26:25 -0800 Message-ID: <3E62CB0D.92E9FF78@mindspring.com> Date: Sun, 02 Mar 2003 19:25:01 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Jeff Roberson Cc: arch@freebsd.org Subject: Re: New getblk parameter. References: <20030302220232.C56877-100000@mail.chesapeake.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a48999fdfc135817524c2c768393a513e6350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Jeff Roberson wrote: > I'd like to add a new parameter to getblk called 'flags'. The only flag > I'm currently defining is GB_LOCK_NOWAIT so that it doesn't block trying > to get the block. This is useful in the vfs_cluster code where we want to > include a block in a cluster but only if it isn't currently in use. > > I have defined a new function 'getblkf' and put up a #define wrapper for > getblk. It'd be neat to have a getblk() that didn't have the slpflag and > slptimo args since almost nothing uses those and then use getblkf to > supply all possible arguments. I'm not doing that for now though. > > I have a patch that does this and makes use of it in vfs_cluster available > at: > > http://www.chesapeake.net/~jroberson/cluster.diff > > This clears up some other unsafe code in vfs cluster as well. > > Comments? FWIW, I like it; the cleanup that results in kern/vfs_cluster.c looks nice. If you are not going to change all the calls to getblk(), it should probably be a wrapper function, or, minimally, an inline and a wrapper function. The reasoning is that it chould be called from precompiled modules, so you want to leave a symbol visible for it, which defining it to getblkf(..., 0) doesn't do. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 20:25:50 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D4E4837B405 for ; Sun, 2 Mar 2003 20:25:48 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id A83F143F85 for ; Sun, 2 Mar 2003 20:25:47 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h234PkN85111; Sun, 2 Mar 2003 23:25:46 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Sun, 2 Mar 2003 23:25:46 -0500 (EST) From: Jeff Roberson To: Terry Lambert Cc: arch@FreeBSD.ORG Subject: Re: New getblk parameter. In-Reply-To: <3E62CB0D.92E9FF78@mindspring.com> Message-ID: <20030302232340.R84333-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Sun, 2 Mar 2003, Terry Lambert wrote: > Jeff Roberson wrote: > > I'd like to add a new parameter to getblk called 'flags'. The only flag > > I'm currently defining is GB_LOCK_NOWAIT so that it doesn't block trying > > to get the block. This is useful in the vfs_cluster code where we want to > > include a block in a cluster but only if it isn't currently in use. > > FWIW, I like it; the cleanup that results in kern/vfs_cluster.c > looks nice. Me too, thanks. > If you are not going to change all the calls to getblk(), it > should probably be a wrapper function, or, minimally, an inline > and a wrapper function. > > The reasoning is that it chould be called from precompiled > modules, so you want to leave a symbol visible for it, which > defining it to getblkf(..., 0) doesn't do. Precompiled modules are already going to be broken with the new locking semantics. I think requiring them to recompile is OK. I intend to bump the FreeBSD version if this goes in. I'd sort of like to change all the getblk() calls actually. If no one strongly objects to that I'll do it. Cheers, Jeff To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sun Mar 2 21:12:26 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E3AD737B401 for ; Sun, 2 Mar 2003 21:12:24 -0800 (PST) Received: from puffin.mail.pas.earthlink.net (puffin.mail.pas.earthlink.net [207.217.120.139]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5E7A043FB1 for ; Sun, 2 Mar 2003 21:12:24 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0459.cvx21-bradley.dialup.earthlink.net ([209.179.193.204] helo=mindspring.com) by puffin.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18piFN-00048y-00; Sun, 02 Mar 2003 21:12:18 -0800 Message-ID: <3E62E3B9.32887967@mindspring.com> Date: Sun, 02 Mar 2003 21:10:17 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Jeff Roberson Cc: arch@FreeBSD.ORG Subject: Re: New getblk parameter. References: <20030302232340.R84333-100000@mail.chesapeake.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a47470f5de6aafae7010021353932a0a7d666fa475841a1c7a350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Jeff Roberson wrote: > On Sun, 2 Mar 2003, Terry Lambert wrote: > > If you are not going to change all the calls to getblk(), it > > should probably be a wrapper function, or, minimally, an inline > > and a wrapper function. > > > > The reasoning is that it chould be called from precompiled > > modules, so you want to leave a symbol visible for it, which > > defining it to getblkf(..., 0) doesn't do. > > Precompiled modules are already going to be broken with the new locking > semantics. I think requiring them to recompile is OK. I intend to bump > the FreeBSD version if this goes in. > > I'd sort of like to change all the getblk() calls actually. If no one > strongly objects to that I'll do it. I'd personally prefer that to the getblkf() thing. If you aren't making the change for a reason, like not wanting to change the ABI, that's one thing; but if it's going to trade out symbol space and a name obfuscation, it's best to just change them all. By my count, there's only about 60 of them, and it's a mechanical change that would take less than a minute with a modified cscope that can do parameter addition of the zero, or 5 minutes manually. You should just "go for it" (assuming RE@ OK's it). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 7:15:21 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AFA4E37B401; Mon, 3 Mar 2003 07:15:18 -0800 (PST) Received: from grosbein.pp.ru (www2.svzserv.kemerovo.su [213.184.65.86]) by mx1.FreeBSD.org (Postfix) with ESMTP id 21CCB43FBD; Mon, 3 Mar 2003 07:15:08 -0800 (PST) (envelope-from eugen@grosbein.pp.ru) Received: from grosbein.pp.ru (smmsp@localhost [127.0.0.1]) by grosbein.pp.ru (8.12.7/8.12.7) with ESMTP id h23FF0W2001180; Mon, 3 Mar 2003 22:15:00 +0700 (KRAT) (envelope-from eugen@grosbein.pp.ru) Received: (from eugen@localhost) by grosbein.pp.ru (8.12.7/8.12.7/Submit) id h23FBcmB000870; Mon, 3 Mar 2003 22:11:38 +0700 (KRAT) Date: Mon, 3 Mar 2003 22:11:38 +0700 (KRAT) Message-Id: <200303031511.h23FBcmB000870@grosbein.pp.ru> To: FreeBSD-gnats-submit@freebsd.org Subject: [PATCH] The influence of /etc/start_ifname on /etc/rc.firewall is obscure and harmfull From: Eugene Grosbein Reply-To: Eugene Grosbein Cc: arch@freebsd.org X-send-pr-version: 3.113 X-GNATS-Notify: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG >Submitter-Id: current-users >Originator: Eugene Grosbein >Organization: Svyaz Service JSC >Confidential: no >Synopsis: [PATCH] The influence of /etc/start_ifname on /etc/rc.firewall is obscure and harmfull >Severity: serious >Priority: low >Category: misc >Class: change-request >Release: FreeBSD 4.8-PRERELEASE i386 >Environment: System: FreeBSD grosbein.pp.ru 4.8-PRERELEASE FreeBSD 4.8-PRERELEASE #2: Sat Mar 1 21:20:16 KRAT 2003 eu@grosbein.pp.ru:/usr/local/obj/usr/local/src/sys/DADV i386 >Description: The revision 1.13 of /etc/rc.firewall 5 years ago introduced ability of passing firewall_type as $1. This feature is not documented in rc(8) man page. Meantime /etc/rc.network invokes /etc/rc.firewall using '.' command so /etc/rc.firewall will inherit $1. rc.network will invoke /etc/start_$ifname using '.' also. An unsuspicious administrator may make start_$ifname so that it sets positional parameters. /etc/rc.firewall will catch $1 and ignore firewall_type from /etc/rc.conf. Most probably, this will result in a set or rules consisting in default rule only that is 'deny from any to any'. That's dangerous and might be hard to debug and recover. >How-To-Repeat: Try to use such /etc/start_gre script to assist WCCP: #!/bin/sh routers="1.2.3.4 5.6.7.8" # WCCP-compatible gateways wccp_int="fxp0" # we try not to hardcore our IP # but autosense from /etc/rc.conf eval set \$interface_$wccp_int # generally, this is an easy way my_ip=$2 # to get ip address of interface # from /etc/rc.conf # configure tunnels for ... Documentation nowhere warns that one should not use such constructions. So we will have $1="inet" here and rc.firewall will NOT load firewall rules if /etc/inet does not exists. That may be harmfull. The same applies to the rc.firewall(6) but I did not try it. >Fix: A decision have to be taken. One should correct rc(8) to warn administrators or to take some measures. For exaple, it's possible to unset positional parameters before running /etc/rc.firewall. Apply this patch to /etc: --- rc.network.orig Mon Mar 3 22:05:32 2003 +++ rc.network Mon Mar 3 22:00:30 2003 @@ -330,6 +330,7 @@ case ${firewall_enable} in [Yy][Ee][Ss]) if [ -r "${firewall_script}" ]; then + while shift 2>/dev/null; do :; done . "${firewall_script}" echo -n 'Firewall rules loaded, starting divert daemons:' --- rc.network6.orig Mon Mar 3 22:10:43 2003 +++ rc.network6 Mon Mar 3 22:10:33 2003 @@ -67,6 +67,7 @@ case ${ipv6_firewall_enable} in [Yy][Ee][Ss]) if [ -r "${ipv6_firewall_script}" ]; then + while shift 2>/dev/null; do :; done . "${ipv6_firewall_script}" echo -n 'IPv6 Firewall rules loaded.' elif [ "`ip6fw l 65535`" = "65535 deny ipv6 from any to any" ]; then OTOH, one may wrap invocation rc.firewall[6] into a function. Eugene Grosbein To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 8:38:48 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9BD9C37B401 for ; Mon, 3 Mar 2003 08:38:42 -0800 (PST) Received: from sinamail.com (61-221-29-145.HINET-IP.hinet.net [61.221.29.145]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4879C43FDD for ; Mon, 3 Mar 2003 08:38:41 -0800 (PST) (envelope-from suppergeorge@sinamail.com) From: star@yahoo.com.tw To: freebsd-arch@FreeBSD.org Subject: =?ISO-8859-1?B?prO+97d8p0HEQLdOpWi5wbjVttw/Pw==?= Reply-To: suppergeorge@sinamail.com Date: 04 Mar 2003 00:43:47 +0800 MIME-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 8bit Message-Id: <20030303163841.4879C43FDD@mx1.FreeBSD.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG 為什麼有人會比你成功10倍
主旨: 這或許是您正在尋找的機會喔
這或許是您正在找的機會哦!
對不起!打擾了,如果因此造成您的困擾,請直接刪除本信及點選下方「不想再收信」,我們會將您的資料刪除!

為什麼有人會比你成功10倍,收入多100倍、甚至多1000倍,難道他有比你多聰明這麼多嗎?
答案肯定不是的!
想一想!那些收入比我們高很多,生活比我們好很多的人!
他們到底做了什麼是我們所不知道的事?
而我們到底做錯了什麼、又錯過了什麼?
想不想知道人家怎麼做倒的!
你相信「時間=金錢」、還是「時間>金錢」

舉例:

我們一天工作8小時,一年工作365天,一輩子工作30年!那我們一輩子的總工作時數?
8小時*365天*30年=87,600小時
如果你的時薪100元,你一輩子賺876萬元!
如果你的時薪150元,你一輩子賺1314萬元!
如果你的時薪200元,你一輩子賺1752萬元!
看起來好像很多,看清楚!一年工作365天,要工作30年!而且不吃不喝!
這樣的收入,足夠三餐溫飽;買車子、房子勉強夠用;別忘了,還有子女的教育費、自己的養老金、還有『夢想』等待實現!
這樣的一輩子,你甘心嗎?
身為員工的你,每天辛苦為的是什麼?家庭、小孩?你有沒有想過,你上班一輩子,將來你的小孩能承接你的職位繼續做下去嗎?(除非你自己是老闆)
想不想改變自己及下一代的一生?

不想再收信(Unsubscribe)

 

To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 10:39:48 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3C00337B401 for ; Mon, 3 Mar 2003 10:39:48 -0800 (PST) Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184]) by mx1.FreeBSD.org (Postfix) with ESMTP id B12D243FDF for ; Mon, 3 Mar 2003 10:39:47 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Received: from beastie.mckusick.com (localhost [127.0.0.1]) by beastie.mckusick.com (8.12.3/8.12.3) with ESMTP id h23IafFL085350; Mon, 3 Mar 2003 10:36:41 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Message-Id: <200303031836.h23IafFL085350@beastie.mckusick.com> To: Terry Lambert Subject: Re: New getblk parameter. Cc: Jeff Roberson , arch@FreeBSD.ORG In-Reply-To: Your message of "Sun, 02 Mar 2003 21:10:17 PST." <3E62E3B9.32887967@mindspring.com> Date: Mon, 03 Mar 2003 10:36:41 -0800 From: Kirk McKusick Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I agree that changing all the getblk calls is prefereable to adding the getblkf wrapper. Kirk McKusick To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 12:15:11 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5378237B401 for ; Mon, 3 Mar 2003 12:15:10 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7031043FA3 for ; Mon, 3 Mar 2003 12:15:09 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h23KF1v79344; Mon, 3 Mar 2003 15:15:01 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Mon, 3 Mar 2003 15:15:01 -0500 (EST) From: Jeff Roberson To: Kirk McKusick Cc: Terry Lambert , Subject: Re: New getblk parameter. In-Reply-To: <200303031836.h23IafFL085350@beastie.mckusick.com> Message-ID: <20030303151319.Q72102-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Ok, what about the substance of the patch? You're ok with this as well? On Mon, 3 Mar 2003, Kirk McKusick wrote: > I agree that changing all the getblk calls is prefereable to > adding the getblkf wrapper. > > Kirk McKusick > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 12:24:24 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 086C537B405 for ; Mon, 3 Mar 2003 12:24:23 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0CB5843F3F for ; Mon, 3 Mar 2003 12:24:22 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h23KOLG83934 for ; Mon, 3 Mar 2003 15:24:21 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Mon, 3 Mar 2003 15:24:21 -0500 (EST) From: Jeff Roberson To: arch@freebsd.org Subject: vtruncbuf() Message-ID: <20030303151503.N72102-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG vtruncbuf() does a few things that I'm not terribly certain I understand. I'm hoping someone can elaborate on this. Once we have eliminated all bufs that are above the truncation mark we do a sort of inline sync of all indirect blocks. Why do we have to do this sync? Is this required? If so, why don't we just fsync here? Or require the filesystem to do it. There are a few things that bother me about the current code there. Firstly, it makes assumptions about negative blknos. So this scheme doesn't work for filesystems that don't use this method for indexing their metadata. Secondly, it doesn't hold a lock while inspecting B_DELWRI. There is also a really weird check to see if the buf's vp matches the vp we're truncating. This doesn't really make sense since we just pulled this buf off of the dirty block lists for this vnode. Comments please? Thanks, Jeff To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 14:44:28 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4989A37B401 for ; Mon, 3 Mar 2003 14:44:26 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id C9EAD43FDF for ; Mon, 3 Mar 2003 14:44:25 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 772AF2107C; Mon, 3 Mar 2003 14:44:18 -0800 (PST) Date: Mon, 3 Mar 2003 14:44:18 -0800 From: Sean Chittenden To: freebsd-arch@FreeBSD.ORG Subject: Should sendfile() to return ENOBUFS? Message-ID: <20030303224418.GU79234@perrin.int.nxad.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="1hKfHPzOXWu1rh0v" Content-Disposition: inline User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --1hKfHPzOXWu1rh0v Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I've got a cluster of busy servers and have now exhausted the number of sf_buf's available. top(1) reports that applications using sendfile(2) are spending quite a bit of time in the 'sfbufa' state blocking even though the socket is non-blocking. I'd consider this a pretty nice bug and that sendfile(2) should return ENOBUFS instead of blocking on a non-blocking call. Right now if sf_buf_alloc() returns NULL, it is assumed that the call was sent a signal and was interrupted. So I have a two fold question: 1) Should sendfile(2) block on a non-blocking socket when there are no sf_buf's available? I don't think it should. sendfile(2) should return ENOBUFS and let the user land process continue working even though the kernel is constrained for sf_buf's. 2) Will changing the sendfile() call to return ENOBUFs break source compatibility across sendfile() implementations? I'm pretty amazed that other performance mongers haven't run across this problem and noticed this before under high load conditions. If there is a precedent, I'm tempted to suggest that we break ranks because: *) the current behavior is pretty clearly not "the right thing" *) is pretty clearly broken, and *) is causing me no end of headaches (using writev() is a terrible alternative to sendfile()). Comments? -sc --=20 Sean Chittenden --1hKfHPzOXWu1rh0v Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+Y9rC3ZnjH7yEs0ERAl99AJ47H8wu0FBJ3Qqv9shnyMrqjgZEnQCglUXX 6/MGUFEN5qdDAFaq+pJsuxc= =Hmv+ -----END PGP SIGNATURE----- --1hKfHPzOXWu1rh0v-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 15:39:54 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A161837B401 for ; Mon, 3 Mar 2003 15:39:53 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id BD39443F75 for ; Mon, 3 Mar 2003 15:39:52 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h23NdqS73721 for ; Mon, 3 Mar 2003 18:39:52 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Mon, 3 Mar 2003 18:39:52 -0500 (EST) From: Jeff Roberson Cc: arch@FreeBSD.ORG Subject: Re: New getblk parameter. In-Reply-To: <200303031836.h23IafFL085350@beastie.mckusick.com> Message-ID: <20030303183921.D72102-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Ok. I'm going to commit this part and let the cluster patch shake out for another day or two and then commit it. Cheers, Jeff On Mon, 3 Mar 2003, Kirk McKusick wrote: > I agree that changing all the getblk calls is prefereable to > adding the getblkf wrapper. > > Kirk McKusick > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 16:12:37 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9586C37B401; Mon, 3 Mar 2003 16:12:35 -0800 (PST) Received: from angelica.unixdaemons.com (angelica.unixdaemons.com [209.148.64.135]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2599C43FA3; Mon, 3 Mar 2003 16:12:33 -0800 (PST) (envelope-from hiten@angelica.unixdaemons.com) Received: from angelica.unixdaemons.com (localhost.unixdaemons.com [127.0.0.1]) by angelica.unixdaemons.com (8.12.7/8.12.1) with ESMTP id h240CUjL041476; Mon, 3 Mar 2003 19:12:30 -0500 (EST) Received: (from hiten@localhost) by angelica.unixdaemons.com (8.12.7/8.12.1/Submit) id h240CUOS041475; Mon, 3 Mar 2003 19:12:30 -0500 (EST) (envelope-from hiten) Date: Mon, 3 Mar 2003 19:12:30 -0500 From: Hiten Pandya To: Sean Chittenden Cc: arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304001230.GC36475@unixdaemons.com> References: <20030303224418.GU79234@perrin.int.nxad.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030303224418.GU79234@perrin.int.nxad.com> User-Agent: Mutt/1.4i X-Operating-System: FreeBSD i386 X-Public-Key: http://www.pittgoth.com/~hiten/pubkey.asc X-URL: http://www.unixdaemons.com/~hiten X-PGP: http://pgp.mit.edu:11371/pks/lookup?search=Hiten+Pandya&op=index Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean Chittenden (Mon, Mar 03, 2003 at 02:44:18PM -0800) wrote: > I've got a cluster of busy servers and have now exhausted the number > of sf_buf's available. top(1) reports that applications using > sendfile(2) are spending quite a bit of time in the 'sfbufa' state > blocking even though the socket is non-blocking. I'd consider this a > pretty nice bug and that sendfile(2) should return ENOBUFS instead of > blocking on a non-blocking call. Right now if sf_buf_alloc() returns > NULL, it is assumed that the call was sent a signal and was > interrupted. So I have a two fold question: > > 1) Should sendfile(2) block on a non-blocking socket when there are no > sf_buf's available? > > I don't think it should. sendfile(2) should return ENOBUFS and let > the user land process continue working even though the kernel is > constrained for sf_buf's. > > 2) Will changing the sendfile() call to return ENOBUFs break source > compatibility across sendfile() implementations? > > I'm pretty amazed that other performance mongers haven't run across > this problem and noticed this before under high load conditions. > > If there is a precedent, I'm tempted to suggest that we break ranks > because: > > *) the current behavior is pretty clearly not "the right thing" > *) is pretty clearly broken, and > *) is causing me no end of headaches (using writev() is a terrible > alternative to sendfile()). FWIW, there are many other parts of the sys/ tree where ENOBUFS or any error code is not returned in the case of mbuf allocation, or something on those lines, best candidate for this is the sys/nfsserver code. (Sorry for being a little offtopic...) Cheers. -- Hiten Pandya (hiten@unixdaemons.com, hiten@uk.FreeBSD.org) http://www.unixdaemons.com/~hiten/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 16:22:28 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C7D3E37B401 for ; Mon, 3 Mar 2003 16:22:26 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5823643FBD for ; Mon, 3 Mar 2003 16:22:26 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 728A82105E; Mon, 3 Mar 2003 16:22:18 -0800 (PST) Date: Mon, 3 Mar 2003 16:22:18 -0800 From: Sean Chittenden To: Hiten Pandya Cc: arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304002218.GY79234@perrin.int.nxad.com> References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="iq/fWD14IMVFWBCD" Content-Disposition: inline In-Reply-To: <20030304001230.GC36475@unixdaemons.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --iq/fWD14IMVFWBCD Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > FWIW, there are many other parts of the sys/ tree where ENOBUFS or > any error code is not returned in the case of mbuf allocation, or > something on those lines, best candidate for this is the > sys/nfsserver code. Are you suggesting that the parts of the sys/ tree that don't return ENOBUFS should? FWIW, all I'm advocating right now is to get sendfile() fixed up and taken care of just because it's biting me in the arse and making life hell at the moment. -sc --=20 Sean Chittenden --iq/fWD14IMVFWBCD Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+Y/G63ZnjH7yEs0ERAgNmAJ4iwW+aPyS1IYZIIK+MhUPPAUAzTwCeIySQ TkAHPuFVdsKJ22agerK/qBM= =0V0O -----END PGP SIGNATURE----- --iq/fWD14IMVFWBCD-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 18:38:16 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ADC4837B401 for ; Mon, 3 Mar 2003 18:38:14 -0800 (PST) Received: from puffin.mail.pas.earthlink.net (puffin.mail.pas.earthlink.net [207.217.120.139]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1387643F85 for ; Mon, 3 Mar 2003 18:38:14 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0240.cvx21-bradley.dialup.earthlink.net ([209.179.192.240] helo=mindspring.com) by puffin.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18q2Jb-0002hz-00; Mon, 03 Mar 2003 18:38:00 -0800 Message-ID: <3E641131.431A0BA8@mindspring.com> Date: Mon, 03 Mar 2003 18:36:33 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Sean Chittenden Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4281169632e6c678d01bfe2088d3be828a8438e0f32a48e08350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean Chittenden wrote: > > FWIW, there are many other parts of the sys/ tree where ENOBUFS or > > any error code is not returned in the case of mbuf allocation, or > > something on those lines, best candidate for this is the > > sys/nfsserver code. > > Are you suggesting that the parts of the sys/ tree that don't return > ENOBUFS should? > > FWIW, all I'm advocating right now is to get sendfile() fixed up and > taken care of just because it's biting me in the arse and making life > hell at the moment. -sc sendfile: When using a socket marked for non-blocking I/O, sendfile() may send fewer bytes than requested. In this case, the number of bytes success- fully written is returned in *sbytes (if specified), and the error EAGAIN is returned. This seems to indicate several things: 1) The correct error is EAGAIN, *not* ENOBUFS 2) You need to be damn sure you can guarantee a correct update of *sbytes; I believe this is very difficult in the case in question, which is why it blocks 3) If sbytes is NULL, you should probably block, even on a non-blocking call. The reason for this is that there is no way for the application to restart without *sbytes 4) If you get rid of the blocking with (sbytes == NULL), you better add a BUGS section to the manual page. Frankly I'm really surprised that you are blocking in this place; it indicates an inability to get a page in the kernel map in the sf zone, which, in turn, indicates that your NSFBUFS is improperly tuned; if you are using sendfile, and tune up your other kernel parameters for your system, don't forget NSFBUFS. While you could *technically* make sf_buf_alloc() non-blocking, in general this would be a bad idea, given that the one place it's called is in in interior loop that can be the subject of a "goto" (so it's an embedded interior loop) in sendfile() itself. I think it would be very hard to satisfy #2, to allow it to be restartable by the application, in the face of failure, and since *sbytes is not a mandatory parameter, likely your application will end up barfing (e.g. sending partial FTP files or HTML documents down, with no way to recover from a failure, other than closing the client socket, and hoping the client can recover). In a "flash crowd" case on an HTTP server, this basically means that you will continuously get retries, and the situation will worsen, exponentially, as people retry getting the same page. In the FTP case, or some other protocol without automatic retry on session abandonment, of course, it will be fatal. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 20: 9:12 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E27D637B401 for ; Mon, 3 Mar 2003 20:09:08 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id 35F0E43FAF for ; Mon, 3 Mar 2003 20:09:08 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 85C6D21068; Mon, 3 Mar 2003 20:08:59 -0800 (PST) Date: Mon, 3 Mar 2003 20:08:59 -0800 From: Sean Chittenden To: Terry Lambert Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304040859.GB79234@perrin.int.nxad.com> References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="8Bx+wEju+vH9ym24" Content-Disposition: inline In-Reply-To: <3E641131.431A0BA8@mindspring.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --8Bx+wEju+vH9ym24 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > sendfile: >=20 > When using a socket marked for non-blocking I/O, sendfile() may > send fewer bytes than requested. In this case, the number of > bytes success- fully written is returned in *sbytes (if > specified), and the error EAGAIN is returned. >=20 > This seems to indicate several things: >=20 > 1) The correct error is EAGAIN, *not* ENOBUFS EAGAIN/EWOULDBLOCK, I'm inclined to agree... > 2) You need to be damn sure you can guarantee a correct update > of *sbytes; I believe this is very difficult in the case in > question, which is why it blocks I'm not convinced of this. Have you poked through src/sys/kern/uipc_syscalls.c? It's not that ugly/hard, nothing's impossible with a bit of refactoring. > 3) If sbytes is NULL, you should probably block, even on a > non-blocking call. The reason for this is that there is > no way for the application to restart without *sbytes This degrades terribly though and if you get a spike in traffic, degradation of performance is critical. Going from a non-blocking application to a blocking call simply because of high use is murderous and is justification in itself enough for me to move away from the really nice zero-copy sockets that sendfile() affords me, back to the sluggish writev() syscall. If a system is busy, it's stuck in an sfbufa state and blocks the server from servicing thousands of connections. The symptoms are common and synonymous with mbuf exhaustion or any other kind of buffer exhaustion... my point is that having this block is the worst way that sendfile() can degrade under high performance. > 4) If you get rid of the blocking with (sbytes =3D=3D NULL), you > better add a BUGS section to the manual page. There's nothing that says that sbytes can't be set to 0 if errno is EAGAIN, in fact, that's what it does right now. > Frankly I'm really surprised that you are blocking in this place; it > indicates an inability to get a page in the kernel map in the sf > zone, which, in turn, indicates that your NSFBUFS is improperly > tuned; if you are using sendfile, and tune up your other kernel > parameters for your system, don't forget NSFBUFS. Well, it's set to 65535 at the moment. How much higher you think I should set it? :-] At some point I have to say, "it's high enough and I just need to get the application to degrade gracefully." :-] > While you could *technically* make sf_buf_alloc() non-blocking, in > general this would be a bad idea, given that the one place it's > called is in in interior loop that can be the subject of a "goto" > (so it's an embedded interior loop) in sendfile() itself. I think > it would be very hard to satisfy #2, to allow it to be restartable > by the application, in the face of failure, and since *sbytes is not > a mandatory parameter, likely your application will end up barfing > (e.g. sending partial FTP files or HTML documents down, with no way > to recover from a failure, other than closing the client socket, and > hoping the client can recover). Frankly, if a developer is stupid enough to pass in NULL for sbytes, they get what they deserve. Returning -1 and setting errno to EAGAIN in the event that there aren't any sf_buf's available isn't what I'd call the programming exercise of the decade. :-P > In a "flash crowd" case on an HTTP server, this basically means that > you will continuously get retries, and the situation will worsen, > exponentially, as people retry getting the same page. In the FTP > case, or some other protocol without automatic retry on session > abandonment, of course, it will be fatal. Hrm, let me redefine "fatal" as "changing the behavior of a system call to go from returning in less than 0.001ms, to returning in 2-15s for every connection when trying to make over ~500K sendfile(2) calls a second." I'd call that a catastrophic failure to degrade successfully. -sc --=20 Sean Chittenden --8Bx+wEju+vH9ym24 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+ZCbb3ZnjH7yEs0ERAk3mAKCTIVw1wlkEppN9MlKOvgcjGROfbQCgyjlj ihQpNHXryGSGT/JMcV81SQI= =frrn -----END PGP SIGNATURE----- --8Bx+wEju+vH9ym24-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 21:42:58 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 65F1E37B401 for ; Mon, 3 Mar 2003 21:42:57 -0800 (PST) Received: from bricore.com (adsl-64-168-71-68.dsl.snfc21.pacbell.net [64.168.71.68]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9F26943F3F for ; Mon, 3 Mar 2003 21:42:56 -0800 (PST) (envelope-from lchen@briontech.com) Received: from luoqi (luoqi.bricore.com [192.168.1.63]) by bricore.com (8.12.6/8.12.6) with SMTP id h245gY3r011472; Mon, 3 Mar 2003 21:42:52 -0800 (PST) (envelope-from lchen@briontech.com) From: "Luoqi Chen" To: "Jeff Roberson" , Subject: RE: vtruncbuf() Date: Mon, 3 Mar 2003 21:46:39 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 In-Reply-To: <20030303151503.N72102-100000@mail.chesapeake.net> X-Virus-Scanned: by amavisd-milter (http://amavis.org/) Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > vtruncbuf() does a few things that I'm not terribly certain I understand. > I'm hoping someone can elaborate on this. > > Once we have eliminated all bufs that are above the truncation mark we do > a sort of inline sync of all indirect blocks. Why do we have to do this > sync? Is this required? If so, why don't we just fsync here? Or > require the > filesystem to do it. There are a few things that bother me about the > current code there. > I think the idea was to avoid calling fsync. > Firstly, it makes assumptions about negative blknos. So this scheme > doesn't work for filesystems that don't use this method for indexing > their metadata. The code is a little ufs specific, but should still work for other FS -- it doesn't hurt to write out dirty bufs. > Secondly, it doesn't hold a lock while inspecting > B_DELWRI. > It's intentional, see below... > There is also a really weird check to see if the buf's vp matches the vp > we're truncating. This doesn't really make sense since we just > pulled this > buf off of the dirty block lists for this vnode. > ..., the buf is not locked, remember :) > Comments please? > > Thanks, > Jeff > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 22: 7:28 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 88DAE37B401 for ; Mon, 3 Mar 2003 22:07:26 -0800 (PST) Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id AC63E43F75 for ; Mon, 3 Mar 2003 22:07:25 -0800 (PST) (envelope-from jroberson@chesapeake.net) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id h24676159798; Tue, 4 Mar 2003 01:07:06 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Tue, 4 Mar 2003 01:07:06 -0500 (EST) From: Jeff Roberson To: Luoqi Chen Cc: arch@FreeBSD.ORG Subject: RE: vtruncbuf() In-Reply-To: Message-ID: <20030304010228.P72102-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > > vtruncbuf() does a few things that I'm not terribly certain I understand. > > I'm hoping someone can elaborate on this. > > > I think the idea was to avoid calling fsync. Why does it need to be synced at all? > > Firstly, it makes assumptions about negative blknos. So this scheme > > doesn't work for filesystems that don't use this method for indexing > > their metadata. > The code is a little ufs specific, but should still work for other FS > -- it doesn't hurt to write out dirty bufs. No, but I'm not sure how it helps either. > > Secondly, it doesn't hold a lock while inspecting > > B_DELWRI. > > > It's intentional, see below... It's an optimization. > > There is also a really weird check to see if the buf's vp matches the vp > > we're truncating. This doesn't really make sense since we just > > pulled this > > buf off of the dirty block lists for this vnode. > > > ..., the buf is not locked, remember :) Yes, but you were guaranteed that it wouldn't have migrated to a new vp even in RELENG_4. The whole thing happens at splbio(). In current Giant makes that guarantee and now the vnode interlock does as well. The thing that you aren't guaranteed now is whether or not DELWRI is still valid. You can be certain that UFS won't have negative blocks locked at this point though because the vnode lock is held. So this lock should always succeed anyway. I hadn't thought about that until just now. I may just test with that lock moved up and ignore the issue unless some one gives me some compelling reason to do otherwise. > > Comments please? > > > > Thanks, > > Jeff > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Mar 3 23:17:41 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 41DF837B401 for ; Mon, 3 Mar 2003 23:17:36 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6239643FA3 for ; Mon, 3 Mar 2003 23:17:35 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0299.cvx40-bradley.dialup.earthlink.net ([216.244.43.44] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18q6g3-0003Ys-00; Mon, 03 Mar 2003 23:17:28 -0800 Message-ID: <3E6452B4.E87BEC2@mindspring.com> Date: Mon, 03 Mar 2003 23:16:04 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Sean Chittenden Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4b39a621f7443ebf29d5b69aacea76e67a8438e0f32a48e08350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean Chittenden wrote: > > 2) You need to be damn sure you can guarantee a correct update > > of *sbytes; I believe this is very difficult in the case in > > question, which is why it blocks > > I'm not convinced of this. Have you poked through > src/sys/kern/uipc_syscalls.c? It's not that ugly/hard, nothing's > impossible with a bit of refactoring. I've done this. I've ported the -current sendfile external buffer code to FreeBSD 4.3, and again to FreeBSD 4.4, etc.. I'm rather familiar with it, actually... > > 3) If sbytes is NULL, you should probably block, even on a > > non-blocking call. The reason for this is that there is > > no way for the application to restart without *sbytes > > This degrades terribly though and if you get a spike in traffic, > degradation of performance is critical. Sendfile degrades terribly under traffic spikes, period. One thing sendfile fails to do is honor the so_snd size limits that other things honor, as it goes through its loop. Technically, sendfile should be an async interface so it can lock the so_snd window to the buffers-in-flight. If it did this, it could preallocate the memory at the time it's called, and then reuse it internally until the operation has been completed. Then it could write it's completion status. > Going from a non-blocking application to a blocking call simply > because of high use is murderous and is justification in itself > enough for me to move away from the really nice zero-copy sockets > that sendfile() affords me, back to the sluggish writev() syscall. For POP3 and SMTP, and most other RFC822 derived protocols, you end up having to store your files with line delimiters, instead of . For FTP, you can only do binary transfers, etc.. The sendfile interface is just a bad design, period. That it performs badly under load is just icing on the cake. > If a system is busy, it's stuck in an sfbufa state and blocks the > server from servicing thousands of connections. I understand. > The symptoms are common and synonymous with mbuf exhaustion or any > other kind of buffer exhaustion... my point is that having this > block is the worst way that sendfile() can degrade under > high performance. Djikstra: preallocate your resources, and you do not have this problem. In this case, set your tunable high enough that even were you to use up all your available buffers, there are NSFBUFS available... and the problem goes away. > > 4) If you get rid of the blocking with (sbytes == NULL), you > > better add a BUGS section to the manual page. > > There's nothing that says that sbytes can't be set to 0 if errno is > EAGAIN, in fact, that's what it does right now. If you send a non-zero amount of data, you need to know exactly what was sent, in order to maintain connection state data pipe coherency between the user space application requesting the send on a connection basis, and the kernel space code that has done a partial send. Given your statement, though, we can say pretty surely that this is HTTP... Any other approach, and your only option to recover your state is to close the connection and make the client retry. So in the situation where the resources are limited, you end up *increasing* the overall load by, instead of satisfying a client with a single request, converting that into 5 requests, all of which fail to deliver the data to the client. > > Frankly I'm really surprised that you are blocking in this place; it > > indicates an inability to get a page in the kernel map in the sf > > zone, which, in turn, indicates that your NSFBUFS is improperly > > tuned; if you are using sendfile, and tune up your other kernel > > parameters for your system, don't forget NSFBUFS. > > Well, it's set to 65535 at the moment. How much higher you think I > should set it? :-] At some point I have to say, "it's high enough and > I just need to get the application to degrade gracefully." :-] The sendfile interface does not degrade gracefully, period. Even if you dealt with the issue by setting *sbytes correctly in all cases, and returning the right value to use space, you've increased the number of system calls, potentially significantly. So even if you "correct" the behaviour, your degradation is going to be exponential. One potential solution is to go to using KSE's, so that the blocking context is not your whole process. This allows you to write the server as multithreaded. Another is to do what Apache does, and run processes per connection. My recommendation was (and is): get a sufficiently large NSFBUFS in the first place, so you never encounter the situation that results in the non-graceful degradation. > > While you could *technically* make sf_buf_alloc() non-blocking, in > > general this would be a bad idea, given that the one place it's > > called is in in interior loop that can be the subject of a "goto" > > (so it's an embedded interior loop) in sendfile() itself. I think > > it would be very hard to satisfy #2, to allow it to be restartable > > by the application, in the face of failure, and since *sbytes is not > > a mandatory parameter, likely your application will end up barfing > > (e.g. sending partial FTP files or HTML documents down, with no way > > to recover from a failure, other than closing the client socket, and > > hoping the client can recover). > > Frankly, if a developer is stupid enough to pass in NULL for sbytes, > they get what they deserve. Returning -1 and setting errno to EAGAIN > in the event that there aren't any sf_buf's available isn't what I'd > call the programming exercise of the decade. :-P Nevertheless, the sendfile interface appears to allow this situation; it is a flaw in the API design. There are two ways to handle it: 1) Any time you call sendfile on a non-blocking fd with (sbytes == NULL), *immediately* return EPARM or a similar error 2) Allow the API to be inconsistent, and then have the OS accept the blame for broken applications, since it permits known broken parameter values > > In a "flash crowd" case on an HTTP server, this basically means that > > you will continuously get retries, and the situation will worsen, > > exponentially, as people retry getting the same page. In the FTP > > case, or some other protocol without automatic retry on session > > abandonment, of course, it will be fatal. > > Hrm, let me redefine "fatal" as "changing the behavior of a system > call to go from returning in less than 0.001ms, to returning in 2-15s > for every connection when trying to make over ~500K sendfile(2) calls > a second." I'd call that a catastrophic failure to degrade > successfully. -sc "Fatal" in this context was intended to imply "the clients do not get their data, and get partial data and closed descriptors, instead, thus breaking the contract between the client and the server". And yeah, either way you look at it, it's a failure to degrade gracefully... once again: the easy fix is to not put your system in that position in the first place. A less easy approach would be to maintain a count of active sendfile instances in your application, and queue up requests above some high watermark, rather than making system calls. Another would be to hard limit the number of client connections you allow at once, etc.. The east ugly of these (to my mind) is to not overcommit NSFBUFS in the first place by always having at least 1 more than you could ever need, preconfigured into the kernel. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 0:13:44 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2617937B401 for ; Tue, 4 Mar 2003 00:13:38 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id 43DE043FA3 for ; Tue, 4 Mar 2003 00:13:37 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 0EFCC21078; Tue, 4 Mar 2003 00:13:27 -0800 (PST) Date: Tue, 4 Mar 2003 00:13:26 -0800 From: Sean Chittenden To: Terry Lambert Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304081326.GD79234@perrin.int.nxad.com> References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="kVhvBuyIzNBvw9vr" Content-Disposition: inline In-Reply-To: <3E6452B4.E87BEC2@mindspring.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --kVhvBuyIzNBvw9vr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > > > 2) You need to be damn sure you can guarantee a correct update > > > of *sbytes; I believe this is very difficult in the case in > > > question, which is why it blocks > >=20 > > I'm not convinced of this. Have you poked through > > src/sys/kern/uipc_syscalls.c? It's not that ugly/hard, nothing's > > impossible with a bit of refactoring. >=20 > I've done this. I've ported the -current sendfile external buffer > code to FreeBSD 4.3, and again to FreeBSD 4.4, etc.. I'm rather > familiar with it, actually... Excellent... I know you've done stuff with large numbers of TCP connections in the past so this doesn't really surprise me all that much. Suggestions welcome. > > > 3) If sbytes is NULL, you should probably block, even on a > > > non-blocking call. The reason for this is that there is > > > no way for the application to restart without *sbytes > >=20 > > This degrades terribly though and if you get a spike in traffic, > > degradation of performance is critical. >=20 > Sendfile degrades terribly under traffic spikes, period. One thing > sendfile fails to do is honor the so_snd size limits that other > things honor, as it goes through its loop. Much to my dismay and frustration, I'm discovering this... is there a better zero-copy socket file operation that can be used in place of sendfile()? Alfred's mentioned something called kblob a few times but I haven't been able to dig up anything on it other than an old arch@ discussion where it was shot down (unfortunately). > Technically, sendfile should be an async interface so it can lock > the so_snd window to the buffers-in-flight. If it did this, it > could preallocate the memory at the time it's called, and then > reuse it internally until the operation has been completed. Then > it could write it's completion status. I haven't spent more than a few seconds thinking about this, but wouldn't that require more mbufclusters to be in use but idle at any given time than the current implementation? > > Going from a non-blocking application to a blocking call simply > > because of high use is murderous and is justification in itself > > enough for me to move away from the really nice zero-copy sockets > > that sendfile() affords me, back to the sluggish writev() syscall. >=20 > For POP3 and SMTP, and most other RFC822 derived protocols, you end > up having to store your files with line delimiters, instead > of . For FTP, you can only do binary transfers, etc.. The > sendfile interface is just a bad design, period. >=20 > That it performs badly under load is just icing on the cake. I don't quite understand what you're trying to say here. What's the correlation between / and system calls? CR+LF is always read/written as two bytes... I must be missing the point of your comment. > > If a system is busy, it's stuck in an sfbufa state and blocks the > > server from servicing thousands of connections. >=20 > I understand. Groovy: that's a third of the problem, what's the elegant solution? > > The symptoms are common and synonymous with mbuf exhaustion or any > > other kind of buffer exhaustion... my point is that having this > > block is the worst way that sendfile() can degrade under > > high performance. >=20 > Djikstra: preallocate your resources, and you do not have this > problem. In this case, set your tunable high enough that even > were you to use up all your available buffers, there are NSFBUFS > available... and the problem goes away. I keep chasing this upper bound and pushing things higher and higher because sendfile() doesn't degrade worth beans... well, that's a hack and not a solution. The TCP stack, VM, and my general setup has scaled quite well. The 1st thing to go, however, is the number of sf_buf's. I'm worried I'm going to run out of KVM here in the near future (and at that point, life basically begins to suck given my RAM requirements are all over the place, 64bit platforms other than the alpha aren't ready for prime time quite yet, and BSD has a hard kernel memory split that isn't dynamic). > > > 4) If you get rid of the blocking with (sbytes =3D=3D NULL), you > > > better add a BUGS section to the manual page. > >=20 > > There's nothing that says that sbytes can't be set to 0 if errno > > is EAGAIN, in fact, that's what it does right now. >=20 > If you send a non-zero amount of data, you need to know exactly what > was sent, in order to maintain connection state data pipe coherency > between the user space application requesting the send on a > connection basis, and the kernel space code that has done a partial > send. ::nods:: That's a given. > Given your statement, though, we can say pretty surely that this is > HTTP... ::nods:: After some processing, I need to send a file as fast and efficient as I can. Moving to sendfile() saved me gobs of CPU cycles and now things hover down below 15% CPU time. > Any other approach, and your only option to recover your state is to > close the connection and make the client retry. Agreed, but that's a non-option when trying to deliver a high level of reliability. HTTP doesn't handle that so well. > So in the situation where the resources are limited, you end up > *increasing* the overall load by, instead of satisfying a client > with a single request, converting that into 5 requests, all of which > fail to deliver the data to the client. But 'ya see, I wouldn't mind that at all: I'm not CPU bound and can afford the extra context switches back and forth from the user space. I'd bet dime to dollar that people who use sendfile(2) aren't CPU bound: they're IO/sf_buf bound. Sure having sendfile() return EAGAIN will drive up the number of calls under high load, but I'd rather burn a few more cycles swapping contexts than I would getting stuck in a spin lock waiting for the required number of sf_buf's to become available. If I've got a connection queue of 60K, I want to free up as many connections as I can as fast as I can which makes sleeping the worst thing I can do because the contentions in queue just pile up. A userland spin lock is going to result in a more responsive application than a kernel spin lock since the userland app will loop through the connection queue and free up sf_buf's as data gets sent out over the pipe (something that won't happen when stuck in msleep() in the kernel's spin lock). > > Well, it's set to 65535 at the moment. How much higher you think > > I should set it? :-] At some point I have to say, "it's high > > enough and I just need to get the application to degrade > > gracefully." :-] >=20 > The sendfile interface does not degrade gracefully, period. Even if > you dealt with the issue by setting *sbytes correctly in all cases, > and returning the right value to use space, you've increased the > number of system calls, potentially significantly. So even if you > "correct" the behaviour, your degradation is going to be > exponential. ::nods:: But as stated above, there are worse things that can be done, most notably, blocking and letting connections pile up. > One potential solution is to go to using KSE's, so that the blocking > context is not your whole process. This allows you to write the > server as multithreaded. Another is to do what Apache does, and run > processes per connection. I'm antsy as hell to convert my apps to use KSE for this very reason, but I'm going to give myself a few more months before I turn the life blood of my business over to KSE. > My recommendation was (and is): get a sufficiently large NSFBUFS in > the first place, so you never encounter the situation that results > in the non-graceful degradation. That's not a solution though, that's a work around/hack. :-] I've hacked/worked around, but I need a solution. Making sendfile(2) "do the right thing(TM)" I thought was the solution (still do). > > Frankly, if a developer is stupid enough to pass in NULL for sbytes, > > they get what they deserve. Returning -1 and setting errno to EAGAIN > > in the event that there aren't any sf_buf's available isn't what I'd > > call the programming exercise of the decade. :-P >=20 > Nevertheless, the sendfile interface appears to allow this > situation; it is a flaw in the API design. There are two ways to > handle it: >=20 > 1) Any time you call sendfile on a non-blocking fd with > (sbytes =3D=3D NULL), *immediately* return EPARM or a > similar error I'm less than wild about that since that breaks POLA with existing code. There's no harm in making something more unknown when it's already unknown. > 2) Allow the API to be inconsistent, and then have the OS > accept the blame for broken applications, since it permits > known broken parameter values I don't follow... how would this fix anything? I don't understand why this would be necessary given what I'd proposed/suggested earlier. > > Hrm, let me redefine "fatal" as "changing the behavior of a system > > call to go from returning in less than 0.001ms, to returning in > > 2-15s for every connection when trying to make over ~500K > > sendfile(2) calls a second." I'd call that a catastrophic failure > > to degrade successfully. -sc >=20 > "Fatal" in this context was intended to imply "the clients do not > get their data, and get partial data and closed descriptors, > instead, thus breaking the contract between the client and the > server". >=20 > And yeah, either way you look at it, it's a failure to degrade > gracefully... once again: the easy fix is to not put your system in > that position in the first place. Lol! I wish I had that as an option. Near infinite demand doesn't give me this luxury. > A less easy approach would be to maintain a count of active sendfile > instances in your application, and queue up requests above some high > watermark, rather than making system calls. Another would be to > hard limit the number of client connections you allow at once, etc.. > The east ugly of these (to my mind) is to not overcommit NSFBUFS in > the first place by always having at least 1 more than you could ever > need, preconfigured into the kernel. I'd actually thought about having my application do this on the fly and automatically tune itself based on the number of free sf_buf's, but this brings up another problem with sendfile(2): there's no way of determining how many sf_buf's are in use at any given time and on -STABLE, you can't even read the number of sf_buf's allocated (kern.ipc.nsfbufs). :-/ Other suggestions welcome including, "leave sendfile() alone, hack up a new interface." -sc --=20 Sean Chittenden --kVhvBuyIzNBvw9vr Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+ZGAm3ZnjH7yEs0ERAlZWAJ42VRWSXW7clFjsbduZnqKHI6t5qACgkhly IqynnFEy7FaE58AqQi8omZw= =F8J4 -----END PGP SIGNATURE----- --kVhvBuyIzNBvw9vr-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 2: 3:49 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8C3D537B401; Tue, 4 Mar 2003 02:03:47 -0800 (PST) Received: from park.rambler.ru (park.rambler.ru [81.19.64.101]) by mx1.FreeBSD.org (Postfix) with ESMTP id C4AB243FB1; Tue, 4 Mar 2003 02:03:41 -0800 (PST) (envelope-from is@rambler-co.ru) Received: from is.park.rambler.ru (is.park.rambler.ru [81.19.64.102]) by park.rambler.ru (8.12.6/8.12.6) with ESMTP id h24A3cmF081402; Tue, 4 Mar 2003 13:03:38 +0300 (MSK) Date: Tue, 4 Mar 2003 13:03:38 +0300 (MSK) From: Igor Sysoev X-Sender: is@is To: Sean Chittenden Cc: freebsd-arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? In-Reply-To: <20030303224418.GU79234@perrin.int.nxad.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 3 Mar 2003, Sean Chittenden wrote: > I've got a cluster of busy servers and have now exhausted the number > of sf_buf's available. top(1) reports that applications using > sendfile(2) are spending quite a bit of time in the 'sfbufa' state > blocking even though the socket is non-blocking. I'd consider this a > pretty nice bug and that sendfile(2) should return ENOBUFS instead of > blocking on a non-blocking call. Right now if sf_buf_alloc() returns > NULL, it is assumed that the call was sent a signal and was > interrupted. So I have a two fold question: > > 1) Should sendfile(2) block on a non-blocking socket when there are no > sf_buf's available? > > I don't think it should. sendfile(2) should return ENOBUFS and let > the user land process continue working even though the kernel is > constrained for sf_buf's. > > 2) Will changing the sendfile() call to return ENOBUFs break source > compatibility across sendfile() implementations? sendfile() can block on a non-blocking socket at least in two cases: 1) sf_buf's exhaustion (as you have described), 2) and reading file page from the disk. In these cases we could return: 1) EAGAIN for both cases, 2) or ENOBUFS for sf_buf's exhaustion and EAGAIN (or new special error) for reading file page (after initiating disk transfer of course). ENOBUFS (and new special error) could be enabled via sendfile() flag parameter, which currently unused and should be zero, so we would not break source and even binary compatibility. But there is the one problem - how to notify user-level code about readiness of these events, especially with EAGAIN ? select()/poll()/kevent() (or kevent() only) should check not only socket reasiness but readiness of the read operation (if it was initiated) and free sf_buf's (if we exhausted them last time). Igor Sysoev http://sysoev.ru To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 2:40:42 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3D95137B401; Tue, 4 Mar 2003 02:40:41 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id D00A443FDF; Tue, 4 Mar 2003 02:40:39 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id VAA31947; Tue, 4 Mar 2003 21:40:26 +1100 Date: Tue, 4 Mar 2003 21:42:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Terry Lambert Cc: Sean Chittenden , Hiten Pandya , , Subject: Re: Should sendfile() to return ENOBUFS? In-Reply-To: <3E641131.431A0BA8@mindspring.com> Message-ID: <20030304210901.L37414-100000@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 3 Mar 2003, Terry Lambert wrote: > sendfile: > > When using a socket marked for non-blocking I/O, sendfile() may send > fewer bytes than requested. In this case, the number of bytes success- > fully written is returned in *sbytes (if specified), and the error EAGAIN > is returned. > > This seems to indicate several things: > > 1) The correct error is EAGAIN, *not* ENOBUFS I agree. There seem to be more bugs near here: - sendfile(9) can return ENOBUFS for some other resource shortages, but this is undocumented for sendfile(2). - the wait can now be terminated by an signal. Then sendfile(9) returns the undocumented error EINTR. - sendfile(9) is never restarted after it is terminated by an signal. I don't know if it can be, but restarting is easier for applications to deal with so it should be done if possible. - sendfile(2) has a better way of reporting short counts than write(2). This seems to work for all types of errors, but this is only documented to work for EAGAIN errors. > 2) You need to be damn sure you can guarantee a correct update > of *sbytes; I believe this is very difficult in the case in > question, which is why it blocks No, since it just waits for a buffer and knows how much it already sent. > 3) If sbytes is NULL, you should probably block, even on a > non-blocking call. The reason for this is that there is > no way for the application to restart without *sbytes No, since a NULL sbytes means that the application doesn't care how many bytes got sent if there was an error. Such applications already don't handle other types of error. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 10: 2:13 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4759437B43B for ; Tue, 4 Mar 2003 10:02:01 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 37B6F43FA3 for ; Tue, 4 Mar 2003 10:01:59 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0410.cvx21-bradley.dialup.earthlink.net ([209.179.193.155] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18qGjZ-0007KQ-00; Tue, 04 Mar 2003 10:01:46 -0800 Message-ID: <3E64E9B8.EDCA54FE@mindspring.com> Date: Tue, 04 Mar 2003 10:00:24 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Sean Chittenden Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> <20030304081326.GD79234@perrin.int.nxad.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a424144d60a93d476d00dc4a5d766aae5a387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean Chittenden wrote: > > Sendfile degrades terribly under traffic spikes, period. One thing > > sendfile fails to do is honor the so_snd size limits that other > > things honor, as it goes through its loop. > > Much to my dismay and frustration, I'm discovering this... is there a > better zero-copy socket file operation that can be used in place of > sendfile()? Alfred's mentioned something called kblob a few times but > I haven't been able to dig up anything on it other than an old arch@ > discussion where it was shot down (unfortunately). Personally, I would probably just add a flag to sendfile, and treat *sbytes as an opaque pointer that caused a kevent back on completion of the transmission. You would still need add queues of blocking (*not* sleep!) contexts, but it could be done rather quickly. This is more expedient than I usually like to be, but, IMO, sendfile() is a lost cause, and spending good resources after bad is not a wise investment. A better solution would be to add a different API to the system. If your application is in an embedded system, there are even more drastic approached you can take, like setting PG_U on the interesting kernel pages, and then accessing them directly from user space with a system call interlock, minimally on any allocations or deallocations (this only works if there is no such thing as "someone else's process" running on your system, since it provides an opportunity to corrupt kernel memory from user space), etc.. The kblob interface is an interesting animal; Jeffrey Hsu has done some good work in that area, but it's not entirely usable, as it sits. You might want to talk to Jonathan Lemon. IMO, it is probably a lost cause. > > Technically, sendfile should be an async interface so it can lock > > the so_snd window to the buffers-in-flight. If it did this, it > > could preallocate the memory at the time it's called, and then > > reuse it internally until the operation has been completed. Then > > it could write it's completion status. > > I haven't spent more than a few seconds thinking about this, but > wouldn't that require more mbufclusters to be in use but idle at any > given time than the current implementation? No. First of all, it reduces the sfbuf requirements considerably, by queueing request descriptors, instead, and satisfying them, as it can. Second, you can control the number of packets "in flight" for each outstanding sendfile request in progress (unlike now), so if you throttle this back to the so_snd size, in fact you will use *fewer* mbuf clusters simultaneously, and you will reduce page thrashing (remember that sendmaile uses external mbufs that refer to buffer cache pages via sfbuf mappings). [ ... ] > I don't quite understand what you're trying to say here. What's the > correlation between / and system calls? CR+LF is always > read/written as two bytes... I must be missing the point of your > comment. It's a tangent that indicates sendfile() is generally inappropriate, unless you also implement the recvfile() to go with it, and use it. The issue is that UNIX text files are not stored in the wire formats for these protocols, so using sendfile() on them is usually inapprorpiate, unless you change how you store them. Mail servers, especially, break inbound and outbound data between applications, so you'd have to hack them up sto store incoming as delimited so that when you sent them out via sendfile(), they were compliant with the protocol standard, on the wire. > > > If a system is busy, it's stuck in an sfbufa state and blocks the > > > server from servicing thousands of connections. > > > > I understand. > > Groovy: that's a third of the problem, what's the elegant solution? You can't have one, in the context of the current sendfile. You need to change your context, if you want to address this issue and get onto the next one, or you can accept the implementation of an administrative limit to keep from banging your head on the design limit, and cut your losses. It really boils down to how much effort you are willing to spend on it, for what return you expect. > > > The symptoms are common and synonymous with mbuf exhaustion or any > > > other kind of buffer exhaustion... my point is that having this > > > block is the worst way that sendfile() can degrade under > > > high performance. > > > > Djikstra: preallocate your resources, and you do not have this > > problem. In this case, set your tunable high enough that even > > were you to use up all your available buffers, there are NSFBUFS > > available... and the problem goes away. > > I keep chasing this upper bound and pushing things higher and higher > because sendfile() doesn't degrade worth beans... well, that's a hack > and not a solution. No, it's really a "Then don't do that" solution to the old "Doctor, it hurts when I do this" complaint. Before sendfile(), the answer was to mmap() the data to be sent, and then call write() on it. Doing that guaranteed that you would not have to copy the data from user space to kernel space, because the mapping was already established. That solution can still work, without using sendfile() to get the same performance. The performance "win" of sendfile is the assumption that the entire file will be sent as a result of a single system call. > The TCP stack, VM, and my general setup has > scaled quite well. The 1st thing to go, however, is the number of > sf_buf's. "If it ain't one thing, it's another"... At some point, you have to bound the application to some administrative limit to keep it from hitting the hard limits inherent in the system; you aren't going to be able to address all the hard limits that your going to run into that happen from e.g. impedence mismatches over API's like sendfile() that aren't designed to handle them. As someone else pointed out, there are a lot of low overheads in various places in the FreeBSD kernel. If you are a seven foot tall person that wants to walk around without banging your head every 5 feet, then there's a lot of remodelling you are going to need to do to avoid that. If you want to get CS technical, you have found a livelock stall barrier: there are literally thousands of these in the design of FreeBSD, as it stands, and most of them are unlikely to ever get fixed, except in private commercial repositories for FreeBSD-based products. > I'm worried I'm going to run out of KVM here in the near > future (and at that point, life basically begins to suck given my RAM > requirements are all over the place, 64bit platforms other than the > alpha aren't ready for prime time quite yet, and BSD has a hard kernel > memory split that isn't dynamic). Eventually, you will. That's an inevitability. *Why* you run out will depend on your application, and what system characteristics seem important for it, to you. For me, this is usually "number of connections" or "ability to shed load in order to degrade gracefully", etc.. So for me, it usually comes down to number of mbufs stuck in so_snd chains, and I set cluster high enough that I don't hit my head. As far as 64bit, the Alpha can't handle as much physical RAM as the x86 (2G vs. 4G) at this point. The hard kernel memory split will *never* be dyamic. The closest you can ever expect to get is seperate process and kernel address spaces, so that the kernel address space and process address space are never simultaneously mapped. Doing that means heroic efforts are required to implement uiomove(), et. al.. Even so, the kernel memory is generally non-pageable. This is going to mean that you will be able to use up all physical RAM with such a config, but in doing so, you will leave yourself with no physical memory to give to user programs. It's a set of tradeoffs, and at some point, like it or not, you hit your head. Personally, I would probably never get rid of the simultaneous mapping; it's just too useful. For example, it's possible to map RO in user space but RW in kernel space a page that permits you to take no system call overhead for getpid/getgid/getuid/etc.. It's also possible to map a page RO in user space that contains the clock structures from the timecounters in kernel space (this is harder, but doable). By doing this, you can have a zero system call overhead "gettimeofday()" function, and guarantee it's atomicity by maintaining two regions and pointer-flipping between them, and reading the pointer to read atomically, in user space, which guaranteed atomicity on the content references. And so on. These tricks all required that a PG_U bit set on a kernel page makes it visible in user space. [ ... ] > > Any other approach, and your only option to recover your state is to > > close the connection and make the client retry. > > Agreed, but that's a non-option when trying to deliver a high level of > reliability. HTTP doesn't handle that so well. I look at this as "load shedding after hitting capacity limits"; the failure is going to be no worse than the worst case, in that scenario. The problem you have with your sendfile lockups is actually not that severe, per se. Yes, you stall your user space processing until some of the in-progress sendfile()'s that have happened previously drain out the network interface, so it impacts your ability to accept new connections, but it doesn't damage your ability to service the existing connections. If you turn the problem around, the real problem you have is that you are not rejecting new connections, the moment before you hit this situation. From that perspective, if you were to preallocate everything that would be used for a given sendfile, AND either fail the sendfile() completely ("WOULDBLOCK"), signalling user space to throttle new requests to the interface, OR guarantee it to complete completely, then the problem is also solved. It's just solved a different way. > > So in the situation where the resources are limited, you end up > > *increasing* the overall load by, instead of satisfying a client > > with a single request, converting that into 5 requests, all of which > > fail to deliver the data to the client. > > But 'ya see, I wouldn't mind that at all: I'm not CPU bound and can > afford the extra context switches back and forth from the user space. > I'd bet dime to dollar that people who use sendfile(2) aren't CPU > bound: they're IO/sf_buf bound. Sure having sendfile() return EAGAIN > will drive up the number of calls under high load, but I'd rather burn > a few more cycles swapping contexts than I would getting stuck in a > spin lock waiting for the required number of sf_buf's to become > available. I think they would care very, very much. Here's why. Consider a site with large files to deliver to their customers; each of the customers has a pipe of a given data rate. Logic tells us that the data rate *at the customer* is going to dictate how fast the send buffers drain out, which in turn, controls the queue retention time for the resources on your server. Large pipes will drain fastest, and small pipes will take much longer. This is the classic "equal resource requirement, variable time to runb" scheduling algorithm problem. If you accept requests at random, then you are guaranteed, at some point, to stall all your fast connections behind slow connections. What about "retroactive RED queueing" as a solution? In other words, based on a calculated figure of merit for your load, you decide to abandon existing connections, with a bias towards abandining the longest running connections first (maybe you use a Poisson distribution; whatever). Naievely, this seems like a solution. Practically, though, it's not. The reason is that of human psychology. People on slow pipes are paitent, by definition. This means that they will retry for hours and hours, keeping you clogged up, no matter what. So in an overcapacity situation, you won't escape by dropping "problem" connections (the only way to do that effectively is a QoS negotiation that knows, at connection time, how big the pipe of the connecting client will be). The only answer is to slog through the workload, and RED on new requests. What this comes down to is guaranteeing "fairness". So the conclusion? Having sendfile() return "EAGAIN" is naieve, unless you have a means of limiting each sendfile to it's *fair share* of sf_buf's. And once again, we are at the point where the sendfile() implementation is inadequate to the task. There's no proportional allocator for resources here; there's not even a simplistic count maintained of the number of sendfile() requests simultaneously in progress. And there *can't* be. Again, it comes down to the sendfile API: without *knowing* the number of session in process at the same time, which is unknowable to the kernel, at this point, because the sendfile() API is just "take this file and queue as many mbufs on so_snd as you can, until you hit the end of the file or until you run out of sf_bufs". The *only* way to address this so that the kernel can *know*, to *fairly* share resources among requesters, is to queue the requests *to the kernel*, and then service them to completion. *Only* then can the kernel perform useful resource arbitration on your behalf. > If I've got a connection queue of 60K, I want to free up as many > connections as I can as fast as I can which makes sleeping the worst > thing I can do because the contentions in queue just pile up. A > userland spin lock is going to result in a more responsive application > than a kernel spin lock since the userland app will loop through the > connection queue and free up sf_buf's as data gets sent out over the > pipe (something that won't happen when stuck in msleep() in the > kernel's spin lock). You should look at the Rice University Scala Server project code; much of it is based on FreeBSD. One of the things they do is put proactive load shedding at the stall barriers, rather than hitting them. They do things like LRP (one of my favorites), but they also do other interesting work in that area. Druschel, Banga, et. al., are all very smart guys. You've probably heard of iMimic? I'll give you a hint, though: shortest request first. Blows to hell on asymmetric client data rates, though (IMO). > > The sendfile interface does not degrade gracefully, period. Even if > > you dealt with the issue by setting *sbytes correctly in all cases, > > and returning the right value to use space, you've increased the > > number of system calls, potentially significantly. So even if you > > "correct" the behaviour, your degradation is going to be > > exponential. > > ::nods:: But as stated above, there are worse things that can be done, > most notably, blocking and letting connections pile up. "Emoticon" time, I guess... ;^). ::sighs:: And then the next bottleneck becomes system call overhead, and the next one after that becomes network I/O, and the next one after that becomes PCI bus bandwidth, etc.. The correct thing to do is to *not let connections pile up*: after a certain *very small* overage, drop them on the floor: do not answer their SYN's. Hell, if you have the source code to the firmware for your network card, then don't cause an interrupt for their SYN's *at all*. One of the major pains in the butt for effective load shedding in FreeBSD, as it currently stands, is the SYN cache. The damn thing accepts connections on your behalf by completing three-way handshakes automatically, without giving you the opportunity of doing feedback until *after* the connection is established. > > One potential solution is to go to using KSE's, so that the blocking > > context is not your whole process. This allows you to write the > > server as multithreaded. Another is to do what Apache does, and run > > processes per connection. > > I'm antsy as hell to convert my apps to use KSE for this very reason, > but I'm going to give myself a few more months before I turn the life > blood of my business over to KSE. Personally, I would not do it. Not for networking equipment, where you aren't CPU-bound. Unfortunately, the benefits of KSE are mostly not worth the cost, without SMP, and the cost of SMP is now much hogher than it was, back when the work started. Over time, clock multipliers have gone up, and the limitation is all on internal bus bandwith, and internal data stalls, much more than it's on raw compute cycles. What good does a 3GHz processor do me, if I have to wait for 12 cycles per I/O cycle, if I'm I/O bound? The one thing it really buys you is the ability to program lazy, using threads instead of finite state automatons. That's OK for some applications, of course, but mostly for ones which are compute bound, since it adds lock contention and cache contention and TLB shootdowns, and protection domain crossings. Probably the one place I'd be willing to eat that is if I had a large Java-based server, where I'm going to be eating compute cycles like crazy in the JVM, and so it makes sense to throw compute cycles at the problem. > > My recommendation was (and is): get a sufficiently large NSFBUFS in > > the first place, so you never encounter the situation that results > > in the non-graceful degradation. > > That's not a solution though, that's a work around/hack. :-] I've > hacked/worked around, but I need a solution. Making sendfile(2) "do > the right thing(TM)" I thought was the solution (still do). Come up with a new API. It needs to: 1) Queue it's requests to the kernel, so that the kernel has enough information to make useful decisions 2) Respect the limit on the so_snd depth (minimally; there are reasons for load tuning to make it even more severe, on purpose, to control router queue depths for slow customer pipes) 3) Sends a kevent when the file send has been completed 4) Preallocate resources before taking something off the queue Those are the minimum design requirements, from a 50,000 foot view. > > 2) Allow the API to be inconsistent, and then have the OS > > accept the blame for broken applications, since it permits > > known broken parameter values > > I don't follow... how would this fix anything? I don't understand > why this would be necessary given what I'd proposed/suggested earlier. It doesn't fix anything. If you want something fixed, you are back to the option you aren't thrilled with. If you see a third option, you should talk about it. Actually, the first option is suspect, because of the headers and trailers from *hdtr. Maintaining accurate header and file content indexing for arbitrary length headers, or handling a partial completion on the header/trailer is undercoverable, even if *sbytes accurately reflects the amount of the file itself that was sent. 8-(. > > And yeah, either way you look at it, it's a failure to degrade > > gracefully... once again: the easy fix is to not put your system in > > that position in the first place. > > Lol! I wish I had that as an option. Near infinite demand doesn't > give me this luxury. Shed load before it takes resources. Seriously. > I'd actually thought about having my application do this on the fly > and automatically tune itself based on the number of free sf_buf's, > but this brings up another problem with sendfile(2): there's no way of > determining how many sf_buf's are in use at any given time and on > -STABLE, you can't even read the number of sf_buf's allocated > (kern.ipc.nsfbufs). :-/ > > Other suggestions welcome including, "leave sendfile() alone, hack up > a new interface." That would be my recommendation. The sendfile() interface has always been an architectural wart. It's there, IMO, to compete with Linux ("Linux has one, we need one"). There's changes that could be made to the implementation details to make it less of an aggregious hack, but there's no way to make it a non-hack. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 11: 1:23 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 74C2737B401 for ; Tue, 4 Mar 2003 11:01:22 -0800 (PST) Received: from park.rambler.ru (park.rambler.ru [81.19.64.101]) by mx1.FreeBSD.org (Postfix) with ESMTP id AA31A43FA3 for ; Tue, 4 Mar 2003 11:01:20 -0800 (PST) (envelope-from is@rambler-co.ru) Received: from is.park.rambler.ru (is.park.rambler.ru [81.19.64.102]) by park.rambler.ru (8.12.6/8.12.6) with ESMTP id h24J1EmF094396; Tue, 4 Mar 2003 22:01:14 +0300 (MSK) Date: Tue, 4 Mar 2003 22:01:13 +0300 (MSK) From: Igor Sysoev X-Sender: is@is To: Terry Lambert Cc: arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? In-Reply-To: <3E64E9B8.EDCA54FE@mindspring.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, 4 Mar 2003, Terry Lambert wrote: >Before sendfile(), the answer was to mmap() the data to be sent, >and then call write() on it. Doing that guaranteed that you would >not have to copy the data from user space to kernel space, because >>the mapping was already established. That solution can still work, >without using sendfile() to get the same performance. The performance >"win" of sendfile is the assumption that the entire file will be >sent as a result of a single system call. It's seems to me that FreeBSD (at least 4.x) write()s mmap()ed file to the socket as well as it write()s malloc()ed memory, i.e. it simply copies user data to kernel mbufs. aio_write() on the socket does the same. And sendfile() is the single syscall that set mbuf.ext_buf to page wihout coping. Igor Sysoev http://sysoev.ru/en/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 11:30: 2 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1F8AC37B401 for ; Tue, 4 Mar 2003 11:30:01 -0800 (PST) Received: from mail26a.sbc-webhosting.com (mail26a.sbc-webhosting.com [216.173.237.36]) by mx1.FreeBSD.org (Postfix) with SMTP id 3578F43FBD for ; Tue, 4 Mar 2003 11:29:58 -0800 (PST) (envelope-from alc@imimic.com) Received: from www.imimic.com (64.143.12.21) by mail26a.sbc-webhosting.com (RS ver 1.0.77vs) with SMTP id 0503806170; Tue, 4 Mar 2003 14:29:31 -0500 (EST) Message-ID: <3E64FEA0.CCA21C7@imimic.com> Date: Tue, 04 Mar 2003 13:29:36 -0600 From: "Alan L. Cox" Organization: iMimic Networking, Inc. X-Mailer: Mozilla 4.8 [en] (X11; U; Linux 2.4.2 i386) X-Accept-Language: en MIME-Version: 1.0 To: arch@freebsd.org, Sean Chittenden Subject: Re: Should sendfile() to return ENOBUFS? Content-Type: text/plain; charset=x-user-defined Content-Transfer-Encoding: 7bit X-Loop-Detect: 1 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Sean, The current sf_buf implementation has a simple problem that could account for your frequent blocking. Let me describe an extreme example that will make it clear. Suppose you have a web server that delivers nothing but a single file of 8 pages, or 32K bytes of data, to its clients. Here's the punchline: If you had 1,000 concurrent requests, you could wind up allocating 8,000 sf_bufs. Given that the main purpose of the sf_buf is simply to provide an in-kernel virtual address for the page, one sf_buf per page should suffice. Sf_bufs are already reference counted. So, the principle change would be to add a directory data structure that could answer the question "Does this page already have an allocated sf_buf?" Regards, Alan To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 13:51:31 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7D50937B401 for ; Tue, 4 Mar 2003 13:51:29 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id ECFAF43FBD for ; Tue, 4 Mar 2003 13:51:28 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id DF34D2107B; Tue, 4 Mar 2003 13:51:18 -0800 (PST) Date: Tue, 4 Mar 2003 13:51:18 -0800 From: Sean Chittenden To: "Alan L. Cox" Cc: arch@freebsd.org Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304215118.GJ79234@perrin.int.nxad.com> References: <3E64FEA0.CCA21C7@imimic.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="D+UG5SQJKkIYNVx0" Content-Disposition: inline In-Reply-To: <3E64FEA0.CCA21C7@imimic.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --D+UG5SQJKkIYNVx0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > The current sf_buf implementation has a simple problem that could > account for your frequent blocking. Let me describe an extreme > example that will make it clear. Suppose you have a web server that > delivers nothing but a single file of 8 pages, or 32K bytes of data, > to its clients. Here's the punchline: If you had 1,000 concurrent > requests, you could wind up allocating 8,000 sf_bufs. Given that > the main purpose of the sf_buf is simply to provide an in-kernel > virtual address for the page, one sf_buf per page should suffice. > Sf_bufs are already reference counted. So, the principle change > would be to add a directory data structure that could answer the > question "Does this page already have an allocated sf_buf?" This is an excellent suggestion and one that I hadn't even thought of, thank you. You're right in assuming that I'm sending out only a few hundred files per server to many thousands of clients, so this would be ideal in terms of performance. The problem that I can see with this is, "what happens when a file changes on disk?" Some how the page of data needs to be flushed and re-read. For files that are constantly in transit, their ref count would never hit zero so the data sent would never change (in theory, or it would mix/match pages from the new and old file: a problem not encountered with the current sendfile() implementation). Shutting down the server and waiting for the buffers to clear isn't a valid option in my book or with the possibility of out of sequence pages makes sendfile() something of a data integrity liability. Since the various pages types are all aligned, caching of sf_buf's along with the above directory structure would be quite a bit more efficient for my case, but possibly too efficient. Is there a mechanism for reseting highly used, but changed file pages? There could very likely is a way of doing this, but I'm not familiar with it. -sc --=20 Sean Chittenden --D+UG5SQJKkIYNVx0 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+ZR/W3ZnjH7yEs0ERAqlxAKDLFlX/YJ6MzAkV6yA7v7ixO8NgggCg59fT NwRxgLSCeAy12P/8LbmnXYo= =h3y7 -----END PGP SIGNATURE----- --D+UG5SQJKkIYNVx0-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 14:10:18 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D619937B401 for ; Tue, 4 Mar 2003 14:10:16 -0800 (PST) Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24]) by mx1.FreeBSD.org (Postfix) with ESMTP id 26FCF43F93 for ; Tue, 4 Mar 2003 14:10:16 -0800 (PST) (envelope-from bmilekic@unixdaemons.com) Received: (from bmilekic@localhost) by tesla.distributel.net (8.11.6/8.11.6) id h24M8b710311; Tue, 4 Mar 2003 17:08:37 -0500 (EST) (envelope-from bmilekic@unixdaemons.com) Date: Tue, 4 Mar 2003 17:08:37 -0500 From: Bosko Milekic To: Sean Chittenden Cc: "Alan L. Cox" , arch@freebsd.org Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304170837.A10281@unixdaemons.com> References: <3E64FEA0.CCA21C7@imimic.com> <20030304215118.GJ79234@perrin.int.nxad.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030304215118.GJ79234@perrin.int.nxad.com>; from sean@chittenden.org on Tue, Mar 04, 2003 at 01:51:18PM -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, Mar 04, 2003 at 01:51:18PM -0800, Sean Chittenden wrote: > > The current sf_buf implementation has a simple problem that could > > account for your frequent blocking. Let me describe an extreme > > example that will make it clear. Suppose you have a web server that > > delivers nothing but a single file of 8 pages, or 32K bytes of data, > > to its clients. Here's the punchline: If you had 1,000 concurrent > > requests, you could wind up allocating 8,000 sf_bufs. Given that > > the main purpose of the sf_buf is simply to provide an in-kernel > > virtual address for the page, one sf_buf per page should suffice. > > Sf_bufs are already reference counted. So, the principle change > > would be to add a directory data structure that could answer the > > question "Does this page already have an allocated sf_buf?" > > This is an excellent suggestion and one that I hadn't even thought of, > thank you. You're right in assuming that I'm sending out only a few > hundred files per server to many thousands of clients, so this would > be ideal in terms of performance. The problem that I can see with > this is, "what happens when a file changes on disk?" Some how the > page of data needs to be flushed and re-read. For files that are > constantly in transit, their ref count would never hit zero so the > data sent would never change (in theory, or it would mix/match pages > from the new and old file: a problem not encountered with the current > sendfile() implementation). Shutting down the server and waiting for > the buffers to clear isn't a valid option in my book or with the > possibility of out of sequence pages makes sendfile() something of a > data integrity liability. > > Since the various pages types are all aligned, caching of sf_buf's > along with the above directory structure would be quite a bit more > efficient for my case, but possibly too efficient. Is there a > mechanism for reseting highly used, but changed file pages? There > could very likely is a way of doing this, but I'm not familiar with > it. What about only re-using the already allocated page if the timestamp for the last modification matches the currently stored one? (i.e., store the timestamp in the auxilary structure). I'm not sure this would work in all cases, but it would serve as an OK compromise; or maybe I'm just overlooking something? > -sc > > -- > Sean Chittenden -- Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 15:27:55 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 34A2237B401 for ; Tue, 4 Mar 2003 15:27:53 -0800 (PST) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE0C443FAF for ; Tue, 4 Mar 2003 15:27:51 -0800 (PST) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.12.8/8.12.8) with ESMTP id h24NRlG1004856 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 4 Mar 2003 18:27:50 -0500 (EST) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.11.6/8.9.1) id h24NRgP73552; Tue, 4 Mar 2003 18:27:42 -0500 (EST) (envelope-from gallatin@cs.duke.edu) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15973.13934.694598.353417@grasshopper.cs.duke.edu> Date: Tue, 4 Mar 2003 18:27:42 -0500 (EST) To: arch@freebsd.org Cc: Sean Chittenden Subject: Re: Should sendfile() to return ENOBUFS? In-Reply-To: <3E64FEA0.CCA21C7@imimic.com> References: <3E64FEA0.CCA21C7@imimic.com> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Alan L. Cox writes: > Sean, > > The current sf_buf implementation has a simple problem that could > account for your frequent blocking. Let me describe an extreme example > that will make it clear. Suppose you have a web server that delivers > nothing but a single file of 8 pages, or 32K bytes of data, to its > clients. Here's the punchline: If you had 1,000 concurrent requests, > you could wind up allocating 8,000 sf_bufs. Given that the main purpose > of the sf_buf is simply to provide an in-kernel virtual address for the > page, one sf_buf per page should suffice. Sf_bufs are already reference > counted. So, the principle change would be to add a directory data > structure that could answer the question "Does this page already have an > allocated sf_buf?" In a reply I previously sent privately to Alan, I suggested: One off-the-cuff idea would be to trade the u_int cow field of a vm_page for a struct sf_buf *sf_buf ptr, and to move the cow field into the sf_buf. That way, the sendfile and zero-copy code could find the relevant sfbuf without doing any additional hashing beyond what they needed to do to find the page. If page->sf_buf == NULL, then an sf buf is alloc'ed off the list, and page->sf_buf = new_sfbuf. Otherwise, a refcnt is incremented. The vm_fault() code would change so that it first checked for a non-null sf buf, then checked the cow count in the sf buf. This increases the size of a vm_page of 4 bytes on a 64-bit platform (or maybe 8, depending on the size of a vm_page), but should not affect the 32-bit platforms. There'd be a 4-byte size increase per sf_buf, but the decrease in the number of sf_buf's in flight should more than make up for the bloat. Alan suggested that once this was done: alc> the next step would be to manage sf_buf's as a sort alc> of "mapping cache". This could reduce the number of TLB shootdowns on alc> SMPs; and on 64-bit architectures we should be using the "mapping of alc> all RAM". Unfortunately, I don't have any time to implement this, nor does Alan. Is there any interest in this idea? Anbody like it enough to implement it? Drew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 15:36:26 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA39837B401 for ; Tue, 4 Mar 2003 15:36:19 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8F54643FBD for ; Tue, 4 Mar 2003 15:36:18 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 708FA21059; Tue, 4 Mar 2003 15:36:08 -0800 (PST) Date: Tue, 4 Mar 2003 15:36:08 -0800 From: Sean Chittenden To: Terry Lambert Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304233608.GK79234@perrin.int.nxad.com> References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> <20030304081326.GD79234@perrin.int.nxad.com> <3E64E9B8.EDCA54FE@mindspring.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="P6PRkhImOxklJvkF" Content-Disposition: inline In-Reply-To: <3E64E9B8.EDCA54FE@mindspring.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --P6PRkhImOxklJvkF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > Personally, would probably just add a flag to sendfile, Breaking source compatibility shouldn't be an option. > A better solution would be to add a different API to the system. >=20 > The kblob interface is an interesting animal; Jeffrey Hsu has done > some good work in that area, but it's not entirely usable, as it > sits. You might want to talk to Jonathan Lemon. IMO, it is > probably a lost cause. Is there a patch or URL that I can read through? kblob + google =3D=3D KDE in all but two instances. Alfred mentioned that it was good for smaller files so I'm not sure it'd be good for the large multi-MB type things that are getting sent out. > > I haven't spent more than a few seconds thinking about this, but > > wouldn't that require more mbufclusters to be in use but idle at any > > given time than the current implementation? >=20 > No. First of all, it reduces the sfbuf requirements considerably, > by queueing request descriptors, instead, and satisfying them, as it > can. Second, you can control the number of packets "in flight" for > each outstanding sendfile request in progress (unlike now), so if > you throttle this back to the so_snd size, in fact you will use > *fewer* mbuf clusters simultaneously, and you will reduce page > thrashing (remember that sendmaile uses external mbufs that refer to > buffer cache pages via sfbuf mappings). s/sendmaile/sendfile/ ? Hrm... this could be interesting... I'll keep this in mind as a way of keeping the rate of transfer constant at the point where all sf_bufs are in use. I'll have to think about this more before I'm 100% convinced that it's the right thing to do. > [ ... ] > > I don't quite understand what you're trying to say here. What's the > > correlation between / and system calls? CR+LF is always > > read/written as two bytes... I must be missing the point of your > > comment. >=20 > It's a tangent that indicates sendfile() is generally inappropriate, > unless you also implement the recvfile() to go with it, and use it. > The issue is that UNIX text files are not stored in the wire formats > for these protocols, so using sendfile() on them is usually > inapprorpiate, unless you change how you store them. Mail servers, > especially, break inbound and outbound data between applications, so > you'd have to hack them up sto store incoming as delimited > so that when you sent them out via sendfile(), they were compliant > with the protocol standard, on the wire. Ah, ok... well, I generally have no sympathy for what are pretty poorly designed protocols or operating systems that are brain dead in their definition of newlines. > > > > If a system is busy, it's stuck in an sfbufa state and blocks the > > > > server from servicing thousands of connections. > > > > > > I understand. > >=20 > > Groovy: that's a third of the problem, what's the elegant solution? >=20 > You can't have one, in the context of the current sendfile. You > need to change your context, if you want to address this issue and > get onto the next one, or you can accept the implementation of an > administrative limit to keep from banging your head on the design > limit, and cut your losses. I think the state of sendfile(2) could be much improved. Once improved, if things still suck, then I'll go about finding/writing a new interface to do what I/others need. > It really boils down to how much effort you are willing to spend on > it, for what return you expect. When load happens, every ounce of grace is worth its weight in gold ten fold over. That's why I switched from MS->Linux->FreeBSD in the 1st place. > > I keep chasing this upper bound and pushing things higher and higher > > because sendfile() doesn't degrade worth beans... well, that's a hack > > and not a solution. >=20 > No, it's really a "Then don't do that" solution to the old "Doctor, > it hurts when I do this" complaint. The next then the Dr. says is "then don't do it!," and we're back at square one again... this isn't lifting an arm, it's trying to breath while running: something FreeBSD is generally better than most at. > Before sendfile(), the answer was to mmap() the data to be sent, and > then call write() on it. Doing that guaranteed that you would not > have to copy the data from user space to kernel space, because the > mapping was already established. That solution can still work, > without using sendfile() to get the same performance. The > performance "win" of sendfile is the assumption that the entire file > will be sent as a result of a single system call. You're forgetting the biggest win for busy servers with hundreds/thousands of files: RAM. I used to use mmap() + writev() and because of userland RAM constraints, I moved to using sendfile(). This dramatically improved the state of affairs. > > The TCP stack, VM, and my general setup has scaled quite well. > > The 1st thing to go, however, is the number of sf_buf's. >=20 > "If it ain't one thing, it's another"... [...] > As someone else pointed out, there are a lot of low overheads in > various places in the FreeBSD kernel. If you are a seven foot tall > person that wants to walk around without banging your head every 5 > feet, then there's a lot of remodelling you are going to need to do > to avoid that. Interesting point, but if you own a house with 5ft doorways and are 7ft tall, you'll fix the house or move out. Where's your saw and hammer? > If you want to get CS technical, you have found a livelock stall > barrier: there are literally thousands of these in the design of > FreeBSD, as it stands, and most of them are unlikely to ever get > fixed, except in private commercial repositories for FreeBSD-based > products. Again, where's your hammer? You've got experience in running into doorways, have you ever thought about making them taller for everyone to pass through there? > The hard kernel memory split will *never* be dyamic. I know the history and rationale for the current state of things, but *never* say never. ;) [...] > What this comes down to is guaranteeing "fairness". >=20 > So the conclusion? >=20 > Having sendfile() return "EAGAIN" is naieve, unless you have a means > of limiting each sendfile to it's *fair share* of sf_buf's. The problem is that not all connections are created equal. Send gobs of traffic overseas via slow last mile pipes and you'll find the problem changing dramatically. EAGAIN will at least get the available connections something and I'll be able to drain some of my load. [...]=20 > The *only* way to address this so that the kernel can *know*, to > *fairly* share resources among requesters, is to queue the requests > *to the kernel*, and then service them to completion. *Only* then > can the kernel perform useful resource arbitration on your behalf. Actually, I was talking to Hiten on IRC about writing a kqueue inspired file sending state machine. Basically you'd have a kernel daemon that'd broker sending files to clients. Push a file request (either whole file or partial) into a queue and the kernel would send out the file (or file part) as best as it could and once complete (successful or not), would add the completed request to a queue of finished requests that would be classified in one of two states: *) failure (errno, sent this many bytes) *) success (file sent, sent this many bytes). Applications would then only have to manage adding files to the kernel's queue and processing the completion of events from the queue (logging). On a local network, with T/TCP, this would make the basis for a really slick NFS replacement or cache engine, IMHO. And actually, this interface could be fd -> fd and used to replace local copying of files. > One of the major pains in the butt for effective load shedding in > FreeBSD, as it currently stands, is the SYN cache. The damn thing > accepts connections on your behalf by completing three-way handshakes > automatically, without giving you the opportunity of doing feedback > until *after* the connection is established. I haven't hit that yet, but when I do, you'll hear back from me. > Come up with a new API. It needs to: >=20 > 1) Queue it's requests to the kernel, so that the kernel has > enough information to make useful decisions Check. See above. > 2) Respect the limit on the so_snd depth (minimally; there are > reasons for load tuning to make it even more severe, on > purpose, to control router queue depths for slow customer > pipes) That'd be something that the kernel send file daemon would do (in theory). > 3) Sends a kevent when the file send has been completed See above. > 4) Preallocate resources before taking something off the queue Check. > Those are the minimum design requirements, from a 50,000 foot view. Perk of the above design is that you don't have to constantly make system calls to send out parts of a file (non-blocking IO + sendfile() + clients connecting at slow rates, this can be substantial). -sc --=20 Sean Chittenden --P6PRkhImOxklJvkF Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+ZTho3ZnjH7yEs0ERAt0DAKC8LYKZ1PNts2W8XZKoeNp0EOQx+QCdG+7E Gbv19H5sq3l7UJ2TayGoaUQ= =u7sJ -----END PGP SIGNATURE----- --P6PRkhImOxklJvkF-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 15:40:37 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3062237B401 for ; Tue, 4 Mar 2003 15:40:35 -0800 (PST) Received: from bricore.com (adsl-64-168-71-68.dsl.snfc21.pacbell.net [64.168.71.68]) by mx1.FreeBSD.org (Postfix) with ESMTP id E2C8F43FA3 for ; Tue, 4 Mar 2003 15:40:33 -0800 (PST) (envelope-from lchen@briontech.com) Received: from luoqi (luoqi.bricore.com [192.168.1.63]) by bricore.com (8.12.6/8.12.6) with SMTP id h24NeS3q016603; Tue, 4 Mar 2003 15:40:28 -0800 (PST) (envelope-from lchen@briontech.com) From: "Luoqi Chen" To: "Jeff Roberson" Cc: Subject: RE: vtruncbuf() Date: Tue, 4 Mar 2003 15:43:25 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: <20030304010228.P72102-100000@mail.chesapeake.net> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal X-Virus-Scanned: by amavisd-milter (http://amavis.org/) Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG > > > vtruncbuf() does a few things that I'm not terribly certain I > understand. > > > I'm hoping someone can elaborate on this. > > > > > I think the idea was to avoid calling fsync. > > Why does it need to be synced at all? > There was no metadata sync when vtruncbuf() was first introduced, but after some nasty crashes, John Dyson added the metadata sync code. It has been a long time and I don't remember now how the crashes were triggered (or if I had ever understood the mechanism). But now I look at the code, it may have something to do with getting rid of softupdate dependencies associated with these indirect blocks (B_DELWRI), before modifying their contents. > > > Firstly, it makes assumptions about negative blknos. So this scheme > > > doesn't work for filesystems that don't use this method for indexing > > > their metadata. > > The code is a little ufs specific, but should still work for other FS > > -- it doesn't hurt to write out dirty bufs. > > No, but I'm not sure how it helps either. > > > > Secondly, it doesn't hold a lock while inspecting > > > B_DELWRI. > > > > > It's intentional, see below... > > It's an optimization. > > > > There is also a really weird check to see if the buf's vp > matches the vp > > > we're truncating. This doesn't really make sense since we just > > > pulled this > > > buf off of the dirty block lists for this vnode. > > > > > ..., the buf is not locked, remember :) > > Yes, but you were guaranteed that it wouldn't have migrated to a new vp > even in RELENG_4. The whole thing happens at splbio(). In current > Giant makes that guarantee and now the vnode interlock does as well. The > thing that you aren't guaranteed now is whether or not DELWRI is still > valid. You can be certain that UFS won't have negative blocks locked at > this point though because the vnode lock is held. So this lock should > always succeed anyway. > I thought you were talking about the nbp->b_vp == vp checks. The check for bp->b_vp == vp indeed looks wierd, but the buffer wasn't locked back when the code was first written... -lq To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 17:24: 7 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D7AA837B401 for ; Tue, 4 Mar 2003 17:24:06 -0800 (PST) Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184]) by mx1.FreeBSD.org (Postfix) with ESMTP id 283B143FD7 for ; Tue, 4 Mar 2003 17:24:06 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Received: from beastie.mckusick.com (localhost [127.0.0.1]) by beastie.mckusick.com (8.12.3/8.12.3) with ESMTP id h251NxFL088420; Tue, 4 Mar 2003 17:23:59 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Message-Id: <200303050123.h251NxFL088420@beastie.mckusick.com> To: "Jeff Roberson" Subject: Re: vtruncbuf() Cc: "Luoqi Chen" , arch@FreeBSD.ORG In-Reply-To: Your message of "Tue, 04 Mar 2003 15:43:25 PST." Date: Tue, 04 Mar 2003 17:23:59 -0800 From: Kirk McKusick Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG From: "Luoqi Chen" To: "Jeff Roberson" Cc: Subject: RE: vtruncbuf() Date: Tue, 4 Mar 2003 15:43:25 -0800 X-ASK-Info: Whitelist match > There is also a really weird check to see if the buf's vp > matches the vp This check is left from the days before the logical block numbers of metadata (indirect blocks) were stored as negative numbers associated with the file vnode. Before that change, they were stored as actual disk block numbers and associated with the mounted special device vnode. So as to be able to find them easily, they were temporarily moved from the special device vnode to the file vnode dirty list when they were written, then moved back to the special device vnode clean list when they were written. Luckily all that nastyness ended when the new logical block scheme got invented, but the check for bp->b_vp == vp never got cleaned out of the code. Your historic footnote for the day. Kirk McKusick To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 19: 5:25 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5023437B401 for ; Tue, 4 Mar 2003 19:05:24 -0800 (PST) Received: from smtp1.server.rpi.edu (smtp1.server.rpi.edu [128.113.2.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4E2F743FD7 for ; Tue, 4 Mar 2003 19:05:23 -0800 (PST) (envelope-from drosih@rpi.edu) Received: from [128.113.24.47] (gilead.netel.rpi.edu [128.113.24.47]) by smtp1.server.rpi.edu (8.12.8/8.12.7) with ESMTP id h2535L5I023262; Tue, 4 Mar 2003 22:05:21 -0500 Mime-Version: 1.0 X-Sender: drosih@mail.rpi.edu Message-Id: In-Reply-To: References: <20030210114930.GB90800@melusine.cuivre.fr.eu.org> <200302251255.48219.wes@softweyr.com> Date: Tue, 4 Mar 2003 22:05:20 -0500 To: arch@FreeBSD.ORG From: Garance A Drosihn Subject: Re: NEWSYSLOG changes, signal process groups Cc: Gregory Bond Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Scanned-By: MIMEDefang 2.28 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG In pr=28435 there was a request that newsyslog know how to signal a process-group, as well as just a single process. I have added a 'U' flag for config-file entries, which indicates that the pid_file contains the id of a process group. The value in that file is expected to be a negative number (which I assume has to be in the negative range of valid process numbers). My patch for this is in: http://people.freebsd.org/~gad/newsyslog/4-sigpgrp.diff First I basically rewrote/reorganized the get_pid() routine, partially to get better error-checking. Then I checked what OpenBSD has in this area, and included a number of ideas from their start_signal() routine (including renaming the routine). Let me know of any feedback for this change. -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 20: 0:23 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2788A37B405 for ; Tue, 4 Mar 2003 20:00:22 -0800 (PST) Received: from smtp3.server.rpi.edu (smtp3.server.rpi.edu [128.113.2.3]) by mx1.FreeBSD.org (Postfix) with ESMTP id EF5B643FB1 for ; Tue, 4 Mar 2003 20:00:20 -0800 (PST) (envelope-from drosih@rpi.edu) Received: from [128.113.24.47] (gilead.netel.rpi.edu [128.113.24.47]) by smtp3.server.rpi.edu (8.12.8/8.12.7) with ESMTP id h2540JNh028295; Tue, 4 Mar 2003 23:00:19 -0500 Mime-Version: 1.0 X-Sender: drosih@mail.rpi.edu Message-Id: In-Reply-To: References: <20030210114930.GB90800@melusine.cuivre.fr.eu.org> <200302251255.48219.wes@softweyr.com> Date: Tue, 4 Mar 2003 23:00:18 -0500 To: arch@FreeBSD.ORG From: Garance A Drosihn Subject: Re: NEWSYSLOG changes, signal process groups Cc: Gregory Bond Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Scanned-By: MIMEDefang 2.28 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG At 10:05 PM -0500 3/4/03, Garance A Drosihn wrote: >I have added a 'U' flag for config-file entries, which indicates >that the pid_file contains the id of a process group. I picked 'U' because 'g' was already taken, and 'u' is the only letter in the word 'group' which is not also in 'process'. -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Mar 4 21:51:25 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 90B1937B401 for ; Tue, 4 Mar 2003 21:51:24 -0800 (PST) Received: from mail26a.sbc-webhosting.com (mail26a.sbc-webhosting.com [216.173.237.36]) by mx1.FreeBSD.org (Postfix) with SMTP id 8C5AF43F93 for ; Tue, 4 Mar 2003 21:51:21 -0800 (PST) (envelope-from alc@imimic.com) Received: from www.imimic.com (64.143.12.21) by mail26a.sbc-webhosting.com (RS ver 1.0.63s) with SMTP id 047605; Wed, 5 Mar 2003 00:50:48 -0500 (EST) Message-ID: <3E659041.EC63D4E0@imimic.com> Date: Tue, 04 Mar 2003 23:50:57 -0600 From: "Alan L. Cox" Organization: iMimic Networking, Inc. X-Mailer: Mozilla 4.8 [en] (X11; U; Linux 2.4.2 i386) X-Accept-Language: en MIME-Version: 1.0 To: Bosko Milekic Cc: Sean Chittenden , arch@freebsd.org Subject: Re: Should sendfile() to return ENOBUFS? References: <3E64FEA0.CCA21C7@imimic.com> <20030304215118.GJ79234@perrin.int.nxad.com> <20030304170837.A10281@unixdaemons.com> Content-Type: text/plain; charset=x-user-defined Content-Transfer-Encoding: 7bit X-Loop-Detect: 1 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Bosko Milekic wrote: > ... > What about only re-using the already allocated page if the timestamp > for the last modification matches the currently stored one? (i.e., > store the timestamp in the auxilary structure). I'm not sure this > would work in all cases, but it would serve as an OK compromise; or > maybe I'm just overlooking something? > I don't see the need for this. The vm_object being used in sendfile() is tied to the file's vnode. Thus, changes to the file will affect the vm_object used by sendfile(). Also, the sf_buf changes that I described have no effect on sendfile()'s data coherence. It remains the same. Alan To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Mar 6 6: 0: 9 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A710B37B405 for ; Thu, 6 Mar 2003 06:00:07 -0800 (PST) Received: from uk.com (TC218-187-144-136.adsl.pl.apol.com.tw [218.187.144.136]) by mx1.FreeBSD.org (Postfix) with ESMTP id B72B444059 for ; Thu, 6 Mar 2003 05:59:33 -0800 (PST) (envelope-from dhes@uk.com) From: tyjnt5@ergfv4r.com Subject: =?ISO-8859-1?B?v/qw96XOISGnT8RGpEikRiEh?= Reply-To: 35yhw@nu6j.com Date: 06 Mar 2003 21:59:36 +0800 MIME-Version: 1.0 Content-Type: text/html Content-Transfer-Encoding: 8bit Message-Id: <20030306135937.B72B444059@mx1.FreeBSD.org> To: undisclosed-recipients: ; Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG 十分之一的瀏覽者 選擇進入

 

十分之一的瀏覽者 選擇進入

十之九的瀏覽者 選擇關閉視窗

然而 我們很滿意這樣的比例 

因為…

這裡的人 所擁有的非凡生活

多是在一種不凡思維下所引發的不凡決定後 偷偷開始的

enter

To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Mar 6 16:50:20 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DA04537B405 for ; Thu, 6 Mar 2003 16:50:03 -0800 (PST) Received: from angelica.unixdaemons.com (angelica.unixdaemons.com [209.148.64.135]) by mx1.FreeBSD.org (Postfix) with ESMTP id EEFF043F3F for ; Thu, 6 Mar 2003 16:50:01 -0800 (PST) (envelope-from hiten@angelica.unixdaemons.com) Received: from angelica.unixdaemons.com (localhost.unixdaemons.com [127.0.0.1]) by angelica.unixdaemons.com (8.12.8/8.12.1) with ESMTP id h270nxmq001185 for ; Thu, 6 Mar 2003 19:49:59 -0500 (EST) Received: (from hiten@localhost) by angelica.unixdaemons.com (8.12.8/8.12.1/Submit) id h270nwp8001180 for arch@FreeBSD.org; Thu, 6 Mar 2003 19:49:58 -0500 (EST) (envelope-from hiten) Date: Thu, 6 Mar 2003 19:49:58 -0500 From: Hiten Pandya To: arch@FreeBSD.org Subject: Using m_getcl() in network and nfs code paths Message-ID: <20030307004958.GA98917@unixdaemons.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="xHFwDpU9dbj6ez1V" Content-Disposition: inline User-Agent: Mutt/1.4i X-Operating-System: FreeBSD i386 X-Public-Key: http://www.pittgoth.com/~hiten/pubkey.asc X-URL: http://www.unixdaemons.com/~hiten X-PGP: http://pgp.mit.edu:11371/pks/lookup?search=Hiten+Pandya&op=index Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --xHFwDpU9dbj6ez1V Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi gang. After discussing this with Bosko Mikelic and some other people, I have made some changes to net, netinet, and the NFS server and client code to utilize the m_getcl() routine. The m_getcl routine does mbuf and cluster allocation in a single shot without dropping the Cache lock, hence reducing lock operations. This could prove beneficial. Comments are welcome. If there are no objections, hopefully, Bosko will commit the patches. Cheers. -- Hiten Pandya (hiten@unixdaemons.com, hiten@uk.FreeBSD.org) http://www.unixdaemons.com/~hiten/ --xHFwDpU9dbj6ez1V Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bmilekic-prelim.patch" Index: src/sys/kern/uipc_mbuf2.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_mbuf2.c,v retrieving revision 1.18 diff -u -r1.18 uipc_mbuf2.c --- src/sys/kern/uipc_mbuf2.c 19 Feb 2003 05:47:26 -0000 1.18 +++ src/sys/kern/uipc_mbuf2.c 7 Mar 2003 00:31:27 -0000 @@ -230,14 +230,9 @@ * now, we need to do the hard way. don't m_copy as there's no room * on both end. */ - MGET(o, M_DONTWAIT, m->m_type); - if (o && len > MLEN) { - MCLGET(o, M_DONTWAIT); - if ((o->m_flags & M_EXT) == 0) { - m_free(o); - o = NULL; - } - } + o = (len > MLEN) ? + m_getcl(M_DONTWAIT, m->m_type, m->m_flags) : + m_get(M_DONTWAIT, m->m_type); if (!o) { m_freem(m); return NULL; /* ENOBUFS */ Index: kern/uipc_socket2.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_socket2.c,v retrieving revision 1.111 diff -u -r1.111 uipc_socket2.c --- kern/uipc_socket2.c 21 Feb 2003 22:23:40 -0000 1.111 +++ kern/uipc_socket2.c 5 Mar 2003 00:55:21 -0000 @@ -833,15 +833,11 @@ if (CMSG_SPACE((u_int)size) > MCLBYTES) return ((struct mbuf *) NULL); - if ((m = m_get(M_DONTWAIT, MT_CONTROL)) == NULL) - return ((struct mbuf *) NULL); - if (CMSG_SPACE((u_int)size) > MLEN) { - MCLGET(m, M_DONTWAIT); - if ((m->m_flags & M_EXT) == 0) { - m_free(m); - return ((struct mbuf *) NULL); - } - } + m = CMSG_SPACE((u_int)size > MLEN) ? + m_getcl(M_DONTWAIT, MT_CONTROL, 0) : /* Note: !M_PKTHDR */ + m_get(M_DONTWAIT, MT_CONTROL); + if (m == NULL) + return NULL; cp = mtod(m, struct cmsghdr *); m->m_len = 0; KASSERT(CMSG_SPACE((u_int)size) <= M_TRAILINGSPACE(m), Index: net/if_ieee80211subr.c =================================================================== RCS file: /home/ncvs/src/sys/net/if_ieee80211subr.c,v retrieving revision 1.7 diff -u -r1.7 if_ieee80211subr.c --- net/if_ieee80211subr.c 3 Mar 2003 06:09:18 -0000 1.7 +++ net/if_ieee80211subr.c 5 Mar 2003 01:02:34 -0000 @@ -2544,16 +2544,14 @@ if (len > n->m_len - noff) { len = n->m_len - noff; if (len == 0) { - MGET(n->m_next, M_DONTWAIT, n->m_type); + n->m_next = (left >= MINCLSIZE) ? + m_getcl(M_DONTWAIT, n->m_type, M_PKTHDR) : + m_get(M_DONTWAIT, n->m_type); if (n->m_next == NULL) goto fail; n = n->m_next; - n->m_len = MLEN; - if (left >= MINCLSIZE) { - MCLGET(n, M_DONTWAIT); - if (n->m_flags & M_EXT) - n->m_len = n->m_ext.ext_size; - } + n->m_len = (left >= MINCLSIZE) ? + n->m_len = n->m_ext.ext_size : MLEN; noff = 0; continue; } Index: net/if_ppp.c =================================================================== RCS file: /home/ncvs/src/sys/net/if_ppp.c,v retrieving revision 1.89 diff -u -r1.89 if_ppp.c --- net/if_ppp.c 19 Feb 2003 05:47:29 -0000 1.89 +++ net/if_ppp.c 5 Mar 2003 01:02:34 -0000 @@ -1411,13 +1411,14 @@ } /* Copy the PPP and IP headers into a new mbuf. */ - MGETHDR(mp, M_DONTWAIT, MT_DATA); + mp = (hlen + PPP_HDRLEN > MHLEN) ? + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : + m_gethdr(M_DONTWAIT, MT_DATA); if (mp == NULL) goto bad; mp->m_len = 0; mp->m_next = NULL; if (hlen + PPP_HDRLEN > MHLEN) { - MCLGET(mp, M_DONTWAIT); if (M_TRAILINGSPACE(mp) < hlen + PPP_HDRLEN) { m_freem(mp); goto bad; /* lose if big headers and no clusters */ Index: net/if_sl.c =================================================================== RCS file: /home/ncvs/src/sys/net/if_sl.c,v retrieving revision 1.109 diff -u -r1.109 if_sl.c --- net/if_sl.c 19 Feb 2003 05:47:29 -0000 1.109 +++ net/if_sl.c 5 Mar 2003 01:02:34 -0000 @@ -266,14 +266,7 @@ MALLOC(sc, struct sl_softc *, sizeof(*sc), M_SL, M_WAITOK | M_ZERO); - m = m_gethdr(M_TRYWAIT, MT_DATA); - if (m != NULL) { - MCLGET(m, M_TRYWAIT); - if ((m->m_flags & M_EXT) == 0) { - m_free(m); - m = NULL; - } - } + m = m_getcl(m, M_TRYWAIT, MT_DATA); if (m == NULL) { printf("sl: can't allocate buffer\n"); @@ -791,10 +784,6 @@ { struct mbuf *m, *newm; - MGETHDR(m, M_DONTWAIT, MT_DATA); - if (m == NULL) - return (NULL); - /* * If we have more than MHLEN bytes, it's cheaper to * queue the cluster we just filled & allocate a new one @@ -802,16 +791,13 @@ * allocated above. Note that code in the input routine * guarantees that packet will fit in a cluster. */ + m = (len >= MHLEN) ? + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : + m_gethdr(M_DONTWAIT, MT_DATA); + if (m == NULL) + return (NULL); + if (len >= MHLEN) { - MCLGET(m, M_DONTWAIT); - if ((m->m_flags & M_EXT) == 0) { - /* - * we couldn't get a cluster - if memory's this - * low, it's time to start dropping packets. - */ - (void) m_free(m); - return (NULL); - } /* Swap the new and old clusters */ newm = m; m = sc->sc_mbuf; Index: net/rtsock.c =================================================================== RCS file: /home/ncvs/src/sys/net/rtsock.c,v retrieving revision 1.88 diff -u -r1.88 rtsock.c --- net/rtsock.c 19 Feb 2003 05:47:29 -0000 1.88 +++ net/rtsock.c 5 Mar 2003 01:02:34 -0000 @@ -608,16 +608,12 @@ } if (len > MCLBYTES) panic("rt_msg1"); - m = m_gethdr(M_DONTWAIT, MT_DATA); - if (m && len > MHLEN) { - MCLGET(m, M_DONTWAIT); - if ((m->m_flags & M_EXT) == 0) { - m_free(m); - m = NULL; - } - } + m = (len > MHLEN) ? + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : + m_gethdr(M_DONTWAIT, MT_DATA); + if (m == 0) - return (m); + return (NULL); m->m_pkthdr.len = m->m_len = len; m->m_pkthdr.rcvif = 0; rtm = mtod(m, struct rt_msghdr *); Index: netatm/port.h =================================================================== RCS file: /home/ncvs/src/sys/netatm/port.h,v retrieving revision 1.15 diff -u -r1.15 port.h --- netatm/port.h 23 Feb 2003 22:26:39 -0000 1.15 +++ netatm/port.h 5 Mar 2003 01:02:35 -0000 @@ -151,14 +151,9 @@ } #define KB_ALLOCEXT(bfr, size, flags, type) { \ if ((size) <= MCLBYTES) { \ - MGET((bfr), (flags), (type)); \ - if ((bfr) != NULL) { \ - MCLGET((bfr), (flags)); \ - if (((bfr)->m_flags & M_EXT) == 0) { \ - m_freem((bfr)); \ - (bfr) = NULL; \ - } \ - } \ + (bfr) = m_getcl((flags), (type), 0); \ + if ((bfr) == NULL) \ + panic("Out of mbufs!"); \ } else \ (bfr) = NULL; \ } Index: netgraph/ng_pppoe.c =================================================================== RCS file: /home/ncvs/src/sys/netgraph/ng_pppoe.c,v retrieving revision 1.58 diff -u -r1.58 ng_pppoe.c --- netgraph/ng_pppoe.c 19 Feb 2003 05:47:31 -0000 1.58 +++ netgraph/ng_pppoe.c 5 Mar 2003 01:02:35 -0000 @@ -723,20 +723,13 @@ printf("pppoe: Session out of memory\n"); LEAVE(ENOMEM); } - MGETHDR(neg->m, M_DONTWAIT, MT_DATA); + neg->m = m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR); if(neg->m == NULL) { - printf("pppoe: Session out of mbufs\n"); + printf("pppoe: Session out of mbufs and cls\n"); FREE(neg, M_NETGRAPH_PPPOE); LEAVE(ENOBUFS); } neg->m->m_pkthdr.rcvif = NULL; - MCLGET(neg->m, M_DONTWAIT); - if ((neg->m->m_flags & M_EXT) == 0) { - printf("pppoe: Session out of mcls\n"); - m_freem(neg->m); - FREE(neg, M_NETGRAPH_PPPOE); - LEAVE(ENOBUFS); - } sp->neg = neg; callout_handle_init( &neg->timeout_handle); neg->m->m_len = sizeof(struct pppoe_full_hdr); Index: netgraph/ng_vjc.c =================================================================== RCS file: /home/ncvs/src/sys/netgraph/ng_vjc.c,v retrieving revision 1.23 diff -u -r1.23 ng_vjc.c --- netgraph/ng_vjc.c 19 Feb 2003 05:47:32 -0000 1.23 +++ netgraph/ng_vjc.c 5 Mar 2003 01:02:35 -0000 @@ -476,7 +476,9 @@ m_adj(m, vjlen); /* Copy the reconstructed TCP/IP headers into a new mbuf */ - MGETHDR(hm, M_DONTWAIT, MT_DATA); + hm = (hlen > MHLEN) ? + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : + m_gethdr(M_DONTWAIT, MT_DATA); if (hm == NULL) { priv->slc.sls_errorin++; NG_FREE_M(m); @@ -485,16 +487,6 @@ } hm->m_len = 0; hm->m_pkthdr.rcvif = NULL; - if (hlen > MHLEN) { /* unlikely, but can happen */ - MCLGET(hm, M_DONTWAIT); - if ((hm->m_flags & M_EXT) == 0) { - m_freem(hm); - priv->slc.sls_errorin++; - NG_FREE_M(m); - NG_FREE_ITEM(item); - return (ENOBUFS); - } - } bcopy(hdr, mtod(hm, u_char *), hlen); hm->m_len = hlen; Index: netinet/tcp_output.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v retrieving revision 1.78 diff -u -r1.78 tcp_output.c --- netinet/tcp_output.c 19 Feb 2003 22:18:05 -0000 1.78 +++ netinet/tcp_output.c 5 Mar 2003 01:02:35 -0000 @@ -604,21 +604,18 @@ m->m_len += hdrlen; m->m_data -= hdrlen; #else +#ifdef INET6 + m = (MHLEN < hdrlen + max_linkhdr) ? + m_getcl(M_DONTWAIT, MT_HEADER, M_PKTHDR) : + m_gethdr(M_DONTWAIT, MT_HEADER); +#else MGETHDR(m, M_DONTWAIT, MT_HEADER); +#endif if (m == NULL) { error = ENOBUFS; goto out; } -#ifdef INET6 - if (MHLEN < hdrlen + max_linkhdr) { - MCLGET(m, M_DONTWAIT); - if ((m->m_flags & M_EXT) == 0) { - m_freem(m); - error = ENOBUFS; - goto out; - } - } -#endif + m->m_data += max_linkhdr; m->m_len = hdrlen; if (len <= MHLEN - hdrlen - max_linkhdr) { Index: netipsec/key.c =================================================================== RCS file: /home/ncvs/src/sys/netipsec/key.c,v retrieving revision 1.5 diff -u -r1.5 key.c --- netipsec/key.c 19 Feb 2003 05:47:36 -0000 1.5 +++ netipsec/key.c 5 Mar 2003 01:02:35 -0000 @@ -2079,14 +2079,10 @@ if (len > MCLBYTES) return key_senderror(so, m, ENOBUFS); - MGETHDR(n, M_DONTWAIT, MT_DATA); - if (n && len > MHLEN) { - MCLGET(n, M_DONTWAIT); - if ((n->m_flags & M_EXT) == 0) { - m_freem(n); - n = NULL; - } - } + n = (len > MHLEN) ? + m_getcl(n, M_DONTWAIT, MT_DATA) : + m_gethdr(M_DONTWAIT, MT_DATA); + if (!n) return key_senderror(so, m, ENOBUFS); Index: nfsclient/krpc_subr.c =================================================================== RCS file: /home/ncvs/src/sys/nfsclient/krpc_subr.c,v retrieving revision 1.22 diff -u -r1.22 krpc_subr.c --- nfsclient/krpc_subr.c 19 Feb 2003 05:47:38 -0000 1.22 +++ nfsclient/krpc_subr.c 4 Mar 2003 22:56:15 -0000 @@ -465,14 +465,12 @@ if (mlen > MCLBYTES) /* If too big, we just can't do it. */ return (NULL); - m = m_get(M_TRYWAIT, MT_DATA); - if (mlen > MLEN) { - MCLGET(m, M_TRYWAIT); - if ((m->m_flags & M_EXT) == 0) { - (void) m_free(m); /* There can be only one. */ - return (NULL); - } - } + m = (mlen > MLEN) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (m == NULL) + return (NULL); + xs = mtod(m, struct xdr_string *); m->m_len = mlen; xs->len = txdr_unsigned(len); Index: nfsclient/nfs_socket.c =================================================================== RCS file: /home/ncvs/src/sys/nfsclient/nfs_socket.c,v retrieving revision 1.95 diff -u -r1.95 nfs_socket.c --- nfsclient/nfs_socket.c 2 Mar 2003 16:54:38 -0000 1.95 +++ nfsclient/nfs_socket.c 4 Mar 2003 22:56:15 -0000 @@ -1378,10 +1378,11 @@ ++nfs_realign_test; while ((m = *pm) != NULL) { if ((m->m_len & 0x3) || (mtod(m, intptr_t) & 0x3)) { - MGET(n, M_TRYWAIT, MT_DATA); - if (m->m_len >= MINCLSIZE) { - MCLGET(n, M_TRYWAIT); - } + n = (m->m_len >= MINCLSIZE) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (n == NULL) + panic("nfs_realign: Out of mbufs"); n->m_len = 0; break; } Index: nfsclient/nfs_subs.c =================================================================== RCS file: /home/ncvs/src/sys/nfsclient/nfs_subs.c,v retrieving revision 1.117 diff -u -r1.117 nfs_subs.c --- nfsclient/nfs_subs.c 19 Feb 2003 05:47:38 -0000 1.117 +++ nfsclient/nfs_subs.c 4 Mar 2003 22:56:15 -0000 @@ -142,9 +142,11 @@ { struct mbuf *mb; - MGET(mb, M_TRYWAIT, MT_DATA); - if (hsiz >= MINCLSIZE) - MCLGET(mb, M_TRYWAIT); + mb = (hsiz >= MINCLSIZE) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (mb == NULL) + return NULL; mb->m_len = 0; return (mb); } @@ -168,10 +170,12 @@ int grpsiz, authsiz; authsiz = nfsm_rndup(auth_len); - MGETHDR(mb, M_TRYWAIT, MT_DATA); - if ((authsiz + 10 * NFSX_UNSIGNED) >= MINCLSIZE) { - MCLGET(mb, M_TRYWAIT); - } else if ((authsiz + 10 * NFSX_UNSIGNED) < MHLEN) { + mb = (authsiz + 10 * NFSX_UNSIGNED >= MINCLSIZE) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_gethdr(M_TRYWAIT, MT_DATA); + if (mb == NULL) + return NULL; + if ((authsiz + 10 * NFSX_UNSIGNED) < MHLEN) { MH_ALIGN(mb, authsiz + 10 * NFSX_UNSIGNED); } else { MH_ALIGN(mb, 8 * NFSX_UNSIGNED); @@ -271,9 +275,11 @@ while (left > 0) { mlen = M_TRAILINGSPACE(mp); if (mlen == 0) { - MGET(mp, M_TRYWAIT, MT_DATA); - if (clflg) - MCLGET(mp, M_TRYWAIT); + mp = (clflg) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (mp == NULL) + return ENOBUFS; mp->m_len = 0; mp2->m_next = mp; mp2 = mp; @@ -349,9 +355,11 @@ } /* Loop around adding mbufs */ while (siz > 0) { - MGET(m1, M_TRYWAIT, MT_DATA); - if (siz > MLEN) - MCLGET(m1, M_TRYWAIT); + m1 = (siz > MLEN) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (m1 == NULL) + return ENOBUFS; m1->m_len = NFSMSIZ(m1); m2->m_next = m1; m2 = m1; Index: nfsserver/nfs_serv.c =================================================================== RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v retrieving revision 1.131 diff -u -r1.131 nfs_serv.c --- nfsserver/nfs_serv.c 25 Feb 2003 03:37:47 -0000 1.131 +++ nfsserver/nfs_serv.c 4 Mar 2003 22:56:15 -0000 @@ -656,8 +656,11 @@ len = 0; i = 0; while (len < NFS_MAXPATHLEN) { - MGET(nmp, M_TRYWAIT, MT_DATA); - MCLGET(nmp, M_TRYWAIT); + nmp = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); + if (nmp == NULL) { + error = ENOBUFS; + return (error); + } nmp->m_len = NFSMSIZ(nmp); if (len == 0) mp3 = mp = nmp; @@ -899,8 +902,9 @@ i++; } if (left > 0) { - MGET(m, M_TRYWAIT, MT_DATA); - MCLGET(m, M_TRYWAIT); + m = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); + if (m == NULL) + return (ENOBUFS); m->m_len = 0; m2->m_next = m; m2 = m; Index: nfsserver/nfs_srvsock.c =================================================================== RCS file: /home/ncvs/src/sys/nfsserver/nfs_srvsock.c,v retrieving revision 1.83 diff -u -r1.83 nfs_srvsock.c --- nfsserver/nfs_srvsock.c 2 Mar 2003 16:54:39 -0000 1.83 +++ nfsserver/nfs_srvsock.c 4 Mar 2003 22:56:15 -0000 @@ -148,18 +148,23 @@ nd->nd_repstat = err; if (err && (nd->nd_flag & ND_NFSV3) == 0) /* XXX recheck */ siz = 0; - MGETHDR(mreq, M_TRYWAIT, MT_DATA); - mb = mreq; /* * If this is a big reply, use a cluster else * try and leave leading space for the lower level headers. */ - mreq->m_len = 6 * NFSX_UNSIGNED; siz += RPC_REPLYSIZ; if ((max_hdr + siz) >= MINCLSIZE) { - MCLGET(mreq, M_TRYWAIT); - } else + mreq = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); + if (mreq == NULL) + return (NULL); + } else { + mreq = m_gethdr(M_TRYWAIT, MT_DATA); + if (mreq == NULL) + return (NULL); mreq->m_data += min(max_hdr, M_TRAILINGSPACE(mreq)); + } + mreq->m_len = 6 * NFSX_UNSIGNED; + mb = mreq; tl = mtod(mreq, u_int32_t *); bpos = ((caddr_t)tl) + mreq->m_len; *tl++ = txdr_unsigned(nd->nd_retxid); @@ -244,10 +249,11 @@ ++nfs_realign_test; while ((m = *pm) != NULL) { if ((m->m_len & 0x3) || (mtod(m, intptr_t) & 0x3)) { - MGET(n, M_TRYWAIT, MT_DATA); - if (m->m_len >= MINCLSIZE) { - MCLGET(n, M_TRYWAIT); - } + n = (m->m_len >= MINCLSIZE) ? + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : + m_get(M_TRYWAIT, MT_DATA); + if (n == NULL) + panic("Out of mbufs!\n"); n->m_len = 0; break; } Index: nfsserver/nfs_srvsubs.c =================================================================== RCS file: /home/ncvs/src/sys/nfsserver/nfs_srvsubs.c,v retrieving revision 1.120 diff -u -r1.120 nfs_srvsubs.c --- nfsserver/nfs_srvsubs.c 19 Feb 2003 05:47:39 -0000 1.120 +++ nfsserver/nfs_srvsubs.c 4 Mar 2003 22:56:15 -0000 @@ -1287,8 +1287,9 @@ if (*bp >= *be) { if (*mp == mb) (*mp)->m_len += *bp - bpos; - MGET(nmp, M_TRYWAIT, MT_DATA); - MCLGET(nmp, M_TRYWAIT); + nmp = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); + if (nmp == NULL) + panic("nfsm_clget_xx: Out of mbufs"); nmp->m_len = NFSMSIZ(nmp); (*mp)->m_next = nmp; *mp = nmp; --xHFwDpU9dbj6ez1V-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Mar 6 21:27: 1 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0A9D837B401 for ; Thu, 6 Mar 2003 21:26:46 -0800 (PST) Received: from xorpc.icir.org (xorpc.icir.org [192.150.187.68]) by mx1.FreeBSD.org (Postfix) with ESMTP id F0E4A43F3F for ; Thu, 6 Mar 2003 21:26:44 -0800 (PST) (envelope-from rizzo@xorpc.icir.org) Received: from xorpc.icir.org (localhost [127.0.0.1]) by xorpc.icir.org (8.12.3/8.12.3) with ESMTP id h275QdAq039019; Thu, 6 Mar 2003 21:26:39 -0800 (PST) (envelope-from rizzo@xorpc.icir.org) Received: (from rizzo@localhost) by xorpc.icir.org (8.12.3/8.12.3/Submit) id h275QdVG039018; Thu, 6 Mar 2003 21:26:39 -0800 (PST) (envelope-from rizzo) Date: Thu, 6 Mar 2003 21:26:38 -0800 From: Luigi Rizzo To: Hiten Pandya Cc: arch@FreeBSD.ORG Subject: Re: Using m_getcl() in network and nfs code paths Message-ID: <20030306212638.A32850@xorpc.icir.org> References: <20030307004958.GA98917@unixdaemons.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307004958.GA98917@unixdaemons.com>; from hiten@unixdaemons.com on Thu, Mar 06, 2003 at 07:49:58PM -0500 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Thu, Mar 06, 2003 at 07:49:58PM -0500, Hiten Pandya wrote: > After discussing this with Bosko Mikelic and some other people, I have > made some changes to net, netinet, and the NFS server and client code to > utilize the m_getcl() routine. ... > Comments are welcome. If there are no objections, hopefully, Bosko > will commit the patches. the number of places where the code does m = (want > X) ? m_getcl(...) : m_get(...) makes me wonder if we shouldn't perhaps add a 'desired_size' parameter to m_getcl() so we can have the test made in one place and in a consistent way (i..e always use the same threshold X instead of MLEN/MINCLSIZE/MHLEN which i suspect is incorrect somewhere. Also it makes no sense to print a msg on failure -- the allocator already does that. And even less to panic (as in the netatm case). All of the above are bugs in the original code but given that you are going through it it would make sense to fix them once for all. cheers luigi > Cheers. > > -- > Hiten Pandya (hiten@unixdaemons.com, hiten@uk.FreeBSD.org) > http://www.unixdaemons.com/~hiten/ > Index: src/sys/kern/uipc_mbuf2.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/uipc_mbuf2.c,v > retrieving revision 1.18 > diff -u -r1.18 uipc_mbuf2.c > --- src/sys/kern/uipc_mbuf2.c 19 Feb 2003 05:47:26 -0000 1.18 > +++ src/sys/kern/uipc_mbuf2.c 7 Mar 2003 00:31:27 -0000 > @@ -230,14 +230,9 @@ > * now, we need to do the hard way. don't m_copy as there's no room > * on both end. > */ > - MGET(o, M_DONTWAIT, m->m_type); > - if (o && len > MLEN) { > - MCLGET(o, M_DONTWAIT); > - if ((o->m_flags & M_EXT) == 0) { > - m_free(o); > - o = NULL; > - } > - } > + o = (len > MLEN) ? > + m_getcl(M_DONTWAIT, m->m_type, m->m_flags) : > + m_get(M_DONTWAIT, m->m_type); > if (!o) { > m_freem(m); > return NULL; /* ENOBUFS */ > Index: kern/uipc_socket2.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/uipc_socket2.c,v > retrieving revision 1.111 > diff -u -r1.111 uipc_socket2.c > --- kern/uipc_socket2.c 21 Feb 2003 22:23:40 -0000 1.111 > +++ kern/uipc_socket2.c 5 Mar 2003 00:55:21 -0000 > @@ -833,15 +833,11 @@ > > if (CMSG_SPACE((u_int)size) > MCLBYTES) > return ((struct mbuf *) NULL); > - if ((m = m_get(M_DONTWAIT, MT_CONTROL)) == NULL) > - return ((struct mbuf *) NULL); > - if (CMSG_SPACE((u_int)size) > MLEN) { > - MCLGET(m, M_DONTWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - m_free(m); > - return ((struct mbuf *) NULL); > - } > - } > + m = CMSG_SPACE((u_int)size > MLEN) ? > + m_getcl(M_DONTWAIT, MT_CONTROL, 0) : /* Note: !M_PKTHDR */ > + m_get(M_DONTWAIT, MT_CONTROL); > + if (m == NULL) > + return NULL; > cp = mtod(m, struct cmsghdr *); > m->m_len = 0; > KASSERT(CMSG_SPACE((u_int)size) <= M_TRAILINGSPACE(m), > Index: net/if_ieee80211subr.c > =================================================================== > RCS file: /home/ncvs/src/sys/net/if_ieee80211subr.c,v > retrieving revision 1.7 > diff -u -r1.7 if_ieee80211subr.c > --- net/if_ieee80211subr.c 3 Mar 2003 06:09:18 -0000 1.7 > +++ net/if_ieee80211subr.c 5 Mar 2003 01:02:34 -0000 > @@ -2544,16 +2544,14 @@ > if (len > n->m_len - noff) { > len = n->m_len - noff; > if (len == 0) { > - MGET(n->m_next, M_DONTWAIT, n->m_type); > + n->m_next = (left >= MINCLSIZE) ? > + m_getcl(M_DONTWAIT, n->m_type, M_PKTHDR) : > + m_get(M_DONTWAIT, n->m_type); > if (n->m_next == NULL) > goto fail; > n = n->m_next; > - n->m_len = MLEN; > - if (left >= MINCLSIZE) { > - MCLGET(n, M_DONTWAIT); > - if (n->m_flags & M_EXT) > - n->m_len = n->m_ext.ext_size; > - } > + n->m_len = (left >= MINCLSIZE) ? > + n->m_len = n->m_ext.ext_size : MLEN; > noff = 0; > continue; > } > Index: net/if_ppp.c > =================================================================== > RCS file: /home/ncvs/src/sys/net/if_ppp.c,v > retrieving revision 1.89 > diff -u -r1.89 if_ppp.c > --- net/if_ppp.c 19 Feb 2003 05:47:29 -0000 1.89 > +++ net/if_ppp.c 5 Mar 2003 01:02:34 -0000 > @@ -1411,13 +1411,14 @@ > } > > /* Copy the PPP and IP headers into a new mbuf. */ > - MGETHDR(mp, M_DONTWAIT, MT_DATA); > + mp = (hlen + PPP_HDRLEN > MHLEN) ? > + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : > + m_gethdr(M_DONTWAIT, MT_DATA); > if (mp == NULL) > goto bad; > mp->m_len = 0; > mp->m_next = NULL; > if (hlen + PPP_HDRLEN > MHLEN) { > - MCLGET(mp, M_DONTWAIT); > if (M_TRAILINGSPACE(mp) < hlen + PPP_HDRLEN) { > m_freem(mp); > goto bad; /* lose if big headers and no clusters */ > Index: net/if_sl.c > =================================================================== > RCS file: /home/ncvs/src/sys/net/if_sl.c,v > retrieving revision 1.109 > diff -u -r1.109 if_sl.c > --- net/if_sl.c 19 Feb 2003 05:47:29 -0000 1.109 > +++ net/if_sl.c 5 Mar 2003 01:02:34 -0000 > @@ -266,14 +266,7 @@ > > MALLOC(sc, struct sl_softc *, sizeof(*sc), M_SL, M_WAITOK | M_ZERO); > > - m = m_gethdr(M_TRYWAIT, MT_DATA); > - if (m != NULL) { > - MCLGET(m, M_TRYWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - m_free(m); > - m = NULL; > - } > - } > + m = m_getcl(m, M_TRYWAIT, MT_DATA); > > if (m == NULL) { > printf("sl: can't allocate buffer\n"); > @@ -791,10 +784,6 @@ > { > struct mbuf *m, *newm; > > - MGETHDR(m, M_DONTWAIT, MT_DATA); > - if (m == NULL) > - return (NULL); > - > /* > * If we have more than MHLEN bytes, it's cheaper to > * queue the cluster we just filled & allocate a new one > @@ -802,16 +791,13 @@ > * allocated above. Note that code in the input routine > * guarantees that packet will fit in a cluster. > */ > + m = (len >= MHLEN) ? > + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : > + m_gethdr(M_DONTWAIT, MT_DATA); > + if (m == NULL) > + return (NULL); > + > if (len >= MHLEN) { > - MCLGET(m, M_DONTWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - /* > - * we couldn't get a cluster - if memory's this > - * low, it's time to start dropping packets. > - */ > - (void) m_free(m); > - return (NULL); > - } > /* Swap the new and old clusters */ > newm = m; > m = sc->sc_mbuf; > Index: net/rtsock.c > =================================================================== > RCS file: /home/ncvs/src/sys/net/rtsock.c,v > retrieving revision 1.88 > diff -u -r1.88 rtsock.c > --- net/rtsock.c 19 Feb 2003 05:47:29 -0000 1.88 > +++ net/rtsock.c 5 Mar 2003 01:02:34 -0000 > @@ -608,16 +608,12 @@ > } > if (len > MCLBYTES) > panic("rt_msg1"); > - m = m_gethdr(M_DONTWAIT, MT_DATA); > - if (m && len > MHLEN) { > - MCLGET(m, M_DONTWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - m_free(m); > - m = NULL; > - } > - } > + m = (len > MHLEN) ? > + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : > + m_gethdr(M_DONTWAIT, MT_DATA); > + > if (m == 0) > - return (m); > + return (NULL); > m->m_pkthdr.len = m->m_len = len; > m->m_pkthdr.rcvif = 0; > rtm = mtod(m, struct rt_msghdr *); > Index: netatm/port.h > =================================================================== > RCS file: /home/ncvs/src/sys/netatm/port.h,v > retrieving revision 1.15 > diff -u -r1.15 port.h > --- netatm/port.h 23 Feb 2003 22:26:39 -0000 1.15 > +++ netatm/port.h 5 Mar 2003 01:02:35 -0000 > @@ -151,14 +151,9 @@ > } > #define KB_ALLOCEXT(bfr, size, flags, type) { \ > if ((size) <= MCLBYTES) { \ > - MGET((bfr), (flags), (type)); \ > - if ((bfr) != NULL) { \ > - MCLGET((bfr), (flags)); \ > - if (((bfr)->m_flags & M_EXT) == 0) { \ > - m_freem((bfr)); \ > - (bfr) = NULL; \ > - } \ > - } \ > + (bfr) = m_getcl((flags), (type), 0); \ > + if ((bfr) == NULL) \ > + panic("Out of mbufs!"); \ > } else \ > (bfr) = NULL; \ > } > Index: netgraph/ng_pppoe.c > =================================================================== > RCS file: /home/ncvs/src/sys/netgraph/ng_pppoe.c,v > retrieving revision 1.58 > diff -u -r1.58 ng_pppoe.c > --- netgraph/ng_pppoe.c 19 Feb 2003 05:47:31 -0000 1.58 > +++ netgraph/ng_pppoe.c 5 Mar 2003 01:02:35 -0000 > @@ -723,20 +723,13 @@ > printf("pppoe: Session out of memory\n"); > LEAVE(ENOMEM); > } > - MGETHDR(neg->m, M_DONTWAIT, MT_DATA); > + neg->m = m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR); > if(neg->m == NULL) { > - printf("pppoe: Session out of mbufs\n"); > + printf("pppoe: Session out of mbufs and cls\n"); > FREE(neg, M_NETGRAPH_PPPOE); > LEAVE(ENOBUFS); > } > neg->m->m_pkthdr.rcvif = NULL; > - MCLGET(neg->m, M_DONTWAIT); > - if ((neg->m->m_flags & M_EXT) == 0) { > - printf("pppoe: Session out of mcls\n"); > - m_freem(neg->m); > - FREE(neg, M_NETGRAPH_PPPOE); > - LEAVE(ENOBUFS); > - } > sp->neg = neg; > callout_handle_init( &neg->timeout_handle); > neg->m->m_len = sizeof(struct pppoe_full_hdr); > Index: netgraph/ng_vjc.c > =================================================================== > RCS file: /home/ncvs/src/sys/netgraph/ng_vjc.c,v > retrieving revision 1.23 > diff -u -r1.23 ng_vjc.c > --- netgraph/ng_vjc.c 19 Feb 2003 05:47:32 -0000 1.23 > +++ netgraph/ng_vjc.c 5 Mar 2003 01:02:35 -0000 > @@ -476,7 +476,9 @@ > m_adj(m, vjlen); > > /* Copy the reconstructed TCP/IP headers into a new mbuf */ > - MGETHDR(hm, M_DONTWAIT, MT_DATA); > + hm = (hlen > MHLEN) ? > + m_getcl(M_DONTWAIT, MT_DATA, M_PKTHDR) : > + m_gethdr(M_DONTWAIT, MT_DATA); > if (hm == NULL) { > priv->slc.sls_errorin++; > NG_FREE_M(m); > @@ -485,16 +487,6 @@ > } > hm->m_len = 0; > hm->m_pkthdr.rcvif = NULL; > - if (hlen > MHLEN) { /* unlikely, but can happen */ > - MCLGET(hm, M_DONTWAIT); > - if ((hm->m_flags & M_EXT) == 0) { > - m_freem(hm); > - priv->slc.sls_errorin++; > - NG_FREE_M(m); > - NG_FREE_ITEM(item); > - return (ENOBUFS); > - } > - } > bcopy(hdr, mtod(hm, u_char *), hlen); > hm->m_len = hlen; > > Index: netinet/tcp_output.c > =================================================================== > RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v > retrieving revision 1.78 > diff -u -r1.78 tcp_output.c > --- netinet/tcp_output.c 19 Feb 2003 22:18:05 -0000 1.78 > +++ netinet/tcp_output.c 5 Mar 2003 01:02:35 -0000 > @@ -604,21 +604,18 @@ > m->m_len += hdrlen; > m->m_data -= hdrlen; > #else > +#ifdef INET6 > + m = (MHLEN < hdrlen + max_linkhdr) ? > + m_getcl(M_DONTWAIT, MT_HEADER, M_PKTHDR) : > + m_gethdr(M_DONTWAIT, MT_HEADER); > +#else > MGETHDR(m, M_DONTWAIT, MT_HEADER); > +#endif > if (m == NULL) { > error = ENOBUFS; > goto out; > } > -#ifdef INET6 > - if (MHLEN < hdrlen + max_linkhdr) { > - MCLGET(m, M_DONTWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - m_freem(m); > - error = ENOBUFS; > - goto out; > - } > - } > -#endif > + > m->m_data += max_linkhdr; > m->m_len = hdrlen; > if (len <= MHLEN - hdrlen - max_linkhdr) { > Index: netipsec/key.c > =================================================================== > RCS file: /home/ncvs/src/sys/netipsec/key.c,v > retrieving revision 1.5 > diff -u -r1.5 key.c > --- netipsec/key.c 19 Feb 2003 05:47:36 -0000 1.5 > +++ netipsec/key.c 5 Mar 2003 01:02:35 -0000 > @@ -2079,14 +2079,10 @@ > > if (len > MCLBYTES) > return key_senderror(so, m, ENOBUFS); > - MGETHDR(n, M_DONTWAIT, MT_DATA); > - if (n && len > MHLEN) { > - MCLGET(n, M_DONTWAIT); > - if ((n->m_flags & M_EXT) == 0) { > - m_freem(n); > - n = NULL; > - } > - } > + n = (len > MHLEN) ? > + m_getcl(n, M_DONTWAIT, MT_DATA) : > + m_gethdr(M_DONTWAIT, MT_DATA); > + > if (!n) > return key_senderror(so, m, ENOBUFS); > > Index: nfsclient/krpc_subr.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsclient/krpc_subr.c,v > retrieving revision 1.22 > diff -u -r1.22 krpc_subr.c > --- nfsclient/krpc_subr.c 19 Feb 2003 05:47:38 -0000 1.22 > +++ nfsclient/krpc_subr.c 4 Mar 2003 22:56:15 -0000 > @@ -465,14 +465,12 @@ > if (mlen > MCLBYTES) /* If too big, we just can't do it. */ > return (NULL); > > - m = m_get(M_TRYWAIT, MT_DATA); > - if (mlen > MLEN) { > - MCLGET(m, M_TRYWAIT); > - if ((m->m_flags & M_EXT) == 0) { > - (void) m_free(m); /* There can be only one. */ > - return (NULL); > - } > - } > + m = (mlen > MLEN) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (m == NULL) > + return (NULL); > + > xs = mtod(m, struct xdr_string *); > m->m_len = mlen; > xs->len = txdr_unsigned(len); > Index: nfsclient/nfs_socket.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsclient/nfs_socket.c,v > retrieving revision 1.95 > diff -u -r1.95 nfs_socket.c > --- nfsclient/nfs_socket.c 2 Mar 2003 16:54:38 -0000 1.95 > +++ nfsclient/nfs_socket.c 4 Mar 2003 22:56:15 -0000 > @@ -1378,10 +1378,11 @@ > ++nfs_realign_test; > while ((m = *pm) != NULL) { > if ((m->m_len & 0x3) || (mtod(m, intptr_t) & 0x3)) { > - MGET(n, M_TRYWAIT, MT_DATA); > - if (m->m_len >= MINCLSIZE) { > - MCLGET(n, M_TRYWAIT); > - } > + n = (m->m_len >= MINCLSIZE) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (n == NULL) > + panic("nfs_realign: Out of mbufs"); > n->m_len = 0; > break; > } > Index: nfsclient/nfs_subs.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsclient/nfs_subs.c,v > retrieving revision 1.117 > diff -u -r1.117 nfs_subs.c > --- nfsclient/nfs_subs.c 19 Feb 2003 05:47:38 -0000 1.117 > +++ nfsclient/nfs_subs.c 4 Mar 2003 22:56:15 -0000 > @@ -142,9 +142,11 @@ > { > struct mbuf *mb; > > - MGET(mb, M_TRYWAIT, MT_DATA); > - if (hsiz >= MINCLSIZE) > - MCLGET(mb, M_TRYWAIT); > + mb = (hsiz >= MINCLSIZE) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (mb == NULL) > + return NULL; > mb->m_len = 0; > return (mb); > } > @@ -168,10 +170,12 @@ > int grpsiz, authsiz; > > authsiz = nfsm_rndup(auth_len); > - MGETHDR(mb, M_TRYWAIT, MT_DATA); > - if ((authsiz + 10 * NFSX_UNSIGNED) >= MINCLSIZE) { > - MCLGET(mb, M_TRYWAIT); > - } else if ((authsiz + 10 * NFSX_UNSIGNED) < MHLEN) { > + mb = (authsiz + 10 * NFSX_UNSIGNED >= MINCLSIZE) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_gethdr(M_TRYWAIT, MT_DATA); > + if (mb == NULL) > + return NULL; > + if ((authsiz + 10 * NFSX_UNSIGNED) < MHLEN) { > MH_ALIGN(mb, authsiz + 10 * NFSX_UNSIGNED); > } else { > MH_ALIGN(mb, 8 * NFSX_UNSIGNED); > @@ -271,9 +275,11 @@ > while (left > 0) { > mlen = M_TRAILINGSPACE(mp); > if (mlen == 0) { > - MGET(mp, M_TRYWAIT, MT_DATA); > - if (clflg) > - MCLGET(mp, M_TRYWAIT); > + mp = (clflg) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (mp == NULL) > + return ENOBUFS; > mp->m_len = 0; > mp2->m_next = mp; > mp2 = mp; > @@ -349,9 +355,11 @@ > } > /* Loop around adding mbufs */ > while (siz > 0) { > - MGET(m1, M_TRYWAIT, MT_DATA); > - if (siz > MLEN) > - MCLGET(m1, M_TRYWAIT); > + m1 = (siz > MLEN) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (m1 == NULL) > + return ENOBUFS; > m1->m_len = NFSMSIZ(m1); > m2->m_next = m1; > m2 = m1; > Index: nfsserver/nfs_serv.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v > retrieving revision 1.131 > diff -u -r1.131 nfs_serv.c > --- nfsserver/nfs_serv.c 25 Feb 2003 03:37:47 -0000 1.131 > +++ nfsserver/nfs_serv.c 4 Mar 2003 22:56:15 -0000 > @@ -656,8 +656,11 @@ > len = 0; > i = 0; > while (len < NFS_MAXPATHLEN) { > - MGET(nmp, M_TRYWAIT, MT_DATA); > - MCLGET(nmp, M_TRYWAIT); > + nmp = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); > + if (nmp == NULL) { > + error = ENOBUFS; > + return (error); > + } > nmp->m_len = NFSMSIZ(nmp); > if (len == 0) > mp3 = mp = nmp; > @@ -899,8 +902,9 @@ > i++; > } > if (left > 0) { > - MGET(m, M_TRYWAIT, MT_DATA); > - MCLGET(m, M_TRYWAIT); > + m = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); > + if (m == NULL) > + return (ENOBUFS); > m->m_len = 0; > m2->m_next = m; > m2 = m; > Index: nfsserver/nfs_srvsock.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsserver/nfs_srvsock.c,v > retrieving revision 1.83 > diff -u -r1.83 nfs_srvsock.c > --- nfsserver/nfs_srvsock.c 2 Mar 2003 16:54:39 -0000 1.83 > +++ nfsserver/nfs_srvsock.c 4 Mar 2003 22:56:15 -0000 > @@ -148,18 +148,23 @@ > nd->nd_repstat = err; > if (err && (nd->nd_flag & ND_NFSV3) == 0) /* XXX recheck */ > siz = 0; > - MGETHDR(mreq, M_TRYWAIT, MT_DATA); > - mb = mreq; > /* > * If this is a big reply, use a cluster else > * try and leave leading space for the lower level headers. > */ > - mreq->m_len = 6 * NFSX_UNSIGNED; > siz += RPC_REPLYSIZ; > if ((max_hdr + siz) >= MINCLSIZE) { > - MCLGET(mreq, M_TRYWAIT); > - } else > + mreq = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); > + if (mreq == NULL) > + return (NULL); > + } else { > + mreq = m_gethdr(M_TRYWAIT, MT_DATA); > + if (mreq == NULL) > + return (NULL); > mreq->m_data += min(max_hdr, M_TRAILINGSPACE(mreq)); > + } > + mreq->m_len = 6 * NFSX_UNSIGNED; > + mb = mreq; > tl = mtod(mreq, u_int32_t *); > bpos = ((caddr_t)tl) + mreq->m_len; > *tl++ = txdr_unsigned(nd->nd_retxid); > @@ -244,10 +249,11 @@ > ++nfs_realign_test; > while ((m = *pm) != NULL) { > if ((m->m_len & 0x3) || (mtod(m, intptr_t) & 0x3)) { > - MGET(n, M_TRYWAIT, MT_DATA); > - if (m->m_len >= MINCLSIZE) { > - MCLGET(n, M_TRYWAIT); > - } > + n = (m->m_len >= MINCLSIZE) ? > + m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR) : > + m_get(M_TRYWAIT, MT_DATA); > + if (n == NULL) > + panic("Out of mbufs!\n"); > n->m_len = 0; > break; > } > Index: nfsserver/nfs_srvsubs.c > =================================================================== > RCS file: /home/ncvs/src/sys/nfsserver/nfs_srvsubs.c,v > retrieving revision 1.120 > diff -u -r1.120 nfs_srvsubs.c > --- nfsserver/nfs_srvsubs.c 19 Feb 2003 05:47:39 -0000 1.120 > +++ nfsserver/nfs_srvsubs.c 4 Mar 2003 22:56:15 -0000 > @@ -1287,8 +1287,9 @@ > if (*bp >= *be) { > if (*mp == mb) > (*mp)->m_len += *bp - bpos; > - MGET(nmp, M_TRYWAIT, MT_DATA); > - MCLGET(nmp, M_TRYWAIT); > + nmp = m_getcl(M_TRYWAIT, MT_DATA, M_PKTHDR); > + if (nmp == NULL) > + panic("nfsm_clget_xx: Out of mbufs"); > nmp->m_len = NFSMSIZ(nmp); > (*mp)->m_next = nmp; > *mp = nmp; To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Mar 7 0: 7: 3 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D8DB437B401 for ; Fri, 7 Mar 2003 00:07:01 -0800 (PST) Received: from angelica.unixdaemons.com (angelica.unixdaemons.com [209.148.64.135]) by mx1.FreeBSD.org (Postfix) with ESMTP id DC73A43FAF for ; Fri, 7 Mar 2003 00:07:00 -0800 (PST) (envelope-from hiten@angelica.unixdaemons.com) Received: from angelica.unixdaemons.com (localhost.unixdaemons.com [127.0.0.1]) by angelica.unixdaemons.com (8.12.8/8.12.1) with ESMTP id h2786xmq062444; Fri, 7 Mar 2003 03:06:59 -0500 (EST) Received: (from hiten@localhost) by angelica.unixdaemons.com (8.12.8/8.12.1/Submit) id h2786xa2062443; Fri, 7 Mar 2003 03:06:59 -0500 (EST) (envelope-from hiten) Date: Fri, 7 Mar 2003 03:06:59 -0500 From: Hiten Pandya To: Luigi Rizzo Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Using m_getcl() in network and nfs code paths Message-ID: <20030307080659.GA60937@unixdaemons.com> References: <20030307004958.GA98917@unixdaemons.com> <20030306212638.A32850@xorpc.icir.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030306212638.A32850@xorpc.icir.org> User-Agent: Mutt/1.4i X-Operating-System: FreeBSD i386 X-Public-Key: http://www.pittgoth.com/~hiten/pubkey.asc X-URL: http://www.unixdaemons.com/~hiten X-PGP: http://pgp.mit.edu:11371/pks/lookup?search=Hiten+Pandya&op=index Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Luigi Rizzo (Thu, Mar 06, 2003 at 09:26:38PM -0800) wrote: > the number of places where the code does > > m = (want > X) ? m_getcl(...) : m_get(...) > > makes me wonder if we shouldn't perhaps add a 'desired_size' > parameter to m_getcl() so we can have the test made in one > place and in a consistent way (i..e always use the same > threshold X instead of MLEN/MINCLSIZE/MHLEN > which i suspect is incorrect somewhere. Can you provide some examples for this? I do no exactly follow. The size checking is always changing. I have similar changes to the dev/ and pci/ and netgraph/ tree, and I noticed that it gets checked against MLEN/MCLBYTES/MHLEN and whatnot... but I do not think that is a bug, because surely they all represent different quantities? > Also it makes no sense to print a msg on failure -- the allocator > already does that. And even less to panic (as in the netatm > case). Right. I avoided this in the first place, because some did complain that extra verbosity was needed, but now that you have made the case crystal clear, I will remove them. As for the panic, well, if you noticed, error checking after mbuf + cluster allocation was screwed from the start in the NFS code (the client code is sort of better in this than the server code). So I was left with no choice but to panic, because there was no way I could return errors in a void function without considerable changes to the surrounding code and files as well. > All of the above are bugs in the original code but given that > you are going through it it would make sense to fix them once > for all. Yes. I will go ahead and fix the printf stuff, but I guess I can't look at the nfs error checking problems right now. I will put an XXX comment saying 'better error checking', and then I will get to it some other fine day. :-) Thanks for your comments Luigi, very much appreciated. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Mar 7 0:25:28 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DD6BD37B401 for ; Fri, 7 Mar 2003 00:25:26 -0800 (PST) Received: from xorpc.icir.org (xorpc.icir.org [192.150.187.68]) by mx1.FreeBSD.org (Postfix) with ESMTP id BF66B43FAF for ; Fri, 7 Mar 2003 00:25:25 -0800 (PST) (envelope-from rizzo@xorpc.icir.org) Received: from xorpc.icir.org (localhost [127.0.0.1]) by xorpc.icir.org (8.12.3/8.12.3) with ESMTP id h278PPAq054081; Fri, 7 Mar 2003 00:25:25 -0800 (PST) (envelope-from rizzo@xorpc.icir.org) Received: (from rizzo@localhost) by xorpc.icir.org (8.12.3/8.12.3/Submit) id h278PP8o054080; Fri, 7 Mar 2003 00:25:25 -0800 (PST) (envelope-from rizzo) Date: Fri, 7 Mar 2003 00:25:25 -0800 From: Luigi Rizzo To: Hiten Pandya Cc: arch@FreeBSD.ORG Subject: Re: Using m_getcl() in network and nfs code paths Message-ID: <20030307002525.A50491@xorpc.icir.org> References: <20030307004958.GA98917@unixdaemons.com> <20030306212638.A32850@xorpc.icir.org> <20030307080659.GA60937@unixdaemons.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307080659.GA60937@unixdaemons.com>; from hiten@unixdaemons.com on Fri, Mar 07, 2003 at 03:06:59AM -0500 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG the logic of this code m = (n > X) ? m_getcl(...) : m_get(...) The following: i have n bytes of data, give me a place large enough to store them. This can be either a single mbuf or an mbuf+cluster, depending on the size. The threshold (X) is whatever fits into the mbuf (which varies depending on whether or not this is a pkthdr mbuf, but again this is easy to tell inside m_getcl because you are passing the M_PKTHDR flag). Now, MINCLSIZE/MHLEN are basically the same thing, and MLEN covers the case for !M_PKTHDR. But my point is that the programmer should not bother to know which one to use and instead just let the function do the right thing. Fewer chances for bugs, and smaller code. cheers luigi On Fri, Mar 07, 2003 at 03:06:59AM -0500, Hiten Pandya wrote: > Luigi Rizzo (Thu, Mar 06, 2003 at 09:26:38PM -0800) wrote: > > the number of places where the code does > > > > m = (want > X) ? m_getcl(...) : m_get(...) > > > > makes me wonder if we shouldn't perhaps add a 'desired_size' > > parameter to m_getcl() so we can have the test made in one > > place and in a consistent way (i..e always use the same > > threshold X instead of MLEN/MINCLSIZE/MHLEN > > which i suspect is incorrect somewhere. > > Can you provide some examples for this? I do no exactly follow. > The size checking is always changing. I have similar changes to > the dev/ and pci/ and netgraph/ tree, and I noticed that it gets > checked against MLEN/MCLBYTES/MHLEN and whatnot... but I do not think > that is a bug, because surely they all represent different > quantities? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Mar 7 1:22:59 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B745D37B401 for ; Fri, 7 Mar 2003 01:22:58 -0800 (PST) Received: from angelica.unixdaemons.com (angelica.unixdaemons.com [209.148.64.135]) by mx1.FreeBSD.org (Postfix) with ESMTP id C357843FA3 for ; Fri, 7 Mar 2003 01:22:57 -0800 (PST) (envelope-from hiten@angelica.unixdaemons.com) Received: from angelica.unixdaemons.com (localhost.unixdaemons.com [127.0.0.1]) by angelica.unixdaemons.com (8.12.8/8.12.1) with ESMTP id h279Mumq070274; Fri, 7 Mar 2003 04:22:56 -0500 (EST) Received: (from hiten@localhost) by angelica.unixdaemons.com (8.12.8/8.12.1/Submit) id h279MuQd070273; Fri, 7 Mar 2003 04:22:56 -0500 (EST) (envelope-from hiten) Date: Fri, 7 Mar 2003 04:22:56 -0500 From: Hiten Pandya To: Luigi Rizzo Cc: arch@FreeBSD.ORG Subject: Re: Using m_getcl() in network and nfs code paths Message-ID: <20030307092256.GA69971@unixdaemons.com> References: <20030307004958.GA98917@unixdaemons.com> <20030306212638.A32850@xorpc.icir.org> <20030307080659.GA60937@unixdaemons.com> <20030307002525.A50491@xorpc.icir.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030307002525.A50491@xorpc.icir.org> User-Agent: Mutt/1.4i X-Operating-System: FreeBSD i386 X-Public-Key: http://www.pittgoth.com/~hiten/pubkey.asc X-URL: http://www.unixdaemons.com/~hiten X-PGP: http://pgp.mit.edu:11371/pks/lookup?search=Hiten+Pandya&op=index Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Luigi Rizzo (Fri, Mar 07, 2003 at 12:25:25AM -0800) wrote: > the logic of this code > > m = (n > X) ? m_getcl(...) : m_get(...) > > The following: i have n bytes of data, give me a place large > enough to store them. This can be either a single mbuf or an > mbuf+cluster, depending on the size. The threshold (X) is whatever > fits into the mbuf (which varies depending on whether or not this is > a pkthdr mbuf, but again this is easy to tell inside m_getcl because > you are passing the M_PKTHDR flag). > > Now, MINCLSIZE/MHLEN are basically the same thing, and MLEN covers > the case for !M_PKTHDR. But my point is that the programmer should > not bother to know which one to use and instead just let the function > do the right thing. Fewer chances for bugs, and smaller code. Right. I guess it makes better sense. I will try and come up with these changes over the weekend, or maybe even today if I get the time. Cheers Luigi. -- Hiten To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Mar 7 8:26:50 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9B47D37B401 for ; Fri, 7 Mar 2003 08:26:47 -0800 (PST) Received: from mailsrv.otenet.gr (mailsrv.otenet.gr [195.170.0.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8EBB143F85 for ; Fri, 7 Mar 2003 08:26:43 -0800 (PST) (envelope-from keramida@ceid.upatras.gr) Received: from gothmog.gr (patr530-b220.otenet.gr [212.205.244.228]) by mailsrv.otenet.gr (8.12.8/8.12.8) with ESMTP id h27GQdNJ006106 for ; Fri, 7 Mar 2003 18:26:40 +0200 (EET) Received: from gothmog.gr (gothmog [127.0.0.1]) by gothmog.gr (8.12.8/8.12.8) with ESMTP id h27GQZMI004816 for ; Fri, 7 Mar 2003 18:26:39 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Received: (from giorgos@localhost) by gothmog.gr (8.12.8/8.12.8/Submit) id h27Epd7w003627; Fri, 7 Mar 2003 16:51:39 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Date: Fri, 7 Mar 2003 16:51:39 +0200 From: Giorgos Keramidas To: Hiten Pandya Cc: arch@freebsd.org Subject: Re: Using m_getcl() in network and nfs code paths Message-ID: <20030307145139.GD2094@gothmog.gr> References: <20030307004958.GA98917@unixdaemons.com> <20030306212638.A32850@xorpc.icir.org> <20030307080659.GA60937@unixdaemons.com> <20030307002525.A50491@xorpc.icir.org> <20030307092256.GA69971@unixdaemons.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030307092256.GA69971@unixdaemons.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 2003-03-07 04:22, Hiten Pandya wrote: > Luigi Rizzo (Fri, Mar 07, 2003 at 12:25:25AM -0800) wrote: > > the logic of this code > > > > m = (n > X) ? m_getcl(...) : m_get(...) > > > > The following: i have n bytes of data, give me a place large > > enough to store them. This can be either a single mbuf or an > > mbuf+cluster, depending on the size. The threshold (X) is whatever > > fits into the mbuf (which varies depending on whether or not this is > > a pkthdr mbuf, but again this is easy to tell inside m_getcl because > > you are passing the M_PKTHDR flag). > > > > Now, MINCLSIZE/MHLEN are basically the same thing, and MLEN covers > > the case for !M_PKTHDR. But my point is that the programmer should > > not bother to know which one to use and instead just let the function > > do the right thing. Fewer chances for bugs, and smaller code. > > Right. I guess it makes better sense. I will try and come up > with these changes over the weekend, or maybe even today if I > get the time. > > Cheers Luigi. Can you also convert all those (condition) ? true : false; things to (cleaner, in my opinion) if {...} else {...} pairs? It really hurts my eyes trying to read those ?: things, and writing these as: if (n > X) m_getcl(...); else m_get(...); has the added benefit that whenever one needs to fix just code of the `else' part, the diff output of the changes will be much much cleaner in the future. This is, of course, more related to style than the real architectural characteristics of the code. But I just thought I'd mention it, since I was too striken fairly fast by the same impression that Luigi mentioned in his first post. The argument of rewriting code to make things easier for possible future changes is, admittedly, a bit weak. But you can certainly forget about all this if it's too much trouble :) - Giorgos To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat Mar 8 23: 5:16 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D58C137B439 for ; Sat, 8 Mar 2003 23:05:08 -0800 (PST) Received: from smtp4.server.rpi.edu (smtp4.server.rpi.edu [128.113.2.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id D9BB3442EC for ; Sat, 8 Mar 2003 21:29:19 -0800 (PST) (envelope-from drosih@rpi.edu) Received: from [128.113.24.47] (gilead.netel.rpi.edu [128.113.24.47]) by smtp4.server.rpi.edu (8.12.8/8.12.7) with ESMTP id h295TIuF028552 for ; Sun, 9 Mar 2003 00:29:18 -0500 Mime-Version: 1.0 X-Sender: drosih@mail.rpi.edu Message-Id: In-Reply-To: References: <20030210114930.GB90800@melusine.cuivre.fr.eu.org> <200302251255.48219.wes@softweyr.com> Date: Sun, 9 Mar 2003 00:29:17 -0500 To: arch@FreeBSD.ORG From: Garance A Drosihn Subject: Re: NEWSYSLOG changes, -Create option for rc.diskless Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Scanned-By: MIMEDefang 2.28 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG When looking through some /etc/rc files the other day, I noticed that /etc/rc.diskless2 picks up entries in /etc/newsyslog.conf and uses 'touch' to create them. With all the other changes I have made and plan to make, this is not a good idea. In fact, it was already a bad idea because it won't get owner, group, or permissions set right. I have an update in: http://people.freebsd.org/~gad/newsyslog/create-opt.diff which implements a '-C' option for newyslog. Once this is committed, /etc/rc.diskless2 should be changed to use it. The update also adds a a 'C' flag for the config file entries, to match an option that NetBSD has. Please pay close attention to the new createlog() routine, as a later update will switch to using that for all logfile creations. My goal was to eliminate all windows in the create of a new log file. This (once I do it) should complete the job started in revision 1.41 (committed in april 2002). -- Garance Alistair Drosehn = gad@gilead.netel.rpi.edu Senior Systems Programmer or gad@freebsd.org Rensselaer Polytechnic Institute or drosih@rpi.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message