From owner-freebsd-arch Tue Mar 4 15:36:26 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA39837B401 for ; Tue, 4 Mar 2003 15:36:19 -0800 (PST) Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8F54643FBD for ; Tue, 4 Mar 2003 15:36:18 -0800 (PST) (envelope-from sean@perrin.int.nxad.com) Received: by perrin.int.nxad.com (Postfix, from userid 1001) id 708FA21059; Tue, 4 Mar 2003 15:36:08 -0800 (PST) Date: Tue, 4 Mar 2003 15:36:08 -0800 From: Sean Chittenden To: Terry Lambert Cc: Hiten Pandya , arch@FreeBSD.ORG Subject: Re: Should sendfile() to return ENOBUFS? Message-ID: <20030304233608.GK79234@perrin.int.nxad.com> References: <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> <20030304081326.GD79234@perrin.int.nxad.com> <3E64E9B8.EDCA54FE@mindspring.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="P6PRkhImOxklJvkF" Content-Disposition: inline In-Reply-To: <3E64E9B8.EDCA54FE@mindspring.com> User-Agent: Mutt/1.4i X-PGP-Key: finger seanc@FreeBSD.org X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 X-Web-Homepage: http://sean.chittenden.org/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --P6PRkhImOxklJvkF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > Personally, would probably just add a flag to sendfile, Breaking source compatibility shouldn't be an option. > A better solution would be to add a different API to the system. >=20 > The kblob interface is an interesting animal; Jeffrey Hsu has done > some good work in that area, but it's not entirely usable, as it > sits. You might want to talk to Jonathan Lemon. IMO, it is > probably a lost cause. Is there a patch or URL that I can read through? kblob + google =3D=3D KDE in all but two instances. Alfred mentioned that it was good for smaller files so I'm not sure it'd be good for the large multi-MB type things that are getting sent out. > > I haven't spent more than a few seconds thinking about this, but > > wouldn't that require more mbufclusters to be in use but idle at any > > given time than the current implementation? >=20 > No. First of all, it reduces the sfbuf requirements considerably, > by queueing request descriptors, instead, and satisfying them, as it > can. Second, you can control the number of packets "in flight" for > each outstanding sendfile request in progress (unlike now), so if > you throttle this back to the so_snd size, in fact you will use > *fewer* mbuf clusters simultaneously, and you will reduce page > thrashing (remember that sendmaile uses external mbufs that refer to > buffer cache pages via sfbuf mappings). s/sendmaile/sendfile/ ? Hrm... this could be interesting... I'll keep this in mind as a way of keeping the rate of transfer constant at the point where all sf_bufs are in use. I'll have to think about this more before I'm 100% convinced that it's the right thing to do. > [ ... ] > > I don't quite understand what you're trying to say here. What's the > > correlation between / and system calls? CR+LF is always > > read/written as two bytes... I must be missing the point of your > > comment. >=20 > It's a tangent that indicates sendfile() is generally inappropriate, > unless you also implement the recvfile() to go with it, and use it. > The issue is that UNIX text files are not stored in the wire formats > for these protocols, so using sendfile() on them is usually > inapprorpiate, unless you change how you store them. Mail servers, > especially, break inbound and outbound data between applications, so > you'd have to hack them up sto store incoming as delimited > so that when you sent them out via sendfile(), they were compliant > with the protocol standard, on the wire. Ah, ok... well, I generally have no sympathy for what are pretty poorly designed protocols or operating systems that are brain dead in their definition of newlines. > > > > If a system is busy, it's stuck in an sfbufa state and blocks the > > > > server from servicing thousands of connections. > > > > > > I understand. > >=20 > > Groovy: that's a third of the problem, what's the elegant solution? >=20 > You can't have one, in the context of the current sendfile. You > need to change your context, if you want to address this issue and > get onto the next one, or you can accept the implementation of an > administrative limit to keep from banging your head on the design > limit, and cut your losses. I think the state of sendfile(2) could be much improved. Once improved, if things still suck, then I'll go about finding/writing a new interface to do what I/others need. > It really boils down to how much effort you are willing to spend on > it, for what return you expect. When load happens, every ounce of grace is worth its weight in gold ten fold over. That's why I switched from MS->Linux->FreeBSD in the 1st place. > > I keep chasing this upper bound and pushing things higher and higher > > because sendfile() doesn't degrade worth beans... well, that's a hack > > and not a solution. >=20 > No, it's really a "Then don't do that" solution to the old "Doctor, > it hurts when I do this" complaint. The next then the Dr. says is "then don't do it!," and we're back at square one again... this isn't lifting an arm, it's trying to breath while running: something FreeBSD is generally better than most at. > Before sendfile(), the answer was to mmap() the data to be sent, and > then call write() on it. Doing that guaranteed that you would not > have to copy the data from user space to kernel space, because the > mapping was already established. That solution can still work, > without using sendfile() to get the same performance. The > performance "win" of sendfile is the assumption that the entire file > will be sent as a result of a single system call. You're forgetting the biggest win for busy servers with hundreds/thousands of files: RAM. I used to use mmap() + writev() and because of userland RAM constraints, I moved to using sendfile(). This dramatically improved the state of affairs. > > The TCP stack, VM, and my general setup has scaled quite well. > > The 1st thing to go, however, is the number of sf_buf's. >=20 > "If it ain't one thing, it's another"... [...] > As someone else pointed out, there are a lot of low overheads in > various places in the FreeBSD kernel. If you are a seven foot tall > person that wants to walk around without banging your head every 5 > feet, then there's a lot of remodelling you are going to need to do > to avoid that. Interesting point, but if you own a house with 5ft doorways and are 7ft tall, you'll fix the house or move out. Where's your saw and hammer? > If you want to get CS technical, you have found a livelock stall > barrier: there are literally thousands of these in the design of > FreeBSD, as it stands, and most of them are unlikely to ever get > fixed, except in private commercial repositories for FreeBSD-based > products. Again, where's your hammer? You've got experience in running into doorways, have you ever thought about making them taller for everyone to pass through there? > The hard kernel memory split will *never* be dyamic. I know the history and rationale for the current state of things, but *never* say never. ;) [...] > What this comes down to is guaranteeing "fairness". >=20 > So the conclusion? >=20 > Having sendfile() return "EAGAIN" is naieve, unless you have a means > of limiting each sendfile to it's *fair share* of sf_buf's. The problem is that not all connections are created equal. Send gobs of traffic overseas via slow last mile pipes and you'll find the problem changing dramatically. EAGAIN will at least get the available connections something and I'll be able to drain some of my load. [...]=20 > The *only* way to address this so that the kernel can *know*, to > *fairly* share resources among requesters, is to queue the requests > *to the kernel*, and then service them to completion. *Only* then > can the kernel perform useful resource arbitration on your behalf. Actually, I was talking to Hiten on IRC about writing a kqueue inspired file sending state machine. Basically you'd have a kernel daemon that'd broker sending files to clients. Push a file request (either whole file or partial) into a queue and the kernel would send out the file (or file part) as best as it could and once complete (successful or not), would add the completed request to a queue of finished requests that would be classified in one of two states: *) failure (errno, sent this many bytes) *) success (file sent, sent this many bytes). Applications would then only have to manage adding files to the kernel's queue and processing the completion of events from the queue (logging). On a local network, with T/TCP, this would make the basis for a really slick NFS replacement or cache engine, IMHO. And actually, this interface could be fd -> fd and used to replace local copying of files. > One of the major pains in the butt for effective load shedding in > FreeBSD, as it currently stands, is the SYN cache. The damn thing > accepts connections on your behalf by completing three-way handshakes > automatically, without giving you the opportunity of doing feedback > until *after* the connection is established. I haven't hit that yet, but when I do, you'll hear back from me. > Come up with a new API. It needs to: >=20 > 1) Queue it's requests to the kernel, so that the kernel has > enough information to make useful decisions Check. See above. > 2) Respect the limit on the so_snd depth (minimally; there are > reasons for load tuning to make it even more severe, on > purpose, to control router queue depths for slow customer > pipes) That'd be something that the kernel send file daemon would do (in theory). > 3) Sends a kevent when the file send has been completed See above. > 4) Preallocate resources before taking something off the queue Check. > Those are the minimum design requirements, from a 50,000 foot view. Perk of the above design is that you don't have to constantly make system calls to send out parts of a file (non-blocking IO + sendfile() + clients connecting at slow rates, this can be substantial). -sc --=20 Sean Chittenden --P6PRkhImOxklJvkF Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iD8DBQE+ZTho3ZnjH7yEs0ERAt0DAKC8LYKZ1PNts2W8XZKoeNp0EOQx+QCdG+7E Gbv19H5sq3l7UJ2TayGoaUQ= =u7sJ -----END PGP SIGNATURE----- --P6PRkhImOxklJvkF-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message