Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Mar 2003 15:36:08 -0800
From:      Sean Chittenden <sean@chittenden.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Hiten Pandya <hiten@unixdaemons.com>, arch@FreeBSD.ORG
Subject:   Re: Should sendfile() to return ENOBUFS?
Message-ID:  <20030304233608.GK79234@perrin.int.nxad.com>
In-Reply-To: <3E64E9B8.EDCA54FE@mindspring.com>
References:  <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> <20030304081326.GD79234@perrin.int.nxad.com> <3E64E9B8.EDCA54FE@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--P6PRkhImOxklJvkF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> Personally,  would probably just add a flag to sendfile,

Breaking source compatibility shouldn't be an option.

> A better solution would be to add a different API to the system.
>=20
> The kblob interface is an interesting animal; Jeffrey Hsu has done
> some good work in that area, but it's not entirely usable, as it
> sits.  You might want to talk to Jonathan Lemon.  IMO, it is
> probably a lost cause.

Is there a patch or URL that I can read through?  kblob + google =3D=3D
KDE in all but two instances.  Alfred mentioned that it was good for
smaller files so I'm not sure it'd be good for the large multi-MB type
things that are getting sent out.

> > I haven't spent more than a few seconds thinking about this, but
> > wouldn't that require more mbufclusters to be in use but idle at any
> > given time than the current implementation?
>=20
> No.  First of all, it reduces the sfbuf requirements considerably,
> by queueing request descriptors, instead, and satisfying them, as it
> can.  Second, you can control the number of packets "in flight" for
> each outstanding sendfile request in progress (unlike now), so if
> you throttle this back to the so_snd size, in fact you will use
> *fewer* mbuf clusters simultaneously, and you will reduce page
> thrashing (remember that sendmaile uses external mbufs that refer to
> buffer cache pages via sfbuf mappings).

s/sendmaile/sendfile/ ?

Hrm...  this could be interesting...  I'll keep this in mind as a way
of keeping the rate of transfer constant at the point where all
sf_bufs are in use.  I'll have to think about this more before I'm
100% convinced that it's the right thing to do.

> [ ... ]
> > I don't quite understand what you're trying to say here.  What's the
> > correlation between <CR><LF>/<LF> and system calls? CR+LF is always
> > read/written as two bytes...  I must be missing the point of your
> > comment.
>=20
> It's a tangent that indicates sendfile() is generally inappropriate,
> unless you also implement the recvfile() to go with it, and use it.
> The issue is that UNIX text files are not stored in the wire formats
> for these protocols, so using sendfile() on them is usually
> inapprorpiate, unless you change how you store them.  Mail servers,
> especially, break inbound and outbound data between applications, so
> you'd have to hack them up sto store incoming as <CR><LF> delimited
> so that when you sent them out via sendfile(), they were compliant
> with the protocol standard, on the wire.

Ah, ok... well, I generally have no sympathy for what are pretty
poorly designed protocols or operating systems that are brain dead in
their definition of newlines.

> > > > If a system is busy, it's stuck in an sfbufa state and blocks the
> > > > server from servicing thousands of connections.
> > >
> > > I understand.
> >=20
> > Groovy: that's a third of the problem, what's the elegant solution?
>=20
> You can't have one, in the context of the current sendfile.  You
> need to change your context, if you want to address this issue and
> get onto the next one, or you can accept the implementation of an
> administrative limit to keep from banging your head on the design
> limit, and cut your losses.

I think the state of sendfile(2) could be much improved.  Once
improved, if things still suck, then I'll go about finding/writing a
new interface to do what I/others need.

> It really boils down to how much effort you are willing to spend on
> it, for what return you expect.

When load happens, every ounce of grace is worth its weight in gold
ten fold over.  That's why I switched from MS->Linux->FreeBSD in the
1st place.

> > I keep chasing this upper bound and pushing things higher and higher
> > because sendfile() doesn't degrade worth beans... well, that's a hack
> > and not a solution.
>=20
> No, it's really a "Then don't do that" solution to the old "Doctor,
> it hurts when I do this" complaint.

The next then the Dr. says is "then don't do it!," and we're back at
square one again...  this isn't lifting an arm, it's trying to breath
while running: something FreeBSD is generally better than most at.

> Before sendfile(), the answer was to mmap() the data to be sent, and
> then call write() on it.  Doing that guaranteed that you would not
> have to copy the data from user space to kernel space, because the
> mapping was already established.  That solution can still work,
> without using sendfile() to get the same performance.  The
> performance "win" of sendfile is the assumption that the entire file
> will be sent as a result of a single system call.

You're forgetting the biggest win for busy servers with
hundreds/thousands of files: RAM.  I used to use mmap() + writev() and
because of userland RAM constraints, I moved to using sendfile().
This dramatically improved the state of affairs.

> > The TCP stack, VM, and my general setup has scaled quite well.
> > The 1st thing to go, however, is the number of sf_buf's.
>=20
> "If it ain't one thing, it's another"...

[...]

> As someone else pointed out, there are a lot of low overheads in
> various places in the FreeBSD kernel.  If you are a seven foot tall
> person that wants to walk around without banging your head every 5
> feet, then there's a lot of remodelling you are going to need to do
> to avoid that.

Interesting point, but if you own a house with 5ft doorways and are
7ft tall, you'll fix the house or move out.  Where's your saw and
hammer?

> If you want to get CS technical, you have found a livelock stall
> barrier: there are literally thousands of these in the design of
> FreeBSD, as it stands, and most of them are unlikely to ever get
> fixed, except in private commercial repositories for FreeBSD-based
> products.

<rhetorical>Again, where's your hammer?  You've got experience in
running into doorways, have you ever thought about making them taller
for everyone to pass through there?</rhetorical>

> The hard kernel memory split will *never* be dyamic.

I know the history and rationale for the current state of things, but
*never* say never.  ;)

[...]
> What this comes down to is guaranteeing "fairness".
>=20
> So the conclusion?
>=20
> Having sendfile() return "EAGAIN" is naieve, unless you have a means
> of limiting each sendfile to it's *fair share* of sf_buf's.

The problem is that not all connections are created equal.  Send gobs
of traffic overseas via slow last mile pipes and you'll find the
problem changing dramatically.  EAGAIN will at least get the available
connections something and I'll be able to drain some of my load.

[...]=20
> The *only* way to address this so that the kernel can *know*, to
> *fairly* share resources among requesters, is to queue the requests
> *to the kernel*, and then service them to completion.  *Only* then
> can the kernel perform useful resource arbitration on your behalf.

Actually, I was talking to Hiten on IRC about writing a kqueue
inspired file sending state machine.  Basically you'd have a kernel
daemon that'd broker sending files to clients.  Push a file request
(either whole file or partial) into a queue and the kernel would send
out the file (or file part) as best as it could and once complete
(successful or not), would add the completed request to a queue of
finished requests that would be classified in one of two states:

*) failure (errno, sent this many bytes)
*) success (file sent, sent this many bytes).

Applications would then only have to manage adding files to the
kernel's queue and processing the completion of events from the queue
(logging).  On a local network, with T/TCP, this would make the basis
for a really slick NFS replacement or cache engine, IMHO.  And
actually, this interface could be fd -> fd and used to replace local
copying of files.

> One of the major pains in the butt for effective load shedding in
> FreeBSD, as it currently stands, is the SYN cache.  The damn thing
> accepts connections on your behalf by completing three-way handshakes
> automatically, without giving you the opportunity of doing feedback
> until *after* the connection is established.

I haven't hit that yet, but when I do, you'll hear back from me.

> Come up with a new API.  It needs to:
>=20
> 1)	Queue it's requests to the kernel, so that the kernel has
> 	enough information to make useful decisions

Check.  See above.

> 2)	Respect the limit on the so_snd depth (minimally; there are
> 	reasons for load tuning to make it even more severe, on
> 	purpose, to control router queue depths for slow customer
> 	pipes)

That'd be something that the kernel send file daemon would do (in
theory).

> 3)	Sends a kevent when the file send has been completed

See above.

> 4)	Preallocate resources before taking something off the queue

Check.

> Those are the minimum design requirements, from a 50,000 foot view.

Perk of the above design is that you don't have to constantly make
system calls to send out parts of a file (non-blocking IO + sendfile()
+ clients connecting at slow rates, this can be substantial).

-sc

--=20
Sean Chittenden

--P6PRkhImOxklJvkF
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Comment: Sean Chittenden <sean@chittenden.org>

iD8DBQE+ZTho3ZnjH7yEs0ERAt0DAKC8LYKZ1PNts2W8XZKoeNp0EOQx+QCdG+7E
Gbv19H5sq3l7UJ2TayGoaUQ=
=u7sJ
-----END PGP SIGNATURE-----

--P6PRkhImOxklJvkF--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030304233608.GK79234>