Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Mar 2003 00:13:26 -0800
From:      Sean Chittenden <sean@chittenden.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Hiten Pandya <hiten@unixdaemons.com>, arch@FreeBSD.ORG
Subject:   Re: Should sendfile() to return ENOBUFS?
Message-ID:  <20030304081326.GD79234@perrin.int.nxad.com>
In-Reply-To: <3E6452B4.E87BEC2@mindspring.com>
References:  <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--kVhvBuyIzNBvw9vr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> > > 2)    You need to be damn sure you can guarantee a correct update
> > >       of *sbytes; I believe this is very difficult in the case in
> > >       question, which is why it blocks
> >=20
> > I'm not convinced of this.  Have you poked through
> > src/sys/kern/uipc_syscalls.c?  It's not that ugly/hard, nothing's
> > impossible with a bit of refactoring.
>=20
> I've done this.  I've ported the -current sendfile external buffer
> code to FreeBSD 4.3, and again to FreeBSD 4.4, etc..  I'm rather
> familiar with it, actually...

Excellent... I know you've done stuff with large numbers of TCP
connections in the past so this doesn't really surprise me all that
much.  Suggestions welcome.

> > > 3)    If sbytes is NULL, you should probably block, even on a
> > >       non-blocking call.  The reason for this is that there is
> > >       no way for the application to restart without *sbytes
> >=20
> > This degrades terribly though and if you get a spike in traffic,
> > degradation of performance is critical.
>=20
> Sendfile degrades terribly under traffic spikes, period.  One thing
> sendfile fails to do is honor the so_snd size limits that other
> things honor, as it goes through its loop.

Much to my dismay and frustration, I'm discovering this...  is there a
better zero-copy socket file operation that can be used in place of
sendfile()?  Alfred's mentioned something called kblob a few times but
I haven't been able to dig up anything on it other than an old arch@
discussion where it was shot down (unfortunately).

> Technically, sendfile should be an async interface so it can lock
> the so_snd window to the buffers-in-flight.  If it did this, it
> could preallocate the memory at the time it's called, and then
> reuse it internally until the operation has been completed.  Then
> it could write it's completion status.

I haven't spent more than a few seconds thinking about this, but
wouldn't that require more mbufclusters to be in use but idle at any
given time than the current implementation?

> > Going from a non-blocking application to a blocking call simply
> > because of high use is murderous and is justification in itself
> > enough for me to move away from the really nice zero-copy sockets
> > that sendfile() affords me, back to the sluggish writev() syscall.
>=20
> For POP3 and SMTP, and most other RFC822 derived protocols, you end
> up having to store your files with <CR><LF> line delimiters, instead
> of <LF>.  For FTP, you can only do binary transfers, etc..  The
> sendfile interface is just a bad design, period.
>=20
> That it performs badly under load is just icing on the cake.

I don't quite understand what you're trying to say here.  What's the
correlation between <CR><LF>/<LF> and system calls? CR+LF is always
read/written as two bytes...  I must be missing the point of your
comment.

> > If a system is busy, it's stuck in an sfbufa state and blocks the
> > server from servicing thousands of connections.
>=20
> I understand.

Groovy: that's a third of the problem, what's the elegant solution?

> > The symptoms are common and synonymous with mbuf exhaustion or any
> > other kind of buffer exhaustion...  my point is that having this
> > block is the worst way that sendfile() can degrade under
> > high performance.
>=20
> Djikstra: preallocate your resources, and you do not have this
> problem.  In this case, set your tunable high enough that even
> were you to use up all your available buffers, there are NSFBUFS
> available... and the problem goes away.

I keep chasing this upper bound and pushing things higher and higher
because sendfile() doesn't degrade worth beans... well, that's a hack
and not a solution.  The TCP stack, VM, and my general setup has
scaled quite well.  The 1st thing to go, however, is the number of
sf_buf's.  I'm worried I'm going to run out of KVM here in the near
future (and at that point, life basically begins to suck given my RAM
requirements are all over the place, 64bit platforms other than the
alpha aren't ready for prime time quite yet, and BSD has a hard kernel
memory split that isn't dynamic).

> > > 4)    If you get rid of the blocking with (sbytes =3D=3D NULL), you
> > >       better add a BUGS section to the manual page.
> >=20
> > There's nothing that says that sbytes can't be set to 0 if errno
> > is EAGAIN, in fact, that's what it does right now.
>=20
> If you send a non-zero amount of data, you need to know exactly what
> was sent, in order to maintain connection state data pipe coherency
> between the user space application requesting the send on a
> connection basis, and the kernel space code that has done a partial
> send.

::nods::  That's a given.

> Given your statement, though, we can say pretty surely that this is
> HTTP...

::nods:: After some processing, I need to send a file as fast and
efficient as I can.  Moving to sendfile() saved me gobs of CPU cycles
and now things hover down below 15% CPU time.

> Any other approach, and your only option to recover your state is to
> close the connection and make the client retry.

Agreed, but that's a non-option when trying to deliver a high level of
reliability.  HTTP doesn't handle that so well.

> So in the situation where the resources are limited, you end up
> *increasing* the overall load by, instead of satisfying a client
> with a single request, converting that into 5 requests, all of which
> fail to deliver the data to the client.

But 'ya see, I wouldn't mind that at all: I'm not CPU bound and can
afford the extra context switches back and forth from the user space.
I'd bet dime to dollar that people who use sendfile(2) aren't CPU
bound: they're IO/sf_buf bound.  Sure having sendfile() return EAGAIN
will drive up the number of calls under high load, but I'd rather burn
a few more cycles swapping contexts than I would getting stuck in a
spin lock waiting for the required number of sf_buf's to become
available.

If I've got a connection queue of 60K, I want to free up as many
connections as I can as fast as I can which makes sleeping the worst
thing I can do because the contentions in queue just pile up.  A
userland spin lock is going to result in a more responsive application
than a kernel spin lock since the userland app will loop through the
connection queue and free up sf_buf's as data gets sent out over the
pipe (something that won't happen when stuck in msleep() in the
kernel's spin lock).

> > Well, it's set to 65535 at the moment.  How much higher you think
> > I should set it?  :-] At some point I have to say, "it's high
> > enough and I just need to get the application to degrade
> > gracefully."  :-]
>=20
> The sendfile interface does not degrade gracefully, period.  Even if
> you dealt with the issue by setting *sbytes correctly in all cases,
> and returning the right value to use space, you've increased the
> number of system calls, potentially significantly.  So even if you
> "correct" the behaviour, your degradation is going to be
> exponential.

::nods:: But as stated above, there are worse things that can be done,
most notably, blocking and letting connections pile up.

> One potential solution is to go to using KSE's, so that the blocking
> context is not your whole process.  This allows you to write the
> server as multithreaded.  Another is to do what Apache does, and run
> processes per connection.

I'm antsy as hell to convert my apps to use KSE for this very reason,
but I'm going to give myself a few more months before I turn the life
blood of my business over to KSE.

> My recommendation was (and is): get a sufficiently large NSFBUFS in
> the first place, so you never encounter the situation that results
> in the non-graceful degradation.

That's not a solution though, that's a work around/hack.  :-] I've
hacked/worked around, but I need a solution.  Making sendfile(2) "do
the right thing(TM)" I thought was the solution (still do).

> > Frankly, if a developer is stupid enough to pass in NULL for sbytes,
> > they get what they deserve.  Returning -1 and setting errno to EAGAIN
> > in the event that there aren't any sf_buf's available isn't what I'd
> > call the programming exercise of the decade.  :-P
>=20
> Nevertheless, the sendfile interface appears to allow this
> situation; it is a flaw in the API design.  There are two ways to
> handle it:
>=20
> 1)	Any time you call sendfile on a non-blocking fd with
> 	(sbytes =3D=3D NULL), *immediately* return EPARM or a
> 	similar error

I'm less than wild about that since that breaks POLA with existing
code.  There's no harm in making something more unknown when it's
already unknown.

> 2)	Allow the API to be inconsistent, and then have the OS
> 	accept the blame for broken applications, since it permits
> 	known broken parameter values

I don't follow...  how would this fix anything?  I don't understand
why this would be necessary given what I'd proposed/suggested earlier.

> > Hrm, let me redefine "fatal" as "changing the behavior of a system
> > call to go from returning in less than 0.001ms, to returning in
> > 2-15s for every connection when trying to make over ~500K
> > sendfile(2) calls a second."  I'd call that a catastrophic failure
> > to degrade successfully.  -sc
>=20
> "Fatal" in this context was intended to imply "the clients do not
> get their data, and get partial data and closed descriptors,
> instead, thus breaking the contract between the client and the
> server".
>=20
> And yeah, either way you look at it, it's a failure to degrade
> gracefully... once again: the easy fix is to not put your system in
> that position in the first place.

Lol!  I wish I had that as an option.  Near infinite demand doesn't
give me this luxury.

> A less easy approach would be to maintain a count of active sendfile
> instances in your application, and queue up requests above some high
> watermark, rather than making system calls.  Another would be to
> hard limit the number of client connections you allow at once, etc..
> The east ugly of these (to my mind) is to not overcommit NSFBUFS in
> the first place by always having at least 1 more than you could ever
> need, preconfigured into the kernel.

I'd actually thought about having my application do this on the fly
and automatically tune itself based on the number of free sf_buf's,
but this brings up another problem with sendfile(2): there's no way of
determining how many sf_buf's are in use at any given time and on
-STABLE, you can't even read the number of sf_buf's allocated
(kern.ipc.nsfbufs).  :-/

Other suggestions welcome including, "leave sendfile() alone, hack up
a new interface."

-sc

--=20
Sean Chittenden

--kVhvBuyIzNBvw9vr
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Comment: Sean Chittenden <sean@chittenden.org>

iD8DBQE+ZGAm3ZnjH7yEs0ERAlZWAJ42VRWSXW7clFjsbduZnqKHI6t5qACgkhly
IqynnFEy7FaE58AqQi8omZw=
=F8J4
-----END PGP SIGNATURE-----

--kVhvBuyIzNBvw9vr--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030304081326.GD79234>