Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 03 Mar 2003 23:16:04 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Sean Chittenden <sean@chittenden.org>
Cc:        Hiten Pandya <hiten@unixdaemons.com>, arch@FreeBSD.ORG
Subject:   Re: Should sendfile() to return ENOBUFS?
Message-ID:  <3E6452B4.E87BEC2@mindspring.com>
References:  <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Sean Chittenden wrote:
> > 2)    You need to be damn sure you can guarantee a correct update
> >       of *sbytes; I believe this is very difficult in the case in
> >       question, which is why it blocks
> 
> I'm not convinced of this.  Have you poked through
> src/sys/kern/uipc_syscalls.c?  It's not that ugly/hard, nothing's
> impossible with a bit of refactoring.

I've done this.  I've ported the -current sendfile external buffer
code to FreeBSD 4.3, and again to FreeBSD 4.4, etc..  I'm rather
familiar with it, actually...


> > 3)    If sbytes is NULL, you should probably block, even on a
> >       non-blocking call.  The reason for this is that there is
> >       no way for the application to restart without *sbytes
> 
> This degrades terribly though and if you get a spike in traffic,
> degradation of performance is critical.

Sendfile degrades terribly under traffic spikes, period.  One thing
sendfile fails to do is honor the so_snd size limits that other
things honor, as it goes through its loop.

Technically, sendfile should be an async interface so it can lock
the so_snd window to the buffers-in-flight.  If it did this, it
could preallocate the memory at the time it's called, and then
reuse it internally until the operation has been completed.  Then
it could write it's completion status.


> Going from a non-blocking application to a blocking call simply
> because of high use is murderous and is justification in itself
> enough for me to move away from the really nice zero-copy sockets
> that sendfile() affords me, back to the sluggish writev() syscall.

For POP3 and SMTP, and most other RFC822 derived protocols, you end
up having to store your files with <CR><LF> line delimiters, instead
of <LF>.  For FTP, you can only do binary transfers, etc..  The
sendfile interface is just a bad design, period.

That it performs badly under load is just icing on the cake.


> If a system is busy, it's stuck in an sfbufa state and blocks the
> server from servicing thousands of connections.

I understand.


> The symptoms are common and synonymous with mbuf exhaustion or any
> other kind of buffer exhaustion...  my point is that having this
> block is the worst way that sendfile() can degrade under
> high performance.

Djikstra: preallocate your resources, and you do not have this
problem.  In this case, set your tunable high enough that even
were you to use up all your available buffers, there are NSFBUFS
available... and the problem goes away.


> > 4)    If you get rid of the blocking with (sbytes == NULL), you
> >       better add a BUGS section to the manual page.
> 
> There's nothing that says that sbytes can't be set to 0 if errno is
> EAGAIN, in fact, that's what it does right now.

If you send a non-zero amount of data, you need to know exactly
what was sent, in order to maintain connection state data pipe
coherency between the user space application requesting the send
on a connection basis, and the kernel space code that has done a
partial send.

Given your statement, though, we can say pretty surely that this
is HTTP...

Any other approach, and your only option to recover your state is
to close the connection and make the client retry.

So in the situation where the resources are limited, you end up
*increasing* the overall load by, instead of satisfying a client
with a single request, converting that into 5 requests, all of
which fail to deliver the data to the client.


> > Frankly I'm really surprised that you are blocking in this place; it
> > indicates an inability to get a page in the kernel map in the sf
> > zone, which, in turn, indicates that your NSFBUFS is improperly
> > tuned; if you are using sendfile, and tune up your other kernel
> > parameters for your system, don't forget NSFBUFS.
> 
> Well, it's set to 65535 at the moment.  How much higher you think I
> should set it?  :-] At some point I have to say, "it's high enough and
> I just need to get the application to degrade gracefully."  :-]

The sendfile interface does not degrade gracefully, period.  Even if
you dealt with the issue by setting *sbytes correctly in all cases,
and returning the right value to use space, you've increased the
number of system calls, potentially significantly.  So even if you
"correct" the behaviour, your degradation is going to be exponential.

One potential solution is to go to using KSE's, so that the blocking
context is not your whole process.  This allows you to write the
server as multithreaded.  Another is to do what Apache does, and run
processes per connection.

My recommendation was (and is): get a sufficiently large NSFBUFS in
the first place, so you never encounter the situation that results
in the non-graceful degradation.


> > While you could *technically* make sf_buf_alloc() non-blocking, in
> > general this would be a bad idea, given that the one place it's
> > called is in in interior loop that can be the subject of a "goto"
> > (so it's an embedded interior loop) in sendfile() itself.  I think
> > it would be very hard to satisfy #2, to allow it to be restartable
> > by the application, in the face of failure, and since *sbytes is not
> > a mandatory parameter, likely your application will end up barfing
> > (e.g. sending partial FTP files or HTML documents down, with no way
> > to recover from a failure, other than closing the client socket, and
> > hoping the client can recover).
> 
> Frankly, if a developer is stupid enough to pass in NULL for sbytes,
> they get what they deserve.  Returning -1 and setting errno to EAGAIN
> in the event that there aren't any sf_buf's available isn't what I'd
> call the programming exercise of the decade.  :-P

Nevertheless, the sendfile interface appears to allow this situation;
it is a flaw in the API design.  There are two ways to handle it:

1)	Any time you call sendfile on a non-blocking fd with
	(sbytes == NULL), *immediately* return EPARM or a
	similar error

2)	Allow the API to be inconsistent, and then have the OS
	accept the blame for broken applications, since it permits
	known broken parameter values


> > In a "flash crowd" case on an HTTP server, this basically means that
> > you will continuously get retries, and the situation will worsen,
> > exponentially, as people retry getting the same page.  In the FTP
> > case, or some other protocol without automatic retry on session
> > abandonment, of course, it will be fatal.
> 
> Hrm, let me redefine "fatal" as "changing the behavior of a system
> call to go from returning in less than 0.001ms, to returning in 2-15s
> for every connection when trying to make over ~500K sendfile(2) calls
> a second."  I'd call that a catastrophic failure to degrade
> successfully.  -sc

"Fatal" in this context was intended to imply "the clients do not
get their data, and get partial data and closed descriptors, instead,
thus breaking the contract between the client and the server".

And yeah, either way you look at it, it's a failure to degrade
gracefully... once again: the easy fix is to not put your system
in that position in the first place.  A less easy approach would
be to maintain a count of active sendfile instances in your
application, and queue up requests above some high watermark,
rather than making system calls.  Another would be to hard limit
the number of client connections you allow at once, etc..  The
east ugly of these (to my mind) is to not overcommit NSFBUFS in
the first place by always having at least 1 more than you could
ever need, preconfigured into the kernel.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E6452B4.E87BEC2>