Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 04 Mar 2003 10:00:24 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Sean Chittenden <sean@chittenden.org>
Cc:        Hiten Pandya <hiten@unixdaemons.com>, arch@FreeBSD.ORG
Subject:   Re: Should sendfile() to return ENOBUFS?
Message-ID:  <3E64E9B8.EDCA54FE@mindspring.com>
References:  <20030303224418.GU79234@perrin.int.nxad.com> <20030304001230.GC36475@unixdaemons.com> <20030304002218.GY79234@perrin.int.nxad.com> <3E641131.431A0BA8@mindspring.com> <20030304040859.GB79234@perrin.int.nxad.com> <3E6452B4.E87BEC2@mindspring.com> <20030304081326.GD79234@perrin.int.nxad.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Sean Chittenden wrote:
> > Sendfile degrades terribly under traffic spikes, period.  One thing
> > sendfile fails to do is honor the so_snd size limits that other
> > things honor, as it goes through its loop.
> 
> Much to my dismay and frustration, I'm discovering this...  is there a
> better zero-copy socket file operation that can be used in place of
> sendfile()?  Alfred's mentioned something called kblob a few times but
> I haven't been able to dig up anything on it other than an old arch@
> discussion where it was shot down (unfortunately).

Personally, I would probably just add a flag to sendfile, and
treat *sbytes as an opaque pointer that caused a kevent back
on completion of the transmission.  You would still need add
queues of blocking (*not* sleep!) contexts, but it could be
done rather quickly.  This is more expedient than I usually
like to be, but, IMO, sendfile() is a lost cause, and spending
good resources after bad is not a wise investment.

A better solution would be to add a different API to the system.

If your application is in an embedded system, there are even
more drastic approached you can take, like setting PG_U on the
interesting kernel pages, and then accessing them directly
from user space with a system call interlock, minimally on any
allocations or deallocations (this only works if there is no
such thing as "someone else's process" running on your system,
since it provides an opportunity to corrupt kernel memory from
user space), etc..

The kblob interface is an interesting animal; Jeffrey Hsu has
done some good work in that area, but it's not entirely usable,
as it sits.  You might want to talk to Jonathan Lemon.  IMO, it
is probably a lost cause.


> > Technically, sendfile should be an async interface so it can lock
> > the so_snd window to the buffers-in-flight.  If it did this, it
> > could preallocate the memory at the time it's called, and then
> > reuse it internally until the operation has been completed.  Then
> > it could write it's completion status.
> 
> I haven't spent more than a few seconds thinking about this, but
> wouldn't that require more mbufclusters to be in use but idle at any
> given time than the current implementation?

No.  First of all, it reduces the sfbuf requirements considerably,
by queueing request descriptors, instead, and satisfying them, as
it can.  Second, you can control the number of packets "in flight"
for each outstanding sendfile request in progress (unlike now), so
if you throttle this back to the so_snd size, in fact you will use
*fewer* mbuf clusters simultaneously, and you will reduce page
thrashing (remember that sendmaile uses external mbufs that refer
to buffer cache pages via sfbuf mappings).


[ ... ]
> I don't quite understand what you're trying to say here.  What's the
> correlation between <CR><LF>/<LF> and system calls? CR+LF is always
> read/written as two bytes...  I must be missing the point of your
> comment.

It's a tangent that indicates sendfile() is generally inappropriate,
unless you also implement the recvfile() to go with it, and use it.
The issue is that UNIX text files are not stored in the wire formats
for these protocols, so using sendfile() on them is usually
inapprorpiate, unless you change how you store them.  Mail servers,
especially, break inbound and outbound data between applications,
so you'd have to hack them up sto store incoming as <CR><LF>
delimited so that when you sent them out via sendfile(), they
were compliant with the protocol standard, on the wire.


> > > If a system is busy, it's stuck in an sfbufa state and blocks the
> > > server from servicing thousands of connections.
> >
> > I understand.
> 
> Groovy: that's a third of the problem, what's the elegant solution?

You can't have one, in the context of the current sendfile.  You
need to change your context, if you want to address this issue and
get onto the next one, or you can accept the implementation of an
administrative limit to keep from banging your head on the design
limit, and cut your losses.

It really boils down to how much effort you are willing to spend on
it, for what return you expect.


> > > The symptoms are common and synonymous with mbuf exhaustion or any
> > > other kind of buffer exhaustion...  my point is that having this
> > > block is the worst way that sendfile() can degrade under
> > > high performance.
> >
> > Djikstra: preallocate your resources, and you do not have this
> > problem.  In this case, set your tunable high enough that even
> > were you to use up all your available buffers, there are NSFBUFS
> > available... and the problem goes away.
> 
> I keep chasing this upper bound and pushing things higher and higher
> because sendfile() doesn't degrade worth beans... well, that's a hack
> and not a solution.

No, it's really a "Then don't do that" solution to the old "Doctor,
it hurts when I do this" complaint.

Before sendfile(), the answer was to mmap() the data to be sent,
and then call write() on it.  Doing that guaranteed that you would
not have to copy the data from user space to kernel space, because
the mapping was already established.  That solution can still work,
without using sendfile() to get the same performance.  The performance
"win" of sendfile is the assumption that the entire file will be
sent as a result of a single system call.


> The TCP stack, VM, and my general setup has
> scaled quite well.  The 1st thing to go, however, is the number of
> sf_buf's.

"If it ain't one thing, it's another"...

At some point, you have to bound the application to some
administrative limit to keep it from hitting the hard limits
inherent in the system; you aren't going to be able to address
all the hard limits that your going to run into that happen
from e.g. impedence mismatches over API's like sendfile()
that aren't designed to handle them.

As someone else pointed out, there are a lot of low overheads
in various places in the FreeBSD kernel.  If you are a seven
foot tall person that wants to walk around without banging
your head every 5 feet, then there's a lot of remodelling you
are going to need to do to avoid that.

If you want to get CS technical, you have found a livelock
stall barrier: there are literally thousands of these in the
design of FreeBSD, as it stands, and most of them are unlikely
to ever get fixed, except in private commercial repositories
for FreeBSD-based products.


> I'm worried I'm going to run out of KVM here in the near
> future (and at that point, life basically begins to suck given my RAM
> requirements are all over the place, 64bit platforms other than the
> alpha aren't ready for prime time quite yet, and BSD has a hard kernel
> memory split that isn't dynamic).

Eventually, you will.  That's an inevitability.

*Why* you run out will depend on your application, and what system
characteristics seem important for it, to you.  For me, this is
usually "number of connections" or "ability to shed load in order
to degrade gracefully", etc..  So for me, it usually comes down to
number of mbufs stuck in so_snd chains, and I set cluster high
enough that I don't hit my head.

As far as 64bit, the Alpha can't handle as much physical RAM as
the x86 (2G vs. 4G) at this point.

The hard kernel memory split will *never* be dyamic.  The closest
you can ever expect to get is seperate process and kernel address
spaces, so that the kernel address space and process address space
are never simultaneously mapped.  Doing that means heroic efforts
are required to implement uiomove(), et. al..  Even so, the kernel
memory is generally non-pageable.  This is going to mean that you
will be able to use up all physical RAM with such a config, but in
doing so, you will leave yourself with no physical memory to give
to user programs.

It's a set of tradeoffs, and at some point, like it or not, you
hit your head.

Personally, I would probably never get rid of the simultaneous
mapping; it's just too useful.  For example, it's possible to
map RO in user space but RW in kernel space a page that permits
you to take no system call overhead for getpid/getgid/getuid/etc..
It's also possible to map a page RO in user space that contains
the clock structures from the timecounters in kernel space (this
is harder, but doable).  By doing this, you can have a zero
system call overhead "gettimeofday()" function, and guarantee
it's atomicity by maintaining two regions and pointer-flipping
between them, and reading the pointer to read atomically, in
user space, which guaranteed atomicity on the content references.
And so on.  These tricks all required that a PG_U bit set on a
kernel page makes it visible in user space.


[ ... ]

> > Any other approach, and your only option to recover your state is to
> > close the connection and make the client retry.
> 
> Agreed, but that's a non-option when trying to deliver a high level of
> reliability.  HTTP doesn't handle that so well.

I look at this as "load shedding after hitting capacity limits";
the failure is going to be no worse than the worst case, in that
scenario.

The problem you have with your sendfile lockups is actually not
that severe, per se.

Yes, you stall your user space processing until some of the
in-progress sendfile()'s that have happened previously drain out
the network interface, so it impacts your ability to accept new
connections, but it doesn't damage your ability to service the
existing connections.  If you turn the problem around, the real
problem you have is that you are not rejecting new connections,
the moment before you hit this situation.

From that perspective, if you were to preallocate everything that
would be used for a given sendfile, AND either fail the sendfile()
completely ("WOULDBLOCK"), signalling user space to throttle new
requests to the interface, OR guarantee it to complete completely,
then the problem is also solved.  It's just solved a different way.


> > So in the situation where the resources are limited, you end up
> > *increasing* the overall load by, instead of satisfying a client
> > with a single request, converting that into 5 requests, all of which
> > fail to deliver the data to the client.
> 
> But 'ya see, I wouldn't mind that at all: I'm not CPU bound and can
> afford the extra context switches back and forth from the user space.
> I'd bet dime to dollar that people who use sendfile(2) aren't CPU
> bound: they're IO/sf_buf bound.  Sure having sendfile() return EAGAIN
> will drive up the number of calls under high load, but I'd rather burn
> a few more cycles swapping contexts than I would getting stuck in a
> spin lock waiting for the required number of sf_buf's to become
> available.

I think they would care very, very much.

Here's why.

Consider a site with large files to deliver to their customers;
each of the customers has a pipe of a given data rate.  Logic
tells us that the data rate *at the customer* is going to dictate
how fast the send buffers drain out, which in turn, controls the
queue retention time for the resources on your server.  Large
pipes will drain fastest, and small pipes will take much longer.

This is the classic "equal resource requirement, variable time to
runb" scheduling algorithm problem.  If you accept requests at
random, then you are guaranteed, at some point, to stall all your
fast connections behind slow connections.

What about "retroactive RED queueing" as a solution?  In other
words, based on a calculated figure of merit for your load, you
decide to abandon existing connections, with a bias towards
abandining the longest running connections first (maybe you use
a Poisson distribution; whatever).

Naievely, this seems like a solution.  Practically, though, it's
not.  The reason is that of human psychology.  People on slow
pipes are paitent, by definition.  This means that they will retry
for hours and hours, keeping you clogged up, no matter what.  So
in an overcapacity situation, you won't escape by dropping
"problem" connections (the only way to do that effectively is a
QoS negotiation that knows, at connection time, how big the pipe
of the connecting client will be).

The only answer is to slog through the workload, and RED on new
requests.

What this comes down to is guaranteeing "fairness".

So the conclusion?

Having sendfile() return "EAGAIN" is naieve, unless you have a
means of limiting each sendfile to it's *fair share* of sf_buf's.

And once again, we are at the point where the sendfile() implementation
is inadequate to the task.  There's no proportional allocator for
resources here; there's not even a simplistic count maintained of
the number of sendfile() requests simultaneously in progress.  And
there *can't* be.

Again, it comes down to the sendfile API: without *knowing* the
number of session in process at the same time, which is unknowable
to the kernel, at this point, because the sendfile() API is just
"take this file and queue as many mbufs on so_snd as you can, until
you hit the end of the file or until you run out of sf_bufs".

The *only* way to address this so that the kernel can *know*, to
*fairly* share resources among requesters, is to queue the requests
*to the kernel*, and then service them to completion.  *Only* then
can the kernel perform useful resource arbitration on your behalf.


> If I've got a connection queue of 60K, I want to free up as many
> connections as I can as fast as I can which makes sleeping the worst
> thing I can do because the contentions in queue just pile up.  A
> userland spin lock is going to result in a more responsive application
> than a kernel spin lock since the userland app will loop through the
> connection queue and free up sf_buf's as data gets sent out over the
> pipe (something that won't happen when stuck in msleep() in the
> kernel's spin lock).

You should look at the Rice University Scala Server project code;
much of it is based on FreeBSD.  One of the things they do is put
proactive load shedding at the stall barriers, rather than hitting
them.  They do things like LRP (one of my favorites), but they also
do other interesting work in that area.  Druschel, Banga, et. al.,
are all very smart guys.  You've probably heard of iMimic?

I'll give you a hint, though: shortest request first.  Blows to hell
on asymmetric client data rates, though (IMO).


> > The sendfile interface does not degrade gracefully, period.  Even if
> > you dealt with the issue by setting *sbytes correctly in all cases,
> > and returning the right value to use space, you've increased the
> > number of system calls, potentially significantly.  So even if you
> > "correct" the behaviour, your degradation is going to be
> > exponential.
> 
> ::nods:: But as stated above, there are worse things that can be done,
> most notably, blocking and letting connections pile up.

"Emoticon" time, I guess... ;^).

::sighs:: And then the next bottleneck becomes system call overhead,
and the next one after that becomes network I/O, and the next one
after that becomes PCI bus bandwidth, etc..

The correct thing to do is to *not let connections pile up*: after
a certain *very small* overage, drop them on the floor: do not
answer their SYN's.  Hell, if you have the source code to the
firmware for your network card, then don't cause an interrupt for
their SYN's *at all*.

One of the major pains in the butt for effective load shedding in
FreeBSD, as it currently stands, is the SYN cache.  The damn thing
accepts connections on your behalf by completing three-way handshakes
automatically, without giving you the opportunity of doing feedback
until *after* the connection is established.


> > One potential solution is to go to using KSE's, so that the blocking
> > context is not your whole process.  This allows you to write the
> > server as multithreaded.  Another is to do what Apache does, and run
> > processes per connection.
> 
> I'm antsy as hell to convert my apps to use KSE for this very reason,
> but I'm going to give myself a few more months before I turn the life
> blood of my business over to KSE.

Personally, I would not do it.  Not for networking equipment, where
you aren't CPU-bound.

Unfortunately, the benefits of KSE are mostly not worth the cost,
without SMP, and the cost of SMP is now much hogher than it was,
back when the work started.  Over time, clock multipliers have
gone up, and the limitation is all on internal bus bandwith, and
internal data stalls, much more than it's on raw compute cycles.
What good does a 3GHz processor do me, if I have to wait for 12
cycles per I/O cycle, if I'm I/O bound?

The one thing it really buys you is the ability to program lazy,
using threads instead of finite state automatons.  That's OK for
some applications, of course, but mostly for ones which are
compute bound, since it adds lock contention and cache contention
and TLB shootdowns, and protection domain crossings.  Probably
the one place I'd be willing to eat that is if I had a large
Java-based server, where I'm going to be eating compute cycles
like crazy in the JVM, and so it makes sense to throw compute
cycles at the problem.


> > My recommendation was (and is): get a sufficiently large NSFBUFS in
> > the first place, so you never encounter the situation that results
> > in the non-graceful degradation.
> 
> That's not a solution though, that's a work around/hack.  :-] I've
> hacked/worked around, but I need a solution.  Making sendfile(2) "do
> the right thing(TM)" I thought was the solution (still do).

Come up with a new API.  It needs to:

1)	Queue it's requests to the kernel, so that the kernel has
	enough information to make useful decisions

2)	Respect the limit on the so_snd depth (minimally; there are
	reasons for load tuning to make it even more severe, on
	purpose, to control router queue depths for slow customer
	pipes)

3)	Sends a kevent when the file send has been completed

4)	Preallocate resources before taking something off the queue

Those are the minimum design requirements, from a 50,000 foot view.


> > 2)    Allow the API to be inconsistent, and then have the OS
> >       accept the blame for broken applications, since it permits
> >       known broken parameter values
> 
> I don't follow...  how would this fix anything?  I don't understand
> why this would be necessary given what I'd proposed/suggested earlier.

It doesn't fix anything.  If you want something fixed, you are
back to the option you aren't thrilled with.  If you see a third
option, you should talk about it.

Actually, the first option is suspect, because of the headers
and trailers from *hdtr.  Maintaining accurate header and file
content indexing for arbitrary length headers, or handling a
partial completion on the header/trailer is undercoverable, even
if *sbytes accurately reflects the amount of the file itself that
was sent.  8-(.


> > And yeah, either way you look at it, it's a failure to degrade
> > gracefully... once again: the easy fix is to not put your system in
> > that position in the first place.
> 
> Lol!  I wish I had that as an option.  Near infinite demand doesn't
> give me this luxury.

Shed load before it takes resources.  Seriously.


> I'd actually thought about having my application do this on the fly
> and automatically tune itself based on the number of free sf_buf's,
> but this brings up another problem with sendfile(2): there's no way of
> determining how many sf_buf's are in use at any given time and on
> -STABLE, you can't even read the number of sf_buf's allocated
> (kern.ipc.nsfbufs).  :-/
> 
> Other suggestions welcome including, "leave sendfile() alone, hack up
> a new interface."

That would be my recommendation.  The sendfile() interface has
always been an architectural wart.  It's there, IMO, to compete
with Linux ("Linux has one, we need one").  There's changes that
could be made to the implementation details to make it less of an
aggregious hack, but there's no way to make it a non-hack.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E64E9B8.EDCA54FE>