From owner-freebsd-chat Wed Oct 25 13:47:28 2000 Delivered-To: freebsd-chat@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id E6AAA37B479 for ; Wed, 25 Oct 2000 13:47:22 -0700 (PDT) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id NAA15786; Wed, 25 Oct 2000 13:25:17 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp04.primenet.com, id smtpdAAAUsa47A; Wed Oct 25 13:21:29 2000 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id NAA29899; Wed, 25 Oct 2000 13:24:18 -0700 (MST) From: Terry Lambert Message-Id: <200010252024.NAA29899@usr01.primenet.com> Subject: Re: kqueue microbenchmark results To: sim@stormix.com (Simon Kirby) Date: Wed, 25 Oct 2000 20:24:18 +0000 (GMT) Cc: jlemon@flugsvamp.com (Jonathan Lemon), dank@alumni.caltech.edu (Dan Kegel), chat@FreeBSD.ORG, linux-kernel@vger.kernel.org In-Reply-To: <20001025112709.A1500@stormix.com> from "Simon Kirby" at Oct 25, 2000 11:27:09 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > What applications would do better by postponing some of the reading? > I can't think of any reason off the top of my head why an application > wouldn't want to read everything it can. Doing everything in smaller > chunks would increase overhead (but maybe reduce latencies very slightly > -- albeit probably not much when using a get_events()-style interface). Applications that: o Want to limit their memory footprint by limiting the amount of process VM they consume, and so limit their buffer size to less than all the data the stacks might be capable of providing at one time o With fixed-size messages, which want to operate on a message at a time, without restricting the sender to sending only a single message or whole messages at one time o Want to limit their overall system processing overhead for irrelevent/stale data (example: one which implements state delta referesh events, such as "Bolo" or "Netrek") o Have to implement "leaky bucket" algorithms, where it is permissible to drop some data on the floor, and assume it will be retransmited later (e.g. ATM or user space protocols which want to implement QoS guarantees) o Need to take advantage of kernel strategies for protection from denial of service attacks, without having to redo those strategies themselves (particularly, flood attacks; this is the same reason inetd supports connection rate limiting on behalf of the programs it is responsible for starting) o With multiple data channels to which they are listening, some of which are more important than others (e.g. the Real Networks streaming media protocols are an example) o Want to evealuate the contents of a security negotiation prior to accepting data that was sent using an expired certificate or otherwise bogus credentials There are all sorts of good reasons a programmer would want to trust the kernel, instead of having to build ring buffers into each and every program they write to ensure they remember data which is irrelevent to the processing at hand, or protect their code against buffer overflows initiated by trusted applications. > Isn't it probably better to keep the kernel implementation as efficient > as possible so that the majority of applications which will read (and > write) all data possible can do it as efficiently as possible? Queueing > up the events, even as they are in the form received from the kernel, is > pretty simple for a userspace program to do, and I think it's the best > place for it. Reading, yes. Writing, no. The buffers they are filling in the kernel belong to the kernel, not the application, despite what Microsoft tells you about WSOCK32 programming. The WSOCK32 model assumes that the networking is implemented in another user space process, rather than in the kernel. People who use the "async" WSOCK32 interface rarely understand the implications because they rarely understand how async messages are built using a Windows data pump, which serializes all requests through the moral equivalent of a select loop (which is why NT supports async notification on socket I/O, but other versions of Windows does not [NB: actually, it could, using an fd=-1 I/O completion port, but the WSOCK32 programmers were a bit lazy and were also being told to keep performance under that of NT]). In any case, it's not just a matter of queueing up kernel events, it's also a matter of partially instead of completely reacting to the events, since if an event comes in that says you have 1k of data, and you only read 128 bytes of it, you will have to requeue, in LIFO instead of FIFO order, a modified event with 1k-128 bytes, so the next read completes as expected. Very gross code, which must be then duplicated in every iser space program, and either requires a "tail minus one" pointer, or requires doubly linking the user space event queue. > I know nothing about any other implementations, though, and I'm speaking > mainly from the experiences I've had with coding daemons using select(). Programs which are select-based are usually DFAs (Deterministic Finite State Automatons), which operate on non-blocking file descriptors. This means that I/O is not interleaved, and so is not necessarily as efficient as it could be, should there ever occur a time when an I/O completion posts sooner after being checked than the amount of time it takes to complete 50% of an event processing cycle (the reasons for this involve queueing theory algebra, and are easier to explain in terms of the relative value of positive and negative caches in situations where a cache miss results in the need to perform a linear traversal). A lot of this can be "magically" alleviated using POSIX AIO calls in the underlying implementation, instead of relying on non-blocking I/O -- even then, don't expect a better than 50% improvement, and be prepared for it to be closer to 30%-40%. Expect kernel threads to max out at less than 50% (and usually be closer to 25%-35%), and expect kernel threads on any non-commercial OS and on SVR4 systems, excluding the most recent versions of Solaris, to do much worse than that. Really, the performance of DFAs based on select loops has more to do with mean processing time for any arbitrary event, than anything else. I was able to run logins for an entire lab full of X terminals with an xdm replacement DFA that was very highly controlled in its longhest and mean path processing latency, and recover the use of a SPARCServer that was otherwise burning all its VM and CPU resources on swapping 32 copies of xdm in and out for no good reason. The AfterBurner web server (John Sokol, et. al.) is still, I believe, the highest performance HTTP server, by a wide margin, beating out anything threads or not DFA-based, like it is. Even kqueue and kernel threads are DFAs, in the limit: the question is on which side of the user/kernel barrier the code lives, not whether or not it lives somewhere. All UNIX schedulers (well, all reasonable and deployed-for-actual-work ones) are DFAs in disguise. > You mention you wrote a paper discussing this issue...Where could I find > this? I'm pretty sure he will give you the paper reference; I personally don't have the reference handy (sorry). I think this is more a case of believing the weight of the literature, rather than an individual report, so you will want to look at his paper, but also at others in the field, before you decide to believe anything specific (though I have no doubt you'll agree with his findings, after you do that). Otherwise, it's just him against Linus, and making a decision based on one paper or an authors reputation is bad science (and bad engineering). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message