Date: Sun, 26 Oct 2003 00:09:21 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Robert Watson <rwatson@freebsd.org> Cc: Marcel Moolenaar <marcel@xcllnt.net> Subject: Re: Synchronization philosophy (was: Re: FreeBSD mail list etiquette) Message-ID: <200310260709.h9Q79LEq034660@apollo.backplane.com> References: <Pine.NEB.3.96L.1031026001054.74063I-100000@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:... :Moving towards message passing helps to structure the code to avoid :sharing of this sort, although it's not the only way to motivate that sort :of change. I'm a big fan of the change in -CURRENT to use td->td_cred as :a read-only thread-local credential reference and avoid synchronization on :the credential reference--it nicely addresses the requirements for :consistency in the referenced data for the read-only cases (which are the :vast majority of uses of a credential). I saw td->td_cred in -current, but while figuring out what to do in DragonFly I realized that there was something fundamentally wrong with pulling a ucred out of a proc *OR* a thread. What's wrong is that the ucred associated with the current process, or current thread, or passed process, or passed thread may not be the correct ucred for a particular operation, especially in DragonFly where I/O operations are going to be messages and may be aggregated into a single thread for a particular target, filesystem, or protocol in the case of the network. There is also the potential of writing buggy code... if the programmer has a thread based ucred handy he might use it whether it is the correct ucred or not. So in DragonFly I kept the ucred in the proc structure but only refer to it at the highest level. The ucred is passed as an argument from the highest level and lower levels don't try to get it out of the proc or thread. I've already fixed this for VFS. I haven't fixed it for the network stack yet but it's pretty obvious to me that that is the only correct way of doing it, and actually not that hard to do. VFS was interesting... ucred was either being passed directly or pulled out of the proc structure inappropriately for things like VOP_WRITE, and indirectly via struct uio's proc/thread pointer. And then there were a dozen hacks to deal with the case where a flush or VM paging I/O might not occur from the originating user process, hence people may notice that existance of a ucred pointer in struct buf. It's was pretty bad. I ripped it all out. No ucred is passed for VOP I/O any more at all. The ucred is expected to be established to 'open' the file (aka vnode), but once open the driver is expected to be able to read and write through that vnode without any further ucred info. And, in fact, most filesystem drivers completely ignored the 'ucred' argument for VOP_READ and VOP_WRITE anyway. Only NFS didn't. For NFS I just used the root ucred but later wound up having to record the ucred at open to deal with -maproot exports that used something other then root. But that's a damn sight better then strewing ucred references all over the generic I/O APIs. :... :issues in FreeBSD: for example, even the relatively simple accounting file :rotation would benefit from queue-like operation to serialize the :accounting data/event stream and rotation events. Using locks and :condition variables to perform serialization as is currently done in the :accounting code is unwieldy and bug-prone. However, when moving to :event/message queuing, you also have to be very careful with data :ownership and referencing, as well as proper error-handling. With The accounting code is a really excellent example. The solution there is to not to queue to a single serializing thread, but instead to queue to the filesystem driver which pulls out the appropriate data (which it will own in DragonFly's case, hence no synchronization issue), and then queues to the accounting file or to a thread that manages the accounting file. All you would be queueing at the highest level might be a pointer to the vnode and not actually dereference the vnode. :... :regardless of whether accounting of that operation succeeds or fails, :execve() continues). The start/stop operation, however, is intended to be :synchronous. Happily, in the accounting case, all necessary error :checking can be performed in advance of the handoff to the accounting :thread from the user thread, but that won't always be the case... Synchronous ops can be queued as easily as asynchronous ops. The only difference is that the caller has to wait for the synchronizing event to complete. In DragonFly that would be a message that the caller waits for a reply on. In fact, this can be advantangious. Take filesystem snapshots for example. If all VOPs for a filesystem are queued to a single thread representing that filesystem (as a simplified view), then you don't need fancy snapshot support at what would normally be considered the API boundary. The message port becomes the API boundary and all a snapshot has to do is freeze the incoming messages and wait for the existing ones (who are undergoing I/O) to complete, then tell the underlying block device to snapshot. Bang! A poor-man's snapshot that could be adapted to virtually any filesystem. :One of the other risks that has worried me about this approach is that :explicit locking has some nice benefits from the perspective of :deadlocking and lock order management: monitoring for deadlocks and lock :orders is a well-understood topic, and the tools for tracking deadlocks :and wait conditions, as well as for performing locking and waiting safely, :are mature. As with with the older BSD sleeping interfaces, such as :tsleep(), synchronous waits on messages are harder to mechanically track, :and resulting deadlocks resemble resource deadlocks more than lock :deadlocks... On the other hand, some forms of tracing may be made easier. :I've had some pretty nasty experiences trying to track deadlocks between :cooperating threads due to message waits, and found tools such as WITNESS :much easier to work with. I would agree that tracking a message, which might traverse several threads before being returned, is far harder then breaking into DDB and doing a backtrace. On the otherhand, if you know the pointer to the message adding debugging fields to the message to track who retrieved or sent it last is quite easy. You only have to shim lwkt_sendmsg(), lwkt_domsg(), lwkt_waitmsg(), and so forth. Then from DDB> you just 'print *msg' from the frame of the process that is blocked waiting on the message. Deadlocking is something that a fine-grained model has to deal with a lot more then a serializing threaded model. In the threaded model one has to contend with far *fewer* high level locks. In DragonFly my hope is that the issue becomes even further removed due to the IPI messaging mechanism which devolves many things which have to be mutexes in 5.x into simple critical sections (like spl's) in DragonFly. :... :thread making use of thread-local storage to avoid explicit :synchronization. Having dealt both with lower level lock/cv primitives :for event passing, and message passing, I have to say I'm leaning far more :towards the message passing. However, it benefits particularly from the :approach due to its asynchronous nature... Message passing is far superior for asynchronous operations, when you don't have to wait, because there are a thousand ways to implement an asynch messaging backend and a thousand ways to optimize it in ways that can completely avoid conflicts, cache issues, mutexes, and so forth. With a locking abstraction you always have to execute the locking and unlocking primitive which can be far more expensive. A messaging backend, hidden by the messaging abstraction, can optimize for cases that would be EXTREMELY cumbersome to optimize for with a locking abstraction. Take the example below... it's overkill but it is only an illustration. Here the messaging backend can optimize for the same-cpu case, whether the client allows blocking or not (whether the client used lwkt_sendmsg() or lwkt_domsg(), basically), and potentially other optmizations. Every single optimization is COMPLETELY hidden from the client that dispatched the message. (this function is executed in the context of the client, not the target, but may queue to the target) messaging_backend_for_a_particular_port(msg, datastructure) { if (datastructure->cpu == mycpu) { crit_enter(); <<< not even this if interrupts don't mess <<< with the data structure do stuff crit_exit(); return(code) } if (client_wants_to_send_msg_synchronously) { ... fine, the client is allowed to block or switch ... ... just queue it, but without a reply port ... lwkt_switch(); /* do a quick blip blip, see if it's done */ if ((msg->ms_flags & MSGF_DONE) == 0) { ... wasn't a quick blip blip, formally block on the message , wait for wakeup ... } return(msg->ms_retcode); } else { ... just queue it ... return(EASYNC); } } On the otherhand, locking verses messaging in the synchronous case is a lot harder to quantify the performance for because theoretically you do not have the overhead of having to build a message in the synchronous case if you are using locking. So even if the message itself implements locking internally, degenerating into, well, locking, it would still be slower. But I would argue that for certain particular inflection points in the system, such as VOP_READ() or VOP_WRITE() for example, a messaging abstraction will be far superior to an API and/or locking abstraction because the tiny difference in overhead is well worth all the cool things you can hide behind the abstraction. System calls fall under this category as well. Encapsulating a system call in a message is trivial (because DragonFly already does it :-)), and only slightly slower then the direct syscall mechanism. But the power one gets out of that sort of abstraction is phenominal. :> For a protocol, a protocol thread will own a PCB, so the PCB will be :> 'owned' by the cpu the protocol thread is on. Any manipulation of the :> PCB must occur on that cpu or otherwise be very carefully managed :> (e.g. FIFO rindex/windex for the socket buffer and a memory barrier). :> Our intention is to encapsulate most operations as messages to the :> protocol thread owning the PCB. : :I'll be very interested to see how this ends up working out: even in :RELENG_4, FreeBSD has sufficient apparent parallelism/preemption in the :network stack to require synchronization at most levels. In RELENG_4, :most upward bound traffic is serialized via the netisr thread, but :downward traffic from user threads passes through the stack in parallel. :Do you anticipate handing off control to a netisr-like thread earlier than :RELENG_4 does in order to get things "into the right thread"? That's a good point. The real question here is determined by how one defines 'parallelism'. The reality of an N cpu system is that the only actual parallelism you have is N, the number of cpus you have in the system. So the only effect that really matters with the serializing effect on downward bound network traffic is the cost of the context switch. In DragonFly's case that would be a light weight kernel thread context switch, which is far less expensive then a heavy weight context switch. But it *is* still a context switch. The question in regards to DragonFly is whether the messaging function can deal with the case where, say, a write() results in queueing an output packet as efficiently as 4.x would be able to deal with the case by calling tcp_output(). (the messaging function that is executed in the context of the client, that is). My take on the situation is that on a 2xCPU system DragonFly could potentially be faster because the message that wakes up the protocol thread can potentially be waking up the protocol thread on a different cpu and return immediately to userland on the original cpu without userland having to wait for the protocol operation. On a UP system DragonFly would be slower along this path due to the extra context switch, at least under low loads. Under high loads DragonFly ought to be around the same speed because the messages would aggregate and the number of context switches into the protocol thread would actually go down relative to the number of write()'s being performed. It is also possible that DragonFly would not be slower under any circumstances, especially if the TCP pipe is full, because the protocol thread will be woken up by the ACK packet coming back up the stack from the NETIF interrupt (which is effectively a software interrupt), and it could process the WRITE-wakeup at the same time. In fact, this could potentially be faster because the WRITE going down the stack wouldn't even bother to run tcp_output(), it would just wakeup the protocol thread (send a wakeup message to it, anyway), which is a NOP if the protocol thread is already awake. So a WRITE would degenerate to appending to the socket buf and returning to userland and the protocol thread would primarily be driven by ACKs coming back over the wire. That's a 2-for-1 deal since ACKs are typically one per every two data packets. :One of the conceptual issues we've been wrestling with is the issue of :ordering of events: from the perspective of performance, guaranteeing the :weakest possible ordering is desirable. However, with parallelism in the :stack, you risk introducing weaker than permitted ordering of packets :(triggering fast retransmits, etc). In most situations, it's sufficient :to maintain source ordering: if two packets are from the same source, :their ordering should be maintained. If they are from separate sources, :maintaining order isn't required. This raises interesting questions about :when you want to defer processing, and when to try and "catch up" to :maintain performance and ordering. This is a problem if you have multiple protocol threads picking up random work. I don't think it is a problem if you partition the PCB's amoungst protocol threads so a particular PCB is always handled by a particular protocol thread. This is the old 'get a lock, pull work off the queue, execute it' problem, when one is using locks to parallelize operations on queues. It reminds me of NFS's nfsiod processes, which until I fixed it had EXACTLY the problem you describe. Several nfsiod processes would trip over each other trying to get the same VNODE lock for read-ahead and write ops. :> DragonFly is using its LWKT messaging API to abstract blocking verses :> non-blocking. In particular, if a client sends a message using an :> asynch interface it isn't supposed to block, but can return EASYNC if it :> wound up queueing the message due to not being able to execute it :> synchronous without blocking. If a client sends a message using a :> synchronous messaging interface then the client is telling the :> messaging subsystem that it is ok to block. : :The risk here is not so much that the first level of consumer code for an :interface can't determine if it will block or not, but that after a few :levels of abstraction, that information has disappeared (or worse, is :conditional on context in an opaque way). Ah, but that assumes that the levels of abstraction are enclosed within the one (originating) thread. With a messaging layer major inflection points are in different threads, so the problem becomes compartmentalized. An operation which MUST be asynch in thread #1 does not have to be async in thread #2 which picks it up. :> in the system. 90% of the 'not sure if we might block' problem :> is related to scheduling or memory allocation and neither of those :> subsystems needs to use extranious mutexes, so managing the blocking :> conditions is actually quite easy. : :Again, I think it comes down to the fact that memory allocation APIs :typically offer choices to the consumer: block if the resources aren't :available, or fail. My mood swings a bit back and forth as to what the :ideal strategy would/should be at the lowest level, but I think as you :move up the stack, the exact semantics at the bottom matter less. The :APIs are generally clear, but it's the programming layered on top of it :that's sloppy (alternatively: at the bottom level API, the behavior is :well-documented, but as you move up the stack, the behavior typically :becomes more poorly documented). While it's probably appropriate to say :"this is a property of poor programming, or at least, programming done :against a model that we want no longer to hold true", there's still the :issue of cleaning up the legacy code... : :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects :robert@fledge.watson.org Network Associates Laboratories In the case of FreeBSD-4.x, and 5.x as well, the problem devolves into the unknown quantity that locking the kernel_map represents, and the unknown quantity that obtaining and releasing a mutex represents. Even in DragonFly, for allocations, I have to deal with M_NOWAIT and locking the kernel_map if the slab allocator has no free memory in the per-cpu globaldata area. And in DragonFly I have to deal with the equivalent of mutexes in regards to the IPI messaging. Even though IPI messaging is effectively asynchronous, it can *STILL* stall for a bit. If the IPI message FIFO between two cpus is full the IPI messaging code must spin waiting for it to empty a little and at the same time process any incoming IPI messages (to avoid IPI messaging deadlocks between two cpus). This means that any time you send an IPI message you have to assume that your cpu will also have processed any pending IPI messages to it. It hasn't created any problems yet, because IPI operations are still serialized within the context of the cpu (so we don't have to worry about preemption), and since IPI messaging is used to serialize operations on a cpu anyway the fact that several IPI messages may execute when you send a new one does not really interfere with the abstraction. -Matt Matthew Dillon <dillon@backplane.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200310260709.h9Q79LEq034660>