Date: Fri, 1 Dec 2006 01:30:54 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Robert Watson <rwatson@freebsd.org> Cc: Ivan Voras <ivoras@fer.hr>, freebsd-arch@freebsd.org Subject: Re: a proposed callout API Message-ID: <200612010930.kB19Ushn064003@apollo.backplane.com> References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> <ekmr76$23g$1@sea.gmane.org> <20061201012221.J79653@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:The implications of adopting the model Matt proposes are quite far-reaching: :callouts don't exist in isolation, but occur in the context of data structures :and work occuring in many threads. If callouts are pinned to a particular :... :Consider the case of TCP timers: a number of TCP timers get regularly :rescheduled (delack, retransmit, etc). If they can only be manipulated from :cpu0 (i.e., protected by a synchronization primitive that can't be acquired :from another CPU -- i.e., critical sections instead of mutexes), how do you :handle the case where the a TCP packet for that connection is processed on :cpu1 and needs to change the scheduling of the timer? In a strict work/data :structure pinning model, you would pin the TCP connection to cpu0, and only :process any data leading to timer changes on that CPU. Alternatively, you :might pass a message from cpu1 to cpu0 to change the scheduling. Yes, this is all very true. One could think of this in a more abstract way if that would make things more clear: All the work processing related to a particular TCP connection is accumulated into a single 'hopper'. The hopper is what is being serialized with a mutex, or by cpu-locality, or even simply by thread-locality (dedicating a single thread to process a single hopper). This means that all the work that has accumulated in the hopper can be processed while holding a single serializer instead of having to acquire and release a serializer for each work item within the hopper. That's the jist of it. If you have enough hoppers, statistics takes care of the rest. There is nothing that says the hoppers have to be pinned to particular cpu's, it just makes it easier for other system APIs if they are. For FreeBSD, I think the hopper abstraction might be the way to go. You could then have worker threads running on each cpu (one per cpu) which compete for hoppers with pending work. You can avoid wiring the hoppers to particular cpus (which FreeBSD people seem to dislike considerably) yet still reap the benefits of batch processing. TCP callout timers are a really good example here, because TCP callout timers are ONLY ever manipulated from within the TCP protocol stack, which means they are only manipulated in the context of a TCP work item (either a packet, or a timeout, or user requested work). If you think about it, nearly *all* the manipulation of the TCP callout timers occurs during work item processing where you already hold the governing serializer. That is the manipulation that needs to become optimal here. So the question for callouts then becomes.... can the serializer used for the work item processing be the SAME serializer that the callout API uses to control access to the callout structures? In the DragonFly model the answer is: yes, easy, because the serializer is cpu-localized. In FreeBSD the same thing could be accomplished by implementing a callout wheel for each 'hopper', controlled by the same serializer. The only real performance issue is how to handle work item events caused by userland read() or write()'s.... do you have those operations send a message to the thread managing the hopper? Or do you have those operations obtain the hopper's serializer and enter the TCP stack directly? For FreeBSD I would guess the latter... obtain the hopper's serializer and enter the TCP stack directly. But if you were to implement it you could actually do it both ways and have a sysctl to select which method to use, then look at how that effects performance. The other main entry point for packets into the TCP stack is from the network interface. The network interface layer is typically interrupt driven, and just as typically it is not (in my opinion) the best idea to try to call the TCP protocol stack from the network interrupt as it seriously elongates the code path and enlarges the cache fingerprint required to run through a network interface's RX ring. The RX ring is likely to contain dozens or even a hundred or more packets bound for a fewer (but still significant) number of TCP connections. Breaking up that processing into two separate loops... getting the packets off the RX ring and placing them in the correct hopper, and processing the hopper's work queue, would yield a far better cache footprint. Again, my opinion. -- In any case, these methodologies basically exist in order to remove the need to acquire a serializer that is so fine-grained that the overhead of the serializer becomes a serious component of the overhead of the work being serialized. That is *certainly* the case for the callout API. Sans serializer, the callout API is basically one or two TAILQ manipulations and that is it. You can't get much faster then that. I don't think it is appropriate to try to abstract-away the serializer when the serializer becomes such a large component. That's like hiding something you don't like under your bed. -- Something just came to my attention... are you guys actually using high 'hz' values to govern your current callout API? In particular, the c->c_time field? If that is the case the size of your callwheel array may be insufficient to hold even short timeouts without wrapping. That could create *serious* performance problems with the callwheel design. And I do mean serious. The entire purpose of having the callwheel is to support the notion that most timeouts will be removed or reset before they actually occur, meaning before the iterator (softclock_handler() in kern_timeout.c) gets to the index. If you wrap, the iterator may wind up having to skip literally thousands or hundreds of thousands of callout structures during its scan. So, e.g. a typical callwheel is sized to 16384 or 32768 entries ('print callwheelsize' from kgdb on a live kernel). At 100hz 32768 entries gives us 327 seconds of range before callout entries start to wrap. At 1000hz 32768 entries barely gives you 32 seconds of range. All TCP timers except the idle timer are fairly short lived. The idle timer could be an issue for you. In fact, it could be an issue for us too... that's something I will have a look at in DragonFly. You could also be hitting another big problem by using a too fine-grained timer/timeout resolution, and that is destroying the natural aggregation of work that occurs with coarse resolutions. It doesn't make much sense to have a handful of callouts at 10ms, 11ms and 12ms for example. It would be better to have them all in one slot (like at 12ms) so they can all be processed in batch. This is particularly true for anything that can be processed with a tight code loop, and the TCP protocol stack certainly applies there. I think Jeffrey Hsu actually counted instruction cycles for TCP processing through the short-cut tests (the optimal/critical path when incoming data packets are in-order and non-overlapping and such), and once he fixed some of the conditionals the number of instructions required to process a packet had been reduced dramatically and certainly fit in the L1 cache. Someething to think about, anyhow. I'll read the paper you referenced. It looks interesting. -Matt Matthew Dillon <dillon@backplane.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200612010930.kB19Ushn064003>