From owner-freebsd-current@FreeBSD.ORG Tue Mar 23 08:25:57 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A080D106566B; Tue, 23 Mar 2010 08:25:57 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 296FE8FC1A; Tue, 23 Mar 2010 08:25:57 +0000 (UTC) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.4/8.14.1) with ESMTP id o2N8Ps1p032955; Tue, 23 Mar 2010 01:25:56 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.4/8.13.4/Submit) id o2N8PsLY032954; Tue, 23 Mar 2010 01:25:54 -0700 (PDT) Date: Tue, 23 Mar 2010 01:25:54 -0700 (PDT) From: Matthew Dillon Message-Id: <201003230825.o2N8PsLY032954@apollo.backplane.com> To: freebsd-arch@freebsd.org, freebsd-current@freebsd.org References: <4BA633A0.2090108@icyb.net.ua> <5754.1269246223@critter.freebsd.dk> <20100322233607.GB1767@garage.freebsd.pl> <8D465FFB-0389-4321-84B9-E45292697D26@samsco.org> Cc: Subject: Re: Increasing MAXPHYS X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Mar 2010 08:25:57 -0000 :The whole point of the discussion, sans PHK's interlude, is to reduce the context switches and indirection, not to increase it. But if you can show decreased latency/higher-iops benefits of increasing it, more power to you. I would think that the results of DFly's experiment with parallelism-via-more-queues would serve as a good warning, though. : :Scott Well, I'm not sure what experiment you are refering to but I'll assume its the network threading, which works quite well actually. The protocol threads can be matched against the toeplitz function and in that case the entire packet stream operates lockless. Even without the matching we still get good benefits from batching (e.g. via ether_input_chain()) which drops the IPI and per-packet switch overhead basically to zero. We have other issues but the protocol threads aren't one of them. In anycase, the lesson to learn with batching to a thread is that you don't want the thread to immediately preempt the sender (if it happens to be on the same cpu), or to generate an instant IPI (if going between cpus). This creates a degenerate case where you wind up with a thread switch on each message or an excessive messaging interrupt rate... THAT is what seriously screws up performance. The key is to be able to batch multiple messages per thread switch when under load and to be able to maintain a pipeline. A single user-process test case will always have a bit more latency and can wind up being inefficient for a variety of other reasons (e.g. whether the target thread is on the same cpu or not), but that becomes less relevant when the machine is under load so its a self-correcting problem for the most part. Once the machine is under load batching becomes highly efficient. That is, latency != cpu cycle cost under load. When the threads have enough work to do they can pick up the next message without the cost of entering a sleep state or needing a wakeup (or needing to generate an actual IPI interrupt, etc). Plus you can run lockless and you get excellent cache locality. So as long as you ensure these optimal operations become the norm under load you win. Getting the threads to pipeline properly and avoid unnecessary tsleeps and wakeups is the hard part. -- But with regard to geom, I'd have to agree with you. You don't want to pipeline a single N-stage request through N threads. One thread, sure... that can be batched to reduce overhead. N-stages through N-threads just creates unnecessary latency, complicates your ability to maintain a pipeline, and has a multiplicative effect on thread activity that negates the advantage of having multiple cpus (and destroys cache locality as well). You could possibly use a different trick at least for some of the simpler transformations, and that is to replicate the control structures on a per-cpu basis. If you replicate the control structures on a per-cpu basis then you can parallelize independent operations running through the same set of devices and remove the bottlenecks. The set of transformations for a single BIO would be able to run lockless within a single thread and the control system as a whole would have one thread per cpu. (Of course, a RAID layer would require some rendezvous to deal with contention/conflicts, but that's easily dealt with). That would be my suggestion. We use that trick for our route tables in DFly, and also for listen socket PCBs to remove choke points, and a few other things like statistics gathering. -Matt Matthew Dillon