From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 6 07:04:37 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 88C1B106564A; Mon, 6 Feb 2012 07:04:37 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by mx1.freebsd.org (Postfix) with ESMTP id E24628FC18; Mon, 6 Feb 2012 07:04:36 +0000 (UTC) Received: by eekb47 with SMTP id b47so2443995eek.13 for ; Sun, 05 Feb 2012 23:04:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; bh=SC4GdLGrv3QFmxwdGo3U5/7IxlNvMjqlzX93WPwOXWE=; b=n9SE1eBBxDbrwiORaoKOCbXWUeX/Nz+3cRdCPAlDYSwir2jKfo++DJdEz7PgfFJSaD qLgBTyG94TpefrYdjmYJg2OaqCebdIRkURMbqPwaJiqdRoe7f2SvZnb4FrZSnA+Urn9m 6MyyV9gATQa7+6z5KoFpSMRk32vBQLBIEQ5oc= Received: by 10.14.48.8 with SMTP id u8mr5383188eeb.37.1328511874339; Sun, 05 Feb 2012 23:04:34 -0800 (PST) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id n17sm57847046eei.3.2012.02.05.23.04.32 (version=SSLv3 cipher=OTHER); Sun, 05 Feb 2012 23:04:33 -0800 (PST) Sender: Alexander Motin Message-ID: <4F2F7B7F.40508@FreeBSD.org> Date: Mon, 06 Feb 2012 09:04:31 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20111227 Thunderbird/9.0 MIME-Version: 1.0 To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Subject: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Feb 2012 07:04:37 -0000 Hi. I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE knows about HTT and when doing load balancing once a second, it does right things. Unluckily, if some other thread gets in the way, process can be easily pushed out to another CPU, where it will stay for another second because of CPU affinity, possibly sharing physical core with something else without need. I've made a patch, reworking SCHED_ULE affinity code, to fix that: http://people.freebsd.org/~mav/sched.htt.patch This patch does three things: - Disables strict affinity optimization when HTT detected to let more sophisticated code to take into account load of other logical core(s). - Adds affinity support to the sched_lowest() function to prefer specified (last used) CPU (and CPU groups it belongs to) in case of equal load. Previous code always selected first valid CPU of evens. It caused threads migration to lower CPUs without need. - If current CPU group has no CPU where the process with its priority can run now, sequentially check parent CPU groups before doing global search. That should improve affinity for the next cache levels. I've made several different benchmarks to test it, and so far results look promising: - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with fetch and FTP transmit with ftpd. On receive I've got 103MB/s on interface; on transmit somewhat less -- about 85MB/s. In both cases scheduler kept interrupt thread and application on different physical cores. Without patch speed fluctuating about 103-80MB/s on receive and is about 85MB/s on transmit. - On the same Atom I've tested TCP speed with iperf and got mostly the same results: - receive to Atom with patch -- 755-765Mbit/s, without patch -- 531-765Mbit/s. - transmit from Atom in both cases 679Mbit/s. Fluctuating receive behavior in both tests I think can be explained by some heavy callout handled by the swi4:clock process, called on receive (seen in top and schedgraph), but not on transmit. May be it is specifics of the Realtek NIC driver. - On the same Atom tested number of 512 byte reads from SSD with dd in 1 and 32 streams. Found no regressions, but no benefits also as with one stream there is no congestion and with multiple streams all cores congested. - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both original and patched kernel. I've found no performance regressions, while for -j4 I've got 10% improvement: # ministat -w 65 res4A res4B x res4A + res4B +-----------------------------------------------------------------+ |+ | |++ x x x| |A| |______M__A__________| | +-----------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 Difference at 95.0% confidence -159.587 ± 51.9496 -10.0921% ± 3.28524% (Student's t, pooled s = 22.9197) , and for -j6 -- 3.6% improvement: # ministat -w 65 res6A res6B x res6A + res6B +-----------------------------------------------------------------+ | + | | + + x x x | ||_M__A___| |__________A____M_____|| +-----------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 Difference at 95.0% confidence -51.1467 ± 20.6211 -3.66694% ± 1.47842% (Student's t, pooled s = 9.09782) Who wants to do independent testing to verify my results or do some more interesting benchmarks? :) PS: Sponsored by iXsystems, Inc. -- Alexander Motin