From owner-freebsd-smp@FreeBSD.ORG Mon Nov 17 21:13:57 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 083691065678 for ; Mon, 17 Nov 2008 21:13:57 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id DD8398FC19 for ; Mon, 17 Nov 2008 21:13:55 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1]) (authenticated bits=0) by server.baldwin.cx (8.14.3/8.14.3) with ESMTP id mAHLDgus033788; Mon, 17 Nov 2008 16:13:48 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Archimedes Gaviola" Date: Mon, 17 Nov 2008 16:09:54 -0500 User-Agent: KMail/1.9.7 References: <42e3d810811100033w172e90dbl209ecbab640cc24f@mail.gmail.com> <42e3d810811170311uddc77daj176bc285722a0c8@mail.gmail.com> <42e3d810811170336rf0a0357sf32035e8bd1489e9@mail.gmail.com> In-Reply-To: <42e3d810811170336rf0a0357sf32035e8bd1489e9@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200811171609.54527.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]); Mon, 17 Nov 2008 16:13:49 -0500 (EST) X-Virus-Scanned: ClamAV 0.93.1/8642/Sun Nov 16 23:01:08 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00,NO_RELAYS autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-smp@freebsd.org Subject: Re: CPU affinity with ULE scheduler X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Nov 2008 21:13:57 -0000 On Monday 17 November 2008 06:36:40 am Archimedes Gaviola wrote: > On Mon, Nov 17, 2008 at 7:11 PM, Archimedes Gaviola > wrote: > > On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin wrote: > >> On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote: > >>> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin wrote: > >>> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote: > >>> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin wrote: > >>> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote: > >>> >> >> To Whom It May Concerned: > >>> >> >> > >>> >> >> Can someone explain or share about ULE scheduler (latest version 2 if > >>> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing > >>> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD > >>> >> >> scheduler and as what I have observed especially on processing high > >>> >> >> network load traffic on multiple CPU cores, only one CPU were being > >>> >> >> stressed with network interrupt while the rests are mostly in idle > >>> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE Broadcom > >>> >> >> network interface cards (bce0 and bce1). Below is the snapshot of the > >>> >> >> case. > >>> >> > > >>> >> > Interrupts are routed to a single CPU. Since bce0 and bce1 are both on > >>> > the > >>> >> > same interrupt (irq 23), the CPU that interrupt is routed to is going > >> to > >>> > end > >>> >> > up handling all the interrupts for bce0 and bce1. This not something > >> ULE > >>> > or > >>> >> > 4BSD have any control over. > >>> >> > > >>> >> > -- > >>> >> > John Baldwin > >>> >> > > >>> >> > >>> >> Hi John, > >>> >> > >>> >> I'm sorry for the wrong snapshot. Here's the right one with my concern. > >>> >> > >>> >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > >>> >> 17 root 1 171 52 0K 16K CPU0 0 54:28 95.17% idle: > >> cpu0 > >>> >> 15 root 1 171 52 0K 16K CPU2 2 55:55 93.65% idle: > >> cpu2 > >>> >> 14 root 1 171 52 0K 16K CPU3 3 58:53 93.55% idle: > >> cpu3 > >>> >> 13 root 1 171 52 0K 16K RUN 4 59:14 82.47% idle: > >> cpu4 > >>> >> 12 root 1 171 52 0K 16K RUN 5 55:42 82.23% idle: > >> cpu5 > >>> >> 16 root 1 171 52 0K 16K CPU1 1 58:13 77.78% idle: > >> cpu1 > >>> >> 11 root 1 171 52 0K 16K CPU6 6 54:08 76.17% idle: > >> cpu6 > >>> >> 36 root 1 -68 -187 0K 16K WAIT 7 8:50 65.53% > >>> >> irq23: bce0 bce1 > >>> >> 10 root 1 171 52 0K 16K CPU7 7 48:19 29.79% idle: > >> cpu7 > >>> >> 43 root 1 171 52 0K 16K pgzero 2 0:35 1.51% > >> pagezero > >>> >> 1372 root 10 20 0 16716K 5764K kserel 6 58:42 0.00% kmd > >>> >> 4488 root 1 96 0 30676K 4236K select 2 1:51 0.00% sshd > >>> >> 18 root 1 -32 -151 0K 16K WAIT 0 1:14 0.00% swi4: > >>> > clock s > >>> >> 20 root 1 -44 -163 0K 16K WAIT 0 0:30 0.00% swi1: > >> net > >>> >> 218 root 1 96 0 3852K 1376K select 0 0:23 0.00% syslogd > >>> >> 2171 root 1 96 0 30676K 4224K select 6 0:19 0.00% sshd > >>> >> > >>> >> Actually I was doing a network performance testing on this system with > >>> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a > >>> >> tool to generate big amount of traffic around 600Mbps-700Mbps > >>> >> traversing the FreeBSD system in bi-direction, meaning both network > >>> >> interfaces are receiving traffic. What happened was, the CPU (cpu7) > >>> >> that handles the (irq 23) on both interfaces consumed big amount of > >>> >> CPU utilization around 65.53% in which it affects other running > >>> >> applications and services like sshd and httpd. It's no longer > >>> >> accessible when traffic is bombarded. With the current situation of my > >>> >> FreeBSD system with only one CPU being stressed, I was thinking of > >>> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought > >>> >> my concern has something to do with the distributions of load on > >>> >> multiple CPU cores handled by the scheduler especially at the network > >>> >> level, processing network load. So, if it is more of interrupt > >>> >> handling and not on the scheduler, is there a way we can optimize it? > >>> >> Because if it still routed only to one CPU then for me it's still > >>> >> inefficient. Who handles interrupt scheduling for bounding CPU in > >>> >> order to prevent shared IRQ? Is there any improvements with > >>> >> FreeBSD-7.0 with regards to interrupt handling? > >>> > > >>> > It depends. In all likelihood, the interrupts from bce0 and bce1 are both > >>> > hardwired to the same interrupt pin and so they will always share the same > >>> > ithread when using the legacy INTx interrupts. However, bce(4) parts do > >>> > support MSI, and if you try a newer OS snap (6.3 or later) these devices > >>> > should use MSI in which case each NIC would be assigned to a separate CPU. > >> I > >>> > would suggest trying 7.0 or a 7.1 release candidate and see if it does > >>> > better. > >>> > > >>> > -- > >>> > John Baldwin > >>> > > >>> > >>> Hi John, > >>> > >>> I try 7.0 release and each network interface were already allocated > >>> separately on different CPU. Here, MSI is already working. > >>> > >>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > >>> 12 root 1 171 ki31 0K 16K CPU6 6 123:55 100.00% idle: > >> cpu6 > >>> 15 root 1 171 ki31 0K 16K CPU3 3 123:54 100.00% idle: > >> cpu3 > >>> 14 root 1 171 ki31 0K 16K CPU4 4 123:26 100.00% idle: > >> cpu4 > >>> 16 root 1 171 ki31 0K 16K CPU2 2 123:15 100.00% idle: > >> cpu2 > >>> 17 root 1 171 ki31 0K 16K CPU1 1 123:15 100.00% idle: > >> cpu1 > >>> 37 root 1 -68 - 0K 16K CPU7 7 9:09 100.00% irq256: > >> bce0 > >>> 13 root 1 171 ki31 0K 16K CPU5 5 123:49 99.07% idle: cpu5 > >>> 40 root 1 -68 - 0K 16K WAIT 0 4:40 51.17% irq257: > >> bce1 > >>> 18 root 1 171 ki31 0K 16K RUN 0 117:48 49.37% idle: cpu0 > >>> 11 root 1 171 ki31 0K 16K RUN 7 115:25 0.00% idle: cpu7 > >>> 19 root 1 -32 - 0K 16K WAIT 0 0:39 0.00% swi4: > >> clock s > >>> 14367 root 1 44 0 5176K 3104K select 2 0:01 0.00% dhcpd > >>> 22 root 1 -16 - 0K 16K - 3 0:01 0.00% yarrow > >>> 25 root 1 -24 - 0K 16K WAIT 0 0:00 0.00% swi6: > >> Giant t > >>> 11658 root 1 44 0 32936K 4540K select 1 0:00 0.00% sshd > >>> 14224 root 1 44 0 32936K 4540K select 5 0:00 0.00% sshd > >>> 41 root 1 -60 - 0K 16K WAIT 0 0:00 0.00% irq1: > >> atkbd0 > >>> 4 root 1 -8 - 0K 16K - 2 0:00 0.00% g_down > >>> > >>> The bce0 interface interrupt (irq256) gets stressed out which already > >>> have 100% of CPU7 while CPU0 is around 51.17%. Any more > >>> recommendations? Is there anything we can do about optimization with > >>> MSI? > >> > >> Well, on 7.x you can try turning net.isr.direct off (sysctl). However, it > >> seems you are hammering your bce0 interface. You might want to try using > >> polling on bce0 and seeing if it keeps up with the traffic better. > >> > >> -- > >> John Baldwin > >> > > > > With net.isr.direct=0, my IBM system lessens CPU utilization per > > interface (bce0 and bce1) but swi1:net increase its utilization. > > Can you explained what's happening here? What does net.isr.direct do > > with the decrease of CPU utilization on its interface? I really wanted > > to know what happened internally during the packets being processed > > and received by the interfaces then to the device interrupt up to the > > software interrupt level because I am confused when enabling/disabling > > net.isr.direct in sysctl. Is there a tool that can we used to trace > > this process just to be able to know which part of the kernel internal > > is doing the bottleneck especially when net.isr.direct=1? By the way > > with device polling enabled, the system experienced packet errors and > > the interface throughput is worst, so I avoid using it though. > > > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > > > > 16 root 1 171 ki31 0K 16K CPU10 a 86:06 89.06% idle: cpu10 > > 27 root 1 -44 - 0K 16K CPU1 1 34:37 82.67% swi1: net > > 52 root 1 -68 - 0K 16K WAIT b 51:59 59.77% irq32: bce1 > > 15 root 1 171 ki31 0K 16K RUN b 69:28 43.16% idle: cpu11 > > 25 root 1 171 ki31 0K 16K RUN 1 115:35 24.27% idle: cpu1 > > 51 root 1 -68 - 0K 16K CPU10 a 35:21 13.48% irq31: bce0 > > > > > > Regards, > > Archimedes > > > > One more thing, I observed that when net.isr.direct=1, bce0 is using > irq256 and bce1 is using irq257 while net.isr.direct=0, bce0 is now > using irq31 and bce1 is using irq32. What makes it different? That is not from net.isr.direcct. irq256/257 is when the bce devices are using MSI. irq31/32 is when the bce devices are using INTx. -- John Baldwin