From owner-freebsd-smp@FreeBSD.ORG Mon Nov 17 21:13:50 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6F8DD1065677 for ; Mon, 17 Nov 2008 21:13:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id A4D738FC14 for ; Mon, 17 Nov 2008 21:13:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1]) (authenticated bits=0) by server.baldwin.cx (8.14.3/8.14.3) with ESMTP id mAHLDgur033788; Mon, 17 Nov 2008 16:13:42 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Archimedes Gaviola" Date: Mon, 17 Nov 2008 16:09:15 -0500 User-Agent: KMail/1.9.7 References: <42e3d810811100033w172e90dbl209ecbab640cc24f@mail.gmail.com> <200811131128.55220.jhb@freebsd.org> <42e3d810811170311uddc77daj176bc285722a0c8@mail.gmail.com> In-Reply-To: <42e3d810811170311uddc77daj176bc285722a0c8@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200811171609.15913.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]); Mon, 17 Nov 2008 16:13:43 -0500 (EST) X-Virus-Scanned: ClamAV 0.93.1/8642/Sun Nov 16 23:01:08 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00,NO_RELAYS autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-smp@freebsd.org Subject: Re: CPU affinity with ULE scheduler X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Nov 2008 21:13:50 -0000 On Monday 17 November 2008 06:11:00 am Archimedes Gaviola wrote: > On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin wrote: > > On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote: > >> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin wrote: > >> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote: > >> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin wrote: > >> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote: > >> >> >> To Whom It May Concerned: > >> >> >> > >> >> >> Can someone explain or share about ULE scheduler (latest version 2 if > >> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing > >> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD > >> >> >> scheduler and as what I have observed especially on processing high > >> >> >> network load traffic on multiple CPU cores, only one CPU were being > >> >> >> stressed with network interrupt while the rests are mostly in idle > >> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE Broadcom > >> >> >> network interface cards (bce0 and bce1). Below is the snapshot of the > >> >> >> case. > >> >> > > >> >> > Interrupts are routed to a single CPU. Since bce0 and bce1 are both on > >> > the > >> >> > same interrupt (irq 23), the CPU that interrupt is routed to is going > > to > >> > end > >> >> > up handling all the interrupts for bce0 and bce1. This not something > > ULE > >> > or > >> >> > 4BSD have any control over. > >> >> > > >> >> > -- > >> >> > John Baldwin > >> >> > > >> >> > >> >> Hi John, > >> >> > >> >> I'm sorry for the wrong snapshot. Here's the right one with my concern. > >> >> > >> >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > >> >> 17 root 1 171 52 0K 16K CPU0 0 54:28 95.17% idle: > > cpu0 > >> >> 15 root 1 171 52 0K 16K CPU2 2 55:55 93.65% idle: > > cpu2 > >> >> 14 root 1 171 52 0K 16K CPU3 3 58:53 93.55% idle: > > cpu3 > >> >> 13 root 1 171 52 0K 16K RUN 4 59:14 82.47% idle: > > cpu4 > >> >> 12 root 1 171 52 0K 16K RUN 5 55:42 82.23% idle: > > cpu5 > >> >> 16 root 1 171 52 0K 16K CPU1 1 58:13 77.78% idle: > > cpu1 > >> >> 11 root 1 171 52 0K 16K CPU6 6 54:08 76.17% idle: > > cpu6 > >> >> 36 root 1 -68 -187 0K 16K WAIT 7 8:50 65.53% > >> >> irq23: bce0 bce1 > >> >> 10 root 1 171 52 0K 16K CPU7 7 48:19 29.79% idle: > > cpu7 > >> >> 43 root 1 171 52 0K 16K pgzero 2 0:35 1.51% > > pagezero > >> >> 1372 root 10 20 0 16716K 5764K kserel 6 58:42 0.00% kmd > >> >> 4488 root 1 96 0 30676K 4236K select 2 1:51 0.00% sshd > >> >> 18 root 1 -32 -151 0K 16K WAIT 0 1:14 0.00% swi4: > >> > clock s > >> >> 20 root 1 -44 -163 0K 16K WAIT 0 0:30 0.00% swi1: > > net > >> >> 218 root 1 96 0 3852K 1376K select 0 0:23 0.00% syslogd > >> >> 2171 root 1 96 0 30676K 4224K select 6 0:19 0.00% sshd > >> >> > >> >> Actually I was doing a network performance testing on this system with > >> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a > >> >> tool to generate big amount of traffic around 600Mbps-700Mbps > >> >> traversing the FreeBSD system in bi-direction, meaning both network > >> >> interfaces are receiving traffic. What happened was, the CPU (cpu7) > >> >> that handles the (irq 23) on both interfaces consumed big amount of > >> >> CPU utilization around 65.53% in which it affects other running > >> >> applications and services like sshd and httpd. It's no longer > >> >> accessible when traffic is bombarded. With the current situation of my > >> >> FreeBSD system with only one CPU being stressed, I was thinking of > >> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought > >> >> my concern has something to do with the distributions of load on > >> >> multiple CPU cores handled by the scheduler especially at the network > >> >> level, processing network load. So, if it is more of interrupt > >> >> handling and not on the scheduler, is there a way we can optimize it? > >> >> Because if it still routed only to one CPU then for me it's still > >> >> inefficient. Who handles interrupt scheduling for bounding CPU in > >> >> order to prevent shared IRQ? Is there any improvements with > >> >> FreeBSD-7.0 with regards to interrupt handling? > >> > > >> > It depends. In all likelihood, the interrupts from bce0 and bce1 are both > >> > hardwired to the same interrupt pin and so they will always share the same > >> > ithread when using the legacy INTx interrupts. However, bce(4) parts do > >> > support MSI, and if you try a newer OS snap (6.3 or later) these devices > >> > should use MSI in which case each NIC would be assigned to a separate CPU. > > I > >> > would suggest trying 7.0 or a 7.1 release candidate and see if it does > >> > better. > >> > > >> > -- > >> > John Baldwin > >> > > >> > >> Hi John, > >> > >> I try 7.0 release and each network interface were already allocated > >> separately on different CPU. Here, MSI is already working. > >> > >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > >> 12 root 1 171 ki31 0K 16K CPU6 6 123:55 100.00% idle: > > cpu6 > >> 15 root 1 171 ki31 0K 16K CPU3 3 123:54 100.00% idle: > > cpu3 > >> 14 root 1 171 ki31 0K 16K CPU4 4 123:26 100.00% idle: > > cpu4 > >> 16 root 1 171 ki31 0K 16K CPU2 2 123:15 100.00% idle: > > cpu2 > >> 17 root 1 171 ki31 0K 16K CPU1 1 123:15 100.00% idle: > > cpu1 > >> 37 root 1 -68 - 0K 16K CPU7 7 9:09 100.00% irq256: > > bce0 > >> 13 root 1 171 ki31 0K 16K CPU5 5 123:49 99.07% idle: cpu5 > >> 40 root 1 -68 - 0K 16K WAIT 0 4:40 51.17% irq257: > > bce1 > >> 18 root 1 171 ki31 0K 16K RUN 0 117:48 49.37% idle: cpu0 > >> 11 root 1 171 ki31 0K 16K RUN 7 115:25 0.00% idle: cpu7 > >> 19 root 1 -32 - 0K 16K WAIT 0 0:39 0.00% swi4: > > clock s > >> 14367 root 1 44 0 5176K 3104K select 2 0:01 0.00% dhcpd > >> 22 root 1 -16 - 0K 16K - 3 0:01 0.00% yarrow > >> 25 root 1 -24 - 0K 16K WAIT 0 0:00 0.00% swi6: > > Giant t > >> 11658 root 1 44 0 32936K 4540K select 1 0:00 0.00% sshd > >> 14224 root 1 44 0 32936K 4540K select 5 0:00 0.00% sshd > >> 41 root 1 -60 - 0K 16K WAIT 0 0:00 0.00% irq1: > > atkbd0 > >> 4 root 1 -8 - 0K 16K - 2 0:00 0.00% g_down > >> > >> The bce0 interface interrupt (irq256) gets stressed out which already > >> have 100% of CPU7 while CPU0 is around 51.17%. Any more > >> recommendations? Is there anything we can do about optimization with > >> MSI? > > > > Well, on 7.x you can try turning net.isr.direct off (sysctl). However, it > > seems you are hammering your bce0 interface. You might want to try using > > polling on bce0 and seeing if it keeps up with the traffic better. > > > > -- > > John Baldwin > > > > With net.isr.direct=0, my IBM system lessens CPU utilization per > interface (bce0 and bce1) but swi1:net increase its utilization. > Can you explained what's happening here? What does net.isr.direct do > with the decrease of CPU utilization on its interface? I really wanted > to know what happened internally during the packets being processed > and received by the interfaces then to the device interrupt up to the > software interrupt level because I am confused when enabling/disabling > net.isr.direct in sysctl. Is there a tool that can we used to trace > this process just to be able to know which part of the kernel internal > is doing the bottleneck especially when net.isr.direct=1? By the way > with device polling enabled, the system experienced packet errors and > the interface throughput is worst, so I avoid using it though. > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > > 16 root 1 171 ki31 0K 16K CPU10 a 86:06 89.06% idle: cpu10 > 27 root 1 -44 - 0K 16K CPU1 1 34:37 82.67% swi1: net > 52 root 1 -68 - 0K 16K WAIT b 51:59 59.77% irq32: bce1 > 15 root 1 171 ki31 0K 16K RUN b 69:28 43.16% idle: cpu11 > 25 root 1 171 ki31 0K 16K RUN 1 115:35 24.27% idle: cpu1 > 51 root 1 -68 - 0K 16K CPU10 a 35:21 13.48% irq31: bce0 With net.isr.direct=1, the ithread tries to pass the received packets up to IP/UDP/TCP/socket directly. With net.isr.direct=0, the ithread places received packets on a queue and sends a signal to 'sw1: net'. The swi thread wakes up, pulls the packets off of the queue and sends them to IP/UDP/TCP/socket. -- John Baldwin