From owner-freebsd-performance@FreeBSD.ORG Thu Dec 14 22:40:09 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E3D8016A4A7 for ; Thu, 14 Dec 2006 22:40:09 +0000 (UTC) (envelope-from amesbury@umn.edu) Received: from mta-m2.tc.umn.edu (mta-m2.tc.umn.edu [160.94.23.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 176AA43F13 for ; Thu, 14 Dec 2006 22:33:20 +0000 (GMT) (envelope-from amesbury@umn.edu) Received: from [160.94.247.212] (paulaner.oitsec.umn.edu [160.94.247.212]) by mta-m2.tc.umn.edu (UMN smtpd) with ESMTP for ; Thu, 14 Dec 2006 16:34:45 -0600 (CST) X-Umn-Remote-Mta: [N] paulaner.oitsec.umn.edu [160.94.247.212] #+LO+TS+AU+HN Message-ID: <4581D185.7020702@umn.edu> Date: Thu, 14 Dec 2006 16:34:45 -0600 From: Alan Amesbury User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: freebsd-performance@freebsd.org X-Enigmail-Version: 0.94.0.0 Content-Type: multipart/mixed; boundary="------------050007070804010403090909" Subject: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Dec 2006 22:40:10 -0000 This is a multi-part message in MIME format. --------------050007070804010403090909 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit This is a long one, but mainly because I've tried to include notes about what I've already looked at. Thanks in advance for taking the time to read this. I have a FreeBSD 6.1-RELEASE/amd64 system which routinely needs to accept traffic at fairly high speeds. The system is accepting traffic at fairly high rates; 'systat -if' suggests 428551GB (not a typo, but possibly a display bug in 'systat') over the past 63 days, or an average rate of a bit over 600Mb/sec. However, 'time tcpdump ...' tends to back up this assertion: amesbury@host % sudo time tcpdump -i bge1 -n -w /dev/null -c 1000000 tcpdump: WARNING: bge1: no IPv4 address assigned tcpdump: listening on bge1, link-type EN10MB (Ethernet), capture size 96 bytes 1000000 packets captured 1000395 packets received by filter 167 packets dropped by kernel 0.268u 0.153s 0:06.84 5.9% 901+3236k 0+0io 0pf+0w What I'm aiming for, of course, is zero packet loss. Realizing that's probably impossible for this system given its load, I'm trying to do what I can to minimize loss. The system is running a somewhat leaner kernel than GENERIC. Notable changes include: * PREEMPTION disabled - /sys/conf/NOTES says this helps with interactivity. I don't care about interactive performance on this host. * COMPAT_FREEBSD4, COMPAT_LINUX32, and COMPAT_43 are removed. They appear to be unneeded. * SMP is enabled, as this is a dual-core box (not HTT!). * Many devices are removed, e.g., ncr(4), sym(4), adv(4), and other unnecessary block devices; anything relating to cardbus; de(4), bce(4), ti(4), wb(4), ed(4), ex(4), lnc(4), and a number of other network devices that aren't going to ever be used; etc. * All wlan(4) and related drivers are gone. * pf(4), pflog(4), and some of the ALTQ stuff has been added in, but is not actively used on this host (at the moment). * ZERO_COPY_SOCKETS, MAC_BSDEXTENDED, MAC_PARTITION, and MAC are enabled. * Most importantly, HZ=1000, and DEVICE_POLLING and AUTO_EOI_1 are included. (AUTO_EOI_1 was added because /sys/amd64/conf/NOTES says this can save a few microseconds on some interrupts. I'm not worried about suspend/resume, but definitely want speed, so it got added. As mentioned above, this host is running FreeBSD/amd64, so there's no need to remove support for I586_CPU, et al; that stuff was never there in the first place. Since kern.polling.enable is marked as deprecated in /sys/kern/kern_poll.c, I'm enabling polling specifically for the interface receiving the high-volume traffic. (It is NOT enabled for the other interface on this system, but traffic loads there are orders of magnitude lower, so I didn't think it was necessary.) As mentioned above, I've got HZ set to 1000. Per /sys/amd64/conf/NOTES, I'd considered setting it to 2000, but have discovered previously that FreeBSD's RFC1323 support breaks. I documented this on -hackers last year: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html Since I've not seen word on a correction for this being added to FreeBSD, I've limited HZ to 1000. After reading polling(4) a couple times, I set kern.polling.burst_max to 1000. The manpage says that "each interface can receive at most (HZ * burst_max) packets per second", and the default setting is 150, which is described as "adequate for 100Mbit network and HZ=1000." I figured, "Hey, gigabit, how about ten times the default?" but that's prevented by "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c. In theory that might've been good enough, but polling(4) says that kern.polling.burst is "[the] [m]aximum number of packets grabbed from each network interface in each timer tick. This number is dynamically adjusted by the kernel, according to the programmed user_frac, burst_max, CPU speed, and system load." I keep seeing kern.polling.burst hit a thousand, which leads me to believe that kern.polling.burst_max needs to be higher. For example: secs since epoch kern.polling.burst ---------- ------------------ 1166133997 1000 1166134006 550 1166134015 877 1166134024 1000 1166134033 1000 1166134042 1000 1166134051 1000 1166134060 1000 1166134069 1000 1166134078 1000 Unfortunately, that appears to be only possible through a) patching /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000, as indicated in one of the NOTES, which will effectively hose certain TCP connectivity because of the RFC1323 breakage. Looked at another way, both essentially require changes to source code, the former being fairly obvious, and the latter requiring fixes to the RFC1323 support. Either way, I think that's a bit beyond my abilities; I have NO illusions about my kernel h4cking sk1llz. Other possibly relevant data points: * System load hovers right around 1. * The system has almost zero disk activity. * With polling off: - 'vmstat 5' consistently shows about 13K context switches and ~6800 interrupts - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286 for bge1, and near zero for everything else - CPU load drops to 0.4-0.8, but CPU idle time sits around 80% * With polling on, kern.polling.burst_max=150: - kern.polling.burst holds at 150 - 'vmstat 5' shows context switches hold around 2600, with interrupts holding around 30K - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total doesn't increase!), other rates stay the same (looks like possible display bugs in 'vmstat -i' here!) - CPU load holds at 1, but CPU idle time usually stays >95% * With polling on, kern.polling.burst_max=1000: - kern.polling.burst is frequently 1000 and almost always >850 - 'vmstat 5' shows context switches unchanged, but interrupts are 150K-190K - 'vmstat -i' unchanged from burst_max=150 - CPU load and CPU idle time very similar to burst_max=150 So, with all that in mind..... Any ideas for improvement? Apologies in advance for missing the obvious. 'dmesg' and kernel config are attached. -- Alan Amesbury OIT Security and Assurance University of Minnesota --------------050007070804010403090909 Content-Type: text/plain; name="SPECIALIZED" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="SPECIALIZED" machine amd64 cpu HAMMER ident SPECIALIZED # To statically compile in device wiring instead of /boot/device.hints #hints "GENERIC.hints" # Default places to look for devices. makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols #options SCHED_ULE # ULE scheduler options SCHED_4BSD # 4BSD scheduler #options PREEMPTION # Enable kernel thread preemption options INET # InterNETworking options INET6 # IPv6 communications protocols options FFS # Berkeley Fast Filesystem options SOFTUPDATES # Enable FFS soft updates support options UFS_ACL # Support for access control lists options UFS_DIRHASH # Improve performance on big directories options MD_ROOT # MD is a potential root device options NFSCLIENT # Network Filesystem Client options NFSSERVER # Network Filesystem Server options NFS_ROOT # NFS usable as /, requires NFSCLIENT options MSDOSFS # MSDOS Filesystem options CD9660 # ISO 9660 Filesystem options PROCFS # Process filesystem (requires PSEUDOFS) options PSEUDOFS # Pseudo-filesystem framework options GEOM_GPT # GUID Partition Tables. options COMPAT_IA32 # Compatible with i386 binaries options COMPAT_FREEBSD5 # Compatible with FreeBSD5 options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options KTRACE # ktrace(1) support options SYSVSHM # SYSV-style shared memory options SYSVMSG # SYSV-style message queues options SYSVSEM # SYSV-style semaphores options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions options KBD_INSTALL_CDEV # install a CDEV entry in /dev options AHC_REG_PRETTY_PRINT # Print register bitfields in debug # output. Adds ~128k to driver. options AHD_REG_PRETTY_PRINT # Print register bitfields in debug # output. Adds ~215k to driver. options ADAPTIVE_GIANT # Giant mutex is adaptive. options SMP # Symmetric MultiProcessor Kernel # Workarounds for some known-to-be-broken chipsets (nVidia nForce3-Pro150) device atpic # 8259A compatability # Bus support. device acpi device isa device pci device mem device io # Floppy drives device fdc # ATA and ATAPI devices device ata device atadisk # ATA disk drives device ataraid # ATA RAID drives device atapicd # ATAPI CDROM drives device atapifd # ATAPI floppy drives device atapist # ATAPI tape drives options ATA_STATIC_ID # Static device numbering # SCSI Controllers device ahc # AHA2940 and onboard AIC7xxx devices device ahd # AHA39320/29320 and onboard AIC79xx devices device amd # AMD 53C974 (Tekram DC-390(T)) device isp # Qlogic family device mpt # LSI-Logic MPT-Fusion # SCSI peripherals device scbus # SCSI bus (required for SCSI) device ch # SCSI media changers device da # Direct Access (disks) device sa # Sequential Access (tape etc) device cd # CD device pass # Passthrough device (direct SCSI access) device ses # SCSI Environmental Services (and SAF-TE) # RAID controllers interfaced to the SCSI subsystem device amr # AMI MegaRAID device ciss # Compaq Smart RAID 5* device dpt # DPT Smartcache III, IV - See NOTES for options device hptmv # Highpoint RocketRAID 182x device iir # Intel Integrated RAID device ips # IBM (Adaptec) ServeRAID device mly # Mylex AcceleRAID/eXtremeRAID device twa # 3ware 9000 series PATA/SATA RAID # RAID controllers device aac # Adaptec FSA RAID device aacp # SCSI passthrough for aac (requires CAM) device ida # Compaq Smart RAID device twe # 3ware ATA RAID # atkbdc0 controls both the keyboard and the PS/2 mouse device atkbdc # AT keyboard controller device atkbd # AT keyboard device psm # PS/2 mouse device vga # VGA video card driver device splash # Splash screen and screen saver support # syscons is the default console driver, resembling an SCO console device sc device agp # support several AGP chipsets # Serial (COM) ports device sio # 8250, 16[45]50 based serial ports # If you've got a "dumb" serial or parallel PCI card that is # supported by the puc(4) glue driver, uncomment the following # line to enable it (connects to the sio and/or ppc drivers): #device puc # PCI Ethernet NICs. device em # Intel PRO/1000 adapter Gigabit Ethernet Card device ixgb # Intel PRO/10GbE Ethernet Card device txp # 3Com 3cR990 (``Typhoon'') device vx # 3Com 3c590, 3c595 (``Vortex'') # PCI Ethernet NICs that use the common MII bus controller code. # NOTE: Be sure to keep the 'device miibus' line in order to use these NICs! device miibus # MII bus support device bfe # Broadcom BCM440x 10/100 Ethernet device bge # Broadcom BCM570xx Gigabit Ethernet device dc # DEC/Intel 21143 and various workalikes device fxp # Intel EtherExpress PRO/100B (82557, 82558) device lge # Level 1 LXT1001 gigabit Ethernet device nge # NatSemi DP83820 gigabit Ethernet device re # RealTek 8139C+/8169/8169S/8110S device rl # RealTek 8129/8139 device sis # Silicon Integrated Systems SiS 900/SiS 7016 device sk # SysKonnect SK-984x & SK-982x gigabit Ethernet device tx # SMC EtherPower II (83c170 ``EPIC'') device xl # 3Com 3c90x (``Boomerang'', ``Cyclone'') # Pseudo devices. device loop # Network loopback device random # Entropy device device ether # Ethernet support device tun # Packet tunnel. device pty # Pseudo-ttys (telnet etc) device md # Memory "disks" device gif # IPv6 and IPv4 tunneling device faith # IPv6-to-IPv4 relaying (translation) # The `bpf' device enables the Berkeley Packet Filter. # Be aware of the administrative consequences of enabling this! # Note that 'bpf' is required for DHCP. device bpf # Berkeley packet filter # USB support device uhci # UHCI PCI->USB interface device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) #device udbp # USB Double Bulk Pipe devices device ugen # Generic device uhid # "Human Interface Devices" device ukbd # Keyboard device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da device ums # Mouse # FireWire support device firewire # FireWire bus code device sbp # SCSI over FireWire (Requires scbus and da) device fwe # Ethernet over FireWire (non-standard!) options ALTQ options ALTQ_CBQ options ALTQ_HFSC options ALTQ_PRIQ options ALTQ_NOPCC device pf device pflog options BRIDGE options ZERO_COPY_SOCKETS options MAC options MAC_BSDEXTENDED options MAC_PARTITION options HZ=1000 options SC_HISTORY_SIZE=1000 options SC_KERNEL_CONS_ATTR=(FG_YELLOW|BG_BLACK) options SC_KERNEL_CONS_REV_ATTR=(FG_BLACK|BG_RED) options DEVICE_POLLING options AUTO_EOI_1 options INCLUDE_CONFIG_FILE --------------050007070804010403090909 Content-Type: text/plain; name="specialized_dmesg.boot" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="specialized_dmesg.boot" Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-RELEASE-p10 #1: Thu Oct 12 14:14:54 CDT 2006 root@specialized:/usr/obj/usr/src/sys/SPECIALIZED Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) D CPU 2.80GHz (2800.11-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0xf44 Stepping = 4 Features=0xbfebfbff Features2=0x641d> AMD Features=0x20100800 Cores per package: 2 real memory = 4563402752 (4352 MB) avail memory = 4140404736 (3948 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 Security policy loaded: TrustedBSD MAC/BSD Extended (mac_bsdextended) Security policy loaded: TrustedBSD MAC/Partition (mac_partition) ioapic0: Changing APIC ID to 2 ioapic1: Changing APIC ID to 3 ioapic1: WARNING: intbase 32 != expected base 24 ioapic0 irqs 0-23 on motherboard ioapic1 irqs 32-55 on motherboard acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: on acpi0 cpu1: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pcib2: at device 28.0 on pci0 pci2: on pcib2 pcib3: at device 0.0 on pci2 pci3: on pcib3 pcib4: at device 28.4 on pci0 pci4: on pcib4 bge0: mem 0xfe8f0000-0xfe8fffff irq 16 at device 0.0 on pci4 miibus0: on bge0 brgphy0: on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge0: Ethernet address: 00:15:c5:60:1b:dc pcib5: at device 28.5 on pci0 pci5: on pcib5 bge1: mem 0xfe6f0000-0xfe6fffff irq 17 at device 0.0 on pci5 miibus1: on bge1 brgphy1: on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge1: Ethernet address: 00:15:c5:60:1b:dd uhci0: port 0xbce0-0xbcff irq 20 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: port 0xbcc0-0xbcdf irq 21 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] usb1: on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: port 0xbca0-0xbcbf irq 22 at device 29.2 on pci0 uhci2: [GIANT-LOCKED] usb2: on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered ehci0: mem 0xfeb00400-0xfeb007ff irq 20 at device 29.7 on pci0 ehci0: [GIANT-LOCKED] usb3: EHCI version 1.0 usb3: wrong number of companions (7 != 3) usb3: companion controllers, 2 ports each: usb0 usb1 usb2 usb3: on ehci0 usb3: USB revision 2.0 uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub3: 6 ports with 6 removable, self powered pcib6: at device 30.0 on pci0 pci6: on pcib6 pci6: at device 5.0 (no driver attached) isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0 ata0: on atapci0 ata1: on atapci0 atapci1: port 0xbc98-0xbc9f,0xbc90-0xbc93,0xbc80-0xbc87,0xbc78-0xbc7b,0xbc60-0xbc6f mem 0xfeb00000-0xfeb003ff irq 20 at device 31.2 on pci0 ata2: on atapci1 ata3: on atapci1 pci0: at device 31.3 (no driver attached) fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 orm0: at iomem 0xc0000-0xc7fff,0xec000-0xeffff on isa0 atkbdc0: at port 0x60,0x64 on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec acd0: CDRW at ata0-master UDMA33 ad4: 152587MB at ata2-master SATA150 SMP: AP CPU #1 Launched! Trying to mount root from ufs:/dev/ad4s1a bge0: link state changed to UP bge1: link state changed to UP --------------050007070804010403090909-- From owner-freebsd-performance@FreeBSD.ORG Fri Dec 15 14:18:04 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4F06C16A403 for ; Fri, 15 Dec 2006 14:18:04 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id E2EEC43CA1 for ; Fri, 15 Dec 2006 14:16:21 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 41B1C5BFC6B; Sat, 16 Dec 2006 01:18:02 +1100 (EST) Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 695D58C09; Sat, 16 Dec 2006 01:17:59 +1100 (EST) Date: Sat, 16 Dec 2006 01:17:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alan Amesbury In-Reply-To: <4581D185.7020702@umn.edu> Message-ID: <20061215232203.C3994@besplex.bde.org> References: <4581D185.7020702@umn.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Dec 2006 14:18:04 -0000 On Thu, 14 Dec 2006, Alan Amesbury wrote: > ... > What I'm aiming for, of course, is zero packet loss. Realizing that's > probably impossible for this system given its load, I'm trying to do > what I can to minimize loss. > ... > * PREEMPTION disabled - /sys/conf/NOTES says this helps with > interactivity. I don't care about interactive performance > on this host. It's needed to prevent packet loss without polling. It probably makes little difference with polling (if the machines is mostly handling network traffic and that only by polling). > * Most importantly, HZ=1000, and DEVICE_POLLING and > AUTO_EOI_1 are included. (AUTO_EOI_1 was added because > /sys/amd64/conf/NOTES says this can save a few microseconds > on some interrupts. I'm not worried about suspend/resume, but > definitely want speed, so it got added. I don't believe in POLLING or HZ=1000, but recetly tested them with bge. I am unhappy to report that my fine-tuned interrupt handling still loses to polling by a few percent for efficiency. I am happy to report that polling loses to interrupt handling by a lot for correctness -- polling gives packet loss. Polling also loses big for latency with idle_poll and the system actually idle, when it wins a little. AUTO_EOI_1 has little effect unless the system gets lots of interrupt, so with most interrupts avoided by using polling it has little effect. > As mentioned above, this host is running FreeBSD/amd64, so there's no > need to remove support for I586_CPU, et al; that stuff was never there > in the first place. AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very unusual for amd64 so AUTO_EOI_1 probably has no effect for you. > As mentioned above, I've got HZ set to 1000. Per /sys/amd64/conf/NOTES, > I'd considered setting it to 2000, but have discovered previously that > FreeBSD's RFC1323 support breaks. I documented this on -hackers last year: > > http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html I think there are old PRs about this. Even 1000 is too large (?). > Since I've not seen word on a correction for this being added to > FreeBSD, I've limited HZ to 1000. HZ = 100 gives interesting behaviour. Of course, it doesn't work, since polling depends on polling often enough. Any particular value of HZ can only give polling often enough for a very limited range of systems. 1000 is apparently good for 100Mbps and not too bad for 1Gbps, provided the hardware has enough buffering, but with enough buffering polling is not really needed. > After reading polling(4) a couple times, I set kern.polling.burst_max to > 1000. The manpage says that "each interface can receive at most (HZ * > burst_max) packets per second", and the default setting is 150, which is > described as "adequate for 100Mbit network and HZ=1000." I figured, > "Hey, gigabit, how about ten times the default?" but that's prevented by > "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c. I can (easily) generate only 250 kpps on input and had to increase kern.polling.burst_max to > 250 to avoid huge packet lossage at this rate. It doesn't seem to work right for output, since I can (easily) generate 340 kpps output and got that with a burst max of only 15 should have got only 150 kpps. Output is faster at the lowest level (but slower at higher levels), so doing larger bursts of output might be intentional. However, output at 340 kkps gives a system load of 100% on the test machine (which is not very fast or SMP). no matter how it is done (polling just makes it go 2% faster), so polling is not doing its main job of very well. Polling's main job is to prevent netowork activity from using 100% CPU. Large values of kern.polling.burst_max are fundamentally incompatible with polling doing this. On my test system, a burst max of 1000 combined with HZ = 1000 would just ask the driver alone to use 100% of the CPU doing 1000 kppps though a single device. "Fortunately", the device can't go that fast, so plenty of CPU is left. > In theory that might've been good enough, but polling(4) says that > kern.polling.burst is "[the] [m]aximum number of packets grabbed from > each network interface in each timer tick. This number is dynamically > adjusted by the kernel, according to the programmed user_frac, > burst_max, CPU speed, and system load." I keep seeing > kern.polling.burst hit a thousand, which leads me to believe that > kern.polling.burst_max needs to be higher. > > For example: > > secs since > epoch kern.polling.burst > ---------- ------------------ > 1166133997 1000 > ... Is it really dynamic? I see 1000's too, but for sending at only 340 kpps. Almost all bursts should have size 340. With a max of 150, burst is 150 too but 340 kpps are still sent. > Unfortunately, that appears to be only possible through a) patching > /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000, > as indicated in one of the NOTES, which will effectively hose certain > TCP connectivity because of the RFC1323 breakage. Looked at another > way, both essentially require changes to source code, the former being > fairly obvious, and the latter requiring fixes to the RFC1323 support. > Either way, I think that's a bit beyond my abilities; I have NO > illusions about my kernel h4cking sk1llz. There may be a fix in an old PR. > Other possibly relevant data points: > > * System load hovers right around 1. Polling in idle eats all the CPU. Polling in idle is very wasteful (mainly of power) unless the system can rarely be idle anyway, but then polling in idle doesn't help much. > * The system has almost zero disk activity. > > * With polling off: > > - 'vmstat 5' consistently shows about 13K context switches > and ~6800 interrupts > - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286 > for bge1, and near zero for everything else > - CPU load drops to 0.4-0.8, but CPU idle time sits around 80% These are only small interrupt loads. bge always generates about 6667 interrupts per second (under all loads except none or tiny) because it is programmed to use interrupt moderation with a timeout of 150uS and some finer details. This gives behaviour very similar to polling at a frequency of 6667 Hz. The main differences between this and polling at 1000 Hz are: - 6667 Hz works better for correctness (lower latency, fewer dropped packets for missed polls) - 6667 Hz has higher overheads (only a few percent) - interrupts have lower overheads if nothing is happening so you don't actually get them at 6667 Hz - the polling given by interrupt moderation is dumb. It doesn't have any of the burst max controls, etc. (but could easily). It doesn't interact with other devices (but could uneasily). bge can be easily be reprogrammed to use interrupt moderation with a timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz. This immediately gives the main disadvantage of polling (latency of 1000uS unless polling in idle and the system is actually idle at least once every 1000uS). bge has internal (buffering) limits which have similar effects to the burst limit. The advantages of polling are not easily gained in this way (especially for rx). > * With polling on, kern.polling.burst_max=150: > > - kern.polling.burst holds at 150 > - 'vmstat 5' shows context switches hold around 2600, with > interrupts holding around 30K I think you mean `systat -vmstat 5'. The interrupt count here is bogus. It is mostly for software interrupts that mostly don't do much becuase they coalesce with old ones. Only ones that cause context switches are relevant, and there is no counter for those. Most of the context switches are to the poll routine (1000 there and 1000 back). > - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total > doesn't increase!), other rates stay the same (looks like > possible display bugs in 'vmstat -i' here!) Probably just averaging. > - CPU load holds at 1, but CPU idle time usually stays >95% I saw heavy polling reduce the idle time significantly here. I think the CPU idle time can be very biased here under light loads. The times shown by top(1) are unbiased. > * With polling on, kern.polling.burst_max=1000: > > - kern.polling.burst is frequently 1000 and almost always >850 > - 'vmstat 5' shows context switches unchanged, but interrupts > are 150K-190K > - 'vmstat -i' unchanged from burst_max=150 > - CPU load and CPU idle time very similar to burst_max=150 > > So, with all that in mind..... Any ideas for improvement? Apologies in > advance for missing the obvious. 'dmesg' and kernel config are attached. Sorry, no ideas about tuning polling parameters (I don't know them well since I don't believe in polling :-). You apparently have eveything tuned almost as well as possible, and the only possibilities for future improvments are avoiding the 5% (?) extra overhead for !polling and the packet loss for polling. I see the folowing packet loss for polling with HZ=1000, burst_max=300, idle_poll=1: %%% input (bge0) output packets errs bytes packets errs bytes colls 242999 1 14579940 0 0 0 0 235496 0 14129760 0 0 0 0 236930 3261 14215800 0 0 0 0 237816 3400 14268960 0 0 0 0 240418 3211 14425080 0 0 0 0 %%% The packet losses of 3+K always occur when I hit Caps Lock. This also happens without polling unless PREEMPTION is configuered. It is caused by low-quality code for setting the LED for Caps Lock combined with thread priorities and or their scheduling not working right. In the interrupt-driven case, the thread priorities are correct (bgeintr > syscons) and configuring PREEMPTION fixes the schedulng. In the polling case, the thread priorities are apparently incorrect. Polling probably needs to have its own thread running at the same priority as bgeintr (> syscons), but I think it mainly uses the network SWI thread (< syscons). With idle_poll=1, it also uses its idlepoll thread, but that has very low priority so it cannot help in cases like this. The code for setting LEDs busy-waits for several mS which is several polling periods. It must be about 13mS to lose 3200 packets when packets are arriving at 240 kpps. With a network server you won't be hitting Caps Lock a lot but have to worry about other low-quality interrupt handlers busy-waiting for several mS. The loss of a single packet in the above happens more often than I can explain: - with polling, it happens a lot - without polling but with PREEMPTION, it happens a lot when I press Caps Lock but not otherwise. THe problem might not be packet loss. bge has separate statistics for packet loss but the net layer counts all intput errors together. Bruce From owner-freebsd-performance@FreeBSD.ORG Sat Dec 16 01:39:02 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 88C3D16A403 for ; Sat, 16 Dec 2006 01:39:02 +0000 (UTC) (envelope-from amesbury@umn.edu) Received: from mta-a2.tc.umn.edu (mta-a2.tc.umn.edu [134.84.119.206]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9820243CA8 for ; Sat, 16 Dec 2006 01:37:16 +0000 (GMT) (envelope-from amesbury@umn.edu) Received: from [160.94.247.212] (paulaner.oitsec.umn.edu [160.94.247.212]) by mta-a2.tc.umn.edu (UMN smtpd) with ESMTP Fri, 15 Dec 2006 19:38:54 -0600 (CST) X-Umn-Remote-Mta: [N] paulaner.oitsec.umn.edu [160.94.247.212] #+LO+TS+AU+HN Message-ID: <45834E2D.7010901@umn.edu> Date: Fri, 15 Dec 2006 19:38:53 -0600 From: Alan Amesbury User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: Bruce Evans References: <4581D185.7020702@umn.edu> <20061215232203.C3994@besplex.bde.org> In-Reply-To: <20061215232203.C3994@besplex.bde.org> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-performance@freebsd.org Subject: Re: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Dec 2006 01:39:02 -0000 Bruce, thanks for taking time to read and reply. For brevity, I've removed my own earlier writings, (usually) annotating what's missing. Bruce Evans wrote: [snip - PREEMPTION stuff] > It's needed to prevent packet loss without polling. It probably makes > little difference with polling (if the machines is mostly handling > network traffic and that only by polling). I should've noted in my original posting that 'vmstat' also reports very little activity in the various paging columns; faults, pages in/out, reclaims, freed, and pages scanned usually sit very close to or at zero. Disk operations as reported by 'vmstat' also sit almost completely at zero. The (extremely busy) interface is exclusively incoming traffic, received promiscuously. Since that's provided enough clues as to what this box might actually be doing, I'll give away the secret: It's running snort. :-) > I don't believe in POLLING or HZ=1000, but recetly tested them with > bge. I am unhappy to report that my fine-tuned interrupt handling > still loses to polling by a few percent for efficiency. I am happy > to report that polling loses to interrupt handling by a lot for > correctness -- polling gives packet loss. Polling also loses big for > latency with idle_poll and the system actually idle, when it wins a > little. How are you benchmarking this? > AUTO_EOI_1 has little effect unless the system gets lots of interrupt, > so with most interrupts avoided by using polling it has little effect. > >> As mentioned above, this host is running FreeBSD/amd64, so there's no >> need to remove support for I586_CPU, et al; that stuff was never there >> in the first place. > > AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very > unusual for amd64 so AUTO_EOI_1 probably has no effect for you. Good to know. "No effect" is still acceptable. I just didn't want to cause "negative effect." :-) [snip - Broken FreeBSD RFC1323/PAWS support at high HZ] > I think there are old PRs about this. Even 1000 is too large (?). We noticed it when 'scrub all tcp reassemble' in FreeBSD 6.x's PF started tossing packets. The problem (mostly?) went away when we dropped from HZ=2000 to HZ=1000, so we considered that a marginally acceptable work-around for this FreeBSD bug. However, since we a) have gigabit-connected PF firewalls; b) want to consider following the advice in NOTES about HZ=2000 for busy firewalls; and c) really prefer to run off stock FreeBSD source unless absolutely impossible, we're sort of interested in seeing a fix for RFC1323 get officially applied to FreeBSD. About a year ago I pointed out patch had been submitted. A commit-bit responder acknowledged it, but said he wanted to do it differently. Since I'm not really in a position to pay and don't have a more acceptable patch of my own to submit, I've not really squawked about it. >> Since I've not seen word on a correction for this being added to >> FreeBSD, I've limited HZ to 1000. > > HZ = 100 gives interesting behaviour. Of course, it doesn't work, since > polling depends on polling often enough. Any particular value of HZ can > only give polling often enough for a very limited range of systems. 1000 > is apparently good for 100Mbps and not too bad for 1Gbps, provided the > hardware has enough buffering, but with enough buffering polling is > not really needed. Well, I'm not exactly tied to polling. I just tried it as an alternative and, for at least part of the time, it's performed better than non-polling. I'm open to alternatives; I just want as close to zero loss as possible. [snip - "I've read polling(4) and it says..."] > I can (easily) generate only 250 kpps on input and had to increase > kern.polling.burst_max to > 250 to avoid huge packet lossage at this > rate. It doesn't seem to work right for output, since I can (easily) > generate 340 kpps output and got that with a burst max of only 15 > should have got only 150 kpps. Output is faster at the lowest level > (but slower at higher levels), so doing larger bursts of output might > be intentional. However, output at 340 kkps gives a system load of > 100% on the test machine (which is not very fast or SMP). no matter > how it is done (polling just makes it go 2% faster), so polling is not > doing its main job of very well. Polling's main job is to prevent > netowork activity from using 100% CPU. Large values of > kern.polling.burst_max are fundamentally incompatible with polling > doing this. On my test system, a burst max of 1000 combined with HZ > = 1000 would just ask the driver alone to use 100% of the CPU doing > 1000 kppps though a single device. "Fortunately", the device can't > go that fast, so plenty of CPU is left. That's for sending, right? In this case that's not an issue. I simply have incoming traffic with MTUs of up to 9216 bytes that I want to *receive*. Never mind the fact that bge(4) and the underlying hardware sucks in that it can't do that (although there's apparently a WinDOS driver that can do it on the same hardware?!). Again, my focus is on sucking in packets as fast as possible with minimal loss. [snip - watching kern.polling.burst values] > Is it really dynamic? I see 1000's too, but for sending at only 340 kpps. > Almost all bursts should have size 340. With a max of 150, burst is > 150 too but 340 kpps are still sent. I haven't tested sending. kern.polling.burst tends to hand at whatever kern.polling.burst_max is set to. [snip - writing kernel patches exceeds my expertise] > There may be a fix in an old PR. I'll look again. [snip - load hovers at 1] > Polling in idle eats all the CPU. Polling in idle is very wasteful (mainly > of power) unless the system can rarely be idle anyway, but then polling > in idle doesn't help much. This system is expected to NEVER be idle... except if it loses power. :-) [snip - other system stats] > These are only small interrupt loads. bge always generates about 6667 > interrupts per second (under all loads except none or tiny) because it > is programmed to use interrupt moderation with a timeout of 150uS and > some finer details. This gives behaviour very similar to polling at a > frequency of 6667 Hz. The main differences between this and polling at > 1000 Hz are: > - 6667 Hz works better for correctness (lower latency, fewer dropped > packets for missed polls) > - 6667 Hz has higher overheads (only a few percent) > - interrupts have lower overheads if nothing is happening so you don't > actually get them at 6667 Hz > - the polling given by interrupt moderation is dumb. It doesn't have > any of the burst max controls, etc. (but could easily). It doesn't > interact with other devices (but could uneasily). > > bge can be easily be reprogrammed to use interrupt moderation with a > timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz. > This immediately gives the main disadvantage of polling (latency of > 1000uS unless polling in idle and the system is actually idle at least > once every 1000uS). bge has internal (buffering) limits which have > similar effects to the burst limit. The advantages of polling are > not easily gained in this way (especially for rx). If I understand you correctly, it sounds like I'd be better off without polling, particularly if there are *any* buffer limitations in the Broadcom hardware. Again, it's not idle; the lowest recorded packet receive rate I've seen lately is around 40Kpkt/sec. The lowest recorded rate was around 16Kpkt/sec. >> * With polling on, kern.polling.burst_max=150: >> >> - kern.polling.burst holds at 150 >> - 'vmstat 5' shows context switches hold around 2600, with >> interrupts holding around 30K > > I think you mean `systat -vmstat 5'. The interrupt count here is bogus. No, I mean 'vmstat 5'. I just let it dump a line every five seconds and watch what happens. Context switches and interrupts are both shown. The 'systat' version, in this case, is harder for me to read; it also lacks the scrolling history of 'vmstat'. Sample output taken while writing this (note that the first line is almost always bogus and sorry if wrap is borked): % vmstat 5 procs memory page disk faults cpu r b w avm fre flt re pi po fr sr ad4 in sy cs us sy id 2 0 0 1898784 1256124 13 0 0 0 12 0 0 647 291 552 8 15 78 1 0 0 1898784 1256124 1 0 0 0 0 0 0 183135 97 2432 9 4 87 1 0 0 1898784 1256124 0 0 0 0 0 0 0 183370 116 2423 11 5 84 1 0 0 1898784 1256124 0 0 0 0 0 0 0 183455 100 2454 8 5 87 1 0 0 1898784 1256124 0 0 0 0 0 0 0 170236 105 2437 8 4 88 0 1 0 1898784 1256124 0 0 0 0 0 0 0 183183 108 2469 10 5 84 ^C Settings: * Polling enabled on the high traffic interface * kern.polling.user_frac=20 * kern.polling.burst_max=1000 > It is mostly for software interrupts that mostly don't do much becuase > they coalesce with old ones. Only ones that cause context switches are > relevant, and there is no counter for those. Most of the context switches > are to the poll routine (1000 there and 1000 back). > >> - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total >> doesn't increase!), other rates stay the same (looks like >> possible display bugs in 'vmstat -i' here!) > > Probably just averaging. See, I'm not sure about that. I thought that the whole point of polling was to avoid interrupts. Since the total count doesn't increase for bge1 in 'vmstat -i' output, I interpreted it as a bug. >> - CPU load holds at 1, but CPU idle time usually stays >95% > > I saw heavy polling reduce the idle time significantly here. I think > the CPU idle time can be very biased here under light loads. The times > shown by top(1) are unbiased. As mentioned before, though, this system is expected to NEVER be idle, so a fast polling loop shouldn't be a liability. [snip - more stats; "room for improvement?"] > Sorry, no ideas about tuning polling parameters (I don't know them well > since I don't believe in polling :-). You apparently have eveything tuned > almost as well as possible, and the only possibilities for future > improvments are avoiding the 5% (?) extra overhead for !polling and > the packet loss for polling. > > I see the folowing packet loss for polling with HZ=1000, burst_max=300, > idle_poll=1: > > %%% > input (bge0) output > packets errs bytes packets errs bytes colls > 242999 1 14579940 0 0 0 0 > 235496 0 14129760 0 0 0 0 > 236930 3261 14215800 0 0 0 0 > 237816 3400 14268960 0 0 0 0 > 240418 3211 14425080 0 0 0 0 > %%% Well, I guess I'm doing OK, then. With the same settings as above: amesbury@scoop % netstat -I bge1 -w 5 input (bge1) output packets errs bytes packets errs bytes colls 614710 0 513122698 0 0 0 0 662633 0 556662669 0 0 0 0 639052 0 530704135 0 0 0 0 706713 0 576938553 0 0 0 0 690495 0 554269218 0 0 0 0 682868 0 560234712 0 0 0 0 692268 0 562487939 0 0 0 0 680498 0 549782169 0 0 0 0 ^C Then again, it's after 1830 on a Friday afternoon, so traffic loads have dropped a bit, so it's quite possible I'm not seeing anything dropped here because of this relatively lighter load. > The packet losses of 3+K always occur when I hit Caps Lock. This also > happens without polling unless PREEMPTION is configuered. It is caused > by low-quality code for setting the LED for Caps Lock combined with > thread priorities and or their scheduling not working right. In the > interrupt-driven case, the thread priorities are correct (bgeintr > > syscons) and configuring PREEMPTION fixes the schedulng. In the polling > case, the thread priorities are apparently incorrect. Polling probably > needs to have its own thread running at the same priority as bgeintr > (> syscons), but I think it mainly uses the network SWI thread (< > syscons). With idle_poll=1, it also uses its idlepoll thread, but > that has very low priority so it cannot help in cases like this. The > code for setting LEDs busy-waits for several mS which is several polling > periods. It must be about 13mS to lose 3200 packets when packets > are arriving at 240 kpps. > > With a network server you won't be hitting Caps Lock a lot but have to > worry about other low-quality interrupt handlers busy-waiting for several > mS. > > The loss of a single packet in the above happens more often than I can > explain: > - with polling, it happens a lot > - without polling but with PREEMPTION, it happens a lot when I press > Caps Lock but not otherwise. > THe problem might not be packet loss. bge has separate statistics for > packet loss but the net layer counts all intput errors together. Fortunately this machine doesn't even have a keyboard attached, so there'll be no Caps games on it. :-) In spite of the momentary 0% loss, do you think switching to an em(4), sk(4), or other card might help? The bge(4) interfaces are integrated PCIe, and I think only PCI-X slots are available. Again, thanks for the sanity checking and additional information. -- Alan Amesbury OIT Security and Assurance University of Minnesota From owner-freebsd-performance@FreeBSD.ORG Sat Dec 16 07:11:31 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D0B4C16A403 for ; Sat, 16 Dec 2006 07:11:31 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id DD7E843CA0 for ; Sat, 16 Dec 2006 07:11:30 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 2E2C15A0734; Sat, 16 Dec 2006 18:11:29 +1100 (EST) Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 231C18C0F; Sat, 16 Dec 2006 18:11:27 +1100 (EST) Date: Sat, 16 Dec 2006 18:11:26 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alan Amesbury In-Reply-To: <45834E2D.7010901@umn.edu> Message-ID: <20061216171718.K2901@besplex.bde.org> References: <4581D185.7020702@umn.edu> <20061215232203.C3994@besplex.bde.org> <45834E2D.7010901@umn.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Dec 2006 07:11:31 -0000 On Fri, 15 Dec 2006, Alan Amesbury wrote: > Bruce Evans wrote: > ... > The (extremely busy) interface is exclusively incoming traffic, received > promiscuously. Since that's provided enough clues as to what this box > might actually be doing, I'll give away the secret: It's running snort. > :-) > >> I don't believe in POLLING or HZ=1000, but recetly tested them with >> bge. ... > How are you benchmarking this? Just by blasting packets, usually with ttcp. > ... > Well, I'm not exactly tied to polling. I just tried it as an > alternative and, for at least part of the time, it's performed better > than non-polling. I'm open to alternatives; I just want as close to > zero loss as possible. Polling is not working acceptably for me at all. I'm testing on the same network and machine that are serving nfs/udp. Apparently, with polling there is an i/o error evey few seconds even under light loads, and of course errors are especially bad for nfs/udp (nfs seems to recover but takes about 1 minute). > ... > [snip - "I've read polling(4) and it says..."] >> I can (easily) generate only 250 kpps on input and had to increase >> kern.polling.burst_max to > 250 to avoid huge packet lossage at this >> rate. It doesn't seem to work right for output, since I can (easily) >> generate 340 kpps output and got that with a burst max of only 15 >> should have got only 150 kpps. Output is faster at the lowest level >> (but slower at higher levels), so doing larger bursts of output might >> be intentional. However, output at 340 kkps gives a system load of >> 100% on the test machine (which is not very fast or SMP). no matter >> how it is done (polling just makes it go 2% faster), so polling is not >> doing its main job of very well. Polling's main job is to prevent >> netowork activity from using 100% CPU. Large values of >> kern.polling.burst_max are fundamentally incompatible with polling >> doing this. On my test system, a burst max of 1000 combined with HZ >> = 1000 would just ask the driver alone to use 100% of the CPU doing >> 1000 kppps though a single device. "Fortunately", the device can't >> go that fast, so plenty of CPU is left. > > That's for sending, right? In this case that's not an issue. I simply > have incoming traffic with MTUs of up to 9216 bytes that I want to > *receive*. Never mind the fact that bge(4) and the underlying hardware > sucks in that it can't do that (although there's apparently a WinDOS > driver that can do it on the same hardware?!). Again, my focus is on > sucking in packets as fast as possible with minimal loss. Some bge hardware certainly supports jumbo frames. Half of mine can, and the other half is documented not to. > ... > If I understand you correctly, it sounds like I'd be better off without > polling, particularly if there are *any* buffer limitations in the > Broadcom hardware. Again, it's not idle; the lowest recorded packet > receive rate I've seen lately is around 40Kpkt/sec. The lowest recorded > rate was around 16Kpkt/sec. No, you seem to have the fairly specialized but common application where polling currently works better, except for the problem with packet loss which we don't completely understand but seems to be related to thread priorities. >>> * With polling on, kern.polling.burst_max=150: >>> >>> - kern.polling.burst holds at 150 >>> - 'vmstat 5' shows context switches hold around 2600, with >>> interrupts holding around 30K >> >> I think you mean `systat -vmstat 5'. The interrupt count here is bogus. > > No, I mean 'vmstat 5'. I just let it dump a line every five seconds and > watch what happens. Context switches and interrupts are both shown. > The 'systat' version, in this case, is harder for me to read; it also > lacks the scrolling history of 'vmstat'. Sample output taken while > writing this (note that the first line is almost always bogus and sorry > if wrap is borked): Ah, I forgot that I fixed some interrupt counting only in -current to get a useful interrupt count in vmstat. Software interrupts are still put in the global interrupt count (but not in the software interrupt count) in RELENG_6. This makes them show up in vmstat output, and in many configurations they dominate the global count so this count becomes unrelated to the actual interrupt load. In -current they are counted as software interrupts only. systat -vmstat reports interrupt counts in finer detail so it is possible to determine various subcounts by adding or subtracting the other counts. >> ... >>> - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total >>> doesn't increase!), other rates stay the same (looks like >>> possible display bugs in 'vmstat -i' here!) >> >> Probably just averaging. > > See, I'm not sure about that. I thought that the whole point of polling > was to avoid interrupts. Since the total count doesn't increase for > bge1 in 'vmstat -i' output, I interpreted it as a bug. It's probably just the bogus software interrupt count. Apparently, polling generates 20-30 software interrupts per poll. I don't know why it generates so many, but the context switch count shows that most of them don't generate a context switch, so most of them don't take much time. Both software interrupts and hardware interrupts are currently counted when they are requested, not when they delivered. This is dubious but works out OK for hardware interrupts only. For hardware interupts, even requests have a large overhead so requests that will coalesce should be counted somewhere, but for software interrupts, requests have a low overhead so the only reason to count requests that will coalesce is to find and fix callers that make them. I think that for hardware interrupts, requests that will coalesce are rare in practice since the first requst blocks subsequent ones. >> I see the folowing packet loss for polling with HZ=1000, burst_max=300, >> idle_poll=1: >> >> %%% >> input (bge0) output >> packets errs bytes packets errs bytes colls >> 242999 1 14579940 0 0 0 0 >> 235496 0 14129760 0 0 0 0 >> 236930 3261 14215800 0 0 0 0 >> 237816 3400 14268960 0 0 0 0 >> 240418 3211 14425080 0 0 0 0 >> %%% > > Well, I guess I'm doing OK, then. With the same settings as above: > > amesbury@scoop % netstat -I bge1 -w 5 > input (bge1) output > packets errs bytes packets errs bytes colls > 614710 0 513122698 0 0 0 0 > 662633 0 556662669 0 0 0 0 > 639052 0 530704135 0 0 0 0 > 706713 0 576938553 0 0 0 0 > 690495 0 554269218 0 0 0 0 > 682868 0 560234712 0 0 0 0 > 692268 0 562487939 0 0 0 0 > 680498 0 549782169 0 0 0 0 > ^C Yes, I used -w 1 so my pps is about twice as much as yours, but I also use tiny packets so as to get that high rate on low-end hardware, and that gives a bandwidth that is about 1/8 of yours. > Then again, it's after 1830 on a Friday afternoon, so traffic loads have > dropped a bit, so it's quite possible I'm not seeing anything dropped > here because of this relatively lighter load. Problems are certainly more likely with higher pps. 140 kpps is quite small. I can almost reach that with tiny packets on an 100Mbps network. > In spite of the momentary 0% loss, do you think switching to an em(4), > sk(4), or other card might help? The bge(4) interfaces are integrated > PCIe, and I think only PCI-X slots are available. I believe em is (only slightly?) better but haven't used it. The bus matters most unless the card is really stupid. Bruce