From owner-freebsd-performance@FreeBSD.ORG  Thu Dec 14 22:40:09 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E3D8016A4A7
	for <freebsd-performance@freebsd.org>;
	Thu, 14 Dec 2006 22:40:09 +0000 (UTC)
	(envelope-from amesbury@umn.edu)
Received: from mta-m2.tc.umn.edu (mta-m2.tc.umn.edu [160.94.23.21])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 176AA43F13
	for <freebsd-performance@freebsd.org>;
	Thu, 14 Dec 2006 22:33:20 +0000 (GMT)
	(envelope-from amesbury@umn.edu)
Received: from [160.94.247.212] (paulaner.oitsec.umn.edu [160.94.247.212])
	by mta-m2.tc.umn.edu (UMN smtpd) with ESMTP
	for <freebsd-performance@freebsd.org>;
	Thu, 14 Dec 2006 16:34:45 -0600 (CST)
X-Umn-Remote-Mta: [N] paulaner.oitsec.umn.edu [160.94.247.212] #+LO+TS+AU+HN
Message-ID: <4581D185.7020702@umn.edu>
Date: Thu, 14 Dec 2006 16:34:45 -0600
From: Alan Amesbury <amesbury@umn.edu>
User-Agent: Thunderbird 1.5.0.7 (X11/20060915)
MIME-Version: 1.0
To: freebsd-performance@freebsd.org
X-Enigmail-Version: 0.94.0.0
Content-Type: multipart/mixed; boundary="------------050007070804010403090909"
Subject: Polling tuning and performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Dec 2006 22:40:10 -0000

This is a multi-part message in MIME format.
--------------050007070804010403090909
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

This is a long one, but mainly because I've tried to include notes about
what I've already looked at.  Thanks in advance for taking the time to
read this.

I have a FreeBSD 6.1-RELEASE/amd64 system which routinely needs to
accept traffic at fairly high speeds.  The system is accepting traffic
at fairly high rates; 'systat -if' suggests 428551GB (not a typo, but
possibly a display bug in 'systat') over the past 63 days, or an average
rate of a bit over 600Mb/sec.  However, 'time tcpdump ...' tends to back
up this assertion:

amesbury@host % sudo time tcpdump -i bge1 -n -w /dev/null -c 1000000
tcpdump: WARNING: bge1: no IPv4 address assigned
tcpdump: listening on bge1, link-type EN10MB (Ethernet), capture size 96
bytes
1000000 packets captured
1000395 packets received by filter
167 packets dropped by kernel
0.268u 0.153s 0:06.84 5.9%      901+3236k 0+0io 0pf+0w


What I'm aiming for, of course, is zero packet loss.  Realizing that's
probably impossible for this system given its load, I'm trying to do
what I can to minimize loss.

The system is running a somewhat leaner kernel than GENERIC.  Notable
changes include:

	* PREEMPTION disabled - /sys/conf/NOTES says this helps with
	  interactivity.  I don't care about interactive performance
	  on this host.

	* COMPAT_FREEBSD4, COMPAT_LINUX32, and COMPAT_43 are removed.
	  They appear to be unneeded.

	* SMP is enabled, as this is a dual-core box (not HTT!).

	* Many devices are removed, e.g., ncr(4), sym(4), adv(4), and
	  other unnecessary block devices; anything relating to cardbus;
	  de(4), bce(4), ti(4), wb(4), ed(4), ex(4), lnc(4), and a
	  number of other network devices that aren't going to ever be
	  used; etc.

	* All wlan(4) and related drivers are gone.

	* pf(4), pflog(4), and some of the ALTQ stuff has been added in,
	  but is not actively used on this host (at the moment).

	* ZERO_COPY_SOCKETS, MAC_BSDEXTENDED, MAC_PARTITION, and MAC
	  are enabled.

	* Most importantly, HZ=1000, and DEVICE_POLLING and
	  AUTO_EOI_1 are included.  (AUTO_EOI_1 was added because
	  /sys/amd64/conf/NOTES says this can save a few microseconds
	  on some interrupts.  I'm not worried about suspend/resume, but
	  definitely want speed, so it got added.


As mentioned above, this host is running FreeBSD/amd64, so there's no
need to remove support for I586_CPU, et al; that stuff was never there
in the first place.

Since kern.polling.enable is marked as deprecated in
/sys/kern/kern_poll.c, I'm enabling polling specifically for the
interface receiving the high-volume traffic.  (It is NOT enabled for the
other interface on this system, but traffic loads there are orders of
magnitude lower, so I didn't think it was necessary.)

As mentioned above, I've got HZ set to 1000.  Per /sys/amd64/conf/NOTES,
I'd considered setting it to 2000, but have discovered previously that
FreeBSD's RFC1323 support breaks.  I documented this on -hackers last year:

http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html


Since I've not seen word on a correction for this being added to
FreeBSD, I've limited HZ to 1000.

After reading polling(4) a couple times, I set kern.polling.burst_max to
1000.  The manpage says that "each interface can receive at most (HZ *
burst_max) packets per second", and the default setting is 150, which is
described as "adequate for 100Mbit network and HZ=1000."  I figured,
"Hey, gigabit, how about ten times the default?" but that's prevented by
"#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c.

In theory that might've been good enough, but polling(4) says that
kern.polling.burst is "[the] [m]aximum number of packets grabbed from
each network interface in each timer tick.  This number is dynamically
adjusted by the kernel, according to the programmed user_frac,
burst_max, CPU speed, and system load."  I keep seeing
kern.polling.burst hit a thousand, which leads me to believe that
kern.polling.burst_max needs to be higher.

For example:

	secs since
	  epoch	      kern.polling.burst
	----------    ------------------
	1166133997       1000
	1166134006        550
	1166134015        877
	1166134024       1000
	1166134033       1000
	1166134042       1000
	1166134051       1000
	1166134060       1000
	1166134069       1000
	1166134078       1000


Unfortunately, that appears to be only possible through a) patching
/sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000,
as indicated in one of the NOTES, which will effectively hose certain
TCP connectivity because of the RFC1323 breakage.  Looked at another
way, both essentially require changes to source code, the former being
fairly obvious, and the latter requiring fixes to the RFC1323 support.
Either way, I think that's a bit beyond my abilities; I have NO
illusions about my kernel h4cking sk1llz.

Other possibly relevant data points:

	* System load hovers right around 1.

	* The system has almost zero disk activity.

	* With polling off:

	  - 'vmstat 5' consistently shows about 13K context switches
	    and ~6800 interrupts
	  - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286
	    for bge1, and near zero for everything else
	  - CPU load drops to 0.4-0.8, but CPU idle time sits around 80%

	* With polling on, kern.polling.burst_max=150:

	  - kern.polling.burst holds at 150
	  - 'vmstat 5' shows context switches hold around 2600, with
	    interrupts holding around 30K
	  - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
	    doesn't increase!), other rates stay the same (looks like
	    possible display bugs in 'vmstat -i' here!)
	  - CPU load holds at 1, but CPU idle time usually stays >95%

	* With polling on, kern.polling.burst_max=1000:

	  - kern.polling.burst is frequently 1000 and almost always >850
	  - 'vmstat 5' shows context switches unchanged, but interrupts
	    are 150K-190K
	  - 'vmstat -i' unchanged from burst_max=150
	  - CPU load and CPU idle time very similar to burst_max=150


So, with all that in mind.....  Any ideas for improvement?  Apologies in
advance for missing the obvious.  'dmesg' and kernel config are attached.


-- 
Alan Amesbury
OIT Security and Assurance
University of Minnesota

--------------050007070804010403090909
Content-Type: text/plain;
 name="SPECIALIZED"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="SPECIALIZED"


machine		amd64
cpu		HAMMER
ident		SPECIALIZED

# To statically compile in device wiring instead of /boot/device.hints
#hints		"GENERIC.hints"		# Default places to look for devices.

makeoptions	DEBUG=-g		# Build kernel with gdb(1) debug symbols

#options 	SCHED_ULE		# ULE scheduler
options 	SCHED_4BSD		# 4BSD scheduler
#options 	PREEMPTION		# Enable kernel thread preemption
options 	INET			# InterNETworking
options 	INET6			# IPv6 communications protocols
options 	FFS			# Berkeley Fast Filesystem
options 	SOFTUPDATES		# Enable FFS soft updates support
options 	UFS_ACL			# Support for access control lists
options 	UFS_DIRHASH		# Improve performance on big directories
options 	MD_ROOT			# MD is a potential root device
options 	NFSCLIENT		# Network Filesystem Client
options 	NFSSERVER		# Network Filesystem Server
options 	NFS_ROOT		# NFS usable as /, requires NFSCLIENT
options 	MSDOSFS			# MSDOS Filesystem
options 	CD9660			# ISO 9660 Filesystem
options 	PROCFS			# Process filesystem (requires PSEUDOFS)
options 	PSEUDOFS		# Pseudo-filesystem framework
options 	GEOM_GPT		# GUID Partition Tables.
options 	COMPAT_IA32		# Compatible with i386 binaries
options 	COMPAT_FREEBSD5		# Compatible with FreeBSD5
options 	SCSI_DELAY=5000		# Delay (in ms) before probing SCSI
options 	KTRACE			# ktrace(1) support
options 	SYSVSHM			# SYSV-style shared memory
options 	SYSVMSG			# SYSV-style message queues
options 	SYSVSEM			# SYSV-style semaphores
options 	_KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
options 	KBD_INSTALL_CDEV	# install a CDEV entry in /dev
options 	AHC_REG_PRETTY_PRINT	# Print register bitfields in debug
					# output.  Adds ~128k to driver.
options 	AHD_REG_PRETTY_PRINT	# Print register bitfields in debug
					# output.  Adds ~215k to driver.
options 	ADAPTIVE_GIANT		# Giant mutex is adaptive.

options 	SMP			# Symmetric MultiProcessor Kernel

# Workarounds for some known-to-be-broken chipsets (nVidia nForce3-Pro150)
device		atpic			# 8259A compatability

# Bus support.
device		acpi
device		isa
device		pci
device		mem
device		io

# Floppy drives
device		fdc

# ATA and ATAPI devices
device		ata
device		atadisk		# ATA disk drives
device		ataraid		# ATA RAID drives
device		atapicd		# ATAPI CDROM drives
device		atapifd		# ATAPI floppy drives
device		atapist		# ATAPI tape drives
options 	ATA_STATIC_ID	# Static device numbering

# SCSI Controllers
device		ahc		# AHA2940 and onboard AIC7xxx devices
device		ahd		# AHA39320/29320 and onboard AIC79xx devices
device		amd		# AMD 53C974 (Tekram DC-390(T))
device		isp		# Qlogic family
device		mpt		# LSI-Logic MPT-Fusion

# SCSI peripherals
device		scbus		# SCSI bus (required for SCSI)
device		ch		# SCSI media changers
device		da		# Direct Access (disks)
device		sa		# Sequential Access (tape etc)
device		cd		# CD
device		pass		# Passthrough device (direct SCSI access)
device		ses		# SCSI Environmental Services (and SAF-TE)

# RAID controllers interfaced to the SCSI subsystem
device		amr		# AMI MegaRAID
device		ciss		# Compaq Smart RAID 5*
device		dpt		# DPT Smartcache III, IV - See NOTES for options
device		hptmv		# Highpoint RocketRAID 182x
device		iir		# Intel Integrated RAID
device		ips		# IBM (Adaptec) ServeRAID
device		mly		# Mylex AcceleRAID/eXtremeRAID
device		twa		# 3ware 9000 series PATA/SATA RAID

# RAID controllers
device		aac		# Adaptec FSA RAID
device		aacp		# SCSI passthrough for aac (requires CAM)
device		ida		# Compaq Smart RAID
device		twe		# 3ware ATA RAID

# atkbdc0 controls both the keyboard and the PS/2 mouse
device		atkbdc		# AT keyboard controller
device		atkbd		# AT keyboard
device		psm		# PS/2 mouse

device		vga		# VGA video card driver

device		splash		# Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device		sc

device		agp		# support several AGP chipsets

# Serial (COM) ports
device		sio		# 8250, 16[45]50 based serial ports

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to the sio and/or ppc drivers):
#device		puc

# PCI Ethernet NICs.
device		em		# Intel PRO/1000 adapter Gigabit Ethernet Card
device		ixgb		# Intel PRO/10GbE Ethernet Card
device		txp		# 3Com 3cR990 (``Typhoon'')
device		vx		# 3Com 3c590, 3c595 (``Vortex'')

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device		miibus		# MII bus support
device		bfe		# Broadcom BCM440x 10/100 Ethernet
device		bge		# Broadcom BCM570xx Gigabit Ethernet
device		dc		# DEC/Intel 21143 and various workalikes
device		fxp		# Intel EtherExpress PRO/100B (82557, 82558)
device		lge		# Level 1 LXT1001 gigabit Ethernet
device		nge		# NatSemi DP83820 gigabit Ethernet
device		re		# RealTek 8139C+/8169/8169S/8110S
device		rl		# RealTek 8129/8139
device		sis		# Silicon Integrated Systems SiS 900/SiS 7016
device		sk		# SysKonnect SK-984x & SK-982x gigabit Ethernet
device		tx		# SMC EtherPower II (83c170 ``EPIC'')
device		xl		# 3Com 3c90x (``Boomerang'', ``Cyclone'')


# Pseudo devices.
device		loop		# Network loopback
device		random		# Entropy device
device		ether		# Ethernet support
device		tun		# Packet tunnel.
device		pty		# Pseudo-ttys (telnet etc)
device		md		# Memory "disks"
device		gif		# IPv6 and IPv4 tunneling
device		faith		# IPv6-to-IPv4 relaying (translation)

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device		bpf		# Berkeley packet filter

# USB support
device		uhci		# UHCI PCI->USB interface
device		ohci		# OHCI PCI->USB interface
device		ehci		# EHCI PCI->USB interface (USB 2.0)
device		usb		# USB Bus (required)
#device		udbp		# USB Double Bulk Pipe devices
device		ugen		# Generic
device		uhid		# "Human Interface Devices"
device		ukbd		# Keyboard
device		ulpt		# Printer
device		umass		# Disks/Mass storage - Requires scbus and da
device		ums		# Mouse

# FireWire support
device		firewire	# FireWire bus code
device		sbp		# SCSI over FireWire (Requires scbus and da)
device		fwe		# Ethernet over FireWire (non-standard!)


options 	ALTQ
options 	ALTQ_CBQ
options 	ALTQ_HFSC
options 	ALTQ_PRIQ
options		ALTQ_NOPCC
device		pf
device		pflog
options 	BRIDGE
options 	ZERO_COPY_SOCKETS
options 	MAC
options 	MAC_BSDEXTENDED
options 	MAC_PARTITION
options 	HZ=1000
options 	SC_HISTORY_SIZE=1000
options 	SC_KERNEL_CONS_ATTR=(FG_YELLOW|BG_BLACK)
options 	SC_KERNEL_CONS_REV_ATTR=(FG_BLACK|BG_RED)
options 	DEVICE_POLLING
options 	AUTO_EOI_1
options		INCLUDE_CONFIG_FILE

--------------050007070804010403090909
Content-Type: text/plain;
 name="specialized_dmesg.boot"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="specialized_dmesg.boot"

Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 6.1-RELEASE-p10 #1: Thu Oct 12 14:14:54 CDT 2006
    root@specialized:/usr/obj/usr/src/sys/SPECIALIZED
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) D CPU 2.80GHz (2800.11-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0xf44  Stepping = 4
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x641d<SSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,<b14>>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  Cores per package: 2
real memory  = 4563402752 (4352 MB)
avail memory = 4140404736 (3948 MB)
ACPI APIC Table: <DELL   PE850   >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
Security policy loaded: TrustedBSD MAC/BSD Extended (mac_bsdextended)
Security policy loaded: TrustedBSD MAC/Partition (mac_partition)
ioapic0: Changing APIC ID to 2
ioapic1: Changing APIC ID to 3
ioapic1: WARNING: intbase 32 != expected base 24
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 32-55 on motherboard
acpi0: <DELL PE850> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 28.4 on pci0
pci4: <ACPI PCI bus> on pcib4
bge0: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe8f0000-0xfe8fffff irq 16 at device 0.0 on pci4
miibus0: <MII bus> on bge0
brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge0: Ethernet address: 00:15:c5:60:1b:dc
pcib5: <ACPI PCI-PCI bridge> at device 28.5 on pci0
pci5: <ACPI PCI bus> on pcib5
bge1: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe6f0000-0xfe6fffff irq 17 at device 0.0 on pci5
miibus1: <MII bus> on bge1
brgphy1: <BCM5750 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge1: Ethernet address: 00:15:c5:60:1b:dd
uhci0: <UHCI (generic) USB controller> port 0xbce0-0xbcff irq 20 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xbcc0-0xbcdf irq 21 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <UHCI (generic) USB controller> port 0xbca0-0xbcbf irq 22 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: <UHCI (generic) USB controller> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem 0xfeb00400-0xfeb007ff irq 20 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
usb3: EHCI version 1.0
usb3: wrong number of companions (7 != 3)
usb3: companion controllers, 2 ports each: usb0 usb1 usb2
usb3: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0
usb3: USB revision 2.0
uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: 6 ports with 6 removable, self powered
pcib6: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci6: <ACPI PCI bus> on pcib6
pci6: <display, VGA> at device 5.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH7 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <Intel ICH7 SATA300 controller> port 0xbc98-0xbc9f,0xbc90-0xbc93,0xbc80-0xbc87,0xbc78-0xbc7b,0xbc60-0xbc6f mem 0xfeb00000-0xfeb003ff irq 20 at device 31.2 on pci0
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xec000-0xeffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
acd0: CDRW <TSSTcorpCD-RW/DVD-ROM TSL462C/DE05> at ata0-master UDMA33
ad4: 152587MB <WDC WD1600JS-75NCB2 10.02E03> at ata2-master SATA150
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad4s1a
bge0: link state changed to UP
bge1: link state changed to UP

--------------050007070804010403090909--

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 15 14:18:04 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4F06C16A403
	for <freebsd-performance@freebsd.org>;
	Fri, 15 Dec 2006 14:18:04 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E2EEC43CA1
	for <freebsd-performance@freebsd.org>;
	Fri, 15 Dec 2006 14:16:21 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 41B1C5BFC6B;
	Sat, 16 Dec 2006 01:18:02 +1100 (EST)
Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 695D58C09;
	Sat, 16 Dec 2006 01:17:59 +1100 (EST)
Date: Sat, 16 Dec 2006 01:17:56 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@besplex.bde.org
To: Alan Amesbury <amesbury@umn.edu>
In-Reply-To: <4581D185.7020702@umn.edu>
Message-ID: <20061215232203.C3994@besplex.bde.org>
References: <4581D185.7020702@umn.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@freebsd.org
Subject: Re: Polling tuning and performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Dec 2006 14:18:04 -0000

On Thu, 14 Dec 2006, Alan Amesbury wrote:

> ...
> What I'm aiming for, of course, is zero packet loss.  Realizing that's
> probably impossible for this system given its load, I'm trying to do
> what I can to minimize loss.
> ...
> 	* PREEMPTION disabled - /sys/conf/NOTES says this helps with
> 	  interactivity.  I don't care about interactive performance
> 	  on this host.

It's needed to prevent packet loss without polling.  It probably makes
little difference with polling (if the machines is mostly handling
network traffic and that only by polling).

> 	* Most importantly, HZ=1000, and DEVICE_POLLING and
> 	  AUTO_EOI_1 are included.  (AUTO_EOI_1 was added because
> 	  /sys/amd64/conf/NOTES says this can save a few microseconds
> 	  on some interrupts.  I'm not worried about suspend/resume, but
> 	  definitely want speed, so it got added.

I don't believe in POLLING or HZ=1000, but recetly tested them with
bge.  I am unhappy to report that my fine-tuned interrupt handling
still loses to polling by a few percent for efficiency.  I am happy
to report that polling loses to interrupt handling by a lot for
correctness -- polling gives packet loss.  Polling also loses big for
latency with idle_poll and the system actually idle, when it wins a
little.

AUTO_EOI_1 has little effect unless the system gets lots of interrupt,
so with most interrupts avoided by using polling it has little effect.

> As mentioned above, this host is running FreeBSD/amd64, so there's no
> need to remove support for I586_CPU, et al; that stuff was never there
> in the first place.

AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very
unusual for amd64 so AUTO_EOI_1 probably has no effect for you.

> As mentioned above, I've got HZ set to 1000.  Per /sys/amd64/conf/NOTES,
> I'd considered setting it to 2000, but have discovered previously that
> FreeBSD's RFC1323 support breaks.  I documented this on -hackers last year:
>
> http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html

I think there are old PRs about this.  Even 1000 is too large (?).

> Since I've not seen word on a correction for this being added to
> FreeBSD, I've limited HZ to 1000.

HZ = 100 gives interesting behaviour.  Of course, it doesn't work, since
polling depends on polling often enough.  Any particular value of HZ can
only give polling often enough for a very limited range of systems.  1000
is apparently good for 100Mbps and not too bad for 1Gbps, provided the
hardware has enough buffering, but with enough buffering polling is
not really needed.

> After reading polling(4) a couple times, I set kern.polling.burst_max to
> 1000.  The manpage says that "each interface can receive at most (HZ *
> burst_max) packets per second", and the default setting is 150, which is
> described as "adequate for 100Mbit network and HZ=1000."  I figured,
> "Hey, gigabit, how about ten times the default?" but that's prevented by
> "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c.

I can (easily) generate only 250 kpps on input and had to increase
kern.polling.burst_max to > 250 to avoid huge packet lossage at this
rate.  It doesn't seem to work right for output, since I can (easily)
generate 340 kpps output and got that with a burst max of only 15
should have got only 150 kpps.  Output is faster at the lowest level
(but slower at higher levels), so doing larger bursts of output might
be intentional.  However, output at 340 kkps gives a system load of
100% on the test machine (which is not very fast or SMP).  no matter
how it is done (polling just makes it go 2% faster), so polling is not
doing its main job of very well.  Polling's main job is to prevent
netowork activity from using 100% CPU.  Large values of
kern.polling.burst_max are fundamentally incompatible with polling
doing this.  On my test system, a burst max of 1000 combined with HZ
= 1000 would just ask the driver alone to use 100% of the CPU doing
1000 kppps though a single device.  "Fortunately", the device can't
go that fast, so plenty of CPU is left.

> In theory that might've been good enough, but polling(4) says that
> kern.polling.burst is "[the] [m]aximum number of packets grabbed from
> each network interface in each timer tick.  This number is dynamically
> adjusted by the kernel, according to the programmed user_frac,
> burst_max, CPU speed, and system load."  I keep seeing
> kern.polling.burst hit a thousand, which leads me to believe that
> kern.polling.burst_max needs to be higher.
>
> For example:
>
> 	secs since
> 	  epoch	      kern.polling.burst
> 	----------    ------------------
> 	1166133997       1000
> ...

Is it really dynamic?  I see 1000's too, but for sending at only 340 kpps.
Almost all bursts should have size 340.   With a max of 150, burst is
150 too but 340 kpps are still sent.

> Unfortunately, that appears to be only possible through a) patching
> /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000,
> as indicated in one of the NOTES, which will effectively hose certain
> TCP connectivity because of the RFC1323 breakage.  Looked at another
> way, both essentially require changes to source code, the former being
> fairly obvious, and the latter requiring fixes to the RFC1323 support.
> Either way, I think that's a bit beyond my abilities; I have NO
> illusions about my kernel h4cking sk1llz.

There may be a fix in an old PR.

> Other possibly relevant data points:
>
> 	* System load hovers right around 1.

Polling in idle eats all the CPU.  Polling in idle is very wasteful (mainly
of power) unless the system can rarely be idle anyway, but then polling
in idle doesn't help much.

> 	* The system has almost zero disk activity.
>
> 	* With polling off:
>
> 	  - 'vmstat 5' consistently shows about 13K context switches
> 	    and ~6800 interrupts
> 	  - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286
> 	    for bge1, and near zero for everything else
> 	  - CPU load drops to 0.4-0.8, but CPU idle time sits around 80%

These are only small interrupt loads.  bge always generates about 6667
interrupts per second (under all loads except none or tiny) because it
is programmed to use interrupt moderation with a timeout of 150uS and
some finer details.  This gives behaviour very similar to polling at a
frequency of 6667 Hz.  The main differences between this and polling at
1000 Hz are:
- 6667 Hz works better for correctness (lower latency, fewer dropped
   packets for missed polls)
- 6667 Hz has higher overheads (only a few percent)
- interrupts have lower overheads if nothing is happening so you don't
   actually get them at 6667 Hz
- the polling given by interrupt moderation is dumb.  It doesn't have
   any of the burst max controls, etc. (but could easily).  It doesn't
   interact with other devices (but could uneasily).

bge can be easily be reprogrammed to use interrupt moderation with a
timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz.
This immediately gives the main disadvantage of polling (latency of
1000uS unless polling in idle and the system is actually idle at least
once every 1000uS).  bge has internal (buffering) limits which have
similar effects to the burst limit.   The advantages of polling are
not easily gained in this way (especially for rx).

> 	* With polling on, kern.polling.burst_max=150:
>
> 	  - kern.polling.burst holds at 150
> 	  - 'vmstat 5' shows context switches hold around 2600, with
> 	    interrupts holding around 30K

I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.
It is mostly for software interrupts that mostly don't do much becuase
they coalesce with old ones.  Only ones that cause context switches are
relevant, and there is no counter for those.  Most of the context switches
are to the poll routine (1000 there and 1000 back).

> 	  - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
> 	    doesn't increase!), other rates stay the same (looks like
> 	    possible display bugs in 'vmstat -i' here!)

Probably just averaging.

> 	  - CPU load holds at 1, but CPU idle time usually stays >95%

I saw heavy polling reduce the idle time significantly here.  I think
the CPU idle time can be very biased here under light loads.  The times
shown by top(1) are unbiased.

> 	* With polling on, kern.polling.burst_max=1000:
>
> 	  - kern.polling.burst is frequently 1000 and almost always >850
> 	  - 'vmstat 5' shows context switches unchanged, but interrupts
> 	    are 150K-190K
> 	  - 'vmstat -i' unchanged from burst_max=150
> 	  - CPU load and CPU idle time very similar to burst_max=150
>
> So, with all that in mind.....  Any ideas for improvement?  Apologies in
> advance for missing the obvious.  'dmesg' and kernel config are attached.

Sorry, no ideas about tuning polling parameters (I don't know them well
since I don't believe in polling :-).  You apparently have eveything tuned
almost as well as possible, and the only possibilities for future
improvments are avoiding the 5% (?) extra overhead for !polling and
the packet loss for polling.

I see the folowing packet loss for polling with HZ=1000, burst_max=300,
idle_poll=1:

%%%
             input         (bge0)           output
    packets  errs      bytes    packets  errs      bytes colls
     242999     1   14579940          0     0          0     0
     235496     0   14129760          0     0          0     0
     236930  3261   14215800          0     0          0     0
     237816  3400   14268960          0     0          0     0
     240418  3211   14425080          0     0          0     0
%%%

The packet losses of 3+K always occur when I hit Caps Lock.  This also
happens without polling unless PREEMPTION is configuered.  It is caused
by low-quality code for setting the LED for Caps Lock combined with
thread priorities and or their scheduling not working right.  In the
interrupt-driven case, the thread priorities are correct (bgeintr >
syscons) and configuring PREEMPTION fixes the schedulng.  In the polling
case, the thread priorities are apparently incorrect.  Polling probably
needs to have its own thread running at the same priority as bgeintr
(> syscons), but I think it mainly uses the network SWI thread (<
syscons).  With idle_poll=1, it also uses its idlepoll thread, but
that has very low priority so it cannot help in cases like this.  The
code for setting LEDs busy-waits for several mS which is several polling
periods.  It must be about 13mS to lose 3200 packets when packets
are arriving at 240 kpps.

With a network server you won't be hitting Caps Lock a lot but have to
worry about other low-quality interrupt handlers busy-waiting for several
mS.

The loss of a single packet in the above happens more often than I can
explain:
- with polling, it happens a lot
- without polling but with PREEMPTION, it happens a lot when I press
   Caps Lock but not otherwise.
THe problem might not be packet loss.  bge has separate statistics for
packet loss but the net layer counts all intput errors together.

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 16 01:39:02 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 88C3D16A403
	for <freebsd-performance@freebsd.org>;
	Sat, 16 Dec 2006 01:39:02 +0000 (UTC)
	(envelope-from amesbury@umn.edu)
Received: from mta-a2.tc.umn.edu (mta-a2.tc.umn.edu [134.84.119.206])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9820243CA8
	for <freebsd-performance@freebsd.org>;
	Sat, 16 Dec 2006 01:37:16 +0000 (GMT)
	(envelope-from amesbury@umn.edu)
Received: from [160.94.247.212] (paulaner.oitsec.umn.edu [160.94.247.212])
	by mta-a2.tc.umn.edu (UMN smtpd) with ESMTP
	Fri, 15 Dec 2006 19:38:54 -0600 (CST)
X-Umn-Remote-Mta: [N] paulaner.oitsec.umn.edu [160.94.247.212] #+LO+TS+AU+HN
Message-ID: <45834E2D.7010901@umn.edu>
Date: Fri, 15 Dec 2006 19:38:53 -0600
From: Alan Amesbury <amesbury@umn.edu>
User-Agent: Thunderbird 1.5.0.7 (X11/20060915)
MIME-Version: 1.0
To: Bruce Evans <bde@zeta.org.au>
References: <4581D185.7020702@umn.edu> <20061215232203.C3994@besplex.bde.org>
In-Reply-To: <20061215232203.C3994@besplex.bde.org>
X-Enigmail-Version: 0.94.0.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-performance@freebsd.org
Subject: Re: Polling tuning and performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Dec 2006 01:39:02 -0000

Bruce, thanks for taking time to read and reply.  For brevity, I've
removed my own earlier writings, (usually) annotating what's missing.

Bruce Evans wrote:

[snip - PREEMPTION stuff]
> It's needed to prevent packet loss without polling.  It probably makes
> little difference with polling (if the machines is mostly handling
> network traffic and that only by polling).

I should've noted in my original posting that 'vmstat' also reports very
little activity in the various paging columns; faults, pages in/out,
reclaims, freed, and pages scanned usually sit very close to or at zero.
 Disk operations as reported by 'vmstat' also sit almost completely at zero.

The (extremely busy) interface is exclusively incoming traffic, received
promiscuously.  Since that's provided enough clues as to what this box
might actually be doing, I'll give away the secret:  It's running snort.
 :-)

> I don't believe in POLLING or HZ=1000, but recetly tested them with
> bge.  I am unhappy to report that my fine-tuned interrupt handling
> still loses to polling by a few percent for efficiency.  I am happy
> to report that polling loses to interrupt handling by a lot for
> correctness -- polling gives packet loss.  Polling also loses big for
> latency with idle_poll and the system actually idle, when it wins a
> little.

How are you benchmarking this?

> AUTO_EOI_1 has little effect unless the system gets lots of interrupt,
> so with most interrupts avoided by using polling it has little effect.
> 
>> As mentioned above, this host is running FreeBSD/amd64, so there's no
>> need to remove support for I586_CPU, et al; that stuff was never there
>> in the first place.
> 
> AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very
> unusual for amd64 so AUTO_EOI_1 probably has no effect for you.

Good to know.  "No effect" is still acceptable.  I just didn't want to
cause "negative effect."  :-)

[snip - Broken FreeBSD RFC1323/PAWS support at high HZ]
> I think there are old PRs about this.  Even 1000 is too large (?).

We noticed it when 'scrub all tcp reassemble' in FreeBSD 6.x's PF
started tossing packets.  The problem (mostly?) went away when we
dropped from HZ=2000 to HZ=1000, so we considered that a marginally
acceptable work-around for this FreeBSD bug.  However, since we a) have
gigabit-connected PF firewalls; b) want to consider following the advice
in NOTES about HZ=2000 for busy firewalls; and c) really prefer to run
off stock FreeBSD source unless absolutely impossible, we're sort of
interested in seeing a fix for RFC1323 get officially applied to FreeBSD.

About a year ago I pointed out patch had been submitted.  A commit-bit
responder acknowledged it, but said he wanted to do it differently.
Since I'm not really in a position to pay and don't have a more
acceptable patch of my own to submit, I've not really squawked about it.

>> Since I've not seen word on a correction for this being added to
>> FreeBSD, I've limited HZ to 1000.
> 
> HZ = 100 gives interesting behaviour.  Of course, it doesn't work, since
> polling depends on polling often enough.  Any particular value of HZ can
> only give polling often enough for a very limited range of systems.  1000
> is apparently good for 100Mbps and not too bad for 1Gbps, provided the
> hardware has enough buffering, but with enough buffering polling is
> not really needed.

Well, I'm not exactly tied to polling.  I just tried it as an
alternative and, for at least part of the time, it's performed better
than non-polling.  I'm open to alternatives; I just want as close to
zero loss as possible.

[snip - "I've read polling(4) and it says..."]
> I can (easily) generate only 250 kpps on input and had to increase
> kern.polling.burst_max to > 250 to avoid huge packet lossage at this
> rate.  It doesn't seem to work right for output, since I can (easily)
> generate 340 kpps output and got that with a burst max of only 15
> should have got only 150 kpps.  Output is faster at the lowest level
> (but slower at higher levels), so doing larger bursts of output might
> be intentional.  However, output at 340 kkps gives a system load of
> 100% on the test machine (which is not very fast or SMP).  no matter
> how it is done (polling just makes it go 2% faster), so polling is not
> doing its main job of very well.  Polling's main job is to prevent
> netowork activity from using 100% CPU.  Large values of
> kern.polling.burst_max are fundamentally incompatible with polling
> doing this.  On my test system, a burst max of 1000 combined with HZ
> = 1000 would just ask the driver alone to use 100% of the CPU doing
> 1000 kppps though a single device.  "Fortunately", the device can't
> go that fast, so plenty of CPU is left.

That's for sending, right?  In this case that's not an issue.  I simply
have incoming traffic with MTUs of up to 9216 bytes that I want to
*receive*.  Never mind the fact that bge(4) and the underlying hardware
sucks in that it can't do that (although there's apparently a WinDOS
driver that can do it on the same hardware?!).  Again, my focus is on
sucking in packets as fast as possible with minimal loss.

[snip - watching kern.polling.burst values]
> Is it really dynamic?  I see 1000's too, but for sending at only 340 kpps.
> Almost all bursts should have size 340.   With a max of 150, burst is
> 150 too but 340 kpps are still sent.

I haven't tested sending.  kern.polling.burst tends to hand at whatever
kern.polling.burst_max is set to.

[snip - writing kernel patches exceeds my expertise]
> There may be a fix in an old PR.

I'll look again.

[snip - load hovers at 1]
> Polling in idle eats all the CPU.  Polling in idle is very wasteful (mainly
> of power) unless the system can rarely be idle anyway, but then polling
> in idle doesn't help much.

This system is expected to NEVER be idle... except if it loses power.  :-)

[snip - other system stats]
> These are only small interrupt loads.  bge always generates about 6667
> interrupts per second (under all loads except none or tiny) because it
> is programmed to use interrupt moderation with a timeout of 150uS and
> some finer details.  This gives behaviour very similar to polling at a
> frequency of 6667 Hz.  The main differences between this and polling at
> 1000 Hz are:
> - 6667 Hz works better for correctness (lower latency, fewer dropped
>   packets for missed polls)
> - 6667 Hz has higher overheads (only a few percent)
> - interrupts have lower overheads if nothing is happening so you don't
>   actually get them at 6667 Hz
> - the polling given by interrupt moderation is dumb.  It doesn't have
>   any of the burst max controls, etc. (but could easily).  It doesn't
>   interact with other devices (but could uneasily).
> 
> bge can be easily be reprogrammed to use interrupt moderation with a
> timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz.
> This immediately gives the main disadvantage of polling (latency of
> 1000uS unless polling in idle and the system is actually idle at least
> once every 1000uS).  bge has internal (buffering) limits which have
> similar effects to the burst limit.   The advantages of polling are
> not easily gained in this way (especially for rx).

If I understand you correctly, it sounds like I'd be better off without
polling, particularly if there are *any* buffer limitations in the
Broadcom hardware.  Again, it's not idle; the lowest recorded packet
receive rate I've seen lately is around 40Kpkt/sec.  The lowest recorded
rate was around 16Kpkt/sec.

>>     * With polling on, kern.polling.burst_max=150:
>>
>>       - kern.polling.burst holds at 150
>>       - 'vmstat 5' shows context switches hold around 2600, with
>>         interrupts holding around 30K
> 
> I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.

No, I mean 'vmstat 5'.  I just let it dump a line every five seconds and
watch what happens.  Context switches and interrupts are both shown.
The 'systat' version, in this case, is harder for me to read; it also
lacks the scrolling history of 'vmstat'.  Sample output taken while
writing this (note that the first line is almost always bogus and sorry
if wrap is borked):

% vmstat 5
 procs      memory      page                   disk   faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr ad4   in   sy  cs us sy id
 2 0 0 1898784 1256124   13   0   0   0  12   0   0  647  291 552  8 15 78
 1 0 0 1898784 1256124    1   0   0   0   0   0   0 183135   97 2432  9
 4 87
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 183370  116 2423 11
 5 84
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 183455  100 2454  8
 5 87
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 170236  105 2437  8
 4 88
 0 1 0 1898784 1256124    0   0   0   0   0   0   0 183183  108 2469 10
 5 84
^C


Settings:

	* Polling enabled on the high traffic interface
	* kern.polling.user_frac=20
	* kern.polling.burst_max=1000


> It is mostly for software interrupts that mostly don't do much becuase
> they coalesce with old ones.  Only ones that cause context switches are
> relevant, and there is no counter for those.  Most of the context switches
> are to the poll routine (1000 there and 1000 back).
> 
>>       - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
>>         doesn't increase!), other rates stay the same (looks like
>>         possible display bugs in 'vmstat -i' here!)
> 
> Probably just averaging.

See, I'm not sure about that.  I thought that the whole point of polling
was to avoid interrupts.  Since the total count doesn't increase for
bge1 in 'vmstat -i' output, I interpreted it as a bug.

>>       - CPU load holds at 1, but CPU idle time usually stays >95%
> 
> I saw heavy polling reduce the idle time significantly here.  I think
> the CPU idle time can be very biased here under light loads.  The times
> shown by top(1) are unbiased.

As mentioned before, though, this system is expected to NEVER be idle,
so a fast polling loop shouldn't be a liability.

[snip - more stats; "room for improvement?"]
> Sorry, no ideas about tuning polling parameters (I don't know them well
> since I don't believe in polling :-).  You apparently have eveything tuned
> almost as well as possible, and the only possibilities for future
> improvments are avoiding the 5% (?) extra overhead for !polling and
> the packet loss for polling.
> 
> I see the folowing packet loss for polling with HZ=1000, burst_max=300,
> idle_poll=1:
> 
> %%%
>             input         (bge0)           output
>    packets  errs      bytes    packets  errs      bytes colls
>     242999     1   14579940          0     0          0     0
>     235496     0   14129760          0     0          0     0
>     236930  3261   14215800          0     0          0     0
>     237816  3400   14268960          0     0          0     0
>     240418  3211   14425080          0     0          0     0
> %%%

Well, I guess I'm doing OK, then.  With the same settings as above:

amesbury@scoop % netstat -I bge1 -w 5
            input         (bge1)           output
   packets  errs      bytes    packets  errs      bytes colls
    614710     0  513122698          0     0          0     0
    662633     0  556662669          0     0          0     0
    639052     0  530704135          0     0          0     0
    706713     0  576938553          0     0          0     0
    690495     0  554269218          0     0          0     0
    682868     0  560234712          0     0          0     0
    692268     0  562487939          0     0          0     0
    680498     0  549782169          0     0          0     0
^C


Then again, it's after 1830 on a Friday afternoon, so traffic loads have
dropped a bit, so it's quite possible I'm not seeing anything dropped
here because of this relatively lighter load.

> The packet losses of 3+K always occur when I hit Caps Lock.  This also
> happens without polling unless PREEMPTION is configuered.  It is caused
> by low-quality code for setting the LED for Caps Lock combined with
> thread priorities and or their scheduling not working right.  In the
> interrupt-driven case, the thread priorities are correct (bgeintr >
> syscons) and configuring PREEMPTION fixes the schedulng.  In the polling
> case, the thread priorities are apparently incorrect.  Polling probably
> needs to have its own thread running at the same priority as bgeintr
> (> syscons), but I think it mainly uses the network SWI thread (<
> syscons).  With idle_poll=1, it also uses its idlepoll thread, but
> that has very low priority so it cannot help in cases like this.  The
> code for setting LEDs busy-waits for several mS which is several polling
> periods.  It must be about 13mS to lose 3200 packets when packets
> are arriving at 240 kpps.
> 
> With a network server you won't be hitting Caps Lock a lot but have to
> worry about other low-quality interrupt handlers busy-waiting for several
> mS.
> 
> The loss of a single packet in the above happens more often than I can
> explain:
> - with polling, it happens a lot
> - without polling but with PREEMPTION, it happens a lot when I press
>   Caps Lock but not otherwise.
> THe problem might not be packet loss.  bge has separate statistics for
> packet loss but the net layer counts all intput errors together.

Fortunately this machine doesn't even have a keyboard attached, so
there'll be no Caps games on it.  :-)

In spite of the momentary 0% loss, do you think switching to an em(4),
sk(4), or other card might help?  The bge(4) interfaces are integrated
PCIe, and I think only PCI-X slots are available.

Again, thanks for the sanity checking and additional information.


-- 
Alan Amesbury
OIT Security and Assurance
University of Minnesota

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 16 07:11:31 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id D0B4C16A403
	for <freebsd-performance@freebsd.org>;
	Sat, 16 Dec 2006 07:11:31 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DD7E843CA0
	for <freebsd-performance@freebsd.org>;
	Sat, 16 Dec 2006 07:11:30 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 2E2C15A0734;
	Sat, 16 Dec 2006 18:11:29 +1100 (EST)
Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 231C18C0F;
	Sat, 16 Dec 2006 18:11:27 +1100 (EST)
Date: Sat, 16 Dec 2006 18:11:26 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@besplex.bde.org
To: Alan Amesbury <amesbury@umn.edu>
In-Reply-To: <45834E2D.7010901@umn.edu>
Message-ID: <20061216171718.K2901@besplex.bde.org>
References: <4581D185.7020702@umn.edu> <20061215232203.C3994@besplex.bde.org>
	<45834E2D.7010901@umn.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@freebsd.org
Subject: Re: Polling tuning and performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Dec 2006 07:11:31 -0000

On Fri, 15 Dec 2006, Alan Amesbury wrote:

> Bruce Evans wrote:

> ...
> The (extremely busy) interface is exclusively incoming traffic, received
> promiscuously.  Since that's provided enough clues as to what this box
> might actually be doing, I'll give away the secret:  It's running snort.
> :-)
>
>> I don't believe in POLLING or HZ=1000, but recetly tested them with
>> bge.  ...

> How are you benchmarking this?

Just by blasting packets, usually with ttcp.

> ...
> Well, I'm not exactly tied to polling.  I just tried it as an
> alternative and, for at least part of the time, it's performed better
> than non-polling.  I'm open to alternatives; I just want as close to
> zero loss as possible.

Polling is not working acceptably for me at all.  I'm testing on the
same network and machine that are serving nfs/udp.  Apparently, with
polling there is an i/o error evey few seconds even under light loads,
and of course errors are especially bad for nfs/udp (nfs seems to
recover but takes about 1 minute).

> ...
> [snip - "I've read polling(4) and it says..."]
>> I can (easily) generate only 250 kpps on input and had to increase
>> kern.polling.burst_max to > 250 to avoid huge packet lossage at this
>> rate.  It doesn't seem to work right for output, since I can (easily)
>> generate 340 kpps output and got that with a burst max of only 15
>> should have got only 150 kpps.  Output is faster at the lowest level
>> (but slower at higher levels), so doing larger bursts of output might
>> be intentional.  However, output at 340 kkps gives a system load of
>> 100% on the test machine (which is not very fast or SMP).  no matter
>> how it is done (polling just makes it go 2% faster), so polling is not
>> doing its main job of very well.  Polling's main job is to prevent
>> netowork activity from using 100% CPU.  Large values of
>> kern.polling.burst_max are fundamentally incompatible with polling
>> doing this.  On my test system, a burst max of 1000 combined with HZ
>> = 1000 would just ask the driver alone to use 100% of the CPU doing
>> 1000 kppps though a single device.  "Fortunately", the device can't
>> go that fast, so plenty of CPU is left.
>
> That's for sending, right?  In this case that's not an issue.  I simply
> have incoming traffic with MTUs of up to 9216 bytes that I want to
> *receive*.  Never mind the fact that bge(4) and the underlying hardware
> sucks in that it can't do that (although there's apparently a WinDOS
> driver that can do it on the same hardware?!).  Again, my focus is on
> sucking in packets as fast as possible with minimal loss.

Some bge hardware certainly supports jumbo frames.  Half of mine can, and
the other half is documented not to.

> ...
> If I understand you correctly, it sounds like I'd be better off without
> polling, particularly if there are *any* buffer limitations in the
> Broadcom hardware.  Again, it's not idle; the lowest recorded packet
> receive rate I've seen lately is around 40Kpkt/sec.  The lowest recorded
> rate was around 16Kpkt/sec.

No, you seem to have the fairly specialized but common application where
polling currently works better, except for the problem with packet loss
which we don't completely understand but seems to be related to thread
priorities.

>>>     * With polling on, kern.polling.burst_max=150:
>>>
>>>       - kern.polling.burst holds at 150
>>>       - 'vmstat 5' shows context switches hold around 2600, with
>>>         interrupts holding around 30K
>>
>> I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.
>
> No, I mean 'vmstat 5'.  I just let it dump a line every five seconds and
> watch what happens.  Context switches and interrupts are both shown.
> The 'systat' version, in this case, is harder for me to read; it also
> lacks the scrolling history of 'vmstat'.  Sample output taken while
> writing this (note that the first line is almost always bogus and sorry
> if wrap is borked):

Ah, I forgot that I fixed some interrupt counting only in -current to
get a useful interrupt count in vmstat.  Software interrupts are still
put in the global interrupt count (but not in the software interrupt
count) in RELENG_6.  This makes them show up in vmstat output, and in
many configurations they dominate the global count so this count becomes
unrelated to the actual interrupt load.  In -current they are counted
as software interrupts only.  systat -vmstat reports interrupt counts
in finer detail so it is possible to determine various subcounts by
adding or subtracting the other counts.

>> ...
>>>       - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
>>>         doesn't increase!), other rates stay the same (looks like
>>>         possible display bugs in 'vmstat -i' here!)
>>
>> Probably just averaging.
>
> See, I'm not sure about that.  I thought that the whole point of polling
> was to avoid interrupts.  Since the total count doesn't increase for
> bge1 in 'vmstat -i' output, I interpreted it as a bug.

It's probably just the bogus software interrupt count.  Apparently, polling
generates 20-30 software interrupts per poll.  I don't know why it
generates so many, but the context switch count shows that most of them
don't generate a context switch, so most of them don't take much time.
Both software interrupts and hardware interrupts are currently counted
when they are requested, not when they delivered.  This is dubious but
works out OK for hardware interrupts only.  For hardware interupts,
even requests have a large overhead so requests that will coalesce
should be counted somewhere, but for software interrupts, requests have
a low overhead so the only reason to count requests that will coalesce
is to find and fix callers that make them.  I think that for hardware
interrupts, requests that will coalesce are rare in practice since the
first requst blocks subsequent ones.

>> I see the folowing packet loss for polling with HZ=1000, burst_max=300,
>> idle_poll=1:
>>
>> %%%
>>             input         (bge0)           output
>>    packets  errs      bytes    packets  errs      bytes colls
>>     242999     1   14579940          0     0          0     0
>>     235496     0   14129760          0     0          0     0
>>     236930  3261   14215800          0     0          0     0
>>     237816  3400   14268960          0     0          0     0
>>     240418  3211   14425080          0     0          0     0
>> %%%
>
> Well, I guess I'm doing OK, then.  With the same settings as above:
>
> amesbury@scoop % netstat -I bge1 -w 5
>            input         (bge1)           output
>   packets  errs      bytes    packets  errs      bytes colls
>    614710     0  513122698          0     0          0     0
>    662633     0  556662669          0     0          0     0
>    639052     0  530704135          0     0          0     0
>    706713     0  576938553          0     0          0     0
>    690495     0  554269218          0     0          0     0
>    682868     0  560234712          0     0          0     0
>    692268     0  562487939          0     0          0     0
>    680498     0  549782169          0     0          0     0
> ^C

Yes, I used -w 1 so my pps is about twice as much as yours, but I also use
tiny packets so as to get that high rate on low-end hardware, and that gives 
a bandwidth that is about 1/8 of yours.

> Then again, it's after 1830 on a Friday afternoon, so traffic loads have
> dropped a bit, so it's quite possible I'm not seeing anything dropped
> here because of this relatively lighter load.

Problems are certainly more likely with higher pps.  140 kpps is quite
small.  I can almost reach that with tiny packets on an 100Mbps network.

> In spite of the momentary 0% loss, do you think switching to an em(4),
> sk(4), or other card might help?  The bge(4) interfaces are integrated
> PCIe, and I think only PCI-X slots are available.

I believe em is (only slightly?) better but haven't used it.  The bus
matters most unless the card is really stupid.

Bruce