From owner-freebsd-net@FreeBSD.ORG Mon Dec 17 05:21:19 2007 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5CB8916A419 for ; Mon, 17 Dec 2007 05:21:19 +0000 (UTC) (envelope-from maf@eng.oar.net) Received: from sv1.eng.oar.net (sv1.eng.oar.net [192.148.251.86]) by mx1.freebsd.org (Postfix) with SMTP id 291EF13C468 for ; Mon, 17 Dec 2007 05:21:19 +0000 (UTC) (envelope-from maf@eng.oar.net) Received: (qmail 31882 invoked from network); 17 Dec 2007 04:54:38 -0000 Received: from dev1.eng.oar.net (HELO ?127.0.0.1?) (192.148.251.71) by sv1.eng.oar.net with SMTP; 17 Dec 2007 04:54:38 -0000 Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Mark Fullmer Date: Sun, 16 Dec 2007 23:54:28 -0500 To: freebsd-stable@freebsd.org X-Mailer: Apple Mail (2.752.3) Cc: freebsd-net@FreeBSD.org Subject: Packet loss every 30.999 seconds X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2007 05:21:19 -0000 While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Test setup: A - ethernet_switch - B A sends UDP packets to B through an ethernet switch. The interface input packet count and output packet count on the switch match what A is sending and B should be receiving. A UDP receiver running on B sees windows of packet loss with a period of 30.99 seconds. The lost packets are counted based on an incrementing sequence number. On an isolated network the Ipkts counter on B matches what A is sending, but the packets never show up in any of the IP/UDP counters or the program trying to receive them. This behavior can be seen with both em and fxp interfaces. Problem is it only occurs after the receiving host has been up about a day. Reboot, problem clears. GENERIC kernel, nothing more than default daemons running. Behavior seen on three different motherboards so far. It also appears this is not just lost network interrupts. Whatever is spinning in the kernel also impacts syscall latency. An easy way to replicate what I'm seeing is to run gettimeofday() in a tight loop and note when the real time syscall delay exceeds some value (which is dependent on processor speed). As an example on an 3.20GHz CPU a small program will output when the syscall latency is > 5000 usecs. Note the periodic behavior at 30.99 seconds. These big jumps in latency correspond to when packets are being dropped. usecs (epoch) latency diff ------------------------------------------------ 1197861705805078 478199 0 1197861721012298 25926 15207220 1197861726332036 11729 5319738 1197861757331549 11691 30999513 1197861788331266 11878 30999717 1197861819330647 11708 30999381 1197861850330192 11698 30999545 1197861881329733 11667 30999541 1197861900018297 6516 18688564 1197861912329282 11684 12310985 1197861943328849 11699 30999567 1197861974328413 11692 30999564 1197862005328228 11916 30999815 1197862036327598 11684 30999370 1197862067327229 11680 30999631 1197862098326860 11667 30999631 1197862129326559 11704 30999699 1197862160326377 11844 30999818 1197862191325890 11674 30999513 (output from packet loss tester) window_start/window_end is packet counter time_start/time_end is absolute time in usecs. window_diff is # of packets missing The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot less than this hardware is capable of running BSD4.X. :missing window_start=311510, time_start=1197861726332008,window_end=311638, time_end=1197861726332011, window_diff=128, time_diff=3 :missing window_start=794482, time_start=1197861757331505,window_end=794609, time_end=1197861757331509, window_diff=127, time_diff=4 :missing window_start=1277313, time_start=1197861788331245,window_end=1277444, time_end=1197861788331249, window_diff=131, time_diff=4 :missing window_start=1760104, time_start=1197861819330625,window_end=1760232, time_end=1197861819330629, window_diff=128, time_diff=4 :missing window_start=2242789, time_start=1197861850330170,window_end=2242916, time_end=1197861850330174, window_diff=127, time_diff=4 :missing window_start=2725818, time_start=1197861881329712,window_end=2725946, time_end=1197861881329715, window_diff=128, time_diff=3 :missing window_start=3208594, time_start=1197861912329261,window_end=3208722, time_end=1197861912329264, window_diff=128, time_diff=3 :missing window_start=3691395, time_start=1197861943328802,window_end=3691522, time_end=1197861943328805, window_diff=127, time_diff=3 :missing window_start=4173793, time_start=1197861974328369,window_end=4173921, time_end=1197861974328373, window_diff=128, time_diff=4 :missing window_start=4656236, time_start=1197862005328176,window_end=4656367, time_end=1197862005328179, window_diff=131, time_diff=3 :missing window_start=5139197, time_start=1197862036327576,window_end=5139325, time_end=1197862036327580, window_diff=128, time_diff=4 :missing window_start=5621958, time_start=1197862067327208,window_end=5622085, time_end=1197862067327211, window_diff=127, time_diff=3 :missing window_start=6104597, time_start=1197862098326839,window_end=6104725, time_end=1197862098326843, window_diff=128, time_diff=4 :missing window_start=6587241, time_start=1197862129326514,window_end=6587369, time_end=1197862129326534, window_diff=128, time_diff=20 :missing window_start=7070051, time_start=1197862160326368,window_end=7070183, time_end=1197862160326371, window_diff=132, time_diff=3 :missing window_start=7552828, time_start=1197862191325873,window_end=7552954, time_end=1197862191325876, window_diff=126, time_diff=3 :missing window_start=8035434, time_start=1197862222325572,window_end=8035560, time_end=1197862222325576, window_diff=126, time_diff=4 I'm building a more up to date copy of RELENG_6 to make sure I'm not chasing something that's been fixed. As a side note this appears to also be happening on a RELENG_6 build dated Mar 11 2007. Included is the gettimeofday() looper. Run as ./a.out 1 5000, where 5000 will depend on your system speed. This probably won't provide any meaningful results on a loaded system. E-mail me off list for a copy of the packet tester or more diagnostics. #include #include #include #include #include #include #include #include main(int argc, char **argv) { struct timeval tv; struct timezone tz; u_int64_t time_now, time_last, time_mark; int quiet, max; if (argc != 3) errx(1, "Usage: %s ", argv[0]); quiet=atoi(argv[1]); max=atoi(argv[2]); gettimeofday(&tv, &tz); time_last = (u_int64_t)tv.tv_sec * 1000000LL + (u_int64_t)tv.tv_usec; time_mark = 0LL; for (;;) { gettimeofday(&tv, &tz); time_now = (u_int64_t)tv.tv_sec * 1000000LL + (u_int64_t) tv.tv_usec; if (!quiet) { printf("%llu %llu %llu\n", time_now, time_now-time_last, time_now-time_mark); time_mark = time_now; } else { if ((time_now-time_last) > max) { if (time_mark == 0) time_mark = time_now; printf("%llu %llu %llu\n", time_now, time_now-time_last, time_now-time_mark); time_mark = time_now; } } time_last = time_now; } } /* main */