From owner-freebsd-current@FreeBSD.ORG Mon May 12 07:19:07 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3204137B401; Mon, 12 May 2003 07:19:07 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 42EFE43FB1; Mon, 12 May 2003 07:19:06 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h4CEJ0On081787; Mon, 12 May 2003 10:19:01 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h4CEJ0N2081784; Mon, 12 May 2003 10:19:00 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 12 May 2003 10:19:00 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: current@FreeBSD.org, alfred@FreeBSD.org Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: rpc.lockd spinning; much breakage X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 May 2003 14:19:07 -0000 I have three boxes in the current configuration: a build server (cboss) and two diskless workstations (crash1, crash2). crash1 and crash2 pxeboot up to load a kernel via NFS, then run almost entirely out of NFS with the exception of local swap and perhaps a local file system for testing. The clients will frequently "hang" with many processes in the "lockd" state when the daily event runs -- here's the ps from crash2 in such a state: db> ps pid proc addr uid ppid pgrp flag stat wmesg wchan cmd 23872 c1de05a0 c934c000 0 446 446 0000100 norm[SLPQ lockd c1dbc620][SLP] sendmail 23870 c1de0960 c934e000 0 446 446 0000100 norm[SLPQ lockd c1dbc540][SLP] sendmail 23868 c1dbf780 c92f2000 0 446 446 0000100 norm[SLPQ lockd c1dbc020][SLP] sendmail 23866 c1dbf000 c92ee000 0 446 446 0000100 norm[SLPQ lockd c1dbc1e0][SLP] sendmail 23864 c1b72960 c9226000 0 446 446 0000100 norm[SLPQ lockd c1dbc200][SLP] sendmail 23862 c1b725a0 c9224000 0 446 446 0000100 norm[SLPQ lockd c1dbc1a0][SLP] sendmail 23860 c1b73d20 c927e000 0 446 446 0000100 norm[SLPQ lockd c1dbc400][SLP] sendmail 23858 c1b733c0 c922b000 0 446 446 0000100 norm[SLPQ lockd c1dbc640][SLP] sendmail 23856 c1dbe5a0 c92e9000 0 446 446 0000100 norm[SLPQ lockd c1dbc5e0][SLP] sendmail 23854 c1a395a0 c9160000 0 446 446 0000100 norm[SLPQ lockd c1dbc1c0][SLP] sendmail 23852 c1b723c0 c9223000 0 446 446 0000100 norm[SLPQ lockd c1dbc080][SLP] sendmail 23850 c1b37d20 c920a000 0 446 446 0000100 norm[SLPQ lockd c1dbc2c0][SLP] sendmail 23848 c1dbf3c0 c92f0000 0 446 446 0000100 norm[SLPQ lockd c1db8d60][SLP] sendmail 23846 c1a3e960 c91b8000 0 446 446 0000100 norm[SLPQ lockd c1dbc2a0][SLP] sendmail 23845 c1b73780 c927b000 0 23840 23820 0004102 norm[SLPQ lockd c1dbc140][SLP] su 23840 c1dbe780 c92ea000 0 23828 23820 0004002 norm[SLPQ wait c1dbe780][SLP] sh 23832 c1b72000 c9221000 0 23829 23820 0004002 norm[SLPQ piperd c19cc8f0][SLP] mail 23831 c1b73960 c927c000 0 1 23831 0000100 norm[SLPQ lockd c1db8d80][SLP] sendmail 23829 c1b37780 c9207000 0 23820 23820 0000002 norm[SLPQ wait c1b37780][SLP] sh 23828 c1a3eb40 c91b9000 0 23820 23820 0000002 norm[SLPQ wait c1a3eb40][SLP] sh 23821 c1b37960 c9208000 0 23789 23821 0004000 norm[SLPQ piperd c19ccb00][SLP] mail.local 23820 c1b73000 c9229000 0 521 23820 0004002 norm[SLPQ wait c1b73000][SLP] sh 23818 c1dbf1e0 c92ef000 25 1 23575 0004102 norm[SLPQ lockd c1db8dc0][SLP] sendmail 23789 c1dbe000 c92e6000 0 23781 446 0000100 norm[SLPQ lockd c1db8ec0][SLP] sendmail 23781 c1a393c0 c915f000 0 446 446 0000100 norm[SLPQ lockd c1dbc0c0][SLP] sendmail 23627 c1dbeb40 c92ec000 0 23602 23627 0004000 norm[SLPQ piperd c19cde70][SLP] mail.local 23626 c1dbf5a0 c92f1000 0 23601 23626 0004000 norm[SLPQ piperd c19cc6e0][SLP] mail.local 23602 c1dbe3c0 c92e8000 0 23589 446 0000100 norm[SLPQ lockd c1dbca20][SLP] sendmail 23601 c1dbed20 c92ed000 0 23591 446 0000100 norm[SLPQ lockd c1dbca00][SLP] sendmail 23591 c1a3e000 c91b3000 0 446 446 0000100 norm[SLPQ lockd c1dbc6a0][SLP] sendmail 23589 c1dbe1e0 c92e7000 0 446 446 0000100 norm[SLPQ lockd c1dbc760][SLP] sendmail 23529 c1b73b40 c927d000 0 23514 23529 0004000 norm[SLPQ piperd c19cc840][SLP] mail.local 23514 c1b721e0 c9222000 0 23508 446 0000100 norm[SLPQ lockd c1dbc160][SLP] sendmail 23508 c1b731e0 c922a000 0 446 446 0000100 norm[SLPQ lockd c1dbc000][SLP] sendmail 23471 c1b72d20 c9228000 0 23457 23471 0004000 norm[SLPQ piperd c19cdd10][SLP] mail.local 23457 c1b735a0 c922c000 0 23452 446 0000100 norm[SLPQ lockd c1db8da0][SLP] sendmail 23452 c1dbe960 c92eb000 0 446 446 0000100 norm[SLPQ lockd c1db8e80][SLP] sendmail 23324 c1b72780 c9225000 0 23279 23324 0004000 norm[SLPQ piperd c19cd370][SLP] mail.local 23323 c1b37b40 c9209000 0 23276 23323 0004000 norm[SLPQ piperd c19cd9a0][SLP] mail.local 23279 c1a39b40 c9163000 0 1 446 0000100 norm[SLPQ lockd c1dbcb00][SLP] sendmail 23276 c1b72b40 c9227000 0 1 446 0000100 norm[SLPQ lockd c1dbc8e0][SLP] sendmail 521 c1a3ed20 c91ba000 0 515 521 0004002 norm[SLPQ pause c91ba000][SLP] csh 520 c1b36000 c91c0000 425 471 471 0000100 norm[SLPQ accept c19b1c36][SLP] httpd 519 c1b361e0 c91c1000 425 471 471 0000100 norm[SLPQ accept c19b1c36][SLP] httpd 518 c1b363c0 c91c4000 425 471 471 0000100 norm[SLPQ accept c19b1c36][SLP] httpd 517 c1b365a0 c91d7000 425 471 471 0000100 norm[SLPQ accept c19b1c36][SLP] httpd 516 c1b36780 c91d8000 425 471 471 0000100 norm[SLPQ accept c19b1c36][SLP] httpd 515 c1b36960 c91d9000 0 1 515 0004102 norm[SLPQ wait c1b36960][SLP] login 514 c1b36b40 c91da000 0 1 514 0004002 norm[SLPQ ttyin c1a27010][SLP] getty 513 c1b36d20 c91db000 0 1 513 0004002 norm[SLPQ ttyin c1a29010][SLP] getty 512 c1b37000 c91dc000 0 1 512 0004002 norm[SLPQ ttyin c1a28c10][SLP] getty 511 c1b371e0 c9204000 0 1 511 0004002 norm[SLPQ ttyin c1a28810][SLP] getty 510 c1b373c0 c9205000 0 1 510 0004002 norm[SLPQ ttyin c1a27c10][SLP] getty 509 c1b375a0 c9206000 0 1 509 0004002 norm[SLPQ ttyin c1a27810][SLP] getty 508 c1a39780 c9161000 0 1 508 0004002 norm[SLPQ ttyin c1a27410][SLP] getty 507 c19ad780 c8d01000 0 1 507 0004002 norm[SLPQ ttyin c1907010][SLP] getty 488 c19155a0 c8cd1000 0 1 488 0000000 norm[CVQ select c05b65b4][SLP] inetd 471 c1a39960 c9162000 0 1 471 0000000 norm[CVQ select c05b65b4][SLP] httpd 449 c1915960 c8cd3000 25 1 449 0000100 norm[SLPQ pause c8cd3000][SLP] sendmail 446 c1a39000 c914d000 0 1 446 0000100 norm[CVQ select c05b65b4][SLP] sendmail 440 c19add20 c8d04000 0 1 440 0000100 norm[CVQ select c05b65b4][SLP] sshd 430 c1a3e3c0 c91b5000 0 1 430 0000000 norm[CVQ select c05b65b4][SLP] usbd 378 c1a391e0 c915e000 0 372 372 0000100 norm[CVQ select c05b65b4][SLP] rpc.lockd 372 c1a39d20 c91b2000 0 1 372 0000000 norm[CVQ select c05b65b4][SLP] rpc.lockd 370 c1a3e1e0 c91b4000 0 1 370 0000000 norm[CVQ select c05b65b4][SLP] rpc.statd 368 c1a3e5a0 c91b6000 0 362 362 0000000 norm[SLPQ nfsd c1a27e00][SLP] nfsd 367 c1a3e780 c91b7000 0 362 362 0000000 norm[SLPQ nfsd c1a28000][SLP] nfsd 365 c1915780 c8cd2000 0 362 362 0000000 norm[SLPQ nfsd c1a28200][SLP] nfsd 364 c19ad1e0 c8cfe000 0 362 362 0000000 norm[SLPQ nfsd c1a28400][SLP] nfsd 362 c1915b40 c8cfb000 0 1 362 0000000 norm[CVQ select c05b65b4][SLP] nfsd 360 c19ad000 c8cfd000 0 1 360 0000000 norm[CVQ select c05b65b4][SLP] mountd 291 c1915d20 c8cfc000 0 1 291 0000000 norm[CVQ select c05b65b4][SLP] ypbind 276 c19ad5a0 c8d00000 0 1 276 0000000 norm[CVQ select c05b65b4][SLP] rpcbind 210 c19153c0 c8cd0000 0 1 210 0000000 norm[CVQ select c05b65b4][SLP] dhclient 124 c19ad3c0 c8cff000 0 1 124 0000000 norm[SLPQ pause c8cff000][SLP] adjkerntz 44 c19ad960 c8d02000 0 0 0 0000204 norm[SLPQ nfsidl c05bd14c][SLP] nfsiod 3 43 c19adb40 c8d03000 0 0 0 0000204 norm[SLPQ nfsidl c05bd148][SLP] nfsiod 2 42 c189bb40 c8c9a000 0 0 0 0000204 norm[SLPQ nfsidl c05bd144][SLP] nfsiod 1 41 c189bd20 c8c9b000 0 0 0 0000204 norm[SLPQ nfsidl c05bd140][SLP] nfsiod 0 40 c1913000 c8cc6000 0 0 0 0000204 norm[SLPQ vlruwt c1913000][SLP] vnlru 39 c19131e0 c8cc7000 0 0 0 0000204 norm[SLPQ syncer c0591160][SLP] syncer 38 c19133c0 c8cc8000 0 0 0 0000204 norm[SLPQ psleep c05b697c][SLP] bufdaemon 37 c19135a0 c8cc9000 0 0 0 000020c norm[SLPQ pgzero c05be808][SLP] pagezero 9 c1913780 c8cca000 0 0 0 0000204 norm[SLPQ psleep c05be834][SLP] vmdaemon 8 c1913960 c8ccb000 0 0 0 0000204 norm[SLPQ psleep c05be820][SLP] pagedaemon 36 c1913b40 c8ccc000 0 0 0 0000204 new [IWAIT] irq8: rtc 35 c1913d20 c8ccd000 0 0 0 0000204 new [IWAIT] irq0: clk 34 c1915000 c8cce000 0 0 0 0000204 new [IWAIT] irq3: sio1 33 c19151e0 c8ccf000 0 0 0 0000204 new [IWAIT] irq4: sio0 32 c17c85a0 c8c62000 0 0 0 0000204 norm[IWAIT] swi0: tty:sio 31 c17c8780 c8c63000 0 0 0 0000204 norm[IWAIT] irq6: fdc0 30 c17c8960 c8c64000 0 0 0 0000204 new [IWAIT] irq5: fxp3 29 c17c8b40 c8c65000 0 0 0 0000204 new [IWAIT] irq7: fxp2 28 c17c8d20 c8c66000 0 0 0 0000204 norm[SLPQ usbtsk c0588118][SLP] usbtask 27 c189b000 c8c6d000 0 0 0 0000204 norm[SLPQ usbevt c18ef210][SLP] usb0 26 c189b1e0 c8c6e000 0 0 0 0000204 new [IWAIT] irq11: ohci0 25 c189b3c0 c8c6f000 0 0 0 0000204 norm[IWAIT] irq15: ata1 24 c189b5a0 c8c70000 0 0 0 0000204 norm[IWAIT] irq14: ata0 23 c189b780 c8c71000 0 0 0 0000204 norm[IWAIT] irq10: fxp0 fxp1 7 c189b960 c8c72000 0 0 0 0000204 norm[SLPQ actask c06da0cc][SLP] acpi_task2 6 c0b8b1e0 c8670000 0 0 0 0000204 norm[SLPQ actask c06da0cc][SLP] acpi_task1 5 c0b8b3c0 c8671000 0 0 0 0000204 norm[SLPQ actask c06da0cc][SLP] acpi_task0 22 c0b8b5a0 c8672000 0 0 0 0000204 new [IWAIT] irq9: acpi0 21 c0b8b780 c8673000 0 0 0 0000204 new [IWAIT] irq13: 20 c0b8b960 c8674000 0 0 0 0000204 new [IWAIT] swi5:+ 19 c0b8bb40 c8675000 0 0 0 0000204 new [IWAIT] swi6: task queue 18 c0b8bd20 c869d000 0 0 0 0000204 new [IWAIT] swi3: cambio 17 c17c8000 c8c5f000 0 0 0 0000204 new [IWAIT] swi2: camnet 16 c17c81e0 c8c60000 0 0 0 0000204 new [IWAIT] swi6: acpitaskq 15 c17c83c0 c8c61000 0 0 0 0000204 norm[SLPQ sleep c057bf60][SLP] random 4 c0b84000 c8619000 0 0 0 0000204 norm[SLPQ g_down c058a678][SLP] g_down 3 c0b841e0 c8668000 0 0 0 0000204 norm[SLPQ g_up c058a674][SLP] g_up 2 c0b843c0 c8669000 0 0 0 0000204 norm[SLPQ g_events c058a66c][SLP] g_event 14 c0b845a0 c866a000 0 0 0 0000204 new [IWAIT] swi4: vm 13 c0b84780 c866b000 0 0 0 000020c norm[IWAIT] swi7: tty:sio clock 12 c0b84960 c866c000 0 0 0 0000204 norm[IWAIT] swi1: net 11 c0b84b40 c866d000 0 0 0 000020c norm[CPU 0] idle 1 c0b84d20 c866e000 0 0 1 0004200 norm[SLPQ wait c0b84d20][SLP] init 10 c0b8b000 c866f000 0 0 0 0000204 norm[CVQ ktrace c058dc34][SLP] ktrace 0 c058a760 c0709000 0 0 0 0000200 norm[SLPQ sched c058a760][SLP] swapper When in this state, the rpc.lockd on crash2 (.11) appears to be spinning, spewing out the following sorts of packets to cboss (.1): 10:13:37.329753 0:0:e2:3c:70:63 0:d0:b7:68:4a:8c 0800 290: 192.168.50.11.49938 > 192.168.50.1.1016: udp 248 4500 0114 9515 0000 4011 ff66 c0a8 320b c0a8 3201 c312 03f8 0100 4bf5 3eba ef11 0000 0000 0000 0002 0001 86b5 0000 0001 0000 0007 0000 0001 0000 0054 3ebf a6d0 0000 0015 6372 6173 6832 2e72 762e 6e61 696c 10:13:37.330089 0:0:e2:3c:70:63 0:d0:b7:68:4a:8c 0800 250: 192.168.50.11.49938 > 192.168.50.1.1016: udp 208 4500 00ec 9516 0000 4011 ff8d c0a8 320b c0a8 3201 c312 03f8 00d8 2d1b 3eba ef12 0000 0000 0000 0002 0001 86b5 0000 0001 0000 0007 0000 0001 0000 002c 3ebf a6d0 0000 0015 6372 6173 6832 2e72 762e 6e61 696c 10:13:37.519632 0:0:e2:3c:70:63 0:d0:b7:68:4a:8c 0800 250: 192.168.50.11.49938 > 192.168.50.1.1016: udp 208 4500 00ec 9517 0000 4011 ff8c c0a8 320b c0a8 3201 c312 03f8 00d8 3a03 3eba ef13 0000 0000 0000 0002 0001 86b5 0000 0001 0000 0007 0000 0001 0000 002c 3ebf a6d0 0000 0015 6372 6173 6832 2e72 762e 6e61 696c 10:13:37.829638 0:0:e2:3c:70:63 0:d0:b7:68:4a:8c 0800 290: 192.168.50.11.49938 > 192.168.50.1.1016: udp 248 4500 0114 9518 0000 4011 ff63 c0a8 320b c0a8 3201 c312 03f8 0100 601b 3eba ef14 0000 0000 0000 0002 0001 86b5 0000 0001 0000 0007 0000 0001 0000 0054 3ebf a6d0 0000 0015 6372 6173 6832 2e72 762e 6e61 696c On the server side, rpc.lockd is unusually busy as a result: PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 47547 root 121 0 1592K 1224K RUN 448:30 98.68% 98.68% rpc.lockd Usually the behavior that upsets the client is to run: while (1) periodic daily periodic weekly periodic monthly end The rpc.lockd process remains extremely busy even after crash2 is rebooted and the stream of packets is no longer present. I'm not sure how to go about debugging these problems, but the current scenario basically means I can't get both the crash boxes through their daily events if both the client and server are very busy (i.e., if they both run their daily events at the same time). I'm going to reboot cboss and the systems and see if that flushes whatever nasty state hangs around, but any advice on the debugging process would be helpful. Is there a way to get rpc.lockd on the server to dump it's state to a file? Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories