From owner-freebsd-stable Sun Jun 18 4:24:17 2000 Delivered-To: freebsd-stable@freebsd.org Received: from lucifer.bart.nl (lucifer.bart.nl [194.158.168.74]) by hub.freebsd.org (Postfix) with ESMTP id B402537B556; Sun, 18 Jun 2000 04:24:04 -0700 (PDT) (envelope-from asmodai@lucifer.bart.nl) Received: (from asmodai@localhost) by lucifer.bart.nl (8.9.3/8.9.3) id NAA95568; Sun, 18 Jun 2000 13:23:54 +0200 (CEST) (envelope-from asmodai) Date: Sun, 18 Jun 2000 13:23:54 +0200 From: Jeroen Ruigrok van der Werven To: Luoqi Chen Cc: stable@freebsd.org, dillon@freebsd.org Subject: Details [note: this is long] Message-ID: <20000618132353.B95177@lucifer.bart.nl> References: <200006131458.e5DEwJb03585@lor.watermarkgroup.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200006131458.e5DEwJb03585@lor.watermarkgroup.com>; from luoqi@watermarkgroup.com on Tue, Jun 13, 2000 at 10:58:19AM -0400 Organisation: VIA Net.Works The Netherlands Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [still unsure where to sent 4.x related problems, since we only lag behind 5 a few weeks/months.] [Sorry for cc:'ing you directly Matthew, but I thought you might find this interesting] -On [20000613 17:00], Luoqi Chen (luoqi@watermarkgroup.com) wrote: >The best way to diagnose the problem is to work on the live system when >the same symptom occurs (unkillable process), find out which channels >these processes are sleeping on and why they're not waken up (hardware >failure might contribute to it). > >A `ps axl' report would be very helpful. Ok, be afraid: UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND 0 0 0 0 -18 0 0 0 sched DLs ?? 0:04.42 (swapper) 0 1 0 0 10 0 524 72 wait ILs ?? 0:00.70 /sbin/init -- 0 2 0 0 -18 0 0 0 psleep DL ?? 0:31.86 (pagedaemon) 0 3 0 0 18 0 0 0 psleep DL ?? 0:00.00 (vmdaemon) 0 4 0 0 -18 0 0 0 psleep DL ?? 0:02.14 (bufdaemon) 0 5 0 0 18 0 0 0 syncer DL ?? 2:12.06 (syncer) 0 30 1 29 18 0 208 0 pause Is ?? 0:00.00 adjkerntz -i 0 95 1 0 2 0 888 396 select Ss ?? 0:07.50 syslogd -s 1 102 1 0 2 0 904 424 select Is ?? 0:13.25 /usr/sbin/portmap 0 104 1 0 2 0 944 400 select Is ?? 0:09.97 ypbind -s -Sadmin,idun10,bragi10 0 111 1 0 10 0 208 0 nfsidl I ?? 0:05.71 nfsiod -n 4 0 112 1 0 10 0 208 0 nfsidl I ?? 0:00.24 nfsiod -n 4 0 113 1 0 10 0 208 0 nfsidl I ?? 0:00.00 nfsiod -n 4 0 114 1 0 10 0 208 0 nfsidl I ?? 0:00.00 nfsiod -n 4 0 133 1 0 2 0 1024 412 select Is ?? 0:00.01 inetd -wW 0 135 1 0 10 0 940 436 nanslp Is ?? 0:08.99 cron 0 138 1 0 2 0 1828 752 select Is ?? 0:05.86 /usr/sbin/sshd 80 2105 1 0 -4 0 1900 840 nfsvin D ?? 0:00.16 /usr/local/apache/bin/httpd 80 2119 1 0 -4 0 1900 840 nfsvin D ?? 0:00.14 /usr/local/apache/bin/httpd 80 2338 1 0 -4 0 1900 840 nfsvin D ?? 0:00.18 /usr/local/apache/bin/httpd 80 2343 1 0 -18 0 1920 840 vmopar DL ?? 0:00.27 /usr/local/apache/bin/httpd 80 2344 1 0 -4 0 1900 840 nfsvin D ?? 0:00.21 /usr/local/apache/bin/httpd 80 2357 1 0 -4 0 1900 840 nfsvin D ?? 0:00.15 /usr/local/apache/bin/httpd 80 2604 1 0 -4 0 1900 840 nfsvin D ?? 0:00.48 /usr/local/apache/bin/httpd 80 2605 1 0 -18 0 1900 840 vmopar D ?? 0:00.40 /usr/local/apache/bin/httpd 80 2608 1 0 -4 0 1900 840 nfsvin D ?? 0:00.38 /usr/local/apache/bin/httpd 80 2610 1 0 -4 0 1900 840 nfsvin D ?? 0:00.43 /usr/local/apache/bin/httpd 80 2659 1 0 -4 0 1900 840 nfsvin D ?? 0:00.39 /usr/local/apache/bin/httpd 80 2684 1 0 -4 0 1900 840 nfsvin D ?? 0:00.68 /usr/local/apache/bin/httpd 80 2697 1 0 -18 0 1900 840 vmopar D ?? 0:00.12 /usr/local/apache/bin/httpd 80 2729 1 0 -4 0 1900 840 nfsvin D ?? 0:00.08 /usr/local/apache/bin/httpd 80 2774 1 0 -18 0 1900 840 vmopar D ?? 0:00.16 /usr/local/apache/bin/httpd 80 2798 1 0 -4 0 1900 840 nfsvin D ?? 0:00.26 /usr/local/apache/bin/httpd 80 2803 1 0 -4 0 1900 840 nfsvin D ?? 0:00.30 /usr/local/apache/bin/httpd [snip couple 'o hundred of the same types of httpd processes, ps -axl | wc -l gave 1027] 80 99475 1 0 -4 0 1900 840 nfsvin D ?? 0:00.41 /usr/local/apache/bin/httpd 80 99502 1 0 -4 0 1900 840 nfsvin D ?? 0:00.24 /usr/local/apache/bin/httpd 80 99503 1 0 -4 0 1900 840 nfsvin D ?? 0:00.16 /usr/local/apache/bin/httpd 80 99504 1 0 -4 0 1900 840 nfsvin D ?? 0:00.07 /usr/local/apache/bin/httpd 80 99527 1 0 -4 0 1900 840 nfsvin D ?? 0:00.54 /usr/local/apache/bin/httpd 80 99538 1 0 -4 0 1900 840 nfsvin D ?? 0:00.18 /usr/local/apache/bin/httpd 80 99589 1 0 -4 0 1900 840 nfsvin D ?? 0:00.19 /usr/local/apache/bin/httpd 80 99590 1 0 -4 0 1900 840 nfsvin D ?? 0:00.14 /usr/local/apache/bin/httpd 80 99611 1 0 -4 0 1900 840 nfsvin D ?? 0:00.07 /usr/local/apache/bin/httpd 80 99622 1 0 -4 0 1900 840 nfsvin D ?? 0:00.37 /usr/local/apache/bin/httpd 0 7244 7243 0 18 0 476 348 pause Is p0 0:00.29 -csh (csh) 0 7250 7244 0 10 0 1628 1276 wait S p0 0:00.03 bash 0 7253 7250 1 28 0 1540 1404 - R+ p0 0:00.01 ps -axl 0 7254 7250 1 28 0 308 128 - R+ p0 0:00.00 more 0 29318 1 0 2 0 2280 512 select I p0- 0:00.81 /usr/local/sbin/snmpd 0 58721 1 0 3 0 908 276 ttyin Is+ v0 0:00.01 /usr/libexec/getty Pc ttyv0 0 183 1 0 3 0 904 272 ttyin Is+ v1 0:00.01 /usr/libexec/getty Pc ttyv1 0 184 1 0 3 0 904 272 ttyin Is+ v2 0:00.01 /usr/libexec/getty Pc ttyv2 0 185 1 0 3 0 904 272 ttyin Is+ v3 0:00.01 /usr/libexec/getty Pc ttyv3 0 186 1 0 3 0 904 272 ttyin Is+ v4 0:00.01 /usr/libexec/getty Pc ttyv4 0 187 1 0 3 0 904 272 ttyin Is+ v5 0:00.01 /usr/libexec/getty Pc ttyv5 0 188 1 0 3 0 904 272 ttyin Is+ v6 0:00.01 /usr/libexec/getty Pc ttyv6 0 189 1 0 3 0 904 272 ttyin Is+ v7 0:00.01 /usr/libexec/getty Pc ttyv7 >For those unkillable processes, >you might want to report the backtrace for each, here's how to get them, > # gdb -k /kernel /dev/mem > (kgdb) proc > (kgdb) bt Basically each and every httpd process is unkillable. We get our data from a NetApp Filer F720. root@fenrir: /root$ gdb -k /kernel.debug /dev/mem GNU gdb 4.18 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-unknown-freebsd"... IdlePTD 2990080 initial pcb at 700f000 panic messages: --- --- #0 mi_switch () at ../../kern/kern_synch.c:858 858 if (switchtime.tv_sec == 0) (kgdb) proc 2105 (kgdb) bt #0 mi_switch () at ../../kern/kern_synch.c:858 858 if (switchtime.tv_sec == 0) #1 0xc0149539 in tsleep (ident=0xd441d20a, priority=18, wmesg=0xc02387fb "nfsvinval", timo=0) at ../../kern/kern_synch.c:467 467 mi_switch(); #2 0xc019a14b in nfs_vinvalbuf (vp=0xd4c0c200, flags=1, cred=0xc2e42180, p=0xd5ccc920, intrflg=1) at ../../nfs/nfs_bio.c:1170 1170 error = tsleep((caddr_t)&np->n_flag, PRIBIO + 2, "nfsvinval", #3 0xc01bcfd6 in nfs_open () #4 0xc017682b in vn_open (ndp=0xd5cddecc, fmode=1, cmode=420) at vnode_if.h:189 189 rc = VCALL(vp, VOFFSET(vop_open), &a); #5 0xc017281d in open (p=0xd5ccc920, uap=0xd5cddf80) at ../../kern/vfs_syscalls.c:994 994 error = vn_open(&nd, flags, cmode); #6 0xc0216459 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 4, tf_esi = 672500536, tf_ebp = -1077937632, tf_isp = -707928108, tf_ebx = 672435076, tf_edx = 672500536, tf_ecx = 15, tf_eax = 5, tf_trapno = 22, tf_err = 2, tf_eip = 672351492, tf_cs = 31, tf_eflags = 659, tf_esp = -1077937676, tf_ss = 47}) at ../../i386/i386/trap.c:1126 1126 error = (*callp->sy_call)(p, args); #7 0xc020b0a6 in Xint0x80_syscall () #8 0x8050c14 in ?? () Another nfsvin: (kgdb) proc 99538 (kgdb) bt #0 mi_switch () at ../../kern/kern_synch.c:858 858 if (switchtime.tv_sec == 0) #1 0xc0149539 in tsleep (ident=0xd441d20a, priority=18, wmesg=0xc02387fb "nfsvinval", timo=0) at ../../kern/kern_synch.c:467 467 mi_switch(); #2 0xc019a14b in nfs_vinvalbuf (vp=0xd4c0c200, flags=1, cred=0xc2747180, p=0xd5ccbc20, intrflg=1) at ../../nfs/nfs_bio.c:1170 1170 error = tsleep((caddr_t)&np->n_flag, PRIBIO + 2, "nfsvinval", #3 0xc01bcfd6 in nfs_open () #4 0xc017682b in vn_open (ndp=0xd5cfcecc, fmode=1, cmode=420) at vnode_if.h:189 189 rc = VCALL(vp, VOFFSET(vop_open), &a); #5 0xc017281d in open (p=0xd5ccbc20, uap=0xd5cfcf80) at ../../kern/vfs_syscalls.c:994 994 error = vn_open(&nd, flags, cmode); #6 0xc0216459 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 4, tf_esi = 672500536, tf_ebp = -1077937632, tf_isp = -707801132, tf_ebx = 672435076, tf_edx = 672500536, tf_ecx = 15, tf_eax = 5, tf_trapno = 7, tf_err = 2, tf_eip = 672351492, tf_cs = 31, tf_eflags = 659, tf_esp = -1077937676, tf_ss = 47}) at ../../i386/i386/trap.c:1126 1126 error = (*callp->sy_call)(p, args); #7 0xc020b0a6 in Xint0x80_syscall () #8 0x8050c14 in ?? () vmopar stuck one: (kgdb) proc 2343 (kgdb) bt #0 mi_switch () at ../../kern/kern_synch.c:858 858 if (switchtime.tv_sec == 0) #1 0xc0149539 in tsleep (ident=0xc05ceec0, priority=4, wmesg=0xc0240471 "vmopar", timo=0) at ../../kern/kern_synch.c:467 467 mi_switch(); #2 0xc01ef947 in vm_object_page_remove () #3 0xc01f39d1 in vnode_pager_setsize () #4 0xc01b7c93 in nfs_loadattrcache () #5 0xc01c0b83 in nfs_readrpc () #6 0xc01987f5 in nfs_getpages (ap=0xd5de7c8c) at ../../nfs/nfs_bio.c:153 #7 0xc01f3f7e in vnode_pager_getpages () #8 0xc01e8e4e in vm_fault () #9 0xc0215e1a in trap_pfault (frame=0xd5de7dd0, usermode=0, eva=672915456) at ../../i386/i386/trap.c:800 800 rv = vm_fault(map, va, ftype, #10 0xc0215a83 in trap (frame={tf_fs = -1072496624, tf_es = -1062797296, tf_ds = 16, tf_edi = -1062680520, tf_esi = 672915455, tf_ebp = -706838980, tf_isp = -706839044, tf_ebx = 2048, tf_edx = 672917447, tf_ecx = 498, tf_eax = -706846720, tf_trapno = 12, tf_err = 0, tf_eip = -1071558811, tf_cs = 8, tf_eflags = 66054, tf_esp = -706838772, tf_ss = -706838828}) at ../../i386/i386/trap.c:426 426 (void) trap_pfault(&frame, FALSE, eva); #11 0xc0214f65 in generic_copyin () #12 0xc016193c in sosend (so=0xd1678480, addr=0x0, uio=0xd5de7f0c, top=0x0, control=0x0, flags=0, p=0xd5da25a0) at ../../kern/uipc_socket.c:567 567 error = uiomove(mtod(m, caddr_t), (int)len, uio); #13 0xc0156c28 in soo_write (fp=0xc27c3680, uio=0xd5de7f0c, cred=0xc2e48b80, flags=0, p=0xd5da25a0) at ../../kern/sys_socket.c:78 78 return so->so_proto->pr_usrreqs->pru_sosend(so, 0, uio, 0, 0, 0, #14 0xc0153b03 in writev (p=0xd5da25a0, uap=0xd5de7f80) at ../../sys/file.h:159 159 error = (*fp->f_ops->fo_write)(fp, uio, cred, flags, p); #15 0xc0216459 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = -1077936692, tf_esi = -1077936700, tf_ebp = -1077937776, tf_isp = -706838572, tf_ebx = 289, tf_edx = 3, tf_ecx = 18407, tf_eax = 121, tf_trapno = 7, tf_err = 2, tf_eip = 672156464, tf_cs = 31, tf_eflags = 642, tf_esp = -1077937820, tf_ss = 47}) at ../../i386/i386/trap.c:1126 1126 error = (*callp->sy_call)(p, args); #16 0xc020b0a6 in Xint0x80_syscall () #17 0x80529a7 in ?? () And another one: (kgdb) proc 2774 (kgdb) bt #0 mi_switch () at ../../kern/kern_synch.c:858 858 if (switchtime.tv_sec == 0) #1 0xc0149539 in tsleep (ident=0xc05ceec0, priority=4, wmesg=0xc0240471 "vmopar", timo=0) at ../../kern/kern_synch.c:467 467 mi_switch(); #2 0xc01ef947 in vm_object_page_remove () #3 0xc01f39d1 in vnode_pager_setsize () #4 0xc01b7c93 in nfs_loadattrcache () #5 0xc01bcb37 in nfs3_access_otw () #6 0xc01bd113 in nfs_getattr () #7 0xc01bf11c in nfs_lookup () #8 0xc016e518 in lookup () #9 0xc016e014 in namei () #10 0xc01738ed in stat (p=0xd5da0ba0, uap=0xd5e1ff80) at ../../kern/vfs_syscalls.c:1698 1698 if ((error = namei(&nd)) != 0) #11 0xc0216459 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = -1077936692, tf_esi = -1077936700, tf_ebp = -1077937456, tf_isp = -706609196, tf_ebx = 1, tf_edx = 135129834, tf_ecx = 39, tf_eax = 188, tf_trapno = 7, tf_err = 2, tf_eip = 672155472, tf_cs = 31, tf_eflags = 518, tf_esp = -1077937516, tf_ss = 47}) at ../../i386/i386/trap.c:1126 1126 error = (*callp->sy_call)(p, args); #12 0xc020b0a6 in Xint0x80_syscall () #13 0x806622a in ?? () Some last logging from /var/log/messages: Jun 17 17:09:21 fenrir /kernel: nfs_getpages: error 70 Jun 17 17:09:21 fenrir /kernel: vm_fault: pager read error, pid 3896 (httpd) Jun 17 17:11:25 fenrir /kernel: nfs_getpages: error 70 Jun 17 17:11:25 fenrir /kernel: vm_fault: pager read error, pid 3880 (httpd) Jun 18 00:05:02 fenrir /kernel: nfs_getpages: error 70 Jun 18 00:05:02 fenrir /kernel: vm_fault: pager read error, pid 5178 (httpd) Jun 18 00:14:40 fenrir /kernel: nfs_getpages: error 70 Jun 18 00:14:40 fenrir /kernel: vm_fault: pager read error, pid 5145 (httpd) Jun 18 00:16:13 fenrir /kernel: nfs_getpages: error 70 Jun 18 00:16:13 fenrir /kernel: vm_fault: pager read error, pid 5302 (httpd) Jun 18 00:21:51 fenrir /kernel: nfs_getpages: error 70 Jun 18 00:21:51 fenrir /kernel: vm_fault: pager read error, pid 5268 (httpd) /etc/fstab: netapp://basic/users /home/users nfs ro,nodev,noexec,nosuid 0 0 Every NFS mount gets mounted like this on the webservers. top reports: last pid: 7353; load averages: 0.14, 0.04, 0.01 up 15+23:59:31 13:16:15 1020 processes:1 running, 1019 sleeping CPU states: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle Mem: 84M Active, 16M Inact, 88M Wired, 14M Cache, 35M Buf, 47M Free Swap: 1024M Total, 174M Used, 850M Free, 16% Inuse *sigh* I just wanted to drop into DDB only to realise it wasn't configured on this particular box. [mental note: go past all boxes and fix this] This is not my week. =( -- Jeroen Ruigrok van der Werven Network- and systemadministrator VIA Net.Works The Netherlands BSD: Technical excellence at its best http://www.via-net-works.nl I succeed him; no one could replace him... To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message