From owner-freebsd-questions Mon Jan 17 15: 1: 1 2000 Delivered-To: freebsd-questions@freebsd.org Received: from blockhead.mincom.com (blockhead2.mincom.com [203.15.57.33]) by hub.freebsd.org (Postfix) with ESMTP id 0D7A415035 for ; Mon, 17 Jan 2000 15:00:40 -0800 (PST) (envelope-from philh@mincom.com) Received: (from uucp@localhost) by blockhead.mincom.com (8.9.3/8.9.3) id JAA34832 for ; Tue, 18 Jan 2000 09:00:29 +1000 (EST) (envelope-from philh@mincom.com) Received: from porthole.mincom.oz.au(172.17.100.2) via SMTP by blockhead.mincom.oz.au, id smtpdy34830; Tue Jan 18 09:00:29 2000 Received: (from philh@localhost) by porthole.mincom.oz.au (8.9.3/8.8.5) id JAA26214 for questions@freebsd.org; Tue, 18 Jan 2000 09:00:29 +1000 (EST) Date: Tue, 18 Jan 2000 09:00:28 +1000 From: Phil Homewood To: questions@freebsd.org Subject: Strange lockups/lost response/VMbug, 3.3-STABLE (22Nov1999) Message-ID: <20000118090028.C28105@mincom.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.5i Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG One of our firewall machines here has "locked up" three times in as many (business) days. I put "locked up" in quotes because it's not completely wedged; the "top" I have running is still happily updating; it accepts connections but does nothing with them; and the serial console is completely unresponsive (not even echoing). Curiously, and I suspect not coincidentally, the second of our firewalls, suffered what appeared to be the same fate yesterday. The first machine is only new (commissioned a couple of months ago), however the other has been in service for over 12 months without a single glitch, running 3.1-STABLE; it was cvsupped to 3.4-STABLE a couple of days after 3.4 went -RELEASE. The machines are nearly identical, save for disk size and CPU (the older machine is a P2/333, the new one is a Celeron 400.) Both have two disks hanging off ahc controllers, and both have three 3C905B NICs. Following is a cut-and-paste of the "top" currently running on the now-wedged machine: last pid: 22720; load averages: 4.00, 4.00, 4.00 up 0+22:58:56 08:39:25 138 processes: 1 running, 135 sleeping, 2 zombie CPU states: 0.0% user, 0.0% nice, 0.4% system, 1.2% interrupt, 98.4% idle Mem: 23M Active, 62M Inact, 39M Wired, 8345K Buf, 616K Free Swap: 256M Total, 256M Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 20251 uucp -20 0 23872K 12376K VM pgd 14:46 0.00% 0.00% smtpd 16266 uucp -20 0 23808K 428K VM pgd 14:30 0.00% 0.00% smtpd 825 philh 28 0 1248K 660K RUN 2:22 0.00% 0.00% top 163 bind 2 0 3080K 2284K select 0:30 0.00% 0.00% named 399 root 2 0 1180K 684K select 0:16 0.00% 0.00% sshd1 368 root -18 0 1444K 616K vmwait 0:11 0.00% 0.00% xinetd 156 root 2 0 828K 472K select 0:10 0.00% 0.00% syslogd 167 root 2 -12 1044K 612K select 0:09 0.00% 0.00% xntpd 362 root -18 0 1156K 668K vmwait 0:08 0.00% 0.00% sshd1 247 uucp 10 0 792K 488K nanslp 0:05 0.00% 0.00% smtpfwdd 303 root 2 0 920K 508K accept 0:03 0.00% 0.00% socks5 22710 root -18 0 972K 616K vmwait 0:03 0.00% 0.00% find 335 root 2 0 920K 508K accept 0:01 0.00% 0.00% socks5 322 root 2 0 920K 512K accept 0:01 0.00% 0.00% socks5 1 root 10 0 496K 116K wait 0:01 0.00% 0.00% init 296 root 2 0 920K 596K accept 0:00 0.00% 0.00% socks5 294 root 2 0 920K 596K accept 0:00 0.00% 0.00% socks5 the state of the two smtpd processes plus a "find" and other things stuck in vmwait seems to indicate some VM weirdness to me. The smtpd processes really shouldn't be using that much memory; I suspect a stupidly large message is in the process of being rejected. However the memory stats at the top don't agree with the process listing... I just tried suspending the "top" and lost all control of the terminal. Off to reboot.... OK, after reboot... last thing in the system logs was: Jan 18 08:05:30 blocker Socks5[333]: TCP Connection Established: Connect (xxx.xxx.xxx.xxx:xxxx to xxx.xxx.xxx.xxx:xxxx) for user Jan 18 08:07:11 blocker -- MARK -- Jan 18 08:12:11 blocker -- MARK -- Jan 18 08:17:11 blocker -- MARK -- Jan 18 08:22:11 blocker -- MARK -- Jan 18 08:27:11 blocker -- MARK -- Jan 18 08:32:11 blocker -- MARK -- Jan 18 08:37:11 blocker -- MARK -- Jan 18 08:42:11 blocker -- MARK -- followed by silence until the reboot. (Loss of response seems to have occurred around 08:05, so syslogd at least was still working while the machine was unresponsive. Reboot was at 08:46.) Nothing interesting logged on console, just the usual smtpd whining about bad ident responses and incomplete spool files (from the last reboot I guess). This (or similar) problem did occur in pre-commissioning testing of the machine, but was put down to hardware, and a replacement 2940 *seemed* to make the hangs go away. Back then, though, we did get SCSI errors on console (see previous messages by me in -questions in Oct/Nov 1999). dmesg.boot from the newer machine: Copyright (c) 1992-1999 FreeBSD Inc. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. FreeBSD 3.3-STABLE #0: Mon Nov 22 14:24:08 EST 1999 root@blocker.mincom.oz.au:/usr/src/sys/compile/BLOCKER Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 400911175 Hz CPU: Celeron (400.91-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x665 Stepping = 5 Features=0x183f9ff real memory = 134217728 (131072K bytes) avail memory = 127815680 (124820K bytes) Preloaded elf kernel "kernel" at 0xc026f000. Pentium Pro MTRR support enabled Probing for devices on PCI bus 0: chip0: rev 0x03 on pci0.0.0 chip1: rev 0x03 on pci0.1.0 chip2: rev 0x02 on pci0.7.0 ide_pci0: rev 0x01 on pci0.7.1 chip3: rev 0x02 on pci0.7.3 xl0: <3Com 3c905B-TX Fast Etherlink XL> rev 0x30 int a irq 11 on pci0.14.0 xl0: Ethernet address: 00:10:5a:72:43:a0 xl0: autoneg complete, link status good (half-duplex, 10Mbps) xl1: <3Com 3c905B-TX Fast Etherlink XL> rev 0x24 int a irq 10 on pci0.18.0 xl1: Ethernet address: 00:10:4b:c5:b6:14 xl1: autoneg complete, link status good (half-duplex, 10Mbps) ahc0: rev 0x00 int a irq 15 on pci0.19.0 ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs xl2: <3Com 3c905B-TX Fast Etherlink XL> rev 0x24 int a irq 15 on pci0.20.0 xl2: Ethernet address: 00:10:4b:c5:c3:4d xl2: autoneg complete, link status good (half-duplex, 10Mbps) Probing for devices on PCI bus 1: vga0: rev 0x7a on pci1.0.0 Probing for devices on the ISA bus: sc0 on isa sc0: VGA color <16 virtual consoles, flags=0x0> atkbdc0 at 0x60-0x6f on motherboard atkbd0 irq 1 on isa sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa sio0: type 16550A, console sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: FIFO enabled, 8 bytes threshold fd0: 1.44MB 3.5in wdc0 at 0x1f0-0x1f7 irq 14 on isa wdc0: unit 0 (atapi): , removable, dma, iordis acd0: drive speed 6890KB/sec, 128KB cache acd0: supported read types: CD-R, CD-RW, CD-DA acd0: Audio: play, 255 volume levels acd0: Mechanism: ejectable tray acd0: Medium: no/blank disc inside, unlocked ppc0 not found vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa npx0 on motherboard npx0: INT 16 interface IP packet filtering initialized, divert disabled, rule-based forwarding disabled, logging limited to 100 packets/entry by default Waiting 2 seconds for SCSI devices to settle cda0 at ahc0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-3 device da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C) da1 at ahc0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da1: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C) hanging root device to da0s1a WARNING: / was not properly dismounted Anyone have any ideas, or is there more info I can supply? (The machine is running headless with console on a terminal server; I'll try to get a head on it to dump to DDB next time it happens.) There's definitely something nasty here. :-( -- This transmission is for the intended addressee only and is confidential information. If you have received this transmission in error, please delete it and notify the sender. The contents of this email are the opinion of the writer and are not endorsed by Mincom Ltd unless expressly stated otherwise. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message