From owner-freebsd-questions  Mon Jan 17 15: 1: 1 2000
Delivered-To: freebsd-questions@freebsd.org
Received: from blockhead.mincom.com (blockhead2.mincom.com [203.15.57.33])
	by hub.freebsd.org (Postfix) with ESMTP id 0D7A415035
	for <questions@freebsd.org>; Mon, 17 Jan 2000 15:00:40 -0800 (PST)
	(envelope-from philh@mincom.com)
Received: (from uucp@localhost)
	by blockhead.mincom.com (8.9.3/8.9.3) id JAA34832
	for <questions@freebsd.org>; Tue, 18 Jan 2000 09:00:29 +1000 (EST)
	(envelope-from philh@mincom.com)
Received: from porthole.mincom.oz.au(172.17.100.2)
 via SMTP by blockhead.mincom.oz.au, id smtpdy34830; Tue Jan 18 09:00:29 2000
Received: (from philh@localhost)
	by porthole.mincom.oz.au (8.9.3/8.8.5) id JAA26214
	for questions@freebsd.org; Tue, 18 Jan 2000 09:00:29 +1000 (EST)
Date: Tue, 18 Jan 2000 09:00:28 +1000
From: Phil Homewood <philh@mincom.com>
To: questions@freebsd.org
Subject: Strange lockups/lost response/VMbug, 3.3-STABLE (22Nov1999)
Message-ID: <20000118090028.C28105@mincom.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.5i
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

One of our firewall machines here has "locked up" three times in
as many (business) days.

I put "locked up" in quotes because it's not completely wedged;
the "top" I have running is still happily updating; it accepts
connections but does nothing with them; and the serial console
is completely unresponsive (not even echoing).

Curiously, and I suspect not coincidentally, the second of our
firewalls, suffered what appeared to be the same fate yesterday.

The first machine is only new (commissioned a couple of months
ago), however the other has been in service for over 12 months
without a single glitch, running 3.1-STABLE; it was cvsupped to
3.4-STABLE a couple of days after 3.4 went -RELEASE.

The machines are nearly identical, save for disk size and CPU
(the older machine is a P2/333, the new one is a Celeron 400.)
Both have two disks hanging off ahc controllers, and both have
three 3C905B NICs.

Following is a cut-and-paste of the "top" currently running on
the now-wedged machine:

last pid: 22720;  load averages:  4.00,  4.00,  4.00    up 0+22:58:56  08:39:25
138 processes: 1 running, 135 sleeping, 2 zombie
CPU states:  0.0% user,  0.0% nice,  0.4% system,  1.2% interrupt, 98.4% idle
Mem: 23M Active, 62M Inact, 39M Wired, 8345K Buf, 616K Free
Swap: 256M Total, 256M Free

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
20251 uucp     -20   0 23872K 12376K VM pgd  14:46  0.00%  0.00% smtpd
16266 uucp     -20   0 23808K   428K VM pgd  14:30  0.00%  0.00% smtpd
  825 philh     28   0  1248K   660K RUN      2:22  0.00%  0.00% top
  163 bind       2   0  3080K  2284K select   0:30  0.00%  0.00% named
  399 root       2   0  1180K   684K select   0:16  0.00%  0.00% sshd1
  368 root     -18   0  1444K   616K vmwait   0:11  0.00%  0.00% xinetd
  156 root       2   0   828K   472K select   0:10  0.00%  0.00% syslogd
  167 root       2 -12  1044K   612K select   0:09  0.00%  0.00% xntpd
  362 root     -18   0  1156K   668K vmwait   0:08  0.00%  0.00% sshd1
  247 uucp      10   0   792K   488K nanslp   0:05  0.00%  0.00% smtpfwdd
  303 root       2   0   920K   508K accept   0:03  0.00%  0.00% socks5
22710 root     -18   0   972K   616K vmwait   0:03  0.00%  0.00% find
  335 root       2   0   920K   508K accept   0:01  0.00%  0.00% socks5
  322 root       2   0   920K   512K accept   0:01  0.00%  0.00% socks5
    1 root      10   0   496K   116K wait     0:01  0.00%  0.00% init
  296 root       2   0   920K   596K accept   0:00  0.00%  0.00% socks5
  294 root       2   0   920K   596K accept   0:00  0.00%  0.00% socks5

the state of the two smtpd processes plus a "find" and other things
stuck in vmwait seems to indicate some VM weirdness to me. The smtpd
processes really shouldn't be using that much memory; I suspect a
stupidly large message is in the process of being rejected. However
the memory stats at the top don't agree with the process listing...

I just tried suspending the "top" and lost all control of the terminal.
Off to reboot....

OK, after reboot... last thing in the system logs was:

Jan 18 08:05:30 blocker Socks5[333]: TCP Connection Established: Connect (xxx.xxx.xxx.xxx:xxxx to xxx.xxx.xxx.xxx:xxxx) for user
Jan 18 08:07:11 blocker -- MARK --
Jan 18 08:12:11 blocker -- MARK --
Jan 18 08:17:11 blocker -- MARK --
Jan 18 08:22:11 blocker -- MARK -- 
Jan 18 08:27:11 blocker -- MARK --
Jan 18 08:32:11 blocker -- MARK --
Jan 18 08:37:11 blocker -- MARK --
Jan 18 08:42:11 blocker -- MARK --

followed by silence until the reboot. (Loss of response seems
to have occurred around 08:05, so syslogd at least was still
working while the machine was unresponsive. Reboot was at
08:46.)

Nothing interesting logged on console, just the usual smtpd
whining about bad ident responses and incomplete spool files
(from the last reboot I guess).

This (or similar) problem did occur in pre-commissioning testing
of the machine, but was put down to hardware, and a replacement
2940 *seemed* to make the hangs go away. Back then, though, we
did get SCSI errors on console (see previous messages by me in
-questions in Oct/Nov 1999).

dmesg.boot from the newer machine:

Copyright (c) 1992-1999 FreeBSD Inc.
Copyright (c) 1982, 1986, 1989, 1991, 1993
	The Regents of the University of California. All rights reserved.
FreeBSD 3.3-STABLE #0: Mon Nov 22 14:24:08 EST 1999
    root@blocker.mincom.oz.au:/usr/src/sys/compile/BLOCKER
Timecounter "i8254"  frequency 1193182 Hz
Timecounter "TSC"  frequency 400911175 Hz
CPU: Celeron (400.91-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x665  Stepping = 5
  Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR>
real memory  = 134217728 (131072K bytes)
avail memory = 127815680 (124820K bytes)
Preloaded elf kernel "kernel" at 0xc026f000.
Pentium Pro MTRR support enabled
Probing for devices on PCI bus 0:
chip0: <Intel 82443BX host to PCI bridge> rev 0x03 on pci0.0.0
chip1: <Intel 82443BX host to AGP bridge> rev 0x03 on pci0.1.0
chip2: <Intel 82371AB PCI to ISA bridge> rev 0x02 on pci0.7.0
ide_pci0: <Intel PIIX4 Bus-master IDE controller> rev 0x01 on pci0.7.1
chip3: <Intel 82371AB Power management controller> rev 0x02 on pci0.7.3
xl0: <3Com 3c905B-TX Fast Etherlink XL> rev 0x30 int a irq 11 on pci0.14.0
xl0: Ethernet address: 00:10:5a:72:43:a0
xl0: autoneg complete, link status good (half-duplex, 10Mbps)
xl1: <3Com 3c905B-TX Fast Etherlink XL> rev 0x24 int a irq 10 on pci0.18.0
xl1: Ethernet address: 00:10:4b:c5:b6:14
xl1: autoneg complete, link status good (half-duplex, 10Mbps)
ahc0: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 15 on pci0.19.0
ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
xl2: <3Com 3c905B-TX Fast Etherlink XL> rev 0x24 int a irq 15 on pci0.20.0
xl2: Ethernet address: 00:10:4b:c5:c3:4d
xl2: autoneg complete, link status good (half-duplex, 10Mbps)
Probing for devices on PCI bus 1:
vga0: <ATI model 4757 graphics accelerator> rev 0x7a on pci1.0.0
Probing for devices on the ISA bus:
sc0 on isa
sc0: VGA color <16 virtual consoles, flags=0x0>
atkbdc0 at 0x60-0x6f on motherboard
atkbd0 irq 1 on isa
sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa
sio0: type 16550A, console
sio1 at 0x2f8-0x2ff irq 3 on isa
sio1: type 16550A
fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
fdc0: FIFO enabled, 8 bytes threshold
fd0: 1.44MB 3.5in
wdc0 at 0x1f0-0x1f7 irq 14 on isa
wdc0: unit 0 (atapi): <ATAPI CD-ROM DRIVE 40X MAXIMUM/N0AP>, removable, dma, iordis
acd0: drive speed 6890KB/sec, 128KB cache
acd0: supported read types: CD-R, CD-RW, CD-DA
acd0: Audio: play, 255 volume levels
acd0: Mechanism: ejectable tray
acd0: Medium: no/blank disc inside, unlocked
ppc0 not found
vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa
npx0 on motherboard
npx0: INT 16 interface
IP packet filtering initialized, divert disabled, rule-based forwarding disabled, logging limited to 100 packets/entry by default
Waiting 2 seconds for SCSI devices to settle
cda0 at ahc0 bus 0 target 0 lun 0
da0: <QUANTUM ATLAS IV 9 WLS 0808> Fixed Direct Access SCSI-3 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C)
da1 at ahc0 bus 0 target 1 lun 0
da1: <QUANTUM ATLAS IV 9 WLS 0808> Fixed Direct Access SCSI-3 device 
da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da1: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C)
hanging root device to da0s1a
WARNING: / was not properly dismounted

Anyone have any ideas, or is there more info I can supply? (The
machine is running headless with console on a terminal server;
I'll try to get a head on it to dump to DDB next time it happens.)
There's definitely something nasty here. :-(
-- 
This transmission is for the intended addressee only and is confidential
information. If you have received this transmission in error, please delete
it and notify the sender. The contents of this email are the opinion of the
writer and are not endorsed by Mincom Ltd unless expressly stated otherwise.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message