From owner-freebsd-stable Mon Dec 21 20:53:07 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA24689 for freebsd-stable-outgoing; Mon, 21 Dec 1998 20:53:07 -0800 (PST) (envelope-from owner-freebsd-stable@FreeBSD.ORG) Received: from niwa.cri.nz (clam.niwa.cri.nz [131.203.55.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA24684 for ; Mon, 21 Dec 1998 20:53:03 -0800 (PST) (envelope-from w.knowles@niwa.cri.nz) Received: from neptune (neptune.niwa.cri.nz [131.203.56.34]) by niwa.cri.nz (8.9.1/8.9.1) with ESMTP id RAA23897 for ; Tue, 22 Dec 1998 17:52:56 +1300 (NZDT) Date: Tue, 22 Dec 1998 17:52:55 +1300 (NZDT) From: Wayne Knowles X-Sender: wdk@neptune.niwa.cri.nz To: freebsd-stable@FreeBSD.ORG Subject: Weird VM problem in 2.2.8 (sendmail related?) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi, Our firewall machine has been experiencing problems over the past few months with various kernel panics. The problems started to appear when sendmail was upgraded from 8.8.8 to 8.9.1 which produced several panics with the following stack traceback (from DDB) _pmap_remove_pages+0x107 _exit1 _exit _syscall .. etc ... Running process did vary - 'sed' from a news expire and 'sendmail' on other occasions. This was with FreeBSD 2.2.6, and at the beginning of the month I upgraded to FreeBSD 2.2.8 hoping that it will solve the problem - no such luck! Since then I have enabled neighbor cache queries for our squid proxy cache and the machine has crashed twice since then. Details of the first were written down by me with a detailed traceback - but I cannot locate it - it was a VM related call IIRC. The second panic was at a different address _vm_page_insert+0x4e which a colleague wrote down but didn't know how to do a stack traceback. For both panics the active process was squid. Also I was unable to dump memory to disk call diediedie and call dopanic din't work! Since then I have compiled the kernel for debugging support, removed DDB, and installed a strip -d copy onto the system... the patterns have now changed!! Today sendmail started to misbehave. Doing a 'mailq' command gave a Illegal Instruction error. Aha... time to see what is happening. Killing all copies of sendmail & starting it again gave the same Illegal Instruction error, but backup copies of sendmail made before previous upgrades did work. The original sendmail executable that I built from sources differed from the file in /usr/sbin indicating the file had been corrupted. Comparing the 2 versions I get interesting results: clam 86% cmp -l sendmail-good sendmail-bad 23777 303 0 23781 125 377 23782 211 377 23783 345 377 23785 165 0 23786 10 0 23787 350 0 23788 345 0 23789 264 54 23790 4 0 23792 311 0 23793 303 0 23794 123 0 23795 111 0 23796 117 0 23797 107 234 23798 111 156 23799 106 21 23800 103 0 23801 117 242 23802 116 102 23803 106 0 23804 40 0 23805 146 0 23806 141 0 23807 151 0 23808 154 0 I renamed the corrupted sendmail, and installed it again - that version worked. As expected shortly afterwards (as the memory was reclaimed for other things) the file reverted back to what it should have been. If I had rebooted the machine it would have had the same effect. My analysis of what happened: Sendmail read OK from disk and started as expected. After a few days operation, something incorrectly wrote to the wrong memory address, and corrupted a block of memory This area was the VM backing store for a disk file, and every time it was read, it gave back the corrupted result Killing all copies of sendmail meant that the block could be reclaimed for other uses if and when memory was required At a later stage when the file was read, it was physically read from disk again and error free. I have heard several reports of various sendmail problems, which may or may not have been fixed in -current. Hopefully next time the machine panics (which might not happen now that the kernel has been rebuilt) more information can be given... Anybody else seeing this kind of problem?? The machine has an extreme crossection of demands from the following software. Surprisingly, the load never gets above 20% Mail (sendmail 8.9.1) News (inn 2.1) WWW Proxy (squid 2.1-patch2) Nameserver (bind 8.1.2) IPFW DMESG output if it is of interest: Copyright (c) 1992-1998 FreeBSD Inc. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. FreeBSD 2.2.8-RELEASE #0: Thu Dec 17 11:01:44 NZDT 1998 wdk@clam.niwa.cri.nz:/usr/src/sys/compile/CLAM CPU: Pentium/P54C (165.91-MHz 586-class CPU) Origin = "GenuineIntel" Id = 0x52c Stepping=12 Features=0x1bf real memory = 67108864 (65536K bytes) avail memory = 64016384 (62516K bytes) Probing for devices on PCI bus 0: chip0 rev 2 on pci0:0:0 chip1 rev 2 on pci0:7:0 chip2 rev 2 on pci0:7:1 ahc0 rev 0 int a irq 11 on pci0:8:0 ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs ahc0 waiting for scsi devices to settle (ahc0:0:0): "DEC DSP3105S T388" type 0 fixed SCSI 2 sd0(ahc0:0:0): Direct-Access 1001MB (2050860 512 byte sectors) (ahc0:1:0): "DEC DSP3105S T384" type 0 fixed SCSI 2 sd1(ahc0:1:0): Direct-Access 1001MB (2050860 512 byte sectors) (ahc0:2:0): "DEC RZ28 (C) DEC D41C" type 0 fixed SCSI 2 sd2(ahc0:2:0): Direct-Access 2007MB (4110480 512 byte sectors) (ahc0:3:0): "SEAGATE ST34572N 0784" type 0 fixed SCSI 2 sd3(ahc0:3:0): Direct-Access 4340MB (8888924 512 byte sectors) de0 rev 35 int a irq 9 on pci0:9:0 de0: DEC 21040 [10Mb/s] pass 2.3 de0: address 08:00:2b:e4:13:bc de1 rev 35 int a irq 10 on pci0:10:0 de1: DEC 21040 [10Mb/s] pass 2.3 de1: address 08:00:2b:e2:b8:d5 Probing for devices on the ISA bus: sc0 at 0x60-0x6f irq 1 on motherboard sc0: MDA/hercules <16 virtual consoles, flags=0x0> sio0 at 0x3f8-0x3ff irq 4 on isa sio0: type 16550A sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A lpt0 at 0x3bc-0x3c3 irq 7 on isa lpt0: Interrupt-driven port lp0: TCP/IP capable interface fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: FIFO enabled, 8 bytes threshold fd0: 1.44MB 3.5in npx0 flags 0x1 on motherboard npx0: INT 16 interface Intel Pentium F00F detected, installing workaround IP packet filtering initialized, divert disabled, unlimited logging de1: enabling 10baseT port de0: enabling 10baseT port I would have thought the virtual memory for the corrupted block would have been mapped read-only producing a trap when it was written to. Perhaps this isn't happening, or another process has the same block mapped for write as well... Any Kernel patches or ideas on where to put debugging hooks will be appreciated. Help! Wayne -- _____ Wayne Knowles, Systems Manager / o \/ National Institute of Water & Atmospheric Research Ltd \/ v /\ P.O. Box 14-901 Kilbirnie, Wellington, NEW ZEALAND `---' Email: w.knowles@niwa.cri.nz To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message