From owner-freebsd-stable  Mon Dec 21 20:53:07 1998
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id UAA24689
          for freebsd-stable-outgoing; Mon, 21 Dec 1998 20:53:07 -0800 (PST)
          (envelope-from owner-freebsd-stable@FreeBSD.ORG)
Received: from niwa.cri.nz (clam.niwa.cri.nz [131.203.55.1])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA24684
          for <freebsd-stable@freebsd.org>; Mon, 21 Dec 1998 20:53:03 -0800 (PST)
          (envelope-from w.knowles@niwa.cri.nz)
Received: from neptune (neptune.niwa.cri.nz [131.203.56.34])
	by niwa.cri.nz (8.9.1/8.9.1) with ESMTP id RAA23897
	for <freebsd-stable@freebsd.org>; Tue, 22 Dec 1998 17:52:56 +1300 (NZDT)
Date: Tue, 22 Dec 1998 17:52:55 +1300 (NZDT)
From: Wayne Knowles <w.knowles@niwa.cri.nz>
X-Sender: wdk@neptune.niwa.cri.nz
To: freebsd-stable@FreeBSD.ORG
Subject: Weird VM problem in 2.2.8 (sendmail related?)
Message-ID: <Pine.OSF.4.03.9812221533550.21527-100000@neptune.niwa.cri.nz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


Hi,

Our firewall machine has been experiencing problems over the past few
months with various kernel panics.  The problems started to appear when
sendmail was upgraded from 8.8.8 to 8.9.1 which produced several panics
with the following stack traceback (from DDB)

_pmap_remove_pages+0x107
_exit1
_exit
_syscall
.. etc ...

Running process did vary - 'sed' from a news expire and 'sendmail' on
other occasions.

This was with FreeBSD 2.2.6, and at the beginning of the month I upgraded
to FreeBSD 2.2.8 hoping that it will solve the problem - no such luck!

Since then I have enabled neighbor cache queries for our squid proxy
cache and the machine has crashed twice since then.  Details of the first
were written down by me with a detailed traceback - but I cannot locate
it - it was a VM related call IIRC.  The second panic was at a different
address _vm_page_insert+0x4e which a colleague wrote down but didn't know
how to do a stack traceback.   For both panics the active process was
squid.

Also I was unable to dump memory to disk  call diediedie and call
dopanic din't work!

Since then I have compiled the kernel for debugging support, removed DDB,
and installed a strip -d copy onto the system... the patterns have now
changed!! 

Today sendmail started to misbehave.  Doing a 'mailq' command gave a
Illegal Instruction error.  Aha... time to see what is happening.

Killing all copies of sendmail & starting it again gave the same Illegal
Instruction error, but backup copies of sendmail made before previous
upgrades did work.  The original sendmail executable that I
built from sources differed from the file in /usr/sbin indicating the file
had been corrupted.

Comparing the 2 versions I get interesting results:

clam 86% cmp -l sendmail-good sendmail-bad
 23777 303   0
 23781 125 377
 23782 211 377
 23783 345 377
 23785 165   0
 23786  10   0
 23787 350   0
 23788 345   0
 23789 264  54
 23790   4   0
 23792 311   0
 23793 303   0
 23794 123   0
 23795 111   0
 23796 117   0
 23797 107 234
 23798 111 156
 23799 106  21
 23800 103   0
 23801 117 242
 23802 116 102
 23803 106   0
 23804  40   0
 23805 146   0
 23806 141   0
 23807 151   0
 23808 154   0


I renamed the corrupted sendmail, and installed it again - that version 
worked.
As expected shortly afterwards (as the memory was reclaimed for other
things) the file reverted back to what it should have been.   If I had
rebooted the machine it would have had the same effect.

My analysis of what happened:

   Sendmail read OK from disk and started as expected.

   After a few days operation, something incorrectly wrote to the wrong
   memory address, and corrupted a block of memory
   This area was the VM backing store for a disk file, and every time it
   was read, it gave back the corrupted result

   Killing all copies of sendmail meant that the block could be reclaimed
   for other uses if and when memory was required

   At a later stage when the file was read, it was physically read from
   disk again and error free.

I have heard several reports of various sendmail problems, which may or
may not have been fixed in -current.

Hopefully next time the machine panics (which might not happen now that
the kernel has been rebuilt) more information can be given...   Anybody
else seeing this kind of problem??

The machine has an extreme crossection of demands from the following
software.    Surprisingly, the load never gets above 20%

   Mail        (sendmail 8.9.1)
   News        (inn 2.1)
   WWW Proxy   (squid 2.1-patch2)
   Nameserver  (bind 8.1.2)
   IPFW

DMESG output if it is of interest:

Copyright (c) 1992-1998 FreeBSD Inc.
Copyright (c) 1982, 1986, 1989, 1991, 1993
        The Regents of the University of California.  All rights reserved.

FreeBSD 2.2.8-RELEASE #0: Thu Dec 17 11:01:44 NZDT 1998
    wdk@clam.niwa.cri.nz:/usr/src/sys/compile/CLAM
CPU: Pentium/P54C (165.91-MHz 586-class CPU)
  Origin = "GenuineIntel"  Id = 0x52c  Stepping=12
  Features=0x1bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8>
real memory  = 67108864 (65536K bytes)
avail memory = 64016384 (62516K bytes)
Probing for devices on PCI bus 0:
chip0 <Intel 82437FX PCI cache memory controller> rev 2 on pci0:0:0
chip1 <Intel 82371FB PCI-ISA bridge> rev 2 on pci0:7:0
chip2 <Intel 82371FB IDE interface> rev 2 on pci0:7:1
ahc0 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 11 on pci0:8:0
ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs
ahc0 waiting for scsi devices to settle
(ahc0:0:0): "DEC DSP3105S T388" type 0 fixed SCSI 2
sd0(ahc0:0:0): Direct-Access 1001MB (2050860 512 byte sectors)
(ahc0:1:0): "DEC DSP3105S T384" type 0 fixed SCSI 2
sd1(ahc0:1:0): Direct-Access 1001MB (2050860 512 byte sectors)
(ahc0:2:0): "DEC RZ28     (C) DEC D41C" type 0 fixed SCSI 2
sd2(ahc0:2:0): Direct-Access 2007MB (4110480 512 byte sectors)
(ahc0:3:0): "SEAGATE ST34572N 0784" type 0 fixed SCSI 2
sd3(ahc0:3:0): Direct-Access 4340MB (8888924 512 byte sectors)
de0 <Digital 21040 Ethernet> rev 35 int a irq 9 on pci0:9:0
de0: DEC 21040 [10Mb/s] pass 2.3
de0: address 08:00:2b:e4:13:bc
de1 <Digital 21040 Ethernet> rev 35 int a irq 10 on pci0:10:0
de1: DEC 21040 [10Mb/s] pass 2.3
de1: address 08:00:2b:e2:b8:d5
Probing for devices on the ISA bus:
sc0 at 0x60-0x6f irq 1 on motherboard
sc0: MDA/hercules <16 virtual consoles, flags=0x0>
sio0 at 0x3f8-0x3ff irq 4 on isa
sio0: type 16550A
sio1 at 0x2f8-0x2ff irq 3 on isa
sio1: type 16550A
lpt0 at 0x3bc-0x3c3 irq 7 on isa
lpt0: Interrupt-driven port
lp0: TCP/IP capable interface
fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
fdc0: FIFO enabled, 8 bytes threshold
fd0: 1.44MB 3.5in
npx0 flags 0x1 on motherboard
npx0: INT 16 interface
Intel Pentium F00F detected, installing workaround
IP packet filtering initialized, divert disabled, unlimited logging
de1: enabling 10baseT port
de0: enabling 10baseT port                                                          

I would have thought the virtual memory for the corrupted block would
have been mapped read-only producing a trap when it was written to.
Perhaps this isn't happening, or another process has the same block mapped
for write as well...

Any Kernel patches or ideas on where to put debugging hooks will be
appreciated.

Help!

Wayne
--
  _____	   	Wayne Knowles,  Systems Manager
 / o   \/   	National Institute of Water & Atmospheric Research Ltd
 \/  v /\   	P.O. Box 14-901 Kilbirnie, Wellington, NEW ZEALAND
  `---'     	Email:   w.knowles@niwa.cri.nz


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message