From owner-freebsd-current  Fri Aug 28 01:09:38 1998
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id BAA14881
          for freebsd-current-outgoing; Fri, 28 Aug 1998 01:09:38 -0700 (PDT)
          (envelope-from owner-freebsd-current@FreeBSD.ORG)
Received: from grape.carrier.kiev.ua (grape.carrier.kiev.ua [193.193.193.219])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id BAA14872
          for <current@freebsd.org>; Fri, 28 Aug 1998 01:09:30 -0700 (PDT)
          (envelope-from archer@grape.carrier.kiev.ua)
Received: (from archer@localhost)
	by grape.carrier.kiev.ua (8.9.1/8.8.8) id LAA00123;
	Fri, 28 Aug 1998 11:08:06 +0300 (EEST)
	(envelope-from archer)
Date: Fri, 28 Aug 1998 11:08:06 +0300 (EEST)
From: Alexander Litvin <archer@lucky.net>
Message-Id: <199808280808.LAA00123@grape.carrier.kiev.ua>
To: Archie Cobbs <archie@whistle.com>
Cc: current@FreeBSD.ORG
Subject: Re: encountered possible VM bug ?
X-Newsgroups: grape.freebsd.current
In-Reply-To: <199808272051.NAA27400@bubba.whistle.com>
Organization: Lucky Grape
User-Agent: tin/pre-1.4-980202 (UNIX) (FreeBSD/3.0-CURRENT (i386))
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

In article <199808272051.NAA27400@bubba.whistle.com> you wrote:

>> GW> No, this is the ``daemons dying'' bug which nobody has fixed yet.
>> GW> When the system runs out of swap, some random selection of processes
>> GW> which are in swap get corrupted.  Usually this results in a daemon
>> GW> which dies whenever it fork()s, but sometimes it is manifested as
>> GW> other sorts of corruption.  The message you see from realloc is
>> GW> indicative of a corrupted pointer.
>> 
>> Really, I was under impression, that it is the problem just with fork().
>> But now I may confirm that processes get corrupted in different manners.
>> E.g., I have now a specially written dummy daemon running, which I
>> was able to corrupt (intentionally exhausting swap) in such a way that
>> it successfully forks. Than child process sleeps (just to give me
>> chance to attach to it with debugger), allocates memory, accesses it
>> -- and during all that it doesn't get SIGSEGV. But then it dies when
>> trying to syslog(3). It seems that the corruption is in mmaped ld.so
>> or libc.3.1.so.
>> 
>> If anybody cares, I may try to give any other details.

AC> At Whistle, we've seen this bug every so often for a long time.
AC> The common elements seem to be:

AC>  1. memory mapping is in use
AC>  2. a fork() is happening or just happened

AC> But #1 and #2 are not necessarily both related to the same process.
AC> This bug has been around for a *long* time, in both 2.x and 3.x.

I saw bash exiting with SIGSEGV. It was not trying to fork some job.
It was swapped out, I just hit <Enteer>, and it exited with signal 11.
Cron sometimes seem to just stop forking cron jobs, when it is not
segfaulting -- it just doesn't try to fork.

AC> Running out of swap may or may not be related, not sure... I think
AC> we've seen this when swap was not an issue. Perhaps running out of
AC> swap amplifies the problem.

AC> It's really hard to pin down, because the panic seems to come a
AC> while after the initial damage is done. We've seen random processes
AC> crashing every time they try to fork(), kernel panic's because of
AC> some process being on two different queues at the same time (eg,
AC> sleep and runnable), and other manifestations.

AC> A common manifestation is that a file being written out contains
AC> some random page of memory from some other file -- we think the other
AC> file is a currently mmap'd file.

In my case it seems that the process have some of its pages zeroed.
At least here's the simpthom (I have it still running and segfaulting
-- for investigation ;):

root:~/dummy_daemon:grape:> gdb dummy_daemon 29643
[...]
Attaching to program `/usr/home/archer/dummy_daemon/dummy_daemon', process 29643

Reading symbols from /usr/libexec/ld.so...done.
Reading symbols from /usr/lib/aout/libc.so.3.1...done.

Error accessing memory address 0x0: Bad address.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What exactly does that line mean? When I attach to not deseased dummy_daemon,
it does not appear, instead I see:

0x20057c21 in nanosleep ()

AC> Julian and Terry can supply more details.

AC> -Archie

AC> ___________________________________________________________________________
AC> Archie Cobbs   *   Whistle Communications, Inc.  *   http://www.whistle.com
--- 
It's lucky you're going so slowly, because you're going in the wrong
direction.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message