Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Aug 1998 11:08:06 +0300 (EEST)
From:      Alexander Litvin <archer@lucky.net>
To:        Archie Cobbs <archie@whistle.com>
Cc:        current@FreeBSD.ORG
Subject:   Re: encountered possible VM bug ?
Message-ID:  <199808280808.LAA00123@grape.carrier.kiev.ua>
In-Reply-To: <199808272051.NAA27400@bubba.whistle.com>

next in thread | previous in thread | raw e-mail | index | archive | help
In article <199808272051.NAA27400@bubba.whistle.com> you wrote:

>> GW> No, this is the ``daemons dying'' bug which nobody has fixed yet.
>> GW> When the system runs out of swap, some random selection of processes
>> GW> which are in swap get corrupted.  Usually this results in a daemon
>> GW> which dies whenever it fork()s, but sometimes it is manifested as
>> GW> other sorts of corruption.  The message you see from realloc is
>> GW> indicative of a corrupted pointer.
>> 
>> Really, I was under impression, that it is the problem just with fork().
>> But now I may confirm that processes get corrupted in different manners.
>> E.g., I have now a specially written dummy daemon running, which I
>> was able to corrupt (intentionally exhausting swap) in such a way that
>> it successfully forks. Than child process sleeps (just to give me
>> chance to attach to it with debugger), allocates memory, accesses it
>> -- and during all that it doesn't get SIGSEGV. But then it dies when
>> trying to syslog(3). It seems that the corruption is in mmaped ld.so
>> or libc.3.1.so.
>> 
>> If anybody cares, I may try to give any other details.

AC> At Whistle, we've seen this bug every so often for a long time.
AC> The common elements seem to be:

AC>  1. memory mapping is in use
AC>  2. a fork() is happening or just happened

AC> But #1 and #2 are not necessarily both related to the same process.
AC> This bug has been around for a *long* time, in both 2.x and 3.x.

I saw bash exiting with SIGSEGV. It was not trying to fork some job.
It was swapped out, I just hit <Enteer>, and it exited with signal 11.
Cron sometimes seem to just stop forking cron jobs, when it is not
segfaulting -- it just doesn't try to fork.

AC> Running out of swap may or may not be related, not sure... I think
AC> we've seen this when swap was not an issue. Perhaps running out of
AC> swap amplifies the problem.

AC> It's really hard to pin down, because the panic seems to come a
AC> while after the initial damage is done. We've seen random processes
AC> crashing every time they try to fork(), kernel panic's because of
AC> some process being on two different queues at the same time (eg,
AC> sleep and runnable), and other manifestations.

AC> A common manifestation is that a file being written out contains
AC> some random page of memory from some other file -- we think the other
AC> file is a currently mmap'd file.

In my case it seems that the process have some of its pages zeroed.
At least here's the simpthom (I have it still running and segfaulting
-- for investigation ;):

root:~/dummy_daemon:grape:> gdb dummy_daemon 29643
[...]
Attaching to program `/usr/home/archer/dummy_daemon/dummy_daemon', process 29643

Reading symbols from /usr/libexec/ld.so...done.
Reading symbols from /usr/lib/aout/libc.so.3.1...done.

Error accessing memory address 0x0: Bad address.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What exactly does that line mean? When I attach to not deseased dummy_daemon,
it does not appear, instead I see:

0x20057c21 in nanosleep ()

AC> Julian and Terry can supply more details.

AC> -Archie

AC> ___________________________________________________________________________
AC> Archie Cobbs   *   Whistle Communications, Inc.  *   http://www.whistle.com

--- 
It's lucky you're going so slowly, because you're going in the wrong
direction.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808280808.LAA00123>