From owner-freebsd-hackers@FreeBSD.ORG  Wed Apr  2 07:13:19 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 10C0E37B404
	for <hackers@freebsd.org>; Wed,  2 Apr 2003 07:13:19 -0800 (PST)
Received: from puffin.mail.pas.earthlink.net (puffin.mail.pas.earthlink.net
	[207.217.120.139])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 478EA43FB1
	for <hackers@freebsd.org>; Wed,  2 Apr 2003 07:13:18 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0051.cvx21-bradley.dialup.earthlink.net ([209.179.192.51]
	helo=mindspring.com)
	by puffin.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 190jvO-0006lk-00; Wed, 02 Apr 2003 07:13:14 -0800
Message-ID: <3E8AFD9E.A34213B4@mindspring.com>
Date: Wed, 02 Apr 2003 07:11:26 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Dmitry Sivachenko <mitya@cavia.pp.ru>
References: <20030402134428.GA43549@fling-wing.demos.su>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4b7e8e9b6d146f377d0ebeb286acee3ae667c3043c0873f7e350badd9bab72f9c350badd9bab72f9c
cc: hackers@freebsd.org
Subject: Re: Repeated similar panics on -STABLE
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Apr 2003 15:13:19 -0000

Dmitry Sivachenko wrote:
> We have three machines under relatively high load.  They are running -STABLE
> on the same hardware with 2 processors (and SMP kernel).
> Periodically (approximately once a week) they panic with similar symptoms:

[ ... ]

Panic.

> #18 0xc0162549 in panic (fmt=0xc028e3b9 "%s")
>     at /mnt/se3/releng_4/src/sys/kern/kern_shutdown.c:595
> #19 0xc0251b1a in trap_fatal (frame=0xeb278e04, eva=1558020096)
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:974
> #20 0xc0251775 in trap_pfault (frame=0xeb278e04, usermode=0, eva=1558020096)
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:867
> #21 0xc02512b7 in trap (frame={tf_fs = -1072300008, tf_es = -361627632,
>       tf_ds = 16, tf_edi = -1070989600, tf_esi = -349729108,
>       tf_ebp = -349729176, tf_isp = -349729232, tf_ebx = -1070870564,
>       tf_edx = 1558020096, tf_ecx = 7, tf_eax = 128, tf_trapno = 12,
>       tf_err = 0, tf_eip = -1072309505, tf_cs = 8, tf_eflags = 66054,
>       tf_esp = 0, tf_ss = -349729108})
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:466

Page not present error.


> #22 0xc015daff in malloc (size=72, type=0xc029fee0, flags=0)
>     at /mnt/se3/releng_4/src/sys/kern/kern_malloc.c:243

Malloc failure was not checked for return value by source code;
probably the kbp list was just refreshed, and while you were
calling the failing malloc, the list was reemptied.

What this generally means is that KVA was exhausted, and the
caller did not expect that.

To workaround: don't exhaust the KVA space; probably you have tuned
some kernel parameter way too high.

To fix: at line 243, you need to check if va is NULL; if it is,
you need to wheck the M_WAITOK, and if set, restart the allocation.
This has to be done before the next line, where "va" is dereferenced.

Maybe something like:

Change:
	va = kbp->kb_next;
	kbp->kb_next = ((struct freelist *)va)->next;

To:

	va = kbp->kb_next;
	if (va == NULL) {
		if (flags & M_NOWAIT) {
			splx(s);
			return ((void *) NULL);
		}
		goto restart;	/* put this label above the "while" */
	}
	kbp->kb_next = ((struct freelist *)va)->next;

Working around the problem is easier (IMO): just change your tuning
parameters to avoid running out of KVA.  Probably your mbufs or
mbufclusters are way to large, for your amount of physical RAM;
remember that, except in very sepcial circumstances, kernel memory
is non-pageable.
		

> #23 0xc015a3fe in exit1 (p=0xea726820, rv=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_exit.c:166

It was trying to allocate a "zombie" structure.


> #24 0xc0164011 in sigexit (p=0xea726820, sig=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_sig.c:1503

For a process someone sent a SIGTERM to, to kill it.


> #25 0xc0163d9c in postsig (sig=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_sig.c:1406
> #26 0xc0251fc5 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47,
>       tf_edi = 174, tf_esi = 1049187701, tf_ebp = -1077936960,
>       tf_isp = -349728812, tf_ebx = 1, tf_edx = 3, tf_ecx = -1078002496,
>       tf_eax = 3, tf_trapno = 7, tf_err = 2, tf_eip = 672039098, tf_cs = 31,
>       tf_eflags = 659, tf_esp = -1078069180, tf_ss = 47})
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:174

Looks like you caused a floating point exception, and died when
the exit1 failed to create a zombie structure for the process.

-- Terry