From owner-freebsd-current  Fri Oct 18 17:23: 7 2002
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0318237B404; Fri, 18 Oct 2002 17:23:03 -0700 (PDT)
Received: from flamingo.mail.pas.earthlink.net (flamingo.mail.pas.earthlink.net [207.217.120.232])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 5D03543E97; Fri, 18 Oct 2002 17:23:03 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0128.cvx21-bradley.dialup.earthlink.net ([209.179.192.128] helo=mindspring.com)
	by flamingo.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 182hON-00051I-00; Fri, 18 Oct 2002 17:22:59 -0700
Message-ID: <3DB0A598.C53FD37D@mindspring.com>
Date: Fri, 18 Oct 2002 17:21:44 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Ben Stuyts <ben@stuyts.nl>
Cc: current@freebsd.org, Jeff Roberson <jroberson@chesapeake.net>,
	Robert Watson <rwatson@freebsd.org>, jeff@freebsd.org,
	Alfred Perlstein <alfred@FreeBSD.ORG>
Subject: Re: [Ugly PATCH] Again: panic kmem_malloc()
References: <4.3.2.7.2.20021018125313.00bb8990@terminus> <4.3.2.7.2.20021019001010.00b89f28@terminus>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

Ben Stuyts wrote:
> >Almost 5.3M of unswappable physical memory dedicated to semaphores
> >seems like a bit much.
> 
> Yes, and it increases continuously, for example when I fetch new mail (over
> pop) from my windows pc. The pc stores this again on a network drive, so
> both qpopper and smbd are involved. For example, vmstat -m says:
> 
> vmstat -m | grep sem
>            sem155886  2443K   2443K   155886  16,1024,4096
> 
> Now when I do a fetch-mail with Eudora on my pc, the same command says.
> 
> vmstat -m | grep sem
>            sem156178  2448K   2448K   156178  16,1024,4096
> 
> I can repeat this at will, and each time I loose 4-5 KB. qpopper is started
> from inetd, and smbd runs as a daemon. I tried stopping smbd:

None of us have been able to repeat your problem, up to now.  I
suppose now that we know you are running qpopper on -current,
we could repeat the problem, but, frankly, you already have a
test environment set up, and it would be a lot of work for us
to duplicate it, and even so, we won't know for sure if we
could repeat the problem.

Have you checked out your source tree with a date tag, so that
it's possible for everyone else to check out and get the same
source files?  Line number references in tracebacks are pretty
useless, if the lines don't match.


Unless you can identify the exact number of bytes being consumed,
and then identify a kernel structure used in the semaphore code
that is equal to that size, or for which that size is a least
common multiple, and there are a number of evets equal to the
size of the divisor, then that's no good.

This is why everyone keeps asking you to run the kernel debugger,
so that you can tell us exactly the code that's failing, and why,
and why a stack backtrace, more detailed than "it contained a call
to sem" is important.

This problem is evidently a memory leak in the semaphore code;
but that does not mean that the crash that results will be in
any way related to where the leak occurs.

In other words, the crash is a secondary effect.


Only by fully understanding the crash will anyone be able to help
you with the root cause.

I understand that it's frustrating to go step by step, when you
think you have isolated the problem to a smaller area, but the
information you gather from outside that area will tell you about
the inside much more clearly than staring at the outside of a
black box where we know the problem lives.

The only alternative to rewriting the black box from scratch, or
grovelling through it with a line-by-line code review (I'm not
interested in doing that; perhaps you could interest the author
of the changes that resulted in the problem) is to find a smoking
gun, and work from that, instead.

If this problem is in the way of you getting work done (one wonders
why you are using -current, if you need to get work done), then my
best suggestion to you is to back out the changes Alfred made, one
by one, and when it stops having the problem, you will have identified
a very small patch that causes the problem.


> >But without knowing what software you are running, it's hard to say
> >if the number is unreasonable, or not.
> 
> Well, it is really a lightly loaded server, just serving one windows pc
> here at home. Here is a ps, and the only thing that's missing from it is
> the occasional pop session. Also note that this system is not connected to
> the internet, so the http that's running is mostly for my own pleasure (and
> proxy/cache). I do run ppp and uucp every now and then.

Perhaps I wasn't clear.

Not knowing what calls your software makes that cause the problem
to occur, it is not possible for us to create a cut-down test case
in less than 30 lines of C source code, so that we can repeat the
problem at will, without secondary effects.  As it is, you only
*suppose* that the qpopper usage alone is sufficient to cause the
problem; even if you are correct, that's insufficient to identify
where the problem is... it may not even really be in the semaphore
source code at all.. maybe it's in kevent code, for unfreed events,
etc..

I think you need to go back one email:

| > Just had another panic, same kmem_malloc(). I did a trace but forgot to
| > write the traceback down.
| 
| Wait until the next one, and remember to write it down; preferrably,
| obtain a system dump image, so you can examine it with the debugger,
| and make sure that the kernel you are running has a debuggable
| counterpart already there (i.e. you used "config -g" to create the
| kernel you are running).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message