FreeBSD Mail Archives

Date:      Mon, 4 Oct 1999 16:10:15 -0700 (PDT)
From:      Doug <Doug@gorean.org>
To:        freebsd-hackers@freebsd.org
Subject:   zalloci/pv_entry problem (Was: Weird sockname errors with -current and apache)
Message-ID:  <Pine.BSF.4.10.9910041553330.36710-100000@dt011n66.san.rr.com>

next in thread | raw e-mail | index | archive | help

[Including the whole history here since I haven't received a response on
this yet.]

	Well I FINALLY got one of my crashing CGI machines to drop into
the debugger, and the results were interesting. I'm not a DDB expert, but
I tried to get some relevant info. I think the following is the most
interesting:

db> trace
zalloci(c02698c0,deb59e58,c01f24c3,dd196964,8752000) at zalloci+0x33
get_pv_entry(dd196964,8752000,ffc21d48,0,deb59e90) at get_pv_entry+0x4a
pmap_insert_entry(dd196964,8752000,c08952d0,77cb000) at
pmap_insert_entry+0x1f
pmap_copy(dd196964,de0ec264,80c6000,1d7e000,80c6000) at pmap_copy+0x1a0
vm_map_copy_entry(de0ec200,dd196900,dd31aac8,de780208) at
vm_map_copy_entry+0xdf
vmspace_fork(de0ec200,dd191840,dd191840,bfbfddbc,deb59f30) at
vmspace_fork+0x1d3
vm_fork(de09f080,dd191840,14) at vm_fork+0x2f
fork1(de09f080,14,deb59f48,de09f080,9) at fork1+0x621
fork(de09f080,deb59f80,805b36c,30,bfbfddbc) at fork+0x16
syscall(bfbf002f,bfbf002f,bfbf002f,bfbfddbc,30) at syscall+0x19e
Xint0x80_syscall() at Xint0x80_syscall+0x31

The full output of what I got is available at
http://doug.simplenet.com/DDB1.txt. 

	We'd had some problems with the servers crashing due to an
inadequate number of pv_entry's (due to the huge size of our httpd's), so
I increased options  PMAP_SHPGPERPROC to 800, which solved the crashing
problem. However it seems(?) that this high setting is tickling something
that it ought not to. The changes in the VM system between 8/28 and 9/22
seem to be exacerbating the problem, since the servers are much more
stable now with the older code in spite of this one crash. 

	If anyone has suggestions for better DDB commands to use for next
time, and/or any other suggestions on fixing the problem I'm open to them.
I am currently working on taking advantage of apache 1.3.9's new vhost
settings for httpd.conf so that we can reduce the size of our conf files
(and thus the httpd's), but meanwhile it looks like our unusual settings
have uncovered a problem worth fixing. 

Thanks,

Doug
-- 
"Stop it, I'm gettin' misty." 

    - Mel Gibson as Porter, "Payback"

---------- Forwarded message ----------
Date: Fri, 1 Oct 1999 16:06:26 -0700 (PDT)
From: Doug <Doug@gorean.org>
To: freebsd-hackers@freebsd.org
Subject: Weird sockname errors with -current and apache

	No response on -current and I have an update. After moving to
-current source via cvs -D (1999//D99.08.28.21.00.00) the servers are
infinitely more stable, running continuously for three days, and showing
no signs of dying. I get sockname errors once in a great while, and only
one at a time. I get the VM error mentioned below once in a great while,
but I'm fairly certain I've tracked that down to a problem with the miva
processing engine binary itself. I haven't had any of the errant apache 
children not dying errors, and only one of the calcru errors. 

	I have come up with a theory on this and I'd appreciate if someone
could comment on it. We get pre-compiled binaries from the Miva Corp.
people that I'm 99.99% sure are built on a 2.2.x or 3.x machine (waiting
on confirmation now). So what I'm thinking (based mostly on the sockname
error) is that there is a sort of "library creep" happening where small
incompatibilities between the version of the library that the binary is
expecting and the version it's finding are just a bit out of synch. I am
wondering if adding the appropriate compat libraries to these systems
would help, and if so how would I specify that this specific binary use
those libraries as opposed to the ones in /usr/lib?

	Any insights on this would be greatly appreciated. Here are some
details on the binaries, let know if anything else is needed. 

Thanks,

Doug


 ldd miva
miva:
        libcrypt.so.2 => /usr/lib/libcrypt.so.2 (0x280d3000)
        libc.so.3 => /usr/lib/libc.so.3 (0x280e9000)
        libm.so.2 => /usr/lib/libm.so.2 (0x2816c000)

 file miva
miva: setuid sticky ELF 32-bit LSB executable, Intel 80386, version 1
(FreeBSD), dynamically linked, stripped 

-- 
"Stop it, I'm gettin' misty." 

    - Mel Gibson as Porter, "Payback"

---------- Forwarded message ----------
Date: Wed, 29 Sep 1999 12:28:59 -0700 (PDT)
From: Doug <Doug@gorean.org>
To: freebsd-current@freebsd.org
Subject: Weird sockname errors with -current and apache

Greetings,

	I'm using -current on some web server/CGI processing machines. Yes
I know all about using -current on production stuff, but we need the NFS,
et al fixes due to the heavy NFS client activity on these systems, and
I'm willing to take the good with the bad. I cvsup'ed and built world and
kernel on or about 8/26 and these boxes ran fine for about 26 days. On
9/22 (Wednesday) I cvsup'ed and built world and kernel on one machine in
order to take advantage of Matt's latest round of NFS, etc. fixes. That
box ran well for two days so I updated the rest of them on Friday (9/24)
and took off for a happy weekend. Well, you know what happened, one box
locked up on saturday, I came in and rebooted it, then the other 4 boxes
locked up on sunday. *sigh*

	The really annoying thing here is that there isn't ONE clear problem
that I can point to. Also, when the boxes die they wedge solid. No
console, serial or otherwise, and no DDB so I can't find out exactly
what they are doing when they die. I have the DDB_UNATTENDED option in
the kernel because I have the boxes set up to recover themselves on boot
and go back into service (previous to the 26 day uptime panics were
common). I'm starting to think I should disable that, however as far as
I can see they aren't panic'ing, they are just freezing up; although
they are ping'able. We started out this project with Apache 1.3.6, and
on Sept. 7 we moved to 1.3.9. These are dual PIII 500 machines with a
half gig of ram each. 

	The other annoying thing is that while I was checking the kernel, etc.
logs for signs of problems, it hadn't occured to me to check the apache
error log. Once I did I noticed that at least some of the symptoms I'm
seeing go back as far as I have logs, even before the blessed 26 day
uptime period. Here is what I've seen.

The first errror I can find in any of the logs I have that seems related
to the problem is this from apache's error log:

[Fri Aug 20 10:59:34 1999] [error] (22)Invalid argument: getsockname

consequently I've noticed that we get this error a LOT, usually
coinciding with a period of time where the machine is wedged, after
which it sometimes comes back, and sometimes doesn't (i.e., it stays
wedged). When this happens it usually repeats about 15-20 times,
followed by:

Virtual memory exceeded in `new' 

then a NULL character (^@) in the apache log. Those errors are usually
accompanied by a slew of "Premature end of script headers" messages,
apparently related to CGI process that these web servers run dying off
before it finishes writing out its data. 

	We also have a slew of these errors in the apache logs at various
times (doesn't *seem* to be a correlation with the others, but I'm not
sure) that look like:

[Mon Sep 13 12:51:03 1999] [warn] child process 82600 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83437 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 84136 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83698 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83703 still did not
exit, sending a SIGTERM

Sometimes these happen at the same time, sometimes they don't. When this
one happens we get about 40 of them in a row. 

	In the system logs the only unusual thing I've seen (and I enable a LOT
of logging) are these messages, which started over this past weekend. 

/kernel: calcru: negative time of 4347162 usec for pid 6806 (httpd) 

Once again, when these come they come in bunches, sometimes with a
positive time value like this one, sometimes with a negative one. I'm
used to seeing calcru messages related to the kernel misjudging the
speed of the processor, but the recently added code that tells you the
speed on SMP systems says that I have CPU: Pentium III (498.75-MHz
686-class CPU), which looks right to me.

	Now, as if the above were not annoying enough, all of these
problems could very well be related to the third party CGI processing
engine (a program called Miva) which we have tracked down some bugs in
before. Of course the machines freezing up is my main concern at this
point, but the errors themselves could be coming from miva. 

	Any suggestions on how to debug this problem further would be
greatly appreciated. I'm going to start up some boxes today that don't
have the DDB_UNATTENDED option enabled to see if they will in fact panic
and drop to the debugger. Beyond that, I'm at a bit of a loss here. 

TIA,

Doug




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.9910041553330.36710-100000>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation