Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 May 2000 17:51:38 -0400
From:      "David E. Cross" <crossd@cs.rpi.edu>
To:        Guy Helmer <ghelmer@cs.iastate.edu>
Cc:        "David E. Cross" <crossd@cs.rpi.edu>, Matthew Dillon <dillon@apollo.backplane.com>, freebsd-hackers@FreeBSD.ORG, crossd@cs.rpi.edu
Subject:   Re: PR #10971, not dead yet. 
Message-ID:  <200005312151.RAA86135@cs.rpi.edu>
In-Reply-To: Message from Guy Helmer <ghelmer@cs.iastate.edu>  of "Wed, 31 May 2000 15:12:34 CDT." <Pine.HPX.4.05.10005311509440.9820-100000@popeye.cs.iastate.edu> 

next in thread | previous in thread | raw e-mail | index | archive | help
> > Alas, this is not something I have been able to reliably reproduce, it seems
> > to trigger itself every so-often (and at inconvienient times).  But no
> > matter what I do by myself it will not trip.
> 
> Is it possibly related to a low-memory situation?  I'm trying to solve a
> problem in cron that sounds similar, and seems to be triggered when the
> machine goes into swapping.  I'm unable to duplicate it myself :-(
> 
> Guy
> 
> Guy Helmer, Ph.D. Candidate, Iowa State University Dept. of Computer Science 
> Research Assistant, Dept. of Computer Science   ---   ghelmer@cs.iastate.edu
> http://www.cs.iastate.edu/~ghelmer

This does not appear to be memory related at all.

In fact, I *think* I just found it...
(bear with me y'all)

In the case of a TCP connect that requests a yp_all transfer we fork() off,
and then try to do a very good job of not allowing the client to handle any
requests other than the yp_all; however the following code snippet from
readtcp() tries to do an end run around us it would appear:

/*
 * reads data from the tcp conection.
 * any error is fatal and the connection is closed.
 * (And a read of zero bytes is a half closed stream => error.)
 *
 * Note: we have to be careful here not to allow ourselves to become
 * blocked too long in this routine. While we're waiting for data from one
 * client, another client may be trying to connect. To avoid this situation,
 * some code from svc_run() is transplanted here: the select() loop checks
 * all RPC descriptors including the one we want and calls svc_getreqset2()
 * to handle new requests if any are detected.
 */

This is the code that I noted gets run sometimes instead of the main select
loop.  Would it be a good idea to not only close all of the DB-fds, but also
all network FDs, except for the request it is specifically being asked to
handle, in the child ypserv?  Would it be as easy as stepping through the
fd_set and closing anything that != designated connection?

I am still not sure this is the cause, as all of the database FDs should
already be closed, so even if a child did answer the request it shouldn't
cause trouble for the parent (and I do not see any evidence in the ktrace()
that the child is responding outside of its yp_all request).

Indeed, I have just verified this is the code that causes the segfault in
this case (as indicated by the tell-tale gettimeofday calls that I could
not previously track).  I still have no idea what is causing the trboule
though.  Especially confusing is the following sequence of events:

 41096 ypserv   CALL  fork
 41096 ypserv   RET   fork 62356/0xf394
 41096 ypserv   CALL  gettimeofday(0xbfbff510,0)
 41096 ypserv   RET   gettimeofday 0
 41096 ypserv   CALL  select(0x10,0x8051040,0,0,0xbfbff518)
 41096 ypserv   PSIG  SIGCHLD caught handler=0x804c75c mask=0x0 code=0x0
 41096 ypserv   RET   select -1 errno 4 Interrupted system call
 41096 ypserv   CALL  wait4(0xffffffff,0xbfbff308,0x1,0)
 41096 ypserv   RET   wait4 62356/0xf394
 41096 ypserv   CALL  wait4(0xffffffff,0xbfbff308,0x1,0)
 41096 ypserv   RET   wait4 -1 errno 10 No child processes
 41096 ypserv   CALL  sigreturn(0xbfbff328)
 41096 ypserv   RET   sigreturn JUSTRETURN
 41096 ypserv   CALL  gettimeofday(0xbfbff510,0)
 41096 ypserv   RET   gettimeofday 0
 41096 ypserv   CALL  read(0x1c,0x80f3fa0,0xfa0)
 41096 ypserv   GIO   fd 28 read 4000 bytes

Note that the select returned with -1, with errno set to 4, and it
did not re-enter the select loop, but just started to read data.  Also note
that following the 'CALL/RET fork' that it branches to a gettimeofday(), this
says that since readtcp() is acting as a bit of svc_run() that *it*
dispatched to the yp_all() handler, and then it forked there, without the
special handling that is done in the normal yp_svc_run().

Does this give anyone else any ideas?  This is proving to be a very slow 
battle.

--
David Cross                               | email: crossd@cs.rpi.edu 
Lab Director                              | Rm: 308 Lally Hall
Rensselaer Polytechnic Institute,         | Ph: 518.276.2860            
Department of Computer Science            | Fax: 518.276.4033
I speak only for myself.                  | WinNT:Linux::Linux:FreeBSD


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200005312151.RAA86135>