Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 May 2000 13:06:19 -0400
From:      "David E. Cross" <crossd@cs.rpi.edu>
To:        freebsd-hackers@freebsd.org
Subject:   PR #10971, not dead yet.
Message-ID:  <200005311706.NAA78305@cs.rpi.edu>

next in thread | raw e-mail | index | archive | help
We have still have a problem with PR #10971 here running a -STABLE as of last
week.  (Long since 10971 should have been dead).  It is a difficult problem
to track down as stack corruption makes debugging files less than useless.
I do, however, have a ktrace of an entire transaction that causes ypserv
to die.  I am in the process of trying to track down why it is dying, it 
appears to be a bug in the rpc library itself.  Normally what happens is
the following:

# TCP request comes in, accept().
# yp_all request issued, parent forks.
# child handles request, quits.
# parent is interrupted in its select() call, dispatches to signal handler for
# SIGCHLD
# handler returns.
# parent issues a read?!?  (this is odd, since it doesn't re-enter the select
# loop as the code I have read suggests it should).
# read fails (0 bytes returned)
# it does that a couple of times (probably falling out of loops), and FD is
# closed
# ypserv re-enters the select loop

Under a failure condition the following happens:
# Upon child return parent reads from a a DB file to a nonexistent buffer.
# parent seg-faults.

I believe the problem code is "next to" the section of the code where it
selects(), and then accepts() if it is a TCP connection... but I cannot find
where this code is.  a grep of 'accept' in both the ypserv and rpc code
returns no usefull matches.  Also, it would certainly appear that there
is another select loop than just the one in the the canonical ypsrever.

Below is the dying moments for the parent process as reported by ktrace,
ideas?

 41096 ypserv   CALL  fork
 41096 ypserv   RET   fork 62356/0xf394
 41096 ypserv   CALL  gettimeofday(0xbfbff510,0)
 41096 ypserv   RET   gettimeofday 0
 41096 ypserv   CALL  select(0x10,0x8051040,0,0,0xbfbff518)
 41096 ypserv   PSIG  SIGCHLD caught handler=0x804c75c mask=0x0 code=0x0
 41096 ypserv   RET   select -1 errno 4 Interrupted system call
 41096 ypserv   CALL  wait4(0xffffffff,0xbfbff308,0x1,0)
 41096 ypserv   RET   wait4 62356/0xf394
 41096 ypserv   CALL  wait4(0xffffffff,0xbfbff308,0x1,0)
 41096 ypserv   RET   wait4 -1 errno 10 No child processes
 41096 ypserv   CALL  sigreturn(0xbfbff328)
 41096 ypserv   RET   sigreturn JUSTRETURN
 41096 ypserv   CALL  gettimeofday(0xbfbff510,0)
 41096 ypserv   RET   gettimeofday 0
 41096 ypserv   CALL  read(0x1c,0x80f3fa0,0xfa0)
 41096 ypserv   GIO   fd 28 read 4000 bytes
 41096 ypserv   RET   read 4000/0xfa0
 41096 ypserv   PSIG  SIGSEGV SIG_DFL
 41096 ypserv   NAMI  "ypserv.core"

Oh, this is true of all systems, not just 4.0-STABLE.  I was hoping the move
to 4.0 might solve the problem, so I wasn't actively trying to debug it before.
--
David Cross                               | email: crossd@cs.rpi.edu 
Lab Director                              | Rm: 308 Lally Hall
Rensselaer Polytechnic Institute,         | Ph: 518.276.2860            
Department of Computer Science            | Fax: 518.276.4033
I speak only for myself.                  | WinNT:Linux::Linux:FreeBSD


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200005311706.NAA78305>