Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Mar 2007 11:48:54 +1030
From:      "Daniel O'Connor" <doconnor@gsoft.com.au>
To:        freebsd-stable@freebsd.org
Subject:   fork wedging (I think)
Message-ID:  <200703011149.02601.doconnor@gsoft.com.au>

next in thread | raw e-mail | index | archive | help
--nextPart6375052.DeoMG3dhqs
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

We run a set of meteor and wind radar systems based on FreeBSD and one of t=
hem
I am seeing an odd problem that does not occur very often (once a month
perhaps).

The main process we use for gathering data regularly fork()s children off to
process data (at each 30 minute acquisition cycle). The data is read from a=
n=20
external acquisition chassis using a (slow) custom PCI card. Typical data=20
rates are 1-2Mbyte/sec. It will also fork off processes to send email,
although unfortunately I don't know how to extract a Tcl stack trace from=20
inside gdb :(

The problem seems to occur when the acquisition process (called the Recorde=
r)
tries to fork off a new child. I gathered some information during an earlier
failure (mid January) show below. I tried to get a crash dump today when it
failed again but as I ran sysctl -a the system wedged. I think processes got
stuck in allproc (I pressed ctrl-t on an ls process running another session
opened before I ran sysctl).

  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
19999  1453     1  44  -8 -5 17056 11380 piperd I<s   ??  144:41.70 /usr/ho=
me/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisiti=
on/sks.tcl /usr/home/radar/skiymet/etc/ud3
19999 12293  1453  44  96 -5 17056    16 user m D<L   ??    0:00.00 /usr/ho=
me/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisiti=
on/sks.tcl /usr/home/radar/skiymet/etc/ud3

Note that under normal circumstances there would only be one Recorder proce=
ss.

eureka:~>gdb $GSHOME/libexec/Recorder
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain condition=
s.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
(gdb) eureka:~>fg
gdb /usr/home/radar/skiymet/libexec/Recorder

(gdb)
(gdb) attach 1453
Attaching to program: /usr/home/radar/skiymet/libexec/Recorder, process 1453
Reading symbols from /usr/local/lib/libtcl84.so.1...done.
Loaded symbols for /usr/local/lib/libtcl84.so.1
Reading symbols from /lib/libm.so.4...done.
Loaded symbols for /lib/libm.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /usr/home/radar/skiymet/lib/acquisition//synccheck.plu=
gin...done.
Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//synccheck.plugin
Reading symbols from /usr/home/radar/skiymet/lib/acquisition//integrate.plu=
gin...done.
Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//integrate.plugin
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
0x00000008009c395c in read () from /lib/libc.so.6
(gdb) bt
#0  0x00000008009c395c in read () from /lib/libc.so.6
#1  0x000000080072f77f in TclpCreateProcess () from /usr/local/lib/libtcl84=
=2Eso.1
#2  0x0000000800717d25 in TclCreatePipeline () from /usr/local/lib/libtcl84=
=2Eso.1
#3  0x00000008007186d0 in Tcl_OpenCommandChannel () from /usr/local/lib/lib=
tcl84.so.1
#4  0x0000000800704af8 in Tcl_ExecObjCmd () from /usr/local/lib/libtcl84.so=
=2E1
#5  0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#6  0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#7  0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so=
=2E1
#8  0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.=
so.1
#9  0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#10 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#11 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so=
=2E1
#12 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.=
so.1
#13 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#14 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#15 0x0000000800721511 in Tcl_AfterObjCmd () from /usr/local/lib/libtcl84.s=
o.1
#16 0x0000000800720ccc in Tcl_CreateTimerHandler () from /usr/local/lib/lib=
tcl84.so.1
#17 0x0000000800711fea in Tcl_ServiceEvent () from /usr/local/lib/libtcl84.=
so.1
#18 0x0000000800712357 in Tcl_DoOneEvent () from /usr/local/lib/libtcl84.so=
=2E1
#19 0x000000000041b272 in gsio_read (interp=3D0x58d400, iov=3D0xe14408, iov=
cnt=3D4, timeout=3D81000) at gsio.c:234
#20 0x000000000040a3fb in rawdata_wakeup (interp=3D0x58d400, inqueue=3D0x7f=
ffffff6aa0, outqueue=3D0xdf,
    flags=3D0x7fffffffa584, token=3D0xffffff000713c340) at data.c:179
#21 0x00000000004327b3 in proc_run (interp=3D0x58d400) at procedure.c:528
#22 0x00000000004306b5 in ud_Stream (cd=3D0xd, interp=3D0x58d400, argc=3D2,=
 argv=3D0x0) at procedure.c:157
#23 0x00000008006c840b in TclInvokeStringCommand () from /usr/local/lib/lib=
tcl84.so.1
#24 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#25 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#26 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so=
=2E1
#27 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.=
so.1
#28 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#29 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#30 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so=
=2E1
#31 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.=
so.1
#32 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#33 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#34 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so=
=2E1
#35 0x00000008006cbc86 in Tcl_EvalObjEx () from /usr/local/lib/libtcl84.so.1
#36 0x00000008006ceca2 in Tcl_CatchObjCmd () from /usr/local/lib/libtcl84.s=
o.1
#37 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#38 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84=
=2Eso.1
#39 0x00000008006f485a in Tcl_ExprObj () from /usr/local/lib/libtcl84.so.1
#40 0x00000008006ca3b3 in Tcl_ExprBooleanObj () from /usr/local/lib/libtcl8=
4.so.1
#41 0x00000008006d1979 in Tcl_IfObjCmd () from /usr/local/lib/libtcl84.so.1
#42 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl=
84.so.1
#43 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#44 0x000000080070b092 in Tcl_FSEvalFile () from /usr/local/lib/libtcl84.so=
=2E1
#45 0x000000080070ea30 in Tcl_Main () from /usr/local/lib/libtcl84.so.1
#46 0x00000000004281fe in main (argc=3D13, argv=3D0x7fffffff6aa0) at main.c=
:49
(gdb)
#19 0x000000000041b272 in gsio_read (interp=3D0x58d400, iov=3D0xe14408, iov=
cnt=3D4, timeout=3D81000) at gsio.c:234
234                 Tcl_DoOneEvent(TCL_ALL_EVENTS | TCL_DONT_WAIT);
(gdb)
#20 0x000000000040a3fb in rawdata_wakeup (interp=3D0x58d400, inqueue=3D0x7f=
ffffff6aa0, outqueue=3D0xdf,
    flags=3D0x7fffffffa584, token=3D0xffffff000713c340) at data.c:179
179         if (dop->do_ifmod->m.ifmod.if_read(interp, dop->do_iov, DATA_NB=
UFS, timeout))

eureka:~>kill 1453
eureka:~>kill 1453
1453: No such process
eureka:~>ps -axlwww | grep Rec
  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
19999 12293     1  44  96 -5 17056    16 user m D<L   ??    0:00.00 /usr/ho=
me/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisiti=
on/sks.tcl /usr/home/radar/skiymet/etc/ud3
19999 56552 53088   0  96  0   388   272 -      R+    p3    0:00.00 grep Rec
eureka:~>kill 12293
eureka:~>kill -9 12293
eureka:~>kill -9 12293
eureka:~>fstat /dev/gsio0
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W NAME
radar    Recorder   12293    3 /dev        149 crw-rw-r--   gsio0 rw  /dev/=
gsio0

Interestingly close-on-exec is set on that file descriptor, so it shouldn't=
=20
be keeping the gsio0 device busy like that.

Does anyone have any suggestions for debugging it? I will try and get a cra=
sh
dump the next time it happens although it's going to be difficult if I can't
use sysctl :(

(I don't have console access and the system is in a very remote location ju=
st
to make it extra fun)

The FreeBSD version is 6.1 with some minor mods to add the various custom
drivers we're using.

Anyone have any suggestions for tracking the problem down? Or better yet a
solution :) I can upgrade it to RELENG_6 if it will fix the problem but I'd
rather not do so without a reasonable idea it will improve things.

Thanks.

=2D-=20
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
  -- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C

--nextPart6375052.DeoMG3dhqs
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (FreeBSD)

iD8DBQBF5ioG5ZPcIHs/zowRAk0UAJwN56HBPsbnfbnfz/ahtP1+ls8uawCbBWDr
G3E2kFexpzj1Cx1ZDn4ftHs=
=5sDk
-----END PGP SIGNATURE-----

--nextPart6375052.DeoMG3dhqs--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200703011149.02601.doconnor>