From owner-freebsd-bugs@FreeBSD.ORG Mon Mar 2 07:40:04 2009 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CC7EA106566C for ; Mon, 2 Mar 2009 07:40:04 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id B87ED8FC08 for ; Mon, 2 Mar 2009 07:40:04 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id n227e4uO094781 for ; Mon, 2 Mar 2009 07:40:04 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id n227e4lA094780; Mon, 2 Mar 2009 07:40:04 GMT (envelope-from gnats) Date: Mon, 2 Mar 2009 07:40:04 GMT Message-Id: <200903020740.n227e4lA094780@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Mikolaj Golub Cc: Subject: Re: kern/131290: [hang]: How to completely freeze FreeBSD 7.1 under a non-privileged user X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Mikolaj Golub List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Mar 2009 07:40:05 -0000 The following reply was made to PR kern/131290; it has been noted by GNATS. From: Mikolaj Golub To: freebsd-bugs@FreeBSD.org, bug-followup@FreeBSD.org Cc: Sergey S Subject: Re: kern/131290: [hang]: How to completely freeze FreeBSD 7.1 under a non-privileged user Date: Mon, 02 Mar 2009 09:39:00 +0200 Some of the messages were not attached to the pr and it looks a bit confusing, so I will summarize the previous conversation, which can be found in freebsd-bugs maillist archive. As Sergey wrote in the first message, one can hang the system running the simple erlang script. The system hangs when the following erlang statement is executed net_kernel:start([test_node, shortnames]) Usually it will require to run the script several times to hang the system, so Sergey provided shell scrip to run the erlang program in loop. The problem was reproduced by me and Kenji Rikitake on i386 7.x two core systems (Sergey observed the hang on multicore amd64). To observe the hang erlang should be built with option -enable-smp-support. Without this option the problem is not observed. I compiled the kernel with additional debug and witness options hoping the system would panic instead of hanging but it didn't. Also I was not able to break into debugger during the hang. But after compiling erlang with -enable-smp-support I have noticed that the system panics when running /usr/local/etc/rc.d/ejabberd stop Usually I needed to run the command several times to make the system crash. The backtrace provided erlier was from such crash. And actually I don't know if this crash is related to the initial problem with hang. What does be related is that both problems appear when the erlang is built with -enable-smp-support. I have done a bit deeper analysis of the crash and here are results. At the moment of the crash two threads of beam.smp process look the most interesting. One of the threads is sleeping in poll system call waiting read data for two descriptors (3 and 8): #0 sched_switch (td=0xc52d9cc0, newtd=0xc3f12880, flags=1) at /usr/src/sys/kern/sched_4bsd.c:934 #1 0xc07a25d3 in mi_switch (flags=Variable "flags" is not available. ) at /usr/src/sys/kern/kern_synch.c:440 #2 0xc07cb702 in sleepq_switch (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:497 #3 0xc07cb99f in sleepq_catch_signals (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:417 #4 0xc07cc177 in sleepq_timedwait_sig (wchan=0xc0cc721c) at /usr/src/sys/kern/subr_sleepqueue.c:631 #5 0xc0760f7b in _cv_timedwait_sig (cvp=0xc0cc721c, lock=0xc0cc7204, timo=903) at /usr/src/sys/kern/kern_condvar.c:369 #6 0xc07d51f4 in poll (td=0xc52d9cc0, uap=0xe6988cfc) at /usr/src/sys/kern/sys_generic.c:962 #7 0xc0a9f2d3 in syscall (frame=0xe6988d38) at /usr/src/sys/i386/i386/trap.c:1090 #8 0xc0a85480 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:255 #9 0x00000033 in ?? () (kgdb) p uap $1 = (struct poll_args *) 0xe6988cfc (kgdb) p *uap $2 = {fds_l_ = 0xe6988cfc "", fds = 0x28516500, fds_r_ = 0xe6988d00 "\002", nfds_l_ = 0xe6988d00 "\002", nfds = 2, nfds_r_ = 0xe6988d04 "\206\003", timeout_l_ = 0xe6988d04 "\206\003", timeout = 902, timeout_r_ = 0xe6988d08 "\200Z@("} (kgdb) p bits $4 = (struct pollfd *) 0xe6988b74 (kgdb) p *bits $5 = {fd = 3, events = 65, revents = 0} (kgdb) p *(bits+1) $6 = {fd = 8, events = 65, revents = 0} 65 = 0x40 | 0x1 #define POLLIN 0x0001 /* any readable data available */ #define POLLRDNORM 0x0040 /* non-OOB/URG data available */ The descriptor 3 is one end of a pipe: (kgdb) p *td->td_proc->p_fd.fd_ofiles[3] $25 = {f_list = {le_next = 0xc42c9214, le_prev = 0xc4f6db48}, f_type = 3, f_data = 0xc42cc000, f_flag = 7, f_mtxp = 0xc3edb7f0, f_ops = 0xc0c19660, f_cred = 0xc42b0900, f_count = 1, f_vnode = 0x0, f_offset = 0, f_vnread_flags = 0, f_gcflag = 0, f_msgcount = 0, f_seqcount = 0, f_nextoff = 0, f_label = 0x0, f_cdevpriv = 0x0} (#define DTYPE_PIPE 3 /* pipe */) While the other descriptor (8) is a socket: (kgdb) p *td->td_proc->p_fd.fd_ofiles[8] $30 = {f_list = {le_next = 0xc5013098, le_prev = 0xc4d09098}, f_type = 2, f_data = 0xc4ee34e0, f_flag = 7, f_mtxp = 0xc3edb3e8, f_ops = 0xc0c19840, f_cred = 0xc42b0900, f_count = 1, f_vnode = 0x0, f_offset = 0, f_vnread_flags = 0, f_gcflag = 0, f_msgcount = 0, f_seqcount = 0, f_nextoff = 0, f_label = 0x0, f_cdevpriv = 0x0} (#define DTYPE_SOCKET 2 /* communications endpoint */) The other thread is the thread that actually causes the panic: #0 doadump () at pcpu.h:196 #1 0xc079a07e in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418 #2 0xc079a352 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:574 #3 0xc0493a07 in db_panic (addr=Could not find the frame base for "db_panic". ) at /usr/src/sys/ddb/db_command.c:446 #4 0xc049440c in db_command (last_cmdp=0xc0c48114, cmd_table=0x0, dopager=1) at /usr/src/sys/ddb/db_command.c:413 #5 0xc049451a in db_command_loop () at /usr/src/sys/ddb/db_command.c:466 #6 0xc0495d0d in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228 #7 0xc07c3866 in kdb_trap (type=3, code=0, tf=0xe6985ac8) at /usr/src/sys/kern/subr_kdb.c:524 #8 0xc0a9fb5b in trap (frame=0xe6985ac8) at /usr/src/sys/i386/i386/trap.c:688 #9 0xc0a8541b in calltrap () at /usr/src/sys/i386/i386/exception.s:159 #10 0xc07c39ea in kdb_enter_why (why=0xc0b42038 "panic", msg=0xc0b42038 "panic") at cpufunc.h:60 #11 0xc079a336 in panic (fmt=0xc0af1d18 "Bad link elm %p next->prev != elm") at /usr/src/sys/kern/kern_shutdown.c:557 #12 0xc07d4b3b in doselwakeup (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1142 #13 0xc07d4c1e in selwakeuppri (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1118 #14 0xc07d9c8e in pipe_write (fp=0xc4f6db48, uio=0xe6985c60, active_cred=0xc42b0900, flags=0, td=0xc52da000) at /usr/src/sys/kern/sys_pipe.c:528 #15 0xc07d6095 in dofilewrite (td=0xc52da000, fd=4, fp=0xc4f6db48, auio=0xe6985c60, offset=-1, flags=0) at file.h:256 #16 0xc07d6318 in kern_writev (td=0xc52da000, fd=4, auio=0xe6985c60) at /usr/src/sys/kern/sys_generic.c:401 #17 0xc07d638f in write (td=0xc52da000, uap=0xe6985cfc) at /usr/src/sys/kern/sys_generic.c:317 #18 0xc0a9f2d3 in syscall (frame=0xe6985d38) at /usr/src/sys/i386/i386/trap.c:1090 #19 0xc0a85480 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:255 #20 0x00000033 in ?? () Previous frame inner to this frame (corrupt stack?) The thread just has written some data to the pipe (fd=4) and called selwakeuppri to wake up the thread sleeping in poll but has found selinfo inconsistent. (kgdb) frame 12 #12 0xc07d4b3b in doselwakeup (sip=0xc42cc064, pri=88) at /usr/src/sys/kern/sys_generic.c:1142 1142 TAILQ_REMOVE(&td->td_selq, sip, si_thrlist); (kgdb) p sip $1 = (struct selinfo *) 0xc42cc064 (kgdb) p *sip $2 = {si_thrlist = {tqe_next = 0xc4ee3530, tqe_prev = 0xc52d9ce0}, si_thread = 0xc52d9cc0, si_note = {kl_list = { slh_first = 0x0}, kl_lock = 0xc0770070 , kl_unlock = 0xc07700a0 , kl_locked = 0xc0773640 , kl_lockarg = 0xc42cc170}, si_flags = 0} (kgdb) p *sip->si_thrlist.tqe_next $3 = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, si_note = {kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 , kl_unlock = 0xc07700a0 , kl_locked = 0xc0773640 , kl_lockarg = 0xc4ee3554}, si_flags = 0} Here is how descriptors f_data structures look: (kgdb) p *((struct pipe *) td->td_proc->p_fd.fd_ofiles[3].f_data) $37 = {pipe_buffer = {cnt = 1, in = 1, out = 0, size = 16384, buffer = 0xe1530000 "!"}, pipe_map = {cnt = 0, pos = 0, npages = 0, ms = {0x0 }}, pipe_sel = {si_thrlist = {tqe_next = 0xc4ee3530, tqe_prev = 0xc52d9ce0}, si_thread = 0xc52d9cc0, si_note = {kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 , kl_unlock = 0xc07700a0 , kl_locked = 0xc0773640 , kl_lockarg = 0xc42cc170}, si_flags = 0}, pipe_atime = { tv_sec = 1234768224, tv_nsec = 0}, pipe_mtime = {tv_sec = 1234768224, tv_nsec = 0}, pipe_ctime = { tv_sec = 1234768224, tv_nsec = 0}, pipe_sigio = 0x0, pipe_peer = 0xc42cc0b8, pipe_pair = 0xc42cc000, pipe_state = 2304, pipe_busy = 0, pipe_present = 1} (kgdb) p *((struct socket *) td->td_proc->p_fd.fd_ofiles[8].f_data) $36 = {so_count = 1, so_type = 1, so_options = 0, so_linger = 0, so_state = 258, so_qstate = 0, so_pcb = 0xc4bb2a8c, so_proto = 0xc0c20fe8, so_head = 0x0, so_incomp = {tqh_first = 0x0, tqh_last = 0xc4ee34fc}, so_comp = { tqh_first = 0x0, tqh_last = 0xc4ee3504}, so_list = {tqe_next = 0x0, tqe_prev = 0x0}, so_qlen = 0, so_incqlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 0, so_sigio = 0x0, so_oobmark = 0, so_aiojobq = {tqh_first = 0x0, tqh_last = 0xc4ee3528}, so_rcv = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, si_note = {kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 , kl_unlock = 0xc07700a0 , kl_locked = 0xc0773640 , kl_lockarg = 0xc4ee3554}, si_flags = 0}, sb_mtx = {lock_object = {lo_name = 0xc0b472c8 "so_rcv", lo_type = 0xc0b472c8 "so_rcv", lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next = 0xc0c8a8d0}, lod_witness = 0xc0c8a8d0}}, mtx_lock = 4, mtx_recurse = 0}, sb_sx = {lock_object = { lo_name = 0xc0b4aa15 "so_rcv_sx", lo_type = 0xc0b4aa15 "so_rcv_sx", lo_flags = 37421056, lo_witness_data = { lod_list = {stqe_next = 0xc0c87ea0}, lod_witness = 0xc0c87ea0}}, sx_lock = 1, sx_recurse = 0}, sb_state = 0, sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 71680, sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 1, sb_timeo = 0, sb_flags = 2048}, so_snd = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0, si_note = { kl_list = {slh_first = 0x0}, kl_lock = 0xc0770070 , kl_unlock = 0xc07700a0 , kl_locked = 0xc0773640 , kl_lockarg = 0xc4ee35e8}, si_flags = 0}, sb_mtx = {lock_object = {lo_name = 0xc0b472c1 "so_snd", lo_type = 0xc0b472c1 "so_snd", lo_flags = 16973824, lo_witness_data = {lod_list = {stqe_next = 0xc0c8a8f8}, lod_witness = 0xc0c8a8f8}}, mtx_lock = 4, mtx_recurse = 0}, sb_sx = {lock_object = { lo_name = 0xc0b4aa0b "so_snd_sx", lo_type = 0xc0b4aa0b "so_snd_sx", lo_flags = 37421056, lo_witness_data = { lod_list = {stqe_next = 0xc0c87ec8}, lod_witness = 0xc0c87ec8}}, sx_lock = 1, sx_recurse = 0}, sb_state = 0, sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 43008, sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 2048, sb_timeo = 0, sb_flags = 2048}, so_upcall = 0, so_upcallarg = 0x0, so_cred = 0xc42b0900, so_label = 0x0, so_peerlabel = 0x0, so_gencnt = 850, so_emuldata = 0x0, so_accf = 0x0, so_fibnum = 0} So so_rcv = {sb_sel = {si_thrlist = {tqe_next = 0x0, tqe_prev = 0x0}, si_thread = 0x0 for the socket descriptor looks wrong. Looking through the kernel code I tried to find the path that could lead to this state but failed and at present I don't have any idea where I should look. Is it possible to have the kernel built with QUEUE_MACRO_DEBUG defined? Setting it to 1 in sys/queue.h makes impossible even to build the kernel. Also I tried to add the line #define QUEUE_MACRO_DEBUG 1 in kern/sys_generic.c (just before the headers). The kernel has built but I get panic during the system boot. -- Mikolaj Golub