From owner-freebsd-stable@FreeBSD.ORG Wed Jul 23 18:46:36 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F0DFF1065729; Wed, 23 Jul 2008 18:46:35 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 56DE48FC13; Wed, 23 Jul 2008 18:46:35 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m6NIk0nC049681; Wed, 23 Jul 2008 14:46:24 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Kostik Belousov Date: Wed, 23 Jul 2008 11:23:13 -0400 User-Agent: KMail/1.9.7 References: <48860725.9050808@aldan.algebra.com> <48863C3D.7090401@aldan.algebra.com> <20080723120348.GJ17123@deviant.kiev.zoral.com.ua> In-Reply-To: <20080723120348.GJ17123@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200807231123.14229.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]); Wed, 23 Jul 2008 14:46:24 -0400 (EDT) X-Virus-Scanned: ClamAV 0.93.1/7798/Wed Jul 23 13:42:38 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.9 required=4.2 tests=AWL,BAYES_00, DATE_IN_PAST_03_06,NO_RELAYS,URI_NOVOWEL autolearn=no version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Mikhail Teterin , Kris Kennaway , stable@freebsd.org Subject: Re: "sleeping without queue" ? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jul 2008 18:46:36 -0000 On Wednesday 23 July 2008 08:03:48 am Kostik Belousov wrote: > On Tue, Jul 22, 2008 at 03:59:57PM -0400, Mikhail Teterin wrote: > > Kostik Belousov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2(=D0=BB=D0=B0= ): > > >On Tue, Jul 22, 2008 at 03:26:29PM -0400, Mikhail Teterin wrote: > > >>Kostik Belousov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=B2(=D0=BB=D0= =B0): > > >>>Did you switched to the process before doing backtrace (using the pr= oc=20 > > >>> > > >>>command)? > > >>Ok, thanks. Did not know about this one. Here: > > >>... > > >>(kgdb) proc 79759 > > >>(kgdb) bt > > >>#0 sched_switch (td=3D0xffffff01286dc000, newtd=3D0xffffff00010ce000= ,=20 > > >>flags=3D2) at /var/src/sys/kern/sched_4bsd.c:928 > > >>#1 0x0000000000000000 in ?? () > > >>#2 0xffffffff802f1108 in mi_switch (flags=3D678281216, newtd=3D0x2) = at=20 > > >>/var/src/sys/kern/kern_synch.c:442 > > >>#3 0xffffffff80318513 in sleepq_check_timeout () at=20 > > >>/var/src/sys/kern/subr_sleepqueue.c:519 > > >>#4 0xffffffff80318c85 in sleepq_timedwait (wchan=3D0xffffffff8068840= 8) at=20 > > >>/var/src/sys/kern/subr_sleepqueue.c:597 > > >>#5 0xffffffff802f16a2 in _sleep (ident=3D0xffffffff80688408, lock=3D= 0x0,=20 > > >>priority=3D0, wmesg=3D0xffffffff804f3059 "vmo_de", timo=3D1) at=20 > > >>/var/src/sys/kern/kern_synch.c:224 > > >>#6 0xffffffff8043036b in vm_object_deallocate=20 > > >>(object=3D0xffffff0053024a90) at /var/src/sys/vm/vm_object.c:509 > > >From this frame, please, print the object (like p *object) and > > >likewise, print the object that is at the head of the object->shadow_h= ead > > >list. > > kgdb /usr/obj/var/src/sys/SILVER-SMP/kernel.debug /dev/mem > > [GDB will not be able to debug user-mode threads:=20 > > /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] > > GNU gdb 6.1.1 [FreeBSD] > > Copyright 2004 Free Software Foundation, Inc. > > GDB is free software, covered by the GNU General Public License, and yo= u=20 are > > welcome to change it and/or distribute copies of it under certain=20 > > conditions. > > Type "show copying" to see the conditions. > > There is absolutely no warranty for GDB. Type "show warranty" for=20 details. > > This GDB was configured as "amd64-marcel-freebsd". > > There is no member named pathname. > > Reading symbols from /opt/modules/fuse.ko...done. > > Loaded symbols for /opt/modules/fuse.ko > > Reading symbols from /opt/modules/rtc.ko...done. > > Loaded symbols for /opt/modules/rtc.ko > > Reading symbols from /boot/kernel/snd_ich.ko...Reading symbols from=20 > > /boot/kernel/snd_ich.ko.symbols...done. > > done. > > Loaded symbols for /boot/kernel/snd_ich.ko > > Reading symbols from /boot/kernel/msdosfs.ko...Reading symbols from=20 > > /boot/kernel/msdosfs.ko.symbols...done. > > done. > > Loaded symbols for /boot/kernel/msdosfs.ko > > #0 0x0000000000000000 in ?? () > > (kgdb) frame 6 > > Error accessing memory address 0x0: Bad address. > > (kgdb) pid 79759 > > Undefined command: "pid". Try "help". > > (kgdb) proc 79759 > > (kgdb) frame 6 > > #6 0xffffffff8043036b in vm_object_deallocate=20 > > (object=3D0xffffff0053024a90) at /var/src/sys/vm/vm_object.c:509 > > 509 pause("vmo_de", 1); > > (kgdb) p *object > > $1 =3D {mtx =3D {lock_object =3D {lo_name =3D 0xffffffff804f21c4 "vm ob= ject",=20 > > lo_type =3D 0xffffffff804f3018 "standard object", lo_flags =3D 21168128= ,=20 > > lo_witness_data =3D { > > lod_list =3D {stqe_next =3D 0x0}, lod_witness =3D 0x0}}, mtx_loc= k =3D 4,=20 > > mtx_recurse =3D 0}, object_list =3D {tqe_next =3D 0xffffff0005018a90, > > tqe_prev =3D 0xffffff00539a6850}, shadow_head =3D {lh_first =3D=20 > > 0xffffff005d3afa90}, shadow_list =3D {le_next =3D 0x0, le_prev =3D=20 > > 0xffffff005d2cd048}, memq =3D { > > tqh_first =3D 0xffffff007eb9fa58, tqh_last =3D 0xffffff007f864820}, = root=20 > > =3D 0xffffff007ee14d38, size =3D 427, generation =3D 66, ref_count =3D = 2,=20 > > shadow_count =3D 1, > > type =3D 0 '\0', flags =3D 256, pg_color =3D 0, paging_in_progress =3D= 0,=20 > > resident_page_count =3D 44, backing_object =3D 0x0, backing_object_offs= et =3D=20 > > 0, pager_object_list =3D { > > tqe_next =3D 0x0, tqe_prev =3D 0x0}, cache =3D 0x0, handle =3D 0x0, = un_pager=20 > > =3D {vnp =3D {vnp_size =3D 576646}, devp =3D {devp_pglist =3D {tqh_firs= t =3D 0x8cc86, > > tqh_last =3D 0x0}}, swp =3D {swp_bcount =3D 576646}}} > > (kgdb) p (object->shadow_head) > > $2 =3D {lh_first =3D 0xffffff005d3afa90} > > (kgdb) p *object->shadow_head.lh_first > > $3 =3D {mtx =3D {lock_object =3D {lo_name =3D 0xffffffff804f21c4 "vm ob= ject",=20 > > lo_type =3D 0xffffffff804f3018 "standard object", lo_flags =3D 21168128= ,=20 > > lo_witness_data =3D { > > lod_list =3D {stqe_next =3D 0x0}, lod_witness =3D 0x0}}, mtx_loc= k =3D 4,=20 > > mtx_recurse =3D 0}, object_list =3D {tqe_next =3D 0xffffff0066c32340, > > tqe_prev =3D 0xffffff012f673ac0}, shadow_head =3D {lh_first =3D 0x0}= ,=20 > > shadow_list =3D {le_next =3D 0x0, le_prev =3D 0xffffff0053024ad0}, memq= =3D { > > tqh_first =3D 0xffffff007779f9a0, tqh_last =3D 0xffffff0077c04140}, = root=20 > > =3D 0xffffff0077c04130, size =3D 387, generation =3D 3, ref_count =3D 1= ,=20 > > shadow_count =3D 0, > > type =3D 0 '\0', flags =3D 8452, pg_color =3D 0, paging_in_progress = =3D 0,=20 > > resident_page_count =3D 2, backing_object =3D 0xffffff0053024a90,=20 > > backing_object_offset =3D 163840, > > pager_object_list =3D {tqe_next =3D 0x0, tqe_prev =3D 0x0}, cache =3D = 0x0,=20 > > handle =3D 0x0, un_pager =3D {vnp =3D {vnp_size =3D 365278}, devp =3D {= devp_pglist =3D=20 { > > tqh_first =3D 0x592de, tqh_last =3D 0x0}}, swp =3D {swp_bcount = =3D=20 365278}}} > >=20 > >=20 > > > > > >Another question is what scheduler do you use ? > > options SCHED_4BSD # 4BSD scheduler > > options PREEMPTION # Enable kernel thread preempti= on > The state of the both object being destroyed and the object that is shado= wed > looks right for me. Moreover, the shadowed object is not locked, value > of the mtx_lock is 4. It seems as if the thread missed the wakeup > owed by pause. >=20 > John, could it be that the following commit is supposed to fix the issue ? >=20 > r179974 | jhb | 2008-06-24 22:36:33 +0300 (Tue, 24 Jun 2008) | 3 lines >=20 > MFC: Change the roundrobin implementation in the 4BSD scheduler to trigge= r a > userland preemption directly from hardclock() via sched_clock() I don't think this would fix the issue. This patch fixed problems where yo= u=20 had a thread pinned to another CPU that held a lock (typically Giant) that = a=20 callout handler run from softclock needed. This prevented the 'roundrobin'= =20 callout from running which would force all the CPUs to do a context switch= =20 (normally this would have forced the pinned thread holding the lock to=20 eventually run). This involves threads on the run queue not getting to run= ,=20 even though they may have a higher priority than what is running now. I think this case is still a lingering bug in the sleep queue code since th= e=20 thread lock stuff went in. There have been several reports of it but I hav= e=20 been unable to figure out how the wakeup is being missed. > > >>>Also, show the output of ps axl . > > >> UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME=20 COMMAND > > >> 0 79759 79758 0 96 0 0 16 - DE+ p6 0:00,00=20 > > >>/bin/tcsh -fc=20 > >=20 >>/meow/ports/editors/openoffice.org-3/work/BEB300_m3/solver/300/unxfbsdx.p= ro/bin/ma > > > > > >It makes sense to show the whole ps axl output. > > See http://aldan.algebra.com/~mi/tmp/ps-axl.txt -- I edited it for=20 > > privacy a little bit, but process-states are intact. > > The java-processes in the linuxf have remained unkillable for weeks now= =20 > > -- I even forgot about them. But those are linuxulator problems, wherea= s=20 > > the tcsh is native... > It seems that pid 63930 is problematic too ? >=20 =2D-=20 John Baldwin