From owner-freebsd-java@FreeBSD.ORG Fri Jan 18 21:46:34 2008 Return-Path: Delivered-To: java@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 568C816A417; Fri, 18 Jan 2008 21:46:34 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by mx1.freebsd.org (Postfix) with ESMTP id DFAE613C45D; Fri, 18 Jan 2008 21:46:33 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from heho.snv.jussieu.fr (heho.snv.jussieu.fr [134.157.184.22]) by shiva.jussieu.fr (8.14.2/jtpda-5.4) with ESMTP id m0ILD3nh038427 ; Fri, 18 Jan 2008 22:13:08 +0100 (CET) X-Ids: 164 Received: from heho.snv.jussieu.fr (localhost [127.0.0.1]) by heho.snv.jussieu.fr (8.13.3/jtpda-5.2) with ESMTP id m0ILD2Ec007070 ; Fri, 18 Jan 2008 22:13:02 +0100 (MET) Received: (from arno@localhost) by heho.snv.jussieu.fr (8.13.3/8.13.1/Submit) id m0ILD1Oc007067; Fri, 18 Jan 2008 22:13:01 +0100 (MET) (envelope-from arno) To: Landon Fuller References: <200711301716.lAUHGEV1064334@repoman.freebsd.org> <90584F61-91FE-446E-978E-FD234553E8FC@threerings.net> From: "Arno J. Klaassen" Date: 18 Jan 2008 22:13:01 +0100 In-Reply-To: <90584F61-91FE-446E-978E-FD234553E8FC@threerings.net> Message-ID: Lines: 83 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (shiva.jussieu.fr [134.157.0.164]); Fri, 18 Jan 2008 22:13:08 +0100 (CET) X-Virus-Scanned: ClamAV 0.92/5495/Fri Jan 18 18:03:36 2008 on shiva.jussieu.fr X-Virus-Status: Clean X-Miltered: at shiva.jussieu.fr with ID 4791165F.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! Cc: nate@yogotech.com, ivo@scito.com, Daniel Eischen , davidxu@freebsd.org, java@freebsd.org, julian@freebsd.org Subject: Re: cvs commit: src/lib/libkse/thread thr_kern.c X-BeenThere: freebsd-java@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting Java to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2008 21:46:34 -0000 Hello, >LF On Dec 2, 2007, at 09:31, Arno J. Klaassen wrote: >LF > For info, the attached patch, which partially reverts mfc of rev 1.286 >LF > of kern_fork.c, seems to work as well (without the above patch to >LF > be clear), >LF I just upgraded our 8-core build server from pre-november 6-STABLE to >LF 6.3-RELEASE, and ran into this issue, causing our fork-heavy builder >LF processes to lock up regularly. >LF Your suggested patch (reverting the 1.286 MFC to sys/kern/ >LF kern_fork.c) allows our builds to run to completion; Bon, I can just say that the box of my problem is a heavily used production server, running flawlessly, and uninterrupted, since the end of November with the kern_fork.c partial revert. It doesn't seem to hurt or disrupt anything else (I use). > JE .. the reason it was changed was that the > JE previous code results in heavily loaded threaded processes that > JE fork, hanging in indefinite lockups IN THE KERNEL. Eventually > JE the whole machine would become unuseable. In particular when > JE there is NFS being used but in other situations too. SO I'm > JE damned if I do and damned if I don't on this. maybe; we almost exclusively (now) use FreeBSD for Java + NFS (some vestiges of C[++] resisting); I only got this problem on 2X2-smp RELENG_6, not on RELENG[67] UP or 1x2 SMP; I had a 'similar' problem in 2x2-SMp RELENG_7 with was bandaided with rev 1.128 of lib/libkse/thread/thr_kern. JE> > please do a ktrace of the program and send that to me JE> > JE> Here's my guess as to what is happening: JE> thos is not based on code.. JE> JE> thread 1 calls the dummy fork(3) JE> thread 2 calls the dummy fork(3) JE> thread 1 calls fork(2), (the syscall, from within the dummy fork) JE> thread2 calls fork(2) (the real one in the kernel) JE> thread 1 proceeds JE> thread 2 blocks on a VM lock until thread 1 completes JE> kernel duplicates the memory space JE > thread 1 returns from fork(2) JE > thread 1 takes out mutex X inside dummy fork(3) JE> thread 2 proceeeds in the kernel on forking. JE> kernel duplicates the memory space (including mutex X) JE > thread 2 returns from kernel and looks for mutex X JE > thread 2 in client tries to take out mutex X inside dummy fork(3) and JE > waits. JE > thread 1 releases mutex X JE > thread 2 proceeeds JE > ================================ JE > in child1 thread1 runs fine. JE > in child2 thread2 waits for thread 1 to drop the mutex JE> (there is no thread1) [ .. alternatif .. ] DE > I suppose it is malloc() that is getting into an inconsistent DE > state in the child. I'm not qualified for the FreeBSD internals, though both sound plausible to me in the sense that the thread-library does not seem to matter : easy to provoke with libpthread on RELENG_6, just a bit less easy to provoke with libthr on RELENG_6 (see PR 116667 and 166668), harder to provoke with libpthread on RELENG_7 (with above band-aid sufficient for me to not be able to reproduce it again) and /me unable to provoke it with libthr on RELENG_7. Hope this helps tot get a clue Best regards, Arno