From owner-freebsd-stable@FreeBSD.ORG Thu Mar 9 00:26:50 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 778F416A435 for ; Thu, 9 Mar 2006 00:26:50 +0000 (GMT) (envelope-from miguel@anjos.strangled.net) Received: from compaq.anjos.strangled.net (87-196-228-141.net.novis.pt [87.196.228.141]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9403B43D53 for ; Thu, 9 Mar 2006 00:26:49 +0000 (GMT) (envelope-from miguel@anjos.strangled.net) Received: from compaq.anjos.strangled.net (localhost [127.0.0.1]) by compaq.anjos.strangled.net (8.13.4/8.13.4) with ESMTP id k290Qj9S002702; Thu, 9 Mar 2006 00:26:46 GMT (envelope-from miguel@compaq.anjos.strangled.net) Received: (from miguel@localhost) by compaq.anjos.strangled.net (8.13.4/8.13.4/Submit) id k290Qihj002701; Thu, 9 Mar 2006 00:26:44 GMT (envelope-from miguel) Date: Thu, 9 Mar 2006 00:26:44 GMT From: Miguel Lopes Santos Ramos Message-Id: <200603090026.k290Qihj002701@compaq.anjos.strangled.net> To: kris@obsecurity.org In-Reply-To: <20060308224531.GA53611@xor.obsecurity.org> Cc: kuriyama@imgsrc.co.jp, freebsd-stable@freebsd.org Subject: Re: rpc.lockd brokenness (2) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Mar 2006 00:26:50 -0000 > From: Kris Kennaway > Subject: Re: rpc.lockd brokenness (2) > > This is intentional. It's how pidfile_*() tests whether the process > is still running. The intention is that if someone tries to open the > pidfile again while the first process is still running, the lock > acquisition will fail and we'll know the other process is still alive, > and therefore avoid starting a second instance. No, no, you got me wrong. The pidfile is left locked after cron stopped running (with /etc/rc.d/cron stop). This behaviour must be wrong. > Your main problems seems to be that you're mounting the same /var via > NFS from multiple client machines. This is basically a bad idea to > begin with because /var expects to be private to each machine (even if > locking worked as expected, you'd not be able to start cron on more > than one machine because it would fail as above). Even if you solved > this there would be other similar problems. No, it's the whole filesystem tree for a single client, no one else uses those files. The fact that I hung a third machine was an accident, I was testing if cron.pid was still locked and I thought I had a window on the server... My single problem is locking. Actually, it worked well before I upgraded this system to 6-STABLE. It's just for one laptop whose disk I don't want to partition. > In fact the diskless boot infrastructure in /etc will set up and use a > md /var for this purpose. Actually, they don't advise using an md /var, only /etc. Anyway, I don't use that, because it's my only diskless machine. I have a single NFS mounted / and an md /tmp. There's nothing shared with no one else, not even /usr, because it's my only amd64. > There is a (known) lockd bug here though, which you isolated: > So, this really is bin/80389? If so, I can tell Jun Kuriyama that his patch didn't change it. > > With /var/run/cron.pid still locked, on the first client, single-user, sa= > me > > initialization sequence > > # lockf -k -t 1 /var/run/cron.pid echo ok > > Hangs... always. > > which is that lock requests through rpc.lockd cannot be cancelled, so > they'll hang until the operation succeeds or fails. In this case > lockf does a blocking lock request and expects to cancel it with a > signal after the timer expires, but rpc.lockd doesn't know how to back > out lock requests so it just hangs forever or until something else > unlocks the file on the server. > > Kris I am a bit disappointed. First, this problem didn't cause me trouble before I went to 6-STABLE, now I must either disable cron or disable locking (which I can't). And I'm still not completely convinced. That problem, if I understand correctly, existed before January... There are two things... - cron.pid shouldn't be locked after cron terminated. (this interaction was fully saved as http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin) - cron shouldn't hang on startup just because the file is locked, since pidfile_open opens it with O_NONBLOCK (unlike lockf). - cron shouldn't hang in such a way that it is not killable... (and should not also the open system call in lockf be interruptible?) So, I'm led to believe that beyond that issue with rpc.lockd, which, I understand, is an unresolved problem, there is now another problem, perhaps with pidfile.c... Thank you for all your time on this issue. I'm still going to try to chase it, although I only have the knowledge to find it if it is on pidfile.c or in cron. I understand little of the interaction between kernel and the rest of nfs to chase it if it is somewhere else. Miguel