From owner-freebsd-stable@FreeBSD.ORG  Thu Mar  9 00:26:50 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 778F416A435
	for <freebsd-stable@freebsd.org>; Thu,  9 Mar 2006 00:26:50 +0000 (GMT)
	(envelope-from miguel@anjos.strangled.net)
Received: from compaq.anjos.strangled.net (87-196-228-141.net.novis.pt
	[87.196.228.141])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9403B43D53
	for <freebsd-stable@freebsd.org>; Thu,  9 Mar 2006 00:26:49 +0000 (GMT)
	(envelope-from miguel@anjos.strangled.net)
Received: from compaq.anjos.strangled.net (localhost [127.0.0.1])
	by compaq.anjos.strangled.net (8.13.4/8.13.4) with ESMTP id
	k290Qj9S002702; Thu, 9 Mar 2006 00:26:46 GMT
	(envelope-from miguel@compaq.anjos.strangled.net)
Received: (from miguel@localhost)
	by compaq.anjos.strangled.net (8.13.4/8.13.4/Submit) id k290Qihj002701; 
	Thu, 9 Mar 2006 00:26:44 GMT (envelope-from miguel)
Date: Thu, 9 Mar 2006 00:26:44 GMT
From: Miguel Lopes Santos Ramos <miguel@anjos.strangled.net>
Message-Id: <200603090026.k290Qihj002701@compaq.anjos.strangled.net>
To: kris@obsecurity.org
In-Reply-To: <20060308224531.GA53611@xor.obsecurity.org>
Cc: kuriyama@imgsrc.co.jp, freebsd-stable@freebsd.org
Subject: Re: rpc.lockd brokenness (2)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Mar 2006 00:26:50 -0000

> From: Kris Kennaway <kris@obsecurity.org>
> Subject: Re: rpc.lockd brokenness (2)
>
> This is intentional.  It's how pidfile_*() tests whether the process
> is still running.  The intention is that if someone tries to open the
> pidfile again while the first process is still running, the lock
> acquisition will fail and we'll know the other process is still alive,
> and therefore avoid starting a second instance.

No, no, you got me wrong. The pidfile is left locked after cron stopped
running (with /etc/rc.d/cron stop). This behaviour must be wrong.

> Your main problems seems to be that you're mounting the same /var via
> NFS from multiple client machines.  This is basically a bad idea to
> begin with because /var expects to be private to each machine (even if
> locking worked as expected, you'd not be able to start cron on more
> than one machine because it would fail as above).  Even if you solved
> this there would be other similar problems.

No, it's the whole filesystem tree for a single client, no one else uses
those files. The fact that I hung a third machine was an accident, I was
testing if cron.pid was still locked and I thought I had a window on the
server...

My single problem is locking. Actually, it worked well before I upgraded
this system to 6-STABLE. It's just for one laptop whose disk I don't want
to partition.

> In fact the diskless boot infrastructure in /etc will set up and use a
> md /var for this purpose.

Actually, they don't advise using an md /var, only /etc. Anyway, I don't use
that, because it's my only diskless machine. I have a single NFS mounted /
and an md /tmp. There's nothing shared with no one else, not even /usr,
because it's my only amd64.

> There is a (known) lockd bug here though, which you isolated:
>

So, this really is bin/80389?
If so, I can tell Jun Kuriyama that his patch didn't change it.

> > With /var/run/cron.pid still locked, on the first client, single-user, sa=
> me
> > initialization sequence
> >         # lockf -k -t 1 /var/run/cron.pid echo ok
> >         Hangs... always.
>
> which is that lock requests through rpc.lockd cannot be cancelled, so
> they'll hang until the operation succeeds or fails.  In this case
> lockf does a blocking lock request and expects to cancel it with a
> signal after the timer expires, but rpc.lockd doesn't know how to back
> out lock requests so it just hangs forever or until something else
> unlocks the file on the server.
>
> Kris

I am a bit disappointed. First, this problem didn't cause me trouble before
I went to 6-STABLE, now I must either disable cron or disable locking (which
I can't).
And I'm still not completely convinced. That problem, if I understand correctly,
existed before January...

There are two things...
- cron.pid shouldn't be locked after cron terminated. (this interaction was
fully saved as http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin)
- cron shouldn't hang on startup just because the file is locked, since
pidfile_open opens it with O_NONBLOCK (unlike lockf).
- cron shouldn't hang in such a way that it is not killable... (and should
not also the open system call in lockf be interruptible?)

So, I'm led to believe that beyond that issue with rpc.lockd, which,
I understand, is an unresolved problem, there is now another problem,
perhaps with pidfile.c...

Thank you for all your time on this issue. I'm still going to try to chase
it, although I only have the knowledge to find it if it is on pidfile.c or
in cron. I understand little of the interaction between kernel and the rest
of nfs to chase it if it is somewhere else.

Miguel