From owner-freebsd-arch@FreeBSD.ORG  Mon Jun 26 09:31:20 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 782D816A401
	for <freebsd-arch@FreeBSD.org>; Mon, 26 Jun 2006 09:31:20 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A2EA243D83
	for <freebsd-arch@FreeBSD.org>; Mon, 26 Jun 2006 09:31:17 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id D3EF470EC9;
	Mon, 26 Jun 2006 19:31:15 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP
	id k5Q9VAP4031419; Mon, 26 Jun 2006 19:31:11 +1000
Date: Mon, 26 Jun 2006 19:31:10 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Andrew Reilly <andrew-freebsd@areilly.bpc-users.org>
In-Reply-To: <20060625213605.GA93766@duncan.reilly.home>
Message-ID: <20060626181131.G67741@delplex.bde.org>
References: <20060625011746.GC81052@duncan.reilly.home>
	<20060625013110.GA62237@troutmask.apl.washington.edu>
	<20060625020154.GA89358@gurney.reilly.home>
	<20060626002658.A65226@delplex.bde.org>
	<20060625213605.GA93766@duncan.reilly.home>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, freebsd-arch@FreeBSD.org
Subject: Re: What's up with our stdout?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jun 2006 09:31:20 -0000

On Mon, 26 Jun 2006, Andrew Reilly wrote:

> On Mon, Jun 26, 2006 at 01:10:38AM +1000, Bruce Evans wrote:
>> This doesn't seem to have anything to do with stdout.  F_SETLKW just
>> seems to be broken on all regular files (and thus is unsupported for
>> all file types).  The above works under the modified version of
>> FreeBSD-5.2 that I use, but it fails with the documented errno EOPNOTSUPP
>> under at least FreeBSD-6.0-STABLE.  Replacing STDOUT_FILENO by fd =
>> open("foo", O_RDWR) gives the same failure.  Replacing FSETLKW by
>> FSETLK or F_GETLK gives the same failure.
>
> Thanks for the clarification.
>
> Don't all of the databases rely on fcntl locks?  How can this be
> broken?

The problem seems to be that the file system really doesn't support
locking.  On freefall, both fcntl(2) locking and flock(2) fail in my
home directory but work in /tmp.  My home directory on freefall is
nfs-mounted without nolockd, and also without rpc.lockd or rpc.statd.
nfs without the rpc daemons really doesn't support remote locking, so
it is correct for it to fail, but rpc.lockd is buggy so it is often
not used.  On my own machines I normally avoid nfs-locking using nolockd
(this gives locking that doesn't work (remotely) but claims to work).
On the FreeBSD cluster nfs-locking is apparently normally avoided by
nfs-mounting without nolockd and not starting the rpc daemons (this
gives locking that doesn't work and doesn't claim to work).

Configuring of locking for nfs is confusing and poorly documented.
Neither rpc.lockd nor rpc.statd gets started automatically when a file
system is nfs-mounted without nolockd.  This wouldn't be easy to
automate, since the daemons must be started on both the clients and
servers.  mount_nfs(8) doesn't say clearly which daemons must be started
where.  rc.conf(5) says wrongly that rpc_lock_lockd and rpc_statd_enable
only apply to servers.  Starting them both on clients and servers seems
to be needed.  With a filesystem nfs-remounted without nolockd: there
seem to be ordering or timing requirements for starting them -- starting
them manually sometimes gave a useful error message for flock() attempts
when not all were started, but sometimes starting them all didn't stop
flock() from failing and other times gave a hung flock().  Killing and
restarting rpc.lockd on the client (while leaving the other daemons
running) usually worked to unhang flock() and make it work on the next
try.

A modified version of the NetBSD code to test both flock() and fcntl()
locking:

%%%
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

struct flock stdout_lock;

main()
{
 	if (flock(STDOUT_FILENO, LOCK_EX) == -1)
 		err(EXIT_FAILURE, "flock(...LOCK_EX): stdout");
 	warnx("flock(...LOCK_EX): succeeded");
 	if (flock(STDOUT_FILENO, LOCK_UN) == -1)
 		err(EXIT_FAILURE, "flock(...LOCK_UN): stdout");
 	warnx("flock(...LOCK_EX): succeeded");
 	stdout_lock.l_len = 0;
 	stdout_lock.l_start = 0;
 	stdout_lock.l_type = F_WRLCK;
 	stdout_lock.l_whence = SEEK_SET;
 	if (fcntl(STDOUT_FILENO, F_SETLKW, &stdout_lock) == -1)
 		err(EXIT_FAILURE, "fcntl(...F_SETLKW): stdout");
 	warnx("fcntl(...F_SETLKW): succeeded");
 	return (0);
}
%%%

The Minix regression tests showed too many other regressions.  One was
that after mkdir()/open()/rmdir() of a directory, fstat() on the open
unlinked directory gave a wrong link count of 2.  Another was that
after creation of a directory with LINK_MAX links, it was possible to
mkdir() another subdir in the directory, giving 1 more link than the
maxiumum possible:

     drwxrwxrwx  32768 bde  bde  1048576 Jun 25 16:42 DIR_28/foo/

This is a more serious bug, since ffs uses the test (i_nlink <= 0) in
a couple of places, and since i_nlink_t is int16_t, 32768 for a link
count is unrepresentable (it overflows to -32768).  This seems to be
caused by either a race in soft updates or using i_nlink where
i_effnlink should be used.  These bugs show up on an nfs-mounted
directory machines in the FreeBSD cluster.  I think nfs doesn't affect
this, and the underlying file system is ffs2 with soft updates.  Normally
I don't want to see bugs like this, and I use ffs1 without soft updates
on my own machines to avoid seeing new ones.

Bruce