From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 00:08:18 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4ADB91065670;
	Sun, 11 Apr 2010 00:08:18 +0000 (UTC)
	(envelope-from yanefbsd@gmail.com)
Received: from mail-qy0-f181.google.com (mail-qy0-f181.google.com
	[209.85.221.181])
	by mx1.freebsd.org (Postfix) with ESMTP id E3B968FC19;
	Sun, 11 Apr 2010 00:08:17 +0000 (UTC)
Received: by qyk11 with SMTP id 11so3798908qyk.13
	for <multiple recipients>; Sat, 10 Apr 2010 17:08:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:received:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=5Rlb3ELeFoZIXf0Yq+k21xUXPbkBC8YasoqH3jnmNXo=;
	b=kG76Xpyr9iONz2D7t7F3RHsvVMa019V2ix3l6kOySdzEfnwU7OfQsZ9ePLg5g5rx7x
	wBhIGZ8IX8BKMpc0dl6qvE4IiUpKsThIU5dy+lCq2Sz7IJENy+z/K9iYmEnOyks5OI01
	Sppo63NoYvAOL2rhOU11jrGy+JMMY/U0Td16Y=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=t11cpl0w80/CtsvWbyqLKK3lztDse/AONE1uC0Eo2uK5ZhEzidVgT4sQbC9jyax/cF
	50/h+2QZA+LyjgCB1EJhuCm/qQjb8QfQaRCglI0gAqA91uHuW8J/5q6I9CIFHDOSQCfr
	nxVFSJcf21FwQp57+ez/p6JOEIVmXV4R4SmjY=
MIME-Version: 1.0
Received: by 10.229.28.85 with HTTP; Sat, 10 Apr 2010 17:08:16 -0700 (PDT)
In-Reply-To: <h2y7d6fde3d1004101557r12ba49ffva56a00ea42053c51@mail.gmail.com>
References: <x2j7d6fde3d1004101552u1b60ee9etb8ed15183fc1f26f@mail.gmail.com>
	<h2y7d6fde3d1004101557r12ba49ffva56a00ea42053c51@mail.gmail.com>
Date: Sat, 10 Apr 2010 17:08:16 -0700
Received: by 10.229.14.157 with SMTP id g29mr3204782qca.57.1270944496855; Sat, 
	10 Apr 2010 17:08:16 -0700 (PDT)
Message-ID: <l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
From: Garrett Cooper <yanefbsd@gmail.com>
To: arch@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: portmgr@freebsd.org
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 00:08:18 -0000

On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com> wrote:
> On Sat, Apr 10, 2010 at 3:52 PM, Garrett Cooper <yanefbsd@gmail.com> wrot=
e:
>> Hi again arch,
>> =A0 =A0When doing some research, it appears that while functionality in
>> theory exists for @owner and @user in the package list, it isn't
>> actually used in the pkg_install code at all, adding unnecessary bloat
>> to package lists;
>> =A0 =A0FWIW this functionality (just like @exec and @unexec) can be
>> implemented via pkg-install or more reliably via an mtree file.
>> =A0 =A0Thoughts?
>
> Nevermind; I was misreading the code.

    Doing some more digging, there are a handful of ports that I don't
have installed that implement this functionality:

@mode ...

$ grep -Ilr @mode /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/,=
,g'
databases/phpmyadmin/pkg-plist-chunk
databases/phpmyadmin211/pkg-plist-chunk
devel/libtai/pkg-plist
dns/poweradmin/pkg-plist-chunk
games/columns/pkg-plist
games/falconseye/pkg-plist
games/glasteroids/pkg-plist
games/nethack32/pkg-plist
games/nethack33/pkg-plist
games/nethack34/pkg-plist
games/omega/pkg-plist
games/sol/pkg-plist
games/wanderer/pkg-plist
games/xmines/pkg-plist
games/zangband/pkg-plist
irc/inspircd/pkg-plist
japanese/nethack32/pkg-plist
japanese/nethack34/pkg-plist
japanese/zangband/pkg-plist
net/phpldapadmin/pkg-plist-chunk
net/phpldapadmin098/pkg-plist-chunk
security/cyrus-sasl2-saslauthd/pkg-plist
sysutils/clockspeed/pkg-plist
www/ssserver/pkg-plist

@owner ...

$ grep -Ilr @owner /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/=
,,g'
games/omega/pkg-plist
games/sol/pkg-plist
games/zangband/pkg-plist
japanese/zangband/pkg-plist
net/mediatomb/pkg-plist
news/cnews/pkg-plist
news/ifmail/pkg-plist

    Also, I'm not positive, but I think that none of the released
packages use this either -- so ultimately this functionality could be
removed without any impact to folks unless there's a 3rd party that
has implemented this outside of FreeBSD. This functionality could be
delivered in mtree files, could be fixed with the upstream
installation Makefiles, and IMO should not be as part of the package
list, as it only obscures precedence, ownership, and permissions, and
there's a great deal of overlap involved in package creation and
installation; tar applies permissions bits and ownership, mtree is
called next to fix permissions and ownership, if the mtree file
exists, then the @owner and @mode stuff implements a hammer solution
over a series of files -- note that chmod -R and chown -R are called
with @owner and @mode :( :

    if (Mode)
        if (vsystem("cd %s && /bin/chmod -R %s %s", cd_to, Mode, arg))
            warnx("couldn't change modes of '%s' to '%s'", arg, Mode);
    if (Owner && Group) {
        if (vsystem("cd %s && /usr/sbin/chown -R %s:%s %s", cd_to,
Owner, Group, arg))
            warnx("couldn't change owner/group of '%s' to '%s:%s'",
                   arg, Owner, Group);
        return;
    }
    if (Owner) {
        if (vsystem("cd %s && /usr/sbin/chown -R %s %s", cd_to, Owner, arg)=
)
            warnx("couldn't change owner of '%s' to '%s'", arg, Owner);
        return;
    } else if (Group)
        if (vsystem("cd %s && /usr/bin/chgrp -R %s %s", cd_to, Group, arg))
            warnx("couldn't change group of '%s' to '%s'", arg, Group);

Thoughts?
-Garrett

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 00:11:12 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D4DCC1065674;
	Sun, 11 Apr 2010 00:11:12 +0000 (UTC)
	(envelope-from yanefbsd@gmail.com)
Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.24])
	by mx1.freebsd.org (Postfix) with ESMTP id 789118FC08;
	Sun, 11 Apr 2010 00:11:12 +0000 (UTC)
Received: by qw-out-2122.google.com with SMTP id 5so1529776qwi.7
	for <multiple recipients>; Sat, 10 Apr 2010 17:11:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:received:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=8gxnK0d7rbLEIRfuE1EVz0l04lWadfi4RG05uAbap5g=;
	b=RWLfBEYiaiOsD4OUXeDSByObwiBV6kdyK0CMULwHfcPnBaHoKxDnH/MSKIEyGaUapN
	PzOnhjvRAuz/jgJnC3GRWP3mode7RciXOU9dmsgonBsp8XibvCbhILKpdxum4p6YxeGN
	rimiP6CTI+0Z1yn9LJNtIMdsX8nDlY/ENKpWE=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=FzJX3uI43SKL2LijowK8XQ2HzoYpA9tG0CSU2zb1b91CwY0xi+NSx3iKwF+hsFCRsj
	5xNMT3nmF+PkXaFoEevYuzoje78/w520VFvPmVckV59sKSFSeeFMt7DUfP9GF2qfqyJy
	oCrZ+AjHcxd8bvUWmn5ctUUD9xSTPBMZGZdxk=
MIME-Version: 1.0
Received: by 10.229.28.85 with HTTP; Sat, 10 Apr 2010 17:11:11 -0700 (PDT)
In-Reply-To: <l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
References: <x2j7d6fde3d1004101552u1b60ee9etb8ed15183fc1f26f@mail.gmail.com>
	<h2y7d6fde3d1004101557r12ba49ffva56a00ea42053c51@mail.gmail.com>
	<l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
Date: Sat, 10 Apr 2010 17:11:11 -0700
Received: by 10.229.226.1 with SMTP id iu1mr3210885qcb.19.1270944671710; Sat, 
	10 Apr 2010 17:11:11 -0700 (PDT)
Message-ID: <s2q7d6fde3d1004101711rd5f0d507r560d45beeeb314e8@mail.gmail.com>
From: Garrett Cooper <yanefbsd@gmail.com>
To: arch@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: portmgr@freebsd.org
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 00:11:12 -0000

On Sat, Apr 10, 2010 at 5:08 PM, Garrett Cooper <yanefbsd@gmail.com> wrote:
> On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com> wrot=
e:
>> On Sat, Apr 10, 2010 at 3:52 PM, Garrett Cooper <yanefbsd@gmail.com> wro=
te:
>>> Hi again arch,
>>> =A0 =A0When doing some research, it appears that while functionality in
>>> theory exists for @owner and @user in the package list, it isn't
>>> actually used in the pkg_install code at all, adding unnecessary bloat
>>> to package lists;
>>> =A0 =A0FWIW this functionality (just like @exec and @unexec) can be
>>> implemented via pkg-install or more reliably via an mtree file.
>>> =A0 =A0Thoughts?
>>
>> Nevermind; I was misreading the code.
>
> =A0 =A0Doing some more digging, there are a handful of ports that I don't
> have installed that implement this functionality:
>
> @mode ...
>
> $ grep -Ilr @mode /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports=
/,,g'
> databases/phpmyadmin/pkg-plist-chunk
> databases/phpmyadmin211/pkg-plist-chunk
> devel/libtai/pkg-plist
> dns/poweradmin/pkg-plist-chunk
> games/columns/pkg-plist
> games/falconseye/pkg-plist
> games/glasteroids/pkg-plist
> games/nethack32/pkg-plist
> games/nethack33/pkg-plist
> games/nethack34/pkg-plist
> games/omega/pkg-plist
> games/sol/pkg-plist
> games/wanderer/pkg-plist
> games/xmines/pkg-plist
> games/zangband/pkg-plist
> irc/inspircd/pkg-plist
> japanese/nethack32/pkg-plist
> japanese/nethack34/pkg-plist
> japanese/zangband/pkg-plist
> net/phpldapadmin/pkg-plist-chunk
> net/phpldapadmin098/pkg-plist-chunk
> security/cyrus-sasl2-saslauthd/pkg-plist
> sysutils/clockspeed/pkg-plist
> www/ssserver/pkg-plist
>
> @owner ...
>
> $ grep -Ilr @owner /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/port=
s/,,g'
> games/omega/pkg-plist
> games/sol/pkg-plist
> games/zangband/pkg-plist
> japanese/zangband/pkg-plist
> net/mediatomb/pkg-plist
> news/cnews/pkg-plist
> news/ifmail/pkg-plist
>
> =A0 =A0Also, I'm not positive, but I think that none of the released
> packages use this either -- so ultimately this functionality could be
> removed without any impact to folks unless there's a 3rd party that
> has implemented this outside of FreeBSD. This functionality could be
> delivered in mtree files, could be fixed with the upstream
> installation Makefiles, and IMO should not be as part of the package
> list, as it only obscures precedence, ownership, and permissions, and
> there's a great deal of overlap involved in package creation and
> installation; tar applies permissions bits and ownership, mtree is
> called next to fix permissions and ownership, if the mtree file
> exists, then the @owner and @mode stuff implements a hammer solution
> over a series of files -- note that chmod -R and chown -R are called
> with @owner and @mode :( :
>
> =A0 =A0if (Mode)
> =A0 =A0 =A0 =A0if (vsystem("cd %s && /bin/chmod -R %s %s", cd_to, Mode, a=
rg))
> =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change modes of '%s' to '%s'", arg=
, Mode);
> =A0 =A0if (Owner && Group) {
> =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/sbin/chown -R %s:%s %s", cd_to,
> Owner, Group, arg))
> =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change owner/group of '%s' to '%s:=
%s'",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 arg, Owner, Group);
> =A0 =A0 =A0 =A0return;
> =A0 =A0}
> =A0 =A0if (Owner) {
> =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/sbin/chown -R %s %s", cd_to, Ow=
ner, arg))
> =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change owner of '%s' to '%s'", arg=
, Owner);
> =A0 =A0 =A0 =A0return;
> =A0 =A0} else if (Group)
> =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/bin/chgrp -R %s %s", cd_to, Gro=
up, arg))
> =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change group of '%s' to '%s'", arg=
, Group);

Sorry -- forgot @group...

$ grep -Ilr @group /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/=
,,g'
biology/p5-bioperl/files/patch-Bio-Root-Build.pm
databases/phpmyadmin/pkg-plist-chunk
databases/phpmyadmin211/pkg-plist-chunk
games/falconseye/pkg-plist
games/omega/pkg-plist
games/sol/pkg-plist
games/wanderer/pkg-plist
games/zangband/pkg-plist
irc/inspircd/pkg-plist
japanese/gawk/files/patch-sec1
japanese/zangband/pkg-plist
lang/tcc/files/texi2pod.pl
mail/sendmail/pkg-plist
mail/vpopmail/pkg-install
mail/vpopmail-devel/pkg-install
math/freemat/pkg-plist
net/freebsd-uucp/pkg-plist
net/mediatomb/pkg-plist
net/phpldapadmin/pkg-plist-chunk
net/phpldapadmin098/pkg-plist-chunk
news/cnews/pkg-plist
news/ifmail/pkg-plist
security/sfs/pkg-plist

Thanks,
-Garrett

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 00:32:21 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6587A1065670;
	Sun, 11 Apr 2010 00:32:21 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Received: from monday.kientzle.com (kientzle.com [66.166.149.50])
	by mx1.freebsd.org (Postfix) with ESMTP id 3629A8FC17;
	Sun, 11 Apr 2010 00:32:19 +0000 (UTC)
Received: (from root@localhost)
	by monday.kientzle.com (8.14.3/8.14.3) id o3B0WW8i006407;
	Sun, 11 Apr 2010 00:32:32 GMT (envelope-from kientzle@freebsd.org)
Received: from horton.x.kientzle.com (fw2.kientzle.com [10.123.1.2])
	by kientzle.com with SMTP id n4mtt4gy7dnt8d39dx9k5qyq9e;
	Sun, 11 Apr 2010 00:32:32 +0000 (UTC)
	(envelope-from kientzle@freebsd.org)
Message-ID: <4BC1188F.3060001@freebsd.org>
Date: Sat, 10 Apr 2010 17:32:15 -0700
From: Tim Kientzle <kientzle@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US;
	rv:1.8.1.23) Gecko/20100314 SeaMonkey/1.1.18
MIME-Version: 1.0
To: Garrett Cooper <yanefbsd@gmail.com>
References: <x2j7d6fde3d1004101552u1b60ee9etb8ed15183fc1f26f@mail.gmail.com>	<h2y7d6fde3d1004101557r12ba49ffva56a00ea42053c51@mail.gmail.com>
	<l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
In-Reply-To: <l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Arch <arch@freebsd.org>, portmgr@freebsd.org
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 00:32:21 -0000

Garrett Cooper wrote:
> On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com> wrote:
>>>    When doing some research, it appears that while functionality in
>>> theory exists for @owner and @user in the package list, it isn't
>>> actually used in the pkg_install code at all, adding unnecessary bloat
>>> to package lists;
> 
>     Doing some more digging, there are a handful of ports that I don't
> have installed that implement this functionality:
> @mode ...
> @owner ...
 > @group ...

I would certainly shed no tears if these went away.

OTOH, I can see a use for them in pkg_create, to
set the mode/owner/group in the resulting tarball.
This would be good when building a package from a
port while running as non-root user.

Of course, we could also do this from the mtree
description at either package creation time (reading
the mtree description and using it to set file properties
in the tarball) or package install time (using the
mtree description to set the final file properties
on disk).

Tim

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 02:56:07 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 42128106566C;
	Sun, 11 Apr 2010 02:56:07 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id C9F968FC1A;
	Sun, 11 Apr 2010 02:56:06 +0000 (UTC)
Received: from c122-106-168-84.carlnfd1.nsw.optusnet.com.au
	(c122-106-168-84.carlnfd1.nsw.optusnet.com.au [122.106.168.84])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o3B2u23m006829
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 11 Apr 2010 12:56:04 +1000
Date: Sun, 11 Apr 2010 12:56:02 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Andriy Gapon <avg@freebsd.org>
In-Reply-To: <4BBF3C5A.7040009@freebsd.org>
Message-ID: <20100411114405.L10562@delplex.bde.org>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 02:56:07 -0000

On Fri, 9 Apr 2010, Andriy Gapon wrote:

> on 09/04/2010 16:53 Rick Macklem said the following:
>>
>>
>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
>>
>>>
>>> Nowadays several questions could be asked about MAXBSIZE.
>>> - Will we have to consider increasing MAXBSIZE?  Provided ever
>>> increasing media
>>> sizes, typical filesystem sizes, typical file sizes (all that
>>> multimedia) and
>>> even media sector sizes.
>>
>> I would certainly like to see a larger MAXBSIZE for NFS. Solaris10
>> currently uses 128K as a default I/O size and allows up to 1Mb.

Er, the maximum size of buffers in the buffer cache is especially
irrelevant for nfs.  It is almost irrelevant for physical disks because
clustering normally increases the bulk transfer size to MAXPHYS.
Clustering takes a lot of CPU but doesn't affect the transfer rate much
unless there is not enough CPU.  It is even less relevant for network
i/o since there is a sort of reverse-clustering -- the buffers get split
up into tiny packets (normally 1500 bytes less some header bytes) at
the hardware level.  Again a lot of CPU is involved doing the (reverse)
clustering, and again this doesn't affect the transfer rate much.
However, 1500 is so tiny that the reverse-clustering ratio of the i/o
size relative to MAXBSIZE (65536/1500) is much smaller than the normal
clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU
is more significant for network i/o.  (These aren't the actual normal
ratios, but ones the limits of the attainable ones by varying only the
block sizes under the file system's control.)  However2, increasing the
network i/o size can make little difference to this problem -- it can
only increase the already-too-large reverse-clustering ratio, while
possibly reducing other reverse-clustering ratios (the others are for
assembling the nfs buffers from local file system buffers; the local
file system buffers are normally disassembled from pbuf size (MAXPHYS)
to file system size (normally 16K); then conversion to nfs buffers
involves either a sort of clustering or reverse clustering depending
on the relative sizes of the buffers).  There are more gains to be
had from increasing the network i/o size.  tcp allows larger buffers
at intermediate levels but they still get split up at the hardware
level.  Only some networks allow jumbo frames.

>> Using
>> larger I/O sizes for NFS is a simpler way to increase bulk data transfer
>> rate than more buffers and more agressive read-ahead/write-behind.

I'm not sure about that.  Read-ahead and write-behind is already very
aggressive but seems to be not working right.  I use some patches by
Bjorn Groenwald (?) which make it work better for the old nfs implemenation
(I haven't tried the experimental one).  The problems seem to be mainly
timing ones.  vfs clustering makes the buffer sizes almost irrelevant for
physical disks, but there are latency problems for the network i/o.
The latency problems seem to be larger for reads than for writes.  I
get best results by using the same size for network buffers as for local
buffers (16K).  This avoids 1 layer of buffer size changing (see above)
and using 16K-buffers avoids buffer kva fragmentation (see below).  I
saw little difference from changing the user buffer size, except small
buffers tend to work better and smallest (512-byte) buffers may have
actually worked best, I think by reducing latencies.

> I have lightly tested this under qemu.
> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
> I removed size > MAXBSIZE check in getblk (see a parallel thread "panic: getblk:
> size(%d) > MAXBSIZE(%d)").

Did you change the other known things that depend on this?  There is the
b_pages limit of MAXPHYS bytes which should be checked for in another
way, and the soft limits for hibufspace and lobufspace which only matter
under load conditions.

> And I bumped MAXPHYS to 1MB.
>
> Some results.
> I got no panics, data was read correctly and system remained stable, which is good.
> But I observed reading process (dd bs=1m on avgfs) spending a lot of time sleeping
> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
> Apparently there was some shortage of free buffers.
> Perhaps some limits/counts were incorrectly auto-tuned.

This is not surprising, since even 64K is 4 times too large to work
well.  Buffer sizes of larger than BKVASIZE (16K) always cause
fragmentation of buffer kva.  Recovering from fragmentation always
takes a lot of CPU, and if you are unlucky it will also take a lot of
real time (stalling waiting for free buffer kva).  Buffer sizes larger
than BKVASIZE also reduce the number of available buffers significantly
below the number of buffers configured.  This mainly takes a lot of
CPU to reconsitute buffers.  BKVASIZE being less than MAXBSIZE is a
hack to reduce the amount of kva statically allocated for buffers for
systems that cannot support enough kva to work right (mainly i386's).
It only works well when it is not actually used (when all buffers have
size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to
BKVASIZE).  This hack and the complications to support it are bogus on
systems that support enough kva to work right.

nfs buffers larger than 16K would exceed BKVASIZE.  This may have been
why nfs buffer sizes of size 32K gave negative benefits.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 13:48:56 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D1FC61065678;
	Sun, 11 Apr 2010 13:48:56 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca
	[131.104.91.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 5FEF98FC0C;
	Sun, 11 Apr 2010 13:48:56 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEALJvwUuDaFvH/2dsb2JhbACbRXG2BoUMBA
X-IronPort-AV: E=Sophos;i="4.52,184,1270440000"; d="scan'208";a="72300237"
Received: from danube.cs.uoguelph.ca ([131.104.91.199])
	by esa-annu-pri.mail.uoguelph.ca with ESMTP; 11 Apr 2010 09:48:55 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by danube.cs.uoguelph.ca (Postfix) with ESMTP id 4D8FC1084195;
	Sun, 11 Apr 2010 09:48:55 -0400 (EDT)
X-Virus-Scanned: amavisd-new at danube.cs.uoguelph.ca
Received: from danube.cs.uoguelph.ca ([127.0.0.1])
	by localhost (danube.cs.uoguelph.ca [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id noH9HCNu+mzV; Sun, 11 Apr 2010 09:48:53 -0400 (EDT)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by danube.cs.uoguelph.ca (Postfix) with ESMTP id 9371D1084192;
	Sun, 11 Apr 2010 09:48:53 -0400 (EDT)
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	o3BE2d129572; Sun, 11 Apr 2010 10:02:39 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Sun, 11 Apr 2010 10:02:39 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20100411114405.L10562@delplex.bde.org>
Message-ID: <Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 13:48:56 -0000



On Sun, 11 Apr 2010, Bruce Evans wrote:

>
> Er, the maximum size of buffers in the buffer cache is especially
> irrelevant for nfs.  It is almost irrelevant for physical disks because
> clustering normally increases the bulk transfer size to MAXPHYS.
> Clustering takes a lot of CPU but doesn't affect the transfer rate much
> unless there is not enough CPU.  It is even less relevant for network
> i/o since there is a sort of reverse-clustering -- the buffers get split
> up into tiny packets (normally 1500 bytes less some header bytes) at
> the hardware level.  Again a lot of CPU is involved doing the (reverse)
> clustering, and again this doesn't affect the transfer rate much.
> However, 1500 is so tiny that the reverse-clustering ratio of the i/o
> size relative to MAXBSIZE (65536/1500) is much smaller than the normal
> clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU
> is more significant for network i/o.  (These aren't the actual normal
> ratios, but ones the limits of the attainable ones by varying only the
> block sizes under the file system's control.)  However2, increasing the
> network i/o size can make little difference to this problem -- it can
> only increase the already-too-large reverse-clustering ratio, while
> possibly reducing other reverse-clustering ratios (the others are for
> assembling the nfs buffers from local file system buffers; the local
> file system buffers are normally disassembled from pbuf size (MAXPHYS)
> to file system size (normally 16K); then conversion to nfs buffers
> involves either a sort of clustering or reverse clustering depending
> on the relative sizes of the buffers).  There are more gains to be
> had from increasing the network i/o size.  tcp allows larger buffers
> at intermediate levels but they still get split up at the hardware
> level.  Only some networks allow jumbo frames.
>

I've done a simple experiment on Mac OS X 10, where I tried different
sizes for the read and write RPCs plus different amounts of
read-ahead/write-behind and found the I/O rate increased linearly,
up to the max allowed by Mac OS X (MAXBSIZE == 128K) without 
read-ahead/write-behind. Using read-ahead/write-behind the performance
didn't increase at all, until the RPC read/write size was reduced.
(Solaris10 is using 256K by default and allowing up to 1Mb for read/write
RPC size now, so they seem to think that large values work well?)

When you start using a WAN environment, large read/write RPCs really
help, from what I've seen, since that helps fill the TCP pipe
(bits * latency between client<->server).

I care much more about WAN performance than LAN performance w.r.t. this.

I am not sure what you were referring to w.r.t. clustering, but if you
meant that the NFS client can easily do an RPC with a larger I/O size
than the size of the buffer handed it by the buffer cache, I'd like to
hear how that's done? (If not, then a bigger buffer from the buffer
cache is what I need to do a larger I/O size in the RPC.)

Once NFS hands the TCP socket the large RPC, I figure it's up to the
networking to get it on/off the wire, etc. If you are arguing that that
is where there can be major gains, I'll believe you, but it's not my
area of expertise and there's lots of other FreeBSD folks to work on
that. I do believe that being able to do a large read/write RPC is
going to help performance, particularily in the WAN case.

>>> Using
>>> larger I/O sizes for NFS is a simpler way to increase bulk data transfer
>>> rate than more buffers and more agressive read-ahead/write-behind.
>
> I'm not sure about that.  Read-ahead and write-behind is already very
> aggressive but seems to be not working right.  I use some patches by
> Bjorn Groenwald (?) which make it work better for the old nfs implemenation
> (I haven't tried the experimental one).  The problems seem to be mainly
> timing ones.  vfs clustering makes the buffer sizes almost irrelevant for
> physical disks, but there are latency problems for the network i/o.
> The latency problems seem to be larger for reads than for writes.  I
> get best results by using the same size for network buffers as for local
> buffers (16K).  This avoids 1 layer of buffer size changing (see above)
> and using 16K-buffers avoids buffer kva fragmentation (see below).  I
> saw little difference from changing the user buffer size, except small
> buffers tend to work better and smallest (512-byte) buffers may have
> actually worked best, I think by reducing latencies.
>

See above. There is always going to be cases like use over a WAN where
latency is going to be large. That's when large I/O RPCs will win.

I suspect you are focusing on the high bandwidth/low latecy LAN, which
is not where I believe that large I/O sized RPCs will make much 
difference.

Hope this helps clarify what I am looking for, rick

From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 14:09:31 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 655FF106566B;
	Sun, 11 Apr 2010 14:09:31 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
	[131.104.91.44])
	by mx1.freebsd.org (Postfix) with ESMTP id F36EC8FC08;
	Sun, 11 Apr 2010 14:09:30 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEABd1wUuDaFvJ/2dsb2JhbACbRXG2CIUMBA
X-IronPort-AV: E=Sophos;i="4.52,185,1270440000"; d="scan'208";a="71859007"
Received: from ganges.cs.uoguelph.ca ([131.104.91.201])
	by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 11 Apr 2010 10:09:29 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by ganges.cs.uoguelph.ca (Postfix) with ESMTP id 32570FB808C;
	Sun, 11 Apr 2010 10:09:29 -0400 (EDT)
X-Virus-Scanned: amavisd-new at ganges.cs.uoguelph.ca
Received: from ganges.cs.uoguelph.ca ([127.0.0.1])
	by localhost (ganges.cs.uoguelph.ca [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id pqvSclHGgB6O; Sun, 11 Apr 2010 10:09:28 -0400 (EDT)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by ganges.cs.uoguelph.ca (Postfix) with ESMTP id 455FBFB8036;
	Sun, 11 Apr 2010 10:09:28 -0400 (EDT)
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	o3BENEH02542; Sun, 11 Apr 2010 10:23:14 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Sun, 11 Apr 2010 10:23:14 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20100411114405.L10562@delplex.bde.org>
Message-ID: <Pine.GSO.4.63.1004111019420.1962@muncher.cs.uoguelph.ca>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 14:09:31 -0000



On Sun, 11 Apr 2010, Bruce Evans wrote:

>
> Er, the maximum size of buffers in the buffer cache is especially
> irrelevant for nfs.  It is almost irrelevant for physical disks because
> clustering normally increases the bulk transfer size to MAXPHYS.
> Clustering takes a lot of CPU but doesn't affect the transfer rate much
> unless there is not enough CPU.  It is even less relevant for network
> i/o since there is a sort of reverse-clustering -- the buffers get split
> up into tiny packets (normally 1500 bytes less some header bytes) at
> the hardware level.  Again a lot of CPU is involved doing the (reverse)
> clustering, and again this doesn't affect the transfer rate much.
> However, 1500 is so tiny that the reverse-clustering ratio of the i/o
> size relative to MAXBSIZE (65536/1500) is much smaller than the normal
> clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU
> is more significant for network i/o.  (These aren't the actual normal
> ratios, but ones the limits of the attainable ones by varying only the
> block sizes under the file system's control.)  However2, increasing the
> network i/o size can make little difference to this problem -- it can
> only increase the already-too-large reverse-clustering ratio, while
> possibly reducing other reverse-clustering ratios (the others are for
> assembling the nfs buffers from local file system buffers; the local
> file system buffers are normally disassembled from pbuf size (MAXPHYS)
> to file system size (normally 16K); then conversion to nfs buffers
> involves either a sort of clustering or reverse clustering depending
> on the relative sizes of the buffers).  There are more gains to be
> had from increasing the network i/o size.  tcp allows larger buffers
> at intermediate levels but they still get split up at the hardware
> level.  Only some networks allow jumbo frames.
>
Oh, and if the 1Mbyte write rpc can somehow hand the data portion
(the 1Mbyte of data) to sosend() as a single 1Mbyte mbuf cluster
referencing (not copied from) the 1Mbyte buffer cache block, so
the data never needs to be copied until it gets to the network
device driver, that would be great. However, this goes way beyond
the increase of MAXBSIZE that I think I need so that the client
can actually do a 1Mbyte write RPC.

Have a good weekend (what's left of it), rick


From owner-freebsd-arch@FreeBSD.ORG  Sun Apr 11 14:20:19 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 59711106564A;
	Sun, 11 Apr 2010 14:20:19 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 084B98FC14;
	Sun, 11 Apr 2010 14:20:18 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3BEHVRN091386;
	Sun, 11 Apr 2010 08:17:31 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Sun, 11 Apr 2010 08:17:39 -0600 (MDT)
Message-Id: <20100411.081739.974702306123419358.imp@bsdimp.com>
To: kientzle@FreeBSD.org
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <4BC1188F.3060001@freebsd.org>
References: <h2y7d6fde3d1004101557r12ba49ffva56a00ea42053c51@mail.gmail.com>
	<l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
	<4BC1188F.3060001@freebsd.org>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: yanefbsd@gmail.com, portmgr@FreeBSD.org, arch@FreeBSD.org
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Apr 2010 14:20:19 -0000

In message: <4BC1188F.3060001@freebsd.org>
            Tim Kientzle <kientzle@freebsd.org> writes:
: Garrett Cooper wrote:
: > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com>
: > wrote:
: >>>    When doing some research, it appears that while functionality in
: >>> theory exists for @owner and @user in the package list, it isn't
: >>> actually used in the pkg_install code at all, adding unnecessary bloat
: >>> to package lists;
: >     Doing some more digging, there are a handful of ports that I don't
: > have installed that implement this functionality:
: > @mode ...
: > @owner ...
: > @group ...
: 
: I would certainly shed no tears if these went away.
: 
: OTOH, I can see a use for them in pkg_create, to
: set the mode/owner/group in the resulting tarball.
: This would be good when building a package from a
: port while running as non-root user.
: 
: Of course, we could also do this from the mtree
: description at either package creation time (reading
: the mtree description and using it to set file properties
: in the tarball) or package install time (using the
: mtree description to set the final file properties
: on disk).

On the creation side, something like the above would be useful.

makefs supports storing a tree's metadata in an .mtree file.  We could
obviate the need for those keywords if tar could be made to do the
same thing :)

I'm working on an unpriv'd installworld (where the meta data would go
to the .mtree file, and the files would go into a tree owned as the
user building).  Mostly it is a port from NetBSD, but having tar that
would respect this stuff would be great.  Bonus points if the tag in
mtree could be used as a file selector (either additively or
subtractively).

Warner


From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 11:06:55 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9A0431065674
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Apr 2010 11:06:55 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 6DB148FC0C
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Apr 2010 11:06:55 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id o3CB6tad042344
	for <freebsd-arch@FreeBSD.org>; Mon, 12 Apr 2010 11:06:55 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.4/8.14.4/Submit) id o3CB6s8O042342
	for freebsd-arch@FreeBSD.org; Mon, 12 Apr 2010 11:06:54 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 12 Apr 2010 11:06:54 GMT
Message-Id: <201004121106.o3CB6s8O042342@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 11:06:55 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/120749  arch       [request] Suggest upping the default kern.ps_arg_cache

1 problem total.


From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 13:56:03 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A009E1065673;
	Mon, 12 Apr 2010 13:56:03 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6FA3E8FC0A;
	Mon, 12 Apr 2010 13:56:03 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 2121346B6C;
	Mon, 12 Apr 2010 09:56:03 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 1D63D8A01F;
	Mon, 12 Apr 2010 09:56:00 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Mon, 12 Apr 2010 09:31:36 -0400
User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; )
References: <x2j7d6fde3d1004101552u1b60ee9etb8ed15183fc1f26f@mail.gmail.com>
	<l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
	<4BC1188F.3060001@freebsd.org>
In-Reply-To: <4BC1188F.3060001@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201004120931.36907.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 12 Apr 2010 09:56:00 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Garrett Cooper <yanefbsd@gmail.com>, Tim Kientzle <kientzle@freebsd.org>,
	portmgr@freebsd.org, FreeBSD Arch <arch@freebsd.org>
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 13:56:03 -0000

On Saturday 10 April 2010 8:32:15 pm Tim Kientzle wrote:
> Garrett Cooper wrote:
> > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com> 
wrote:
> >>>    When doing some research, it appears that while functionality in
> >>> theory exists for @owner and @user in the package list, it isn't
> >>> actually used in the pkg_install code at all, adding unnecessary bloat
> >>> to package lists;
> > 
> >     Doing some more digging, there are a handful of ports that I don't
> > have installed that implement this functionality:
> > @mode ...
> > @owner ...
>  > @group ...
> 
> I would certainly shed no tears if these went away.
> 
> OTOH, I can see a use for them in pkg_create, to
> set the mode/owner/group in the resulting tarball.
> This would be good when building a package from a
> port while running as non-root user.

Yes. I have used this to build 3rd party packages at a previous employer.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 13:56:03 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A009E1065673;
	Mon, 12 Apr 2010 13:56:03 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6FA3E8FC0A;
	Mon, 12 Apr 2010 13:56:03 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 2121346B6C;
	Mon, 12 Apr 2010 09:56:03 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 1D63D8A01F;
	Mon, 12 Apr 2010 09:56:00 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Mon, 12 Apr 2010 09:31:36 -0400
User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; )
References: <x2j7d6fde3d1004101552u1b60ee9etb8ed15183fc1f26f@mail.gmail.com>
	<l2z7d6fde3d1004101708o3946d155pfe2f9644daff329c@mail.gmail.com>
	<4BC1188F.3060001@freebsd.org>
In-Reply-To: <4BC1188F.3060001@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201004120931.36907.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 12 Apr 2010 09:56:00 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Garrett Cooper <yanefbsd@gmail.com>, Tim Kientzle <kientzle@freebsd.org>,
	portmgr@freebsd.org, FreeBSD Arch <arch@freebsd.org>
Subject: Re: [RFC] Remove @owner and @user from package list
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 13:56:03 -0000

On Saturday 10 April 2010 8:32:15 pm Tim Kientzle wrote:
> Garrett Cooper wrote:
> > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper <yanefbsd@gmail.com> 
wrote:
> >>>    When doing some research, it appears that while functionality in
> >>> theory exists for @owner and @user in the package list, it isn't
> >>> actually used in the pkg_install code at all, adding unnecessary bloat
> >>> to package lists;
> > 
> >     Doing some more digging, there are a handful of ports that I don't
> > have installed that implement this functionality:
> > @mode ...
> > @owner ...
>  > @group ...
> 
> I would certainly shed no tears if these went away.
> 
> OTOH, I can see a use for them in pkg_create, to
> set the mode/owner/group in the resulting tarball.
> This would be good when building a package from a
> port while running as non-root user.

Yes. I have used this to build 3rd party packages at a previous employer.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 16:02:28 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 88C331065674
	for <arch@freebsd.org>; Mon, 12 Apr 2010 16:02:28 +0000 (UTC)
	(envelope-from avg@freebsd.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id CA6088FC13
	for <arch@freebsd.org>; Mon, 12 Apr 2010 16:02:27 +0000 (UTC)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
	[212.40.38.101])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA29562;
	Mon, 12 Apr 2010 19:02:11 +0300 (EEST)
	(envelope-from avg@freebsd.org)
Message-ID: <4BC34402.1050509@freebsd.org>
Date: Mon, 12 Apr 2010 19:02:10 +0300
From: Andriy Gapon <avg@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100319)
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
References: <4BBEE2DD.3090409@freebsd.org>	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>	<4BBF3C5A.7040009@freebsd.org>
	<20100411114405.L10562@delplex.bde.org>
In-Reply-To: <20100411114405.L10562@delplex.bde.org>
X-Enigmail-Version: 0.95.7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 16:02:28 -0000

on 11/04/2010 05:56 Bruce Evans said the following:
> On Fri, 9 Apr 2010, Andriy Gapon wrote:
[snip]
>> I have lightly tested this under qemu.
>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>> "panic: getblk:
>> size(%d) > MAXBSIZE(%d)").
> 
> Did you change the other known things that depend on this?  There is the
> b_pages limit of MAXPHYS bytes which should be checked for in another
> way

I changed the check the way I described in the parallel thread.

> and the soft limits for hibufspace and lobufspace which only matter
> under load conditions.

And what these should be?
hibufspace and lobufspace seem to be auto-calculated.  One thing that I noticed
and that was a direct cause of the problem described below, is that difference
between hibufspace and lobufspace should be at least the maximum block size
allowed in getblk() (perhaps it should be strictly equal to that value?).
So in my case I had to make that difference MAXPHYS.

>> And I bumped MAXPHYS to 1MB.
>>
>> Some results.
>> I got no panics, data was read correctly and system remained stable,
>> which is good.
>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>> time sleeping
>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>> Apparently there was some shortage of free buffers.
>> Perhaps some limits/counts were incorrectly auto-tuned.
> 
> This is not surprising, since even 64K is 4 times too large to work
> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
> fragmentation of buffer kva.  Recovering from fragmentation always
> takes a lot of CPU, and if you are unlucky it will also take a lot of
> real time (stalling waiting for free buffer kva).  Buffer sizes larger
> than BKVASIZE also reduce the number of available buffers significantly
> below the number of buffers configured.  This mainly takes a lot of
> CPU to reconsitute buffers.  BKVASIZE being less than MAXBSIZE is a
> hack to reduce the amount of kva statically allocated for buffers for
> systems that cannot support enough kva to work right (mainly i386's).
> It only works well when it is not actually used (when all buffers have
> size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to
> BKVASIZE).  This hack and the complications to support it are bogus on
> systems that support enough kva to work right.

So, BKVASIZE is the best read size from the point of view of buffer space usage?
E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but
leads to buffer space map fragmentation, because of size > BKVASIZE.
On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from
buffer space point of view (no fragmentation potential) but they result in 4 GEOM
I/O requests.
The thing is that a single read requires a single contiguous virtual address space
chunk.  Would it be possible to take the best of both worlds by somehow allowing a
single large I/O request to work with several buffers (with b_kvasize == BKVASIZE)
in a iovec-like style?
Have I just reinvented bicycle? :)
Probably not, because an answer to my question is probably 'not (without lots of
work in lots of places)' as well.

I see that breadn() certainly doesn't work that way.  As I understand, it works
like bread() for one block plus starts something like 'asynchronous breads()' for
a given count of other blocks.

I am not sure about details of how cluster_read() works, though.
Could you please explain the essence of it?
Thank you!

Perhaps, there are other approaches to the fragmentation issue.  Like, for
example, using sort of zones for different block sizes.  But that all adds
complications and takes away performance of the easy cases.
-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 22:28:47 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C745A106566B
	for <freebsd-arch@freebsd.org>; Mon, 12 Apr 2010 22:28:47 +0000 (UTC)
	(envelope-from delphij@delphij.net)
Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1])
	by mx1.freebsd.org (Postfix) with ESMTP id 710C98FC0C
	for <freebsd-arch@freebsd.org>; Mon, 12 Apr 2010 22:28:47 +0000 (UTC)
Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233])
	by tarsier.geekcn.org (Postfix) with ESMTP id 805B3A561A1;
	Tue, 13 Apr 2010 06:28:46 +0800 (CST)
X-Virus-Scanned: amavisd-new at geekcn.org
Received: from tarsier.geekcn.org ([211.166.10.233])
	by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new,
	port 10024)
	with LMTP id TZUpelbWeKNT; Tue, 13 Apr 2010 06:28:40 +0800 (CST)
Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by tarsier.geekcn.org (Postfix) with ESMTPSA id 7E005A56199;
	Tue, 13 Apr 2010 06:28:39 +0800 (CST)
DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns;
	h=message-id:date:from:reply-to:organization:user-agent:
	mime-version:to:subject:x-enigmail-version:openpgp:content-type:content-transfer-encoding;
	b=cLlVQu6zh3kqkdAmLLrCuYRufh2md+ZnqofYf6atiYqskPrj0Pnbln/R9OmGMw9It
	zsh88YSptj4Dr6vXTk/SQ==
Message-ID: <4BC39E93.7060906@delphij.net>
Date: Mon, 12 Apr 2010 15:28:35 -0700
From: Xin LI <delphij@delphij.net>
Organization: The Geek China Organization
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
X-Enigmail-Version: 1.0.1
OpenPGP: id=3FCA37C1;
	url=http://www.delphij.net/delphij.asc
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: d@delphij.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 22:28:47 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Is there a sane way to copyout ioctl request when the returning errno !=
0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:

===========
        error = kern_ioctl(td, uap->fd, com, data);

        if (error == 0 && (com & IOC_OUT))
                error = copyout(data, uap->data, (u_int)size);
===========

Is there any objection if I change it to something like:

===========
        saved_error = kern_ioctl(td, uap->fd, com, data);

        if (com & IOC_OUT)
                error = copyout(data, uap->data, (u_int)size);
        if (saved_error)
                error = saved_error;
===========

Cheers,
- -- 
Xin LI <delphij@delphij.net>	http://www.delphij.net/
FreeBSD - The Power to Serve!	       Live free or die
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iQEcBAEBAgAGBQJLw56PAAoJEATO+BI/yjfBuYMIAM7qAVWuWn/noQPzH12W3IoH
TInBLyGjG8tH5z9CPJeXe3X+aVz932KEuE85E6GXBo7zoGf1IWbMk8+LO+Ai+5It
AgxeFrBUn0MUEY4dJPZs89Ag8LCBFvvHOe1eTxw+6sjdSDtFg2OV55F2nrCcPtoG
jIEQtcfhy1H+evihEycoN9uMdTH0XWEcCZVhXKS0R4a3veOp2RUt4I21LhSYdyrx
xairvHNIOp0eBdHf8O2TlwyWzlZpHg3XMO9UM/aZ5uiVeSIsB0nEX3SXGi3o7Rih
DaCTqZpk4L6z1UIUsGEqLl5i6yrbP5LFwNDk9dYbQL3of4SVPofsD9O1hJ3MuIE=
=cMdX
-----END PGP SIGNATURE-----

From owner-freebsd-arch@FreeBSD.ORG  Mon Apr 12 23:49:23 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 61A48106566B
	for <freebsd-arch@freebsd.org>; Mon, 12 Apr 2010 23:49:23 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 553518FC1B
	for <freebsd-arch@freebsd.org>; Mon, 12 Apr 2010 23:49:23 +0000 (UTC)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 857C41A3C86; Mon, 12 Apr 2010 16:33:30 -0700 (PDT)
Date: Mon, 12 Apr 2010 16:33:30 -0700
From: Alfred Perlstein <alfred@freebsd.org>
To: d@delphij.net
Message-ID: <20100412233330.GC19003@elvis.mu.org>
References: <4BC39E93.7060906@delphij.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4BC39E93.7060906@delphij.net>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Apr 2010 23:49:23 -0000

* Xin LI <delphij@delphij.net> [100412 15:28] wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi,
> 
> Is there a sane way to copyout ioctl request when the returning errno !=
> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
> 
> ===========
>         error = kern_ioctl(td, uap->fd, com, data);
> 
>         if (error == 0 && (com & IOC_OUT))
>                 error = copyout(data, uap->data, (u_int)size);
> ===========
> 
> Is there any objection if I change it to something like:
> 
> ===========
>         saved_error = kern_ioctl(td, uap->fd, com, data);
> 
>         if (com & IOC_OUT)
>                 error = copyout(data, uap->data, (u_int)size);
>         if (saved_error)
>                 error = saved_error;
> ===========

Is this for linux compat?

I'm not sure this would work, it might seriously break userland
compat.  Have you looked around/queiried what the expected outcome
is from a bad ioctl?  By default the buffer will be zero'd this
might be unexpected by apps.  (all or nothing)

-- 
- Alfred Perlstein
.- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250
.- FreeBSD committer

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 13 00:27:02 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0E128106566C;
	Tue, 13 Apr 2010 00:27:02 +0000 (UTC)
	(envelope-from delphij@delphij.net)
Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1])
	by mx1.freebsd.org (Postfix) with ESMTP id 8E4A18FC08;
	Tue, 13 Apr 2010 00:27:01 +0000 (UTC)
Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233])
	by tarsier.geekcn.org (Postfix) with ESMTP id 9DAD8A563BB;
	Tue, 13 Apr 2010 08:27:00 +0800 (CST)
X-Virus-Scanned: amavisd-new at geekcn.org
Received: from tarsier.geekcn.org ([211.166.10.233])
	by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new,
	port 10024)
	with LMTP id cTHSnfUMPDeC; Tue, 13 Apr 2010 08:26:54 +0800 (CST)
Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by tarsier.geekcn.org (Postfix) with ESMTPSA id D5467A56357;
	Tue, 13 Apr 2010 08:26:52 +0800 (CST)
DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns;
	h=message-id:date:from:reply-to:organization:user-agent:
	mime-version:to:cc:subject:references:in-reply-to:
	x-enigmail-version:openpgp:content-type:content-transfer-encoding;
	b=NGXXAfTZQKyEqnbzykefQAORUbaecgn1wTUsNWWXvoNTgpqHO0nm0eBbaH5hzz+cJ
	KcpLEkuDDEU7MdJ4DhXGA==
Message-ID: <4BC3BA48.9010009@delphij.net>
Date: Mon, 12 Apr 2010 17:26:48 -0700
From: Xin LI <delphij@delphij.net>
Organization: The Geek China Organization
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1
MIME-Version: 1.0
To: Alfred Perlstein <alfred@freebsd.org>
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
In-Reply-To: <20100412233330.GC19003@elvis.mu.org>
X-Enigmail-Version: 1.0.1
OpenPGP: id=3FCA37C1;
	url=http://www.delphij.net/delphij.asc
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: d@delphij.net, freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: d@delphij.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Apr 2010 00:27:02 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2010/04/12 16:33, Alfred Perlstein wrote:
> * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi,
>>
>> Is there a sane way to copyout ioctl request when the returning errno !=
>> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
>>
>> ===========
>>         error = kern_ioctl(td, uap->fd, com, data);
>>
>>         if (error == 0 && (com & IOC_OUT))
>>                 error = copyout(data, uap->data, (u_int)size);
>> ===========
>>
>> Is there any objection if I change it to something like:
>>
>> ===========
>>         saved_error = kern_ioctl(td, uap->fd, com, data);
>>
>>         if (com & IOC_OUT)
>>                 error = copyout(data, uap->data, (u_int)size);
>>         if (saved_error)
>>                 error = saved_error;
>> ===========
> 
> Is this for linux compat?

Do they do this way?  I'm not quite sure :-/

I got a bug report and am thinking about how to fix it, it seems that we
do not have a generic way of returning an error number while giving some
"hints" about the error at the same time, for the ioctl() call.  Adding
an extra pointer to the request structure seems to be a last-resort way
and sounds to be ugly.

> I'm not sure this would work, it might seriously break userland
> compat.  Have you looked around/queiried what the expected outcome
> is from a bad ioctl?  By default the buffer will be zero'd this
> might be unexpected by apps.  (all or nothing)

Yes that's exactly why I'm asking, my understanding is that for normal
usages would be something like:

	if (ioctl(fd, SIOCSOMETHING, &req) < 0) {
		// do something to handle the error
	} else {
		// use data fed back from req
	}

In this case, I think the result would not be affected.  Is there many
(if any) programs that don't bother to check return value of ioctl()?

Speaking for the userland buffer, for _IOR ioctls, the side effect would
be that userland would see a zeroed out 'req' structure (kernel buffer
gets zeroed out before calling kern_ioctl), or "half-baked" one (the
kernel code may have only written partial data).  For _IOWR ioctls, the
side effect would be that the userland may get half-baked data.

The in-kernel request buffer is always initialized, as it is either
overwritten by copyin(), or by bzero() so I don't think sensitive data
could be leaked, unless the kernel code intentionally copy some
sensitive data to the req buffer, detect if there is error, and then
scrub sensitive data away.

Cheers,
- -- 
Xin LI <delphij@delphij.net>	http://www.delphij.net/
FreeBSD - The Power to Serve!	       Live free or die
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iQEcBAEBAgAGBQJLw7pIAAoJEATO+BI/yjfBXjwH/RaheqNyhY0eECcqC5Gz0ycm
2VOpHoe+oRpwHNDYrlNqILKl815HTjpvyi145IpMPIKvEct2O0i6wGJ3FH7VFQwP
ucZh6Tj3K3yF+OsFw3iAk69aqFhslb/SuZtuAbJAA4DB+H1rUPtEfWs9y8XjmAaS
ZvFTmmP1w1V50I843UJEbY86LqwJGOgGH0mJ6n1mEsLOFyrASrjGajAOb/mEvju4
pLVoaKI9sWGk4QfE9QKol083DuSC/WVbJBFHmzN0K0sNmRfyZofcSIYpWDMkwS4n
Mt2M3b6irwul83EkK+cw1gclmV7lUTslfMGtyLbLahZek3HFDh4oZ5xnctfI1xA=
=1Hn6
-----END PGP SIGNATURE-----

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 13 01:45:57 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A31A5106564A
	for <freebsd-arch@freebsd.org>; Tue, 13 Apr 2010 01:45:57 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 93CA88FC12
	for <freebsd-arch@freebsd.org>; Tue, 13 Apr 2010 01:45:57 +0000 (UTC)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 3B76B1A3C86; Mon, 12 Apr 2010 18:45:57 -0700 (PDT)
Date: Mon, 12 Apr 2010 18:45:57 -0700
From: Alfred Perlstein <alfred@freebsd.org>
To: d@delphij.net
Message-ID: <20100413014557.GE19003@elvis.mu.org>
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
	<4BC3BA48.9010009@delphij.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4BC3BA48.9010009@delphij.net>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Apr 2010 01:45:57 -0000

* Xin LI <delphij@delphij.net> [100412 17:27] wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 2010/04/12 16:33, Alfred Perlstein wrote:
> > * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Hi,
> >>
> >> Is there a sane way to copyout ioctl request when the returning errno !=
> >> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
> >>
> >> ===========
> >>         error = kern_ioctl(td, uap->fd, com, data);
> >>
> >>         if (error == 0 && (com & IOC_OUT))
> >>                 error = copyout(data, uap->data, (u_int)size);
> >> ===========
> >>
> >> Is there any objection if I change it to something like:
> >>
> >> ===========
> >>         saved_error = kern_ioctl(td, uap->fd, com, data);
> >>
> >>         if (com & IOC_OUT)
> >>                 error = copyout(data, uap->data, (u_int)size);
> >>         if (saved_error)
> >>                 error = saved_error;
> >> ===========
> > 
> > Is this for linux compat?
> 
> Do they do this way?  I'm not quite sure :-/
> 
> I got a bug report and am thinking about how to fix it, it seems that we
> do not have a generic way of returning an error number while giving some
> "hints" about the error at the same time, for the ioctl() call.  Adding
> an extra pointer to the request structure seems to be a last-resort way
> and sounds to be ugly.

Why not just have the ioctl return success but have an error code
inside the result, example:

struct yourioctldata {
   int error; // 0 = ok, else errno
   char data[DATASIZE]; // data..
   ...
}


> 
> > I'm not sure this would work, it might seriously break userland
> > compat.  Have you looked around/queiried what the expected outcome
> > is from a bad ioctl?  By default the buffer will be zero'd this
> > might be unexpected by apps.  (all or nothing)
> 
> Yes that's exactly why I'm asking, my understanding is that for normal
> usages would be something like:
> 
> 	if (ioctl(fd, SIOCSOMETHING, &req) < 0) {
> 		// do something to handle the error
> 	} else {
> 		// use data fed back from req
> 	}
> 
> In this case, I think the result would not be affected.  Is there many
> (if any) programs that don't bother to check return value of ioctl()?
> 
> Speaking for the userland buffer, for _IOR ioctls, the side effect would
> be that userland would see a zeroed out 'req' structure (kernel buffer
> gets zeroed out before calling kern_ioctl), or "half-baked" one (the
> kernel code may have only written partial data).  For _IOWR ioctls, the
> side effect would be that the userland may get half-baked data.
> 
> The in-kernel request buffer is always initialized, as it is either
> overwritten by copyin(), or by bzero() so I don't think sensitive data
> could be leaked, unless the kernel code intentionally copy some
> sensitive data to the req buffer, detect if there is error, and then
> scrub sensitive data away.

I'm not sure and certainly not an authority on this.  It's probably
worth pinging a few of the standards people.

This is interesting, good luck!

-- 
- Alfred Perlstein
.- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250
.- FreeBSD committer

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 13 13:08:33 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CB9651065670;
	Tue, 13 Apr 2010 13:08:33 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 880EE8FC19;
	Tue, 13 Apr 2010 13:08:33 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id E6E2046BA4;
	Tue, 13 Apr 2010 09:08:32 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id C771D8A021;
	Tue, 13 Apr 2010 09:08:28 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org,
 d@delphij.net
Date: Tue, 13 Apr 2010 08:53:16 -0400
User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; )
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
	<4BC3BA48.9010009@delphij.net>
In-Reply-To: <4BC3BA48.9010009@delphij.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201004130853.16994.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Tue, 13 Apr 2010 09:08:28 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Alfred Perlstein <alfred@freebsd.org>
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Apr 2010 13:08:34 -0000

On Monday 12 April 2010 8:26:48 pm Xin LI wrote:
> On 2010/04/12 16:33, Alfred Perlstein wrote:
> > * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Hi,
> >>
> >> Is there a sane way to copyout ioctl request when the returning errno !=
> >> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
> >>
> >> ===========
> >>         error = kern_ioctl(td, uap->fd, com, data);
> >>
> >>         if (error == 0 && (com & IOC_OUT))
> >>                 error = copyout(data, uap->data, (u_int)size);
> >> ===========
> >>
> >> Is there any objection if I change it to something like:
> >>
> >> ===========
> >>         saved_error = kern_ioctl(td, uap->fd, com, data);
> >>
> >>         if (com & IOC_OUT)
> >>                 error = copyout(data, uap->data, (u_int)size);
> >>         if (saved_error)
> >>                 error = saved_error;
> >> ===========
> >
> > Is this for linux compat?
> 
> Do they do this way?  I'm not quite sure :-/
> 
> I got a bug report and am thinking about how to fix it, it seems that we
> do not have a generic way of returning an error number while giving some
> "hints" about the error at the same time, for the ioctl() call.  Adding
> an extra pointer to the request structure seems to be a last-resort way
> and sounds to be ugly.

Actually, this pattern of embedding an error is quite common.  The mfi(4) and 
mpt(4) pass-thru ioctls to send firmware commands embed the return status of 
any firmware command in the structure that is passed in and out for example.

> > I'm not sure this would work, it might seriously break userland
> > compat.  Have you looked around/queiried what the expected outcome
> > is from a bad ioctl?  By default the buffer will be zero'd this
> > might be unexpected by apps.  (all or nothing)
> 
> Yes that's exactly why I'm asking, my understanding is that for normal
> usages would be something like:
> 
> 	if (ioctl(fd, SIOCSOMETHING, &req) < 0) {
> 		// do something to handle the error
> 	} else {
> 		// use data fed back from req
> 	}
> 
> In this case, I think the result would not be affected.  Is there many
> (if any) programs that don't bother to check return value of ioctl()?
> 
> Speaking for the userland buffer, for _IOR ioctls, the side effect would
> be that userland would see a zeroed out 'req' structure (kernel buffer
> gets zeroed out before calling kern_ioctl), or "half-baked" one (the
> kernel code may have only written partial data).  For _IOWR ioctls, the
> side effect would be that the userland may get half-baked data.

You miss one important variation where the error handling involves adjusting 
the request and retrying (or submitting the same request to a different ioctl 
to handle renumbering conflicts, etc.).  Other APIs such as sysctl(2) and 
setsockopt(2) can leave partial data, but the callers of those APIs expect 
that (and in fact, those APIs return the actual length of data that is copied 
out).  ioctl(2) has not had that behavior, however, and I would find it 
surprising.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Apr 13 19:01:27 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EBD371065679;
	Tue, 13 Apr 2010 19:01:27 +0000 (UTC)
	(envelope-from delphij@delphij.net)
Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1])
	by mx1.freebsd.org (Postfix) with ESMTP id BE1B48FC24;
	Tue, 13 Apr 2010 19:01:26 +0000 (UTC)
Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233])
	by tarsier.geekcn.org (Postfix) with ESMTP id 58CDAA56587;
	Wed, 14 Apr 2010 03:01:25 +0800 (CST)
X-Virus-Scanned: amavisd-new at geekcn.org
Received: from tarsier.geekcn.org ([211.166.10.233])
	by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new,
	port 10024)
	with LMTP id KkbYGbmQsBul; Wed, 14 Apr 2010 03:01:19 +0800 (CST)
Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by tarsier.geekcn.org (Postfix) with ESMTPSA id 41B2FA5502B;
	Wed, 14 Apr 2010 03:01:18 +0800 (CST)
DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns;
	h=message-id:date:from:reply-to:organization:user-agent:
	mime-version:to:cc:subject:references:in-reply-to:
	x-enigmail-version:openpgp:content-type:content-transfer-encoding;
	b=L/qffVBjyhfSQzVWULPMzhlpceKJ99kIViVTjOxzdbOUC+sOISvr4ieVvryH8f92g
	KxRRtl/bkuW7JMCpG7G1w==
Message-ID: <4BC4BF7A.9090106@delphij.net>
Date: Tue, 13 Apr 2010 12:01:14 -0700
From: Xin LI <delphij@delphij.net>
Organization: The Geek China Organization
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
	<4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org>
In-Reply-To: <201004130853.16994.jhb@freebsd.org>
X-Enigmail-Version: 1.0.1
OpenPGP: id=3FCA37C1;
	url=http://www.delphij.net/delphij.asc
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: Alfred Perlstein <alfred@freebsd.org>, d@delphij.net,
	freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: d@delphij.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Apr 2010 19:01:28 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2010/04/13 05:53, John Baldwin wrote:
> On Monday 12 April 2010 8:26:48 pm Xin LI wrote:
>> On 2010/04/12 16:33, Alfred Perlstein wrote:
>>> * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Hi,
>>>>
>>>> Is there a sane way to copyout ioctl request when the returning errno !=
>>>> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
>>>>
>>>> ===========
>>>>         error = kern_ioctl(td, uap->fd, com, data);
>>>>
>>>>         if (error == 0 && (com & IOC_OUT))
>>>>                 error = copyout(data, uap->data, (u_int)size);
>>>> ===========
>>>>
>>>> Is there any objection if I change it to something like:
>>>>
>>>> ===========
>>>>         saved_error = kern_ioctl(td, uap->fd, com, data);
>>>>
>>>>         if (com & IOC_OUT)
>>>>                 error = copyout(data, uap->data, (u_int)size);
>>>>         if (saved_error)
>>>>                 error = saved_error;
>>>> ===========
>>>
>>> Is this for linux compat?
>>
>> Do they do this way?  I'm not quite sure :-/
>>
>> I got a bug report and am thinking about how to fix it, it seems that we
>> do not have a generic way of returning an error number while giving some
>> "hints" about the error at the same time, for the ioctl() call.  Adding
>> an extra pointer to the request structure seems to be a last-resort way
>> and sounds to be ugly.
> 
> Actually, this pattern of embedding an error is quite common.  The mfi(4) and 
> mpt(4) pass-thru ioctls to send firmware commands embed the return status of 
> any firmware command in the structure that is passed in and out for example.
> 
>>> I'm not sure this would work, it might seriously break userland
>>> compat.  Have you looked around/queiried what the expected outcome
>>> is from a bad ioctl?  By default the buffer will be zero'd this
>>> might be unexpected by apps.  (all or nothing)
>>
>> Yes that's exactly why I'm asking, my understanding is that for normal
>> usages would be something like:
>>
>> 	if (ioctl(fd, SIOCSOMETHING, &req) < 0) {
>> 		// do something to handle the error
>> 	} else {
>> 		// use data fed back from req
>> 	}
>>
>> In this case, I think the result would not be affected.  Is there many
>> (if any) programs that don't bother to check return value of ioctl()?
>>
>> Speaking for the userland buffer, for _IOR ioctls, the side effect would
>> be that userland would see a zeroed out 'req' structure (kernel buffer
>> gets zeroed out before calling kern_ioctl), or "half-baked" one (the
>> kernel code may have only written partial data).  For _IOWR ioctls, the
>> side effect would be that the userland may get half-baked data.
> 
> You miss one important variation where the error handling involves adjusting 
> the request and retrying (or submitting the same request to a different ioctl 
> to handle renumbering conflicts, etc.).  Other APIs such as sysctl(2) and 
> setsockopt(2) can leave partial data, but the callers of those APIs expect 
> that (and in fact, those APIs return the actual length of data that is copied 
> out).  ioctl(2) has not had that behavior, however, and I would find it 
> surprising.

I see, that's what I am concerned about, thanks for the explanation.

In order to maintain ABI compatibility I now have a patch which changs
the current behavior of SIOCGIFDESCR to set the buffer field to NULL and
return no errno.  The existing code in -HEAD doesn't seem to work when
the field is big :(

I will post the patch to -net@ for review once I get it tested.

Cheers,
- -- 
Xin LI <delphij@delphij.net>	http://www.delphij.net/
FreeBSD - The Power to Serve!	       Live free or die
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iQEcBAEBAgAGBQJLxL96AAoJEATO+BI/yjfBkkMIAIVjUmwrfHLl5F+mIlRD+Zpv
hYZVBZaeu3/ymv0Zepo5vhbvJCOWxgdtRnJgoVlkpglZLwVrKkAdfxWp/di5n8xm
O4BMc+BIra6tYnqaxmbCYoigKGoLVhim1n6j2Xld/h0n91ErBDpdrWBdHVbs8uV+
mRFLCPbGzGnEXw68rdbWjXFIDRIe7btTdmyYotaHd5AFaqQw6EM+OAXRG3UqGtm3
92o+9TW2LcTTP9gyresbQGoXvITHXVfSdihhDVfDMCtbaClQ+IFlny0oGqg0DttR
OhnEWDvBgUQD+aADYx2k8YLXziUsQzvTc7WTZuoxdz3LzZVecyQSewiydEhor/U=
=IHjf
-----END PGP SIGNATURE-----

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 14 03:45:42 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EA9B71065673;
	Wed, 14 Apr 2010 03:45:41 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id 1D0758FC08;
	Wed, 14 Apr 2010 03:45:40 +0000 (UTC)
Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au
	(c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o3E3jNcM027998
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 14 Apr 2010 13:45:24 +1000
Date: Wed, 14 Apr 2010 13:45:23 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: John Baldwin <jhb@FreeBSD.org>
In-Reply-To: <201004130853.16994.jhb@freebsd.org>
Message-ID: <20100414130627.V12547@delplex.bde.org>
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
	<4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Alfred Perlstein <alfred@FreeBSD.org>, d@delphij.net,
	freebsd-arch@FreeBSD.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2010 03:45:42 -0000

On Tue, 13 Apr 2010, John Baldwin wrote:

> On Monday 12 April 2010 8:26:48 pm Xin LI wrote:
>> On 2010/04/12 16:33, Alfred Perlstein wrote:
>>> * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Hi,
>>>>
>>>> Is there a sane way to copyout ioctl request when the returning errno !=
>>>> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:

No.  You could just do it, but this would be insane since it would
just waste time.

>>>>
>>>> ===========
>>>>         error = kern_ioctl(td, uap->fd, com, data);
>>>>
>>>>         if (error == 0 && (com & IOC_OUT))
>>>>                 error = copyout(data, uap->data, (u_int)size);
>>>> ===========
>>>>
>>>> Is there any objection if I change it to something like:
>>>>
>>>> ===========
>>>>         saved_error = kern_ioctl(td, uap->fd, com, data);
>>>>
>>>>         if (com & IOC_OUT)
>>>>                 error = copyout(data, uap->data, (u_int)size);
>>>>         if (saved_error)
>>>>                 error = saved_error;
>>>> ===========

errno != 0 means that the ioctl failed, so the contents of the output
buffer (output from the kernel) is indeterminate, so only broken
applications would look at it (except merely insane ones could look
at it and not use the results).

> Actually, this pattern of embedding an error is quite common.  The mfi(4) and
> mpt(4) pass-thru ioctls to send firmware commands embed the return status of
> any firmware command in the structure that is passed in and out for example.
>
>>> I'm not sure this would work, it might seriously break userland
>>> compat.  Have you looked around/queiried what the expected outcome
>>> is from a bad ioctl?  By default the buffer will be zero'd this
>>> might be unexpected by apps.  (all or nothing)

Such applications are broken.  The error might occur at any point
in the syscall and apps have no way of telling where.  Errors during
the copyout would cause a partial copy (!(all or nothing) unless partial
is actually nothing).  With a partial copy, the changed bytes could
be anywhere in the copy, depending on the implementation.

>> Yes that's exactly why I'm asking, my understanding is that for normal
>> usages would be something like:
>>
>> 	if (ioctl(fd, SIOCSOMETHING, &req) < 0) {

Testing syscalls that return 0 on error using " < 0" is a normal style bug.

>> 		// do something to handle the error
>> 	} else {
>> 		// use data fed back from req
>> 	}
>>
>> In this case, I think the result would not be affected.  Is there many
>> (if any) programs that don't bother to check return value of ioctl()?

Only broken ones.

>> Speaking for the userland buffer, for _IOR ioctls, the side effect would
>> be that userland would see a zeroed out 'req' structure (kernel buffer
>> gets zeroed out before calling kern_ioctl), or "half-baked" one (the
>> kernel code may have only written partial data).  For _IOWR ioctls, the
>> side effect would be that the userland may get half-baked data.

Hmm, the kernel probably depends on the pre-zeroing, so that half-baked data
is not necessarily an error.

> You miss one important variation where the error handling involves adjusting
> the request and retrying (or submitting the same request to a different ioctl
> to handle renumbering conflicts, etc.).  Other APIs such as sysctl(2) and
> setsockopt(2) can leave partial data, but the callers of those APIs expect
> that (and in fact, those APIs return the actual length of data that is copied
> out).  ioctl(2) has not had that behavior, however, and I would find it
> surprising.

Yes, it has no general way of reporting partial success, and doing this for
special ioctls would be complicated.  At the very least you would need to
add special error codes to distinguish normal failure (output buffer
indeterminate) from partial success (some bytes in output buffer valid,
and encode further details of the partialness).

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 14 04:40:50 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 68820106566C;
	Wed, 14 Apr 2010 04:40:50 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by mx1.freebsd.org (Postfix) with ESMTP id F22C68FC18;
	Wed, 14 Apr 2010 04:40:49 +0000 (UTC)
Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au
	(c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o3E4ejVQ013943
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 14 Apr 2010 14:40:47 +1000
Date: Wed, 14 Apr 2010 14:40:45 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Rick Macklem <rmacklem@uoguelph.ca>
In-Reply-To: <Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
Message-ID: <20100414135230.U12587@delplex.bde.org>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
	<Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2010 04:40:50 -0000

On Sun, 11 Apr 2010, Rick Macklem wrote:

> On Sun, 11 Apr 2010, Bruce Evans wrote:
>
>> Er, the maximum size of buffers in the buffer cache is especially
>> irrelevant for nfs.  It is almost irrelevant for physical disks because
>> clustering normally increases the bulk transfer size to MAXPHYS.
>> Clustering takes a lot of CPU but doesn't affect the transfer rate much
>> unless there is not enough CPU.  It is even less relevant for network
>> i/o since there is a sort of reverse-clustering -- the buffers get split
>> up into tiny packets (normally 1500 bytes less some header bytes) at
>> the hardware level.  ...
>
> I've done a simple experiment on Mac OS X 10, where I tried different
> sizes for the read and write RPCs plus different amounts of
> read-ahead/write-behind and found the I/O rate increased linearly,
> up to the max allowed by Mac OS X (MAXBSIZE == 128K) without 
> read-ahead/write-behind. Using read-ahead/write-behind the performance
> didn't increase at all, until the RPC read/write size was reduced.
> (Solaris10 is using 256K by default and allowing up to 1Mb for read/write
> RPC size now, so they seem to think that large values work well?)
>
> When you start using a WAN environment, large read/write RPCs really
> help, from what I've seen, since that helps fill the TCP pipe
> (bits * latency between client<->server).
>
> I care much more about WAN performance than LAN performance w.r.t. this.

Indeed, I was only caring about a LAN environment.  Especially with
LANs optimized for latency (50-100 uS), nfs performance is poor for
small files, at least for the old nfs client, mainly due to close to
open consistency defeating caching, but not a problem for bulk transfers.

> I am not sure what you were referring to w.r.t. clustering, but if you
> meant that the NFS client can easily do an RPC with a larger I/O size
> than the size of the buffer handed it by the buffer cache, I'd like to
> hear how that's done? (If not, then a bigger buffer from the buffer
> cache is what I need to do a larger I/O size in the RPC.)

Clustering is currently only for the local file system, at least for
the old nfs server.  nfs just does a VOP_READ() into its own buffer,
with ioflag set to indicate nfs's idea of sequentialness.  (User reads
are similar except their uio destination is UIO_USERSPACE instead of
UIO_SYSSPACE and their sequentialness is set generically and thus not
so well (but the nfs setting isn't very good either).)  The local file
system then normally does a clustered read into a larger buffer, with
the sequentialness affecting mainly startup (per-file), and virtually
copies the results to the local file system's smaller buffers.  VOP_READ()
completes by physically copying the results to nfs's buffer (using
bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE).  nfs can't
easily get at the larger clustering buffers or even the local file
system's buffers.  It can more easily benefit from larger MAXBSIZE.
There is still the bcopy() to take a lot of CPU and memory bus resources,
but that is insignifcant compared with WAN latency.  But as I said in
a related thread, even the current MAXBSIZE is too large to use
routinely, due to buffer cache fragmentation causing significant latency
problems, so any increase in MAXBSIZE and/or routine use of buffers
of that size needs to be accompanied by avoiding the fragmentation.
Note that the fragmentation is avoided for the larger clustering buffers
by allocating them from a different pool.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 14 05:03:29 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 915B9106566B;
	Wed, 14 Apr 2010 05:03:29 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 29F448FC0A;
	Wed, 14 Apr 2010 05:03:28 +0000 (UTC)
Received: from [127.0.0.1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.3/8.14.3) with ESMTP id o3E53ECd018824;
	Tue, 13 Apr 2010 23:03:14 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1078)
Content-Type: text/plain; charset=us-ascii
From: Scott Long <scottl@samsco.org>
In-Reply-To: <20100414130627.V12547@delplex.bde.org>
Date: Tue, 13 Apr 2010 23:03:14 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <463B2945-8599-4031-A7A4-E091C69E049F@samsco.org>
References: <4BC39E93.7060906@delphij.net>
	<20100412233330.GC19003@elvis.mu.org>
	<4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org>
	<20100414130627.V12547@delplex.bde.org>
To: Bruce Evans <brde@optusnet.com.au>
X-Mailer: Apple Mail (2.1078)
X-Spam-Status: No, score=-1.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD
	autolearn=unavailable version=3.3.0
X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org
Cc: Alfred Perlstein <alfred@freebsd.org>, d@delphij.net,
	freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2010 05:03:29 -0000

On Apr 13, 2010, at 9:45 PM, Bruce Evans wrote:
> On Tue, 13 Apr 2010, John Baldwin wrote:
>=20
>> On Monday 12 April 2010 8:26:48 pm Xin LI wrote:
>>> On 2010/04/12 16:33, Alfred Perlstein wrote:
>>>> * Xin LI <delphij@delphij.net> [100412 15:28] wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>=20
>>>>> Hi,
>>>>>=20
>>>>> Is there a sane way to copyout ioctl request when the returning =
errno !=3D
>>>>> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we =
have:
>=20
> No.  You could just do it, but this would be insane since it would
> just waste time.
>=20
>>>>>=20
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>>       error =3D kern_ioctl(td, uap->fd, com, data);
>>>>>=20
>>>>>       if (error =3D=3D 0 && (com & IOC_OUT))
>>>>>               error =3D copyout(data, uap->data, (u_int)size);
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>>=20
>>>>> Is there any objection if I change it to something like:
>>>>>=20
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>>       saved_error =3D kern_ioctl(td, uap->fd, com, data);
>>>>>=20
>>>>>       if (com & IOC_OUT)
>>>>>               error =3D copyout(data, uap->data, (u_int)size);
>>>>>       if (saved_error)
>>>>>               error =3D saved_error;
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> errno !=3D 0 means that the ioctl failed, so the contents of the =
output
> buffer (output from the kernel) is indeterminate, so only broken
> applications would look at it (except merely insane ones could look
> at it and not use the results).

More specifically, think of ioctl as a transport mechanism for =
information.
The errno returned by it is a reflection of the state of the transport, =
not the
state of the information transported by it.  Layers that use ioctl to =
transport
their information need to use another mechanism to relay the state of
those layers and the data transported.  errno !=3D 0 means that the =
ioctl
transport failed, period.  Or In other words, the transport of =
information
failed.

As John pointed out, if you want the client layers of ioctl to convey =
their=20
status, you need to build that status into the messages that are =
conveyed
over the ioctl, and not overload the ioctl status.  If that means =
changing
poorly written apps, then that's what it means.  Trying to further =
overload
the functionality of ioctl with heuristic guesses is only going to lead =
to
fragility and frustration.

Scott


From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 14 06:38:33 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5A47C106564A;
	Wed, 14 Apr 2010 06:38:33 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id D0E7F8FC16;
	Wed, 14 Apr 2010 06:38:32 +0000 (UTC)
Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au
	(c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o3E6cS6O027715
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 14 Apr 2010 16:38:29 +1000
Date: Wed, 14 Apr 2010 16:38:28 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Andriy Gapon <avg@FreeBSD.org>
In-Reply-To: <4BC34402.1050509@freebsd.org>
Message-ID: <20100414144336.L12587@delplex.bde.org>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
	<4BC34402.1050509@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2010 06:38:33 -0000

On Mon, 12 Apr 2010, Andriy Gapon wrote:

> on 11/04/2010 05:56 Bruce Evans said the following:
>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
> [snip]
>>> I have lightly tested this under qemu.
>>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>>> "panic: getblk:
>>> size(%d) > MAXBSIZE(%d)").
>>
>> Did you change the other known things that depend on this?  There is the
>> b_pages limit of MAXPHYS bytes which should be checked for in another
>> way
>
> I changed the check the way I described in the parallel thread.

I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE.
Was an explicit check needed?  (An implicit check would probably have
worked: most clients were limited by the MAXBSIZE check, and the pbuf
client always uses MAXPHYS or DFLTPHYS.)

>> and the soft limits for hibufspace and lobufspace which only matter
>> under load conditions.
>
> And what these should be?
> hibufspace and lobufspace seem to be auto-calculated.  One thing that I noticed
> and that was a direct cause of the problem described below, is that difference
> between hibufspace and lobufspace should be at least the maximum block size
> allowed in getblk() (perhaps it should be strictly equal to that value?).
> So in my case I had to make that difference MAXPHYS.

Hard to say.  They are mostly only heuristics which mostly only matter under
heavy loads.  You can change the defaults using sysctl but it is even harder
to know what changes might be good without knowing the details of the
implementation.

>>> And I bumped MAXPHYS to 1MB.
>>> ...
>>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>>> time sleeping
>>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>>> Apparently there was some shortage of free buffers.
>>> Perhaps some limits/counts were incorrectly auto-tuned.
>>
>> This is not surprising, since even 64K is 4 times too large to work
>> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
>> fragmentation of buffer kva.  ...
>
> So, BKVASIZE is the best read size from the point of view of buffer space usage?

It is the best buffer size, which is almost independent of the best read
size.  First, userland reads will be re-blocked into file-system-block-size
reads...

> E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but
> leads to buffer space map fragmentation, because of size > BKVASIZE.
> On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from
> buffer space point of view (no fragmentation potential) but they result in 4 GEOM

Clustering occurs above geom, so geom only sees small requests for small
files, random accesses, and buggy cases for sequential accesses to large
files where the bugs give partial randomness.

E.g., a single 64K read from userland normally gives 4 16K ffs blocks
in the buffer cache.  Clustering turns these into 1 128K block in a
pbuf (64K for the amount read now and 128K for read-ahead; there may
be more read-ahead but it would go in another pbuf).  geom then sees
the 128K (MAXPHYS) block.  Most device drivers still only support i/o's
of size <= DFLTPHYS, but geom confuses the clustering code into producing
clusters larger than its intended maximum of what the device supports
by advertising support for MAXPHYS (v_mount->mnt_iosize_max).  So geom
normally turns the 128K request into 2 64K requests.  Clustering
finishes by converting the 128K request into 8 16K requests (4 for use
now and 4 later for read-ahead).

OTOH, the first block of 4 sequential reads of 16K produces the same
128K block at the geom level, modulo bugs in the read-ahead.  This now
consists 1 and 7 blocks of normal read and read-ahead, respectively,
instead of 4 and 4.  Then the next 3 blocks are found in the buffer
cache as read-ahead instead of read from the disk (actually, this is
insignificantly different from the first case after ffs splits up the 
64K into 4 times 16K).

So the block size makes almost no difference at the syscall level
(512-blocks take significantly more CPU but improve latency, while
hige blocks take significantly less CPU but significantly unimprove
latency).

The file system block size makes only secondary differences:
- clustering only works to turn small logical i/o's into large physical
   ones when sequential blocks are allocated sequentially, but always
   allocating blocks sequentially is hard to do and using large file
   system blocks reduces the loss when the allocation is not sequential
- large file system blocks also reduce the amount of work that clustering
   has to do to reblock.  This benefit is much smaller than the previous
   one.
- the buffer cache is only designed to handle medium-sized blocks well.
   With 512-blocks, it can only hold 1/32 as much as with 16K-blocks,
   so it will thrash 32 times as much with the former.  Now that the
   thrashing is to VMIO instead of to the disk, this only wastes CPU.
   With any block size larger than BKVASIZE, the buffer cache may become
   fragmented, depending on the combination of block sizes.  Mixed
   combinations are the worst, and the system doesn't do anything to
   try to avoid them.  The worst case is a buffer cache full of 512-blocks,
   with getblk() wanting to allocate a 64K-block.  Then it needs to
   wait for 32 contiguous blocks to become free, or forcibly free some,
   or move some...


> I/O requests.
> The thing is that a single read requires a single contiguous virtual address space
> chunk.  Would it be possible to take the best of both worlds by somehow allowing a
> single large I/O request to work with several buffers (with b_kvasize == BKVASIZE)
> in a iovec-like style?
> Have I just reinvented bicycle? :)
> Probably not, because an answer to my question is probably 'not (without lots of
> work in lots of places)' as well.

Separate buffers already partly provided, this, and combined with command
queuing in the hardware they provided it completely in perhaps a better
way than can be done in software.

vfs clustering attempts much less but still complicated.  It mainly wants
to convert buffers that have contiguous disk addresses into a super-buffer
that has contiguous virtual memory and combine this with read-ahead, to
reduce the number of i/o's.  All drives less than 10 years old benefit
only marginally from this, since the same cases that vfs clustering can
handle are also easy for drive clustering, caching and read-ahead/write-
behind (especially the latter) to handle even better, so I occasionally
try turning off vfs clustering to see if it makes a difference;
unfortunately it still seems to help on all drives, including even
reducing total CPU usage despite its own large CPU usage.

> I see that breadn() certainly doesn't work that way.  As I understand, it works
> like bread() for one block plus starts something like 'asynchronous breads()' for
> a given count of other blocks.

Usually breadn() isn't called, but clustering reads to the end of the current
cluster or maybe the next cluster.  breadn() was designed when reading
ahead a single cluster was helpful.  Now, drives read-ahead a whole track
or similar probably hundreds of sectors, so reading ahead a single sector
is almost useless.  It doesn't even reduce the number of i/o's unless it is
clustered with the previous i/o.

> I am not sure about details of how cluster_read() works, though.
> Could you please explain the essence of it?

See above.  It is essentially the old hack of reading ahead a whole
track in software, done in a sophisticated way but with fewer attempts
to satisfy disk geometry timing requirements.  Long ago, everything
was so slow that sequential reads done from userland could not keep
up with even a floppy disk, but sequential i/o's done from near the
driver could, even with i/o's of only 1 sector.  I've only ever seen
this working well for floppy disks.  For hard disks, the i/o's need
to be multi-sector, and needed to be related to the disk geometry
(handle full tracks and don't keep results from intermediate sectors
that are not needed yet iff doing so wouldn't thrash the cache).  Now,
it is unreasonable to try to know the disk geometry, and vfs clustering
doesn't try.  Fortunately, this is not much needed, since newer drives
have their own track caches which, although they don't form a complete
replacement for vfs clustering (see above), they reduces the losses
to extra non-physical reads.  Similarly for another problem with vfs:
all buffers and their clustering are file (vnode) based, which almost
forces missing intermediate sectors when reading a file, but a working
device (track or similar) in the drive mostly compensates for not having
one in the OS.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Wed Apr 14 08:10:25 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1B038106566C
	for <arch@freebsd.org>; Wed, 14 Apr 2010 08:10:25 +0000 (UTC)
	(envelope-from avg@freebsd.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 1E5CD8FC14
	for <arch@freebsd.org>; Wed, 14 Apr 2010 08:10:23 +0000 (UTC)
Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id LAA19688;
	Wed, 14 Apr 2010 11:10:15 +0300 (EEST)
	(envelope-from avg@freebsd.org)
Received: from localhost.topspin.kiev.ua ([127.0.0.1])
	by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1O1xfq-0000OY-Qt; Wed, 14 Apr 2010 11:10:14 +0300
Message-ID: <4BC57866.50807@freebsd.org>
Date: Wed, 14 Apr 2010 11:10:14 +0300
From: Andriy Gapon <avg@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100321)
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org>
	<20100411114405.L10562@delplex.bde.org>
	<4BC34402.1050509@freebsd.org>
	<20100414144336.L12587@delplex.bde.org>
In-Reply-To: <20100414144336.L12587@delplex.bde.org>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2010 08:10:25 -0000

on 14/04/2010 09:38 Bruce Evans said the following:
> On Mon, 12 Apr 2010, Andriy Gapon wrote:
> 
>> on 11/04/2010 05:56 Bruce Evans said the following:
>>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
>> [snip]
>>>> I have lightly tested this under qemu.
>>>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>>>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>>>> "panic: getblk:
>>>> size(%d) > MAXBSIZE(%d)").
>>>
>>> Did you change the other known things that depend on this?  There is the
>>> b_pages limit of MAXPHYS bytes which should be checked for in another
>>> way
>>
>> I changed the check the way I described in the parallel thread.
> 
> I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE.
> Was an explicit check needed?  (An implicit check would probably have
> worked: most clients were limited by the MAXBSIZE check, and the pbuf
> client always uses MAXPHYS or DFLTPHYS.)

I added this:
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -2541,8 +2541,8 @@ getblk(struct vnode * vp, daddr_t blkno, int size, int
slpflag, int slptimeo,

 	CTR3(KTR_BUF, "getblk(%p, %ld, %d)", vp, (long)blkno, size);
 	ASSERT_VOP_LOCKED(vp, "getblk");
-	if (size > MAXBSIZE)
-		panic("getblk: size(%d) > MAXBSIZE(%d)\n", size, MAXBSIZE);
+	if (size > MAXPHYS)
+		panic("getblk: size(%d) > MAXPHYS(%d)\n", size, MAXPHYS);

 	bo = &vp->v_bufobj;
 loop:

It wasn't really needed 'by default' but, as I said, I use my own "filesystem"
for testing and in it I do all kinds of nasty things like huge bread()-s.
So I had to add the check to get a nice panic instead of a crash and/or
corruption a little bit later.

>>> and the soft limits for hibufspace and lobufspace which only matter
>>> under load conditions.
>>
>> And what these should be?
>> hibufspace and lobufspace seem to be auto-calculated.  One thing that
>> I noticed
>> and that was a direct cause of the problem described below, is that
>> difference
>> between hibufspace and lobufspace should be at least the maximum block
>> size
>> allowed in getblk() (perhaps it should be strictly equal to that value?).
>> So in my case I had to make that difference MAXPHYS.
> 
> Hard to say.  They are mostly only heuristics which mostly only matter
> under
> heavy loads.  You can change the defaults using sysctl but it is even
> harder
> to know what changes might be good without knowing the details of the
> implementation.

Yes.
I ended up with this change:
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -613,7 +613,7 @@ bufinit(void)
 	 */
 	maxbufspace = (long)nbuf * BKVASIZE;
 	hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - MAXBSIZE * 10);
-	lobufspace = hibufspace - MAXBSIZE;
+	lobufspace = hibufspace - MAXPHYS;

 	lorunningspace = 512 * 1024;
 	hirunningspace = 1024 * 1024;

Otherwise, in situation where we need a buffer of size 'size' and buffspace <
lobufspace but buffspace + size > hibufspace, logic in getnewbuf() falls through
the cracks.  MAXBSIZE => MAXPHYS change is to reflect that we support bread()-s
as large as that.

>>>> And I bumped MAXPHYS to 1MB.
>>>> ...
>>>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>>>> time sleeping
>>>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>>>> Apparently there was some shortage of free buffers.
>>>> Perhaps some limits/counts were incorrectly auto-tuned.
>>>
>>> This is not surprising, since even 64K is 4 times too large to work
>>> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
>>> fragmentation of buffer kva.  ...
>>
>> So, BKVASIZE is the best read size from the point of view of buffer
>> space usage?
> 
> It is the best buffer size, which is almost independent of the best read
> size.  First, userland reads will be re-blocked into file-system-block-size
> reads...

Umm, I meant 'bread() size', sorry for not being explicit.
It doesn't make much sense to talk about userland in this context.

>> E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read
>> requests, but
>> leads to buffer space map fragmentation, because of size > BKVASIZE.
>> On the other hand, four sequential reads of BKVASIZE=16K bytes are
>> perfect from
>> buffer space point of view (no fragmentation potential) but they
>> result in 4 GEOM
> 
> Clustering occurs above geom, so geom only sees small requests for small
> files, random accesses, and buggy cases for sequential accesses to large
> files where the bugs give partial randomness.

Yes, provided that cluster API is used.
Which is the case for most places in most filesystems, but not 'all in all'.
E.g. my own "filesystem", which I mention only for a joke,.
And, e.g. msdosfs_readdir, which bread()-s the whole FAT cluster in one go, and,
as we learned, that could be a lot (if sector size is large).

> E.g., a single 64K read from userland normally gives 4 16K ffs blocks
> in the buffer cache.  Clustering turns these into 1 128K block in a
> pbuf (64K for the amount read now and 128K for read-ahead; there may
> be more read-ahead but it would go in another pbuf).

Indeed.  I missed the fact that cluster I/O uses different kind of buffers from
different space.  Those are uniformly sized with MAXPHYS.

> geom then sees
> the 128K (MAXPHYS) block.  Most device drivers still only support i/o's
> of size <= DFLTPHYS, but geom confuses the clustering code into producing
> clusters larger than its intended maximum of what the device supports
> by advertising support for MAXPHYS (v_mount->mnt_iosize_max).

Oh, I missed this.  GEOM setting always si_iosize_max to MAXPHYS seems like a
bug.  Actual hardware/driver capabilities need to be honored.

> So geom
> normally turns the 128K request into 2 64K requests.  Clustering
> finishes by converting the 128K request into 8 16K requests (4 for use
> now and 4 later for read-ahead).
> 
> OTOH, the first block of 4 sequential reads of 16K produces the same
> 128K block at the geom level, modulo bugs in the read-ahead.  This now
> consists 1 and 7 blocks of normal read and read-ahead, respectively,
> instead of 4 and 4.  Then the next 3 blocks are found in the buffer
> cache as read-ahead instead of read from the disk (actually, this is
> insignificantly different from the first case after ffs splits up the
> 64K into 4 times 16K).
> 
> So the block size makes almost no difference at the syscall level
> (512-blocks take significantly more CPU but improve latency, while
> hige blocks take significantly less CPU but significantly unimprove
> latency).
> 
> The file system block size makes only secondary differences:
> - clustering only works to turn small logical i/o's into large physical
>   ones when sequential blocks are allocated sequentially, but always
>   allocating blocks sequentially is hard to do and using large file
>   system blocks reduces the loss when the allocation is not sequential
> - large file system blocks also reduce the amount of work that clustering
>   has to do to reblock.  This benefit is much smaller than the previous
>   one.
> - the buffer cache is only designed to handle medium-sized blocks well.
>   With 512-blocks, it can only hold 1/32 as much as with 16K-blocks,
>   so it will thrash 32 times as much with the former.  Now that the
>   thrashing is to VMIO instead of to the disk, this only wastes CPU.
>   With any block size larger than BKVASIZE, the buffer cache may become
>   fragmented, depending on the combination of block sizes.  Mixed
>   combinations are the worst, and the system doesn't do anything to
>   try to avoid them.  The worst case is a buffer cache full of 512-blocks,
>   with getblk() wanting to allocate a 64K-block.  Then it needs to
>   wait for 32 contiguous blocks to become free, or forcibly free some,
>   or move some...

Agree.

>> I/O requests.
>> The thing is that a single read requires a single contiguous virtual
>> address space
>> chunk.  Would it be possible to take the best of both worlds by
>> somehow allowing a
>> single large I/O request to work with several buffers (with b_kvasize
>> == BKVASIZE)
>> in a iovec-like style?
>> Have I just reinvented bicycle? :)
>> Probably not, because an answer to my question is probably 'not
>> (without lots of
>> work in lots of places)' as well.
> 
> Separate buffers already partly provided, this, and combined with command
> queuing in the hardware they provided it completely in perhaps a better
> way than can be done in software.
> 
> vfs clustering attempts much less but still complicated.  It mainly wants
> to convert buffers that have contiguous disk addresses into a super-buffer
> that has contiguous virtual memory and combine this with read-ahead, to
> reduce the number of i/o's.  All drives less than 10 years old benefit
> only marginally from this, since the same cases that vfs clustering can
> handle are also easy for drive clustering, caching and read-ahead/write-
> behind (especially the latter) to handle even better, so I occasionally
> try turning off vfs clustering to see if it makes a difference;
> unfortunately it still seems to help on all drives, including even
> reducing total CPU usage despite its own large CPU usage.

I think that this mainly tells that our code doesn't optimally handle
non-cluster I/O.  For example, all calls of breadn() that I see specify only one
read-ahead block.  And all bread*()-s are, of course, have fs block size.
So, with e.g. typical 8KB blocks we gets lots of GEOM level and hardware level
I/O going back and forth.
While, as you say, the disks may handle that well, it is not optimal for
communication between disks and controllers, and it's definitely bad for GEOM
and drivers layer.
Essentially, what you wrote in the paragraph below :-)

>> I see that breadn() certainly doesn't work that way.  As I understand,
>> it works
>> like bread() for one block plus starts something like 'asynchronous
>> breads()' for
>> a given count of other blocks.
> 
> Usually breadn() isn't called, but clustering reads to the end of the
> current
> cluster or maybe the next cluster.  breadn() was designed when reading
> ahead a single cluster was helpful.  Now, drives read-ahead a whole track
> or similar probably hundreds of sectors, so reading ahead a single sector
> is almost useless.  It doesn't even reduce the number of i/o's unless it is
> clustered with the previous i/o.

Yes.  (I should have read the whole email before starting my reply).
Perhaps,  it makes sense now to change breadn() interface and turn it into a
simple/cheaper version of cluster_read().  I don't really see much point in
passing an array of block numbers and block sizes.
E.g. perhaps something like:
breadn(vp, startblock, blocksize, count, &bp)
and a filesystem must ensure that all the requested blocks are contiguous.
Then breadn() could read and read-ahead those blocks using optimal block size,
not the fs block size.

>> I am not sure about details of how cluster_read() works, though.
>> Could you please explain the essence of it?
> 
> See above.  It is essentially the old hack of reading ahead a whole
> track in software, done in a sophisticated way but with fewer attempts
> to satisfy disk geometry timing requirements.  Long ago, everything
> was so slow that sequential reads done from userland could not keep
> up with even a floppy disk, but sequential i/o's done from near the
> driver could, even with i/o's of only 1 sector.  I've only ever seen
> this working well for floppy disks.  For hard disks, the i/o's need
> to be multi-sector, and needed to be related to the disk geometry
> (handle full tracks and don't keep results from intermediate sectors
> that are not needed yet iff doing so wouldn't thrash the cache).  Now,
> it is unreasonable to try to know the disk geometry, and vfs clustering
> doesn't try.  Fortunately, this is not much needed, since newer drives
> have their own track caches which, although they don't form a complete
> replacement for vfs clustering (see above), they reduces the losses
> to extra non-physical reads.  Similarly for another problem with vfs:
> all buffers and their clustering are file (vnode) based, which almost
> forces missing intermediate sectors when reading a file, but a working
> device (track or similar) in the drive mostly compensates for not having
> one in the OS.

Thank you for the explanation.
I mostly wondered how clustering worked with buffer cache, somehow I was
overlooking the whole pbuf thing and couldn't place all the pieces together.


-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 02:34:33 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 04B401065670;
	Thu, 15 Apr 2010 02:34:33 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca
	[131.104.91.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 94B0C8FC1E;
	Thu, 15 Apr 2010 02:34:32 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEAH8XxkuDaFvK/2dsb2JhbACbW3G+M4UNBA
X-IronPort-AV: E=Sophos;i="4.52,209,1270440000"; d="scan'208";a="72811042"
Received: from fraser.cs.uoguelph.ca ([131.104.91.202])
	by esa-annu-pri.mail.uoguelph.ca with ESMTP; 14 Apr 2010 22:34:31 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 2A7E7109C31C;
	Wed, 14 Apr 2010 22:34:31 -0400 (EDT)
X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca
Received: from fraser.cs.uoguelph.ca ([127.0.0.1])
	by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id D5JnLDqJsxal; Wed, 14 Apr 2010 22:34:30 -0400 (EDT)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 7FE84109C28B;
	Wed, 14 Apr 2010 22:34:30 -0400 (EDT)
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	o3F2mNg00557; Wed, 14 Apr 2010 22:48:24 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Wed, 14 Apr 2010 22:48:23 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20100414135230.U12587@delplex.bde.org>
Message-ID: <Pine.GSO.4.63.1004142234200.28565@muncher.cs.uoguelph.ca>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
	<Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
	<20100414135230.U12587@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 02:34:33 -0000



On Wed, 14 Apr 2010, Bruce Evans wrote:

> On Sun, 11 Apr 2010, Rick Macklem wrote:
>
>> On Sun, 11 Apr 2010, Bruce Evans wrote:
>> 
>>> Er, the maximum size of buffers in the buffer cache is especially
>>> irrelevant for nfs.  It is almost irrelevant for physical disks because
>>> clustering normally increases the bulk transfer size to MAXPHYS.
>>> Clustering takes a lot of CPU but doesn't affect the transfer rate much
>>> unless there is not enough CPU.  It is even less relevant for network
>>> i/o since there is a sort of reverse-clustering -- the buffers get split
>>> up into tiny packets (normally 1500 bytes less some header bytes) at
>>> the hardware level.  ...
>> 
[stuff snipped]
>
> Indeed, I was only caring about a LAN environment.  Especially with
> LANs optimized for latency (50-100 uS), nfs performance is poor for
> small files, at least for the old nfs client, mainly due to close to
> open consistency defeating caching, but not a problem for bulk transfers.
>

And I'll admit I was thinking that for a low latency LAN, a large 
read/write RPC wouldn't have a negative impact, but it sounds like
you've found 16Kb to be optimal for this case.

For NFSv4, if the client has a delegation for the file, it doesn't
have worry about close/open consistency, so there is some hope w.r.t.
small files for this case.

>
> Clustering is currently only for the local file system, at least for
> the old nfs server.  nfs just does a VOP_READ() into its own buffer,
> with ioflag set to indicate nfs's idea of sequentialness.  (User reads
> are similar except their uio destination is UIO_USERSPACE instead of
> UIO_SYSSPACE and their sequentialness is set generically and thus not
> so well (but the nfs setting isn't very good either).)  The local file
> system then normally does a clustered read into a larger buffer, with
> the sequentialness affecting mainly startup (per-file), and virtually
> copies the results to the local file system's smaller buffers.  VOP_READ()
> completes by physically copying the results to nfs's buffer (using
> bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE).  nfs can't
> easily get at the larger clustering buffers or even the local file
> system's buffers.  It can more easily benefit from larger MAXBSIZE.
> There is still the bcopy() to take a lot of CPU and memory bus resources,
> but that is insignifcant compared with WAN latency.  But as I said in
> a related thread, even the current MAXBSIZE is too large to use
> routinely, due to buffer cache fragmentation causing significant latency
> problems, so any increase in MAXBSIZE and/or routine use of buffers
> of that size needs to be accompanied by avoiding the fragmentation.
> Note that the fragmentation is avoided for the larger clustering buffers
> by allocating them from a different pool.
>
Ah, now I know what you were referring to w.r.t. clustering. I haven't
looked at the mechanism used to allocate buffer space in the buffer
cache, so I'll just take your word for it w.r.t. fragmentation. It
sounds like the allocation mechanism needs to be thought about if/when
MAXBSIZE gets increased.

Thanks for your input and I hope I didn't upset you when I jumped on
the "I care about WANs" bandwagon, while basically ignoring the LAN case.

rick


From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 06:41:55 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BC00B106566C
	for <freebsd-arch@freebsd.org>; Thu, 15 Apr 2010 06:41:55 +0000 (UTC)
	(envelope-from pjd@garage.freebsd.pl)
Received: from mail.garage.freebsd.pl (chello089077043238.chello.pl
	[89.77.43.238]) by mx1.freebsd.org (Postfix) with ESMTP id 0A9018FC18
	for <freebsd-arch@freebsd.org>; Thu, 15 Apr 2010 06:41:53 +0000 (UTC)
Received: by mail.garage.freebsd.pl (Postfix, from userid 65534)
	id D420045CAC; Thu, 15 Apr 2010 08:41:51 +0200 (CEST)
Received: from localhost (chello089077043238.chello.pl [89.77.43.238])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.garage.freebsd.pl (Postfix) with ESMTP id 786C645685;
	Thu, 15 Apr 2010 08:41:46 +0200 (CEST)
Date: Thu, 15 Apr 2010 08:41:49 +0200
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: d@delphij.net
Message-ID: <20100415064149.GB2252@garage.freebsd.pl>
References: <4BC39E93.7060906@delphij.net>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="+g7M9IMkV8truYOl"
Content-Disposition: inline
In-Reply-To: <4BC39E93.7060906@delphij.net>
User-Agent: Mutt/1.4.2.3i
X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc
X-OS: FreeBSD 9.0-CURRENT i386
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on 
	mail.garage.freebsd.pl
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL 
	autolearn=no version=3.0.4
Cc: freebsd-arch@freebsd.org
Subject: Re: _IOWR when errno != 0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 06:41:55 -0000


--+g7M9IMkV8truYOl
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Apr 12, 2010 at 03:28:35PM -0700, Xin LI wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Hi,
>=20
> Is there a sane way to copyout ioctl request when the returning errno !=3D
> 0?  Looking at the code, currently, in sys/kern/sys_generic.c, we have:
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>         error =3D kern_ioctl(td, uap->fd, com, data);
>=20
>         if (error =3D=3D 0 && (com & IOC_OUT))
>                 error =3D copyout(data, uap->data, (u_int)size);
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> Is there any objection if I change it to something like:
>=20
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>         saved_error =3D kern_ioctl(td, uap->fd, com, data);
>=20
>         if (com & IOC_OUT)
>                 error =3D copyout(data, uap->data, (u_int)size);
>         if (saved_error)
>                 error =3D saved_error;
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I'd like to note that OpenSolaris does copy data back even if an error
occurs. I needed to change ZFS to return 0 for ioctl(2) and return an
error within zfs_cmd structure.

I think FreeBSD way is better, BTW. ioctl(2) can fail for other reasons,
for example data pointer is invalid, so we return EFAULT and we are
unable to copy data back in that case anyway.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--+g7M9IMkV8truYOl
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkvGtSwACgkQForvXbEpPzTcNACg3Iq+vXbNNUIv2Irudz1D7rE3
gjUAoOUhQ2PkIM0C2u6I2OL2gPLkTnZ/
=y1Ws
-----END PGP SIGNATURE-----

--+g7M9IMkV8truYOl--

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 10:10:24 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3E8D4106566C;
	Thu, 15 Apr 2010 10:10:24 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.159])
	by mx1.freebsd.org (Postfix) with ESMTP id 9C9158FC13;
	Thu, 15 Apr 2010 10:10:22 +0000 (UTC)
Received: by fg-out-1718.google.com with SMTP id l26so1416648fgb.13
	for <multiple recipients>; Thu, 15 Apr 2010 03:10:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:sender:received:date
	:x-google-sender-auth:received:message-id:subject:from:to:cc
	:content-type; bh=hCDk0oD70uEo6LrCGM57pj1oCwP7FyVRtghE+cwavbg=;
	b=Ob/grwGt/NTGbuaGWbyUUXZZYBYkAa9DhNwGPLSSjQZvRfBXyqXZ2SgiCAGG3PJTRZ
	HuGBZ/dSt2KHKfGbDi2z5zDdjNsBiyRY03BwUC0uRCRtZyU+gcLnIBfmMVv1NFzFUcKC
	waufk3YpvB1FIx5N2FeMzBMsKfPphHUQx9EDk=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type;
	b=cmgimaGkY2J8z8P3Me/yZfixNVoSEzZHHMFWRLwTz64nO32Qt8hCZAVuBcs4GuFOOW
	QqFKKJTJdw/DFYYRYD0Z+fA8H+s1DKLLubi4T4VXEBbFN3L+L8buy2SHJWZOp/FS6LRI
	CShUTr1t3Y2h4p0msHpZifflfH8v83CtEuh4Q=
MIME-Version: 1.0
Sender: asmrookie@gmail.com
Received: by 10.239.164.140 with HTTP; Thu, 15 Apr 2010 03:10:21 -0700 (PDT)
Date: Thu, 15 Apr 2010 12:10:21 +0200
X-Google-Sender-Auth: 7e5271c0cf20c8e6
Received: by 10.239.186.140 with SMTP id g12mr520973hbh.146.1271326221435; 
	Thu, 15 Apr 2010 03:10:21 -0700 (PDT)
Message-ID: <r2r3bbf2fe11004150310w9fa12d12vebd6b7f73cc1c5c0@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: freebsd-arch@freebsd.org
Content-Type: text/plain; charset=UTF-8
Cc: Giovanni Trematerra <giovanni.trematerra@gmail.com>
Subject: [PATCH] Syncer rewriting
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 10:10:24 -0000

With a fundamental aid by Giovanni Trematerra and Peter Holm, I
rewrote the syncer following plans and discussions happened
over the last 2 years and started by a Jeff's effort during BSDCan
2008 (for a more complete reference you may check:
http://people.freebsd.org/~jeff/bsdcanbuf.pdf ).

Summarizing a bit, the syncer suffers of the following problems:
- Poor scalability:  just one thread that needs to serve all the
several different mounted filesystems
- Poor flexibility: the current syncer is just used to sync on disk
dirty buffers and nothing else, catering buffer-cache based
filesystems
- Complex design: in order to DTRT, syncer needs the help of a syncer
vnode and introduce some complex locking pattern. Additively, as a
partial mitigation, a separate queue for the !MPSAFE filesystem might
be added
- Poor performance: that is actually more FS specific than anything.
UFS (but I'm not sure if this is the only one), after have synced the
dirty vnodes, does a VFS_SYNC() that actually re-synces all the
referenced vnodes. That means dirty vnodes will be synced 2 times in
the same timeframe.

The rewriting wants to address all these problems.
The main idea is to offer a simple and opaque layer that interacts
directly with the VFS and that any filesystem may override in order to
offer their own implementation of syncer ability. Right now, the layer
lives within the VFS_* methods and the mount structure. More precisely
it offers 5 virtual functions (VFS_SYNCER_INIT, VFS_SYNCER_DESTROY,
VFS_SYNCER_ATTACH, VFS_SYNCER_DETACH, VFS_SYNCER_SPEEDUP) and an
opaque, private pointer for storing syncer-specific datas.
This means the syncer design may not stuck to the specific
thread/process model as it is now, for a given filesystem.
Also, this design may be easilly extended in order to support more
features, if needed.

The syncer, meant as what we have now, becames the 'standard one' but
switches to a different model. It becames per-mount and it then gets
rid of the syncer vnode. This also helps in simplifying a lot the
locking within the syncer because now any thread is responsible only
for its own dog-food.
Filesystems specify their own syncer in the vfsops or they receive, by
default, the buffer cache "standard" syncer. Current filesystems not
using the buffer cache, however, may use the VFS_EOPNOTSUPP
specification in order to avoid completely defining a filesystem
syncer.

The patch has been tested intensively by trema and pho on a lot of
different workload and filesystems:
http://www.freebsd.org/~attilio/syncer_beta_0.diff

Sparse notes:
- The performance problem, even if the patch doesn't currently
supports it, may be easilly addressed now by skipping syncing, in
ffs_fsync() for the MNT_LAZY case and having ffs_sync() taking care of
it.
- The standard syncer may be further improved getting rid of the
bufobj. It should actually handle a list of vnodes rather than a list
of bufobj. However similar optimizations may be done after the patch
is ready to enter the tree.
- The mount interlock now protects the bo_flag & BO_ONWORKLST and the
synclist iterator, thus there is no need to hold the bufobj lock when
accessing them. However the specific for checking if a bufobj is dirty
or not are still protected by bufobj lock, thus the insertion path
still needs of it too.

Notably things that I would receive comments on are mostly linked to
the default syncer:
- I didn't use any form of threads consolidation for threads
automatically assigned by the default syncer. We may have different
opinion and good arguments on it.
- Something we might be willing is to think about the !SMP case. Maybe
we don't want the multi-thread approach for that case? Should we
revert the current approach for !SMP?
- Right now the VFS_SYNCER_INIT() and VFS_SYNCER_DESTROY() are used
not only for flexibility but also for necessity by the default syncer.
Some consumers may be willing to fill-in the workitem queues earlier
than the syncer starts (VFS_SYNCER_ATTACH()) and you may not want to
loose such filled vnodes. This approach is good and offers the
possibility to also support mount state updates simply without loosing
informations, but it has the dis-advantage to allocate structures for
filesystems that may forever be RO.

More testing, reviews and comments are undoubtly welcome at this point.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 05:43:39 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7956D106566B
	for <freebsd-arch@freebsd.org>; Thu, 15 Apr 2010 05:43:39 +0000 (UTC)
	(envelope-from paul-zimmerman@sbcglobal.net)
Received: from web80804.mail.mud.yahoo.com (web80804.mail.mud.yahoo.com
	[209.191.72.108])
	by mx1.freebsd.org (Postfix) with SMTP id EA9FE8FC15
	for <freebsd-arch@freebsd.org>; Thu, 15 Apr 2010 05:43:38 +0000 (UTC)
Received: (qmail 21632 invoked by uid 60001); 15 Apr 2010 05:16:57 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sbcglobal.net; s=s1024;
	t=1271308617; bh=PCEKahc6hSmdZD44IFGzbQ8sy+Z59MVOsHvLZ83tG5U=;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type;
	b=vG8c7OZmH18r0lkRHMBLCwbCK6Qe+k3LN14bFih05969Szr5cRurCP0SuI7bC/79x/MhnpdJHcjaVnfp602V637oTSNF2jQTMjfdFPWW2EUUYz39gZ1VdMcb7NLbnwSMUPHcy/OIl7RhUrPxcvtyse/JcEUQKvoxJBWW52abzZU=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=sbcglobal.net;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type;
	b=bOnORIeKQXBggHz0Dhe6QCPDQ7+5P6RKU9eOea1VV7cHloRljDlHb1qR34e5vp5MaixEhkhq6jsZGwT760efO3IfdeldFCwv7LTAj8TiCxS30CAgk4SJZ+jTVivtjtSLNvfIfa0gm4MB0d01QAP6rgaFqMG/G7KnRKT4R3N8U9M=;
Message-ID: <493092.21442.qm@web80804.mail.mud.yahoo.com>
X-YMail-OSG: f2v4k68VM1kgqyRe_6Bwoi9uwrTy2klK6I5eZxNaHjZ_hpn
	ojcz.hgHTQnlPFsR6JYVtDyGIaPNqd.iSoPQZGIPPNqzhR3zl8KL_o.eU_kj
	zK2_2zz7KJuru5stHu9PRSYui0jpVwAoDefu9xYUy3oITB71rq58wrroFjBF
	L7fT4ra976aHBgmidIgFSTvSFTX6JRLFOAIUrMOWmKB_tgId.9wB0J6VVc1V
	2hUyrbIwOtkCog1qZFENhamuO451ZO32CjQm4xqdTixOGAHilArpktzkh70f
	NyvgqnCYUcEEnNGZrVK3PyYUz4hZEgEqRkPTMq2cwEKfjyBblCeqgbUDF18Q
	0rlIsefVCKc9a6oMWB2be7_6Iyg--
Received: from [75.52.253.215] by web80804.mail.mud.yahoo.com via HTTP;
	Wed, 14 Apr 2010 22:16:57 PDT
X-Mailer: YahooMailClassic/10.0.8 YahooMailWebService/0.8.100.260964
Date: Wed, 14 Apr 2010 22:16:57 -0700 (PDT)
From: Paul Zimmerman <paul-zimmerman@sbcglobal.net>
To: peterjeremy@acm.org, freebsd-arch@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailman-Approved-At: Thu, 15 Apr 2010 11:06:35 +0000
Cc: bruce@cran.org.uk, ed@80386.nl, scottl@samsco.org,
	matthew.fleming@isilon.com, avg@icyb.net.ua, rwatson@freebsd.org,
	ivoras@freebsd.org, stefan@fafoe.narf.at, max@love2party.net
Subject: Re: likely and unlikely
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 05:43:39 -0000

On 2010-Mar-21 19:52:40 -0600, Peter Jeremy <peterjeremy () acm ! org> wrote:
>I suspect predict_true/predict_false is unlikely to help in most cases.
>
>What would probably be more useful for Atom would be gcc scheduling
>support.  This is available in gcc 4.3 (ie GPL3) but not in gcc 4.2.
>I've had a look at dumping the gcc 4.3 Atom scheduler into my gcc 4.2
>but the infrastructure has changed sufficiently that this would be a
>non-trivial task.  (And since it would not be committable, I don't
>think it's worth my time).  Likewise, implementing scheduling from
>scratch in gcc 4.2 would be a non-trivial task.

Just FYI, the use of likely/unlikely in the Linux kernel is not for branch
prediction. It is a hint to gcc which branch of the if() should be moved
out-of-line. The idea is to reduce the cache footprint of the most
frequently executed code paths.

-- 
Paul


From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 14:31:32 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D0558106566C;
	Thu, 15 Apr 2010 14:31:32 +0000 (UTC) (envelope-from des@des.no)
Received: from smtp.des.no (smtp.des.no [194.63.250.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 7F5288FC54;
	Thu, 15 Apr 2010 14:31:32 +0000 (UTC)
Received: from ds4.des.no (des.no [84.49.246.2])
	by smtp.des.no (Postfix) with ESMTP id 0F1871FFC22;
	Thu, 15 Apr 2010 14:31:31 +0000 (UTC)
Received: by ds4.des.no (Postfix, from userid 1001)
	id 3946E844DA; Thu, 15 Apr 2010 16:30:59 +0200 (CEST)
From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To: Attilio Rao <attilio@freebsd.org>
References: <r2r3bbf2fe11004150310w9fa12d12vebd6b7f73cc1c5c0@mail.gmail.com>
Date: Thu, 15 Apr 2010 16:30:58 +0200
In-Reply-To: <r2r3bbf2fe11004150310w9fa12d12vebd6b7f73cc1c5c0@mail.gmail.com>
	(Attilio Rao's message of "Thu, 15 Apr 2010 12:10:21 +0200")
Message-ID: <86sk6waiu5.fsf@ds4.des.no>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.95 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	freebsd-arch@freebsd.org
Subject: Re: [PATCH] Syncer rewriting
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 14:31:32 -0000

Attilio Rao <attilio@freebsd.org> writes:
> With a fundamental aid by Giovanni Trematerra and Peter Holm, I
> rewrote the syncer following plans and discussions happened
> over the last 2 years and started by a Jeff's effort during BSDCan
> 2008 (for a more complete reference you may check:
> http://people.freebsd.org/~jeff/bsdcanbuf.pdf ).

This is great!  The "lemming syncer" we currently have has been a thorn
in our side for years.

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 14:38:26 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B65CD106566C;
	Thu, 15 Apr 2010 14:38:26 +0000 (UTC)
	(envelope-from daniel.rodrick@gmail.com)
Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com
	[209.85.160.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 89F808FC0A;
	Thu, 15 Apr 2010 14:38:26 +0000 (UTC)
Received: by pwi9 with SMTP id 9so1175932pwi.13
	for <multiple recipients>; Thu, 15 Apr 2010 07:38:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:date:received:message-id
	:subject:from:to:content-type;
	bh=E4GoelLAQblwS/5ZeLMtrtTR1Q2UTbUEPPGFNmTMaqk=;
	b=QADC22lZI7RKVtdDy4T7xOYqrbriNPms//Cv7LiIYserDnH0Gx7cw7WF09d5f25b6M
	QfbE4yfvh5e30H/tlG0rA1oZU24B/M9cjrV2QM1+ZAqdfPzDz2KnkNY0nMYbQTa2IBBy
	8FhfFU/3Tm1IRcFjZ3leVZ2WmqYvmyUSdgWbc=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:date:message-id:subject:from:to:content-type;
	b=HD+LNdyeBNe2Je4fduT3kKpMTv1GSTmvDU55qnXRURWZRCec9QmKfLRM+EPzma41bL
	lFuyxorHHav4ieWcNFAdS5KG+FNSEXfjhjauyLpKzKdj+rRXExTKc4VsDg2S7UTr4NB3
	JyOtLKq4jtroteOqTmNZD6Ybp2vyfw6VHgl9s=
MIME-Version: 1.0
Received: by 10.142.230.18 with HTTP; Thu, 15 Apr 2010 07:38:25 -0700 (PDT)
Date: Thu, 15 Apr 2010 20:08:25 +0530
Received: by 10.142.247.33 with SMTP id u33mr154860wfh.44.1271342305705; Thu, 
	15 Apr 2010 07:38:25 -0700 (PDT)
Message-ID: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
From: Daniel Rodrick <daniel.rodrick@gmail.com>
To: freebsd-arch@freebsd.org, freebsd-drivers@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Cc: 
Subject: Multiple PCI controllers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 14:38:26 -0000

Hello,

Can some one please help me understand how did the old FreeBSD kernel
that DID not have the PCI domains concept (say 6.x) used to deal with
systems that had multiple PCI / PCIe controllers on them, from a bus
numbering point of view? Was there a unified PCI tree - thus each PCI
bus number being unique in the system?

Also, how is this dealt with now?

Dan

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 14:58:26 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 015BF106564A;
	Thu, 15 Apr 2010 14:58:25 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 802CF8FC08;
	Thu, 15 Apr 2010 14:58:25 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3FEljvH072965;
	Thu, 15 Apr 2010 08:47:45 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Thu, 15 Apr 2010 08:48:00 -0600 (MDT)
Message-Id: <20100415.084800.714788496340685106.imp@bsdimp.com>
To: daniel.rodrick@gmail.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
References: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-drivers@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Multiple PCI controllers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 14:58:26 -0000

In message: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
            Daniel Rodrick <daniel.rodrick@gmail.com> writes:
: Can some one please help me understand how did the old FreeBSD kernel
: that DID not have the PCI domains concept (say 6.x) used to deal with
: systems that had multiple PCI / PCIe controllers on them, from a bus
: numbering point of view? Was there a unified PCI tree - thus each PCI
: bus number being unique in the system?

FreeBSD has handled multiple PCI domains for a very long time.  The
support was added so that the Alpha machines could run FreeBSD.

The bus numbers were whatever the BIOS programmed them to be.  FreeBSD
doesn't program bus numbers at all, except in some very limited cases.

: Also, how is this dealt with now?

The same.  Each host controller will have a pci device tree under it.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 18:30:09 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A4D311065674;
	Thu, 15 Apr 2010 18:30:09 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 762878FC1C;
	Thu, 15 Apr 2010 18:30:09 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 0E83D46B81;
	Thu, 15 Apr 2010 14:30:09 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 5B2358A01F;
	Thu, 15 Apr 2010 14:30:08 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Thu, 15 Apr 2010 13:11:18 -0400
User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; )
References: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
In-Reply-To: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201004151311.18487.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Thu, 15 Apr 2010 14:30:08 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Daniel Rodrick <daniel.rodrick@gmail.com>, freebsd-drivers@freebsd.org
Subject: Re: Multiple PCI controllers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 18:30:09 -0000

On Thursday 15 April 2010 10:38:25 am Daniel Rodrick wrote:
> Hello,
> 
> Can some one please help me understand how did the old FreeBSD kernel
> that DID not have the PCI domains concept (say 6.x) used to deal with
> systems that had multiple PCI / PCIe controllers on them, from a bus
> numbering point of view? Was there a unified PCI tree - thus each PCI
> bus number being unique in the system?

I think there were not multiple-domain machines that FreeBSD ran on in 
previous releases in general.  Some alpha machines had multiple domains (the 
alpha port referred to them as 'hoses') and the support was incomplete (VGA 
cards had to be in domain 0 for FreeBSD to see them IIRC).  I am not 
personally aware of any x86 machines with multiple domains.  I believe the x86 
port only supports domain 0 currently.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 20:36:35 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3DC701065673
	for <arch@freebsd.org>; Thu, 15 Apr 2010 20:36:35 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id DDD838FC1B
	for <arch@freebsd.org>; Thu, 15 Apr 2010 20:36:34 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3FKVWa8077210
	for <arch@freebsd.org>; Thu, 15 Apr 2010 14:31:32 -0600 (MDT)
	(envelope-from imp@bsdimp.com)
Date: Thu, 15 Apr 2010 14:31:47 -0600 (MDT)
Message-Id: <20100415.143147.69510145118168557.imp@bsdimp.com>
To: arch@freebsd.org
From: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: 
Subject: TARGET_BIG_ENDIAN branch collapse
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 20:36:35 -0000

I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN
stuff.  You can find a diff at
http://people.freebsd.org/~imp/tbemd-20100415.diff.

Highlights include:
o Eliminating TARGET_BIG_ENDIAN entirely
o Eliminating the setting of endian flags in sys.mk and bsd.cpu.mk
o Moving from mips to mipseb and mipsel for MACHINE_ARCH. [*]
o Moving from arm to armeb and arm for MACHINE_ARCH. [**]
o Creating MACHINE_CPUARCH which is the set of architectures that's
  supported.  The 'mips' CPUARCH will support MACHINE_ARCH of mipsel,
  mipseb, mips64eb, mips64el, for example.  This means many of the
  places we used to use MACHINE_ARCH we now use MACHINE_CPUARCH.
o Moving to including Makefile.${MACHNE}, Makefile.${MACHINE_ARCH}, or
  Makefile.{MACHINE_CPUARCH}, in that order, to select or deselect
  portions of FreeBSD.  We already did this for places like libc.  I'm
  just generalizing it.
o Some minor tweaks to gcc and binutils to make the build work with
  the new paradigm.

Please send me your comments and suggestions.  I plan on starting to
integrate some of these technologies into head soon (as well as
coordinating with Juli Mallett's work to bring new ABIs to MIPS).

This is all orthogonal to MACHINE_CPUTYPE and MACHINE_CPU[***] which will
remain unchanged in FreeBSD.

Comments?

Warner

[*] While I generally don't want to talk about names here, since I've
selected the names used by NetBSD, Linux and binutils/gcc, there may
be some tweaking in the final values as these groups have minor
variations in naming mips which complicates things...

[**] These names are well established and consistent among all the
groups.

[***] NetBSD calls MACHINE_CPUARCH just MACHINE_CPU, but since we're
already using that for something else, I had to diverge.

From owner-freebsd-arch@FreeBSD.ORG  Thu Apr 15 22:37:08 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7E8B81065672
	for <arch@FreeBSD.org>; Thu, 15 Apr 2010 22:37:08 +0000 (UTC)
	(envelope-from glebius@FreeBSD.org)
Received: from cell.glebius.int.ru (glebius.int.ru [81.19.64.117])
	by mx1.freebsd.org (Postfix) with ESMTP id 81E208FC08
	for <arch@FreeBSD.org>; Thu, 15 Apr 2010 22:37:06 +0000 (UTC)
Received: from cell.glebius.int.ru (localhost [127.0.0.1])
	by cell.glebius.int.ru (8.14.3/8.14.3) with ESMTP id o3FMbDw3038866
	for <arch@FreeBSD.org>; Fri, 16 Apr 2010 02:37:13 +0400 (MSD)
	(envelope-from glebius@FreeBSD.org)
Received: (from glebius@localhost)
	by cell.glebius.int.ru (8.14.3/8.14.3/Submit) id o3FMbDRK038865
	for arch@freebsd.org; Fri, 16 Apr 2010 02:37:13 +0400 (MSD)
	(envelope-from glebius@FreeBSD.org)
X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to
	glebius@FreeBSD.org using -f
Date: Fri, 16 Apr 2010 02:37:13 +0400
From: Gleb Smirnoff <glebius@FreeBSD.org>
To: arch@FreeBSD.org
Message-ID: <20100415223713.GF97761@FreeBSD.org>
References: <20100326211706.GI18894@FreeBSD.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
In-Reply-To: <20100326211706.GI18894@FreeBSD.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: 
Subject: Re: touch panel support
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2010 22:37:08 -0000

  My further hacking on getting touch panel working with FreeBSD.

On Sat, Mar 27, 2010 at 12:17:06AM +0300, Gleb Smirnoff wrote:
T>   And then I've got a problem. Our mouse subsystem is not ready for
T> touch panels. Our mouse(4) protocol does not support mouse driver
T> passing _absolute_ coordinates to the mouse(4) subsystem. It only
T> expects a relative movement of the mouse.  But _absolute_ coordinates
T> are principal idea of any touch panel.

Well, syscons actually have support for absolute mouse movement. We
need just this tiny patch, to get it working.

Index: scmouse.c
===================================================================
--- scmouse.c   (revision 206501)
+++ scmouse.c   (working copy)
@@ -700,6 +700,7 @@
            scp->mouse_xpos = mouse->u.data.x;
            scp->mouse_ypos = mouse->u.data.y;
            set_mouse_pos(scp);
+           goto motion;
            splx(s);
            break;
 
@@ -732,6 +733,7 @@
                cur_scp->mouse_ypos += mouse->u.data.y;
                set_mouse_pos(cur_scp);
            }
+motion:
            f = 0;
            if (mouse->operation == MOUSE_ACTION) {
                f = cur_scp->mouse_buttons ^ mouse->u.data.buttons;

This also requires userland (moused(8)) to put absolute coordinates
into mouse->u.data and provide MOUSE_MOVEABS command instead of
MOUSE_ACTION.

The patch to moused(8) looks like following. First, we need to
recognize a mouse protocol, that works with absolute coordinates:

@@ -1584,6 +1629,9 @@
        }
     }
 
+    if (rodent.mode.protocol == MOUSE_PROTO_EGALAX)
+       rodent.flags |= AbsoluteXY;
+
     debug("proto params: %02x %02x %02x %02x %d %02x %02x",
        cur_proto[0], cur_proto[1], cur_proto[2], cur_proto[3],
        cur_proto[4], cur_proto[5], cur_proto[6]);
@ -2170,6 +2218,22 @@
        prev_y = y;
        break;
 
+    case MOUSE_PROTO_EGALAX:           /* eGalax */
+       x = (pBuf[1] << 7) | pBuf[2];
+       y = (pBuf[3] << 7) | pBuf[4];
+
+       act->flags = 0;
+       act->button = 0; /* TODO */
+       if (x != prev_x || y != prev_y) {
+               act->dx = prev_x = x;
+               act->dy = prev_y = y;
+               act->flags |= MOUSE_POSCHANGED;
+       }
+
+       return (act->flags);
+
+       break;
+
     case MOUSE_PROTO_BUS:              /* Bus */
     case MOUSE_PROTO_INPORT:           /* InPort */
        act->button = butmapmsc[(~pBuf[0]) & MOUSE_MSC_BUTTONS];

Then we need to pass these absolute coords to mouse(4):

@@ -1295,11 +1335,14 @@
                if (action2.flags & MOUSE_POSCHANGED) {
                    mouse.operation = MOUSE_MOTION_EVENT;
                    mouse.u.data.buttons = action2.button;
-                   if (rodent.flags & ExponentialAcc) {
+                   if (rodent.flags & AbsoluteXY) {
+                       absmove(action2.dx, action2.dy,
+                           &mouse.u.data.x, &mouse.u.data.y);
+                       mouse.operation = MOUSE_MOVEABS;
+                   } else if (rodent.flags & ExponentialAcc) {
                        expoacc(action2.dx, action2.dy,
                            &mouse.u.data.x, &mouse.u.data.y);
-                   }
-                   else {
+                   } else {
                        linacc(action2.dx, action2.dy,
                            &mouse.u.data.x, &mouse.u.data.y);
                    }
@@ -1311,11 +1354,14 @@
            } else {
                mouse.operation = MOUSE_ACTION;
                mouse.u.data.buttons = action2.button;
-               if (rodent.flags & ExponentialAcc) {
+               if (rodent.flags & AbsoluteXY) {
+                   absmove(action2.dx, action2.dy,
+                       &mouse.u.data.x, &mouse.u.data.y);
+                   mouse.operation = MOUSE_MOVEABS;
+               } else if (rodent.flags & ExponentialAcc) {
                    expoacc(action2.dx, action2.dy,
                        &mouse.u.data.x, &mouse.u.data.y);
-               }
-               else {
+               } else {
                    linacc(action2.dx, action2.dy,
                        &mouse.u.data.x, &mouse.u.data.y);
                }

The absmove() function should perform calibration and then assign
u.data.x to calibrated x from touchpanel, and y accordingly.

Doing calibration in moused(8) is a problem. Userland can't guess or
access pixel size of the syscons. Currently I've just hardcoded
calibration into absmove() with my values, and get all this
stuff working. Stilus moves mouse pointer flawlessly and correctly
in the syscons. :)

But, unfortunately, this is a zero step towards touchscreen working
in X. Although we got working absolute mouse pointer in syscons,
we can't pass it through sysmouse(4) protocol :(

-- 
Totus tuus, Glebius.

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 01:54:31 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 357431065670;
	Fri, 16 Apr 2010 01:54:31 +0000 (UTC)
	(envelope-from grehan@freebsd.org)
Received: from dommail.onthenet.com.au (dommail.OntheNet.com.au [203.13.70.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 4222D8FC0A;
	Fri, 16 Apr 2010 01:54:29 +0000 (UTC)
Received: from dallas-lxp.hq.netapp.com (c-67-190-167-186.hsd1.co.comcast.net
	[67.190.167.186]) by dommail.onthenet.com.au (MOS 4.1.8-GA)
	with ESMTP id ALT72123 (AUTH peterg@ptree32.com.au);
	Fri, 16 Apr 2010 11:42:42 +1000
Message-ID: <4BC7C08C.2050002@freebsd.org>
Date: Thu, 15 Apr 2010 19:42:36 -0600
From: Peter Grehan <grehan@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228)
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
References: <p2t292693081004150738je2d9de95ya52cbc5ad9ce82b3@mail.gmail.com>
	<201004151311.18487.jhb@freebsd.org>
In-Reply-To: <201004151311.18487.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Daniel Rodrick <daniel.rodrick@gmail.com>, freebsd-drivers@freebsd.org
Subject: Re: Multiple PCI controllers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 01:54:31 -0000

> I think there were not multiple-domain machines that FreeBSD ran on in 
> previous releases in general.

  Power Macs have up to 3 PCI buses, with each one having bus number 0 
at the host bridge. FreeBSD 6.* was fine with that, except if there was 
a conflict in bus/slot/function. Fortunately it looked like OpenFirmware 
was careful to avoid creating these conflicts when doing bus assignment.

later,

Peter.

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 08:11:46 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E6800106564A
	for <arch@freebsd.org>; Fri, 16 Apr 2010 08:11:46 +0000 (UTC)
	(envelope-from gary.jennejohn@freenet.de)
Received: from mout7.freenet.de (mout7.freenet.de [IPv6:2001:748:100:40::2:9])
	by mx1.freebsd.org (Postfix) with ESMTP id 801718FC13
	for <arch@freebsd.org>; Fri, 16 Apr 2010 08:11:46 +0000 (UTC)
Received: from [195.4.92.15] (helo=5.mx.freenet.de)
	by mout7.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port
	25) (Exim 4.72 #3)
	id 1O2geP-0001z0-2X; Fri, 16 Apr 2010 10:11:45 +0200
Received: from p57ae0f02.dip0.t-ipconnect.de ([87.174.15.2]:56989
	helo=ernst.jennejohn.org)
	by 5.mx.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25)
	(Exim 4.72 #3) id 1O2geO-0004Mr-Ps; Fri, 16 Apr 2010 10:11:45 +0200
Date: Fri, 16 Apr 2010 10:11:44 +0200
From: Gary Jennejohn <gary.jennejohn@freenet.de>
To: "M. Warner Losh" <imp@bsdimp.com>
Message-ID: <20100416101144.68e8beb8@ernst.jennejohn.org>
In-Reply-To: <20100415.143147.69510145118168557.imp@bsdimp.com>
References: <20100415.143147.69510145118168557.imp@bsdimp.com>
X-Mailer: Claws Mail 3.7.5 (GTK+ 2.18.7; amd64-portbld-freebsd9.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org
Subject: Re: TARGET_BIG_ENDIAN branch collapse
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: gary.jennejohn@freenet.de
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 08:11:47 -0000

On Thu, 15 Apr 2010 14:31:47 -0600 (MDT)
"M. Warner Losh" <imp@bsdimp.com> wrote:

> I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN
> stuff.  You can find a diff at
> http://people.freebsd.org/~imp/tbemd-20100415.diff.
> 

fetch http://people.freebsd.org/~imp/tbemd-20100415.diff
fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found

--
Gary Jennejohn

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 08:23:05 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 50132106566C;
	Fri, 16 Apr 2010 08:23:05 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 133A88FC1A;
	Fri, 16 Apr 2010 08:23:04 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id AE23090193;
	Fri, 16 Apr 2010 08:23:03 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o3G8N30A029918;
	Fri, 16 Apr 2010 08:23:03 GMT (envelope-from phk@critter.freebsd.dk)
To: Attilio Rao <attilio@freebsd.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Thu, 15 Apr 2010 12:10:21 +0200."
	<r2r3bbf2fe11004150310w9fa12d12vebd6b7f73cc1c5c0@mail.gmail.com> 
Date: Fri, 16 Apr 2010 08:23:03 +0000
Message-ID: <29917.1271406183@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	freebsd-arch@freebsd.org
Subject: Re: [PATCH] Syncer rewriting 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 08:23:05 -0000

In message <r2r3bbf2fe11004150310w9fa12d12vebd6b7f73cc1c5c0@mail.gmail.com>, At
tilio Rao writes:

>The syncer, meant as what we have now, becames the 'standard one' but
>switches to a different model. It becames per-mount and it then gets
>rid of the syncer vnode. This also helps in simplifying a lot the
>locking within the syncer because now any thread is responsible only
>for its own dog-food.

YeeeeEEEEEHAAAAA!

Go! Go! GO!


>- The standard syncer may be further improved getting rid of the
>bufobj. It should actually handle a list of vnodes rather than a list
>of bufobj. However similar optimizations may be done after the patch
>is ready to enter the tree.

That would be the wrong direction: we need the bufobj because for instance
a RAID5 geom module does not have a vnode for the parity data.

If you force the syncer to only work on vnodes, then we need a parallel
mechanism for non-filesystem disk users.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 08:36:11 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CD6241065673;
	Fri, 16 Apr 2010 08:36:11 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id 629D08FC17;
	Fri, 16 Apr 2010 08:36:10 +0000 (UTC)
Received: from c122-106-149-225.carlnfd1.nsw.optusnet.com.au
	(c122-106-149-225.carlnfd1.nsw.optusnet.com.au [122.106.149.225])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o3G8a7AV026419
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 16 Apr 2010 18:36:08 +1000
Date: Fri, 16 Apr 2010 18:36:07 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Rick Macklem <rmacklem@uoguelph.ca>
In-Reply-To: <Pine.GSO.4.63.1004142234200.28565@muncher.cs.uoguelph.ca>
Message-ID: <20100416181926.F1082@delplex.bde.org>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
	<Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
	<20100414135230.U12587@delplex.bde.org>
	<Pine.GSO.4.63.1004142234200.28565@muncher.cs.uoguelph.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 08:36:11 -0000

On Wed, 14 Apr 2010, Rick Macklem wrote:

> On Wed, 14 Apr 2010, Bruce Evans wrote:
> [stuff snipped]
>> 
>> Indeed, I was only caring about a LAN environment.  Especially with
>> LANs optimized for latency (50-100 uS), nfs performance is poor for
>> small files, at least for the old nfs client, mainly due to close to
>> open consistency defeating caching, but not a problem for bulk transfers.
>
> And I'll admit I was thinking that for a low latency LAN, a large read/write 
> RPC wouldn't have a negative impact, but it sounds like
> you've found 16Kb to be optimal for this case.

I'll try to find old benchmark results or repeat the benchmarks.

> For NFSv4, if the client has a delegation for the file, it doesn't
> have worry about close/open consistency, so there is some hope w.r.t.
> small files for this case.

Do you have benchmarks?  A kernel build (without -j) is a good test.
Due to include bloat and include nesting bloat, a kernel build opens
and closes the same small include files hundreds or thousands of times
each, with O(10^5) includes altogether, so an RPC to read attributes
on each open costs a lot of latency.  nfs on a LAN does well to take
only 10% longer than a local file system on a LAN and after disabling
close/open constency takes only about half as much longer, by reducing
the nomber of RPCs by about a factor of 2.  The difference should be
even more noticable on a WAN.  Building with -j reduces the extra
length by not stalling the whild build waiting for each RPC.  I probably
needed it to take only 10% longer.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 09:42:02 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 20569106564A
	for <arch@freebsd.org>; Fri, 16 Apr 2010 09:42:02 +0000 (UTC)
	(envelope-from pluknet@gmail.com)
Received: from mail-bw0-f214.google.com (mail-bw0-f214.google.com
	[209.85.218.214])
	by mx1.freebsd.org (Postfix) with ESMTP id 9C5808FC16
	for <arch@freebsd.org>; Fri, 16 Apr 2010 09:42:01 +0000 (UTC)
Received: by bwz6 with SMTP id 6so2115519bwz.13
	for <arch@freebsd.org>; Fri, 16 Apr 2010 02:42:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:received:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=JQBcp03ZKGd/axA2/VCpCJuUYub+08D1NVszQvkEA2c=;
	b=o3WiQFpvPpNCa3MQX12pfKGsmjaz/FCKhw90UMfDuPc9/Io6VQRKUC1B+koOinMKXf
	KroqZw6HSki3ghOTyUWPK2F6VuLucElbdoPX5/iqdjGWxsASVfFSg6rhYNKkdJZuwIAK
	WmjTfdG77jxS1QqUNnKj5z5F17siq7Vq3CHek=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=UpQDLvJnVL7jtyeXkTDN+p+9sZ9oYAvrVI1YGdPv9orpm0R+qaIIpH3TW4dIEF4rHP
	0vXXnliBk3Hitq5/fuD1SmHjaV9LPLduy7NKnjTQgUItXK1+qdnLmOiIviYSh/E2jR+z
	PLLyg3FxPNtd6L+mrGngXBxJaOoJ+AuoY9m9k=
MIME-Version: 1.0
Received: by 10.204.47.232 with HTTP; Fri, 16 Apr 2010 02:15:17 -0700 (PDT)
In-Reply-To: <20100416101144.68e8beb8@ernst.jennejohn.org>
References: <20100415.143147.69510145118168557.imp@bsdimp.com>
	<20100416101144.68e8beb8@ernst.jennejohn.org>
Date: Fri, 16 Apr 2010 13:15:17 +0400
Received: by 10.204.32.77 with SMTP id b13mr1457345bkd.113.1271409317186; Fri, 
	16 Apr 2010 02:15:17 -0700 (PDT)
Message-ID: <k2ma31046fc1004160215tccd50e85p5f29a22f20f96cf@mail.gmail.com>
From: pluknet <pluknet@gmail.com>
To: gary.jennejohn@freenet.de
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: arch@freebsd.org
Subject: Re: TARGET_BIG_ENDIAN branch collapse
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 09:42:02 -0000

On 16 April 2010 12:11, Gary Jennejohn <gary.jennejohn@freenet.de> wrote:
> On Thu, 15 Apr 2010 14:31:47 -0600 (MDT)
> "M. Warner Losh" <imp@bsdimp.com> wrote:
>
>> I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN
>> stuff. =A0You can find a diff at
>> http://people.freebsd.org/~imp/tbemd-20100415.diff.
>>
>
> fetch http://people.freebsd.org/~imp/tbemd-20100415.diff
> fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found
>

Whilst there's http://people.freebsd.org/~imp/tbemd.diff

--=20
wbr,
pluknet

From owner-freebsd-arch@FreeBSD.ORG  Fri Apr 16 16:05:39 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 41ACF1065674
	for <arch@FreeBSD.org>; Fri, 16 Apr 2010 16:05:39 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id DCB9A8FC22
	for <arch@FreeBSD.org>; Fri, 16 Apr 2010 16:05:38 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3GG0WKQ090653;
	Fri, 16 Apr 2010 10:00:33 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Fri, 16 Apr 2010 10:00:32 -0600 (MDT)
Message-Id: <20100416.100032.74663955.imp@bsdimp.com>
To: gary.jennejohn@freenet.de
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <20100416101144.68e8beb8@ernst.jennejohn.org>
References: <20100415.143147.69510145118168557.imp@bsdimp.com>
	<20100416101144.68e8beb8@ernst.jennejohn.org>
X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org
Subject: Re: TARGET_BIG_ENDIAN branch collapse
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Apr 2010 16:05:39 -0000

From: Gary Jennejohn <gary.jennejohn@freenet.de>
Subject: Re: TARGET_BIG_ENDIAN branch collapse
Date: Fri, 16 Apr 2010 10:11:44 +0200

> On Thu, 15 Apr 2010 14:31:47 -0600 (MDT)
> "M. Warner Losh" <imp@bsdimp.com> wrote:
> 
> > I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN
> > stuff.  You can find a diff at
> > http://people.freebsd.org/~imp/tbemd-20100415.diff.
> > 
> 
> fetch http://people.freebsd.org/~imp/tbemd-20100415.diff
> fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found
> 

should be there now.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Sat Apr 17 02:11:38 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6ABD5106566C;
	Sat, 17 Apr 2010 02:11:38 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca
	[131.104.91.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 003D08FC21;
	Sat, 17 Apr 2010 02:11:37 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEANO1yEuDaFvK/2dsb2JhbACcAHG+VoJcgjIE
X-IronPort-AV: E=Sophos;i="4.52,224,1270440000"; d="scan'208";a="73100956"
Received: from fraser.cs.uoguelph.ca ([131.104.91.202])
	by esa-annu-pri.mail.uoguelph.ca with ESMTP; 16 Apr 2010 22:11:36 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id DFDAF109C35D;
	Fri, 16 Apr 2010 22:11:36 -0400 (EDT)
X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca
Received: from fraser.cs.uoguelph.ca ([127.0.0.1])
	by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id 4wZFFW9NH+hb; Fri, 16 Apr 2010 22:11:36 -0400 (EDT)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 66149109C2DF;
	Fri, 16 Apr 2010 22:11:36 -0400 (EDT)
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	o3H2PYi05419; Fri, 16 Apr 2010 22:25:35 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Fri, 16 Apr 2010 22:25:34 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20100416181926.F1082@delplex.bde.org>
Message-ID: <Pine.GSO.4.63.1004162207590.3055@muncher.cs.uoguelph.ca>
References: <4BBEE2DD.3090409@freebsd.org>
	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>
	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>
	<Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
	<20100414135230.U12587@delplex.bde.org>
	<Pine.GSO.4.63.1004142234200.28565@muncher.cs.uoguelph.ca>
	<20100416181926.F1082@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org>
Subject: Re: (in)appropriate uses for MAXBSIZE
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 17 Apr 2010 02:11:38 -0000



On Fri, 16 Apr 2010, Bruce Evans wrote:

>
> Do you have benchmarks?  A kernel build (without -j) is a good test.
> Due to include bloat and include nesting bloat, a kernel build opens
> and closes the same small include files hundreds or thousands of times
> each, with O(10^5) includes altogether, so an RPC to read attributes
> on each open costs a lot of latency.  nfs on a LAN does well to take
> only 10% longer than a local file system on a LAN and after disabling
> close/open constency takes only about half as much longer, by reducing
> the nomber of RPCs by about a factor of 2.  The difference should be
> even more noticable on a WAN.  Building with -j reduces the extra
> length by not stalling the whild build waiting for each RPC.  I probably
> needed it to take only 10% longer.
>
Well, I certainly wouldn't call these benchmarks, but here are the #s
I currently see. (The two machines involved are VERY slow by to-day's
hardware standards. One is an 800MHz PIII and the other is a 4-5year
old cheap laptop with something like a 1.5GHz Celeron CPU.)

The results for something like the Connectathon test suite's read/write
test can be highly variable, depending upon the hardware setup, etc. (I 
suspect that is at least partially based on when the writes get flushed 
during the test run. One thing that I'd like to do someday is have a
read/shared lock on the buffer cache block while a write-back to a
server is happening. It currently is write/exclusive locked, but doesn't
need to be, after the data has been copied into the buffer.)

For the laptop as client:
without delegations:
./test5: read and write
 	wrote 1048576 byte file 10 times in 7.27 seconds (1442019 bytes/sec)
 	read 1048576 byte file 10 times in 0.4  seconds (238101682 bytes/sec)
 	./test5 ok.
with delegations:
./test5: read and write
 	wrote 1048576 byte file 10 times in 1.64 seconds (6358890 bytes/sec)
 	read 1048576 byte file 10 times in 0.70 seconds (14802158 bytes/sec)
 	./test5 ok.

but for the PIII as the client (why does this case run so much better
when there are no delegations?):
without delegations:
./test5: read and write
 	wrote 1048576 byte file 10 times in 1.75 seconds (5961944 bytes/sec)
 	read 1048576 byte file 10 times in 0.7  seconds (131844940 bytes/sec)
 	./test5 ok.
with delegations:
./test5: read and write
 	wrote 1048576 byte file 10 times in 1.39 seconds (7526450 bytes/sec)
 	read 1048576 byte file 10 times in 0.67 seconds (15540698 bytes/sec)
 	./test5 ok.

Now, a kernel build with the PIII as client:
without delegations:
Real	User	System
6859	4635	1158
with delegations:
Real	User	System
6491	4634	1105

As you can see, there isn't that much improvement when delegations are
enabled. Part of the problem here is that, for an 800MHz PIII, the
build is CPU bound ("vmstat 5" shows 0->10% idle during the build), so
the speed of the I/O over NFS won't have a lot of effect on it. This
would be more interesting if the client had a much faster CPU.

Not benchmarks, but might give you some idea. (The 2 machines are
running off the same small $50 home router.)

Someday, I'd like to implement agressive client side caching to a
disk in the client and do a performance evaluation (including
introducing network latency) and see how it all does. I'm getting
close to where I can do that. Maybe this summer.

Have fun with it, rick


From owner-freebsd-arch@FreeBSD.ORG  Sat Apr 17 22:26:42 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CC542106564A
	for <freebsd-arch@freebsd.org>; Sat, 17 Apr 2010 22:26:42 +0000 (UTC)
	(envelope-from kmatthew.macy@gmail.com)
Received: from mail-qy0-f199.google.com (mail-qy0-f199.google.com
	[209.85.221.199])
	by mx1.freebsd.org (Postfix) with ESMTP id 834028FC0C
	for <freebsd-arch@freebsd.org>; Sat, 17 Apr 2010 22:26:42 +0000 (UTC)
Received: by qyk37 with SMTP id 37so3262518qyk.8
	for <freebsd-arch@freebsd.org>; Sat, 17 Apr 2010 15:26:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:sender:reply-to:received:date
	:x-google-sender-auth:received:message-id:subject:from:to:cc
	:content-type; bh=PjnAcQjqGFZ8pYrjRPHPdnklBVDduDRO0krJin4wgH4=;
	b=xS4AhzhTHxmyhDyubfmOERTBU4MGKqax7KJo0eo5iB17MjDmGlHcYVHNgHL2vezaf1
	tEF+dutCfkG0V9n6XZEVC8WexbtWsFo1/dDBRbvj9lMQIiaAZJ3Q4phKOVie2wODDmDu
	izFZJb3VPoibskEYwq4WDHUwvT3Waf58bIIjY=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:reply-to:date:x-google-sender-auth:message-id
	:subject:from:to:cc:content-type;
	b=he+TKMQFqc/sNYE2jrnS6f0FAu6SpRrbPKZ2veoKNen1BmRpkxpeID+dK5Cl/t6cJi
	YKmvxNevpXjYr7+hRVSaedqZA7I3KGjVhqBYcY4T9o1VvLBf1ZR1CoxnVwxXUMtjRBR1
	UbfIwfKE458kGsuofdpFvUdFrR1D2z9xkJAJ0=
MIME-Version: 1.0
Sender: kmatthew.macy@gmail.com
Received: by 10.229.226.6 with HTTP; Sat, 17 Apr 2010 14:55:30 -0700 (PDT)
Date: Sat, 17 Apr 2010 14:55:30 -0700
X-Google-Sender-Auth: 37b9bca1f4275dab
Received: by 10.229.88.72 with SMTP id z8mr2100122qcl.3.1271541330653; Sat, 17 
	Apr 2010 14:55:30 -0700 (PDT)
Message-ID: <z2s82c4140e1004171455tdaaeae6av58c941f8aba0deec@mail.gmail.com>
From: "K. Macy" <kmacy@freebsd.org>
To: freebsd-arch@freebsd.org
Content-Type: multipart/mixed; boundary=0016364eead454b468048475c972
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: jeff@freebsd.org, alc@cs.rice.edu
Subject: Moving forward with vm page lock
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: kmacy@freebsd.org
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 17 Apr 2010 22:26:42 -0000

--0016364eead454b468048475c972
Content-Type: text/plain; charset=ISO-8859-1

Last February Jeff Roberson first shared his vm page lock patch with me.
The general premise is that modification of a vm_page_t is no longer
protected by the global "vm page queue mutex" but is instead protected
by an entry in an array of locks which each vm_page_t is hashed to
by its physical address. This complicates pmap somewhat because it increases
the number of cases where retry logic is required if we need to drop the
pmap lock in order to first acquire the page lock (see pa_tryrelock).


I've continued refining Jeff's initial page lock patch by resolving
lock ordering issues in vm_pageout, eliminating pv_lock, and eliminating the
need for pmap_collect on amd64. Rather than exposing ourselves to a race
condition by dropping the locks in pmap_collect, I pre-allocate any
necessary pv_entrys before changing any pmap state. This complicated calls
to demote slightly, but that can probably be simplified later. Currently
only amd64 supports this. Other platforms map vm_page_lock(m) to the
vm page queue mutex.


The current version of the patch can be found at:
http://people.freebsd.org/~kmacy/diffs/head_page_lock.diff

I've been refining it in a subversion branch at:
svn://svn.freebsd.org/base/user/kmacy/head_page_lock



On my workloads at a CDN startup I've seen as much as a 50% increase in
lighttpd throughput (3.2Gbps -> 4.8Gbps). At Jeff's request I've
done some basic measurements with buildkernel to demonstrate that,
at least on my hardware, a dual 4-core
"CPU: Intel(R) Xeon(R) CPU L5420  @ 2.50GHz (2500.01-MHz K8-class CPU)"
with 64GB of RAM there is no performance regression.

I did 2 warm up runs followed by 10 samples of
"time make -j16 buildkernel KERNCONF=GENERIC -DNO_MODULES
-DNO_KERNELCONFIG -DNO_KERNELDEPEND" on a ZFS file system on a twa
based raid device for both with page_lock and without. Wall clock time
is consistently just under a second lower (faster build time) for the
page_lock kernel. The bulk of the time is actually spent in user so it
is more meaningful to compare system times. I attached the logs of the
runs and the two files I fed to ministat.



ministat -c 95 -w 72 base page_lock
x base
+ page_lock
+------------------------------------------------------------------------+
|    +  ++                                                               |
|+   ++ +++  +                                            x    xxxx xxxxx|
|   |__AM__|                                                   |___AM__| |
+------------------------------------------------------------------------+
  N           Min           Max        Median           Avg        Stddev
x  10         47.35         49.09         48.64        48.417    0.53416706
+  10         40.04         41.52         40.98        40.844    0.41494846
Difference at 95.0% confidence
      -7.573 +/- 0.449396
      -15.6412% +/- 0.928179%
      (Student's t, pooled s = 0.478287)




ramsan2.lab1# head -2 prof.out
                    debug.lock.prof.stats:
   max  wait_max       total  wait_total       count    avg wait_avg
cnt_hold cnt_lock name
ramsan2.lab1# sort -nrk 4 prof.out | head
  1592    243918     1768980    12026988      287680      6     41
0 112005 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1065 (sleep
mutex:vm page queue free mutex)
  3967    750285     1678130     9447247      276594      6     34
0 104952 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1388 (sleep
mutex:vm page queue mutex)
 18234    163969     5417360     9213400      282459     19     32
0   6548 /usr/home/kmacy/head_page_lock/sys/amd64/amd64/pmap.c:3372
(sleep mutex:page lock)
 173094    134890    18226507     8195920       49757    366    164
0    625 /usr/home/kmacy/head_page_lock/sys/kern/vfs_subr.c:2091
(lockmgr:zfs)
   254    167136       38222     5153728        2736     13   1883
0   2333 /usr/home/kmacy/head_page_lock/sys/amd64/amd64/pmap.c:550
(sleep mutex:page lock)
  1160    104774     1624269     4380034      279240      5     15
0 107998 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1508 (sleep
mutex:vm page queue free mutex)
  1107     80128     1581048     3377896      274341      5     12
0 100130 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1300 (sleep
mutex:vm page queue mutex)
 104802    284128    14712290     2970729      259423     56     11
0   1900 /usr/home/kmacy/head_page_lock/sys/vm/vm_object.c:721 (sleep
mutex:page lock)
 84339    158037     1455568     2875384       85147     17     33
0    292 /usr/home/kmacy/head_page_lock/sys/kern/vfs_cache.c:390
(rw:Name Cache)
     9    995901         236     2468160          46      5  53655
0     45 /usr/home/kmacy/head_page_lock/sys/kern/sched_ule.c:2552
(spin mutex:sched lock 4)


Both Giovanni Trematerra and I have run stress2 on it for extended
periods with problems in evidence.

I'd like to see this go in to HEAD by the end of this month. Once
this change has proven to be stable by a wider audience I will
extend it to i386.

Thanks,
Kip

--0016364eead454b468048475c972--

From owner-freebsd-arch@FreeBSD.ORG  Sat Apr 17 22:49:40 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E8E8B1065672;
	Sat, 17 Apr 2010 22:49:40 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id A513D8FC26;
	Sat, 17 Apr 2010 22:49:40 +0000 (UTC)
Received: from [127.0.0.1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.3/8.14.3) with ESMTP id o3HMnaBB041181;
	Sat, 17 Apr 2010 16:49:37 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1078)
Content-Type: text/plain; charset=us-ascii
From: Scott Long <scottl@samsco.org>
In-Reply-To: <29917.1271406183@critter.freebsd.dk>
Date: Sat, 17 Apr 2010 16:49:36 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <F335207A-4AE3-4993-8CC7-16CCEE425BC4@samsco.org>
References: <29917.1271406183@critter.freebsd.dk>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
X-Mailer: Apple Mail (2.1078)
X-Spam-Status: No, score=-1.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD
	autolearn=unavailable version=3.3.0
X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org
Cc: Attilio Rao <attilio@freebsd.org>,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	freebsd-arch@freebsd.org
Subject: Re: [PATCH] Syncer rewriting 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 17 Apr 2010 22:49:41 -0000

On Apr 16, 2010, at 2:23 AM, Poul-Henning Kamp wrote:
>=20
>=20
>> - The standard syncer may be further improved getting rid of the
>> bufobj. It should actually handle a list of vnodes rather than a list
>> of bufobj. However similar optimizations may be done after the patch
>> is ready to enter the tree.
>=20
> That would be the wrong direction: we need the bufobj because for =
instance
> a RAID5 geom module does not have a vnode for the parity data.
>=20
> If you force the syncer to only work on vnodes, then we need a =
parallel
> mechanism for non-filesystem disk users.

It's been 5-6 (7?) years since you invented the bufobj, but I still =
haven't seen
anything in GEOM use it as you suggest.  You used to have a saying about
premature optimization...  I'd like to see Attilio's work move forward =
despite this.

Scott