From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 00:08:18 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4ADB91065670; Sun, 11 Apr 2010 00:08:18 +0000 (UTC) (envelope-from yanefbsd@gmail.com) Received: from mail-qy0-f181.google.com (mail-qy0-f181.google.com [209.85.221.181]) by mx1.freebsd.org (Postfix) with ESMTP id E3B968FC19; Sun, 11 Apr 2010 00:08:17 +0000 (UTC) Received: by qyk11 with SMTP id 11so3798908qyk.13 for ; Sat, 10 Apr 2010 17:08:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=5Rlb3ELeFoZIXf0Yq+k21xUXPbkBC8YasoqH3jnmNXo=; b=kG76Xpyr9iONz2D7t7F3RHsvVMa019V2ix3l6kOySdzEfnwU7OfQsZ9ePLg5g5rx7x wBhIGZ8IX8BKMpc0dl6qvE4IiUpKsThIU5dy+lCq2Sz7IJENy+z/K9iYmEnOyks5OI01 Sppo63NoYvAOL2rhOU11jrGy+JMMY/U0Td16Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=t11cpl0w80/CtsvWbyqLKK3lztDse/AONE1uC0Eo2uK5ZhEzidVgT4sQbC9jyax/cF 50/h+2QZA+LyjgCB1EJhuCm/qQjb8QfQaRCglI0gAqA91uHuW8J/5q6I9CIFHDOSQCfr nxVFSJcf21FwQp57+ez/p6JOEIVmXV4R4SmjY= MIME-Version: 1.0 Received: by 10.229.28.85 with HTTP; Sat, 10 Apr 2010 17:08:16 -0700 (PDT) In-Reply-To: References: Date: Sat, 10 Apr 2010 17:08:16 -0700 Received: by 10.229.14.157 with SMTP id g29mr3204782qca.57.1270944496855; Sat, 10 Apr 2010 17:08:16 -0700 (PDT) Message-ID: From: Garrett Cooper To: arch@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: portmgr@freebsd.org Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 00:08:18 -0000 On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper wrote: > On Sat, Apr 10, 2010 at 3:52 PM, Garrett Cooper wrot= e: >> Hi again arch, >> =A0 =A0When doing some research, it appears that while functionality in >> theory exists for @owner and @user in the package list, it isn't >> actually used in the pkg_install code at all, adding unnecessary bloat >> to package lists; >> =A0 =A0FWIW this functionality (just like @exec and @unexec) can be >> implemented via pkg-install or more reliably via an mtree file. >> =A0 =A0Thoughts? > > Nevermind; I was misreading the code. Doing some more digging, there are a handful of ports that I don't have installed that implement this functionality: @mode ... $ grep -Ilr @mode /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/,= ,g' databases/phpmyadmin/pkg-plist-chunk databases/phpmyadmin211/pkg-plist-chunk devel/libtai/pkg-plist dns/poweradmin/pkg-plist-chunk games/columns/pkg-plist games/falconseye/pkg-plist games/glasteroids/pkg-plist games/nethack32/pkg-plist games/nethack33/pkg-plist games/nethack34/pkg-plist games/omega/pkg-plist games/sol/pkg-plist games/wanderer/pkg-plist games/xmines/pkg-plist games/zangband/pkg-plist irc/inspircd/pkg-plist japanese/nethack32/pkg-plist japanese/nethack34/pkg-plist japanese/zangband/pkg-plist net/phpldapadmin/pkg-plist-chunk net/phpldapadmin098/pkg-plist-chunk security/cyrus-sasl2-saslauthd/pkg-plist sysutils/clockspeed/pkg-plist www/ssserver/pkg-plist @owner ... $ grep -Ilr @owner /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/= ,,g' games/omega/pkg-plist games/sol/pkg-plist games/zangband/pkg-plist japanese/zangband/pkg-plist net/mediatomb/pkg-plist news/cnews/pkg-plist news/ifmail/pkg-plist Also, I'm not positive, but I think that none of the released packages use this either -- so ultimately this functionality could be removed without any impact to folks unless there's a 3rd party that has implemented this outside of FreeBSD. This functionality could be delivered in mtree files, could be fixed with the upstream installation Makefiles, and IMO should not be as part of the package list, as it only obscures precedence, ownership, and permissions, and there's a great deal of overlap involved in package creation and installation; tar applies permissions bits and ownership, mtree is called next to fix permissions and ownership, if the mtree file exists, then the @owner and @mode stuff implements a hammer solution over a series of files -- note that chmod -R and chown -R are called with @owner and @mode :( : if (Mode) if (vsystem("cd %s && /bin/chmod -R %s %s", cd_to, Mode, arg)) warnx("couldn't change modes of '%s' to '%s'", arg, Mode); if (Owner && Group) { if (vsystem("cd %s && /usr/sbin/chown -R %s:%s %s", cd_to, Owner, Group, arg)) warnx("couldn't change owner/group of '%s' to '%s:%s'", arg, Owner, Group); return; } if (Owner) { if (vsystem("cd %s && /usr/sbin/chown -R %s %s", cd_to, Owner, arg)= ) warnx("couldn't change owner of '%s' to '%s'", arg, Owner); return; } else if (Group) if (vsystem("cd %s && /usr/bin/chgrp -R %s %s", cd_to, Group, arg)) warnx("couldn't change group of '%s' to '%s'", arg, Group); Thoughts? -Garrett From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 00:11:12 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D4DCC1065674; Sun, 11 Apr 2010 00:11:12 +0000 (UTC) (envelope-from yanefbsd@gmail.com) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.24]) by mx1.freebsd.org (Postfix) with ESMTP id 789118FC08; Sun, 11 Apr 2010 00:11:12 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 5so1529776qwi.7 for ; Sat, 10 Apr 2010 17:11:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=8gxnK0d7rbLEIRfuE1EVz0l04lWadfi4RG05uAbap5g=; b=RWLfBEYiaiOsD4OUXeDSByObwiBV6kdyK0CMULwHfcPnBaHoKxDnH/MSKIEyGaUapN PzOnhjvRAuz/jgJnC3GRWP3mode7RciXOU9dmsgonBsp8XibvCbhILKpdxum4p6YxeGN rimiP6CTI+0Z1yn9LJNtIMdsX8nDlY/ENKpWE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=FzJX3uI43SKL2LijowK8XQ2HzoYpA9tG0CSU2zb1b91CwY0xi+NSx3iKwF+hsFCRsj 5xNMT3nmF+PkXaFoEevYuzoje78/w520VFvPmVckV59sKSFSeeFMt7DUfP9GF2qfqyJy oCrZ+AjHcxd8bvUWmn5ctUUD9xSTPBMZGZdxk= MIME-Version: 1.0 Received: by 10.229.28.85 with HTTP; Sat, 10 Apr 2010 17:11:11 -0700 (PDT) In-Reply-To: References: Date: Sat, 10 Apr 2010 17:11:11 -0700 Received: by 10.229.226.1 with SMTP id iu1mr3210885qcb.19.1270944671710; Sat, 10 Apr 2010 17:11:11 -0700 (PDT) Message-ID: From: Garrett Cooper To: arch@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: portmgr@freebsd.org Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 00:11:12 -0000 On Sat, Apr 10, 2010 at 5:08 PM, Garrett Cooper wrote: > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper wrot= e: >> On Sat, Apr 10, 2010 at 3:52 PM, Garrett Cooper wro= te: >>> Hi again arch, >>> =A0 =A0When doing some research, it appears that while functionality in >>> theory exists for @owner and @user in the package list, it isn't >>> actually used in the pkg_install code at all, adding unnecessary bloat >>> to package lists; >>> =A0 =A0FWIW this functionality (just like @exec and @unexec) can be >>> implemented via pkg-install or more reliably via an mtree file. >>> =A0 =A0Thoughts? >> >> Nevermind; I was misreading the code. > > =A0 =A0Doing some more digging, there are a handful of ports that I don't > have installed that implement this functionality: > > @mode ... > > $ grep -Ilr @mode /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports= /,,g' > databases/phpmyadmin/pkg-plist-chunk > databases/phpmyadmin211/pkg-plist-chunk > devel/libtai/pkg-plist > dns/poweradmin/pkg-plist-chunk > games/columns/pkg-plist > games/falconseye/pkg-plist > games/glasteroids/pkg-plist > games/nethack32/pkg-plist > games/nethack33/pkg-plist > games/nethack34/pkg-plist > games/omega/pkg-plist > games/sol/pkg-plist > games/wanderer/pkg-plist > games/xmines/pkg-plist > games/zangband/pkg-plist > irc/inspircd/pkg-plist > japanese/nethack32/pkg-plist > japanese/nethack34/pkg-plist > japanese/zangband/pkg-plist > net/phpldapadmin/pkg-plist-chunk > net/phpldapadmin098/pkg-plist-chunk > security/cyrus-sasl2-saslauthd/pkg-plist > sysutils/clockspeed/pkg-plist > www/ssserver/pkg-plist > > @owner ... > > $ grep -Ilr @owner /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/port= s/,,g' > games/omega/pkg-plist > games/sol/pkg-plist > games/zangband/pkg-plist > japanese/zangband/pkg-plist > net/mediatomb/pkg-plist > news/cnews/pkg-plist > news/ifmail/pkg-plist > > =A0 =A0Also, I'm not positive, but I think that none of the released > packages use this either -- so ultimately this functionality could be > removed without any impact to folks unless there's a 3rd party that > has implemented this outside of FreeBSD. This functionality could be > delivered in mtree files, could be fixed with the upstream > installation Makefiles, and IMO should not be as part of the package > list, as it only obscures precedence, ownership, and permissions, and > there's a great deal of overlap involved in package creation and > installation; tar applies permissions bits and ownership, mtree is > called next to fix permissions and ownership, if the mtree file > exists, then the @owner and @mode stuff implements a hammer solution > over a series of files -- note that chmod -R and chown -R are called > with @owner and @mode :( : > > =A0 =A0if (Mode) > =A0 =A0 =A0 =A0if (vsystem("cd %s && /bin/chmod -R %s %s", cd_to, Mode, a= rg)) > =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change modes of '%s' to '%s'", arg= , Mode); > =A0 =A0if (Owner && Group) { > =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/sbin/chown -R %s:%s %s", cd_to, > Owner, Group, arg)) > =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change owner/group of '%s' to '%s:= %s'", > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 arg, Owner, Group); > =A0 =A0 =A0 =A0return; > =A0 =A0} > =A0 =A0if (Owner) { > =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/sbin/chown -R %s %s", cd_to, Ow= ner, arg)) > =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change owner of '%s' to '%s'", arg= , Owner); > =A0 =A0 =A0 =A0return; > =A0 =A0} else if (Group) > =A0 =A0 =A0 =A0if (vsystem("cd %s && /usr/bin/chgrp -R %s %s", cd_to, Gro= up, arg)) > =A0 =A0 =A0 =A0 =A0 =A0warnx("couldn't change group of '%s' to '%s'", arg= , Group); Sorry -- forgot @group... $ grep -Ilr @group /scratch/freebsd/ports/ | sed 's,/scratch/freebsd/ports/= ,,g' biology/p5-bioperl/files/patch-Bio-Root-Build.pm databases/phpmyadmin/pkg-plist-chunk databases/phpmyadmin211/pkg-plist-chunk games/falconseye/pkg-plist games/omega/pkg-plist games/sol/pkg-plist games/wanderer/pkg-plist games/zangband/pkg-plist irc/inspircd/pkg-plist japanese/gawk/files/patch-sec1 japanese/zangband/pkg-plist lang/tcc/files/texi2pod.pl mail/sendmail/pkg-plist mail/vpopmail/pkg-install mail/vpopmail-devel/pkg-install math/freemat/pkg-plist net/freebsd-uucp/pkg-plist net/mediatomb/pkg-plist net/phpldapadmin/pkg-plist-chunk net/phpldapadmin098/pkg-plist-chunk news/cnews/pkg-plist news/ifmail/pkg-plist security/sfs/pkg-plist Thanks, -Garrett From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 00:32:21 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6587A1065670; Sun, 11 Apr 2010 00:32:21 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from monday.kientzle.com (kientzle.com [66.166.149.50]) by mx1.freebsd.org (Postfix) with ESMTP id 3629A8FC17; Sun, 11 Apr 2010 00:32:19 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.3/8.14.3) id o3B0WW8i006407; Sun, 11 Apr 2010 00:32:32 GMT (envelope-from kientzle@freebsd.org) Received: from horton.x.kientzle.com (fw2.kientzle.com [10.123.1.2]) by kientzle.com with SMTP id n4mtt4gy7dnt8d39dx9k5qyq9e; Sun, 11 Apr 2010 00:32:32 +0000 (UTC) (envelope-from kientzle@freebsd.org) Message-ID: <4BC1188F.3060001@freebsd.org> Date: Sat, 10 Apr 2010 17:32:15 -0700 From: Tim Kientzle User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.23) Gecko/20100314 SeaMonkey/1.1.18 MIME-Version: 1.0 To: Garrett Cooper References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Arch , portmgr@freebsd.org Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 00:32:21 -0000 Garrett Cooper wrote: > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper wrote: >>> When doing some research, it appears that while functionality in >>> theory exists for @owner and @user in the package list, it isn't >>> actually used in the pkg_install code at all, adding unnecessary bloat >>> to package lists; > > Doing some more digging, there are a handful of ports that I don't > have installed that implement this functionality: > @mode ... > @owner ... > @group ... I would certainly shed no tears if these went away. OTOH, I can see a use for them in pkg_create, to set the mode/owner/group in the resulting tarball. This would be good when building a package from a port while running as non-root user. Of course, we could also do this from the mtree description at either package creation time (reading the mtree description and using it to set file properties in the tarball) or package install time (using the mtree description to set the final file properties on disk). Tim From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 02:56:07 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 42128106566C; Sun, 11 Apr 2010 02:56:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id C9F968FC1A; Sun, 11 Apr 2010 02:56:06 +0000 (UTC) Received: from c122-106-168-84.carlnfd1.nsw.optusnet.com.au (c122-106-168-84.carlnfd1.nsw.optusnet.com.au [122.106.168.84]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3B2u23m006829 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 11 Apr 2010 12:56:04 +1000 Date: Sun, 11 Apr 2010 12:56:02 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Andriy Gapon In-Reply-To: <4BBF3C5A.7040009@freebsd.org> Message-ID: <20100411114405.L10562@delplex.bde.org> References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 02:56:07 -0000 On Fri, 9 Apr 2010, Andriy Gapon wrote: > on 09/04/2010 16:53 Rick Macklem said the following: >> >> >> On Fri, 9 Apr 2010, Andriy Gapon wrote: >> >>> >>> Nowadays several questions could be asked about MAXBSIZE. >>> - Will we have to consider increasing MAXBSIZE? Provided ever >>> increasing media >>> sizes, typical filesystem sizes, typical file sizes (all that >>> multimedia) and >>> even media sector sizes. >> >> I would certainly like to see a larger MAXBSIZE for NFS. Solaris10 >> currently uses 128K as a default I/O size and allows up to 1Mb. Er, the maximum size of buffers in the buffer cache is especially irrelevant for nfs. It is almost irrelevant for physical disks because clustering normally increases the bulk transfer size to MAXPHYS. Clustering takes a lot of CPU but doesn't affect the transfer rate much unless there is not enough CPU. It is even less relevant for network i/o since there is a sort of reverse-clustering -- the buffers get split up into tiny packets (normally 1500 bytes less some header bytes) at the hardware level. Again a lot of CPU is involved doing the (reverse) clustering, and again this doesn't affect the transfer rate much. However, 1500 is so tiny that the reverse-clustering ratio of the i/o size relative to MAXBSIZE (65536/1500) is much smaller than the normal clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU is more significant for network i/o. (These aren't the actual normal ratios, but ones the limits of the attainable ones by varying only the block sizes under the file system's control.) However2, increasing the network i/o size can make little difference to this problem -- it can only increase the already-too-large reverse-clustering ratio, while possibly reducing other reverse-clustering ratios (the others are for assembling the nfs buffers from local file system buffers; the local file system buffers are normally disassembled from pbuf size (MAXPHYS) to file system size (normally 16K); then conversion to nfs buffers involves either a sort of clustering or reverse clustering depending on the relative sizes of the buffers). There are more gains to be had from increasing the network i/o size. tcp allows larger buffers at intermediate levels but they still get split up at the hardware level. Only some networks allow jumbo frames. >> Using >> larger I/O sizes for NFS is a simpler way to increase bulk data transfer >> rate than more buffers and more agressive read-ahead/write-behind. I'm not sure about that. Read-ahead and write-behind is already very aggressive but seems to be not working right. I use some patches by Bjorn Groenwald (?) which make it work better for the old nfs implemenation (I haven't tried the experimental one). The problems seem to be mainly timing ones. vfs clustering makes the buffer sizes almost irrelevant for physical disks, but there are latency problems for the network i/o. The latency problems seem to be larger for reads than for writes. I get best results by using the same size for network buffers as for local buffers (16K). This avoids 1 layer of buffer size changing (see above) and using 16K-buffers avoids buffer kva fragmentation (see below). I saw little difference from changing the user buffer size, except small buffers tend to work better and smallest (512-byte) buffers may have actually worked best, I think by reducing latencies. > I have lightly tested this under qemu. > I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. > I removed size > MAXBSIZE check in getblk (see a parallel thread "panic: getblk: > size(%d) > MAXBSIZE(%d)"). Did you change the other known things that depend on this? There is the b_pages limit of MAXPHYS bytes which should be checked for in another way, and the soft limits for hibufspace and lobufspace which only matter under load conditions. > And I bumped MAXPHYS to 1MB. > > Some results. > I got no panics, data was read correctly and system remained stable, which is good. > But I observed reading process (dd bs=1m on avgfs) spending a lot of time sleeping > on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. > Apparently there was some shortage of free buffers. > Perhaps some limits/counts were incorrectly auto-tuned. This is not surprising, since even 64K is 4 times too large to work well. Buffer sizes of larger than BKVASIZE (16K) always cause fragmentation of buffer kva. Recovering from fragmentation always takes a lot of CPU, and if you are unlucky it will also take a lot of real time (stalling waiting for free buffer kva). Buffer sizes larger than BKVASIZE also reduce the number of available buffers significantly below the number of buffers configured. This mainly takes a lot of CPU to reconsitute buffers. BKVASIZE being less than MAXBSIZE is a hack to reduce the amount of kva statically allocated for buffers for systems that cannot support enough kva to work right (mainly i386's). It only works well when it is not actually used (when all buffers have size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to BKVASIZE). This hack and the complications to support it are bogus on systems that support enough kva to work right. nfs buffers larger than 16K would exceed BKVASIZE. This may have been why nfs buffer sizes of size 32K gave negative benefits. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 13:48:56 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D1FC61065678; Sun, 11 Apr 2010 13:48:56 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 5FEF98FC0C; Sun, 11 Apr 2010 13:48:56 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEALJvwUuDaFvH/2dsb2JhbACbRXG2BoUMBA X-IronPort-AV: E=Sophos;i="4.52,184,1270440000"; d="scan'208";a="72300237" Received: from danube.cs.uoguelph.ca ([131.104.91.199]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 11 Apr 2010 09:48:55 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by danube.cs.uoguelph.ca (Postfix) with ESMTP id 4D8FC1084195; Sun, 11 Apr 2010 09:48:55 -0400 (EDT) X-Virus-Scanned: amavisd-new at danube.cs.uoguelph.ca Received: from danube.cs.uoguelph.ca ([127.0.0.1]) by localhost (danube.cs.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id noH9HCNu+mzV; Sun, 11 Apr 2010 09:48:53 -0400 (EDT) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by danube.cs.uoguelph.ca (Postfix) with ESMTP id 9371D1084192; Sun, 11 Apr 2010 09:48:53 -0400 (EDT) Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id o3BE2d129572; Sun, 11 Apr 2010 10:02:39 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Sun, 11 Apr 2010 10:02:39 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: Bruce Evans In-Reply-To: <20100411114405.L10562@delplex.bde.org> Message-ID: References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 13:48:56 -0000 On Sun, 11 Apr 2010, Bruce Evans wrote: > > Er, the maximum size of buffers in the buffer cache is especially > irrelevant for nfs. It is almost irrelevant for physical disks because > clustering normally increases the bulk transfer size to MAXPHYS. > Clustering takes a lot of CPU but doesn't affect the transfer rate much > unless there is not enough CPU. It is even less relevant for network > i/o since there is a sort of reverse-clustering -- the buffers get split > up into tiny packets (normally 1500 bytes less some header bytes) at > the hardware level. Again a lot of CPU is involved doing the (reverse) > clustering, and again this doesn't affect the transfer rate much. > However, 1500 is so tiny that the reverse-clustering ratio of the i/o > size relative to MAXBSIZE (65536/1500) is much smaller than the normal > clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU > is more significant for network i/o. (These aren't the actual normal > ratios, but ones the limits of the attainable ones by varying only the > block sizes under the file system's control.) However2, increasing the > network i/o size can make little difference to this problem -- it can > only increase the already-too-large reverse-clustering ratio, while > possibly reducing other reverse-clustering ratios (the others are for > assembling the nfs buffers from local file system buffers; the local > file system buffers are normally disassembled from pbuf size (MAXPHYS) > to file system size (normally 16K); then conversion to nfs buffers > involves either a sort of clustering or reverse clustering depending > on the relative sizes of the buffers). There are more gains to be > had from increasing the network i/o size. tcp allows larger buffers > at intermediate levels but they still get split up at the hardware > level. Only some networks allow jumbo frames. > I've done a simple experiment on Mac OS X 10, where I tried different sizes for the read and write RPCs plus different amounts of read-ahead/write-behind and found the I/O rate increased linearly, up to the max allowed by Mac OS X (MAXBSIZE == 128K) without read-ahead/write-behind. Using read-ahead/write-behind the performance didn't increase at all, until the RPC read/write size was reduced. (Solaris10 is using 256K by default and allowing up to 1Mb for read/write RPC size now, so they seem to think that large values work well?) When you start using a WAN environment, large read/write RPCs really help, from what I've seen, since that helps fill the TCP pipe (bits * latency between client<->server). I care much more about WAN performance than LAN performance w.r.t. this. I am not sure what you were referring to w.r.t. clustering, but if you meant that the NFS client can easily do an RPC with a larger I/O size than the size of the buffer handed it by the buffer cache, I'd like to hear how that's done? (If not, then a bigger buffer from the buffer cache is what I need to do a larger I/O size in the RPC.) Once NFS hands the TCP socket the large RPC, I figure it's up to the networking to get it on/off the wire, etc. If you are arguing that that is where there can be major gains, I'll believe you, but it's not my area of expertise and there's lots of other FreeBSD folks to work on that. I do believe that being able to do a large read/write RPC is going to help performance, particularily in the WAN case. >>> Using >>> larger I/O sizes for NFS is a simpler way to increase bulk data transfer >>> rate than more buffers and more agressive read-ahead/write-behind. > > I'm not sure about that. Read-ahead and write-behind is already very > aggressive but seems to be not working right. I use some patches by > Bjorn Groenwald (?) which make it work better for the old nfs implemenation > (I haven't tried the experimental one). The problems seem to be mainly > timing ones. vfs clustering makes the buffer sizes almost irrelevant for > physical disks, but there are latency problems for the network i/o. > The latency problems seem to be larger for reads than for writes. I > get best results by using the same size for network buffers as for local > buffers (16K). This avoids 1 layer of buffer size changing (see above) > and using 16K-buffers avoids buffer kva fragmentation (see below). I > saw little difference from changing the user buffer size, except small > buffers tend to work better and smallest (512-byte) buffers may have > actually worked best, I think by reducing latencies. > See above. There is always going to be cases like use over a WAN where latency is going to be large. That's when large I/O RPCs will win. I suspect you are focusing on the high bandwidth/low latecy LAN, which is not where I believe that large I/O sized RPCs will make much difference. Hope this helps clarify what I am looking for, rick From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 14:09:31 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 655FF106566B; Sun, 11 Apr 2010 14:09:31 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id F36EC8FC08; Sun, 11 Apr 2010 14:09:30 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEABd1wUuDaFvJ/2dsb2JhbACbRXG2CIUMBA X-IronPort-AV: E=Sophos;i="4.52,185,1270440000"; d="scan'208";a="71859007" Received: from ganges.cs.uoguelph.ca ([131.104.91.201]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 11 Apr 2010 10:09:29 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by ganges.cs.uoguelph.ca (Postfix) with ESMTP id 32570FB808C; Sun, 11 Apr 2010 10:09:29 -0400 (EDT) X-Virus-Scanned: amavisd-new at ganges.cs.uoguelph.ca Received: from ganges.cs.uoguelph.ca ([127.0.0.1]) by localhost (ganges.cs.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pqvSclHGgB6O; Sun, 11 Apr 2010 10:09:28 -0400 (EDT) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by ganges.cs.uoguelph.ca (Postfix) with ESMTP id 455FBFB8036; Sun, 11 Apr 2010 10:09:28 -0400 (EDT) Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id o3BENEH02542; Sun, 11 Apr 2010 10:23:14 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Sun, 11 Apr 2010 10:23:14 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: Bruce Evans In-Reply-To: <20100411114405.L10562@delplex.bde.org> Message-ID: References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 14:09:31 -0000 On Sun, 11 Apr 2010, Bruce Evans wrote: > > Er, the maximum size of buffers in the buffer cache is especially > irrelevant for nfs. It is almost irrelevant for physical disks because > clustering normally increases the bulk transfer size to MAXPHYS. > Clustering takes a lot of CPU but doesn't affect the transfer rate much > unless there is not enough CPU. It is even less relevant for network > i/o since there is a sort of reverse-clustering -- the buffers get split > up into tiny packets (normally 1500 bytes less some header bytes) at > the hardware level. Again a lot of CPU is involved doing the (reverse) > clustering, and again this doesn't affect the transfer rate much. > However, 1500 is so tiny that the reverse-clustering ratio of the i/o > size relative to MAXBSIZE (65536/1500) is much smaller than the normal > clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU > is more significant for network i/o. (These aren't the actual normal > ratios, but ones the limits of the attainable ones by varying only the > block sizes under the file system's control.) However2, increasing the > network i/o size can make little difference to this problem -- it can > only increase the already-too-large reverse-clustering ratio, while > possibly reducing other reverse-clustering ratios (the others are for > assembling the nfs buffers from local file system buffers; the local > file system buffers are normally disassembled from pbuf size (MAXPHYS) > to file system size (normally 16K); then conversion to nfs buffers > involves either a sort of clustering or reverse clustering depending > on the relative sizes of the buffers). There are more gains to be > had from increasing the network i/o size. tcp allows larger buffers > at intermediate levels but they still get split up at the hardware > level. Only some networks allow jumbo frames. > Oh, and if the 1Mbyte write rpc can somehow hand the data portion (the 1Mbyte of data) to sosend() as a single 1Mbyte mbuf cluster referencing (not copied from) the 1Mbyte buffer cache block, so the data never needs to be copied until it gets to the network device driver, that would be great. However, this goes way beyond the increase of MAXBSIZE that I think I need so that the client can actually do a 1Mbyte write RPC. Have a good weekend (what's left of it), rick From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 14:20:19 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 59711106564A; Sun, 11 Apr 2010 14:20:19 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 084B98FC14; Sun, 11 Apr 2010 14:20:18 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3BEHVRN091386; Sun, 11 Apr 2010 08:17:31 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Sun, 11 Apr 2010 08:17:39 -0600 (MDT) Message-Id: <20100411.081739.974702306123419358.imp@bsdimp.com> To: kientzle@FreeBSD.org From: "M. Warner Losh" In-Reply-To: <4BC1188F.3060001@freebsd.org> References: <4BC1188F.3060001@freebsd.org> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: yanefbsd@gmail.com, portmgr@FreeBSD.org, arch@FreeBSD.org Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 14:20:19 -0000 In message: <4BC1188F.3060001@freebsd.org> Tim Kientzle writes: : Garrett Cooper wrote: : > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper : > wrote: : >>> When doing some research, it appears that while functionality in : >>> theory exists for @owner and @user in the package list, it isn't : >>> actually used in the pkg_install code at all, adding unnecessary bloat : >>> to package lists; : > Doing some more digging, there are a handful of ports that I don't : > have installed that implement this functionality: : > @mode ... : > @owner ... : > @group ... : : I would certainly shed no tears if these went away. : : OTOH, I can see a use for them in pkg_create, to : set the mode/owner/group in the resulting tarball. : This would be good when building a package from a : port while running as non-root user. : : Of course, we could also do this from the mtree : description at either package creation time (reading : the mtree description and using it to set file properties : in the tarball) or package install time (using the : mtree description to set the final file properties : on disk). On the creation side, something like the above would be useful. makefs supports storing a tree's metadata in an .mtree file. We could obviate the need for those keywords if tar could be made to do the same thing :) I'm working on an unpriv'd installworld (where the meta data would go to the .mtree file, and the files would go into a tree owned as the user building). Mostly it is a port from NetBSD, but having tar that would respect this stuff would be great. Bonus points if the tag in mtree could be used as a file selector (either additively or subtractively). Warner From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 11:06:55 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A0431065674 for ; Mon, 12 Apr 2010 11:06:55 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 6DB148FC0C for ; Mon, 12 Apr 2010 11:06:55 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id o3CB6tad042344 for ; Mon, 12 Apr 2010 11:06:55 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id o3CB6s8O042342 for freebsd-arch@FreeBSD.org; Mon, 12 Apr 2010 11:06:54 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 12 Apr 2010 11:06:54 GMT Message-Id: <201004121106.o3CB6s8O042342@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 11:06:55 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 13:56:03 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A009E1065673; Mon, 12 Apr 2010 13:56:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6FA3E8FC0A; Mon, 12 Apr 2010 13:56:03 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 2121346B6C; Mon, 12 Apr 2010 09:56:03 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 1D63D8A01F; Mon, 12 Apr 2010 09:56:00 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Mon, 12 Apr 2010 09:31:36 -0400 User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; ) References: <4BC1188F.3060001@freebsd.org> In-Reply-To: <4BC1188F.3060001@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201004120931.36907.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 12 Apr 2010 09:56:00 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Garrett Cooper , Tim Kientzle , portmgr@freebsd.org, FreeBSD Arch Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 13:56:03 -0000 On Saturday 10 April 2010 8:32:15 pm Tim Kientzle wrote: > Garrett Cooper wrote: > > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper wrote: > >>> When doing some research, it appears that while functionality in > >>> theory exists for @owner and @user in the package list, it isn't > >>> actually used in the pkg_install code at all, adding unnecessary bloat > >>> to package lists; > > > > Doing some more digging, there are a handful of ports that I don't > > have installed that implement this functionality: > > @mode ... > > @owner ... > > @group ... > > I would certainly shed no tears if these went away. > > OTOH, I can see a use for them in pkg_create, to > set the mode/owner/group in the resulting tarball. > This would be good when building a package from a > port while running as non-root user. Yes. I have used this to build 3rd party packages at a previous employer. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 13:56:03 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A009E1065673; Mon, 12 Apr 2010 13:56:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6FA3E8FC0A; Mon, 12 Apr 2010 13:56:03 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 2121346B6C; Mon, 12 Apr 2010 09:56:03 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 1D63D8A01F; Mon, 12 Apr 2010 09:56:00 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Mon, 12 Apr 2010 09:31:36 -0400 User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; ) References: <4BC1188F.3060001@freebsd.org> In-Reply-To: <4BC1188F.3060001@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201004120931.36907.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 12 Apr 2010 09:56:00 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Garrett Cooper , Tim Kientzle , portmgr@freebsd.org, FreeBSD Arch Subject: Re: [RFC] Remove @owner and @user from package list X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 13:56:03 -0000 On Saturday 10 April 2010 8:32:15 pm Tim Kientzle wrote: > Garrett Cooper wrote: > > On Sat, Apr 10, 2010 at 3:57 PM, Garrett Cooper wrote: > >>> When doing some research, it appears that while functionality in > >>> theory exists for @owner and @user in the package list, it isn't > >>> actually used in the pkg_install code at all, adding unnecessary bloat > >>> to package lists; > > > > Doing some more digging, there are a handful of ports that I don't > > have installed that implement this functionality: > > @mode ... > > @owner ... > > @group ... > > I would certainly shed no tears if these went away. > > OTOH, I can see a use for them in pkg_create, to > set the mode/owner/group in the resulting tarball. > This would be good when building a package from a > port while running as non-root user. Yes. I have used this to build 3rd party packages at a previous employer. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 16:02:28 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 88C331065674 for ; Mon, 12 Apr 2010 16:02:28 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id CA6088FC13 for ; Mon, 12 Apr 2010 16:02:27 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA29562; Mon, 12 Apr 2010 19:02:11 +0300 (EEST) (envelope-from avg@freebsd.org) Message-ID: <4BC34402.1050509@freebsd.org> Date: Mon, 12 Apr 2010 19:02:10 +0300 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.24 (X11/20100319) MIME-Version: 1.0 To: Bruce Evans References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> In-Reply-To: <20100411114405.L10562@delplex.bde.org> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 16:02:28 -0000 on 11/04/2010 05:56 Bruce Evans said the following: > On Fri, 9 Apr 2010, Andriy Gapon wrote: [snip] >> I have lightly tested this under qemu. >> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. >> I removed size > MAXBSIZE check in getblk (see a parallel thread >> "panic: getblk: >> size(%d) > MAXBSIZE(%d)"). > > Did you change the other known things that depend on this? There is the > b_pages limit of MAXPHYS bytes which should be checked for in another > way I changed the check the way I described in the parallel thread. > and the soft limits for hibufspace and lobufspace which only matter > under load conditions. And what these should be? hibufspace and lobufspace seem to be auto-calculated. One thing that I noticed and that was a direct cause of the problem described below, is that difference between hibufspace and lobufspace should be at least the maximum block size allowed in getblk() (perhaps it should be strictly equal to that value?). So in my case I had to make that difference MAXPHYS. >> And I bumped MAXPHYS to 1MB. >> >> Some results. >> I got no panics, data was read correctly and system remained stable, >> which is good. >> But I observed reading process (dd bs=1m on avgfs) spending a lot of >> time sleeping >> on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. >> Apparently there was some shortage of free buffers. >> Perhaps some limits/counts were incorrectly auto-tuned. > > This is not surprising, since even 64K is 4 times too large to work > well. Buffer sizes of larger than BKVASIZE (16K) always cause > fragmentation of buffer kva. Recovering from fragmentation always > takes a lot of CPU, and if you are unlucky it will also take a lot of > real time (stalling waiting for free buffer kva). Buffer sizes larger > than BKVASIZE also reduce the number of available buffers significantly > below the number of buffers configured. This mainly takes a lot of > CPU to reconsitute buffers. BKVASIZE being less than MAXBSIZE is a > hack to reduce the amount of kva statically allocated for buffers for > systems that cannot support enough kva to work right (mainly i386's). > It only works well when it is not actually used (when all buffers have > size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to > BKVASIZE). This hack and the complications to support it are bogus on > systems that support enough kva to work right. So, BKVASIZE is the best read size from the point of view of buffer space usage? E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but leads to buffer space map fragmentation, because of size > BKVASIZE. On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from buffer space point of view (no fragmentation potential) but they result in 4 GEOM I/O requests. The thing is that a single read requires a single contiguous virtual address space chunk. Would it be possible to take the best of both worlds by somehow allowing a single large I/O request to work with several buffers (with b_kvasize == BKVASIZE) in a iovec-like style? Have I just reinvented bicycle? :) Probably not, because an answer to my question is probably 'not (without lots of work in lots of places)' as well. I see that breadn() certainly doesn't work that way. As I understand, it works like bread() for one block plus starts something like 'asynchronous breads()' for a given count of other blocks. I am not sure about details of how cluster_read() works, though. Could you please explain the essence of it? Thank you! Perhaps, there are other approaches to the fragmentation issue. Like, for example, using sort of zones for different block sizes. But that all adds complications and takes away performance of the easy cases. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 22:28:47 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C745A106566B for ; Mon, 12 Apr 2010 22:28:47 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1]) by mx1.freebsd.org (Postfix) with ESMTP id 710C98FC0C for ; Mon, 12 Apr 2010 22:28:47 +0000 (UTC) Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233]) by tarsier.geekcn.org (Postfix) with ESMTP id 805B3A561A1; Tue, 13 Apr 2010 06:28:46 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([211.166.10.233]) by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new, port 10024) with LMTP id TZUpelbWeKNT; Tue, 13 Apr 2010 06:28:40 +0800 (CST) Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTPSA id 7E005A56199; Tue, 13 Apr 2010 06:28:39 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:reply-to:organization:user-agent: mime-version:to:subject:x-enigmail-version:openpgp:content-type:content-transfer-encoding; b=cLlVQu6zh3kqkdAmLLrCuYRufh2md+ZnqofYf6atiYqskPrj0Pnbln/R9OmGMw9It zsh88YSptj4Dr6vXTk/SQ== Message-ID: <4BC39E93.7060906@delphij.net> Date: Mon, 12 Apr 2010 15:28:35 -0700 From: Xin LI Organization: The Geek China Organization User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1 MIME-Version: 1.0 To: freebsd-arch@freebsd.org X-Enigmail-Version: 1.0.1 OpenPGP: id=3FCA37C1; url=http://www.delphij.net/delphij.asc Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 22:28:47 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Is there a sane way to copyout ioctl request when the returning errno != 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: =========== error = kern_ioctl(td, uap->fd, com, data); if (error == 0 && (com & IOC_OUT)) error = copyout(data, uap->data, (u_int)size); =========== Is there any objection if I change it to something like: =========== saved_error = kern_ioctl(td, uap->fd, com, data); if (com & IOC_OUT) error = copyout(data, uap->data, (u_int)size); if (saved_error) error = saved_error; =========== Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iQEcBAEBAgAGBQJLw56PAAoJEATO+BI/yjfBuYMIAM7qAVWuWn/noQPzH12W3IoH TInBLyGjG8tH5z9CPJeXe3X+aVz932KEuE85E6GXBo7zoGf1IWbMk8+LO+Ai+5It AgxeFrBUn0MUEY4dJPZs89Ag8LCBFvvHOe1eTxw+6sjdSDtFg2OV55F2nrCcPtoG jIEQtcfhy1H+evihEycoN9uMdTH0XWEcCZVhXKS0R4a3veOp2RUt4I21LhSYdyrx xairvHNIOp0eBdHf8O2TlwyWzlZpHg3XMO9UM/aZ5uiVeSIsB0nEX3SXGi3o7Rih DaCTqZpk4L6z1UIUsGEqLl5i6yrbP5LFwNDk9dYbQL3of4SVPofsD9O1hJ3MuIE= =cMdX -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 23:49:23 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 61A48106566B for ; Mon, 12 Apr 2010 23:49:23 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 553518FC1B for ; Mon, 12 Apr 2010 23:49:23 +0000 (UTC) Received: by elvis.mu.org (Postfix, from userid 1192) id 857C41A3C86; Mon, 12 Apr 2010 16:33:30 -0700 (PDT) Date: Mon, 12 Apr 2010 16:33:30 -0700 From: Alfred Perlstein To: d@delphij.net Message-ID: <20100412233330.GC19003@elvis.mu.org> References: <4BC39E93.7060906@delphij.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC39E93.7060906@delphij.net> User-Agent: Mutt/1.4.2.3i Cc: freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 23:49:23 -0000 * Xin LI [100412 15:28] wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > Is there a sane way to copyout ioctl request when the returning errno != > 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: > > =========== > error = kern_ioctl(td, uap->fd, com, data); > > if (error == 0 && (com & IOC_OUT)) > error = copyout(data, uap->data, (u_int)size); > =========== > > Is there any objection if I change it to something like: > > =========== > saved_error = kern_ioctl(td, uap->fd, com, data); > > if (com & IOC_OUT) > error = copyout(data, uap->data, (u_int)size); > if (saved_error) > error = saved_error; > =========== Is this for linux compat? I'm not sure this would work, it might seriously break userland compat. Have you looked around/queiried what the expected outcome is from a bad ioctl? By default the buffer will be zero'd this might be unexpected by apps. (all or nothing) -- - Alfred Perlstein .- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250 .- FreeBSD committer From owner-freebsd-arch@FreeBSD.ORG Tue Apr 13 00:27:02 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E128106566C; Tue, 13 Apr 2010 00:27:02 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1]) by mx1.freebsd.org (Postfix) with ESMTP id 8E4A18FC08; Tue, 13 Apr 2010 00:27:01 +0000 (UTC) Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233]) by tarsier.geekcn.org (Postfix) with ESMTP id 9DAD8A563BB; Tue, 13 Apr 2010 08:27:00 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([211.166.10.233]) by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new, port 10024) with LMTP id cTHSnfUMPDeC; Tue, 13 Apr 2010 08:26:54 +0800 (CST) Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTPSA id D5467A56357; Tue, 13 Apr 2010 08:26:52 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:reply-to:organization:user-agent: mime-version:to:cc:subject:references:in-reply-to: x-enigmail-version:openpgp:content-type:content-transfer-encoding; b=NGXXAfTZQKyEqnbzykefQAORUbaecgn1wTUsNWWXvoNTgpqHO0nm0eBbaH5hzz+cJ KcpLEkuDDEU7MdJ4DhXGA== Message-ID: <4BC3BA48.9010009@delphij.net> Date: Mon, 12 Apr 2010 17:26:48 -0700 From: Xin LI Organization: The Geek China Organization User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1 MIME-Version: 1.0 To: Alfred Perlstein References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> In-Reply-To: <20100412233330.GC19003@elvis.mu.org> X-Enigmail-Version: 1.0.1 OpenPGP: id=3FCA37C1; url=http://www.delphij.net/delphij.asc Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: d@delphij.net, freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 00:27:02 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2010/04/12 16:33, Alfred Perlstein wrote: > * Xin LI [100412 15:28] wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi, >> >> Is there a sane way to copyout ioctl request when the returning errno != >> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: >> >> =========== >> error = kern_ioctl(td, uap->fd, com, data); >> >> if (error == 0 && (com & IOC_OUT)) >> error = copyout(data, uap->data, (u_int)size); >> =========== >> >> Is there any objection if I change it to something like: >> >> =========== >> saved_error = kern_ioctl(td, uap->fd, com, data); >> >> if (com & IOC_OUT) >> error = copyout(data, uap->data, (u_int)size); >> if (saved_error) >> error = saved_error; >> =========== > > Is this for linux compat? Do they do this way? I'm not quite sure :-/ I got a bug report and am thinking about how to fix it, it seems that we do not have a generic way of returning an error number while giving some "hints" about the error at the same time, for the ioctl() call. Adding an extra pointer to the request structure seems to be a last-resort way and sounds to be ugly. > I'm not sure this would work, it might seriously break userland > compat. Have you looked around/queiried what the expected outcome > is from a bad ioctl? By default the buffer will be zero'd this > might be unexpected by apps. (all or nothing) Yes that's exactly why I'm asking, my understanding is that for normal usages would be something like: if (ioctl(fd, SIOCSOMETHING, &req) < 0) { // do something to handle the error } else { // use data fed back from req } In this case, I think the result would not be affected. Is there many (if any) programs that don't bother to check return value of ioctl()? Speaking for the userland buffer, for _IOR ioctls, the side effect would be that userland would see a zeroed out 'req' structure (kernel buffer gets zeroed out before calling kern_ioctl), or "half-baked" one (the kernel code may have only written partial data). For _IOWR ioctls, the side effect would be that the userland may get half-baked data. The in-kernel request buffer is always initialized, as it is either overwritten by copyin(), or by bzero() so I don't think sensitive data could be leaked, unless the kernel code intentionally copy some sensitive data to the req buffer, detect if there is error, and then scrub sensitive data away. Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iQEcBAEBAgAGBQJLw7pIAAoJEATO+BI/yjfBXjwH/RaheqNyhY0eECcqC5Gz0ycm 2VOpHoe+oRpwHNDYrlNqILKl815HTjpvyi145IpMPIKvEct2O0i6wGJ3FH7VFQwP ucZh6Tj3K3yF+OsFw3iAk69aqFhslb/SuZtuAbJAA4DB+H1rUPtEfWs9y8XjmAaS ZvFTmmP1w1V50I843UJEbY86LqwJGOgGH0mJ6n1mEsLOFyrASrjGajAOb/mEvju4 pLVoaKI9sWGk4QfE9QKol083DuSC/WVbJBFHmzN0K0sNmRfyZofcSIYpWDMkwS4n Mt2M3b6irwul83EkK+cw1gclmV7lUTslfMGtyLbLahZek3HFDh4oZ5xnctfI1xA= =1Hn6 -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Tue Apr 13 01:45:57 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A31A5106564A for ; Tue, 13 Apr 2010 01:45:57 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 93CA88FC12 for ; Tue, 13 Apr 2010 01:45:57 +0000 (UTC) Received: by elvis.mu.org (Postfix, from userid 1192) id 3B76B1A3C86; Mon, 12 Apr 2010 18:45:57 -0700 (PDT) Date: Mon, 12 Apr 2010 18:45:57 -0700 From: Alfred Perlstein To: d@delphij.net Message-ID: <20100413014557.GE19003@elvis.mu.org> References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> <4BC3BA48.9010009@delphij.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BC3BA48.9010009@delphij.net> User-Agent: Mutt/1.4.2.3i Cc: freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 01:45:57 -0000 * Xin LI [100412 17:27] wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 2010/04/12 16:33, Alfred Perlstein wrote: > > * Xin LI [100412 15:28] wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> Hi, > >> > >> Is there a sane way to copyout ioctl request when the returning errno != > >> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: > >> > >> =========== > >> error = kern_ioctl(td, uap->fd, com, data); > >> > >> if (error == 0 && (com & IOC_OUT)) > >> error = copyout(data, uap->data, (u_int)size); > >> =========== > >> > >> Is there any objection if I change it to something like: > >> > >> =========== > >> saved_error = kern_ioctl(td, uap->fd, com, data); > >> > >> if (com & IOC_OUT) > >> error = copyout(data, uap->data, (u_int)size); > >> if (saved_error) > >> error = saved_error; > >> =========== > > > > Is this for linux compat? > > Do they do this way? I'm not quite sure :-/ > > I got a bug report and am thinking about how to fix it, it seems that we > do not have a generic way of returning an error number while giving some > "hints" about the error at the same time, for the ioctl() call. Adding > an extra pointer to the request structure seems to be a last-resort way > and sounds to be ugly. Why not just have the ioctl return success but have an error code inside the result, example: struct yourioctldata { int error; // 0 = ok, else errno char data[DATASIZE]; // data.. ... } > > > I'm not sure this would work, it might seriously break userland > > compat. Have you looked around/queiried what the expected outcome > > is from a bad ioctl? By default the buffer will be zero'd this > > might be unexpected by apps. (all or nothing) > > Yes that's exactly why I'm asking, my understanding is that for normal > usages would be something like: > > if (ioctl(fd, SIOCSOMETHING, &req) < 0) { > // do something to handle the error > } else { > // use data fed back from req > } > > In this case, I think the result would not be affected. Is there many > (if any) programs that don't bother to check return value of ioctl()? > > Speaking for the userland buffer, for _IOR ioctls, the side effect would > be that userland would see a zeroed out 'req' structure (kernel buffer > gets zeroed out before calling kern_ioctl), or "half-baked" one (the > kernel code may have only written partial data). For _IOWR ioctls, the > side effect would be that the userland may get half-baked data. > > The in-kernel request buffer is always initialized, as it is either > overwritten by copyin(), or by bzero() so I don't think sensitive data > could be leaked, unless the kernel code intentionally copy some > sensitive data to the req buffer, detect if there is error, and then > scrub sensitive data away. I'm not sure and certainly not an authority on this. It's probably worth pinging a few of the standards people. This is interesting, good luck! -- - Alfred Perlstein .- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250 .- FreeBSD committer From owner-freebsd-arch@FreeBSD.ORG Tue Apr 13 13:08:33 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CB9651065670; Tue, 13 Apr 2010 13:08:33 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 880EE8FC19; Tue, 13 Apr 2010 13:08:33 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id E6E2046BA4; Tue, 13 Apr 2010 09:08:32 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id C771D8A021; Tue, 13 Apr 2010 09:08:28 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org, d@delphij.net Date: Tue, 13 Apr 2010 08:53:16 -0400 User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; ) References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> <4BC3BA48.9010009@delphij.net> In-Reply-To: <4BC3BA48.9010009@delphij.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201004130853.16994.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 13 Apr 2010 09:08:28 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.8 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Alfred Perlstein Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 13:08:34 -0000 On Monday 12 April 2010 8:26:48 pm Xin LI wrote: > On 2010/04/12 16:33, Alfred Perlstein wrote: > > * Xin LI [100412 15:28] wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> Hi, > >> > >> Is there a sane way to copyout ioctl request when the returning errno != > >> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: > >> > >> =========== > >> error = kern_ioctl(td, uap->fd, com, data); > >> > >> if (error == 0 && (com & IOC_OUT)) > >> error = copyout(data, uap->data, (u_int)size); > >> =========== > >> > >> Is there any objection if I change it to something like: > >> > >> =========== > >> saved_error = kern_ioctl(td, uap->fd, com, data); > >> > >> if (com & IOC_OUT) > >> error = copyout(data, uap->data, (u_int)size); > >> if (saved_error) > >> error = saved_error; > >> =========== > > > > Is this for linux compat? > > Do they do this way? I'm not quite sure :-/ > > I got a bug report and am thinking about how to fix it, it seems that we > do not have a generic way of returning an error number while giving some > "hints" about the error at the same time, for the ioctl() call. Adding > an extra pointer to the request structure seems to be a last-resort way > and sounds to be ugly. Actually, this pattern of embedding an error is quite common. The mfi(4) and mpt(4) pass-thru ioctls to send firmware commands embed the return status of any firmware command in the structure that is passed in and out for example. > > I'm not sure this would work, it might seriously break userland > > compat. Have you looked around/queiried what the expected outcome > > is from a bad ioctl? By default the buffer will be zero'd this > > might be unexpected by apps. (all or nothing) > > Yes that's exactly why I'm asking, my understanding is that for normal > usages would be something like: > > if (ioctl(fd, SIOCSOMETHING, &req) < 0) { > // do something to handle the error > } else { > // use data fed back from req > } > > In this case, I think the result would not be affected. Is there many > (if any) programs that don't bother to check return value of ioctl()? > > Speaking for the userland buffer, for _IOR ioctls, the side effect would > be that userland would see a zeroed out 'req' structure (kernel buffer > gets zeroed out before calling kern_ioctl), or "half-baked" one (the > kernel code may have only written partial data). For _IOWR ioctls, the > side effect would be that the userland may get half-baked data. You miss one important variation where the error handling involves adjusting the request and retrying (or submitting the same request to a different ioctl to handle renumbering conflicts, etc.). Other APIs such as sysctl(2) and setsockopt(2) can leave partial data, but the callers of those APIs expect that (and in fact, those APIs return the actual length of data that is copied out). ioctl(2) has not had that behavior, however, and I would find it surprising. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Apr 13 19:01:27 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EBD371065679; Tue, 13 Apr 2010 19:01:27 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from tarsier.geekcn.org (tarsier.geekcn.org [IPv6:2001:470:a803::1]) by mx1.freebsd.org (Postfix) with ESMTP id BE1B48FC24; Tue, 13 Apr 2010 19:01:26 +0000 (UTC) Received: from mail.geekcn.org (tarsier.geekcn.org [211.166.10.233]) by tarsier.geekcn.org (Postfix) with ESMTP id 58CDAA56587; Wed, 14 Apr 2010 03:01:25 +0800 (CST) X-Virus-Scanned: amavisd-new at geekcn.org Received: from tarsier.geekcn.org ([211.166.10.233]) by mail.geekcn.org (mail.geekcn.org [211.166.10.233]) (amavisd-new, port 10024) with LMTP id KkbYGbmQsBul; Wed, 14 Apr 2010 03:01:19 +0800 (CST) Received: from delta.delphij.net (drawbridge.ixsystems.com [206.40.55.65]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by tarsier.geekcn.org (Postfix) with ESMTPSA id 41B2FA5502B; Wed, 14 Apr 2010 03:01:18 +0800 (CST) DomainKey-Signature: a=rsa-sha1; s=default; d=delphij.net; c=nofws; q=dns; h=message-id:date:from:reply-to:organization:user-agent: mime-version:to:cc:subject:references:in-reply-to: x-enigmail-version:openpgp:content-type:content-transfer-encoding; b=L/qffVBjyhfSQzVWULPMzhlpceKJ99kIViVTjOxzdbOUC+sOISvr4ieVvryH8f92g KxRRtl/bkuW7JMCpG7G1w== Message-ID: <4BC4BF7A.9090106@delphij.net> Date: Tue, 13 Apr 2010 12:01:14 -0700 From: Xin LI Organization: The Geek China Organization User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.1.9) Gecko/20100408 Thunderbird/3.0.4 ThunderBrowse/3.2.8.1 MIME-Version: 1.0 To: John Baldwin References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> <4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org> In-Reply-To: <201004130853.16994.jhb@freebsd.org> X-Enigmail-Version: 1.0.1 OpenPGP: id=3FCA37C1; url=http://www.delphij.net/delphij.asc Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Alfred Perlstein , d@delphij.net, freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 19:01:28 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2010/04/13 05:53, John Baldwin wrote: > On Monday 12 April 2010 8:26:48 pm Xin LI wrote: >> On 2010/04/12 16:33, Alfred Perlstein wrote: >>> * Xin LI [100412 15:28] wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Hi, >>>> >>>> Is there a sane way to copyout ioctl request when the returning errno != >>>> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: >>>> >>>> =========== >>>> error = kern_ioctl(td, uap->fd, com, data); >>>> >>>> if (error == 0 && (com & IOC_OUT)) >>>> error = copyout(data, uap->data, (u_int)size); >>>> =========== >>>> >>>> Is there any objection if I change it to something like: >>>> >>>> =========== >>>> saved_error = kern_ioctl(td, uap->fd, com, data); >>>> >>>> if (com & IOC_OUT) >>>> error = copyout(data, uap->data, (u_int)size); >>>> if (saved_error) >>>> error = saved_error; >>>> =========== >>> >>> Is this for linux compat? >> >> Do they do this way? I'm not quite sure :-/ >> >> I got a bug report and am thinking about how to fix it, it seems that we >> do not have a generic way of returning an error number while giving some >> "hints" about the error at the same time, for the ioctl() call. Adding >> an extra pointer to the request structure seems to be a last-resort way >> and sounds to be ugly. > > Actually, this pattern of embedding an error is quite common. The mfi(4) and > mpt(4) pass-thru ioctls to send firmware commands embed the return status of > any firmware command in the structure that is passed in and out for example. > >>> I'm not sure this would work, it might seriously break userland >>> compat. Have you looked around/queiried what the expected outcome >>> is from a bad ioctl? By default the buffer will be zero'd this >>> might be unexpected by apps. (all or nothing) >> >> Yes that's exactly why I'm asking, my understanding is that for normal >> usages would be something like: >> >> if (ioctl(fd, SIOCSOMETHING, &req) < 0) { >> // do something to handle the error >> } else { >> // use data fed back from req >> } >> >> In this case, I think the result would not be affected. Is there many >> (if any) programs that don't bother to check return value of ioctl()? >> >> Speaking for the userland buffer, for _IOR ioctls, the side effect would >> be that userland would see a zeroed out 'req' structure (kernel buffer >> gets zeroed out before calling kern_ioctl), or "half-baked" one (the >> kernel code may have only written partial data). For _IOWR ioctls, the >> side effect would be that the userland may get half-baked data. > > You miss one important variation where the error handling involves adjusting > the request and retrying (or submitting the same request to a different ioctl > to handle renumbering conflicts, etc.). Other APIs such as sysctl(2) and > setsockopt(2) can leave partial data, but the callers of those APIs expect > that (and in fact, those APIs return the actual length of data that is copied > out). ioctl(2) has not had that behavior, however, and I would find it > surprising. I see, that's what I am concerned about, thanks for the explanation. In order to maintain ABI compatibility I now have a patch which changs the current behavior of SIOCGIFDESCR to set the buffer field to NULL and return no errno. The existing code in -HEAD doesn't seem to work when the field is big :( I will post the patch to -net@ for review once I get it tested. Cheers, - -- Xin LI http://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iQEcBAEBAgAGBQJLxL96AAoJEATO+BI/yjfBkkMIAIVjUmwrfHLl5F+mIlRD+Zpv hYZVBZaeu3/ymv0Zepo5vhbvJCOWxgdtRnJgoVlkpglZLwVrKkAdfxWp/di5n8xm O4BMc+BIra6tYnqaxmbCYoigKGoLVhim1n6j2Xld/h0n91ErBDpdrWBdHVbs8uV+ mRFLCPbGzGnEXw68rdbWjXFIDRIe7btTdmyYotaHd5AFaqQw6EM+OAXRG3UqGtm3 92o+9TW2LcTTP9gyresbQGoXvITHXVfSdihhDVfDMCtbaClQ+IFlny0oGqg0DttR OhnEWDvBgUQD+aADYx2k8YLXziUsQzvTc7WTZuoxdz3LzZVecyQSewiydEhor/U= =IHjf -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Wed Apr 14 03:45:42 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EA9B71065673; Wed, 14 Apr 2010 03:45:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 1D0758FC08; Wed, 14 Apr 2010 03:45:40 +0000 (UTC) Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au (c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3E3jNcM027998 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 14 Apr 2010 13:45:24 +1000 Date: Wed, 14 Apr 2010 13:45:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: John Baldwin In-Reply-To: <201004130853.16994.jhb@freebsd.org> Message-ID: <20100414130627.V12547@delplex.bde.org> References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> <4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Alfred Perlstein , d@delphij.net, freebsd-arch@FreeBSD.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2010 03:45:42 -0000 On Tue, 13 Apr 2010, John Baldwin wrote: > On Monday 12 April 2010 8:26:48 pm Xin LI wrote: >> On 2010/04/12 16:33, Alfred Perlstein wrote: >>> * Xin LI [100412 15:28] wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Hi, >>>> >>>> Is there a sane way to copyout ioctl request when the returning errno != >>>> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: No. You could just do it, but this would be insane since it would just waste time. >>>> >>>> =========== >>>> error = kern_ioctl(td, uap->fd, com, data); >>>> >>>> if (error == 0 && (com & IOC_OUT)) >>>> error = copyout(data, uap->data, (u_int)size); >>>> =========== >>>> >>>> Is there any objection if I change it to something like: >>>> >>>> =========== >>>> saved_error = kern_ioctl(td, uap->fd, com, data); >>>> >>>> if (com & IOC_OUT) >>>> error = copyout(data, uap->data, (u_int)size); >>>> if (saved_error) >>>> error = saved_error; >>>> =========== errno != 0 means that the ioctl failed, so the contents of the output buffer (output from the kernel) is indeterminate, so only broken applications would look at it (except merely insane ones could look at it and not use the results). > Actually, this pattern of embedding an error is quite common. The mfi(4) and > mpt(4) pass-thru ioctls to send firmware commands embed the return status of > any firmware command in the structure that is passed in and out for example. > >>> I'm not sure this would work, it might seriously break userland >>> compat. Have you looked around/queiried what the expected outcome >>> is from a bad ioctl? By default the buffer will be zero'd this >>> might be unexpected by apps. (all or nothing) Such applications are broken. The error might occur at any point in the syscall and apps have no way of telling where. Errors during the copyout would cause a partial copy (!(all or nothing) unless partial is actually nothing). With a partial copy, the changed bytes could be anywhere in the copy, depending on the implementation. >> Yes that's exactly why I'm asking, my understanding is that for normal >> usages would be something like: >> >> if (ioctl(fd, SIOCSOMETHING, &req) < 0) { Testing syscalls that return 0 on error using " < 0" is a normal style bug. >> // do something to handle the error >> } else { >> // use data fed back from req >> } >> >> In this case, I think the result would not be affected. Is there many >> (if any) programs that don't bother to check return value of ioctl()? Only broken ones. >> Speaking for the userland buffer, for _IOR ioctls, the side effect would >> be that userland would see a zeroed out 'req' structure (kernel buffer >> gets zeroed out before calling kern_ioctl), or "half-baked" one (the >> kernel code may have only written partial data). For _IOWR ioctls, the >> side effect would be that the userland may get half-baked data. Hmm, the kernel probably depends on the pre-zeroing, so that half-baked data is not necessarily an error. > You miss one important variation where the error handling involves adjusting > the request and retrying (or submitting the same request to a different ioctl > to handle renumbering conflicts, etc.). Other APIs such as sysctl(2) and > setsockopt(2) can leave partial data, but the callers of those APIs expect > that (and in fact, those APIs return the actual length of data that is copied > out). ioctl(2) has not had that behavior, however, and I would find it > surprising. Yes, it has no general way of reporting partial success, and doing this for special ioctls would be complicated. At the very least you would need to add special error codes to distinguish normal failure (output buffer indeterminate) from partial success (some bytes in output buffer valid, and encode further details of the partialness). Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Apr 14 04:40:50 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 68820106566C; Wed, 14 Apr 2010 04:40:50 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id F22C68FC18; Wed, 14 Apr 2010 04:40:49 +0000 (UTC) Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au (c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3E4ejVQ013943 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 14 Apr 2010 14:40:47 +1000 Date: Wed, 14 Apr 2010 14:40:45 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Rick Macklem In-Reply-To: Message-ID: <20100414135230.U12587@delplex.bde.org> References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2010 04:40:50 -0000 On Sun, 11 Apr 2010, Rick Macklem wrote: > On Sun, 11 Apr 2010, Bruce Evans wrote: > >> Er, the maximum size of buffers in the buffer cache is especially >> irrelevant for nfs. It is almost irrelevant for physical disks because >> clustering normally increases the bulk transfer size to MAXPHYS. >> Clustering takes a lot of CPU but doesn't affect the transfer rate much >> unless there is not enough CPU. It is even less relevant for network >> i/o since there is a sort of reverse-clustering -- the buffers get split >> up into tiny packets (normally 1500 bytes less some header bytes) at >> the hardware level. ... > > I've done a simple experiment on Mac OS X 10, where I tried different > sizes for the read and write RPCs plus different amounts of > read-ahead/write-behind and found the I/O rate increased linearly, > up to the max allowed by Mac OS X (MAXBSIZE == 128K) without > read-ahead/write-behind. Using read-ahead/write-behind the performance > didn't increase at all, until the RPC read/write size was reduced. > (Solaris10 is using 256K by default and allowing up to 1Mb for read/write > RPC size now, so they seem to think that large values work well?) > > When you start using a WAN environment, large read/write RPCs really > help, from what I've seen, since that helps fill the TCP pipe > (bits * latency between client<->server). > > I care much more about WAN performance than LAN performance w.r.t. this. Indeed, I was only caring about a LAN environment. Especially with LANs optimized for latency (50-100 uS), nfs performance is poor for small files, at least for the old nfs client, mainly due to close to open consistency defeating caching, but not a problem for bulk transfers. > I am not sure what you were referring to w.r.t. clustering, but if you > meant that the NFS client can easily do an RPC with a larger I/O size > than the size of the buffer handed it by the buffer cache, I'd like to > hear how that's done? (If not, then a bigger buffer from the buffer > cache is what I need to do a larger I/O size in the RPC.) Clustering is currently only for the local file system, at least for the old nfs server. nfs just does a VOP_READ() into its own buffer, with ioflag set to indicate nfs's idea of sequentialness. (User reads are similar except their uio destination is UIO_USERSPACE instead of UIO_SYSSPACE and their sequentialness is set generically and thus not so well (but the nfs setting isn't very good either).) The local file system then normally does a clustered read into a larger buffer, with the sequentialness affecting mainly startup (per-file), and virtually copies the results to the local file system's smaller buffers. VOP_READ() completes by physically copying the results to nfs's buffer (using bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE). nfs can't easily get at the larger clustering buffers or even the local file system's buffers. It can more easily benefit from larger MAXBSIZE. There is still the bcopy() to take a lot of CPU and memory bus resources, but that is insignifcant compared with WAN latency. But as I said in a related thread, even the current MAXBSIZE is too large to use routinely, due to buffer cache fragmentation causing significant latency problems, so any increase in MAXBSIZE and/or routine use of buffers of that size needs to be accompanied by avoiding the fragmentation. Note that the fragmentation is avoided for the larger clustering buffers by allocating them from a different pool. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Apr 14 05:03:29 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 915B9106566B; Wed, 14 Apr 2010 05:03:29 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 29F448FC0A; Wed, 14 Apr 2010 05:03:28 +0000 (UTC) Received: from [127.0.0.1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.3/8.14.3) with ESMTP id o3E53ECd018824; Tue, 13 Apr 2010 23:03:14 -0600 (MDT) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii From: Scott Long In-Reply-To: <20100414130627.V12547@delplex.bde.org> Date: Tue, 13 Apr 2010 23:03:14 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: <463B2945-8599-4031-A7A4-E091C69E049F@samsco.org> References: <4BC39E93.7060906@delphij.net> <20100412233330.GC19003@elvis.mu.org> <4BC3BA48.9010009@delphij.net> <201004130853.16994.jhb@freebsd.org> <20100414130627.V12547@delplex.bde.org> To: Bruce Evans X-Mailer: Apple Mail (2.1078) X-Spam-Status: No, score=-1.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: Alfred Perlstein , d@delphij.net, freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2010 05:03:29 -0000 On Apr 13, 2010, at 9:45 PM, Bruce Evans wrote: > On Tue, 13 Apr 2010, John Baldwin wrote: >=20 >> On Monday 12 April 2010 8:26:48 pm Xin LI wrote: >>> On 2010/04/12 16:33, Alfred Perlstein wrote: >>>> * Xin LI [100412 15:28] wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>>=20 >>>>> Hi, >>>>>=20 >>>>> Is there a sane way to copyout ioctl request when the returning = errno !=3D >>>>> 0? Looking at the code, currently, in sys/kern/sys_generic.c, we = have: >=20 > No. You could just do it, but this would be insane since it would > just waste time. >=20 >>>>>=20 >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>> error =3D kern_ioctl(td, uap->fd, com, data); >>>>>=20 >>>>> if (error =3D=3D 0 && (com & IOC_OUT)) >>>>> error =3D copyout(data, uap->data, (u_int)size); >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>>=20 >>>>> Is there any objection if I change it to something like: >>>>>=20 >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>> saved_error =3D kern_ioctl(td, uap->fd, com, data); >>>>>=20 >>>>> if (com & IOC_OUT) >>>>> error =3D copyout(data, uap->data, (u_int)size); >>>>> if (saved_error) >>>>> error =3D saved_error; >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > errno !=3D 0 means that the ioctl failed, so the contents of the = output > buffer (output from the kernel) is indeterminate, so only broken > applications would look at it (except merely insane ones could look > at it and not use the results). More specifically, think of ioctl as a transport mechanism for = information. The errno returned by it is a reflection of the state of the transport, = not the state of the information transported by it. Layers that use ioctl to = transport their information need to use another mechanism to relay the state of those layers and the data transported. errno !=3D 0 means that the = ioctl transport failed, period. Or In other words, the transport of = information failed. As John pointed out, if you want the client layers of ioctl to convey = their=20 status, you need to build that status into the messages that are = conveyed over the ioctl, and not overload the ioctl status. If that means = changing poorly written apps, then that's what it means. Trying to further = overload the functionality of ioctl with heuristic guesses is only going to lead = to fragility and frustration. Scott From owner-freebsd-arch@FreeBSD.ORG Wed Apr 14 06:38:33 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A47C106564A; Wed, 14 Apr 2010 06:38:33 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id D0E7F8FC16; Wed, 14 Apr 2010 06:38:32 +0000 (UTC) Received: from c211-30-173-227.carlnfd1.nsw.optusnet.com.au (c211-30-173-227.carlnfd1.nsw.optusnet.com.au [211.30.173.227]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3E6cS6O027715 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 14 Apr 2010 16:38:29 +1000 Date: Wed, 14 Apr 2010 16:38:28 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Andriy Gapon In-Reply-To: <4BC34402.1050509@freebsd.org> Message-ID: <20100414144336.L12587@delplex.bde.org> References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <4BC34402.1050509@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2010 06:38:33 -0000 On Mon, 12 Apr 2010, Andriy Gapon wrote: > on 11/04/2010 05:56 Bruce Evans said the following: >> On Fri, 9 Apr 2010, Andriy Gapon wrote: > [snip] >>> I have lightly tested this under qemu. >>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. >>> I removed size > MAXBSIZE check in getblk (see a parallel thread >>> "panic: getblk: >>> size(%d) > MAXBSIZE(%d)"). >> >> Did you change the other known things that depend on this? There is the >> b_pages limit of MAXPHYS bytes which should be checked for in another >> way > > I changed the check the way I described in the parallel thread. I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE. Was an explicit check needed? (An implicit check would probably have worked: most clients were limited by the MAXBSIZE check, and the pbuf client always uses MAXPHYS or DFLTPHYS.) >> and the soft limits for hibufspace and lobufspace which only matter >> under load conditions. > > And what these should be? > hibufspace and lobufspace seem to be auto-calculated. One thing that I noticed > and that was a direct cause of the problem described below, is that difference > between hibufspace and lobufspace should be at least the maximum block size > allowed in getblk() (perhaps it should be strictly equal to that value?). > So in my case I had to make that difference MAXPHYS. Hard to say. They are mostly only heuristics which mostly only matter under heavy loads. You can change the defaults using sysctl but it is even harder to know what changes might be good without knowing the details of the implementation. >>> And I bumped MAXPHYS to 1MB. >>> ... >>> But I observed reading process (dd bs=1m on avgfs) spending a lot of >>> time sleeping >>> on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. >>> Apparently there was some shortage of free buffers. >>> Perhaps some limits/counts were incorrectly auto-tuned. >> >> This is not surprising, since even 64K is 4 times too large to work >> well. Buffer sizes of larger than BKVASIZE (16K) always cause >> fragmentation of buffer kva. ... > > So, BKVASIZE is the best read size from the point of view of buffer space usage? It is the best buffer size, which is almost independent of the best read size. First, userland reads will be re-blocked into file-system-block-size reads... > E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but > leads to buffer space map fragmentation, because of size > BKVASIZE. > On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from > buffer space point of view (no fragmentation potential) but they result in 4 GEOM Clustering occurs above geom, so geom only sees small requests for small files, random accesses, and buggy cases for sequential accesses to large files where the bugs give partial randomness. E.g., a single 64K read from userland normally gives 4 16K ffs blocks in the buffer cache. Clustering turns these into 1 128K block in a pbuf (64K for the amount read now and 128K for read-ahead; there may be more read-ahead but it would go in another pbuf). geom then sees the 128K (MAXPHYS) block. Most device drivers still only support i/o's of size <= DFLTPHYS, but geom confuses the clustering code into producing clusters larger than its intended maximum of what the device supports by advertising support for MAXPHYS (v_mount->mnt_iosize_max). So geom normally turns the 128K request into 2 64K requests. Clustering finishes by converting the 128K request into 8 16K requests (4 for use now and 4 later for read-ahead). OTOH, the first block of 4 sequential reads of 16K produces the same 128K block at the geom level, modulo bugs in the read-ahead. This now consists 1 and 7 blocks of normal read and read-ahead, respectively, instead of 4 and 4. Then the next 3 blocks are found in the buffer cache as read-ahead instead of read from the disk (actually, this is insignificantly different from the first case after ffs splits up the 64K into 4 times 16K). So the block size makes almost no difference at the syscall level (512-blocks take significantly more CPU but improve latency, while hige blocks take significantly less CPU but significantly unimprove latency). The file system block size makes only secondary differences: - clustering only works to turn small logical i/o's into large physical ones when sequential blocks are allocated sequentially, but always allocating blocks sequentially is hard to do and using large file system blocks reduces the loss when the allocation is not sequential - large file system blocks also reduce the amount of work that clustering has to do to reblock. This benefit is much smaller than the previous one. - the buffer cache is only designed to handle medium-sized blocks well. With 512-blocks, it can only hold 1/32 as much as with 16K-blocks, so it will thrash 32 times as much with the former. Now that the thrashing is to VMIO instead of to the disk, this only wastes CPU. With any block size larger than BKVASIZE, the buffer cache may become fragmented, depending on the combination of block sizes. Mixed combinations are the worst, and the system doesn't do anything to try to avoid them. The worst case is a buffer cache full of 512-blocks, with getblk() wanting to allocate a 64K-block. Then it needs to wait for 32 contiguous blocks to become free, or forcibly free some, or move some... > I/O requests. > The thing is that a single read requires a single contiguous virtual address space > chunk. Would it be possible to take the best of both worlds by somehow allowing a > single large I/O request to work with several buffers (with b_kvasize == BKVASIZE) > in a iovec-like style? > Have I just reinvented bicycle? :) > Probably not, because an answer to my question is probably 'not (without lots of > work in lots of places)' as well. Separate buffers already partly provided, this, and combined with command queuing in the hardware they provided it completely in perhaps a better way than can be done in software. vfs clustering attempts much less but still complicated. It mainly wants to convert buffers that have contiguous disk addresses into a super-buffer that has contiguous virtual memory and combine this with read-ahead, to reduce the number of i/o's. All drives less than 10 years old benefit only marginally from this, since the same cases that vfs clustering can handle are also easy for drive clustering, caching and read-ahead/write- behind (especially the latter) to handle even better, so I occasionally try turning off vfs clustering to see if it makes a difference; unfortunately it still seems to help on all drives, including even reducing total CPU usage despite its own large CPU usage. > I see that breadn() certainly doesn't work that way. As I understand, it works > like bread() for one block plus starts something like 'asynchronous breads()' for > a given count of other blocks. Usually breadn() isn't called, but clustering reads to the end of the current cluster or maybe the next cluster. breadn() was designed when reading ahead a single cluster was helpful. Now, drives read-ahead a whole track or similar probably hundreds of sectors, so reading ahead a single sector is almost useless. It doesn't even reduce the number of i/o's unless it is clustered with the previous i/o. > I am not sure about details of how cluster_read() works, though. > Could you please explain the essence of it? See above. It is essentially the old hack of reading ahead a whole track in software, done in a sophisticated way but with fewer attempts to satisfy disk geometry timing requirements. Long ago, everything was so slow that sequential reads done from userland could not keep up with even a floppy disk, but sequential i/o's done from near the driver could, even with i/o's of only 1 sector. I've only ever seen this working well for floppy disks. For hard disks, the i/o's need to be multi-sector, and needed to be related to the disk geometry (handle full tracks and don't keep results from intermediate sectors that are not needed yet iff doing so wouldn't thrash the cache). Now, it is unreasonable to try to know the disk geometry, and vfs clustering doesn't try. Fortunately, this is not much needed, since newer drives have their own track caches which, although they don't form a complete replacement for vfs clustering (see above), they reduces the losses to extra non-physical reads. Similarly for another problem with vfs: all buffers and their clustering are file (vnode) based, which almost forces missing intermediate sectors when reading a file, but a working device (track or similar) in the drive mostly compensates for not having one in the OS. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Apr 14 08:10:25 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1B038106566C for ; Wed, 14 Apr 2010 08:10:25 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 1E5CD8FC14 for ; Wed, 14 Apr 2010 08:10:23 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id LAA19688; Wed, 14 Apr 2010 11:10:15 +0300 (EEST) (envelope-from avg@freebsd.org) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1O1xfq-0000OY-Qt; Wed, 14 Apr 2010 11:10:14 +0300 Message-ID: <4BC57866.50807@freebsd.org> Date: Wed, 14 Apr 2010 11:10:14 +0300 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.24 (X11/20100321) MIME-Version: 1.0 To: Bruce Evans References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <4BC34402.1050509@freebsd.org> <20100414144336.L12587@delplex.bde.org> In-Reply-To: <20100414144336.L12587@delplex.bde.org> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Apr 2010 08:10:25 -0000 on 14/04/2010 09:38 Bruce Evans said the following: > On Mon, 12 Apr 2010, Andriy Gapon wrote: > >> on 11/04/2010 05:56 Bruce Evans said the following: >>> On Fri, 9 Apr 2010, Andriy Gapon wrote: >> [snip] >>>> I have lightly tested this under qemu. >>>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. >>>> I removed size > MAXBSIZE check in getblk (see a parallel thread >>>> "panic: getblk: >>>> size(%d) > MAXBSIZE(%d)"). >>> >>> Did you change the other known things that depend on this? There is the >>> b_pages limit of MAXPHYS bytes which should be checked for in another >>> way >> >> I changed the check the way I described in the parallel thread. > > I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE. > Was an explicit check needed? (An implicit check would probably have > worked: most clients were limited by the MAXBSIZE check, and the pbuf > client always uses MAXPHYS or DFLTPHYS.) I added this: --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c @@ -2541,8 +2541,8 @@ getblk(struct vnode * vp, daddr_t blkno, int size, int slpflag, int slptimeo, CTR3(KTR_BUF, "getblk(%p, %ld, %d)", vp, (long)blkno, size); ASSERT_VOP_LOCKED(vp, "getblk"); - if (size > MAXBSIZE) - panic("getblk: size(%d) > MAXBSIZE(%d)\n", size, MAXBSIZE); + if (size > MAXPHYS) + panic("getblk: size(%d) > MAXPHYS(%d)\n", size, MAXPHYS); bo = &vp->v_bufobj; loop: It wasn't really needed 'by default' but, as I said, I use my own "filesystem" for testing and in it I do all kinds of nasty things like huge bread()-s. So I had to add the check to get a nice panic instead of a crash and/or corruption a little bit later. >>> and the soft limits for hibufspace and lobufspace which only matter >>> under load conditions. >> >> And what these should be? >> hibufspace and lobufspace seem to be auto-calculated. One thing that >> I noticed >> and that was a direct cause of the problem described below, is that >> difference >> between hibufspace and lobufspace should be at least the maximum block >> size >> allowed in getblk() (perhaps it should be strictly equal to that value?). >> So in my case I had to make that difference MAXPHYS. > > Hard to say. They are mostly only heuristics which mostly only matter > under > heavy loads. You can change the defaults using sysctl but it is even > harder > to know what changes might be good without knowing the details of the > implementation. Yes. I ended up with this change: --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c @@ -613,7 +613,7 @@ bufinit(void) */ maxbufspace = (long)nbuf * BKVASIZE; hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - MAXBSIZE * 10); - lobufspace = hibufspace - MAXBSIZE; + lobufspace = hibufspace - MAXPHYS; lorunningspace = 512 * 1024; hirunningspace = 1024 * 1024; Otherwise, in situation where we need a buffer of size 'size' and buffspace < lobufspace but buffspace + size > hibufspace, logic in getnewbuf() falls through the cracks. MAXBSIZE => MAXPHYS change is to reflect that we support bread()-s as large as that. >>>> And I bumped MAXPHYS to 1MB. >>>> ... >>>> But I observed reading process (dd bs=1m on avgfs) spending a lot of >>>> time sleeping >>>> on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. >>>> Apparently there was some shortage of free buffers. >>>> Perhaps some limits/counts were incorrectly auto-tuned. >>> >>> This is not surprising, since even 64K is 4 times too large to work >>> well. Buffer sizes of larger than BKVASIZE (16K) always cause >>> fragmentation of buffer kva. ... >> >> So, BKVASIZE is the best read size from the point of view of buffer >> space usage? > > It is the best buffer size, which is almost independent of the best read > size. First, userland reads will be re-blocked into file-system-block-size > reads... Umm, I meant 'bread() size', sorry for not being explicit. It doesn't make much sense to talk about userland in this context. >> E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read >> requests, but >> leads to buffer space map fragmentation, because of size > BKVASIZE. >> On the other hand, four sequential reads of BKVASIZE=16K bytes are >> perfect from >> buffer space point of view (no fragmentation potential) but they >> result in 4 GEOM > > Clustering occurs above geom, so geom only sees small requests for small > files, random accesses, and buggy cases for sequential accesses to large > files where the bugs give partial randomness. Yes, provided that cluster API is used. Which is the case for most places in most filesystems, but not 'all in all'. E.g. my own "filesystem", which I mention only for a joke,. And, e.g. msdosfs_readdir, which bread()-s the whole FAT cluster in one go, and, as we learned, that could be a lot (if sector size is large). > E.g., a single 64K read from userland normally gives 4 16K ffs blocks > in the buffer cache. Clustering turns these into 1 128K block in a > pbuf (64K for the amount read now and 128K for read-ahead; there may > be more read-ahead but it would go in another pbuf). Indeed. I missed the fact that cluster I/O uses different kind of buffers from different space. Those are uniformly sized with MAXPHYS. > geom then sees > the 128K (MAXPHYS) block. Most device drivers still only support i/o's > of size <= DFLTPHYS, but geom confuses the clustering code into producing > clusters larger than its intended maximum of what the device supports > by advertising support for MAXPHYS (v_mount->mnt_iosize_max). Oh, I missed this. GEOM setting always si_iosize_max to MAXPHYS seems like a bug. Actual hardware/driver capabilities need to be honored. > So geom > normally turns the 128K request into 2 64K requests. Clustering > finishes by converting the 128K request into 8 16K requests (4 for use > now and 4 later for read-ahead). > > OTOH, the first block of 4 sequential reads of 16K produces the same > 128K block at the geom level, modulo bugs in the read-ahead. This now > consists 1 and 7 blocks of normal read and read-ahead, respectively, > instead of 4 and 4. Then the next 3 blocks are found in the buffer > cache as read-ahead instead of read from the disk (actually, this is > insignificantly different from the first case after ffs splits up the > 64K into 4 times 16K). > > So the block size makes almost no difference at the syscall level > (512-blocks take significantly more CPU but improve latency, while > hige blocks take significantly less CPU but significantly unimprove > latency). > > The file system block size makes only secondary differences: > - clustering only works to turn small logical i/o's into large physical > ones when sequential blocks are allocated sequentially, but always > allocating blocks sequentially is hard to do and using large file > system blocks reduces the loss when the allocation is not sequential > - large file system blocks also reduce the amount of work that clustering > has to do to reblock. This benefit is much smaller than the previous > one. > - the buffer cache is only designed to handle medium-sized blocks well. > With 512-blocks, it can only hold 1/32 as much as with 16K-blocks, > so it will thrash 32 times as much with the former. Now that the > thrashing is to VMIO instead of to the disk, this only wastes CPU. > With any block size larger than BKVASIZE, the buffer cache may become > fragmented, depending on the combination of block sizes. Mixed > combinations are the worst, and the system doesn't do anything to > try to avoid them. The worst case is a buffer cache full of 512-blocks, > with getblk() wanting to allocate a 64K-block. Then it needs to > wait for 32 contiguous blocks to become free, or forcibly free some, > or move some... Agree. >> I/O requests. >> The thing is that a single read requires a single contiguous virtual >> address space >> chunk. Would it be possible to take the best of both worlds by >> somehow allowing a >> single large I/O request to work with several buffers (with b_kvasize >> == BKVASIZE) >> in a iovec-like style? >> Have I just reinvented bicycle? :) >> Probably not, because an answer to my question is probably 'not >> (without lots of >> work in lots of places)' as well. > > Separate buffers already partly provided, this, and combined with command > queuing in the hardware they provided it completely in perhaps a better > way than can be done in software. > > vfs clustering attempts much less but still complicated. It mainly wants > to convert buffers that have contiguous disk addresses into a super-buffer > that has contiguous virtual memory and combine this with read-ahead, to > reduce the number of i/o's. All drives less than 10 years old benefit > only marginally from this, since the same cases that vfs clustering can > handle are also easy for drive clustering, caching and read-ahead/write- > behind (especially the latter) to handle even better, so I occasionally > try turning off vfs clustering to see if it makes a difference; > unfortunately it still seems to help on all drives, including even > reducing total CPU usage despite its own large CPU usage. I think that this mainly tells that our code doesn't optimally handle non-cluster I/O. For example, all calls of breadn() that I see specify only one read-ahead block. And all bread*()-s are, of course, have fs block size. So, with e.g. typical 8KB blocks we gets lots of GEOM level and hardware level I/O going back and forth. While, as you say, the disks may handle that well, it is not optimal for communication between disks and controllers, and it's definitely bad for GEOM and drivers layer. Essentially, what you wrote in the paragraph below :-) >> I see that breadn() certainly doesn't work that way. As I understand, >> it works >> like bread() for one block plus starts something like 'asynchronous >> breads()' for >> a given count of other blocks. > > Usually breadn() isn't called, but clustering reads to the end of the > current > cluster or maybe the next cluster. breadn() was designed when reading > ahead a single cluster was helpful. Now, drives read-ahead a whole track > or similar probably hundreds of sectors, so reading ahead a single sector > is almost useless. It doesn't even reduce the number of i/o's unless it is > clustered with the previous i/o. Yes. (I should have read the whole email before starting my reply). Perhaps, it makes sense now to change breadn() interface and turn it into a simple/cheaper version of cluster_read(). I don't really see much point in passing an array of block numbers and block sizes. E.g. perhaps something like: breadn(vp, startblock, blocksize, count, &bp) and a filesystem must ensure that all the requested blocks are contiguous. Then breadn() could read and read-ahead those blocks using optimal block size, not the fs block size. >> I am not sure about details of how cluster_read() works, though. >> Could you please explain the essence of it? > > See above. It is essentially the old hack of reading ahead a whole > track in software, done in a sophisticated way but with fewer attempts > to satisfy disk geometry timing requirements. Long ago, everything > was so slow that sequential reads done from userland could not keep > up with even a floppy disk, but sequential i/o's done from near the > driver could, even with i/o's of only 1 sector. I've only ever seen > this working well for floppy disks. For hard disks, the i/o's need > to be multi-sector, and needed to be related to the disk geometry > (handle full tracks and don't keep results from intermediate sectors > that are not needed yet iff doing so wouldn't thrash the cache). Now, > it is unreasonable to try to know the disk geometry, and vfs clustering > doesn't try. Fortunately, this is not much needed, since newer drives > have their own track caches which, although they don't form a complete > replacement for vfs clustering (see above), they reduces the losses > to extra non-physical reads. Similarly for another problem with vfs: > all buffers and their clustering are file (vnode) based, which almost > forces missing intermediate sectors when reading a file, but a working > device (track or similar) in the drive mostly compensates for not having > one in the OS. Thank you for the explanation. I mostly wondered how clustering worked with buffer cache, somehow I was overlooking the whole pbuf thing and couldn't place all the pieces together. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 02:34:33 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 04B401065670; Thu, 15 Apr 2010 02:34:33 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 94B0C8FC1E; Thu, 15 Apr 2010 02:34:32 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEAH8XxkuDaFvK/2dsb2JhbACbW3G+M4UNBA X-IronPort-AV: E=Sophos;i="4.52,209,1270440000"; d="scan'208";a="72811042" Received: from fraser.cs.uoguelph.ca ([131.104.91.202]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 14 Apr 2010 22:34:31 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 2A7E7109C31C; Wed, 14 Apr 2010 22:34:31 -0400 (EDT) X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca Received: from fraser.cs.uoguelph.ca ([127.0.0.1]) by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id D5JnLDqJsxal; Wed, 14 Apr 2010 22:34:30 -0400 (EDT) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 7FE84109C28B; Wed, 14 Apr 2010 22:34:30 -0400 (EDT) Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id o3F2mNg00557; Wed, 14 Apr 2010 22:48:24 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Wed, 14 Apr 2010 22:48:23 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: Bruce Evans In-Reply-To: <20100414135230.U12587@delplex.bde.org> Message-ID: References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <20100414135230.U12587@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 02:34:33 -0000 On Wed, 14 Apr 2010, Bruce Evans wrote: > On Sun, 11 Apr 2010, Rick Macklem wrote: > >> On Sun, 11 Apr 2010, Bruce Evans wrote: >> >>> Er, the maximum size of buffers in the buffer cache is especially >>> irrelevant for nfs. It is almost irrelevant for physical disks because >>> clustering normally increases the bulk transfer size to MAXPHYS. >>> Clustering takes a lot of CPU but doesn't affect the transfer rate much >>> unless there is not enough CPU. It is even less relevant for network >>> i/o since there is a sort of reverse-clustering -- the buffers get split >>> up into tiny packets (normally 1500 bytes less some header bytes) at >>> the hardware level. ... >> [stuff snipped] > > Indeed, I was only caring about a LAN environment. Especially with > LANs optimized for latency (50-100 uS), nfs performance is poor for > small files, at least for the old nfs client, mainly due to close to > open consistency defeating caching, but not a problem for bulk transfers. > And I'll admit I was thinking that for a low latency LAN, a large read/write RPC wouldn't have a negative impact, but it sounds like you've found 16Kb to be optimal for this case. For NFSv4, if the client has a delegation for the file, it doesn't have worry about close/open consistency, so there is some hope w.r.t. small files for this case. > > Clustering is currently only for the local file system, at least for > the old nfs server. nfs just does a VOP_READ() into its own buffer, > with ioflag set to indicate nfs's idea of sequentialness. (User reads > are similar except their uio destination is UIO_USERSPACE instead of > UIO_SYSSPACE and their sequentialness is set generically and thus not > so well (but the nfs setting isn't very good either).) The local file > system then normally does a clustered read into a larger buffer, with > the sequentialness affecting mainly startup (per-file), and virtually > copies the results to the local file system's smaller buffers. VOP_READ() > completes by physically copying the results to nfs's buffer (using > bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE). nfs can't > easily get at the larger clustering buffers or even the local file > system's buffers. It can more easily benefit from larger MAXBSIZE. > There is still the bcopy() to take a lot of CPU and memory bus resources, > but that is insignifcant compared with WAN latency. But as I said in > a related thread, even the current MAXBSIZE is too large to use > routinely, due to buffer cache fragmentation causing significant latency > problems, so any increase in MAXBSIZE and/or routine use of buffers > of that size needs to be accompanied by avoiding the fragmentation. > Note that the fragmentation is avoided for the larger clustering buffers > by allocating them from a different pool. > Ah, now I know what you were referring to w.r.t. clustering. I haven't looked at the mechanism used to allocate buffer space in the buffer cache, so I'll just take your word for it w.r.t. fragmentation. It sounds like the allocation mechanism needs to be thought about if/when MAXBSIZE gets increased. Thanks for your input and I hope I didn't upset you when I jumped on the "I care about WANs" bandwagon, while basically ignoring the LAN case. rick From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 06:41:55 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC00B106566C for ; Thu, 15 Apr 2010 06:41:55 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (chello089077043238.chello.pl [89.77.43.238]) by mx1.freebsd.org (Postfix) with ESMTP id 0A9018FC18 for ; Thu, 15 Apr 2010 06:41:53 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id D420045CAC; Thu, 15 Apr 2010 08:41:51 +0200 (CEST) Received: from localhost (chello089077043238.chello.pl [89.77.43.238]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 786C645685; Thu, 15 Apr 2010 08:41:46 +0200 (CEST) Date: Thu, 15 Apr 2010 08:41:49 +0200 From: Pawel Jakub Dawidek To: d@delphij.net Message-ID: <20100415064149.GB2252@garage.freebsd.pl> References: <4BC39E93.7060906@delphij.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="+g7M9IMkV8truYOl" Content-Disposition: inline In-Reply-To: <4BC39E93.7060906@delphij.net> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: freebsd-arch@freebsd.org Subject: Re: _IOWR when errno != 0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 06:41:55 -0000 --+g7M9IMkV8truYOl Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 12, 2010 at 03:28:35PM -0700, Xin LI wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 >=20 > Hi, >=20 > Is there a sane way to copyout ioctl request when the returning errno !=3D > 0? Looking at the code, currently, in sys/kern/sys_generic.c, we have: >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > error =3D kern_ioctl(td, uap->fd, com, data); >=20 > if (error =3D=3D 0 && (com & IOC_OUT)) > error =3D copyout(data, uap->data, (u_int)size); > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Is there any objection if I change it to something like: >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > saved_error =3D kern_ioctl(td, uap->fd, com, data); >=20 > if (com & IOC_OUT) > error =3D copyout(data, uap->data, (u_int)size); > if (saved_error) > error =3D saved_error; > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D I'd like to note that OpenSolaris does copy data back even if an error occurs. I needed to change ZFS to return 0 for ioctl(2) and return an error within zfs_cmd structure. I think FreeBSD way is better, BTW. ioctl(2) can fail for other reasons, for example data pointer is invalid, so we return EFAULT and we are unable to copy data back in that case anyway. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --+g7M9IMkV8truYOl Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkvGtSwACgkQForvXbEpPzTcNACg3Iq+vXbNNUIv2Irudz1D7rE3 gjUAoOUhQ2PkIM0C2u6I2OL2gPLkTnZ/ =y1Ws -----END PGP SIGNATURE----- --+g7M9IMkV8truYOl-- From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 10:10:24 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3E8D4106566C; Thu, 15 Apr 2010 10:10:24 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.159]) by mx1.freebsd.org (Postfix) with ESMTP id 9C9158FC13; Thu, 15 Apr 2010 10:10:22 +0000 (UTC) Received: by fg-out-1718.google.com with SMTP id l26so1416648fgb.13 for ; Thu, 15 Apr 2010 03:10:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:received:message-id:subject:from:to:cc :content-type; bh=hCDk0oD70uEo6LrCGM57pj1oCwP7FyVRtghE+cwavbg=; b=Ob/grwGt/NTGbuaGWbyUUXZZYBYkAa9DhNwGPLSSjQZvRfBXyqXZ2SgiCAGG3PJTRZ HuGBZ/dSt2KHKfGbDi2z5zDdjNsBiyRY03BwUC0uRCRtZyU+gcLnIBfmMVv1NFzFUcKC waufk3YpvB1FIx5N2FeMzBMsKfPphHUQx9EDk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; b=cmgimaGkY2J8z8P3Me/yZfixNVoSEzZHHMFWRLwTz64nO32Qt8hCZAVuBcs4GuFOOW QqFKKJTJdw/DFYYRYD0Z+fA8H+s1DKLLubi4T4VXEBbFN3L+L8buy2SHJWZOp/FS6LRI CShUTr1t3Y2h4p0msHpZifflfH8v83CtEuh4Q= MIME-Version: 1.0 Sender: asmrookie@gmail.com Received: by 10.239.164.140 with HTTP; Thu, 15 Apr 2010 03:10:21 -0700 (PDT) Date: Thu, 15 Apr 2010 12:10:21 +0200 X-Google-Sender-Auth: 7e5271c0cf20c8e6 Received: by 10.239.186.140 with SMTP id g12mr520973hbh.146.1271326221435; Thu, 15 Apr 2010 03:10:21 -0700 (PDT) Message-ID: From: Attilio Rao To: freebsd-arch@freebsd.org Content-Type: text/plain; charset=UTF-8 Cc: Giovanni Trematerra Subject: [PATCH] Syncer rewriting X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 10:10:24 -0000 With a fundamental aid by Giovanni Trematerra and Peter Holm, I rewrote the syncer following plans and discussions happened over the last 2 years and started by a Jeff's effort during BSDCan 2008 (for a more complete reference you may check: http://people.freebsd.org/~jeff/bsdcanbuf.pdf ). Summarizing a bit, the syncer suffers of the following problems: - Poor scalability: just one thread that needs to serve all the several different mounted filesystems - Poor flexibility: the current syncer is just used to sync on disk dirty buffers and nothing else, catering buffer-cache based filesystems - Complex design: in order to DTRT, syncer needs the help of a syncer vnode and introduce some complex locking pattern. Additively, as a partial mitigation, a separate queue for the !MPSAFE filesystem might be added - Poor performance: that is actually more FS specific than anything. UFS (but I'm not sure if this is the only one), after have synced the dirty vnodes, does a VFS_SYNC() that actually re-synces all the referenced vnodes. That means dirty vnodes will be synced 2 times in the same timeframe. The rewriting wants to address all these problems. The main idea is to offer a simple and opaque layer that interacts directly with the VFS and that any filesystem may override in order to offer their own implementation of syncer ability. Right now, the layer lives within the VFS_* methods and the mount structure. More precisely it offers 5 virtual functions (VFS_SYNCER_INIT, VFS_SYNCER_DESTROY, VFS_SYNCER_ATTACH, VFS_SYNCER_DETACH, VFS_SYNCER_SPEEDUP) and an opaque, private pointer for storing syncer-specific datas. This means the syncer design may not stuck to the specific thread/process model as it is now, for a given filesystem. Also, this design may be easilly extended in order to support more features, if needed. The syncer, meant as what we have now, becames the 'standard one' but switches to a different model. It becames per-mount and it then gets rid of the syncer vnode. This also helps in simplifying a lot the locking within the syncer because now any thread is responsible only for its own dog-food. Filesystems specify their own syncer in the vfsops or they receive, by default, the buffer cache "standard" syncer. Current filesystems not using the buffer cache, however, may use the VFS_EOPNOTSUPP specification in order to avoid completely defining a filesystem syncer. The patch has been tested intensively by trema and pho on a lot of different workload and filesystems: http://www.freebsd.org/~attilio/syncer_beta_0.diff Sparse notes: - The performance problem, even if the patch doesn't currently supports it, may be easilly addressed now by skipping syncing, in ffs_fsync() for the MNT_LAZY case and having ffs_sync() taking care of it. - The standard syncer may be further improved getting rid of the bufobj. It should actually handle a list of vnodes rather than a list of bufobj. However similar optimizations may be done after the patch is ready to enter the tree. - The mount interlock now protects the bo_flag & BO_ONWORKLST and the synclist iterator, thus there is no need to hold the bufobj lock when accessing them. However the specific for checking if a bufobj is dirty or not are still protected by bufobj lock, thus the insertion path still needs of it too. Notably things that I would receive comments on are mostly linked to the default syncer: - I didn't use any form of threads consolidation for threads automatically assigned by the default syncer. We may have different opinion and good arguments on it. - Something we might be willing is to think about the !SMP case. Maybe we don't want the multi-thread approach for that case? Should we revert the current approach for !SMP? - Right now the VFS_SYNCER_INIT() and VFS_SYNCER_DESTROY() are used not only for flexibility but also for necessity by the default syncer. Some consumers may be willing to fill-in the workitem queues earlier than the syncer starts (VFS_SYNCER_ATTACH()) and you may not want to loose such filled vnodes. This approach is good and offers the possibility to also support mount state updates simply without loosing informations, but it has the dis-advantage to allocate structures for filesystems that may forever be RO. More testing, reviews and comments are undoubtly welcome at this point. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 05:43:39 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7956D106566B for ; Thu, 15 Apr 2010 05:43:39 +0000 (UTC) (envelope-from paul-zimmerman@sbcglobal.net) Received: from web80804.mail.mud.yahoo.com (web80804.mail.mud.yahoo.com [209.191.72.108]) by mx1.freebsd.org (Postfix) with SMTP id EA9FE8FC15 for ; Thu, 15 Apr 2010 05:43:38 +0000 (UTC) Received: (qmail 21632 invoked by uid 60001); 15 Apr 2010 05:16:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sbcglobal.net; s=s1024; t=1271308617; bh=PCEKahc6hSmdZD44IFGzbQ8sy+Z59MVOsHvLZ83tG5U=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type; b=vG8c7OZmH18r0lkRHMBLCwbCK6Qe+k3LN14bFih05969Szr5cRurCP0SuI7bC/79x/MhnpdJHcjaVnfp602V637oTSNF2jQTMjfdFPWW2EUUYz39gZ1VdMcb7NLbnwSMUPHcy/OIl7RhUrPxcvtyse/JcEUQKvoxJBWW52abzZU= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=sbcglobal.net; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type; b=bOnORIeKQXBggHz0Dhe6QCPDQ7+5P6RKU9eOea1VV7cHloRljDlHb1qR34e5vp5MaixEhkhq6jsZGwT760efO3IfdeldFCwv7LTAj8TiCxS30CAgk4SJZ+jTVivtjtSLNvfIfa0gm4MB0d01QAP6rgaFqMG/G7KnRKT4R3N8U9M=; Message-ID: <493092.21442.qm@web80804.mail.mud.yahoo.com> X-YMail-OSG: f2v4k68VM1kgqyRe_6Bwoi9uwrTy2klK6I5eZxNaHjZ_hpn ojcz.hgHTQnlPFsR6JYVtDyGIaPNqd.iSoPQZGIPPNqzhR3zl8KL_o.eU_kj zK2_2zz7KJuru5stHu9PRSYui0jpVwAoDefu9xYUy3oITB71rq58wrroFjBF L7fT4ra976aHBgmidIgFSTvSFTX6JRLFOAIUrMOWmKB_tgId.9wB0J6VVc1V 2hUyrbIwOtkCog1qZFENhamuO451ZO32CjQm4xqdTixOGAHilArpktzkh70f NyvgqnCYUcEEnNGZrVK3PyYUz4hZEgEqRkPTMq2cwEKfjyBblCeqgbUDF18Q 0rlIsefVCKc9a6oMWB2be7_6Iyg-- Received: from [75.52.253.215] by web80804.mail.mud.yahoo.com via HTTP; Wed, 14 Apr 2010 22:16:57 PDT X-Mailer: YahooMailClassic/10.0.8 YahooMailWebService/0.8.100.260964 Date: Wed, 14 Apr 2010 22:16:57 -0700 (PDT) From: Paul Zimmerman To: peterjeremy@acm.org, freebsd-arch@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailman-Approved-At: Thu, 15 Apr 2010 11:06:35 +0000 Cc: bruce@cran.org.uk, ed@80386.nl, scottl@samsco.org, matthew.fleming@isilon.com, avg@icyb.net.ua, rwatson@freebsd.org, ivoras@freebsd.org, stefan@fafoe.narf.at, max@love2party.net Subject: Re: likely and unlikely X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 05:43:39 -0000 On 2010-Mar-21 19:52:40 -0600, Peter Jeremy wrote: >I suspect predict_true/predict_false is unlikely to help in most cases. > >What would probably be more useful for Atom would be gcc scheduling >support. This is available in gcc 4.3 (ie GPL3) but not in gcc 4.2. >I've had a look at dumping the gcc 4.3 Atom scheduler into my gcc 4.2 >but the infrastructure has changed sufficiently that this would be a >non-trivial task. (And since it would not be committable, I don't >think it's worth my time). Likewise, implementing scheduling from >scratch in gcc 4.2 would be a non-trivial task. Just FYI, the use of likely/unlikely in the Linux kernel is not for branch prediction. It is a hint to gcc which branch of the if() should be moved out-of-line. The idea is to reduce the cache footprint of the most frequently executed code paths. -- Paul From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 14:31:32 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D0558106566C; Thu, 15 Apr 2010 14:31:32 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 7F5288FC54; Thu, 15 Apr 2010 14:31:32 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id 0F1871FFC22; Thu, 15 Apr 2010 14:31:31 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 3946E844DA; Thu, 15 Apr 2010 16:30:59 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Attilio Rao References: Date: Thu, 15 Apr 2010 16:30:58 +0200 In-Reply-To: (Attilio Rao's message of "Thu, 15 Apr 2010 12:10:21 +0200") Message-ID: <86sk6waiu5.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.95 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [PATCH] Syncer rewriting X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 14:31:32 -0000 Attilio Rao writes: > With a fundamental aid by Giovanni Trematerra and Peter Holm, I > rewrote the syncer following plans and discussions happened > over the last 2 years and started by a Jeff's effort during BSDCan > 2008 (for a more complete reference you may check: > http://people.freebsd.org/~jeff/bsdcanbuf.pdf ). This is great! The "lemming syncer" we currently have has been a thorn in our side for years. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 14:38:26 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B65CD106566C; Thu, 15 Apr 2010 14:38:26 +0000 (UTC) (envelope-from daniel.rodrick@gmail.com) Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id 89F808FC0A; Thu, 15 Apr 2010 14:38:26 +0000 (UTC) Received: by pwi9 with SMTP id 9so1175932pwi.13 for ; Thu, 15 Apr 2010 07:38:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:received:message-id :subject:from:to:content-type; bh=E4GoelLAQblwS/5ZeLMtrtTR1Q2UTbUEPPGFNmTMaqk=; b=QADC22lZI7RKVtdDy4T7xOYqrbriNPms//Cv7LiIYserDnH0Gx7cw7WF09d5f25b6M QfbE4yfvh5e30H/tlG0rA1oZU24B/M9cjrV2QM1+ZAqdfPzDz2KnkNY0nMYbQTa2IBBy 8FhfFU/3Tm1IRcFjZ3leVZ2WmqYvmyUSdgWbc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=HD+LNdyeBNe2Je4fduT3kKpMTv1GSTmvDU55qnXRURWZRCec9QmKfLRM+EPzma41bL lFuyxorHHav4ieWcNFAdS5KG+FNSEXfjhjauyLpKzKdj+rRXExTKc4VsDg2S7UTr4NB3 JyOtLKq4jtroteOqTmNZD6Ybp2vyfw6VHgl9s= MIME-Version: 1.0 Received: by 10.142.230.18 with HTTP; Thu, 15 Apr 2010 07:38:25 -0700 (PDT) Date: Thu, 15 Apr 2010 20:08:25 +0530 Received: by 10.142.247.33 with SMTP id u33mr154860wfh.44.1271342305705; Thu, 15 Apr 2010 07:38:25 -0700 (PDT) Message-ID: From: Daniel Rodrick To: freebsd-arch@freebsd.org, freebsd-drivers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Subject: Multiple PCI controllers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 14:38:26 -0000 Hello, Can some one please help me understand how did the old FreeBSD kernel that DID not have the PCI domains concept (say 6.x) used to deal with systems that had multiple PCI / PCIe controllers on them, from a bus numbering point of view? Was there a unified PCI tree - thus each PCI bus number being unique in the system? Also, how is this dealt with now? Dan From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 14:58:26 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 015BF106564A; Thu, 15 Apr 2010 14:58:25 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 802CF8FC08; Thu, 15 Apr 2010 14:58:25 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3FEljvH072965; Thu, 15 Apr 2010 08:47:45 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Thu, 15 Apr 2010 08:48:00 -0600 (MDT) Message-Id: <20100415.084800.714788496340685106.imp@bsdimp.com> To: daniel.rodrick@gmail.com From: "M. Warner Losh" In-Reply-To: References: X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-drivers@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Multiple PCI controllers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 14:58:26 -0000 In message: Daniel Rodrick writes: : Can some one please help me understand how did the old FreeBSD kernel : that DID not have the PCI domains concept (say 6.x) used to deal with : systems that had multiple PCI / PCIe controllers on them, from a bus : numbering point of view? Was there a unified PCI tree - thus each PCI : bus number being unique in the system? FreeBSD has handled multiple PCI domains for a very long time. The support was added so that the Alpha machines could run FreeBSD. The bus numbers were whatever the BIOS programmed them to be. FreeBSD doesn't program bus numbers at all, except in some very limited cases. : Also, how is this dealt with now? The same. Each host controller will have a pci device tree under it. Warner From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 18:30:09 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A4D311065674; Thu, 15 Apr 2010 18:30:09 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 762878FC1C; Thu, 15 Apr 2010 18:30:09 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 0E83D46B81; Thu, 15 Apr 2010 14:30:09 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 5B2358A01F; Thu, 15 Apr 2010 14:30:08 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 15 Apr 2010 13:11:18 -0400 User-Agent: KMail/1.12.1 (FreeBSD/7.3-CBSD-20100217; KDE/4.3.1; amd64; ; ) References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201004151311.18487.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Thu, 15 Apr 2010 14:30:08 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Daniel Rodrick , freebsd-drivers@freebsd.org Subject: Re: Multiple PCI controllers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 18:30:09 -0000 On Thursday 15 April 2010 10:38:25 am Daniel Rodrick wrote: > Hello, > > Can some one please help me understand how did the old FreeBSD kernel > that DID not have the PCI domains concept (say 6.x) used to deal with > systems that had multiple PCI / PCIe controllers on them, from a bus > numbering point of view? Was there a unified PCI tree - thus each PCI > bus number being unique in the system? I think there were not multiple-domain machines that FreeBSD ran on in previous releases in general. Some alpha machines had multiple domains (the alpha port referred to them as 'hoses') and the support was incomplete (VGA cards had to be in domain 0 for FreeBSD to see them IIRC). I am not personally aware of any x86 machines with multiple domains. I believe the x86 port only supports domain 0 currently. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 20:36:35 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3DC701065673 for ; Thu, 15 Apr 2010 20:36:35 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id DDD838FC1B for ; Thu, 15 Apr 2010 20:36:34 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3FKVWa8077210 for ; Thu, 15 Apr 2010 14:31:32 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Thu, 15 Apr 2010 14:31:47 -0600 (MDT) Message-Id: <20100415.143147.69510145118168557.imp@bsdimp.com> To: arch@freebsd.org From: "M. Warner Losh" X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: Subject: TARGET_BIG_ENDIAN branch collapse X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 20:36:35 -0000 I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN stuff. You can find a diff at http://people.freebsd.org/~imp/tbemd-20100415.diff. Highlights include: o Eliminating TARGET_BIG_ENDIAN entirely o Eliminating the setting of endian flags in sys.mk and bsd.cpu.mk o Moving from mips to mipseb and mipsel for MACHINE_ARCH. [*] o Moving from arm to armeb and arm for MACHINE_ARCH. [**] o Creating MACHINE_CPUARCH which is the set of architectures that's supported. The 'mips' CPUARCH will support MACHINE_ARCH of mipsel, mipseb, mips64eb, mips64el, for example. This means many of the places we used to use MACHINE_ARCH we now use MACHINE_CPUARCH. o Moving to including Makefile.${MACHNE}, Makefile.${MACHINE_ARCH}, or Makefile.{MACHINE_CPUARCH}, in that order, to select or deselect portions of FreeBSD. We already did this for places like libc. I'm just generalizing it. o Some minor tweaks to gcc and binutils to make the build work with the new paradigm. Please send me your comments and suggestions. I plan on starting to integrate some of these technologies into head soon (as well as coordinating with Juli Mallett's work to bring new ABIs to MIPS). This is all orthogonal to MACHINE_CPUTYPE and MACHINE_CPU[***] which will remain unchanged in FreeBSD. Comments? Warner [*] While I generally don't want to talk about names here, since I've selected the names used by NetBSD, Linux and binutils/gcc, there may be some tweaking in the final values as these groups have minor variations in naming mips which complicates things... [**] These names are well established and consistent among all the groups. [***] NetBSD calls MACHINE_CPUARCH just MACHINE_CPU, but since we're already using that for something else, I had to diverge. From owner-freebsd-arch@FreeBSD.ORG Thu Apr 15 22:37:08 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7E8B81065672 for ; Thu, 15 Apr 2010 22:37:08 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.64.117]) by mx1.freebsd.org (Postfix) with ESMTP id 81E208FC08 for ; Thu, 15 Apr 2010 22:37:06 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.3/8.14.3) with ESMTP id o3FMbDw3038866 for ; Fri, 16 Apr 2010 02:37:13 +0400 (MSD) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.3/8.14.3/Submit) id o3FMbDRK038865 for arch@freebsd.org; Fri, 16 Apr 2010 02:37:13 +0400 (MSD) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Fri, 16 Apr 2010 02:37:13 +0400 From: Gleb Smirnoff To: arch@FreeBSD.org Message-ID: <20100415223713.GF97761@FreeBSD.org> References: <20100326211706.GI18894@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline In-Reply-To: <20100326211706.GI18894@FreeBSD.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Subject: Re: touch panel support X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2010 22:37:08 -0000 My further hacking on getting touch panel working with FreeBSD. On Sat, Mar 27, 2010 at 12:17:06AM +0300, Gleb Smirnoff wrote: T> And then I've got a problem. Our mouse subsystem is not ready for T> touch panels. Our mouse(4) protocol does not support mouse driver T> passing _absolute_ coordinates to the mouse(4) subsystem. It only T> expects a relative movement of the mouse. But _absolute_ coordinates T> are principal idea of any touch panel. Well, syscons actually have support for absolute mouse movement. We need just this tiny patch, to get it working. Index: scmouse.c =================================================================== --- scmouse.c (revision 206501) +++ scmouse.c (working copy) @@ -700,6 +700,7 @@ scp->mouse_xpos = mouse->u.data.x; scp->mouse_ypos = mouse->u.data.y; set_mouse_pos(scp); + goto motion; splx(s); break; @@ -732,6 +733,7 @@ cur_scp->mouse_ypos += mouse->u.data.y; set_mouse_pos(cur_scp); } +motion: f = 0; if (mouse->operation == MOUSE_ACTION) { f = cur_scp->mouse_buttons ^ mouse->u.data.buttons; This also requires userland (moused(8)) to put absolute coordinates into mouse->u.data and provide MOUSE_MOVEABS command instead of MOUSE_ACTION. The patch to moused(8) looks like following. First, we need to recognize a mouse protocol, that works with absolute coordinates: @@ -1584,6 +1629,9 @@ } } + if (rodent.mode.protocol == MOUSE_PROTO_EGALAX) + rodent.flags |= AbsoluteXY; + debug("proto params: %02x %02x %02x %02x %d %02x %02x", cur_proto[0], cur_proto[1], cur_proto[2], cur_proto[3], cur_proto[4], cur_proto[5], cur_proto[6]); @ -2170,6 +2218,22 @@ prev_y = y; break; + case MOUSE_PROTO_EGALAX: /* eGalax */ + x = (pBuf[1] << 7) | pBuf[2]; + y = (pBuf[3] << 7) | pBuf[4]; + + act->flags = 0; + act->button = 0; /* TODO */ + if (x != prev_x || y != prev_y) { + act->dx = prev_x = x; + act->dy = prev_y = y; + act->flags |= MOUSE_POSCHANGED; + } + + return (act->flags); + + break; + case MOUSE_PROTO_BUS: /* Bus */ case MOUSE_PROTO_INPORT: /* InPort */ act->button = butmapmsc[(~pBuf[0]) & MOUSE_MSC_BUTTONS]; Then we need to pass these absolute coords to mouse(4): @@ -1295,11 +1335,14 @@ if (action2.flags & MOUSE_POSCHANGED) { mouse.operation = MOUSE_MOTION_EVENT; mouse.u.data.buttons = action2.button; - if (rodent.flags & ExponentialAcc) { + if (rodent.flags & AbsoluteXY) { + absmove(action2.dx, action2.dy, + &mouse.u.data.x, &mouse.u.data.y); + mouse.operation = MOUSE_MOVEABS; + } else if (rodent.flags & ExponentialAcc) { expoacc(action2.dx, action2.dy, &mouse.u.data.x, &mouse.u.data.y); - } - else { + } else { linacc(action2.dx, action2.dy, &mouse.u.data.x, &mouse.u.data.y); } @@ -1311,11 +1354,14 @@ } else { mouse.operation = MOUSE_ACTION; mouse.u.data.buttons = action2.button; - if (rodent.flags & ExponentialAcc) { + if (rodent.flags & AbsoluteXY) { + absmove(action2.dx, action2.dy, + &mouse.u.data.x, &mouse.u.data.y); + mouse.operation = MOUSE_MOVEABS; + } else if (rodent.flags & ExponentialAcc) { expoacc(action2.dx, action2.dy, &mouse.u.data.x, &mouse.u.data.y); - } - else { + } else { linacc(action2.dx, action2.dy, &mouse.u.data.x, &mouse.u.data.y); } The absmove() function should perform calibration and then assign u.data.x to calibrated x from touchpanel, and y accordingly. Doing calibration in moused(8) is a problem. Userland can't guess or access pixel size of the syscons. Currently I've just hardcoded calibration into absmove() with my values, and get all this stuff working. Stilus moves mouse pointer flawlessly and correctly in the syscons. :) But, unfortunately, this is a zero step towards touchscreen working in X. Although we got working absolute mouse pointer in syscons, we can't pass it through sysmouse(4) protocol :( -- Totus tuus, Glebius. From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 01:54:31 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 357431065670; Fri, 16 Apr 2010 01:54:31 +0000 (UTC) (envelope-from grehan@freebsd.org) Received: from dommail.onthenet.com.au (dommail.OntheNet.com.au [203.13.70.57]) by mx1.freebsd.org (Postfix) with ESMTP id 4222D8FC0A; Fri, 16 Apr 2010 01:54:29 +0000 (UTC) Received: from dallas-lxp.hq.netapp.com (c-67-190-167-186.hsd1.co.comcast.net [67.190.167.186]) by dommail.onthenet.com.au (MOS 4.1.8-GA) with ESMTP id ALT72123 (AUTH peterg@ptree32.com.au); Fri, 16 Apr 2010 11:42:42 +1000 Message-ID: <4BC7C08C.2050002@freebsd.org> Date: Thu, 15 Apr 2010 19:42:36 -0600 From: Peter Grehan User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <201004151311.18487.jhb@freebsd.org> In-Reply-To: <201004151311.18487.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Daniel Rodrick , freebsd-drivers@freebsd.org Subject: Re: Multiple PCI controllers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 01:54:31 -0000 > I think there were not multiple-domain machines that FreeBSD ran on in > previous releases in general. Power Macs have up to 3 PCI buses, with each one having bus number 0 at the host bridge. FreeBSD 6.* was fine with that, except if there was a conflict in bus/slot/function. Fortunately it looked like OpenFirmware was careful to avoid creating these conflicts when doing bus assignment. later, Peter. From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 08:11:46 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E6800106564A for ; Fri, 16 Apr 2010 08:11:46 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from mout7.freenet.de (mout7.freenet.de [IPv6:2001:748:100:40::2:9]) by mx1.freebsd.org (Postfix) with ESMTP id 801718FC13 for ; Fri, 16 Apr 2010 08:11:46 +0000 (UTC) Received: from [195.4.92.15] (helo=5.mx.freenet.de) by mout7.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.72 #3) id 1O2geP-0001z0-2X; Fri, 16 Apr 2010 10:11:45 +0200 Received: from p57ae0f02.dip0.t-ipconnect.de ([87.174.15.2]:56989 helo=ernst.jennejohn.org) by 5.mx.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.72 #3) id 1O2geO-0004Mr-Ps; Fri, 16 Apr 2010 10:11:45 +0200 Date: Fri, 16 Apr 2010 10:11:44 +0200 From: Gary Jennejohn To: "M. Warner Losh" Message-ID: <20100416101144.68e8beb8@ernst.jennejohn.org> In-Reply-To: <20100415.143147.69510145118168557.imp@bsdimp.com> References: <20100415.143147.69510145118168557.imp@bsdimp.com> X-Mailer: Claws Mail 3.7.5 (GTK+ 2.18.7; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: TARGET_BIG_ENDIAN branch collapse X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gary.jennejohn@freenet.de List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 08:11:47 -0000 On Thu, 15 Apr 2010 14:31:47 -0600 (MDT) "M. Warner Losh" wrote: > I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN > stuff. You can find a diff at > http://people.freebsd.org/~imp/tbemd-20100415.diff. > fetch http://people.freebsd.org/~imp/tbemd-20100415.diff fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found -- Gary Jennejohn From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 08:23:05 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50132106566C; Fri, 16 Apr 2010 08:23:05 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 133A88FC1A; Fri, 16 Apr 2010 08:23:04 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id AE23090193; Fri, 16 Apr 2010 08:23:03 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o3G8N30A029918; Fri, 16 Apr 2010 08:23:03 GMT (envelope-from phk@critter.freebsd.dk) To: Attilio Rao From: "Poul-Henning Kamp" In-Reply-To: Your message of "Thu, 15 Apr 2010 12:10:21 +0200." Date: Fri, 16 Apr 2010 08:23:03 +0000 Message-ID: <29917.1271406183@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [PATCH] Syncer rewriting X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 08:23:05 -0000 In message , At tilio Rao writes: >The syncer, meant as what we have now, becames the 'standard one' but >switches to a different model. It becames per-mount and it then gets >rid of the syncer vnode. This also helps in simplifying a lot the >locking within the syncer because now any thread is responsible only >for its own dog-food. YeeeeEEEEEHAAAAA! Go! Go! GO! >- The standard syncer may be further improved getting rid of the >bufobj. It should actually handle a list of vnodes rather than a list >of bufobj. However similar optimizations may be done after the patch >is ready to enter the tree. That would be the wrong direction: we need the bufobj because for instance a RAID5 geom module does not have a vnode for the parity data. If you force the syncer to only work on vnodes, then we need a parallel mechanism for non-filesystem disk users. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 08:36:11 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CD6241065673; Fri, 16 Apr 2010 08:36:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 629D08FC17; Fri, 16 Apr 2010 08:36:10 +0000 (UTC) Received: from c122-106-149-225.carlnfd1.nsw.optusnet.com.au (c122-106-149-225.carlnfd1.nsw.optusnet.com.au [122.106.149.225]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3G8a7AV026419 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 16 Apr 2010 18:36:08 +1000 Date: Fri, 16 Apr 2010 18:36:07 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Rick Macklem In-Reply-To: Message-ID: <20100416181926.F1082@delplex.bde.org> References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <20100414135230.U12587@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 08:36:11 -0000 On Wed, 14 Apr 2010, Rick Macklem wrote: > On Wed, 14 Apr 2010, Bruce Evans wrote: > [stuff snipped] >> >> Indeed, I was only caring about a LAN environment. Especially with >> LANs optimized for latency (50-100 uS), nfs performance is poor for >> small files, at least for the old nfs client, mainly due to close to >> open consistency defeating caching, but not a problem for bulk transfers. > > And I'll admit I was thinking that for a low latency LAN, a large read/write > RPC wouldn't have a negative impact, but it sounds like > you've found 16Kb to be optimal for this case. I'll try to find old benchmark results or repeat the benchmarks. > For NFSv4, if the client has a delegation for the file, it doesn't > have worry about close/open consistency, so there is some hope w.r.t. > small files for this case. Do you have benchmarks? A kernel build (without -j) is a good test. Due to include bloat and include nesting bloat, a kernel build opens and closes the same small include files hundreds or thousands of times each, with O(10^5) includes altogether, so an RPC to read attributes on each open costs a lot of latency. nfs on a LAN does well to take only 10% longer than a local file system on a LAN and after disabling close/open constency takes only about half as much longer, by reducing the nomber of RPCs by about a factor of 2. The difference should be even more noticable on a WAN. Building with -j reduces the extra length by not stalling the whild build waiting for each RPC. I probably needed it to take only 10% longer. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 09:42:02 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 20569106564A for ; Fri, 16 Apr 2010 09:42:02 +0000 (UTC) (envelope-from pluknet@gmail.com) Received: from mail-bw0-f214.google.com (mail-bw0-f214.google.com [209.85.218.214]) by mx1.freebsd.org (Postfix) with ESMTP id 9C5808FC16 for ; Fri, 16 Apr 2010 09:42:01 +0000 (UTC) Received: by bwz6 with SMTP id 6so2115519bwz.13 for ; Fri, 16 Apr 2010 02:42:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=JQBcp03ZKGd/axA2/VCpCJuUYub+08D1NVszQvkEA2c=; b=o3WiQFpvPpNCa3MQX12pfKGsmjaz/FCKhw90UMfDuPc9/Io6VQRKUC1B+koOinMKXf KroqZw6HSki3ghOTyUWPK2F6VuLucElbdoPX5/iqdjGWxsASVfFSg6rhYNKkdJZuwIAK WmjTfdG77jxS1QqUNnKj5z5F17siq7Vq3CHek= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=UpQDLvJnVL7jtyeXkTDN+p+9sZ9oYAvrVI1YGdPv9orpm0R+qaIIpH3TW4dIEF4rHP 0vXXnliBk3Hitq5/fuD1SmHjaV9LPLduy7NKnjTQgUItXK1+qdnLmOiIviYSh/E2jR+z PLLyg3FxPNtd6L+mrGngXBxJaOoJ+AuoY9m9k= MIME-Version: 1.0 Received: by 10.204.47.232 with HTTP; Fri, 16 Apr 2010 02:15:17 -0700 (PDT) In-Reply-To: <20100416101144.68e8beb8@ernst.jennejohn.org> References: <20100415.143147.69510145118168557.imp@bsdimp.com> <20100416101144.68e8beb8@ernst.jennejohn.org> Date: Fri, 16 Apr 2010 13:15:17 +0400 Received: by 10.204.32.77 with SMTP id b13mr1457345bkd.113.1271409317186; Fri, 16 Apr 2010 02:15:17 -0700 (PDT) Message-ID: From: pluknet To: gary.jennejohn@freenet.de Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: arch@freebsd.org Subject: Re: TARGET_BIG_ENDIAN branch collapse X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 09:42:02 -0000 On 16 April 2010 12:11, Gary Jennejohn wrote: > On Thu, 15 Apr 2010 14:31:47 -0600 (MDT) > "M. Warner Losh" wrote: > >> I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN >> stuff. =A0You can find a diff at >> http://people.freebsd.org/~imp/tbemd-20100415.diff. >> > > fetch http://people.freebsd.org/~imp/tbemd-20100415.diff > fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found > Whilst there's http://people.freebsd.org/~imp/tbemd.diff --=20 wbr, pluknet From owner-freebsd-arch@FreeBSD.ORG Fri Apr 16 16:05:39 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41ACF1065674 for ; Fri, 16 Apr 2010 16:05:39 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id DCB9A8FC22 for ; Fri, 16 Apr 2010 16:05:38 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o3GG0WKQ090653; Fri, 16 Apr 2010 10:00:33 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Fri, 16 Apr 2010 10:00:32 -0600 (MDT) Message-Id: <20100416.100032.74663955.imp@bsdimp.com> To: gary.jennejohn@freenet.de From: Warner Losh In-Reply-To: <20100416101144.68e8beb8@ernst.jennejohn.org> References: <20100415.143147.69510145118168557.imp@bsdimp.com> <20100416101144.68e8beb8@ernst.jennejohn.org> X-Mailer: Mew version 3.3 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: TARGET_BIG_ENDIAN branch collapse X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Apr 2010 16:05:39 -0000 From: Gary Jennejohn Subject: Re: TARGET_BIG_ENDIAN branch collapse Date: Fri, 16 Apr 2010 10:11:44 +0200 > On Thu, 15 Apr 2010 14:31:47 -0600 (MDT) > "M. Warner Losh" wrote: > > > I'm planning on doing a branch collapse of the TARGET_BIG_ENDIAN > > stuff. You can find a diff at > > http://people.freebsd.org/~imp/tbemd-20100415.diff. > > > > fetch http://people.freebsd.org/~imp/tbemd-20100415.diff > fetch: http://people.freebsd.org/~imp/tbemd-20100415.diff: Not Found > should be there now. Warner From owner-freebsd-arch@FreeBSD.ORG Sat Apr 17 02:11:38 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6ABD5106566C; Sat, 17 Apr 2010 02:11:38 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 003D08FC21; Sat, 17 Apr 2010 02:11:37 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEANO1yEuDaFvK/2dsb2JhbACcAHG+VoJcgjIE X-IronPort-AV: E=Sophos;i="4.52,224,1270440000"; d="scan'208";a="73100956" Received: from fraser.cs.uoguelph.ca ([131.104.91.202]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 16 Apr 2010 22:11:36 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id DFDAF109C35D; Fri, 16 Apr 2010 22:11:36 -0400 (EDT) X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca Received: from fraser.cs.uoguelph.ca ([127.0.0.1]) by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4wZFFW9NH+hb; Fri, 16 Apr 2010 22:11:36 -0400 (EDT) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 66149109C2DF; Fri, 16 Apr 2010 22:11:36 -0400 (EDT) Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id o3H2PYi05419; Fri, 16 Apr 2010 22:25:35 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Fri, 16 Apr 2010 22:25:34 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: Bruce Evans In-Reply-To: <20100416181926.F1082@delplex.bde.org> Message-ID: References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <20100414135230.U12587@delplex.bde.org> <20100416181926.F1082@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Andriy Gapon Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Apr 2010 02:11:38 -0000 On Fri, 16 Apr 2010, Bruce Evans wrote: > > Do you have benchmarks? A kernel build (without -j) is a good test. > Due to include bloat and include nesting bloat, a kernel build opens > and closes the same small include files hundreds or thousands of times > each, with O(10^5) includes altogether, so an RPC to read attributes > on each open costs a lot of latency. nfs on a LAN does well to take > only 10% longer than a local file system on a LAN and after disabling > close/open constency takes only about half as much longer, by reducing > the nomber of RPCs by about a factor of 2. The difference should be > even more noticable on a WAN. Building with -j reduces the extra > length by not stalling the whild build waiting for each RPC. I probably > needed it to take only 10% longer. > Well, I certainly wouldn't call these benchmarks, but here are the #s I currently see. (The two machines involved are VERY slow by to-day's hardware standards. One is an 800MHz PIII and the other is a 4-5year old cheap laptop with something like a 1.5GHz Celeron CPU.) The results for something like the Connectathon test suite's read/write test can be highly variable, depending upon the hardware setup, etc. (I suspect that is at least partially based on when the writes get flushed during the test run. One thing that I'd like to do someday is have a read/shared lock on the buffer cache block while a write-back to a server is happening. It currently is write/exclusive locked, but doesn't need to be, after the data has been copied into the buffer.) For the laptop as client: without delegations: ./test5: read and write wrote 1048576 byte file 10 times in 7.27 seconds (1442019 bytes/sec) read 1048576 byte file 10 times in 0.4 seconds (238101682 bytes/sec) ./test5 ok. with delegations: ./test5: read and write wrote 1048576 byte file 10 times in 1.64 seconds (6358890 bytes/sec) read 1048576 byte file 10 times in 0.70 seconds (14802158 bytes/sec) ./test5 ok. but for the PIII as the client (why does this case run so much better when there are no delegations?): without delegations: ./test5: read and write wrote 1048576 byte file 10 times in 1.75 seconds (5961944 bytes/sec) read 1048576 byte file 10 times in 0.7 seconds (131844940 bytes/sec) ./test5 ok. with delegations: ./test5: read and write wrote 1048576 byte file 10 times in 1.39 seconds (7526450 bytes/sec) read 1048576 byte file 10 times in 0.67 seconds (15540698 bytes/sec) ./test5 ok. Now, a kernel build with the PIII as client: without delegations: Real User System 6859 4635 1158 with delegations: Real User System 6491 4634 1105 As you can see, there isn't that much improvement when delegations are enabled. Part of the problem here is that, for an 800MHz PIII, the build is CPU bound ("vmstat 5" shows 0->10% idle during the build), so the speed of the I/O over NFS won't have a lot of effect on it. This would be more interesting if the client had a much faster CPU. Not benchmarks, but might give you some idea. (The 2 machines are running off the same small $50 home router.) Someday, I'd like to implement agressive client side caching to a disk in the client and do a performance evaluation (including introducing network latency) and see how it all does. I'm getting close to where I can do that. Maybe this summer. Have fun with it, rick From owner-freebsd-arch@FreeBSD.ORG Sat Apr 17 22:26:42 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CC542106564A for ; Sat, 17 Apr 2010 22:26:42 +0000 (UTC) (envelope-from kmatthew.macy@gmail.com) Received: from mail-qy0-f199.google.com (mail-qy0-f199.google.com [209.85.221.199]) by mx1.freebsd.org (Postfix) with ESMTP id 834028FC0C for ; Sat, 17 Apr 2010 22:26:42 +0000 (UTC) Received: by qyk37 with SMTP id 37so3262518qyk.8 for ; Sat, 17 Apr 2010 15:26:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:reply-to:received:date :x-google-sender-auth:received:message-id:subject:from:to:cc :content-type; bh=PjnAcQjqGFZ8pYrjRPHPdnklBVDduDRO0krJin4wgH4=; b=xS4AhzhTHxmyhDyubfmOERTBU4MGKqax7KJo0eo5iB17MjDmGlHcYVHNgHL2vezaf1 tEF+dutCfkG0V9n6XZEVC8WexbtWsFo1/dDBRbvj9lMQIiaAZJ3Q4phKOVie2wODDmDu izFZJb3VPoibskEYwq4WDHUwvT3Waf58bIIjY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:reply-to:date:x-google-sender-auth:message-id :subject:from:to:cc:content-type; b=he+TKMQFqc/sNYE2jrnS6f0FAu6SpRrbPKZ2veoKNen1BmRpkxpeID+dK5Cl/t6cJi YKmvxNevpXjYr7+hRVSaedqZA7I3KGjVhqBYcY4T9o1VvLBf1ZR1CoxnVwxXUMtjRBR1 UbfIwfKE458kGsuofdpFvUdFrR1D2z9xkJAJ0= MIME-Version: 1.0 Sender: kmatthew.macy@gmail.com Received: by 10.229.226.6 with HTTP; Sat, 17 Apr 2010 14:55:30 -0700 (PDT) Date: Sat, 17 Apr 2010 14:55:30 -0700 X-Google-Sender-Auth: 37b9bca1f4275dab Received: by 10.229.88.72 with SMTP id z8mr2100122qcl.3.1271541330653; Sat, 17 Apr 2010 14:55:30 -0700 (PDT) Message-ID: From: "K. Macy" To: freebsd-arch@freebsd.org Content-Type: multipart/mixed; boundary=0016364eead454b468048475c972 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: jeff@freebsd.org, alc@cs.rice.edu Subject: Moving forward with vm page lock X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: kmacy@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Apr 2010 22:26:42 -0000 --0016364eead454b468048475c972 Content-Type: text/plain; charset=ISO-8859-1 Last February Jeff Roberson first shared his vm page lock patch with me. The general premise is that modification of a vm_page_t is no longer protected by the global "vm page queue mutex" but is instead protected by an entry in an array of locks which each vm_page_t is hashed to by its physical address. This complicates pmap somewhat because it increases the number of cases where retry logic is required if we need to drop the pmap lock in order to first acquire the page lock (see pa_tryrelock). I've continued refining Jeff's initial page lock patch by resolving lock ordering issues in vm_pageout, eliminating pv_lock, and eliminating the need for pmap_collect on amd64. Rather than exposing ourselves to a race condition by dropping the locks in pmap_collect, I pre-allocate any necessary pv_entrys before changing any pmap state. This complicated calls to demote slightly, but that can probably be simplified later. Currently only amd64 supports this. Other platforms map vm_page_lock(m) to the vm page queue mutex. The current version of the patch can be found at: http://people.freebsd.org/~kmacy/diffs/head_page_lock.diff I've been refining it in a subversion branch at: svn://svn.freebsd.org/base/user/kmacy/head_page_lock On my workloads at a CDN startup I've seen as much as a 50% increase in lighttpd throughput (3.2Gbps -> 4.8Gbps). At Jeff's request I've done some basic measurements with buildkernel to demonstrate that, at least on my hardware, a dual 4-core "CPU: Intel(R) Xeon(R) CPU L5420 @ 2.50GHz (2500.01-MHz K8-class CPU)" with 64GB of RAM there is no performance regression. I did 2 warm up runs followed by 10 samples of "time make -j16 buildkernel KERNCONF=GENERIC -DNO_MODULES -DNO_KERNELCONFIG -DNO_KERNELDEPEND" on a ZFS file system on a twa based raid device for both with page_lock and without. Wall clock time is consistently just under a second lower (faster build time) for the page_lock kernel. The bulk of the time is actually spent in user so it is more meaningful to compare system times. I attached the logs of the runs and the two files I fed to ministat. ministat -c 95 -w 72 base page_lock x base + page_lock +------------------------------------------------------------------------+ | + ++ | |+ ++ +++ + x xxxx xxxxx| | |__AM__| |___AM__| | +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 47.35 49.09 48.64 48.417 0.53416706 + 10 40.04 41.52 40.98 40.844 0.41494846 Difference at 95.0% confidence -7.573 +/- 0.449396 -15.6412% +/- 0.928179% (Student's t, pooled s = 0.478287) ramsan2.lab1# head -2 prof.out debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name ramsan2.lab1# sort -nrk 4 prof.out | head 1592 243918 1768980 12026988 287680 6 41 0 112005 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1065 (sleep mutex:vm page queue free mutex) 3967 750285 1678130 9447247 276594 6 34 0 104952 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1388 (sleep mutex:vm page queue mutex) 18234 163969 5417360 9213400 282459 19 32 0 6548 /usr/home/kmacy/head_page_lock/sys/amd64/amd64/pmap.c:3372 (sleep mutex:page lock) 173094 134890 18226507 8195920 49757 366 164 0 625 /usr/home/kmacy/head_page_lock/sys/kern/vfs_subr.c:2091 (lockmgr:zfs) 254 167136 38222 5153728 2736 13 1883 0 2333 /usr/home/kmacy/head_page_lock/sys/amd64/amd64/pmap.c:550 (sleep mutex:page lock) 1160 104774 1624269 4380034 279240 5 15 0 107998 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1508 (sleep mutex:vm page queue free mutex) 1107 80128 1581048 3377896 274341 5 12 0 100130 /usr/home/kmacy/head_page_lock/sys/vm/vm_page.c:1300 (sleep mutex:vm page queue mutex) 104802 284128 14712290 2970729 259423 56 11 0 1900 /usr/home/kmacy/head_page_lock/sys/vm/vm_object.c:721 (sleep mutex:page lock) 84339 158037 1455568 2875384 85147 17 33 0 292 /usr/home/kmacy/head_page_lock/sys/kern/vfs_cache.c:390 (rw:Name Cache) 9 995901 236 2468160 46 5 53655 0 45 /usr/home/kmacy/head_page_lock/sys/kern/sched_ule.c:2552 (spin mutex:sched lock 4) Both Giovanni Trematerra and I have run stress2 on it for extended periods with problems in evidence. I'd like to see this go in to HEAD by the end of this month. Once this change has proven to be stable by a wider audience I will extend it to i386. Thanks, Kip --0016364eead454b468048475c972-- From owner-freebsd-arch@FreeBSD.ORG Sat Apr 17 22:49:40 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E8E8B1065672; Sat, 17 Apr 2010 22:49:40 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id A513D8FC26; Sat, 17 Apr 2010 22:49:40 +0000 (UTC) Received: from [127.0.0.1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.3/8.14.3) with ESMTP id o3HMnaBB041181; Sat, 17 Apr 2010 16:49:37 -0600 (MDT) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii From: Scott Long In-Reply-To: <29917.1271406183@critter.freebsd.dk> Date: Sat, 17 Apr 2010 16:49:36 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <29917.1271406183@critter.freebsd.dk> To: Poul-Henning Kamp X-Mailer: Apple Mail (2.1078) X-Spam-Status: No, score=-1.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: Attilio Rao , Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [PATCH] Syncer rewriting X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Apr 2010 22:49:41 -0000 On Apr 16, 2010, at 2:23 AM, Poul-Henning Kamp wrote: >=20 >=20 >> - The standard syncer may be further improved getting rid of the >> bufobj. It should actually handle a list of vnodes rather than a list >> of bufobj. However similar optimizations may be done after the patch >> is ready to enter the tree. >=20 > That would be the wrong direction: we need the bufobj because for = instance > a RAID5 geom module does not have a vnode for the parity data. >=20 > If you force the syncer to only work on vnodes, then we need a = parallel > mechanism for non-filesystem disk users. It's been 5-6 (7?) years since you invented the bufobj, but I still = haven't seen anything in GEOM use it as you suggest. You used to have a saying about premature optimization... I'd like to see Attilio's work move forward = despite this. Scott