From owner-freebsd-fs Sun Jan 26 0:57:41 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 338A737B401 for ; Sun, 26 Jan 2003 00:57:39 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id AB04143ED8 for ; Sun, 26 Jan 2003 00:57:38 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0074.cvx40-bradley.dialup.earthlink.net ([216.244.42.74] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18cibf-0000lK-00; Sun, 26 Jan 2003 00:57:36 -0800 Message-ID: <3E33A208.5ED9A35B@mindspring.com> Date: Sun, 26 Jan 2003 00:53:28 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Craig Reyenga Cc: freebsd-fs@freebsd.org Subject: Re: What about a case insensitive Filesystem? References: <001101c2c4e4$51686960$0200000a@sewer.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4a1abf569f4f6fb263c7073bf60832e4e350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Craig Reyenga wrote: > > Is there any way, either now or in the future, for FreeBSD to be able to > have a UFS-based case-insensitive filesystem? It would be great for many > applications, such as Samba servers, web servers catered to the general > public (angelfire, geocities) and places where the user just doesn't care. > Is this at all possible? > > > (I'm not on the list, so CC'ing would be great) Where do you mean case insensitive? Storage? Lookup? Iteration? In general, what people mean when they say this is "case sensistive on storage and iteration (for display), but case insensistive on lookup". The short answer is "it's possible: go ahead and write the code". The longer answer is "it's possible, but it's not something you really want to do, unless you are willing to move globbing into the kernel". This is because, on lookup, you want to effectively turn each character into a wildcard based on case insensitivity. This is relatively easy for US ASCII, where if the character is in the right range, you AND off a bit, and treat everything that way. Thus it's better if you do this as if you were doing a globbing operation, rather than the way UNIX expects you to do it. The other issue that wants globbing in the kernel is when you have a lookup for a particular purpose; in general, there are three types of lookup: 1) Lookup of existing entry for file operation (stat, open, etc.). 2) Lookup of existing entry for directory entry creation operation (create, link larget, etc.). 3) Lookup of existing entry for directory entry deletion operation (rename, unlink, etc.). This doesn't seem iportant, until you have two files in a directory, e.g. "start" and "Stop", and you try one of: mv Start stop mv Start start mv Stop Stop rm st* ... See the point? The globbing, particularly in the "rm" case has to be moved into the kernel. The alternative to this is to add globbing to each and every shell out there, and hope to God that it's implemented the same way in all of them. This is because in UNIX systems, the globbing is expanded before being passed accross the system call boundary. The main problem with doing this is that it, effectively, then assumes that the underlying FS *must* be case insensitive on lookup. Specifically: ls > Start ls > start Can never end up with two files, because the shell would find the first file when looking up the redirect target for the second command, and dump to it anyway. This also has a problem with files which *already* exist on an FS, on which such a shell is then used, e.g.: ls -i 136 Start 137 start cat sTart ... Therefore doing this in the shell is unacceptable from many, many perspectives.., not the least of which is the fact that you can't have case insensitivity be an attribute of the underlying FS, it is instead an attribute of the shell. FWIW, if this doesn't make a lot of sense to you, but you are willing to hack up a shell to try it out, you will quickly see what I mean. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 11: 1:36 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EEACE37B405 for ; Mon, 27 Jan 2003 11:01:35 -0800 (PST) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7CDCF43F13 for ; Mon, 27 Jan 2003 11:01:35 -0800 (PST) (envelope-from owner-bugmaster@freebsd.org) Received: from freefall.freebsd.org (peter@localhost [127.0.0.1]) by freefall.freebsd.org (8.12.6/8.12.6) with ESMTP id h0RJ1ZNS068943 for ; Mon, 27 Jan 2003 11:01:35 -0800 (PST) (envelope-from owner-bugmaster@freebsd.org) Received: (from peter@localhost) by freefall.freebsd.org (8.12.6/8.12.6/Submit) id h0RJ1ZCb068937 for fs@freebsd.org; Mon, 27 Jan 2003 11:01:35 -0800 (PST) Date: Mon, 27 Jan 2003 11:01:35 -0800 (PST) Message-Id: <200301271901.h0RJ1ZCb068937@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: peter set sender to owner-bugmaster@freebsd.org using -f From: FreeBSD bugmaster To: fs@FreeBSD.org Subject: Current problem reports assigned to you Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Current FreeBSD problem reports Critical problems Serious problems Non-critical problems S Submitted Tracker Resp. Description ------------------------------------------------------------------------------- a [2000/10/06] kern/21807 fs [patches] Make System attribute correspon 1 problem total. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 16:33:48 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0F6C137B401; Mon, 27 Jan 2003 16:33:47 -0800 (PST) Received: from puffin.mail.pas.earthlink.net (puffin.mail.pas.earthlink.net [207.217.120.139]) by mx1.FreeBSD.org (Postfix) with ESMTP id A7AFF43F79; Mon, 27 Jan 2003 16:33:46 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0350.cvx22-bradley.dialup.earthlink.net ([209.179.199.95] helo=mindspring.com) by puffin.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18dJhA-0007mN-00; Mon, 27 Jan 2003 16:33:45 -0800 Message-ID: <3E35CF66.58143561@mindspring.com> Date: Mon, 27 Jan 2003 16:31:34 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: FreeBSD bugmaster Cc: fs@FreeBSD.org Subject: Re: Current problem reports assigned to you References: <200301271901.h0RJ1ZCb068937@freefall.freebsd.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a436657d9c72457ab537c8020cbff2b26e2601a10902912494350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org FreeBSD bugmaster wrote: > Current FreeBSD problem reports > Critical problems > Serious problems > Non-critical problems > > S Submitted Tracker Resp. Description > ------------------------------------------------------------------------------- > a [2000/10/06] kern/21807 fs [patches] Make System attribute correspon > > 1 problem total. Could someone point this PR at someone who cares to try and fix it (e.g. the original poster of the bug), instead of at the FreeBSD-FS mailing list? Brow-beating us with the "Open PR" cron job is going to make any us on this list any more likely to care about "fixing" this "problem" for you than we have been for the last 6 months. Thanks. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 16:43: 4 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DE71837B401; Mon, 27 Jan 2003 16:43:03 -0800 (PST) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6839A43E4A; Mon, 27 Jan 2003 16:43:03 -0800 (PST) (envelope-from dougb@FreeBSD.org) Received: from freefall.freebsd.org (dougb@localhost [127.0.0.1]) by freefall.freebsd.org (8.12.6/8.12.6) with ESMTP id h0S0h3NS084725; Mon, 27 Jan 2003 16:43:03 -0800 (PST) (envelope-from dougb@freefall.freebsd.org) Received: (from dougb@localhost) by freefall.freebsd.org (8.12.6/8.12.6/Submit) id h0S0h32g084721; Mon, 27 Jan 2003 16:43:03 -0800 (PST) Date: Mon, 27 Jan 2003 16:43:03 -0800 (PST) From: Doug Barton Message-Id: <200301280043.h0S0h32g084721@freefall.freebsd.org> To: dougb@FreeBSD.org, fs@FreeBSD.org, freebsd-bugs@FreeBSD.org Subject: Re: kern/21807: [patches] Make System attribute correspond to SF_IMMUTABLE Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Synopsis: [patches] Make System attribute correspond to SF_IMMUTABLE Responsible-Changed-From-To: fs->freebsd-bugs Responsible-Changed-By: dougb Responsible-Changed-When: Mon Jan 27 16:41:38 PST 2003 Responsible-Changed-Why: The -fs list has not expressed any interest. http://www.freebsd.org/cgi/query-pr.cgi?pr=21807 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 16:43:38 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7753D37B401; Mon, 27 Jan 2003 16:43:37 -0800 (PST) Received: from 12-234-22-23.client.attbi.com (12-234-22-23.client.attbi.com [12.234.22.23]) by mx1.FreeBSD.org (Postfix) with ESMTP id B67DB43F85; Mon, 27 Jan 2003 16:43:36 -0800 (PST) (envelope-from DougB@FreeBSD.org) Received: from slave.gorean.org (budeafy5ukh64snm@slave.gorean.org [10.0.0.1]) by 12-234-22-23.client.attbi.com (8.12.6/8.12.6) with ESMTP id h0S0hZRJ008507; Mon, 27 Jan 2003 16:43:36 -0800 (PST) (envelope-from DougB@FreeBSD.org) Date: Mon, 27 Jan 2003 16:43:35 -0800 (PST) From: Doug Barton To: Terry Lambert Cc: FreeBSD bugmaster , fs@FreeBSD.org Subject: Re: Current problem reports assigned to you In-Reply-To: <3E35CF66.58143561@mindspring.com> Message-ID: <20030127164314.T1027@12-234-22-23.pyvrag.nggov.pbz> References: <200301271901.h0RJ1ZCb068937@freefall.freebsd.org> <3E35CF66.58143561@mindspring.com> Organization: http://www.FreeBSD.org/ X-message-flag: Outlook -- Not just for spreading viruses anymore! MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Mon, 27 Jan 2003, Terry Lambert wrote: > FreeBSD bugmaster wrote: > > Current FreeBSD problem reports > > Critical problems > > Serious problems > > Non-critical problems > > > > S Submitted Tracker Resp. Description > > ------------------------------------------------------------------------------- > > a [2000/10/06] kern/21807 fs [patches] Make System attribute correspon > > > > 1 problem total. > > > Could someone point this PR at someone who cares to try and fix > it (e.g. the original poster of the bug), instead of at the > FreeBSD-FS mailing list? Done. -- If it's moving, encrypt it. If it's not moving, encrypt it till it moves, then encrypt it some more. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 16:47: 8 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CE4A437B401; Mon, 27 Jan 2003 16:47:07 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6F0CB43E4A; Mon, 27 Jan 2003 16:47:07 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0350.cvx22-bradley.dialup.earthlink.net ([209.179.199.95] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18dJu6-0001sh-00; Mon, 27 Jan 2003 16:47:07 -0800 Message-ID: <3E35D281.4F6EDDBE@mindspring.com> Date: Mon, 27 Jan 2003 16:44:49 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Doug Barton Cc: FreeBSD bugmaster , fs@FreeBSD.org Subject: Re: Current problem reports assigned to you References: <200301271901.h0RJ1ZCb068937@freefall.freebsd.org> <3E35CF66.58143561@mindspring.com> <20030127164314.T1027@12-234-22-23.pyvrag.nggov.pbz> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a436657d9c72457ab56f58b639eae7d031350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Doug Barton wrote: > > Could someone point this PR at someone who cares to try and fix > > it (e.g. the original poster of the bug), instead of at the > > FreeBSD-FS mailing list? > > Done. Thank you. You ar a god. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 17: 7: 7 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 37F3E37B401; Mon, 27 Jan 2003 17:07:06 -0800 (PST) Received: from mailsrv.otenet.gr (mailsrv.otenet.gr [195.170.0.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0D6E543E4A; Mon, 27 Jan 2003 17:07:05 -0800 (PST) (envelope-from keramida@freebsd.org) Received: from gothmog.gr (patr530-b162.otenet.gr [212.205.244.170]) by mailsrv.otenet.gr (8.12.6/8.12.6) with ESMTP id h0S16dBb023275; Tue, 28 Jan 2003 03:07:00 +0200 (EET) Received: from gothmog.gr (gothmog [127.0.0.1]) by gothmog.gr (8.12.6/8.12.6) with ESMTP id h0S16JVF003628; Tue, 28 Jan 2003 03:06:19 +0200 (EET) (envelope-from keramida@freebsd.org) Received: (from giorgos@localhost) by gothmog.gr (8.12.6/8.12.6/Submit) id h0S16JOW003627; Tue, 28 Jan 2003 03:06:19 +0200 (EET) (envelope-from keramida@freebsd.org) Date: Tue, 28 Jan 2003 03:06:19 +0200 From: Giorgos Keramidas To: Terry Lambert Cc: Doug Barton , fs@freebsd.org Subject: Re: Current problem reports assigned to you Message-ID: <20030128010618.GA3598@gothmog.gr> References: <200301271901.h0RJ1ZCb068937@freefall.freebsd.org> <3E35CF66.58143561@mindspring.com> <20030127164314.T1027@12-234-22-23.pyvrag.nggov.pbz> <3E35D281.4F6EDDBE@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3E35D281.4F6EDDBE@mindspring.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On 2003-01-27 16:44, Terry Lambert wrote: > Doug Barton wrote: > > > Could someone point this PR at someone who cares to try and fix > > > it (e.g. the original poster of the bug), instead of at the > > > FreeBSD-FS mailing list? > > > > Done. > > Thank you. You ar a god. A fast god too. I had oonly just read the message and run query-pr, only to find the PR reassigned already :) Thanks Doug. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Jan 27 20:35:13 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0AC7037B401; Mon, 27 Jan 2003 20:35:11 -0800 (PST) Received: from obsecurity.dyndns.org (adsl-64-169-104-205.dsl.lsan03.pacbell.net [64.169.104.205]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6BFD743F3F; Mon, 27 Jan 2003 20:35:04 -0800 (PST) (envelope-from kris@obsecurity.org) Received: from rot13.obsecurity.org (rot13.obsecurity.org [10.0.0.5]) by obsecurity.dyndns.org (Postfix) with ESMTP id 41AE267872; Mon, 27 Jan 2003 20:35:03 -0800 (PST) Received: by rot13.obsecurity.org (Postfix, from userid 1000) id 35ED9171F; Mon, 27 Jan 2003 20:35:03 -0800 (PST) Date: Mon, 27 Jan 2003 20:35:03 -0800 From: Kris Kennaway To: Kris Kennaway Cc: current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: INVARIANTS-related fs panic on alpha Message-ID: <20030128043503.GA902@rot13.obsecurity.org> References: <20030125081234.GA11722@rot13.obsecurity.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qDbXVdCdHGoSgWSk" Content-Disposition: inline In-Reply-To: <20030125081234.GA11722@rot13.obsecurity.org> User-Agent: Mutt/1.4i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org --qDbXVdCdHGoSgWSk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jan 25, 2003 at 12:12:34AM -0800, Kris Kennaway wrote: > One of the alpha package clients panicked with this. It was under > very high load at the time (25 simultaneous package builds): >=20 > fatal kernel trap: >=20 > trap entry =3D 0x2 (memory management fault) > faulting va =3D 0xdeadc0dedeadc0e6 > type =3D access violation > cause =3D store instruction > pc =3D 0xfffffc000053453c > ra =3D 0xfffffc000053b2a8 > sp =3D 0xfffffe001da15b30 > curthread =3D 0xfffffc003e33b930 > pid =3D 3, comm =3D g_up >=20 > Stopped at add_to_worklist+0xac: stq a0,0x8(t0) <0xdeadc0dedea= dc0e6> > db> trace > add_to_worklist() at add_to_worklist+0xac > handle_written_inodeblock() at handle_written_inodeblock+0x5e8 > softdep_disk_write_complete() at softdep_disk_write_complete+0xac > bufdone() at bufdone+0x19c > bufdonebio() at bufdonebio+0x1c > biodone() at biodone+0x28 > g_dev_done() at g_dev_done+0xd8 > biodone() at biodone+0x28 > g_io_schedule_up() at g_io_schedule_up+0x4c > g_up_procbody() at g_up_procbody+0x9c > fork_exit() at fork_exit+0x100 > exception_return() at exception_return > --- root of call graph --- > db> Here it is again: fatal kernel trap: trap entry =3D 0x4 (unaligned access fault) faulting va =3D 0xdeadc0dedeadc0e6 opcode =3D 0x2d register =3D 0x10 pc =3D 0xfffffc0000534540 ra =3D 0xfffffc000053b2a8 sp =3D 0xfffffe0006c0fb30 curthread =3D 0xfffffc0007ba7930 pid =3D 3, comm =3D g_up Stopped at add_to_worklist+0xb0: ldq t0,0x7c60(gp) <0xfffffc0000= 6581d0> db> trace add_to_worklist() at add_to_worklist+0xb0 handle_written_inodeblock() at handle_written_inodeblock+0x5e8 softdep_disk_write_complete() at softdep_disk_write_complete+0xac bufdone() at bufdone+0x19c bufdonebio() at bufdonebio+0x1c biodone() at biodone+0x28 g_dev_done() at g_dev_done+0xd8 biodone() at biodone+0x28 g_io_schedule_up() at g_io_schedule_up+0x4c g_up_procbody() at g_up_procbody+0x9c fork_exit() at fork_exit+0x100 exception_return() at exception_return --- root of call graph --- db> --qDbXVdCdHGoSgWSk Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (FreeBSD) iD8DBQE+Ngh2Wry0BWjoQKURAqiIAKCqGmPByHp3Dx2DyyjDGB/hQwUoAACggrtB Nd8nsNkuPzG/fntL4bmpILg= =uMly -----END PGP SIGNATURE----- --qDbXVdCdHGoSgWSk-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Jan 28 3:21:51 2003 Delivered-To: freebsd-fs@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 931) id A7F2037B401; Tue, 28 Jan 2003 03:21:50 -0800 (PST) Date: Tue, 28 Jan 2003 03:21:50 -0800 From: Juli Mallett To: freebsd-fs@FreeBSD.org Cc: Adrian Chadd Subject: Filesystem names with non-alphanum characters? Message-ID: <20030128032150.B45888@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i Organisation: The FreeBSD Project X-Alternate-Addresses: , , , , X-Towel: Yes X-LiveJournal: flata, jmallett X-Negacore: Yes Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Does anyone have a FreeBSD filesystem with a VFS type name which contains a space? Any other characters that are not alphanumeric? fsck currently has a hack to look for fsck_foo_bar for the vfstype "foo bar", but does not handle other things, and I am not sure if there is a use for this special case at all, or if there is if we should handle a larger set of transposition. Even the one case I thought maybe would exist ("4.2 ufs" or "4.2BSD ufs") does not, and the analogue to it ("4.2bsd") has no space, and there is no fsck to support that with a space ("fsck_4.2_bsd"). Anyone with input on this would be very welcome to speak up. We've had it since we got this stuff from NetBSD, sorta. It appears to be something Adrian added when converting it to our VFS system, so I'm willing to write off that it was a "this is a good idea" change, since our VFS system might not guarantee no spaces, but if it isn't something useful, then it might be a good idea to remove it, as we certainly don't actually try to make a name we can use, we just handle one small case. If nothing comes inre this, I may try to borrow phk's Danish axe for application to this code, otherwise I will try to make it more general purpose. Thanx, juli. -- Juli Mallett AIM: BSDFlata -- IRC: juli on EFnet OpenDarwin, Mono, FreeBSD Developer ircd-hybrid Developer, EFnet addict FreeBSD on MIPS-Anything on FreeBSD To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Jan 28 14: 6: 8 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7BA5B37B401 for ; Tue, 28 Jan 2003 14:06:07 -0800 (PST) Received: from tolkor.sgi.com (tolkor.sgi.com [198.149.18.6]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8F70F43E4A for ; Tue, 28 Jan 2003 14:06:06 -0800 (PST) (envelope-from cattelan@thebarn.com) Received: from ledzep.americas.sgi.com (ledzep.americas.sgi.com [192.48.203.134]) by tolkor.sgi.com (8.12.2/8.12.2/linux-outbound_gateway-1.2) with ESMTP id h0SKB5kq019782 for ; Tue, 28 Jan 2003 14:11:05 -0600 Received: from daisy-e236.americas.sgi.com (daisy-e236.americas.sgi.com [128.162.236.214]) by ledzep.americas.sgi.com (SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id OAA85232 for ; Tue, 28 Jan 2003 14:02:44 -0600 (CST) Received: from [128.162.233.73] (naboo.americas.sgi.com [128.162.233.73]) by daisy-e236.americas.sgi.com (SGI-8.9.3/SGI-server-1.8) with ESMTP id OAA30079 for ; Tue, 28 Jan 2003 14:02:44 -0600 (CST) Subject: restricted blocks? From: Russell Cattelan To: fs@freebsd.org Content-Type: text/plain Organization: Message-Id: <1043784569.20928.58.camel@naboo.americas.sgi.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.0 Date: 28 Jan 2003 14:09:30 -0600 Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Does anybody have quick answer as to why block 1 isn't writable? naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=0 count=1 1+0 records in 1+0 records out 512 bytes transferred in 0.000575 secs (890518 bytes/sec) naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=1 count=1 dd: /dev/da1s1d: Operation not permitted 1+0 records in 0+0 records out 0 bytes transferred in 0.000819 secs (0 bytes/sec) naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=2 count=1 1+0 records in 1+0 records out 512 bytes transferred in 0.000507 secs (1009630 bytes/sec) XFS uses this location for one of it's meta data block. -- Russell Cattelan To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Jan 28 15:12:35 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7792837B401 for ; Tue, 28 Jan 2003 15:12:34 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9333B43F43 for ; Tue, 28 Jan 2003 15:12:33 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0SNCQZE000900; Wed, 29 Jan 2003 00:12:32 +0100 (CET) (envelope-from phk@freebsd.org) To: Russell Cattelan Cc: fs@freebsd.org Subject: Re: restricted blocks? From: phk@freebsd.org In-Reply-To: Your message of "28 Jan 2003 14:09:30 CST." <1043784569.20928.58.camel@naboo.americas.sgi.com> Date: Wed, 29 Jan 2003 00:12:26 +0100 Message-ID: <899.1043795546@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <1043784569.20928.58.camel@naboo.americas.sgi.com>, Russell Cattelan writes: >Does anybody have quick answer as to why block 1 isn't writable? In all likelyhood your 'd' partition starts at offset zero and therefore the second sector contains the disklabel which the kernel will not allow you to overwrite. Use disklabel -e to change to size of the 'd' partiton down by 16, and set the offset to 16 and you should have no trouble. Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Jan 28 21: 4: 8 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 057FF37B401 for ; Tue, 28 Jan 2003 21:04:07 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4A4AC43F9B for ; Tue, 28 Jan 2003 21:04:05 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id QAA01383; Wed, 29 Jan 2003 16:03:51 +1100 Date: Wed, 29 Jan 2003 16:05:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Russell Cattelan Cc: fs@FreeBSD.ORG Subject: Re: restricted blocks? In-Reply-To: <1043784569.20928.58.camel@naboo.americas.sgi.com> Message-ID: <20030129160134.O31111-100000@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On 28 Jan 2003, Russell Cattelan wrote: > Does anybody have quick answer as to why block 1 isn't writable? > > naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=0 count=1 > 1+0 records in > 1+0 records out > 512 bytes transferred in 0.000575 secs (890518 bytes/sec) > naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=1 count=1 > dd: /dev/da1s1d: Operation not permitted > 1+0 records in > 0+0 records out > 0 bytes transferred in 0.000819 secs (0 bytes/sec) > naboo[6:24pm]#dd if=/dev/zero of=/dev/da1s1d bs=512 oseek=2 count=1 > 1+0 records in > 1+0 records out > 512 bytes transferred in 0.000507 secs (1009630 bytes/sec) > > XFS uses this location for one of it's meta data block. Most likely block 1 has a disk label on it. The errno for this has apparently regressed from EROFS to EPERM. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 8:30:51 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 96FF237B401 for ; Fri, 31 Jan 2003 08:30:50 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id AF07943F43 for ; Fri, 31 Jan 2003 08:30:49 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VGOOX00470; Fri, 31 Jan 2003 09:24:24 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKRXR; Fri, 31 Jan 2003 09:30:47 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id LAA0000001462; Fri, 31 Jan 2003 11:30:20 -0500 (EST) Date: Fri, 31 Jan 2003 11:30:18 -0500 Mime-Version: 1.0 (Apple Message framework v551) Content-Type: text/plain; charset=US-ASCII; format=flowed Subject: DEV_B_SIZE From: Steve Byan To: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Content-Transfer-Encoding: 7bit Message-Id: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org There's a notion afoot in IDEMA to enlarge the underlying physical block size of disks to 4096 bytes while keeping a 512-byte logical block size for the interface. Unaligned accesses would involve either a read-modify-write or some proprietary mechanism that provides persistence without the latency cost of a read-modify-write. Performance issues aside, it occurs to me that hiding the underlying physical block size may break many careful-write and transaction-logging mechanisms, which may depend on no more than one block being corrupted during a failure. In IDEMA's proposal, a power failure during a write of a single 512-byte logical block could result in the corruption of the full 4K block, i.e. reads of any of the 512-byte logical blocks in that 4K physical block would return an uncorrectable ECC error. I'd appreciate hearing examples where hiding the underlying physical block size would break a file system, database, transaction processing monitor, or whatever. Please let me know if I may forward your reply to the committee. Thanks. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 8:51: 0 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 55BE737B401 for ; Fri, 31 Jan 2003 08:50:59 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 398EA43F43 for ; Fri, 31 Jan 2003 08:50:58 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VGor4W002640; Fri, 31 Jan 2003 17:50:54 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 11:30:18 EST." <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Date: Fri, 31 Jan 2003 17:50:53 +0100 Message-ID: <2639.1044031853@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>, Steve Byan writes : >I'd appreciate hearing examples where hiding the underlying physical >block size would break a file system, database, transaction processing >monitor, or whatever. Please let me know if I may forward your reply >to the committee. Thanks. If by "hide" you mean that there will be no way to discover the smallest atomic unit of writes, then you are right: it would be bad. Provided we can get the size of the smallest atomic unit of writes in a standardized, documented, mandatory way, we will have no problem coping with it: Using a 4k size is no problem for our current filesystem technologies and device sizes. It was my impression that already many drives write entire tracks as atomic units, at least we have had plenty of anecdotal evidence to this effect ? Poul-Henning (FreeBSD's disk-I/O wizard) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 9: 4: 2 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C6E5637B401; Fri, 31 Jan 2003 09:04:00 -0800 (PST) Received: from mcomail01.maxtor.com (mcomail01.maxtor.com [134.6.76.15]) by mx1.FreeBSD.org (Postfix) with ESMTP id 47CD243E4A; Fri, 31 Jan 2003 09:03:59 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail01.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VGrUI31927; Fri, 31 Jan 2003 09:53:30 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKTKZ; Fri, 31 Jan 2003 10:03:58 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id MAA0000001510; Fri, 31 Jan 2003 12:03:46 -0500 (EST) Date: Fri, 31 Jan 2003 12:03:44 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org To: phk@freebsd.org From: Steve Byan In-Reply-To: <2639.1044031853@critter.freebsd.dk> Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 11:50 AM, phk@freebsd.org wrote: > In message <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>, Steve > Byan writes > : > >> I'd appreciate hearing examples where hiding the underlying physical >> block size would break a file system, database, transaction processing >> monitor, or whatever. Please let me know if I may forward your reply >> to the committee. Thanks. > > If by "hide" you mean that there will be no way to discover the > smallest atomic unit of writes, then you are right: it would be bad. The notion is that such a disk would be instantly-compatible with existing software, modulo performance issues. I suspect this is not the case, and am searching for expert opinions in this matter. > Provided we can get the size of the smallest atomic unit of writes > in a standardized, documented, mandatory way, we will have no problem > coping with it: Using a 4k size is no problem for our current > filesystem technologies and device sizes. Yes, I understand recompiling the world for 4K is possible. My question is whether not doing so poses a data-integrity / fail-recovery risk. > It was my impression that already many drives write entire tracks > as atomic units, at least we have had plenty of anecdotal evidence > to this effect ? I'm not aware of any SCSI or ATA disks which do this; certainly no Maxtor disk does. Count-key-data mainframe disks can be formatted to do so, but such disks probably don't run Unix. Caching in ATA disks might lead one to believe that the disk could corrupt an entire track, in the sense that a panic ( aka bluescreen) or a power-failure would cause all pending writes in its buffer to be lost, but even in ATA-land I don't believe a power failure would result in more than one disk block returning an uncorrectable read error. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 9:18: 9 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 451C237B401 for ; Fri, 31 Jan 2003 09:18:08 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 65E6043F79 for ; Fri, 31 Jan 2003 09:18:07 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VHI64W002904; Fri, 31 Jan 2003 18:18:06 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 12:03:44 EST." Date: Fri, 31 Jan 2003 18:18:06 +0100 Message-ID: <2903.1044033486@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message , Steve Byan writes : > >On Friday, January 31, 2003, at 11:50 AM, phk@freebsd.org wrote: > >> In message <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>, Steve >> Byan writes >> : >> >>> I'd appreciate hearing examples where hiding the underlying physical >>> block size would break a file system, database, transaction processing >>> monitor, or whatever. Please let me know if I may forward your reply >>> to the committee. Thanks. >> >> If by "hide" you mean that there will be no way to discover the >> smallest atomic unit of writes, then you are right: it would be bad. > >The notion is that such a disk would be instantly-compatible with >existing software, modulo performance issues. I suspect this is not the >case, and am searching for expert opinions in this matter. I'm fine with that, as long as the disk somewhere in a data field we can query (if need be with a new request) exposes the smallest atomically writable unit. The only thing that exposes us to risk is we don't know the risk exists, so as long as the fact that a 4k physical sector size is used is not hidden from us, we can adapt. >Yes, I understand recompiling the world for 4K is possible. My question >is whether not doing so poses a data-integrity / fail-recovery risk. Nope. >> It was my impression that already many drives write entire tracks >> as atomic units, at least we have had plenty of anecdotal evidence >> to this effect ? > >I'm not aware of any SCSI or ATA disks which do this; certainly no >Maxtor disk does. Ok, that is nice to know. And yes, we've had our trouble with write caches. Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 9:55:14 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6E2B437B401 for ; Fri, 31 Jan 2003 09:55:13 -0800 (PST) Received: from host213-122-85-204.in-addr.btopenworld.com (host213-122-85-204.in-addr.btopenworld.com [213.122.85.204]) by mx1.FreeBSD.org (Postfix) with ESMTP id A6C2443E4A for ; Fri, 31 Jan 2003 09:55:05 -0800 (PST) (envelope-from dsl@l8s.co.uk) Received: (from dsl@localhost) by snowdrop.l8s.co.uk (8.11.6/8.11.6) id h0VHxI547889; Fri, 31 Jan 2003 17:59:18 GMT Date: Fri, 31 Jan 2003 17:59:17 +0000 From: David Laight To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131175917.E1487@snowdrop.l8s.co.uk> References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>; from stephen_byan@maxtor.com on Fri, Jan 31, 2003 at 11:30:18AM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, Jan 31, 2003 at 11:30:18AM -0500, Steve Byan wrote: > There's a notion afoot in IDEMA to enlarge the underlying physical > block size of disks to 4096 bytes while keeping a 512-byte logical > block size for the interface. Unaligned accesses would involve either a > read-modify-write or some proprietary mechanism that provides > persistence without the latency cost of a read-modify-write. There probably ought to be a way of making the larger physical size visible to systems that are willing to support larger block sizes. That way misaligned transfers would be far less likely. One problem to consider is that disks are still partitioned on cylinder boundaries. This is largely historic but isn't this doen't actually make much sense, since the geometry almost certainly varies across the disk and has to be faked to fit the ATA CHS limits and (on PCs) the BIOS interface. However what it does mean is that a partition could easily not start on a 8 (512 byte) sector boundary. So misaligned transefers are likely even if the filesystem itself is using 4k blocks. On a PC the partitioning will typically have the first one starting in sector 63, and the others at multiple of 16065 sectors from the start of the disk). This doesn't bode well for getting any aligned transfer at all. David -- David Laight: david@l8s.co.uk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:16:46 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 971FA37B405 for ; Fri, 31 Jan 2003 10:16:44 -0800 (PST) Received: from rwcrmhc52.attbi.com (rwcrmhc52.attbi.com [216.148.227.88]) by mx1.FreeBSD.org (Postfix) with ESMTP id B1BA443F79 for ; Fri, 31 Jan 2003 10:16:43 -0800 (PST) (envelope-from julian@elischer.org) Received: from InterJet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by rwcrmhc52.attbi.com (rwcrmhc52) with ESMTP id <2003013118164305200du3dee>; Fri, 31 Jan 2003 18:16:43 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id KAA45382; Fri, 31 Jan 2003 10:16:42 -0800 (PST) Date: Fri, 31 Jan 2003 10:16:41 -0800 (PST) From: Julian Elischer To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, Steve Byan wrote: > There's a notion afoot in IDEMA to enlarge the underlying physical > block size of disks to 4096 bytes while keeping a 512-byte logical > block size for the interface. Unaligned accesses would involve either a > read-modify-write or some proprietary mechanism that provides > persistence without the latency cost of a read-modify-write. > > Performance issues aside, it occurs to me that hiding the underlying > physical block size may break many careful-write and > transaction-logging mechanisms, which may depend on no more than one > block being corrupted during a failure. In IDEMA's proposal, a power > failure during a write of a single 512-byte logical block could result > in the corruption of the full 4K block, i.e. reads of any of the > 512-byte logical blocks in that 4K physical block would return an > uncorrectable ECC error. > > I'd appreciate hearing examples where hiding the underlying physical > block size would break a file system, database, transaction processing > monitor, or whatever. Please let me know if I may forward your reply > to the committee. Thanks. I presume that if such a drive were made, thre would be some way to identify it? It would be very easy to configure a filesystem to have a minimum writable unit size of 4k, and I assume that doing so would be slightly advantageous. (no Read/modify/write). it would however be good if we could easily identify when doing so was a good idea. Another idea would be to have some way that you could specify a block number and have teh drive tell you the first in the same group.. That would allow a filesystem to work out the alignment. It may not be able to access absolute block numbers, if it's going through some layers of translation, and some way of saying "am I alligned?" might be useful. One thing that does come to mind is that as you say, on power fail we would now be liable to lose a group of 8 sectors (4k) instead of 1 x 512 byte sector. Recovery algorythms might have to deal with this (should we actually decide to write one.. :-). Particularly if the block being written was the 1st, but the other 7 blocks contain data that the OS has no way of knowing that they are in jeopardy. In other words, I might know that block 1 is in danger and put it in a write log, (in a logging filesystem) but I have no way of knowing that the other 7 are in danger, so they may not be in the write log (assuming thAat the write log only holds the last N transactions.). I'd say that this means that the drive should hold the active 4k block in nvram or something.. You seem to have considered this but I'm in agreement that it could prove "nasty" in exactly the cases that are most important.. people use write logging etc. in cases where they care about the data and recovery time. these are exactly the people who are going to be the most pissed off to lose their data. .. If we can easily telll the system to use 4k frags or 4k blocknumbers (i.e. we can elect to expose the real blocksize) then we are probably in better shape. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:41:52 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3DC5A37B401; Fri, 31 Jan 2003 10:41:50 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 683B243F93; Fri, 31 Jan 2003 10:41:49 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VIZQU16809; Fri, 31 Jan 2003 11:35:26 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKX44; Fri, 31 Jan 2003 11:41:48 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id NAA0000002101; Fri, 31 Jan 2003 13:41:36 -0500 (EST) Date: Fri, 31 Jan 2003 13:41:35 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org To: phk@freebsd.org From: Steve Byan In-Reply-To: <2903.1044033486@critter.freebsd.dk> Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 12:18 PM, phk@freebsd.org wrote: > In message , Steve > Byan writes > : >> >> On Friday, January 31, 2003, at 11:50 AM, phk@freebsd.org wrote: >> >>> In message <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>, Steve >>> Byan writes >>> : >>> >>>> I'd appreciate hearing examples where hiding the underlying physical >>>> block size would break a file system, database, transaction >>>> processing >>>> monitor, or whatever. Please let me know if I may forward your >>>> reply >>>> to the committee. Thanks. >>> >>> If by "hide" you mean that there will be no way to discover the >>> smallest atomic unit of writes, then you are right: it would be bad. >> >> The notion is that such a disk would be instantly-compatible with >> existing software, modulo performance issues. I suspect this is not >> the >> case, and am searching for expert opinions in this matter. > > I'm fine with that, as long as the disk somewhere in a data field > we can query (if need be with a new request) exposes the smallest > atomically writable unit. > > The only thing that exposes us to risk is we don't know the risk > exists, so as long as the fact that a 4k physical sector size is > used is not hidden from us, we can adapt. But would existing code be functionally broken (perhaps with respect to failure recovery) if it were to not be modified to adapt to a different physical block size? > >> Yes, I understand recompiling the world for 4K is possible. My >> question >> is whether not doing so poses a data-integrity / fail-recovery risk. > > Nope. Really? fsck can recover from losing 4K bytes surrounding the last metadata block written? Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:45:38 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D10B037B401 for ; Fri, 31 Jan 2003 10:45:36 -0800 (PST) Received: from mcomail01.maxtor.com (mcomail01.maxtor.com [134.6.76.15]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3869343F3F for ; Fri, 31 Jan 2003 10:45:36 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail01.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VIZ2F32011; Fri, 31 Jan 2003 11:35:02 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKXY8; Fri, 31 Jan 2003 11:45:30 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id NAA0000001943; Fri, 31 Jan 2003 13:45:04 -0500 (EST) Date: Fri, 31 Jan 2003 13:45:03 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org To: David Laight From: Steve Byan In-Reply-To: <20030131175917.E1487@snowdrop.l8s.co.uk> Message-Id: <1BBFD4B2-354C-11D7-B26B-00306548867E@maxtor.com> Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 12:59 PM, David Laight wrote: > On Fri, Jan 31, 2003 at 11:30:18AM -0500, Steve Byan wrote: >> There's a notion afoot in IDEMA to enlarge the underlying physical >> block size of disks to 4096 bytes while keeping a 512-byte logical >> block size for the interface. Unaligned accesses would involve either >> a >> read-modify-write or some proprietary mechanism that provides >> persistence without the latency cost of a read-modify-write. > > There probably ought to be a way of making the larger physical > size visible to systems that are willing to support larger > block sizes. That way misaligned transfers would be far less > likely. Yes, of course. But I asked with respect to an issue other than performance. > > One problem to consider is that disks are still partitioned > on cylinder boundaries. This is largely historic but isn't > this doen't actually make much sense, since the geometry > almost certainly varies across the disk and has to be faked > to fit the ATA CHS limits and (on PCs) the BIOS interface. > > However what it does mean is that a partition could easily > not start on a 8 (512 byte) sector boundary. > So misaligned transefers are likely even if the filesystem > itself is using 4k blocks. > > On a PC the partitioning will typically have the first one > starting in sector 63, and the others at multiple of 16065 > sectors from the start of the disk). > > This doesn't bode well for getting any aligned transfer > at all. We understand that problem. It's just a performance issue. My concern is that even if we handwave the performance issues, there's an underlying semantic that would not be satisfied if we were to run existing software, unmodified, on a disk with an underlying 4K sector size. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:50:52 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6841B37B401; Fri, 31 Jan 2003 10:50:51 -0800 (PST) Received: from host213-122-194-66.in-addr.btopenworld.com (host213-122-194-66.in-addr.btopenworld.com [213.122.194.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id A647C43F43; Fri, 31 Jan 2003 10:50:48 -0800 (PST) (envelope-from dsl@l8s.co.uk) Received: (from dsl@localhost) by snowdrop.l8s.co.uk (8.11.6/8.11.6) id h0VIt7o08765; Fri, 31 Jan 2003 18:55:07 GMT Date: Fri, 31 Jan 2003 18:55:07 +0000 From: David Laight To: Steve Byan Cc: phk@freebsd.org, freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131185507.G1487@snowdrop.l8s.co.uk> References: <2903.1044033486@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: ; from stephen_byan@maxtor.com on Fri, Jan 31, 2003 at 01:41:35PM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org > Really? fsck can recover from losing 4K bytes surrounding the last > metadata block written? The only metadata that matter are the inodes and (for ffs) the indirect blocks. You do really want the latter to be single disk blocks - many systems actually write them synchonously. The inode is (probably) only 128 bytes, losing an inode block will lose the other files. A journaling filesystem probably already has ways around this... David -- David Laight: david@l8s.co.uk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:51:27 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E2E2B37B401 for ; Fri, 31 Jan 2003 10:51:25 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1511543F9B for ; Fri, 31 Jan 2003 10:51:25 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VIpO4W019188; Fri, 31 Jan 2003 19:51:24 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 13:41:35 EST." Date: Fri, 31 Jan 2003 19:51:24 +0100 Message-ID: <19187.1044039084@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message , Steve Byan writes : >> The only thing that exposes us to risk is we don't know the risk >> exists, so as long as the fact that a 4k physical sector size is >> used is not hidden from us, we can adapt. > >But would existing code be functionally broken (perhaps with respect to >failure recovery) if it were to not be modified to adapt to a different >physical block size? Not broken any worse than because of write-caching. >> Nope. > >Really? fsck can recover from losing 4K bytes surrounding the last >metadata block written? If the fragment size is 4k when the filsystem is created, and this would happen automatically, then there is no window for lossage. The thing we really need is working tagged-queing... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:56: 9 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 86A7237B401 for ; Fri, 31 Jan 2003 10:56:06 -0800 (PST) Received: from apollo.email.starband.net (smtp2.starband.net [148.78.247.23]) by mx1.FreeBSD.org (Postfix) with ESMTP id 81A8943F75 for ; Fri, 31 Jan 2003 10:56:05 -0800 (PST) (envelope-from jkirby@storagecraft.com) Received: from jkirbydesk (vsat-148-63-114-177.c002.t7.mrt.starband.net [148.63.114.177]) (authenticated bits=0) by apollo.email.starband.net (8.12.4/8.12.4) with ESMTP id h0VItoH5024439; Fri, 31 Jan 2003 13:55:55 -0500 Reply-To: From: "Jamey Kirby" To: "'Steve Byan'" , "'David Laight'" Cc: , Subject: RE: DEV_B_SIZE Date: Fri, 31 Jan 2003 10:55:47 -0800 Organization: StorageCraft Message-ID: <001601c2c95a$63d52d70$0300a8c0@jkirbydesk> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.4024 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal In-Reply-To: <1BBFD4B2-354C-11D7-B26B-00306548867E@maxtor.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org I have been a lurker for years and want to chime in. Under Windows NT (all flavors), using a 4K sector size works fine. The OS abstraction layers are very good and handling the alignment. I wrote a virtual SCSI disk driver (ATA is presented as SCSI to the NT OS kernel) and experimented with all sorts of sector sizes to see how various software would handle it. I found no problems... However, I myself have written test code in the past that assumes 512 byte sectors rather than reading the sector size from the OS. Surly this code would break. Jamey Kirby StorageCraft -----Original Message----- From: owner-freebsd-fs@FreeBSD.ORG [mailto:owner-freebsd-fs@FreeBSD.ORG] On Behalf Of Steve Byan Sent: Friday, January 31, 2003 10:45 AM To: David Laight Cc: freebsd-fs@FreeBSD.ORG; tech-kern@netbsd.org Subject: Re: DEV_B_SIZE On Friday, January 31, 2003, at 12:59 PM, David Laight wrote: > On Fri, Jan 31, 2003 at 11:30:18AM -0500, Steve Byan wrote: >> There's a notion afoot in IDEMA to enlarge the underlying physical >> block size of disks to 4096 bytes while keeping a 512-byte logical >> block size for the interface. Unaligned accesses would involve either >> a >> read-modify-write or some proprietary mechanism that provides >> persistence without the latency cost of a read-modify-write. > > There probably ought to be a way of making the larger physical > size visible to systems that are willing to support larger > block sizes. That way misaligned transfers would be far less > likely. Yes, of course. But I asked with respect to an issue other than performance. > > One problem to consider is that disks are still partitioned > on cylinder boundaries. This is largely historic but isn't > this doen't actually make much sense, since the geometry > almost certainly varies across the disk and has to be faked > to fit the ATA CHS limits and (on PCs) the BIOS interface. > > However what it does mean is that a partition could easily > not start on a 8 (512 byte) sector boundary. > So misaligned transefers are likely even if the filesystem > itself is using 4k blocks. > > On a PC the partitioning will typically have the first one > starting in sector 63, and the others at multiple of 16065 > sectors from the start of the disk). > > This doesn't bode well for getting any aligned transfer > at all. We understand that problem. It's just a performance issue. My concern is that even if we handwave the performance issues, there's an underlying semantic that would not be satisfied if we were to run existing software, unmodified, on a disk with an underlying 4K sector size. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 10:56:47 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A98CA37B401 for ; Fri, 31 Jan 2003 10:56:44 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 29AA443F93 for ; Fri, 31 Jan 2003 10:56:38 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VIoBf22209; Fri, 31 Jan 2003 11:50:11 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKYFQ; Fri, 31 Jan 2003 11:56:33 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id NAA0000002033; Fri, 31 Jan 2003 13:56:11 -0500 (EST) Date: Fri, 31 Jan 2003 13:56:09 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org To: Julian Elischer From: Steve Byan In-Reply-To: Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 01:16 PM, Julian Elischer wrote: > > > On Fri, 31 Jan 2003, Steve Byan wrote: > >> There's a notion afoot in IDEMA to enlarge the underlying physical >> block size of disks to 4096 bytes while keeping a 512-byte logical >> block size for the interface. Unaligned accesses would involve either >> a >> read-modify-write or some proprietary mechanism that provides >> persistence without the latency cost of a read-modify-write. >> >> Performance issues aside, it occurs to me that hiding the underlying >> physical block size may break many careful-write and >> transaction-logging mechanisms, which may depend on no more than one >> block being corrupted during a failure. In IDEMA's proposal, a power >> failure during a write of a single 512-byte logical block could result >> in the corruption of the full 4K block, i.e. reads of any of the >> 512-byte logical blocks in that 4K physical block would return an >> uncorrectable ECC error. >> >> I'd appreciate hearing examples where hiding the underlying physical >> block size would break a file system, database, transaction processing >> monitor, or whatever. Please let me know if I may forward your reply >> to the committee. Thanks. > > I presume that if such a drive were made, thre would be some way to > identify it? Yes, but my concern is that advocates claim existing software could work (albeit slowly) with such a drive. It's hard to retroactively modify binaries installed in the field to adapt to a larger block size :-) > > It would be very easy to configure a filesystem to have a minimum > writable unit size of 4k, and I assume that doing so would be > slightly advantageous. (no Read/modify/write). it would however > be good if we could easily identify when doing so was a good idea. Yes, I've built and run OSF/1 on a system with 4K sector size; this was essentially BSD4.3. Modifying DEV_B_SIZE and recompiling the world was sufficient (well, actually the boot loader had to know the block size, and I needed a way to format the disks to 4K, and ...). > > Another idea would be to have some way that you could specify a block > number and have teh drive tell you the first in the same group.. That > would allow a filesystem to work out the alignment. It may not be able > to access absolute block numbers, if it's going through some layers of > translation, and some way of saying "am I alligned?" might be useful. > > One thing that does come to mind is that as you say, on power fail we > would now be liable to lose a group of 8 sectors (4k) instead of 1 x > 512 > byte sector. > > Recovery algorythms might have to deal with this (should we actually > decide to write one.. :-). > > Particularly if the block being written was the 1st, but the other 7 > blocks contain data that the OS has no way of knowing that they are in > jeopardy. In other words, I might know that block 1 is in danger and > put > it in a write log, (in a logging filesystem) but I have no way of > knowing that the other 7 are in danger, so they may not be in the write > log (assuming thAat the write log only holds the last N transactions.). > I'd say that this means that the drive should hold the active 4k block > in nvram or something.. > > You seem to have considered this but I'm in agreement that it could > prove "nasty" in exactly the cases that are most important.. > people use write logging etc. in cases where they care about the data > and recovery time. these are exactly the people who are going to be the > most pissed off to lose their data. .. Thanks, may I forward your response on to the committee? > > If we can easily telll the system to use 4k frags or 4k blocknumbers > (i.e. we can elect to expose the real blocksize) then we are probably > in better shape. I agree. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11: 1:18 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D11C737B401; Fri, 31 Jan 2003 11:01:16 -0800 (PST) Received: from mcomail01.maxtor.com (mcomail01.maxtor.com [134.6.76.15]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3590F43F75; Fri, 31 Jan 2003 11:01:16 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail01.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VIofM04700; Fri, 31 Jan 2003 11:50:41 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKYLV; Fri, 31 Jan 2003 12:01:08 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id OAA0000002209; Fri, 31 Jan 2003 14:00:57 -0500 (EST) Date: Fri, 31 Jan 2003 14:00:55 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: phk@freebsd.org, freebsd-fs@freebsd.org, tech-kern@netbsd.org To: David Laight From: Steve Byan In-Reply-To: <20030131185507.G1487@snowdrop.l8s.co.uk> Message-Id: <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 01:55 PM, David Laight wrote: >> Really? fsck can recover from losing 4K bytes surrounding the last >> metadata block written? > > The only metadata that matter are the inodes and (for ffs) the > indirect blocks. You do really want the latter to be single disk > blocks - many systems actually write them synchonously. What could be the effect of losing surrounding blocks on the (failed) write of an indirect block? Can we guarantee that fsck can reconstruct the filesystem, modulo some recently-created or deleted files, or is there a possibility of losing the entire filesystem? > The inode is (probably) only 128 bytes, losing an inode block > will lose the other files. > > A journaling filesystem probably already has ways around this... I think journaling filesystems need to know the atomic block size in order to structure their log in a fault-tolerant way; I'm hoping someone on these lists can provide some details. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11: 6:17 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B1FEA37B401; Fri, 31 Jan 2003 11:06:15 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1190343F3F; Fri, 31 Jan 2003 11:06:15 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VIxqN25634; Fri, 31 Jan 2003 11:59:52 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKYTB; Fri, 31 Jan 2003 12:06:14 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id OAA0000002262; Fri, 31 Jan 2003 14:06:13 -0500 (EST) Date: Fri, 31 Jan 2003 14:06:11 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org To: phk@freebsd.org From: Steve Byan In-Reply-To: <19187.1044039084@critter.freebsd.dk> Message-Id: <1010FEB6-354F-11D7-B26B-00306548867E@maxtor.com> Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 01:51 PM, phk@freebsd.org wrote: > In message , Steve > Byan writes > : > >>> The only thing that exposes us to risk is we don't know the risk >>> exists, so as long as the fact that a 4k physical sector size is >>> used is not hidden from us, we can adapt. >> >> But would existing code be functionally broken (perhaps with respect >> to >> failure recovery) if it were to not be modified to adapt to a >> different >> physical block size? > > Not broken any worse than because of write-caching. Agreed, but IDEMA is proposing to do this to SCSI drives, too. > >>> Nope. >> >> Really? fsck can recover from losing 4K bytes surrounding the last >> metadata block written? > > If the fragment size is 4k when the filsystem is created, and this > would happen automatically, then there is no window for lossage. But if someone were to plug a new 4K-block disk into a system compiled to use 512 byte block disks, and the SCSI interface were faked to make it appear that the disk could read and write 512-byte blocks, then what happens? IDEMA's notion is that faking 512-byte logical size is good enough to get new disks to work in systems running legacy code. My fear is that it is not so simple. > > The thing we really need is working tagged-queing... Since I believe tagged-queuing works in SCSI, I assume you are asking for it in ATA? Or is there some feature missing from SCSI tagged-queuing that you'd like to see? Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11: 9: 4 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 46A0B37B401 for ; Fri, 31 Jan 2003 11:09:03 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id BD5C943F85 for ; Fri, 31 Jan 2003 11:09:01 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VJ8l4W022439; Fri, 31 Jan 2003 20:08:48 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: David Laight , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 14:00:55 EST." <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> Date: Fri, 31 Jan 2003 20:08:47 +0100 Message-ID: <22438.1044040127@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <538478DE-354E-11D7-B26B-00306548867E@maxtor.com>, Steve Byan writes : >>> Really? fsck can recover from losing 4K bytes surrounding the last >>> metadata block written? >> >> The only metadata that matter are the inodes and (for ffs) the >> indirect blocks. You do really want the latter to be single disk >> blocks - many systems actually write them synchonously. > >What could be the effect of losing surrounding blocks on the (failed) >write of an indirect block? Can we guarantee that fsck can reconstruct >the filesystem, modulo some recently-created or deleted files, or is >there a possibility of losing the entire filesystem? For inodes the situation is no different, only the exposure is greater: instead of loosing three neighbour inodes we loose 31 neighbour inodes. (Or for ufs2: 1 vs 15 inodes). As long as I can ask the drive what the size of an atomic transfer is it doesn't matter much to us if it is 512, 1k, 2k or 4k. Going above 4k would probably be a bit premature and therefore inconvenient. If drives that come out with 4k sectors end up trashing too much data for people, they will get a bad reputation rather fast and I'm sure market mechanisms will take care of the issue. If they exhibit no worse losses than we already see due to write caching and bugs in same, then the market won't react and you guys can squeeze another N% more diskspace out of the same platter. (I may be an anomaly in this, but I have actually worked on systems which used 1k sectorsize on their 8" floppies when they made backup copies to increase the capacity a small bit.) I get the sense that you want us to say "NOOOO this is HORRIBLE!!!" and you won't stop asking until we do ? You won't have that from this bloke at least. I don't know what the agenda you push are, but I'm not pushing it for you... Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:11:18 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 18EFF37B401 for ; Fri, 31 Jan 2003 11:11:17 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 590AE43F43 for ; Fri, 31 Jan 2003 11:11:16 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VJ4oq27414; Fri, 31 Jan 2003 12:04:50 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKYZ1; Fri, 31 Jan 2003 12:11:13 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id OAA0000002089; Fri, 31 Jan 2003 14:10:55 -0500 (EST) Date: Fri, 31 Jan 2003 14:10:54 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: David Laight , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org To: Lord Isildur From: Steve Byan In-Reply-To: Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 01:51 PM, Lord Isildur wrote: > to just get the performance of aligned accesses, we dont need to modify > block sizes and such stuff. an an example, read the paper linked to > from > this; http://www.pdl.cmu.edu/PDL-FTP/stray/traxtent_abs.html > (brought to you by the same folks who did soft updates and raidframe) Thanks, I'm aware of the excellent CMU paper. In fact, if anyone wants a way to get the complete physical geometry of Maxtor SCSI disks just by reading mode-pages, email me and I can supply the details. My concern is with the proposed backward-compatibility mode, which I fear subtly breaks the failure semantics which systems with persistent storage rely upon to recover. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:11:31 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 923CE37B405; Fri, 31 Jan 2003 11:11:29 -0800 (PST) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1615F43F9B; Fri, 31 Jan 2003 11:11:28 -0800 (PST) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id DEA1AAE1C1; Fri, 31 Jan 2003 11:11:27 -0800 (PST) Date: Fri, 31 Jan 2003 11:11:27 -0800 From: Alfred Perlstein To: Steve Byan Cc: David Laight , phk@freebsd.org, freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131191127.GS85104@elvis.mu.org> References: <20030131185507.G1487@snowdrop.l8s.co.uk> <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> User-Agent: Mutt/1.4i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org I hope I'm not mistaken here, but for FFS to work it needs the 512 byte ops to be atomic, making them not so, or possibly obliterate surrounding blocks doesn't sound like a good idea at all. Shouldn't you guys be asking Dr McKusick? I can forward this question on to some of the fs people at Apple as well. * Steve Byan [030131 11:01] wrote: > > On Friday, January 31, 2003, at 01:55 PM, David Laight wrote: > > >>Really? fsck can recover from losing 4K bytes surrounding the last > >>metadata block written? > > > >The only metadata that matter are the inodes and (for ffs) the > >indirect blocks. You do really want the latter to be single disk > >blocks - many systems actually write them synchonously. > > What could be the effect of losing surrounding blocks on the (failed) > write of an indirect block? Can we guarantee that fsck can reconstruct > the filesystem, modulo some recently-created or deleted files, or is > there a possibility of losing the entire filesystem? > > >The inode is (probably) only 128 bytes, losing an inode block > >will lose the other files. > > > >A journaling filesystem probably already has ways around this... > > I think journaling filesystems need to know the atomic block size in > order to structure their log in a fault-tolerant way; I'm hoping > someone on these lists can provide some details. > > Regards, > -Steve > -------- > Steve Byan > Design Engineer > Maxtor Corp. > MS 1-3/E23 > 333 South Street > Shrewsbury, MA 01545 > (508) 770-3414 > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:21:20 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 90A3437B401 for ; Fri, 31 Jan 2003 11:21:19 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id A295543F9B for ; Fri, 31 Jan 2003 11:21:18 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VJLG4W024732; Fri, 31 Jan 2003 20:21:17 +0100 (CET) (envelope-from phk@freebsd.org) To: Alfred Perlstein Cc: Steve Byan , David Laight , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 11:11:27 PST." <20030131191127.GS85104@elvis.mu.org> Date: Fri, 31 Jan 2003 20:21:16 +0100 Message-ID: <24731.1044040876@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <20030131191127.GS85104@elvis.mu.org>, Alfred Perlstein writes: >I hope I'm not mistaken here, but for FFS to work it needs the 512 >byte ops to be atomic, making them not so, or possibly obliterate >surrounding blocks doesn't sound like a good idea at all. UFS/FFS has no 512 bytes binding, it can work in other sectorsizes. The implication is that your fragment size may increase. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:22:13 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A553D37B401 for ; Fri, 31 Jan 2003 11:22:11 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id E75FE43F43 for ; Fri, 31 Jan 2003 11:22:10 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VJFkc31069; Fri, 31 Jan 2003 12:15:46 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRKZHS; Fri, 31 Jan 2003 12:22:09 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id OAA0000002265; Fri, 31 Jan 2003 14:21:51 -0500 (EST) Date: Fri, 31 Jan 2003 14:21:49 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: , To: From: Steve Byan In-Reply-To: <001601c2c95a$63d52d70$0300a8c0@jkirbydesk> Message-Id: <3F18DF97-3551-11D7-B26B-00306548867E@maxtor.com> Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 01:55 PM, Jamey Kirby wrote: > I have been a lurker for years and want to chime in. Hi Jamey, recognize your name from the NTFS list. > Under Windows NT (all flavors), using a 4K sector size works fine. The > OS abstraction layers are very good and handling the alignment. Yes, I've seen the code in the DDK and in the filesystem developers kit. NT's SCSI driver is already properly parameterized to use the block size returned by the device, as long as it is a power of 2 and greater than 512 byte. However, I wonder about the failure semantics assumed by NTFS's log - does it rely on the beginning and the ending of each log record being in different physical sectors? Does it rely on no more than one sector being lost at the end of the log (i.e. could wiping out 4K at the tail of the log wipe out enough state such that the recovery code couldn't roll-back/roll-forward to a consistent filesystem state)? How about the ExchangeServer? Does it's transaction mechanism depend on a specific block size? How about SQLServer? My concern is that a backwards-compatibility mechanism is being proposed that makes a device (even a SCSI device) with 4K physical blocks look like a 512-byte block device. I fear that since the failure semantics are subtly different, the careful-write and persistent logging strategies in current code will break, and no one will know until they experience the corner condition that results in their {filesystem | database | email server | transaction processing monitor} losing their data. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:23:55 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2D45237B401 for ; Fri, 31 Jan 2003 11:23:54 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4485F43F75 for ; Fri, 31 Jan 2003 11:23:53 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VJNq4W025201; Fri, 31 Jan 2003 20:23:52 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 14:06:11 EST." <1010FEB6-354F-11D7-B26B-00306548867E@maxtor.com> Date: Fri, 31 Jan 2003 20:23:52 +0100 Message-ID: <25200.1044041032@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <1010FEB6-354F-11D7-B26B-00306548867E@maxtor.com>, Steve Byan writes : >> Not broken any worse than because of write-caching. > >Agreed, but IDEMA is proposing to do this to SCSI drives, too. We've seen broken caching on SCSI as well, but not recently I think :-) >But if someone were to plug a new 4K-block disk into a system compiled >to use 512 byte block disks, and the SCSI interface were faked to make >it appear that the disk could read and write 512-byte blocks, then what >happens? IDEMA's notion is that faking 512-byte logical size is good >enough to get new disks to work in systems running legacy code. My fear >is that it is not so simple. If plug a 4k sector disk into a system which doesn't know how to find out that the drive really is 4k sectors, then you will increase the window for lossage. >> The thing we really need is working tagged-queing... > >Since I believe tagged-queuing works in SCSI, I assume you are asking >for it in ATA? Or is there some feature missing from SCSI >tagged-queuing that you'd like to see? Yes, I was talking ATA there. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:24:57 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E915237B401; Fri, 31 Jan 2003 11:24:55 -0800 (PST) Received: from HAL9000.homeunix.com (12-233-57-224.client.attbi.com [12.233.57.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4ED2343F75; Fri, 31 Jan 2003 11:24:55 -0800 (PST) (envelope-from dschultz@uclink.Berkeley.EDU) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.6/8.12.5) with ESMTP id h0VJOrNt016232; Fri, 31 Jan 2003 11:24:53 -0800 (PST) (envelope-from dschultz@uclink.Berkeley.EDU) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.6/8.12.5/Submit) id h0VJOrC2016231; Fri, 31 Jan 2003 11:24:53 -0800 (PST) (envelope-from dschultz@uclink.Berkeley.EDU) Date: Fri, 31 Jan 2003 11:24:52 -0800 From: David Schultz To: Steve Byan Cc: phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131192452.GA15985@HAL9000.homeunix.com> Mail-Followup-To: Steve Byan , phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org References: <2903.1044033486@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Thus spake Steve Byan : > >The only thing that exposes us to risk is we don't know the risk > >exists, so as long as the fact that a 4k physical sector size is > >used is not hidden from us, we can adapt. > > But would existing code be functionally broken (perhaps with respect to > failure recovery) if it were to not be modified to adapt to a different > physical block size? If the disk corrupts a sector it was writing, that's already a problem for us. If the sector is 4K, that just makes it more of a problem. With FFS and soft updates, we assume that the disk can atomically write 512 bytes, and we ensure filesystem consistency by establishing a safe partial ordering for metadata updates. We expect that after a crash, either the old contents or the new contents of the sector are there. I think we would need to implement journalling to ensure integrity if hard drives were likely to corrupt sectors on power failure. (How often do they do this right now, and how often would they with 4K sectors?) Inodes are 128 bytes (UFS1) or 256 bytes (UFS2), so a 4K sector could contain metadata for a lot of files. If an indirect block is squished, that might be less of a problem because it corresponds to only one file. In one sense, 4K sectors save a little bit of space, since directory entries are never split across a sector boundary so that they can be updated in a single, atomic write. But large sectors are still worse from a reliability point of view if it's possible to lose the entire sector. The LFS is probably in much better shape... To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:42:58 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 11D6037B401 for ; Fri, 31 Jan 2003 11:42:57 -0800 (PST) Received: from sccrmhc03.attbi.com (sccrmhc03.attbi.com [204.127.202.63]) by mx1.FreeBSD.org (Postfix) with ESMTP id 77BB743F3F for ; Fri, 31 Jan 2003 11:42:56 -0800 (PST) (envelope-from julian@elischer.org) Received: from interjet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by sccrmhc03.attbi.com (sccrmhc03) with ESMTP id <2003013119425500300jva6me>; Fri, 31 Jan 2003 19:42:55 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id LAA46067; Fri, 31 Jan 2003 11:42:53 -0800 (PST) Date: Fri, 31 Jan 2003 11:42:53 -0800 (PST) From: Julian Elischer To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, Steve Byan wrote: > > On Friday, January 31, 2003, at 01:16 PM, Julian Elischer wrote: > > > > > Recovery algorythms might have to deal with this (should we actually > > decide to write one.. :-). > > > > Particularly if the block being written was the 1st, but the other 7 > > blocks contain data that the OS has no way of knowing that they are in > > jeopardy. In other words, I might know that block 1 is in danger and > > put > > it in a write log, (in a logging filesystem) but I have no way of > > knowing that the other 7 are in danger, so they may not be in the write > > log (assuming thAat the write log only holds the last N transactions.). > > I'd say that this means that the drive should hold the active 4k block > > in nvram or something.. > > > > You seem to have considered this but I'm in agreement that it could > > prove "nasty" in exactly the cases that are most important.. > > people use write logging etc. in cases where they care about the data > > and recovery time. these are exactly the people who are going to be the > > most pissed off to lose their data. .. > > Thanks, may I forward your response on to the committee? sure.. correct the spelling though :-) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:49: 3 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 35F2037B401; Fri, 31 Jan 2003 11:49:02 -0800 (PST) Received: from sccrmhc01.attbi.com (sccrmhc01.attbi.com [204.127.202.61]) by mx1.FreeBSD.org (Postfix) with ESMTP id EAA5C43F3F; Fri, 31 Jan 2003 11:49:00 -0800 (PST) (envelope-from julian@elischer.org) Received: from InterJet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by sccrmhc01.attbi.com (sccrmhc01) with ESMTP id <200301311948590010087eaje>; Fri, 31 Jan 2003 19:49:00 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id LAA46106; Fri, 31 Jan 2003 11:48:58 -0800 (PST) Date: Fri, 31 Jan 2003 11:48:56 -0800 (PST) From: Julian Elischer To: Steve Byan Cc: phk@freebsd.org, freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <1010FEB6-354F-11D7-B26B-00306548867E@maxtor.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org One thign I thought of is that it is not uncommon to 'dd' an entire filesystem from one partition to another. If we create a filesystem that is 'aligned' and we copy it to be 'unalligned', we'd have a sudden performance drop for no immediatly obvious reason. What was one write, would become a 2-sector read, modify and 2-sector write. Especially when copying from one failing drive to another with slightly different characteristics. The idea isn't bad but I think it should be sold as a 4k sector drive, with small print saying it can handle 512byte IO instead of the other way around. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:50:44 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E8B4737B401 for ; Fri, 31 Jan 2003 11:50:43 -0800 (PST) Received: from quic.net (rrcs-central-24-123-205-180.biz.rr.com [24.123.205.180]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3406F43E4A for ; Fri, 31 Jan 2003 11:50:43 -0800 (PST) (envelope-from utsl@quic.net) Received: from localhost (localhost [127.0.0.1]) (uid 1032) by quic.net with local; Fri, 31 Jan 2003 14:50:42 -0500 Date: Fri, 31 Jan 2003 14:50:42 -0500 To: Steve Byan , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131195042.GD6243@quic.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i From: Nathan Hawkins Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, Jan 31, 2003 at 02:10:54PM -0500, Steve Byan wrote: > Thanks, I'm aware of the excellent CMU paper. In fact, if anyone wants > a way to get the complete physical geometry of Maxtor SCSI disks just > by reading mode-pages, email me and I can supply the details. I'd be interested in that. Are those published? > My concern is with the proposed backward-compatibility mode, which I > fear subtly breaks the failure semantics which systems with persistent > storage rely upon to recover. You might want to talk with Veritas. I'm pretty sure their Volume Manager's log subdisks assume 512-byte sectors. More generally, what impact would this have on existing RAID implementations, hardware or software? This is a potentially more damaging impact than filesystem semantics. ---Nathan To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:52:44 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C81F737B405 for ; Fri, 31 Jan 2003 11:52:42 -0800 (PST) Received: from hitl.washington.edu (hitl-new.hitl.washington.edu [128.95.73.60]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4F98C43E4A for ; Fri, 31 Jan 2003 11:52:42 -0800 (PST) (envelope-from perseant@hitl.washington.edu) Received: from psychosis.hitl.washington.edu (psychosis.hitl.washington.edu [128.95.74.36]) by hitl.washington.edu (8.11.6/8.9.3) with ESMTP id h0VJqeh13942; Fri, 31 Jan 2003 11:52:40 -0800 (PST) Date: Fri, 31 Jan 2003 11:52:40 -0800 (PST) From: Konrad Schroder To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> Message-ID: References: <538478DE-354E-11D7-B26B-00306548867E@maxtor.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org My $0.02 regarding FFS: since the default block size (including indirect blocks etc.) is 8k the only common alignment issue would come from (mis-)alignment of the partition as a whole. If the drive were structured so that it reported cylinders as multiples of 4k, (almost) no one would ever have the type of problem you're describing with FFS. On Fri, 31 Jan 2003, Steve Byan wrote: > I think journaling filesystems need to know the atomic block size in > order to structure their log in a fault-tolerant way; I'm hoping > someone on these lists can provide some details. I think LFS is mostly okay here, though there is a corner case in which some data could be lost (possibly the filesystem corrupted) without the user knowing about it. Let me describe such a case. Suppose that the cleaner were operating. Every cleaner write is a checkpoint, but following the cleaner write, the previous checkpoint is invalidated---so it is possible that there is only one valid checkpoint on disk, at all. Now further suppose that the filesystem were created with fragment size less than 4k, the cleaner has just cleaned segment n+1, filling segment n with that data; and another write has occurred into segment n+1, thereby invalidating the contents of segment n+1; and there were a power outage while that first segment summary in segment n+1 were being written. Both the previous checkpoint state (including segment n+1) and the current checkpoint state (including segment n) would be invalid in this case. The worst part about it is that even if fsck_lfs could fix this problem, no one would know to run it; LFS uses roll-forward as its default repair mechanism, and roll-forward always starts from the last known-valid checkpoint. The solution, of course, is to 1) Identify the disk as a 4k-sector disk; 2) Partition the disk so that LFS partitions begin on 4k boundaries; 3) Create the LFS filesystems with 4k or greater fragment size; 4) Play happily with your 8k/1k FFSes and 8k/4k LFSes. If you did that the 4k sector size would be truly invisible to you---and in particular, you would *not* need to recompile the kernel for any of that unless I'm misunderstanding what you're saying. ------------------------------------------------------------------------ Konrad Schroder http://www.hitl.washington.edu/people/perseant/ Information Tech & Services Box 352142 -or- 215 Fluke Hall, Mason Road Human Interface Technology Lab University of Washington Voice: +1.206.616.1478 Fax: +1.206.543.5380 Seattle, WA, 98195, USA To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:56:20 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9C9DA37B401; Fri, 31 Jan 2003 11:56:18 -0800 (PST) Received: from apollo.email.starband.net (smtp2.starband.net [148.78.247.23]) by mx1.FreeBSD.org (Postfix) with ESMTP id D9E5D43F43; Fri, 31 Jan 2003 11:56:17 -0800 (PST) (envelope-from jkirby@storagecraft.com) Received: from jkirbydesk (vsat-148-63-114-177.c002.t7.mrt.starband.net [148.63.114.177]) (authenticated bits=0) by apollo.email.starband.net (8.12.4/8.12.4) with ESMTP id h0VJtTH5005007; Fri, 31 Jan 2003 14:55:37 -0500 Reply-To: From: "Jamey Kirby" To: "'Julian Elischer'" , "'Steve Byan'" Cc: , , Subject: RE: DEV_B_SIZE Date: Fri, 31 Jan 2003 11:55:33 -0800 Organization: StorageCraft Message-ID: <000601c2c962$c04aa2d0$0300a8c0@jkirbydesk> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.4024 In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Who will do the translation? Will there be a device driver that makes the 4K disk look like a 512 byte disk? If so, the device driver would have to pre-read the 4K, modify the 512 byte section and re-write the entire 4K. This would kill performance. If this will be handled in the drive, the same sort of logic must be employed and surly there will be a performance problem; unless the drive will be able to write the 512 bytes without a pre-read. How easy is it to change the firmware in the drive to make it a 4K block drive? I would be willing to tinker with a 4K drive and provide some feedback. Jamey -----Original Message----- From: owner-freebsd-fs@FreeBSD.ORG [mailto:owner-freebsd-fs@FreeBSD.ORG] On Behalf Of Julian Elischer Sent: Friday, January 31, 2003 11:49 AM To: Steve Byan Cc: phk@FreeBSD.ORG; freebsd-fs@FreeBSD.ORG; tech-kern@netbsd.org Subject: Re: DEV_B_SIZE One thign I thought of is that it is not uncommon to 'dd' an entire filesystem from one partition to another. If we create a filesystem that is 'aligned' and we copy it to be 'unalligned', we'd have a sudden performance drop for no immediatly obvious reason. What was one write, would become a 2-sector read, modify and 2-sector write. Especially when copying from one failing drive to another with slightly different characteristics. The idea isn't bad but I think it should be sold as a 4k sector drive, with small print saying it can handle 512byte IO instead of the other way around. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 11:58:39 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 120A537B401 for ; Fri, 31 Jan 2003 11:58:38 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1774943F3F for ; Fri, 31 Jan 2003 11:58:37 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VJwR4W031672; Fri, 31 Jan 2003 20:58:28 +0100 (CET) (envelope-from phk@freebsd.org) To: Julian Elischer Cc: Steve Byan , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 11:48:56 PST." Date: Fri, 31 Jan 2003 20:58:27 +0100 Message-ID: <31671.1044043107@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message , Ju lian Elischer writes: > > >One thign I thought of is that it is not uncommon to 'dd' an entire >filesystem from one partition to another. >If we create a filesystem that is 'aligned' and we copy it to be >'unalligned', we'd have a sudden performance drop for no immediatly >obvious reason. What was one write, would become a 2-sector read, >modify and 2-sector write. Especially when copying from one failing >drive to another with slightly different characteristics. If you run dd without bs=ALOT you deserve bad throughput. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 12:15:28 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8E64037B401; Fri, 31 Jan 2003 12:15:26 -0800 (PST) Received: from rwcrmhc53.attbi.com (rwcrmhc53.attbi.com [204.127.198.39]) by mx1.FreeBSD.org (Postfix) with ESMTP id E2A7A43F43; Fri, 31 Jan 2003 12:15:25 -0800 (PST) (envelope-from julian@elischer.org) Received: from interjet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by rwcrmhc53.attbi.com (rwcrmhc53) with ESMTP id <20030131201519053003msr6e>; Fri, 31 Jan 2003 20:15:20 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id MAA46356; Fri, 31 Jan 2003 12:15:19 -0800 (PST) Date: Fri, 31 Jan 2003 12:15:17 -0800 (PST) From: Julian Elischer To: David Schultz Cc: Steve Byan , phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <20030131192452.GA15985@HAL9000.homeunix.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, David Schultz wrote: > Thus spake Steve Byan : > > >The only thing that exposes us to risk is we don't know the risk > > >exists, so as long as the fact that a 4k physical sector size is > > >used is not hidden from us, we can adapt. > > > > But would existing code be functionally broken (perhaps with respect to > > failure recovery) if it were to not be modified to adapt to a different > > physical block size? > > If the disk corrupts a sector it was writing, that's already a > problem for us. If the sector is 4K, that just makes it more of a > problem. With FFS and soft updates, we assume that the disk can > atomically write 512 bytes, and we ensure filesystem consistency > by establishing a safe partial ordering for metadata updates. We > expect that after a crash, either the old contents or the new > contents of the sector are there. I think we would need to > implement journalling to ensure integrity if hard drives were > likely to corrupt sectors on power failure. (How often do they do > this right now, and how often would they with 4K sectors?) in this case teh journel would have to not only include the block being written, but data on each side of it that may be in teh same 4k. that implies a read.. > > Inodes are 128 bytes (UFS1) or 256 bytes (UFS2), so a 4K sector > could contain metadata for a lot of files. If an indirect block > is squished, that might be less of a problem because it > corresponds to only one file. In one sense, 4K sectors save a > little bit of space, since directory entries are never split > across a sector boundary so that they can be updated in a single, > atomic write. But large sectors are still worse from a > reliability point of view if it's possible to lose the entire > sector. > > The LFS is probably in much better shape... > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 12:16:52 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9924837B407; Fri, 31 Jan 2003 12:16:51 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id B06A743F85; Fri, 31 Jan 2003 12:16:50 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VKAR817917; Fri, 31 Jan 2003 13:10:27 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRK62D; Fri, 31 Jan 2003 13:16:52 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id PAA0000002380; Fri, 31 Jan 2003 15:16:38 -0500 (EST) Date: Fri, 31 Jan 2003 15:16:37 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org To: phk@freebsd.org From: Steve Byan In-Reply-To: <22438.1044040127@critter.freebsd.dk> Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 02:08 PM, phk@freebsd.org wrote: > I get the sense that you want us to say "NOOOO this is HORRIBLE!!!" > and you won't stop asking until we do ? > > You won't have that from this bloke at least. > > I don't know what the agenda you push are, but I'm not pushing it > for you... I keep getting a response that reads like "we'll detect the larger block size and run with it". I'm concerned that I'm not being clear that IDEMA is thinking of proposing a backward-compatibility mode with the presumption that it will work fine (albeit slowly) with existing binaries, i.e. code that hasn't been modified to be aware of the larger block size. If you think there are no functional problems with this backwards-compatibility scenario, including during recovery (fsck or journal roll-forward), I'd be happy to hear a clear "no problem". Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 12:40:34 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E3E4B37B401; Fri, 31 Jan 2003 12:40:32 -0800 (PST) Received: from sccrmhc03.attbi.com (sccrmhc03.attbi.com [204.127.202.63]) by mx1.FreeBSD.org (Postfix) with ESMTP id 44A3243FA7; Fri, 31 Jan 2003 12:40:32 -0800 (PST) (envelope-from julian@elischer.org) Received: from interjet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by sccrmhc03.attbi.com (sccrmhc03) with ESMTP id <2003013120403000300jupd0e>; Fri, 31 Jan 2003 20:40:30 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id MAA46556; Fri, 31 Jan 2003 12:40:29 -0800 (PST) Date: Fri, 31 Jan 2003 12:40:28 -0800 (PST) From: Julian Elischer To: phk@freebsd.org Cc: Steve Byan , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <31671.1044043107@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003 phk@freebsd.org wrote: > In message , Ju > lian Elischer writes: > > > > > >One thign I thought of is that it is not uncommon to 'dd' an entire > >filesystem from one partition to another. > >If we create a filesystem that is 'aligned' and we copy it to be > >'unalligned', we'd have a sudden performance drop for no immediatly > >obvious reason. What was one write, would become a 2-sector read, > >modify and 2-sector write. Especially when copying from one failing > >drive to another with slightly different characteristics. > > If you run dd without bs=ALOT you deserve bad throughput. I'm talking about the performance of the filesystem after it's been moved. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 12:41:44 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EE61937B401 for ; Fri, 31 Jan 2003 12:41:42 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1C49F43E4A for ; Fri, 31 Jan 2003 12:41:42 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h0VKfe4W039529; Fri, 31 Jan 2003 21:41:40 +0100 (CET) (envelope-from phk@freebsd.org) To: Steve Byan Cc: freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Fri, 31 Jan 2003 15:16:37 EST." Date: Fri, 31 Jan 2003 21:41:40 +0100 Message-ID: <39528.1044045700@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message , Steve Byan writes : > >On Friday, January 31, 2003, at 02:08 PM, phk@freebsd.org wrote: > >> I get the sense that you want us to say "NOOOO this is HORRIBLE!!!" >> and you won't stop asking until we do ? >> >> You won't have that from this bloke at least. >> >> I don't know what the agenda you push are, but I'm not pushing it >> for you... > >I keep getting a response that reads like "we'll detect the larger >block size and run with it". I'm concerned that I'm not being clear >that IDEMA is thinking of proposing a backward-compatibility mode with >the presumption that it will work fine (albeit slowly) with existing >binaries, i.e. code that hasn't been modified to be aware of the larger >block size. > >If you think there are no functional problems with this >backwards-compatibility scenario, including during recovery (fsck or >journal roll-forward), I'd be happy to hear a clear "no problem". Ok, to make it 100% clear: 1. We won't see any new problems. The effects of 3.5k around a sector we touched being corrupted is no different from any other 3.5k developing a bad sector read-error. (Hopefully the drive will flag it with a read-error when we come back so it won't look like random data corruption.) 2. Already existing issues will do greater damage. This follows directly from the fact that increasing the sectorsize increases the amount of data lost when a sector is lost. If the market place hates that, the new drives will not be popular there. 3. If the OS can detect the true sectorsize, some choices can be made intelligently and reduce the performance hit and some of recovery issues. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 13: 1:31 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 177AD37B401; Fri, 31 Jan 2003 13:01:30 -0800 (PST) Received: from mcomail02.maxtor.com (mcomail02.maxtor.com [134.6.76.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6EB2943F43; Fri, 31 Jan 2003 13:01:29 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail02.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VKt3u31068; Fri, 31 Jan 2003 13:55:03 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRK8AS; Fri, 31 Jan 2003 14:01:29 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id QAA0000002604; Fri, 31 Jan 2003 16:01:14 -0500 (EST) Date: Fri, 31 Jan 2003 16:01:13 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org To: David Schultz From: Steve Byan In-Reply-To: <20030131192452.GA15985@HAL9000.homeunix.com> Message-Id: <21B8D16C-355F-11D7-B26B-00306548867E@maxtor.com> Content-Transfer-Encoding: quoted-printable X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 02:24 PM, David Schultz wrote: > If the disk corrupts a sector it was writing, that's already a > problem for us. =46rom the Maxtor Atlas 10K III Product Spec: Section 4.5.1 Power Sequencing You may apply the power in any order or manner, or open either the=20 power or power return line with no loss of data or damage to the disk drive.=20 However, data may be lost in the sector being written at the time of power loss.=20= The drive can withstand transient voltages of +10% to =96100% from nominal while powering up or down. > If the sector is 4K, that just makes it more of a > problem. With FFS and soft updates, we assume that the disk can > atomically write 512 bytes, and we ensure filesystem consistency > by establishing a safe partial ordering for metadata updates. We > expect that after a crash, either the old contents or the new > contents of the sector are there. I think we would need to > implement journalling to ensure integrity if hard drives were > likely to corrupt sectors on power failure. (How often do they do > this right now, and how often would they with 4K sectors?) If you are doing nothing but continuously writing, the active data area=20= covers more than 50% of the track, so you'd have more than a 0.5=20 probability of experiencing a corrupt sector. Derate this by your seek=20= duty-cycle and your write disk utilization to arrive at the final=20 probability. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 13: 6:55 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A90C637B401; Fri, 31 Jan 2003 13:06:53 -0800 (PST) Received: from mcomail01.maxtor.com (mcomail01.maxtor.com [134.6.76.15]) by mx1.FreeBSD.org (Postfix) with ESMTP id F15B943F79; Fri, 31 Jan 2003 13:06:52 -0800 (PST) (envelope-from stephen_byan@maxtor.com) Received: from mcoexc03.mlm.maxtor.com (localhost.localdomain [127.0.0.1]) by mcomail01.maxtor.com (8.11.6/8.11.6) with ESMTP id h0VKuNq15356; Fri, 31 Jan 2003 13:56:23 -0700 Received: from mmans02.mma.maxtor.com ([134.6.232.101]) by mcoexc03.mlm.maxtor.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id D4XRK8HA; Fri, 31 Jan 2003 14:06:53 -0700 Received: from maxtor.com by mmans02.mma.maxtor.com (8.8.8/1.1.22.3/08May01-0432PM) id QAA0000002539; Fri, 31 Jan 2003 16:06:44 -0500 (EST) Date: Fri, 31 Jan 2003 16:06:43 -0500 Subject: Re: DEV_B_SIZE Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v551) Cc: "'Julian Elischer'" , , , To: From: Steve Byan In-Reply-To: <000601c2c962$c04aa2d0$0300a8c0@jkirbydesk> Message-Id: Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.551) Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, January 31, 2003, at 02:55 PM, Jamey Kirby wrote: > Who will do the translation? > > Will there be a device driver that makes the 4K disk look like a 512 > byte disk? No. > If so, the device driver would have to pre-read the 4K, > modify the 512 byte section and re-write the entire 4K. This would kill > performance. Yes, it would. > If this will be handled in the drive, the same sort of logic must be > employed and surly there will be a performance problem; unless the > drive > will be able to write the 512 bytes without a pre-read. Yes, there surely would be a performance problem if the I/O has to wait for a read-modify-write. There may be proprietary techniques for hiding the cost. The assumption is that this is purely a backward-compatibility case, and the performance hit would motivate folks to update their software to recognize the new larger block size. Regards, -Steve -------- Steve Byan Design Engineer Maxtor Corp. MS 1-3/E23 333 South Street Shrewsbury, MA 01545 (508) 770-3414 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 13:31:44 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D126C37B401; Fri, 31 Jan 2003 13:31:41 -0800 (PST) Received: from yahoo.com (host81-134-41-205.in-addr.btopenworld.com [81.134.41.205]) by mx1.FreeBSD.org (Postfix) with SMTP id 7E9DA43FC7; Fri, 31 Jan 2003 13:31:36 -0800 (PST) (envelope-from newhsave@yahoo.com) Message-ID: <000410c5eb35$ccc25383$68615510@ljrpyit.rwa> From: To: Homeworker@FreeBSD.ORG Subject: Turn $25 into $45,000 MONTHLY, all automatic! 2588CMRd2-598mGjG7972Raq-23 Date: Fri, 31 Jan 2003 12:22:19 +0900 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 X-Mailer: Microsoft Outlook Express 5.00.2615.200 Importance: Normal Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org This is IT! Don't miss out in this amazing opportunity. Join now for $25 and earn up to $45,000 MONTHLY!!!! Yes, I said monthly AND IT IS ALL AUTOMATED. Get FREE information. Just click the link below. www.mlmontarget.com This is FREE information that will amaze you on how much MONEY YOU CAN EARN FOR ONLY $25 per month. Many join multiple times and it is ALL AUTOMATED. We do all the hard work. Click below now for FREE information. www.mlmontarget.com Start getting your MONEY today!!! 8454CbFV6-461bgof1523wBxl0-180ztPb3459XesT5-575PZdt8733Anqp8-647GKkY1078Nswl71 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 13:48:33 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D735937B401 for ; Fri, 31 Jan 2003 13:48:31 -0800 (PST) Received: from uranium.vaxpower.org (URANIUM.CLUB.CC.cmu.edu [128.2.4.153]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9114743F3F for ; Fri, 31 Jan 2003 13:48:29 -0800 (PST) (envelope-from mrfusion@uranium.vaxpower.org) Received: (from mrfusion@localhost) by uranium.vaxpower.org (8.9.1/5.5.1) id NAA20481; Fri, 31 Jan 2003 13:51:52 -0500 Date: Fri, 31 Jan 2003 13:51:52 -0500 (EST) From: Lord Isildur Subject: Re: DEV_B_SIZE To: Steve Byan Cc: David Laight , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org In-Reply-To: <1BBFD4B2-354C-11D7-B26B-00306548867E@maxtor.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org to just get the performance of aligned accesses, we dont need to modify block sizes and such stuff. an an example, read the paper linked to from this; http://www.pdl.cmu.edu/PDL-FTP/stray/traxtent_abs.html (brought to you by the same folks who did soft updates and raidframe) happy hacking, isildur On Fri, 31 Jan 2003, Steve Byan wrote: > > On Friday, January 31, 2003, at 12:59 PM, David Laight wrote: > > > On Fri, Jan 31, 2003 at 11:30:18AM -0500, Steve Byan wrote: > >> There's a notion afoot in IDEMA to enlarge the underlying physical > >> block size of disks to 4096 bytes while keeping a 512-byte logical > >> block size for the interface. Unaligned accesses would involve either > >> a > >> read-modify-write or some proprietary mechanism that provides > >> persistence without the latency cost of a read-modify-write. > > > > There probably ought to be a way of making the larger physical > > size visible to systems that are willing to support larger > > block sizes. That way misaligned transfers would be far less > > likely. > > Yes, of course. But I asked with respect to an issue other than > performance. > > > > One problem to consider is that disks are still partitioned > > on cylinder boundaries. This is largely historic but isn't > > this doen't actually make much sense, since the geometry > > almost certainly varies across the disk and has to be faked > > to fit the ATA CHS limits and (on PCs) the BIOS interface. > > > > However what it does mean is that a partition could easily > > not start on a 8 (512 byte) sector boundary. > > So misaligned transefers are likely even if the filesystem > > itself is using 4k blocks. > > > > On a PC the partitioning will typically have the first one > > starting in sector 63, and the others at multiple of 16065 > > sectors from the start of the disk). > > > > This doesn't bode well for getting any aligned transfer > > at all. > > We understand that problem. It's just a performance issue. My concern > is that even if we handwave the performance issues, there's an > underlying semantic that would not be satisfied if we were to run > existing software, unmodified, on a disk with an underlying 4K sector > size. > > Regards, > -Steve > -------- > Steve Byan > Design Engineer > Maxtor Corp. > MS 1-3/E23 > 333 South Street > Shrewsbury, MA 01545 > (508) 770-3414 > > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 14:46:33 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7EF4937B401; Fri, 31 Jan 2003 14:46:32 -0800 (PST) Received: from mail.allcaps.org (allcaps.org [216.240.173.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1E18943F85; Fri, 31 Jan 2003 14:46:32 -0800 (PST) (envelope-from bsder@allcaps.org) Received: from mail.allcaps.org (localhost [127.0.0.1]) by mail.allcaps.org (Postfix) with ESMTP id EB39392FA9; Fri, 31 Jan 2003 17:46:27 -0500 (EST) Received: from localhost (bsder@localhost) by mail.allcaps.org (8.12.5/8.12.5/Submit) with ESMTP id h0VMkR90000481; Fri, 31 Jan 2003 14:46:27 -0800 X-Authentication-Warning: mail.allcaps.org: bsder owned process doing -bs Date: Fri, 31 Jan 2003 14:46:27 -0800 (PST) From: "Andrew P. Lentvorski, Jr." To: Steve Byan Cc: phk@freebsd.org, , Subject: Re: DEV_B_SIZE In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, Steve Byan wrote: > I keep getting a response that reads like "we'll detect the larger > block size and run with it". I'm concerned that I'm not being clear > that IDEMA is thinking of proposing a backward-compatibility mode with > the presumption that it will work fine (albeit slowly) with existing > binaries, i.e. code that hasn't been modified to be aware of the larger > block size. Is this the scenario you're worried about? 1) Plug a shiny new 4K type disk into, say, FreeBSD 4.7 2) FreeBSD 4.7 doesn't know about 4K disks, so uses 512 byte mode 3) System configures softupdates and does a newfs 4) ... time passes ... 5) Luser trips over power cord in middle of write and corrupts disk Question: Does this work any differently given that the disk is 4K working in 512 compatibility mode vs. a real 512 disk? I think the answer depends upon the atomicity of the access. If the drive working in compatibility mode guarantees that only the new 512 bytes (out of the total 4096) will be corrupt, things probably work. If, however, any of the 4096 bytes can be corrupted, it probably will not. I assume that the whole reasoning behind moving to 4K size is to extend the error coding to a larger chunk of bits for less overhead. If that is the case, a read-modify-write is likely to clobber any of the 4096 bytes, and it is not likely to work transparently in compatibility mode under failure conditions. -a To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 15:49:38 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 433D137B401; Fri, 31 Jan 2003 15:49:37 -0800 (PST) Received: from HAL9000.homeunix.com (12-233-57-224.client.attbi.com [12.233.57.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id A0D6543FA3; Fri, 31 Jan 2003 15:49:36 -0800 (PST) (envelope-from dschultz@uclink.berkeley.edu) Received: from HAL9000.homeunix.com (localhost [127.0.0.1]) by HAL9000.homeunix.com (8.12.6/8.12.5) with ESMTP id h0VNnWNt016972; Fri, 31 Jan 2003 15:49:32 -0800 (PST) (envelope-from dschultz@uclink.berkeley.edu) Received: (from das@localhost) by HAL9000.homeunix.com (8.12.6/8.12.5/Submit) id h0VNnWuL016971; Fri, 31 Jan 2003 15:49:32 -0800 (PST) (envelope-from dschultz@uclink.berkeley.edu) Date: Fri, 31 Jan 2003 15:49:32 -0800 From: David Schultz To: Julian Elischer Cc: Steve Byan , phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030131234932.GA16959@HAL9000.homeunix.com> Mail-Followup-To: Julian Elischer , Steve Byan , phk@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org References: <20030131192452.GA15985@HAL9000.homeunix.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Thus spake Julian Elischer : > > contents of the sector are there. I think we would need to > > implement journalling to ensure integrity if hard drives were > > likely to corrupt sectors on power failure. (How often do they do > > this right now, and how often would they with 4K sectors?) > > > in this case teh journel would have to not only include the block being > written, but data on each side of it that may be in teh same 4k. > that implies a read.. If you had to do that, then nearly every write would be a read-modify-write cycle. It would be far less painful to use 4K blocks or larger and align filesystem blocks to disk sectors. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 16:11:43 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F0ABD37B401 for ; Fri, 31 Jan 2003 16:11:41 -0800 (PST) Received: from mail.netbsd.org (mail.netbsd.org [155.53.1.253]) by mx1.FreeBSD.org (Postfix) with SMTP id 98A0443F79 for ; Fri, 31 Jan 2003 16:11:41 -0800 (PST) (envelope-from wrstuden@netbsd.org) Received: (qmail 15315 invoked by uid 1130); 1 Feb 2003 00:11:40 -0000 Date: Fri, 31 Jan 2003 16:11:29 -0800 (PST) From: Bill Studenmund X-X-Sender: To: Steve Byan Cc: , Subject: Re: DEV_B_SIZE In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, Steve Byan wrote: > I keep getting a response that reads like "we'll detect the larger > block size and run with it". I'm concerned that I'm not being clear > that IDEMA is thinking of proposing a backward-compatibility mode with > the presumption that it will work fine (albeit slowly) with existing > binaries, i.e. code that hasn't been modified to be aware of the larger > block size. > > If you think there are no functional problems with this > backwards-compatibility scenario, including during recovery (fsck or > journal roll-forward), I'd be happy to hear a clear "no problem". I think Stephan Uphof hit on the main issues. I think there are functional problems with this, but that it may be usefull in some situations. It just needs a BIG warning. Note I am assuming that if there's an error writing a 512-byte sector the full 4k sector will have issues. If that is avoided (say only the 512-byte area actually has an issue) then things are fine. I think the main place that problems will arrise is that methods to reduce error exposure won't necessarily work. Methods that try to resist single- sector errors, say by making multiple copies of data, will need to know that the single-sector error size (how much data goes away) is 4k, not 512 bytes. Exactly how may programs use these methods is not something I know, so I can't tell you exactly what the exposure is. The fact that the errors from a 4k re-write failing are not unheard of isn't the issie. phk is right that that just looks like multiple sectors dying. The problem is that we would have multiple-sector-death happening with single-sector failure dynamics. If you want this to not be an issue 100%, then just put a battery-backed up cache on the device. Note I'm not saying back up the write cache, just have a cache of the last area(s) being writen. We're talking maybe 8k of cache plus checksumming plus the logical block addresses. Shouldn't be hard (read should be cheep in mass quantities) to make a battery back up something that small. Use a rechargable battery, and just say that if you loose power while writing, you should restore power within say a month or a few months to let said cache drain. With well-tuned CMOS, you might even be able to get away with just static charge or a capacitor for power storage. Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 16:13:25 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EC2E337B401 for ; Fri, 31 Jan 2003 16:13:23 -0800 (PST) Received: from sccrmhc01.attbi.com (sccrmhc01.attbi.com [204.127.202.61]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2F50243F79 for ; Fri, 31 Jan 2003 16:13:23 -0800 (PST) (envelope-from julian@elischer.org) Received: from InterJet.elischer.org (12-232-168-4.client.attbi.com[12.232.168.4]) by sccrmhc01.attbi.com (sccrmhc01) with ESMTP id <20030201001316001008a0l0e>; Sat, 1 Feb 2003 00:13:17 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id QAA48166; Fri, 31 Jan 2003 16:13:15 -0800 (PST) Date: Fri, 31 Jan 2003 16:13:13 -0800 (PST) From: Julian Elischer To: David Schultz Cc: Steve Byan , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <20030131234932.GA16959@HAL9000.homeunix.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Fri, 31 Jan 2003, David Schultz wrote: > Thus spake Julian Elischer : > > > contents of the sector are there. I think we would need to > > > implement journalling to ensure integrity if hard drives were > > > likely to corrupt sectors on power failure. (How often do they do > > > this right now, and how often would they with 4K sectors?) > > > > > > in this case teh journel would have to not only include the block being > > written, but data on each side of it that may be in teh same 4k. > > that implies a read.. > > If you had to do that, then nearly every write would be a > read-modify-write cycle. It would be far less painful > to use 4K blocks or larger and align filesystem blocks > to disk sectors. exactly.. But this is a case where "a filesystem using 512 byte blocks would behave significanlty differently with one of these drives" which is what he was asking. > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 16:34:42 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D49AE37B406 for ; Fri, 31 Jan 2003 16:34:38 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id B997F43F75 for ; Fri, 31 Jan 2003 16:34:36 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0203.cvx21-bradley.dialup.earthlink.net ([209.179.192.203] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18elc7-00063u-00; Fri, 31 Jan 2003 16:34:32 -0800 Message-ID: <3E3B1582.39463573@mindspring.com> Date: Fri, 31 Jan 2003 16:32:02 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4fec0bdacb27578085064db9f0561ec03a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Steve Byan wrote: > There's a notion afoot in IDEMA to enlarge the underlying physical > block size of disks to 4096 bytes while keeping a 512-byte logical > block size for the interface. Unaligned accesses would involve either a > read-modify-write or some proprietary mechanism that provides > persistence without the latency cost of a read-modify-write. > > Performance issues aside, it occurs to me that hiding the underlying > physical block size may break many careful-write and > transaction-logging mechanisms, which may depend on no more than one > block being corrupted during a failure. In IDEMA's proposal, a power > failure during a write of a single 512-byte logical block could result > in the corruption of the full 4K block, i.e. reads of any of the > 512-byte logical blocks in that 4K physical block would return an > uncorrectable ECC error. > > I'd appreciate hearing examples where hiding the underlying physical > block size would break a file system, database, transaction processing > monitor, or whatever. Please let me know if I may forward your reply > to the committee. Thanks. UFS directory operations are on the basis of physical disk blocks, which are assumed to be DEVBSIZE in size (512b). Minimally, the I/O path would be broken by this change by changing the atomic unit size to 4096. The reason this would break is that the atomic write guarantee is used to ensure that a single sector changes are recorded atomically. This is important in rename operations from a short name to a longer name, where the new name is allocated as a hard link in the new block; the place this becomes problematic is where the new block and the old block are the same block, unknown to the software. The transaction in question is atomic file replacement; it involves: name - name of the file name.1 - name of the file whose contents are to atomically replace the contents of "name" name.2 - name of intermediate file for use in transaction rollback/forward The transaction is: --------------------------- ----------------------------- files view --------------------------- ----------------------------- name name +name.1 name name.1 explicit_sync(name.1) name name.1 name -> name.2 name name.1 name.2 name.1 name.2 name <- name.1 name name.1 name.2 name name.2 -name.2 name --------------------------- ----------------------------- The failure recovery is: --------------------------- ----------------------------- view process --------------------------- ----------------------------- name [NULL] name name.1 [ROLL BACK(partial file?)] -name.1 name name.1 name.2 [ROLL FORWARD] -name name <- name.1 -name.2 name name.2 [ROLL FORWARD] -name.2 --------------------------- ----------------------------- Currently, UFS is subject to damage through courruption of data in a pending transaction. A corrupt sector destroys data. But this is a weakness of UFS, and is not a uniform weakness of all FS's that must provide the same transactional guarantees to the applications, for the purposes of recovery. In a journalling or log structured FS, the failure of a write of a sector of data -- or rather, an extent or log or journal line -- is recoverable: you get the previous contents, because the journal line has not been replaced with new contents with a newer date stamp. The result is that it backs the transaction out for you. But this is still potentially a partial back-out, which can leave us with any of the views of the directory contents, which we need to use to discern our recovery strategy ([NULL]/[ROLL BACK]/[ROLL FORWARD]). The risk is much higher in this case, in that the logging extents may in fact be adjacent, and span the 4K boundary, while only being self protecting from spanning a 512b boundary. The net effect of this is that rather than guaranteeing to only damage a single extent, you may damage two extents containing pre- and post-operation data. Unless the filesystem maintains extents two back, or goes out of its way to ensure non-adjacency (can this be done, in the face of sector sparing?), this type of failure is unrecoverable. The main issue with this is that you can not ensure physical alignment of the underlying logical device that is acting as a backing store for the FS. This was and is a common performance problem for demand paged virtual memory using OS's: MSDOS FAT FS's on drives that claim an odd numbered physical sector count per track result in the first partition being on an odd 512b boundary. The result is that physical pages in memory are spanned by every third 1K FS block, because they are offset by 512b from the start of the disk. So even if you are not considering the single sector issue as a design flaw in UFS, and even if requiring recompilation is acceptable (it is, IMO), you can't necessarily avoid the failure case. Note: This is not an exhaustive list, this is just off the top of my head; I could probably come up with other scenarios, as well... e.g. at the very least, for FAT, you would probably be screwed with a number larger than 1K, even if you were careful to make sure that the sectors per track was an even multiple of your physical block size, since the FAT entry in FAT FS's *is* the inode. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 16:45:38 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 843F537B401 for ; Fri, 31 Jan 2003 16:45:37 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id D975343E4A for ; Fri, 31 Jan 2003 16:45:36 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0203.cvx21-bradley.dialup.earthlink.net ([209.179.192.203] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18elmm-00001U-00; Fri, 31 Jan 2003 16:45:33 -0800 Message-ID: <3E3B1857.2122B84F@mindspring.com> Date: Fri, 31 Jan 2003 16:44:07 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Julian Elischer Cc: Steve Byan , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4fa5be0a4ef0945269477e932dc92a1a8350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Julian Elischer wrote: > I presume that if such a drive were made, thre would be some way to > identify it? > > It would be very easy to configure a filesystem to have a minimum > writable unit size of 4k, and I assume that doing so would be > slightly advantageous. (no Read/modify/write). it would however > be good if we could easily identify when doing so was a good idea. Substantial modifications would be required to the UFS directory management code to support both old and new disks in the same machine with the same FS code. Assuming that was addressed by making the DEVBSIZE define into a variable based on the underlying device, there's the problem of device concatenation. Your devices would have to be made up of homogeneous components, too, so once you got them to coexist with old disks, you would still not be able to get them to aggregate with them, in, e.g., a RAID 0, and maybe not in any RAID set. > I'd say that this means that the drive should hold the active 4k block > in nvram or something.. This would be very useful, but unlikely in the extreme, I think, because of the associated costs. 8-(. But it would be very, very useful. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 17:12:54 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DF5B937B401 for ; Fri, 31 Jan 2003 17:12:52 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5F27A43F43 for ; Fri, 31 Jan 2003 17:12:52 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0052.cvx40-bradley.dialup.earthlink.net ([216.244.42.52] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18emD9-0001P6-00; Fri, 31 Jan 2003 17:12:48 -0800 Message-ID: <3E3B1E96.B76237AD@mindspring.com> Date: Fri, 31 Jan 2003 17:10:46 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Nathan Hawkins Cc: Steve Byan , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE References: <20030131195042.GD6243@quic.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4dd2b123101f54ac3b9604623d3bb7044350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Nathan Hawkins wrote: > You might want to talk with Veritas. I'm pretty sure their Volume > Manager's log subdisks assume 512-byte sectors. Yes, this is true. It would cause a problem for VXFS, at least the VXFS whose source code I disked around with for USL's use on UnixWare; almost all the directory entry management code is verbatim from the USL UFS sources. I know that AIX *would not* have a problem on the old HPFS, but the OS/2 HPFS might have a problem. I think Solaris, and anyone else using a UFS derived FS would probably have a problem with directory entry management, and for those areas I've already noted. I don't know if the NXFS I wrote for Novell's NetWare for UNIX product is still in use anywhere, or not, these days, but if it is, the it would have a problem, too, both in directory ops, and in secondary inode management for EA's and resource forks. The SGI XFS people, Novell, and the GFS people would also be good ones to ask for input. Microsoft and Apple, too, if it weren't obvious. 8-). > More generally, what impact would this have on existing RAID > implementations, hardware or software? This is a potentially more > damaging impact than filesystem semantics. The real question is sector sparing, when it comes to that, and whether it's on 4K boundaries or not, etc.. For the most part, RAID that does parity should not care, but RAID 0 and 1 may be a problem during a power failure, unless PHK's issue about the write caching, and the inability to disconnect the bus on the data portion of the write, is fixed. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 17:22:16 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 20ADF37B401; Fri, 31 Jan 2003 17:22:15 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 388B243F93; Fri, 31 Jan 2003 17:22:14 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0030.cvx21-bradley.dialup.earthlink.net ([209.179.192.30] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18emMC-0002qL-00; Fri, 31 Jan 2003 17:22:08 -0800 Message-ID: <3E3B20BF.B6F0BC6E@mindspring.com> Date: Fri, 31 Jan 2003 17:19:59 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: phk@freebsd.org Cc: Julian Elischer , Steve Byan , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE References: <31671.1044043107@critter.freebsd.dk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a42ee81c9e74eb11e50fc2b86576933bd5666fa475841a1c7a350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org phk@freebsd.org wrote: > In message , Ju > lian Elischer writes: > >One thign I thought of is that it is not uncommon to 'dd' an entire > >filesystem from one partition to another. > >If we create a filesystem that is 'aligned' and we copy it to be > >'unalligned', we'd have a sudden performance drop for no immediatly > >obvious reason. What was one write, would become a 2-sector read, > >modify and 2-sector write. Especially when copying from one failing > >drive to another with slightly different characteristics. > > If you run dd without bs=ALOT you deserve bad throughput. I think he means that the performance of the resulting FS, if it had expectations of running on a 4K block size, and got a 512b one instead, would be unexpected (e.g. the only difference between the disks is a "Q" or "R" at the end of the disk model number, etc.). The real answer, if that's what you mean, Julian, is that the FS is not likely to be transportable between the devices, or, minimally, from a 512b to a 4K, because of the existing data not having taken the 4K alignment issues into account (e.g. directories would be an even multiple of 512b in length, rather than an even multiple of 4K in length). From a 4K to a 512b, there's might also be an offset issue, if they were not treated internally as if they were 512b on 4K systems, for data storage, and only treated as 4K for atomicity. My recommendation would be to indicate doing this is no longer supported between drives of different physical block sizes. FWIW, the original NEC PC98 disks were 1K physical block size disks. It might be worthwhile to ask the PC98 folks about problems, but I'm going to guess that none of their fictitious geometries, before they moved to using standard disks, was ever an odd sector count per track. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Jan 31 18:27:26 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2219F37B401; Fri, 31 Jan 2003 18:27:25 -0800 (PST) Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.80]) by mx1.FreeBSD.org (Postfix) with ESMTP id C32D343F43; Fri, 31 Jan 2003 18:27:23 -0800 (PST) (envelope-from grog@lemis.com) Received: by wantadilla.lemis.com (Postfix, from userid 1004) id 0D4F651987; Sat, 1 Feb 2003 12:57:17 +1030 (CST) Date: Sat, 1 Feb 2003 12:57:16 +1030 From: Greg 'groggy' Lehey To: Steve Byan Cc: phk@freebsd.org, freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Track buffering (was: DEV_B_SIZE) Message-ID: <20030201022716.GO92530@wantadilla.lemis.com> References: <2639.1044031853@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i Organization: The FreeBSD Project Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 9A1B 8202 BCCE B846 F92F 09AC 22E6 F290 507A 4223 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Friday, 31 January 2003 at 12:03:44 -0500, Steve Byan wrote: > > On Friday, January 31, 2003, at 11:50 AM, phk@freebsd.org wrote: >> It was my impression that already many drives write entire tracks >> as atomic units, at least we have had plenty of anecdotal evidence >> to this effect ? > > I'm not aware of any SCSI or ATA disks which do this; certainly no > Maxtor disk does. Count-key-data mainframe disks can be formatted to do > so, but such disks probably don't run Unix. Caching in ATA disks might > lead one to believe that the disk could corrupt an entire track, in the > sense that a panic ( aka bluescreen) or a power-failure would cause all > pending writes in its buffer to be lost, but even in ATA-land I don't > believe a power failure would result in more than one disk block > returning an uncorrectable read error. A couple of years back I did some power fail testing on IBM IDE drives. On one occasion I managed to blow out a whole range of sectors (about 80), which I attributed to trashing a track buffer. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Feb 1 0:40:48 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 36B9137B422 for ; Sat, 1 Feb 2003 00:40:43 -0800 (PST) Received: from host213-122-108-127.in-addr.btopenworld.com (host213-122-108-127.in-addr.btopenworld.com [213.122.108.127]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2552F43FA7 for ; Sat, 1 Feb 2003 00:40:36 -0800 (PST) (envelope-from dsl@l8s.co.uk) Received: (from dsl@localhost) by snowdrop.l8s.co.uk (8.11.6/8.11.6) id h118ise01592; Sat, 1 Feb 2003 08:44:54 GMT Date: Sat, 1 Feb 2003 08:44:54 +0000 From: David Laight To: Steve Byan Cc: freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE Message-ID: <20030201084454.A1388@snowdrop.l8s.co.uk> References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com>; from stephen_byan@maxtor.com on Fri, Jan 31, 2003 at 11:30:18AM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org The only reason I can see for supporting 512byte reads is to allow them to to be used as system disks without requiring a BIOS update. I suspect that the only reason that the BSD systems don't support sector sizes other than 512 is a lack of test media. Indeed someone has recently gone through the netbsd code getting it to work with (IIRC) 1k blocks for a specific disk. With a test sample the ffs support would be fixed in a few days, and probably backported to recent releases within a few weeks. No one using windows will care :-) you could lock the ATA bus a few times a day and they'd just reset and continue. :-) David -- David Laight: david@l8s.co.uk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Feb 1 1:59:10 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E573337B401 for ; Sat, 1 Feb 2003 01:59:09 -0800 (PST) Received: from chylonia.3miasto.net (chylonia.3miasto.net [217.96.12.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id AA46B43F3F for ; Sat, 1 Feb 2003 01:59:03 -0800 (PST) (envelope-from wojtek@tensor.3miasto.net) Received: from localhost (localhost [[UNIX: localhost]]) by chylonia.3miasto.net (8.11.6/8.11.6) with ESMTP id h119wYh01207; Sat, 1 Feb 2003 10:58:34 +0100 (CET) X-Authentication-Warning: chylonia.3miasto.net: wojtek owned process doing -bs Date: Sat, 1 Feb 2003 10:58:34 +0100 (CET) From: Wojciech Puchar X-X-Sender: wojtek@chylonia.3miasto.net To: David Laight Cc: Steve Byan , freebsd-fs@FreeBSD.ORG, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE In-Reply-To: <20030201084454.A1388@snowdrop.l8s.co.uk> Message-ID: References: <4912E0FE-3539-11D7-B26B-00306548867E@maxtor.com> <20030201084454.A1388@snowdrop.l8s.co.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org > them to to be used as system disks without requiring a BIOS update. > > I suspect that the only reason that the BSD systems don't support > sector sizes other than 512 is a lack of test media. this is not true. older SCSI drives allow formatting with 1K sectors, CDROM's are 2K and EMULATED usually as 512b by netbsd, magneto-opticals are up to 4KB (and doesn't work in NetBSD because of that). > Indeed someone has recently gone through the netbsd code getting > it to work with (IIRC) 1k blocks for a specific disk. > > With a test sample the ffs support would be fixed in a few days, > and probably backported to recent releases within a few weeks. > > No one using windows will care :-) you could lock the ATA bus exactly To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Feb 1 5:18: 1 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 475B037B401 for ; Sat, 1 Feb 2003 05:18:00 -0800 (PST) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5D75043F43 for ; Sat, 1 Feb 2003 05:17:59 -0800 (PST) (envelope-from phk@freebsd.org) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.6/8.12.6) with ESMTP id h11DHq4W018172; Sat, 1 Feb 2003 14:17:57 +0100 (CET) (envelope-from phk@freebsd.org) To: David Laight Cc: Steve Byan , freebsd-fs@freebsd.org, tech-kern@netbsd.org Subject: Re: DEV_B_SIZE From: phk@freebsd.org In-Reply-To: Your message of "Sat, 01 Feb 2003 08:44:54 GMT." <20030201084454.A1388@snowdrop.l8s.co.uk> Date: Sat, 01 Feb 2003 14:17:52 +0100 Message-ID: <18171.1044105472@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <20030201084454.A1388@snowdrop.l8s.co.uk>, David Laight writes: >The only reason I can see for supporting 512byte reads is to allow >them to to be used as system disks without requiring a BIOS update. > >I suspect that the only reason that the BSD systems don't support >sector sizes other than 512 is a lack of test media. What gave you the impression that we don't support anything but 512 bytes ? I'm running a 2k sectorsize device right now. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message