From owner-freebsd-hackers Thu Sep 28 14:27: 9 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from mx02.netapp.com (mx02.netapp.com [198.95.226.52]) by hub.freebsd.org (Postfix) with ESMTP id 81B4837B424 for ; Thu, 28 Sep 2000 14:26:25 -0700 (PDT) Received: from frejya.corp.netapp.com (frejya.dmz.netapp.com [10.254.253.21]) by mx02.netapp.com (8.11.0/8.11.0/NTAP-1.0) with ESMTP id e8SLPkA07627; Thu, 28 Sep 2000 14:25:47 -0700 (PDT) Received: from tooting.eng.netapp.com (tooting.eng.netapp.com [10.100.4.46]) by frejya.corp.netapp.com (8.11.0/8.11.0/NTAP-1.1) with ESMTP id e8SLPjG22250; Thu, 28 Sep 2000 14:25:46 -0700 (PDT) Received: (from guy@localhost) by tooting.eng.netapp.com (8.8.8+Sun/8.8.8) id OAA25500; Thu, 28 Sep 2000 14:25:45 -0700 (PDT) From: Guy Harris Message-Id: <200009282125.OAA25500@tooting.eng.netapp.com> Subject: Re: nfs v2 In-Reply-To: from Danny Braniss at "Sep 28, 2000 09:49:33 pm" To: Danny Braniss Date: Thu, 28 Sep 2000 14:25:44 -0700 (PDT) Cc: Guy Harris , guy@netapp.com, freebsd-hackers@freebsd.org X-Mailer: ELM [version 2.4ME++ PL59 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > } 1) NFS V2 having, as I remember, insufficient bits in the > } major/minor device value used when creating special files to > } support more than 8 bits of major and 8 bits of minor device; > if i remember correctly,i copied the / over to the NetAPP via nfsv3 > either tar or dump, and all is ok. it's when it gets mounted v2 (which the > diskcless boot does) it's when dev is wrong. Originally: UNIX systems had 8-bit major and 8-bit minor devices; NFS V2 had no mechanism for creating special files. Then Sun needed that V2 "mknod" support for NFS-only diskless operation, so they added a hack to V2 wherein a V2 CREATE operation in which the "mode" field of the "attributes" member of the arguments had the upper 4 bits set was treated as an attempt to create a file other than a plain file, and those bits contained a standard UNIX file type, e.g. 0020000 for a character special file; the "size" field of the attributes was to be interpreted as the major/minor device. Later, for SV-style named pipe support, they added an additional hack wherein a character special file create with a size of 0xffffffff meant that it would be an attempt to create a FIFO special file. (That was sufficiently long ago that I forget why passing a file type of 010000, i.e. IFIFO, wasn't the way it was done.) Later, SVR4 extended the major/minor device to 32 bits, with 14 bits of major device and 18 bits of minor device. To handle this over NFS V2, what SunOS 5.5.1's NFS server code, at least, appears to do is: 1) store the major and minor device as 14-bit and 18-bit fields in a 32-bit word; 2) in a V2 CREATE request that attempts to create a character or block special file, check whether any of the upper 16 bits of the 32-bit size field are 1 and: if not, treat the size field as an 8-bit major device and an 8-bit minor device, and store the upper 8 bits as the upper 14 bits of the resulting file's "rdev" and store the lower 8 bits as the lower 18 bits of the resulting file's "rdev"; if so, treat the size field as a 14-bit major device and an 18-bit minor device, and store the field as the resulting file's "rdev"; 3) when constructing V2 attributes of a file, if the major or minor device will both fit in 8 bits, shift the major left by 8 and OR in the minor and stuff the result into the "rdev" field, otherwise shift the major left by 14 and OR in the minor and stuff the result into the "rdev" field. This was, presumably, done to allow both SunOS 4.x (8-bit major, 8-bit minor) and SunOS 5.x (14-bit major, 18-bit minor) systems to work together. Then NFS V3 came along; in V3, there's a MKNOD operation, and it supplies "specdata1" and "specdata2" for character and block special files, which are, on UNIX systems, interpreted as major and minor devices, respectively. For V3, what SunOS 5.5.1 appears to do is: 1) in a V3 CREATE request that attempts to create a character or block special file, combine the "specdata1" and "specdata2" fields as if they were a 14-bit major and 18-bit minor; 2) when constructing V3 attributes for the file, stuff the major into "specdata1" and the minor into "specdata2". What FreeBSD 3.4's client code, at least, does on a "mknod" is: for V2, do a CREATE, pass the appropriate mode bits, and pass the "rdev" value as the size; for V3, do a MNNOD, and pass the major and minor as "specdata1" and "specdata2". The FreeBSD 32-bit major/minor value is 8 bits of major and 24 bits of minor, which isn't the same as SVR4's 14/18. On a "getattr", what FreeBSD 3.4's client code does is: for V2, treat the "rdev" value as a dev_t; for V3, treat "specdata1" as an 8-bit major and "specdata2" as a 24-bit minor, and combine them with "makedev" into an dev_t. NetApp filers originally just, as I remember, stuffed the size field into the 32-bit "rdev" field of our inode on a CREATE operation, and returned it in the "rdev" field of an "fattr" structure on a GETATTR operation. When we added V3 support, on a V2 CREATE we interpreted the "size" field as containing an 8-bit major and an 8-bit minor, and passed those on to the file system as the "specdata1" and "specdata2" values, and passed "specdata1" and "specdata2" from a V3 MKNOD on in the same fashion; the file system then treated them both as 8-bit values, and stuffed them into the "rdev" field of the inode. (At the time, Solaris didn't *support* NFS V3.) On a GETATTR operation, the file system split the "rdev" field into 8-bit "specdata1" and "specdata2" fields, and then: for V2, combined them into an 8-bit+8-bit rdev field in the NFS reply; for V3, returned them as "specdata1" and "specdata2" in the NFS reply. (NOTE: the rdev field occupies the same space as the top-level file block pointers; we don't waste 32 bits of the inode for files that aren't character or block special files - we don't have 32 bits to waste, as we have to stuff various DOS/Windows gunk in there as well, for Windows CIFS clients.) Later, when we had to support diskless Solaris clients using V3: for a V2 create, we just passed the size field on to the file system unchanged as what amounts to a "rdev" value; for a V3 create, we assumed that the client was a 14/18 system, stuffed "specdata1" into the upper 14 bits and "specdata2" into the lower 18 bits, and passed that on to the file as what amounts to an "rdev" value; we stuffed the "rdev" value into the inode's "rdev" field; for V2 GETATTR, we returned the "rdev" field as the "rdev" field; for V3 GETATTR, we checked whether the "rdev" field had the upper 16 bits set, and: if so, we split it into a 14-bit major and an 18-bit minor device, and returned those as "specdata1" and "specdata2"; if not (meaning that either the inode was created by a version of our software that didn't have support for 14/18 device values, or had a major device of 0), we split it into an 8-bit major and an 8-bit minor device, and returned those as "specdata1" and "specdata2". A V2 CREATE from FreeBSD of device (2, 2) looks as if it'd pass (2 << 24) | 2 over the wire, i.e. 0x02000002. We'd stuff 0x02000002 into the inode, and should, in an NFS V2 reply, return it as 0x02000002. It looks as if FreeBSD would handle that. A V3 MKNOD from FreeBSD of device (2, 2) would pass 2 over the wire as "specdata1" and 2 over the wire as "specdata2"; what the server does with that depends on the server: Solaris would, I suspect, turn that into 14 bits of 2 and 18 bits of 2, i.e. 0x00080002; NetApp filers would do the same; an OS with 12-bit majors and 20-bit minors would turn it into 0x00200002; an OS such as FreeBSD with 8-bit majors and 24-bit minors would turn it into 0x02000002. A V2 GETATTR would get back whichever of those the server's OS did, which would not be correctly interpreted unless the server had 8-bit majors and 24-bit minors and thus sent 0x02000002. > } 2) some OSes - Solaris was the one with which we were having > } problems, as I remember - requiring those extra bits. > i tried solaris 2.6 and it's ok. If the major device is non-zero, the size field won't fit in 16 bits, so the SunOS 5.5.1 server will probably misinterpret the size field of a V2 create as being 14/18 rather than 8/24. (If the major device *is* zero, then, at least with a SunOS 5.5 client and SunOS 5.5.1 server, the command mknod foobar c 0 8192 when using V2 created a file with a major device of 32 and a minor device of 0; the upper 16 bits of the size were zero, so the 5.5.1 server assumed that the request was probably coming from a 4.x client.) Perhaps later versions of SunOS 5.x don't do this; are you saying that you tried a Solaris 2.6 server? If so, what happens if you do an "ls -l", *on the Solaris server*, of the FreeBSD client's "/dev/null" file? Does it report "2, 2", or does it report something else? > }NFS V3 is probably a better idea, if you can use it; we (NetApp) have > }supported it for many years, and I suspect most if not all other vendors > }of NFS servers do so as well. > } > and it's the prefered mount here too, the problem is the FreeBSD nfs_root/boot > that is booting using V2. im trying to see how to get the boot to it's magic > via V3, but that does not fix the problem :-) To which problem are you referring? I don't think there *is* a solution to the "create special files using V3, get their attributes using V2" problem other than "only run servers whose OSes use the same major/minor bitfield sizes as your client". One solution to the "special files don't work" problem is "create all the special files using the same version of NFS as will be used to get their attributes", in which case, if you're going to be creating the special files with V3, getting the OS to mount the root file system using V3 *would* fix the problem. > > }Also, could you get a network trace of: > } > } the creation of the "/dev/null" entry, if it was done over NFS; > } > } attempts by the FreeBSD box to get the attributes of "/dev/null" > } via NFS (e.g., an "ls -l /mnt/tmp/null", from your example); > } > }and send them to me? > if you mena a tcpdump, that will have to wait till the morning (my morning :-) Yes, I mean tcpdumps - if you use tcpdump, use "-s 65535", so that tcpdump's annoying default teeny tiny snapshot length of 68 doesn't end up cutting off a lot of the interesting parts of the NFS requests and replies. Also, send me the raw tcpdump captures (i.e., capture with "-w" to a savefile), rather than tcpdump's printed interpretation thereof - I may want to run them through Ethereal or convert them to snoop format and run them through snoop. Do captures for all the servers on which you've tried this, both of the creation of the special files and the attempts to get the attributes. > PS: out of curosity, what os is NetAPP base on? The core kernel is our own; it's a message-passing, non-preemptive, kernel-mode-only, single-address-space, no-demand-paging kernel. The networking stack is 4.4-Lite-derived, with some bits of code from various later BSDs added in, although it's drifted a fair bit from the BSD base (i.e., it's not a no-brainer to move stuff into it from BSD stacks or from it to BSD stacks at this point). A number of the commands are 4.4-Lite-derived as well, although we had to assault them with a chainsaw to get them to run in our single-address-space environment. The NFS server code takes some stuff from 4.4-Lite, but we changed that code a lot. The file system is our own, as is the CIFS server code (no Samba involved), the disk/SCSI/Fibre Channel subsystem, and RAID. Some of the platform support code on x86 originally came from BSD, and the Alpha divide/remainder routine came from NetBSD, but, at this point, most of the platform code is our own, even on x86. I.e., it's based mostly on our code, with a bunch of BSD stuff, mainly in the networking area, but even that stuff's often been changed a fair bit. It's not running a standard general-purpose OS with an appliance wrapper. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message