From owner-freebsd-questions  Tue Mar 19 18:19:56 1996
Return-Path: owner-questions
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id SAA21501
          for questions-outgoing; Tue, 19 Mar 1996 18:19:56 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id SAA21494
          for <questions@freebsd.org>; Tue, 19 Mar 1996 18:19:50 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id TAA25640; Tue, 19 Mar 1996 19:10:39 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199603200210.TAA25640@phaeton.artisoft.com>
Subject: Re: Diskless FreeBSD
To: msmith@atrad.adelaide.edu.au (Michael Smith)
Date: Tue, 19 Mar 1996 19:10:39 -0700 (MST)
Cc: questions@freebsd.org
In-Reply-To: <199603192342.KAA02833@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Mar 20, 96 10:12:36 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-questions@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > > I don't know _why_, I suspect that the NFS swap code doesn't/won't/can't
> > > extend the file, but I haven't been bothered enough by it to try to find out.
> > 
> > How can an NFS client know whether the server is zero-filling the
> > pages or not?
> 
> Huh?  The NFS swapfile, in the eyes of the NFS server, is just a file that
> some client is scribbling all over.  It's nothing special.  The problem is
> just that if the file isn't as big as the client expects it to be, for 
> some reason the client dies.
> 
> Zero-filling here is a total non-sequiter.

If you make the swap file a certain size, unless you actually write
every FS block between the start and the end of the file, it will be
a sparse file: it won't take up the space it says it does.

That is, it will have allocated only the necessary intermediate blocks,
an inode, and one terminal block (which you cause to be allocated by
writing it).

Pages "read" from unallocated areas in the file will reference '0'
block pointers, which will cause the pages to be created on demand.

That is, the file will be a sparse file on the server.

There is no easy way for an NFS client to know if the file is sparse
or not.  So that means it can't care if the server is zero-filling
the file.  Ie: it can't care that the disk space has not really
been allocated or not, unlike a local swap file.  Local swap files
care, because they can't reenter to page-fill.


This has *NOTHING* to do with the fact that the file *MUST* be the
right size in the current code, even if there aren't blocks allocated
to it.


> > Typically, I'd say it can't extend the file because, like mmap, the
> > vnode/extent used for cache mapping images (even for swap files)
> > references the length from the mapping structure, not from the in
> > core vnode.
> 
> In other words, the kernel is told (via the config file) that it has a 20M
> swap arena, but the map to the swapfile to back that arena is flawed because
> the extent for the file is constrained to the size of the file itself.
> Makes sense I gues.

No.  It assumes that the file you give it is the size you give it in
the config file, even if it's really larger or smaller.

The problem is that the page size is not greater than or equal to the
host file system block size.

As a result, when you go to swap out to page 1024 (4M+0) of a file
that has no real space allocated to it, the file system must
perform a partial block write to write the page.

Because the bitmaps on the cache are not used the way the header
file implies that they are, a partial block write of a page causes
the block containing the page to be read *IF* the write area is not
on an FS block alignment boundry OR if the write area is smaller than
the FS block size.  This has to happen so that if the file contained
*real* data in the data block, the part that isn't being written
keeps the same data instead of being filled with zeros.

Now swapping will *always* do it's writes on page boundries.  The
screwup occurs because the FS block size is larger than one page.

Because there must be a read-before-write for the partial block to be
written, and the local NFS clients buffer cache must be read into
for the partial FS block write, the NFS client issues a read to the
remote host.

This read is past the end of the swap file, and therefore fails.  The
NFS read failure for the full block past the end of the file causes
the write to not be attempted, and the file does not get self-extended.


The resulting swap write failure panics the client machine.


To fix this "problem" (ie: you want to do swap "overcommit" on your
NFS server without hacking bootp, which isn't a very good thing to do),
you will need to:

1)	Make the FS block size on the server 4k, the same as the page
	size.  In addition, you must make the transfer size for the
	NFS read/write on the server for the swapfile 4k.

-OR-

2)	You will need to make the 8 bit fractional buffer bitmap work
	as expected, so that partial block writes do not require block
	reads to allow them to complete (assuming they are on alignment
	boundries).


Soloution #2 above will also drastically increase FreeBSD's throughput
on aligned random writes of an existing file, since it will reduce
the number of device blocks which must be written for partial block
writes.  Aligned writes of size 512, 1024, 2048, and 4096 bytes are
used in the Ziff-davis "WinBench 95" benchmark suite.  They also do
aligned 200 byte writes, which will cause 1 512 byte disk block to be
read/written per I/O (assuming no cache hit), or 2 in the case of the
200 byte record overlaying a block boundry.  Sequential writes weenie
out by getting a cache hit, unless they are doing cache-busting (in
which case, this would help those benchmarks, too).


Neither of these guarantee that a write of an object not requiring
a read will in fact result in the object not being read; there used to
be an issue, even with fs_bsize-sized writes on alignment boundries.
I know something like this was fixed in the last month or so by John
Dyson, but I'm not sure if this was exactly it, or if it was just
a similar situation with some other cache interaction.


In any case, if you avoid the NFS read, then the NFS write will extend
the file, as expected, instead of causing the thing to crash.  But
even if you get all your block sizes lines up, you may still not be
able to avoid the read-before-write, which is the only way to avoid
the NFS read past the EOF.


I would suggest, instead, starting with no swap, and adding the swap
after creating it on the remote system in the client's /etc/rc file.

This is not as "clean" as growing swap as necessary, but I think the
zero-fill of a sparse file will allow you to create the swap as
sparse and fool the client.

This still doesn't solve the crash when the NFS client writes and
there aren't any real blocks to convert the sparse file into a
non-sparse file.  Oh well; to do that, you'd need to get into deeper
detail on the vnode pager itself.  This is doubly difficult because
FreeBSD *likes* to use up all its swap if it can to keep extra
pages quickly accessable... so it ill probably quicly turn small
files into large files if it can (you'd have to modify the page
replacement policy to make it act otherwise).


I have no idea in hell how you would safely decide that the swap was
no longer in use so that it could be reclaimed (I assume each client
will need its own swap, and if you can't decide which clients are
active, you can't make the file sparse again until the next login
of the client that created the thing.  Unless you have some kind of
rc.shutdown, and can be guaranteed that the clients will call it...).

Personally, I'd commit the files in the rc.local and start a *long*
UDP "keep-alive" to a daemon that likes to delete "dead" swap files;
it seems to be the best bet.  This is only if you convinced me to
overcommit my NFS server's disk space as swap for too many clients
in the first place (a hard sell, to say the least).


One possible intermediate fix, if the read-before-write of block sized
objects still occurs for an aligned write, is to hack the vnode pager
to know it's on an NFS client (you can get the flag that it's remote
from the swapfile's vnodes pointer to the fs structure), and if the
write would be past the end of the file, extend the file automatically
by doing a write to the next FS block boundry past the area you are
interested in for one byte.  You could do this safely because the area
you are writing is guaranteed to not contain good swap data because
if it did, you wouldn't be writing past the EOF.

You may have to update the clients idea of the file size promiscuously
in the local attribute cache if you do this.  Again, it won't buy you
much because you don't know when it's OK to destroy the swap files.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.