Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 03 May 2008 20:09:25 +0200
From:      Attila Nagy <bra@fsn.hu>
To:        ticso@cicely.de
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Consistent inodes between distinct machines
Message-ID:  <481CAA55.2030506@fsn.hu>
In-Reply-To: <20080503125050.GG40730@cicely12.cicely.de>
References:  <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Hello,

On 2008.05.03. 14:50, Bernd Walter wrote:
> On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote:
>   
>> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote:
>>
>>     
>>> Hello,
>>>
>>> I have several NFS servers, where the service must be available  
>>> 0-24. The servers are mounted read only on the clients and I've  
>>> solved the problem of maintaining consistent inodes between them by  
>>> rsyncing an UFS image and mounting it via md on the NFS servers.
>>> The machines have a common IP address with CARP, so if one of them  
>>> falls out, the other(s) can take over.
>>>
>>> This works nice, but rsyncing multi gigabyte files are becoming more  
>>> and more annoying, so I've wondered whether it would be possible to  
>>> get constant inodes between machines via alternative ways.
>>>       
>> Why not avoid syncing multi-gigabyte files by splitting your huge FS  
>> image into many smaller say 512MB files, then use md and geom concat/ 
>> stripe/etc to make them all one image that you mount?
>>     
>
> Where would be the positive effect by doing this?
> FFS distributes data over the media, so all the small files changes
> in almost every case and you have to checksum-compare the whole virtual
> disk anyway.
> With multiple files the syncing is more complex. For example a normal
> rsync run can garantie that you get a complete file synced or none
> at all, but this doesn't work out of the box with multiple files, so
> you risk half updated data.
>   
I haven't got Eric's e-mail, but I agree with the above.
> Nevertheless I think that the UFS/NFS combo is not very good for this
> problem.
>   
I don't think so. I need a stable system and UFS/NFS is in that state in 
FreeBSD.
> With ZFS send/receive however inode numbers are consistent.
>   
Yes, they are, but the filesystem IDs are not, so you cannot have CARP 
failover for the NFS servers, because all clients will have ESTALE 
errors on everything.
I've already tried that, see my e-mails about this topic in the archives 
(it would be good if we could synchronize the filesystem IDs and 
therefore the filehandles too).
> Together with the differential stream creation it is quite efficient
> to sync large volumes as well.
> [75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test
> receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03
> received 126Mb stream in 28 seconds (4.50Mb/sec)
> 0.008u 5.046s 0:27.93 18.0%     53+2246k 0+0io 0pf+0w
>   
Yes, that's why I thought of this in the first place. But there is 
another problem, which hits us today (with the loopbacked image mount) 
as well: you have to unmount the image and restart the NFS server (it 
can panic the machine otherwise), so we have to flip the active state 
from one machine to the other during the sync.
The exact process looks like this:
- rsync the image to the inactive server
- when it's done, remount the image and restart the nfsd
- flip CARP (this is when the new content will go into production)
- sync the image to the now inactive, previously active server

This is a painful, slow (because of the rsync) and fragile process. And 
if the active server crashes while the sync is going, you are there with 
a possibly non-working state.

With ZFS, the sync time is much smaller, but you have to flip the active 
state and restart nfsd as well.

Currently I'm experimenting with a silly kernel patch, which replaces 
the following arc4random()s with a constant value:
./ffs/ffs_alloc.c:              ip->i_gen = arc4random() / 2 + 1;
./ffs/ffs_alloc.c:              prefcg = arc4random() % fs->fs_ncg;
./ffs/ffs_alloc.c:                      dp2->di_gen = arc4random() / 2 + 1;
./ffs/ffs_vfsops.c:             ip->i_gen = arc4random() / 2 + 1;

It seems that this works when I don't use soft updates on the volumes. 
So what I have now:
- all of the machines have the above arc4random()s removed
- all machines run the data file system in async mode (for speed and 
because soft updates seems to mess up the constant inodes)
- I have all the data in a subversion repository (better than a plain 
"master image", because it's versioned, logged, etc)
- I do updates in this way on the machines: mount -o rw,async /data; svn 
up; mount -o ro /data

So far it seems to be OK, but I'm not yet finished with the testing.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?481CAA55.2030506>