FreeBSD Mail Archives

Date:      Sat, 3 May 2008 20:51:55 +0200
From:      Bernd Walter <ticso@cicely12.cicely.de>
To:        Attila Nagy <bra@fsn.hu>
Cc:        freebsd-fs@freebsd.org, ticso@cicely.de
Subject:   Re: Consistent inodes between distinct machines
Message-ID:  <20080503185155.GA44005@cicely12.cicely.de>
In-Reply-To: <481CAA55.2030506@fsn.hu>
References:  <48070DCF.9090902@fsn.hu> <4CA7BA82-E95C-45FF-9B94-8EF27B6DB024@freebsd.org> <20080503125050.GG40730@cicely12.cicely.de> <481CAA55.2030506@fsn.hu>

On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote:
> Hello,
> 
> On 2008.05.03. 14:50, Bernd Walter wrote:
> >On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote:
> >  
> >>On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote:
> >Nevertheless I think that the UFS/NFS combo is not very good for this
> >problem.
> >  
> I don't think so. I need a stable system and UFS/NFS is in that state in 
> FreeBSD.

ZFS is pretty stable as well, although it has some points you need
to care and tune about.

> >With ZFS send/receive however inode numbers are consistent.
> >  
> Yes, they are, but the filesystem IDs are not, so you cannot have CARP 
> failover for the NFS servers, because all clients will have ESTALE 
> errors on everything.

Havn't though about this.
Of course this is a real problem.
Have you tried the following:
Setup Server A with all required ZFS filesystems.
Replicate everything to Server B using dd.
Then the filesystem ID should be the same on both systems.
This will not work for newly created filesystems however and you may
need to take extra care about not accidently change disks between the
machines, since they have the same disk IDs as well.
I admit - not very perfect :(

> I've already tried that, see my e-mails about this topic in the archives 
> (it would be good if we could synchronize the filesystem IDs and 
> therefore the filehandles too).
> >Together with the differential stream creation it is quite efficient
> >to sync large volumes as well.
> >[75]cicely14# zfs send data/arm-elf@2008-05-03 | zfs receive -v data/test
> >receiving full stream of data/arm-elf@2008-05-03 into data/test@2008-05-03
> >received 126Mb stream in 28 seconds (4.50Mb/sec)
> >0.008u 5.046s 0:27.93 18.0%     53+2246k 0+0io 0pf+0w
> >  
> Yes, that's why I thought of this in the first place. But there is 
> another problem, which hits us today (with the loopbacked image mount) 
> as well: you have to unmount the image and restart the NFS server (it 
> can panic the machine otherwise), so we have to flip the active state 
> from one machine to the other during the sync.

Of course you have to do this - readonly mounts mean not writing, but
it doesn't mean not caching metadata and expecting the underlying media
to change contents, so to stay in sync you have to remount.

> The exact process looks like this:
> - rsync the image to the inactive server
> - when it's done, remount the image and restart the nfsd

You also have to sync the image to a different file, since you can't
pollute the original file with new content, while it is mounted.
But with propper (IIRC default) options rsync already writes a new
file and than exchanges it with the old one.

> - flip CARP (this is when the new content will go into production)
> - sync the image to the now inactive, previously active server
> 
> This is a painful, slow (because of the rsync) and fragile process. And 
> if the active server crashes while the sync is going, you are there with 
> a possibly non-working state.
> 
> With ZFS, the sync time is much smaller, but you have to flip the active 
> state and restart nfsd as well.

Sounds plausible to me.

> Currently I'm experimenting with a silly kernel patch, which replaces 
> the following arc4random()s with a constant value:
> ./ffs/ffs_alloc.c:              ip->i_gen = arc4random() / 2 + 1;
> ./ffs/ffs_alloc.c:              prefcg = arc4random() % fs->fs_ncg;
> ./ffs/ffs_alloc.c:                      dp2->di_gen = arc4random() / 2 + 1;
> ./ffs/ffs_vfsops.c:             ip->i_gen = arc4random() / 2 + 1;
> 
> It seems that this works when I don't use soft updates on the volumes. 

But it is very fragile and it is there for a good reason.
Namely to distribute the allocated inodes over the media and since
AFAIK at leasy small files have their data allocated near the inode
you influece data distribution as well.
This will very likely lead to lower speed after some usage.

> So what I have now:
> - all of the machines have the above arc4random()s removed
> - all machines run the data file system in async mode (for speed and 
> because soft updates seems to mess up the constant inodes)
> - I have all the data in a subversion repository (better than a plain 
> "master image", because it's versioned, logged, etc)
> - I do updates in this way on the machines: mount -o rw,async /data; svn 
> up; mount -o ro /data
> 
> So far it seems to be OK, but I'm not yet finished with the testing.

Honestly said - I wouldn't trust that very much.
Say you use two disk stations with fibre channel, which are connetced to
two hosts.
Use the disk stations with different power supply rails.
Then use a solid constructed single server and have the same machine
as cold or maybe already booted standby.
Use the disk stations to mirror - one half on each station.
If the host dies you can easily take over the service to the other
machine by just mounting the disks.
If you do this with ZFS it even takes care that the original host will
not automatically mount them, since the host-id for the pool has been
changed to that of the other host.
It is not a hot standby as your solution, but talking about service
failures I would assume this will outperform any hackish solution.
I see so many people trying to do freaky failover with additional
complexity and additional failure points, instead of just to increase
the quality of their hardware.

-- 
B.Walter <bernd@bwct.de> http://www.bwct.de
Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080503185155.GA44005>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation