Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Sep 2017 12:44:18 +0100
From:      Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
To:        Karl Pielorz <kpielorz_lst@tdx.co.uk>
Cc:        <freebsd-xen@freebsd.org>
Subject:   Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID:  <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
In-Reply-To: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
References:  <62BC29D8E1F6EA5C09759861@[10.12.30.106]>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Sep 20, 2017 at 11:35:26AM +0100, Karl Pielorz wrote:
> 
> Hi All,
> 
> We recently experienced an "unplanned storage" fail over on our XenServer
> pool. The pool is 7.1 based (on certified HP kit), and runs a mix of FreeBSD
> (all 10.3 based except for a legacy 9.x VM) - and a few Windows VM's -
> storage is provided by two Citrix certified Synology storage boxes.
> 
> During the fail over - Xen see's the storage paths go down, and come up
> again (re-attaching when they are available again). Timing this - it takes
> around a minute, worst case.
> 
> The process killed 99% of our FreeBSD VM's :(
> 
> The earlier 9.x FreeBSD box survived, and all the Windows VM's survived.
> 
> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of
> the I/O delays that occur during a storage fail over?

Do you know whether the VMs saw the disks disconnecting and then
connecting again?

> I've enclosed some of the error we observed below. I realise a full storage
> fail over is a 'stressful time' for VM's - but the Windows VM's, and earlier
> FreeBSD version survived without issue. All the 10.3 boxes logged I/O
> errors, and then panic'd / rebooted.
> 
> We've setup a test lab with the same kit - and can now replicate this at
> will (every time most to all the FreeBSD 10.x boxes panic and reboot, but
> Windows prevails) - so we can test any potential fixes.
> 
> So if anyone can suggest anything we can tweak to minimize the chances of
> this happening (i.e. make I/O more timeout tolerant, or set larger
> timeouts?) that'd be great.

Hm, I have the feeling that part of the problem is that in-flight
requests are basically lost when a disconnect/reconnect happens.

Thanks, Roger.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170920114418.pq6fhnexol2mvkxv>