From owner-freebsd-xen@freebsd.org  Wed Sep 20 10:41:09 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0523E07CAE
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Wed, 20 Sep 2017 10:41:09 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4F00666054
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 10:41:08 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.106]
 (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8KAZSWW001635
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 11:35:29 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Wed, 20 Sep 2017 11:35:26 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: freebsd-xen@freebsd.org
Subject: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 10:41:09 -0000


Hi All,

We recently experienced an "unplanned storage" fail over on our XenServer 
pool. The pool is 7.1 based (on certified HP kit), and runs a mix of 
FreeBSD (all 10.3 based except for a legacy 9.x VM) - and a few Windows 
VM's - storage is provided by two Citrix certified Synology storage boxes.

During the fail over - Xen see's the storage paths go down, and come up 
again (re-attaching when they are available again). Timing this - it takes 
around a minute, worst case.

The process killed 99% of our FreeBSD VM's :(

The earlier 9.x FreeBSD box survived, and all the Windows VM's survived.

Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of 
the I/O delays that occur during a storage fail over?

I've enclosed some of the error we observed below. I realise a full storage 
fail over is a 'stressful time' for VM's - but the Windows VM's, and 
earlier FreeBSD version survived without issue. All the 10.3 boxes logged 
I/O errors, and then panic'd / rebooted.

We've setup a test lab with the same kit - and can now replicate this at 
will (every time most to all the FreeBSD 10.x boxes panic and reboot, but 
Windows prevails) - so we can test any potential fixes.

So if anyone can suggest anything we can tweak to minimize the chances of 
this happening (i.e. make I/O more timeout tolerant, or set larger 
timeouts?) that'd be great.

Thanks,

-Karl


Errors we observed:

ada0: disk error cmd=write 11339752-11339767 status: ffffffff
ada0: disk error cmd=write 
g_vfs_done():11340544-11340607gpt/root[WRITE(offset=4731097088, 
length=8192)] status: ffffffff error = 5
(repeated a couple of times with different values)

Machine then goes on to panic:

g_vfs_done():panic: softdep_setup_freeblocks: inode busy
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff8098e810 at kdb_backtrace+0x60
#1 0xffffffff809514e6 at vpanic+0x126
#2 0xffffffff809513b3 at panic+0x43
#3 0xffffffff80b9c685 at softdep_setup_freeblocks+0xaf5
#4 0xffffffff80b86bae at ffs_truncate+0x44e
#5 0xffffffff80bbec49 at ufs_setattr+0x769
#6 0xffffffff80e81891 at VOP_SETATTR_APV+0xa1
#7 0xffffffff80a053c5 at vn_trunacte+0x165
#8 0xffffffff809ff236 at kern_openat+0x326
#9 0xffffffff80d56e6f at amd64_syscall+0x40f
#10 0xffffffff80d3c0cb at Xfast_syscall+0xfb


Another box also logged:

ada0: disk error cmd=read 9970080-9970082 status: ffffffff
g_vfs_done():gpt/root[READ(offset=4029825024, length=1536)]error = 5
vnode_pager_getpages: I/O read error
vm_fault: pager read error, pid 24219 (make)

And again, went on to panic shortly thereafter.

I had to hand transcribe the above from screen shots / video, so apologies 
if any errors crept in.

I'm hoping there's just a magic sysctl / kernel option we can set to up the 
timeouts? (if it is as simple as timeouts killing things)

From owner-freebsd-xen@freebsd.org  Wed Sep 20 14:22:02 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B5D4E108D2
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Wed, 20 Sep 2017 14:22:02 +0000 (UTC)
 (envelope-from prvs=429837174=roger.pau@citrix.com)
Received: from SMTP.EU.CITRIX.COM (smtp.ctxuk.citrix.com [185.25.65.24])
 (using TLSv1.2 with cipher RC4-SHA (128/128 bits))
 (Client CN "mail.citrix.com",
 Issuer "DigiCert SHA2 Secure Server CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A44556D2DA
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 14:22:00 +0000 (UTC)
 (envelope-from prvs=429837174=roger.pau@citrix.com)
X-IronPort-AV: E=Sophos;i="5.42,421,1500940800"; d="scan'208";a="53157682"
Date: Wed, 20 Sep 2017 12:44:18 +0100
From: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
To: Karl Pielorz <kpielorz_lst@tdx.co.uk>
CC: <freebsd-xen@freebsd.org>
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
References: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
User-Agent: NeoMutt/20170714 (1.8.3)
X-ClientProxiedBy: AMSPEX02CAS01.citrite.net (10.69.22.112) To
 AMSPEX02CL02.citrite.net (10.69.22.126)
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 14:22:02 -0000

On Wed, Sep 20, 2017 at 11:35:26AM +0100, Karl Pielorz wrote:
> 
> Hi All,
> 
> We recently experienced an "unplanned storage" fail over on our XenServer
> pool. The pool is 7.1 based (on certified HP kit), and runs a mix of FreeBSD
> (all 10.3 based except for a legacy 9.x VM) - and a few Windows VM's -
> storage is provided by two Citrix certified Synology storage boxes.
> 
> During the fail over - Xen see's the storage paths go down, and come up
> again (re-attaching when they are available again). Timing this - it takes
> around a minute, worst case.
> 
> The process killed 99% of our FreeBSD VM's :(
> 
> The earlier 9.x FreeBSD box survived, and all the Windows VM's survived.
> 
> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of
> the I/O delays that occur during a storage fail over?

Do you know whether the VMs saw the disks disconnecting and then
connecting again?

> I've enclosed some of the error we observed below. I realise a full storage
> fail over is a 'stressful time' for VM's - but the Windows VM's, and earlier
> FreeBSD version survived without issue. All the 10.3 boxes logged I/O
> errors, and then panic'd / rebooted.
> 
> We've setup a test lab with the same kit - and can now replicate this at
> will (every time most to all the FreeBSD 10.x boxes panic and reboot, but
> Windows prevails) - so we can test any potential fixes.
> 
> So if anyone can suggest anything we can tweak to minimize the chances of
> this happening (i.e. make I/O more timeout tolerant, or set larger
> timeouts?) that'd be great.

Hm, I have the feeling that part of the problem is that in-flight
requests are basically lost when a disconnect/reconnect happens.

Thanks, Roger.

From owner-freebsd-xen@freebsd.org  Wed Sep 20 14:54:26 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 88281E1210D
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Wed, 20 Sep 2017 14:54:26 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 306586E6B5
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 14:54:25 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.100] (vo.getonline.co.uk [62.13.128.251])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8KEsK7w019497
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Wed, 20 Sep 2017 15:54:21 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Wed, 20 Sep 2017 15:54:20 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
cc: freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <E20E34C4D5A766D4854317A5@Mac-mini.local>
In-Reply-To: <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
References: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
 <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
X-Mailer: Mulberry/4.0.8 (Mac OS X)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 14:54:26 -0000


--On 20 September 2017 at 12:44:18 +0100 Roger Pau Monn=C3=A9=20
<roger.pau@citrix.com> wrote:

>> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant
>> of the I/O delays that occur during a storage fail over?
>
> Do you know whether the VMs saw the disks disconnecting and then
> connecting again?

I can't see any evidence the drives actually get 'disconnected' from the=20
VM's point of view. Plenty of I/O errors - but no "device destroyed" type=20
stuff.

I have seen that kind of error logged on our test kit - when deliberately=20
failed non-HA storage, but I don't see it this time.

> Hm, I have the feeling that part of the problem is that in-flight
> requests are basically lost when a disconnect/reconnect happens.

So if a disconnect doesn't happen (as it appears it isn't) - is there any=20
tunable to set the I/O timeout?

'sysctl -a | grep timeout' finds things like:

  kern.cam.ada.default_timeout=3D30

I might see if that has any effect (from memory - as I'm out of the office=20
now - it did seem to be about 30 seconds before the VM's started logging=20
I/O related errors to the console).

As it's a pure test setup - I can try adjusting this without fear of=20
breaking anything :)

Though I'm open to other suggestions...

fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if =

a write is 'in flight' when an I/O error occurs in the lower layers of=20
XenServer is it XenServers responsibility to retry that - before giving up, =

or does it just push the error straight back to the VM - expecting the VM=20
to retry it? [or a bit of both?] - just curious.

-Karl


From owner-freebsd-xen@freebsd.org  Wed Sep 20 15:30:03 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 70715E13574
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Wed, 20 Sep 2017 15:30:03 +0000 (UTC)
 (envelope-from SRS0=m11i=AV=quip.cz=000.fbsd@elsa.codelab.cz)
Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 360466F5B2
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 15:30:02 +0000 (UTC)
 (envelope-from SRS0=m11i=AV=quip.cz=000.fbsd@elsa.codelab.cz)
Received: from elsa.codelab.cz (localhost [127.0.0.1])
 by elsa.codelab.cz (Postfix) with ESMTP id AE6002842B;
 Wed, 20 Sep 2017 17:23:15 +0200 (CEST)
Received: from illbsd.quip.test (ip-86-49-16-209.net.upcbroadband.cz
 [86.49.16.209])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by elsa.codelab.cz (Postfix) with ESMTPSA id C60BF28411;
 Wed, 20 Sep 2017 17:23:14 +0200 (CEST)
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
To: Karl Pielorz <kpielorz_lst@tdx.co.uk>,
 =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= <roger.pau@citrix.com>
Cc: freebsd-xen@freebsd.org
References: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
 <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
 <E20E34C4D5A766D4854317A5@Mac-mini.local>
From: Miroslav Lachman <000.fbsd@quip.cz>
Message-ID: <59C287E2.1030500@quip.cz>
Date: Wed, 20 Sep 2017 17:23:14 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:42.0) Gecko/20100101
 Firefox/42.0 SeaMonkey/2.39
MIME-Version: 1.0
In-Reply-To: <E20E34C4D5A766D4854317A5@Mac-mini.local>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 15:30:03 -0000

Karl Pielorz wrote on 2017/09/20 16:54:
>
>
> --On 20 September 2017 at 12:44:18 +0100 Roger Pau Monné
> <roger.pau@citrix.com> wrote:
>
>>> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant
>>> of the I/O delays that occur during a storage fail over?
>>
>> Do you know whether the VMs saw the disks disconnecting and then
>> connecting again?
>
> I can't see any evidence the drives actually get 'disconnected' from the
> VM's point of view. Plenty of I/O errors - but no "device destroyed"
> type stuff.
>
> I have seen that kind of error logged on our test kit - when
> deliberately failed non-HA storage, but I don't see it this time.
>
>> Hm, I have the feeling that part of the problem is that in-flight
>> requests are basically lost when a disconnect/reconnect happens.
>
> So if a disconnect doesn't happen (as it appears it isn't) - is there
> any tunable to set the I/O timeout?
>
> 'sysctl -a | grep timeout' finds things like:
>
>   kern.cam.ada.default_timeout=30


Yes, you can try to set kern.cam.ada.default_timeout to 60 or more, but 
it can has downside too.

Miroslav Lachman


From owner-freebsd-xen@freebsd.org  Wed Sep 20 18:15:09 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 46FB1E1BA5C
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Wed, 20 Sep 2017 18:15:09 +0000 (UTC)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 322F87602C
 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 18:15:08 +0000 (UTC)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1])
 by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id v8KIF7Ta089959;
 Wed, 20 Sep 2017 11:15:07 -0700 (PDT)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: (from freebsd-rwg@localhost)
 by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id v8KIF7Gi089958;
 Wed, 20 Sep 2017 11:15:07 -0700 (PDT) (envelope-from freebsd-rwg)
From: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
Message-Id: <201709201815.v8KIF7Gi089958@pdx.rh.CN85.dnsmgr.net>
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
In-Reply-To: <62BC29D8E1F6EA5C09759861@[10.12.30.106]>
To: Karl Pielorz <kpielorz_lst@tdx.co.uk>
Date: Wed, 20 Sep 2017 11:15:07 -0700 (PDT)
CC: freebsd-xen@freebsd.org
X-Mailer: ELM [version 2.4ME+ PL121h (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 18:15:09 -0000

> Hi All,
> 
> We recently experienced an "unplanned storage" fail over on our XenServer 
> pool. The pool is 7.1 based (on certified HP kit), and runs a mix of 
> FreeBSD (all 10.3 based except for a legacy 9.x VM) - and a few Windows 
> VM's - storage is provided by two Citrix certified Synology storage boxes.
> 
> During the fail over - Xen see's the storage paths go down, and come up 
> again (re-attaching when they are available again). Timing this - it takes 
> around a minute, worst case.
> 
> The process killed 99% of our FreeBSD VM's :(
> 
> The earlier 9.x FreeBSD box survived, and all the Windows VM's survived.
> 
> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of 
> the I/O delays that occur during a storage fail over?
> 
> I've enclosed some of the error we observed below. I realise a full storage 
> fail over is a 'stressful time' for VM's - but the Windows VM's, and 
> earlier FreeBSD version survived without issue. All the 10.3 boxes logged 
> I/O errors, and then panic'd / rebooted.
> 
> We've setup a test lab with the same kit - and can now replicate this at 
> will (every time most to all the FreeBSD 10.x boxes panic and reboot, but 
> Windows prevails) - so we can test any potential fixes.
> 
> So if anyone can suggest anything we can tweak to minimize the chances of 
> this happening (i.e. make I/O more timeout tolerant, or set larger 
> timeouts?) that'd be great.

As you found one of these let me point out the pair of them:
kern.cam.ada.default_timeout: 30
kern.cam.ada.retry_count: 4

Rather than increasing default_timeout you might try increasing
retry_count.  Though it would seem that the default settings should
of allowed for a 2 minute failure window, it may be that these
are not working as I expect in this situation.

...
> 
> Errors we observed:
> 
> ada0: disk error cmd=write 11339752-11339767 status: ffffffff
> ada0: disk error cmd=write 
Did you actually get this 4 times, then it fell through to
the next error?  There should be some retry counts in here
some place counting up to 4, then cam/ada should give up
and pass the error up the stack.

> g_vfs_done():11340544-11340607gpt/root[WRITE(offset=4731097088, 
> length=8192)] status: ffffffff error = 5
> (repeated a couple of times with different values)
> 
> Machine then goes on to panic:

Ah, okay, so it is repeating.. these messages should be
30 seconds apart, there should be exactly 4 of them,
then you get the panic.   If that is the case try cranking
kern.cam.ada.retry_count up and see if that resolves your
issue.

> g_vfs_done():panic: softdep_setup_freeblocks: inode busy
> cpuid = 0
> KDB: stack backtrace:
> #0 0xffffffff8098e810 at kdb_backtrace+0x60
> #1 0xffffffff809514e6 at vpanic+0x126
> #2 0xffffffff809513b3 at panic+0x43
> #3 0xffffffff80b9c685 at softdep_setup_freeblocks+0xaf5
> #4 0xffffffff80b86bae at ffs_truncate+0x44e
> #5 0xffffffff80bbec49 at ufs_setattr+0x769
> #6 0xffffffff80e81891 at VOP_SETATTR_APV+0xa1
> #7 0xffffffff80a053c5 at vn_trunacte+0x165
> #8 0xffffffff809ff236 at kern_openat+0x326
> #9 0xffffffff80d56e6f at amd64_syscall+0x40f
> #10 0xffffffff80d3c0cb at Xfast_syscall+0xfb
> 
> 
> Another box also logged:
> 
> ada0: disk error cmd=read 9970080-9970082 status: ffffffff
> g_vfs_done():gpt/root[READ(offset=4029825024, length=1536)]error = 5
> vnode_pager_getpages: I/O read error
> vm_fault: pager read error, pid 24219 (make)
> 
> And again, went on to panic shortly thereafter.
> 
> I had to hand transcribe the above from screen shots / video, so apologies 
> if any errors crept in.
> 
> I'm hoping there's just a magic sysctl / kernel option we can set to up the 
> timeouts? (if it is as simple as timeouts killing things)

Yes, freebsd does not live long when its disk drive goes away... 2.5 minutes
to panic in almost all cases of a drive failure.

-- 
Rod Grimes                                                 rgrimes@freebsd.org

From owner-freebsd-xen@freebsd.org  Thu Sep 21 11:33:40 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 92DB5E25E84
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Thu, 21 Sep 2017 11:33:40 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 3A5537562E
 for <freebsd-xen@freebsd.org>; Thu, 21 Sep 2017 11:33:39 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.106]
 (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8LBXTnT004730
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 21 Sep 2017 12:33:30 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Thu, 21 Sep 2017 12:33:24 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
cc: freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <1913BD3E6623F2384770C39E@[10.12.30.106]>
In-Reply-To: <201709201815.v8KIF7Gi089958@pdx.rh.CN85.dnsmgr.net>
References: <201709201815.v8KIF7Gi089958@pdx.rh.CN85.dnsmgr.net>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 11:33:40 -0000


--On 20 September 2017 11:15 -0700 "Rodney W. Grimes" 
<freebsd-rwg@pdx.rh.CN85.dnsmgr.net> wrote:

> As you found one of these let me point out the pair of them:
> kern.cam.ada.default_timeout: 30
> kern.cam.ada.retry_count: 4

Adjusting these doesn't seem to make any difference at all.

All the VM's (the control one, running defaults) - and the 3 others (one 
running longer timeouts, one running more retries - and one running both 
longer timeouts and retries) - all start throwing I/O errors on the console 
at the same time (e.g. regardless if the timeout is set to default 30 
seconds, or extended out to 120) - I/O errors pop up at the same time.

It looks like they're either ignored, or 'not applicable' for this scenario.
Of particular concern is - if I adjust the 'timeout' value from 30 to 100 
seconds, and 'time' how long the first I/O error takes to appear on the 
console, it's still ~30 seconds.

I'm going to re-setup the test VM's - with a 'stable' boot disk (that won't 
go away during the switch) to give me something to log to - I should be 
able to work out the timings involved then, to see if the first I/O error 
really does surface after 30 seconds, or not.

If the timeout is set to, say 100 seconds - I shouldn't see any console 
errors until then, should I? - unless some other part of the storage stack 
is still timing out first at 30 seconds?

-Karl

From owner-freebsd-xen@freebsd.org  Thu Sep 21 12:05:33 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5B55FE27C58;
 Thu, 21 Sep 2017 12:05:33 +0000 (UTC)
 (envelope-from rainer@ultra-secure.de)
Received: from connect.ultra-secure.de (connect.ultra-secure.de
 [88.198.71.201])
 by mx1.freebsd.org (Postfix) with ESMTP id 911AB76A1A;
 Thu, 21 Sep 2017 12:05:31 +0000 (UTC)
 (envelope-from rainer@ultra-secure.de)
Received: (Haraka outbound); Thu, 21 Sep 2017 14:04:19 +0200
Authentication-Results: connect.ultra-secure.de; auth=pass (login);
 spf=none smtp.mailfrom=ultra-secure.de
Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does
 not designate 127.0.0.10 as permitted sender)
 receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=127.0.0.10;
 helo=connect.ultra-secure.de; envelope-from=<rainer@ultra-secure.de>
Received: from connect.ultra-secure.de (webmail [127.0.0.10])
 by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id
 337C7DD2-F54B-40D6-AF4C-17998A528C17.1
 envelope-from <rainer@ultra-secure.de> (authenticated bits=0)
 (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO);
 Thu, 21 Sep 2017 14:04:09 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
Date: Thu, 21 Sep 2017 14:04:08 +0200
From: rainer@ultra-secure.de
To: Karl Pielorz <kpielorz_lst@tdx.co.uk>
Cc: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.cn85.dnsmgr.net>,
 freebsd-xen@freebsd.org, owner-freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
In-Reply-To: <1913BD3E6623F2384770C39E@[10.12.30.106]>
References: <201709201815.v8KIF7Gi089958@pdx.rh.CN85.dnsmgr.net>
 <1913BD3E6623F2384770C39E@[10.12.30.106]>
Message-ID: <5afd1ad53171583603bb8518dfbdd51b@ultra-secure.de>
X-Sender: rainer@ultra-secure.de
User-Agent: Roundcube Webmail/1.2.0
X-Haraka-GeoIP: --, , NaNkm
X-Haraka-GeoIP-Received: 
X-Haraka-p0f: os="undefined undefined" link_type="undefined"
 distance=undefined total_conn=undefined shared_ip=Y
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.1
X-Haraka-Karma: score: 6, good: 266, bad: 0, connections: 266, history: 266,
 pass:all_good, relaying
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 12:05:33 -0000

Am 2017-09-21 13:33, schrieb Karl Pielorz:
> --On 20 September 2017 11:15 -0700 "Rodney W. Grimes"
> <freebsd-rwg@pdx.rh.CN85.dnsmgr.net> wrote:
> 
>> As you found one of these let me point out the pair of them:
>> kern.cam.ada.default_timeout: 30
>> kern.cam.ada.retry_count: 4
> 
> Adjusting these doesn't seem to make any difference at all.


I asked myself already if the disks from Xen(Server) are really 
CAM-disks.

They certainly don't show up with camcontrol devlist.

If they don't show-up there, why should any cam timeouts apply?

BTW: storage-failures also kill various Linux hosts.
They usually turn their filesystem into read-only mode and then you've 
got to reboot anyway.


From owner-freebsd-xen@freebsd.org  Thu Sep 21 12:40:16 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4FC21E0187A;
 Thu, 21 Sep 2017 12:40:16 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EA0A37C30D;
 Thu, 21 Sep 2017 12:40:15 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.106]
 (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8LCeCtF009249
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 21 Sep 2017 13:40:13 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Thu, 21 Sep 2017 13:40:07 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: rainer@ultra-secure.de
cc: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.cn85.dnsmgr.net>,
 freebsd-xen@freebsd.org, owner-freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <7D39973713EC8FB944D4BA4F@[10.12.30.106]>
In-Reply-To: <5afd1ad53171583603bb8518dfbdd51b@ultra-secure.de>
References: <201709201815.v8KIF7Gi089958@pdx.rh.CN85.dnsmgr.net>
 <1913BD3E6623F2384770C39E@[10.12.30.106]>
 <5afd1ad53171583603bb8518dfbdd51b@ultra-secure.de>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 12:40:16 -0000


--On 21 September 2017 14:04 +0200 rainer@ultra-secure.de wrote:

> I asked myself already if the disks from Xen(Server) are really CAM-disks.
>
> They certainly don't show up with camcontrol devlist.

So they don't. I presumed they were cam, as they're presented as 'ada0'.

> If they don't show-up there, why should any cam timeouts apply?

It appears they don't :) (at least, so far).

> BTW: storage-failures also kill various Linux hosts.
> They usually turn their filesystem into read-only mode and then you've
> got to reboot anyway.

Yes, I know - it's a bit of an upheaval to cope with storage fail over - 
annoyingly the windows boxes (though they go 'comatose' while it's 
happening) all seem to survive.

I could cope with a few VM's rebooting - but to see so many just fold and 
panic, adds a lot of "insult to injury" at fail over time :(

[And I know, if I/O is unavailable you're going to be queuing up a whole 
'world of pain' anyway for when it returns, such as listen queues, pending 
disk I/O, hung processes waiting for I/O etc. etc.) - but to have a 
fighting chance of unwinding it all when I/O recovers - would be good.

-Karl


From owner-freebsd-xen@freebsd.org  Thu Sep 21 14:23:24 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0354E07734
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Thu, 21 Sep 2017 14:23:24 +0000 (UTC)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 98F2880359
 for <freebsd-xen@freebsd.org>; Thu, 21 Sep 2017 14:23:24 +0000 (UTC)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1])
 by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id v8LENLq9094068;
 Thu, 21 Sep 2017 07:23:21 -0700 (PDT)
 (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net)
Received: (from freebsd-rwg@localhost)
 by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id v8LENKvN094067;
 Thu, 21 Sep 2017 07:23:20 -0700 (PDT) (envelope-from freebsd-rwg)
From: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
Message-Id: <201709211423.v8LENKvN094067@pdx.rh.CN85.dnsmgr.net>
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
In-Reply-To: <7D39973713EC8FB944D4BA4F@[10.12.30.106]>
To: Karl Pielorz <kpielorz_lst@tdx.co.uk>
Date: Thu, 21 Sep 2017 07:23:20 -0700 (PDT)
CC: rainer@ultra-secure.de, freebsd-xen@freebsd.org
X-Mailer: ELM [version 2.4ME+ PL121h (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 14:23:25 -0000

> --On 21 September 2017 14:04 +0200 rainer@ultra-secure.de wrote:
> 
> > I asked myself already if the disks from Xen(Server) are really CAM-disks.
> >
> > They certainly don't show up with camcontrol devlist.
> 
> So they don't. I presumed they were cam, as they're presented as 'ada0'.

Ok, need to sort that out for certain to get much further.

What are you running for Dom0?
Did you do the sysctl's in Dom0 or in the guest?
To be effective I would think they would need to be run
in the guest, but if DOM0 is timing out and returning
an I/O error then they well have to be adjusted there first.

> > If they don't show-up there, why should any cam timeouts apply?
> 
> It appears they don't :) (at least, so far).

Are these timeouts coming from Dom0 or from a VM in a DomU?

> > BTW: storage-failures also kill various Linux hosts.
> > They usually turn their filesystem into read-only mode and then you've
> > got to reboot anyway.
> 
> Yes, I know - it's a bit of an upheaval to cope with storage fail over - 
> annoyingly the windows boxes (though they go 'comatose' while it's 
> happening) all seem to survive.

Windows has horrible long timeouts and large retry counts, and worse
they dont warn the user that it is having issues other than event logs
and things usually go to the state of drive catastrophic failure before
the user ever sees an error.

> I could cope with a few VM's rebooting - but to see so many just fold and 
> panic, adds a lot of "insult to injury" at fail over time :(
> 
> [And I know, if I/O is unavailable you're going to be queuing up a whole 
> 'world of pain' anyway for when it returns, such as listen queues, pending 
> disk I/O, hung processes waiting for I/O etc. etc.) - but to have a 
> fighting chance of unwinding it all when I/O recovers - would be good.
> 
> -Karl
> 
> 
> 
> 
> _______________________________________________
> freebsd-xen@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-xen
> To unsubscribe, send any mail to "freebsd-xen-unsubscribe@freebsd.org"
> 

-- 
Rod Grimes                                                 rgrimes@freebsd.org

From owner-freebsd-xen@freebsd.org  Thu Sep 21 14:50:07 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 89B5CE09059
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Thu, 21 Sep 2017 14:50:07 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 30CCF811C3
 for <freebsd-xen@freebsd.org>; Thu, 21 Sep 2017 14:50:06 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.106]
 (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8LEo3i0018197
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 21 Sep 2017 15:50:04 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Thu, 21 Sep 2017 15:49:57 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
cc: freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <D5D409CD045CF518CE957E77@[10.12.30.106]>
In-Reply-To: <201709211423.v8LENKvN094067@pdx.rh.CN85.dnsmgr.net>
References: <201709211423.v8LENKvN094067@pdx.rh.CN85.dnsmgr.net>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 14:50:07 -0000


--On 21 September 2017 07:23 -0700 "Rodney W. Grimes" 
<freebsd-rwg@pdx.rh.CN85.dnsmgr.net> wrote:

>> So they don't. I presumed they were cam, as they're presented as 'ada0'.
>
> Ok, need to sort that out for certain to get much further.
>
> What are you running for Dom0?

XenServer 7.1 - i.e. the official ISO distribution from www.xenserver.org

> Did you do the sysctl's in Dom0 or in the guest?

In the guest. I don't have access to the equivalent (or, rather shouldn't 
have access to the equivalent in Dom0 as it's the official installation - 
i.e. black boxed).

> To be effective I would think they would need to be run
> in the guest, but if DOM0 is timing out and returning
> an I/O error then they well have to be adjusted there first.

dom0 (i.e. XenServer grumbles about paths going down, shows some I/O errors 
- that get re-tried, but doesn't invalidate the storage).

As soon as the paths are available again - you can see it re-attach to them.

> Are these timeouts coming from Dom0 or from a VM in a DomU?

domU - as above, dom0 grumbles, but generally seems OK about things. dom0 
doesn't do anything silly like invalidate the VM's disks or anything.

> Windows has horrible long timeouts and large retry counts, and worse
> they dont warn the user that it is having issues other than event logs
> and things usually go to the state of drive catastrophic failure before
> the user ever sees an error.

I can believe that - I've seen the figure of 60 seconds bandied around (as 
opposed to 30 seconds for Linux / FreeBSD).

Sadly, making FreeBSD have a similar timeout (at least just to test) may 
fix the issue.

-Karl

From owner-freebsd-xen@freebsd.org  Fri Sep 22 09:14:12 2017
Return-Path: <owner-freebsd-xen@freebsd.org>
Delivered-To: freebsd-xen@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 04185E205E8
 for <freebsd-xen@mailman.ysv.freebsd.org>;
 Fri, 22 Sep 2017 09:14:12 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9FFAF3199
 for <freebsd-xen@freebsd.org>; Fri, 22 Sep 2017 09:14:10 +0000 (UTC)
 (envelope-from kpielorz_lst@tdx.co.uk)
Received: from [10.12.30.106]
 (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244])
 (authenticated bits=0)
 by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8M9E6ir093967
 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Fri, 22 Sep 2017 10:14:08 +0100 (BST)
 (envelope-from kpielorz_lst@tdx.co.uk)
Date: Fri, 22 Sep 2017 10:13:59 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net>
cc: freebsd-xen@freebsd.org
Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID: <29F6204C1998F74F077A94D9@[10.12.30.106]>
In-Reply-To: <D5D409CD045CF518CE957E77@[10.12.30.106]>
References: <201709211423.v8LENKvN094067@pdx.rh.CN85.dnsmgr.net>
 <D5D409CD045CF518CE957E77@[10.12.30.106]>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-BeenThere: freebsd-xen@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussion of the freebsd port to xen - implementation and usage
 <freebsd-xen.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/>
List-Post: <mailto:freebsd-xen@freebsd.org>
List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>,
 <mailto:freebsd-xen-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Sep 2017 09:14:12 -0000


--On 21 September 2017 15:49 +0100 Karl Pielorz <kpielorz_lst@tdx.co.uk> 
wrote:

>> Are these timeouts coming from Dom0 or from a VM in a DomU?
>
> domU - as above, dom0 grumbles, but generally seems OK about things. dom0
> doesn't do anything silly like invalidate the VM's disks or anything.

I've chased this down in the code - having briefly looked at 
blkfront/blkback - I can see all the mechanisms in place for performing I/O 
- but I cannot see there's any timeouts set anywhere (in that code).

I can see the callback that fires when the I/O fails.

It looks like for the purposes of xbd I/O requests are just gathered up, 
processed - and then fired off to XenServer (i.e. upstream). If they fail, 
callbacks are fired - and action taken.

But nowhere can I see where there are any timeouts either specified, or 
specifiable by FreeBSD - nor can I see (certainly at that level) that there 
are any I/O retries in that code.

So,

  - Timeouts may be set by Xen (i.e. outside of FreeBSD's scope)
  - I/O may be retried by 'higher levels' than blkfront/blkback - but I 
can't see where.

It may simply be that I/O from FreeBSD through XenServer is a 'fire and 
forget' process, where FreeBSD has no control over timeouts, and currently 
has no code (at that level) to perform retries.

I'd need to figure out what sits above 'blkfront/blkback' - and whether 
that's likely to do any retries.

It's certainly not CAM running the storage - so those timeout/retry sysctl 
values are completely irrelevant.

More study, and maybe a quick post to -hackers to see what lies above 
blkfront/back etc.

-Kp