From owner-freebsd-scsi@FreeBSD.ORG  Wed Nov 12 22:04:17 2008
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D81C71065676
	for <freebsd-scsi@freebsd.org>; Wed, 12 Nov 2008 22:04:17 +0000 (UTC)
	(envelope-from Carole.Macheret@ch.meggitt.com)
Received: from gw.vibro-meter.com (gw.vibro-meter.com [62.2.232.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 40BC38FC1A
	for <freebsd-scsi@freebsd.org>; Wed, 12 Nov 2008 22:04:17 +0000 (UTC)
	(envelope-from Carole.Macheret@ch.meggitt.com)
Received: from Vm-Fribourg-MTA by gw.vibro-meter.com
	with Novell_GroupWise; Wed, 12 Nov 2008 22:44:02 +0100
Message-Id: <491B5C2A.1F16.0013.0@ch.meggitt.com>
X-Mailer: Novell GroupWise Internet Agent 7.0.3 
Date: Wed, 12 Nov 2008 22:43:54 +0100
From: "Carole Macheret" <Carole.Macheret@ch.meggitt.com>
To: "Scott Long" <scottl@samsco.org>
References: <4874F53A0200001300130DE3@gw.vibro-meter.com>	<48A465B10200001300132295@gw.vibro-meter.com>
	<48A46586.1F16.0013.0@ch.meggitt.com><48A46586.1F16.0013.0@ch.meggitt.com>
	<48A4666C.6080008@samsco.org>
In-Reply-To: <48A4666C.6080008@samsco.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Cc: freebsd-scsi@freebsd.org, Roland Rothen <Roland.Rothen@ch.meggitt.com>
Subject: Re: g_vfs_done
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Nov 2008 22:04:17 -0000

Hi Scott,

Thanks a lot for your advice, we have finally run some tests with the =
following setting changed: kern.cam.da.retry_count=3D100 (in /etc/sysctl.co=
nf)
Now the FreeBSD virtual machines doesn't freeze anymore after loosing the =
disks during the IPstor failover.

Best regards

Carole Macheret


>>> Scott Long <scottl@samsco.org> 14.08.2008 19:07 >>>
Carole Macheret wrote:
> Hello,
>=20
> We are using FreeBSD 7.0-RELEASE #1 running Squid and Zabbix on vmware =
ESX 3.0.2 and our vmware ESX servers access our SAN through IpStor cluster =
(Storage virtualization and mirroring).=20
>=20
> We have 2 storages (EVA 6100) and the IpStor solution allows us to =
mirror disks on both EVAs.
>=20
> We have a problem with both the Zabbix and Squid FreeBSD virtual =
machines, when the virtual machine is loosing its disks (EVA controller =
reboot or ipstor cluster failover), we have several "g_vfs_done() : =
da1s1d[WRITE(offset=3D2312431234, length=3D12453)] error=3D 5" errors then =
the host is definitively frozen. The disk loss lasts 1-5 seconds. Windows =
virtual machines do freeze during the loss then continue working. On =
Windows we had to specify a longer timeout for local disk in registry.
>=20
> Does anybody has an idea what could be tuned to avoid this problem ?
>=20
> Attached you can find the dmesg and a screenshot of the g_vfs_done =
error...
>=20
> Thanks in advance for your help
>=20

So the virtual disks that the FreeBSD images are using in VMWare are on
an IpStor, and those periodically go away, yes?  What's probably
happening is that the VMWare host is triggering an event in the FreeBSD
client VM that essentially is making the virtual disks go away.  Inside
the FreeBSD VM, the SCSI layer tries to talk to the disk and gets a
selection timeout since the disk is no longer there.  It doesn't know
that this is a temporary state, and it declares the I/O as failed.  At
that point, the BSD VM gets upset and everything gets bad.

There is a property called kern.cam.da.default_timeout.  It's set to
60 seconds, but I don't think that it will help you in this case, since
it's likely that the i/o is failing because of a selection timeout, not
because the virtual disk is slow in completing the i/o.  The
kern.cam.da.retry_count property is set to 5, and changing it might help
since it might be able to force enough retries to give time for the
virtual disk to come back.  Try the following command on a running
system:

sysctl kern.cam.da.retry_count=3D100

This will allow for about 25 seconds worth of retries (a selection
attempt takes 250ms, so you'll get about 4 retries per second).  If
this doesn't work, try configuring VMWare to give you a serial console
that you can capture on the host, then set bootverbose during boot and
send me the log once the problem happens.

Scott