Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 24 Apr 2010 09:30:31 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Mikolaj Golub <to.my.trociny@gmail.com>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: HAST: primary might get stuck when there are connectivity problems with secondary
Message-ID:  <20100424073031.GD3067@garage.freebsd.pl>
In-Reply-To: <86k4rye33e.fsf@zhuzha.ua1>
References:  <86r5m9dvqf.fsf@zhuzha.ua1> <20100423062950.GD1670@garage.freebsd.pl> <86k4rye33e.fsf@zhuzha.ua1>

next in thread | previous in thread | raw e-mail | index | archive | help

--7mxbaLlpDEyR1+x6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Apr 23, 2010 at 04:00:05PM +0300, Mikolaj Golub wrote:
> This does not look for me just "confusing" :-). When I wrote that hast had
> gotten stuck I meant that all operations with FS above hast (ZFS in my ca=
se)
> had hanged and this lasted until hastd on the secondary was restarted.

Ok, I see.

> May be I miss something but I think the following happens when there is
> network outage:
>=20
> remote_send_thread():
>=20
> 1) the request is moved to recv queue;
>=20
> 2) the request is sent to secondary calling hast_proto_send();
>=20
> 3) hast_proto_send() returns OK although data are not sent yet (they are
> buffered by kernel) so the request is not removed from recv queue and not
> moved immediately to done queue as it would be in the case when
> hast_proto_send() returned error.
>=20
> The kernel is trying to send the request for some time, fails by timeout =
and
> close the connection.
>=20
> remote_recv_thread():
>=20
> 1) takes the request from recv queue;
>=20
> 2) waits for the respond from the secondary in hast_proto_recv_hdr() which
> will never come because the secondary has not received the request and has
> nothing to reply.

If secondary is not going to reply, hast_proto_recv_hdr() should
eventually timeout. On timeout, connection should be closed and this
requests (and all the others) should be moved to done queue.

It doesn't timeout at all or maybe the timeout is too long?=20

> So we have a request or several gotten stuck in in the recv queue. It loo=
ks
> like FS is waiting for the response on this request(s) sent by
> ggate_send_thread() and gets stuck.
>=20
> Below are logs for a similar example I made today:
>=20
> The last messages before "outage":
>=20
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: (0x28411c40)=
 Got request.
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: (0x28411c40)=
 Moving request to the free queue.
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: Taking reque=
st.
>=20
> During "outage":
>=20
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c80)=
 Request received from the kernel: WRITE(1925971968, 131072).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c80)=
 Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free =
request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0)=
 Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0)=
 Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0)=
 Request received from the kernel: WRITE(1925348352, 1024).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0)=
 Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free =
request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00)=
 Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00)=
 Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00)=
 Request received from the kernel: WRITE(922860544, 1024).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411c80)=
 Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411c80=
) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411c80=
) Moving request to the recv queue.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: Taking requ=
est.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411cc0=
) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411cc0=
) Moving request to the recv queue.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00)=
 Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free =
request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0)=
 Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0)=
 Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0)=
 Request received from the kernel: WRITE(1926103040, 131072).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0)=
 Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free =
request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80)=
 Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80)=
 Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking reque=
st.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411cc0)=
 Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking reque=
st.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411c00)=
 Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking reque=
st.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411bc0)=
 Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking reque=
st.
>=20
> During "outage" two requests 0x28411c80 and 0x28411cc0 was received,
> remote_send reported that it had succesfully sent the requests and moved =
them
> to recv queue. And the requests got stuck there

IMHO hast_proto_recv_hdr() should timeout, but I don't change default
timeout, which might be way too long.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--7mxbaLlpDEyR1+x6
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkvSnhcACgkQForvXbEpPzTFUwCgwflzhBfrmpHvFG3qogV/boc3
n4EAnRpciwMaxlokTftyxQ6YFBpPUWDb
=r0Nr
-----END PGP SIGNATURE-----

--7mxbaLlpDEyR1+x6--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100424073031.GD3067>