Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Apr 2010 12:02:16 +0300
From:      Mikolaj Golub <to.my.trociny@gmail.com>
To:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   HAST: primary might get stuck when there are connectivity problems with secondary
Message-ID:  <86r5m9dvqf.fsf@zhuzha.ua1>

next in thread | raw e-mail | index | archive | help
--=-=-=

Hi,

I can make HAST primary get stuck making the secondary not accessible (network
packets are lost) for some period of time. I run HAST in VirtualBox hosts, so
to emulate network outage I just change bridge interface in VirtualBox
configuration.

Below are details for one example.

On the primary before the outage we have:

sockstat:
root     hastd      1571  10 tcp4   172.20.66.201:41841   172.20.66.202:8457
root     hastd      1571  11 tcp4   172.20.66.201:57596   172.20.66.202:8457

During the outage and after it sockstat shows the same, while netstat shows:

tcp4       0      0 172.20.66.201.57596    172.20.66.202.8457     ESTABLISHED
tcp4       0   8307 172.20.66.201.41841    172.20.66.202.8457     ESTABLISHED

(note non zero value for send buffer) and then later

tcp4       0      0 172.20.66.201.57596    172.20.66.202.8457     ESTABLISHED
tcp4       0      0 172.20.66.201.41841    172.20.66.202.8457     CLOSED

Restoring network after this changes nothing. Primary gets stuck. No messages
in the log and "dirty" in status output does not change:

[root@hasta ~]# hastctl status
storage:
  role: primary
  provname: storage
  localpath: /dev/ad4
  extentsize: 2097152
  keepdirty: 64
  remoteaddr: 172.20.66.202
  replication: memsync
  status: complete
  dirty: 2097152 bytes

On the secondary we have all this time:

tcp4       0      0 172.20.66.202.8457     172.20.66.201.57596    ESTABLISHED
tcp4       0      0 172.20.66.202.8457     172.20.66.201.41841    ESTABLISHED

The last messages in log:

Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Request received from the kernel: READ(13565952, 65536).
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Moving request to the send queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Got free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Waiting for request from the kernel.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Moving request to the done queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Moving request to the free queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Request received from the kernel: READ(1812529152, 65536).
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Moving request to the send queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Got free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Waiting for request from the kernel.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Moving request to the done queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Moving request to the free queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: Taking request.

The backtrace of gotten stuck hastd is in the attach.

I interpret this in the following way. Although the network is down
hast_proto_send() in remote_send_thread() returns success (sent data are
stored in the kernel buffer). Then kernel tries to send data and eventually
fails after timeout and close the socket. hastd is not aware about this,
remote_send_thread() is blocked in "Taking request" at this time, sync thread
is waiting for status from the secondary about sent data but secondary does
not send it because it did not receive any data.

Restarting hastd on the secondary usually helps. A workaround is to set
net.inet.tcp.keepidle to some small value (e.g. 300 sec) on the
secondary. Then the secondary will notice much earlier that the peer has
closed the connection and will restart the worker itself:

Apr 21 11:52:21 hastb hastd: [storage] (secondary) Unable to receive request header: Connection reset by peer.
Apr 21 11:52:21 hastb hastd: [storage] (secondary) Worker process (pid=1398) exited ungracefully: status=19200.

-- 
Mikolaj Golub


--=-=-=
Content-Type: application/octet-stream
Content-Disposition: attachment; filename=bt.log
Content-Transfer-Encoding: base64

VGhyZWFkIDggKFRocmVhZCAyODQwNDE0MCAoTFdQIDEwMDA3OCkpOgojMCAgMHgyODIzZWRkNyBp
biBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMSAgMHgyODIzZTliOCBpbiBfX2Vy
cm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMiAgMHgyODRjMzUyMCBpbiA/PyAoKQojMyAg
MHgwMDAwMDAwOCBpbiA/PyAoKQojNCAgMHgwMDAwMDAwMSBpbiA/PyAoKQojNSAgMHgyODRjMzUw
MCBpbiA/PyAoKQojNiAgMHgwMDAwMDAwMCBpbiA/PyAoKQojNyAgMHgyODBhNWEwMCBpbiA/PyAo
KQojOCAgMHhiZmJmZTk4MCBpbiA/PyAoKQojOSAgMHgyODIzZDMxZiBpbiBwdGhyZWFkX3NldGNh
bmNlbHN0YXRlICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMTAgMHgyODIzY2JiZSBpbiBwdGhy
ZWFkX2NvbmRfc2lnbmFsICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMTEgMHgwODA1OGU3OCBp
biBjdl93YWl0IChjdj0weDgwNjdlMmMsIGxvY2s9MHg4MDY3ZTI4KSBhdCBzeW5jaC5oOjEyNQoj
MTIgMHgwODA1Yjc1ZSBpbiBjdl90aW1lZHdhaXQgKGN2PTB4ODA2N2UyYywgbG9jaz0weDgwNjdl
MjgsIHRpbWVvdXQ9MCkgYXQgc3luY2guaDoxMzUKIzEzIDB4MDgwNWI3MmMgaW4gZ3VhcmRfdGhy
ZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTc4
NwojMTQgMHgwODA1ODIwNiBpbiBoYXN0ZF9wcmltYXJ5IChyZXM9MHgyODRjYWIwMCkgYXQgL3Vz
ci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6NzczCiMxNSAweDA4MDRjNGU4IGluIGNvbnRyb2xf
c2V0X3JvbGUgKGNmZz0weDgwNjY1MDAsIG52b3V0PTB4Mjg0ZWIwYjAsIHJvbGU9MiAnXDAwMics
IHJlcz0weDI4NGNhYjAwLCAKICAgIG5hbWU9MHgyODQ4MTQ0MiAic3RvcmFnZSIsIG5vPTApIGF0
IC91c3Ivc3JjL3NiaW4vaGFzdGQvY29udHJvbC5jOjExNAojMTYgMHgwODA0Y2QwMSBpbiBjb250
cm9sX2hhbmRsZSAoY2ZnPTB4ODA2NjUwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9jb250cm9s
LmM6MzMyCiMxNyAweDA4MDRmMDdjIGluIG1haW5fbG9vcCAoKSBhdCAvdXNyL3NyYy9zYmluL2hh
c3RkL2hhc3RkLmM6NDI1CiMxOCAweDA4MDRmM2U4IGluIG1haW4gKGFyZ2M9MCwgYXJndj0weGJm
YmZlZGE0KSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL2hhc3RkLmM6NTIxCgpUaHJlYWQgNyAoVGhy
ZWFkIDI4NDA0MjgwIChMV1AgMTAwMTExKSk6CiMwICAweDI4MzQ0NzczIGluIGlvY3RsICgpIGZy
b20gL2xpYi9saWJjLnNvLjcKIzEgIDB4MDgwNTg4YzQgaW4gZ2dhdGVfcmVjdl90aHJlYWQgKGFy
Zz0weDI4NGNhYjAwKSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzo4OTQKIzIgIDB4
MjgyMzQyOGYgaW4gcHRocmVhZF9nZXRwcmlvICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMyAg
MHgwMDAwMDAwMCBpbiA/PyAoKQoKVGhyZWFkIDYgKFRocmVhZCAyODQwNDNjMCAoTFdQIDEwMDEx
MikpOgojMCAgMHgyODIzZWRkNyBpbiBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwoj
MSAgMHgyODIzZTliOCBpbiBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMiAgMHgy
ODRjMzJhMCBpbiA/PyAoKQojMyAgMHgwMDAwMDAwOCBpbiA/PyAoKQojNCAgMHgwMDAwMDAwMSBp
biA/PyAoKQojNSAgMHgyODRjMzI4MCBpbiA/PyAoKQojNiAgMHgwMDAwMDAwMCBpbiA/PyAoKQoj
NyAgMHhiZjhmZGU5NCBpbiA/PyAoKQojOCAgMHgyODIzOGRiNSBpbiBwdGhyZWFkX3J3bG9ja191
bmxvY2sgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiM5ICAweDI4MjNjYmJlIGluIHB0aHJlYWRf
Y29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDA4MDU4ZTc4IGluIGN2
X3dhaXQgKGN2PTB4Mjg0YzkwODAsIGxvY2s9MHgyODRjOTA3OCkgYXQgc3luY2guaDoxMjUKIzEx
IDB4MDgwNThmMzcgaW4gbG9jYWxfc2VuZF90aHJlYWQgKGFyZz0weDI4NGNhYjAwKSBhdCAvdXNy
L3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzoxMDMyCiMxMiAweDI4MjM0MjhmIGluIHB0aHJlYWRf
Z2V0cHJpbyAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEzIDB4MDAwMDAwMDAgaW4gPz8gKCkK
ClRocmVhZCA1IChUaHJlYWQgMjg0MDQ1MDAgKExXUCAxMDAxMTMpKToKIzAgIDB4MjgyM2VkZDcg
aW4gX19lcnJvciAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEgIDB4MjgyM2U5YjggaW4gX19l
cnJvciAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzIgIDB4Mjg0YzMzYTAgaW4gPz8gKCkKIzMg
IDB4MDAwMDAwMDggaW4gPz8gKCkKIzQgIDB4MDAwMDAwMDEgaW4gPz8gKCkKIzUgIDB4Mjg0YzMz
ODAgaW4gPz8gKCkKIzYgIDB4MDAwMDAwMDAgaW4gPz8gKCkKIzcgIDB4MDAwMDAwMDAgaW4gPz8g
KCkKIzggIDB4ZDJlZGUzODkgaW4gPz8gKCkKIzkgIDB4MjgyM2QzMWYgaW4gcHRocmVhZF9zZXRj
YW5jZWxzdGF0ZSAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEwIDB4MjgyM2NiYmUgaW4gcHRo
cmVhZF9jb25kX3NpZ25hbCAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzExIDB4MDgwNThlNzgg
aW4gY3Zfd2FpdCAoY3Y9MHgyODRjOTA4NCwgbG9jaz0weDI4NGM5MDdjKSBhdCBzeW5jaC5oOjEy
NQojMTIgMHgwODA1OTUwZiBpbiByZW1vdGVfc2VuZF90aHJlYWQgKGFyZz0weDI4NGNhYjAwKSBh
dCAvdXNyL3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzoxMTE3CiMxMyAweDI4MjM0MjhmIGluIHB0
aHJlYWRfZ2V0cHJpbyAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzE0IDB4MDAwMDAwMDAgaW4g
Pz8gKCkKClRocmVhZCA0IChUaHJlYWQgMjg0MDQ2NDAgKExXUCAxMDAxMTQpKToKIzAgIDB4Mjgy
ZmFhNTcgaW4gcmVjdmZyb20gKCkgZnJvbSAvbGliL2xpYmMuc28uNwojMSAgMHgyODI4MGJlMiBp
biByZWN2ICgpIGZyb20gL2xpYi9saWJjLnNvLjcKIzIgIDB4MDgwNWMyODcgaW4gcHJvdG9fY29t
bW9uX3JlY3YgKGZkPTExLCBkYXRhPTB4YmY2ZmJmMjcgIiIsIHNpemU9NSkKICAgIGF0IC91c3Iv
c3JjL3NiaW4vaGFzdGQvcHJvdG9fY29tbW9uLmM6NzgKIzMgIDB4MDgwNWQ0ZjAgaW4gdGNwNF9y
ZWN2IChjdHg9MHgyODQ3ZjIyMCwgZGF0YT0weGJmNmZiZjI3ICIiLCBzaXplPTUpCiAgICBhdCAv
dXNyL3NyYy9zYmluL2hhc3RkL3Byb3RvX3RjcDQuYzozMjUKIzQgIDB4MDgwNWJkZjEgaW4gcHJv
dG9fcmVjdiAoY29ubj0weDI4NGViMTUwLCBkYXRhPTB4YmY2ZmJmMjcsIHNpemU9NSkgYXQgL3Vz
ci9zcmMvc2Jpbi9oYXN0ZC9wcm90by5jOjE5OAojNSAgMHgwODA0ZGRhZSBpbiBoYXN0X3Byb3Rv
X3JlY3ZfaGRyIChjb25uPTB4Mjg0ZWIxNTAsIG52cD0weGJmNmZiZjdjKSBhdCAvdXNyL3NyYy9z
YmluL2hhc3RkL2hhc3RfcHJvdG8uYzoyOTgKIzYgIDB4MDgwNTllZjkgaW4gcmVtb3RlX3JlY3Zf
dGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6
MTI4MgojNyAgMHgyODIzNDI4ZiBpbiBwdGhyZWFkX2dldHByaW8gKCkgZnJvbSAvbGliL2xpYnRo
ci5zby4zCiM4ICAweDAwMDAwMDAwIGluID8/ICgpCgpUaHJlYWQgMyAoVGhyZWFkIDI4NDA0Nzgw
IChMV1AgMTAwMTE1KSk6CiMwICAweDI4MjNlZGQ3IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xp
YnRoci5zby4zCiMxICAweDI4MjNlOWI4IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5z
by4zCiMyICAweDI4NGMzNGEwIGluID8/ICgpCiMzICAweDAwMDAwMDA4IGluID8/ICgpCiM0ICAw
eDAwMDAwMDAxIGluID8/ICgpCiM0ICAweDAwMDAwMDAxIGluID8/ICgpCiM1ICAweDI4NGMzNDgw
IGluID8/ICgpCiM2ICAweDAwMDAwMDAwIGluID8/ICgpCiM3ICAweDAwMDAwMDAwIGluID8/ICgp
CiM4ICAweDAwMDAwMDAwIGluID8/ICgpCiM5ICAweDI4MjNkMzFmIGluIHB0aHJlYWRfc2V0Y2Fu
Y2Vsc3RhdGUgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDI4MjNjYmJlIGluIHB0aHJl
YWRfY29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMSAweDA4MDU4ZTc4IGlu
IGN2X3dhaXQgKGN2PTB4ODA2N2UxNCwgbG9jaz0weDgwNjdlMTApIGF0IHN5bmNoLmg6MTI1CiMx
MiAweDA4MDVhNDBiIGluIGdnYXRlX3NlbmRfdGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vz
ci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTM4MwojMTMgMHgyODIzNDI4ZiBpbiBwdGhyZWFk
X2dldHByaW8gKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxNCAweDAwMDAwMDAwIGluID8/ICgp
CgpUaHJlYWQgMiAoVGhyZWFkIDI4NDA0OGMwIChMV1AgMTAwMTE2KSk6CiMwICAweDI4MjNlZGQ3
IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxICAweDI4MjNlOWI4IGluIF9f
ZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMyICAweDI4NGMzMWEwIGluID8/ICgpCiMz
ICAweDAwMDAwMDA4IGluID8/ICgpCiM0ICAweDAwMDAwMDAxIGluID8/ICgpCiM1ICAweDI4NGMz
MTgwIGluID8/ICgpCiM2ICAweDAwMDAwMDAwIGluID8/ICgpCiM3ICAweDAwMDAwMDAwIGluID8/
ICgpCiM4ICAweGJmNGY5ZWE4IGluID8/ICgpCiM5ICAweDI4MjNkMzFmIGluIHB0aHJlYWRfc2V0
Y2FuY2Vsc3RhdGUgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDI4MjNjYmJlIGluIHB0
aHJlYWRfY29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMSAweDA4MDU4ZTc4
IGluIGN2X3dhaXQgKGN2PTB4ODA2N2UyMCwgbG9jaz0weDgwNjdlMWMpIGF0IHN5bmNoLmg6MTI1
CiMxMiAweDA4MDVhN2NjIGluIHN5bmNfdGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9z
cmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTQ3MgojMTMgMHgyODIzNDI4ZiBpbiBwdGhyZWFkX2dl
dHByaW8gKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxNCAweDAwMDAwMDAwIGluID8/ICgpCgpU
aHJlYWQgMSAoVGhyZWFkIDI4NDA0YTAwIChMV1AgMTAwMTE3KSk6CiMwICAweDI4MmZhYTU1IGlu
IHJlY3Zmcm9tICgpIGZyb20gL2xpYi9saWJjLnNvLjcKIzEgIDB4MjgyODBiZTIgaW4gcmVjdiAo
KSBmcm9tIC9saWIvbGliYy5zby43CiMyICAweDA4MDVjMjg3IGluIHByb3RvX2NvbW1vbl9yZWN2
IChmZD05LCBkYXRhPTB4YmYzZjhmNDcgIioiLCBzaXplPTUpCiAgICBhdCAvdXNyL3NyYy9zYmlu
L2hhc3RkL3Byb3RvX2NvbW1vbi5jOjc4CiMzICAweDA4MDVjNmFlIGluIHNwX3JlY3YgKGN0eD0w
eDI4NGViMTAwLCBkYXRhPTB4YmYzZjhmNDcgIioiLCBzaXplPTUpCiAgICBhdCAvdXNyL3NyYy9z
YmluL2hhc3RkL3Byb3RvX3NvY2tldHBhaXIuYzoxNzcKIzQgIDB4MDgwNWJkZjEgaW4gcHJvdG9f
cmVjdiAoY29ubj0weDI4NGViMGYwLCBkYXRhPTB4YmYzZjhmNDcsIHNpemU9NSkgYXQgL3Vzci9z
cmMvc2Jpbi9oYXN0ZC9wcm90by5jOjE5OAojNSAgMHgwODA0ZGRhZSBpbiBoYXN0X3Byb3RvX3Jl
Y3ZfaGRyIChjb25uPTB4Mjg0ZWIwZjAsIG52cD0weGJmM2Y4ZjgwKSBhdCAvdXNyL3NyYy9zYmlu
L2hhc3RkL2hhc3RfcHJvdG8uYzoyOTgKIzYgIDB4MDgwNGNlMjcgaW4gY3RybF90aHJlYWQgKGFy
Zz0weDI4NGNhYjAwKSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL2NvbnRyb2wuYzozNzMKIzcgIDB4
MjgyMzQyOGYgaW4gcHRocmVhZF9nZXRwcmlvICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojOCAg
MHgwMDAwMDAwMCBpbiA/PyAoKQo=
--=-=-=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86r5m9dvqf.fsf>