Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 May 2011 21:42:22 +0300
From:      Mikolaj Golub <trociny@freebsd.org>
To:        Daniel Kalchev <daniel@digsys.bg>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: HAST instability
Message-ID:  <86d3j02fox.fsf@kopusha.home.net>
In-Reply-To: <4DE3ACF8.4070809@digsys.bg> (Daniel Kalchev's message of "Mon, 30 May 2011 17:43:04 %2B0300")
References:  <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg>

next in thread | previous in thread | raw e-mail | index | archive | help

On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote:

 DK> Some further investigation:

 DK> The HAST nodes do not disconnect when checksum is enabled (either
 DK> crc32 or sha256).

 DK> One strange thing is that there is never established TCP connection
 DK> between both nodes:

 DK> tcp4       0      0 10.2.101.11.48939      10.2.101.12.8457       FIN_WAIT_2
 DK> tcp4       0   1288 10.2.101.11.57008      10.2.101.12.8457       CLOSE_WAIT
 DK> tcp4       0      0 10.2.101.11.46346      10.2.101.12.8457       FIN_WAIT_2
 DK> tcp4       0  90648 10.2.101.11.13916      10.2.101.12.8457       CLOSE_WAIT
 DK> tcp4       0      0 10.2.101.11.8457       *.*                    LISTEN

It is normal. hastd uses the connections only in one direction so it calls
shutdown to close unused directions.

 DK> When using sha256 one CPU core is 100% utilized by each hastd process,
 DK> while 70-80MB/sec per HAST resource is being transferred (total of up
 DK> to 140 MB/sec traffic for both);

 DK> When using crc32 each CPU core is at 22% utilization;

 DK> When using none as checksum, CPU usage is under 10%

I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is
lower and the problem is not triggered.

 DK> Eventually after many hours, got corrupted communication:

 DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch.

"Hash mismatch" message suggests that actually you were using checksum then,
weren't you?

 DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive
 DK> request data: No such file or directory.
 DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process
 DK> exited ungracefully (pid=9827, exitcode=75).

 DK> and

 DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive
 DK> reply header: Operation timed out.
 DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from
 DK> 10.2.101.12.
 DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send
 DK> request (Broken pipe): WRITE(99128470016, 131072).

It looks a little different than in your fist message.

Do you have clock in sync on both nodes?

I would like to look at full logs for some rather large period, with several
cases, from both primary and secondary (and be sure about synchronized time).

Also, it might worth checking that there is no network packet corruption (some
strange things in netstat -di, netstat -s, may be copying large files via net
and comparing checksums).

-- 
Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86d3j02fox.fsf>