From owner-freebsd-stable@FreeBSD.ORG Tue May 31 12:51:18 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B4ECE1065674; Tue, 31 May 2011 12:51:18 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.3.230]) by mx1.freebsd.org (Postfix) with ESMTP id 4325B8FC1A; Tue, 31 May 2011 12:51:17 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [192.92.129.5]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.4/8.14.4) with ESMTP id p4VCp7tJ055389 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 31 May 2011 15:51:13 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <4DE4E43B.7030302@digsys.bg> Date: Tue, 31 May 2011 15:51:07 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.17) Gecko/20110519 Thunderbird/3.1.10 MIME-Version: 1.0 To: Mikolaj Golub References: <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg> <86d3j02fox.fsf@kopusha.home.net> In-Reply-To: <86d3j02fox.fsf@kopusha.home.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org Subject: Re: HAST instability X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 May 2011 12:51:18 -0000 On 30.05.11 21:42, Mikolaj Golub wrote: > DK> One strange thing is that there is never established TCP connection > DK> between both nodes: > > DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 > DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT > DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 > DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT > DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN > > It is normal. hastd uses the connections only in one direction so it calls > shutdown to close unused directions. So the TCP connections are all too short-lived that I can never see a single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so this might well be possible... > I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered. I was thinking something like this. My later tests seems to suggest that when the network transfer rate is mugh higher than disk transfer rate this gets triggered. > "Hash mismatch" message suggests that actually you were using checksum then, > weren't you? Yes, this occurs only when checksums are enabled. Happens with both crc32 and sha256. > I would like to look at full logs for some rather large period, with several > cases, from both primary and secondary (and be sure about synchronized time). I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) -- so far some interesting findings, like I get hash errors and disconnects much more frequent now. Will post when an bonnie++ run on the ZFS filesystem on top of the HAST resources finishes. > Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums). > I will post these as well, however so far no indication of any network problems was seen, no interface errors etc. Might be also the ix driver is not reporting such, of course. One additional note: while playing with this setup, I tried to simulate local disk going away in the hope HAST will switch to using the remote disk. Instead of asking someone at the site to pull out the drive, I just issued on the primary hastctl role init data0 which resulted in kernel panic. Unfortunately, there was no sufficient dump space for 48GB. I will re-run this again with more drives for the crash dump. Anything you want me to look for in particular? (kernels have no KDB compiled in yet) Daniel