From owner-freebsd-stable@FreeBSD.ORG Mon May 30 18:42:28 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B67A1065670 for ; Mon, 30 May 2011 18:42:28 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id A38BB8FC08 for ; Mon, 30 May 2011 18:42:27 +0000 (UTC) Received: by bwz12 with SMTP id 12so4552186bwz.13 for ; Mon, 30 May 2011 11:42:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:to:cc:subject:references:x-comment-to :sender:date:in-reply-to:message-id:user-agent:mime-version :content-type; bh=0SYRbKvxFRShHFo0EyETiaTVI55dFZMeOIVpDlPCJ8M=; b=n8WuDXcvFPTXm5/7vBQk5STqm913emV5n9CCnEDoevF3cQQT+Zuh2Mx6x+n62UZOXR 4TwfIOVaLIspDJs1GH6n/HL8hTCGy/9ujmbCRStN3BwgFB4+E5nlu/afUUCpd9+yTUxT ZuXeUph2H68Zr4g96SKJjkFYZrKBwGAZsMKDU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:references:x-comment-to:sender:date:in-reply-to :message-id:user-agent:mime-version:content-type; b=D66K9UMuQBc0qM1YpoqfMfEMXI2ndV6Rs4KOTjCcwaJuGRuZ51GEJx2Jx3PsLgevW6 pQfSCXhyAlNQGCLLPNO2WWDbHKuhr8pXOCgoIKcUwtsID+9TAEqUnycxd/j/6sh6igjV Acy34YpWMhQX4lKdWxMrHcvqfDaSHJJdkrWqk= Received: by 10.204.232.4 with SMTP id js4mr4503881bkb.47.1306780946494; Mon, 30 May 2011 11:42:26 -0700 (PDT) Received: from localhost ([95.69.172.154]) by mx.google.com with ESMTPS id af13sm3618392bkc.7.2011.05.30.11.42.24 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 30 May 2011 11:42:25 -0700 (PDT) From: Mikolaj Golub To: Daniel Kalchev References: <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg> X-Comment-To: Daniel Kalchev Sender: Mikolaj Golub Date: Mon, 30 May 2011 21:42:22 +0300 In-Reply-To: <4DE3ACF8.4070809@digsys.bg> (Daniel Kalchev's message of "Mon, 30 May 2011 17:43:04 +0300") Message-ID: <86d3j02fox.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-stable@freebsd.org Subject: Re: HAST instability X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 May 2011 18:42:28 -0000 On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote: DK> Some further investigation: DK> The HAST nodes do not disconnect when checksum is enabled (either DK> crc32 or sha256). DK> One strange thing is that there is never established TCP connection DK> between both nodes: DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN It is normal. hastd uses the connections only in one direction so it calls shutdown to close unused directions. DK> When using sha256 one CPU core is 100% utilized by each hastd process, DK> while 70-80MB/sec per HAST resource is being transferred (total of up DK> to 140 MB/sec traffic for both); DK> When using crc32 each CPU core is at 22% utilization; DK> When using none as checksum, CPU usage is under 10% I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered. DK> Eventually after many hours, got corrupted communication: DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch. "Hash mismatch" message suggests that actually you were using checksum then, weren't you? DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive DK> request data: No such file or directory. DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process DK> exited ungracefully (pid=9827, exitcode=75). DK> and DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive DK> reply header: Operation timed out. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from DK> 10.2.101.12. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send DK> request (Broken pipe): WRITE(99128470016, 131072). It looks a little different than in your fist message. Do you have clock in sync on both nodes? I would like to look at full logs for some rather large period, with several cases, from both primary and secondary (and be sure about synchronized time). Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums). -- Mikolaj Golub