From owner-freebsd-stable@FreeBSD.ORG Sun Mar 11 22:09:15 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 56E861065672; Sun, 11 Mar 2012 22:09:15 +0000 (UTC) (envelope-from regnauld@x0.dk) Received: from moof.catpipe.net (moof.catpipe.net [194.28.252.64]) by mx1.freebsd.org (Postfix) with ESMTP id 0160F8FC1B; Sun, 11 Mar 2012 22:09:14 +0000 (UTC) Received: from localhost (moof.catpipe.net [194.28.252.64]) by localhost.catpipe.net (Postfix) with ESMTP id 47EE64CEDAC; Sun, 11 Mar 2012 23:09:13 +0100 (CET) Received: from moof.catpipe.net ([194.28.252.64]) by localhost (moof.catpipe.net [194.28.252.64]) (amavisd-new, port 10024) with ESMTP id Fu175dMiC8Dh; Sun, 11 Mar 2012 23:09:12 +0100 (CET) Received: from macbook.bluepipe.net (x0.dk [194.19.205.214]) (Authenticated sender: relayuser) by moof.catpipe.net (Postfix) with ESMTPA id E10B14CEDA6; Sun, 11 Mar 2012 23:09:11 +0100 (CET) Received: by macbook.bluepipe.net (Postfix, from userid 1001) id 554B8827D71; Sun, 11 Mar 2012 23:09:11 +0100 (CET) Date: Sun, 11 Mar 2012 23:09:11 +0100 From: Phil Regnauld To: Mikolaj Golub Message-ID: <20120311220911.GD1684@macbook.bluepipe.net> References: <20120311185457.GB1684@macbook.bluepipe.net> <861uoyvpzh.fsf@kopusha.home.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <861uoyvpzh.fsf@kopusha.home.net> X-Operating-System: Darwin 11.3.0 x86_64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org Subject: Re: Issue with hast replication X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Mar 2012 22:09:15 -0000 Mikolaj Golub (trociny) writes: > > > PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200. > PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory. > PR> Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31642091520, 131072). > > 31642091520 looks like rather large offset for 10Gb volume... Sorry, that should have been 100G - I typed from memory instead of copy-pasting. > Just to be more confident that this is a HAST issue could you please try the > following experiment? > > 1) Stop hastd on h2. > > 2) On h1 run something like below: > > dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 of=/dev/zvol/zfs/hvol > > (copy hvol from h1 to h2 without hastd to see if it will succeed). > > Note: you will need to recreate HAST provider on secondary after this. Ok this is interesting. (For debugging purposes I've renamed the target zvol as "junk", you'll see why below). 1) As you suggested: h1# dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 of=/dev/zvol/zfs/junk dd: /dev/zvol/zfs/junk: Invalid argument 0+6 records in 0+5 records out 131072 bytes transferred in 0.002344 secs (55920640 bytes/sec) To be certain which dd was complaining, I renamed the target zvol. 2) Tried repeatedly, sometimes the number of bytes is a bit different: 0+7 records in 0+6 records out 147456 bytes transferred in 0.002448 secs (60233277 bytes/sec) And yes, hastd is stopped on h2. 3) I tried dd'ing zero to the zvol locally on h2: h2# dd if=/dev/zero of=/dev/zvol/zfs/junk bs=131072 ^C1817+0 records in 1816+0 records out 238026752 bytes transferred in 1.582006 secs (150458820 bytes/sec) That works, until I ^C it. 4) I tried redirecting the output of the dd | ssh to a file on the h2 side: h1# dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 of=/tmp/x ^C653+0 records in 652+0 records out 85458944 bytes transferred in 2.408074 secs (35488506 bytes/sec) That works too, until I ^C it. 5) Things get even weirder - if I then go over to h2 and dd the "/tmp/x" test file over to the zvol: h2# dd if=x bs=131072 of=/dev/zvol/zfs/junk dd: /dev/zvol/zfs/junk: Invalid argument 652+1 records in 652+0 records out 85458944 bytes transferred in 0.444571 secs (192227879 bytes/sec) Note that the file /tmp/x is 86917120 bytes long. 6) I try to copy more data into /tmp/x - it's now 291946496 (~280 MB) h2# dd if=x bs=131072 of=/dev/zvol/zfs/junk 2227+1 records in 2227+1 records out 291946496 bytes transferred in 3.564129 secs (81912441 bytes/sec) No more "invalid argument"... 7) ktrace on the destination dd: [...] \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ \0" 5807 dd RET read 17992/0x4648 5807 dd CALL write(0x3,0x800c09000,0x4648) 5807 dd RET write -1 errno 22 Invalid argument 5807 dd CALL write(0x2,0x7fffffffd300,0x4) 5807 dd GIO fd 2 wrote 4 bytes "dd: " 5807 dd RET write 4 5807 dd CALL write(0x2,0x7fffffffd3e0,0x12) 5807 dd GIO fd 2 wrote 18 bytes "/dev/zvol/zfs/junk" truss is a bit more informative: fstat(0,{ mode=p--------- ,inode=5,size=16384,blksize=4096 }) = 0 (0x0) lseek(0,0x0,SEEK_CUR) ERR#29 'Illegal seek' Illegal seek, eh ? Any clues ? The boxes are identical (HP DL380 G6), though the RAM config is different. Summary: - ssh works fine - h1 zvol to h2 zvol over ssh fails - h1 zvol to h2 /tmp/x over ssh is fine - h2 /dev/zero locally to h2 zvol is fine - h2 /tmp/x locally to h2 zvol fails at first, but works afterwards...