From owner-freebsd-fs@freebsd.org Mon May 16 23:07:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8A86CB3E23B for ; Mon, 16 May 2016 23:07:30 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id E79E511DE for ; Mon, 16 May 2016 23:07:29 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 01:07:28 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id D0846A73-60AD-4F3A-841F-6946D77246BB.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Tue, 17 May 2016 01:07:26 +0200 From: Rainer Duffner Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: zfs receive stalls whole system Message-Id: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Date: Tue, 17 May 2016 01:07:24 +0200 To: FreeBSD Filesystems Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=2 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 168, bad: 0, connections: 328, history: 168, asn_score: 102, asn_connections: 113, asn_good: 102, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 23:07:30 -0000 Hi, I have two servers, that were running FreeBSD 10.1-AMD64 for a long = time, one zfs-sending to the other (via zxfer). Both are NFS-servers and = MySQL-slaves, the sender is actively used as NFS-server, the recipient = is just a warm-standby, in case something serious happens and we don=E2=80= =99t want to wait for a day until the restore is back in place. The = MySQL-Slaves are actively used as read-only servers (at the application = level, Python=E2=80=99s SQL-Alchemy does that, apparently). They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think = one has 144, the other has 192). While they were running 10.1, they used HP P420 RAID-controllers with = individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. I use zfsnap to do hourly, daily and weekly snapshots. Sending worked well, especially after updating to 10.1 Because the storage was over 90% full (and I really hate this = RAID0-business we have with the HP RAID controllers), I rebuilt the = servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) and = an external disk shelf, hosting 12 additional disks was added- and I = upgraded to FreeBSD 10.3. Because we didn=E2=80=99t want to throw out the original disks, but = increase available space a lot, the new disks are double the size of the = original disks (600 vs. 1200 GB SAS).=20 I also created GPT-partitions on the disks and labeled them according to = the disk=E2=80=99s position in the cages/shelf, created the pools with = the got-partition-names instead of the daX-names. Now, when I do a zxfer, sometimes the whole system stalls while the data = is sent over, especially if the delta is large or if something else is = reading from the disk at the same time (backup agent). I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in = 9.1 either, IIRC) and it went away in 10.1. It=E2=80=99s very difficult (well, impossible) to debug, because the = system totally hangs and doesn=E2=80=99t accept any keypresses. Would a ZIL help in this case? I always thought that NFS was the only thing that did SYNC writes=E2=80=A6=