Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 May 2016 12:44:50 +0200
From:      "Ronald Klop" <ronald-lists@klop.ws>
To:        "FreeBSD Filesystems" <freebsd-fs@freebsd.org>, "Rainer Duffner" <rainer@ultra-secure.de>
Subject:   Re: zfs receive stalls whole system
Message-ID:  <op.yhlr40k3kndu52@ronaldradial.radialsg.local>
In-Reply-To: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de>
References:  <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 17 May 2016 01:07:24 +0200, Rainer Duffner  =

<rainer@ultra-secure.de> wrote:

> Hi,
>
> I have two servers, that were running FreeBSD 10.1-AMD64 for a long  =

> time, one zfs-sending to the other (via zxfer). Both are NFS-servers a=
nd  =

> MySQL-slaves, the sender is actively used as NFS-server, the recipient=
  =

> is just a warm-standby, in case something serious happens and we don=E2=
=80=99t  =

> want to wait for a day until the restore is back in place. The  =

> MySQL-Slaves are actively used as read-only servers (at the applicatio=
n  =

> level, Python=E2=80=99s SQL-Alchemy does that, apparently).
>
> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think =
 =

> one has 144, the other has 192).
> While they were running 10.1, they used HP P420 RAID-controllers with =
 =

> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs.
> I use zfsnap to do hourly, daily and weekly snapshots.
>
> Sending worked well, especially after updating to 10.1
>
> Because the storage was over 90% full (and I really hate this  =

> RAID0-business we have with the HP RAID controllers), I rebuilt the  =

> servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) and=
  =

> an external disk shelf, hosting 12 additional disks was added- and I  =

> upgraded to FreeBSD 10.3.
> Because we didn=E2=80=99t want to throw out the original disks, but in=
crease  =

> available space a lot, the new disks are double the size of the origin=
al  =

> disks (600 vs. 1200 GB SAS).
> I also created GPT-partitions on the disks and labeled them according =
to  =

> the disk=E2=80=99s position in the cages/shelf, created the pools with=
 the  =

> got-partition-names instead of the daX-names.
>
> Now, when I do a zxfer, sometimes the whole system stalls while the da=
ta  =

> is sent over, especially if the delta is large or if something else is=
  =

> reading from the disk at the same time (backup agent).
>
> I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in =
9.1  =

> either, IIRC) and it went away in 10.1.
>
> It=E2=80=99s very difficult (well, impossible) to debug, because the s=
ystem  =

> totally hangs and doesn=E2=80=99t accept any keypresses.
>
> Would a ZIL help in this case?
> I always thought that NFS was the only thing that did SYNC writes=E2=80=
=A6

Databases love SYNC writes too. (But that doesn't say anything about the=
  =

unresponsive system).
I think there is a statistic somewhere in FreeBSD to analyze the sync vs=
  =

async writes and decide if a ZIL will help or not. (But that doesn't say=
  =

anything about the unresponsive system either).

Ronald.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.yhlr40k3kndu52>