From owner-freebsd-fs@freebsd.org  Mon Jun 12 06:13:53 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 53C82BF4D9A
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon, 12 Jun 2017 06:13:53 +0000 (UTC)
 (envelope-from amutu@amutu.com)
Received: from mail-oi0-x233.google.com (mail-oi0-x233.google.com
 [IPv6:2607:f8b0:4003:c06::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 0867883B6B
 for <freebsd-fs@freebsd.org>; Mon, 12 Jun 2017 06:13:52 +0000 (UTC)
 (envelope-from amutu@amutu.com)
Received: by mail-oi0-x233.google.com with SMTP id e11so9333554oia.2
 for <freebsd-fs@freebsd.org>; Sun, 11 Jun 2017 23:13:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amutu-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=EMY7hwj8VduAIfatVYPv/s+Tmyp3dT5M+0FOgsPaaV4=;
 b=d+XOO0Lab2LqlhmRaq6YJDQUpp1WCckXj92gbILuMp7zVbq7/72ima+FEuTHpMX9Ww
 YP3b5sDvLPs0hNfABAY2l77gEtWPkCAhWNea26HxOHthIbzEuEYHQIMxdr/xMPwRe1i3
 IyU1/X1APiUsIfZPw0iQjHU9RZSqXcr74RvgeY3cdLQkCYTemPVTIQf5wrTFD4GlLmKn
 MovwjRb8YF+P9nByBVJ2G05+DfAC1tDIrcyLQnfYMw4L4fCTz9O8v0lT0/zTPzCB4k2D
 tSLVWgGunaIUFwXX3pv/+8XUCzEfNyuxJSZ7s0je5IAvMVF0OzEytxlFZlbjvRQ2z+P0
 UOwA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=EMY7hwj8VduAIfatVYPv/s+Tmyp3dT5M+0FOgsPaaV4=;
 b=kuMWnqFYJ2T76PfLwq0uJ5OMB9qBukM//obiGXQvTJT3sKWAe6nZKuO2LPN3n4KoH6
 F9bRs7yZVOvCz/Eg5At46/gMm4DxT44izd+xPl+MCxyaj0aPr4mjTxWiULj0YLWHw5au
 r115gvI02fptQ660A5gjNQDX4+c4bITYsp9y1K4gbTzWbFfaTuCnHWkX/x0DOs3kw1f6
 X15fFDEfCiQv2tdniSu0bR7vBKaM+BxV9MrWHOkTiBZwiAHYdXrwxqZHhn1rYlkCP2Bz
 pgqarFxBr8hBtpymFvn8lq7yaLRXPhWLkiCPDcsTCCXeyyEQv/WFGJxA82dj9FI+ls0d
 e9nQ==
X-Gm-Message-State: AODbwcA5wWe4JFCb90raOQUlZrR3znMPs3uVCzseExK+2wieuIHburjY
 JESH8PIuDIVkG6if
X-Received: by 10.202.69.6 with SMTP id s6mr27176694oia.193.1497248031982;
 Sun, 11 Jun 2017 23:13:51 -0700 (PDT)
Received: from mail-oi0-f43.google.com (mail-oi0-f43.google.com.
 [209.85.218.43])
 by smtp.gmail.com with ESMTPSA id c42sm4496772otc.58.2017.06.11.23.13.51
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sun, 11 Jun 2017 23:13:51 -0700 (PDT)
Received: by mail-oi0-f43.google.com with SMTP id s64so6241453oif.1;
 Sun, 11 Jun 2017 23:13:51 -0700 (PDT)
X-Received: by 10.202.227.3 with SMTP id a3mr1852330oih.52.1497248031209; Sun,
 11 Jun 2017 23:13:51 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.74.133.136 with HTTP; Sun, 11 Jun 2017 23:13:30 -0700 (PDT)
In-Reply-To: <4410c303314a4d11832a8d248e0b53e1@DM2PR58MB013.032d.mgd.msft.net>
References: <a8523e8099404bd699525f8ff7763819@DM2PR58MB013.032d.mgd.msft.net>
 <CADyrUxNeRW_fKJzoc9zoZi3J-6L+tXnJuW3q_4cremUy7sP7fA@mail.gmail.com>
 <4410c303314a4d11832a8d248e0b53e1@DM2PR58MB013.032d.mgd.msft.net>
From: Jov <amutu@amutu.com>
Date: Mon, 12 Jun 2017 14:13:30 +0800
X-Gmail-Original-Message-ID: <CADyrUxPH3LPOePhvz3gFyvWbAHrx+wU04pcrpBhKXDvF4=p0Pw@mail.gmail.com>
Message-ID: <CADyrUxPH3LPOePhvz3gFyvWbAHrx+wU04pcrpBhKXDvF4=p0Pw@mail.gmail.com>
Subject: Re: [EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD
 performance drop < 24 hours
To: "Caza, Aaron" <Aaron.Caza@ca.weatherford.com>
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>,
 freebsd-fs <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jun 2017 06:13:53 -0000

>From the output of explain analyze of PG, the problem can be excluded from
the database.I am not a fs expert, I CCed freebsd-fs@freebsd.org.It may be
helpful if you provide more info such as
sysctl -a | grep zfs
after degradation.

2017-06-12 12:50 GMT+08:00 Caza, Aaron <Aaron.Caza@ca.weatherford.com>:

> Thanks, Jov, for you suggestions.  Per your e-mail I added =E2=80=9Cexpla=
in
> analyze=E2=80=9D to the script:
>
>
>
> #!/bin/sh
> psql --username=3Dtest --password=3Dsupersecret -h /db -d test << EOL
> \timing on
> explain analyze select count(*) from test;
> \q
> EOL
>
> Sample run of above script before degradation:
>
> Timing is on.
>
>                                                              QUERY
> PLAN
>
> ------------------------------------------------------------
> ------------------------------------------------------------------------
>
> Aggregate  (cost=3D3350822.35..3350822.36 rows=3D1 width=3D0) (actual
> time=3D60234.556..60234.556 rows=3D1 loops=3D1)
>
>    ->  Seq Scan on test  (cost=3D0.00..3296901.08 rows=3D21568508 width=
=3D0)
> (actual time=3D1.126..57021.470 rows=3D21568508 loops=3D1)
>
> Planning time: 4.968 ms
>
> Execution time: 60234.649 ms
>
> (4 rows)
>
>
>
> Time: 60248.503 ms
>
> test$ uptime
>
> 10:33PM  up 7 mins, 3 users, load averages: 1.68, 1.79, 0.94
>
>
>
>
>
> Sample run of above script after degradation (~11.33 hours uptime):
>
> Timing is on.
>
>                                                              QUERY
> PLAN
>
> ------------------------------------------------------------
> -------------------------------------------------------------------------
>
> Aggregate  (cost=3D3350822.35..3350822.36 rows=3D1 width=3D0) (actual
> time=3D485669.361..485669.361 rows=3D1 loops=3D1)
>
>    ->  Seq Scan on test  (cost=3D0.00..3296901.08 rows=3D21568508 width=
=3D0)
> (actual time=3D0.008..483241.253 rows=3D21568508 loops=3D1)
>
> Planning time: 0.529 ms
>
> Execution time: 485669.411 ms
>
> (4 rows)
>
>
>
> Time: 485670.432 ms
>
> test$ uptime
>
> 9:59PM  up 11:21, 2 users, load averages: 1.11, 2.13, 2.14
>
>
>
>
>
> Regarding dd=E2=80=99ing the pgdata directory, that didn=E2=80=99t work f=
or me as Postgres
> splits the database up into multiple 2GB files =E2=80=93 dd=E2=80=99ing o=
f a 2GB file on a
> system with 8GB ram doesn=E2=80=99t seem representative.  I opted to crea=
te a 16GB
>  file (dd if=3D/dev/random of=3D/testdb/test bs=3D1m count=3D16000) on th=
e
> pertinent ZFS file system then performed dd operation on that:
>
>
>
> Sample of run after degradation (~11.66 hours uptime):
>
> 16000+0 records in
>
> 16000+0 records out
>
> 16777216000 bytes transferred in 274.841792 secs (61043176 bytes/sec)
>
> test$ uptime
>
> 10:25PM  up 11:46, 2 users, load averages: 1.00, 1.28, 1.59
>
>
>
>
>
> After rebooting, we can see **MUCH** before performance:
>
> test$ dd if=3D/testdb/test of=3D/dev/null bs=3D1m
>
> 16000+0 records in
>
> 16000+0 records out
>
> 16777216000 bytes transferred in 19.456043 secs (862313883 bytes/sec)
>
> test$ dd if=3D/testdb/test of=3D/dev/null bs=3D1m
>
> 16000+0 records in
>
> 16000+0 records out
>
> 16777216000 bytes transferred in 19.375321 secs (865906473 bytes/sec)
>
> test$ dd if=3D/testdb/test of=3D/dev/null bs=3D1m
>
> 16000+0 records in
>
> 16000+0 records out
>
> 16777216000 bytes transferred in 19.173458 secs (875022968 bytes/sec)
>
> test$ uptime
>
> 10:30PM  up 4 mins, 3 users, load averages: 3.52, 1.62, 0.69
>
>
>
> These tests were conducted with the previously mentioned Samsung 850 Pro
> 256GB SSDs (Intel Xeon E31240 with 8GB ram).  There=E2=80=99s essentially=
 nothing
> else running on this system (99.5-100% idle) and no other disk activity.
>
>
>
> Regards,
>
> A
>
>
>
> *From:* Jov [mailto:amutu@amutu.com]
> *Sent:* Sunday, June 11, 2017 5:50 PM
> *To:* Caza, Aaron
> *Cc:* freebsd-hackers@freebsd.org; Allan Jude
>
> *Subject:* [EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD
> performance drop < 24 hours
>
>
>
> To exclude the fs problem=EF=BC=8CI will do a dd test on the pgdata data =
set
> after the performance drop,if the read and/or write utility can reach 100=
%
> or performance expected then I will say the problem is not fs or os.
>
>
>
> For pg,what's your output of explain analyze before and after performance
> drop?
>
>
>
> 2017=E5=B9=B46=E6=9C=8812=E6=97=A5 12:51 AM=EF=BC=8C"Caza, Aaron" <Aaron.=
Caza@ca.weatherford.com>=E5=86=99=E9=81=93=EF=BC=9A
>
> Thanks Allan for the suggestions.  I tried gstat -d but deletes (d/s)
> doesn't seem to be it as it stays at 0 despite vfs.zfs.trim.enabled=3D1.
>
> This is most likely due to the "layering" I use as, for historical
> reasons, I have GEOM ELI set up to essentially emulate 4k sectors
> regardless of the underlying media.  I do my own alignment and partition
> sizing as well as have the ZFS record size set to 8k for Postgres.
>
> In gstat, the SSDs %busy is 90-100% on startup after reboot.  Once the
> performance degradation hits (<24 hours later), I'm seeing %busy at ~10%.
>
> #!/bin/sh
> psql --username=3Dtest --password=3Dsupersecret -h /db -d test << EOL
> \timing on
> select count(*) from test;
> \q
> EOL
>
> Sample run of above script after reboot (before degradation hits) (Samsun=
g
> 850 Pros in ZFS mirror):
> Timing is on.
>   count
> ----------
>  21568508
> (1 row)
>
> Time: 57029.262 ms
>
> Sample run of above script after degradation (Samsung 850 Pros in ZFS
> mirror):
> Timing is on.
>   count
> ----------
>  21568508
> (1 row)
>
> Time: 583595.239 ms
> (Uptime ~1 day in this particular case.)
>
>
> Any other suggestions?
>
> Regards,
> A
>
> -----Original Message-----
> From: owner-freebsd-hackers@freebsd.org [mailto:owner-freebsd-hackers@
> freebsd.org] On Behalf Of Allan Jude
> Sent: Saturday, June 10, 2017 9:40 PM
> To: freebsd-hackers@freebsd.org
> Subject: [EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD
> performance drop < 24 hours
>
> On 06/10/2017 12:36, Slawa Olhovchenkov wrote:
> > On Sat, Jun 10, 2017 at 04:25:59PM +0000, Caza, Aaron wrote:
> >
> >> Gents,
> >>
> >> I'm experiencing an issue where iterating over a PostgreSQL table of
> ~21.5 million rows (select count(*)) goes from ~35 seconds to ~635 second=
s
> on Intel 540 SSDs.  This is using a FreeBSD 10 amd64 stable kernel back
> from Jan 2017.  SSDs are basically 2 drives in a ZFS mirrored zpool.  I'm
> using PostgreSQL 9.5.7.
> >>
> >> I've tried:
> >>
> >> *       Using the FreeBSD10 amd64 stable kernel snapshot of May 25,
> 2017.
> >>
> >> *       Tested on half a dozen machines with different models of SSDs:
> >>
> >> o   Intel 510s (120GB) in ZFS mirrored pair
> >>
> >> o   Intel 520s (120GB) in ZFS mirrored pair
> >>
> >> o   Intel 540s (120GB) in ZFS mirrored pair
> >>
> >> o   Samsung 850 Pros (256GB) in ZFS mirrored pair
> >>
> >> *       Using bonnie++ to remove Postgres from the equation and
> performance does indeed drop.
> >>
> >> *       Rebooting server and immediately re-running test and
> performance is back to original.
> >>
> >> *       Tried using Karl Denninger's patch from PR187594 (which took
> some work to find a kernel that the FreeBSD10 patch would both apply and
> compile cleanly against).
> >>
> >> *       Tried disabling ZFS lz4 compression.
> >>
> >> *       Ran the same test on a FreeBSD9.0 amd64 system using PostgreSQ=
L
> 9.1.3 with 2 Intel 520s in ZFS mirrored pair.  System had 165 days uptime
> and test took ~80 seconds after which I rebooted and re-ran test and was
> still at ~80 seconds (older processor and memory in this system).
> >>
> >> I realize that there's a whole lot of info I'm not including (dmesg,
> zfs-stats -a, gstat, et cetera): I'm hoping some enlightened individual
> will be able to point me to a solution with only the above to go on.
> >
> > Just a random guess: can you try r307264 (I am mean regression in
> > r307266)?
> > _______________________________________________
> > freebsd-hackers@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@
> freebsd.org"
> >
>
> This sounds a bit like an issue I investigated for a customer a few month=
s
> ago.
>
> Look at gstat -d (includes DELETE operations like TRIM)
>
> If you see a lot of that happening, but try: vfs.zfs.trim.enabled=3D0 in
> /boot/loader.conf and see if your issues go away.
>
> the FreeBSD TRIM code for ZFS basicallys waits until the sector has been
> free for a while (to avoid doing a TRIM on a block we'll immediately
> reuse), so your benchmark will run file for a little while, then suddenly
> the TRIM will kick in.
>
> For postgres, fio, bonnie++ etc, make sure the ZFS dataset you are storin=
g
> the data on / benchmarking has a recordsize that matches the workload.
>
> If you are doing a write-only benchmark, and you see lots of reads in
> gstat, you know you are having to do read/modify/write's, and that is why
> your performance is so bad.
>
>
> --
> Allan Jude
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list https://lists.freebsd.org/
> mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org=
"
>
> This message may contain confidential and privileged information. If it
> has been sent to you in error, please reply to advise the sender of the
> error and then immediately delete it. If you are not the intended
> recipient, do not read, copy, disclose or otherwise use this message. The
> sender disclaims any liability for such unauthorized use. PLEASE NOTE tha=
t
> all incoming e-mails sent to Weatherford e-mail accounts will be archived
> and may be scanned by us and/or by external service providers to detect a=
nd
> prevent threats to our systems, investigate illegal or inappropriate
> behavior, and/or eliminate unsolicited promotional e-mails (spam). This
> process could result in deletion of a legitimate e-mail before it is read
> by its intended recipient at our organization. Moreover, based on the
> scanning results, the full text of e-mails and attachments may be made
> available to Weatherford security and other personnel for review and
> appropriate action. If you have any concerns about this process,
>   please contact us at dataprivacy@weatherford.com.
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org=
"
>