From owner-freebsd-questions@freebsd.org Wed Oct 11 18:08:07 2017 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5DE80E32805 for ; Wed, 11 Oct 2017 18:08:07 +0000 (UTC) (envelope-from markham@ssimicro.com) Received: from barracuda.ssimicro.com (barracuda.ssimicro.com [96.46.39.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.ssimicro.com", Issuer "RapidSSL SHA256 CA - G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 27E6D70C52 for ; Wed, 11 Oct 2017 18:08:06 +0000 (UTC) (envelope-from markham@ssimicro.com) X-ASG-Debug-ID: 1507745282-08e7172d0be954a0001-jLrpzn Received: from mail.ssimicro.com (mail.ssimicro.com [64.247.129.10]) by barracuda.ssimicro.com with ESMTP id 4OdxmzJRvz7arkRv (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Wed, 11 Oct 2017 14:08:02 -0400 (EDT) X-Barracuda-Envelope-From: markham@ssimicro.com X-Barracuda-Effective-Source-IP: mail.ssimicro.com[64.247.129.10] X-Barracuda-Apparent-Source-IP: 64.247.129.10 Received: from yk-office-dhcp-64-247-130-165.ssimicro.com (yk-office-dhcp-64-247-130-165.ssimicro.com [64.247.130.165]) (authenticated bits=0) by mail.ssimicro.com (8.15.2/8.15.2) with ESMTPSA id v9BI81Vj022716 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 11 Oct 2017 12:08:01 -0600 (MDT) (envelope-from markham@ssimicro.com) Subject: Re: FreeBSD ZFS file server with SSD HDD To: freebsd-questions@freebsd.org X-ASG-Orig-Subj: Re: FreeBSD ZFS file server with SSD HDD References: <20171011130512.GE24374@apple.rat.burntout.org> From: markham breitbach Message-ID: Date: Wed, 11 Oct 2017 12:08:01 -0600 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Content-Language: en-US X-Barracuda-Connect: mail.ssimicro.com[64.247.129.10] X-Barracuda-Start-Time: 1507745282 X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 X-Barracuda-URL: https://barracuda.ssimicro.com:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 6981 X-Virus-Scanned: by bsmtpd at ssimicro.com X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using per-user scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.43806 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2017 18:08:07 -0000 I ran into some problems of disks choking on heavy IO under VMware.  It turned out to be an issue with firmware on the SSDs and backplane in a Dell server. It's probably worth making sure those are all up to date. -M On 2017-10-11 11:30 AM, David Christensen wrote: > On 10/11/17 06:05, Kate Dawson wrote: >> Currently running a FreeBSD NFS server with a zpool comprising >> 12 x 1TB hard disk drives are arranged as pairs of mirrors in a strip >> set ( RAID 10 ) > > That should do 6+ Gb/s. > > > bonnie++ should be able to measure that.  (It's been a while, but I > seem to recall that bonnie++ expects raw drives and nukes your data.  > So, it could take some effort to use it.) > > https://www.coker.com.au/bonnie++/ > > >> An additional 2x 960GB SSD added. These two SSD are partitioned with a >> small partition begin used for a ZIL log, and larger partion arranged >> for >> L2ARC cache. > > Assuming the ZIL is mirrored, that should do 5+ Gb/s. > > > Assuming the L2ARC is striped, that should do 10+ Gb/s. > > > I dont' know how to test ZIL and L2ARC in isolation, but dbench should > be able to test what ZFS exposes, both locally and over NFS: > > https://dbench.samba.org/ > > >> Additionally the host has 64GB RAM and 16 CPU cores (AMD Opteron 2Ghz) > > That should do 20+ Gb/s. > > > Memtest86+ will be to measure: > > http://www.memtest.org/ > > >> A dataset from the pool is exported via NFS to a number of Debian >> Gnu/Linux hosts running a xen hypervisor. These run several disk image >> based virtual machines >> >> In general use, the FreeBSD NFS host sees very little read IO, which >> is to expected >> as the RAM cache  and L2ARC are designed to minimise the amount of >> read load >> on the disks. >> >> However we're starting to see high load ( mostly IO WAIT ) on the Linux >> virtualisation hosts, and virtual machines - with kernel timeouts >> occurring resulting in crashes and instability. >> >> I believe this may be due to the limited number of random write IOPS >> available >> on the zpool NFS export. >> >> I can get sequential writes and reads to and from the NFS server at >> speeds that approach the maximum the network provides ( currently 1Gb/s >> + Jumbo Frames, and I could increase this by bonding multiple >> interfaces together. ) >> >> However day to day usage does not show network utilisation anywhere near >> this maximum. >> >> If I look at the output of `zpool iostat -v tank 1 ` I see that every >> five seconds or so, the numner of write operation go to > 2k >> >> I think this shows that the I'm hitting the limit that the spinning disk >> can provide in this workload. >> >> As a cost effective way to improve this ( rather than replacing the >> whole chassis ) I was considering replacing the 1TB HDD with 1TB SSD, >> for the improved IOPS. >> >> I wonder if there were any opinions within the community here, on >> >> 1. What metrics can I gather to confirm the disk write IO as bottleneck? >> >> 2. If the proposed solution will have the required effect?  That is an >> decrease in the IOWAIT on the GNU/Linux virtualization hosts. > > > I infer your network to be: > > - 1 host running FreeBSD (freebsd-version? uname -a?) and an NFS > server (version?). > > - N (how many?) Debian GNU/Linux hosts (/etc/debian-version?  uname > -a?), each running a Xen hypervisor (version?) and an NFS client. > > - The VM's are configured to see their drives as local devices (e.g. > the VM's are not running NFS clients connected to the FreeBSD NFS > server). > > - Gigabit switch (make? model?). > > - 1 Gigabit connection between switch and each host. > > > As you have correctly stated, you need visibility on the relevant > performance metrics to make informed decisions.  In addition to the > above tools: > > - For networking, I'd try netstat: > > http://netstat.net/ > > - For drive I/O, I use nmon on Debian: > > https://en.wikipedia.org/wiki/Nmon > > - I believe iostat is available on both: > > https://en.wikipedia.org/wiki/Iostat > > - For CPU's, RAM, and swap, I use top. > > https://en.wikipedia.org/wiki/Top_(software) > > - You seem to have found at least one ZFS tool. > > > As others have stated, you will want to ensure that all the pieces are > reasonably in tune -- VM, NFS client, Xen, Debian networking, switch, > FreeBSD networking, NFS server, ZFS, etc..  I'd start by looking for > errors and/or warnings in the usual places (dmesg, /var/log, etc.).  I > typically leave the settings at the installer defaults, unless I have > some compelling reason to make a change (at least one reader made a > suggestion).  Be sure to keep good notes if you're going to muck with > the settings. > > > As for 'zpool iostat -v tank 1', I suspect ZFS is telling you that it > is flushing writes to the HDD's every five seconds.  If flushes always > complete before the next scheduled flush, replacing the HDD's with > SSD's probably will not help with the VM IO WAIT and kernel timeout > problems. But, if the flushes are overrunning each other during peak > usage, you may have found the bottleneck. > > > That said, I suspect that the root cause of the VM IO WAIT and kernel > timeout problems is that the virtual machines need a low latency > connection to their system drives, temporary file systems, and/or swap > devices, and they aren't getting it.  I would not bet on NFS to > provide this, even with SSD's instead of HDD's.  I would bet on local > resources.  I suggest: > > 1.  Put 2 mirrored SSD's in each Xen server. > > 2.  Put VM system drives on the local SSD mirror. > > 3.  Put VM /tmp file systems on the local SSD mirror, or on RAM: > > https://en.wikipedia.org/wiki/Tmpfs > > 4.  Put VM swap devices on the local SSD mirror, or on RAM: > > https://en.wikipedia.org/wiki/Zram > > 5.  Put VM data drives on NFS. > > > I am unsure if it is better to do the "on RAM" and "on NFS" ideas at > the Xen level or within each VM.  Performance is one consideration.  > Others considerations are security and accountability -- e.g. do > customers have root on the VM's? > > > To improve NFS performance: > > 1.  Enlarging the pipe between the NFS server and the switch -- > bonding (your idea), upgrade to 10 Gb/s, etc.. > > 2.  Enlarge the pipes between the Xen hosts and the switch. > > 3.  Add NIC's to the NFS server, add switches, and divide up the Xen > hosts across the switches. > > 4.  Add NIC's to the NFS server, one per Xen host, and make direct > connections between the NFS server and each Xen host. > > > Please let us know how it goes.  :-) > > > David > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to > "freebsd-questions-unsubscribe@freebsd.org"