From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 19:00:29 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 1E4ECB2E; Sun, 20 Jan 2013 19:00:29 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 43E93FDD; Sun, 20 Jan 2013 19:00:27 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id VAA07884; Sun, 20 Jan 2013 21:00:18 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Tx07u-0005ww-48; Sun, 20 Jan 2013 21:00:18 +0200 Message-ID: <50FC3EBF.6070803@FreeBSD.org> Date: Sun, 20 Jan 2013 21:00:15 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: freebsd-current@FreeBSD.org, freebsd-fs , freebsd-geom@FreeBSD.org Subject: disk "flipped" - a known problem? X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=X-VIET-VPS Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 19:00:29 -0000 Today something unusual happened on one of my machines: kernel: (ada0:ahcich0:0:0:0): lost device kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected flags 0x18 refcount 1 kernel: adaasync: Unable to attach to new device due to status 0x6 It looks like the disk disappeared from the bus and then re-appeared on the bus, but not to the OS. One of the partitions that the disk hosted was a swap partition and it seems to be the cause of some of the following consequences. The consequences: * ZFS properly noticed disappearance of the disk, but its diagnostic was a little bit misleading: pool: pond state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012 config: NAME STATE READ WRITE CKSUM pond DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 12725235722288301230 REMOVED 0 0 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0 0 0 Yes, I agree that the disk got removed/lost, but disagree that "the administrator" did it. * geom_event thread started consuming 100% of CPU in g_wither_washer() * /dev/ada0 disappeared but camcontrol devlist still reported ada0: at scbus0 target 0 lun 0 (pass0,ada0) * As seen in the system messages, CAM layer refused to re-attach the disk * gpart command would just crash So, I can explain the behavior of the geom_event thread - apparently swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so g_wither_washer is stuck waiting until the swap consumer goes way (drops its access bits). (Another sad thing about this state is that I couldn't swapoff the device, because there was no device entry.) I am not sure if the "attempt to re-allocate valid device" failure was caused by this, but it could be, if something in CAM layer was waiting for GEOM layer to be done with the disk. It would be nice if the swap code properly supported disappearance of the underlying disks. Especially in this case where the swap was actually never used / touched at all (few hours after reboot and completely idle system). -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 19:53:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2C70A4DE; Sun, 20 Jan 2013 19:53:27 +0000 (UTC) (envelope-from nicolas@i.0x5.de) Received: from n.0x5.de (n.0x5.de [217.197.85.144]) by mx1.freebsd.org (Postfix) with ESMTP id 12C4D1BF; Sun, 20 Jan 2013 19:53:26 +0000 (UTC) Received: by pc5.i.0x5.de (Postfix, from userid 1003) id 3Yq65B40g0z7ySG; Sun, 20 Jan 2013 20:53:18 +0100 (CET) Date: Sun, 20 Jan 2013 20:53:18 +0100 From: Nicolas Rachinsky To: Artem Belevich Subject: Re: slowdown of zfs (tx->tx) Message-ID: <20130120195318.GA24646@mid.pc5.i.0x5.de> References: <20130114195148.GA20540@mid.pc5.i.0x5.de> <20130114214652.GA76779@mid.pc5.i.0x5.de> <20130115224556.GA41774@mid.pc5.i.0x5.de> <20130116073759.GA47781@mid.pc5.i.0x5.de> <20130118112630.GA41074@mid.pc5.i.0x5.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Powered-by: FreeBSD X-Homepage: http://www.rachinsky.de X-PGP-Keyid: 887BAE72 X-PGP-Fingerprint: 039E 9433 115F BC5F F88D 4524 5092 45C4 887B AE72 X-PGP-Keys: http://www.rachinsky.de/nicolas/gpg/nicolas_rachinsky.asc User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 19:53:27 -0000 * Artem Belevich [2013-01-18 08:20 -0800]: > On Fri, Jan 18, 2013 at 3:26 AM, Nicolas Rachinsky > wrote: > > * Artem Belevich [2013-01-16 00:45 -0800]: > >> On Tue, Jan 15, 2013 at 11:37 PM, Nicolas Rachinsky > >> wrote: > >> >> You may want to update your system to very recent FreeBSD as quite a > >> >> few fixes were recently imported from illumos. Hopefully it will deal > >> >> with the issue. I'm out of ideas otherwise. Sorry. > >> > > >> > Do you mean -CURRENT or -STABLE with very recent? Or just 9.1? > >> > >> -HEAD or -STABLE (-8 or -9). > > > > I have now updated the machine to stable/8 r245541. I have not updated > > the zpool. > > > > But the problem still occurs. Should I update the pool? Or try other > > things first? > > Updating the pool is an irreversible operation. In general I'd > suggest trying less drastic options first. > > Other people suggested that the problem may be just a side effect of > almost-full filesystem. ZFS needs fair amount of unfragmented free > space in order to work efficiently. If that's what's causing your > problem, then one thing to try would be to free enough free space. The > gotcha there is that you need to free up enough contiguous space. > Removing bunch of recently written files may not help as those writes > would happen on already fragmented FS. Removing files written when FS > had a lot of free space may have better chance of freeing contiguous > space. Old snapshots are good candidates for this. It seems, it was too little free space. I copied some data to another machine and deleted it from this zpool (one part was written on the beginning of December, the other one on the beginning of January). I removed no snapshots and no filesystems. After this everything works fine. Thank you all for your efforts! Nicolas -- http://www.rachinsky.de/nicolas From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 22:26:57 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 67E80C1F; Sun, 20 Jan 2013 22:26:57 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by mx1.freebsd.org (Postfix) with ESMTP id C141A8A7; Sun, 20 Jan 2013 22:26:56 +0000 (UTC) Received: by mail-la0-f53.google.com with SMTP id fn20so5431393lab.40 for ; Sun, 20 Jan 2013 14:26:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=JuNTwzghQnLFi7b4kiekbY/7yuz5RXU+skBR1484Yts=; b=jG8/p/Q+tBT0MGMnibYW3KSGnqU4BC0pjxAKeBYrRpvxu6Y7c6dPSzZ4oOon5t9X3p QPrfrHOL5JYV9bfr+oE5lDunivuHHfAy71SvZFFaW6seynZuLX6VlDMa91hhixjcNvJ4 abCmG0u6RvklvMgRrWClvSBswaXki54+gtM1xT23ScDaBiBoB1SasbXXwTlXb5Ft6Kwm jZF2npvpAu+Q5rH9XbPfnWmDVQtf+26o1Yb7GWpbhRZJrkInODdyV4Oc1OnBVUsMDckg UrIbyCMxPeZViXEa+t2jyyUS6WxuWTJkTLHIUxwSMLDDzDb/bpiKjuriFYyAGV9PNkSL Zytw== MIME-Version: 1.0 X-Received: by 10.112.28.9 with SMTP id x9mr6710216lbg.27.1358720810293; Sun, 20 Jan 2013 14:26:50 -0800 (PST) Received: by 10.112.6.38 with HTTP; Sun, 20 Jan 2013 14:26:50 -0800 (PST) Date: Sun, 20 Jan 2013 17:26:50 -0500 Message-ID: Subject: ZFS regimen: scrub, scrub, scrub and scrub again. From: Zaphod Beeblebrox To: freebsd-fs , FreeBSD Hackers Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 22:26:57 -0000 Please don't misinterpret this post: ZFS's ability to recover from fairly catastrophic failures is pretty stellar, but I'm wondering if there can be a little room for improvement. I use RAID pretty much everywhere. I don't like to loose data and disks are cheap. I have a fair amount of experience with all flavors ... and ZFS has become a go-to filesystem for most of my applications. One of the best recommendations I can give for ZFS is it's crash-recoverability. As a counter example, if you have most hardware RAID going or a software whole-disk raid, after a crash it will generally declare one disk as good and the other disk as "to be repaired" ... after which a full surface scan of the affected disks --- reading one and writing the other --- ensues. On my Windows desktop, the pair of 2T's take 3 or 4 hours to do this. A pair of green 2T's can take over 6. You don't loose any data, but you have severely reduced performance until it's repaired. The rub is that you know only one or two blocks could possibly even be different ... and that this is a highly unoptimized way of going about the problem. ZFS is smart on this point: it will recover on reboot with a minimum amount of fuss. Even if you dislodge a drive ... so that it's missing the last 'n' transactions, ZFS seems to figure this out (which I thought was extra cudos). MY PROBLEM comes from problems that scrub can fix. Let's talk, in specific, about my home array. It has 9x 1.5T and 8x 2T in a RAID-Z configuration (2 sets, obviously). The drives themselves are housed (4 each) in external drive bays with a single SATA connection for each. I think I have spoken of this here before. A full scrub of my drives weighs in at 36 hours or so. Now around Christmas, while moving some things, I managed to pull the plug on one cabinet of 4 drives. It was likely that the only active use of the filesystem was an automated cvs checkin (backup) given that the errors only appeared on the cvs directory. IN-THE-END, no data was lost, but I had to scrub 4 times to remove the complaints, which showed like this from "zpool status -v" errors: Permanent errors have been detected in the following files: vr2/cvs:<0x1c1> Now ... this is just an example: after each scrub, the hex number was different. I also couldn't actually find the error on the cvs filesystem, as a side note. Not many files are stored there, and they all seemed to be present. MY TAKEAWAY from this is that 2 major improvements could be made to ZFS: 1) a pause for scrub... such that long scrubs could be paused during working hours. 2) going back over errors... during each scrub, the "new" error was found before the old error was cleared. Then this new error gets similarly cleared by the next scrub. It seems that if the scrub returned to this new found error after fixing the "known" errors, this could save whole new scrub runs from being required. From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 22:35:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C001FF16 for ; Sun, 20 Jan 2013 22:35:10 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 65BF0911 for ; Sun, 20 Jan 2013 22:35:10 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id 251BC12DE30A0 for ; Sun, 20 Jan 2013 14:35:04 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rqML15eqW17J for ; Sun, 20 Jan 2013 14:35:01 -0800 (PST) Received: from [192.168.0.129] (c-50-135-255-120.hsd1.wa.comcast.net [50.135.255.120]) by plato.corp.nas.com (Postfix) with ESMTPSA id 9DB1712DE3095 for ; Sun, 20 Jan 2013 14:35:01 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Michael DeMan In-Reply-To: Date: Sun, 20 Jan 2013 14:34:59 -0800 Content-Transfer-Encoding: 7bit Message-Id: References: To: freebsd-fs X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 22:35:10 -0000 +1 on being able to 'pause scrubs' - that would be awesome. - Mike On Jan 20, 2013, at 2:26 PM, Zaphod Beeblebrox wrote: > Please don't misinterpret this post: ZFS's ability to recover from fairly > catastrophic failures is pretty stellar, but I'm wondering if there can be > a little room for improvement. > > I use RAID pretty much everywhere. I don't like to loose data and disks > are cheap. I have a fair amount of experience with all flavors ... and ZFS > has become a go-to filesystem for most of my applications. > > One of the best recommendations I can give for ZFS is it's > crash-recoverability. As a counter example, if you have most hardware RAID > going or a software whole-disk raid, after a crash it will generally > declare one disk as good and the other disk as "to be repaired" ... after > which a full surface scan of the affected disks --- reading one and writing > the other --- ensues. On my Windows desktop, the pair of 2T's take 3 or 4 > hours to do this. A pair of green 2T's can take over 6. You don't loose > any data, but you have severely reduced performance until it's repaired. > > The rub is that you know only one or two blocks could possibly even be > different ... and that this is a highly unoptimized way of going about the > problem. > > ZFS is smart on this point: it will recover on reboot with a minimum amount > of fuss. Even if you dislodge a drive ... so that it's missing the last > 'n' transactions, ZFS seems to figure this out (which I thought was extra > cudos). > > MY PROBLEM comes from problems that scrub can fix. > > Let's talk, in specific, about my home array. It has 9x 1.5T and 8x 2T in > a RAID-Z configuration (2 sets, obviously). The drives themselves are > housed (4 each) in external drive bays with a single SATA connection for > each. I think I have spoken of this here before. > > A full scrub of my drives weighs in at 36 hours or so. > > Now around Christmas, while moving some things, I managed to pull the plug > on one cabinet of 4 drives. It was likely that the only active use of the > filesystem was an automated cvs checkin (backup) given that the errors only > appeared on the cvs directory. > > IN-THE-END, no data was lost, but I had to scrub 4 times to remove the > complaints, which showed like this from "zpool status -v" > > errors: Permanent errors have been detected in the following files: > > vr2/cvs:<0x1c1> > > Now ... this is just an example: after each scrub, the hex number was > different. I also couldn't actually find the error on the cvs filesystem, > as a side note. Not many files are stored there, and they all seemed to be > present. > > MY TAKEAWAY from this is that 2 major improvements could be made to ZFS: > > 1) a pause for scrub... such that long scrubs could be paused during > working hours. > > 2) going back over errors... during each scrub, the "new" error was found > before the old error was cleared. Then this new error gets similarly > cleared by the next scrub. It seems that if the scrub returned to this new > found error after fixing the "known" errors, this could save whole new > scrub runs from being required. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 22:45:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A3DBC142 for ; Sun, 20 Jan 2013 22:45:38 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-lb0-f177.google.com (mail-lb0-f177.google.com [209.85.217.177]) by mx1.freebsd.org (Postfix) with ESMTP id 16748972 for ; Sun, 20 Jan 2013 22:45:37 +0000 (UTC) Received: by mail-lb0-f177.google.com with SMTP id go11so22192lbb.8 for ; Sun, 20 Jan 2013 14:45:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=1hUyktHLhe6V5OBGW0mX3Wgk6Pae3DxLQ9M45huJENs=; b=OByNRbHBtEEgJToj1zQp8V40/jMqzymC9x/gRzST4MFHxw+/5rmBVZ0iXD/qj7uRvw +Ii+nf+VNBh2DM5zXBZ+sBu7Ys/5xYxLtfaROsmSZUK/YcyVpvr1SY9ZZrcgSaEBfwUs k9iigLLm1pIBHH1y+tn0Xnyg3iomnCmtz8RsurOdeydlDZ8rHsqvW+QtL9LDwBQlmL8v oH3OL/mJK2z8i/7DeIt5PV/8ba13zGEBNekNCl6pY0/0gLoYm0kV+X3ofCkmeURt9wtQ euVlfJ6hpwHpOFssoXLnXD8uC733Ss6TjJigAfpBytpCEOyP+bDZokvMO8wiNou03pN2 JLDg== MIME-Version: 1.0 X-Received: by 10.152.147.103 with SMTP id tj7mr15502313lab.54.1358721931142; Sun, 20 Jan 2013 14:45:31 -0800 (PST) Received: by 10.112.6.38 with HTTP; Sun, 20 Jan 2013 14:45:31 -0800 (PST) Date: Sun, 20 Jan 2013 17:45:31 -0500 Message-ID: Subject: ZFS _receiving_ TRIM ? (unmap) From: Zaphod Beeblebrox To: freebsd-fs Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 22:45:38 -0000 So... I see that using TRIM is a scheduled feature for ZFS. This is good --- ZFS's policy of never overwriting active data make's it's use of TRIM all the more important. But I'm concerned about the other side. If I have an iSCSI disk exported from ZFS, will ZFS receive the "UNMAP" (SCSI's TRIM) and free those blocks? From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 23:30:37 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C6D82A82 for ; Sun, 20 Jan 2013 23:30:37 +0000 (UTC) (envelope-from william.devries@gmail.com) Received: from mail-qa0-f45.google.com (mail-qa0-f45.google.com [209.85.216.45]) by mx1.freebsd.org (Postfix) with ESMTP id 6C2B8B0A for ; Sun, 20 Jan 2013 23:30:37 +0000 (UTC) Received: by mail-qa0-f45.google.com with SMTP id bv4so2567081qab.11 for ; Sun, 20 Jan 2013 15:30:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=V0SohnDnW2m1N8OG51EPu0MdmxLkd4VHLoVmNPDmF8g=; b=msiApuy2V2xWeYyKjzYRfFR+Y5kD+g/BfnOzb4+E9XNZG2spYw5E+AWy9frq0lD3sT VTmHT7il/MN6uNScbTa2bhkXBgAVu9Pmy+AB+bImhydLAJg6J3MGZ3bI/OK0a6gFFqcp P9jvWyscOUfX5dtzEF4/wrTCscfpLMZkSbw4N1ckk0rNCk3MbiQ1SAj/lVrITzlVXMet lmklGWXpn2QjAK1RIUuKVSuXQcipm7LRIznUrbI9cbhrdLuK8nm7U20FsptsDNuCzKYA Kqq8cb+qqMq5AO8uBhBT0D6KwkdMSAZfIe6KqmhXuf0gSVSV0Hp1yXZ82WPAblwJCS/j UQUA== MIME-Version: 1.0 X-Received: by 10.49.24.135 with SMTP id u7mr20289110qef.4.1358724630304; Sun, 20 Jan 2013 15:30:30 -0800 (PST) Received: by 10.49.29.5 with HTTP; Sun, 20 Jan 2013 15:30:30 -0800 (PST) Date: Sun, 20 Jan 2013 15:30:30 -0800 Message-ID: Subject: Read-only port of NetBSD's UDF filesystem. From: Will DeVries To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 23:30:37 -0000 I have been working on a read-only port of NetBSD's UDF file system implementation, which I now believe to be complete except for any bug related fixes that may arise. This file system supports UDF versions through 2.60 on CDs, DVDs and Blu-rays. While it could use more testing, it seems to be stable and working well, and now seems like a good time to publish it for review. At the very least, I can judge interest and get advice on aspects that perhaps need more work. The code can be found at https://github.com/williamdevries/UDF, and installation instructions are present in the README file. It should compile and work correct under both 9-Stable and Current. For full functionally it requires an additional ioctl in the scsi_cd driver, for which there is a patch in the repository. The patch also adds 'mount_udf2' as an external mount command in the 'mount' command and creates a header file needed for compilation. The file system was named 'udf2' so that it can coexist with the much better tested file system already in the base. At some point, the module should be renamed back to 'udf'. (A version of this code was posted to this list once before by Oleksandr Dudinskyi, but this code does not contain any of his changes.) Will DeVries From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 23:40:09 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 38406CE5 for ; Sun, 20 Jan 2013 23:40:09 +0000 (UTC) (envelope-from baptiste.daroussin@gmail.com) Received: from mail-ea0-f178.google.com (mail-ea0-f178.google.com [209.85.215.178]) by mx1.freebsd.org (Postfix) with ESMTP id BD11EB56 for ; Sun, 20 Jan 2013 23:40:08 +0000 (UTC) Received: by mail-ea0-f178.google.com with SMTP id a14so2278390eaa.23 for ; Sun, 20 Jan 2013 15:40:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:subject:message-id:mime-version :content-type:content-disposition:user-agent; bh=iF8IpJivkVJw4QBWdv+lHNJmmeIKl8MhL8Iop6hhMyo=; b=U8QHQSUFEfI2bnRzakKt7uLfv6ao3quwTTwYXn0TUdHZKZrc1ZSPgdnF603yThrkps q4nZj2LhiVQvZzVsF6iUd3Vzne2slbVXH6AY2cPcUb3VCaUrUEV8miy95RfqB8C4Damr rYP3/Zi7+ymiWcFTfzcz6f4//u128vCJcd01rq2HxQatucB7r+/ZlCn5romk8inymR// yu6D8Sa2fS++28ISwky9wsWqWp+wf2lzFaZlB9pM4+GbZ1LOOizy862kHIlc5hv/X9Dj AAk6FCsE2sFToRvfz2LT7Ih90xZHixP9XVx6zSaZoV4C5l7nyEbAGfylPeJZej5ZguRl 2v+g== X-Received: by 10.14.225.133 with SMTP id z5mr867644eep.15.1358725207632; Sun, 20 Jan 2013 15:40:07 -0800 (PST) Received: from ithaqua.etoilebsd.net (ithaqua.etoilebsd.net. [37.59.37.188]) by mx.google.com with ESMTPS id 6sm19712705eea.3.2013.01.20.15.40.06 (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 20 Jan 2013 15:40:06 -0800 (PST) Sender: Baptiste Daroussin Date: Mon, 21 Jan 2013 00:40:04 +0100 From: Baptiste Daroussin To: fs@FreeBSD.org Subject: pkgng 1.1 beta best way to make people tests Message-ID: <20130120234004.GB782@ithaqua.etoilebsd.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="QTprm0S8XgL7H0Dt" Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 23:40:09 -0000 --QTprm0S8XgL7H0Dt Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi all, As a first mail on this list, I want to open a discussion on what would be the best way to handle the beta phase of pkgng in the ports tree. We are about to enter in a beta phase and we want to add the beta to the ports tree so that people can test it easily. We also want the users to keep having the 1.0 as default. We have 2 possibilities here: 1- add WITH_PKGNG_DEVEL=yes to the actual ports-mgmt/pkg port switching the 1.0 version to 1.1b1 a bit like WITH_NEW_XORG. bonus: really simple, can only be activated on purpose by the user. 2- add a ports-mgmt/pkg-devel with the new beta which will conflict with ports-mgmt/pkg. As a difficulty remember that the local.sqlite smoothly upgrade from 1.0 to 1.1 but cannot be switch backward. So far I plan a two month beta/rc phase for pkg 1.1 I think the code is quite as robust as the 1.0 branch is and should not bring too much surprises, of course, no date base release, it will be out when ready :) Regards, Bapt --QTprm0S8XgL7H0Dt Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD8gFQACgkQ8kTtMUmk6Ez4nQCfUjH+kUrSAURUZDAiCHQPjT43 GfsAn1aaOInUg8+WJ1kp97K7zudS/fM0 =Bz9m -----END PGP SIGNATURE----- --QTprm0S8XgL7H0Dt-- From owner-freebsd-fs@FreeBSD.ORG Sun Jan 20 23:40:47 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 8CDC4D5C for ; Sun, 20 Jan 2013 23:40:47 +0000 (UTC) (envelope-from baptiste.daroussin@gmail.com) Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com [209.85.215.170]) by mx1.freebsd.org (Postfix) with ESMTP id 1E3E8B63 for ; Sun, 20 Jan 2013 23:40:46 +0000 (UTC) Received: by mail-ea0-f170.google.com with SMTP id a11so2182799eaa.1 for ; Sun, 20 Jan 2013 15:40:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=Qa1OYSJfahUVHdN1nN7575NfUuiqoJ2OxB0WSM0EXWM=; b=IlzMAJ0q2kM3X1lSyMNJIjyHv/FEhUvL0otLtTWc36brLX6ENteKxZ8g9NEQUAjpj+ ajHah7I8JJ8gAhZ9OnEbdaLoSSt6BlkB7WXBPb86fKs+EnVBnbUoOrdMa9DGMR2XIO/k Ue6w1Jb0q3GByPm59RcGtAkhh5nIp0GcySBsMsYB4E8c2MJT5vpbz351BFtMKhnz/CMV xMlJjFQgIu3/G5OB7Dhfko92D3ZsYYfyro0I3brrSaSoAPy++D9z5J/UKaOI7Xzf7vot s4v0yXYuWTyeT3m9D30A9Yf4a3BMDDArZnTMSmb8Kn1fq4FtIf0zMVhOnBEL232bFfPB zuRQ== X-Received: by 10.14.2.5 with SMTP id 5mr16301769eee.30.1358725246158; Sun, 20 Jan 2013 15:40:46 -0800 (PST) Received: from ithaqua.etoilebsd.net (ithaqua.etoilebsd.net. [37.59.37.188]) by mx.google.com with ESMTPS id t44sm19715782eeo.2.2013.01.20.15.40.44 (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 20 Jan 2013 15:40:45 -0800 (PST) Sender: Baptiste Daroussin Date: Mon, 21 Jan 2013 00:40:43 +0100 From: Baptiste Daroussin To: fs@FreeBSD.org Subject: Re: pkgng 1.1 beta best way to make people tests Message-ID: <20130120234042.GC782@ithaqua.etoilebsd.net> References: <20130120234004.GB782@ithaqua.etoilebsd.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="oTHb8nViIGeoXxdp" Content-Disposition: inline In-Reply-To: <20130120234004.GB782@ithaqua.etoilebsd.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 23:40:47 -0000 --oTHb8nViIGeoXxdp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jan 21, 2013 at 12:40:04AM +0100, Baptiste Daroussin wrote: > Hi all, >=20 > As a first mail on this list, I want to open a discussion on what would b= e the > best way to handle the beta phase of pkgng in the ports tree. >=20 > We are about to enter in a beta phase and we want to add the beta to the = ports > tree so that people can test it easily. We also want the users to keep ha= ving > the 1.0 as default. >=20 > We have 2 possibilities here: > 1- add WITH_PKGNG_DEVEL=3Dyes to the actual ports-mgmt/pkg port switching= the 1.0 > version to 1.1b1 a bit like WITH_NEW_XORG. bonus: really simple, can only= be > activated on purpose by the user. >=20 > 2- add a ports-mgmt/pkg-devel with the new beta which will conflict with > ports-mgmt/pkg. >=20 > As a difficulty remember that the local.sqlite smoothly upgrade from 1.0 = to 1.1 > but cannot be switch backward. >=20 > So far I plan a two month beta/rc phase for pkg 1.1 I think the code is q= uite as > robust as the 1.0 branch is and should not bring too much surprises, of c= ourse, > no date base release, it will be out when ready :) >=20 > Regards, > Bapt Sorry all wrong mailing list :( regards, Bapt --oTHb8nViIGeoXxdp Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD8gHoACgkQ8kTtMUmk6EwzYQCcC5ZPVZPIsqILIWXmcVQJvLTM OOwAn1Y8kp+PwaJKRbeZrW+ub+r8h5mW =gBWh -----END PGP SIGNATURE----- --oTHb8nViIGeoXxdp-- From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 07:02:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C14442E8; Mon, 21 Jan 2013 07:02:56 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id 21EF8E54; Mon, 21 Jan 2013 07:02:55 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id E6FE9F8FFE4; Mon, 21 Jan 2013 07:53:01 +0100 (CET) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.3 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 6.5807] X-CRM114-CacheID: sfid-20130121_07525_18F2FEA6 X-CRM114-Status: Good ( pR: 6.5807 ) X-DSPAM-Result: Whitelisted X-DSPAM-Processed: Mon Jan 21 07:53:01 2013 X-DSPAM-Confidence: 0.7600 X-DSPAM-Probability: 0.0000 X-DSPAM-Signature: 50fce5cd875961076684440 X-DSPAM-Factors: 27, From*Attila Nagy , 0.00010, wrote+>, 0.00217, >+>, 0.00365, >+>, 0.00365, In-Reply-To*mail.gmail.com>, 0.00375, References*mail.gmail.com>, 0.00389, wrote, 0.00573, and+set, 0.00616, zfs, 0.00692, Subject*ZFS, 0.00692, I+haven't, 0.00921, >+1), 0.01000, Date*07+52, 0.99000, that+long, 0.01000, Date*52+58, 0.99000, here?, 0.01375, From*Attila, 0.01825, relevant, 0.01825, To*gmail.com>, 0.02321, delay+or, 0.02713, values, 0.02713, be+set, 0.02713, From*Nagy, 0.02713, (maybe, 0.02713, 23+26, 0.02713, Number+of, 0.02798, X-Spambayes-Classification: ham; 0.00 Received: from japan.t-online.private (japan.t-online.co.hu [195.228.243.99]) by people.fsn.hu (Postfix) with ESMTPSA id F0D9DF8FFD5; Mon, 21 Jan 2013 07:52:58 +0100 (CET) Message-ID: <50FCE5CA.7080006@fsn.hu> Date: Mon, 21 Jan 2013 07:52:58 +0100 From: Attila Nagy MIME-Version: 1.0 To: Zaphod Beeblebrox Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 07:02:56 -0000 Hi, On 01/20/13 23:26, Zaphod Beeblebrox wrote: > > 1) a pause for scrub... such that long scrubs could be paused during > working hours. > > While not exactly pause, but isn't playing with scrub_delay works here? vfs.zfs.scrub_delay: Number of ticks to delay scrub Set this to a high value during working hours, and set back to its normal (or even below) value off working hours. (maybe resilver delay, or some other values should also be set, I haven't yet read the relevant code) From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 08:49:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 5F523E05 for ; Mon, 21 Jan 2013 08:49:31 +0000 (UTC) (envelope-from Alexei.Volkov@softlynx.ru) Received: from softlynx.ru (softlynx.ru [95.66.187.155]) by mx1.freebsd.org (Postfix) with ESMTP id 08C788A3 for ; Mon, 21 Jan 2013 08:49:30 +0000 (UTC) Received: from [10.192.61.11] (unknown [95.66.153.7]) by softlynx.ru (Postfix) with ESMTPSA id 02BCF11F04E for ; Mon, 21 Jan 2013 12:49:29 +0400 (MSK) Message-ID: <50FD0013.8060905@softlynx.ru> Date: Mon, 21 Jan 2013 12:45:07 +0400 From: =?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JLQvtC70LrQvtCy?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121204 Thunderbird/10.0.11 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Fail to use ZFS ZVOL as a gmirror component Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 08:49:31 -0000 Hi there! In a most recent FreeBSD 9.1-RELEASE i found a serious issue with zvol and geom layer. For instance it is impossible to uses zvol as a gmirror component. I have posted a bug report http://www.freebsd.org/cgi/query-pr.cgi?pr=175323 and look forward to get fixes. Does any one face the same issue and have any suggestions or workarounds? -- Best regards Alexei Volkov From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 09:26:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 64899145 for ; Mon, 21 Jan 2013 09:26:03 +0000 (UTC) (envelope-from prvs=1733fbe9f2=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id EF225A57 for ; Mon, 21 Jan 2013 09:26:02 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001787757.msg for ; Mon, 21 Jan 2013 09:26:01 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Mon, 21 Jan 2013 09:26:01 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1733fbe9f2=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: <71300B07D20A4B1D900C850AB4450526@multiplay.co.uk> From: "Steven Hartland" To: "Zaphod Beeblebrox" , "freebsd-fs" References: Subject: Re: ZFS _receiving_ TRIM ? (unmap) Date: Mon, 21 Jan 2013 09:26:30 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 09:26:03 -0000 ----- Original Message ----- From: "Zaphod Beeblebrox" To: "freebsd-fs" Sent: Sunday, January 20, 2013 10:45 PM Subject: ZFS _receiving_ TRIM ? (unmap) > So... I see that using TRIM is a scheduled feature for ZFS. This is good > --- ZFS's policy of never overwriting active data make's it's use of TRIM > all the more important. > > But I'm concerned about the other side. If I have an iSCSI disk exported > from ZFS, will ZFS receive the "UNMAP" (SCSI's TRIM) and free those blocks? The underlying disk UNMAP / TRIM support is dependent on implementation at the driver level. For controllers who's driver used cam e.g. scsi (da) and ata (ada) this is done there. Having a quick glance at our iSCSI initiator it looks like its based on cam which means it should work assuming target supports UNMAP or TRIM via ata_pass. TRIM via ata_pass requires a patch I'm working on here which works fine on 8.3 but I haven't had chance to port to head or 9/stable yet. If you have your setup working with a TRIM compatible ZFS or UFS you can check the values of kern.cam.da.X.delete_method where X is the disk number. If your going to test this ensure you have r239655 in your kernel otherwise you could get some nasty results. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 11:06:45 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 02B91C41 for ; Mon, 21 Jan 2013 11:06:45 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id E87AA715 for ; Mon, 21 Jan 2013 11:06:44 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0LB6iq1054052 for ; Mon, 21 Jan 2013 11:06:44 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0LB6iXG054050 for freebsd-fs@FreeBSD.org; Mon, 21 Jan 2013 11:06:44 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 21 Jan 2013 11:06:44 GMT Message-Id: <201301211106.r0LB6iXG054050@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-fs@FreeBSD.org Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 11:06:45 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/175179 fs [zfs] ZFS may attach wrong device on move o kern/175071 fs [ufs] [panic] softdep_deallocate_dependencies: unrecov o kern/174372 fs [zfs] Pagefault appears to be related to ZFS o kern/174315 fs [zfs] chflags uchg not supported o kern/174310 fs [zfs] root point mounting broken on CURRENT with multi o kern/174279 fs [ufs] UFS2-SU+J journal and filesystem corruption o kern/174060 fs [ext2fs] Ext2FS system crashes (buffer overflow?) o kern/173830 fs [zfs] Brain-dead simple change to ZFS error descriptio o kern/173718 fs [zfs] phantom directory in zraid2 pool f kern/173657 fs [nfs] strange UID map with nfsuserd o kern/173363 fs [zfs] [panic] Panic on 'zpool replace' on readonly poo o kern/173136 fs [unionfs] mounting above the NFS read-only share panic o kern/172348 fs [unionfs] umount -f of filesystem in use with readonly o kern/172334 fs [unionfs] unionfs permits recursive union mounts; caus o kern/171626 fs [tmpfs] tmpfs should be noisier when the requested siz o kern/171415 fs [zfs] zfs recv fails with "cannot receive incremental o kern/170945 fs [gpt] disk layout not portable between direct connect o bin/170778 fs [zfs] [panic] FreeBSD panics randomly o kern/170680 fs [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA o kern/170497 fs [xfs][panic] kernel will panic whenever I ls a mounted o kern/169945 fs [zfs] [panic] Kernel panic while importing zpool (afte o kern/169480 fs [zfs] ZFS stalls on heavy I/O o kern/169398 fs [zfs] Can't remove file with permanent error o kern/169339 fs panic while " : > /etc/123" o kern/169319 fs [zfs] zfs resilver can't complete o kern/168947 fs [nfs] [zfs] .zfs/snapshot directory is messed up when o kern/168942 fs [nfs] [hang] nfsd hangs after being restarted (not -HU o kern/168158 fs [zfs] incorrect parsing of sharenfs options in zfs (fs o kern/167979 fs [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste o kern/167977 fs [smbfs] mount_smbfs results are differ when utf-8 or U o kern/167688 fs [fusefs] Incorrect signal handling with direct_io o kern/167685 fs [zfs] ZFS on USB drive prevents shutdown / reboot o kern/167612 fs [portalfs] The portal file system gets stuck inside po o kern/167272 fs [zfs] ZFS Disks reordering causes ZFS to pick the wron o kern/167260 fs [msdosfs] msdosfs disk was mounted the second time whe o kern/167109 fs [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene o kern/167105 fs [nfs] mount_nfs can not handle source exports wiht mor o kern/167067 fs [zfs] [panic] ZFS panics the server o kern/167065 fs [zfs] boot fails when a spare is the boot disk o kern/167048 fs [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF o kern/166912 fs [ufs] [panic] Panic after converting Softupdates to jo o kern/166851 fs [zfs] [hang] Copying directory from the mounted UFS di o kern/166477 fs [nfs] NFS data corruption. o kern/165950 fs [ffs] SU+J and fsck problem o kern/165923 fs [nfs] Writing to NFS-backed mmapped files fails if flu o kern/165521 fs [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31 o kern/165392 fs Multiple mkdir/rmdir fails with errno 31 o kern/165087 fs [unionfs] lock violation in unionfs o kern/164472 fs [ufs] fsck -B panics on particular data inconsistency o kern/164370 fs [zfs] zfs destroy for snapshot fails on i386 and sparc o kern/164261 fs [nullfs] [patch] fix panic with NFS served from NULLFS o kern/164256 fs [zfs] device entry for volume is not created after zfs o kern/164184 fs [ufs] [panic] Kernel panic with ufs_makeinode o kern/163801 fs [md] [request] allow mfsBSD legacy installed in 'swap' o kern/163770 fs [zfs] [hang] LOR between zfs&syncer + vnlru leading to o kern/163501 fs [nfs] NFS exporting a dir and a subdir in that dir to o kern/162944 fs [coda] Coda file system module looks broken in 9.0 o kern/162860 fs [zfs] Cannot share ZFS filesystem to hosts with a hyph o kern/162751 fs [zfs] [panic] kernel panics during file operations o kern/162591 fs [nullfs] cross-filesystem nullfs does not work as expe o kern/162519 fs [zfs] "zpool import" relies on buggy realpath() behavi o kern/162362 fs [snapshots] [panic] ufs with snapshot(s) panics when g o kern/161968 fs [zfs] [hang] renaming snapshot with -r including a zvo o kern/161864 fs [ufs] removing journaling from UFS partition fails on o bin/161807 fs [patch] add option for explicitly specifying metadata o kern/161579 fs [smbfs] FreeBSD sometimes panics when an smb share is o kern/161533 fs [zfs] [panic] zfs receive panic: system ioctl returnin o kern/161438 fs [zfs] [panic] recursed on non-recursive spa_namespace_ o kern/161424 fs [nullfs] __getcwd() calls fail when used on nullfs mou o kern/161280 fs [zfs] Stack overflow in gptzfsboot o kern/161205 fs [nfs] [pfsync] [regression] [build] Bug report freebsd o kern/161169 fs [zfs] [panic] ZFS causes kernel panic in dbuf_dirty o kern/161112 fs [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3 o kern/160893 fs [zfs] [panic] 9.0-BETA2 kernel panic o kern/160860 fs [ufs] Random UFS root filesystem corruption with SU+J o kern/160801 fs [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o o kern/160790 fs [fusefs] [panic] VPUTX: negative ref count with FUSE o kern/160777 fs [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo o kern/160706 fs [zfs] zfs bootloader fails when a non-root vdev exists o kern/160591 fs [zfs] Fail to boot on zfs root with degraded raidz2 [r o kern/160410 fs [smbfs] [hang] smbfs hangs when transferring large fil o kern/160283 fs [zfs] [patch] 'zfs list' does abort in make_dataset_ha o kern/159930 fs [ufs] [panic] kernel core o kern/159402 fs [zfs][loader] symlinks cause I/O errors o kern/159357 fs [zfs] ZFS MAXNAMELEN macro has confusing name (off-by- o kern/159356 fs [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s o kern/159351 fs [nfs] [patch] - divide by zero in mountnfs() o kern/159251 fs [zfs] [request]: add FLETCHER4 as DEDUP hash option o kern/159077 fs [zfs] Can't cd .. with latest zfs version o kern/159048 fs [smbfs] smb mount corrupts large files o kern/159045 fs [zfs] [hang] ZFS scrub freezes system o kern/158839 fs [zfs] ZFS Bootloader Fails if there is a Dead Disk o kern/158802 fs amd(8) ICMP storm and unkillable process. o kern/158231 fs [nullfs] panic on unmounting nullfs mounted over ufs o f kern/157929 fs [nfs] NFS slow read o kern/157399 fs [zfs] trouble with: mdconfig force delete && zfs strip o kern/157179 fs [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov o kern/156797 fs [zfs] [panic] Double panic with FreeBSD 9-CURRENT and o kern/156781 fs [zfs] zfs is losing the snapshot directory, p kern/156545 fs [ufs] mv could break UFS on SMP systems o kern/156193 fs [ufs] [hang] UFS snapshot hangs && deadlocks processes o kern/156039 fs [nullfs] [unionfs] nullfs + unionfs do not compose, re o kern/155615 fs [zfs] zfs v28 broken on sparc64 -current o kern/155587 fs [zfs] [panic] kernel panic with zfs p kern/155411 fs [regression] [8.2-release] [tmpfs]: mount: tmpfs : No o kern/155199 fs [ext2fs] ext3fs mounted as ext2fs gives I/O errors o bin/155104 fs [zfs][patch] use /dev prefix by default when importing o kern/154930 fs [zfs] cannot delete/unlink file from full volume -> EN o kern/154828 fs [msdosfs] Unable to create directories on external USB o kern/154491 fs [smbfs] smb_co_lock: recursive lock for object 1 p kern/154228 fs [md] md getting stuck in wdrain state o kern/153996 fs [zfs] zfs root mount error while kernel is not located o kern/153753 fs [zfs] ZFS v15 - grammatical error when attempting to u o kern/153716 fs [zfs] zpool scrub time remaining is incorrect o kern/153695 fs [patch] [zfs] Booting from zpool created on 4k-sector o kern/153680 fs [xfs] 8.1 failing to mount XFS partitions o kern/153418 fs [zfs] [panic] Kernel Panic occurred writing to zfs vol o kern/153351 fs [zfs] locking directories/files in ZFS o bin/153258 fs [patch][zfs] creating ZVOLs requires `refreservation' s kern/153173 fs [zfs] booting from a gzip-compressed dataset doesn't w o bin/153142 fs [zfs] ls -l outputs `ls: ./.zfs: Operation not support o kern/153126 fs [zfs] vdev failure, zpool=peegel type=vdev.too_small o kern/152022 fs [nfs] nfs service hangs with linux client [regression] o kern/151942 fs [zfs] panic during ls(1) zfs snapshot directory o kern/151905 fs [zfs] page fault under load in /sbin/zfs o bin/151713 fs [patch] Bug in growfs(8) with respect to 32-bit overfl o kern/151648 fs [zfs] disk wait bug o kern/151629 fs [fs] [patch] Skip empty directory entries during name o kern/151330 fs [zfs] will unshare all zfs filesystem after execute a o kern/151326 fs [nfs] nfs exports fail if netgroups contain duplicate o kern/151251 fs [ufs] Can not create files on filesystem with heavy us o kern/151226 fs [zfs] can't delete zfs snapshot o kern/150503 fs [zfs] ZFS disks are UNAVAIL and corrupted after reboot o kern/150501 fs [zfs] ZFS vdev failure vdev.bad_label on amd64 o kern/150390 fs [zfs] zfs deadlock when arcmsr reports drive faulted o kern/150336 fs [nfs] mountd/nfsd became confused; refused to reload n o kern/149208 fs mksnap_ffs(8) hang/deadlock o kern/149173 fs [patch] [zfs] make OpenSolaris installa o kern/149015 fs [zfs] [patch] misc fixes for ZFS code to build on Glib o kern/149014 fs [zfs] [patch] declarations in ZFS libraries/utilities o kern/149013 fs [zfs] [patch] make ZFS makefiles use the libraries fro o kern/148504 fs [zfs] ZFS' zpool does not allow replacing drives to be o kern/148490 fs [zfs]: zpool attach - resilver bidirectionally, and re o kern/148368 fs [zfs] ZFS hanging forever on 8.1-PRERELEASE o kern/148138 fs [zfs] zfs raidz pool commands freeze o kern/147903 fs [zfs] [panic] Kernel panics on faulty zfs device o kern/147881 fs [zfs] [patch] ZFS "sharenfs" doesn't allow different " o kern/147420 fs [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt o kern/146941 fs [zfs] [panic] Kernel Double Fault - Happens constantly o kern/146786 fs [zfs] zpool import hangs with checksum errors o kern/146708 fs [ufs] [panic] Kernel panic in softdep_disk_write_compl o kern/146528 fs [zfs] Severe memory leak in ZFS on i386 o kern/146502 fs [nfs] FreeBSD 8 NFS Client Connection to Server s kern/145712 fs [zfs] cannot offline two drives in a raidz2 configurat o kern/145411 fs [xfs] [panic] Kernel panics shortly after mounting an f bin/145309 fs bsdlabel: Editing disk label invalidates the whole dev o kern/145272 fs [zfs] [panic] Panic during boot when accessing zfs on o kern/145246 fs [ufs] dirhash in 7.3 gratuitously frees hashes when it o kern/145238 fs [zfs] [panic] kernel panic on zpool clear tank o kern/145229 fs [zfs] Vast differences in ZFS ARC behavior between 8.0 o kern/145189 fs [nfs] nfsd performs abysmally under load o kern/144929 fs [ufs] [lor] vfs_bio.c + ufs_dirhash.c p kern/144447 fs [zfs] sharenfs fsunshare() & fsshare_main() non functi o kern/144416 fs [panic] Kernel panic on online filesystem optimization s kern/144415 fs [zfs] [panic] kernel panics on boot after zfs crash o kern/144234 fs [zfs] Cannot boot machine with recent gptzfsboot code o kern/143825 fs [nfs] [panic] Kernel panic on NFS client o bin/143572 fs [zfs] zpool(1): [patch] The verbose output from iostat o kern/143212 fs [nfs] NFSv4 client strange work ... o kern/143184 fs [zfs] [lor] zfs/bufwait LOR o kern/142878 fs [zfs] [vfs] lock order reversal o kern/142597 fs [ext2fs] ext2fs does not work on filesystems with real o kern/142489 fs [zfs] [lor] allproc/zfs LOR o kern/142466 fs Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re o kern/142306 fs [zfs] [panic] ZFS drive (from OSX Leopard) causes two o kern/142068 fs [ufs] BSD labels are got deleted spontaneously o kern/141897 fs [msdosfs] [panic] Kernel panic. msdofs: file name leng o kern/141463 fs [nfs] [panic] Frequent kernel panics after upgrade fro o kern/141305 fs [zfs] FreeBSD ZFS+sendfile severe performance issues ( o kern/141091 fs [patch] [nullfs] fix panics with DIAGNOSTIC enabled o kern/141086 fs [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS o kern/141010 fs [zfs] "zfs scrub" fails when backed by files in UFS2 o kern/140888 fs [zfs] boot fail from zfs root while the pool resilveri o kern/140661 fs [zfs] [patch] /boot/loader fails to work on a GPT/ZFS- o kern/140640 fs [zfs] snapshot crash o kern/140068 fs [smbfs] [patch] smbfs does not allow semicolon in file o kern/139725 fs [zfs] zdb(1) dumps core on i386 when examining zpool c o kern/139715 fs [zfs] vfs.numvnodes leak on busy zfs p bin/139651 fs [nfs] mount(8): read-only remount of NFS volume does n o kern/139407 fs [smbfs] [panic] smb mount causes system crash if remot o kern/138662 fs [panic] ffs_blkfree: freeing free block o kern/138421 fs [ufs] [patch] remove UFS label limitations o kern/138202 fs mount_msdosfs(1) see only 2Gb o kern/136968 fs [ufs] [lor] ufs/bufwait/ufs (open) o kern/136945 fs [ufs] [lor] filedesc structure/ufs (poll) o kern/136944 fs [ffs] [lor] bufwait/snaplk (fsync) o kern/136873 fs [ntfs] Missing directories/files on NTFS volume o kern/136865 fs [nfs] [patch] NFS exports atomic and on-the-fly atomic p kern/136470 fs [nfs] Cannot mount / in read-only, over NFS o kern/135546 fs [zfs] zfs.ko module doesn't ignore zpool.cache filenam o kern/135469 fs [ufs] [panic] kernel crash on md operation in ufs_dirb o kern/135050 fs [zfs] ZFS clears/hides disk errors on reboot o kern/134491 fs [zfs] Hot spares are rather cold... o kern/133676 fs [smbfs] [panic] umount -f'ing a vnode-based memory dis p kern/133174 fs [msdosfs] [patch] msdosfs must support multibyte inter o kern/132960 fs [ufs] [panic] panic:ffs_blkfree: freeing free frag o kern/132397 fs reboot causes filesystem corruption (failure to sync b o kern/132331 fs [ufs] [lor] LOR ufs and syncer o kern/132237 fs [msdosfs] msdosfs has problems to read MSDOS Floppy o kern/132145 fs [panic] File System Hard Crashes o kern/131441 fs [unionfs] [nullfs] unionfs and/or nullfs not combineab o kern/131360 fs [nfs] poor scaling behavior of the NFS server under lo o kern/131342 fs [nfs] mounting/unmounting of disks causes NFS to fail o bin/131341 fs makefs: error "Bad file descriptor" on the mount poin o kern/130920 fs [msdosfs] cp(1) takes 100% CPU time while copying file o kern/130210 fs [nullfs] Error by check nullfs o kern/129760 fs [nfs] after 'umount -f' of a stale NFS share FreeBSD l o kern/129488 fs [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: o kern/129231 fs [ufs] [patch] New UFS mount (norandom) option - mostly o kern/129152 fs [panic] non-userfriendly panic when trying to mount(8) o kern/127787 fs [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs o bin/127270 fs fsck_msdosfs(8) may crash if BytesPerSec is zero o kern/127029 fs [panic] mount(8): trying to mount a write protected zi o kern/126287 fs [ufs] [panic] Kernel panics while mounting an UFS file o kern/125895 fs [ffs] [panic] kernel: panic: ffs_blkfree: freeing free s kern/125738 fs [zfs] [request] SHA256 acceleration in ZFS o kern/123939 fs [msdosfs] corrupts new files o kern/122380 fs [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash o bin/122172 fs [fs]: amd(8) automount daemon dies on 6.3-STABLE i386, o bin/121898 fs [nullfs] pwd(1)/getcwd(2) fails with Permission denied o bin/121072 fs [smbfs] mount_smbfs(8) cannot normally convert the cha o kern/120483 fs [ntfs] [patch] NTFS filesystem locking changes o kern/120482 fs [ntfs] [patch] Sync style changes between NetBSD and F o kern/118912 fs [2tb] disk sizing/geometry problem with large array o kern/118713 fs [minidump] [patch] Display media size required for a k o kern/118318 fs [nfs] NFS server hangs under special circumstances o bin/118249 fs [ufs] mv(1): moving a directory changes its mtime o kern/118126 fs [nfs] [patch] Poor NFS server write performance o kern/118107 fs [ntfs] [panic] Kernel panic when accessing a file at N o kern/117954 fs [ufs] dirhash on very large directories blocks the mac o bin/117315 fs [smbfs] mount_smbfs(8) and related options can't mount o kern/117158 fs [zfs] zpool scrub causes panic if geli vdevs detach on o bin/116980 fs [msdosfs] [patch] mount_msdosfs(8) resets some flags f o conf/116931 fs lack of fsck_cd9660 prevents mounting iso images with o kern/116583 fs [ffs] [hang] System freezes for short time when using o bin/115361 fs [zfs] mount(8) gets into a state where it won't set/un o kern/114955 fs [cd9660] [patch] [request] support for mask,dirmask,ui o kern/114847 fs [ntfs] [patch] [request] dirmask support for NTFS ala o kern/114676 fs [ufs] snapshot creation panics: snapacct_ufs2: bad blo o bin/114468 fs [patch] [request] add -d option to umount(8) to detach o kern/113852 fs [smbfs] smbfs does not properly implement DFS referral o bin/113838 fs [patch] [request] mount(8): add support for relative p o bin/113049 fs [patch] [request] make quot(8) use getopt(3) and show o kern/112658 fs [smbfs] [patch] smbfs and caching problems (resolves b o kern/111843 fs [msdosfs] Long Names of files are incorrectly created o kern/111782 fs [ufs] dump(8) fails horribly for large filesystems s bin/111146 fs [2tb] fsck(8) fails on 6T filesystem o bin/107829 fs [2TB] fdisk(8): invalid boundary checking in fdisk / w o kern/106107 fs [ufs] left-over fsck_snapshot after unfinished backgro o kern/104406 fs [ufs] Processes get stuck in "ufs" state under persist o kern/104133 fs [ext2fs] EXT2FS module corrupts EXT2/3 filesystems o kern/103035 fs [ntfs] Directories in NTFS mounted disc images appear o kern/101324 fs [smbfs] smbfs sometimes not case sensitive when it's s o kern/99290 fs [ntfs] mount_ntfs ignorant of cluster sizes s bin/97498 fs [request] newfs(8) has no option to clear the first 12 o kern/97377 fs [ntfs] [patch] syntax cleanup for ntfs_ihash.c o kern/95222 fs [cd9660] File sections on ISO9660 level 3 CDs ignored o kern/94849 fs [ufs] rename on UFS filesystem is not atomic o bin/94810 fs fsck(8) incorrectly reports 'file system marked clean' o kern/94769 fs [ufs] Multiple file deletions on multi-snapshotted fil o kern/94733 fs [smbfs] smbfs may cause double unlock o kern/93942 fs [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D o kern/92272 fs [ffs] [hang] Filling a filesystem while creating a sna o kern/91134 fs [smbfs] [patch] Preserve access and modification time a kern/90815 fs [smbfs] [patch] SMBFS with character conversions somet o kern/88657 fs [smbfs] windows client hang when browsing a samba shar o kern/88555 fs [panic] ffs_blkfree: freeing free frag on AMD 64 o bin/87966 fs [patch] newfs(8): introduce -A flag for newfs to enabl o kern/87859 fs [smbfs] System reboot while umount smbfs. o kern/86587 fs [msdosfs] rm -r /PATH fails with lots of small files o bin/85494 fs fsck_ffs: unchecked use of cg_inosused macro etc. o kern/80088 fs [smbfs] Incorrect file time setting on NTFS mounted vi o bin/74779 fs Background-fsck checks one filesystem twice and omits o kern/73484 fs [ntfs] Kernel panic when doing `ls` from the client si o bin/73019 fs [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino o kern/71774 fs [ntfs] NTFS cannot "see" files on a WinXP filesystem o bin/70600 fs fsck(8) throws files away when it can't grow lost+foun o kern/68978 fs [panic] [ufs] crashes with failing hard disk, loose po o kern/65920 fs [nwfs] Mounted Netware filesystem behaves strange o kern/65901 fs [smbfs] [patch] smbfs fails fsx write/truncate-down/tr o kern/61503 fs [smbfs] mount_smbfs does not work as non-root o kern/55617 fs [smbfs] Accessing an nsmb-mounted drive via a smb expo o kern/51685 fs [hang] Unbounded inode allocation causes kernel to loc o kern/36566 fs [smbfs] System reboot with dead smb mount and umount o bin/27687 fs fsck(8) wrapper is not properly passing options to fsc o kern/18874 fs [2TB] 32bit NFS servers export wrong negative values t 296 problems total. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 11:12:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2CF77807; Mon, 21 Jan 2013 11:12:56 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 983D8930; Mon, 21 Jan 2013 11:12:55 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0LBCjTd001085; Mon, 21 Jan 2013 12:12:45 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0LBCjTT001082; Mon, 21 Jan 2013 12:12:45 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Mon, 21 Jan 2013 12:12:45 +0100 (CET) From: Wojciech Puchar To: Zaphod Beeblebrox Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Mon, 21 Jan 2013 12:12:46 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 11:12:56 -0000 > Please don't misinterpret this post: ZFS's ability to recover from fairly > catastrophic failures is pretty stellar, but I'm wondering if there can be from my testing it is exactly opposite. You have to see a difference between marketing and reality. > a little room for improvement. > > I use RAID pretty much everywhere. I don't like to loose data and disks > are cheap. I have a fair amount of experience with all flavors ... and ZFS just like me. And because i want performance and - as you described - disks are cheap - i use RAID-1 (gmirror). > has become a go-to filesystem for most of my applications. My applications doesn't tolerate low performance, overcomplexity and high risk of data loss. That's why i use properly tuned UFS, gmirror, and prefer not to use gstripe but have multiple filesystems > One of the best recommendations I can give for ZFS is it's > crash-recoverability. Which is marketing, not truth. If you want bullet-proof recoverability, UFS beats everything i've ever seen. If you want FAST crash recovery, use softupdates+journal, available in FreeBSD 9. > As a counter example, if you have most hardware RAID > going or a software whole-disk raid, after a crash it will generally > declare one disk as good and the other disk as "to be repaired" ... after > which a full surface scan of the affected disks --- reading one and writing > the other --- ensues. true. gmirror do it, but you can defer mirror rebuild, which i use. I have a script that send me a mail when gmirror is degraded, and i - after finding out the cause of problem, and possibly replacing disk - run rebuild after work hours, so no slowdown is experienced. > ZFS is smart on this point: it will recover on reboot with a minimum amount > of fuss. Even if you dislodge a drive ... so that it's missing the last > 'n' transactions, ZFS seems to figure this out (which I thought was extra > cudos). Yes this is marketing. practice is somehow different. as you discovered yourself. > > MY PROBLEM comes from problems that scrub can fix. > > Let's talk, in specific, about my home array. It has 9x 1.5T and 8x 2T in > a RAID-Z configuration (2 sets, obviously). While RAID-Z is already a king of bad performance, i assume you mean two POOLS, not 2 RAID-Z sets. if you mixed 2 different RAID-Z pools you would spread load unevenly and make performance even worse. > > A full scrub of my drives weighs in at 36 hours or so. which is funny as ZFS is marketed as doing this efficient (like checking only used space). dd if=/dev/disk of=/dev/null bs=2m would take no more than a few hours. and you may do all in parallel. > vr2/cvs:<0x1c1> > > Now ... this is just an example: after each scrub, the hex number was seems like scrub simply not do it's work right. > before the old error was cleared. Then this new error gets similarly > cleared by the next scrub. It seems that if the scrub returned to this new > found error after fixing the "known" errors, this could save whole new > scrub runs from being required. Even better - use UFS. For both bullet proof recoverability and performance. If you need help in tuning you may ask me privately. From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 14:46:52 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 79387D96; Mon, 21 Jan 2013 14:46:52 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay06.ispgateway.de (smtprelay06.ispgateway.de [80.67.31.96]) by mx1.freebsd.org (Postfix) with ESMTP id 3196C714; Mon, 21 Jan 2013 14:46:52 +0000 (UTC) Received: from [78.35.171.46] (helo=fabiankeil.de) by smtprelay06.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1TxIcp-00047V-8m; Mon, 21 Jan 2013 15:45:27 +0100 Date: Mon, 21 Jan 2013 15:44:50 +0100 From: Fabian Keil To: Andriy Gapon Subject: Re: disk "flipped" - a known problem? Message-ID: <20130121154450.60f457d1@fabiankeil.de> In-Reply-To: <50FC3EBF.6070803@FreeBSD.org> References: <50FC3EBF.6070803@FreeBSD.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/t7ntFaXohqG3FgxzzyKTQc7"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: freebsd-fs , freebsd-current@FreeBSD.org, freebsd-geom@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 14:46:52 -0000 --Sig_/t7ntFaXohqG3FgxzzyKTQc7 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Andriy Gapon wrote: > Today something unusual happened on one of my machines: > kernel: (ada0:ahcich0:0:0:0): lost device > kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00= 00 00 > kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout > kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted > kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00= 00 00 > kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout > kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted > kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 reject= ed > flags 0x18 refcount 1 > kernel: adaasync: Unable to attach to new device due to status 0x6 I believe I saw something similar when trying to forcefully end the cam lockups reported in: http://lists.freebsd.org/pipermail/freebsd-current/2012-October/037413.html Detaching the disc drive caused /dev/cd0 to disappear as expected, but reinserting the drive didn't bring cd0 back. > It looks like the disk disappeared from the bus and then re-appeared on t= he bus, > but not to the OS. >=20 > One of the partitions that the disk hosted was a swap partition and it se= ems to > be the cause of some of the following consequences. >=20 > The consequences: [...] > * geom_event thread started consuming 100% of CPU in g_wither_washer() This sounds familiar as well: http://www.freebsd.org/cgi/query-pr.cgi?pr=3D171865 Fabian --Sig_/t7ntFaXohqG3FgxzzyKTQc7 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD9VGUACgkQBYqIVf93VJ0RpQCfUBrj2QYbgBfT710Iy1tTmGWO bUYAmwQoYLnfhRfr2pCN7o5FrQz9agGz =Ex10 -----END PGP SIGNATURE----- --Sig_/t7ntFaXohqG3FgxzzyKTQc7-- From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 22:16:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 332A4374 for ; Mon, 21 Jan 2013 22:16:19 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta12.emeryville.ca.mail.comcast.net (qmta12.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:227]) by mx1.freebsd.org (Postfix) with ESMTP id EA29BBB for ; Mon, 21 Jan 2013 22:16:18 +0000 (UTC) Received: from omta20.emeryville.ca.mail.comcast.net ([76.96.30.87]) by qmta12.emeryville.ca.mail.comcast.net with comcast id qcMx1k0031smiN4ACmGJF4; Mon, 21 Jan 2013 22:16:18 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta20.emeryville.ca.mail.comcast.net with comcast id qmGH1k00P1t3BNj8gmGHnW; Mon, 21 Jan 2013 22:16:17 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 0FAB673A1B; Mon, 21 Jan 2013 14:16:17 -0800 (PST) Date: Mon, 21 Jan 2013 14:16:17 -0800 From: Jeremy Chadwick To: freebsd-fs@freebsd.org Subject: Re: disk "flipped" - a known problem? Message-ID: <20130121221617.GA23909@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1358806578; bh=/iTyfZ3lJ0vlxQ1SmyJbSM32sUhApKQYl6ec7ToAT3g=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=YAuFLSrzgp8UNh9SKLb0Qn6gVvgDKFj1wedIrAM1IqTRqaGp9WQPjUME4+V6NpjV2 mvgUd1r9cngalYqNbZt3Honmw/wvQZg28tu3IpMdSZpK17AfO8cxPwCmdFFu3lLw9Z 3pYo0KdRvPXVKp3poqtuBuC8P/hkbAfNyV6rzRGUAplIrOFJaLVSzKtTSfMZ4Ovr6Z IT0zgu22CqoLkti052in2RcAwBk+Ru0I3//Tu3vkApoGjo39Q950jXzVt5jIa3FVTM WCTYnM0R/1uvfjMmu4SFsUH62M9lE8Zopa116RgZKr0DT3AvSVUdeeCdZKUkNy6eHj BDZ5l+/xjwQkw== Cc: mav@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 22:16:19 -0000 (Please keep me CC'd as I am not subscribed) WRT this: http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html I can reproduce the first problem 100% of the time on my home system here. I can provide hardware specs if needed, but the important part is that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI mode (and does not share an IRQ), hot-swap bays are in use, and I'm using ahci.ko. I also want to make this clear to Andriy: I'm not saying "there's a problem with your disk". In my case, I KNOW there's a problem with the disk (that's the entire point to my tests! :-) ). In my case the disk is a WD Raptor (150GB, circa 2006) that has a very badly-designed firmware that goes completely catatonic when encountering certain sector-level conditions. That's not the problem though -- the problem is with FreeBSD apparently getting confused as to the internal state of its devices after a device falls off the bus and comes back. Explanation: 1. System powered off; disk is attached; system powered on, shows up as ada5. Can communicate with device in every way (the way I tend to test simple I/O is to use "smartctl -a /dev/ada5"). This disk has no filesystems or other "stuff" on it -- it's just a raw disk, so I believe the g_wither_washer oddity does not apply in this situation. 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" 3. Drive hits a bad sector which it cannot remap/deal with. Drive firmware design flaw results in drive becoming 100% stuck trying to re-read the sector and work out internal decisions to do remapping or not. Drive audibly clicking during this time (not actuator arm being reset to track 0 noise; some other mechanical issue). Due to firmware issue, drive remains in this state indefinitely. 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 times (kern.cam.da.retry_count+1). 5. FreeBSD spits out similar messages you see; retries exhausted, cam_periph_alloc error, and devfs claims device removal. 6. Drive is still catatonic of course. Only way to reset the drive is to power-cycle it. Drive removed from hot-swap bay, let sit for 20 seconds, then is reinserted. 7. FreeBSD sees the disk reappear, shows up much like it did during #1, except... 8. "smartctl -a /dev/ada5" claims no such device or unknown device type (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol devlist" shows the disk on the bus, yet I/O does not work. If I remember right, re-attempting the dd command returns some error (I forget which). 9. "camcontrol rescan all" stalls for quite some time when trying to communicate with entry 5, but eventually does return (I think with some error). camcontrol reset all" works without a hitch. "camcontrol devlist" during this time shows the same disk on ada5 (which to me means ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning I/O works at some level). 10. System otherwise works fine, but the only way to bring back usability of ada5 is to reboot ("shutdown -r now"). To me, this looks like FreeBSD at some layer within the kernel (or some driver (I don't know which)) is internally confused about the true state of things. Alexander, do you have any ideas? I can enable CAM debugging (I do use options CAMDEBUG so I can toggle this with camcontrol) as well as take notes and do a full step-by-step diagnosis (along with relevant kernel output seen during each phase) if that would help you. And I can test patches but not against -CURRENT (will be a cold day in hell before I run that, sorry). Let me know, time permitting. :-) -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Mon Jan 21 22:43:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 698B8D3; Mon, 21 Jan 2013 22:43:33 +0000 (UTC) (envelope-from prvs=1733fbe9f2=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 99DC91CC; Mon, 21 Jan 2013 22:43:32 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001797160.msg; Mon, 21 Jan 2013 22:43:24 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Mon, 21 Jan 2013 22:43:24 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1733fbe9f2=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: From: "Steven Hartland" To: "Jeremy Chadwick" , References: <20130121221617.GA23909@icarus.home.lan> Subject: Re: disk "flipped" - a known problem? Date: Mon, 21 Jan 2013 22:43:55 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: mav@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 22:43:33 -0000 ----- Original Message ----- From: "Jeremy Chadwick" To: Cc: ; Sent: Monday, January 21, 2013 10:16 PM Subject: Re: disk "flipped" - a known problem? > (Please keep me CC'd as I am not subscribed) > > WRT this: > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > I can reproduce the first problem 100% of the time on my home system > here. I can provide hardware specs if needed, but the important part is > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > using ahci.ko. > > I also want to make this clear to Andriy: I'm not saying "there's a > problem with your disk". In my case, I KNOW there's a problem with the > disk (that's the entire point to my tests! :-) ). > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > badly-designed firmware that goes completely catatonic when encountering > certain sector-level conditions. That's not the problem though -- the > problem is with FreeBSD apparently getting confused as to the internal > state of its devices after a device falls off the bus and comes back. > Explanation: > > 1. System powered off; disk is attached; system powered on, shows up as > ada5. Can communicate with device in every way (the way I tend to test > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > the g_wither_washer oddity does not apply in this situation. > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > firmware design flaw results in drive becoming 100% stuck trying to > re-read the sector and work out internal decisions to do remapping or > not. Drive audibly clicking during this time (not actuator arm being > reset to track 0 noise; some other mechanical issue). Due to firmware > issue, drive remains in this state indefinitely. > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > times (kern.cam.da.retry_count+1). > > 5. FreeBSD spits out similar messages you see; retries exhausted, > cam_periph_alloc error, and devfs claims device removal. > > 6. Drive is still catatonic of course. Only way to reset the drive is > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > seconds, then is reinserted. > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > except... > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > devlist" shows the disk on the bus, yet I/O does not work. If I > remember right, re-attempting the dd command returns some error (I > forget which). > > 9. "camcontrol rescan all" stalls for quite some time when trying to > communicate with entry 5, but eventually does return (I think with some > error). camcontrol reset all" works without a hitch. "camcontrol > devlist" during this time shows the same disk on ada5 (which to me means > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > I/O works at some level). > > 10. System otherwise works fine, but the only way to bring back > usability of ada5 is to reboot ("shutdown -r now"). > > To me, this looks like FreeBSD at some layer within the kernel (or some > driver (I don't know which)) is internally confused about the true state > of things. > > Alexander, do you have any ideas? > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > this with camcontrol) as well as take notes and do a full step-by-step > diagnosis (along with relevant kernel output seen during each phase) if > that would help you. And I can test patches but not against -CURRENT > (will be a cold day in hell before I run that, sorry). > > Let me know, time permitting. :-) Do you have a controller which not ata based you can test this on e.g. mps as this may help identify if the issue is ata specific or more generic. If you have the messages log for above scenario that also might help to track down the problem. It does, as you say, sound like something isn't being cleaned up properly which might be confirmed by adding a printf just inside cam_periph_alloc's if ((periph = cam_periph_find(path, name)) != NULL) { Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 00:40:32 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 395023A4 for ; Tue, 22 Jan 2013 00:40:32 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta01.emeryville.ca.mail.comcast.net (qmta01.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:16]) by mx1.freebsd.org (Postfix) with ESMTP id 17887979 for ; Tue, 22 Jan 2013 00:40:32 +0000 (UTC) Received: from omta17.emeryville.ca.mail.comcast.net ([76.96.30.73]) by qmta01.emeryville.ca.mail.comcast.net with comcast id qfLL1k00C1afHeLA1ogXoC; Tue, 22 Jan 2013 00:40:31 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta17.emeryville.ca.mail.comcast.net with comcast id qogW1k00Q1t3BNj8dogWCe; Tue, 22 Jan 2013 00:40:31 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 8D40E73A1B; Mon, 21 Jan 2013 16:40:30 -0800 (PST) Date: Mon, 21 Jan 2013 16:40:30 -0800 From: Jeremy Chadwick To: Steven Hartland Subject: Re: disk "flipped" - a known problem? Message-ID: <20130122004030.GA25201@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1358815231; bh=IJy99UtPvqSqTogXqfp6P3jXHmqPgFmFsOsNaK97dUA=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=mOyQgmnt+movjgbOZ9FKIRHsmJFWRraxkIBfRJLkdC/wJBCVk6ZjZ7VARguPKE0kS eIQOv1ftZf8YPYosD6q7B2H8nNWK08mJ9soP7YfnPkNbNu44QfwyGT+TI1vRCcYwnw +mO9CsFT0ryWGcgI2AQVMcjlqUdT20LiBn4+Lz/LU6lLFj0xl0aVLUH2vcMgtBElzf 3ZJ9iHuV5nDyt1kFr/8J9rCg+AjsGgCv1BxOjpqc1rhUQFfJs2vDEXwescCWFKKxl2 2NbngInL5cc7F5cBH0T3Nmyd+tCZx7HAAb61yuN4eH58su21AVBTqIY8lPzkhEcPzM Pyw2RE6lIil+A== Cc: freebsd-fs@freebsd.org, mav@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 00:40:32 -0000 On Mon, Jan 21, 2013 at 10:43:55PM -0000, Steven Hartland wrote: > ----- Original Message ----- From: "Jeremy Chadwick" > > To: > Cc: ; > Sent: Monday, January 21, 2013 10:16 PM > Subject: Re: disk "flipped" - a known problem? > > > >(Please keep me CC'd as I am not subscribed) > > > >WRT this: > > > >http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > >I can reproduce the first problem 100% of the time on my home system > >here. I can provide hardware specs if needed, but the important part is > >that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > >mode (and does not share an IRQ), hot-swap bays are in use, and I'm > >using ahci.ko. > > > >I also want to make this clear to Andriy: I'm not saying "there's a > >problem with your disk". In my case, I KNOW there's a problem with the > >disk (that's the entire point to my tests! :-) ). > > > >In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > >badly-designed firmware that goes completely catatonic when encountering > >certain sector-level conditions. That's not the problem though -- the > >problem is with FreeBSD apparently getting confused as to the internal > >state of its devices after a device falls off the bus and comes back. > >Explanation: > > > >1. System powered off; disk is attached; system powered on, shows up as > >ada5. Can communicate with device in every way (the way I tend to test > >simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > >filesystems or other "stuff" on it -- it's just a raw disk, so I believe > >the g_wither_washer oddity does not apply in this situation. > > > >2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > >3. Drive hits a bad sector which it cannot remap/deal with. Drive > >firmware design flaw results in drive becoming 100% stuck trying to > >re-read the sector and work out internal decisions to do remapping or > >not. Drive audibly clicking during this time (not actuator arm being > >reset to track 0 noise; some other mechanical issue). Due to firmware > >issue, drive remains in this state indefinitely. > > > >4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > >errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > >times (kern.cam.da.retry_count+1). > > > >5. FreeBSD spits out similar messages you see; retries exhausted, > >cam_periph_alloc error, and devfs claims device removal. > > > >6. Drive is still catatonic of course. Only way to reset the drive is > >to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > >seconds, then is reinserted. > > > >7. FreeBSD sees the disk reappear, shows up much like it did during #1, > >except... > > > >8. "smartctl -a /dev/ada5" claims no such device or unknown device type > >(I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > >devlist" shows the disk on the bus, yet I/O does not work. If I > >remember right, re-attempting the dd command returns some error (I > >forget which). > > > >9. "camcontrol rescan all" stalls for quite some time when trying to > >communicate with entry 5, but eventually does return (I think with some > >error). camcontrol reset all" works without a hitch. "camcontrol > >devlist" during this time shows the same disk on ada5 (which to me means > >ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > >I/O works at some level). > > > >10. System otherwise works fine, but the only way to bring back > >usability of ada5 is to reboot ("shutdown -r now"). > > > >To me, this looks like FreeBSD at some layer within the kernel (or some > >driver (I don't know which)) is internally confused about the true state > >of things. > > > >Alexander, do you have any ideas? > > > >I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > >this with camcontrol) as well as take notes and do a full step-by-step > >diagnosis (along with relevant kernel output seen during each phase) if > >that would help you. And I can test patches but not against -CURRENT > >(will be a cold day in hell before I run that, sorry). > > > >Let me know, time permitting. :-) > > Do you have a controller which not ata based you can test this on e.g. > mps as this may help identify if the issue is ata specific or more > generic. I do not. Well, that's not entirely true -- I have an Adaptec 2410SA laying around here somewhere, but xxx(4), as I understand it, has been neglected for quite some time and I stopped using that controller a few months after I got it simply because it sucked. :P It's SiI-3112-based with a bunch of hullabaloo on it. The best I could do is try to pick up an inexpensive 3124-based siis(4) controller and try that. I understand your logic here -- you're trying to narrow down if the issue is within CAM(4) or not. The above used to work. That is to say, I could literally yank a disk out of my hot-swap bay, insert a new one, and FreeBSD would do the right thing. Possibly the issue is with the same disk being re-inserted? Not sure. > ... If you have the messages log for above scenario that also might > help to track down the problem. I can do that but will need some time (not a lot, just dedicated linear time. :-) ). I also need to know what kind of output folks are wanting -- I know you want kernel output, as well as whatever physical action was last taken, but output from userland (ex. "camcontrol devlist") seems relevant, thus need some advice on what would be useful. "camcontrol debug" might be helpful but I'd need to know what printfs are wanted (see man page). camcontrol(8) implies that simply doing "camcontrol debug -x -y -z ahcich5" would be sufficient to see everything within ahci(4) on port 5 as well as ada5. Yes/no? Want to make this clear too: the issue I see is not specific to just ada5 on my system, i.e. it is not a problem with the physical port or somesuch. It's reproducible regardless of port number -- it just so happens that I actively use ports 0-4 on my system and leave port 5 available for drive testing/forensics. > It does, as you say, sound like something isn't being cleaned up properly > which might be confirmed by adding a printf just inside cam_periph_alloc's > if ((periph = cam_periph_find(path, name)) != NULL) { I'd rather wait on this; I never feel comfortable poking about in kernel innards, especially something like CAM. Remember: the system I'm testing with is actually used, so I don't want to risk impacting other ("good") CAM transactions with a change. I just don't have the knowledge or familiarity with those pieces to be poking around in there with comfort. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 07:36:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B2E1FA07; Tue, 22 Jan 2013 07:36:55 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 4545C9FF; Tue, 22 Jan 2013 07:36:54 +0000 (UTC) Received: from server.rulingia.com (c220-239-253-186.belrs5.nsw.optusnet.com.au [220.239.253.186]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0M7anI0064423 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 22 Jan 2013 18:36:51 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0M7ahxi050714 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 22 Jan 2013 18:36:43 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r0M7afm9050710; Tue, 22 Jan 2013 18:36:41 +1100 (EST) (envelope-from peter) Date: Tue, 22 Jan 2013 18:36:41 +1100 From: Peter Jeremy To: Wojciech Puchar Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. Message-ID: <20130122073641.GH30633@server.rulingia.com> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="n+lFg1Zro7sl44OB" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 07:36:55 -0000 --n+lFg1Zro7sl44OB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-21 12:12:45 +0100, Wojciech Puchar wrote: >That's why i use properly tuned UFS, gmirror, and prefer not to use=20 >gstripe but have multiple filesystems When I started using ZFS, I didn't fully trust it so I had a gmirrored UFS root (including a full src tree). Over time, I found that gmirror plus UFS was giving me more problems than ZFS. In particular, I was seeing behaviour that suggested that the mirrors were out of sync, even though gmirror insisted they were in sync. Unfortunately, there is no way to get gmirror to verify the mirroring or to get UFS to check correctness of data or metadata (fsck can only check metadata consistency). I've since moved to a ZFS root. >Which is marketing, not truth. If you want bullet-proof recoverability,=20 >UFS beats everything i've ever seen. I've seen the opposite. One big difference is that ZFS is designed to ensure it returns the data that was written to it whereas UFS just returns the bytes it finds where it thinks it wrote your data. One side effect of this is that ZFS is far fussier about hardware quality - since it checksums everything, it is likely to pick up glitches that UFS doesn't notice. >If you want FAST crash recovery, use softupdates+journal, available in=20 >FreeBSD 9. I'll admit that I haven't used SU+J but one downside of SU+J is that it prevents the use of snapshots, which in turn prevents the (safe) use of dump(8) (which is the official tool for UFS backups) on live filesystems. >> of fuss. Even if you dislodge a drive ... so that it's missing the last >> 'n' transactions, ZFS seems to figure this out (which I thought was extra >> cudos). > >Yes this is marketing. practice is somehow different. as you discovered=20 >yourself. Most of the time this works as designed. It's possible there are bugs in the implementation. >While RAID-Z is already a king of bad performance, I don't believe RAID-Z is any worse than RAID5. Do you have any actual measurements to back up your claim? > i assume=20 >you mean two POOLS, not 2 RAID-Z sets. if you mixed 2 different RAID-Z poo= ls you would=20 >spread load unevenly and make performance even worse. There's no real reason why you could't have 2 different vdevs in the same pool. >> A full scrub of my drives weighs in at 36 hours or so. > >which is funny as ZFS is marketed as doing this efficient (like checking= =20 >only used space). It _does_ only check used space but it does so in logical order rather than physical order. For a fragmented pool, this means random accesses. >Even better - use UFS. Then you'll never know that your data has been corrupted. >For both bullet proof recoverability and performance. use ZFS. --=20 Peter Jeremy --n+lFg1Zro7sl44OB Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD+QYkACgkQ/opHv/APuIdH7QCfQcSzk1BtPmFuSWNBqH/UUZL0 r+kAoKU/ks97MatHjPwjXl2BarlMyOzg =KFNN -----END PGP SIGNATURE----- --n+lFg1Zro7sl44OB-- From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 07:52:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 858BCEB for ; Tue, 22 Jan 2013 07:52:55 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-la0-f44.google.com (mail-la0-f44.google.com [209.85.215.44]) by mx1.freebsd.org (Postfix) with ESMTP id 15C83AC6 for ; Tue, 22 Jan 2013 07:52:54 +0000 (UTC) Received: by mail-la0-f44.google.com with SMTP id eb20so5953616lab.3 for ; Mon, 21 Jan 2013 23:52:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=AEM7TtIxrj3qml11a/gq9WKhfRXwQ9vbO1dp1qi4EnQ=; b=kqWYOJ/3jQyS8Y548D3lgkwdv9YsOhap0zkc8j6ePI6ejJCZ/vzdDxOHmJB6k1f16p gtjutHEapSloTcxz1Fw1McFQ7KJb0iFfECL9Y1i1/WL+eFxMDzvxxFw11o8Vh6XxJDlt LjhLLKRp+BhMvqZ57onKks3rpMsW6ZHX9LN3dzzEjoBVouRQTRjarJ90K/BWQ5Dtevi5 zVukfDrmhUCL3UoqfL5UnKAZAuCV0nTbaYa/gF3+nVKf7GlGY0njMWUibuqUDWh+17iB +LkwvLfY/gDDNF5DuQCV0LMQ7Y2P7ltyZOAYDUBLiZ9/SeMDiXNKgGGflD4AafQHFZ2s Vozw== MIME-Version: 1.0 X-Received: by 10.112.28.9 with SMTP id x9mr8747176lbg.27.1358841173688; Mon, 21 Jan 2013 23:52:53 -0800 (PST) Received: by 10.112.6.38 with HTTP; Mon, 21 Jan 2013 23:52:53 -0800 (PST) In-Reply-To: References: Date: Tue, 22 Jan 2013 02:52:53 -0500 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Zaphod Beeblebrox To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 07:52:55 -0000 On Mon, Jan 21, 2013 at 6:12 AM, Wojciech Puchar < wojtek@wojtek.tensor.gdynia.pl> wrote: > Please don't misinterpret this post: ZFS's ability to recover from fairly >> catastrophic failures is pretty stellar, but I'm wondering if there can be >> > > from my testing it is exactly opposite. You have to see a difference > between marketing and reality. > Far from listening to the marketing, I have experienced it. I have been using ZFS for as long as it's been part of FreeBSD. It has never lost me data and it has recovered from some pretty incredible situations that would have lost data with standard RAID deployments. Most of what I'm talking about is bad hardware --- ZFS is marvelous at helping you find bad hardware. > has become a go-to filesystem for most of my applications. >> > > My applications doesn't tolerate low performance, overcomplexity and high > risk of data loss. > > That's why i use properly tuned UFS, gmirror, and prefer not to use > gstripe but have multiple filesystems > Bull-hockey. I still have a 250-ish gig RAID-1 UFS partition with cyrus IMAP data on it that takes nearly an hour to FSCK. UFS snapshots have definitely lost me data and UFS has definite problems with large partitions and/or lots of small files. I've even spent time with Kirk McKusick on this. Softupdates on UFS is an incredible piece of work, but simply put: UFS is not designed for 50 or 100 Terrabyte partitions. ... now... if you don't mind. I've sparred with you verbally 2 or 3 times now on list. Trust me: I have no use for "marketing" ... but equally, I have less use for someone who would accuse me of it... or does anyone "market" FreeBSD effectively anyways. I haven't found your posts (in general) to be useful or on topic at all. If you don't mind, just don't reply to my post next time. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 09:17:39 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2248E386 for ; Tue, 22 Jan 2013 09:17:39 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id D85791EC for ; Tue, 22 Jan 2013 09:17:38 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 25FCD47E0F; Tue, 22 Jan 2013 10:11:17 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.5 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 9BD2147DE6 for ; Tue, 22 Jan 2013 10:11:17 +0100 (CET) Message-ID: <50FE57A9.2040104@platinum.linux.pl> Date: Tue, 22 Jan 2013 10:11:05 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> In-Reply-To: <20130122073641.GH30633@server.rulingia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 09:17:39 -0000 >> Even better - use UFS. > Then you'll never know that your data has been corrupted. This is exactly what happened to me. I had a server connected to a failing mains socket. For about a month ZFS reported checksum errors on multiple disks, all fixed thanks to raidz2. Socket failed, completely burned off, UPS woke me up. Replaced the socket, no checksum errors since then. This is what was in my wall: http://tepeserwery.pl/DSC_0178.JPG From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 11:13:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 86F6692A for ; Tue, 22 Jan 2013 11:13:11 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop03b.sare.net (proxypop03b.sare.net [194.30.0.251]) by mx1.freebsd.org (Postfix) with ESMTP id BCF5A9C6 for ; Tue, 22 Jan 2013 11:13:10 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 861F19DD4B5; Tue, 22 Jan 2013 12:03:39 +0100 (CET) From: Borja Marcos Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Subject: RFC: Suggesting ZFS "best practices" in FreeBSD Date: Tue, 22 Jan 2013 12:03:59 +0100 Message-Id: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: FreeBSD Filesystems Mime-Version: 1.0 (Apple Message framework v1085) X-Mailer: Apple Mail (2.1085) Cc: Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 11:13:11 -0000 (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS = mailing list, and this is a SCSI//FS issue) Hi :) Hope nobody will hate me too much, but ZFS usage under FreeBSD is still = chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). ZFS is a very different beast from other filesystems, and the setup can = involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. Since I've been using ZFS on FreeBSD (from the first versions) I have = noticed several serious problems. I try to explain some of them, and my = suggestions for a solution. We should collect more use cases and issues = and try to reach a consensus.=20 1- Dynamic disk naming -> We should use static naming (GPT labels, for = instance) ZFS was born in a system with static device naming (Solaris). When you = plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. For example, imagine that I have 16 disks, da0 to da15. One of them, = say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 will be renamed to the device number -1. Potential for trouble as a = minimum. After several different installations, I am preferring to rely on static = naming. Doing it with some care can really help to make pools portable = from one system to another. I create a GPT partition in each drive, and = Iabel it with a readable name. Thus, imagine I label each big partition = (which takes the whole available space) as pool-vdev-disk, for example, = pool-raidz1-disk1. When creating a pool, I use these names. Instead of dealing with device = numbers. For example:=20 % zpool status pool: rpool state: ONLINE scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan 7 16:25:47 = 2013 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/rpool-disk1 ONLINE 0 0 0 gpt/rpool-disk2 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 Using a unique name for each disk within your organization is important. = That way, you can safely move the disks to a different server, which = might be using ZFS, and still be able to import the pool without name = collisions. Of course you could use gptids, which, as far as I know, = are unique, but they are difficult to use and in case of a disk failure = it's not easy to determine which disk to replace. 2- RAID cards. Simply: Avoid them like the pest. ZFS is designed to operate on bare = disks. And it does an amazingly good job. Any additional software layer = you add on top will compromise it. I have had bad experiences with "mfi" = and "aac" cards.=20 There are two solutions adopted by RAID card users. None of them is = good. The first an obvious one is to create a RAID5 taking advantage of = the battery based cache (if present). It works, but it loses some of the = advantages of ZFS. Moreover, trying different cards, I have been forced = to reboot whole servers in order to do something trivial like replacing = a failed disk. Yes, there are software tools to control some of the = cards, but they are at the very least cumbersome and confusing. The second "solution" is to create a RAID0 volume for each disk (some = RAID card manufacturers even dare to call it JBOD). I haven't seen a = single instance of this working flawlessly. Again, a replaced disk can = be a headache. At the very least, you have to deal with a cumbersome and = complicated management program to replace a disk, and you often have to = reboot the server. The biggest reason to avoid these stupid cards, anyway, is plain simple: = Those cards, at least the ones I have tried bundled by Dell as = PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense = codes from the filesystem. Pure crap. Years ago, fighting this issue, and when ZFS was still rather = experimental, I asked for help and Scott Long sent me a "don't try this = at home" simple patch, so that the disks become available to the CAM = layer, bypassing the RAID card. He warned me of potential issues and = lost sense codes, but, so far so good. And indeed the sense codes are = lost when a RAID card creates a volume, even if in the misnamed "JBOD" = configuration.=20 = http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634= 817&raw=3Dyes http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679 Anyway, even if there might be some issues due to command handling, the = end to end verification performed by ZFS should ensure that, as a = minimum, the data on the disks won't be corrupted and, in case it = happens, it will be detected. I rather prefer to have ZFS deal with it, = instead of working on a sort of "virtual" disk implemented on the RAID = card. Another *strong* reason to avoid those cards, even "JBOD" = configurations, is disk portability. The RAID labels the disks. Moving = one disk from one machine to another one will result on a funny = situation of confusing "import foreign config/ignore" messages when = rebooting the destination server (mandatory in order to be able to = access the transferred disk). Once again, additional complexity, useless = layering and more reboots. That may be acceptable for Metoosoft crap, = not for Unix systems. Summarizing: I would *strongly* recommend to avoid the RAID cards and = get proper host adapters without any fancy functionalities instead. The = one sold by Dell as H200 seems to work very well. No need to create any = JBOD or fancy thing at all. It will just expose the drivers as normal = SAS/SATA ones. A host adapter without fancy firmware is the best = guarantee about failures caused by fancy firmware. But, in case that=B4s not possible, I am still leaning to the kludge of = bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing = by patching the driver. There is one issue, though. In case of reboot, = the RAID cards freeze, I am not sure why. Maybe that could be fixed, it = happens on machines on which I am not using the RAID functionality at = all. They should become "transparent" but they don't.=20 Also, I think that the so-called JBOD thing would impair the correct = performance of a zfs health daemon doing things such as automatic failed = disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ = log message for diagnosis. (See at the bottom to read about a problem I have just had with a "JBOD" = configuration) 3- Installation, boot, etc. Here I am not sure. Before zfsboot became available, I used to create a = zfs-on-root system by doing, more or less, this: - Install base system on a pendrive. After the installation, just /boot = will be used from the pendrive, and /boot/loader.conf will=20 - Create the ZFS pool. - Create and populate the root hierarchy. I used to create something = like: pool/root pool/root/var pool/root/usr pool/root/tmp Why pool/root instead of simply "pool"? Because it's easier to = understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if = needed, it's possible to snapshot the whole "system" tree atomically.=20 I also set the mountpoint of the "system" tree as legacy, and rely on = /etc/fstab. Why? In order to avoid an accidental "auto mount" of = critical filesystems in case, for example, I boot off a pendrive in = order to tinker.=20 For the last system I installed, I tried with zfsboot instead of booting = off the /boot directory of a FFS partition. (*) An example of RAID/JBOD induced crap and the problem of not using = static naming follows,=20 I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and = one of those cards I worship: this particular example is controlled by = the aac driver.=20 As I was going to tinker a lot, I decided to create a raid-based mirror = for the system, so that I can boot off it and have swap even with a = failed disk, and use the other 14 disks as a pool with two raidz vdevs = of 6 disks, leaving two disks as hot-spares. Later I removed one of the = hot-spares and I installed a SSD disk with two partitions to try and = make it work as L2ARC and log. As I had gone for the jbod pain, of = course replacing that disk meant rebooting the server in order to do = something as illogical as creating a "logical" volume on top of it. = These cards just love to be rebooted. pool: pool state: ONLINE scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 = 2013 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 aacd1 ONLINE 0 0 0 aacd2 ONLINE 0 0 0 aacd3 ONLINE 0 0 0 aacd4 ONLINE 0 0 0 aacd5 ONLINE 0 0 0 aacd6 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 aacd7 ONLINE 0 0 0 aacd8 ONLINE 0 0 0 aacd9 ONLINE 0 0 0 aacd10 ONLINE 0 0 0 aacd11 ONLINE 0 0 0 aacd12 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 spares aacd14 AVAIL =20 errors: No known data errors The fun begun when a disk failed. When it happened, I offlined it, and = replaced it by the remaining hot-spare. But something had changed, and = the pool remained in this state: % zpool status pool: pool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning = in a degraded state. action: Online the device using 'zpool online' or replace the device = with 'zpool replace'. scan: resilvered 192K in 0h0m with 0 errors on Wed Dec 5 08:31:57 = 2012 config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 13277671892912019085 OFFLINE 0 0 0 was = /dev/aacd1 aacd14 ONLINE 0 0 0 aacd2 ONLINE 0 0 0 aacd3 ONLINE 0 0 0 aacd4 ONLINE 0 0 0 aacd5 ONLINE 0 0 0 aacd6 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 aacd7 ONLINE 0 0 0 aacd8 ONLINE 0 0 0 aacd9 ONLINE 0 0 0 aacd10 ONLINE 0 0 0 aacd11 ONLINE 0 0 0 aacd12 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 spares 2388350688826453610 INUSE was /dev/aacd14 errors: No known data errors %=20 ZFS was somewhat confused by the JBOD volumes, and it was impossible to = end this situation. A reboot revealed that the card, apparently, had = changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a = single bit of data, but the situation seemed to be risky. Finally I = could fix it by replacing the failed disk, rebooting the whole server, = of course, and doing a zpool replace. But the card added some confusion, = and I still don't know what was the disk failure. No traces of a = meaningful error message.=20 Best regards, Borja. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 12:00:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 5C4B96EF for ; Tue, 22 Jan 2013 12:00:34 +0000 (UTC) (envelope-from andrnils@gmail.com) Received: from mail-oa0-f42.google.com (mail-oa0-f42.google.com [209.85.219.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2D1AFD4E for ; Tue, 22 Jan 2013 12:00:34 +0000 (UTC) Received: by mail-oa0-f42.google.com with SMTP id j1so7173925oag.1 for ; Tue, 22 Jan 2013 04:00:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=uN6s/m65q2kdRJSZjYAZBtl323nOArFQPTYsG6PW1W8=; b=I0c3bGhon5MN8tCHw6F9YP+6TR/OVRtrovsjHXqQvwqPtWhRhxBDqd6syJvh3Qfy1G 1W5cn+EQsKLkmU6IspqJeAwPMds9XhycR0hbHiY59fdMMC65sIQQzay/eAs8Ha2Sfu4z tosWndHZZUaM/YlqPA7FCK1gc0EDqtQbGf/GayxFM8M1x4HrHruTSOxvtmtMoyGLNBSz OelYFv5TAeffVjlG362QztseixReD/QWPGfS9LnDYcfNrBxg11syQ3Nz6ZVG6VXXmd9K ZvroZKC3VYWoIi/Q1Con5xnoiYVPVJj7AAOzm4fe4fb9lAt0ijmYVAicgTQDeySaL5dq 1PYg== MIME-Version: 1.0 X-Received: by 10.60.12.99 with SMTP id x3mr17076210oeb.71.1358856033447; Tue, 22 Jan 2013 04:00:33 -0800 (PST) Received: by 10.76.5.148 with HTTP; Tue, 22 Jan 2013 04:00:33 -0800 (PST) In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 13:00:33 +0100 Message-ID: Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Andreas Nilsson To: Borja Marcos Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 12:00:34 -0000 Seems good to me, but I think you could/should push for poolnames that are unique as weel, to ease import on another system. I would recommend avoiding HP raid cards as well ( at least P400i and P410i ) as gptzfsboot cannot find a pool on the first "logical" disk presented to the os, see http://lists.freebsd.org/pipermail/freebsd-current/2011-August/026175.html Best regards Andreas From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 13:41:19 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8A72EB27 for ; Tue, 22 Jan 2013 13:41:19 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id B75CB6F4 for ; Tue, 22 Jan 2013 13:41:18 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA29553; Tue, 22 Jan 2013 15:41:14 +0200 (EET) (envelope-from avg@FreeBSD.org) Message-ID: <50FE96F9.6000900@FreeBSD.org> Date: Tue, 22 Jan 2013 15:41:13 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130113 Thunderbird/17.0.2 MIME-Version: 1.0 To: Borja Marcos Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 13:41:19 -0000 on 22/01/2013 13:03 Borja Marcos said the following: > pool/root pool/root/var pool/root/usr pool/root/tmp > > Why pool/root instead of simply "pool"? Because it's easier to understand, > snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's > possible to snapshot the whole "system" tree atomically. I recommend placing "/" into pool/ROOT/. That would very useful for boot environments (BEs - use them!). > I also set the mountpoint of the "system" tree as legacy, and rely on > /etc/fstab. I do place anything for ZFS into fstab. Nor I use vfs.root.mountfrom loader variable. I depend on the boot and kernel code doing the right thing based on pool's bootfs property. > Why? In order to avoid an accidental "auto mount" of critical > filesystems in case, for example, I boot off a pendrive in order to tinker. Not sure what you mean, if you don't import the pool nothing gets mounted. If you remember to use import -R then everything gets mounted in controlled places. -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 14:34:47 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id AE2EB6BD for ; Tue, 22 Jan 2013 14:34:47 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 46D8399C for ; Tue, 22 Jan 2013 14:34:46 +0000 (UTC) Received: from [127.0.0.1] (Scott4long@pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.5/8.14.5) with ESMTP id r0MEXxPb077679; Tue, 22 Jan 2013 07:33:59 -0700 (MST) (envelope-from scottl@samsco.org) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Scott Long In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 07:33:59 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: Borja Marcos X-Mailer: Apple Mail (2.1499) X-Spam-Status: No, score=-50.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 14:34:47 -0000 On Jan 22, 2013, at 4:03 AM, Borja Marcos wrote: > (Scott, I hope you don't mind to be CC'd, I'm not sure you read the = -FS mailing list, and this is a SCSI//FS issue) >=20 >=20 >=20 > Hi :) >=20 > Hope nobody will hate me too much, but ZFS usage under FreeBSD is = still chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). >=20 > ZFS is a very different beast from other filesystems, and the setup = can involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. >=20 > Since I've been using ZFS on FreeBSD (from the first versions) I have = noticed several serious problems. I try to explain some of them, and my = suggestions for a solution. We should collect more use cases and issues = and try to reach a consensus.=20 >=20 >=20 >=20 > 1- Dynamic disk naming -> We should use static naming (GPT labels, for = instance) >=20 > ZFS was born in a system with static device naming (Solaris). When you = plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. >=20 Look up SCSI device wiring in /sys/conf/NOTES. That's one solution to = static naming, just with a slightly different angle than Solaris. I do = agree with your general thesis here, and either wiring should be made a = much more visible and documented feature, or a new mechanism should be = developed to provide naming stability. Please let me know what you = think of the wiring mechanic. >=20 >=20 > 2- RAID cards. >=20 > Simply: Avoid them like the pest. ZFS is designed to operate on bare = disks. And it does an amazingly good job. Any additional software layer = you add on top will compromise it. I have had bad experiences with "mfi" = and "aac" cards.=20 >=20 Agree 200%. Despite the best effort of sales and marketing people, RAID = cards do not make good HBAs. At best they add latency. At worst, they = add a lot of latency and extra failure modes. Scott From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:04:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B1931123 for ; Tue, 22 Jan 2013 15:04:44 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id 6502FB6E for ; Tue, 22 Jan 2013 15:04:44 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r0MF4hpP061802; Tue, 22 Jan 2013 08:04:43 -0700 (MST) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r0MF4gYT061799; Tue, 22 Jan 2013 08:04:42 -0700 (MST) (envelope-from wblock@wonkity.com) Date: Tue, 22 Jan 2013 08:04:42 -0700 (MST) From: Warren Block To: Borja Marcos Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Message-ID: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Tue, 22 Jan 2013 08:04:43 -0700 (MST) Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:04:44 -0000 On Tue, 22 Jan 2013, Borja Marcos wrote: > 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance) > > ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic. > > For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum. > > After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1. I'm a proponent of using various types of labels, but my impression after a recent experience was that ZFS metadata was enough to identify the drives even if they were moved around. That is, ZFS bare metadata on a drive with no other partitioning or labels. Is that incorrect? From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:12:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DFD4434D; Tue, 22 Jan 2013 15:12:05 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id A2561C37; Tue, 22 Jan 2013 15:12:05 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r0MFC5CO061890; Tue, 22 Jan 2013 08:12:05 -0700 (MST) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r0MFC4ko061887; Tue, 22 Jan 2013 08:12:05 -0700 (MST) (envelope-from wblock@wonkity.com) Date: Tue, 22 Jan 2013 08:12:04 -0700 (MST) From: Warren Block To: Borja Marcos Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Message-ID: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Tue, 22 Jan 2013 08:12:05 -0700 (MST) Cc: FreeBSD Filesystems , Scott Long , wblock@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:12:05 -0000 On Tue, 22 Jan 2013, Borja Marcos wrote: > Hope nobody will hate me too much, but ZFS usage under FreeBSD is > still chaotic. We badly need a well proven "doctrine" in order to > avoid problems. I would like to see guidelines for at least two common scenarios: Multi-terabyte file server with multi-drive pool. Limited RAM (1G) root-on-ZFS workstation with a single disk. The first is easy with the defaults, but particular tuning could be beneficial. And would be a good place to talk about NFS on ZFS, usage of SSDs, and so on. The second is supposed to be achievable, but the specifics... These could go in the ZFS section in the Handbook. I'm interested in working on that. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:26:47 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A671E8E3 for ; Tue, 22 Jan 2013 15:26:47 +0000 (UTC) (envelope-from feld@feld.me) Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id 81656D68 for ; Tue, 22 Jan 2013 15:26:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:Cc:To:Content-Type; bh=3e2wB5Elmx3X0/KQltluoCSz/nMn6XSj22/2HZRlSk4=; b=Z0uLYJR6S+FmoiWT2Xo59+mIXkjgcsZompOzM6o2lMkKPO20NrFfyUfR6FrSopNJt2HLkDeAZ5yA8XcRJnQhc1/GPXeH4YfLCEP6Kv4CH5dfT+xPaAalWzOfgkSlkhnQ; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by feld.me with esmtp (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1TxfkM-000M2F-8N; Tue, 22 Jan 2013 09:26:46 -0600 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4) with esmtpsa id 1358868399-62102-17996/5/1; Tue, 22 Jan 2013 15:26:39 +0000 Content-Type: text/plain; format=flowed; delsp=yes To: Warren Block Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 09:26:39 -0600 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: User-Agent: Opera Mail/12.12 (FreeBSD) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:26:47 -0000 On Tue, 22 Jan 2013 09:04:42 -0600, Warren Block wrote: > I'm a proponent of using various types of labels, but my impression > after a recent experience was that ZFS metadata was enough to identify > the drives even if they were moved around. That is, ZFS bare metadata > on a drive with no other partitioning or labels. > Is that incorrect? If you have an enclosure with 48 drives can you be confident which drive is failing using only the ZFS metadata? From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:36:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D729AC4D for ; Tue, 22 Jan 2013 15:36:55 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop03b.sare.net (proxypop03b.sare.net [194.30.0.251]) by mx1.freebsd.org (Postfix) with ESMTP id 9C483E06 for ; Tue, 22 Jan 2013 15:36:55 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id AC5239DFB6B; Tue, 22 Jan 2013 16:36:32 +0100 (CET) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Borja Marcos In-Reply-To: Date: Tue, 22 Jan 2013 16:36:52 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <16B2C50C-DD36-4375-A002-F866A612D842@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: Warren Block X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:36:55 -0000 On Jan 22, 2013, at 4:04 PM, Warren Block wrote: > I'm a proponent of using various types of labels, but my impression = after a recent experience was that ZFS metadata was enough to identify = the drives even if they were moved around. That is, ZFS bare metadata = on a drive with no other partitioning or labels. >=20 > Is that incorrect? I'm afraid it's inconvenient unless you enjoy reboots ;) This is a patologic and likely example I just demonstrated to a friend. We were testing a new server with 12 hard disks and a proper HBA. The disks are, unspririsingly, da0-da11. There is a da12 used (for now) = for the OS, so that there's no problem to create and destroy pools at = leisure. My friend had created a pool with two raidz vdevs nothing rocket = science. da0-5, da6-11. So, we were doing some tests and I've pulled one of the disks. Nothing = special, ZFS recovers nicely. Now it comes the fun part.=20 I reboot the machine with the missing disk. What happens now? I had pulled da4 I think. So, disks with an ID > 4 have been renamed to = N - 1. da5 became da4, da6 became da5, da7 became da6... and, = critically, da12 became da11. The reboot begun by failing to mount the root filesystem, but that one = is trivial. Just tell the kernel where it is now (da11) and it boots = happily. Now, we have a degraded pool with a missing disk (da4) and a da4 that = previously was da5. It works of course, but in degraded state. OK, we found a replacement disk, and we plugged it. It became, guess! = Yes, da12. Now: I cannot "online" da4, because it exists. I cannot online da12 = because it didn't belong to the pool. I cannot replace da4 with da12, = because it is there. Now that I think of it, in this case: 15896790606386444480 OFFLINE 0 0 0 was /dev/da4 Is it possible to say zpool replace 15896790606386444480 da12? I haven't = tried it. Anyway, seems to be a bit confusing. The logical, albeit cumbersome = approach is to reboot the machine with the new da4 in place, and after = rebooting, onlining or replacing.=20 Using names prevents this kind of confusion. Borja. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:40:24 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id AC856D0D for ; Tue, 22 Jan 2013 15:40:24 +0000 (UTC) (envelope-from claudiu.vasadi@gmail.com) Received: from mail-ie0-f178.google.com (mail-ie0-f178.google.com [209.85.223.178]) by mx1.freebsd.org (Postfix) with ESMTP id 56F82E36 for ; Tue, 22 Jan 2013 15:40:24 +0000 (UTC) Received: by mail-ie0-f178.google.com with SMTP id c12so11646001ieb.23 for ; Tue, 22 Jan 2013 07:40:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=1ciVddKAnkarB86VkrstRmL+G5j25tdaYeub/w6zWu0=; b=s56RSmqldG0KyRZn91DmMe3n+su7eDyVLqbKmWbUZiI1YO1G6AJ/bdhaZI316YYtzP GcTY1lIxifnHWhCNY3zYFNamDOpTWmkiKuVtoowjQfK/W1zv0KtuJby9jiA1YvNtSe4Y Ed5wDVtMkX6AQgiYXq530Zf4qILkXRNDUhLzA9juPFAfkpuSZ7vlttmRh4gkAeCIOpNZ yukWr4+61vAP2HhuWLFa2TwcX1i8Dx7AtEvWoje9iXRalEo2tNpIz1U7YRqPgzaMf6+d wu7a1jGLl7VriqWxl8PZEcI1I2MzSdgs581LOVUBajVlqdMRexd6/aDWoZGKstIe4mKJ T0Gw== MIME-Version: 1.0 X-Received: by 10.50.170.66 with SMTP id ak2mr12402662igc.38.1358869223676; Tue, 22 Jan 2013 07:40:23 -0800 (PST) Received: by 10.64.33.110 with HTTP; Tue, 22 Jan 2013 07:40:23 -0800 (PST) In-Reply-To: <16B2C50C-DD36-4375-A002-F866A612D842@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <16B2C50C-DD36-4375-A002-F866A612D842@sarenet.es> Date: Tue, 22 Jan 2013 16:40:23 +0100 Message-ID: Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: claudiu vasadi To: Borja Marcos Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:40:24 -0000 On Tue, Jan 22, 2013 at 4:36 PM, Borja Marcos wrote: > > On Jan 22, 2013, at 4:04 PM, Warren Block wrote: > > > I'm a proponent of using various types of labels, but my impression > after a recent experience was that ZFS metadata was enough to identify the > drives even if they were moved around. That is, ZFS bare metadata on a > drive with no other partitioning or labels. > > > > Is that incorrect? > > I'm afraid it's inconvenient unless you enjoy reboots ;) > > This is a patologic and likely example I just demonstrated to a friend. > > We were testing a new server with 12 hard disks and a proper HBA. > > The disks are, unspririsingly, da0-da11. There is a da12 used (for now) > for the OS, so that there's no problem to create and destroy pools at > leisure. > > My friend had created a pool with two raidz vdevs nothing rocket science. > da0-5, da6-11. > > So, we were doing some tests and I've pulled one of the disks. Nothing > special, ZFS recovers nicely. > > Now it comes the fun part. > > I reboot the machine with the missing disk. > > What happens now? > > I had pulled da4 I think. So, disks with an ID > 4 have been renamed to N > - 1. da5 became da4, da6 became da5, da7 became da6... and, critically, > da12 became da11. > > The reboot begun by failing to mount the root filesystem, but that one is > trivial. Just tell the kernel where it is now (da11) and it boots happily. > > Now, we have a degraded pool with a missing disk (da4) and a da4 that > previously was da5. It works of course, but in degraded state. > > OK, we found a replacement disk, and we plugged it. It became, guess! Yes, > da12. > > Now: I cannot "online" da4, because it exists. I cannot online da12 > because it didn't belong to the pool. I cannot replace da4 with da12, > because it is there. > > Now that I think of it, in this case: > 15896790606386444480 OFFLINE 0 0 0 was /dev/da4 > > Is it possible to say zpool replace 15896790606386444480 da12? I haven't > tried it. > > Anyway, seems to be a bit confusing. The logical, albeit cumbersome > approach is to reboot the machine with the new da4 in place, and after > rebooting, onlining or replacing. > > Using names prevents this kind of confusion. > > > > > Same thing happened to me on a production server that crashed. Sometimes, it;s not easy to reboot again simply because you need to insert a disk. This, of course, make s the hot-swappable capability of the hardware, useless. -- Best regards, Claudiu Vasadi From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 15:40:29 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B44C7D10 for ; Tue, 22 Jan 2013 15:40:29 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ie0-f179.google.com (mail-ie0-f179.google.com [209.85.223.179]) by mx1.freebsd.org (Postfix) with ESMTP id 7EEB4E38 for ; Tue, 22 Jan 2013 15:40:29 +0000 (UTC) Received: by mail-ie0-f179.google.com with SMTP id k14so11801556iea.24 for ; Tue, 22 Jan 2013 07:40:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=dCcQIj721dYgKr0N3v/TuQDJYhdhVcBPgVvd1N9AJM4=; b=FUjCvM7HWeAup7n4k/V/p+lGz+FG9ICsmfC3aI/lrUu4Nkx1eCTcAq4g+KL1smOJ52 ENrvt6Ro9VeoVmm6zig0XTQ7SAh8QW5/Vwapg20Ca4ViBdS/pwpkIK2zsXNA9fPgMgc8 iAjVurzYqL8QoLM79fZYAkz6huoOUFE5W+GyY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=dCcQIj721dYgKr0N3v/TuQDJYhdhVcBPgVvd1N9AJM4=; b=HoLfLujbU3Ehb0UaIDf0t63BdAp/H2xtGBu7StZ2B7F5c/iAKUt1Nd7sAHr1vmG7mY RAORK0M7rKoJyi8ByzbyTCoRIW+QfXyRl1tBMdsX5sxSOmksY57UoF7DvoHhhN++cdvQ U38MfCfBUp/KsCjOqJ91lye+49+qSWkWdgZJ8kiSCA14PnMo1IQa3oZh0Xf8CyceCB29 20UMKGHOIteLSPPDrSP6C64gZiRFSMzMpCTsSn5MsrOHbhhbocjXe+61ZiTh3f9I1JEf /FXUNShWKaeelPy3qL7Qbw+i3Dj1yDqMwG0hwK3holWjpR9uHWSPjx4CXFziabTixyr8 x2/w== X-Received: by 10.42.91.7 with SMTP id n7mr14711171icm.40.1358869229011; Tue, 22 Jan 2013 07:40:29 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id ez8sm12098977igb.17.2013.01.22.07.40.26 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 22 Jan 2013 07:40:28 -0800 (PST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Kevin Day In-Reply-To: Date: Tue, 22 Jan 2013 09:40:24 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: Warren Block X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQnl9gnkZTIyzyJDuQA9Jh6xtWfui86+qbHnXFq0EXxGD2BgZ3oTeUsjXrSF40BkpYQPSJ4I Cc: FreeBSD Filesystems , Scott Long , wblock@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 15:40:29 -0000 On Jan 22, 2013, at 9:12 AM, Warren Block wrote: >=20 > I would like to see guidelines for at least two common scenarios: >=20 > Multi-terabyte file server with multi-drive pool. [=85] > The first is easy with the defaults, but particular tuning could be = beneficial. And would be a good place to talk about NFS on ZFS, usage = of SSDs, and so on. I run ftpmirror.your.org, which is a 72 x 3TB drive ZFS server. It's a = very busy server. It currently houses the only off-site backup of all of = the Wikimedia projects(121TB), a full FreeBSD FTP mirror(1T), a full = CentOS mirror, all of FreeBSD-Archive(1.5TB), FreeBSD-CVS, etc. It's = usually running between 100 and 1500mbps of ethernet traffic in/out of = it. There are usually around 15 FTP connections, 20-50 HTTP connections, = 10 rsync connections and 1 or 2 CVS connections.=20 The only changes we've made that are ZFS specific are atime=3Doff and = sync=3Ddisabled. Nothing we do uses atimes so disabling that cuts down = on a ton of unnecessary writes. Disabling sync is okay here too - we're = just mirroring stuff that's available elsewhere, so there's no threat of = data loss. Other than some TCP tuning in sysctl.conf, this is running a = totally stock kernel with no special settings.=20 I've looked at using an SSD for meta-data only caching, but it appears = that we've got far more than 256GB of metadata here that's being = accessed regularly (nearly every file is being stat'ed when rsync runs) = so I'm guessing it's not going to be incredibly effective unless I buy a = seriously large SSD. If you have any specific questions I'm happy to answer though. -- Kevin From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 16:54:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2A0C6558 for ; Tue, 22 Jan 2013 16:54:31 +0000 (UTC) (envelope-from feld@feld.me) Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id DB9112D5 for ; Tue, 22 Jan 2013 16:54:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:To:Content-Type; bh=9K/lQ9iTNyv0EBtW1QP4hMMGtkmYPtOMgltoPq20Db0=; b=r8p/5cO5uIOPPDrvbdNVk4Q4X4AtJI0sd9aeO9jgjVxdtEeolU20l/odIYQgx11TqzRhqIV3uXlRumI3SxxbHUR+vMq+lK6kTE+xo1ezpe5XHudeXrE7Ve9RY4S1Xned; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by feld.me with esmtp (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1Txh7F-000OZS-7E for freebsd-fs@freebsd.org; Tue, 22 Jan 2013 10:54:29 -0600 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4) with esmtpsa id 1358873668-62102-17996/5/4; Tue, 22 Jan 2013 16:54:28 +0000 Content-Type: text/plain; format=flowed; delsp=yes To: freebsd-fs@freebsd.org Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 10:54:28 -0600 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: User-Agent: Opera Mail/12.12 (FreeBSD) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 16:54:31 -0000 On Tue, 22 Jan 2013 09:40:24 -0600, Kevin Day wrote: > I've looked at using an SSD for meta-data only caching, but it appears > that we've got far more than 256GB of metadata here that's being > accessed regularly (nearly every file is being stat'ed when rsync runs) > so I'm guessing it's not going to be incredibly effective unless I buy a > seriously large SSD. Well, 512GB SSDs are less than $500 now (Samsung 830) but can't you just add multiple SSDs? AFAIK you can have multiple L2ARC devices and ZFS will just split the load between them. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 18:19:00 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 792D6FA6 for ; Tue, 22 Jan 2013 18:19:00 +0000 (UTC) (envelope-from matthew.ahrens@delphix.com) Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by mx1.freebsd.org (Postfix) with ESMTP id 01D84ABD for ; Tue, 22 Jan 2013 18:18:59 +0000 (UTC) Received: by mail-la0-f53.google.com with SMTP id fr10so1647185lab.12 for ; Tue, 22 Jan 2013 10:18:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=qIMylib9uTp3ioe2QABsre9UeXFqZcBxYifHZU4Sgqk=; b=GQGa23gi7gjOeKmCQC/jTv4yvJvdl8ZCqTyFW7y+1xaaDHMHwcXznPz8BWc6TCC8a8 cjzMxA2yYe9blMXNKxmwlWF6/1iPEF+01TozCVu3G77ztY9N5FHNX2dGA8XZRE2iW0E9 cgWMAovrbVXWvhR5CepPwxTXVU/C/qqc0coGg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=qIMylib9uTp3ioe2QABsre9UeXFqZcBxYifHZU4Sgqk=; b=XuEqf18CztNJJWhcwBjmcOe9/FmoDLNGbtVeCbQK1tET5MkY6D9yJBQJChbSKWtqyu ieFoQ5QDQ+l7jyVQSuCWe/39ed6T7oDAO0ksoZ8wLuJG0JDADyu4e54tSxC0KOcHYuLX zLORiN6POF67UQGuY5DgZV6qKaMu8uMcztU5a/T/IEC3OnuRKDnsdWu1L9Z8qJZcg3HK T47e70GCyPhgfPwnpDYfLyA8hE4vHC8DIwkFvcpAdMWrsneKce/lBpUUz38SKQvcEpfO bqtCvNJvGr/CjjFam7AorOOljL+65y4qarZlEv6+1H5l3fky+fnzH62ORZqa7STIHgsI xF9Q== MIME-Version: 1.0 X-Received: by 10.112.88.105 with SMTP id bf9mr9704257lbb.43.1358878732612; Tue, 22 Jan 2013 10:18:52 -0800 (PST) Received: by 10.114.63.100 with HTTP; Tue, 22 Jan 2013 10:18:52 -0800 (PST) In-Reply-To: <20130122073641.GH30633@server.rulingia.com> References: <20130122073641.GH30633@server.rulingia.com> Date: Tue, 22 Jan 2013 10:18:52 -0800 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Matthew Ahrens To: Peter Jeremy X-Gm-Message-State: ALoCoQlLHuzt8meD6fgYKc4h7VgB1cpOiLAen6yBQBHoEdvCg0QIhP7S0FuTKpoh1Eo6T1GRVk/q Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs , Wojciech Puchar , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 18:19:00 -0000 On Mon, Jan 21, 2013 at 11:36 PM, Peter Jeremy wrote: > On 2013-Jan-21 12:12:45 +0100, Wojciech Puchar < wojtek@wojtek.tensor.gdynia.pl> wrote: >>While RAID-Z is already a king of bad performance, > > I don't believe RAID-Z is any worse than RAID5. Do you have any actual > measurements to back up your claim? Leaving aside anecdotal evidence (or actual measurements), RAID-Z is fundamentally slower than RAID4/5 *for random reads*. This is because RAID-Z spreads each block out over all disks, whereas RAID5 (as it is typically configured) puts each block on only one disk. So to read a block from RAID-Z, all data disks must be involved, vs. for RAID5 only one disk needs to have its head moved. For other workloads (especially streaming reads/writes), there is no fundamental difference, though of course implementation quality may vary. >> Even better - use UFS. To each their own. As a ZFS developer, it should come as no surprise that in my opinion and experience, the benefits of ZFS almost always outweigh this downside. --matt From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 18:19:09 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C9D847B; Tue, 22 Jan 2013 18:19:09 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-la0-f46.google.com (mail-la0-f46.google.com [209.85.215.46]) by mx1.freebsd.org (Postfix) with ESMTP id 2A8D4ABF; Tue, 22 Jan 2013 18:19:08 +0000 (UTC) Received: by mail-la0-f46.google.com with SMTP id fq12so5380274lab.19 for ; Tue, 22 Jan 2013 10:19:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=2Ky5WXRbp7Q12poesWHeD5DZjJ4qtFD4L0BXepddLBU=; b=jgUIXrYUm1EXQIu3fKJz0HusWe1pdCfp31qvDVulAhl7xsTsGu5WVZVm+IsTzmUW0T Qv4v/9h+JFijYsZj9Cd4AoP4V1fW6SPaZhP8aga4O4OfglGEx7Ohz7Zfxd6XG+u5CSV+ 8wCZZNNlJ5QzTUWLkKWw0IupwgvEZkltVMRw2tqvuJ884nJBclrKbQINfFaO76OaUggR t3uIk9NfvcPLzSHb4CBzUu2ZWsbTgdmhhV7pIh6Py5wAsZTTdR5EO/SUNW0sg2L+Pz20 opIhReRWtf9FVG/lmY8N4IW73JYuNK58GT4O8HmthLeYWpxwS9C7bpx0UMJbKtdVU2ZQ AF4Q== X-Received: by 10.112.88.7 with SMTP id bc7mr9741645lbb.108.1358878747794; Tue, 22 Jan 2013 10:19:07 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id ml1sm7357808lab.15.2013.01.22.10.19.06 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 22 Jan 2013 10:19:07 -0800 (PST) Sender: Alexander Motin Message-ID: <50FED818.7070704@FreeBSD.org> Date: Tue, 22 Jan 2013 20:19:04 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Jeremy Chadwick Subject: Re: disk "flipped" - a known problem? References: <20130121221617.GA23909@icarus.home.lan> In-Reply-To: <20130121221617.GA23909@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 18:19:09 -0000 On 22.01.2013 00:16, Jeremy Chadwick wrote: > (Please keep me CC'd as I am not subscribed) > > WRT this: > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > I can reproduce the first problem 100% of the time on my home system > here. I can provide hardware specs if needed, but the important part is > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > using ahci.ko. > > I also want to make this clear to Andriy: I'm not saying "there's a > problem with your disk". In my case, I KNOW there's a problem with the > disk (that's the entire point to my tests! :-) ). > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > badly-designed firmware that goes completely catatonic when encountering > certain sector-level conditions. That's not the problem though -- the > problem is with FreeBSD apparently getting confused as to the internal > state of its devices after a device falls off the bus and comes back. > Explanation: > > 1. System powered off; disk is attached; system powered on, shows up as > ada5. Can communicate with device in every way (the way I tend to test > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > the g_wither_washer oddity does not apply in this situation. > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > firmware design flaw results in drive becoming 100% stuck trying to > re-read the sector and work out internal decisions to do remapping or > not. Drive audibly clicking during this time (not actuator arm being > reset to track 0 noise; some other mechanical issue). Due to firmware > issue, drive remains in this state indefinitely. > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > times (kern.cam.da.retry_count+1). > > 5. FreeBSD spits out similar messages you see; retries exhausted, > cam_periph_alloc error, and devfs claims device removal. > > 6. Drive is still catatonic of course. Only way to reset the drive is > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > seconds, then is reinserted. > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > except... > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > devlist" shows the disk on the bus, yet I/O does not work. If I > remember right, re-attempting the dd command returns some error (I > forget which). > > 9. "camcontrol rescan all" stalls for quite some time when trying to > communicate with entry 5, but eventually does return (I think with some > error). camcontrol reset all" works without a hitch. "camcontrol > devlist" during this time shows the same disk on ada5 (which to me means > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > I/O works at some level). > > 10. System otherwise works fine, but the only way to bring back > usability of ada5 is to reboot ("shutdown -r now"). > > To me, this looks like FreeBSD at some layer within the kernel (or some > driver (I don't know which)) is internally confused about the true state > of things. > > Alexander, do you have any ideas? > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > this with camcontrol) as well as take notes and do a full step-by-step > diagnosis (along with relevant kernel output seen during each phase) if > that would help you. And I can test patches but not against -CURRENT > (will be a cold day in hell before I run that, sorry). Command timeout itself is not a reason for AHCI driver to drop the disk, neither it is for CAM in case of payload requests. Disk can be dropped if controller report device absence detected by SATA PHY, or by errors during device reinitialization after reset by CAM SATA XPT. What is interesting, is what exactly goes on after disk got stuck and you have removed it. In normal case controller should immediately report PHY status change, driver should run PHY reset and see that link is lost. It should trigger bus rescan for CAM, that should invalidate device. That should make dd abort with error. After dd gone, device should be destroyed and ready for reattachment. So it should be great if you start with the full verbose dmesg from the boot up to the moment when system becomes stable after disk removal. If it won't be enough, we can enable some more debugging with `camcontrol debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. -- Alexander Motin From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 19:01:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E05F6A4D for ; Tue, 22 Jan 2013 19:01:56 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-la0-f52.google.com (mail-la0-f52.google.com [209.85.215.52]) by mx1.freebsd.org (Postfix) with ESMTP id 40CCEE65 for ; Tue, 22 Jan 2013 19:01:55 +0000 (UTC) Received: by mail-la0-f52.google.com with SMTP id fs12so850581lab.39 for ; Tue, 22 Jan 2013 11:01:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=Qatv82pqI5mbzbDodn6fPNlkZNYWBNvXz2dvbvDmtU8=; b=mmojILm+UQfzQqs66tI3GWwFyoe4fB0E461qzy9YGSDsZJM/cGwg7/oaAGYy1Dzw+k tJdAylmmCmNtqZ7emqGLzfpP989nLLeFHWurN0rRjIaf9xt2JqA5SwJjt8pGVwSfobUo harSphextNWzzI5IMuiENyr+UZj7deODPwTpIKiEwd6wcpwuBdXar0BBHSrTSsma4O+3 ngxHNMC1V89gIl9RMd8xOlGDWd1q+ZEqdy/Q19RJ+kuJSzOKH+3atSTwogWceBMq6GNx fcC/v9UN8EZyAYpOlKwMAdZBbAJi87f6Ezu7+hvaLuR4L/yq+z05vET/Y5OXphCNVXBk grvA== MIME-Version: 1.0 X-Received: by 10.112.46.199 with SMTP id x7mr9532516lbm.109.1358881314809; Tue, 22 Jan 2013 11:01:54 -0800 (PST) Received: by 10.112.6.38 with HTTP; Tue, 22 Jan 2013 11:01:54 -0800 (PST) In-Reply-To: <50FE57A9.2040104@platinum.linux.pl> References: <20130122073641.GH30633@server.rulingia.com> <50FE57A9.2040104@platinum.linux.pl> Date: Tue, 22 Jan 2013 14:01:54 -0500 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Zaphod Beeblebrox To: Adam Nowacki Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 19:01:56 -0000 On Tue, Jan 22, 2013 at 4:11 AM, Adam Nowacki wrote: > > This is exactly what happened to me. I had a server connected to a failing > mains socket. For about a month ZFS reported checksum errors on multiple > disks, all fixed thanks to raidz2. Socket failed, completely burned off, > UPS woke me up. Replaced the socket, no checksum errors since then. > > This is what was in my wall: http://tepeserwery.pl/DSC_**0178.JPG > > Damn, son. That socket is obviously not rated for whatever you used it for. I'm not familiar with European sockets, but any time you smell burning in relation electric stuff --- it's just bad. Here in North America, many "cheap" wall sockets are really only meant for lamps or TVs and whatnot. Even cheap contractors put "real" sockets in the kitchen. Most "decora" style wall switches are only good for 10 or fewer amps ... not 15 or 20 --- although at least their failure mode is reasonable (they just stop working). I used to be in peer1 (Canada) at our "151 front" (center of Canadian telecoms). I had been noticing for some time the smell of burnt electronics coming from their UPS --- so I moved (this UPS unit would be in the 50 to 100 KVA range). About a year later, it caught fire and had that colo down for most of the day. I had pointed it out to the people who worked there --- but they obviously didn't take it seriously enough. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 19:28:59 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2C543F0B for ; Tue, 22 Jan 2013 19:28:59 +0000 (UTC) (envelope-from utisoft@gmail.com) Received: from mail-ia0-x229.google.com (mail-ia0-x229.google.com [IPv6:2607:f8b0:4001:c02::229]) by mx1.freebsd.org (Postfix) with ESMTP id CB731F4D for ; Tue, 22 Jan 2013 19:28:58 +0000 (UTC) Received: by mail-ia0-f169.google.com with SMTP id j5so3555246iaf.28 for ; Tue, 22 Jan 2013 11:28:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=5p1wbNocHDaMsXlJBOYl5RA2uJI5n9ywZL5gi9PxlmQ=; b=Lij+6adr/mNbL39NCgxi5FGSx4b2FgjrwosEwWfMGQgtenchm2+TsTtLi8FVHTlMqo 3N5amxMP4SD48+FogdzxiQoYNe5L6bSOnTuYVQ8pYYruLWE1m/nKfy7tKjGZlPx4VnX1 HiUd484eEf51ynQCahLOnzuSzFzN6SIyEpvigDp78/v2FOJTafSKU36vsHq2oH34mfdi 2zt4GYnuDOe7O+eqIPuwxJeY2tZGWKWb8RxvphjaJcjxb8d8QIdz6QWQ/2A6D4zAZw9o 3WoK+bB1Dp30u1uoaCLvSx6mdjRPh+MvcB0GU8amzKqBktmKKiB3FYg5TEl7fXz19YzS tn5w== X-Received: by 10.50.178.10 with SMTP id cu10mr12981271igc.75.1358882938313; Tue, 22 Jan 2013 11:28:58 -0800 (PST) MIME-Version: 1.0 Sender: utisoft@gmail.com Received: by 10.64.16.73 with HTTP; Tue, 22 Jan 2013 11:28:28 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> <50FE57A9.2040104@platinum.linux.pl> From: Chris Rees Date: Tue, 22 Jan 2013 19:28:28 +0000 X-Google-Sender-Auth: X94CNZ899MeaJ0RLbonCejodMDQ Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. To: Zaphod Beeblebrox Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 19:28:59 -0000 On 22 January 2013 19:01, Zaphod Beeblebrox wrote: > On Tue, Jan 22, 2013 at 4:11 AM, Adam Nowacki wrote: > >> >> This is exactly what happened to me. I had a server connected to a failing >> mains socket. For about a month ZFS reported checksum errors on multiple >> disks, all fixed thanks to raidz2. Socket failed, completely burned off, >> UPS woke me up. Replaced the socket, no checksum errors since then. >> >> This is what was in my wall: http://tepeserwery.pl/DSC_**0178.JPG >> >> > Damn, son. That socket is obviously not rated for whatever you used it > for. I'm not familiar with European sockets, but any time you smell > burning in relation electric stuff --- it's just bad. Here in North > America, many "cheap" wall sockets are really only meant for lamps or TVs > and whatnot. Even cheap contractors put "real" sockets in the kitchen. > Most "decora" style wall switches are only good for 10 or fewer amps ... > not 15 or 20 --- although at least their failure mode is reasonable (they > just stop working). That is a standard European socket, which are normally rated at 13A. Also please remember that we use 230V here, so 10A is a high current for us. Most of our computers have 5A fuses in them (~1.2 kW). In Britain, we have seriously anal standards for electrical equipment mercifully, so there is no such thing here as a 'cheap' socket; all are fine. Not sure if one can say the same for Poland. Chris > I used to be in peer1 (Canada) at our "151 front" (center of Canadian > telecoms). I had been noticing for some time the smell of burnt > electronics coming from their UPS --- so I moved (this UPS unit would be in > the 50 to 100 KVA range). About a year later, it caught fire and had that > colo down for most of the day. I had pointed it out to the people who > worked there --- but they obviously didn't take it seriously enough. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 21:57:13 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 337C2C55; Tue, 22 Jan 2013 21:57:13 +0000 (UTC) (envelope-from steven@pyro.eu.org) Received: from falkenstein-2.sn.de.cluster.ok24.net (falkenstein-2.sn.de.cluster.ok24.net [IPv6:2002:4e2f:2f89:2::1]) by mx1.freebsd.org (Postfix) with ESMTP id D99F18D1; Tue, 22 Jan 2013 21:57:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=simple/simple; d=pyro.eu.org; s=01.2013; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:References:Subject:CC:To:MIME-Version:From:Date:Message-ID; bh=0ZHrX35EhLBBzYZvqack3wej6dBux2WfK9I452eXZ8s=; b=YlTC+F/2jkvpcoCuTOhyQMmyvaD0qivg6H5cNz/EFxdFW+YWXkp10XLt72kXRd563MrslH7jo9j9W+StH42U3F/Jb3KPHnfHaFkURJXDnQGyG/NZ0IGvktnGqx3wAwGXZTwbTzQslvbzJDsVOPt6GdDGWJtDdJFAkV4UK1sWj2M=; X-Spam-Status: No, score=-1.1 required=2.0 tests=ALL_TRUSTED, BAYES_00, DKIM_ADSP_DISCARD, TVD_RCVD_IP Received: from 188-220-33-66.zone11.bethere.co.uk ([188.220.33.66] helo=guisborough-1.rcc.uk.cluster.ok24.net) by falkenstein-2.sn.de.cluster.ok24.net with esmtp (Exim 4.72) (envelope-from ) id 1Txlq7-0007mr-NF; Tue, 22 Jan 2013 21:57:10 +0000 X-Spam-Status: No, score=-4.3 required=2.0 tests=ALL_TRUSTED, AWL, BAYES_00, DKIM_POLICY_SIGNALL Received: from [192.168.0.110] (helo=[192.168.0.9]) by guisborough-1.rcc.uk.cluster.ok24.net with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.69) (envelope-from ) id 1Txlpu-0004Ta-6N; Tue, 22 Jan 2013 21:57:07 +0000 Message-ID: <50FF0B25.3050009@pyro.eu.org> Date: Tue, 22 Jan 2013 21:56:53 +0000 From: Steven Chamberlain User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130116 Icedove/10.0.12 MIME-Version: 1.0 To: Adam Nowacki Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> <50FE57A9.2040104@platinum.linux.pl> In-Reply-To: X-Enigmail-Version: 1.4.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 21:57:13 -0000 On 22/01/13 19:28, Chris Rees wrote: > On 22 January 2013 19:01, Zaphod Beeblebrox wrote: >> On Tue, Jan 22, 2013 at 4:11 AM, Adam Nowacki wrote: >>> This is what was in my wall: http://tepeserwery.pl/DSC_0178.JPG >>> >> Damn, son. That socket is obviously not rated for whatever you used it >> for. > > That is a standard European socket, which are normally rated at 13A. Maybe they don't work very well with paint splattered inside them. My guess is the paint was slightly metallic or conductive, so a current was flowing between one of the screw terminals on the right, and the bolt in the mounting bracket (which is probably earthed). The socket may have been supplying a lower voltage as a result, hence the equipment suffering brownouts. (But it seems the UPS wasn't sensitive enough to this?) Ideally a circuit breaker would have tripped before anything got hot enough to melt, but in this case the heat in the right-hand-side rail (in a wall, with no air circulation) became enough to discolour the plastic. I think it's lucky it didn't cause a fire. In the UK most house/office socket circuits are supposed to be protected by an RCD, which are extremely sensitive to fault currents like this flowing to earth. If there are any more sockets in that room/building I would definitely check all of them! (Carefully, with the supply disconnected, of course). Or an insulation/leakage test of the circuit by an electrician would have detected this. > checksum errors on multiple > disks, all fixed thanks to raidz2. Well, that's some relief :) Regards, -- Steven Chamberlain steven@pyro.eu.org From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 22:34:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 86EA0AE7; Tue, 22 Jan 2013 22:34:44 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 2B487A5C; Tue, 22 Jan 2013 22:34:43 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 2C3F147E0F; Tue, 22 Jan 2013 23:34:40 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 743FB47DE6; Tue, 22 Jan 2013 23:34:30 +0100 (CET) Message-ID: <50FF13E8.2030904@platinum.linux.pl> Date: Tue, 22 Jan 2013 23:34:16 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Steven Chamberlain Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> <50FE57A9.2040104@platinum.linux.pl> <50FF0B25.3050009@pyro.eu.org> In-Reply-To: <50FF0B25.3050009@pyro.eu.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 22:34:44 -0000 Many theories - all wrong. The socket failed because of an unstuck screw. Everything was wired correctly, the whole circuit was on a 10A fuse. After many years of use (and neglect) one of the screws became dislodged (thermal expansion of the wires or whatever) - enough to spark and cause problems. From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 23:02:42 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id DD665205 for ; Tue, 22 Jan 2013 23:02:42 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com [209.85.217.171]) by mx1.freebsd.org (Postfix) with ESMTP id 55E37B5C for ; Tue, 22 Jan 2013 23:02:41 +0000 (UTC) Received: by mail-lb0-f171.google.com with SMTP id gg13so4210871lbb.30 for ; Tue, 22 Jan 2013 15:02:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=epgTUus8irZmPELv4jZRSNsBP2Tp1yUuFi2BXUzQc58=; b=MVSCtUzQKHHM38CJ6xoq2R16tB/lnYNRclkEvhRXM+i7kMTMzyWhaPdii6taMh1HxT bRxQ76B0A5/Hw3MS5miscIAH8OWyhebuQr7havoHKZGwn95RxkVcpp2UcatWhyu3IlyZ J6DTv8oZDXvbC0/gECBpFBrlZywdxoyZrMl9b6xRoy6pn+Qt/hlTpmM0LwBSyeXGs7P3 sVgSyKjrNoTKcITe0yZb3YksDNeXy0sH8/sfsKyaMpUuGtCSZj6W/OzGHJFX7UxR5ac4 PeufUQAnR/ZfE0MhgsD52wkCyOy4OE00G023xsYBqBlD0vHtNHpc8vPsX8OfE3IIbjMv nahQ== MIME-Version: 1.0 X-Received: by 10.112.44.134 with SMTP id e6mr9857851lbm.134.1358895760864; Tue, 22 Jan 2013 15:02:40 -0800 (PST) Received: by 10.114.81.40 with HTTP; Tue, 22 Jan 2013 15:02:40 -0800 (PST) Received: by 10.114.81.40 with HTTP; Tue, 22 Jan 2013 15:02:40 -0800 (PST) In-Reply-To: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 15:02:40 -0800 Message-ID: Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Freddie Cash To: Warren Block Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 23:02:42 -0000 On Jan 22, 2013 7:04 AM, "Warren Block" wrote: > > On Tue, 22 Jan 2013, Borja Marcos wrote: > >> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance) >> >> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic. >> >> For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum. >> >> After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1. > > > I'm a proponent of using various types of labels, but my impression after a recent experience was that ZFS metadata was enough to identify the drives even if they were moved around. That is, ZFS bare metadata on a drive with no other partitioning or labels. > > Is that incorrect? The ZFS metadata on disk allows you to move disks around in a system and still import the pool, correct. But the ZFS metadata will not help you figure out which disk, in which bay, of which drive shelf just died and needs to be replaced. That's where glabels, gpt labels, and similar come in handy. It's for the sysadmin, not the system itself. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 00:11:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 297DDE32 for ; Wed, 23 Jan 2013 00:11:34 +0000 (UTC) (envelope-from jhs@berklix.com) Received: from flat.berklix.org (flat.berklix.org [83.236.223.115]) by mx1.freebsd.org (Postfix) with ESMTP id 972BBE55 for ; Wed, 23 Jan 2013 00:11:32 +0000 (UTC) Received: from mart.js.berklix.net (p5DCBFE5B.dip.t-dialin.net [93.203.254.91]) (authenticated bits=128) by flat.berklix.org (8.14.5/8.14.5) with ESMTP id r0N0BHAw083778; Wed, 23 Jan 2013 01:11:17 +0100 (CET) (envelope-from jhs@berklix.com) Received: from fire.js.berklix.net (fire.js.berklix.net [192.168.91.41]) by mart.js.berklix.net (8.14.3/8.14.3) with ESMTP id r0N0DHUv038279; Wed, 23 Jan 2013 01:13:19 +0100 (CET) (envelope-from jhs@berklix.com) Received: from fire.js.berklix.net (localhost [127.0.0.1]) by fire.js.berklix.net (8.14.4/8.14.4) with ESMTP id r0N0CwUH054867; Wed, 23 Jan 2013 01:13:17 +0100 (CET) (envelope-from jhs@fire.js.berklix.net) Message-Id: <201301230013.r0N0CwUH054867@fire.js.berklix.net> To: Steven Chamberlain Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: "Julian H. Stacey" Organization: http://berklix.com BSD Unix Linux Consultancy, Munich Germany User-agent: EXMH on FreeBSD http://berklix.com/free/ X-URL: http://www.berklix.com In-reply-to: Your message "Tue, 22 Jan 2013 21:56:53 GMT." <50FF0B25.3050009@pyro.eu.org> Date: Wed, 23 Jan 2013 01:12:58 +0100 Sender: jhs@berklix.com Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 00:11:34 -0000 Steven Chamberlain wrote: > On 22/01/13 19:28, Chris Rees wrote: > > On 22 January 2013 19:01, Zaphod Beeblebrox wrote: > >> On Tue, Jan 22, 2013 at 4:11 AM, Adam Nowacki wrote: > >>> This is what was in my wall: http://tepeserwery.pl/DSC_0178.JPG > >>> > >> Damn, son. That socket is obviously not rated for whatever you used it > >> for. > > > > That is a standard European socket, which are normally rated at 13A. Re. Eurpean sockets: I doubt 13 Amps (though I don't know Polish system, possible I suppose) British square pin are 13 Amp. German continetal etc are not. This (Polish?) socket looks similar to French French are (marginaly better than German & Austrian & North Italian/South Tyrol) Continetal wiring standards I've seen are shamefuly dangerous comparedto British (which are to better standard, cost more per equipment & more work to install) In a German rental flat here, one of several hundred in a big, not cheap complex built in Munich in 1985, there are just 16 A & 10 A trip fuses. I suspect a lot is spur wired, not ring main. Certainly it's combined lighting & floor power on 1 fuse per several rooms. No Earth trip, though I'm told it should have one (but I wont commit myself if it Should by law, as I've not myself researched German law/ standards on that) Naked socket with unscreened holes & no switch, between the 2 hand basins in the bathroom, inches apart from the basins. a 2nd socket for clothes washing machine in bathroom. a light switch switch, not ceiling pull cord in the wall. All quite normal by German & continental standards, all appalling to a British electrician. What is shown in http://tepeserwery.pl/DSC_0178.JPG I suppose is Polish, never been there, but looks just like a French style socklet (same material used as German & Austrian & very North Italian (=Sued Tyrol), But the French (& I see Poles) at least achieve the possibility of differentiating neutral from live, by virtue of the offset earth pin. These (type of) sockets are rubbish compared with British 13A sockets. Reasons: No chance of a switch (cheap British one dont, decent ones do). Big reason: see the tiny claws (more visible on left) ? Leftmost screw pushes tthem sideways into the wall. That's all that holds socket in wall, 2 claws; after a while they work loose. If you've got a vacuum cleaner or kettle plugged in (that needs firm contact for all the current, to avoid getting hot), there's a heavy outward force out of wall when unplugging. (I always use one hand on plastic to help it Stay in wall, whith other hand on plug to pull out. Another reason: all that naked metal when the cover is off (a British socket is a lot more covered, much lower chance of electrocution at 230V in Europe) Another reason: UK plugs also have variable 2/3/5/13 amp fuses in plugs. Continetal sockets supply up to room circuit fuse rating, a lot more than many appliance cables can take. Another reason: Polarised Live & Neutral (well at least French & Poles achieve that, Germans Austrians & North Tyrol fail. They like crap sockets on the continent, as seen in picture 'cos you just a use combi circular saw with drill in the widdle, to quickly pilot a hole, then sink bigger circular hole in wall. & then bung in a cheap circular plastic cylinder (that the metal claws eat into & scratch out of over the years) In Britain, you have to hack out a square hole (a lot more work, then put in a more expensive galvanised steel square cavity box, then bang in several masonry nails sideways to hold the steel box inplace, then screw in the more expensive better socket, with proper metal thread screws into screwed holes that make a good grip. The plastic cover on British sockets is much thicker & stronger In Munich, Schiller Str (main computer/ PC street) sold British polarised square plugs & sockets as high quality luxury equipment at several times UK prices. Much continental wiring is Sub Standard & would be condemned under British (ex IEE as was) wiring regs. British plugs are admittedly an absolute swine if you walk on them accidentally bare foot, & clunkier in a laptop case, but the plugs & sockets are _Much_ better, What cost a life or a burnt building. ? > Maybe they don't work very well with paint splattered inside them. > > My guess is the paint was slightly metallic or conductive, so a current > was flowing between one of the screw terminals on the right, and the > bolt in the mounting bracket (which is probably earthed). > > The socket may have been supplying a lower voltage as a result, The voltage would only drop sufficient to count as brown out if there was horrendous heat evolved inside plug / socket combo. These crappy continetal plastic covers suffer from heat easily, but you'd need very little voltage drop to cause the browning seen. > hence > the equipment suffering brownouts. (But it seems the UPS wasn't > sensitive enough to this?) > > Ideally a circuit breaker would have tripped before anything got hot > enough to melt, but in this case the heat in the right-hand-side rail > (in a wall, with no air circulation) became enough to discolour the > plastic. I think it's lucky it didn't cause a fire. > > In the UK most house/office socket circuits are supposed to be protected > by an RCD, which are extremely sensitive to fault currents like this > flowing to earth. Probably just an HR high resistance bad contact causing heat, I've seen so many loose continetal sockets, relatively few UK 13A ones by contrast. When mean or shops closed, I remove socket & file the holes (after fuse off :-) Fine emery (black sand) paper is good to ploish plug contacts. Ive know many public in Britain & Germany use plugs for many decades, till pins are really dirty, & they never think to polish plus pins. So easy to do, even if it takes an electrician to remove & replace or clean (or tighten springs in) a socket. PS 2 pin multi way continental adapters are also crap: Insert 2 pin plug fully in an adapter, & you can feel the metal contact is sometimes off, or often almost off, ready to be high resistance or fail 'cos its gone too deep, 'cos too much of the shaft is plastic, & not enough metal along the tip. Can occasionaly be cause of eg laptops & electric toothbrushes, & razors not charging. > If there are any more sockets in that room/building I would definitely > check all of them! (Carefully, with the supply disconnected, of > course). Or an insulation/leakage test of the circuit by an electrician > would have detected this. > > > checksum errors on multiple > > disks, all fixed thanks to raidz2. > > Well, that's some relief :) > > Regards, > -- > Steven Chamberlain > steven@pyro.eu.org Cheers, Julian -- Julian Stacey, BSD Unix Linux C Sys Eng Consultant, Munich http://berklix.com Reply below not above, like a play script. Indent old text with "> ". Send plain text. Not: HTML, multipart/alternative, base64, quoted-printable. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 01:27:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 552818E2 for ; Wed, 23 Jan 2013 01:27:31 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 1DBD394 for ; Wed, 23 Jan 2013 01:27:31 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id EA31312E1E527 for ; Tue, 22 Jan 2013 17:27:23 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PSRNq6K78xZT for ; Tue, 22 Jan 2013 17:27:22 -0800 (PST) Received: from [192.168.0.116] (c-50-135-255-120.hsd1.wa.comcast.net [50.135.255.120]) by plato.corp.nas.com (Postfix) with ESMTPSA id 1B9B512E1E51B for ; Tue, 22 Jan 2013 17:27:22 -0800 (PST) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Michael DeMan In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> Date: Tue, 22 Jan 2013 17:27:13 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: FreeBSD Filesystems X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 01:27:31 -0000 I think this would be awesome. Googling around it is extremely = difficult to know what to do and which practices are current or = obsolete, etc. I would suggest maybe some separate sections so the information is = organized well and can be easily maintained? MAIN:=20 - recommended anybody using ZFS have a a 64-bit processor and 8GB RAM. - I don't know, but it seems to me that much of what would go in here is = fairly well known now and probably not changing much? ROOT ON ZFS: - section just for this 32-bit AND/OR TINY MEMORY: - all the tuning needed for the people that aren't following recommended = 64-bit+8GB RAM setup. - probably there are enough people even though it seems pretty obvious = in a couple more years nobody will have 32-bit or less than 8GB RAM? A couple more things for subsections in topic MAIN - lots of stuff to go = in there... PARTITIONING: I could be disinformed here, but my understanding) is best practice is = to use gpart + gnop to: #1. Ensure proper alignment for 4K sector drives - the latest western = digitals still report as 512. #2. Ensure a little extra space is left on the drive since if the whole = drive is used, a replacement may be a tiny bit smaller and will not = work. #3. Label the disks so you know what is what. MAPPING PHYSICAL DRIVES: Particularly and issue with SATA drives - basically force the mapping so = if the system reboots with a drive missing (or you add drives) you know = what is what. - http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/011039.html - so you can put a label on the disk caddies and when the system says = 'diskXYZ' died - you can just look at the label on the front of the box = and change 'diskXYZ'. - also without this - if you reboot after adding disks or with a disk = missing - all the adaXYZ numbering shifts :( SPECIFIC TUNABLES - there are still a myriad of specific tunables that can be very helpful = even with a 8GB+=20 ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here=20 - why the ZIL is a good thing even you think it kills your NFS = performance - no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 = disks, etc. - striping over raidz1/raidz2 pools - striping over mirrors - etc... On Jan 22, 2013, at 3:03 AM, Borja Marcos wrote: > (Scott, I hope you don't mind to be CC'd, I'm not sure you read the = -FS mailing list, and this is a SCSI//FS issue) >=20 >=20 >=20 > Hi :) >=20 > Hope nobody will hate me too much, but ZFS usage under FreeBSD is = still chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). >=20 > ZFS is a very different beast from other filesystems, and the setup = can involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. >=20 > Since I've been using ZFS on FreeBSD (from the first versions) I have = noticed several serious problems. I try to explain some of them, and my = suggestions for a solution. We should collect more use cases and issues = and try to reach a consensus.=20 >=20 >=20 >=20 > 1- Dynamic disk naming -> We should use static naming (GPT labels, for = instance) >=20 > ZFS was born in a system with static device naming (Solaris). When you = plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. >=20 > For example, imagine that I have 16 disks, da0 to da15. One of them, = say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 will be renamed to the device number -1. Potential for trouble as a = minimum. >=20 > After several different installations, I am preferring to rely on = static naming. Doing it with some care can really help to make pools = portable from one system to another. I create a GPT partition in each = drive, and Iabel it with a readable name. Thus, imagine I label each big = partition (which takes the whole available space) as pool-vdev-disk, for = example, pool-raidz1-disk1. >=20 > When creating a pool, I use these names. Instead of dealing with = device numbers. For example:=20 >=20 > % zpool status > pool: rpool > state: ONLINE > scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan 7 16:25:47 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > gpt/rpool-disk1 ONLINE 0 0 0 > gpt/rpool-disk2 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 >=20 > Using a unique name for each disk within your organization is = important. That way, you can safely move the disks to a different = server, which might be using ZFS, and still be able to import the pool = without name collisions. Of course you could use gptids, which, as far = as I know, are unique, but they are difficult to use and in case of a = disk failure it's not easy to determine which disk to replace. >=20 >=20 >=20 >=20 > 2- RAID cards. >=20 > Simply: Avoid them like the pest. ZFS is designed to operate on bare = disks. And it does an amazingly good job. Any additional software layer = you add on top will compromise it. I have had bad experiences with "mfi" = and "aac" cards.=20 >=20 > There are two solutions adopted by RAID card users. None of them is = good. The first an obvious one is to create a RAID5 taking advantage of = the battery based cache (if present). It works, but it loses some of the = advantages of ZFS. Moreover, trying different cards, I have been forced = to reboot whole servers in order to do something trivial like replacing = a failed disk. Yes, there are software tools to control some of the = cards, but they are at the very least cumbersome and confusing. >=20 > The second "solution" is to create a RAID0 volume for each disk (some = RAID card manufacturers even dare to call it JBOD). I haven't seen a = single instance of this working flawlessly. Again, a replaced disk can = be a headache. At the very least, you have to deal with a cumbersome and = complicated management program to replace a disk, and you often have to = reboot the server. >=20 > The biggest reason to avoid these stupid cards, anyway, is plain = simple: Those cards, at least the ones I have tried bundled by Dell as = PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense = codes from the filesystem. Pure crap. >=20 > Years ago, fighting this issue, and when ZFS was still rather = experimental, I asked for help and Scott Long sent me a "don't try this = at home" simple patch, so that the disks become available to the CAM = layer, bypassing the RAID card. He warned me of potential issues and = lost sense codes, but, so far so good. And indeed the sense codes are = lost when a RAID card creates a volume, even if in the misnamed "JBOD" = configuration.=20 >=20 >=20 > = http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634= 817&raw=3Dyes > http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679 >=20 > Anyway, even if there might be some issues due to command handling, = the end to end verification performed by ZFS should ensure that, as a = minimum, the data on the disks won't be corrupted and, in case it = happens, it will be detected. I rather prefer to have ZFS deal with it, = instead of working on a sort of "virtual" disk implemented on the RAID = card. >=20 > Another *strong* reason to avoid those cards, even "JBOD" = configurations, is disk portability. The RAID labels the disks. Moving = one disk from one machine to another one will result on a funny = situation of confusing "import foreign config/ignore" messages when = rebooting the destination server (mandatory in order to be able to = access the transferred disk). Once again, additional complexity, useless = layering and more reboots. That may be acceptable for Metoosoft crap, = not for Unix systems. >=20 > Summarizing: I would *strongly* recommend to avoid the RAID cards and = get proper host adapters without any fancy functionalities instead. The = one sold by Dell as H200 seems to work very well. No need to create any = JBOD or fancy thing at all. It will just expose the drivers as normal = SAS/SATA ones. A host adapter without fancy firmware is the best = guarantee about failures caused by fancy firmware. >=20 > But, in case that=B4s not possible, I am still leaning to the kludge = of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 = thing by patching the driver. There is one issue, though. In case of = reboot, the RAID cards freeze, I am not sure why. Maybe that could be = fixed, it happens on machines on which I am not using the RAID = functionality at all. They should become "transparent" but they don't.=20= >=20 > Also, I think that the so-called JBOD thing would impair the correct = performance of a zfs health daemon doing things such as automatic failed = disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ = log message for diagnosis. >=20 > (See at the bottom to read about a problem I have just had with a = "JBOD" configuration) >=20 >=20 >=20 >=20 > 3- Installation, boot, etc. >=20 > Here I am not sure. Before zfsboot became available, I used to create = a zfs-on-root system by doing, more or less, this: >=20 > - Install base system on a pendrive. After the installation, just = /boot will be used from the pendrive, and /boot/loader.conf will=20 >=20 > - Create the ZFS pool. >=20 > - Create and populate the root hierarchy. I used to create something = like: >=20 > pool/root > pool/root/var > pool/root/usr > pool/root/tmp >=20 > Why pool/root instead of simply "pool"? Because it's easier to = understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if = needed, it's possible to snapshot the whole "system" tree atomically.=20 >=20 > I also set the mountpoint of the "system" tree as legacy, and rely on = /etc/fstab. Why? In order to avoid an accidental "auto mount" of = critical filesystems in case, for example, I boot off a pendrive in = order to tinker.=20 >=20 > For the last system I installed, I tried with zfsboot instead of = booting off the /boot directory of a FFS partition. >=20 >=20 >=20 >=20 > (*) An example of RAID/JBOD induced crap and the problem of not using = static naming follows,=20 >=20 > I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, = and one of those cards I worship: this particular example is controlled = by the aac driver.=20 >=20 > As I was going to tinker a lot, I decided to create a raid-based = mirror for the system, so that I can boot off it and have swap even with = a failed disk, and use the other 14 disks as a pool with two raidz vdevs = of 6 disks, leaving two disks as hot-spares. Later I removed one of the = hot-spares and I installed a SSD disk with two partitions to try and = make it work as L2ARC and log. As I had gone for the jbod pain, of = course replacing that disk meant rebooting the server in order to do = something as illogical as creating a "logical" volume on top of it. = These cards just love to be rebooted. >=20 > pool: pool > state: ONLINE > scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > aacd1 ONLINE 0 0 0 > aacd2 ONLINE 0 0 0 > aacd3 ONLINE 0 0 0 > aacd4 ONLINE 0 0 0 > aacd5 ONLINE 0 0 0 > aacd6 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > aacd7 ONLINE 0 0 0 > aacd8 ONLINE 0 0 0 > aacd9 ONLINE 0 0 0 > aacd10 ONLINE 0 0 0 > aacd11 ONLINE 0 0 0 > aacd12 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 > spares > aacd14 AVAIL =20 >=20 > errors: No known data errors >=20 >=20 >=20 > The fun begun when a disk failed. When it happened, I offlined it, and = replaced it by the remaining hot-spare. But something had changed, and = the pool remained in this state: >=20 > % zpool status > pool: pool > state: DEGRADED > status: One or more devices has been taken offline by the = administrator. > Sufficient replicas exist for the pool to continue functioning = in a > degraded state. > action: Online the device using 'zpool online' or replace the device = with > 'zpool replace'. > scan: resilvered 192K in 0h0m with 0 errors on Wed Dec 5 08:31:57 = 2012 > config: >=20 > NAME STATE READ WRITE CKSUM > pool DEGRADED 0 0 0 > raidz1-0 DEGRADED 0 0 0 > spare-0 DEGRADED 0 0 0 > 13277671892912019085 OFFLINE 0 0 0 was = /dev/aacd1 > aacd14 ONLINE 0 0 0 > aacd2 ONLINE 0 0 0 > aacd3 ONLINE 0 0 0 > aacd4 ONLINE 0 0 0 > aacd5 ONLINE 0 0 0 > aacd6 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > aacd7 ONLINE 0 0 0 > aacd8 ONLINE 0 0 0 > aacd9 ONLINE 0 0 0 > aacd10 ONLINE 0 0 0 > aacd11 ONLINE 0 0 0 > aacd12 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 > spares > 2388350688826453610 INUSE was /dev/aacd14 >=20 > errors: No known data errors > %=20 >=20 >=20 > ZFS was somewhat confused by the JBOD volumes, and it was impossible = to end this situation. A reboot revealed that the card, apparently, had = changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a = single bit of data, but the situation seemed to be risky. Finally I = could fix it by replacing the failed disk, rebooting the whole server, = of course, and doing a zpool replace. But the card added some confusion, = and I still don't know what was the disk failure. No traces of a = meaningful error message.=20 >=20 >=20 >=20 >=20 > Best regards, >=20 >=20 >=20 >=20 >=20 >=20 > Borja. >=20 >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 01:52:35 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4E03AAFB for ; Wed, 23 Jan 2013 01:52:35 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 1715013C for ; Wed, 23 Jan 2013 01:52:35 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id D7EC712E1EAAE for ; Tue, 22 Jan 2013 17:52:34 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6OkOwnbiHGau for ; Tue, 22 Jan 2013 17:52:34 -0800 (PST) Received: from [192.168.0.116] (c-50-135-255-120.hsd1.wa.comcast.net [50.135.255.120]) by plato.corp.nas.com (Postfix) with ESMTPSA id 3DEF612E1EAA3 for ; Tue, 22 Jan 2013 17:52:34 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Michael DeMan In-Reply-To: Date: Tue, 22 Jan 2013 17:52:34 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: FreeBSD Filesystems X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 01:52:35 -0000 On Jan 22, 2013, at 7:04 AM, Warren Block wrote: > On Tue, 22 Jan 2013, Borja Marcos wrote: >=20 >> 1- Dynamic disk naming -> We should use static naming (GPT labels, = for instance) >>=20 >> ZFS was born in a system with static device naming (Solaris). When = you plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. >>=20 >> For example, imagine that I have 16 disks, da0 to da15. One of them, = say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 will be renamed to the device number -1. Potential for trouble as a = minimum. >>=20 >> After several different installations, I am preferring to rely on = static naming. Doing it with some care can really help to make pools = portable from one system to another. I create a GPT partition in each = drive, and Iabel it with a readable name. Thus, imagine I label each big = partition (which takes the whole available space) as pool-vdev-disk, for = example, pool-raidz1-disk1. >=20 > I'm a proponent of using various types of labels, but my impression = after a recent experience was that ZFS metadata was enough to identify = the drives even if they were moved around. That is, ZFS bare metadata = on a drive with no other partitioning or labels. >=20 > Is that incorrect? I don't know if it is correct or not, but the best I could figure out = was to both label the drives and also force the mapping so the physical = and logical drives always show up associated correctly. I also ended up deciding I wanted the hostname as a prefix for the = labels - so if they get moved around to say another machine I can look = and know what is going on - 'oh yeah, those disks are from the ones we = moved over to this machine'... Again - no idea if this is right or 'best practice' but it was what I = ended up doing since we don't have that 'best practice' document. Basically what I came to was: #1. Map the physical drive slots to how they show up in FBSD so if a = disk is removed and the machine is rebooted all the disks after that = removed one do not have an 'off by one error'. i.e. if you have = ada0-ada14 and remove ada8 then reboot - normally FBSD skips that = missing ada8 drive and the next drive (that used to be ada9) is now = called ada8 and so on... =20 #2. Use gpart+gnop to deal with 4K disk sizes in a standardized way and = also to leave a little extra room so if when doing a replacement disk = and that disk is a few MB smaller than the original - it all 'just = works'. (All disks are partitioned to a slightly smaller size than = their physical capacity). #3. For ZFS - make the pool run off the labels. The labels include in = them the 'adaXXX' physical disk for easy reference. If the disks are = moved to another machine (say ada0-ada14 are moved to ada30-44 in a new = box) then naturally that is off - but with the original hostname prefix = in the label (presuming hostnames are unique) you can tell what is going = on. Having the disks in another host I treat as an emergency/temporary = situation and the pool can be taken offline and the labels fixed up if = the plan is for the disks to live in that new machine for a long time. Example below on a test box - so if these drives got moved to another = machine where ada6 and ada14 are already present -=20 NAME STATE READ WRITE CKSUM zpmirrorTEST ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/myhostname-ada6p1 ONLINE 0 0 0 gpt/myhostname-ada14p1 ONLINE 0 0 0 logs da1 =20 From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 02:06:28 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 85F07E28 for ; Wed, 23 Jan 2013 02:06:28 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 4D5A01B0 for ; Wed, 23 Jan 2013 02:06:28 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id E744912E1EE46; Tue, 22 Jan 2013 18:06:27 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O2yRLmv8qFaX; Tue, 22 Jan 2013 18:06:27 -0800 (PST) Received: from [192.168.0.116] (c-50-135-255-120.hsd1.wa.comcast.net [50.135.255.120]) by plato.corp.nas.com (Postfix) with ESMTPSA id 164B112E1EE3B; Tue, 22 Jan 2013 18:06:27 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD - mapping logical to physical drives From: Michael DeMan In-Reply-To: Date: Tue, 22 Jan 2013 18:06:27 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <16E9D784-D2F2-4C55-9138-907BF3957CE8@deman.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: Freddie Cash X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 02:06:28 -0000 Hi, We have been able to effectively mitigate this (and rigorously tested) = problem. I myself am fussy and in the situation where a disk drive dies want to = make sure that the data-center technician is removing/replacing exactly = the correct disk. -AND- if the machine reboots with a disk removed, or added - that it all = just looks normal. I think this is basically another item that there are standard ways to = deal with it but there is no documentation? What we did was /boot/device.hints. On the machine we rigorously tested this on, we have in = /boot/device.hints. This is for the particular controllers as noted but = I think works for any SATA or SAS controllers? # OAIMFD 2011.04.13 adding this to force ordering on adaX disks=20 # dev.mvs.0.%desc: Marvell 88SX6081 SATA controller=20 # dev.mvs.1.%desc: Marvell 88SX6081 SATA controller=20 hint.scbus.0.at=3D"mvsch0"=20 hint.ada.0.at=3D"scbus0"=20 hint.scbus.1.at=3D"mvsch1"=20 hint.ada.1.at=3D"scbus1"=20 hint.scbus.2.at=3D"mvsch2"=20 hint.ada.2.at=3D"scbus2"=20 hint.scbus.3.at=3D"mvsch3"=20 hint.ada.3.at=3D"scbus3"=20 ...and so on up to ada14... Inserting disks that were empty before and rebooting, or removing disks = that did exist and rebooting - it all 'just works'. On Jan 22, 2013, at 3:02 PM, Freddie Cash wrote: > On Jan 22, 2013 7:04 AM, "Warren Block" wrote: >>=20 >> On Tue, 22 Jan 2013, Borja Marcos wrote: >>=20 >>> 1- Dynamic disk naming -> We should use static naming (GPT labels, = for > instance) >>>=20 >>> ZFS was born in a system with static device naming (Solaris). When = you > plug a disk it gets a fixed name. As far as I know, at least from my > experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's = dynamic > naming can be very problematic. >>>=20 >>> For example, imagine that I have 16 disks, da0 to da15. One of them, > say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 > will be renamed to the device number -1. Potential for trouble as a = minimum. >>>=20 >>> After several different installations, I am preferring to rely on = static > naming. Doing it with some care can really help to make pools portable = from > one system to another. I create a GPT partition in each drive, and = Iabel it > with a readable name. Thus, imagine I label each big partition (which = takes > the whole available space) as pool-vdev-disk, for example, > pool-raidz1-disk1. >>=20 >>=20 >> I'm a proponent of using various types of labels, but my impression = after > a recent experience was that ZFS metadata was enough to identify the = drives > even if they were moved around. That is, ZFS bare metadata on a drive = with > no other partitioning or labels. >>=20 >> Is that incorrect? >=20 > The ZFS metadata on disk allows you to move disks around in a system = and > still import the pool, correct. >=20 > But the ZFS metadata will not help you figure out which disk, in which = bay, > of which drive shelf just died and needs to be replaced. >=20 > That's where glabels, gpt labels, and similar come in handy. It's for = the > sysadmin, not the system itself. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 02:16:35 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D80D0F5E for ; Wed, 23 Jan 2013 02:16:35 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id 834791F9 for ; Wed, 23 Jan 2013 02:16:35 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r0N2GPO4066708; Tue, 22 Jan 2013 19:16:25 -0700 (MST) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r0N2GPet066705; Tue, 22 Jan 2013 19:16:25 -0700 (MST) (envelope-from wblock@wonkity.com) Date: Tue, 22 Jan 2013 19:16:25 -0700 (MST) From: Warren Block To: Michael DeMan Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD In-Reply-To: Message-ID: References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Tue, 22 Jan 2013 19:16:25 -0700 (MST) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 02:16:35 -0000 On Tue, 22 Jan 2013, Michael DeMan wrote: > On Jan 22, 2013, at 7:04 AM, Warren Block wrote: >> >> I'm a proponent of using various types of labels, but my impression >> after a recent experience was that ZFS metadata was enough to >> identify the drives even if they were moved around. That is, ZFS >> bare metadata on a drive with no other partitioning or labels. >> >> Is that incorrect? > > I don't know if it is correct or not, but the best I could figure out > was to both label the drives and also force the mapping so the > physical and logical drives always show up associated correctly. I > also ended up deciding I wanted the hostname as a prefix for the > labels - so if they get moved around to say another machine I can look > and know what is going on - 'oh yeah, those disks are from the ones we > moved over to this machine'... It helps to avoid duplicate labels, a good idea. > #1. Map the physical drive slots to how they show up in FBSD so if a > disk is removed and the machine is rebooted all the disks after that > removed one do not have an 'off by one error'. i.e. if you have > ada0-ada14 and remove ada8 then reboot - normally FBSD skips that > missing ada8 drive and the next drive (that used to be ada9) is now > called ada8 and so on... How do you do that? If I'm in that situation, I think I could find the bad drive, or at least the good ones, with diskinfo and the drive serial number. One suggestion I saw somewhere was to use disk serial numbers for label values. > #2. Use gpart+gnop to deal with 4K disk sizes in a standardized way > and also to leave a little extra room so if when doing a replacement > disk and that disk is a few MB smaller than the original - it all > 'just works'. (All disks are partitioned to a slightly smaller size > than their physical capacity). I've been told (but have not personally verified) that newer versions of ZFS actually leaves some unused space at the end of a drive to allow for variations in nominally-sized drives. Don't know how much. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 02:43:24 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 21726674 for ; Wed, 23 Jan 2013 02:43:23 +0000 (UTC) (envelope-from jas@cse.yorku.ca) Received: from nm18-vm3.bullet.mail.gq1.yahoo.com (nm18-vm3.bullet.mail.gq1.yahoo.com [98.136.217.218]) by mx1.freebsd.org (Postfix) with SMTP id 69A762D5 for ; Wed, 23 Jan 2013 02:43:22 +0000 (UTC) Received: from [98.137.12.191] by nm18.bullet.mail.gq1.yahoo.com with NNFMP; 23 Jan 2013 02:40:06 -0000 Received: from [208.71.42.214] by tm12.bullet.mail.gq1.yahoo.com with NNFMP; 23 Jan 2013 02:40:06 -0000 Received: from [127.0.0.1] by smtp225.mail.gq1.yahoo.com with NNFMP; 23 Jan 2013 02:40:06 -0000 X-Yahoo-Newman-Id: 97992.44294.bm@smtp225.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: wjSIrusVM1nrNDmCG_kkLmu6QGd2XeaXwDAy0SYNIPG4GEl WZnzE_ZRTvnjdaQl2BL5vMmxxGJkqtWDoW_fPBK6FnFaYbAnUmEgCnFY6fNM N0fSKYEsuvGVWkKD3BhDbkM7Qvk.TH79_6.iX7.M7q4R383G5yk7BlzPPz_i a5_N5kdqRowoMbyupWDoTnTCcrmrihicQ5b0v3Y0LfXVDTldzSFSRM3Q4RJ_ sp24BeOW36C5I6d8mBP3EqgfbvVY4bNnrotMnpUpTOwRemltaiqTW3En4imP 1gTFDg5abuWn1cNScU7mFj7wHmSXC.T1LJffyKB7LA7M39LsE9XDxXb6E3Nr 45wW7Jydjp8l7X2yhtLgQRLc2V543nHtrkLDYG7fT.TlkuW_iEWVwsJfazIg bvRRtA0ydtP8EIllw1bTgC3YjHmkfLCQgxsWJhtoMkl8v_ga3B52Pxkb7HpY fZS1UJ8L3XB5EQy2Wyf9h2LQgZx.Yg6U- X-Yahoo-SMTP: hdvk3SuswBDjqWuLIhjJ7cQT_83YtZNiMmKQOSuhvZGxXQ-- Received: from [192.168.1.100] (jas@99.237.135.117 with plain) by smtp225.mail.gq1.yahoo.com with SMTP; 22 Jan 2013 18:40:05 -0800 PST Message-ID: <50FF4D87.5080404@cse.yorku.ca> Date: Tue, 22 Jan 2013 21:40:07 -0500 From: Jason Keltz User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Warren Block Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 02:43:24 -0000 On 22/01/2013 9:16 PM, Warren Block wrote: > On Tue, 22 Jan 2013, Michael DeMan wrote: > >> On Jan 22, 2013, at 7:04 AM, Warren Block wrote: >>> >>> I'm a proponent of using various types of labels, but my impression >>> after a recent experience was that ZFS metadata was enough to >>> identify the drives even if they were moved around. That is, ZFS >>> bare metadata on a drive with no other partitioning or labels. >>> >>> Is that incorrect? >> >> I don't know if it is correct or not, but the best I could figure out >> was to both label the drives and also force the mapping so the >> physical and logical drives always show up associated correctly. I >> also ended up deciding I wanted the hostname as a prefix for the >> labels - so if they get moved around to say another machine I can >> look and know what is going on - 'oh yeah, those disks are from the >> ones we moved over to this machine'... > > It helps to avoid duplicate labels, a good idea. > >> #1. Map the physical drive slots to how they show up in FBSD so if a >> disk is removed and the machine is rebooted all the disks after that >> removed one do not have an 'off by one error'. i.e. if you have >> ada0-ada14 and remove ada8 then reboot - normally FBSD skips that >> missing ada8 drive and the next drive (that used to be ada9) is now >> called ada8 and so on... > > How do you do that? If I'm in that situation, I think I could find > the bad drive, or at least the good ones, with diskinfo and the drive > serial number. One suggestion I saw somewhere was to use disk serial > numbers for label values. I think that was using /boot/device.hints. Unfortunately it only works for some systems, and not for all.. and someone shared an experience with me where a kernel update caused the card probe order to change, the devices to change, and then it all broke... It worked for one card, not for the other... I gave up because I wanted consistency across different systems.. In my own opinion, the whole process of partitioning drives, labelling them, all kinds of tricks for dealing with 4k drives, manually configuring /boot/device.hints, etc. is something that we have to do, but honestly, I really believe there *has* to be a better way.... Years back when I was using a 3ware/AMCC RAID card (actually, I AM still using a few), none of this was an issue... every disk just appeared in order.. I didn't have to configure anything specially .. ordering never changed when I removed a drive, I didn't need to partition or do anything with the disks - just give it the raw disks, and it knew what to do... If anything, I took my labeller and labelled the disk bays with a numeric label so when I got an error, I knew which disk to pull, but order never changed, and I always pulled the right drive... Now, I look at my pricey "new" system, see disks ordered by default in what seems like an almost "random" order... I dded each drive to figure out the exact ordering, and labelled the disks, but it just gets really annoying.... > >> #2. Use gpart+gnop to deal with 4K disk sizes in a standardized way >> and also to leave a little extra room so if when doing a replacement >> disk and that disk is a few MB smaller than the original - it all >> 'just works'. (All disks are partitioned to a slightly smaller size >> than their physical capacity). > > I've been told (but have not personally verified) that newer versions > of ZFS actually leaves some unused space at the end of a drive to > allow for variations in nominally-sized drives. Don't know how much. You see... this point has been mentioned on the list a whole bunch of times, and is exactly the type of information that needs to make it into a "best practices". Does anyone know if this applies to ZFS in FreeBSD? From what version? Who knows, maybe a whole bunch of people are partitioning devices that don't need to be! :) Jason. -- Jason Keltz Manager of Development Department of Computer Science and Engineering York University, Toronto, Canada Tel: 416-736-2100 x. 33570 Fax: 416-736-5872 From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 03:51:25 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7FD5554F for ; Wed, 23 Jan 2013 03:51:25 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 3C8347E1 for ; Wed, 23 Jan 2013 03:51:24 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id 55CC112E2056E; Tue, 22 Jan 2013 19:51:24 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zbtFqZ0Kw7qc; Tue, 22 Jan 2013 19:51:23 -0800 (PST) Received: from [192.168.113.72] (75-151-97-138-washington.hfc.comcastbusiness.net [75.151.97.138]) by plato.corp.nas.com (Postfix) with ESMTPSA id 2C17112E2055A; Tue, 22 Jan 2013 19:51:23 -0800 (PST) Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Michael DeMan In-Reply-To: <50FF4D87.5080404@cse.yorku.ca> Date: Tue, 22 Jan 2013 19:51:22 -0800 Message-Id: <8DC70418-7CF9-4839-BDC6-1A1AF5354307@deman.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <50FF4D87.5080404@cse.yorku.ca> To: Jason Keltz X-Mailer: Apple Mail (2.1499) Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 03:51:25 -0000 Inline below... On Jan 22, 2013, at 6:40 PM, Jason Keltz wrote: >>> #1. Map the physical drive slots to how they show up in FBSD so if = a disk is removed and the machine is rebooted all the disks after that = removed one do not have an 'off by one error'. i.e. if you have = ada0-ada14 and remove ada8 then reboot - normally FBSD skips that = missing ada8 drive and the next drive (that used to be ada9) is now = called ada8 and so on... >>=20 >> How do you do that? If I'm in that situation, I think I could find = the bad drive, or at least the good ones, with diskinfo and the drive = serial number. One suggestion I saw somewhere was to use disk serial = numbers for label values. > I think that was using /boot/device.hints. Unfortunately it only = works for some systems, and not for all.. and someone shared an = experience with me where a kernel update caused the card probe order to = change, the devices to change, and then it all broke... It worked for = one card, not for the other... I gave up because I wanted consistency = across different systems.. I am not sure, but possibly I hit that same issue about pci-probing with = our ZFS test machine - basically I vaguely recall asking to have the = SATA controllers have their slots swapped without completely knowing why = it needed to be done other than it did need to be done. It could have = been from an upgrade from FBSD 7.x -> 8.x -> 9.x, or could have just = because its a test box and there were other things going on with for a = while and the cards had got put back in out of order after doing some = other stuff. This is actually kind of an interesting problem overall - logical vs. = physical and how to keep things mapped in a way that makes sense. The = linux community has run into this and substantially (from a basic end = user perspective) changed the way they deal with hardware MAC addresses = and ethernet cards between RHEL5 and RHEL6. Ultimately neither of their = techniques works very well. For the FreeBSD community we should = probably pick one or another strategy and just standardize on it with = its warts and all and have it documented? >=20 > In my own opinion, the whole process of partitioning drives, labelling = them, all kinds of tricks for dealing with 4k drives, manually = configuring /boot/device.hints, etc. is something that we have to do, = but honestly, I really believe there *has* to be a better way.... =20 I agree. At this point the only solution I can think of to be able to = use ZFS on FreeBSD for production systems is to write scripts that do = all of this - all the goofy gpart + gnop + everything else. How is = anybody supposed to replace a disk in a system in an emergency situation = by having to run a bunch of cryptic command line stuff on a disk before = they can even confidently put it in as a replacement for the original? = And by definition of having to do a bunch of manual command line stuff = you can not be reliably confident? > Years back when I was using a 3ware/AMCC RAID card (actually, I AM = still using a few), none of this was an issue... every disk just = appeared in order.. I didn't have to configure anything specially .. = ordering never changed when I removed a drive, I didn't need to = partition or do anything with the disks - just give it the raw disks, = and it knew what to do... If anything, I took my labeller and labelled = the disk bays with a numeric label so when I got an error, I knew which = disk to pull, but order never changed, and I always pulled the right = drive... Now, I look at my pricey "new" system, see disks ordered by = default in what seems like an almost "random" order... I dded each drive = to figure out the exact ordering, and labelled the disks, but it just = gets really annoying.... A lot of these things - about making sure that a little extra space is = spared on the drive when an array is first built so that when a new = drive with slightly smaller capacity is the replacement - the RAID = vendors have hidden that away from the end user. In many cases they = have only done that in the last 10 years or so? And I stumbled a few = weeks ago about a Sun ZFS user that had received Sun certified disks = that had the same issue - a few sectors too small... Overall you are describing exactly the kind of behavior I want, and I = think everybody needs from a FreeBSD+ZFS system. - Alarm sent out - drive #52 failed- wake up and deal with it. - Go to server (or call data center) - groggily look at labels on front = of disk caddies - physically pull drive #52 - insert new similarly sized drive from inventory as new #52. =20 - Verify resilver is in progress - Confidently go back to bed knowing all is okay The above scenario is just unworkable right now for most people (even = tech-savvy people) because of the lack of documentation - hence I am = glad to see some kind of 'best practices' document put together. - Mike From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 05:38:13 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 09E7CDD3 for ; Wed, 23 Jan 2013 05:38:13 +0000 (UTC) (envelope-from davidxu@freebsd.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id D6840A0E; Wed, 23 Jan 2013 05:38:12 +0000 (UTC) Received: from xyf.my.dom (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0N5cAYH052829; Wed, 23 Jan 2013 05:38:11 GMT (envelope-from davidxu@freebsd.org) Message-ID: <50FF7764.2020803@freebsd.org> Date: Wed, 23 Jan 2013 13:38:44 +0800 From: David Xu User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:14.0) Gecko/20120822 Thunderbird/14.0 MIME-Version: 1.0 To: Scott Long Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> In-Reply-To: <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 05:38:13 -0000 On 2013/01/22 22:33, Scott Long wrote: > > On Jan 22, 2013, at 4:03 AM, Borja Marcos wrote: > >> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue) >> >> >> >> Hi :) >> >> Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless"). >> >> ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse. >> >> Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus. >> >> >> >> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance) >> >> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic. >> > > Look up SCSI device wiring in /sys/conf/NOTES. That's one solution to static naming, just with a slightly different angle than Solaris. I do agree with your general thesis here, and either wiring should be made a much more visible and documented feature, or a new mechanism should be developed to provide naming stability. Please let me know what you think of the wiring mechanic. >> >> I am curious, because we already have devfs, why do not the driver create device entry like following ? /dev/scsi/bus0/target0/lun0/ada0 /dev/scsi/bus0/target0/lun0/ada0s1 /dev/scsi/bus0/target0/lun0/ada0s2 ... This will eliminate the needs of hints. >> 2- RAID cards. >> >> Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards. >> > > Agree 200%. Despite the best effort of sales and marketing people, RAID cards do not make good HBAs. At best they add latency. At worst, they add a lot of latency and extra failure modes. > > Scott > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 05:45:29 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 33A87E9F; Wed, 23 Jan 2013 05:45:29 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id E688FA40; Wed, 23 Jan 2013 05:45:28 +0000 (UTC) Received: from [127.0.0.1] (Scott4long@pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.5/8.14.5) with ESMTP id r0N5jO4R090480; Tue, 22 Jan 2013 22:45:25 -0700 (MST) (envelope-from scottl@samsco.org) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Scott Long In-Reply-To: <50FF7764.2020803@freebsd.org> Date: Tue, 22 Jan 2013 22:45:26 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <5DD4A455-A351-4676-979B-4D7199F0FA1C@samsco.org> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> <50FF7764.2020803@freebsd.org> To: David Xu X-Mailer: Apple Mail (2.1499) X-Spam-Status: No, score=-50.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 05:45:29 -0000 On Jan 22, 2013, at 10:38 PM, David Xu wrote: > On 2013/01/22 22:33, Scott Long wrote: >>=20 >> On Jan 22, 2013, at 4:03 AM, Borja Marcos wrote: >>=20 >>> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the = -FS mailing list, and this is a SCSI//FS issue) >>>=20 >>>=20 >>>=20 >>> Hi :) >>>=20 >>> Hope nobody will hate me too much, but ZFS usage under FreeBSD is = still chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). >>>=20 >>> ZFS is a very different beast from other filesystems, and the setup = can involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. >>>=20 >>> Since I've been using ZFS on FreeBSD (from the first versions) I = have noticed several serious problems. I try to explain some of them, = and my suggestions for a solution. We should collect more use cases and = issues and try to reach a consensus. >>>=20 >>>=20 >>>=20 >>> 1- Dynamic disk naming -> We should use static naming (GPT labels, = for instance) >>>=20 >>> ZFS was born in a system with static device naming (Solaris). When = you plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. >>>=20 >>=20 >> Look up SCSI device wiring in /sys/conf/NOTES. That's one solution = to static naming, just with a slightly different angle than Solaris. I = do agree with your general thesis here, and either wiring should be made = a much more visible and documented feature, or a new mechanism should be = developed to provide naming stability. Please let me know what you = think of the wiring mechanic. >>>=20 >>>=20 >=20 > I am curious, because we already have devfs, why do not the driver = create device entry like following ? >=20 > /dev/scsi/bus0/target0/lun0/ada0 > /dev/scsi/bus0/target0/lun0/ada0s1 > /dev/scsi/bus0/target0/lun0/ada0s2 > ... >=20 > This will eliminate the needs of hints. >=20 The problem is that this structure is east for computers to manipulate, = but hard for humans to manipulate. One thing that could be done with = devfs, however, is to create device aliases. i.e.: crw-r----- 1 root operator 0x86 Jan 2 23:00 c0b0t1l0 lrwxr-xr-x 1 root wheel 6 Jan 2 23:00 da0 -> c0b0t1l0 This gets hairy because aliases then need to be made for partitions, = unless some sort of transparent translator is created in devfs. It also = gets complicated because you still need to arbitrate the controller = numbering (the 'c0' in the above example), so wiring might still be = needed. Scott From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 05:51:49 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 86F05F57 for ; Wed, 23 Jan 2013 05:51:49 +0000 (UTC) (envelope-from haroldp@internal.org) Received: from mail.internal.org (mail.internal.org [64.182.209.41]) by mx1.freebsd.org (Postfix) with ESMTP id 47508A6A for ; Wed, 23 Jan 2013 05:51:49 +0000 (UTC) Received: from [10.0.0.32] (99-46-24-87.lightspeed.renonv.sbcglobal.net [99.46.24.87]) by mail.internal.org (Postfix) with ESMTPSA id 56C4CBA940 for ; Tue, 22 Jan 2013 21:43:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=internal.org; s=default; t=1358919822; bh=G+3TklcHEgded37O7E5pBi1RK0qhcTaxzBTUn+fTxiU=; h=Subject:From:In-Reply-To:Date:References:To; b=swRluAeW3fSq0HlOlfM+JtoCgnMhcwQPc0kXvBLVL/kcoD8bMHgwj9qWbng7kdRV5 vjdJDR9TQzKqzuYdM2ha2QXFpfVffiN6nBGLVUZRI/n0nys5xK3dq1S+1JGZ0gO3BF 7JEqKnKwOAfj2/AuXW/h6EXs/xvNFxYaOGOkmJpk= Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Harold Paulson In-Reply-To: <50FF4D87.5080404@cse.yorku.ca> Date: Tue, 22 Jan 2013 21:44:25 -0800 Content-Transfer-Encoding: 7bit Message-Id: <9DBE767B-4427-49AB-A47E-6C334E860E08@internal.org> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <50FF4D87.5080404@cse.yorku.ca> To: FreeBSD Filesystems X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 05:51:49 -0000 I'm not allowed to edit this page: https://wiki.freebsd.org/action/edit/ZFSBestPractices?action=edit but someone should start pouring this info into the wiki. It's gold. - H From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 07:37:35 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2B8E5237 for ; Wed, 23 Jan 2013 07:37:35 +0000 (UTC) (envelope-from lehmann@ans-netz.de) Received: from avocado.salatschuessel.net (avocado.salatschuessel.net [78.111.72.186]) by mx1.freebsd.org (Postfix) with ESMTP id 84BA5EBD for ; Wed, 23 Jan 2013 07:37:34 +0000 (UTC) Received: (qmail 19034 invoked by uid 80); 23 Jan 2013 07:37:26 -0000 Received: from 164.61.223.12 ([164.61.223.12]) by avocado.salatschuessel.net (Horde Framework) with HTTP; Wed, 23 Jan 2013 08:37:26 +0100 Date: Wed, 23 Jan 2013 08:37:26 +0100 Message-ID: <20130123083726.Horde.WFWBzChUy3RRAjdkMOnNeg7@avocado.salatschuessel.net> From: Oliver Lehmann To: fs@freebsd.org Subject: ZFS: is getting zfs slower with more data on the fs? User-Agent: Internet Messaging Program (IMP) H5 (6.0.3) Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes MIME-Version: 1.0 Content-Disposition: inline X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 07:37:35 -0000 Hi, I have for example 2 filesystems: NAME zroot/usr zroot/usr/testfs TYPE filesystem filesystem CREATION Wed Sep 7 21:45 2011 Tue Jan 22 21:47 2013 USED 15.6G 31K AVAIL 7.21G 20.0G REFER 12.8G 31K RATIO 1.00x 1.00x MOUNTED yes yes ORIGIN - - QUOTA none none RESERV none none VOLSIZE - - VOLBLOCK - - RECSIZE 128K 128K MOUNTPOINT legacy legacy SHARENFS off off CHECKSUM fletcher4 fletcher4 COMPRESS off off ATIME off off DEVICES on on EXEC on on SETUID on on RDONLY off off JAILED off off SNAPDIR hidden hidden ACLMODE discard discard ACLINHERIT restricted restricted CANMOUNT on on XATTR off off COPIES 1 1 VERSION 5 5 UTF8ONLY off off NORMALIZATION none none CASE sensitive sensitive VSCAN off off NBMAND off off SHARESMB off off REFQUOTA 20G 20G REFRESERV none none PRIMARYCACHE all all SECONDARYCACHE all all USEDSNAP 0 0 USEDDS 12.8G 31K USEDCHILD 2.82G 0 USEDREFRESERV 0 0 DEFER_DESTROY - - USERREFS - - LOGBIAS latency latency DEDUP off off MLSLABEL standard standard SYNC 1.00x 1.00x REFRATIO 12.8G 31K WRITTEN - - And I wonder why I have different write performance on each of those filesystems: zroot/usr: Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nudel.salatsc 6544M 18 99 48439 50 37901 47 46 99 188508 87 214.8 14 Latency 585ms 1550ms 780ms 223ms 388ms 214ms zroot/usr/testfs: Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nudel.salatsc 6544M 18 99 80818 91 55605 74 46 99 184481 85 225.7 15 Latency 631ms 1159ms 2156ms 199ms 361ms 239ms root@nudel testfs> zpool status pool: zroot state: ONLINE scan: scrub repaired 0 in 0h9m with 0 errors on Fri Jan 4 15:55:10 2013 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors root@nudel testfs> /usr is filled with 12GB of data, /usr/testfs is empty. Why is it like this? I'm trying to figure out how to configure a ZFS fs as fast as possible for things like /usr/obj where I do not care about data security that much. My "testfs" is a filesystem where I wanted to test different things - but now it is faster without any changes?! PS: please keep me CCed From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 09:18:06 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0D2877BB for ; Wed, 23 Jan 2013 09:18:06 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 98EDF2FC for ; Wed, 23 Jan 2013 09:18:04 +0000 (UTC) Received: from server.rulingia.com (c220-239-246-167.belrs5.nsw.optusnet.com.au [220.239.246.167]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0N8wcRL070159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 23 Jan 2013 19:58:38 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0N8wXAR036068 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 23 Jan 2013 19:58:33 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r0N8wWFp036066; Wed, 23 Jan 2013 19:58:32 +1100 (EST) (envelope-from peter) Date: Wed, 23 Jan 2013 19:58:32 +1100 From: Peter Jeremy To: Michael DeMan Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD - mapping logical to physical drives Message-ID: <20130123085832.GJ30633@server.rulingia.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <16E9D784-D2F2-4C55-9138-907BF3957CE8@deman.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="reI/iBAAp9kzkmX4" Content-Disposition: inline In-Reply-To: <16E9D784-D2F2-4C55-9138-907BF3957CE8@deman.com> X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 09:18:06 -0000 --reI/iBAAp9kzkmX4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-22 18:06:27 -0800, Michael DeMan wrote: ># OAIMFD 2011.04.13 adding this to force ordering on adaX disks=20 ># dev.mvs.0.%desc: Marvell 88SX6081 SATA controller=20 ># dev.mvs.1.%desc: Marvell 88SX6081 SATA controller=20 > >hint.scbus.0.at=3D"mvsch0" >hint.ada.0.at=3D"scbus0"=20 =2E.. That only works until a BIOS or OS change alters the probe order and reverses the controller numbers. The correct solution to the problem is gpart labels - which rely on on-disk metadata and so don't care about changes in the path to the disk. --=20 Peter Jeremy --reI/iBAAp9kzkmX4 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD/pjgACgkQ/opHv/APuIeoswCggtCZsZO/r3TAOMmpfub1DaTZ W88AoK7LEHvqpL5qdJHFJg/MWgJCXiNI =SY7c -----END PGP SIGNATURE----- --reI/iBAAp9kzkmX4-- From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 11:00:32 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2C4CAFBC for ; Wed, 23 Jan 2013 11:00:32 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 99CFF986 for ; Wed, 23 Jan 2013 11:00:30 +0000 (UTC) Received: from server.rulingia.com (c220-239-246-167.belrs5.nsw.optusnet.com.au [220.239.246.167]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0NB0LxN070489 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 23 Jan 2013 22:00:23 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0NB0GSV078381 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 23 Jan 2013 22:00:16 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r0NB0Eu9078372; Wed, 23 Jan 2013 22:00:14 +1100 (EST) (envelope-from peter) Date: Wed, 23 Jan 2013 22:00:14 +1100 From: Peter Jeremy To: Mark Felder Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Message-ID: <20130123110014.GL30633@server.rulingia.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="X0cz4bGbQuRbxrVl" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 11:00:32 -0000 --X0cz4bGbQuRbxrVl Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-22 09:26:39 -0600, Mark Felder wrote: >On Tue, 22 Jan 2013 09:04:42 -0600, Warren Block =20 >wrote: > >> I'm a proponent of using various types of labels, but my impression =20 >> after a recent experience was that ZFS metadata was enough to identify = =20 >> the drives even if they were moved around. That is, ZFS bare metadata = =20 >> on a drive with no other partitioning or labels. >> Is that incorrect? > >If you have an enclosure with 48 drives can you be confident which drive = =20 >is failing using only the ZFS metadata? There are two different issues here. ZFS stores metadata on each disk which it uses to determine the pool layout and where that disk fits into that layout. The device pathname is solely used as a hint and ZFS doesn't care how you juggle the disks. OTOH, the sysadmin needs some way of identifying a physical disk based on the logical identifiers that the system provides. This is totally up to the sysadmin and there's no reason why you couldn't write ZFS disklabel numbers on your physical disks in addition to or instead of writing "daX" or the gpart label onto the disk. --=20 Peter Jeremy --X0cz4bGbQuRbxrVl Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD/wr4ACgkQ/opHv/APuIfYfgCfVd+wj/su3P64uAemE2L+xpu3 tgoAn3OHxDDSUyqA7DLglR5zEOF+7pOs =6jxg -----END PGP SIGNATURE----- --X0cz4bGbQuRbxrVl-- From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 11:19:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6BF445C8 for ; Wed, 23 Jan 2013 11:19:05 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 01882A35 for ; Wed, 23 Jan 2013 11:19:03 +0000 (UTC) Received: from server.rulingia.com (c220-239-246-167.belrs5.nsw.optusnet.com.au [220.239.246.167]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0NBIvI1070522 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 23 Jan 2013 22:18:58 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0NBIqgx023019 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 23 Jan 2013 22:18:52 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id r0NBIqI7023016; Wed, 23 Jan 2013 22:18:52 +1100 (EST) (envelope-from peter) Date: Wed, 23 Jan 2013 22:18:52 +1100 From: Peter Jeremy To: Michael DeMan Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Message-ID: <20130123111852.GM30633@server.rulingia.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="SLfjTIIQuAzj8yil" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 11:19:05 -0000 --SLfjTIIQuAzj8yil Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2013-Jan-22 17:27:13 -0800, Michael DeMan wrote: >- probably there are enough people even though it seems pretty obvious in = a couple more years nobody will have 32-bit or less than 8GB RAM? FreeBSD goes into an awful lot of embedded devices which are unlikely to transition off 32-bit or have more than a few hundred MB RAM for the foreseeable future. Definitely, the PC world is going that way - but it's heavily driven by Microsoft and it's quite likely that the next version of Windows won't fit into 32-bits. >I could be disinformed here, but my understanding) is best practice is to = use gpart + gnop to: >#1. Ensure proper alignment for 4K sector drives - the latest western dig= itals still report as 512. Recent versions of ZFS allow you to specify ashift=3D12 without needing gnop. And in any case, gnop is only needed when you initialise the pool. Once the initial ashift value is set, it can't be changed. (And I'd be interested in knowing how well a SSD ZIL compensates for having an ashift=3D9 pool on 4K sector disks). >#2. Ensure a little extra space is left on the drive since if the whole d= rive is used, a replacement may be a tiny bit smaller and will not work. As someone else has mentioned, recent ZFS allows some slop here. But I still think it's worthwhile carving out some space to allow for a marginally smaller replacement disk. >#3. Label the disks so you know what is what. With a paper label and a pen. >ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here=20 >- why the ZIL is a good thing even you think it kills your NFS performance >- no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 disks, = etc. >- striping over raidz1/raidz2 pools >- striping over mirrors >- etc... Most of this can be pointers to generic ZFS documentation. Which brings up another point - the documentation should make it clear whether a particular hint is FreeBSD-specific or generic across all ZFS platforms. And what "generic" ZFS hints are unnecessary on FreeBSD. --=20 Peter Jeremy --SLfjTIIQuAzj8yil Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlD/xxwACgkQ/opHv/APuIeUFACfc9Nsr1d6RtSfgcGmtXTrRCd/ uTsAn2LBwgWuMbR8sxwUFcA9dAjKSz+K =vLtP -----END PGP SIGNATURE----- --SLfjTIIQuAzj8yil-- From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 14:16:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 975B2325 for ; Wed, 23 Jan 2013 14:16:02 +0000 (UTC) (envelope-from prvs=1735826444=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 3C63C3C9 for ; Wed, 23 Jan 2013 14:16:01 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001821491.msg for ; Wed, 23 Jan 2013 14:15:55 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Wed, 23 Jan 2013 14:15:55 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1735826444=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: <32D3BF4572C74ED49F6978A16E340FE2@multiplay.co.uk> From: "Steven Hartland" To: "Peter Jeremy" , "Michael DeMan" References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <20130123111852.GM30633@server.rulingia.com> Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Date: Wed, 23 Jan 2013 14:16:29 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 14:16:02 -0000 ----- Original Message ----- From: "Peter Jeremy" On 2013-Jan-22 17:27:13 -0800, Michael DeMan wrote: >>I could be disinformed here, but my understanding) is best practice is >> to use gpart + gnop to: >>#1. Ensure proper alignment for 4K sector drives - the latest western >> digitals still report as 512. > > Recent versions of ZFS allow you to specify ashift=12 without needing > gnop. And in any case, gnop is only needed when you initialise the > pool. Once the initial ashift value is set, it can't be changed. > (And I'd be interested in knowing how well a SSD ZIL compensates for > having an ashift=9 pool on 4K sector disks). I'm not aware of any other option to specify ashift in current sources. I have outstanding patches which add auto alignment detection for disks which report 4K either native or via quirks to ZFS. In addition it provides the ability to set a default min ashift which essentially provides an override. This works well for us but the implemention details, which uses stripsize, are under discussion with pjd and avg. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 14:31:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C2ADFCFF for ; Wed, 23 Jan 2013 14:31:03 +0000 (UTC) (envelope-from roberto@keltia.freenix.fr) Received: from keltia.net (centre.keltia.net [IPv6:2a01:240:fe5c::41]) by mx1.freebsd.org (Postfix) with ESMTP id 7FEB36FB for ; Wed, 23 Jan 2013 14:31:03 +0000 (UTC) Received: from roberto02-aw.eurocontrol.fr (aran.keltia.net [88.191.250.24]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: roberto) by keltia.net (Postfix/TLS) with ESMTPSA id 75C25EB9F for ; Wed, 23 Jan 2013 15:31:02 +0100 (CET) Date: Wed, 23 Jan 2013 15:30:18 +0100 From: Ollivier Robert To: freebsd-fs@freebsd.org Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Message-ID: <20130123143018.GA5533@roberto02-aw.eurocontrol.fr> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: MacOS X / Macbook Pro - FreeBSD 7.2 / Dell D820 SMP User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 14:31:03 -0000 According to Freddie Cash on Tue, Jan 22, 2013 at 03:02:40PM -0800: > The ZFS metadata on disk allows you to move disks around in a system and > still import the pool, correct. Even better, a few years ago before I enabled AHCI on my machine, the drives were named adX. After I started using AHCI, the drives became adaY and it still booted fine. -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.net In memoriam to Ondine, our 2nd child: http://ondine.keltia.net/ From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 14:37:31 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A1BB722B for ; Wed, 23 Jan 2013 14:37:31 +0000 (UTC) (envelope-from romain@blogreen.org) Received: from marvin.blogreen.org (unknown [IPv6:2001:470:1f12:b9c::2]) by mx1.freebsd.org (Postfix) with ESMTP id E92E6788 for ; Wed, 23 Jan 2013 14:37:30 +0000 (UTC) Received: by marvin.blogreen.org (Postfix, from userid 1001) id AAA7213259; Wed, 23 Jan 2013 15:37:29 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=blogreen.org; s=default; t=1358951849; bh=MurGs17donigxurmdjb+BehCGrm8YBmwPGKWNlRXQdM=; h=Date:From:To:Subject; b=mwnS1+OSiQW9rN//qR2CnpBnnb8Wb8gXLklM4OhocFWFYJm2JPoD54Jlyt62tqMBy 1cx+MPHzbQJs8nDGUeruNISHt1K9ioGqSopxeG0b9eeOFj44Ix03Xhj2iXWODGakCV jr/7C0NUPnJCk/48dlVmG8BqW+su1DA0Q3K+bOS8= Date: Wed, 23 Jan 2013 15:37:29 +0100 From: Romain =?iso-8859-1?Q?Tarti=E8re?= To: freebsd-fs@freebsd.org Subject: ZFS deduplication Message-ID: <20130123143728.GA84218@blogreen.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="bp/iNruPH9dso1Pn" Content-Disposition: inline X-PGP-Key: http://romain.blogreen.org/pubkey.asc User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 14:37:31 -0000 --bp/iNruPH9dso1Pn Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello My machine has run on full ZFS for quite a few years now without major problem, but suddenly started to reports inconsistent "statistics". It all started by `zfs list` reporting too much available space: > # zfs list data > NAME USED AVAIL REFER MOUNTPOINT > data 117G 5,73E 4,52G legacy "data" being a mirror of two 500 GB disks, the available space is obviously wrong. The pool is healthy according to `zpool status`: > # zpool status data > pool: data > state: ONLINE > scan: scrub repaired 0 in 5h1m with 0 errors on Tue Jan 1 09:52:50 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > ada0p3 ONLINE 0 0 0 > ada1p3 ONLINE 0 0 0 >=20 > errors: No known data errors However, `zpool list` reports an inconsistent deduplication value (it used to be ~1.4 AFAICR): > zpool list data > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > data 460G 101G 359G 21% 1386985.39x ONLINE - At first I though about a process that might have created bazillion of files with the same contents but df(1) does not report any filesystem with an abnormal number of used inodes. Likewise, df(1) does not show a filesystem with a huge amount of data that could come from a single file which content would be a repeated single block of data (but maybe df(1) can't tell me that): > df -i > Filesystem 1K-blocks = Used Avail Capacity iused ifree %iused Mounted= on > data 645500584230938= 0 4735434 6455005837573946 0% 45457 281474976665199 0% / > devfs = 1 1 0 100% 0 0 100% /dev > procfs = 4 4 0 100% 1 0 100% /proc > fdescfs = 1 1 0 100% 4 22497 0% /dev/fd > data/tmp 645500583758072= 9 6783 6455005837573946 0% 2087 281474976708569 0% /tmp > data/usr 645500585324521= 1 15671265 6455005837573946 0% 532871 281474976177785 0% /usr > tank 192093743= 1 490426844 1430510587 26% 1315413 2861021175 0% /usr/ho= me > data/poudriere/data 645500583943267= 1 1858725 6455005837573946 0% 11958 281474976698698 0% /usr/lo= cal/poudriere/data > data/poudriere/8_3_RELEASE_amd64 645500583838496= 9 811023 6455005837573946 0% 56480 281474976654176 0% /usr/lo= cal/poudriere/jails/8_3_RELEASE_amd64 > data/poudriere/9_0_RELEASE_amd64 645500583861545= 4 1041508 6455005837573946 0% 66523 281474976644133 0% /usr/lo= cal/poudriere/jails/9_0_RELEASE_amd64 > data/poudriere/jails/9_1_RELEASE_amd64 645500583865276= 0 1078814 6455005837573946 0% 67766 281474976642890 0% /usr/lo= cal/poudriere/jails/9_1_RELEASE_amd64 > data/poudriere/ports/default 645500583806405= 5 490109 6455005837573946 0% 151864 281474976558792 0% /usr/lo= cal/poudriere/ports/default/ports > data/poudriere/ports/gnome 645500583806928= 3 495337 6455005837573946 0% 152758 281474976557898 0% /usr/lo= cal/poudriere/ports/gnome/ports > data/poudriere/ports/mono 645500583806266= 3 488717 6455005837573946 0% 152411 281474976558245 0% /usr/lo= cal/poudriere/ports/mono/ports > data/poudriere/ports/romain 645500583806495= 6 491010 6455005837573946 0% 152030 281474976558626 0% /usr/lo= cal/poudriere/ports/romain/ports > data/usr/main 645500583808621= 6 512270 6455005837573946 0% 164164 281474976546492 0% /usr/po= rts > data/var 645500585671870= 3 19144757 6455005837573946 0% 158141 281474976552515 0% /var > data/var/cache 645500583757404= 1 95 6455005837573946 0% 17 281474976710639 0% /var/ca= che > data/var/cache/portshaker 645500583758706= 8 13122 6455005837573946 0% 4658 281474976705998 0% /var/ca= che/portshaker > data/var/cache/portshaker/bsd_sharp 645500583757854= 9 4603 6455005837573946 0% 1695 281474976708961 0% /var/ca= che/portshaker/bsd_sharp > data/var/cache/portshaker/e 645500583757523= 0 1284 6455005837573946 0% 977 281474976709679 0% /var/ca= che/portshaker/e > data/var/cache/portshaker/freebsd_texlive 645500583757418= 5 239 6455005837573946 0% 124 281474976710532 0% /var/ca= che/portshaker/freebsd_texlive > data/var/cache/portshaker/freebsd_texlive_ports 645500583757853= 2 4586 6455005837573946 0% 289 281474976710367 0% /var/ca= che/portshaker/freebsd_texlive_ports > data/var/cache/portshaker/freebsd_texlive_ports_marcuscom 645500583757406= 6 120 6455005837573946 0% 29 281474976710627 0% /var/ca= che/portshaker/freebsd_texlive_ports_marcuscom > data/var/cache/portshaker/freebsd_texlive_releng 645500583761182= 6 37880 6455005837573946 0% 29335 281474976681321 0% /var/ca= che/portshaker/freebsd_texlive_releng > data/var/cache/portshaker/ports 645500583806211= 5 488169 6455005837573946 0% 152039 281474976558617 0% /var/ca= che/portshaker/ports > data/var/cache/portshaker/ports/distfiles 645500584283383= 7 5259891 6455005837573946 0% 2605 281474976708051 0% /var/ca= che/portshaker/ports/distfiles > data/var/cache/portshaker/ports/packages 645500585631547= 6 18741530 6455005837573946 0% 16799 281474976693857 0% /var/ca= che/portshaker/ports/packages > data/var/cache/portshaker/redports 645500583757448= 8 542 6455005837573946 0% 312 281474976710344 0% /var/ca= che/portshaker/redports > data/var/cache/portshaker/redports:fneufneu 645500583757548= 2 1536 6455005837573946 0% 317 281474976710339 0% /var/ca= che/portshaker/redports:fneufneu > data/var/cache/portshaker/redports:romain 645500583757429= 5 349 6455005837573946 0% 226 281474976710430 0% /var/ca= che/portshaker/redports:romain > data/var/cache/portshaker/redports:virtualbox 645500583757471= 7 771 6455005837573946 0% 444 281474976710212 0% /var/ca= che/portshaker/redports:virtualbox > data/var/cache/portshaker/redports:xbmc 645500583757397= 7 31 6455005837573946 0% 7 281474976710649 0% /var/ca= che/portshaker/redports:xbmc > data/var/cache/portshaker/romain 645500583757701= 1 3065 6455005837573946 0% 930 281474976709726 0% /var/ca= che/portshaker/romain > data/var/cache/portshaker/woot 645500583757406= 5 119 6455005837573946 0% 75 281474976710581 0% /var/ca= che/portshaker/woot > data/var/jails 645500584133690= 1 3762955 6455005837573946 0% 517515 281474976193141 0% /var/ja= ils > data/var/jails/base 645500583840364= 8 829702 6455005837573946 0% 11659 281474976698997 0% /var/ja= ils/base > data/var/jails/r230911 645500583785718= 5 283239 6455005837573946 0% 11581 281474976699075 0% /var/ja= ils/r230911 > data/var/jails/texlive-test 645500584460139= 8 7027452 6455005837573946 0% 496201 281474976214455 0% /var/ja= ils/texlive-test > data/var/jails/texlive.home.sigabrt.org 645500585084185= 3 13267907 6455005837573946 0% 377896 281474976332760 0% /var/ja= ils/texlive.home.sigabrt.org > data/var/jails/texlive.home.sigabrt.org/ports 645500583823011= 9 656173 6455005837573946 0% 152331 281474976558325 0% /var/ja= ils/texlive.home.sigabrt.org/usr/ports > data/var/jails/xbmc 645500584044239= 6 2868450 6455005837573946 0% 228330 281474976482326 0% /var/ja= ils/xbmc > devfs = 1 1 0 100% 0 0 100% /var/na= med/dev > devfs = 1 1 0 100% 0 0 100% /var/ja= ils/texlive.home.sigabrt.org/dev > devfs = 1 1 0 100% 0 0 100% /var/ja= ils/texlive/dev Does anyone has an idea about what is going on? Is there any way I can know which deduplicated blocks are most used and in which file(s) to locate the source of the problem? Or is there any way to prove that this reporting is wrong? I'm running 9.1-STABLE r245578, with zpool v28 (I guess, no way to check that?) and ZFS v5 for everything but a few snapshots (v4, v3; according to `zfs get version`). Thanks! Romain --=20 Romain Tarti=E8re http://romain.blogreen.org/ pgp: 8234 9A78 E7C0 B807 0B59 80FF BA4D 1D95 5112 336F (ID: 0x5112336F) (plain text =3Dnon-HTML=3D PGP/GPG encrypted/signed e-mail much appreciated) --bp/iNruPH9dso1Pn Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQGcBAEBAgAGBQJQ//WoAAoJELpNHZVREjNvC7ML/AsIr8KotegWtEx7ka/0IgeE jj7nyYtdAVdjRkIlgHaHL2vIKZy/6Csf4tE2V56k/hHAHAWEYMHW2At7nKu9mmUQ CqCjGAf0Pir8ZNNa5PYIA+Rw9HkE4KjHWz2COuyw99IFoFwNO/eIFs2MHG6x04+M n+GJ0Ra/RUJnttCfsSS5ZmCA+PzVkqG0oztzRZfGGgkJBTg7B//Y3Klp7VpiE2Sc YEh5IzWxWNMbOyzRqr/4EnBD0xJF1H+WUMmWo+J+Jgn6VD9t8tb1lohzp2/6BosX l9kumW9WUHcgMQtPwmU5mNIZimfzR0BYickMY77pzi8Y+UV6YB3+rSgkAk6sQphw DyBUsAJXm7p8ZzpugpBjGMenOuIo+3mQDBmmX8DqkVQEDGPSMO2C4dTSN5jOWG3X vNecwwQnfQs1dL0nXvwENYZ/uDoD8n64r4OIuqfBtv+TNT6CzAYn0KZTr/Cnl/Oo bxGYvDZhnLDCd2vKD7DkgxXL/F4GfS5dz9SK+qecbg== =Y1Ng -----END PGP SIGNATURE----- --bp/iNruPH9dso1Pn-- From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 15:16:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id BE10AF87 for ; Wed, 23 Jan 2013 15:16:58 +0000 (UTC) (envelope-from jym@baaz.fr) Received: from smtp6-g21.free.fr (smtp6-g21.free.fr [IPv6:2a01:e0c:1:1599::15]) by mx1.freebsd.org (Postfix) with ESMTP id 2E9569C6 for ; Wed, 23 Jan 2013 15:16:56 +0000 (UTC) Received: from niglo.baaz.fr (unknown [82.231.174.34]) by smtp6-g21.free.fr (Postfix) with ESMTP id 83DBE8230A; Wed, 23 Jan 2013 16:16:50 +0100 (CET) Received: from babylone.eileo-local.org (free.eileo.net [82.247.15.147]) by niglo.baaz.fr (Postfix) with ESMTPSA id 4692114C18; Wed, 23 Jan 2013 16:16:48 +0100 (CET) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD From: Jean-Yves Moulin In-Reply-To: <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> Date: Wed, 23 Jan 2013 16:16:45 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <81460DE8-89B4-41E8-9D93-81B8CC27AA87@baaz.fr> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> To: Scott Long X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 15:16:58 -0000 Hi, On 22 Jan 2013, at 15:33 , Scott Long wrote: > Agree 200%. Despite the best effort of sales and marketing people, = RAID cards do not make good HBAs. At best they add latency. At worst, = they add a lot of latency and extra failure modes. But what about battery-backed cache RAID card ? They offer a = non-volatile cache that improves writes. And this cache is safe because = of the battery. These feature doesn't exist on bare disks. best, jym= From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 16:22:49 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id ECD9C500 for ; Wed, 23 Jan 2013 16:22:49 +0000 (UTC) (envelope-from mad@madpilot.net) Received: from winston.madpilot.net (winston.madpilot.net [78.47.75.155]) by mx1.freebsd.org (Postfix) with ESMTP id A7E69E64 for ; Wed, 23 Jan 2013 16:22:49 +0000 (UTC) Received: from winston.madpilot.net (localhost [127.0.0.1]) by winston.madpilot.net (Postfix) with ESMTP id 3YrsGv1fT1zFX1H for ; Wed, 23 Jan 2013 17:22:47 +0100 (CET) X-Virus-Scanned: amavisd-new at madpilot.net Received: from winston.madpilot.net ([127.0.0.1]) by winston.madpilot.net (winston.madpilot.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Fts7KA5XHnfC for ; Wed, 23 Jan 2013 17:22:44 +0100 (CET) Received: from vwg82.hq.ignesti.it (unknown [80.74.176.55]) by winston.madpilot.net (Postfix) with ESMTPSA for ; Wed, 23 Jan 2013 17:22:44 +0100 (CET) Message-ID: <51000E55.6070901@madpilot.net> Date: Wed, 23 Jan 2013 17:22:45 +0100 From: Guido Falsi User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130115 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> <81460DE8-89B4-41E8-9D93-81B8CC27AA87@baaz.fr> In-Reply-To: <81460DE8-89B4-41E8-9D93-81B8CC27AA87@baaz.fr> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 16:22:50 -0000 On 01/23/13 16:16, Jean-Yves Moulin wrote: > Hi, > > > On 22 Jan 2013, at 15:33 , Scott Long wrote: > >> Agree 200%. Despite the best effort of sales and marketing people, RAID cards do not make good HBAs. At best they add latency. At worst, they add a lot of latency and extra failure modes. > > > But what about battery-backed cache RAID card ? They offer a non-volatile cache that improves writes. And this cache is safe because of the battery. These feature doesn't exist on bare disks. > Safe is optimistic. The cache can keep the memory alive for a 36-48 hours at most usually. In this (short) time frame you need to find identical hardware on which to move the disks and the controller without detaching the batteries. This in fact mostly means you need a second server without disks just in case you need a recovery. Also, expected battery life will decrease with time. Some vendors now sell solid state cache memory which can hold data indefinitely. This is a more sensible approach(and looks very similar to a dedicated ZIL device to me). It does not remove the need to find identical hardware on which to move disk and controllers to recover the array though. This is the one aspect in which open sourced software raid is better: any hardware with the enough connectors of the correct kind will do for recovery...well and enough RAM also. -- Guido Falsi From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 18:38:41 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 3159F8B6; Wed, 23 Jan 2013 18:38:41 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id CFB4A78B; Wed, 23 Jan 2013 18:38:40 +0000 (UTC) Received: from pakbsde14.localnet (unknown [38.105.238.108]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 214FCB93E; Wed, 23 Jan 2013 13:38:40 -0500 (EST) From: John Baldwin To: fs@freebsd.org Subject: [PATCH] More time cleanups in the NFS code Date: Wed, 23 Jan 2013 13:38:38 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p22; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301231338.39056.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 23 Jan 2013 13:38:40 -0500 (EST) Cc: Rick Macklem , bde@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 18:38:41 -0000 This patch removes all calls to get*time(). Most of them it replaces with time_uptime (especially ones that are attempting to handle time intervals for which time_uptime is far better suited than time_second). One specific case it replaces with nanotime() as suggested by Bruce previously. A few of the timestamps were not used (nd_starttime and the curtime in the lease expiry function). Index: fs/nfs/nfsport.h =================================================================== --- fs/nfs/nfsport.h (revision 245742) +++ fs/nfs/nfsport.h (working copy) @@ -588,12 +588,6 @@ #define NCHNAMLEN 9999999 /* - * Define these to use the time of day clock. - */ -#define NFSGETTIME(t) (getmicrotime(t)) -#define NFSGETNANOTIME(t) (getnanotime(t)) - -/* * These macros are defined to initialize and set the timer routine. */ #define NFS_TIMERINIT \ Index: fs/nfs/nfs_commonkrpc.c =================================================================== --- fs/nfs/nfs_commonkrpc.c (revision 245742) +++ fs/nfs/nfs_commonkrpc.c (working copy) @@ -459,18 +459,17 @@ { struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; struct nfsmount *nmp = nf->nf_mount; - struct timeval now; + time_t now; - getmicrouptime(&now); - switch (type) { case FEEDBACK_REXMIT2: case FEEDBACK_RECONNECT: - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { + now = NFSD_MONOSEC; + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { nfs_down(nmp, nf->nf_td, "not responding", 0, NFSSTA_TIMEO); nf->nf_tprintfmsg = TRUE; - nf->nf_lastmsg = now.tv_sec; + nf->nf_lastmsg = now; } break; @@ -501,7 +500,7 @@ u_int16_t procnum; u_int trylater_delay = 1; struct nfs_feedback_arg nf; - struct timeval timo, now; + struct timeval timo; AUTH *auth; struct rpc_callextra ext; enum clnt_stat stat; @@ -617,8 +616,7 @@ bzero(&nf, sizeof(struct nfs_feedback_arg)); nf.nf_mount = nmp; nf.nf_td = td; - getmicrouptime(&now); - nf.nf_lastmsg = now.tv_sec - + nf.nf_lastmsg = NFSD_MONOSEC - ((nmp->nm_tprintf_delay)-(nmp->nm_tprintf_initial_delay)); } Index: fs/nfs/nfs.h =================================================================== --- fs/nfs/nfs.h (revision 245742) +++ fs/nfs/nfs.h (working copy) @@ -523,7 +523,6 @@ int *nd_errp; /* Pointer to ret status */ u_int32_t nd_retxid; /* Reply xid */ struct nfsrvcache *nd_rp; /* Assoc. cache entry */ - struct timeval nd_starttime; /* Time RPC initiated */ fhandle_t nd_fh; /* File handle */ struct ucred *nd_cred; /* Credentials */ uid_t nd_saveduid; /* Saved uid */ Index: fs/nfsclient/nfs_clstate.c =================================================================== --- fs/nfsclient/nfs_clstate.c (revision 245742) +++ fs/nfsclient/nfs_clstate.c (working copy) @@ -2447,7 +2447,7 @@ u_int32_t clidrev; int error, cbpathdown, islept, igotlock, ret, clearok; uint32_t recover_done_time = 0; - struct timespec mytime; + time_t mytime; static time_t prevsec = 0; struct nfscllockownerfh *lfhp, *nlfhp; struct nfscllockownerfhhead lfh; @@ -2720,9 +2720,9 @@ * Call nfscl_cleanupkext() once per second to check for * open/lock owners where the process has exited. */ - NFSGETNANOTIME(&mytime); - if (prevsec != mytime.tv_sec) { - prevsec = mytime.tv_sec; + mytime = NFSD_MONOSEC; + if (prevsec != mytime) { + prevsec = mytime; nfscl_cleanupkext(clp, &lfh); } @@ -4611,7 +4611,7 @@ } dp = nfscl_finddeleg(clp, np->n_fhp->nfh_fh, np->n_fhp->nfh_len); if (dp != NULL && (dp->nfsdl_flags & NFSCLDL_WRITE)) { - NFSGETNANOTIME(&dp->nfsdl_modtime); + nanotime(&dp->nfsdl_modtime); dp->nfsdl_flags |= NFSCLDL_MODTIMESET; } NFSUNLOCKCLSTATE(); Index: fs/nfsserver/nfs_nfsdkrpc.c =================================================================== --- fs/nfsserver/nfs_nfsdkrpc.c (revision 245742) +++ fs/nfsserver/nfs_nfsdkrpc.c (working copy) @@ -310,7 +310,6 @@ } else { isdgram = 1; } - NFSGETTIME(&nd->nd_starttime); /* * Two cases: Index: fs/nfsserver/nfs_nfsdstate.c =================================================================== --- fs/nfsserver/nfs_nfsdstate.c (revision 245742) +++ fs/nfsserver/nfs_nfsdstate.c (working copy) @@ -3967,7 +3967,6 @@ int error, i, tryagain; off_t off = 0; ssize_t aresid, len; - struct timeval curtime; /* * If NFSNSF_UPDATEDONE is set, this is a restart of the nfsds without @@ -3978,8 +3977,7 @@ /* * Set Grace over just until the file reads successfully. */ - NFSGETTIME(&curtime); - nfsrvboottime = curtime.tv_sec; + nfsrvboottime = time_second; LIST_INIT(&sf->nsf_head); sf->nsf_flags = (NFSNSF_GRACEOVER | NFSNSF_NEEDLOCK); sf->nsf_eograce = NFSD_MONOSEC + NFSRV_LEASEDELTA; @@ -4650,7 +4648,7 @@ APPLESTATIC void nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) { - struct timespec mytime; + time_t mytime; int32_t starttime; int error; @@ -4675,8 +4673,8 @@ * Now, call nfsrv_checkremove() in a loop while it returns * NFSERR_DELAY. Return upon any other error or when timed out. */ - NFSGETNANOTIME(&mytime); - starttime = (u_int32_t)mytime.tv_sec; + mytime = NFSD_MONOSEC; + starttime = (u_int32_t)mytime; do { if (NFSVOPLOCK(vp, LK_EXCLUSIVE) == 0) { error = nfsrv_checkremove(vp, 0, p); @@ -4684,11 +4682,9 @@ } else error = EPERM; if (error == NFSERR_DELAY) { - NFSGETNANOTIME(&mytime); - if (((u_int32_t)mytime.tv_sec - starttime) > - NFS_REMOVETIMEO && - ((u_int32_t)mytime.tv_sec - starttime) < - 100000) + mytime = NFSD_MONOSEC; + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && + ((u_int32_t)mytime - starttime) < 100000) break; /* Sleep for a short period of time */ (void) nfs_catnap(PZERO, 0, "nfsremove"); @@ -4949,9 +4945,7 @@ static time_t nfsrv_leaseexpiry(void) { - struct timeval curtime; - NFSGETTIME(&curtime); if (nfsrv_stablefirst.nsf_eograce > NFSD_MONOSEC) return (NFSD_MONOSEC + 2 * (nfsrv_lease + NFSRV_LEASEDELTA)); return (NFSD_MONOSEC + nfsrv_lease + NFSRV_LEASEDELTA); Index: nfsclient/nfs_krpc.c =================================================================== --- nfsclient/nfs_krpc.c (revision 245742) +++ nfsclient/nfs_krpc.c (working copy) @@ -394,18 +394,17 @@ { struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; struct nfsmount *nmp = nf->nf_mount; - struct timeval now; + time_t now; - getmicrouptime(&now); - switch (type) { case FEEDBACK_REXMIT2: case FEEDBACK_RECONNECT: - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { + now = time_uptime; + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { nfs_down(nmp, nf->nf_td, "not responding", 0, NFSSTA_TIMEO); nf->nf_tprintfmsg = TRUE; - nf->nf_lastmsg = now.tv_sec; + nf->nf_lastmsg = now; } break; @@ -438,7 +437,6 @@ time_t waituntil; caddr_t dpos; int error = 0, timeo; - struct timeval now; AUTH *auth = NULL; enum nfs_rto_timer_t timer; struct nfs_feedback_arg nf; @@ -455,8 +453,7 @@ bzero(&nf, sizeof(struct nfs_feedback_arg)); nf.nf_mount = nmp; nf.nf_td = td; - getmicrouptime(&now); - nf.nf_lastmsg = now.tv_sec - + nf.nf_lastmsg = time_uptime - ((nmp->nm_tprintf_delay) - (nmp->nm_tprintf_initial_delay)); /* -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 20:23:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 86CD17DF; Wed, 23 Jan 2013 20:23:02 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id EA022CDE; Wed, 23 Jan 2013 20:23:01 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NKMrEl001682; Wed, 23 Jan 2013 21:22:53 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NKMqI3001679; Wed, 23 Jan 2013 21:22:53 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 21:22:52 +0100 (CET) From: Wojciech Puchar To: Peter Jeremy Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: <20130122073641.GH30633@server.rulingia.com> Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 21:22:53 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 20:23:02 -0000 >> While RAID-Z is already a king of bad performance, > > I don't believe RAID-Z is any worse than RAID5. Do you have any actual > measurements to back up your claim? it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 20:24:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9BC9A93F; Wed, 23 Jan 2013 20:24:19 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 0A006CF5; Wed, 23 Jan 2013 20:24:18 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NKOGVX001691; Wed, 23 Jan 2013 21:24:16 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NKOFFl001688; Wed, 23 Jan 2013 21:24:16 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 21:24:15 +0100 (CET) From: Wojciech Puchar To: Matthew Ahrens Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 21:24:16 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 20:24:19 -0000 > This is because RAID-Z spreads each block out over all disks, whereas RAID5 > (as it is typically configured) puts each block on only one disk. So to > read a block from RAID-Z, all data disks must be involved, vs. for RAID5 > only one disk needs to have its head moved. > > For other workloads (especially streaming reads/writes), there is no > fundamental difference, though of course implementation quality may vary. streaming workload generally is always good. random I/O is what is important. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 20:26:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 35989CF5; Wed, 23 Jan 2013 20:26:44 +0000 (UTC) (envelope-from utisoft@gmail.com) Received: from mail-ie0-f178.google.com (mail-ie0-f178.google.com [209.85.223.178]) by mx1.freebsd.org (Postfix) with ESMTP id 00D61D55; Wed, 23 Jan 2013 20:26:43 +0000 (UTC) Received: by mail-ie0-f178.google.com with SMTP id c12so14695291ieb.9 for ; Wed, 23 Jan 2013 12:26:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=9XrBORLOe2IsLQljrp9MmYuhcFI8/4m7lInRuUOVuew=; b=v1vzwwfD+9DwZKuxlIRR1ok9A8sExeD+7auzbZ6SJvT2wWKx4GuhcBV87L8f5KREPX rO+qf0XJpnt8VpRZaKLJpOPHm9VOaoXIZpJQG1PyQPBgOveZyb5ENDxm8pqnAksICrkY rRYEgijFaQwGrFSZDASME7CLMIFFsDZJYoedTP7uHlB/cA3NGaKb+MrmOvgKjlb6uftw b6IhABdJbtmelaKRBPzjxvRJKIF71wXv31M+OC7Z5s+G5tjhGU0DiUaFsrKqK3Bms+Av v2B5BJTqzUylhN7rYSayRgKkI0WWXS5OBaxcz8BQgDDO4N9MxTgzEXRslB6sDhbYE+2s E8+g== MIME-Version: 1.0 X-Received: by 10.43.114.4 with SMTP id ey4mr1995217icc.27.1358972803501; Wed, 23 Jan 2013 12:26:43 -0800 (PST) Received: by 10.64.16.73 with HTTP; Wed, 23 Jan 2013 12:26:43 -0800 (PST) Received: by 10.64.16.73 with HTTP; Wed, 23 Jan 2013 12:26:43 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> Date: Wed, 23 Jan 2013 20:26:43 +0000 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Chris Rees To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 20:26:44 -0000 On 23 Jan 2013 20:23, "Wojciech Puchar" wrote: >>> >>> While RAID-Z is already a king of bad performance, >> >> >> I don't believe RAID-Z is any worse than RAID5. Do you have any actual >> measurements to back up your claim? > > > it is clearly described even in ZFS papers. Both on reads and writes it gives single drive random I/O performance. So we have to take your word for it? Provide a link if you're going to make assertions, or they're no more than your own opinion. Chris From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:09:36 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 40A15A73; Wed, 23 Jan 2013 21:09:36 +0000 (UTC) (envelope-from feld@feld.me) Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id E88DDF12; Wed, 23 Jan 2013 21:09:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:Cc:To:Content-Type; bh=IFkEk7sxzpdO70vPGpC8J/+tzwpriEknEjCNY6nj8CI=; b=gwP/BUB4vMWXgEYRdTkkboCD7L4b2CuuCc2FlM8aFqvUKUSEwzsbtBJCAoRLSg9JdJdGEmNYxREFa3YBvnk2xjhGMr2AgIGniYuIaHdHqGWigjtT8cg+bVyBRAVsg8Ia; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by feld.me with esmtp (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1Ty7ZY-0003UU-Fb; Wed, 23 Jan 2013 15:09:28 -0600 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4) with esmtpsa id 1358975362-13187-64949/5/1; Wed, 23 Jan 2013 21:09:22 +0000 Content-Type: text/plain; format=flowed; delsp=yes To: Wojciech Puchar , Chris Rees Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> Date: Wed, 23 Jan 2013 15:09:22 -0600 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: User-Agent: Opera Mail/12.12 (FreeBSD) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:09:36 -0000 On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees wrote: > > So we have to take your word for it? > Provide a link if you're going to make assertions, or they're no more > than > your own opinion. I've heard this same thing -- every vdev == 1 drive in performance. I've never seen any proof/papers on it though. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:10:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4CDF2D00; Wed, 23 Jan 2013 21:10:55 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id F197DF40; Wed, 23 Jan 2013 21:10:54 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fl17so4160851vcb.27 for ; Wed, 23 Jan 2013 13:10:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=f/8KJxGm7K3Y6HyhRdjkHuaatBFIhKtuAn2fAteXSO0=; b=IUtkWkHehzDzWfVuxnYY6/vaqlHyLwWsa8ktg0hkWDAOgoZs75m81WrPnCQS2PeaqJ NLiE/6A3B+uhfDuLOm2E/y8DU2douELunwmomb2+rA0UdrbVFrWgTsZqHXzDpm9YbsDL aOm658PmfJwjD1KQfLTuLdOBVJ02BYPDVlQBmpusm9/7Pak7R/aQMTHHYXqiM6a4jb8U WzPB1X1yZ4NMoTQmwfzq7g1H+k6OOAnA4ZnCvfN1X6KJ+1jHIvIRW+JmTz1RVcOW7QEW DdaW+skTNQ74kVQ2GL5ICI1KmbipFGCmG96skC4ntl8VKR2LKLTdKCQDa8nTdL12qXQb G7oA== MIME-Version: 1.0 X-Received: by 10.52.67.45 with SMTP id k13mr2820757vdt.9.1358975454188; Wed, 23 Jan 2013 13:10:54 -0800 (PST) Sender: artemb@gmail.com Received: by 10.220.123.2 with HTTP; Wed, 23 Jan 2013 13:10:54 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> Date: Wed, 23 Jan 2013 13:10:54 -0800 X-Google-Sender-Auth: 2JIK8ydSsaEj5o2ILc8pnpmjeT8 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Artem Belevich To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:10:55 -0000 On Wed, Jan 23, 2013 at 12:22 PM, Wojciech Puchar wrote: >>> While RAID-Z is already a king of bad performance, >> >> >> I don't believe RAID-Z is any worse than RAID5. Do you have any actual >> measurements to back up your claim? > > > it is clearly described even in ZFS papers. Both on reads and writes it > gives single drive random I/O performance. For reads - true. For writes it's probably behaves better than RAID5 as it does not have to go through read-modify-write for partial block updates. Search for RAID-5 write hole. If you need higher performance, build your pool out of multiple RAID-Z vdevs. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:23:48 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E024648C; Wed, 23 Jan 2013 21:23:48 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-vb0-f47.google.com (mail-vb0-f47.google.com [209.85.212.47]) by mx1.freebsd.org (Postfix) with ESMTP id 84C6BFCC; Wed, 23 Jan 2013 21:23:48 +0000 (UTC) Received: by mail-vb0-f47.google.com with SMTP id e21so1650535vbm.34 for ; Wed, 23 Jan 2013 13:23:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=24nHO+462d1riarIWZ5ktl944GDji+uBqmN8eJS75Cw=; b=wtDtseVtMbZ3aY9LCKhAvibga+3GBvllFcKrnr5X67xko9tcHej7CHS+5de55u6fPI mSw6UGC/vy7CCECS9XokyHrobwLanuG3gm1QUAbt5y5N6T47panY+5z88Plsbf9aYRwb BJ+8tcZyJOmfCWQiFx6pewFIAOURP7XVK3wOU5hUrF2n3gqhSZtAWoiOA/NF2vHRXjep XJTePdWwVJlP0KMPB9DQl+xJZ6xvPtA/W4SOFPS3uZFCcWQ3JUVtmSOXikZV/P2wK6Le vygIVFSomCJ6GD/z9Bp+NF/6F6yBB4QOUZnZn8/XbNuhvUyc2GIT26pdjqqP2LMPOjn+ X4hQ== MIME-Version: 1.0 X-Received: by 10.52.76.7 with SMTP id g7mr2694351vdw.95.1358976227679; Wed, 23 Jan 2013 13:23:47 -0800 (PST) Sender: artemb@gmail.com Received: by 10.220.123.2 with HTTP; Wed, 23 Jan 2013 13:23:47 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> Date: Wed, 23 Jan 2013 13:23:47 -0800 X-Google-Sender-Auth: H1jEo4QrSn9xnvkNQGX15PeFTCk Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Artem Belevich To: Mark Felder Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs , Wojciech Puchar , Chris Rees , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:23:49 -0000 On Wed, Jan 23, 2013 at 1:09 PM, Mark Felder wrote: > On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees wrote: > >> >> So we have to take your word for it? >> Provide a link if you're going to make assertions, or they're no more than >> your own opinion. > > > I've heard this same thing -- every vdev == 1 drive in performance. I've > never seen any proof/papers on it though. "1 drive in performance" only applies to number of random i/o operations vdev can perform. You still get increased throughput. I.e. 5-drive RAIDZ will have 4x bandwidth of individual disks in vdev, but would deliver only as many IOPS as the slowest drive as record would have to be read back from N-1 or N-2 drived in vdev. It's the same for RAID5. IMHO for identical record/block size RAID5 has no advantage over RAID-Z for reads and does have disadvantage when it comes to small writes. Never mind lack of data integrity checks and other bells and whistles ZFS provides. --Artem From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:24:26 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A57855E3; Wed, 23 Jan 2013 21:24:26 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 1937FFDE; Wed, 23 Jan 2013 21:24:25 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NLOF1t001978; Wed, 23 Jan 2013 22:24:15 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NLOFxU001975; Wed, 23 Jan 2013 22:24:15 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 22:24:15 +0100 (CET) From: Wojciech Puchar To: Mark Felder Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 22:24:16 +0100 (CET) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Chris Rees X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:24:26 -0000 > > I've heard this same thing -- every vdev == 1 drive in performance. I've > never seen any proof/papers on it though. read original ZFS papers. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:25:09 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4CF28740; Wed, 23 Jan 2013 21:25:09 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id B2C7BFF5; Wed, 23 Jan 2013 21:25:08 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NLP79g001989; Wed, 23 Jan 2013 22:25:07 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NLP6YK001986; Wed, 23 Jan 2013 22:25:06 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 22:25:06 +0100 (CET) From: Wojciech Puchar To: Artem Belevich Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 22:25:07 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:25:09 -0000 >> gives single drive random I/O performance. > > For reads - true. For writes it's probably behaves better than RAID5 yes, because as with reads it gives single drive performance. small writes on RAID5 gives lower than single disk performance. > If you need higher performance, build your pool out of multiple RAID-Z vdevs. even you need normal performance use gmirror and UFS From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:27:06 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 885A199F; Wed, 23 Jan 2013 21:27:06 +0000 (UTC) (envelope-from utisoft@gmail.com) Received: from mail-ie0-f181.google.com (mail-ie0-f181.google.com [209.85.223.181]) by mx1.freebsd.org (Postfix) with ESMTP id 53DE990; Wed, 23 Jan 2013 21:27:06 +0000 (UTC) Received: by mail-ie0-f181.google.com with SMTP id 16so14553810iea.12 for ; Wed, 23 Jan 2013 13:27:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=Ivy7C2YjvDfWp3QZS8yC0nQrhMbEcWyVaemhiS/3fa8=; b=zD5E0aIEDSnyWNt0WzLtOfa9WlNyfsbpL1BeP1fFSSuMTjB8U/Xx2KqiEuVtnzcfmo 1TieAnjL4moMybwoRUZoPMLW7FMXC1cEEbfD01pnIvos9EiOisrk8xmjer0wZvJiQFjO CaV9YMQVmxQ6k5xe8VtmoNAYp2tKxJTJs4p+zIIDi3PMVHQLNVJRBNUvs9zHOrhlKxf0 BqeIid6lCH7n0MFMEcimxhv+Wf4oU6/vVBaT31ozBkRC2JR3jOY1Xw6cD58Zk+gnaBF1 LZ6foO2KcXV/xmcV+VDMdTuNVO8djuadNVQbfqX0LONHxku0Yfnee/S9oLDh9129MiyQ wS4Q== X-Received: by 10.50.202.97 with SMTP id kh1mr16748010igc.15.1358976425937; Wed, 23 Jan 2013 13:27:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.16.73 with HTTP; Wed, 23 Jan 2013 13:26:35 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> From: Chris Rees Date: Wed, 23 Jan 2013 21:26:35 +0000 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-fs@freebsd.org" , "freebsd-hackers@freebsd.org" , Mark Felder X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:27:06 -0000 On 23 January 2013 21:24, Wojciech Puchar wrote: >> >> I've heard this same thing -- every vdev == 1 drive in performance. I've >> never seen any proof/papers on it though. > > read original ZFS papers. No, you are making the assertion, provide a link. Chris From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:40:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6E2811E6; Wed, 23 Jan 2013 21:40:44 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id BAF5C15F; Wed, 23 Jan 2013 21:40:43 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NLeesC002071; Wed, 23 Jan 2013 22:40:40 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NLedY5002068; Wed, 23 Jan 2013 22:40:39 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 22:40:39 +0100 (CET) From: Wojciech Puchar To: Artem Belevich Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 22:40:40 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers , Mark Felder , Chris Rees X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:40:44 -0000 > "1 drive in performance" only applies to number of random i/o > operations vdev can perform. You still get increased throughput. I.e. > 5-drive RAIDZ will have 4x bandwidth of individual disks in vdev, but unless your work is serving movies it doesn't matter. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:49:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 074AB93B; Wed, 23 Jan 2013 21:49:38 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-vb0-f47.google.com (mail-vb0-f47.google.com [209.85.212.47]) by mx1.freebsd.org (Postfix) with ESMTP id AB8B6204; Wed, 23 Jan 2013 21:49:37 +0000 (UTC) Received: by mail-vb0-f47.google.com with SMTP id e21so1664343vbm.6 for ; Wed, 23 Jan 2013 13:49:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=z4/E1kLIId/DXawG3IwyNiUPD7HpgHwVXChoKr62Ehs=; b=otq09/+ZHVb0J5x6vQY6LCtcvyBozJEi7CUmJ9PybWB0PbtqGiYL5Ji9lr+JL9I0Gd NIfIuY17oqebpp93lkjUM0p+nUce+/f18QkKfSus6DExRPptLDWvfcXox/PC5Dsz68Z5 iQ4vHz14I4l8wOHMvyFFpG/K1TL4zh3U6KhGLDJjcVGv1yvq/mWCEgAAFL4SIe+z/tPx nD2EUdxovoIp/PsHJC5Wa4HLw9/+miFcQqOk4waDE7c/BjX6g6T4QXb7NRmIV8Z4H2ki /3/kd6qL5Nujb0G/4y9EYSyFC8wehQ4IjZGf4mqujActecSvqEOGRxwKuon+1nfvKl0o 8fhw== MIME-Version: 1.0 X-Received: by 10.220.156.197 with SMTP id y5mr3146793vcw.17.1358977777180; Wed, 23 Jan 2013 13:49:37 -0800 (PST) Sender: artemb@gmail.com Received: by 10.220.123.2 with HTTP; Wed, 23 Jan 2013 13:49:37 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> Date: Wed, 23 Jan 2013 13:49:37 -0800 X-Google-Sender-Auth: vIBQ5fJ98BDdbeD0elQH_y-D0JM Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Artem Belevich To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:49:38 -0000 On Wed, Jan 23, 2013 at 1:25 PM, Wojciech Puchar wrote: >>> gives single drive random I/O performance. >> >> >> For reads - true. For writes it's probably behaves better than RAID5 > > > yes, because as with reads it gives single drive performance. small writes > on RAID5 gives lower than single disk performance. > > >> If you need higher performance, build your pool out of multiple RAID-Z >> vdevs. > > even you need normal performance use gmirror and UFS I've no objection. If it works for you -- go for it. For me personally ZFS performance is good enough, and data integrity verification is something that I'm willing to sacrifice some performance for. ZFS scrub gives me either warm and fuzzy feeling that everything is OK, or explicitly tells me that something bad happened *and* reconstructs the data if it's possible. Just my $0.02, --Artem From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 21:52:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C684BC5A; Wed, 23 Jan 2013 21:52:27 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-bk0-f42.google.com (mail-bk0-f42.google.com [209.85.214.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2D11723D; Wed, 23 Jan 2013 21:52:26 +0000 (UTC) Received: by mail-bk0-f42.google.com with SMTP id ji2so4817147bkc.1 for ; Wed, 23 Jan 2013 13:52:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:subject:mime-version:content-type:from:in-reply-to:date :cc:message-id:references:to:x-mailer; bh=XiQiaGkVwDhhJNAEPPX1sWe7zXw4tcQ5J6NnQz+bXkY=; b=HAU8BQsV7QPkb0WccBWwF+B9+c7yab46DnAnuWig7HgRqHhEYHTtLqNwjyABhl3+oq uhO4PBu3qRQzPJjm/Uxq1yIxs9MgxFIIMBfh+mOtoCLMy+leCzo4xqenJ4PIPuoEWXF5 wIr6RBuuMYj2qQOtxWGB0Q0LjXCpG69sRroCg+gVB6opFEGc+FTy7GZk37l+t7sXnxcn SMKHxNfC+pk3xYKqFHv6VcJisy0W6VAnQ+oD8L5Zq5+swnLKDgmTWSmkKg7jW9BsYqK7 74jct2gjtLuPng0snCnbRPDD4K2Q6D3pAOJHlftbQ/CgTL01QEXSR8q06V3k1oKlOfdu 7wCw== X-Received: by 10.205.122.9 with SMTP id ge9mr1020974bkc.59.1358977940417; Wed, 23 Jan 2013 13:52:20 -0800 (PST) Received: from [10.0.0.3] ([93.152.184.10]) by mx.google.com with ESMTPS id v8sm15533709bku.6.2013.01.23.13.52.18 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 23 Jan 2013 13:52:19 -0800 (PST) Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) From: Nikolay Denev In-Reply-To: Date: Wed, 23 Jan 2013 23:52:21 +0200 Message-Id: <6DBE5200-47E6-4D00-AB25-83CB5250DFC0@gmail.com> References: <20130122073641.GH30633@server.rulingia.com> To: Mark Felder X-Mailer: Apple Mail (2.1499) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org, Wojciech Puchar , Chris Rees , freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 21:52:27 -0000 On Jan 23, 2013, at 11:09 PM, Mark Felder wrote: > On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees = wrote: >=20 >>=20 >> So we have to take your word for it? >> Provide a link if you're going to make assertions, or they're no more = than >> your own opinion. >=20 > I've heard this same thing -- every vdev =3D=3D 1 drive in = performance. I've never seen any proof/papers on it though. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" Here is a blog post that describes why this is true for IOPS: = http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-sola= ris-zfs-filesystem-performance From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 22:27:56 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D61F6757; Wed, 23 Jan 2013 22:27:56 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id EA5E43D5; Wed, 23 Jan 2013 22:27:55 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NMRs2Y002335; Wed, 23 Jan 2013 23:27:54 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NMRr3K002332; Wed, 23 Jan 2013 23:27:54 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 23:27:53 +0100 (CET) From: Wojciech Puchar To: Artem Belevich Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 23:27:54 +0100 (CET) Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 22:27:56 -0000 >> >> even you need normal performance use gmirror and UFS > > I've no objection. If it works for you -- go for it. both "works". For todays trend of solving everything by more hardware ZFS may even have "enough" performance. But still it is dangerous for a reasons i explained, as well as it promotes bad setups and layouts like making single filesystem out of large amount of disks. This is bad for no matter what filesystem and RAID setup you use, or even what OS. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 22:39:52 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BC9B3A2A; Wed, 23 Jan 2013 22:39:52 +0000 (UTC) (envelope-from sendtomatt@gmail.com) Received: from mail-pb0-f52.google.com (mail-pb0-f52.google.com [209.85.160.52]) by mx1.freebsd.org (Postfix) with ESMTP id 3884C641; Wed, 23 Jan 2013 22:39:52 +0000 (UTC) Received: by mail-pb0-f52.google.com with SMTP id uo5so39382pbc.11 for ; Wed, 23 Jan 2013 14:39:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:message-id:date:from:user-agent:mime-version:to:cc :subject:references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; bh=bd3sz9qPeMWHGxqmErArzCpY3zda65LV918si2p09y4=; b=K0q2Bt7K/faC6QCw9z6UjxJ3ir4W3AGafxkmOH6FtNsF0PFr1iCWIk9GutCYlBW/Px ZxHaG05+2U38K8cyudKzL2BC2/RIO5vn3lCUq6j1NzIJHXIYgnYGqKd9WerhOHFh3aks De3QDz9E1j8ZBJ3ZhiRiRE7kJ3C7cxYjhQC14Fj+yPjaQpyyLzFmtVlGd9PucKXKqVh1 U5p/Uev3j7U11dubDJ6+NXF4ue3qkR4IUCnJCnohVyC4C/3pE/e3bCRh7mCvjcF/ObwY lPxvhG1Q5OzD2ctbum/Ts9HEW/6MEaB+TnG2QlHsvemEbzqSyDLxth3qNSqU/nmiZZh9 /4lA== X-Received: by 10.68.239.232 with SMTP id vv8mr7292677pbc.53.1358980483013; Wed, 23 Jan 2013 14:34:43 -0800 (PST) Received: from bakeneko.local (108-213-216-134.lightspeed.sntcca.sbcglobal.net. [108.213.216.134]) by mx.google.com with ESMTPS id l5sm12460471pax.10.2013.01.23.14.34.40 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 23 Jan 2013 14:34:41 -0800 (PST) Message-ID: <51006548.7050807@gmail.com> Date: Wed, 23 Jan 2013 14:33:44 -0800 From: matt User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121203 Thunderbird/10.0.11 MIME-Version: 1.0 To: Wojciech Puchar Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> In-Reply-To: X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs , FreeBSD Hackers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 22:39:52 -0000 On 01/23/13 14:27, Wojciech Puchar wrote: >> > > both "works". For todays trend of solving everything by more hardware > ZFS may even have "enough" performance. > > But still it is dangerous for a reasons i explained, as well as it > promotes bad setups and layouts like making single filesystem out of > large amount of disks. This is bad for no matter what filesystem and > RAID setup you use, or even what OS. > > ZFS mirror performance is quite good (both random IO and sequential), and resilvers/scrubs are measured in an hour or less. You can always make pool out of these instead of RAIDZ if you can get away with less total available space. I think RAIDZ vs Gmirror is a bad comparison, you can use a ZFS mirror with all the ZFS features, plus N-way (not sure if gmirror does this). Regarding single large filesystems, there is an old saying about not putting all your eggs into one basket, even if it's a great basket :) Matt From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 22:45:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6B524C8D; Wed, 23 Jan 2013 22:45:02 +0000 (UTC) (envelope-from steven@pyro.eu.org) Received: from falkenstein-2.sn.de.cluster.ok24.net (falkenstein-2.sn.de.cluster.ok24.net [IPv6:2002:4e2f:2f89:2::1]) by mx1.freebsd.org (Postfix) with ESMTP id 1DBEB6B6; Wed, 23 Jan 2013 22:45:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=simple/simple; d=pyro.eu.org; s=01.2013; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:References:Subject:CC:To:MIME-Version:From:Date:Message-ID; bh=DaMAHxcUwKH/lh1EZ3HdrwwU2awcBPPE2Zj3zI56m9w=; b=FlwoFt7jRaUHIDdnJDdL7tbFWW3EY6ht3tBnGCFlK+RpkqsW+HM12p7XHHMHOq5lPA8DiTDhi0nRvPxTdWxLiNFszJCqI7q11rtVB9SbkQLdQ+bUxIElqfuGTUZp1/UpesdTysDlqo9wrmYEyZIhkStva/5vYs51ycu+gIYcUz8=; X-Spam-Status: No, score=-1.1 required=2.0 tests=ALL_TRUSTED, BAYES_00, DKIM_ADSP_DISCARD, TVD_RCVD_IP Received: from 188-220-33-66.zone11.bethere.co.uk ([188.220.33.66] helo=guisborough-1.rcc.uk.cluster.ok24.net) by falkenstein-2.sn.de.cluster.ok24.net with esmtp (Exim 4.72) (envelope-from ) id 1Ty93q-0001lu-Kn; Wed, 23 Jan 2013 22:44:56 +0000 X-Spam-Status: No, score=-4.3 required=2.0 tests=ALL_TRUSTED, AWL, BAYES_00, DKIM_POLICY_SIGNALL Received: from [192.168.0.110] (helo=[192.168.0.9]) by guisborough-1.rcc.uk.cluster.ok24.net with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.69) (envelope-from ) id 1Ty93l-0004MI-AH; Wed, 23 Jan 2013 22:44:50 +0000 Message-ID: <510067DC.7030707@pyro.eu.org> Date: Wed, 23 Jan 2013 22:44:44 +0000 From: Steven Chamberlain User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130116 Icedove/10.0.12 MIME-Version: 1.0 To: Wojciech Puchar Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> In-Reply-To: X-Enigmail-Version: 1.4.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs , Mark Felder , Chris Rees X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 22:45:02 -0000 On 23/01/13 21:40, Wojciech Puchar wrote: >> "1 drive in performance" only applies to number of random i/o >> operations vdev can perform. You still get increased throughput. I.e. >> 5-drive RAIDZ will have 4x bandwidth of individual disks in vdev, but > > unless your work is serving movies it doesn't matter. That's why I find it really interesting the Netflix Open Connect appliance didn't use ZFS - it would have seemed perfect for that application. http://lists.freebsd.org/pipermail/freebsd-stable/2012-June/068129.html Instead there are plain UFS+J filesystems on some 36 disks and no RAID - it tries to handle almost everything at the application layer instead. I'm sure it's worked out okay for them, but wonder how much easier it would be if they could push content updates out with a 'zfs send', without having to take the machine offline. If split into a few large RAID-Zs a disk could stay in use even when a few blocks have failed. Regards, -- Steven Chamberlain steven@pyro.eu.org From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 22:52:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5CAB2EF0; Wed, 23 Jan 2013 22:52:38 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 84ACE6ED; Wed, 23 Jan 2013 22:52:37 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0NMqX1Y002480; Wed, 23 Jan 2013 23:52:33 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0NMqWvN002477; Wed, 23 Jan 2013 23:52:32 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Wed, 23 Jan 2013 23:52:32 +0100 (CET) From: Wojciech Puchar To: Steven Chamberlain Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: <510067DC.7030707@pyro.eu.org> Message-ID: References: <20130122073641.GH30633@server.rulingia.com> <510067DC.7030707@pyro.eu.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Wed, 23 Jan 2013 23:52:33 +0100 (CET) Cc: freebsd-fs , Mark Felder , Chris Rees X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 22:52:38 -0000 >> unless your work is serving movies it doesn't matter. > > That's why I find it really interesting the Netflix Open Connect > appliance didn't use ZFS - it would have seemed perfect for that "Seems perfect" only by ZFS marketers and their victims. but is at most usable, but and dangerous. > application. because doing it with UFS is ACTUALLY perfect. large parallel transfers are great with UFS, >95% of platter speed is normal and near zero CPU load, metadata amount are minimal and doesn't matter for performance and fsck time (but +J would make it even smaller). Getting ca 90% of platter speed under multitasking load is possible with proper setup. > http://lists.freebsd.org/pipermail/freebsd-stable/2012-June/068129.html > > Instead there are plain UFS+J filesystems on some 36 disks and no RAID - > it tries to handle almost everything at the application layer instead. this is exactly the kind of setup i would do in their case. They can restore all data as master movie storage is not here. but they have to restore 2 drives in case of 2 drives failing in the same time. not 36 :) "application layer" is quite trivial - just store where each movie is. such setup could easily handle 2 10Gb/s cards. Or more if load is spread over drives. From owner-freebsd-fs@FreeBSD.ORG Wed Jan 23 23:49:28 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id EFBDBEAA for ; Wed, 23 Jan 2013 23:49:28 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-ve0-f182.google.com (mail-ve0-f182.google.com [209.85.128.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7FB46910 for ; Wed, 23 Jan 2013 23:49:28 +0000 (UTC) Received: by mail-ve0-f182.google.com with SMTP id pb11so1239887veb.27 for ; Wed, 23 Jan 2013 15:49:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=NjFpTGAQqW2orT6QCbQt3GHhG1L6dbgqh0I9MADnSLc=; b=LSE6m8PkJLMqadsdB5NyY9QKUDvteZF4/7TvlSYb+nl89agfhmr1zmbGf4qs2RrwMC b05f9fZ7AzEXkrLv+sjFXrQ1ASPcJ8NQjoRKaLdG4b5k1/6ZOdcB0qMFD5VFXi6e9kCe sOn2wNWOPhsLWjp6MBazr6ICVeBZdX7XbCQj8iUnzaoUgs7466ikURVBnwn29mW1eXMr BJfG/b+ADhxVPsGtY9RdUWdvmlUvGUw9I54LE55HcIVKi3a8O8F+/pHS3+pRVBHL9irf 5Ib6ommykGcHHBJ9sW9/v+dMJW+2fW8J3gdNG/J18JLrSEDrKoted//4WfoNpxRn8oSq FZgg== MIME-Version: 1.0 X-Received: by 10.220.156.197 with SMTP id y5mr19135vcw.17.1358984961877; Wed, 23 Jan 2013 15:49:21 -0800 (PST) Sender: artemb@gmail.com Received: by 10.220.123.2 with HTTP; Wed, 23 Jan 2013 15:49:21 -0800 (PST) In-Reply-To: References: <20130122073641.GH30633@server.rulingia.com> <510067DC.7030707@pyro.eu.org> Date: Wed, 23 Jan 2013 15:49:21 -0800 X-Google-Sender-Auth: VJ6kGDstSHQAUOT0nCHcSBI9axc Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Artem Belevich To: Wojciech Puchar Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs , Mark Felder , Chris Rees X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jan 2013 23:49:29 -0000 On Wed, Jan 23, 2013 at 2:52 PM, Wojciech Puchar wrote: > because doing it with UFS is ACTUALLY perfect. That's a bold statement. Literally. :-) However, it appears that we're nowhere near having a rational discussion: http://thoughtcatalog.com/2011/how-to-have-a-rational-discussion/ I've just reached the early exit point on the chart. --Artem From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 03:21:01 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id DA308819; Thu, 24 Jan 2013 03:21:01 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 57E9120B; Thu, 24 Jan 2013 03:21:01 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAEaoAFGDaFvO/2dsb2JhbABEhkW4GnOCHgEBAQMBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggHBAEcBIdzBgyqZ5JjgSOLaIJVgRMDiGGKfYIugRyPLYMTgVE1 X-IronPort-AV: E=Sophos;i="4.84,525,1355115600"; d="scan'208";a="13373675" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 23 Jan 2013 22:20:53 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6BC2EB3F1B; Wed, 23 Jan 2013 22:20:53 -0500 (EST) Date: Wed, 23 Jan 2013 22:20:53 -0500 (EST) From: Rick Macklem To: John Baldwin Message-ID: <955630692.2300967.1358997653420.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201301231338.39056.jhb@freebsd.org> Subject: Re: [PATCH] More time cleanups in the NFS code MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , bde@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 03:21:01 -0000 John Baldwin wrote: > This patch removes all calls to get*time(). Most of them it replaces > with > time_uptime (especially ones that are attempting to handle time > intervals for > which time_uptime is far better suited than time_second). One specific > case > it replaces with nanotime() as suggested by Bruce previously. A few of > the > timestamps were not used (nd_starttime and the curtime in the lease > expiry > function). > All looks fine to me. Thanks for doing this, John. rick > Index: fs/nfs/nfsport.h > =================================================================== > --- fs/nfs/nfsport.h (revision 245742) > +++ fs/nfs/nfsport.h (working copy) > @@ -588,12 +588,6 @@ > #define NCHNAMLEN 9999999 > > /* > - * Define these to use the time of day clock. > - */ > -#define NFSGETTIME(t) (getmicrotime(t)) > -#define NFSGETNANOTIME(t) (getnanotime(t)) > - > -/* > * These macros are defined to initialize and set the timer routine. > */ > #define NFS_TIMERINIT \ > Index: fs/nfs/nfs_commonkrpc.c > =================================================================== > --- fs/nfs/nfs_commonkrpc.c (revision 245742) > +++ fs/nfs/nfs_commonkrpc.c (working copy) > @@ -459,18 +459,17 @@ > { > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > struct nfsmount *nmp = nf->nf_mount; > - struct timeval now; > + time_t now; > > - getmicrouptime(&now); > - > switch (type) { > case FEEDBACK_REXMIT2: > case FEEDBACK_RECONNECT: > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > + now = NFSD_MONOSEC; > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > nfs_down(nmp, nf->nf_td, > "not responding", 0, NFSSTA_TIMEO); > nf->nf_tprintfmsg = TRUE; > - nf->nf_lastmsg = now.tv_sec; > + nf->nf_lastmsg = now; > } > break; > > @@ -501,7 +500,7 @@ > u_int16_t procnum; > u_int trylater_delay = 1; > struct nfs_feedback_arg nf; > - struct timeval timo, now; > + struct timeval timo; > AUTH *auth; > struct rpc_callextra ext; > enum clnt_stat stat; > @@ -617,8 +616,7 @@ > bzero(&nf, sizeof(struct nfs_feedback_arg)); > nf.nf_mount = nmp; > nf.nf_td = td; > - getmicrouptime(&now); > - nf.nf_lastmsg = now.tv_sec - > + nf.nf_lastmsg = NFSD_MONOSEC - > ((nmp->nm_tprintf_delay)-(nmp->nm_tprintf_initial_delay)); > } > > Index: fs/nfs/nfs.h > =================================================================== > --- fs/nfs/nfs.h (revision 245742) > +++ fs/nfs/nfs.h (working copy) > @@ -523,7 +523,6 @@ > int *nd_errp; /* Pointer to ret status */ > u_int32_t nd_retxid; /* Reply xid */ > struct nfsrvcache *nd_rp; /* Assoc. cache entry */ > - struct timeval nd_starttime; /* Time RPC initiated */ > fhandle_t nd_fh; /* File handle */ > struct ucred *nd_cred; /* Credentials */ > uid_t nd_saveduid; /* Saved uid */ > Index: fs/nfsclient/nfs_clstate.c > =================================================================== > --- fs/nfsclient/nfs_clstate.c (revision 245742) > +++ fs/nfsclient/nfs_clstate.c (working copy) > @@ -2447,7 +2447,7 @@ > u_int32_t clidrev; > int error, cbpathdown, islept, igotlock, ret, clearok; > uint32_t recover_done_time = 0; > - struct timespec mytime; > + time_t mytime; > static time_t prevsec = 0; > struct nfscllockownerfh *lfhp, *nlfhp; > struct nfscllockownerfhhead lfh; > @@ -2720,9 +2720,9 @@ > * Call nfscl_cleanupkext() once per second to check for > * open/lock owners where the process has exited. > */ > - NFSGETNANOTIME(&mytime); > - if (prevsec != mytime.tv_sec) { > - prevsec = mytime.tv_sec; > + mytime = NFSD_MONOSEC; > + if (prevsec != mytime) { > + prevsec = mytime; > nfscl_cleanupkext(clp, &lfh); > } > > @@ -4611,7 +4611,7 @@ > } > dp = nfscl_finddeleg(clp, np->n_fhp->nfh_fh, np->n_fhp->nfh_len); > if (dp != NULL && (dp->nfsdl_flags & NFSCLDL_WRITE)) { > - NFSGETNANOTIME(&dp->nfsdl_modtime); > + nanotime(&dp->nfsdl_modtime); > dp->nfsdl_flags |= NFSCLDL_MODTIMESET; > } > NFSUNLOCKCLSTATE(); > Index: fs/nfsserver/nfs_nfsdkrpc.c > =================================================================== > --- fs/nfsserver/nfs_nfsdkrpc.c (revision 245742) > +++ fs/nfsserver/nfs_nfsdkrpc.c (working copy) > @@ -310,7 +310,6 @@ > } else { > isdgram = 1; > } > - NFSGETTIME(&nd->nd_starttime); > > /* > * Two cases: > Index: fs/nfsserver/nfs_nfsdstate.c > =================================================================== > --- fs/nfsserver/nfs_nfsdstate.c (revision 245742) > +++ fs/nfsserver/nfs_nfsdstate.c (working copy) > @@ -3967,7 +3967,6 @@ > int error, i, tryagain; > off_t off = 0; > ssize_t aresid, len; > - struct timeval curtime; > > /* > * If NFSNSF_UPDATEDONE is set, this is a restart of the nfsds without > @@ -3978,8 +3977,7 @@ > /* > * Set Grace over just until the file reads successfully. > */ > - NFSGETTIME(&curtime); > - nfsrvboottime = curtime.tv_sec; > + nfsrvboottime = time_second; > LIST_INIT(&sf->nsf_head); > sf->nsf_flags = (NFSNSF_GRACEOVER | NFSNSF_NEEDLOCK); > sf->nsf_eograce = NFSD_MONOSEC + NFSRV_LEASEDELTA; > @@ -4650,7 +4648,7 @@ > APPLESTATIC void > nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) > { > - struct timespec mytime; > + time_t mytime; > int32_t starttime; > int error; > > @@ -4675,8 +4673,8 @@ > * Now, call nfsrv_checkremove() in a loop while it returns > * NFSERR_DELAY. Return upon any other error or when timed out. > */ > - NFSGETNANOTIME(&mytime); > - starttime = (u_int32_t)mytime.tv_sec; > + mytime = NFSD_MONOSEC; > + starttime = (u_int32_t)mytime; > do { > if (NFSVOPLOCK(vp, LK_EXCLUSIVE) == 0) { > error = nfsrv_checkremove(vp, 0, p); > @@ -4684,11 +4682,9 @@ > } else > error = EPERM; > if (error == NFSERR_DELAY) { > - NFSGETNANOTIME(&mytime); > - if (((u_int32_t)mytime.tv_sec - starttime) > > - NFS_REMOVETIMEO && > - ((u_int32_t)mytime.tv_sec - starttime) < > - 100000) > + mytime = NFSD_MONOSEC; > + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && > + ((u_int32_t)mytime - starttime) < 100000) > break; > /* Sleep for a short period of time */ > (void) nfs_catnap(PZERO, 0, "nfsremove"); > @@ -4949,9 +4945,7 @@ > static time_t > nfsrv_leaseexpiry(void) > { > - struct timeval curtime; > > - NFSGETTIME(&curtime); > if (nfsrv_stablefirst.nsf_eograce > NFSD_MONOSEC) > return (NFSD_MONOSEC + 2 * (nfsrv_lease + NFSRV_LEASEDELTA)); > return (NFSD_MONOSEC + nfsrv_lease + NFSRV_LEASEDELTA); > Index: nfsclient/nfs_krpc.c > =================================================================== > --- nfsclient/nfs_krpc.c (revision 245742) > +++ nfsclient/nfs_krpc.c (working copy) > @@ -394,18 +394,17 @@ > { > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > struct nfsmount *nmp = nf->nf_mount; > - struct timeval now; > + time_t now; > > - getmicrouptime(&now); > - > switch (type) { > case FEEDBACK_REXMIT2: > case FEEDBACK_RECONNECT: > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > + now = time_uptime; > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > nfs_down(nmp, nf->nf_td, > "not responding", 0, NFSSTA_TIMEO); > nf->nf_tprintfmsg = TRUE; > - nf->nf_lastmsg = now.tv_sec; > + nf->nf_lastmsg = now; > } > break; > > @@ -438,7 +437,6 @@ > time_t waituntil; > caddr_t dpos; > int error = 0, timeo; > - struct timeval now; > AUTH *auth = NULL; > enum nfs_rto_timer_t timer; > struct nfs_feedback_arg nf; > @@ -455,8 +453,7 @@ > bzero(&nf, sizeof(struct nfs_feedback_arg)); > nf.nf_mount = nmp; > nf.nf_td = td; > - getmicrouptime(&now); > - nf.nf_lastmsg = now.tv_sec - > + nf.nf_lastmsg = time_uptime - > ((nmp->nm_tprintf_delay) - (nmp->nm_tprintf_initial_delay)); > > /* > > -- > John Baldwin > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 06:38:41 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 33EEEBED; Thu, 24 Jan 2013 06:38:41 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 14103995; Thu, 24 Jan 2013 06:38:39 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0O6cVEW004462; Thu, 24 Jan 2013 07:38:31 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0O6cUl1004459; Thu, 24 Jan 2013 07:38:30 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Thu, 24 Jan 2013 07:38:30 +0100 (CET) From: Wojciech Puchar To: Artem Belevich Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> <510067DC.7030707@pyro.eu.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Thu, 24 Jan 2013 07:38:31 +0100 (CET) Cc: freebsd-fs , Mark Felder X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 06:38:41 -0000 > > That's a bold statement. Literally. :-) And literally tested. But few settings have to be changed, and are for unknown reason capped on bad values for years. I will not repeat it as i already wrote it many times. If you like real results - i am getting like 50MB/s total I/O throughput when reading 10 different files in parallel on my really low speed laptop disk. It can do at most 69MB/s from the outer surface tested by dd if=/dev/ada0 of=/dev/null bs=2m 100MB/s are normal for modern 7200rpm SATA disks. UFS have weak points like lots of small files, but except writing, or reading already cached things, it is still faster than ZFS. That's about speed. As for reliability i already explained it. UFS can be improved. Best improvement would be SSD caching for metadata and selected data. But something real, not L2ARC or dragonflybsd's swapcache. A moment when you need high performance the most is after failure. When all people waited for system being up and then - at once - like to use it. In this moment L2ARC is empty and give no help. SSD cache must be persistent. Another improvement would be adding, after user quota and group quota, jail quota and jail ID in metadata. Not that useful for me, but may be useful for people that serve 1000s of random unknown clients. Things like snapshots and pseudo filesystems are cool but not really that useful, at least for me. And when using it heavily one will quickly get lost in the mess. > However, it appears that we're nowhere near having a rational discussion: > http://thoughtcatalog.com/2011/how-to-have-a-rational-discussion/ > true. And it will not change until technical arguments will be the only arguments. Not fanatic reactions or repeating marketing words. ZFS is great marketer tool. It promises features that would make administering a server very easy at first. Then - there are companies for data recovery :) I wish them a lots of money earned. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 08:15:36 2013 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id BC721C9D; Thu, 24 Jan 2013 08:15:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail27.syd.optusnet.com.au (mail27.syd.optusnet.com.au [211.29.133.168]) by mx1.freebsd.org (Postfix) with ESMTP id 3D603DE0; Thu, 24 Jan 2013 08:15:35 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail27.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r0O8FQ9s029583 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 24 Jan 2013 19:15:28 +1100 Date: Thu, 24 Jan 2013 19:15:26 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin Subject: Re: [PATCH] More time cleanups in the NFS code In-Reply-To: <201301231338.39056.jhb@freebsd.org> Message-ID: <20130124184756.O1180@besplex.bde.org> References: <201301231338.39056.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=MscKcBme c=1 sm=1 a=Kpej93CV1R8A:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=_HDPiWP5c-IA:10 a=Ae79VNAGcdZGe2jW8Y0A:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: Rick Macklem , bde@FreeBSD.org, fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 08:15:36 -0000 On Wed, 23 Jan 2013, John Baldwin wrote: > This patch removes all calls to get*time(). Most of them it replaces with > time_uptime (especially ones that are attempting to handle time intervals for > which time_uptime is far better suited than time_second). One specific case > it replaces with nanotime() as suggested by Bruce previously. A few of the > timestamps were not used (nd_starttime and the curtime in the lease expiry > function). Looks good. I didn't check for completeness. oldnfs might benefit from use of NFSD_MONOSEC. Both nfs's might benefit from use of NFS_REALSEC (doesn't exist but would be #defined as time_second if acceses to this global are atomic (which I think is implied by its existence)). > Index: fs/nfs/nfs_commonkrpc.c > =================================================================== > --- fs/nfs/nfs_commonkrpc.c (revision 245742) > +++ fs/nfs/nfs_commonkrpc.c (working copy) > @@ -459,18 +459,17 @@ > { > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > struct nfsmount *nmp = nf->nf_mount; > - struct timeval now; > + time_t now; > > - getmicrouptime(&now); > - > switch (type) { > case FEEDBACK_REXMIT2: > case FEEDBACK_RECONNECT: > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > + now = NFSD_MONOSEC; It's confusing for 'now' to be in mono-time. > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > nfs_down(nmp, nf->nf_td, > "not responding", 0, NFSSTA_TIMEO); > nf->nf_tprintfmsg = TRUE; > - nf->nf_lastmsg = now.tv_sec; > + nf->nf_lastmsg = now; > } > break; It's safest but probably unnecessary (uncritical) to copy the (not quite volatile) variable NFSD_MONOSEC to a local variable, since it is used twice. Now I don't like the NFSD_MONOSEC macro. It looks like a constant, but is actually a not quite volatile variable. > Index: fs/nfsclient/nfs_clstate.c > =================================================================== > --- fs/nfsclient/nfs_clstate.c (revision 245742) > +++ fs/nfsclient/nfs_clstate.c (working copy) > @@ -2447,7 +2447,7 @@ > u_int32_t clidrev; > int error, cbpathdown, islept, igotlock, ret, clearok; > uint32_t recover_done_time = 0; > - struct timespec mytime; > + time_t mytime; Another name for the cached copy of mono-now. > @@ -2720,9 +2720,9 @@ > * Call nfscl_cleanupkext() once per second to check for > * open/lock owners where the process has exited. > */ > - NFSGETNANOTIME(&mytime); > - if (prevsec != mytime.tv_sec) { > - prevsec = mytime.tv_sec; > + mytime = NFSD_MONOSEC; > + if (prevsec != mytime) { > + prevsec = mytime; > nfscl_cleanupkext(clp, &lfh); > } > Now copying it is clearly needed. > @@ -4684,11 +4682,9 @@ > } else > error = EPERM; > if (error == NFSERR_DELAY) { > - NFSGETNANOTIME(&mytime); > - if (((u_int32_t)mytime.tv_sec - starttime) > > - NFS_REMOVETIMEO && > - ((u_int32_t)mytime.tv_sec - starttime) < > - 100000) > + mytime = NFSD_MONOSEC; > + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && > + ((u_int32_t)mytime - starttime) < 100000) > break; > /* Sleep for a short period of time */ > (void) nfs_catnap(PZERO, 0, "nfsremove"); Should use time_t for all times in seconds and no casts to u_int32_t (unless the times are put in data structures -- then 64-bit times are wasteful). Here, when not doing this cleanup, mytime might as well have type u_int32_t to begin with, to match starttime. Then the bogus cast would be implicit in the assignment to mytime. The old code had to cast to break the type of mytime.tv_sec to match that of starttime. This of course only mattered when the times were non-monotonic, time_t was 64 bits, and the non-mono time was later than the middle of 2038. Bruce From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 08:33:01 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6FF4A224 for ; Thu, 24 Jan 2013 08:33:01 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop04.sare.net (proxypop04.sare.net [194.30.0.65]) by mx1.freebsd.org (Postfix) with ESMTP id 34AFFEAA for ; Thu, 24 Jan 2013 08:33:00 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop04.sare.net (Postfix) with ESMTPSA id A75C69DD7DF; Thu, 24 Jan 2013 09:32:47 +0100 (CET) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Borja Marcos In-Reply-To: <81460DE8-89B4-41E8-9D93-81B8CC27AA87@baaz.fr> Date: Thu, 24 Jan 2013 09:32:58 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <8BA7B786-3B4B-473B-B4F0-798C9B5AEF00@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> <81460DE8-89B4-41E8-9D93-81B8CC27AA87@baaz.fr> To: Jean-Yves Moulin X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Filesystems , Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 08:33:01 -0000 On Jan 23, 2013, at 4:16 PM, Jean-Yves Moulin wrote: > But what about battery-backed cache RAID card ? They offer a = non-volatile cache that improves writes. And this cache is safe because = of the battery. These feature doesn't exist on bare disks. They can be "fine" for certain applications, especially with limited = ability filesystems. But we are speaking about using maybe the latest = and greatest in filesystem technology, with a superior mechanism to manage redundancy = and I/O bandwidth. Using another redundancy mechanism underneath can make matters worse, with one system working against the other. ZFS manages it better. ZFS allows you to decide if you need to cache = metadata and/or data or none of them. RAID cards can show stupid caching = behaviors depending on your workload. So, RAID card with ZFS, definitely a no-no. As Scott said, more failure = modes. And some of them, complex. Many trivial operations may require a = reboot. The card hides important disk diagnostics from ZFS. Borja. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 08:34:30 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 55D582B4 for ; Thu, 24 Jan 2013 08:34:30 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop04.sare.net (proxypop04.sare.net [194.30.0.65]) by mx1.freebsd.org (Postfix) with ESMTP id 198DBEBE for ; Thu, 24 Jan 2013 08:34:29 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop04.sare.net (Postfix) with ESMTPSA id 5DF2F9DD778; Thu, 24 Jan 2013 09:27:06 +0100 (CET) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Borja Marcos In-Reply-To: <20130123143018.GA5533@roberto02-aw.eurocontrol.fr> Date: Thu, 24 Jan 2013 09:27:14 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <0AF9A29D-5B5A-4CB6-B880-7F43CA7FC612@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <20130123143018.GA5533@roberto02-aw.eurocontrol.fr> To: Ollivier Robert X-Mailer: Apple Mail (2.1085) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 08:34:30 -0000 On Jan 23, 2013, at 3:30 PM, Ollivier Robert wrote: > According to Freddie Cash on Tue, Jan 22, 2013 at 03:02:40PM -0800: >> The ZFS metadata on disk allows you to move disks around in a system = and >> still import the pool, correct. >=20 > Even better, a few years ago before I enabled AHCI on my machine, the > drives were named adX. After I started using AHCI, the drives became = adaY > and it still booted fine. Yes, that's right. As an example, yesterday I used the gnop kludge to = have a SSD recognized as a 4K-sector drive. After a reboot,=20 ZFS was able to locate the device even though the named gnop device had = disappeared. However, remember that the Murphy's field is enormously intense around = anything that holds data, especially if that data is important. Yes, it works, but it's better not to rely = too much on error recovery mechanisms. And there is at least one situation in which the dynamic renumbering causes trouble = (failure + reboot) which is not so rare on=20 high uptime machines with many disks. Borja. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 09:12:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 65F43C9B for ; Thu, 24 Jan 2013 09:12:38 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop03b.sare.net (proxypop03b.sare.net [194.30.0.251]) by mx1.freebsd.org (Postfix) with ESMTP id 2ABA4108 for ; Thu, 24 Jan 2013 09:12:37 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 33E3C9E084E; Thu, 24 Jan 2013 10:12:08 +0100 (CET) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Borja Marcos In-Reply-To: <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> Date: Thu, 24 Jan 2013 10:12:29 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <19EED306-9AA8-4DFE-8164-331C1DAD28CC@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <565CB55B-9A75-47F4-A88B-18FA8556E6A2@samsco.org> To: Scott Long X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 09:12:38 -0000 On Jan 22, 2013, at 3:33 PM, Scott Long wrote: > Look up SCSI device wiring in /sys/conf/NOTES. That's one solution to = static naming, just with a slightly different angle than Solaris. I do = agree with your general thesis here, and either wiring should be made a = much more visible and documented feature, or a new mechanism should be = developed to provide naming stability. Please let me know what you = think of the wiring mechanic. The mechanism used in Solaris has, in my opinion, two benefits: it is = used by default, which is important. It means less troublesome = installations, less time bombs lurking. The second important benefit is that, especially with many disks, it's = easier (at least for me) to think in terms of controllers and disks, = rather than "disk number 47". But well, it can be different for many = people. Of course, a big advantage for Solaris was Sun hardware at least in the = golden years, where everything was well predictable. PCs are chaos, and = Intel based servers have inherited the worst of the PC chaos.=20 But a good mechanism, and, I think, working by default, is badly needed. = And I would advocate for a more "Solaris-like" approach. Borja. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 09:46:26 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D5DEC5E9; Thu, 24 Jan 2013 09:46:26 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop04.sare.net (proxypop04.sare.net [194.30.0.65]) by mx1.freebsd.org (Postfix) with ESMTP id 5F0EA83F; Thu, 24 Jan 2013 09:46:26 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop04.sare.net (Postfix) with ESMTPSA id 0714F9DEB8D; Thu, 24 Jan 2013 10:46:13 +0100 (CET) From: Borja Marcos Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Problem adding SCSI quirks for a SSD, 4K sector and ZFS Date: Thu, 24 Jan 2013 10:46:23 +0100 Message-Id: <492280E6-E3EE-4540-92CE-C535C8943CCF@sarenet.es> To: freebsd-scsi@freebsd.org Mime-Version: 1.0 (Apple Message framework v1085) X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 09:46:26 -0000 Hello, Crossposting to FreeBSD-fs, as I am wondering if I have had a problem = with ZFS and sector size detection as well. I am doing tests with an OCZ Vertex 4 connected to a SAS backplane. < OCZ-VERTEX4 1.5> at scbus6 target 22 lun 0 = (pass19,da15) (The blank before "OCZ" really appears there) pass19: < OCZ-VERTEX4 1.5> Fixed Direct Access SCSI-5 device=20 pass19: Serial Number OCZ-1SVG6KZ2YRMSS8E1 pass19: 3.300MB/s transfers I am bypassing an "aac" RAID card so that the disks are directly = attached to the da driver, instead of relying on the so-called JBOD = feature. I have had a weird problem, with the disk being unresponsive to the = REQUEST CAPACITY(16) command. Weird, seems it timeouts. So, just to complete the tests, I have added a quirk to scsi_da.c. = Anyway, I also need the disk to be recognized as a 4K sector drive. I created a new quirk, called it DA_Q_NO_RC16, and added an entry to = the quirk table, so that these drives are recognized as 4K drives and = the driver doesn't try to send a RC(16) command. diff scsi_da.c.orig scsi_da.c 93c93,94 < DA_Q_4K =3D 0x08 --- > DA_Q_4K =3D 0x08, > DA_Q_NO_RC16 =3D 0x10 811a813,817 > /* OCZ Vertex 4 firmware 1.5 */ > { T_DIRECT, SIP_MEDIA_FIXED, "", "OCZ-VERTEX4", "*" }, > /*quirks*/DA_Q_NO_RC16 | DA_Q_4K > }, > { 1635,1636c1641,1646 < /* Predict whether device may support READ CAPACITY(16). */ < if (SID_ANSI_REV(&cgd->inq_data) >=3D SCSI_REV_SPC3) { --- > /*=20 > * Predict whether device may support READ CAPACITY(16). > * BUT Some disks don't support RC(16) even though they should. > */ > if ((SID_ANSI_REV(&cgd->inq_data) >=3D SCSI_REV_SPC3)=20 > && !(softc->quirks & DA_Q_NO_RC16) ) { I think it's working. I haven't seen any more RC(16) errors, and the = disk is working fine. Anyway I am not sure I've done it right. After = adding the 4K quirk and rebooting, GEOM_PART complained that the = partitions weren't aligned to 4K /var/log/messages.0:Jan 23 16:01:30 kernel: GEOM_PART: partition 1 is = not aligned on 4096 bytes /var/log/messages.0:Jan 23 16:01:30 kernel: GEOM_PART: partition 2 is = not aligned on 4096 bytes So it seems it works. However, when using the disk for ZFS, it still = detects a 512 byte sector size, which is odd.=20 Jan 23 16:01:30 rasputin kernel: GEOM: new disk da15 Jan 23 16:01:30 rasputin kernel: da15 at aacp0 bus 0 scbus6 target 22 = lun 0 Jan 23 16:01:30 rasputin kernel: da15: < OCZ-VERTEX4 1.5> Fixed Direct = Access SCSI-5 device=20 Jan 23 16:01:30 rasputin kernel: da15: Serial Number = OCZ-1SVG6KZ2YRMSS8E1 Jan 23 16:01:30 rasputin kernel: da15: 3.300MB/s transfers Jan 23 16:01:30 rasputin kernel: da15: 488386MB (1000215216 512 byte = sectors: 255H 63S/T 62260C) diskinfo is returning a sector size of 512 bytes, and a stripesize of = 4096. Is this correct? ZFS is still detecting it as a 512 byte sector = disk. /dev/da15 512 # sectorsize 512110190592 # mediasize in bytes (477G) 1000215216 # mediasize in sectors 4096 # stripesize 0 # stripeoffset 62260 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. OCZ-1SVG6KZ2YRMSS8E1 # Disk ident. So, to summarize: If the quirk was working, should diskinfo return a sector size of 512 = bytes, or is it correct to show a "stripesize" of 4096? Do we have a bug either on ZFS or the disk drivers? The same experiment = on another system (both are 9.1-RELEASE) and a similar drive attached to = a SATA controller, also adding a 4K sector quirk for it, defines a = stripe size instead of a sector size. Thanks, Borja. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 11:04:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id AF545886 for ; Thu, 24 Jan 2013 11:04:03 +0000 (UTC) (envelope-from araujobsdport@gmail.com) Received: from mail-wi0-x229.google.com (mail-wi0-x229.google.com [IPv6:2a00:1450:400c:c05::229]) by mx1.freebsd.org (Postfix) with ESMTP id 5241EE98 for ; Thu, 24 Jan 2013 11:04:03 +0000 (UTC) Received: by mail-wi0-f169.google.com with SMTP id hq12so324673wib.0 for ; Thu, 24 Jan 2013 03:04:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:reply-to:date:message-id:subject:from:to :content-type; bh=H9BpRdzVQLMc/kPoinC4UXOmoe+bQPx4xruKkrf2IxA=; b=W3T6pxYn4Ej0l4pt5s5zakrkAISao2Y/a57alQ8WW0Pd3kbaPt6FuOjXnONF2tNMm2 8Kim6x4a8UjtWu47SqtgcYf3ZGoVbSXPd2zGOyJDY3Yq+9+xk0m84Xgwn3U4RdVMZQkJ 67IkDDR5D/FtAAGDS9b71GWlpCGbUJM/Yiq3ArCBrv5P28REv7UN77L0PTqBbUYBNjjD x7NEt3U/uMPUecaDeg7jIrsEYQJ5cbmkpaCbSZzzyYnuIMzMKuI3chO8Opc8MwxZiklh eGBQGbc+TjYTxmmoYcrLdDImpjSRMECq/SA6BEyS2kMDbYXb6PoF4uO3KrHQUBNyEV8/ 8DOQ== MIME-Version: 1.0 X-Received: by 10.194.240.233 with SMTP id wd9mr2117585wjc.54.1359025441876; Thu, 24 Jan 2013 03:04:01 -0800 (PST) Received: by 10.180.145.44 with HTTP; Thu, 24 Jan 2013 03:04:01 -0800 (PST) Date: Thu, 24 Jan 2013 19:04:01 +0800 Message-ID: Subject: gmirror doubt. From: Marcelo Araujo To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: araujo@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 11:04:03 -0000 Hello Guys, I'm wondering if is possible load the gmirror without activate the mirrors. As an example: I have several mirrors using gmirror, when I'm loading gmirror, all mirrors that are activated are loaded. What I'd like to do is totally the oposite, something like: Load gmirror, and active every mirror by myself. Anyone with more experience with geom could give me a clue about it? Best Regards, -- Marcelo Araujo araujo@FreeBSD.org From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 11:19:24 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4D869D1D; Thu, 24 Jan 2013 11:19:24 +0000 (UTC) (envelope-from prvs=1736dd70aa=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 5F530F73; Thu, 24 Jan 2013 11:19:23 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001833957.msg; Thu, 24 Jan 2013 11:19:20 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 24 Jan 2013 11:19:20 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1736dd70aa=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: From: "Steven Hartland" To: "Borja Marcos" , References: <492280E6-E3EE-4540-92CE-C535C8943CCF@sarenet.es> Subject: Re: Problem adding SCSI quirks for a SSD, 4K sector and ZFS Date: Thu, 24 Jan 2013 11:19:52 -0000 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0014_01CDFA24.BAC45C90" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 11:19:24 -0000 This is a multi-part message in MIME format. ------=_NextPart_000_0014_01CDFA24.BAC45C90 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit ----- Original Message ----- From: "Borja Marcos" To: Cc: "FreeBSD Filesystems" Sent: Thursday, January 24, 2013 9:46 AM Subject: Problem adding SCSI quirks for a SSD, 4K sector and ZFS > > Hello, > > Crossposting to FreeBSD-fs, as I am wondering if I have had a problem with ZFS and sector size detection as well. > > I am doing tests with an OCZ Vertex 4 connected to a SAS backplane. > > < OCZ-VERTEX4 1.5> at scbus6 target 22 lun 0 (pass19,da15) > > (The blank before "OCZ" really appears there) > > pass19: < OCZ-VERTEX4 1.5> Fixed Direct Access SCSI-5 device > pass19: Serial Number OCZ-1SVG6KZ2YRMSS8E1 > pass19: 3.300MB/s transfers > > I am bypassing an "aac" RAID card so that the disks are directly attached to the da driver, instead of relying on the so-called > JBOD feature. > > I have had a weird problem, with the disk being unresponsive to the REQUEST CAPACITY(16) command. Weird, seems it timeouts. > > So, just to complete the tests, I have added a quirk to scsi_da.c. Anyway, I also need the disk to be recognized as a 4K sector > drive. > > I created a new quirk, called it DA_Q_NO_RC16, and added an entry to the quirk table, so that these drives are recognized as 4K > drives and the driver doesn't try to send a RC(16) command. > > diff scsi_da.c.orig scsi_da.c > 93c93,94 > < DA_Q_4K = 0x08 > --- >> DA_Q_4K = 0x08, >> DA_Q_NO_RC16 = 0x10 > 811a813,817 >> /* OCZ Vertex 4 firmware 1.5 */ >> { T_DIRECT, SIP_MEDIA_FIXED, "", "OCZ-VERTEX4", "*" }, >> /*quirks*/DA_Q_NO_RC16 | DA_Q_4K >> }, >> { > 1635,1636c1641,1646 > < /* Predict whether device may support READ CAPACITY(16). */ > < if (SID_ANSI_REV(&cgd->inq_data) >= SCSI_REV_SPC3) { > --- >> /* >> * Predict whether device may support READ CAPACITY(16). >> * BUT Some disks don't support RC(16) even though they should. >> */ >> if ((SID_ANSI_REV(&cgd->inq_data) >= SCSI_REV_SPC3) >> && !(softc->quirks & DA_Q_NO_RC16) ) { > > > > I think it's working. I haven't seen any more RC(16) errors, and the disk is working fine. Anyway I am not sure I've done it > right. After adding the 4K quirk and rebooting, GEOM_PART complained that the partitions weren't aligned to 4K > > /var/log/messages.0:Jan 23 16:01:30 kernel: GEOM_PART: partition 1 is not aligned on 4096 bytes > /var/log/messages.0:Jan 23 16:01:30 kernel: GEOM_PART: partition 2 is not aligned on 4096 bytes > > So it seems it works. However, when using the disk for ZFS, it still detects a 512 byte sector size, which is odd. > > Jan 23 16:01:30 rasputin kernel: GEOM: new disk da15 > Jan 23 16:01:30 rasputin kernel: da15 at aacp0 bus 0 scbus6 target 22 lun 0 > Jan 23 16:01:30 rasputin kernel: da15: < OCZ-VERTEX4 1.5> Fixed Direct Access SCSI-5 device > Jan 23 16:01:30 rasputin kernel: da15: Serial Number OCZ-1SVG6KZ2YRMSS8E1 > Jan 23 16:01:30 rasputin kernel: da15: 3.300MB/s transfers > Jan 23 16:01:30 rasputin kernel: da15: 488386MB (1000215216 512 byte sectors: 255H 63S/T 62260C) > > > diskinfo is returning a sector size of 512 bytes, and a stripesize of 4096. Is this correct? ZFS is still detecting it as a 512 > byte sector disk. > > /dev/da15 > 512 # sectorsize > 512110190592 # mediasize in bytes (477G) > 1000215216 # mediasize in sectors > 4096 # stripesize > 0 # stripeoffset > 62260 # Cylinders according to firmware. > 255 # Heads according to firmware. > 63 # Sectors according to firmware. > OCZ-1SVG6KZ2YRMSS8E1 # Disk ident. > > > > So, to summarize: > > If the quirk was working, should diskinfo return a sector size of 512 bytes, or is it correct to show a "stripesize" of 4096? > > Do we have a bug either on ZFS or the disk drivers? The same experiment on another system (both are 9.1-RELEASE) and a similar > drive attached to a SATA controller, also adding a 4K sector quirk for it, defines a stripe size instead of a sector size. Simple answer is ZFS doesn't understand quirks. The attached patch does what you're looking for along with a few other things, see notes at the top for details. Its not a final version as there's still some discussion about implementation details but it should do what your looking for. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. ------=_NextPart_000_0014_01CDFA24.BAC45C90 Content-Type: application/octet-stream; name="zzz-zfs-ashift-fix.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="zzz-zfs-ashift-fix.patch" Changes zfs zpool initial / desired ashift to be based off stripesize=0A= instead of sectorsize making it compatible with drives marked with=0A= the 4k sector size quirk.=0A= =0A= Without the correct min block size BIO_DELETE requests passed to=0A= a large number of current SSD's via TRIM don't actually perform=0A= any LBA TRIM so its vital for the correct operation of TRIM to get=0A= the correct min block size.=0A= =0A= To do this we added the additional dashift (desired ashift) to=0A= vdev_open_func_t calls. This was needed as just updating ashift to=0A= be based off stripesize would mean that a devices reported minimum=0A= transfer size (ashift) could increase and that in turn would cause=0A= member devices to be unusable and hence break pools with error=0A= ZFS-8000-5E.=0A= =0A= The global minimum ashift used for new zpools can now also be=0A= tuned using the vfs.zfs.min_create_ashift sysctl. This defaults=0A= to 12 (4096 byte blocks) in order to optimise for newer disks which=0A= are migrating from 512 to 4096 byte sectors.=0A= =0A= The value of vfs.zfs.min_create_ashift is limited to min of=0A= SPA_MINBLOCKSHIFT (9) and a max of SPA_MAXBLOCKSHIFT (17).=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c.orig = 2011-06-06 09:36:46.000000000 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c = 2012-11-02 14:47:55.293668071 +0000=0A= @@ -32,6 +32,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for disks.=0A= */=0A= @@ -103,7 +105,7 @@=0A= }=0A= =0A= static int=0A= -vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= spa_t *spa =3D vd->vdev_spa;=0A= vdev_disk_t *dvd;=0A= @@ -284,7 +286,7 @@=0A= }=0A= =0A= /*=0A= - * Determine the device's minimum transfer size.=0A= + * Determine the device's minimum and desired transfer size.=0A= * If the ioctl isn't supported, assume DEV_BSIZE.=0A= */=0A= if (ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT, (intptr_t)&dkmext,=0A= @@ -292,6 +294,7 @@=0A= dkmext.dki_pbsize =3D DEV_BSIZE;=0A= =0A= *ashift =3D highbit(MAX(dkmext.dki_pbsize, SPA_MINBLOCKSIZE)) - 1;=0A= + *dashift =3D highbit(MAX(dkmext.dki_pbsize, (1ULL << zfs_min_ashift))) = - 1;=0A= =0A= /*=0A= * Clear the nowritecache bit, so that on a vdev_reopen() we will=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c.orig = 2012-01-05 22:31:25.000000000 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c = 2012-11-02 14:47:38.252107541 +0000=0A= @@ -30,6 +30,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for files.=0A= */=0A= @@ -47,7 +49,7 @@=0A= }=0A= =0A= static int=0A= -vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= vdev_file_t *vf;=0A= vnode_t *vp;=0A= @@ -127,6 +129,7 @@=0A= =0A= *psize =3D vattr.va_size;=0A= *ashift =3D SPA_MINBLOCKSHIFT;=0A= + *dashift =3D zfs_min_ashift;=0A= =0A= return (0);=0A= }=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c.orig = 2012-11-02 12:20:15.918986181 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c = 2012-11-02 14:47:48.135273692 +0000=0A= @@ -36,6 +36,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for GEOM.=0A= */=0A= @@ -408,7 +410,7 @@=0A= }=0A= =0A= static int=0A= -vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= struct g_provider *pp;=0A= struct g_consumer *cp;=0A= @@ -494,9 +496,10 @@=0A= *psize =3D pp->mediasize;=0A= =0A= /*=0A= - * Determine the device's minimum transfer size.=0A= + * Determine the device's minimum and desired transfer size.=0A= */=0A= *ashift =3D highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1;=0A= + *dashift =3D highbit(MAX(pp->stripesize, (1ULL << zfs_min_ashift))) - = 1;=0A= =0A= /*=0A= * Clear the nowritecache settings, so that on a vdev_reopen()=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c.orig = 2012-07-03 11:49:22.342245151 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c = 2012-07-03 11:58:02.161948585 +0000=0A= @@ -127,7 +127,7 @@=0A= }=0A= =0A= static int=0A= -vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, = uint64_t *dashift)=0A= {=0A= int numerrors =3D 0;=0A= int lasterror =3D 0;=0A= @@ -150,6 +150,7 @@=0A= =0A= *asize =3D MIN(*asize - 1, cvd->vdev_asize - 1) + 1;=0A= *ashift =3D MAX(*ashift, cvd->vdev_ashift);=0A= + *dashift =3D MAX(*dashift, cvd->vdev_dashift);=0A= }=0A= =0A= if (numerrors =3D=3D vd->vdev_children) {=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c.orig = 2012-07-03 11:49:10.545275865 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c = 2012-07-03 11:58:07.670470640 +0000=0A= @@ -40,7 +40,7 @@=0A= =0A= /* ARGSUSED */=0A= static int=0A= -vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, = uint64_t *dashift)=0A= {=0A= /*=0A= * Really this should just fail. But then the root vdev will be in the=0A= @@ -50,6 +50,7 @@=0A= */=0A= *psize =3D 0;=0A= *ashift =3D 0;=0A= + *dashift =3D 0;=0A= return (0);=0A= }=0A= =0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c.orig = 2012-07-03 11:49:03.675875505 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c = 2012-07-03 11:58:15.334806334 +0000=0A= @@ -1447,7 +1447,7 @@=0A= }=0A= =0A= static int=0A= -vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= vdev_t *cvd;=0A= uint64_t nparity =3D vd->vdev_nparity;=0A= @@ -1476,6 +1476,7 @@=0A= =0A= *asize =3D MIN(*asize - 1, cvd->vdev_asize - 1) + 1;=0A= *ashift =3D MAX(*ashift, cvd->vdev_ashift);=0A= + *dashift =3D MAX(*dashift, cvd->vdev_dashift);=0A= }=0A= =0A= *asize *=3D vd->vdev_children;=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c.orig = 2012-07-03 11:49:27.901760380 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c = 2012-07-03 11:58:19.704427068 +0000=0A= @@ -50,7 +50,7 @@=0A= }=0A= =0A= static int=0A= -vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= int lasterror =3D 0;=0A= int numerrors =3D 0;=0A= @@ -78,6 +78,7 @@=0A= =0A= *asize =3D 0;=0A= *ashift =3D 0;=0A= + *dashift =3D 0;=0A= =0A= return (0);=0A= }=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c.orig = 2012-10-22 20:41:50.234005351 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c 2012-10-22 = 20:42:16.355805894 +0000=0A= @@ -1125,6 +1125,7 @@=0A= uint64_t osize =3D 0;=0A= uint64_t asize, psize;=0A= uint64_t ashift =3D 0;=0A= + uint64_t dashift =3D 0;=0A= =0A= ASSERT(vd->vdev_open_thread =3D=3D curthread ||=0A= spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) =3D=3D = SCL_STATE_ALL);=0A= @@ -1154,7 +1155,7 @@=0A= return (ENXIO);=0A= }=0A= =0A= - error =3D vd->vdev_ops->vdev_op_open(vd, &osize, &ashift);=0A= + error =3D vd->vdev_ops->vdev_op_open(vd, &osize, &ashift, &dashift);=0A= =0A= /*=0A= * Reset the vdev_reopening flag so that we actually close=0A= @@ -1255,14 +1256,16 @@=0A= */=0A= vd->vdev_asize =3D asize;=0A= vd->vdev_ashift =3D MAX(ashift, vd->vdev_ashift);=0A= + vd->vdev_dashift =3D MAX(dashift, vd->vdev_dashift);=0A= } else {=0A= /*=0A= * Make sure the alignment requirement hasn't increased.=0A= */=0A= if (ashift > vd->vdev_top->vdev_ashift) {=0A= + printf("ZFS ashift open failure of %s (%ld > %ld)\n", vd->vdev_path, = ashift, vd->vdev_top->vdev_ashift);=0A= vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,=0A= VDEV_AUX_BAD_LABEL);=0A= return (EINVAL);=0A= }=0A= }=0A= =0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c.orig = 2012-11-05 15:27:52.092194343 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c = 2012-11-05 15:53:26.449021023 +0000=0A= @@ -145,9 +145,12 @@=0A= #include =0A= =0A= static boolean_t vdev_trim_on_init =3D B_TRUE;=0A= +static boolean_t vdev_dashift_enable =3D B_TRUE;=0A= SYSCTL_DECL(_vfs_zfs_vdev);=0A= SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, trim_on_init, CTLFLAG_RW,=0A= &vdev_trim_on_init, 0, "Enable/disable full vdev trim on = initialisation");=0A= +SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, optimal_ashift, CTLFLAG_RW,=0A= + &vdev_dashift_enable, 0, "Enable/disable optimal ashift usage on = initialisation");=0A= =0A= /*=0A= * Basic routines to read and write from a vdev label.=0A= @@ -282,6 +285,16 @@=0A= vd->vdev_ms_array) =3D=3D 0);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT,=0A= vd->vdev_ms_shift) =3D=3D 0);=0A= + /*=0A= + * We use the max of ashift and dashift (the desired/optimal=0A= + * ashift), which is typically the stripesize of a device, to=0A= + * ensure we get the best performance from underlying devices.=0A= + * =0A= + * Its done here as it should only ever have an effect on new=0A= + * zpool creation.=0A= + */=0A= + if (vdev_dashift_enable)=0A= + vd->vdev_ashift =3D MAX(vd->vdev_ashift, vd->vdev_dashift);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT,=0A= vd->vdev_ashift) =3D=3D 0);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE,=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h.orig = 2012-10-22 20:40:08.361577293 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h = 2012-10-22 21:02:52.447781800 +0000=0A= @@ -55,7 +55,7 @@=0A= /*=0A= * Virtual device operations=0A= */=0A= -typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t = *ashift);=0A= +typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t = *ashift, uint64_t *dashift);=0A= typedef void vdev_close_func_t(vdev_t *vd);=0A= typedef uint64_t vdev_asize_func_t(vdev_t *vd, uint64_t psize);=0A= typedef int vdev_io_start_func_t(zio_t *zio);=0A= @@ -119,6 +119,7 @@=0A= uint64_t vdev_asize; /* allocatable device capacity */=0A= uint64_t vdev_min_asize; /* min acceptable asize */=0A= uint64_t vdev_ashift; /* block alignment shift */=0A= + uint64_t vdev_dashift; /* desired blk alignment shift */=0A= uint64_t vdev_state; /* see VDEV_STATE_* #defines */=0A= uint64_t vdev_prevstate; /* used when reopening a vdev */=0A= vdev_ops_t *vdev_ops; /* vdev operations */=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c.orig = 2012-11-02 14:56:29.474248887 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c 2012-11-03 = 01:27:28.066912403 +0000=0A= @@ -41,6 +41,30 @@=0A= #include =0A= #include =0A= =0A= +#define ZFS_MIN_ASHIFT SPA_MINBLOCKSHIFT=0A= +/*=0A= + * Max ashift - limited by how labels are accessed by zio_read_phys = using offsets=0A= + * within vdev_label_t=0A= + *=0A= + * If label access is fixed to work with ashift properly then the max = should be=0A= + * set to SPA_MAXBLOCKSHIFT=0A= + */=0A= +#define ZFS_MAX_ASHIFT 13=0A= +/*=0A= + * Optimum ashift - defaults to 12 which results in a min block size of = 4096 as=0A= + * this is the optimum value for newer disks which are migrating from = 512 to 4096=0A= + * byte sectors=0A= + */=0A= +#define ZFS_OPTIMUM_ASHIFT 12 =0A= +=0A= +/*=0A= + * Minimum ashift used when creating new pools=0A= + *=0A= + * This can be tuned using the sysctl vfs.zfs.min_create_ashift but is = limited=0A= + * to a min of ZFS_MIN_ASHIFT and a max of ZFS_MAX_ASHIFT=0A= + * =0A= + */=0A= +int zfs_min_ashift =3D MAX(SPA_MINBLOCKSHIFT, ZFS_OPTIMUM_ASHIFT);=0A= int zfs_no_write_throttle =3D 0;=0A= int zfs_write_limit_shift =3D 3; /* 1/8th of physical memory */=0A= int zfs_txg_synctime_ms =3D 1000; /* target millisecs to sync a txg */=0A= @@ -54,6 +78,9 @@=0A= =0A= static pgcnt_t old_physmem =3D 0;=0A= =0A= +#ifdef _KERNEL=0A= +static int min_ashift_sysctl(SYSCTL_HANDLER_ARGS);=0A= +=0A= SYSCTL_DECL(_vfs_zfs);=0A= TUNABLE_INT("vfs.zfs.no_write_throttle", &zfs_no_write_throttle);=0A= SYSCTL_INT(_vfs_zfs, OID_AUTO, no_write_throttle, CTLFLAG_RDTUN,=0A= @@ -78,6 +105,32 @@=0A= TUNABLE_QUAD("vfs.zfs.write_limit_override", &zfs_write_limit_override);=0A= SYSCTL_QUAD(_vfs_zfs, OID_AUTO, write_limit_override, CTLFLAG_RDTUN,=0A= &zfs_write_limit_override, 0, "");=0A= +SYSCTL_PROC(_vfs_zfs, OID_AUTO, min_create_ashift, CTLTYPE_INT | = CTLFLAG_RW,=0A= + &zfs_min_ashift, 0, min_ashift_sysctl, "I",=0A= + "Minimum ashift used when creating new pools");=0A= +=0A= +static int=0A= +min_ashift_sysctl(SYSCTL_HANDLER_ARGS)=0A= +{=0A= + int error, value;=0A= +=0A= + value =3D *(int *)arg1;=0A= +=0A= + error =3D sysctl_handle_int(oidp, &value, 0, req);=0A= +=0A= + if ((error !=3D 0) || (req->newptr =3D=3D NULL))=0A= + return (error);=0A= +=0A= + if (value < ZFS_MIN_ASHIFT)=0A= + value =3D ZFS_MIN_ASHIFT;=0A= + else if (value > ZFS_MAX_ASHIFT)=0A= + value =3D ZFS_MAX_ASHIFT;=0A= +=0A= + *(int *)arg1 =3D value;=0A= +=0A= + return (0);=0A= +}=0A= +#endif=0A= =0A= int=0A= dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t = **ddp)=0A= ------=_NextPart_000_0014_01CDFA24.BAC45C90-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 12:40:36 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8C844380; Thu, 24 Jan 2013 12:40:36 +0000 (UTC) (envelope-from universite@ukr.net) Received: from ffe16.ukr.net (ffe16.ukr.net [195.214.192.51]) by mx1.freebsd.org (Postfix) with ESMTP id 36F543CE; Thu, 24 Jan 2013 12:40:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net; s=ffe; h=Date:Message-Id:From:To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version; bh=Fi9RGfPkJ88vQumCDEouStYkub4SO0ftgDNuWGVx07w=; b=dXuVSCHZDnmaJIoyukwr7yWEO/XGeJDcmKT37oc6LZ2eKtHagKs87RpZxqHZh29HSBNe/3+IFzqVX0tRY8NKR/9xEiCmS2vEAajMrlqDpPFN+6K9q0sBKX372NQDxw28bNoSoZ5LmIm82M5GsFVwf0XalKiRKk5/JADG0GZgBgc=; Received: from mail by ffe16.ukr.net with local ID 1TyLmM-0003XF-5m ; Thu, 24 Jan 2013 14:19:38 +0200 MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: binary Content-Type: text/plain; charset="windows-1251" Subject: AHCI timeout when using ZFS + AIO + NCQ To: fs@freebsd.org From: "Vladislav Prodan" X-Mailer: freemail.ukr.net 4.0 Message-Id: <13391.1359029978.3957795939058384896@ffe16.ukr.net> X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0 Date: Thu, 24 Jan 2013 14:19:38 +0200 Cc: current@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 12:40:36 -0000 I have the server: FreeBSD 9.1-PRERELEASE #0: Wed Jul 25 01:40:56 EEST 2012 Jan 24 12:53:01 vesuvius kernel: atapci0: port 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xfe210000-0xfe2101ff irq 51 at device 0.0 on pci3 ... Jan 24 12:53:01 vesuvius kernel: ahci0: port 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfe307000-0xfe3073ff irq 19 at device 17.0 on pci0 Jan 24 12:53:01 vesuvius kernel: ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported ... Jan 24 12:53:01 vesuvius kernel: ada2 at ahcich2 bus 0 scbus4 target 0 lun 0 Jan 24 12:53:01 vesuvius kernel: ada2: ATA-8 SATA 3.x device Jan 24 12:53:01 vesuvius kernel: ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) Jan 24 12:53:01 vesuvius kernel: ada2: Command Queueing enabled Jan 24 12:53:01 vesuvius kernel: ada2: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) Jan 24 12:53:01 vesuvius kernel: ada2: Previously was known as ad12 ... I use 4 HDD in RAID10 via ZFS. With a very irregular intervals fall off HDD drives. As a result, the server stops. Jan 24 06:48:06 vesuvius kernel: ahcich2: Timeout on slot 6 port 0 Jan 24 06:48:06 vesuvius kernel: ahcich2: is 00000000 cs 00000000 ss 000000c0 rs 000000c0 tfd 40 serr 00000000 cmd 0000e817 Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 4c 4e 1e 40 68 00 00 01 00 00 Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): Retrying command Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked Jan 24 06:51:11 vesuvius kernel: swap_pager: I/O error - pagein failed; blkno 4227133,size 8192, error 6 Jan 24 06:51:11 vesuvius kernel: (ada2:(pass2:vm_fault: pager read error, pid 1943 (named) Jan 24 06:51:11 vesuvius kernel: ahcich2:0:ahcich2:0:0:0:0): lost device Jan 24 06:51:11 vesuvius kernel: 0): passdevgonecb: devfs entry is gone Jan 24 06:51:11 vesuvius kernel: pid 1943 (named), uid 53: exited on signal 11 ... Helps only restart by pressing Power. Judging by the state of SMART, HDD have no problems. SATA data cable changed. I found a similar problem: http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html PR: amd64/165547: NVIDIA MCP67 AHCI SATA controller timeout -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 12:49:59 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 7E2FC7EB; Thu, 24 Jan 2013 12:49:59 +0000 (UTC) (envelope-from prvs=1736dd70aa=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id F3C9166C; Thu, 24 Jan 2013 12:49:58 +0000 (UTC) Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50001835522.msg; Thu, 24 Jan 2013 12:49:57 +0000 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 24 Jan 2013 12:49:57 +0000 (not processed: message from valid local sender) X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1736dd70aa=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <221B307551154F489452F89E304CA5F7@multiplay.co.uk> From: "Steven Hartland" To: , "Vladislav Prodan" References: <13391.1359029978.3957795939058384896@ffe16.ukr.net> Subject: Re: AHCI timeout when using ZFS + AIO + NCQ Date: Thu, 24 Jan 2013 12:50:30 -0000 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: current@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 12:49:59 -0000 Is it always the same disk, of so replace it SMART helps identify issues but doesn't tell you 100% there's no problem. ----- Original Message ----- From: "Vladislav Prodan" To: Cc: Sent: Thursday, January 24, 2013 12:19 PM Subject: AHCI timeout when using ZFS + AIO + NCQ >I have the server: > > FreeBSD 9.1-PRERELEASE #0: Wed Jul 25 01:40:56 EEST 2012 > > Jan 24 12:53:01 vesuvius kernel: atapci0: port > 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xfe210000-0xfe2101ff irq 51 at device 0.0 on pci3 > ... > Jan 24 12:53:01 vesuvius kernel: ahci0: port > 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfe307000-0xfe3073ff irq 19 at device 17.0 on pci0 > Jan 24 12:53:01 vesuvius kernel: ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported > ... > Jan 24 12:53:01 vesuvius kernel: ada2 at ahcich2 bus 0 scbus4 target 0 lun 0 > Jan 24 12:53:01 vesuvius kernel: ada2: ATA-8 SATA 3.x device > Jan 24 12:53:01 vesuvius kernel: ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) > Jan 24 12:53:01 vesuvius kernel: ada2: Command Queueing enabled > Jan 24 12:53:01 vesuvius kernel: ada2: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C) > Jan 24 12:53:01 vesuvius kernel: ada2: Previously was known as ad12 > ... > I use 4 HDD in RAID10 via ZFS. > > With a very irregular intervals fall off HDD drives. As a result, the server stops. > > Jan 24 06:48:06 vesuvius kernel: ahcich2: Timeout on slot 6 port 0 > Jan 24 06:48:06 vesuvius kernel: ahcich2: is 00000000 cs 00000000 ss 000000c0 rs 000000c0 tfd 40 serr 00000000 cmd 0000e817 > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 4c 4e 1e 40 68 00 00 01 00 00 > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): Retrying command > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080) > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192 > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0 > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817 > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked > Jan 24 06:51:11 vesuvius kernel: swap_pager: I/O error - pagein failed; blkno 4227133,size 8192, error 6 > Jan 24 06:51:11 vesuvius kernel: (ada2:(pass2:vm_fault: pager read error, pid 1943 (named) > Jan 24 06:51:11 vesuvius kernel: ahcich2:0:ahcich2:0:0:0:0): lost device > Jan 24 06:51:11 vesuvius kernel: 0): passdevgonecb: devfs entry is gone > Jan 24 06:51:11 vesuvius kernel: pid 1943 (named), uid 53: exited on signal 11 > ... > > Helps only restart by pressing Power. > Judging by the state of SMART, HDD have no problems. SATA data cable changed. > > > I found a similar problem: > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > PR: amd64/165547: NVIDIA MCP67 AHCI SATA controller timeout > > -- > Vladislav V. Prodan > System & Network Administrator > http://support.od.ua > +380 67 4584408, +380 99 4060508 > VVP88-RIPE > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" > ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 13:13:04 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CB9D6CC2; Thu, 24 Jan 2013 13:13:04 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 7FA9E76C; Thu, 24 Jan 2013 13:13:04 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id AA1DD47E11; Thu, 24 Jan 2013 14:12:56 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id 08DB247DE6; Thu, 24 Jan 2013 14:12:56 +0100 (CET) Message-ID: <51013345.8010701@platinum.linux.pl> Date: Thu, 24 Jan 2013 14:12:37 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 13:13:04 -0000 On 2013-01-23 21:22, Wojciech Puchar wrote: >>> While RAID-Z is already a king of bad performance, >> >> I don't believe RAID-Z is any worse than RAID5. Do you have any actual >> measurements to back up your claim? > > it is clearly described even in ZFS papers. Both on reads and writes it > gives single drive random I/O performance. With ZFS and RAID-Z the situation is a bit more complex. Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). A worst case scenario could happen if your random i/o workload was reading random files each of 2048 bytes. Each file read would require data from 4 disks (5th is parity and won't be read unless there are errors). However if files were 512 bytes or less then only one disk would be used. 1024 bytes - two disks, etc. So ZFS is probably not the best choice to store millions of small files if random access to whole files is the primary concern. But lets look at a different scenario - a PostgreSQL database. Here table data is split and stored in 1GB files. ZFS splits the file into 128KiB records (recordsize property). This record is then again split into 4 columns each 32768 bytes. 5th column is generated containing parity. Each column is then stored on a different disk. You could think of it as a regular RAID-5 with stripe size of 32768 bytes. PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and column size. Each page access requires only a single disk read. Random i/o performance here should be 5 times that of a single disk. For me the reliability ZFS offers is far more important than pure performance. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 14:24:40 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D2A0FA70; Thu, 24 Jan 2013 14:24:40 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 37800A5F; Thu, 24 Jan 2013 14:24:39 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0OEObtO005693; Thu, 24 Jan 2013 15:24:37 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0OEObiI005690; Thu, 24 Jan 2013 15:24:37 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Thu, 24 Jan 2013 15:24:37 +0100 (CET) From: Wojciech Puchar To: Adam Nowacki Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: <51013345.8010701@platinum.linux.pl> Message-ID: References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Thu, 24 Jan 2013 15:24:37 +0100 (CET) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 14:24:40 -0000 > then stored on a different disk. You could think of it as a regular RAID-5 > with stripe size of 32768 bytes. > > PostgreSQL uses 8192 byte pages that fit evenly both into ZFS record size and > column size. Each page access requires only a single disk read. Random i/o > performance here should be 5 times that of a single disk. think about writing 8192 byte pages randomly. and then doing linear search over table. > > For me the reliability ZFS offers is far more important than pure > performance. Except it is on paper reliability. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 14:45:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 08BD6567; Thu, 24 Jan 2013 14:45:55 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-lb0-f178.google.com (mail-lb0-f178.google.com [209.85.217.178]) by mx1.freebsd.org (Postfix) with ESMTP id 5F4F9BE9; Thu, 24 Jan 2013 14:45:54 +0000 (UTC) Received: by mail-lb0-f178.google.com with SMTP id n1so4755178lba.23 for ; Thu, 24 Jan 2013 06:45:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=oRSFPzW30F41pK0pa7TGi3bqr8iGepeaoQzU5EOC5QE=; b=LBeEA0fN2+Yjzl4ytf9aDe/LDyMReEDJj2P0dU1tDdC6Nznu7so6zEznfhuQ306/Wx GQaOCHvygKB+RDW2ryYWqKPpjXxEHTeqcFtECaS9Tx9jHu2gihaOWonkG1qJw0S2xVWY pIlgdA1AkmfZslHIdLFgg7oK/vLXNaG8zHh2g9ULD5KB4m3uxs+lhSRVych6Ai495LYR TIGrg49KWkUZozFpnOkfKO8qaq3LfIbU8K73DoDuXnRO9G7oVxIP2vEw2BWtpVAH9KaD jvij6YquNyP1BUg2zhxdHTKIshNQesQP2IgktcbzNCEzZakGISB7YLAj9O0IYZfmfs99 UA3A== MIME-Version: 1.0 X-Received: by 10.112.38.67 with SMTP id e3mr872339lbk.105.1359038753054; Thu, 24 Jan 2013 06:45:53 -0800 (PST) Received: by 10.112.6.38 with HTTP; Thu, 24 Jan 2013 06:45:52 -0800 (PST) In-Reply-To: <51013345.8010701@platinum.linux.pl> References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> Date: Thu, 24 Jan 2013 09:45:52 -0500 Message-ID: Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. From: Zaphod Beeblebrox To: Adam Nowacki Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 14:45:55 -0000 Wow!.! OK. It sounds like you (or someone like you) can answer some of my burning questions about ZFS. On Thu, Jan 24, 2013 at 8:12 AM, Adam Nowacki wrote: > Lets assume 5 disk raidz1 vdev with ashift=9 (512 byte sectors). > > A worst case scenario could happen if your random i/o workload was reading > random files each of 2048 bytes. Each file read would require data from 4 > disks (5th is parity and won't be read unless there are errors). However if > files were 512 bytes or less then only one disk would be used. 1024 bytes - > two disks, etc. > > So ZFS is probably not the best choice to store millions of small files if > random access to whole files is the primary concern. > > But lets look at a different scenario - a PostgreSQL database. Here table > data is split and stored in 1GB files. ZFS splits the file into 128KiB > records (recordsize property). This record is then again split into 4 > columns each 32768 bytes. 5th column is generated containing parity. Each > column is then stored on a different disk. You could think of it as a > regular RAID-5 with stripe size of 32768 bytes. > Ok... so my question then would be... what of the small files. If I write several small files at once, does the transaction use a record, or does each file need to use a record? Additionally, if small files use sub-records, when you delete that file, does the sub-record get moved or just wasted (until the record is completely free)? I'm considering the difference, say, between cyrus imap (one file per message ZFS, database files on different ZFS filesystem) and dbmail imap (postgresql on ZFS). ... now I realize that PostgreSQL on ZFS has some special issues (but I don't have a choice here between ZFS and non-ZFS ... ZFS has already been chosen), but I'm also figuring that PostgreSQL on ZFS has some waste compared to cyrus IMAP on ZFS. So far in my research, Cyrus makes some compelling arguments that the common use case of most IMAP database files is full scan --- for which it's database files are optimized and SQL-based files are not. I agree that some operations can be more efficient in a good SQL database, but full scan (as a most often used query) is not. Cyrus also makes sense to me as a collection of small files ... for which I expect ZFS to excel... including the ability to snapshot with impunity... but I am terribly curious how the files are handled in transactions. I'm actually (right now) running some filesize statistics (and I'll get back to the list, if asked), but I'd like to know how ZFS is going to store the arriving mail... :). From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 14:53:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9E795A2F; Thu, 24 Jan 2013 14:53:19 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 05EDFCE4; Thu, 24 Jan 2013 14:53:18 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0OErHwx005776; Thu, 24 Jan 2013 15:53:17 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0OErHh0005773; Thu, 24 Jan 2013 15:53:17 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Thu, 24 Jan 2013 15:53:17 +0100 (CET) From: Wojciech Puchar To: Zaphod Beeblebrox Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: Message-ID: References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Thu, 24 Jan 2013 15:53:17 +0100 (CET) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 14:53:19 -0000 > several small files at once, does the transaction use a record, or does > each file need to use a record? Additionally, if small files use > sub-records, when you delete that file, does the sub-record get moved or > just wasted (until the record is completely free)? writes of small files are always good with ZFS. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 14:54:34 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 0F4F2B98; Thu, 24 Jan 2013 14:54:34 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id BDC46CFF; Thu, 24 Jan 2013 14:54:33 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id 1BF1147E11; Thu, 24 Jan 2013 15:54:32 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.0.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id C7BC447DE6; Thu, 24 Jan 2013 15:54:31 +0100 (CET) Message-ID: <51014B28.8070404@platinum.linux.pl> Date: Thu, 24 Jan 2013 15:54:32 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Wojciech Puchar Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 14:54:34 -0000 On 2013-01-24 15:24, Wojciech Puchar wrote: >> For me the reliability ZFS offers is far more important than pure >> performance. > Except it is on paper reliability. This "on paper" reliability in practice saved a 20TB pool. See one of my previous emails. Any other filesystem or hardware/software raid without per-disk checksums would have failed. Silent corruption of non-important files would be the best case, complete filesystem death by important metadata corruption as the worst case. I've been using ZFS for 3 years in many systems. Biggest one has 44 disks and 4 ZFS pools - this one survived SAS expander disconnects, a few kernel panics and countless power failures (UPS only holds for a few hours). So far I've not lost a single ZFS pool or any data stored. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 15:12:12 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 489E92D5; Thu, 24 Jan 2013 15:12:12 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id CB322E4C; Thu, 24 Jan 2013 15:12:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAMVOAVGDaFvO/2dsb2JhbABEhkW4GHOCHgEBAQMBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggHBAEcBIdzBgyrGpJogSOLcIJVgRMDiGGKfYIugRyPLIMWgVE1 X-IronPort-AV: E=Sophos;i="4.84,530,1355115600"; d="scan'208";a="13432171" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 24 Jan 2013 10:12:10 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 35541B3F18; Thu, 24 Jan 2013 10:12:10 -0500 (EST) Date: Thu, 24 Jan 2013 10:12:10 -0500 (EST) From: Rick Macklem To: Bruce Evans Message-ID: <1868504817.2310318.1359040330206.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20130124184756.O1180@besplex.bde.org> Subject: Re: [PATCH] More time cleanups in the NFS code MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , bde@FreeBSD.org, fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 15:12:12 -0000 Bruce Evans wrote: > On Wed, 23 Jan 2013, John Baldwin wrote: > > > This patch removes all calls to get*time(). Most of them it replaces > > with > > time_uptime (especially ones that are attempting to handle time > > intervals for > > which time_uptime is far better suited than time_second). One > > specific case > > it replaces with nanotime() as suggested by Bruce previously. A few > > of the > > timestamps were not used (nd_starttime and the curtime in the lease > > expiry > > function). > > Looks good. > > I didn't check for completeness. > > oldnfs might benefit from use of NFSD_MONOSEC. > I put NFSD_MONOSEC in for portability between BSDens, when that mattered to me. Do whatever you like with it, such as get rid of it or ... rick > Both nfs's might benefit from use of NFS_REALSEC (doesn't exist but > would be #defined as time_second if acceses to this global are atomic > (which I think is implied by its existence)). > > > Index: fs/nfs/nfs_commonkrpc.c > > =================================================================== > > --- fs/nfs/nfs_commonkrpc.c (revision 245742) > > +++ fs/nfs/nfs_commonkrpc.c (working copy) > > @@ -459,18 +459,17 @@ > > { > > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > > struct nfsmount *nmp = nf->nf_mount; > > - struct timeval now; > > + time_t now; > > > > - getmicrouptime(&now); > > - > > switch (type) { > > case FEEDBACK_REXMIT2: > > case FEEDBACK_RECONNECT: > > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > > + now = NFSD_MONOSEC; > > It's confusing for 'now' to be in mono-time. > > > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > > nfs_down(nmp, nf->nf_td, > > "not responding", 0, NFSSTA_TIMEO); > > nf->nf_tprintfmsg = TRUE; > > - nf->nf_lastmsg = now.tv_sec; > > + nf->nf_lastmsg = now; > > } > > break; > > It's safest but probably unnecessary (uncritical) to copy the (not > quite > volatile) variable NFSD_MONOSEC to a local variable, since it is used > twice. > > Now I don't like the NFSD_MONOSEC macro. It looks like a constant, but > is actually a not quite volatile variable. > > > Index: fs/nfsclient/nfs_clstate.c > > =================================================================== > > --- fs/nfsclient/nfs_clstate.c (revision 245742) > > +++ fs/nfsclient/nfs_clstate.c (working copy) > > @@ -2447,7 +2447,7 @@ > > u_int32_t clidrev; > > int error, cbpathdown, islept, igotlock, ret, clearok; > > uint32_t recover_done_time = 0; > > - struct timespec mytime; > > + time_t mytime; > > Another name for the cached copy of mono-now. > > > @@ -2720,9 +2720,9 @@ > > * Call nfscl_cleanupkext() once per second to check for > > * open/lock owners where the process has exited. > > */ > > - NFSGETNANOTIME(&mytime); > > - if (prevsec != mytime.tv_sec) { > > - prevsec = mytime.tv_sec; > > + mytime = NFSD_MONOSEC; > > + if (prevsec != mytime) { > > + prevsec = mytime; > > nfscl_cleanupkext(clp, &lfh); > > } > > > > Now copying it is clearly needed. > > > @@ -4684,11 +4682,9 @@ > > } else > > error = EPERM; > > if (error == NFSERR_DELAY) { > > - NFSGETNANOTIME(&mytime); > > - if (((u_int32_t)mytime.tv_sec - starttime) > > > - NFS_REMOVETIMEO && > > - ((u_int32_t)mytime.tv_sec - starttime) < > > - 100000) > > + mytime = NFSD_MONOSEC; > > + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && > > + ((u_int32_t)mytime - starttime) < 100000) > > break; > > /* Sleep for a short period of time */ > > (void) nfs_catnap(PZERO, 0, "nfsremove"); > > Should use time_t for all times in seconds and no casts to u_int32_t > (unless the times are put in data structures -- then 64-bit times are > wasteful). > > Here, when not doing this cleanup, mytime might as well have type > u_int32_t to begin with, to match starttime. Then the bogus cast would > be implicit in the assignment to mytime. The old code had to cast to > break the type of mytime.tv_sec to match that of starttime. This of > course only mattered when the times were non-monotonic, time_t was 64 > bits, and the non-mono time was later than the middle of 2038. > > Bruce > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 16:10:01 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3FE96B29; Thu, 24 Jan 2013 16:10:01 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop04.sare.net (proxypop04.sare.net [194.30.0.65]) by mx1.freebsd.org (Postfix) with ESMTP id 03B69230; Thu, 24 Jan 2013 16:10:00 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop04.sare.net (Postfix) with ESMTPSA id 552499DD631; Thu, 24 Jan 2013 17:09:46 +0100 (CET) Subject: Re: Problem adding SCSI quirks for a SSD, 4K sector and ZFS Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Borja Marcos X-Priority: 3 In-Reply-To: Date: Thu, 24 Jan 2013 17:09:56 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: References: <492280E6-E3EE-4540-92CE-C535C8943CCF@sarenet.es> To: Steven Hartland X-Mailer: Apple Mail (2.1085) Cc: FreeBSD Filesystems , freebsd-scsi@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 16:10:01 -0000 On Jan 24, 2013, at 12:19 PM, Steven Hartland wrote: >=20 > ----- Original Message ----- From: "Borja Marcos" > To: > Cc: "FreeBSD Filesystems" > Sent: Thursday, January 24, 2013 9:46 AM > Subject: Problem adding SCSI quirks for a SSD, 4K sector and ZFS >=20 >=20 >>=20 >> Hello, >>=20 >> Crossposting to FreeBSD-fs, as I am wondering if I have had a problem = with ZFS and sector size detection as well. >>=20 >> I am doing tests with an OCZ Vertex 4 connected to a SAS backplane. >>=20 >> < OCZ-VERTEX4 1.5> at scbus6 target 22 lun 0 = (pass19,da15) >>=20 >> (The blank before "OCZ" really appears there) >>=20 >> pass19: < OCZ-VERTEX4 1.5> Fixed Direct Access SCSI-5 device >> pass19: Serial Number OCZ-1SVG6KZ2YRMSS8E1 >> pass19: 3.300MB/s transfers >>=20 >> I am bypassing an "aac" RAID card so that the disks are directly = attached to the da driver, instead of relying on the so-called JBOD = feature. >>=20 >> I have had a weird problem, with the disk being unresponsive to the = REQUEST CAPACITY(16) command. Weird, seems it timeouts. >>=20 >> So, just to complete the tests, I have added a quirk to scsi_da.c. = Anyway, I also need the disk to be recognized as a 4K sector drive. >>=20 (...) > Simple answer is ZFS doesn't understand quirks. The attached patch = does > what you're looking for along with a few other things, see notes at > the top for details. >=20 > Its not a final version as there's still some discussion about > implementation details but it should do what your looking for. Oh, sorrry for the confusion! I assumed that the da driver would = announce a 4 KB sector instead of the 512 bytes due to the quirks.=20 Which brings an interesting question, I might have a potential time = bomb in my experimental machine at home. It's a ZFS mirror with two 500 = MB disks. One of the has 512 byte sectors, the other is "advanced = format".=20 Borja. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 16:50:07 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C89D8888; Thu, 24 Jan 2013 16:50:07 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72]) by mx1.freebsd.org (Postfix) with ESMTP id 9695A6B7; Thu, 24 Jan 2013 16:50:06 +0000 (UTC) Received: from localhost (unknown [62.49.66.12]) by mail.dawidek.net (Postfix) with ESMTPSA id 83C4F26F; Thu, 24 Jan 2013 17:47:30 +0100 (CET) Date: Thu, 24 Jan 2013 17:50:43 +0100 From: Pawel Jakub Dawidek To: araujo@FreeBSD.org Subject: Re: gmirror doubt. Message-ID: <20130124165043.GB2386@garage.freebsd.pl> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="tsOsTdHNUZQcU9Ye" Content-Disposition: inline In-Reply-To: X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 16:50:07 -0000 --tsOsTdHNUZQcU9Ye Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 07:04:01PM +0800, Marcelo Araujo wrote: > Hello Guys, >=20 > I'm wondering if is possible load the gmirror without activate the mirror= s. >=20 > As an example: > I have several mirrors using gmirror, when I'm loading gmirror, all mirro= rs > that are activated are loaded. What I'd like to do is totally the oposite, > something like: Load gmirror, and active every mirror by myself. >=20 > Anyone with more experience with geom could give me a clue about it? Unfortunately it is not possible. You could probably prepare some hack that would keep all gmirror components open for writing during load of gmirror module, which would prevent gmirror from using them immediately, but it would be gross. Alternatively you could introduce even simple sysctl which, when set, will disable tasting in gmirror globally. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --tsOsTdHNUZQcU9Ye Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlEBZmMACgkQForvXbEpPzRS2wCg7/EZXOapEM/5pkVoCqL+o6Fr jp8AoJZQ9aBrlI8P+6SdynkVLj7dyeKo =aD1y -----END PGP SIGNATURE----- --tsOsTdHNUZQcU9Ye-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 17:04:20 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 02D53ED3 for ; Thu, 24 Jan 2013 17:04:20 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.17.21]) by mx1.freebsd.org (Postfix) with ESMTP id 8D74B79B for ; Thu, 24 Jan 2013 17:04:19 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.28]) by mrigmx.server.lan (mrigmx001) with ESMTP (Nemesis) id 0M1Tap-1UrCbT2TQP-00tSts for ; Thu, 24 Jan 2013 18:04:18 +0100 Received: (qmail invoked by alias); 24 Jan 2013 17:04:18 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp028) with SMTP; 24 Jan 2013 18:04:18 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1/VWgVLGQTVgufYrxKLmaXZm4Q/B5t2rdSiPBxmtB TL3WZdLXlS3g15 From: Christian Gusenbauer To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org Subject: 9.1-stable crashes while copying data from a NFS mounted directory Date: Thu, 24 Jan 2013 18:05:57 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301241805.57623.c47g@gmx.at> X-Y-GMX-Trusted: 0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 17:04:20 -0000 Hi! I'm using 9.1 stable svn revision 245605 and I get the panic below if I execute the following commands (as single user): # swapon -a # dumpon /dev/ada0s3b # mount -u / # ifconfig age0 inet 192.168.2.2 mtu 6144 up # mount -t nfs -o rsize=32768 data:/multimedia /mnt # cp /mnt/Movies/test/a.m2ts /tmp then the system panics almost immediately. I'll attach the stack trace. Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe that's the cause for the panic, because the bcopy (see stack frame #15) fails. Any clues? Ciao, Christian. #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 265 if (textdump && textdump_pending) { (kgdb) #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 #1 0xffffffff802a8ba0 in db_dump (dummy=, dummy2=, dummy3=, dummy4=) at /spare/tmp/src-stable9/sys/ddb/db_command.c:538 #2 0xffffffff802a84ce in db_command (last_cmdp=0xffffffff808bc5c0, cmd_table=, dopager=1) at /spare/tmp/src-stable9/sys/ddb/db_command.c:449 #3 0xffffffff802a8720 in db_command_loop () at /spare/tmp/src-stable9/sys/ddb/db_command.c:502 #4 0xffffffff802aa859 in db_trap (type=, code=) at /spare/tmp/src-stable9/sys/ddb/db_main.c:231 #5 0xffffffff803c4918 in kdb_trap (type=3, code=0, tf=0xffffff81b2da8a80) at /spare/tmp/src-stable9/sys/kern/subr_kdb.c:649 #6 0xffffffff805a02cf in trap (frame=0xffffff81b2da8a80) at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:579 #7 0xffffffff8058992f in calltrap () at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 #8 0xffffffff803c43cb in kdb_enter (why=0xffffffff806145f3 "panic", msg=0x80
) at cpufunc.h:63 #9 0xffffffff8038f407 in panic (fmt=) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:627 #10 0xffffffff80568049 in vm_fault_hold (map=0xfffffe0002000000, vaddr=18446743530148802560, fault_type=2 '\002', fault_flags=0, m_hold=0x0) at /spare/tmp/src-stable9/sys/vm/vm_fault.c:285 #11 0xffffffff80568753 in vm_fault (map=0xfffffe0002000000, vaddr=18446743530148802560, fault_type=, fault_flags=0) at /spare/tmp/src-stable9/sys/vm/vm_fault.c:229 #12 0xffffffff805a00c7 in trap_pfault (frame=0xffffff81b2da9170, usermode=0) at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:771 #13 0xffffffff805a051e in trap (frame=0xffffff81b2da9170) at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:463 #14 0xffffffff8058992f in calltrap () at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 #15 0xffffffff8059d7b5 in bcopy () at /spare/tmp/src-stable9/sys/amd64/amd64/support.S:134 #16 0xffffffff81c5963b in nfsm_mbufuio (nd=0xffffff81b2da9320, uiop=, siz=32768) at /spare/tmp/src-stable9/sys/modules/nfscommon/../../fs/nfs/nfs_commonsubs.c:212 #17 0xffffffff81c19571 in nfsrpc_read (vp=0xfffffe0005ca2000, uiop=0xffffff81b2da95a0, cred=, p=0xfffffe0005f28000, nap=0xffffff81b2da9480, attrflagp=0xffffff81b2da954c, stuff=0x0) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clrpcops.c:1343 #18 0xffffffff81c3aff0 in ncl_readrpc (vp=0xfffffe0005ca2000, uiop=0xffffff81b2da95a0, cred=) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clvnops.c:1366 #19 0xffffffff81c2fed3 in ncl_doio (vp=0xfffffe0005ca2000, bp=0xffffff816fabca20, cr=0xfffffe0002d59e00, td=0xfffffe0005f28000, called_from_strategy=0) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clbio.c:1605 #20 0xffffffff81c32aaf in ncl_bioread (vp=0xfffffe0005ca2000, uio=0xffffff81b2da9ad0, ioflag=, cred=0xfffffe0002d59e00) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clbio.c:541 #21 0xffffffff804379c3 in vn_read (fp=0xfffffe0005f3e960, uio=0xffffff81b2da9ad0, active_cred=, flags=, td=) at vnode_if.h:384 #22 0xffffffff80434d40 in vn_io_fault (fp=0xfffffe0005f3e960, uio=0xffffff81b2da9ad0, active_cred=0xfffffe0002d59e00, flags=0, td=0xfffffe0005f28000) at /spare/tmp/src-stable9/sys/kern/vfs_vnops.c:903 #23 0xffffffff803d7bd1 in dofileread (td=0xfffffe0005f28000, fd=3, fp=0xfffffe0005f3e960, auio=0xffffff81b2da9ad0, offset=, flags=0) at file.h:287 #24 0xffffffff803d7f7c in kern_readv (td=0xfffffe0005f28000, fd=3, auio=0xffffff81b2da9ad0) at /spare/tmp/src-stable9/sys/kern/sys_generic.c:250 #25 0xffffffff803d8074 in sys_read (td=, uap=) at /spare/tmp/src-stable9/sys/kern/sys_generic.c:166 #26 0xffffffff8059f4f0 in amd64_syscall (td=0xfffffe0005f28000, traced=0) at subr_syscall.c:135 #27 0xffffffff80589c17 in Xfast_syscall () at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:387 #28 0x00000008009245fc in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 17:22:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 20CAD5E4 for ; Thu, 24 Jan 2013 17:22:10 +0000 (UTC) (envelope-from mcdouga9@egr.msu.edu) Received: from mail.egr.msu.edu (dauterive.egr.msu.edu [35.9.37.168]) by mx1.freebsd.org (Postfix) with ESMTP id F0A2488D for ; Thu, 24 Jan 2013 17:22:09 +0000 (UTC) Received: from dauterive (localhost [127.0.0.1]) by mail.egr.msu.edu (Postfix) with ESMTP id AE02D2658B for ; Thu, 24 Jan 2013 12:22:03 -0500 (EST) X-Virus-Scanned: amavisd-new at egr.msu.edu Received: from mail.egr.msu.edu ([127.0.0.1]) by dauterive (dauterive.egr.msu.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id igbswQB2ZB96 for ; Thu, 24 Jan 2013 12:22:03 -0500 (EST) Received: from EGR authenticated sender Message-ID: <51016DBC.5050001@egr.msu.edu> Date: Thu, 24 Jan 2013 12:22:04 -0500 From: Adam McDougall User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: gmirror doubt. References: <20130124165043.GB2386@garage.freebsd.pl> In-Reply-To: <20130124165043.GB2386@garage.freebsd.pl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 17:22:10 -0000 On 1/24/2013 11:50 AM, Pawel Jakub Dawidek wrote: > On Thu, Jan 24, 2013 at 07:04:01PM +0800, Marcelo Araujo wrote: >> Hello Guys, >> >> I'm wondering if is possible load the gmirror without activate the mirrors. >> >> As an example: >> I have several mirrors using gmirror, when I'm loading gmirror, all mirrors >> that are activated are loaded. What I'd like to do is totally the oposite, >> something like: Load gmirror, and active every mirror by myself. >> >> Anyone with more experience with geom could give me a clue about it? > Unfortunately it is not possible. You could probably prepare some hack > that would keep all gmirror components open for writing during load of > gmirror module, which would prevent gmirror from using them immediately, > but it would be gross. > > Alternatively you could introduce even simple sysctl which, when set, > will disable tasting in gmirror globally. > Just putting in my 2 cents that I'd LOVE LOVE LOVE a sysctl to disable tasting by geom, even if global. My #1 extreme frustration with gmirror, gjournal etc is the difficulty of removing the on-disk config when the kernel is fighting you to keep re-tasting it and keeping it active. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 17:32:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 05D8CBDD for ; Thu, 24 Jan 2013 17:32:03 +0000 (UTC) (envelope-from feld@feld.me) Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id D89D6949 for ; Thu, 24 Jan 2013 17:32:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:To:Content-Type; bh=4zThfqJyoSwROSkhTdeQxwz31ii2suOqu6APL9gwhsU=; b=bQ9mMpqRazLsw0/wWRmn52FAVIWTm+M2qvZYyvvBRfwQz+ZQntFR83IPote8Rqj1DDGMP1U+CicTez2BTiDqP/8HwL7v854fLgXQsb75wiYTJdtuuKhs7rIIUIv2pWrb; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by feld.me with esmtp (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1TyQeg-0009OR-8F; Thu, 24 Jan 2013 11:32:02 -0600 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4) with esmtpsa id 1359048721-4111-4109/5/1; Thu, 24 Jan 2013 17:32:01 +0000 Content-Type: text/plain; format=flowed; delsp=yes To: freebsd-fs@freebsd.org, Adam McDougall Subject: Re: gmirror doubt. References: <20130124165043.GB2386@garage.freebsd.pl> <51016DBC.5050001@egr.msu.edu> Date: Thu, 24 Jan 2013 11:32:01 -0600 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: <51016DBC.5050001@egr.msu.edu> User-Agent: Opera Mail/12.12 (FreeBSD) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 17:32:03 -0000 On Thu, 24 Jan 2013 11:22:04 -0600, Adam McDougall wrote: > > Just putting in my 2 cents that I'd LOVE LOVE LOVE a sysctl to disable > tasting by geom, even if global. My #1 extreme frustration with gmirror, > gjournal etc is the difficulty of removing the on-disk config when the > kernel is fighting you to keep re-tasting it and keeping it active. I can't agree more. The GEOM framework and tasting is brilliant, but we need to be able to control it. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 18:04:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 38967535; Thu, 24 Jan 2013 18:04:11 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 9F21AAC3; Thu, 24 Jan 2013 18:04:10 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r0OI3xqp061489; Thu, 24 Jan 2013 20:03:59 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.4 kib.kiev.ua r0OI3xqp061489 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r0OI3xhK061488; Thu, 24 Jan 2013 20:03:59 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 24 Jan 2013 20:03:59 +0200 From: Konstantin Belousov To: Christian Gusenbauer Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130124180359.GH2522@kib.kiev.ua> References: <201301241805.57623.c47g@gmx.at> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vxNpRqk6sR1V+qwt" Content-Disposition: inline In-Reply-To: <201301241805.57623.c47g@gmx.at> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 18:04:11 -0000 --vxNpRqk6sR1V+qwt Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > Hi! >=20 > I'm using 9.1 stable svn revision 245605 and I get the panic below if I e= xecute the following commands (as single user): >=20 > # swapon -a > # dumpon /dev/ada0s3b > # mount -u / > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > # cp /mnt/Movies/test/a.m2ts /tmp >=20 > then the system panics almost immediately. I'll attach the stack trace. >=20 > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe t= hat's the cause for the panic, because the bcopy (see=20 > stack frame #15) fails. >=20 > Any clues? I tried a similar operation with the nfs mount of rsize=3D32768 and mtu 6144, but the machine runs HEAD and em instead of age. I was unable to reproduce the panic on the copy of the 5GB file from nfs mount. Show the output of "p *(struct uio *)0xffffff81b2da95a0" in kgdb. >=20 > Ciao, > Christian. >=20 > #0 doadump (textdump=3D0) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 > 265 if (textdump && textdump_pending) { > (kgdb) #0 doadump (textdump=3D0) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 > #1 0xffffffff802a8ba0 in db_dump (dummy=3D, > dummy2=3D, dummy3=3D, > dummy4=3D) > at /spare/tmp/src-stable9/sys/ddb/db_command.c:538 > #2 0xffffffff802a84ce in db_command (last_cmdp=3D0xffffffff808bc5c0, > cmd_table=3D, dopager=3D1) > at /spare/tmp/src-stable9/sys/ddb/db_command.c:449 > #3 0xffffffff802a8720 in db_command_loop () > at /spare/tmp/src-stable9/sys/ddb/db_command.c:502 > #4 0xffffffff802aa859 in db_trap (type=3D, > code=3D) > at /spare/tmp/src-stable9/sys/ddb/db_main.c:231 > #5 0xffffffff803c4918 in kdb_trap (type=3D3, code=3D0, tf=3D0xffffff81b2= da8a80) > at /spare/tmp/src-stable9/sys/kern/subr_kdb.c:649 > #6 0xffffffff805a02cf in trap (frame=3D0xffffff81b2da8a80) > at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:579 > #7 0xffffffff8058992f in calltrap () > at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 > #8 0xffffffff803c43cb in kdb_enter (why=3D0xffffffff806145f3 "panic", > msg=3D0x80
) at cpufunc.h:63 > #9 0xffffffff8038f407 in panic (fmt=3D) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:627 > #10 0xffffffff80568049 in vm_fault_hold (map=3D0xfffffe0002000000, > vaddr=3D18446743530148802560, fault_type=3D2 '\002', fault_flags=3D0, > m_hold=3D0x0) at /spare/tmp/src-stable9/sys/vm/vm_fault.c:285 > #11 0xffffffff80568753 in vm_fault (map=3D0xfffffe0002000000, > vaddr=3D18446743530148802560, fault_type=3D, > fault_flags=3D0) at /spare/tmp/src-stable9/sys/vm/vm_fault.c:229 > #12 0xffffffff805a00c7 in trap_pfault (frame=3D0xffffff81b2da9170, usermo= de=3D0) > at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:771 > #13 0xffffffff805a051e in trap (frame=3D0xffffff81b2da9170) > at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:463 > #14 0xffffffff8058992f in calltrap () > at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 > #15 0xffffffff8059d7b5 in bcopy () > at /spare/tmp/src-stable9/sys/amd64/amd64/support.S:134 > #16 0xffffffff81c5963b in nfsm_mbufuio (nd=3D0xffffff81b2da9320, > uiop=3D, siz=3D32768) > at /spare/tmp/src-stable9/sys/modules/nfscommon/../../fs/nfs/nfs_comm= onsubs.c:212 > #17 0xffffffff81c19571 in nfsrpc_read (vp=3D0xfffffe0005ca2000, > uiop=3D0xffffff81b2da95a0, cred=3D, > p=3D0xfffffe0005f28000, nap=3D0xffffff81b2da9480, > attrflagp=3D0xffffff81b2da954c, stuff=3D0x0) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= rpcops.c:1343 > #18 0xffffffff81c3aff0 in ncl_readrpc (vp=3D0xfffffe0005ca2000, > uiop=3D0xffffff81b2da95a0, cred=3D) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= vnops.c:1366 > #19 0xffffffff81c2fed3 in ncl_doio (vp=3D0xfffffe0005ca2000, > bp=3D0xffffff816fabca20, cr=3D0xfffffe0002d59e00, td=3D0xfffffe0005f2= 8000, > called_from_strategy=3D0) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= bio.c:1605 > #20 0xffffffff81c32aaf in ncl_bioread (vp=3D0xfffffe0005ca2000, > uio=3D0xffffff81b2da9ad0, ioflag=3D, > cred=3D0xfffffe0002d59e00) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= bio.c:541 > #21 0xffffffff804379c3 in vn_read (fp=3D0xfffffe0005f3e960, > uio=3D0xffffff81b2da9ad0, active_cred=3D, > flags=3D, td=3D) at vnode_i= f.h:384 > #22 0xffffffff80434d40 in vn_io_fault (fp=3D0xfffffe0005f3e960, > uio=3D0xffffff81b2da9ad0, active_cred=3D0xfffffe0002d59e00, flags=3D0, > td=3D0xfffffe0005f28000) at /spare/tmp/src-stable9/sys/kern/vfs_vnops= =2Ec:903 > #23 0xffffffff803d7bd1 in dofileread (td=3D0xfffffe0005f28000, fd=3D3, > fp=3D0xfffffe0005f3e960, auio=3D0xffffff81b2da9ad0, > offset=3D, flags=3D0) at file.h:287 > #24 0xffffffff803d7f7c in kern_readv (td=3D0xfffffe0005f28000, fd=3D3, > auio=3D0xffffff81b2da9ad0) > at /spare/tmp/src-stable9/sys/kern/sys_generic.c:250 > #25 0xffffffff803d8074 in sys_read (td=3D, > uap=3D) > at /spare/tmp/src-stable9/sys/kern/sys_generic.c:166 > #26 0xffffffff8059f4f0 in amd64_syscall (td=3D0xfffffe0005f28000, traced= =3D0) > at subr_syscall.c:135 > #27 0xffffffff80589c17 in Xfast_syscall () > at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:387 > #28 0x00000008009245fc in ?? () > Previous frame inner to this frame (corrupt stack?) > (kgdb) > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" --vxNpRqk6sR1V+qwt Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRAXeOAAoJEJDCuSvBvK1B7yUQAI2Q7JdBYsuMUky5kTYyF9iQ X6Sp4TxZ8zBlX2j5moCOeohh4WFmH7WaS+XOUKre5Wl56ce9yigwkCf6GAEvSN0G uAw8Pd2+UpjNvpMbFwblvmB2Otrn4qG2f/Tg12F4oK3onURDS4tVMvyOYdU1piwS 8lmGzrrXmPwRU9HshMYEcyJsCIJESBqpzm/8Jf8L+hCa/QXDWEaviTRifh1ooTwl QKXPZ+CZZYi9JRqMo6wFMPWDlKejIYhl9axithelBFqKM6sfaqrLfAUB8WhttV9q abI1qYFNkiRd8pdUTnVF68lKe4fnhyets4UBhgbANGYzIZOZ+CDLWbBErXj0B9jZ 1Atvey0tTaQxxnsOe+GgwcyAq1XHUp5K9/gybpo5BESj5IBUbae/yj7OUKw1FxrG i1DM8m1Pgia+rACeVFc3mJOJoCHfUhubjSfnSDk/f5t0AR8ek3nyFGOoYstyVktH IAPTa88fJmmJeMpaLR3qP3IDBmrUfB0Jnkg2c6Sl/Im+r0JQ5GmaLG5c3ehpwp/T fdEOmyHJXvtyI0izWdksRTDV5UUmz4TfnoGgv5fzgjAANShnVR++9CNWcTDhy6sl J+/Sx04xWv6utoPG1To/z49ilx+sqmc0uHsrGXT/oa94DG5wqeaXS8ft9LVKgr43 EzsxLVM3yZAlE9bhUqYo =K3q3 -----END PGP SIGNATURE----- --vxNpRqk6sR1V+qwt-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 18:07:27 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id B0C226DC; Thu, 24 Jan 2013 18:07:27 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 15B91AFA; Thu, 24 Jan 2013 18:07:26 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r0OI7NEl062045; Thu, 24 Jan 2013 20:07:23 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.4 kib.kiev.ua r0OI7NEl062045 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r0OI7NqU062044; Thu, 24 Jan 2013 20:07:23 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 24 Jan 2013 20:07:23 +0200 From: Konstantin Belousov To: Christian Gusenbauer Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130124180723.GI2522@kib.kiev.ua> References: <201301241805.57623.c47g@gmx.at> <20130124180359.GH2522@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gKwGQjXLadG+jf5J" Content-Disposition: inline In-Reply-To: <20130124180359.GH2522@kib.kiev.ua> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 18:07:27 -0000 --gKwGQjXLadG+jf5J Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > Hi! > >=20 > > I'm using 9.1 stable svn revision 245605 and I get the panic below if I= execute the following commands (as single user): > >=20 > > # swapon -a > > # dumpon /dev/ada0s3b > > # mount -u / > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > > # cp /mnt/Movies/test/a.m2ts /tmp > >=20 > > then the system panics almost immediately. I'll attach the stack trace. > >=20 > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe= that's the cause for the panic, because the bcopy (see=20 > > stack frame #15) fails. > >=20 > > Any clues? > I tried a similar operation with the nfs mount of rsize=3D32768 and mtu > 6144, but the machine runs HEAD and em instead of age. I was unable to > reproduce the panic on the copy of the 5GB file from nfs mount. >=20 > Show the output of "p *(struct uio *)0xffffff81b2da95a0" in kgdb. And the output of "p *(struct buf *)0xffffff816fabca20". --gKwGQjXLadG+jf5J Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRAXhaAAoJEJDCuSvBvK1BSQsP/0ekIK2UoEQOnEn7FSjfTt5R 6/wNh5Q9CWm8A08CqwV/RNm+3MSNY/7b69+RbuBQyOGtZ/GrmSnB9nv0MGEBZUGM UkuGWy5w6DwaQunPTm7i8Ek+G9Bjk6aRiLO1OyTAL0nvGIxGmG87/OELIbStZjEx 1sc+EJRg9x5qFX8Y394Dm9ODovc46L64JT7znLKcCij0LGb8yxsCsxiSO+mv8oGn 8Z7pzr+Ll6SVg+4XzSuT51BXmcXeHsrMJdKEWiW2A0dPQ84xofEqmA2j2MVPQHdy AvYzyOqp8ebp360jhIQszm2mAL5cOA2VpjHvLDdtft1XFvJp5nHM61oJkBgpyiz1 85I/N3PhiQud/RVl7XF1hj3oRPOsKqvF5mjmUUG6mKCl8IGDr2ZzK+YZ7iq4nPam Hhi1+Avaqwnxovvij68UVByzqfODNUy0UrvwvwDXM6yeskxft9rbjf+NU8sRivYj A83gDxNHayvy/FrXS+ccaknrYV9jYa6KAF3A4KngPJpNK9yw0ab9n6V8X+1rP8yP 7hwVs3NuqJiuTrgC3MkoX1m5DHgpitjBmcJ6cgcGge92CbxY9LpD3R/uRUzyKwJ9 k9eoVXnNXX3dIXXKjmm8COjsZpRo2KKKoaEncjndb5eqDAuk3gXIFkhnVryKGE4t MZ/wsUsfMnBAgri0HKy5 =WsNv -----END PGP SIGNATURE----- --gKwGQjXLadG+jf5J-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 18:22:39 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 8BCA9C47 for ; Thu, 24 Jan 2013 18:22:39 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.17.21]) by mx1.freebsd.org (Postfix) with ESMTP id 397F0C21 for ; Thu, 24 Jan 2013 18:22:39 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.29]) by mrigmx.server.lan (mrigmx001) with ESMTP (Nemesis) id 0LwkcQ-1V4e390m52-016KuI for ; Thu, 24 Jan 2013 19:22:35 +0100 Received: (qmail invoked by alias); 24 Jan 2013 18:22:35 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp029) with SMTP; 24 Jan 2013 19:22:35 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1/Q5VpCBlnQBRz/5jQM+646V7vEzaUWjNQXtmnwxn Y36FEuSNddVTNU From: Christian Gusenbauer To: Konstantin Belousov Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Thu, 24 Jan 2013 19:24:12 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <20130124180359.GH2522@kib.kiev.ua> <20130124180723.GI2522@kib.kiev.ua> In-Reply-To: <20130124180723.GI2522@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201301241924.12999.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 18:22:39 -0000 On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > Hi! > > >=20 > > > I'm using 9.1 stable svn revision 245605 and I get the panic below if= I > > > execute the following commands (as single user): > > >=20 > > > # swapon -a > > > # dumpon /dev/ada0s3b > > > # mount -u / > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > > > # cp /mnt/Movies/test/a.m2ts /tmp > > >=20 > > > then the system panics almost immediately. I'll attach the stack trac= e. > > >=20 > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, may= be > > > that's the cause for the panic, because the bcopy (see stack frame > > > #15) fails. > > >=20 > > > Any clues? > >=20 > > I tried a similar operation with the nfs mount of rsize=3D32768 and mtu > > 6144, but the machine runs HEAD and em instead of age. I was unable to > > reproduce the panic on the copy of the 5GB file from nfs mount. > >=20 > > Show the output of "p *(struct uio *)0xffffff81b2da95a0" in kgdb. (kgdb) p *(struct uio *)0xffffff81b2da95a0 $1 =3D {uio_iov =3D 0xffffff81b2da95d0, uio_iovcnt =3D 1, uio_offset =3D 59= 64,=20 uio_resid =3D 26804,=20 uio_segflg =3D UIO_SYSSPACE, uio_rw =3D UIO_READ, uio_td =3D 0xfffffe0005= f28000} >=20 > And the output of "p *(struct buf *)0xffffff816fabca20". (kgdb) p *(struct buf *)0xffffff816fabca20 $2 =3D {b_bufobj =3D 0xfffffe0005ca2120, b_bcount =3D 32768, b_caller1 =3D = 0x0,=20 b_data =3D 0xffffff8171418000 "=F8\017", b_error =3D 0, b_iocmd =3D 1 '\0= 01',=20 b_ioflags =3D 0 '\0', b_iooffset =3D 0, b_resid =3D 0, b_iodone =3D 0, b_= blkno =3D 0,=20 b_offset =3D 0, b_bobufs =3D {tqe_next =3D 0x0, tqe_prev =3D 0xfffffe0005= ca2140},=20 b_left =3D 0x0,=20 b_right =3D 0x0, b_vflags =3D 0, b_freelist =3D {tqe_next =3D 0x0,=20 tqe_prev =3D 0xffffffff80926900}, b_qindex =3D 0, b_flags =3D 536870912= ,=20 b_xflags =3D 2 '\002', b_lock =3D {lock_object =3D {lo_name =3D 0xfffffff= f8061d778=20 "bufwait",=20 lo_flags =3D 91422720, lo_data =3D 0, lo_witness =3D 0x0},=20 lk_lock =3D 18446741874786074624, lk_exslpfail =3D 0, lk_timo =3D 0, lk= _pri =3D=20 96},=20 b_bufsize =3D 32768, b_runningbufspace =3D 0, b_kvabase =3D 0xffffff81714= 18000=20 "=F8\017",=20 b_kvasize =3D 32768, b_lblkno =3D 0, b_vp =3D 0xfffffe0005ca2000, b_dirty= off =3D 0,=20 b_dirtyend =3D 0, b_rcred =3D 0x0, b_wcred =3D 0x0, b_saveaddr =3D=20 0xffffff8171418000,=20 b_pager =3D {pg_reqpage =3D 0}, b_cluster =3D {cluster_head =3D {tqh_firs= t =3D 0x0,=20 tqh_last =3D 0x0}, cluster_entry =3D {tqe_next =3D 0x0, tqe_prev =3D = 0x0}},=20 b_pages =3D { 0xfffffe01f1786708, 0xfffffe01f1785880, 0xfffffe01f17858f8,=20 0xfffffe01f1785970,=20 0xfffffe01f17859e8, 0xfffffe01f1785a60, 0xfffffe01f1785ad8,=20 0xfffffe01f1785b50,=20 0x0 }, b_npages =3D 8, b_dep =3D {lh_first =3D 0x0},= =20 b_fsprivate1 =3D 0x0,=20 b_fsprivate2 =3D 0x0, b_fsprivate3 =3D 0x0, b_pin_count =3D 0} From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 18:49:12 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3CDFD5F1 for ; Thu, 24 Jan 2013 18:49:12 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) by mx1.freebsd.org (Postfix) with ESMTP id CFA8DE16 for ; Thu, 24 Jan 2013 18:49:11 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.12]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0Lnmgj-1UfVqj2grn-00hyCf for ; Thu, 24 Jan 2013 19:49:10 +0100 Received: (qmail invoked by alias); 24 Jan 2013 18:49:10 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp012) with SMTP; 24 Jan 2013 19:49:10 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX183RQcrvRQ3lZC92BaP33z2miqhc45iF+iGlK6+oE Gu4laaVCN2Z+cV From: Christian Gusenbauer To: Konstantin Belousov Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Thu, 24 Jan 2013 19:50:49 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <20130124180359.GH2522@kib.kiev.ua> <20130124180723.GI2522@kib.kiev.ua> In-Reply-To: <20130124180723.GI2522@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301241950.49455.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 18:49:12 -0000 On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > Hi! > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below if I > > > execute the following commands (as single user): > > > > > > # swapon -a > > > # dumpon /dev/ada0s3b > > > # mount -u / > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > then the system panics almost immediately. I'll attach the stack trace. > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, maybe > > > that's the cause for the panic, because the bcopy (see stack frame > > > #15) fails. > > > > > > Any clues? > > > > I tried a similar operation with the nfs mount of rsize=32768 and mtu > > 6144, but the machine runs HEAD and em instead of age. I was unable to > > reproduce the panic on the copy of the 5GB file from nfs mount. Hmmm, I did a quick test. If I do not change the MTU, so just configuring age0 with # ifconfig age0 inet 192.168.2.2 up then I can copy all files from the mounted directory without any problems, too. So it's probably age0 related? Ciao, Christian. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 19:24:41 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 39DFC403; Thu, 24 Jan 2013 19:24:41 +0000 (UTC) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from wojtek.tensor.gdynia.pl (wojtek.tensor.gdynia.pl [188.252.31.196]) by mx1.freebsd.org (Postfix) with ESMTP id 9D776F89; Thu, 24 Jan 2013 19:24:40 +0000 (UTC) Received: from wojtek.tensor.gdynia.pl (localhost [127.0.0.1]) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5) with ESMTP id r0OJOcdw008269; Thu, 24 Jan 2013 20:24:38 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Received: from localhost (wojtek@localhost) by wojtek.tensor.gdynia.pl (8.14.5/8.14.5/Submit) with ESMTP id r0OJOcx2008266; Thu, 24 Jan 2013 20:24:38 +0100 (CET) (envelope-from wojtek@wojtek.tensor.gdynia.pl) Date: Thu, 24 Jan 2013 20:24:38 +0100 (CET) From: Wojciech Puchar To: Adam Nowacki Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. In-Reply-To: <51014B28.8070404@platinum.linux.pl> Message-ID: References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> <51014B28.8070404@platinum.linux.pl> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (wojtek.tensor.gdynia.pl [127.0.0.1]); Thu, 24 Jan 2013 20:24:38 +0100 (CET) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 19:24:41 -0000 > So far I've not lost a single ZFS pool or any data stored. so far my house wasn't robbed. From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 19:37:21 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 667BC7E4; Thu, 24 Jan 2013 19:37:21 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id CA6466C; Thu, 24 Jan 2013 19:37:20 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r0OJb9uU071565; Thu, 24 Jan 2013 21:37:09 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.4 kib.kiev.ua r0OJb9uU071565 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r0OJb9HT071564; Thu, 24 Jan 2013 21:37:09 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 24 Jan 2013 21:37:09 +0200 From: Konstantin Belousov To: Christian Gusenbauer Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130124193709.GL2522@kib.kiev.ua> References: <201301241805.57623.c47g@gmx.at> <20130124180359.GH2522@kib.kiev.ua> <20130124180723.GI2522@kib.kiev.ua> <201301241950.49455.c47g@gmx.at> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="08mGw3CZk5wypUJm" Content-Disposition: inline In-Reply-To: <201301241950.49455.c47g@gmx.at> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 19:37:21 -0000 --08mGw3CZk5wypUJm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > Hi! > > > >=20 > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below = if I > > > > execute the following commands (as single user): > > > >=20 > > > > # swapon -a > > > > # dumpon /dev/ada0s3b > > > > # mount -u / > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > >=20 > > > > then the system panics almost immediately. I'll attach the stack tr= ace. > > > >=20 > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, m= aybe > > > > that's the cause for the panic, because the bcopy (see stack frame > > > > #15) fails. > > > >=20 > > > > Any clues? > > >=20 > > > I tried a similar operation with the nfs mount of rsize=3D32768 and m= tu > > > 6144, but the machine runs HEAD and em instead of age. I was unable to > > > reproduce the panic on the copy of the 5GB file from nfs mount. >=20 > Hmmm, I did a quick test. If I do not change the MTU, so just configuring= age0=20 > with >=20 > # ifconfig age0 inet 192.168.2.2 up >=20 > then I can copy all files from the mounted directory without any problems= ,=20 > too. So it's probably age0 related? =46rom your backtrace and the buffer printout, I see somewhat strange thing. The buffer data address is 0xffffff8171418000, while kernel faulted at the attempt to write at 0xffffff8171413000, which is is lower then the buffer data pointer, at the attempt to bcopy to the buffer. The other data suggests that there were no overflow of the data from the server response. So it might be that mbuf_len(mp) returned negative number ? I am not sure is it possible at all. Try this debugging patch, please. You need to add INVARIANTS etc to the kernel config. diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 --- a/sys/fs/nfs/nfs_commonsubs.c +++ b/sys/fs/nfs/nfs_commonsubs.c @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uio= p, int siz) } mbufcp =3D NFSMTOD(mp, caddr_t); len =3D mbuf_len(mp); + KASSERT(len > 0, ("len %d", len)); } xfer =3D (left > len) ? len : left; #ifdef notdef @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio *uio= p, int siz) uiop->uio_resid -=3D xfer; } if (uiop->uio_iov->iov_len <=3D siz) { + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", + uiop->uio_iovcnt)); uiop->uio_iovcnt--; uiop->uio_iov++; } else { I thought that server have returned too long response, but it seems to be not the case from your data. Still, I think the patch below might be due. diff --git a/sys/fs/nfsclient/nfs_clrpcops.c b/sys/fs/nfsclient/nfs_clrpcop= s.c index be0476a..a89b907 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c +++ b/sys/fs/nfsclient/nfs_clrpcops.c @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct u= cred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); eof =3D fxdr_unsigned(int, *tl); } - NFSM_STRSIZ(retlen, rsize); + NFSM_STRSIZ(retlen, len); error =3D nfsm_mbufuio(nd, uiop, retlen); if (error) goto nfsmout; --08mGw3CZk5wypUJm Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRAY1kAAoJEJDCuSvBvK1B3GIP/R35aJSwVPR4UYtMoamu5VC/ eYrSSZ8iREbFXRIWCmBBY3dZGgGB83/h0jo7p9Q9C1eQTuo+3oPc9L3HFp6m9d4c +Lcf/S+GMi9Ncmm6QbNv8IM3WOhBhPnimQlqoJkP0qTYNUoHGCJQSEJuXXs8ca/8 Nl84xHMiuYoc2WU+sQ8gn32+EFDp4DeEzlbXPTT0hLTxgBrMZWsjjGDCJ61POx+u Fy2EcoAGmFQWxM33RZbvp2LPnrO6fczF6Q38kvGsVWsljz0N0vj9ZumV/9H+hElO gkkaAQyy77Kh1AhaAEFLvIkpQQu63FJsUJiNYSauIbO3SM23/7pB75+WPUEHUe0l 6d7NlkYhcxkltVms+AGDNYcOzXmaLr3kvd8iCOeC8EssXGM0Gtsd1IOUBicmZIXK FkI4qCbY7FvrBkiEV59xISIAkYnXH4ov/GZ5ks+VSy0mLr9f2wiwMLsrWdQYNbpm 2m4sKy+ubfHzb0ZOV4R49qR/IJhDbQSOW6J0Ro0GF/Y4zOMl3xOxcOC9zPMixa7E reHs/jERnmH9Rwb7J8U180VNujy8Oab2JXgB5/KrVqO2UVgRMidMirqqMsa+EYiU hu1SdaM5LA+qcmi4FbXhEXLzvGk16o6JnavbIY8IT9hDDtljBDJYP0jnHGjCBWLw naLEkaMdchHSCKIIrRpI =MFh3 -----END PGP SIGNATURE----- --08mGw3CZk5wypUJm-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 19:52:40 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id AF2B6C49; Thu, 24 Jan 2013 19:52:40 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 8758513E; Thu, 24 Jan 2013 19:52:40 +0000 (UTC) Received: from pakbsde14.localnet (unknown [38.105.238.108]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D268FB91A; Thu, 24 Jan 2013 14:52:39 -0500 (EST) From: John Baldwin To: Bruce Evans Subject: Re: [PATCH] More time cleanups in the NFS code Date: Thu, 24 Jan 2013 11:27:33 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p22; KDE/4.5.5; amd64; ; ) References: <201301231338.39056.jhb@freebsd.org> <20130124184756.O1180@besplex.bde.org> In-Reply-To: <20130124184756.O1180@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201301241127.33274.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 24 Jan 2013 14:52:39 -0500 (EST) Cc: Rick Macklem , bde@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 19:52:40 -0000 On Thursday, January 24, 2013 3:15:26 am Bruce Evans wrote: > On Wed, 23 Jan 2013, John Baldwin wrote: > > > This patch removes all calls to get*time(). Most of them it replaces with > > time_uptime (especially ones that are attempting to handle time intervals for > > which time_uptime is far better suited than time_second). One specific case > > it replaces with nanotime() as suggested by Bruce previously. A few of the > > timestamps were not used (nd_starttime and the curtime in the lease expiry > > function). > > Looks good. > > I didn't check for completeness. > > oldnfs might benefit from use of NFSD_MONOSEC. > > Both nfs's might benefit from use of NFS_REALSEC (doesn't exist but > would be #defined as time_second if acceses to this global are atomic > (which I think is implied by its existence)). Accesses should be atomic. > > Index: fs/nfs/nfs_commonkrpc.c > > =================================================================== > > --- fs/nfs/nfs_commonkrpc.c (revision 245742) > > +++ fs/nfs/nfs_commonkrpc.c (working copy) > > @@ -459,18 +459,17 @@ > > { > > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > > struct nfsmount *nmp = nf->nf_mount; > > - struct timeval now; > > + time_t now; > > > > - getmicrouptime(&now); > > - > > switch (type) { > > case FEEDBACK_REXMIT2: > > case FEEDBACK_RECONNECT: > > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > > + now = NFSD_MONOSEC; > > It's confusing for 'now' to be in mono-time. 'now' is all relative anyway. :) > > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > > nfs_down(nmp, nf->nf_td, > > "not responding", 0, NFSSTA_TIMEO); > > nf->nf_tprintfmsg = TRUE; > > - nf->nf_lastmsg = now.tv_sec; > > + nf->nf_lastmsg = now; > > } > > break; > > It's safest but probably unnecessary (uncritical) to copy the (not quite > volatile) variable NFSD_MONOSEC to a local variable, since it is used > twice. > > Now I don't like the NFSD_MONOSEC macro. It looks like a constant, but > is actually a not quite volatile variable. I have a separate patch to make both time_second and time_uptime volatile. The global 'ticks' should also be made volatile for the same reason. > > @@ -4684,11 +4682,9 @@ > > } else > > error = EPERM; > > if (error == NFSERR_DELAY) { > > - NFSGETNANOTIME(&mytime); > > - if (((u_int32_t)mytime.tv_sec - starttime) > > > - NFS_REMOVETIMEO && > > - ((u_int32_t)mytime.tv_sec - starttime) < > > - 100000) > > + mytime = NFSD_MONOSEC; > > + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && > > + ((u_int32_t)mytime - starttime) < 100000) > > break; > > /* Sleep for a short period of time */ > > (void) nfs_catnap(PZERO, 0, "nfsremove"); > > Should use time_t for all times in seconds and no casts to u_int32_t > (unless the times are put in data structures -- then 64-bit times are > wasteful). Ah, for some reason I had thought starttime was stuffed into a packet or some such. It is not, so I redid this as: @@ -4650,8 +4648,7 @@ out: APPLESTATIC void nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) { - struct timespec mytime; - int32_t starttime; + time_t starttime, elapsed; int error; /* @@ -4675,8 +4672,7 @@ nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) * Now, call nfsrv_checkremove() in a loop while it returns * NFSERR_DELAY. Return upon any other error or when timed out. */ - NFSGETNANOTIME(&mytime); - starttime = (u_int32_t)mytime.tv_sec; + starttime = NFSD_MONOSEC; do { if (NFSVOPLOCK(vp, LK_EXCLUSIVE) == 0) { error = nfsrv_checkremove(vp, 0, p); @@ -4684,11 +4680,8 @@ nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) } else error = EPERM; if (error == NFSERR_DELAY) { - NFSGETNANOTIME(&mytime); - if (((u_int32_t)mytime.tv_sec - starttime) > - NFS_REMOVETIMEO && - ((u_int32_t)mytime.tv_sec - starttime) < - 100000) + elapsed = NFSD_MONOSEC - starttime; + if (elapsed > NFS_REMOVETIMEO && elapsed < 100000) break; /* Sleep for a short period of time */ (void) nfs_catnap(PZERO, 0, "nfsremove"); -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 20:49:16 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 3B657947 for ; Thu, 24 Jan 2013 20:49:16 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) by mx1.freebsd.org (Postfix) with ESMTP id DE67762E for ; Thu, 24 Jan 2013 20:49:15 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.19]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0MfTAj-1UMzfV1b3L-00P4SF for ; Thu, 24 Jan 2013 21:49:14 +0100 Received: (qmail invoked by alias); 24 Jan 2013 20:49:12 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp019) with SMTP; 24 Jan 2013 21:49:12 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1+NNqF5x/CNZrBmCGwv1aPuCtq8AKOzHlJ1RM8O0R k9jtSAHQ94dpzb From: Christian Gusenbauer To: Konstantin Belousov Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Thu, 24 Jan 2013 21:50:52 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <201301241950.49455.c47g@gmx.at> <20130124193709.GL2522@kib.kiev.ua> In-Reply-To: <20130124193709.GL2522@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301242150.52238.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 20:49:16 -0000 On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > Hi! > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below > > > > > if I execute the following commands (as single user): > > > > > > > > > > # swapon -a > > > > > # dumpon /dev/ada0s3b > > > > > # mount -u / > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > then the system panics almost immediately. I'll attach the stack > > > > > trace. > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, > > > > > maybe that's the cause for the panic, because the bcopy (see stack > > > > > frame #15) fails. > > > > > > > > > > Any clues? > > > > > > > > I tried a similar operation with the nfs mount of rsize=32768 and mtu > > > > 6144, but the machine runs HEAD and em instead of age. I was unable > > > > to reproduce the panic on the copy of the 5GB file from nfs mount. > > > > Hmmm, I did a quick test. If I do not change the MTU, so just configuring > > age0 with > > > > # ifconfig age0 inet 192.168.2.2 up > > > > then I can copy all files from the mounted directory without any > > problems, too. So it's probably age0 related? > > From your backtrace and the buffer printout, I see somewhat strange thing. > The buffer data address is 0xffffff8171418000, while kernel faulted > at the attempt to write at 0xffffff8171413000, which is is lower then > the buffer data pointer, at the attempt to bcopy to the buffer. > > The other data suggests that there were no overflow of the data from the > server response. So it might be that mbuf_len(mp) returned negative number > ? I am not sure is it possible at all. > > Try this debugging patch, please. You need to add INVARIANTS etc to the > kernel config. > > diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c > index efc0786..9a6bda5 100644 > --- a/sys/fs/nfs/nfs_commonsubs.c > +++ b/sys/fs/nfs/nfs_commonsubs.c > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > *uiop, int siz) } > mbufcp = NFSMTOD(mp, caddr_t); > len = mbuf_len(mp); > + KASSERT(len > 0, ("len %d", len)); > } > xfer = (left > len) ? len : left; > #ifdef notdef > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > *uiop, int siz) uiop->uio_resid -= xfer; > } > if (uiop->uio_iov->iov_len <= siz) { > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > + uiop->uio_iovcnt)); > uiop->uio_iovcnt--; > uiop->uio_iov++; > } else { > > I thought that server have returned too long response, but it seems to > be not the case from your data. Still, I think the patch below might be > due. > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > --- a/sys/fs/nfsclient/nfs_clrpcops.c > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > eof = fxdr_unsigned(int, *tl); > } > - NFSM_STRSIZ(retlen, rsize); > + NFSM_STRSIZ(retlen, len); > error = nfsm_mbufuio(nd, uiop, retlen); > if (error) > goto nfsmout; I applied your patches and now I get a panic: len -4 cpuid = 1 KDB: enter: panic Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 265 if (textdump && textdump_pending) { (kgdb) #0 doadump (textdump=0) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 #1 0xffffffff802a7490 in db_dump (dummy=, dummy2=, dummy3=, dummy4=) at /spare/tmp/src-stable9/sys/ddb/db_command.c:538 #2 0xffffffff802a6a7e in db_command (last_cmdp=0xffffffff808ca140, cmd_table=, dopager=1) at /spare/tmp/src-stable9/sys/ddb/db_command.c:449 #3 0xffffffff802a6cd0 in db_command_loop () at /spare/tmp/src-stable9/sys/ddb/db_command.c:502 #4 0xffffffff802a8e29 in db_trap (type=, code=) at /spare/tmp/src-stable9/sys/ddb/db_main.c:231 #5 0xffffffff803bf548 in kdb_trap (type=3, code=0, tf=0xffffff81b2ba1080) at /spare/tmp/src-stable9/sys/kern/subr_kdb.c:649 #6 0xffffffff80594c28 in trap (frame=0xffffff81b2ba1080) at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:579 #7 0xffffffff8057e06f in calltrap () at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 #8 0xffffffff803beffb in kdb_enter (why=0xffffffff8060ebcf "panic", msg=0x80
) at cpufunc.h:63 #9 0xffffffff80389391 in panic (fmt=) at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:627 #10 0xffffffff81e5bab2 in nfsm_mbufuio (nd=0xffffff81b2ba1340, uiop=0x7cf, siz=18) at /spare/tmp/src-stable9/sys/modules/nfscommon/../../fs/nfs/nfs_commonsubs.c:202 #11 0xffffffff81e195c1 in nfsrpc_read (vp=0xfffffe0006c94dc8, uiop=0xffffff81b2ba15c0, cred=, p=0xfffffe0006aa6490, nap=0xffffff81b2ba14a0, attrflagp=0xffffff81b2ba156c, stuff=0x0) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clrpcops.c:1343 #12 0xffffffff81e3bd80 in ncl_readrpc (vp=0xfffffe0006c94dc8, uiop=0xffffff81b2ba15c0, cred=) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clvnops.c:1366 #13 0xffffffff81e3086b in ncl_doio (vp=0xfffffe0006c94dc8, bp=0xffffff816f8f4120, cr=0xfffffe0002d58e00, td=0xfffffe0006aa6490, called_from_strategy=0) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clbio.c:1605 #14 0xffffffff81e3254f in ncl_bioread (vp=0xfffffe0006c94dc8, uio=0xffffff81b2ba1ad0, ioflag=, cred=0xfffffe0002d58e00) at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_clbio.c:541 #15 0xffffffff80434ae8 in vn_read (fp=0xfffffe0006abda50, uio=0xffffff81b2ba1ad0, active_cred=, flags=, td=) at vnode_if.h:384 #16 0xffffffff8043206e in vn_io_fault (fp=0xfffffe0006abda50, uio=0xffffff81b2ba1ad0, active_cred=0xfffffe0002d58e00, flags=0, td=0xfffffe0006aa6490) at /spare/tmp/src-stable9/sys/kern/vfs_vnops.c:903 #17 0xffffffff803d7ac1 in dofileread (td=0xfffffe0006aa6490, fd=3, fp=0xfffffe0006abda50, auio=0xffffff81b2ba1ad0, offset=, flags=0) at file.h:287 #18 0xffffffff803d7e1c in kern_readv (td=0xfffffe0006aa6490, fd=3, auio=0xffffff81b2ba1ad0) at /spare/tmp/src-stable9/sys/kern/sys_generic.c:250 #19 0xffffffff803d7f34 in sys_read (td=, uap=) at /spare/tmp/src-stable9/sys/kern/sys_generic.c:166 #20 0xffffffff80593cb3 in amd64_syscall (td=0xfffffe0006aa6490, traced=0) at subr_syscall.c:135 #21 0xffffffff8057e357 in Xfast_syscall () at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:387 #22 0x00000008009245fc in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 21:08:05 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 08D59D9C; Thu, 24 Jan 2013 21:08:05 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.3.230]) by mx1.freebsd.org (Postfix) with ESMTP id 435F2723; Thu, 24 Jan 2013 21:08:03 +0000 (UTC) Received: from digsys200-136.pip.digsys.bg (digsys200-136.pip.digsys.bg [193.68.136.200]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.5/8.14.5) with ESMTP id r0OKkb3B079889 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Thu, 24 Jan 2013 22:46:38 +0200 (EET) (envelope-from daniel@digsys.bg) Subject: Re: Problem adding SCSI quirks for a SSD, 4K sector and ZFS Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Content-Type: text/plain; charset=us-ascii From: Daniel Kalchev X-Priority: 3 In-Reply-To: Date: Thu, 24 Jan 2013 22:46:37 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <92BD25C7-FC6C-41AF-80AE-05AAD4CD4659@digsys.bg> References: <492280E6-E3EE-4540-92CE-C535C8943CCF@sarenet.es> To: Borja Marcos X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems , freebsd-scsi@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 21:08:05 -0000 On Jan 24, 2013, at 6:09 PM, Borja Marcos wrote: > Oh, sorrry for the confusion! I assumed that the da driver would = announce a 4 KB sector instead of the 512 bytes due to the quirks.=20 >=20 > Which brings an interesting question, I might have a potential time = bomb in my experimental machine at home. It's a ZFS mirror with two 500 = MB disks. One of the has 512 byte sectors, the other is "advanced = format".=20 If you did not create your zpool with ashift=3D12, then you have the = bomb already. Disks with different native sector size coexist quite = happily in the same zpool. If the storage does not lie, ZFS will in fact = use the largest 'sector size' and everything will be ok. Many (most) = advanced format drives however lie about their sector size and pretend = to have 512 byte sectors. If you create an zpool with such a drive and = an 'normal' 512 byte drive without explicitly requesting 4k alignment, = thing are bad. Daniel= From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 21:22:20 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6F3C72D6; Thu, 24 Jan 2013 21:22:20 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id BF89B7DA; Thu, 24 Jan 2013 21:22:19 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r0OLMCUx094635; Thu, 24 Jan 2013 23:22:12 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.4 kib.kiev.ua r0OLMCUx094635 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r0OLMCx5094634; Thu, 24 Jan 2013 23:22:12 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 24 Jan 2013 23:22:12 +0200 From: Konstantin Belousov To: Christian Gusenbauer Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130124212212.GM2522@kib.kiev.ua> References: <201301241805.57623.c47g@gmx.at> <201301241950.49455.c47g@gmx.at> <20130124193709.GL2522@kib.kiev.ua> <201301242150.52238.c47g@gmx.at> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="RS1722//baS0C3Tp" Content-Disposition: inline In-Reply-To: <201301242150.52238.c47g@gmx.at> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org, net@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 21:22:20 -0000 --RS1722//baS0C3Tp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wr= ote: > > > > > > Hi! > > > > > >=20 > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic be= low > > > > > > if I execute the following commands (as single user): > > > > > >=20 > > > > > > # swapon -a > > > > > > # dumpon /dev/ada0s3b > > > > > > # mount -u / > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > >=20 > > > > > > then the system panics almost immediately. I'll attach the stack > > > > > > trace. > > > > > >=20 > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit networ= k, > > > > > > maybe that's the cause for the panic, because the bcopy (see st= ack > > > > > > frame #15) fails. > > > > > >=20 > > > > > > Any clues? > > > > >=20 > > > > > I tried a similar operation with the nfs mount of rsize=3D32768 a= nd mtu > > > > > 6144, but the machine runs HEAD and em instead of age. I was unab= le > > > > > to reproduce the panic on the copy of the 5GB file from nfs mount. > > >=20 > > > Hmmm, I did a quick test. If I do not change the MTU, so just configu= ring > > > age0 with > > >=20 > > > # ifconfig age0 inet 192.168.2.2 up > > >=20 > > > then I can copy all files from the mounted directory without any > > > problems, too. So it's probably age0 related? > >=20 > > From your backtrace and the buffer printout, I see somewhat strange thi= ng. > > The buffer data address is 0xffffff8171418000, while kernel faulted > > at the attempt to write at 0xffffff8171413000, which is is lower then > > the buffer data pointer, at the attempt to bcopy to the buffer. > >=20 > > The other data suggests that there were no overflow of the data from the > > server response. So it might be that mbuf_len(mp) returned negative num= ber > > ? I am not sure is it possible at all. > >=20 > > Try this debugging patch, please. You need to add INVARIANTS etc to the > > kernel config. > >=20 > > diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c > > index efc0786..9a6bda5 100644 > > --- a/sys/fs/nfs/nfs_commonsubs.c > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > *uiop, int siz) } > > mbufcp =3D NFSMTOD(mp, caddr_t); > > len =3D mbuf_len(mp); > > + KASSERT(len > 0, ("len %d", len)); > > } > > xfer =3D (left > len) ? len : left; > > #ifdef notdef > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > *uiop, int siz) uiop->uio_resid -=3D xfer; > > } > > if (uiop->uio_iov->iov_len <=3D siz) { > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > + uiop->uio_iovcnt)); > > uiop->uio_iovcnt--; > > uiop->uio_iov++; > > } else { > >=20 > > I thought that server have returned too long response, but it seems to > > be not the case from your data. Still, I think the patch below might be > > due. > >=20 > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, stru= ct > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > eof =3D fxdr_unsigned(int, *tl); > > } > > - NFSM_STRSIZ(retlen, rsize); > > + NFSM_STRSIZ(retlen, len); > > error =3D nfsm_mbufuio(nd, uiop, retlen); > > if (error) > > goto nfsmout; >=20 > I applied your patches and now I get a >=20 > panic: len -4 > cpuid =3D 1 > KDB: enter: panic > Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..= 94% >=20 This means that the age driver either produced corrupted mbuf chain, or filled wrong negative value into the mbuf len field. I am quite certain that the issue is in the driver. I added the net@ to Cc:, hopefully you could get help there. >=20 > #0 doadump (textdump=3D0) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 > 265 if (textdump && textdump_pending) { > (kgdb) #0 doadump (textdump=3D0) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:265 > #1 0xffffffff802a7490 in db_dump (dummy=3D, > dummy2=3D, dummy3=3D, > dummy4=3D) > at /spare/tmp/src-stable9/sys/ddb/db_command.c:538 > #2 0xffffffff802a6a7e in db_command (last_cmdp=3D0xffffffff808ca140, > cmd_table=3D, dopager=3D1) > at /spare/tmp/src-stable9/sys/ddb/db_command.c:449 > #3 0xffffffff802a6cd0 in db_command_loop () > at /spare/tmp/src-stable9/sys/ddb/db_command.c:502 > #4 0xffffffff802a8e29 in db_trap (type=3D, > code=3D) > at /spare/tmp/src-stable9/sys/ddb/db_main.c:231 > #5 0xffffffff803bf548 in kdb_trap (type=3D3, code=3D0, tf=3D0xffffff81b2= ba1080) > at /spare/tmp/src-stable9/sys/kern/subr_kdb.c:649 > #6 0xffffffff80594c28 in trap (frame=3D0xffffff81b2ba1080) > at /spare/tmp/src-stable9/sys/amd64/amd64/trap.c:579 > #7 0xffffffff8057e06f in calltrap () > at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:228 > #8 0xffffffff803beffb in kdb_enter (why=3D0xffffffff8060ebcf "panic", > msg=3D0x80
) at cpufunc.h:63 > #9 0xffffffff80389391 in panic (fmt=3D) > at /spare/tmp/src-stable9/sys/kern/kern_shutdown.c:627 > #10 0xffffffff81e5bab2 in nfsm_mbufuio (nd=3D0xffffff81b2ba1340, uiop=3D0= x7cf, > siz=3D18) > at /spare/tmp/src-stable9/sys/modules/nfscommon/../../fs/nfs/nfs_comm= onsubs.c:202 > #11 0xffffffff81e195c1 in nfsrpc_read (vp=3D0xfffffe0006c94dc8, > uiop=3D0xffffff81b2ba15c0, cred=3D, > p=3D0xfffffe0006aa6490, nap=3D0xffffff81b2ba14a0, > attrflagp=3D0xffffff81b2ba156c, stuff=3D0x0) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= rpcops.c:1343 > #12 0xffffffff81e3bd80 in ncl_readrpc (vp=3D0xfffffe0006c94dc8, > uiop=3D0xffffff81b2ba15c0, cred=3D) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= vnops.c:1366 > #13 0xffffffff81e3086b in ncl_doio (vp=3D0xfffffe0006c94dc8, > bp=3D0xffffff816f8f4120, cr=3D0xfffffe0002d58e00, td=3D0xfffffe0006aa= 6490, > called_from_strategy=3D0) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= bio.c:1605 > #14 0xffffffff81e3254f in ncl_bioread (vp=3D0xfffffe0006c94dc8, > uio=3D0xffffff81b2ba1ad0, ioflag=3D, > cred=3D0xfffffe0002d58e00) > at /spare/tmp/src-stable9/sys/modules/nfscl/../../fs/nfsclient/nfs_cl= bio.c:541 > #15 0xffffffff80434ae8 in vn_read (fp=3D0xfffffe0006abda50, > uio=3D0xffffff81b2ba1ad0, active_cred=3D, > flags=3D, td=3D) at vnode_i= f.h:384 > #16 0xffffffff8043206e in vn_io_fault (fp=3D0xfffffe0006abda50, > uio=3D0xffffff81b2ba1ad0, active_cred=3D0xfffffe0002d58e00, flags=3D0, > td=3D0xfffffe0006aa6490) at /spare/tmp/src-stable9/sys/kern/vfs_vnops= =2Ec:903 > #17 0xffffffff803d7ac1 in dofileread (td=3D0xfffffe0006aa6490, fd=3D3, > fp=3D0xfffffe0006abda50, auio=3D0xffffff81b2ba1ad0, > offset=3D, flags=3D0) at file.h:287 > #18 0xffffffff803d7e1c in kern_readv (td=3D0xfffffe0006aa6490, fd=3D3, > auio=3D0xffffff81b2ba1ad0) > at /spare/tmp/src-stable9/sys/kern/sys_generic.c:250 > #19 0xffffffff803d7f34 in sys_read (td=3D, > uap=3D) > at /spare/tmp/src-stable9/sys/kern/sys_generic.c:166 > #20 0xffffffff80593cb3 in amd64_syscall (td=3D0xfffffe0006aa6490, traced= =3D0) > at subr_syscall.c:135 > #21 0xffffffff8057e357 in Xfast_syscall () > at /spare/tmp/src-stable9/sys/amd64/amd64/exception.S:387 > #22 0x00000008009245fc in ?? () > Previous frame inner to this frame (corrupt stack?) > (kgdb) --RS1722//baS0C3Tp Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRAaYEAAoJEJDCuSvBvK1B9dwP/109z+fMvI1HtV7JEV64d/TO DXy6Kicq6vVIhWmIpmWnWITbrt2IOIhTpqaw3m5F9jDMP9p1MvTnOuyuJ8bB77Kx ZOMg59gQUaJmfWpR2yPOMOSAw1OAbgPkke3gX+Tn4NEyUrS3p32AR9+tN79QE/Tj ZrPJ8weZ/8A5f76ZfaQMbbrOohvnYc5kHMs9avwc2/x2hiLMNHBC0k/c5ZPAZ9mx o8FH/4HeAcPdUjyobKKVohoJyv1DeQqwOvw086fz2gNhz7uIdVRQE2kI9QytEyz4 T+lHaTMG0QYkWZ2bYK9kkpwpBBGp7pBcP4flRnHExr5WNQk7BHKVYtqhgy3M4r7J heH4lsUnujDKQtIjxALfYH37BUE9oKuca5g3sYDHwL0tTQg0Hh/Bh1ZkD/dLTqO+ Ju5+NmRl18sdQ/jRZlbD/ljTor86YugGpWt6EZnRjMnEgsstKqwhJW+zukzIhgYU NcGSAZL7dhaZXkrzQ2SXPb/n74llmu3Nhhpgavvyfz72FtB2A8g9b8bPTCUeDVfn pzrN4RFAaoXAf9F9Jd0VQhlPIunWNJ7LPYKfu7MjZ2F+Gc5q38qwIrCWr8l9j2R1 oP2Sw3P2rxfsdhXpyqiTUkigXtlrC3SmeV/KzCstSaJghHXTURzUxdSiU5M2GV/v IzHs0gCb02hj3bXW9Lz1 =NsNv -----END PGP SIGNATURE----- --RS1722//baS0C3Tp-- From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 21:52:24 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 44427A66 for ; Thu, 24 Jan 2013 21:52:24 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4]) by mx1.freebsd.org (Postfix) with ESMTP id 097C692E for ; Thu, 24 Jan 2013 21:52:23 +0000 (UTC) Received: from elsa.codelab.cz (localhost [127.0.0.1]) by elsa.codelab.cz (Postfix) with ESMTP id 346352842D; Thu, 24 Jan 2013 22:45:19 +0100 (CET) Received: from [192.168.1.2] (unknown [89.177.49.69]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by elsa.codelab.cz (Postfix) with ESMTPSA id BF5FC2842A; Thu, 24 Jan 2013 22:45:17 +0100 (CET) Message-ID: <5101AB6C.5060905@quip.cz> Date: Thu, 24 Jan 2013 22:45:16 +0100 From: Miroslav Lachman <000.fbsd@quip.cz> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.1.19) Gecko/20110420 Lightning/1.0b1 SeaMonkey/2.0.14 MIME-Version: 1.0 To: Mark Felder Subject: Re: gmirror doubt. References: <20130124165043.GB2386@garage.freebsd.pl> <51016DBC.5050001@egr.msu.edu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 21:52:24 -0000 Mark Felder wrote: > On Thu, 24 Jan 2013 11:22:04 -0600, Adam McDougall > wrote: > >> >> Just putting in my 2 cents that I'd LOVE LOVE LOVE a sysctl to disable >> tasting by geom, even if global. My #1 extreme frustration with gmirror, >> gjournal etc is the difficulty of removing the on-disk config when the >> kernel is fighting you to keep re-tasting it and keeping it active. > > I can't agree more. The GEOM framework and tasting is brilliant, but we > need to be able to control it. +1 from me. I had a problem with gjournal re-tasting few weeks ago (posted on GEOM mailinglist) Miroslav Lachman From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 22:23:38 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6585A3E1; Thu, 24 Jan 2013 22:23:38 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 12721A96; Thu, 24 Jan 2013 22:23:38 +0000 (UTC) Received: from pakbsde14.localnet (unknown [38.105.238.108]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2B2DDB91A; Thu, 24 Jan 2013 17:23:37 -0500 (EST) From: John Baldwin To: freebsd-fs@freebsd.org, yongari@freebsd.org Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Thu, 24 Jan 2013 17:21:50 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p22; KDE/4.5.5; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <201301242150.52238.c47g@gmx.at> <20130124212212.GM2522@kib.kiev.ua> In-Reply-To: <20130124212212.GM2522@kib.kiev.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201301241721.51102.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 24 Jan 2013 17:23:37 -0500 (EST) Cc: Christian Gusenbauer X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 22:23:38 -0000 On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > > > Hi! > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below > > > > > > > if I execute the following commands (as single user): > > > > > > > > > > > > > > # swapon -a > > > > > > > # dumpon /dev/ada0s3b > > > > > > > # mount -u / > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach the stack > > > > > > > trace. > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, > > > > > > > maybe that's the cause for the panic, because the bcopy (see stack > > > > > > > frame #15) fails. > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > I tried a similar operation with the nfs mount of rsize=32768 and mtu > > > > > > 6144, but the machine runs HEAD and em instead of age. I was unable > > > > > > to reproduce the panic on the copy of the 5GB file from nfs mount. > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just configuring > > > > age0 with > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > then I can copy all files from the mounted directory without any > > > > problems, too. So it's probably age0 related? > > > > > > From your backtrace and the buffer printout, I see somewhat strange thing. > > > The buffer data address is 0xffffff8171418000, while kernel faulted > > > at the attempt to write at 0xffffff8171413000, which is is lower then > > > the buffer data pointer, at the attempt to bcopy to the buffer. > > > > > > The other data suggests that there were no overflow of the data from the > > > server response. So it might be that mbuf_len(mp) returned negative number > > > ? I am not sure is it possible at all. > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc to the > > > kernel config. > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c > > > index efc0786..9a6bda5 100644 > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > *uiop, int siz) } > > > mbufcp = NFSMTOD(mp, caddr_t); > > > len = mbuf_len(mp); > > > + KASSERT(len > 0, ("len %d", len)); > > > } > > > xfer = (left > len) ? len : left; > > > #ifdef notdef > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > } > > > if (uiop->uio_iov->iov_len <= siz) { > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > + uiop->uio_iovcnt)); > > > uiop->uio_iovcnt--; > > > uiop->uio_iov++; > > > } else { > > > > > > I thought that server have returned too long response, but it seems to > > > be not the case from your data. Still, I think the patch below might be > > > due. > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > eof = fxdr_unsigned(int, *tl); > > > } > > > - NFSM_STRSIZ(retlen, rsize); > > > + NFSM_STRSIZ(retlen, len); > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > if (error) > > > goto nfsmout; > > > > I applied your patches and now I get a > > > > panic: len -4 > > cpuid = 1 > > KDB: enter: panic > > Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > This means that the age driver either produced corrupted mbuf chain, > or filled wrong negative value into the mbuf len field. I am quite > certain that the issue is in the driver. > > I added the net@ to Cc:, hopefully you could get help there. And I've cc'd Pyun who has written most of this driver and is likely the one most familiar with its handling of jumbo frames. -- John Baldwin From owner-freebsd-fs@FreeBSD.ORG Thu Jan 24 23:32:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 976951D7; Thu, 24 Jan 2013 23:32:03 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-bk0-f52.google.com (mail-bk0-f52.google.com [209.85.214.52]) by mx1.freebsd.org (Postfix) with ESMTP id 0719BE1D; Thu, 24 Jan 2013 23:32:02 +0000 (UTC) Received: by mail-bk0-f52.google.com with SMTP id jk13so1054414bkc.11 for ; Thu, 24 Jan 2013 15:31:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:subject:mime-version:content-type:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=hNhdEjjgWuOGeQh4UlpJcQNV1MGOZosoZxTU/uQpSn4=; b=EYInC7OLDDYM/nCBZIHB3cx1mxbNr3NWppIsFmXUUA+W7KSWOjScKtRot4sK2DqGth UojRU0BwAorsoi76bqXqiMZg8e9mELKR4kGCIRBizctW7qdQVAMKiqGoE7kuafh50lyr Hi5PY3ccYYDNFjJckZdrTx5vH2qMDJL0FkVXpQxWJyXsA/MG/s8nbJwOHjv8w6OfFx3J iYMd5EFKwRyJ1IpkN+ALgPwgiDKFLoUtgifFKmAHqbo8WiCLkjaiebOZde2eJKAXQ5oA m1BLb15kAKepBG34hNimXGN8QWFgHNxJKVn5184kIyduPIxytD8VF6B9mbqPEFjhfIgQ 6ZBg== X-Received: by 10.204.128.151 with SMTP id k23mr1333499bks.65.1359070316405; Thu, 24 Jan 2013 15:31:56 -0800 (PST) Received: from [10.0.0.3] ([93.152.184.10]) by mx.google.com with ESMTPS id o9sm18380914bko.15.2013.01.24.15.31.54 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Jan 2013 15:31:55 -0800 (PST) Subject: Re: ZFS regimen: scrub, scrub, scrub and scrub again. Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: Date: Fri, 25 Jan 2013 01:31:53 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <4241CA0A-9AFC-4EB4-89B7-18BC7E645B03@gmail.com> References: <20130122073641.GH30633@server.rulingia.com> <51013345.8010701@platinum.linux.pl> To: Wojciech Puchar X-Mailer: Apple Mail (2.1499) Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jan 2013 23:32:03 -0000 On Jan 24, 2013, at 4:24 PM, Wojciech Puchar = wrote: >>=20 > Except it is on paper reliability. This "on paper" reliability saved my ass numerous times. For example I had one home NAS server machine with flaky SATA controller = that would not detect one of the four drives from time to time on = reboot. This made my pool degraded several times, and even rebooting with let's = say disk4 failed to a situation that disk3 is failed did not corrupt any = data. I don't think this is possible with any other open source FS, let alone = hardware RAID that would drop the whole array because of this. I have never ever personally lost any data on ZFS. Yes, the performance = is another topic, and you must know what you are doing, and what is your usage pattern, but from reliability standpoint, to me ZFS looks more = durable than anything else. P.S.: My home NAS is running freebsd-CURRENT with ZFS from the first = version available. Several drives died, two times the pool was expanded by replacing all drives one by one and resilvered, no single byte lost. From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 01:11:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3728D29B; Fri, 25 Jan 2013 01:11:19 +0000 (UTC) (envelope-from ler@lerctr.org) Received: from thebighonker.lerctr.org (lrosenman-1-pt.tunnel.tserv8.dal1.ipv6.he.net [IPv6:2001:470:1f0e:3ad::2]) by mx1.freebsd.org (Postfix) with ESMTP id E3B981CE; Fri, 25 Jan 2013 01:11:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lerctr.org; s=lerami; h=Content-Type:MIME-Version:References:Message-ID:In-Reply-To:Subject:cc:To:From:Date; bh=JKPrHWFPsUdoRnExCWBdFvgw3Ul+/7WT2hKYGWWy6DA=; b=nskPQJii0nghIfVKg0TQ9F93+78pXp42WA22BYzppuUXCVX2anznS0RdD/2VgJ+fLyAjrW79+CuUJ73zemCK1bgXjcv1uVFwd5hEHh/fHz6EZMcYhuznhhrN/NI+oQVrG0IMCiTQpJXZDOdVzVIBd0sCwUNjvu8OgVXgSZNdLuQ=; Received: from lrosenman-1-pt.tunnel.tserv8.dal1.ipv6.he.net ([2001:470:1f0e:3ad::2]:54684) by thebighonker.lerctr.org with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1TyXp4-000LlG-TF; Thu, 24 Jan 2013 19:11:17 -0600 Date: Thu, 24 Jan 2013 19:11:12 -0600 (CST) From: Larry Rosenman To: Andriy Gapon Subject: Re: My panic in amd64/pmap In-Reply-To: Message-ID: References: <38def6b37be1a3128fb1b64595e9044e@webmail.lerctr.org> <50F95964.6060706@FreeBSD.org> <6f1d46304fbcc6e32f51109f6ab4c60d@webmail.lerctr.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam-Score: -2.9 (--) X-LERCTR-Spam-Score: -2.9 (--) X-Spam-Report: SpamScore (-2.9/5.0) ALL_TRUSTED=-1,BAYES_00=-1.9 X-LERCTR-Spam-Report: SpamScore (-2.9/5.0) ALL_TRUSTED=-1,BAYES_00=-1.9 Cc: freebsd-fs@freebsd.org, freebsd-emulation@freebsd.org, freebsd-current@freebsd.org, freebsd-amd64@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 01:11:19 -0000 On Sun, 20 Jan 2013, Larry Rosenman wrote: > On Fri, 18 Jan 2013, Larry Rosenman wrote: >> Never mind, it's in VirtualBox itself. The line is at ~~line 8020 in the >> same file. I've patched it and am recompiling >> VirtualBox. >> >> If I don't see the panic for a few days, I'll submit a PR. >> > I've submitted the PR, because for nehalem class or better cpu's it's > probably needed, however, I can still panic FreeBSD9 or FreeBSD10 with > running a zpool scrub, sometimes :(. > > I have vmcores and kernels from both VM's available. > > Latest core.txts: http://www.lerctr.org/~ler/zfs10-core.txt.4 > http://www.lerctr.org/~ler/zfs9-core.txt.4 > > I can still give ssh access to both VM's as well as the host. > > I'd really like to get to the bottom of this..... > > > I've moved all the core.txt's to: http://www.lerctr.org/~ler/FreeBSD-PMAP/ I got another one on FreeBSD9 today.... Is there ANYONE interested in this? These always seem to be ZFS induced..... I've added freebsd-fs to the cc list. I have vmcore's from them all. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 02:26:23 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 12092C3A; Fri, 25 Jan 2013 02:26:23 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 2D60E615; Fri, 25 Jan 2013 02:26:21 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAA/sAVGDaFvO/2dsb2JhbABEhkW4GXOCHgEBAQMBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggHBAEcBIdzBgyrJpJfgSOLcIJVgRMDiGGKfYIugRyPLIMWgVE1 X-IronPort-AV: E=Sophos;i="4.84,534,1355115600"; d="scan'208";a="10807869" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 24 Jan 2013 21:26:15 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 1FD80B4044; Thu, 24 Jan 2013 21:26:15 -0500 (EST) Date: Thu, 24 Jan 2013 21:26:15 -0500 (EST) From: Rick Macklem To: John Baldwin Message-ID: <1303380973.2342056.1359080775074.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201301241127.33274.jhb@freebsd.org> Subject: Re: [PATCH] More time cleanups in the NFS code MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , bde@freebsd.org, fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 02:26:23 -0000 John Baldwin wrote: > On Thursday, January 24, 2013 3:15:26 am Bruce Evans wrote: > > On Wed, 23 Jan 2013, John Baldwin wrote: > > > > > This patch removes all calls to get*time(). Most of them it > > > replaces with > > > time_uptime (especially ones that are attempting to handle time > > > intervals for > > > which time_uptime is far better suited than time_second). One > > > specific case > > > it replaces with nanotime() as suggested by Bruce previously. A > > > few of the > > > timestamps were not used (nd_starttime and the curtime in the > > > lease expiry > > > function). > > > > Looks good. > > > > I didn't check for completeness. > > > > oldnfs might benefit from use of NFSD_MONOSEC. > > > > Both nfs's might benefit from use of NFS_REALSEC (doesn't exist but > > would be #defined as time_second if acceses to this global are > > atomic > > (which I think is implied by its existence)). > > Accesses should be atomic. > > > > Index: fs/nfs/nfs_commonkrpc.c > > > =================================================================== > > > --- fs/nfs/nfs_commonkrpc.c (revision 245742) > > > +++ fs/nfs/nfs_commonkrpc.c (working copy) > > > @@ -459,18 +459,17 @@ > > > { > > > struct nfs_feedback_arg *nf = (struct nfs_feedback_arg *) arg; > > > struct nfsmount *nmp = nf->nf_mount; > > > - struct timeval now; > > > + time_t now; > > > > > > - getmicrouptime(&now); > > > - > > > switch (type) { > > > case FEEDBACK_REXMIT2: > > > case FEEDBACK_RECONNECT: > > > - if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now.tv_sec) { > > > + now = NFSD_MONOSEC; > > > > It's confusing for 'now' to be in mono-time. > > 'now' is all relative anyway. :) > > > > + if (nf->nf_lastmsg + nmp->nm_tprintf_delay < now) { > > > nfs_down(nmp, nf->nf_td, > > > "not responding", 0, NFSSTA_TIMEO); > > > nf->nf_tprintfmsg = TRUE; > > > - nf->nf_lastmsg = now.tv_sec; > > > + nf->nf_lastmsg = now; > > > } > > > break; > > > > It's safest but probably unnecessary (uncritical) to copy the (not > > quite > > volatile) variable NFSD_MONOSEC to a local variable, since it is > > used > > twice. > > > > Now I don't like the NFSD_MONOSEC macro. It looks like a constant, > > but > > is actually a not quite volatile variable. > > I have a separate patch to make both time_second and time_uptime > volatile. > The global 'ticks' should also be made volatile for the same reason. > > > > @@ -4684,11 +4682,9 @@ > > > } else > > > error = EPERM; > > > if (error == NFSERR_DELAY) { > > > - NFSGETNANOTIME(&mytime); > > > - if (((u_int32_t)mytime.tv_sec - starttime) > > > > - NFS_REMOVETIMEO && > > > - ((u_int32_t)mytime.tv_sec - starttime) < > > > - 100000) > > > + mytime = NFSD_MONOSEC; > > > + if (((u_int32_t)mytime - starttime) > NFS_REMOVETIMEO && > > > + ((u_int32_t)mytime - starttime) < 100000) > > > break; > > > /* Sleep for a short period of time */ > > > (void) nfs_catnap(PZERO, 0, "nfsremove"); > > > > Should use time_t for all times in seconds and no casts to u_int32_t > > (unless the times are put in data structures -- then 64-bit times > > are > > wasteful). > > Ah, for some reason I had thought starttime was stuffed into a packet > or some > such. It is not, so I redid this as: > > @@ -4650,8 +4648,7 @@ out: > APPLESTATIC void > nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) > { > - struct timespec mytime; > - int32_t starttime; > + time_t starttime, elapsed; > int error; > > /* > @@ -4675,8 +4672,7 @@ nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) > * Now, call nfsrv_checkremove() in a loop while it returns > * NFSERR_DELAY. Return upon any other error or when timed out. > */ > - NFSGETNANOTIME(&mytime); > - starttime = (u_int32_t)mytime.tv_sec; > + starttime = NFSD_MONOSEC; > do { > if (NFSVOPLOCK(vp, LK_EXCLUSIVE) == 0) { > error = nfsrv_checkremove(vp, 0, p); > @@ -4684,11 +4680,8 @@ nfsd_recalldelegation(vnode_t vp, NFSPROC_T *p) > } else > error = EPERM; > if (error == NFSERR_DELAY) { > - NFSGETNANOTIME(&mytime); > - if (((u_int32_t)mytime.tv_sec - starttime) > > - NFS_REMOVETIMEO && > - ((u_int32_t)mytime.tv_sec - starttime) < > - 100000) > + elapsed = NFSD_MONOSEC - starttime; > + if (elapsed > NFS_REMOVETIMEO && elapsed < 100000) For this patch, I think you can get rid of "&& elapsed < 100000" from the above line. I can't really remember, but I think I coded this as a sanity check for a clock going backwards. (And, yes, this patch looks fine to me, too.) Thanks again for working on this, rick > break; > /* Sleep for a short period of time */ > (void) nfs_catnap(PZERO, 0, "nfsremove"); > > > -- > John Baldwin > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 02:39:43 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9E3B3EA7; Fri, 25 Jan 2013 02:39:43 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id F1395693; Fri, 25 Jan 2013 02:39:42 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAEvvAVGDaFvO/2dsb2JhbABEhkW4GXOCHgEBAQMBAQEBIAQnIAsbDgoCAg0ZAikBCSYGAQcCBQQBHASHcwYMqyWSXIEji3CCVYETA4hhin2CLoEcjyyDFoFRNQ X-IronPort-AV: E=Sophos;i="4.84,534,1355115600"; d="scan'208";a="13547159" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 24 Jan 2013 21:39:38 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id A079AB3EEA; Thu, 24 Jan 2013 21:39:38 -0500 (EST) Date: Thu, 24 Jan 2013 21:39:38 -0500 (EST) From: Rick Macklem To: John Baldwin , kib Message-ID: <1390810985.2342318.1359081578591.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201301241721.51102.jhb@freebsd.org> Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, yongari@freebsd.org, Christian Gusenbauer X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 02:39:43 -0000 John Baldwin wrote: > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer > > wrote: > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer > > > > wrote: > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov > > > > > wrote: > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin > > > > > > Belousov wrote: > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > > > > > > > Gusenbauer wrote: > > > > > > > > Hi! > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the > > > > > > > > panic below > > > > > > > > if I execute the following commands (as single user): > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > # mount -u / > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach > > > > > > > > the stack > > > > > > > > trace. > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit > > > > > > > > network, > > > > > > > > maybe that's the cause for the panic, because the bcopy > > > > > > > > (see stack > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > rsize=32768 and mtu > > > > > > > 6144, but the machine runs HEAD and em instead of age. I > > > > > > > was unable > > > > > > > to reproduce the panic on the copy of the 5GB file from > > > > > > > nfs mount. > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just > > > > > configuring > > > > > age0 with > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > then I can copy all files from the mounted directory without > > > > > any > > > > > problems, too. So it's probably age0 related? > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > strange thing. > > > > The buffer data address is 0xffffff8171418000, while kernel > > > > faulted > > > > at the attempt to write at 0xffffff8171413000, which is is lower > > > > then > > > > the buffer data pointer, at the attempt to bcopy to the buffer. > > > > > > > > The other data suggests that there were no overflow of the data > > > > from the > > > > server response. So it might be that mbuf_len(mp) returned > > > > negative number > > > > ? I am not sure is it possible at all. > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc > > > > to the > > > > kernel config. > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > b/sys/fs/nfs/nfs_commonsubs.c > > > > index efc0786..9a6bda5 100644 > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > struct uio > > > > *uiop, int siz) } > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > len = mbuf_len(mp); > > > > + KASSERT(len > 0, ("len %d", len)); > > > > } > > > > xfer = (left > len) ? len : left; > > > > #ifdef notdef > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > struct uio > > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > > } > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > + uiop->uio_iovcnt)); > > > > uiop->uio_iovcnt--; > > > > uiop->uio_iov++; > > > > } else { > > > > > > > > I thought that server have returned too long response, but it > > > > seems to > > > > be not the case from your data. Still, I think the patch below > > > > might be > > > > due. > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > *uiop, struct > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > eof = fxdr_unsigned(int, *tl); > > > > } > > > > - NFSM_STRSIZ(retlen, rsize); > > > > + NFSM_STRSIZ(retlen, len); > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > if (error) > > > > goto nfsmout; > > > I think this patch is appropriate, although I don't see it as too critical. It just tightens the "sanity check" on the read reply length (which should never exceed what the client requested). nfsm_mbufuio() shouldn't transfer more than the uio structure can handle, even if the replied read size is larger than requested. It does seem that nfsm_mbufuio() should apply a sanity check on m_len. I think m_len == 0 is ok, but negative or very large should be checked for. Maybe just return EBADRPC after a printf() instead of a KASSERT(), as a safety belt against a trashed m_len from a driver or ??? rick > > > I applied your patches and now I get a > > > > > > panic: len -4 > > > cpuid = 1 > > > KDB: enter: panic > > > Dumping 377 out of 6116 > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > This means that the age driver either produced corrupted mbuf chain, > > or filled wrong negative value into the mbuf len field. I am quite > > certain that the issue is in the driver. > > > > I added the net@ to Cc:, hopefully you could get help there. > > And I've cc'd Pyun who has written most of this driver and is likely > the one > most familiar with its handling of jumbo frames. > > -- > John Baldwin > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 03:26:43 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id E2BB8502 for ; Fri, 25 Jan 2013 03:26:43 +0000 (UTC) (envelope-from araujobsdport@gmail.com) Received: from mail-wi0-x229.google.com (wi-in-x0229.1e100.net [IPv6:2a00:1450:400c:c05::229]) by mx1.freebsd.org (Postfix) with ESMTP id 6FF4E903 for ; Fri, 25 Jan 2013 03:26:43 +0000 (UTC) Received: by mail-wi0-f169.google.com with SMTP id hq12so1087016wib.4 for ; Thu, 24 Jan 2013 19:26:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:reply-to:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=Z477pa/D8/NewLqUeqGB2BtW//Pcx2Zsg/3KQW80RvE=; b=fqysCOJWddDBNckHkzBBWk3bQF0efmd/PsFfYhcrItP7P0ClCbNhU2n49EhvatgAzy 80UjrRYBLXiOcb8CQmwS/hjfOe0e8KFnh/m9OseOdr2xoi3cXBA8VAsnkFLy5fTLGT6s OiRzD74RoIDwc5vRaWxngBHY/F4nNimgasQ/6JMTqo0yy0M+uIk2FZAcZ0ymjD/Vbw76 M1Tu7ResC26j+EO1u/xSwoPK0BGU/Cy/nxkbDPYSKaxqOvAr15hjyB31M3PIPDoHQw5w 35WtSVSlMk4LA9JMKZChWCQGk34OjsnZPwhzULBhUbxZ/fnVvNifQJ1+av5MN9SzQCwj iXfQ== MIME-Version: 1.0 X-Received: by 10.180.99.129 with SMTP id eq1mr6212946wib.30.1359084402333; Thu, 24 Jan 2013 19:26:42 -0800 (PST) Received: by 10.180.145.44 with HTTP; Thu, 24 Jan 2013 19:26:42 -0800 (PST) In-Reply-To: <5101AB6C.5060905@quip.cz> References: <20130124165043.GB2386@garage.freebsd.pl> <51016DBC.5050001@egr.msu.edu> <5101AB6C.5060905@quip.cz> Date: Fri, 25 Jan 2013 11:26:42 +0800 Message-ID: Subject: Re: gmirror doubt. From: Marcelo Araujo To: Miroslav Lachman <000.fbsd@quip.cz> Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-fs@freebsd.org, Mark Felder X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: araujo@FreeBSD.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 03:26:43 -0000 2013/1/25 Miroslav Lachman <000.fbsd@quip.cz> > Mark Felder wrote: > >> On Thu, 24 Jan 2013 11:22:04 -0600, Adam McDougall >> wrote: >> >> >>> Just putting in my 2 cents that I'd LOVE LOVE LOVE a sysctl to disable >>> tasting by geom, even if global. My #1 extreme frustration with gmirror, >>> gjournal etc is the difficulty of removing the on-disk config when the >>> kernel is fighting you to keep re-tasting it and keeping it active. >>> >> >> I can't agree more. The GEOM framework and tasting is brilliant, but we >> need to be able to control it. >> > > +1 from me. I had a problem with gjournal re-tasting few weeks ago (posted > on GEOM mailinglist) Yes, seems reasonable have a sysctl to control the tasting. I'm gonna take a look on it, basically I need to solve a personal issue and if it works for me, I'm gonna share with you guys my changes. I don't know yet, when I will start to work on it, it is in my TODO list! Thanks everybody that reply! Best Regards, -- Marcelo Araujo araujo@FreeBSD.org From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 04:30:59 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C48ABC08; Fri, 25 Jan 2013 04:30:59 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pb0-f49.google.com (mail-pb0-f49.google.com [209.85.160.49]) by mx1.freebsd.org (Postfix) with ESMTP id 870C2BDC; Fri, 25 Jan 2013 04:30:59 +0000 (UTC) Received: by mail-pb0-f49.google.com with SMTP id xa12so147264pbc.8 for ; Thu, 24 Jan 2013 20:30:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:date:to:cc:subject:message-id:reply-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=uZ4lEa2DyFBvAp+hMozUPavjooonHG3Zi6jKGNaqc9o=; b=NflBnyulH0516Q4KMhlDh/0hXNJajENr7KOd4fXNNGcaXkO+S5IkXRYp+THNAktNf5 Wqwl+gzWNlC/yf2/Kb15TvWZR9Hzqtx6qCxxGTN4ocfp0w7MZtA/BN39z1cH62kSpoMx ZoD7iVAkCPwSx2uY8c+U6tdDzTBS5+eCunNh0+4CjvlduR2yET/t4myihlfWo1g3GKsd jjQaZlxFpxGvgnd5FgCd9nnNs0sX+htZ+BqOEiZS/oSedWJ9urh81Y0OvhvCqu8jy5mJ kwNPSFGZOwfhN1uGnX4/+24/DkfM9QZC/FpdM16Mnt77vfQy/DS4GM8s2CA57BwHkYPb Toqg== X-Received: by 10.68.226.71 with SMTP id rq7mr11149858pbc.60.1359088253822; Thu, 24 Jan 2013 20:30:53 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPS id nw9sm16104964pbb.42.2013.01.24.20.30.49 (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 24 Jan 2013 20:30:52 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Fri, 25 Jan 2013 13:30:43 +0900 From: YongHyeon PYUN Date: Fri, 25 Jan 2013 13:30:43 +0900 To: John Baldwin Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130125043043.GA1429@michelle.cdnetworks.com> References: <201301241805.57623.c47g@gmx.at> <201301242150.52238.c47g@gmx.at> <20130124212212.GM2522@kib.kiev.ua> <201301241721.51102.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="GvXjxJ+pjyke8COw" Content-Disposition: inline In-Reply-To: <201301241721.51102.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-fs@freebsd.org, Christian Gusenbauer , yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 04:30:59 -0000 --GvXjxJ+pjyke8COw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > > > > Hi! > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below > > > > > > > > if I execute the following commands (as single user): > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > # mount -u / > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach the stack > > > > > > > > trace. > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, > > > > > > > > maybe that's the cause for the panic, because the bcopy (see stack > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of rsize=32768 and mtu > > > > > > > 6144, but the machine runs HEAD and em instead of age. I was unable > > > > > > > to reproduce the panic on the copy of the 5GB file from nfs mount. > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just configuring > > > > > age0 with > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > then I can copy all files from the mounted directory without any > > > > > problems, too. So it's probably age0 related? > > > > > > > > From your backtrace and the buffer printout, I see somewhat strange thing. > > > > The buffer data address is 0xffffff8171418000, while kernel faulted > > > > at the attempt to write at 0xffffff8171413000, which is is lower then > > > > the buffer data pointer, at the attempt to bcopy to the buffer. > > > > > > > > The other data suggests that there were no overflow of the data from the > > > > server response. So it might be that mbuf_len(mp) returned negative number > > > > ? I am not sure is it possible at all. > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc to the > > > > kernel config. > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c > > > > index efc0786..9a6bda5 100644 > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > > *uiop, int siz) } > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > len = mbuf_len(mp); > > > > + KASSERT(len > 0, ("len %d", len)); > > > > } > > > > xfer = (left > len) ? len : left; > > > > #ifdef notdef > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > > } > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > + uiop->uio_iovcnt)); > > > > uiop->uio_iovcnt--; > > > > uiop->uio_iov++; > > > > } else { > > > > > > > > I thought that server have returned too long response, but it seems to > > > > be not the case from your data. Still, I think the patch below might be > > > > due. > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > eof = fxdr_unsigned(int, *tl); > > > > } > > > > - NFSM_STRSIZ(retlen, rsize); > > > > + NFSM_STRSIZ(retlen, len); > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > if (error) > > > > goto nfsmout; > > > > > > I applied your patches and now I get a > > > > > > panic: len -4 > > > cpuid = 1 > > > KDB: enter: panic > > > Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > This means that the age driver either produced corrupted mbuf chain, > > or filled wrong negative value into the mbuf len field. I am quite > > certain that the issue is in the driver. > > > > I added the net@ to Cc:, hopefully you could get help there. > > And I've cc'd Pyun who has written most of this driver and is likely the one > most familiar with its handling of jumbo frames. > Try attached one and let me know how it goes. Note, I don't have age(4) anymore so it wasn't tested at all. --GvXjxJ+pjyke8COw Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="age.diff" Index: sys/dev/age/if_age.c =================================================================== --- sys/dev/age/if_age.c (revision 245870) +++ sys/dev/age/if_age.c (working copy) @@ -2289,9 +2289,7 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r nsegs = AGE_RX_NSEGS(index); sc->age_cdata.age_rxlen = AGE_RX_BYTES(le32toh(rxrd->len)); - if ((status & AGE_RRD_ERROR) != 0 && - (status & (AGE_RRD_CRC | AGE_RRD_CODE | AGE_RRD_DRIBBLE | - AGE_RRD_RUNT | AGE_RRD_OFLOW | AGE_RRD_TRUNC)) != 0) { + if ((status & (AGE_RRD_ERROR | AGE_RRD_LENGTH_NOK)) != 0) { /* * We want to pass the following frames to upper * layer regardless of error status of Rx return @@ -2301,9 +2299,18 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r * o frame length and protocol specific length * does not match. */ - sc->age_cdata.age_rx_cons += nsegs; - sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; - return; + status |= AGE_RRD_IPCSUM_NOK | AGE_RRD_TCP_UDPCSUM_NOK; + if ((status & (AGE_RRD_CRC | AGE_RRD_CODE | AGE_RRD_DRIBBLE | + AGE_RRD_RUNT | AGE_RRD_OFLOW | AGE_RRD_TRUNC)) != 0) { + for (count = 0; count < nsegs; count++) { + rxd = &sc->age_cdata.age_rxdesc[rx_cons]; + desc = rxd->rx_desc; + desc->len = htole32(((MCLBYTES - ETHER_ALIGN) & + AGE_RD_LEN_MASK) << AGE_RD_LEN_SHIFT); + AGE_DESC_INC(rx_cons, AGE_RX_RING_CNT); + } + return; + } } pktlen = 0; @@ -2316,10 +2323,10 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r if (age_newbuf(sc, rxd) != 0) { ifp->if_iqdrops++; /* Reuse Rx buffers. */ - if (sc->age_cdata.age_rxhead != NULL) { + desc->len = htole32(((MCLBYTES - ETHER_ALIGN) & + AGE_RD_LEN_MASK) << AGE_RD_LEN_SHIFT); + if (sc->age_cdata.age_rxhead != NULL) m_freem(sc->age_cdata.age_rxhead); - AGE_RXCHAIN_RESET(sc); - } break; } @@ -2383,9 +2390,11 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r */ if ((ifp->if_capenable & IFCAP_RXCSUM) != 0 && (status & AGE_RRD_IPV4) != 0) { - m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED; - if ((status & AGE_RRD_IPCSUM_NOK) == 0) + if ((status & AGE_RRD_IPCSUM_NOK) == 0) { + m->m_pkthdr.csum_flags |= + CSUM_IP_CHECKED; m->m_pkthdr.csum_flags |= CSUM_IP_VALID; + } if ((status & (AGE_RRD_TCP | AGE_RRD_UDP)) && (status & AGE_RRD_TCP_UDPCSUM_NOK) == 0) { m->m_pkthdr.csum_flags |= @@ -2411,17 +2420,11 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r AGE_UNLOCK(sc); (*ifp->if_input)(ifp, m); AGE_LOCK(sc); - - /* Reset mbuf chains. */ - AGE_RXCHAIN_RESET(sc); } } - if (count != nsegs) { - sc->age_cdata.age_rx_cons += nsegs; - sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; - } else - sc->age_cdata.age_rx_cons = rx_cons; + /* Reset mbuf chains. */ + AGE_RXCHAIN_RESET(sc); } static int @@ -2460,12 +2463,13 @@ age_rxintr(struct age_softc *sc, int rr_prod, int (MCLBYTES - ETHER_ALIGN))) break; - prog++; /* Received a frame. */ age_rxeof(sc, rxrd); /* Clear return ring. */ rxrd->index = 0; AGE_DESC_INC(rr_cons, AGE_RR_RING_CNT); + sc->age_cdata.age_rx_cons += nsegs; + sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; } if (prog > 0) { --GvXjxJ+pjyke8COw-- From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 04:50:58 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BDC33299; Fri, 25 Jan 2013 04:50:58 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pb0-f50.google.com (mail-pb0-f50.google.com [209.85.160.50]) by mx1.freebsd.org (Postfix) with ESMTP id 6F60ECFC; Fri, 25 Jan 2013 04:50:58 +0000 (UTC) Received: by mail-pb0-f50.google.com with SMTP id wz7so5859035pbc.37 for ; Thu, 24 Jan 2013 20:50:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:date:to:cc:subject:message-id:reply-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=cu02ZUxruGu2a2KsBXED5VfD8QYlMsybFxpp/jYdzI0=; b=xk3CUOTF3Gd2V8KUKKfNdXqqgkGQ8vsF1UlrH1DKUNqLxBRlOLqGOK1quMR3BxrAdN jt83GTbLOvDC+b7LDyJfgYGgF0amBXSEOajX3VSfpQl32RE4gGgvELBuK0q3avpaLaj+ KJSONrzJUUpecEiCuv8qDwLTuXM5DF6kNEZ0C9T4wXCy9HPYTw922qg2vzK1u/wgRfn0 6pZ8jSnprYUc2QHuLkRKHU8lvAflF6cHf0W74MDFyhIrCKedfQP1ee8cOEDxZ0HH3Wgm pEGnyu7Um71y2qTsToKfz3wibh1i4VKQNDv3v/H6lzHRtpCb1oD9OEenXtgKU/2geR7w Ynmw== X-Received: by 10.69.16.100 with SMTP id fv4mr11085888pbd.135.1359089457030; Thu, 24 Jan 2013 20:50:57 -0800 (PST) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPS id pj1sm16131128pbb.71.2013.01.24.20.50.53 (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 24 Jan 2013 20:50:55 -0800 (PST) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Fri, 25 Jan 2013 13:50:48 +0900 From: YongHyeon PYUN Date: Fri, 25 Jan 2013 13:50:48 +0900 To: John Baldwin Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130125045048.GB1429@michelle.cdnetworks.com> References: <201301241805.57623.c47g@gmx.at> <201301242150.52238.c47g@gmx.at> <20130124212212.GM2522@kib.kiev.ua> <201301241721.51102.jhb@freebsd.org> <20130125043043.GA1429@michelle.cdnetworks.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="8P1HSweYDcXXzwPJ" Content-Disposition: inline In-Reply-To: <20130125043043.GA1429@michelle.cdnetworks.com> User-Agent: Mutt/1.4.2.3i Cc: freebsd-fs@freebsd.org, Christian Gusenbauer , yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 04:50:58 -0000 --8P1HSweYDcXXzwPJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic below > > > > > > > > > if I execute the following commands (as single user): > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > # mount -u / > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach the stack > > > > > > > > > trace. > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit network, > > > > > > > > > maybe that's the cause for the panic, because the bcopy (see stack > > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of rsize=32768 and mtu > > > > > > > > 6144, but the machine runs HEAD and em instead of age. I was unable > > > > > > > > to reproduce the panic on the copy of the 5GB file from nfs mount. > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just configuring > > > > > > age0 with > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > then I can copy all files from the mounted directory without any > > > > > > problems, too. So it's probably age0 related? > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat strange thing. > > > > > The buffer data address is 0xffffff8171418000, while kernel faulted > > > > > at the attempt to write at 0xffffff8171413000, which is is lower then > > > > > the buffer data pointer, at the attempt to bcopy to the buffer. > > > > > > > > > > The other data suggests that there were no overflow of the data from the > > > > > server response. So it might be that mbuf_len(mp) returned negative number > > > > > ? I am not sure is it possible at all. > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc to the > > > > > kernel config. > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c b/sys/fs/nfs/nfs_commonsubs.c > > > > > index efc0786..9a6bda5 100644 > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > > > *uiop, int siz) } > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > len = mbuf_len(mp); > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > } > > > > > xfer = (left > len) ? len : left; > > > > > #ifdef notdef > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct uio > > > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > > > } > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > + uiop->uio_iovcnt)); > > > > > uiop->uio_iovcnt--; > > > > > uiop->uio_iov++; > > > > > } else { > > > > > > > > > > I thought that server have returned too long response, but it seems to > > > > > be not the case from your data. Still, I think the patch below might be > > > > > due. > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop, struct > > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > > eof = fxdr_unsigned(int, *tl); > > > > > } > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > + NFSM_STRSIZ(retlen, len); > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > if (error) > > > > > goto nfsmout; > > > > > > > > I applied your patches and now I get a > > > > > > > > panic: len -4 > > > > cpuid = 1 > > > > KDB: enter: panic > > > > Dumping 377 out of 6116 MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > This means that the age driver either produced corrupted mbuf chain, > > > or filled wrong negative value into the mbuf len field. I am quite > > > certain that the issue is in the driver. > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > And I've cc'd Pyun who has written most of this driver and is likely the one > > most familiar with its handling of jumbo frames. > > > > Try attached one and let me know how it goes. > Note, I don't have age(4) anymore so it wasn't tested at all. Sorry, ignore previous patch and use this one(age.diff2) instead. --8P1HSweYDcXXzwPJ Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="age.diff2" Index: sys/dev/age/if_age.c =================================================================== --- sys/dev/age/if_age.c (revision 245870) +++ sys/dev/age/if_age.c (working copy) @@ -143,6 +143,7 @@ static void age_init_rr_ring(struct age_softc *); static void age_init_cmb_block(struct age_softc *); static void age_init_smb_block(struct age_softc *); static int age_newbuf(struct age_softc *, struct age_rxdesc *); +static void age_resetbuf(struct age_softc *, int, int); static void age_rxvlan(struct age_softc *); static void age_rxfilter(struct age_softc *); static int sysctl_age_stats(SYSCTL_HANDLER_ARGS); @@ -2289,9 +2290,7 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r nsegs = AGE_RX_NSEGS(index); sc->age_cdata.age_rxlen = AGE_RX_BYTES(le32toh(rxrd->len)); - if ((status & AGE_RRD_ERROR) != 0 && - (status & (AGE_RRD_CRC | AGE_RRD_CODE | AGE_RRD_DRIBBLE | - AGE_RRD_RUNT | AGE_RRD_OFLOW | AGE_RRD_TRUNC)) != 0) { + if ((status & (AGE_RRD_ERROR | AGE_RRD_LENGTH_NOK)) != 0) { /* * We want to pass the following frames to upper * layer regardless of error status of Rx return @@ -2301,9 +2300,12 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r * o frame length and protocol specific length * does not match. */ - sc->age_cdata.age_rx_cons += nsegs; - sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; - return; + status |= AGE_RRD_IPCSUM_NOK | AGE_RRD_TCP_UDPCSUM_NOK; + if ((status & (AGE_RRD_CRC | AGE_RRD_CODE | AGE_RRD_DRIBBLE | + AGE_RRD_RUNT | AGE_RRD_OFLOW | AGE_RRD_TRUNC)) != 0) { + age_resetbuf(sc, rx_cons, nsegs); + return; + } } pktlen = 0; @@ -2316,10 +2318,9 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r if (age_newbuf(sc, rxd) != 0) { ifp->if_iqdrops++; /* Reuse Rx buffers. */ - if (sc->age_cdata.age_rxhead != NULL) { + age_resetbuf(sc, rx_cons, nsegs - count); + if (sc->age_cdata.age_rxhead != NULL) m_freem(sc->age_cdata.age_rxhead); - AGE_RXCHAIN_RESET(sc); - } break; } @@ -2383,9 +2384,9 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r */ if ((ifp->if_capenable & IFCAP_RXCSUM) != 0 && (status & AGE_RRD_IPV4) != 0) { - m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED; if ((status & AGE_RRD_IPCSUM_NOK) == 0) - m->m_pkthdr.csum_flags |= CSUM_IP_VALID; + m->m_pkthdr.csum_flags |= + CSUM_IP_CHECKED | CSUM_IP_VALID; if ((status & (AGE_RRD_TCP | AGE_RRD_UDP)) && (status & AGE_RRD_TCP_UDPCSUM_NOK) == 0) { m->m_pkthdr.csum_flags |= @@ -2411,17 +2412,11 @@ age_rxeof(struct age_softc *sc, struct rx_rdesc *r AGE_UNLOCK(sc); (*ifp->if_input)(ifp, m); AGE_LOCK(sc); - - /* Reset mbuf chains. */ - AGE_RXCHAIN_RESET(sc); } } - if (count != nsegs) { - sc->age_cdata.age_rx_cons += nsegs; - sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; - } else - sc->age_cdata.age_rx_cons = rx_cons; + /* Reset mbuf chains. */ + AGE_RXCHAIN_RESET(sc); } static int @@ -2460,12 +2455,13 @@ age_rxintr(struct age_softc *sc, int rr_prod, int (MCLBYTES - ETHER_ALIGN))) break; - prog++; /* Received a frame. */ age_rxeof(sc, rxrd); /* Clear return ring. */ rxrd->index = 0; AGE_DESC_INC(rr_cons, AGE_RR_RING_CNT); + sc->age_cdata.age_rx_cons += nsegs; + sc->age_cdata.age_rx_cons %= AGE_RX_RING_CNT; } if (prog > 0) { @@ -3094,6 +3090,22 @@ age_newbuf(struct age_softc *sc, struct age_rxdesc } static void +age_resetbuf(struct age_softc *sc, int rx_cons, int count) +{ + struct age_rxdesc *rxd; + struct rx_desc *desc; + int n; + + for (n = 0; n < count; n++) { + rxd = &sc->age_cdata.age_rxdesc[rx_cons]; + desc = rxd->rx_desc; + desc->len = htole32(((MCLBYTES - ETHER_ALIGN) & + AGE_RD_LEN_MASK) << AGE_RD_LEN_SHIFT); + AGE_DESC_INC(rx_cons, AGE_RX_RING_CNT); + } +} + +static void age_rxvlan(struct age_softc *sc) { struct ifnet *ifp; --8P1HSweYDcXXzwPJ-- From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 08:36:21 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C7A9EBCC for ; Fri, 25 Jan 2013 08:36:21 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta14.emeryville.ca.mail.comcast.net (qmta14.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:212]) by mx1.freebsd.org (Postfix) with ESMTP id ADB99720 for ; Fri, 25 Jan 2013 08:36:21 +0000 (UTC) Received: from omta05.emeryville.ca.mail.comcast.net ([76.96.30.43]) by qmta14.emeryville.ca.mail.comcast.net with comcast id s8bg1k0020vp7WLAE8cLsF; Fri, 25 Jan 2013 08:36:20 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta05.emeryville.ca.mail.comcast.net with comcast id s8cK1k0081t3BNj8R8cKZW; Fri, 25 Jan 2013 08:36:19 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 6A48E73A1C; Fri, 25 Jan 2013 00:36:19 -0800 (PST) Date: Fri, 25 Jan 2013 00:36:19 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130125083619.GA51096@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50FED818.7070704@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1359102980; bh=Yu/l8WSrYcTVJGNg/Wi4+q++ziKm8aNU3Gv6N7ZpbPI=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=BJFH/4ScrPr2u6DGsqrcBMboiDyzKZYZpXARbRaVPlhc7aJmGdU7XICmmrEkzBpPL zeuIbjQFuBn58p9hreuftH6W6e9zFLfMvsSJr16Jas3O1r8FBL4/ceD37lPJWnfTHB ahqo9Oq9h257m3LiUahnFHdxmTIg5iFC9UQ7IjmsD0HKRwDWAdjuI1KW14pilCJd2x 0KBU2Po/jzlkbljb5RBa8Nn9S6JThGX21lbcN6WwXbFPxWQ8jDlxK6/glECK/cvot/ gLdEQ1FFSmubchhVcMDHcKUUtF90pquzDK1L2JXTuYLwSTxTzgyNNpVg1KjcJ5peNc oqOQzsTNPt8Lg== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 08:36:21 -0000 On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > (Please keep me CC'd as I am not subscribed) > > > > WRT this: > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > I can reproduce the first problem 100% of the time on my home system > > here. I can provide hardware specs if needed, but the important part is > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > using ahci.ko. > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > problem with your disk". In my case, I KNOW there's a problem with the > > disk (that's the entire point to my tests! :-) ). > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > badly-designed firmware that goes completely catatonic when encountering > > certain sector-level conditions. That's not the problem though -- the > > problem is with FreeBSD apparently getting confused as to the internal > > state of its devices after a device falls off the bus and comes back. > > Explanation: > > > > 1. System powered off; disk is attached; system powered on, shows up as > > ada5. Can communicate with device in every way (the way I tend to test > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > the g_wither_washer oddity does not apply in this situation. > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > firmware design flaw results in drive becoming 100% stuck trying to > > re-read the sector and work out internal decisions to do remapping or > > not. Drive audibly clicking during this time (not actuator arm being > > reset to track 0 noise; some other mechanical issue). Due to firmware > > issue, drive remains in this state indefinitely. > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > times (kern.cam.da.retry_count+1). > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > cam_periph_alloc error, and devfs claims device removal. > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > seconds, then is reinserted. > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > except... > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > devlist" shows the disk on the bus, yet I/O does not work. If I > > remember right, re-attempting the dd command returns some error (I > > forget which). > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > communicate with entry 5, but eventually does return (I think with some > > error). camcontrol reset all" works without a hitch. "camcontrol > > devlist" during this time shows the same disk on ada5 (which to me means > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > I/O works at some level). > > > > 10. System otherwise works fine, but the only way to bring back > > usability of ada5 is to reboot ("shutdown -r now"). > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > driver (I don't know which)) is internally confused about the true state > > of things. > > > > Alexander, do you have any ideas? > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > this with camcontrol) as well as take notes and do a full step-by-step > > diagnosis (along with relevant kernel output seen during each phase) if > > that would help you. And I can test patches but not against -CURRENT > > (will be a cold day in hell before I run that, sorry). > > Command timeout itself is not a reason for AHCI driver to drop the disk, > neither it is for CAM in case of payload requests. Disk can be dropped > if controller report device absence detected by SATA PHY, or by errors > during device reinitialization after reset by CAM SATA XPT. I have some theories as to why this is happening and it relates to the underlying design of the drive firmware and the drive controller used. I could write some pseudo-code showing how I believe the drive behaves, but it's really besides the point, as you point out below. > What is interesting, is what exactly goes on after disk got stuck and > you have removed it. In normal case controller should immediately report > PHY status change, driver should run PHY reset and see that link is > lost. It should trigger bus rescan for CAM, that should invalidate > device. That should make dd abort with error. After dd gone, device > should be destroyed and ready for reattachment. Yup, that sounds exactly like what should happen. I know that in userland (dd) the command eventually does abort/fail with an error (I believe I/O error or some other message), and that's good. The device disappearing can also be confirmed. It's after the drive is power-cycled (to bring it back online) where its re-tasted and I/O (at the kernel level) works, but now userland utilities interfacing with /dev/ada5 insist "unknown device" or "no such device". It's easier to show than it is to explain. My theory is that there is some kind of internal (kernel-level) "state" that is not being reset correctly when a device is lost and then brought back. > So it should be great if you start with the full verbose dmesg from the > boot up to the moment when system becomes stable after disk removal. If > it won't be enough, we can enable some more debugging with `camcontrol > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. This is exactly what I needed; thank you! I'll spend some time tomorrow collecting the data + documenting and will provide the results once I've compiled them. This will be more useful than speculation on my part. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 10:45:16 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3368F2E3; Fri, 25 Jan 2013 10:45:16 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id D8B74F95; Fri, 25 Jan 2013 10:45:14 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id MAA00680; Fri, 25 Jan 2013 12:45:08 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TygmS-000FKK-H3; Fri, 25 Jan 2013 12:45:08 +0200 Message-ID: <51026233.2020601@FreeBSD.org> Date: Fri, 25 Jan 2013 12:45:07 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130121 Thunderbird/17.0.2 MIME-Version: 1.0 To: Larry Rosenman Subject: Re: My panic in amd64/pmap References: <38def6b37be1a3128fb1b64595e9044e@webmail.lerctr.org> <50F95964.6060706@FreeBSD.org> <6f1d46304fbcc6e32f51109f6ab4c60d@webmail.lerctr.org> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-emulation@FreeBSD.org, freebsd-current@FreeBSD.org, freebsd-amd64@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 10:45:16 -0000 on 25/01/2013 03:11 Larry Rosenman said the following: > I've moved all the core.txt's to: > http://www.lerctr.org/~ler/FreeBSD-PMAP/ > > I got another one on FreeBSD9 today.... > > Is there ANYONE interested in this? > > These always seem to be ZFS induced..... > > I've added freebsd-fs to the cc list. > > I have vmcore's from them all. Can you try to reproduce the issue using the same VM image but in a different VM implementation? E.g. qemu... -- Andriy Gapon From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 11:36:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 07347AF1; Fri, 25 Jan 2013 11:36:15 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 3A7B72B3; Fri, 25 Jan 2013 11:36:14 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r0PBa6xe084024; Fri, 25 Jan 2013 13:36:06 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.4 kib.kiev.ua r0PBa6xe084024 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r0PBa6Tv084023; Fri, 25 Jan 2013 13:36:06 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 25 Jan 2013 13:36:06 +0200 From: Konstantin Belousov To: Rick Macklem Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Message-ID: <20130125113606.GO2522@kib.kiev.ua> References: <201301241721.51102.jhb@freebsd.org> <1390810985.2342318.1359081578591.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="E0e4ihfNxLmjeTLW" Content-Disposition: inline In-Reply-To: <1390810985.2342318.1359081578591.JavaMail.root@erie.cs.uoguelph.ca> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-fs@freebsd.org, yongari@freebsd.org, Christian Gusenbauer X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 11:36:15 -0000 --E0e4ihfNxLmjeTLW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 24, 2013 at 09:39:38PM -0500, Rick Macklem wrote: > John Baldwin wrote: > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer > > > wrote: > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer > > > > > wrote: > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov > > > > > > wrote: > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin > > > > > > > Belousov wrote: > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > > > > > > > > Gusenbauer wrote: > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the > > > > > > > > > panic below > > > > > > > > > if I execute the following commands (as single user): > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > # mount -u / > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > # mount -t nfs -o rsize=3D32768 data:/multimedia /mnt > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach > > > > > > > > > the stack > > > > > > > > > trace. > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit > > > > > > > > > network, > > > > > > > > > maybe that's the cause for the panic, because the bcopy > > > > > > > > > (see stack > > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > rsize=3D32768 and mtu > > > > > > > > 6144, but the machine runs HEAD and em instead of age. I > > > > > > > > was unable > > > > > > > > to reproduce the panic on the copy of the 5GB file from > > > > > > > > nfs mount. > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just > > > > > > configuring > > > > > > age0 with > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > then I can copy all files from the mounted directory without > > > > > > any > > > > > > problems, too. So it's probably age0 related? > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > > strange thing. > > > > > The buffer data address is 0xffffff8171418000, while kernel > > > > > faulted > > > > > at the attempt to write at 0xffffff8171413000, which is is lower > > > > > then > > > > > the buffer data pointer, at the attempt to bcopy to the buffer. > > > > > > > > > > The other data suggests that there were no overflow of the data > > > > > from the > > > > > server response. So it might be that mbuf_len(mp) returned > > > > > negative number > > > > > ? I am not sure is it possible at all. > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc > > > > > to the > > > > > kernel config. > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > b/sys/fs/nfs/nfs_commonsubs.c > > > > > index efc0786..9a6bda5 100644 > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > struct uio > > > > > *uiop, int siz) } > > > > > mbufcp =3D NFSMTOD(mp, caddr_t); > > > > > len =3D mbuf_len(mp); > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > } > > > > > xfer =3D (left > len) ? len : left; > > > > > #ifdef notdef > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > struct uio > > > > > *uiop, int siz) uiop->uio_resid -=3D xfer; > > > > > } > > > > > if (uiop->uio_iov->iov_len <=3D siz) { > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > + uiop->uio_iovcnt)); > > > > > uiop->uio_iovcnt--; > > > > > uiop->uio_iov++; > > > > > } else { > > > > > > > > > > I thought that server have returned too long response, but it > > > > > seems to > > > > > be not the case from your data. Still, I think the patch below > > > > > might be > > > > > due. > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > > *uiop, struct > > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > > eof =3D fxdr_unsigned(int, *tl); > > > > > } > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > + NFSM_STRSIZ(retlen, len); > > > > > error =3D nfsm_mbufuio(nd, uiop, retlen); > > > > > if (error) > > > > > goto nfsmout; > > > > > I think this patch is appropriate, although I don't see it as too > critical. It just tightens the "sanity check" on the read reply > length (which should never exceed what the client requested). I agree, but client cannot control the server response. Anyway, I think there is too much things that could go wrong if the server actively exploit the client code. >=20 > nfsm_mbufuio() shouldn't transfer more than the uio structure can > handle, even if the replied read size is larger than requested. Yes, this what happen, I suppose, due to the decrement of the uio_iovcnt and the EBADRPC error return at the beginning of loop. But IMO the situation should be catched and asserted instead. This is why I added KASSERT(uio_iovcnt > 1) before the decrement. I do not think that we should both add my KASSERT for iovcnt and leave the EBADRPC return. What is your preference there ? >=20 > It does seem that nfsm_mbufuio() should apply a sanity check on > m_len. I think m_len =3D=3D 0 is ok, but negative or very large should > be checked for. Maybe just return EBADRPC after a printf() instead > of a KASSERT(), as a safety belt against a trashed m_len from a driver > or ??? The same as with the overflowed size, it would only hide another bug in the kernel. Probably, the assert m_len >=3D 0 and other useful assertions should be performed in the central place in the network stack, to catch an error earlier and for all consumers. >=20 > rick >=20 > > > > I applied your patches and now I get a > > > > > > > > panic: len -4 > > > > cpuid =3D 1 > > > > KDB: enter: panic > > > > Dumping 377 out of 6116 > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > This means that the age driver either produced corrupted mbuf chain, > > > or filled wrong negative value into the mbuf len field. I am quite > > > certain that the issue is in the driver. > > > > > > I added the net@ to Cc:, hopefully you could get help there. > >=20 > > And I've cc'd Pyun who has written most of this driver and is likely > > the one > > most familiar with its handling of jumbo frames. > >=20 > > -- > > John Baldwin > > _______________________________________________ > > freebsd-fs@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" --E0e4ihfNxLmjeTLW Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRAm4lAAoJEJDCuSvBvK1BsXEP/1/LNhc9/ugRGyAzhZyv/osk DL6tuIcfMfKaPekPG8hxQL5L/Pp5U0CAPnxwylVva82/A0qMo5l0PhOCUSVNluMu dfDh81dBYbJkKAgAeiN5s03OwfJCqGEubReXCDRW4O0xF0gQlGVdYHK7Fhm2zYxX ZlwbrjC1wlVnMa2NZ6NFT8MtXV4/jhoJ7wtyYWmwpJGo1faFH1m1GBabqM++S1UA 3ARV4t+nLAsifKFr4AHhCmoa2IuGvrWnzUwBcwSN5EDHyJlVswDRazYxRK3vDgbz dnG7egSEXlb8W24lzw+Fw/mZxX4s0+krmeh6yd2aC65BaaDd6vPA+uG2CGvj//lM FiS4yopOKDg2exdOKrn2AukLalqrZLsrHFWcpMDgwxyou9rTZ9HnbIPuL4X3Rpwd j68lSQCmHnwDAzP4aEt+sjVv5eZqwC/sfxsaPmpAUjtml1UDi+0ERXxL2loq4gM4 39k+qVKu1iJa7t630UvOeihLePS9OuyF7BvCk8cyCUARtfx/lpi7sTWgpan7ngxC ZYfP6KwkaWjuCRn7KjODB4TB0qyOEA6iX5QccdNhwN6bm1R0FxjvHChechg5t0ik sCXJrTe0Kk0rpbsXmyNsTZICXUe5osYTt3PFmqFUZvxJW+RoIcvnOzqc1K/u3w63 wUXMcErVczRhnQiMl0Vr =1nmz -----END PGP SIGNATURE----- --E0e4ihfNxLmjeTLW-- From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 12:17:59 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 906E2A1C for ; Fri, 25 Jan 2013 12:17:59 +0000 (UTC) (envelope-from laurencesgill@googlemail.com) Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by mx1.freebsd.org (Postfix) with ESMTP id 191776F3 for ; Fri, 25 Jan 2013 12:17:58 +0000 (UTC) Received: by mail-wi0-f173.google.com with SMTP id hn17so1356946wib.0 for ; Fri, 25 Jan 2013 04:17:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=x-received:date:from:to:subject:message-id:x-mailer:mime-version :content-type:content-transfer-encoding; bh=6AXSEYVQdsdNTeP/z4qcG22E+VtMeFDOqDWduwgv42g=; b=Xfqjw7zS7mKI5yAygD53MBDtbRpuzRYk3cuskj3MaptfR4UbPSLWveEr0YLVCIEc2l Kt9tlcwGUtqiWO+CH8asSXmdbJKpAbmqotyxTovXrcAAL1V3PnmFur+Bme+L/Mnuebwa wGT2P7bZF/NWYkXrMb04sz/elwfk9DyYQ7HcLAuPh2VzqycTmf3LiregZSKDfaTZwSUe 6sgnJMYnPE5hB/YIVAMkIFdur7Marou7ZkOsuddiMcOzcSeQ3YCj62WspC86XSXu6n8k CZk9sXYTQ99H01f8t9+61MzwvnOKgPtGXYlqnx/1acqMte0nknxj9ZnHJ4tLnKJtU7gv VoZA== X-Received: by 10.194.77.13 with SMTP id o13mr8457363wjw.58.1359115807971; Fri, 25 Jan 2013 04:10:07 -0800 (PST) Received: from localhost (gateway.ash.thebunker.net. [213.129.64.4]) by mx.google.com with ESMTPS id gz3sm6800943wib.2.2013.01.25.04.10.07 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 25 Jan 2013 04:10:07 -0800 (PST) Date: Fri, 25 Jan 2013 12:10:44 +0000 From: Laurence Gill To: freebsd-fs@freebsd.org Subject: HAST performance overheads? Message-ID: <20130125121044.1afac72e@googlemail.com> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.12; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: base64 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 12:17:59 -0000 LS0tLS1CRUdJTiBQR1AgU0lHTkVEIE1FU1NBR0UtLS0tLQ0KSGFzaDogU0hBMQ0KDQpIaSBhbGws DQoNCg0KSSByZWFsaXNlIHRoZSBzZWNvbmQgbGluZSBpcyBnb2luZyB0byBzb3VuZCBwcmV0dHkg dmFndWUuLi4NCg0KV2hhdCBzb3J0IG9mIHBlcmZvcm1hbmNlIG92ZXJoZWFkIHNob3VsZCB3ZSBl eHBlY3Qgd2hlbiB1c2luZyBIQVNUPw0KDQpJJ20gc2VlaW5nIGFwcHJveGltYXRlbHkgMS8xMHRo IG9mIHRoZSBwZXJmb3JtYW5jZSBvZiBzZXF1ZW50aWFsDQp3cml0ZXMgd2hlbiB1c2luZyB0aGUg SEFTVCBsYXllci4gIEl0cyAic3RhY2tlZCIgbGlrZSBzbw0KImhhcmQgZGlzayAtLT4gaGFzdCBs YXllciAtLT4gZmlsZXN5c3RlbSINCg0KSSd2ZSBydW4gdGVzdHMgdXNpbmcgc2V2ZXJhbCBzZXR1 cHMgb2YgdmFyaW91cyBudW1iZXJzIG9mIGRpc2tzIGFuZA0KVWZTL1pGUyAtIGFsbCBhcmUgc2xv d2VyIHdoZW4gSSBpbnRyb2R1Y2UgSEFTVC4NCg0KQW4gZXhhbXBsZSBvZiB0aGlzIGlzIHVzaW5n IDYgZGlza3Mgb2YgdGhlIGZvbGxvd2luZyBzcGVjOg0KIC0gIyBkbWVzZyB8IGdyZXAgXmRhMA0K ICAgICBkYTAgYXQgbXBzMCBidXMgMCBzY2J1czAgdGFyZ2V0IDExIGx1biAwDQogICAgIGRhMDog PFRPU0hJQkEgTUsxMDAxVFJLQiBEQ0E4PiBGaXhlZCBEaXJlY3QgQWNjZXNzIFNDU0ktNiBkZXZp Y2UgDQogICAgIGRhMDogNjAwLjAwME1CL3MgdHJhbnNmZXJzDQogICAgIGRhMDogQ29tbWFuZCBR dWV1ZWluZyBlbmFibGVkDQogICAgIGRhMDogOTUzODY5TUIgKDE5NTM1MjUxNjggNTEyIGJ5dGUg c2VjdG9yczogMjU1SCA2M1MvVCAxMjE2MDFDKQ0KDQoNCklmIEkgY3JlYXRlIFpGUyByYWlkejIg b24gdGhlc2UuLi4NCg0KIC0gIyB6cG9vbCBjcmVhdGUgcG9vbCByYWlkejIgZGEwIGRhMSBkYTIg ZGEzIGRhNCBkYTUNCg0KDQpUaGVuIHJ1biBhIGRkIHRlc3QsIGEgc2FtcGxlIG91dHB1dCBpcy4u Lg0KDQogLSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBicz0xTSBjb3VudD0xMDI0DQog ICAgIDEwNzM3NDE4MjQgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gNy42ODk2MzQgc2VjcyAoMTM5NjM0 OTc0IGJ5dGVzL3NlYykNCg0KIC0gIyBkZCBpZj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MTZr IGNvdW50PTY1NTM1DQogICAgIDEwNzM3MjU0NDAgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gMS45MDkx NTcgc2VjcyAoNTYyNDA4MTMwIGJ5dGVzL3NlYykNCg0KDQoNClRoaXMgaXMgbXVjaCBmYXN0ZXIg dGhhbiBjb21wYXJlZCB0byBydW5uaW5nIGhhc3QsIEkgd291bGQgZXhwZWN0IGFuDQpvdmVyaGVh ZCwgYnV0IG5vdCB0aGlzIG11Y2guICBGb3IgZXhhbXBsZToNCg0KIC0gIyBoYXN0Y3RsIGNyZWF0 ZSBkaXNrMC9kaXNrMS9kaXNrMi9kaXNrMy9kaXNrNC9kaXNrNQ0KIC0gIyBoYXN0Y3RsIHJvbGUg cHJpbWFyeSBhbGwNCiAtICMgenBvb2wgY3JlYXRlIHBvb2wgcmFpZHoyIGRpc2swIGRpc2sxIGRp c2syIGRpc2szIGRpc2s0IGRpc2s1DQoNCg0KUnVuIGEgZGQgdGVzdCwgYW5kIHRoZSBzcGVlZCBp cy4uLg0KDQogLSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBicz0xTSBjb3VudD0xMDI0 DQogICAgIDEwNzM3NDE4MjQgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gNDAuOTA4MTUzIHNlY3MgKDI2 MjQ3NjI0IGJ5dGVzL3NlYykNCg0KIC0gIyBkZCBpZj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9 MTZrIGNvdW50PTY1NTM1DQogICAgIDEwNzM3MjU0NDAgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gNDIu MDE3OTk3IHNlY3MgKDI1NTUzOTQyIGJ5dGVzL3NlYykNCg0KDQpOb3RlIHRoYXQgbm8gc2Vjb25k YXJ5IHNlcnZlciBpcyBzZXR1cCwgYXMgdGhpcyBkZWdyYWRlcyB0aGUgc3BlZWQNCmV2ZW4gZnVy dGhlciBhbmQgSSBoYXZlIHJlbW92ZWQgdGhhdCBmb3IgdGhlIHRlc3RpbmcuDQoNCldlIGNhbiBz ZWUgYmV0dGVyIHNwZWVkcyB0aGFuIHRoaXMgKHVwIHRvIGFwcHJveCAzME1Ccykgd2l0aCBtZXRh Zmx1c2gNCnN3aXRjaGVkIG9mZi4gIFRoZSBhc3luYyByZXBsaWNhdGlvbiBtb2RlIGFjdHVhbGx5 IHNlZW1zIHRvIGRlZ3JhZGUgdGhlDQpzcGVlZCB0b28sIHdoaWNoIHdhcyBxdWl0ZSB1bmV4cGVj dGVkLg0KDQoNClRoZXJlIGlzIDEgZXhjZXB0aW9uIHdoZXJlIHRoZSBzcGVlZCBzZWVtcyBxdWl0 ZSBnb29kOyBieSBjb25maWd1cmluZw0KWkZTLW9uLUhBU1Qtb24tWlZPTC1vbi1aRlMgOikNCg0K IC0gIyB6cG9vbCBjcmVhdGUgcG9vbCByYWlkejIgZGEwIGRhMSBkYTIgZGEzIGRhNCBkYTUNCiAt ICMgemZzIGNyZWF0ZSAtcyAtViAzVCBwb29sL2hhc3Rwb29sDQogLSAjIGhhc3RjdGwgY3JlYXRl IHpoYXN0DQogLSAjIHpwb29sIGNyZWF0ZSB6aGFzdCAvZGV2L2hhc3Qvemhhc3QNCg0KDQpIb3dl dmVyIHRoaXMgc2V0IHNlZW1zIHF1aXRlIGluc2FuZSB0byBtZSwgYnV0LCB3ZSBkbyBnZXQgYmV0 dGVyDQpwZXJmb3JtYW5jZSAobWV0YWZsdXNoIG9mZik6DQoNCiAtICMgZGQgaWY9L2Rldi96ZXJv IG9mPXRlc3QuZGF0IGJzPTFNIGNvdW50PTEwMjQNCiAgICAgMTA3Mzc0MTgyNCBieXRlcyB0cmFu c2ZlcnJlZCBpbiAxNC4wNTc4ODAgc2VjcyAoNzYzODAwNjcgYnl0ZXMvc2VjKQ0KIC0gIyBkZCBp Zj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MTZrIGNvdW50PTY1NTM1DQogICAgIDEwNzM3MjU0 NDAgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gMTAuMzQxNzk2IHNlY3MgKDEwMzgyMzg4NA0KICAgYnl0 ZXMvc2VjKQ0KDQoNCkkgZ3Vlc3MgdGhlIHF1ZXN0aW9uIGZvciBtZSByZWFsbHkgaXMgd2h5IHRo ZSBiaWcgZGlmZmVyZW5jZSB3aGVuDQpoYXZpbmcganVzdCAxIEhBU1QgcHJvdmlkZXIsIGluIGNv bXBhcmlzb24gdG8gdXNpbmcgNj8gIFdvdWxkIHdlIHNlZQ0KdGhpcyBwZXJmb3JtYW5jZSBnYWlu IGlmIHdlIHdlcmUgdG8gY29uY2F0aW5hdGUgdGhlIGRpc2tzIHRvZ2V0aGVyIG9yDQp1c2UgYSBo d2FyZSByYWlkIGNvbnRyb2xsZXI/ICBJZiBzbywgd2h5PyAgSXMgdGhlcmUgYSBtYXNzaXZlDQpw ZXJmb3JtYW5jZSBvdmVyaGVhZCB3aXRoIHVzaW5nIHNldmVyYWwgcHJvdmlkZXJzPw0KDQoNCg0K DQpUaGFua3MgaW4gYWR2YW5jZSBmb3IgcmVhZGluZy4NCg0KDQpSZWdhcmRzDQpMYXVyZW5jZQ0K DQoNCi0gLS0gDQpMYXVyZW5jZSBHaWxsDQoNCnQ6IDAxODQzIDU5MCA3ODQNCmY6IDA4NzIxIDE1 NyA2NjUNCnNreXBlOiBsYXVyZW5jZWdnDQplOiBsYXVyZW5jZXNnaWxsQGdvb2dsZW1haWwuY29t DQpQR1Agb24gS2V5IFNlcnZlcnMNCi0tLS0tQkVHSU4gUEdQIFNJR05BVFVSRS0tLS0tDQpWZXJz aW9uOiBHbnVQRyB2Mi4wLjE5IChHTlUvTGludXgpDQoNCmlFWUVBUkVDQUFZRkFsRUNka2tBQ2dr UXlnVnQ4U3EwUGYrc0NRQ2NDZTRPaXZQNDJFcmdZSDY1RFoycHpRVlUNCktYc0FuMXFnM09FTm1m T2FqdWVHMUd6cUVzWGRNNVV4DQo9bEY5WA0KLS0tLS1FTkQgUEdQIFNJR05BVFVSRS0tLS0tDQo= From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 14:57:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7227F476; Fri, 25 Jan 2013 14:57:11 +0000 (UTC) (envelope-from ler@lerctr.org) Received: from thebighonker.lerctr.org (lrosenman-1-pt.tunnel.tserv8.dal1.ipv6.he.net [IPv6:2001:470:1f0e:3ad::2]) by mx1.freebsd.org (Postfix) with ESMTP id 41604B8; Fri, 25 Jan 2013 14:57:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lerctr.org; s=lerami; h=Message-ID:References:In-Reply-To:Subject:Cc:To:From:Date:Content-Transfer-Encoding:Content-Type:MIME-Version; bh=HSnH/o2PVUCD+Ej1wqnwuHu5FGyYhr7+Us8RTdvG834=; b=KVmM55NYZ+zN6RmPlpgOHNjgE3EkDAzqvmRdTPsXaku3btAQv1VNPyNLaXwPuVHYLCbGKo8Q55EMTNGYlIx4RjEYIh9OonbP+dyvNnBje1SY7VE0HcV4lV5mVabuuxmMlVjLtUrAmqvHq/DXJq2rsF16fBKW132SXdHUBN4qtYU=; Received: from localhost.lerctr.org ([127.0.0.1]:18569 helo=webmail.lerctr.org) by thebighonker.lerctr.org with esmtpa (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1TykiM-00031z-5J; Fri, 25 Jan 2013 08:57:10 -0600 Received: from [32.97.110.60] by webmail.lerctr.org with HTTP (HTTP/1.1 POST); Fri, 25 Jan 2013 08:57:10 -0600 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 25 Jan 2013 08:57:10 -0600 From: Larry Rosenman To: Andriy Gapon Subject: Re: My panic in amd64/pmap In-Reply-To: <51026233.2020601@FreeBSD.org> References: <38def6b37be1a3128fb1b64595e9044e@webmail.lerctr.org> <50F95964.6060706@FreeBSD.org> <6f1d46304fbcc6e32f51109f6ab4c60d@webmail.lerctr.org> <51026233.2020601@FreeBSD.org> Message-ID: X-Sender: ler@lerctr.org User-Agent: Roundcube Webmail/0.8.4 X-Spam-Score: -2.9 (--) X-LERCTR-Spam-Score: -2.9 (--) X-Spam-Report: SpamScore (-2.9/5.0) ALL_TRUSTED=-1, BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001 X-LERCTR-Spam-Report: SpamScore (-2.9/5.0) ALL_TRUSTED=-1, BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001 Cc: freebsd-fs@freebsd.org, freebsd-emulation@freebsd.org, freebsd-current@freebsd.org, freebsd-amd64@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 14:57:11 -0000 On 2013-01-25 04:45, Andriy Gapon wrote: > on 25/01/2013 03:11 Larry Rosenman said the following: >> I've moved all the core.txt's to: >> http://www.lerctr.org/~ler/FreeBSD-PMAP/ >> >> I got another one on FreeBSD9 today.... >> >> Is there ANYONE interested in this? >> >> These always seem to be ZFS induced..... >> >> I've added freebsd-fs to the cc list. >> >> I have vmcore's from them all. > > Can you try to reproduce the issue using the same VM image but in a > different VM > implementation? E.g. qemu... Can qemu use a VBox setup, or is it easy to convert the vdi's and vbox xml file? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 (c) E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 15:07:02 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 09C488D5; Fri, 25 Jan 2013 15:07:02 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 75E861C0; Fri, 25 Jan 2013 15:07:01 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEANqeAlGDaFvO/2dsb2JhbABEhka4I3OCHgEBAQMBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggCBQQBHASHaAYMq3CSVoEji3GDGIETA4hhin6CLoEcjyyDFYFRNQ X-IronPort-AV: E=Sophos;i="4.84,538,1355115600"; d="scan'208";a="13610977" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 25 Jan 2013 10:06:54 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id C9321B403F; Fri, 25 Jan 2013 10:06:54 -0500 (EST) Date: Fri, 25 Jan 2013 10:06:54 -0500 (EST) From: Rick Macklem To: Konstantin Belousov Message-ID: <1436756651.2351478.1359126414802.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20130125113606.GO2522@kib.kiev.ua> Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, yongari@freebsd.org, Christian Gusenbauer X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 15:07:02 -0000 Konstantin Belousov wrote: > On Thu, Jan 24, 2013 at 09:39:38PM -0500, Rick Macklem wrote: > > John Baldwin wrote: > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov > > > wrote: > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer > > > > wrote: > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov > > > > > wrote: > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian > > > > > > Gusenbauer > > > > > > wrote: > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov > > > > > > > wrote: > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin > > > > > > > > Belousov wrote: > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > > > > > > > > > Gusenbauer wrote: > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get > > > > > > > > > > the > > > > > > > > > > panic below > > > > > > > > > > if I execute the following commands (as single > > > > > > > > > > user): > > > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > > # mount -u / > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll > > > > > > > > > > attach > > > > > > > > > > the stack > > > > > > > > > > trace. > > > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a > > > > > > > > > > 1Gbit > > > > > > > > > > network, > > > > > > > > > > maybe that's the cause for the panic, because the > > > > > > > > > > bcopy > > > > > > > > > > (see stack > > > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > > rsize=32768 and mtu > > > > > > > > > 6144, but the machine runs HEAD and em instead of age. > > > > > > > > > I > > > > > > > > > was unable > > > > > > > > > to reproduce the panic on the copy of the 5GB file > > > > > > > > > from > > > > > > > > > nfs mount. > > > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so > > > > > > > just > > > > > > > configuring > > > > > > > age0 with > > > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > > > then I can copy all files from the mounted directory > > > > > > > without > > > > > > > any > > > > > > > problems, too. So it's probably age0 related? > > > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > > > strange thing. > > > > > > The buffer data address is 0xffffff8171418000, while kernel > > > > > > faulted > > > > > > at the attempt to write at 0xffffff8171413000, which is is > > > > > > lower > > > > > > then > > > > > > the buffer data pointer, at the attempt to bcopy to the > > > > > > buffer. > > > > > > > > > > > > The other data suggests that there were no overflow of the > > > > > > data > > > > > > from the > > > > > > server response. So it might be that mbuf_len(mp) returned > > > > > > negative number > > > > > > ? I am not sure is it possible at all. > > > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS > > > > > > etc > > > > > > to the > > > > > > kernel config. > > > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > > b/sys/fs/nfs/nfs_commonsubs.c > > > > > > index efc0786..9a6bda5 100644 > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > struct uio > > > > > > *uiop, int siz) } > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > > len = mbuf_len(mp); > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > } > > > > > > xfer = (left > len) ? len : left; > > > > > > #ifdef notdef > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > struct uio > > > > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > } > > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > > + uiop->uio_iovcnt)); > > > > > > uiop->uio_iovcnt--; > > > > > > uiop->uio_iov++; > > > > > > } else { > > > > > > > > > > > > I thought that server have returned too long response, but > > > > > > it > > > > > > seems to > > > > > > be not the case from your data. Still, I think the patch > > > > > > below > > > > > > might be > > > > > > due. > > > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 > > > > > > 100644 > > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > > > *uiop, struct > > > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > } > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > > + NFSM_STRSIZ(retlen, len); > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > > if (error) > > > > > > goto nfsmout; > > > > > > > I think this patch is appropriate, although I don't see it as too > > critical. It just tightens the "sanity check" on the read reply > > length (which should never exceed what the client requested). > I agree, but client cannot control the server response. Righto. However, the server can put the correct "size" at the beginning of the reply and then follow it with the wrong amount of data. Those are the cases that the EBADRPC reply from nfsm_mbufuio() hopefully catches. But, I agree that changing "rsize" to "len" is correct in this case and would catch the case where the server replies with too large a "size". > Anyway, I > think > there is too much things that could go wrong if the server actively > exploit the client code. > > > > > nfsm_mbufuio() shouldn't transfer more than the uio structure can > > handle, even if the replied read size is larger than requested. > Yes, this what happen, I suppose, due to the decrement of the > uio_iovcnt > and the EBADRPC error return at the beginning of loop. But IMO the > situation should be catched and asserted instead. This is why I added > KASSERT(uio_iovcnt > 1) before the decrement. > > I do not think that we should both add my KASSERT for iovcnt and leave > the EBADRPC return. What is your preference there ? > Well, if a server sends a reply with "size == 16384", but then follows it with 32768bytes of data, I don't think that should panic the client, since it isn't a client bug. I think this is where the EBADRPC reply for iovcnt will happen. The reverse: A reply with "size == 32768", but followed by 16384bytes of data, would hit the end of the mbuf list. I don't think that should be a KASSERT() either since, again, it is a server bug and not a client one. Now, the case where the mbuf list is bogus (bad m_len or ???) I see the argument for KASSERT()s, since it is a bug somewhere in the client machine and that bug may soon (or have already) corrupted other data structures, even if there is no damage done for this case once caught. I'm fine with KASSERT()s for these. (A KASSERT() panic with a message is probably easier to debug than a bcopy() crash, although you did the latter amazingly well;-) > > > > It does seem that nfsm_mbufuio() should apply a sanity check on > > m_len. I think m_len == 0 is ok, but negative or very large should > > be checked for. Maybe just return EBADRPC after a printf() instead > > of a KASSERT(), as a safety belt against a trashed m_len from a > > driver > > or ??? > The same as with the overflowed size, it would only hide another bug > in > the kernel. Probably, the assert m_len >= 0 and other useful > assertions > should be performed in the central place in the network stack, to > catch > an error earlier and for all consumers. > Yes. That sounds like a good idea to me. To be honest, there are a lot of places where a bogus mbuf list can trash the NFS code, so catching it in a central place would be good, I think. Thanks for looking into this, rick > > > > rick > > > > > > > I applied your patches and now I get a > > > > > > > > > > panic: len -4 > > > > > cpuid = 1 > > > > > KDB: enter: panic > > > > > Dumping 377 out of 6116 > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > > > This means that the age driver either produced corrupted mbuf > > > > chain, > > > > or filled wrong negative value into the mbuf len field. I am > > > > quite > > > > certain that the issue is in the driver. > > > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > > > And I've cc'd Pyun who has written most of this driver and is > > > likely > > > the one > > > most familiar with its handling of jumbo frames. > > > > > > -- > > > John Baldwin > > > _______________________________________________ > > > freebsd-fs@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > > To unsubscribe, send any mail to > > > "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 15:25:00 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 440A6DE9; Fri, 25 Jan 2013 15:25:00 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id D8AD328E; Fri, 25 Jan 2013 15:24:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAHKiAlGDaFvO/2dsb2JhbABEhka4I3OCHgEBAQMBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggCBQQBHASHaAYMq3SSVoEji2EQgxiBEwOIYYp+gi6BHI8sgxWBUTU X-IronPort-AV: E=Sophos;i="4.84,538,1355115600"; d="scan'208";a="10883058" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 25 Jan 2013 10:24:51 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 7FB8EB3F4E; Fri, 25 Jan 2013 10:24:51 -0500 (EST) Date: Fri, 25 Jan 2013 10:24:51 -0500 (EST) From: Rick Macklem To: Konstantin Belousov Message-ID: <582361609.2352469.1359127491504.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <1436756651.2351478.1359126414802.JavaMail.root@erie.cs.uoguelph.ca> Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, Christian Gusenbauer , yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 15:25:00 -0000 I wrote: > Konstantin Belousov wrote: > > On Thu, Jan 24, 2013 at 09:39:38PM -0500, Rick Macklem wrote: > > > John Baldwin wrote: > > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov > > > > wrote: > > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer > > > > > wrote: > > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov > > > > > > wrote: > > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian > > > > > > > Gusenbauer > > > > > > > wrote: > > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov > > > > > > > > wrote: > > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin > > > > > > > > > Belousov wrote: > > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian > > > > > > > > > > Gusenbauer wrote: > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get > > > > > > > > > > > the > > > > > > > > > > > panic below > > > > > > > > > > > if I execute the following commands (as single > > > > > > > > > > > user): > > > > > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > > > # mount -u / > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia > > > > > > > > > > > /mnt > > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll > > > > > > > > > > > attach > > > > > > > > > > > the stack > > > > > > > > > > > trace. > > > > > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a > > > > > > > > > > > 1Gbit > > > > > > > > > > > network, > > > > > > > > > > > maybe that's the cause for the panic, because the > > > > > > > > > > > bcopy > > > > > > > > > > > (see stack > > > > > > > > > > > frame #15) fails. > > > > > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > > > rsize=32768 and mtu > > > > > > > > > > 6144, but the machine runs HEAD and em instead of > > > > > > > > > > age. > > > > > > > > > > I > > > > > > > > > > was unable > > > > > > > > > > to reproduce the panic on the copy of the 5GB file > > > > > > > > > > from > > > > > > > > > > nfs mount. > > > > > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so > > > > > > > > just > > > > > > > > configuring > > > > > > > > age0 with > > > > > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > > > > > then I can copy all files from the mounted directory > > > > > > > > without > > > > > > > > any > > > > > > > > problems, too. So it's probably age0 related? > > > > > > > > > > > > > > From your backtrace and the buffer printout, I see > > > > > > > somewhat > > > > > > > strange thing. > > > > > > > The buffer data address is 0xffffff8171418000, while > > > > > > > kernel > > > > > > > faulted > > > > > > > at the attempt to write at 0xffffff8171413000, which is is > > > > > > > lower > > > > > > > then > > > > > > > the buffer data pointer, at the attempt to bcopy to the > > > > > > > buffer. > > > > > > > > > > > > > > The other data suggests that there were no overflow of the > > > > > > > data > > > > > > > from the > > > > > > > server response. So it might be that mbuf_len(mp) returned > > > > > > > negative number > > > > > > > ? I am not sure is it possible at all. > > > > > > > > > > > > > > Try this debugging patch, please. You need to add > > > > > > > INVARIANTS > > > > > > > etc > > > > > > > to the > > > > > > > kernel config. > > > > > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > b/sys/fs/nfs/nfs_commonsubs.c > > > > > > > index efc0786..9a6bda5 100644 > > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript > > > > > > > *nd, > > > > > > > struct uio > > > > > > > *uiop, int siz) } > > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > > > len = mbuf_len(mp); > > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > > } > > > > > > > xfer = (left > len) ? len : left; > > > > > > > #ifdef notdef > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript > > > > > > > *nd, > > > > > > > struct uio > > > > > > > *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > > } > > > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > > > + uiop->uio_iovcnt)); > > > > > > > uiop->uio_iovcnt--; > > > > > > > uiop->uio_iov++; > > > > > > > } else { > > > > > > > > > > > > > > I thought that server have returned too long response, but > > > > > > > it > > > > > > > seems to > > > > > > > be not the case from your data. Still, I think the patch > > > > > > > below > > > > > > > might be > > > > > > > due. > > > > > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 > > > > > > > 100644 > > > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct > > > > > > > uio > > > > > > > *uiop, struct > > > > > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED); > > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > > } > > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > > > + NFSM_STRSIZ(retlen, len); > > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > > > if (error) > > > > > > > goto nfsmout; > > > > > > > > > I think this patch is appropriate, although I don't see it as too > > > critical. It just tightens the "sanity check" on the read reply > > > length (which should never exceed what the client requested). > > I agree, but client cannot control the server response. > Righto. However, the server can put the correct "size" at the > beginning > of the reply and then follow it with the wrong amount of data. Those > are > the cases that the EBADRPC reply from nfsm_mbufuio() hopefully > catches. > > But, I agree that changing "rsize" to "len" is correct in this case > and > would catch the case where the server replies with too large a "size". > > > Anyway, I > > think > > there is too much things that could go wrong if the server actively > > exploit the client code. > > > > > > > > nfsm_mbufuio() shouldn't transfer more than the uio structure can > > > handle, even if the replied read size is larger than requested. > > Yes, this what happen, I suppose, due to the decrement of the > > uio_iovcnt > > and the EBADRPC error return at the beginning of loop. But IMO the > > situation should be catched and asserted instead. This is why I > > added > > KASSERT(uio_iovcnt > 1) before the decrement. > > > > I do not think that we should both add my KASSERT for iovcnt and > > leave > > the EBADRPC return. What is your preference there ? > > > Well, if a server sends a reply with "size == 16384", but then follows > it with 32768bytes of data, I don't think that should panic the > client, > since it isn't a client bug. I think this is where the EBADRPC reply > for > iovcnt will happen. Oops, I realize this wouldn't do the iovcnt EBADRPC reply. I would just use the bytes after 16384 as the next field of the reply. I think this one can be a KASSERT(), assuming the patch that changes "rsize" to "len" for NFSM_STRSIZ() is applied. I'll take another look at the code and email again, if I still haven't gotten this right;-) rick > The reverse: A reply with "size == 32768", but followed by 16384bytes > of > data, would hit the end of the mbuf list. I don't think that should be > a KASSERT() either since, again, it is a server bug and not a client > one. > > Now, the case where the mbuf list is bogus (bad m_len or ???) I see > the argument for KASSERT()s, since it is a bug somewhere in the client > machine and that bug may soon (or have already) corrupted other data > structures, even if there is no damage done for this case once caught. > I'm fine with KASSERT()s for these. (A KASSERT() panic with a message > is probably easier to debug than a bcopy() crash, although you did the > latter amazingly well;-) > > > > > > > It does seem that nfsm_mbufuio() should apply a sanity check on > > > m_len. I think m_len == 0 is ok, but negative or very large should > > > be checked for. Maybe just return EBADRPC after a printf() instead > > > of a KASSERT(), as a safety belt against a trashed m_len from a > > > driver > > > or ??? > > The same as with the overflowed size, it would only hide another bug > > in > > the kernel. Probably, the assert m_len >= 0 and other useful > > assertions > > should be performed in the central place in the network stack, to > > catch > > an error earlier and for all consumers. > > > Yes. That sounds like a good idea to me. To be honest, there are a lot > of places where a bogus mbuf list can trash the NFS code, so catching > it in a central place would be good, I think. > > Thanks for looking into this, rick > > > > > > > rick > > > > > > > > > I applied your patches and now I get a > > > > > > > > > > > > panic: len -4 > > > > > > cpuid = 1 > > > > > > KDB: enter: panic > > > > > > Dumping 377 out of 6116 > > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > > > > > This means that the age driver either produced corrupted mbuf > > > > > chain, > > > > > or filled wrong negative value into the mbuf len field. I am > > > > > quite > > > > > certain that the issue is in the driver. > > > > > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > > > > > And I've cc'd Pyun who has written most of this driver and is > > > > likely > > > > the one > > > > most familiar with its handling of jumbo frames. > > > > > > > > -- > > > > John Baldwin > > > > _______________________________________________ > > > > freebsd-fs@freebsd.org mailing list > > > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > > > To unsubscribe, send any mail to > > > > "freebsd-fs-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 17:08:15 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C6710BE9 for ; Fri, 25 Jan 2013 17:08:15 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.17.21]) by mx1.freebsd.org (Postfix) with ESMTP id 74427A98 for ; Fri, 25 Jan 2013 17:08:15 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.2]) by mrigmx.server.lan (mrigmx001) with ESMTP (Nemesis) id 0MeNJB-1UNquz0Orf-00QDqu for ; Fri, 25 Jan 2013 18:08:10 +0100 Received: (qmail invoked by alias); 25 Jan 2013 17:08:09 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp002) with SMTP; 25 Jan 2013 18:08:09 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1+SgGJw7OqSOBofPCJSqsUzVxGRuVjDIRHBRhzPYn 8VStME5O41LgsX From: Christian Gusenbauer To: pyunyh@gmail.com Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory Date: Fri, 25 Jan 2013 18:09:50 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <201301241805.57623.c47g@gmx.at> <20130125043043.GA1429@michelle.cdnetworks.com> <20130125045048.GB1429@michelle.cdnetworks.com> In-Reply-To: <20130125045048.GB1429@michelle.cdnetworks.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301251809.50929.c47g@gmx.at> X-Y-GMX-Trusted: 0 Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 17:08:15 -0000 On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote: > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote: > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote: > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote: > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote: > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote: > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote: > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote: > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov wrote: > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer wrote: > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the > > > > > > > > > > panic below if I execute the following commands (as > > > > > > > > > > single user): > > > > > > > > > > > > > > > > > > > > # swapon -a > > > > > > > > > > # dumpon /dev/ada0s3b > > > > > > > > > > # mount -u / > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp > > > > > > > > > > > > > > > > > > > > then the system panics almost immediately. I'll attach > > > > > > > > > > the stack trace. > > > > > > > > > > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit > > > > > > > > > > network, maybe that's the cause for the panic, because > > > > > > > > > > the bcopy (see stack frame #15) fails. > > > > > > > > > > > > > > > > > > > > Any clues? > > > > > > > > > > > > > > > > > > I tried a similar operation with the nfs mount of > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and em > > > > > > > > > instead of age. I was unable to reproduce the panic on the > > > > > > > > > copy of the 5GB file from nfs mount. > > > > > > > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just > > > > > > > configuring age0 with > > > > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 up > > > > > > > > > > > > > > then I can copy all files from the mounted directory without > > > > > > > any problems, too. So it's probably age0 related? > > > > > > > > > > > > From your backtrace and the buffer printout, I see somewhat > > > > > > strange thing. The buffer data address is 0xffffff8171418000, > > > > > > while kernel faulted at the attempt to write at > > > > > > 0xffffff8171413000, which is is lower then the buffer data > > > > > > pointer, at the attempt to bcopy to the buffer. > > > > > > > > > > > > The other data suggests that there were no overflow of the data > > > > > > from the server response. So it might be that mbuf_len(mp) > > > > > > returned negative number ? I am not sure is it possible at all. > > > > > > > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc > > > > > > to the kernel config. > > > > > > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644 > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > struct uio *uiop, int siz) } > > > > > > > > > > > > mbufcp = NFSMTOD(mp, caddr_t); > > > > > > len = mbuf_len(mp); > > > > > > > > > > > > + KASSERT(len > 0, ("len %d", len)); > > > > > > > > > > > > } > > > > > > xfer = (left > len) ? len : left; > > > > > > > > > > > > #ifdef notdef > > > > > > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer; > > > > > > > > > > > > } > > > > > > if (uiop->uio_iov->iov_len <= siz) { > > > > > > > > > > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d", > > > > > > + uiop->uio_iovcnt)); > > > > > > > > > > > > uiop->uio_iovcnt--; > > > > > > uiop->uio_iov++; > > > > > > > > > > > > } else { > > > > > > > > > > > > I thought that server have returned too long response, but it > > > > > > seems to be not the case from your data. Still, I think the > > > > > > patch below might be due. > > > > > > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644 > > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, > > > > > > NFSX_UNSIGNED); > > > > > > > > > > > > eof = fxdr_unsigned(int, *tl); > > > > > > > > > > > > } > > > > > > > > > > > > - NFSM_STRSIZ(retlen, rsize); > > > > > > + NFSM_STRSIZ(retlen, len); > > > > > > > > > > > > error = nfsm_mbufuio(nd, uiop, retlen); > > > > > > if (error) > > > > > > > > > > > > goto nfsmout; > > > > > > > > > > I applied your patches and now I get a > > > > > > > > > > panic: len -4 > > > > > cpuid = 1 > > > > > KDB: enter: panic > > > > > Dumping 377 out of 6116 > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94% > > > > > > > > This means that the age driver either produced corrupted mbuf chain, > > > > or filled wrong negative value into the mbuf len field. I am quite > > > > certain that the issue is in the driver. > > > > > > > > I added the net@ to Cc:, hopefully you could get help there. > > > > > > And I've cc'd Pyun who has written most of this driver and is likely > > > the one most familiar with its handling of jumbo frames. > > > > Try attached one and let me know how it goes. > > Note, I don't have age(4) anymore so it wasn't tested at all. > > Sorry, ignore previous patch and use this one(age.diff2) instead. Thanks for the patch! I ignored the first and applied only the second one, but unfortunately that did not change anything. I still get the "panic: len -4" :-(. Ciao, Christian. From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 21:12:37 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 072E545A for ; Fri, 25 Jan 2013 21:12:37 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta12.emeryville.ca.mail.comcast.net (qmta12.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:227]) by mx1.freebsd.org (Postfix) with ESMTP id E0A7B9FA for ; Fri, 25 Jan 2013 21:12:36 +0000 (UTC) Received: from omta11.emeryville.ca.mail.comcast.net ([76.96.30.36]) by qmta12.emeryville.ca.mail.comcast.net with comcast id sDyF1k0060mlR8UACMCZc0; Fri, 25 Jan 2013 21:12:33 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta11.emeryville.ca.mail.comcast.net with comcast id sMCY1k00M1t3BNj8XMCYuh; Fri, 25 Jan 2013 21:12:33 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 4D1BE73A1B; Fri, 25 Jan 2013 13:12:32 -0800 (PST) Date: Fri, 25 Jan 2013 13:12:32 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130125211232.GA3037@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125083619.GA51096@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1359148353; bh=VZptO9AI8u471i/dBbOu6hEHnAkJfaBEkGsD70HPFoY=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=l5PhHb8eqGG9o1Dq7XYm3e0ghU2fN2axf0+QflL/MwVwbVOBoQRC5TDYzaWAIDwyv /bia5qWjznnOr9wOaxL3J4q2Rw5fvihLxTTclhzzfC4vlNtp+/iPwSENXWmTT0msJh ph+1yfq1zZ/tVjUpPCwB029MR/WFODHzL87Aq/+KhsSg9AKPU3WvoOvdr9NZHH72k5 WRC3m1WA7EzBw7BdGDLuypE5tBu7GsHRGTwFDhTvPMMBPCRR8oWRM5bPjmKDzMWQIr VbJcYpqPcx0lgkKwJyXVFtvk3hkXm7sLvA/wCuWWRQSRjlsRG9uuBZrslYKVn9ayRp 7Ob+OIHasIMNw== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 21:12:37 -0000 On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote: > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > > (Please keep me CC'd as I am not subscribed) > > > > > > WRT this: > > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > > > I can reproduce the first problem 100% of the time on my home system > > > here. I can provide hardware specs if needed, but the important part is > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > > using ahci.ko. > > > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > > problem with your disk". In my case, I KNOW there's a problem with the > > > disk (that's the entire point to my tests! :-) ). > > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > > badly-designed firmware that goes completely catatonic when encountering > > > certain sector-level conditions. That's not the problem though -- the > > > problem is with FreeBSD apparently getting confused as to the internal > > > state of its devices after a device falls off the bus and comes back. > > > Explanation: > > > > > > 1. System powered off; disk is attached; system powered on, shows up as > > > ada5. Can communicate with device in every way (the way I tend to test > > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > > the g_wither_washer oddity does not apply in this situation. > > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > > firmware design flaw results in drive becoming 100% stuck trying to > > > re-read the sector and work out internal decisions to do remapping or > > > not. Drive audibly clicking during this time (not actuator arm being > > > reset to track 0 noise; some other mechanical issue). Due to firmware > > > issue, drive remains in this state indefinitely. > > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > > times (kern.cam.da.retry_count+1). > > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > > cam_periph_alloc error, and devfs claims device removal. > > > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > > seconds, then is reinserted. > > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > > except... > > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > > devlist" shows the disk on the bus, yet I/O does not work. If I > > > remember right, re-attempting the dd command returns some error (I > > > forget which). > > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > > communicate with entry 5, but eventually does return (I think with some > > > error). camcontrol reset all" works without a hitch. "camcontrol > > > devlist" during this time shows the same disk on ada5 (which to me means > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > > I/O works at some level). > > > > > > 10. System otherwise works fine, but the only way to bring back > > > usability of ada5 is to reboot ("shutdown -r now"). > > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > > driver (I don't know which)) is internally confused about the true state > > > of things. > > > > > > Alexander, do you have any ideas? > > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > > this with camcontrol) as well as take notes and do a full step-by-step > > > diagnosis (along with relevant kernel output seen during each phase) if > > > that would help you. And I can test patches but not against -CURRENT > > > (will be a cold day in hell before I run that, sorry). > > > > Command timeout itself is not a reason for AHCI driver to drop the disk, > > neither it is for CAM in case of payload requests. Disk can be dropped > > if controller report device absence detected by SATA PHY, or by errors > > during device reinitialization after reset by CAM SATA XPT. > > I have some theories as to why this is happening and it relates to the > underlying design of the drive firmware and the drive controller used. > I could write some pseudo-code showing how I believe the drive behaves, > but it's really besides the point, as you point out below. > > > What is interesting, is what exactly goes on after disk got stuck and > > you have removed it. In normal case controller should immediately report > > PHY status change, driver should run PHY reset and see that link is > > lost. It should trigger bus rescan for CAM, that should invalidate > > device. That should make dd abort with error. After dd gone, device > > should be destroyed and ready for reattachment. > > Yup, that sounds exactly like what should happen. I know that in > userland (dd) the command eventually does abort/fail with an error (I > believe I/O error or some other message), and that's good. The device > disappearing can also be confirmed. It's after the drive is > power-cycled (to bring it back online) where its re-tasted and I/O (at > the kernel level) works, but now userland utilities interfacing with > /dev/ada5 insist "unknown device" or "no such device". It's easier to > show than it is to explain. My theory is that there is some kind of > internal (kernel-level) "state" that is not being reset correctly when a > device is lost and then brought back. > > > So it should be great if you start with the full verbose dmesg from the > > boot up to the moment when system becomes stable after disk removal. If > > it won't be enough, we can enable some more debugging with `camcontrol > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. > > This is exactly what I needed; thank you! > > I'll spend some time tomorrow collecting the data + documenting and will > provide the results once I've compiled them. This will be more useful > than speculation on my part. Finished. http://jdc.koitsu.org/freebsd/ahci_cam_testing/ I was not able to get the cam_periph_alloc error message; I'll talk about that at the end of my mail in attempt to stay focused on what I did find / what I was able to reproduce. Things start to get interesting around phase 23. Phase 31 is where things are confirmed broken in some way ("no such file or directory" even though /dev/ada5 is there). Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl ceases to work ***to that device only*** from then onward (e.g. smartctl ada0 works fine). A reboot is needed to recover from this. I'm aware that smartmontools uses xpt(4), and I think therein is where the issue is. The only difference between the tests/phases is that I issued "camcontrol reset" and "camcontrol rescan" prior to the breakage. Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on something internally and causing what I'm reporting. I can reproduce it reliably at least. Let me know what else I can do to help. If I need to turn on CAMDEBUG and re-do some of the tests + provide full kernel/CAMDEBUG output during each phase, no problem, just let me know what you need. I just hate risking interspersed kernel printf output... The rest of my Email from here is probably for a separate issue. ------ Now about cam_periph_alloc -- I wanted to provide proof that I have seen this message before / proving Andriy isn't crazy. :-) This is from when I was messing about with this bad disk the day I received it: Jan 18 19:54:57 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0 Jan 18 19:54:57 icarus kernel: ada5: ATA-7 SATA 1.x device Jan 18 19:54:57 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes) Jan 18 19:54:57 icarus kernel: ada5: Command Queueing enabled Jan 18 19:54:57 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C) Jan 18 19:54:57 icarus kernel: ada5: Previously was known as ad14 Jan 18 19:54:57 icarus kernel: cam_periph_alloc: attempt to re-allocate valid device pass5 rejected flags 0x18 refcount 1 Jan 18 19:54:57 icarus kernel: passasync: Unable to attach new device due to status 0x6: CCB request was invalid Jan 18 19:54:57 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 created. Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Force array start due to timeout. Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Disk ada5 state changed from NONE to ACTIVE. Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Subdisk RAID 0+1 279.47G:3-ada5 state changed from NONE to REBUILD. Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Array started. Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from STARTING to BROKEN. Jan 18 19:55:39 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from BROKEN to STOPPED. Jan 18 19:55:49 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 destroyed. So why didn't I see this message today? On January 20th I rebuild world/kernel after removing GEOM_RAID from my kernel config. The reason I removed GEOM_RAID is that, as you can see, that bad disk** was previously in a system (not my own) with an nVidia SATA chipset with their RAID option ROM enabled (my system is Intel, hence "array timeout" since there's no nVidia option ROM, I believe). I got sick and tired of having to "fight" with the kernel. The last two messages were a result of me doing "graid stop ada5". And of course "dd if=/dev/zero of=/dev/ada5 bs=64k" will cause GEOM to re-taste, causing the RAID metadata to get re-read, "NVIDIA-7" created, rinse lather repeat. But there's already a thread on this: http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016292.html Just easier for me to remove the option, that's all. **: People from all over the US send me bad disks for lots of reasons. Sometimes to do data recovery, sometimes to do forensics, blah blah. I have quite a collection and all with different behaviours. Bad disks are often fun/interesting. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 21:26:00 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7C0A4E26 for ; Fri, 25 Jan 2013 21:26:00 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta01.emeryville.ca.mail.comcast.net (qmta01.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:16]) by mx1.freebsd.org (Postfix) with ESMTP id 5469AABD for ; Fri, 25 Jan 2013 21:26:00 +0000 (UTC) Received: from omta20.emeryville.ca.mail.comcast.net ([76.96.30.87]) by qmta01.emeryville.ca.mail.comcast.net with comcast id sGDu1k0041smiN4A1MS0rF; Fri, 25 Jan 2013 21:26:00 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta20.emeryville.ca.mail.comcast.net with comcast id sMRz1k00H1t3BNj8gMRze4; Fri, 25 Jan 2013 21:25:59 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 277FB73A1C; Fri, 25 Jan 2013 13:25:59 -0800 (PST) Date: Fri, 25 Jan 2013 13:25:59 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130125212559.GA1772@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125211232.GA3037@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1359149160; bh=jGuUf1OTXtxZrjIjXgBHxlmcf8c6hSnhkDanVC7AUaw=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=j/b0/5KbvhB/6XGfrAP7FXZb+8ZC8+56Cx+SGXFaq5BMD4i0K47u7+rThi/59G8JO KDW6y6e5W632Kgk0bxDNG+/gGGJhO2jpQtouK1xKK5PurFv46LmM4n7TWEeQy4jEAr hbgOsJBbuoj/toxPpjSqm8QC0Q6lYr29M4jRDa2jfaa/H8CXRa2CfkOUP8T+KPfoao X183Lg9GAqGcbQNNDip41uPCwGrva/bXCLeKwHHn+oB4evhuUXrod9NwNX/YoeP0uh PGa04oCD/p4Y4U/O3xxUngGC/GAd9+eoI4A6VW/nNv5Qs7djYTZECrySNkbsuRLjip G73TuF2q31o3w== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 21:26:00 -0000 On Fri, Jan 25, 2013 at 01:12:32PM -0800, Jeremy Chadwick wrote: > On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote: > > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > > > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > > > (Please keep me CC'd as I am not subscribed) > > > > > > > > WRT this: > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > > > > > I can reproduce the first problem 100% of the time on my home system > > > > here. I can provide hardware specs if needed, but the important part is > > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > > > using ahci.ko. > > > > > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > > > problem with your disk". In my case, I KNOW there's a problem with the > > > > disk (that's the entire point to my tests! :-) ). > > > > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > > > badly-designed firmware that goes completely catatonic when encountering > > > > certain sector-level conditions. That's not the problem though -- the > > > > problem is with FreeBSD apparently getting confused as to the internal > > > > state of its devices after a device falls off the bus and comes back. > > > > Explanation: > > > > > > > > 1. System powered off; disk is attached; system powered on, shows up as > > > > ada5. Can communicate with device in every way (the way I tend to test > > > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > > > the g_wither_washer oddity does not apply in this situation. > > > > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > > > firmware design flaw results in drive becoming 100% stuck trying to > > > > re-read the sector and work out internal decisions to do remapping or > > > > not. Drive audibly clicking during this time (not actuator arm being > > > > reset to track 0 noise; some other mechanical issue). Due to firmware > > > > issue, drive remains in this state indefinitely. > > > > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > > > times (kern.cam.da.retry_count+1). > > > > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > > > cam_periph_alloc error, and devfs claims device removal. > > > > > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > > > seconds, then is reinserted. > > > > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > > > except... > > > > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > > > devlist" shows the disk on the bus, yet I/O does not work. If I > > > > remember right, re-attempting the dd command returns some error (I > > > > forget which). > > > > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > > > communicate with entry 5, but eventually does return (I think with some > > > > error). camcontrol reset all" works without a hitch. "camcontrol > > > > devlist" during this time shows the same disk on ada5 (which to me means > > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > > > I/O works at some level). > > > > > > > > 10. System otherwise works fine, but the only way to bring back > > > > usability of ada5 is to reboot ("shutdown -r now"). > > > > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > > > driver (I don't know which)) is internally confused about the true state > > > > of things. > > > > > > > > Alexander, do you have any ideas? > > > > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > > > this with camcontrol) as well as take notes and do a full step-by-step > > > > diagnosis (along with relevant kernel output seen during each phase) if > > > > that would help you. And I can test patches but not against -CURRENT > > > > (will be a cold day in hell before I run that, sorry). > > > > > > Command timeout itself is not a reason for AHCI driver to drop the disk, > > > neither it is for CAM in case of payload requests. Disk can be dropped > > > if controller report device absence detected by SATA PHY, or by errors > > > during device reinitialization after reset by CAM SATA XPT. > > > > I have some theories as to why this is happening and it relates to the > > underlying design of the drive firmware and the drive controller used. > > I could write some pseudo-code showing how I believe the drive behaves, > > but it's really besides the point, as you point out below. > > > > > What is interesting, is what exactly goes on after disk got stuck and > > > you have removed it. In normal case controller should immediately report > > > PHY status change, driver should run PHY reset and see that link is > > > lost. It should trigger bus rescan for CAM, that should invalidate > > > device. That should make dd abort with error. After dd gone, device > > > should be destroyed and ready for reattachment. > > > > Yup, that sounds exactly like what should happen. I know that in > > userland (dd) the command eventually does abort/fail with an error (I > > believe I/O error or some other message), and that's good. The device > > disappearing can also be confirmed. It's after the drive is > > power-cycled (to bring it back online) where its re-tasted and I/O (at > > the kernel level) works, but now userland utilities interfacing with > > /dev/ada5 insist "unknown device" or "no such device". It's easier to > > show than it is to explain. My theory is that there is some kind of > > internal (kernel-level) "state" that is not being reset correctly when a > > device is lost and then brought back. > > > > > So it should be great if you start with the full verbose dmesg from the > > > boot up to the moment when system becomes stable after disk removal. If > > > it won't be enough, we can enable some more debugging with `camcontrol > > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. > > > > This is exactly what I needed; thank you! > > > > I'll spend some time tomorrow collecting the data + documenting and will > > provide the results once I've compiled them. This will be more useful > > than speculation on my part. > > Finished. > > http://jdc.koitsu.org/freebsd/ahci_cam_testing/ > > I was not able to get the cam_periph_alloc error message; I'll talk > about that at the end of my mail in attempt to stay focused on what I > did find / what I was able to reproduce. > > Things start to get interesting around phase 23. > > Phase 31 is where things are confirmed broken in some way ("no such file > or directory" even though /dev/ada5 is there). > > Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl > ceases to work ***to that device only*** from then onward (e.g. smartctl > ada0 works fine). A reboot is needed to recover from this. > > I'm aware that smartmontools uses xpt(4), and I think therein is where > the issue is. The only difference between the tests/phases is that I > issued "camcontrol reset" and "camcontrol rescan" prior to the breakage. > > Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on > something internally and causing what I'm reporting. I can reproduce it > reliably at least. > > Let me know what else I can do to help. If I need to turn on CAMDEBUG > and re-do some of the tests + provide full kernel/CAMDEBUG output during > each phase, no problem, just let me know what you need. I just hate > risking interspersed kernel printf output... I just realised that the numbering scheme I was using for the phases is *completely* buggered. I obviously made a manual typo at some point and it just proliferated from there. I'll see if I can figure out where my mistake was and clean it up. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Fri Jan 25 21:32:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6CB871CF for ; Fri, 25 Jan 2013 21:32:11 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta12.emeryville.ca.mail.comcast.net (qmta12.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:227]) by mx1.freebsd.org (Postfix) with ESMTP id 1C850B0A for ; Fri, 25 Jan 2013 21:32:11 +0000 (UTC) Received: from omta13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by qmta12.emeryville.ca.mail.comcast.net with comcast id sHxP1k00517UAYkACMYBlx; Fri, 25 Jan 2013 21:32:11 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta13.emeryville.ca.mail.comcast.net with comcast id sMYA1k00E1t3BNj8ZMYAHN; Fri, 25 Jan 2013 21:32:10 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 097CD73A1C; Fri, 25 Jan 2013 13:32:10 -0800 (PST) Date: Fri, 25 Jan 2013 13:32:10 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130125213209.GA1858@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125212559.GA1772@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1359149531; bh=kI2Nq8kdWiloCJfKOlKGfmTqUi9cV4sCo++Sy5zYVa4=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=MNwYsEpcXHHY117bpEF1YOAi6AOz6mTw5S1u5gZtMzZyK3+6pTuGo/KAaKSKX7dEr K82WAtDLnMHS5DXPA/OKR9F6mzWwx/Hfr6NdPLBXWAhyOtN/rkaHhqmlKf0MNoSMSX xsGUZWWvej75wHDu+UuhqFnIsWb4qjYKP4B7+cNn3gjMAm7NeRbup0yTbeIG49fSeq 3aK4QONumfWssjJPHl5IAn4i09osqTUEY+Qi2EslYN/Z99A43beCJvMMgd3FI4lt3C h1mAXAszeVgqjkqbgo6G6gypil4aN8y1J0OxxDrzncx2JISZasUNnlr5HLnNRMkEZN b2nPAFNFElXyQ== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2013 21:32:11 -0000 On Fri, Jan 25, 2013 at 01:25:59PM -0800, Jeremy Chadwick wrote: > On Fri, Jan 25, 2013 at 01:12:32PM -0800, Jeremy Chadwick wrote: > > On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote: > > > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > > > > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > > > > (Please keep me CC'd as I am not subscribed) > > > > > > > > > > WRT this: > > > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > > > > > > > I can reproduce the first problem 100% of the time on my home system > > > > > here. I can provide hardware specs if needed, but the important part is > > > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > > > > using ahci.ko. > > > > > > > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > > > > problem with your disk". In my case, I KNOW there's a problem with the > > > > > disk (that's the entire point to my tests! :-) ). > > > > > > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > > > > badly-designed firmware that goes completely catatonic when encountering > > > > > certain sector-level conditions. That's not the problem though -- the > > > > > problem is with FreeBSD apparently getting confused as to the internal > > > > > state of its devices after a device falls off the bus and comes back. > > > > > Explanation: > > > > > > > > > > 1. System powered off; disk is attached; system powered on, shows up as > > > > > ada5. Can communicate with device in every way (the way I tend to test > > > > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > > > > the g_wither_washer oddity does not apply in this situation. > > > > > > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > > > > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > > > > firmware design flaw results in drive becoming 100% stuck trying to > > > > > re-read the sector and work out internal decisions to do remapping or > > > > > not. Drive audibly clicking during this time (not actuator arm being > > > > > reset to track 0 noise; some other mechanical issue). Due to firmware > > > > > issue, drive remains in this state indefinitely. > > > > > > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > > > > times (kern.cam.da.retry_count+1). > > > > > > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > > > > cam_periph_alloc error, and devfs claims device removal. > > > > > > > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > > > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > > > > seconds, then is reinserted. > > > > > > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > > > > except... > > > > > > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > > > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > > > > devlist" shows the disk on the bus, yet I/O does not work. If I > > > > > remember right, re-attempting the dd command returns some error (I > > > > > forget which). > > > > > > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > > > > communicate with entry 5, but eventually does return (I think with some > > > > > error). camcontrol reset all" works without a hitch. "camcontrol > > > > > devlist" during this time shows the same disk on ada5 (which to me means > > > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > > > > I/O works at some level). > > > > > > > > > > 10. System otherwise works fine, but the only way to bring back > > > > > usability of ada5 is to reboot ("shutdown -r now"). > > > > > > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > > > > driver (I don't know which)) is internally confused about the true state > > > > > of things. > > > > > > > > > > Alexander, do you have any ideas? > > > > > > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > > > > this with camcontrol) as well as take notes and do a full step-by-step > > > > > diagnosis (along with relevant kernel output seen during each phase) if > > > > > that would help you. And I can test patches but not against -CURRENT > > > > > (will be a cold day in hell before I run that, sorry). > > > > > > > > Command timeout itself is not a reason for AHCI driver to drop the disk, > > > > neither it is for CAM in case of payload requests. Disk can be dropped > > > > if controller report device absence detected by SATA PHY, or by errors > > > > during device reinitialization after reset by CAM SATA XPT. > > > > > > I have some theories as to why this is happening and it relates to the > > > underlying design of the drive firmware and the drive controller used. > > > I could write some pseudo-code showing how I believe the drive behaves, > > > but it's really besides the point, as you point out below. > > > > > > > What is interesting, is what exactly goes on after disk got stuck and > > > > you have removed it. In normal case controller should immediately report > > > > PHY status change, driver should run PHY reset and see that link is > > > > lost. It should trigger bus rescan for CAM, that should invalidate > > > > device. That should make dd abort with error. After dd gone, device > > > > should be destroyed and ready for reattachment. > > > > > > Yup, that sounds exactly like what should happen. I know that in > > > userland (dd) the command eventually does abort/fail with an error (I > > > believe I/O error or some other message), and that's good. The device > > > disappearing can also be confirmed. It's after the drive is > > > power-cycled (to bring it back online) where its re-tasted and I/O (at > > > the kernel level) works, but now userland utilities interfacing with > > > /dev/ada5 insist "unknown device" or "no such device". It's easier to > > > show than it is to explain. My theory is that there is some kind of > > > internal (kernel-level) "state" that is not being reset correctly when a > > > device is lost and then brought back. > > > > > > > So it should be great if you start with the full verbose dmesg from the > > > > boot up to the moment when system becomes stable after disk removal. If > > > > it won't be enough, we can enable some more debugging with `camcontrol > > > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. > > > > > > This is exactly what I needed; thank you! > > > > > > I'll spend some time tomorrow collecting the data + documenting and will > > > provide the results once I've compiled them. This will be more useful > > > than speculation on my part. > > > > Finished. > > > > http://jdc.koitsu.org/freebsd/ahci_cam_testing/ > > > > I was not able to get the cam_periph_alloc error message; I'll talk > > about that at the end of my mail in attempt to stay focused on what I > > did find / what I was able to reproduce. > > > > Things start to get interesting around phase 23. > > > > Phase 31 is where things are confirmed broken in some way ("no such file > > or directory" even though /dev/ada5 is there). > > > > Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl > > ceases to work ***to that device only*** from then onward (e.g. smartctl > > ada0 works fine). A reboot is needed to recover from this. > > > > I'm aware that smartmontools uses xpt(4), and I think therein is where > > the issue is. The only difference between the tests/phases is that I > > issued "camcontrol reset" and "camcontrol rescan" prior to the breakage. > > > > Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on > > something internally and causing what I'm reporting. I can reproduce it > > reliably at least. > > > > Let me know what else I can do to help. If I need to turn on CAMDEBUG > > and re-do some of the tests + provide full kernel/CAMDEBUG output during > > each phase, no problem, just let me know what you need. I just hate > > risking interspersed kernel printf output... > > I just realised that the numbering scheme I was using for the phases is > *completely* buggered. I obviously made a manual typo at some point and > it just proliferated from there. > > I'll see if I can figure out where my mistake was and clean it up. Yeah, there's no way for me to easily work out what happened here. Sorry, I'm going to have to re-do these tests. I'm still fairly certain the issue has to do with camcontrol reset or camcontrol rescan though, but I can't work out exactly where my typo was in the files. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 01:17:57 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 523E6FEB for ; Sat, 26 Jan 2013 01:17:57 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta01.emeryville.ca.mail.comcast.net (qmta01.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:16]) by mx1.freebsd.org (Postfix) with ESMTP id 1AA076BC for ; Sat, 26 Jan 2013 01:17:57 +0000 (UTC) Received: from omta13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by qmta01.emeryville.ca.mail.comcast.net with comcast id sEux1k00617UAYkA1RHvHj; Sat, 26 Jan 2013 01:17:55 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta13.emeryville.ca.mail.comcast.net with comcast id sRHu1k0061t3BNj8ZRHu9K; Sat, 26 Jan 2013 01:17:54 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 1A42673A1C; Fri, 25 Jan 2013 17:17:54 -0800 (PST) Date: Fri, 25 Jan 2013 17:17:54 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130126011754.GA1806@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130125213209.GA1858@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1359163075; bh=A00wm7JYs4UszNVhAA0V4AMNcYV6HKx+U1XcA6IB4cs=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=BaEyDGAyrokbMFQTU/MZPm4+jd5vljBKepbeiEp9IB2jMkhClJ/tVyvPnTuaNPgui 2EtheITN8qSRRg4QpNOHdErItKxkiBX/XJEJiFj6ZMBtQS7b+rNAvk9viK1L3toisZ qrCU+WG8m2WY7NvB4ddvCvtgk9FY0OW4jwSB33M3aqGNO5lHbX0zuAirHIfrwszOSJ M8vKHNGBklFYEge3WNxottQoz6wD8xfVHJE8o2rN7fBBB06UXCD/ZFTBaWKYpdcsow FLT/orYRS/yAX4sW2IvoB1xKB+ypTmThDmMA7eagR9K5US4/6IxzSO++3tr95a1HZ/ 6+cVfZkp6SAqg== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 01:17:57 -0000 On Fri, Jan 25, 2013 at 01:32:09PM -0800, Jeremy Chadwick wrote: > On Fri, Jan 25, 2013 at 01:25:59PM -0800, Jeremy Chadwick wrote: > > On Fri, Jan 25, 2013 at 01:12:32PM -0800, Jeremy Chadwick wrote: > > > On Fri, Jan 25, 2013 at 12:36:19AM -0800, Jeremy Chadwick wrote: > > > > On Tue, Jan 22, 2013 at 08:19:04PM +0200, Alexander Motin wrote: > > > > > On 22.01.2013 00:16, Jeremy Chadwick wrote: > > > > > > (Please keep me CC'd as I am not subscribed) > > > > > > > > > > > > WRT this: > > > > > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html > > > > > > > > > > > > I can reproduce the first problem 100% of the time on my home system > > > > > > here. I can provide hardware specs if needed, but the important part is > > > > > > that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI > > > > > > mode (and does not share an IRQ), hot-swap bays are in use, and I'm > > > > > > using ahci.ko. > > > > > > > > > > > > I also want to make this clear to Andriy: I'm not saying "there's a > > > > > > problem with your disk". In my case, I KNOW there's a problem with the > > > > > > disk (that's the entire point to my tests! :-) ). > > > > > > > > > > > > In my case the disk is a WD Raptor (150GB, circa 2006) that has a very > > > > > > badly-designed firmware that goes completely catatonic when encountering > > > > > > certain sector-level conditions. That's not the problem though -- the > > > > > > problem is with FreeBSD apparently getting confused as to the internal > > > > > > state of its devices after a device falls off the bus and comes back. > > > > > > Explanation: > > > > > > > > > > > > 1. System powered off; disk is attached; system powered on, shows up as > > > > > > ada5. Can communicate with device in every way (the way I tend to test > > > > > > simple I/O is to use "smartctl -a /dev/ada5"). This disk has no > > > > > > filesystems or other "stuff" on it -- it's just a raw disk, so I believe > > > > > > the g_wither_washer oddity does not apply in this situation. > > > > > > > > > > > > 2. "dd if=/dev/zero of=/dev/ada5 bs=64k" > > > > > > > > > > > > 3. Drive hits a bad sector which it cannot remap/deal with. Drive > > > > > > firmware design flaw results in drive becoming 100% stuck trying to > > > > > > re-read the sector and work out internal decisions to do remapping or > > > > > > not. Drive audibly clicking during this time (not actuator arm being > > > > > > reset to track 0 noise; some other mechanical issue). Due to firmware > > > > > > issue, drive remains in this state indefinitely. > > > > > > > > > > > > 4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ) > > > > > > errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5 > > > > > > times (kern.cam.da.retry_count+1). > > > > > > > > > > > > 5. FreeBSD spits out similar messages you see; retries exhausted, > > > > > > cam_periph_alloc error, and devfs claims device removal. > > > > > > > > > > > > 6. Drive is still catatonic of course. Only way to reset the drive is > > > > > > to power-cycle it. Drive removed from hot-swap bay, let sit for 20 > > > > > > seconds, then is reinserted. > > > > > > > > > > > > 7. FreeBSD sees the disk reappear, shows up much like it did during #1, > > > > > > except... > > > > > > > > > > > > 8. "smartctl -a /dev/ada5" claims no such device or unknown device type > > > > > > (I forget which). "ls -l /dev/ada5" shows an entry. "camcontrol > > > > > > devlist" shows the disk on the bus, yet I/O does not work. If I > > > > > > remember right, re-attempting the dd command returns some error (I > > > > > > forget which). > > > > > > > > > > > > 9. "camcontrol rescan all" stalls for quite some time when trying to > > > > > > communicate with entry 5, but eventually does return (I think with some > > > > > > error). camcontrol reset all" works without a hitch. "camcontrol > > > > > > devlist" during this time shows the same disk on ada5 (which to me means > > > > > > ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning > > > > > > I/O works at some level). > > > > > > > > > > > > 10. System otherwise works fine, but the only way to bring back > > > > > > usability of ada5 is to reboot ("shutdown -r now"). > > > > > > > > > > > > To me, this looks like FreeBSD at some layer within the kernel (or some > > > > > > driver (I don't know which)) is internally confused about the true state > > > > > > of things. > > > > > > > > > > > > Alexander, do you have any ideas? > > > > > > > > > > > > I can enable CAM debugging (I do use options CAMDEBUG so I can toggle > > > > > > this with camcontrol) as well as take notes and do a full step-by-step > > > > > > diagnosis (along with relevant kernel output seen during each phase) if > > > > > > that would help you. And I can test patches but not against -CURRENT > > > > > > (will be a cold day in hell before I run that, sorry). > > > > > > > > > > Command timeout itself is not a reason for AHCI driver to drop the disk, > > > > > neither it is for CAM in case of payload requests. Disk can be dropped > > > > > if controller report device absence detected by SATA PHY, or by errors > > > > > during device reinitialization after reset by CAM SATA XPT. > > > > > > > > I have some theories as to why this is happening and it relates to the > > > > underlying design of the drive firmware and the drive controller used. > > > > I could write some pseudo-code showing how I believe the drive behaves, > > > > but it's really besides the point, as you point out below. > > > > > > > > > What is interesting, is what exactly goes on after disk got stuck and > > > > > you have removed it. In normal case controller should immediately report > > > > > PHY status change, driver should run PHY reset and see that link is > > > > > lost. It should trigger bus rescan for CAM, that should invalidate > > > > > device. That should make dd abort with error. After dd gone, device > > > > > should be destroyed and ready for reattachment. > > > > > > > > Yup, that sounds exactly like what should happen. I know that in > > > > userland (dd) the command eventually does abort/fail with an error (I > > > > believe I/O error or some other message), and that's good. The device > > > > disappearing can also be confirmed. It's after the drive is > > > > power-cycled (to bring it back online) where its re-tasted and I/O (at > > > > the kernel level) works, but now userland utilities interfacing with > > > > /dev/ada5 insist "unknown device" or "no such device". It's easier to > > > > show than it is to explain. My theory is that there is some kind of > > > > internal (kernel-level) "state" that is not being reset correctly when a > > > > device is lost and then brought back. > > > > > > > > > So it should be great if you start with the full verbose dmesg from the > > > > > boot up to the moment when system becomes stable after disk removal. If > > > > > it won't be enough, we can enable some more debugging with `camcontrol > > > > > debug -IPXp BUS`, where BUS is the bus number from `camcontrol devlist`. > > > > > > > > This is exactly what I needed; thank you! > > > > > > > > I'll spend some time tomorrow collecting the data + documenting and will > > > > provide the results once I've compiled them. This will be more useful > > > > than speculation on my part. > > > > > > Finished. > > > > > > http://jdc.koitsu.org/freebsd/ahci_cam_testing/ > > > > > > I was not able to get the cam_periph_alloc error message; I'll talk > > > about that at the end of my mail in attempt to stay focused on what I > > > did find / what I was able to reproduce. > > > > > > Things start to get interesting around phase 23. > > > > > > Phase 31 is where things are confirmed broken in some way ("no such file > > > or directory" even though /dev/ada5 is there). > > > > > > Direct I/O to /dev/ada5 still works (shown in phase 33), but smartctl > > > ceases to work ***to that device only*** from then onward (e.g. smartctl > > > ada0 works fine). A reboot is needed to recover from this. > > > > > > I'm aware that smartmontools uses xpt(4), and I think therein is where > > > the issue is. The only difference between the tests/phases is that I > > > issued "camcontrol reset" and "camcontrol rescan" prior to the breakage. > > > > > > Based on CAMDEBUG output in phase 36 it looks like xpt(4) is spinning on > > > something internally and causing what I'm reporting. I can reproduce it > > > reliably at least. > > > > > > Let me know what else I can do to help. If I need to turn on CAMDEBUG > > > and re-do some of the tests + provide full kernel/CAMDEBUG output during > > > each phase, no problem, just let me know what you need. I just hate > > > risking interspersed kernel printf output... > > > > I just realised that the numbering scheme I was using for the phases is > > *completely* buggered. I obviously made a manual typo at some point and > > it just proliferated from there. > > > > I'll see if I can figure out where my mistake was and clean it up. > > Yeah, there's no way for me to easily work out what happened here. > Sorry, I'm going to have to re-do these tests. I'm still fairly certain > the issue has to do with camcontrol reset or camcontrol rescan though, > but I can't work out exactly where my typo was in the files. Okay, I've figured out the exact, 100% reproducible condition that causes the situation. It took me a lot of tries and a digital pocket recorder to take verbal notes (there are just too many things to look at simultaneously), but I've figured it out. I'm sorry for the verbosity, but it's necessary. Assume the disk we're talking about is /dev/ada5. 1. Prior to any issues, we have this: root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* crw-r----- 1 root operator 0x8c Jan 25 16:41 /dev/ada5 crw------- 1 root operator 0x75 Jan 25 16:35 /dev/pass5 crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not get a response (not going to discuss how/why that can happen). 3. These types of messages are seen on console (naturally the CDB and request type will vary -- in this case it was because I was doing the dd zero'ing, thus tickling the bad sector/naughty firmware on the drive): Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0 Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset... Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113 Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 80 77 01 40 00 00 00 00 00 00 Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command 4. Any I/O submit to ada5 during this time blocks (this is normal). 5. **While this situation is happening**, something using xpt(4) attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5). This request also blocks (again, normal). 6. Physical device falls off bus, or CAM kicks the disk off the bus. Doesn't matter which. We see messages resembling this (boy am I tired of this interspersed output problem): Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone 7. Standard I/O requests fail with errno=6 "Device not configured". xpt(4) requests also fail with the same errno. 8. Device-wise, at this stage all we have is: root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 9. Device comes back online for whatever reason. FreeBSD sees the disk, blah blah blah: Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5 Jan 25 16:30:16 icarus kernel: ada5: ATA-7 SATA 1.x device Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589 Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes) Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C) Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14 ...um, where's pass5? 10. /dev/pass5 is now completely (permanently) missing: root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* crw-r----- 1 root operator 0x99 Jan 25 16:42 /dev/ada5 crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 11. Any further attempts to communicate via xpt(4) with ada5 fail. Detaching and reattaching the disk does not fix the issue; the only fix is to reboot the system. 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output all pertaining to xpt(4). It looks like xpt(4) is in some kind of loop. Below is my verbose boot (with non-kernel things removed), which also includes "camcontrol debug" output once things are in a bad state: http://jdc.koitsu.org/freebsd/xpt_oddity.log In this log you'll see that after 1 CAM timeout I yanked the drive, then roughly 30 seconds later reinserted it. If you need me to turn on CAM debugging *prior* to the above, I can do that, just let me know. The important step is #5. Without that, the problem shown in #9/10/11 does not happen. It's a good thing I don't run smartd(8) -- most users I see using that software set the interval to something like 180s or 60s. Imagine this frustration: "okay so the disk fell off the bus, but what, now I can't talk to it with SMART? Uhhh... Err, works now? Whatever". -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 01:30:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 823212F6 for ; Sat, 26 Jan 2013 01:30:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 758A8731 for ; Sat, 26 Jan 2013 01:30:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q1U1Om033349 for ; Sat, 26 Jan 2013 01:30:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q1U1QA033345; Sat, 26 Jan 2013 01:30:01 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 01:30:01 GMT Message-Id: <201301260130.r0Q1U1QA033345@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Harry Coin Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Harry Coin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 01:30:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Harry Coin To: bug-followup@FreeBSD.org, levent.serinol@mynet.com Cc: Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 19:26:10 -0600 I think I have an easier way to reproduce this problem on a very simple setup: Boot a 'nas4free' livecd version 573 (freebsd 9.1.0.1). Mount one zpool with no dedup, simple raidz 4 drive setup, 4GB memory. Don't enable any of the various features, just get a shell from the live cd prompt. Use dd to make a 20GB file on the pool writing from /dev/random. While that's going, go to another virtual console cd to the pool, and do an ls -l. It won't return. Of interest, long before wired memory explodes to the limit, ls not returning while dd is running, 'top' on another shell reports this: 99.7% idle, 2.6gb free ram, dd in tx->tx, zfskern in zio->i, and intr in WAIT Eventually disk activity as shown by the cheery little flickering lamp slows, then stops. Characters still echo on the ls -l command that hasn't returned, but no others. alt-tab strangely still works. The only way out is to power cycle the box. Details here: http://forums.nas4free.org/viewtopic.php?p=12519#p12519 Feels like completed io's are getting lost and the system just waits and waits until there are no more resources left to issue new commands or too many threads are locked waiting for what will never return. Harry From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 02:10:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 92901A56 for ; Sat, 26 Jan 2013 02:10:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 815568A8 for ; Sat, 26 Jan 2013 02:10:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q2A1bK040040 for ; Sat, 26 Jan 2013 02:10:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q2A1Eo040039; Sat, 26 Jan 2013 02:10:01 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 02:10:01 GMT Message-Id: <201301260210.r0Q2A1Eo040039@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Jeremy Chadwick Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Jeremy Chadwick List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 02:10:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Jeremy Chadwick To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 18:08:59 -0800 Harry, things that come to mind immediately: 1. http://www.quietfountain.com/fs1pool1.txt is when your pool contained both L2ARC and ZIL devices on SSDs. Please remove the SSDs from the picture entirely and use raidz1 disks ada[2345] only at this point. I do not want to discuss ada[01] at this time, because they're SSDs. There are quite literally 4 or 5 "catches" to using these devices on FreeBSD ZFS, but the biggest problem -- and this WILL hurt you, no arguments about it -- is lack of TRIM support. You will hurt your SSDs over time doing this. If you want TRIM support on ZFS you will need to run -CURRENT. We can talk more about the SSDs later. As said, please remove them from the pictures for starters, as all they do is make troubleshooting much much harder. 2. I do see some raw I/O benchmarks but only for ada2. This is insufficient. A single disk performing like crap in a pool can slow down the entire response time for everything. I can do analysis of all of your disks if the issue is narrowed down to one of them. "gstat -I500ms" is a good way to watch I/O speeds in real-time. I find this more effective than "zpool iostat -v 1" for per-device info. 3. The ada[2345] disks involved are Hitachi HDS723015BLA642 (7K3000, 1.5TB), and there is sparse info on the web as to if these are 512-byte physical sector disks or 4096-byte. smartmontools 6.0 or newer will tell you. All disks regardless advertise 512-byte as the logical size to remain fully compatible with legacy systems, but the perform hit on I/O is major if the device + pool ashift isn't 12. So please check this with smartmontools 6.0 or newer. If the disks use physically 4096-byte sectors, you need to use gnop(8) to align them and create the pool off of that. Ivan Voras wrote a wonderful guide on how to do this, and it's very simple: http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html It wouldn't hurt you to do this regardless, as there's no performance hit using the gnop(8) method on 512-byte sector drives; this would "future-proof" you if upgrading to newer disks too. You want ashift=12. 4. Why are all of your drives partitioned? In other words, why are you using adaXpX rather than just adaX for your raidz1 pool? "gpart show" output was not provided, and I can only speculate as to what's going on under the hood there. Please use raw disks when recreating your pool, i.e. ada2, ada3, ada4, etc... I know for your cache/logs this is a different situation but again, please remove those from the picture. 5. Please keep your Hitachi disks on the Intel ICH7 controller for the time being. It's SATA300 but that isn't going to hurt these disks. Don't bring the Marvell into the picture yet. Don't change around cabling or anything else. 6. For any process that takes a long while, you're going to need to do "procstat -kk" (yes -k twice) against it. 7. I do not think your issue is related to this PR. I would suggest discussing it on freebsd-fs first. Of course, you're also using something called "nas4free" which may or may not be *true, unaltered* FreeBSD -- I have no idea. I often find it frustrating when, say, the FreeNAS folks or other "FreeBSD fork projects" appear on the FreeBSD mailing lists "just because it uses FreeBSD". You always have to go with the vendor for support (like you did on their forum), but if you really think this is a FreeBSD "kernel thing" freebsd-fs is fine. Start with what I described above and go from there. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 03:10:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id A83022B6 for ; Sat, 26 Jan 2013 03:10:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 8CCC4AF1 for ; Sat, 26 Jan 2013 03:10:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q3A1tB051108 for ; Sat, 26 Jan 2013 03:10:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q3A1fe051107; Sat, 26 Jan 2013 03:10:01 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 03:10:01 GMT Message-Id: <201301260310.r0Q3A1fe051107@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Harry Coin Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Harry Coin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 03:10:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Harry Coin To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, Jeremy Chadwick Cc: Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 21:04:05 -0600 Jeremy, Thanks. re: 1. Not only have I removed the zil but I removed the cache before running this test as well, thinking as you did the whole l2arc thing was adding too much complexity. The pool was nothing more than the single raidz when doing the test to cause the failure noted. re 2. The raw benchmarks you saw for ada2 were identically duplicated for 3, 4 and 5. They are all exactly the same drive make and model purchased at the same time. re 3. These are 512 byte/sector drives. v6.0 smartctl -a reports 512 bytes/sector logical / physical. re 4. The drives are partitioned so that each of them can hold boot information should I decide someday to boot off the array, to make sure sector 0 of the partition starts on the afore-noticed 4K boundary, and there is a couple of gig at the end for swap if I decide I want that. Partition 1 is just big enough for the boot code, 2 is the bulk of the storage, and 3 is the last 2 gig. So, no zil or cache used in the test, all formatted zfs v28, scrub reported zero errors on the array two days ago (though the scrub speed was 14M/s or so). It's something about max speed writes that get interrupted by another read that kills zfs/nas4free somehow. re 5: The ata drives indeed are on the ICH7, as the 300gb isn't an issue. The ssd's are on the higher speed controller, and are unused as noted before. re 6: You'll note that 'ls -l' didn't return, only the 'top' program run before the dd was started was launched. Here are the procstat -kk's for the dd and ls during their run (the ls not returning) nas4free:~# ps axu | grep dd root 4006 47.1 0.1 9916 1752 v0 R+ 2:56AM 0:07.92 dd if=/dev/urandom of=foo bs=512 count=20000000 root 4009 0.0 0.1 16280 2092 0 S+ 2:56AM 0:00.00 grep dd nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# ps axu | grep ls root 4016 0.0 0.1 14420 2196 v2 D+ 2:57AM 0:00.00 ls -l root 4018 0.0 0.1 16280 2084 0 RL+ 2:57AM 0:00.00 grep ls nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# re 7: The 'submit a problem' form doesn't list 'fs' as an option, just kern. If there's another place for fs bugs I trust the freebsd folks will know what to do with it. Anyhow I do think it's about write completions getting lost. And, 'nas4free' is a pretty popular project having the major virtue that anyone, anywhere looking to solve this problem can download a livecd iso and reproduce it. This is not my first rodeo, there are several bug reports with various black magic tunable tweaks that generally I think accomplish avoiding the bug only by their usage patterns and the tweaks not hitting it. This particular iso freebsd download has the virtue of being completely reproducible. Does any of the above help? Also here's an update zpool : nas4free:~# zpool status pool1 pool: pool1 state: ONLINE scan: scrub canceled on Fri Jan 25 22:07:51 2013 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada2p2 ONLINE 0 0 0 ada3p2 ONLINE 0 0 0 ada4p2 ONLINE 0 0 0 ada5p2 ONLINE 0 0 0 errors: No known data errors nas4free:~# as4free:~# zpool iostat -v pool1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- pool1 3.01T 2.43T 168 0 360K 3.42K raidz1 3.01T 2.43T 168 0 360K 3.42K ada2p2 - - 124 0 104K 2.34K ada3p2 - - 123 0 104K 2.28K ada4p2 - - 124 0 103K 2.34K ada5p2 - - 126 0 98.1K 2.29K ---------- ----- ----- ----- ----- ----- ----- nas4free:~# nas4free:~# zpool get all pool1 NAME PROPERTY VALUE SOURCE pool1 size 5.44T - pool1 capacity 55% - pool1 altroot - default pool1 health ONLINE - pool1 guid 1701438519865110975 default pool1 version 28 default pool1 bootfs - default pool1 delegation on default pool1 autoreplace off default pool1 cachefile - default pool1 failmode wait default pool1 listsnapshots off default pool1 autoexpand off default pool1 dedupditto 0 default pool1 dedupratio 1.46x - pool1 free 2.43T - pool1 allocated 3.01T - pool1 readonly off - pool1 comment - default pool1 expandsize 0 - nas4free:~# nas4free:~# zfs get all pool1/videos NAME PROPERTY VALUE SOURCE pool1/videos type filesystem - pool1/videos creation Sat Jan 19 5:12 2013 - pool1/videos used 526G - pool1/videos available 1.74T - pool1/videos referenced 526G - pool1/videos compressratio 1.00x - pool1/videos mounted yes - pool1/videos quota none local pool1/videos reservation none local pool1/videos recordsize 128K default pool1/videos mountpoint /mnt/pool1/videos inherited from pool1 pool1/videos sharenfs off default pool1/videos checksum on default pool1/videos compression off local pool1/videos atime off local pool1/videos devices on default pool1/videos exec on default pool1/videos setuid on default pool1/videos readonly off local pool1/videos jailed off default pool1/videos snapdir hidden local pool1/videos aclmode discard default pool1/videos aclinherit restricted default pool1/videos canmount on local pool1/videos xattr off temporary pool1/videos copies 1 default pool1/videos version 5 - pool1/videos utf8only off - pool1/videos normalization none - pool1/videos casesensitivity sensitive - pool1/videos vscan off default pool1/videos nbmand off default pool1/videos sharesmb off default pool1/videos refquota none default pool1/videos refreservation none default pool1/videos primarycache all default pool1/videos secondarycache all default pool1/videos usedbysnapshots 0 - pool1/videos usedbydataset 526G - pool1/videos usedbychildren 0 - pool1/videos usedbyrefreservation 0 - pool1/videos logbias latency default pool1/videos dedup off local pool1/videos mlslabel - pool1/videos sync standard local pool1/videos refcompressratio 1.00x - pool1/videos written 526G - nas4free:~# From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 04:00:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 346C9BA0 for ; Sat, 26 Jan 2013 04:00:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 27159D36 for ; Sat, 26 Jan 2013 04:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q401dF059912 for ; Sat, 26 Jan 2013 04:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q401QP059909; Sat, 26 Jan 2013 04:00:01 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 04:00:01 GMT Message-Id: <201301260400.r0Q401QP059909@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Jeremy Chadwick Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Jeremy Chadwick List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 04:00:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Jeremy Chadwick To: Harry Coin Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 19:55:26 -0800 Recommendations: 1. Instead of /dev/random use /dev/zero. /dev/random is not blazing fast given it has to harvest lots of entropy from places. If you're doing I/O speed testing just use /dev/zero. The speed difference is quite big. 2. For dd, instead of bs=512 use bs=64k. bs=512 isn't very ideal; these are direct I/O writes of 512 bytes each, which is dog slow. I repeat: dog slow. Linux does this differently. 3. During the dd, in another VTY or window, use "gstat -I500ms" and watch the I/O speeds for your ada[2345] disks during the dd. They should be hitting peaks between 60-150MBytes/sec under the far right "Kbps" field (far left=read, far right=write). The large potential speed variance has to do with how much data you already have on the pool, i.e. MHDDs get slower as the actuator arms move inward towards the spindle motor. That's why you might see, for example, 150MBytes/sec when reading/writing to low-numbered LBAs but slower speeds when writing to high-numbered LBAs. This speed will be "bursty" and "sporadic" due to the how ZFS ARC works. The interval at which "things are flushed to disk" is based on the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should default to 5 (5 seconds). 4. "zpool iostat -v {pool}" does not provide accurate speed indications for the same reason "iostat" doesn't show ""valid"" (it does but not what most people would hope for) information while "iostat 1" would. You need to run it with an interval, i.e. "zpool iostat -v {pool} 1" and let it run for a while while doing I/O. But I recommend using gstat like I said, simply because the interval can be set at 500ms (0.5s) and you get a better idea of what your peak I/O speed is. If you find a single disk that is **always** performing badly, then that disk is your bottleneck and I can help you with analysis of its problem. 5. Your "zpool scrub" speed being 14MBytes/second indicates you are no where close to your ideal I/O speed. It should not be that slow unless you're doing tons of I/O at the same time as the scrub. Also, scrubs take longer now due to the disabling of the vdev cache (and that's not a FreeBSD thing, it's that way in Illumos too, and it's a sensitive topic to discuss). 6. On FreeBSD 9.1-RELEASE generally speaking you should not have to tune any sysctls. The situation was different in 8.x and 9.0. Your system only has 4GB of RAM so prefetching automatically gets disabled, by the way, just in case you were wondering about that (there were problems with prefetch in older releases). 7. You should probably keep "top -s 1" running, and you might even consider using "top -S -s 1" to see system/kernel threads (they're in brackets). This isn't going to tell you downright what's making things slow though. "vmstat -i" during heavy I/O would be useful too, just in case somehow you have a shared interrupt that's being pegged hard (for example I've seen SATA controllers and USB controllers sharing an interrupt, even with APICs, where the USB layer is busted churning out 1000 ints/sec and thus affecting SATA I/O speed). 8. If you want to compare systems I'm happy to do so, although I have less disks than you do (3 in raidz1, WD Red 1TB drives). However my system is not a Pentium D-class processor; it's a Core 2 Quad Q9500. The D-class stuff is fairly old. I have some other theories as well but one thing at a time. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 04:26:28 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id EC43FE6 for ; Sat, 26 Jan 2013 04:26:28 +0000 (UTC) (envelope-from freebsd@deman.com) Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138]) by mx1.freebsd.org (Postfix) with ESMTP id 8E5A9E35 for ; Sat, 26 Jan 2013 04:26:28 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by plato.corp.nas.com (Postfix) with ESMTP id 0283612E653F5; Fri, 25 Jan 2013 20:26:21 -0800 (PST) X-Virus-Scanned: amavisd-new at corp.nas.com Received: from plato.corp.nas.com ([127.0.0.1]) by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id A7a6K8dOrxMd; Fri, 25 Jan 2013 20:26:19 -0800 (PST) Received: from [172.20.10.5] (mobile-166-147-080-234.mycingular.net [166.147.80.234]) by plato.corp.nas.com (Postfix) with ESMTPSA id D354612E653E7; Fri, 25 Jan 2013 20:26:16 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD - mapping logical to physical drives From: Michael DeMan In-Reply-To: <20130123085832.GJ30633@server.rulingia.com> Date: Fri, 25 Jan 2013 20:26:02 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <32F8B97D-DE8D-4876-B127-91BE6AB9A854@deman.com> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> <16E9D784-D2F2-4C55-9138-907BF3957CE8@deman.com> <20130123085832.GJ30633@server.rulingia.com> To: Peter Jeremy X-Mailer: Apple Mail (2.1499) Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 04:26:29 -0000 Except that does not work when the technician just needs to know which = darn physical disk to remove/replace by looking at the front of the box? I think possibly on this thread we are discussing some major topics - = all of which would be good categories within a 'best practices' = document? #1. Best practices for design. #2. Best practices for maintenance (logical) - ZFS 'halt scrub' would = be awesome), when to scrub, etc - how to label your disks, blends into = #3... #3. Best practices for physical maintenance - i.e at the end of the day = ZFS theoretical break down if somebody can't look at the front of the = machine and figure out which disk was the bad one? Maintenance (in my definition) including the scenario where the guy = changing the disk is not necessarily a FreeBSD wizard and = all-knowledgable about gpart, gnop, glabel and such? Some idea that whoever designs, builds, deploys and then maintains the = physical server over a 3-7 year service life - that one person is = supposed to do all of that does not work well either for small = businesses or large businesses? Sorta off topic on this - sorry. On Jan 23, 2013, at 12:58 AM, Peter Jeremy wrote: > On 2013-Jan-22 18:06:27 -0800, Michael DeMan = wrote: >> # OAIMFD 2011.04.13 adding this to force ordering on adaX disks=20 >> # dev.mvs.0.%desc: Marvell 88SX6081 SATA controller=20 >> # dev.mvs.1.%desc: Marvell 88SX6081 SATA controller=20 >>=20 >> hint.scbus.0.at=3D"mvsch0" >> hint.ada.0.at=3D"scbus0"=20 > ... >=20 > That only works until a BIOS or OS change alters the probe order and > reverses the controller numbers. >=20 > The correct solution to the problem is gpart labels - which rely on > on-disk metadata and so don't care about changes in the path to the > disk. >=20 > --=20 > Peter Jeremy From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 05:50:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6A753C24 for ; Sat, 26 Jan 2013 05:50:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 57F439D for ; Sat, 26 Jan 2013 05:50:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q5o1sH079535 for ; Sat, 26 Jan 2013 05:50:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q5o1YW079534; Sat, 26 Jan 2013 05:50:01 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 05:50:01 GMT Message-Id: <201301260550.r0Q5o1YW079534@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Harry Coin Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Harry Coin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 05:50:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Harry Coin To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, jdc@koitsu.org Cc: Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 23:49:23 -0600 I appreciate your effort very much. One change from last time was I set failmode=continue from 'wait'. There were no errors in zpool or other logs. re 1,2: switching to bs=64k and /dev/zero as the input source, the dd completes and the ls -l during the process causes no problems. Transfer speed reported by dd of 183qMBytes/sec. re 3: the gstat -I500ms shows each of the raid partitions hanging in at about top speed of 50-ish MBps during the above dd. All disks perform within a couple % of one another. re 4: we'll stick with gstat. re 5: I see we agree about that. Nothing happened during the scrub other than the scrub. 14Mbps scrub speed is 1/6th of the minimum I'd expect. re 6: Tuning Is Evil. On the other hand, Crashing is Eviler. So I tried Evil Tuning, with poor results as noted upstream. So, perhaps tuning is indeed not only evil but a black hole for time as well. Switching back to the 512 byte writes, I notice on gstat the writes are zero for 4 secs or so, then a burst of activity (all partitions within a few % of one another), all quiet, repeat. Then I throw in the ls -l... and... it all works with no problems. Trying the dd with the big blocksize and /dev/urandom.... and: gstat looks bursty as in the 512 byte writes. ls -l causes no changes, works. Trying the original /dev/urandom and bs=512 and ... the gstat pattern does the usual bursty thing, ls -l works well. Thinking it's about the gstat... Doing top -S -s 1 .. nothing. nas4free:~# vmstat -i interrupt total rate irq1: atkbd0 168 0 irq18: uhci2 1136 0 irq20: hpet0 620136 328 irq21: uhci0 ehci0 7764 4 irq256: hdac0 37 0 irq257: ahci0 1090 0 irq258: em0 5625 2 irq259: ahci1 849746 450 Total 1485702 787 So weird. It's not crashing. So, I tried to move a video from one dataset to another using mv. Within moments all the writes gstat shows have stopped, while there's a few K / s reads. And there it sits. Here's top last pid: 3932; load averages: 0.02, 0.04, 0.13 up 0+00:24:46 05:03:53 249 processes: 3 running, 229 sleeping, 17 waiting CPU: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 32K CPU1 1 22:02 100.00% idle{idle: cpu1} 11 root 155 ki31 0K 32K RUN 0 22:06 98.49% idle{idle: cpu0} 13 root -8 - 0K 48K - 1 0:06 0.10% geom{g_up} 5 root -8 - 0K 80K zio->i 0 0:05 0.10% zfskern{txg_thread_enter} 0 root -16 0 0K 2624K sched 1 2:30 0.00% kernel{swapper} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_1} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_0} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 12 root -88 - 0K 272K WAIT 1 0:08 0.00% intr{irq259: ahci1} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_2} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_3} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_6} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_1} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_0} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_4} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_7} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_5} 13 root -8 - 0K 48K - 1 0:07 0.00% geom{g_down} 3919 root 20 0 9916K 1636K tx->tx 1 0:07 0.00% mv 20 root 16 - 0K 16K syncer 0 0:03 0.00% syncer 12 root -60 - 0K 272K WAIT 0 0:01 0.00% intr{swi4: clock} 12 root -52 - 0K 272K WAIT 1 0:01 0.00% intr{swi6: Giant task} 14 root -16 - 0K 16K - 0 0:01 0.00% yarrow 5 root -8 - 0K 80K arc_re 0 0:01 0.00% zfskern{arc_reclaim_thre} 12 root -84 - 0K 272K WAIT 1 0:01 0.00% intr{irq1: atkbd0} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} 0 root -8 0 0K 2624K - 0 0:00 0.00% kernel{zil_clean} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} Set failmode back to wait. Rebooted from the livecd and now repeating the tests: And they work properly. So, with gstat running I once again try to move a video file from one dataset to another. I watch a bunch of reads happen, the a burst of writes, and that goes on for a couple of minutes. Then the writes stop and the reads continue at 3Kbps according to gstat. The total memory still has 2.5G free. There is no writing to the zpool. I notice that one of the video files has moved, and the writing stopped when it was trying to open the next one I'm guessing. I can sucessfully ^C out of the mv, and notice a constant low speed read going on, 80-100KBps. I try restarting the move, no change to the read pattern, no writes. So, no memory explosion, ^C works, but I can't write, and gstat reports a steady read low pace read for no obvious reason, just as it was when the mv process was active. So, a huger puzzle. From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 06:00:01 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 81AADCE7 for ; Sat, 26 Jan 2013 06:00:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 721F7D0 for ; Sat, 26 Jan 2013 06:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0Q600OZ081133 for ; Sat, 26 Jan 2013 06:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0Q600He081129; Sat, 26 Jan 2013 06:00:00 GMT (envelope-from gnats) Date: Sat, 26 Jan 2013 06:00:00 GMT Message-Id: <201301260600.r0Q600He081129@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Harry Coin Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Harry Coin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 06:00:01 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Harry Coin To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, jdc@koitsu.org Cc: Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Fri, 25 Jan 2013 23:49:15 -0600 I appreciate your effort very much. One change from last time was I set failmode=continue from 'wait'. There were no errors in zpool or other logs. re 1,2: switching to bs=64k and /dev/zero as the input source, the dd completes and the ls -l during the process causes no problems. Transfer speed reported by dd of 183qMBytes/sec. re 3: the gstat -I500ms shows each of the raid partitions hanging in at about top speed of 50-ish MBps during the above dd. All disks perform within a couple % of one another. re 4: we'll stick with gstat. re 5: I see we agree about that. Nothing happened during the scrub other than the scrub. 14Mbps scrub speed is 1/6th of the minimum I'd expect. re 6: Tuning Is Evil. On the other hand, Crashing is Eviler. So I tried Evil Tuning, with poor results as noted upstream. So, perhaps tuning is indeed not only evil but a black hole for time as well. Switching back to the 512 byte writes, I notice on gstat the writes are zero for 4 secs or so, then a burst of activity (all partitions within a few % of one another), all quiet, repeat. Then I throw in the ls -l... and... it all works with no problems. Trying the dd with the big blocksize and /dev/urandom.... and: gstat looks bursty as in the 512 byte writes. ls -l causes no changes, works. Trying the original /dev/urandom and bs=512 and ... the gstat pattern does the usual bursty thing, ls -l works well. Thinking it's about the gstat... Doing top -S -s 1 .. nothing. nas4free:~# vmstat -i interrupt total rate irq1: atkbd0 168 0 irq18: uhci2 1136 0 irq20: hpet0 620136 328 irq21: uhci0 ehci0 7764 4 irq256: hdac0 37 0 irq257: ahci0 1090 0 irq258: em0 5625 2 irq259: ahci1 849746 450 Total 1485702 787 So weird. It's not crashing. So, I tried to move a video from one dataset to another using mv. Within moments all the writes gstat shows have stopped, while there's a few K / s reads. And there it sits. Here's top last pid: 3932; load averages: 0.02, 0.04, 0.13 up 0+00:24:46 05:03:53 249 processes: 3 running, 229 sleeping, 17 waiting CPU: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 32K CPU1 1 22:02 100.00% idle{idle: cpu1} 11 root 155 ki31 0K 32K RUN 0 22:06 98.49% idle{idle: cpu0} 13 root -8 - 0K 48K - 1 0:06 0.10% geom{g_up} 5 root -8 - 0K 80K zio->i 0 0:05 0.10% zfskern{txg_thread_enter} 0 root -16 0 0K 2624K sched 1 2:30 0.00% kernel{swapper} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_1} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_0} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 12 root -88 - 0K 272K WAIT 1 0:08 0.00% intr{irq259: ahci1} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_2} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_3} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_6} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_1} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_0} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_4} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_7} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_5} 13 root -8 - 0K 48K - 1 0:07 0.00% geom{g_down} 3919 root 20 0 9916K 1636K tx->tx 1 0:07 0.00% mv 20 root 16 - 0K 16K syncer 0 0:03 0.00% syncer 12 root -60 - 0K 272K WAIT 0 0:01 0.00% intr{swi4: clock} 12 root -52 - 0K 272K WAIT 1 0:01 0.00% intr{swi6: Giant task} 14 root -16 - 0K 16K - 0 0:01 0.00% yarrow 5 root -8 - 0K 80K arc_re 0 0:01 0.00% zfskern{arc_reclaim_thre} 12 root -84 - 0K 272K WAIT 1 0:01 0.00% intr{irq1: atkbd0} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} 0 root -8 0 0K 2624K - 0 0:00 0.00% kernel{zil_clean} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} Set failmode back to wait. Rebooted from the livecd and now repeating the tests: And they work properly. So, with gstat running I once again try to move a video file from one dataset to another. I watch a bunch of reads happen, the a burst of writes, and that goes on for a couple of minutes. Then the writes stop and the reads continue at 3Kbps according to gstat. The total memory still has 2.5G free. There is no writing to the zpool. I notice that one of the video files has moved, and the writing stopped when it was trying to open the next one I'm guessing. I can sucessfully ^C out of the mv, and notice a constant low speed read going on, 80-100KBps. I try restarting the move, no change to the read pattern, no writes. So, no memory explosion, ^C works, but I can't write, and gstat reports a steady read low pace read for no obvious reason, just as it was when the mv process was active. So, a huger puzzle. From owner-freebsd-fs@FreeBSD.ORG Sat Jan 26 10:16:35 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 8F8FCEC; Sat, 26 Jan 2013 10:16:35 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 9F23391A; Sat, 26 Jan 2013 10:16:34 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id MAA10508; Sat, 26 Jan 2013 12:16:29 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Tz2oG-000HjC-Vf; Sat, 26 Jan 2013 12:16:29 +0200 Message-ID: <5103ACFC.1040306@FreeBSD.org> Date: Sat, 26 Jan 2013 12:16:28 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130121 Thunderbird/17.0.2 MIME-Version: 1.0 To: Chris Ross Subject: Re: Changes to kern.geom.debugflags? References: <7AA0B5D0-D49C-4D5A-8FA0-AA57C091C040@distal.com> <6A0C1005-F328-4C4C-BB83-CA463BD85127@distal.com> <20121225232507.GA47735@alchemy.franken.de> <8D01A854-97D9-4F1F-906A-7AB59BF8850B@distal.com> <6FC4189B-85FA-466F-AA00-C660E9C16367@distal.com> <20121230032403.GA29164@pix.net> <56B28B8A-2284-421D-A666-A21F995C7640@distal.com> <20130104234616.GA37999@alchemy.franken.de> <50F82846.6030104@FreeBSD.org> <315EDE17-4995-4819-BC82-E9B7D942E82A@distal.com> In-Reply-To: <315EDE17-4995-4819-BC82-E9B7D942E82A@distal.com> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "freebsd-fs@freebsd.org" , Kurt Lidl , "freebsd-sparc64@freebsd.org" , Marius Strobl X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2013 10:16:35 -0000 on 18/01/2013 02:49 Chris Ross said the following: > How long will this take to get to stable/9? Being new to FreeBSD, > I'm not too familiar with the process of HEAD/stable/etc. (In NetBSD, it would be a > commit followed by a pull request.) I've just MFC-ed the change to stable/9 and 8. -- Andriy Gapon