From owner-freebsd-stable@FreeBSD.ORG Wed Jul 17 20:47:58 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D61F0DA2 for ; Wed, 17 Jul 2013 20:47:58 +0000 (UTC) (envelope-from hartzell@alacrity.alerce.com) Received: from griffon.alerce.com (griffon.alerce.com [206.125.171.162]) by mx1.freebsd.org (Postfix) with ESMTP id B40D9AB3 for ; Wed, 17 Jul 2013 20:47:58 +0000 (UTC) Received: from griffon.alerce.com (localhost [127.0.0.1]) by griffon.alerce.com (Postfix) with ESMTP id B2E0E2842A for ; Wed, 17 Jul 2013 13:47:55 -0700 (PDT) Received: from alacrity.alerce.com (75-149-38-78-SFBA.hfc.comcastbusiness.net [75.149.38.78]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by griffon.alerce.com (Postfix) with ESMTPSA id 652C228424 for ; Wed, 17 Jul 2013 13:47:55 -0700 (PDT) Received: by alacrity.alerce.com (Postfix, from userid 503) id 2874C150594E; Wed, 17 Jul 2013 13:47:52 -0700 (PDT) From: George Hartzell Message-ID: <20967.760.95825.310085@gargle.gargle.HOWL> Date: Wed, 17 Jul 2013 13:47:52 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: freebsd-stable@freebsd.org Subject: Help with filing a [maybe] ZFS/mmap bug. X-Mailer: VM 8.2.0b under 24.2.1 (x86_64-apple-darwin) X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: hartzell@alerce.com List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jul 2013 20:47:59 -0000 Hi All, I have what I think is a ZFS related bug. Unfortunately my simplest test case is a bit cumbersome and I haven't definitively proven that the problem is ZFS related. I'm hoping for some feedback on how to move forward. Quick background: I rip my CD's using grip and produce flac files. I tag the music using Musicbrainz' Picard and transcode it to mp3's within Picard using a plugin that I wrote. Picard is a python based app and uses the Mutagen library to tag files. I'm working on a MacPro with 10GB ram and using Seagate ST31000340AS drives updated to the latest firmware (SD1A). The system is running 9-STABLE from late June. It is ZFS only and boots from a mirrored pool that provides a bunch of zfs filesystems, including my home directory. I recently realized that some of the flacs were corrupt and have been chasing down the problem. I've blamed Picard, my disks (there was newer, "important" firmware, which they're now running), my RAM, etc... After blaming each of the moving parts in turn I offer up the following experiment as evidence that I have found a ZFS problem. - start with a bunch of untagged flac files that pass validation with "flac -t". - load them into Picard, tag them and save them (this also transcodes them to mp3's using my plugin and runs a plugin which runs flac -t on the tagged file). - run flac -t on all of the tag flac files and collect the result as pre-exit-validation. - exit Picard "politely" (using the menu options, not killing it from the command line...). - run flac -t on all of the tag flac files and collect the result post-exit-validation. - reboot the machine - run flac -t on all of the tag flac files and collect the result post-reboot-validation. On multiple runs through this routine I'll sometimes see errors in the {pre,post}-exit-validations, but they'll often all validate perfectly. On all of the runs through the validation I'll see many invalid files in the post-reboot-validation output. I've even scp'd the directories to an unrelated machine (Mac OS X 10.8) at the various points to do the "flac -t" validation, with the same results. Looking carefully at a couple of instances shows that they differ in a few bytes. E.g. one file differs by a few bytes starting at 139253 to 139264 (I might have an off by one counting issue, using emacs' buffer positions here). 2^17 + 2^13 = 139264, which is an interesting coincidence. In another file I see a difference ending at 2^17+2^12 (again, I might be off by one or so in my counting). Patching the different hunk from a good file into a bad file (again via emacs) results in a file that passes validation. At one point I was blaming RAM and was pulling/swapping sims. Running with less memory increased the likelihood of files being invalid. I built up a similar system running 9-STABLE as of yesterday (7/16) that uses UFS and have been unable to recreate the problem. Given that the files are valid after exiting Picard, I do not think that there is anything in my tagging pipeline that is causing the problem. The fact that the files "become" invalid after a reboot suggests something in the ZFS buffering and/or interactions with the VM system. The observation that running with less memory causes more/earlier problems reinforces this. The fact that the garbage in the file happens near a power-of-two boundry also reinforces this. My current test case involves my local version of Picard and my plugins, and 165 flac files (some of which Picard can discover automatically based on grip's freedb based metadata, some of which need a helping hand). Not particularly minimal but I'm not sure that I can ever get it trimmed down to something trivial that a ZFS developer might be able to run locally. Thanks for making it this far! How should I move forward with this? g.