From owner-freebsd-stable@FreeBSD.ORG Fri Jun 24 12:32:08 2005 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 446C316A41C for ; Fri, 24 Jun 2005 12:32:08 +0000 (GMT) (envelope-from smckay@internode.on.net) Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203]) by mx1.FreeBSD.org (Postfix) with ESMTP id BB16743D4C for ; Fri, 24 Jun 2005 12:32:07 +0000 (GMT) (envelope-from smckay@internode.on.net) Received: from dungeon.home (ppp116-218.lns1.bne3.internode.on.net [59.167.116.218]) by smtp3.adl2.internode.on.net (8.12.9/8.12.9) with ESMTP id j5OCW597040999; Fri, 24 Jun 2005 22:02:05 +0930 (CST) Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.13.1/8.11.6) with ESMTP id j5OCV6jp047730; Fri, 24 Jun 2005 22:31:06 +1000 (EST) (envelope-from mckay) Message-Id: <200506241231.j5OCV6jp047730@dungeon.home> To: stable@freebsd.org Date: Fri, 24 Jun 2005 22:31:06 +1000 From: Stephen McKay Cc: Stephen McKay Subject: Data corruption in cd9660 on FreeBSD 4.11? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Jun 2005 12:32:08 -0000 Hi! I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11. My best theory so far is that cd9660 or perhaps the VFS layer is mishandling 2048 byte buffers (since they are smaller than one virtual memory page), occasionally writing them to the wrong location in RAM. Read on for why I think so. First up, I don't think this is the usual hardware problem since the machine has done huge numbers of buildworlds (in 4.x and -current) without any of the telltale signs (eg bus errors and segmentation violations). There are no error messages in /var/log/messages. Also, it moonlights as a games machine and plays Doom 3, Battlefield 1942, Neverwinter Nights and so forth like a champ. Memory, cpu, video, disk, networking are all just fine 100% of the time. The hardware is an ASUS P4P800 mobo (including onboard Marvell Yukon gigabit ethernet) with a P4 2.8GHz cpu, 1GB RAM, Maxtor 120GB disk, Pioneer 103S DVD-ROM, LiteOn SOHW-1673S DVD burner in an Antec Sonata case. Now that I have a DVD burner, I make backups of my main machines (over NFS) but have found that they often don't verify as 100% correct. The symptom is that, for some files, an entire 2048 DVD sector is replaced with different (non-zero) data. This occurs both when reading with the Pioneer DVD-ROM and when reading with the LiteOn burner (though I don't test with the Pioneer much as it is slower). I emphasise that all burns have been 100% correct (ie the burning process worked and this can be verified by reading on, say, my iBook), so all of the hardware seems to be operating correctly (and swiftly, I might add). The problem is that reading the iso9660 file system is not safe. After some experimenting, I've found that the problem also occurs when reading CDs, and I built a test CD (of photos of a recent wedding) and in testing I read this CD over and over. I compare the CD with the original files (via NFS) using diff. When diff finds a difference, I save copies of the differing files before they can be flushed from the cache. I have calculated checksums for all 2048 blocks on the CD, so I can know if any given block of 2048 bytes came from the CD and if so which file it came from. In all cases so far, the 2048 byte error has been a block from another file, not a random corruption. I am starting to believe that, under high load, the cd9660 file system code tells the ata driver to put a 2K block in the wrong spot in memory, leaving some old junk in the gap in the file being read, and blasting some other 2K block of memory. It may not be cd9660 code per se that is wrong, but a bug in the complex buffer handling code (getblk, getnewbuf, allocbuf, etc). Why do I believe it is writing to the wrong memory, rather than any number of other flaws? In two runs (out of many), unusual things occurred that are consistent with memory being overwritten, rather than, say, a 2K block just not being read at all: In one, an innocent sshd core-dumped (which is something that has never happened except when running my cd9660 tests), and in another, a previously OK cached NFS file became corrupted. Explaining that last case further: I had been running a test script that would mount the CD, compare files, unmount the CD, and repeat. This meant that the NFS copy of the files was read over and over and hence became memory resident (there being enough space in 1GB of RAM for one copy of the files, plus my normal programs). Several tests passed without fault (hence all the NFS files were cached and correct), when suddenly there were multiple corruptions; call them file A and file B. File A was the usual corruption where a 2K block of another file was unexpectedly present in the copy read from the CD, but in file B it was the NFS file that was wrong. In fact it contained the missing block from file A! In short, the fully memory resident NFS file B had been corrupted by reading file A from the CD. It's been pretty interesting hunting this problem, but now I'm sort of stuck. I believe that some 2K reads from DVDs and CDs end up in the wrong place in RAM, but I can't find where this happens in the code (it's pretty hard to work out just by reading it), and I can't rule out the possibility that there's a hardware error here that I've just never run across before. So, can anyone suggest any more tests I could try? Or is there a kind of hardware fault that could cause this substitution of whole blocks read from CDs without causing any other problems? And does anyone know of any commits made anywhere in the 5 years since 4.x split off from 5.x that may be relevant? Yep. 5 years. I have started looking, but there's a fair bit of stuff in there... Stephen.