Date: 01 Jun 2003 23:44:53 +0200 From: Kern Sibbald <kern@sibbald.com> To: "Justin T. Gibbs" <gibbs@scsiguy.com> Cc: mjacob@feral.com Subject: Re: SCSI tape data loss Message-ID: <1054503893.1578.1723.camel@rufus> In-Reply-To: <2846020000.1054498114@aslan.scsiguy.com> References: <1054490081.1582.1685.camel@rufus> <2846020000.1054498114@aslan.scsiguy.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello again, I just re-read the Linux mt pages, and I see that they have a setting both for async-writes and buffer-writes, so I'm now confused about what the distinction really is. I had assumed that if you are buffering then the writes must be asynchronous, otherwise why would you buffer? Best regards, Kern On Sun, 2003-06-01 at 22:08, Justin T. Gibbs wrote: > > Hello, > > > > I'm the author of a GPL'ed network backup program called > > Bacula (www.bacula.org). For the last three years, it > > has been working flawlessly on Solaris and Linux systems. > > When users attempted to use it recently on FreeBSD, > > it did not work. I subsequently modified Bacula so that > > it would work on FreeBSD -- basically, I had to program > > around some important differences in the way FreeBSD > > handles EOFs compared to Solaris and Linux. At some point > > in the future, I would like to discuss the problems > > I had in detail, if that interests you. > > I would be interested as I'm sure would other readers of this > list. > > > We've now worked on this problem for several weeks, and > > I believe we have now isolated the problem (data loss) to occur > > when the end of medium is reached. > > > > We have now confirmed that Bacula correctly wrote > > to the tape, but when it was read back 13 blocks > > of 64512 bytes were missing. > > > > Below, I have listed in pseudo-language what > > Bacula was doing. Each write with the exception > > of the first block on the second tape is 64512 > > bytes: > > > > first tape mounted > > write(block 1) > > ... > > write(block 1554); > > write(block 1555); <=== block lost > > ... <=== blocks lost > > write(block 1567); <=== block lost > > write(block 1568) failed because of EOM detected > > ioctl(MTIOCERRSTAT); > > What was the residual reported by MTIOCERRSTAT? If the > device is in buffered mode, that residual can be larger than > the last transaction that was failed. My guess is that either > MTIOCERRSTAT is not properly pulling the residual out of the > info field, or you are not backing up far enough in the data > stream when the EOM occurs. > > > I have verified that Bacula did successfully write 1567 blocks to the > > first tape, but in reading back the tape, blocks 1555-1567 are not > > on the tape. > > > > Now, the big question is: what caused the loss of those blocks? > > The most likely causes I can think of are: > > > > 1. Bacula is doing something (e.g. MTIOCERRSTAT, or the MTBSF) > > to cause the data to be lost. If this is the case, it is > > something specific to FreeBSD since this sequence of commands > > works on both Solaris and Linux (except that MTIOCERRSTAT is > > MTIOCLRERR on those systems). > > Perhaps both Linux and Solaris force the tape drives to run in > unbuffered mode? > > > 2. The SCSI driver is doing asynchronous writes (very bad) and > > the End of Medium is not sent to Bacula until many writes after > > the end of the tape. > > Disabling the tape drive's write buffer kills performance. All > of the information required to handle buffered writes should be > available to you. > > Perhaps we should also implement the MTCACHE/MTNOCACHE opcodes so > that userland apps can control this. It's not clear if this is > exactly what they were created for, but it may be better to use > these than to add some other opcodes. > > > 3. The SCSI driver has some sort of bug that causes buffers to be > > lost. > > I doubt that this would occur only at EOM. > > -- > Justin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1054503893.1578.1723.camel>