From owner-freebsd-scsi@FreeBSD.ORG  Mon Jun  2 01:28:54 2003
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D8BBE37B401
	for <freebsd-scsi@freebsd.org>; Mon,  2 Jun 2003 01:28:52 -0700 (PDT)
Received: from matou.sibbald.com (matou.sibbald.com [195.202.201.48])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 81EDF43FA3
	for <freebsd-scsi@freebsd.org>; Mon,  2 Jun 2003 01:28:50 -0700 (PDT)
	(envelope-from kern@sibbald.com)
Received: from [192.168.68.112] (rufus [192.168.68.112])
	by matou.sibbald.com (8.11.6/8.11.6) with ESMTP id h528Scv04594;
	Mon, 2 Jun 2003 10:28:39 +0200
From: Kern Sibbald <kern@sibbald.com>
To: mjacob@feral.com
In-Reply-To: <20030601163730.T97138@beppo>
References: <20030601124620.S18592@root.org>  <20030601163730.T97138@beppo>
Content-Type: text/plain
Organization: 
Message-Id: <1054542517.1578.1770.camel@rufus>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.4 
Date: 02 Jun 2003 10:28:38 +0200
Content-Transfer-Encoding: 7bit
cc: freebsd-scsi@freebsd.org
Subject: Re: SCSI tape data loss
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Jun 2003 08:28:55 -0000

Hello,

Yes, I've seen both your name and Justin Gibbs in the FreeBSD
documentation (partial answer to question d below).


On Mon, 2003-06-02 at 02:13, Matthew Jacob wrote:
> Hello, I'm the author of the SA driver. This specific case is something
> I have indeed tried to handle correctly, but could have missed something
> on. In particular I've been wary of devices in fixed block mode.
> 
> The executive summary: I need more info. I need to know:
> 
> 	a) was the tape device in fixed or variable block mode

The tape drive was in variable block mode.  However at the time of
the error all the blocks Bacula was writing were the same size
i.e. 64512 bytes.

> 
> 	b) you claim to have lost blocks 1555..1567, and that
> 	1568 was the signifier to change tapes. Are these tape
> 	blocks reflective of single 'write' requests? Or are
> 	these multiple tape records issued in one write?

All Bacula writes are single non-buffered writes of what I call a
block (a single tape record).  By the way, these block numbers are
Bacula block number counting from 1 beginning at the last EOF
that Bacula wrote or the beginning of the tape. Each block contains
among other things the block number -- this allowed us to identify
which blocks were missing with certainty.  In addition, Bacula
increments the block number at only one place in the code -- after
a "successful" write.  At the end of the tape Bacula writes the final
block number to its database -- we found the correct block number
(the last of the missing blocks) in the Bacula database thus "proving"
that Bacula actually wrote the blocks.

> 
> 	c) What was the signifier you got that indicated that it
> 	was time to change tapes (viz block 1568)? -1 and an errno
> 	set? A residual that indicated that some data that you
> 	had requested to be written had not been written.

Bacula stops writing and changes tapes under a single condition:
the return status from write() does not equal the number of bytes
requested to be written.  

Then a bit of analysis is done and the reason is reported (write error,
end of medium, ...). The effect is the same whether it was an
end of tape or an I/O error.  In this particular
case I cannot say with 100% assurance what happened, but I believe
that Bacula received a -1 status and errno was ENOSPC.

If this point is critical, we can re-run the test with debug code
inserted to give us the exact status that was returned.  It is a
bit of work for us both, but if it is important, say so and we
will do it.

> 
> 
> 	d) Other general info about whether you were indeed using
> 	the 'no-rewind' device, whether you'd changed the default
> 	EOT model (from 'dual filemark' to 'single filemark'- you
> 	*have* read the man pages, yes? :-))

Yes, we were using the no rewind device (though this makes absolutely
no difference to Bacula).  The EOT model was 2 EOF's I am sure because
I questioned Dan on that point and he proved it was 2.

Yes, I have read the man pages several times in detail as well as the
man pages for Linux and Solaris. 


> 
> 
> 
> There is one case I'm also worried about. This is from sa.c:saerror:
> 
>        if (csio->cdb_io.cdb_bytes[0] == SA_WRITE) {
>                 if (sense_key == SSD_KEY_VOLUME_OVERFLOW) {
>                         csio->resid = resid;
>                         error = ENOSPC;
>                 } else if (sense->flags & SSD_EOM) {
>                         softc->flags |= SA_FLAG_EOM_PENDING;
>                         /*
>                          * Grotesque as it seems, the few times
>                          * I've actually seen a non-zero resid,
>                          * the tape drive actually lied and had
>                          * writtent all the data!.
>                          */
>                         csio->resid = 0;
>                 }
> 
> This is saying: if we were writing, and we got SSD_KEY_VOLUME_OVERFLOW,
> we're at hard EOT- we have to assume we didn't write *any* data
> for this last operation, and we return an errno.
> 
> Otherwise, if early warning was spotted, mark EOM pending, but *don't*
> believe the residual field.

In both Solaris and Linux, they immediately notify Bacula with an
errno=ENOSPC (or at least a -1 status) when the early warning is hit.
When this happens, the write was not successful. I then immediately
clear the error status and write two EOF marks (one would be sufficient)
and change tapes.

> 
> Every tape drive I'd tested with (and this was around 7 or 8) had all,
> when presenting a non-zero residual, had lied about what they actually
> had put on the tape.
> 
> What I'm obviously worried about here is whether or not your tape drive
> was correct in reporting a residual. This would indeed fit your data.
> 
> I'm pretty sure I also tested my EOT test program with an Archive
> autoloader- but I don't remember for sure.

As long as the full record is not successfully written to the tape,
Bacula will be 100% data correct because any "short" block that Bacula
reads is discarded. 

Logically what you should do is that if a partial or full
record is written to tape and an early EOM mark is detected, you should
return a the bytes written and set a flag. On the next write, you should
report no data written and errno=ENOSPC.  

That will ensure that every program knows exactly where it is.

> 
> 
> 
> Other points:
> 
> > However, more recently Dan Langille did some extensive
> > testing writing a 6GB file to six tapes. This brought
> > out additional problems of the driver "freezing" the tape,
> 
> If the tape is 'freezing' it means that tape position was lost.
> Under what circumstances did this occor?

Yes, the tape is "freezing".  I do not believe that it is freezing
during the writing, but it apparently freezes during Bacula's check to
see whether or not the last block was correctly written.  This check
fails on FreeBSD probably for two reasons: 1. you freeze the tape. 
2. your handling of EOF marks does not correspond to what Solaris/Linux
does.

Point 1 freezing of the tape:
At EOM (or I/O error) Bacula writes two EOF marks, backspaces over them,
backspaces a record then rereads the record and compares it to the last
block successfully written. This works perfectly on Solaris/Linux, but
does not work on FreeBSD. One reason is that I believe you freeze the
tape on the backspace record.

Point 2 handling of EOF marks:
FreeBSD's handling of EOF marks is quite different from Solaris/Linux
in the sense that Solaris/Linux is "transparent" -- the program writer
never sees the extra EOF marks. In FreeBSD the EOF marks that the driver
adds are visible to the program (causing Bacula great problems, most of
which I have programmed around).

Basically the best I can determine after Bacula writes its two EOF
marks, FreeBSD adds another one, but leaves the tape positioned after
the EOF mark it wrote rather than before it.  When Solaris/Linux add
an EOF mark in the driver, they always backspace over it and leave
you positioned "correctly".

As an example: if I write:

  write()
  EOF
  EOF
  ioctl(MTBSF)
  ioctl(MTBSF)
  ioctl(MTBSR)

I end up on Linux/Solaris end up positioned just before the last
write. On FreeBSD, I seem to always end up positioned *after* the
last write, which I claim in BSD tape mode is "incorrect" (i.e. not
expected).

Actually, if Linux/Solaris are intelligent, they will not write a third
EOF mark.  If I had only written one EOF mark, they would have added a
second one, but backspaced over it before returning control to me.