From owner-freebsd-scsi  Sat Dec  2 20:20:53 2000
Delivered-To: freebsd-scsi@freebsd.org
Received: from mass.osd.bsdi.com (adsl-63-202-178-34.dsl.snfc21.pacbell.net [63.202.178.34])
	by hub.freebsd.org (Postfix) with ESMTP id 7D15037B400
	for <freebsd-scsi@freebsd.org>; Sat,  2 Dec 2000 20:20:49 -0800 (PST)
Received: from mass.osd.bsdi.com (localhost [127.0.0.1])
	by mass.osd.bsdi.com (8.11.0/8.11.1) with ESMTP id eB34TMF33893;
	Sat, 2 Dec 2000 20:29:22 -0800 (PST)
	(envelope-from msmith@mass.osd.bsdi.com)
Message-Id: <200012030429.eB34TMF33893@mass.osd.bsdi.com>
X-Mailer: exmh version 2.1.1 10/15/1999
To: Peter Gradwell <peter@gradwell.com>
Cc: freebsd-scsi@freebsd.org
Subject: Re: Mylex DAC960 Driver "online/offline" 
In-reply-to: Your message of "Sat, 02 Dec 2000 23:49:24 GMT."
             <5.0.0.25.0.20001202233356.0366b2d8@pop3.gradwell.net> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Sat, 02 Dec 2000 20:29:22 -0800
From: Mike Smith <msmith@freebsd.org>
Sender: owner-freebsd-scsi@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> Hi Mike,
> 
> At 15:39 02/12/2000 -0800, Mike Smith wrote:
> > > What does this message really mean?
> >
> >It means that the controller is telling us that the drive is offline.
> >Then that it's online.  Then that it's offline again.
> >
> >You don't say what the time intervals between these messages are; you can
> >get the 'drive offline' message from either the status poll (once per
> >second) or if an I/O operation is sent to a drive that the controller
> >reports as offline.  The 'drive online' message only comes from the
> >status poll though.
> 
> It was occuring without any apparent activity, about once per second,
> so I would guess it was from the status poll.

Did you get one message each second, or two?  (two would be more 
confusing)

> >Can you describe your configuration?  I can try to reproduce the
> >situation here and see if it's not possible that there's a bug in the
> >driver confusing the status between your two drives.  I have to say,
> >though, that the fact that the controller thinks that one of your system
> >drives is offline when you claim it's a mirror is a bit troubling.
> 
> Ok, on an update to the situation though, I was able to get too the
> mylex bios (there is 250 miles between me and the machine you see!)
> via a serial console and discovered that it had marked two drives offline.
> 
> We have:
>          3 x 18 gig disks, of which two are bonded in a raid 1 pack
>          and one is a hot spare
>          2 x 36 gig disks, bonded in a raid 0 pack.
> 
> Everything apart from /var/spool/news is on the raid 1 pack. (Yeah, it's
> a news server.)
> 
> One of the 18 gig disks and one of the 36 gig disks were marked offline.
> 
> I belive that when the 18 gig disk was marked off line the RAID card
> rebuilt it's redundancy data onto the hot spare disk and carried on.
> - cos the 18 gig which is off line was part of the raid 1 pack and there
> is now not hot spare. *So, that's good.*

That sounds about right.  The failure rate is terrible though.  Heat 
issues?

> So, we hard reset the machine and it booted. However, the symptoms
> described previously prevailed. We couldn't login via ssh or on the console
> as it was unresponsive.
>
> * This worries me. I would hope the machine would take the loss of
> /v/s/news gracefully, and carry on.

If a filesystem listed in /etc/fstab can't be mounted, the system won't 
boot.  If you want resiliency against filesystems that won't mount, don't 
list them there; mount them manually as part of eg. the news subsystem 
startup instead.

> So, when I accessed the bios this morning, I tried, as an "experiment"
> to put the 36 gig disk back online and rebooted. After running fsck
> a bit (is there a journaling file system for freebsd?!) the machine is
> now running ok.

You can use snapshots with softupdates, but that's still a work in 
progress, so I wouldn't count on it yet.

> I have yet to schedule a reboot to mark the currently off line 18 gig
> disk as the hot spare. I think I will be able to do this.
> 
> I am worried that the controller randomly marks the drives off line. Mylex
> tell me this happens when it looses contact with the drives.
> They are internal drives, well screwed into a big case, nicely racked
> into a locked cabinet in Telehouse Europe. From what I can gather, no
> one accessed the rack. It appears they aren't disconnected anyway
> because I can mark them online and we're go again.

"loses contact" doesn't necessarily mean that they're disconnected; it 
can be caused by the drive failing to behave (eg. SCSI error -> drive 
reset -> repeated SCSI error(s)).  If this is the case, you should have a 
lot more log messages from the controller.

> I'd be happy to help with more information if it helps. Directed questions
> work best!

I'd start by checking your system logs for more verbosity from the 
controller.  At the very least, there should be some indication that the 
drives died, and reporting on the rebuild that seems to have occurred.

-- 
... every activity meets with opposition, everyone who acts has his
rivals and unfortunately opponents also.  But not because people want
to be opponents, rather because the tasks and relationships force
people to take different points of view.  [Dr. Fritz Todt]
           V I C T O R Y   N O T   V E N G E A N C E


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message