Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Jul 1995 19:14:53 -0700 (PDT)
From:      "Rodney W. Grimes" <rgrimes@gndrsh.aac.dev.com>
To:        karl@Mcs.Net (Karl Denninger)
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: SCSI disk wedge
Message-ID:  <199507130214.TAA19888@gndrsh.aac.dev.com>
In-Reply-To: <199507130143.UAA00551@Jupiter.mcs.net> from "Karl Denninger" at Jul 12, 95 08:43:04 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> 
> > On Wed, 12 Jul 1995, Karl Denninger wrote:
> > 
> > > This hang is only seen about once a day, and it is NOT load related.  It
> > > happens infrequently enough that tracking it is going to be a real bitch.
> > 
> >   I don't see this at all on a 1742 equiped system.  I have seen uptimes 
> > of 25 days before rebooting for a hardware upgrade.  I have DEC 3210 
> > drives though.
> > 
> >   It could be that one of the drives has a firware bug.  This is not that 
> > uncommon.  It was reported in hackers that some Conner drives have such 
> > problems.  I also remember getting bug-fix firmware upgrades for old 
> > Micropolis drives.
> > 
> > Tom
> 
> The drives on these machines are (1) less than two months old, (2) have
> current firmware, and (3) don't have ANY problems with BSDI.

Slow down... (1) new drives are often prone to firmware bugs if by
new you also mean new model.  (2) good!!  But the ``new'' firmware
could still have a bug in it (3) This is good, but it does not
necessarily mean the bug is in FreeBSD.  We do things like
very large I/O requests through the vm system, perhaps one of your
drives does not like it when we drop a 64K I/O operation to it.

> If FreeBSD is going to be a production platform then it is going to have to
> start behaving like one.  This means that pushing things off on drive
> vendors is not acceptable.

I didn't see anyone explictly state that the drivers where at fault.  I
saw a lot of people reporting there sight works, and a lot of requests for
details about just what you are running.  If you want your problem fixed
you will need to fill in as much detail as possible.  I have seen a thread
like this go for 3 to 5 days, and then suddenly some little detail comes
out and we put a finger right on the problem.

I even once exchanged some 15 to 20 emails with a gentleman trying to
fix his problem, when a the little detail that he was running BSDI and
not FreeBSD came out.  Needless to say, that pointed right at what was
wrong, I corrected my strategy and fixed his problem for him in about
10 minutes.

Is what I am trying to get at here is please be tolarent of all the
questions, we are trying to get you fixed, but we need the details!!

> If you have a problem with a device, you *report it*.  Silent death is never
> acceptable.  The kernel is running in this case, but the system is hung
> waiting on I/O completion.

If you _detect_ a problem this holds true, but no code can detect all
possible problems :-(.

> I am not at all convinced this is a firmware issue.  If it was then the 83
> days of uptime on identically-configured BSDI machines wouldn't be happening.

Unless FreeBSD happens to do some different type of operations that cause
a different path in the firmware to be taken.  Or press other limits that
you had not been using before.  Just becuase X works does not mean that
Y does not have a flaw in it.

> But they are.

:-(.

> Those 83-day uptimes are recorded on our production NFS servers which run a
> much heavier disk load, with the same devices, on a different OS with no
> problems.

Same _exact_ devices, or same _model_/_pn_/_revision_/_date_code_?

FreeBSD has been running on aha1742 based controllers quite stably for
well over 2 years (I know, in that my personal machines, plus the FreeBSD
developement machine (freefall) and the cdrom ftp site (wcarchive) where
all built by me initially as ECS MB with aha1742 controllers.

I am seeing lots of sites here reporting ``no problem'' with there aha1742/
aha2742 so that leads me to want to know what is ``different'' about your
site that causes this problem to show up.

I know these things:
a) You have a hang problem on a 2742 with no error message
b) You have a hang problem on a 1742 with some error before it, but
   I did not see any error in your mail.
c) You are using Seagate and Micropolis (I think that is what you said)
   disk drives, but I have no idea as to what models).
d) You have running on similiar hardware (maybe even the exact hardware)
   BSDI with long uptimes.
e) You crash once a day.
f) You publically posted that you get a 200% performance boost running
   FreeBSD over BSDI, telling me we are probably pushing your hardware
   quite a bit harder than BSDI did.

What I do not know:

a) Are you using active termination?
b) Do your scsi cables meet the SCSI-ii spec with respect to all
   parameters (length, impendence, capacitance, etc)?
c) What exact model of disk drives you are using?
d) What that error message you get is?
e) What motherboard you are running on, as much detail as possible.
f) What exact model/revision aha174x and 274x are you using.
g) What other I/O cards are in the machine.
h) What is the system running as far as a work load, does any one specific
   work load tend to bring the crash out?
i) Are you willing to pay for production type support, or is this the
   reason you switched from BSDI to FreeBSD and now expect to get that
   level of support for free?  Contracted support is avaliable from
   several people if you expect that level of service.

What I am willing to do:

a) As long as you keep answering the questions and filling in the
   details I will continue to follow the thread so that we might
   come to a final resolution of your problem.

b) Reserect my DX2/66 EISA 1742 based system to run some testing on
   duplicating your environment as much as I can with time permitting
   (and I am one very busy person) to try and duplicate the bug here.

c) Loan you my aha1742 that I know has worked for 2.5 years with 
   FreeBSD with out a single hickup.

d) Since you mentioned ``production'': If you are in a real hurry
   to get it fixed, you can pay me at contracted rates and I will be 
   at your site with my equipment within 2 days.  This is an expensive
   option, but one that does exist.

-- 
Rod Grimes                                      rgrimes@gndrsh.aac.dev.com
Accurate Automation Company                 Reliable computers for FreeBSD



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199507130214.TAA19888>