From owner-freebsd-hackers Wed Jul 12 19:15:07 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.10/8.6.6) id TAA14271 for hackers-outgoing; Wed, 12 Jul 1995 19:15:07 -0700 Received: from gndrsh.aac.dev.com (gndrsh.aac.dev.com [198.145.92.241]) by freefall.cdrom.com (8.6.10/8.6.6) with ESMTP id TAB14265 for ; Wed, 12 Jul 1995 19:15:03 -0700 Received: (from rgrimes@localhost) by gndrsh.aac.dev.com (8.6.11/8.6.9) id TAA19888; Wed, 12 Jul 1995 19:14:54 -0700 From: "Rodney W. Grimes" Message-Id: <199507130214.TAA19888@gndrsh.aac.dev.com> Subject: Re: SCSI disk wedge To: karl@Mcs.Net (Karl Denninger) Date: Wed, 12 Jul 1995 19:14:53 -0700 (PDT) Cc: freebsd-hackers@FreeBSD.ORG In-Reply-To: <199507130143.UAA00551@Jupiter.mcs.net> from "Karl Denninger" at Jul 12, 95 08:43:04 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 5993 Sender: hackers-owner@FreeBSD.ORG Precedence: bulk > > > On Wed, 12 Jul 1995, Karl Denninger wrote: > > > > > This hang is only seen about once a day, and it is NOT load related. It > > > happens infrequently enough that tracking it is going to be a real bitch. > > > > I don't see this at all on a 1742 equiped system. I have seen uptimes > > of 25 days before rebooting for a hardware upgrade. I have DEC 3210 > > drives though. > > > > It could be that one of the drives has a firware bug. This is not that > > uncommon. It was reported in hackers that some Conner drives have such > > problems. I also remember getting bug-fix firmware upgrades for old > > Micropolis drives. > > > > Tom > > The drives on these machines are (1) less than two months old, (2) have > current firmware, and (3) don't have ANY problems with BSDI. Slow down... (1) new drives are often prone to firmware bugs if by new you also mean new model. (2) good!! But the ``new'' firmware could still have a bug in it (3) This is good, but it does not necessarily mean the bug is in FreeBSD. We do things like very large I/O requests through the vm system, perhaps one of your drives does not like it when we drop a 64K I/O operation to it. > If FreeBSD is going to be a production platform then it is going to have to > start behaving like one. This means that pushing things off on drive > vendors is not acceptable. I didn't see anyone explictly state that the drivers where at fault. I saw a lot of people reporting there sight works, and a lot of requests for details about just what you are running. If you want your problem fixed you will need to fill in as much detail as possible. I have seen a thread like this go for 3 to 5 days, and then suddenly some little detail comes out and we put a finger right on the problem. I even once exchanged some 15 to 20 emails with a gentleman trying to fix his problem, when a the little detail that he was running BSDI and not FreeBSD came out. Needless to say, that pointed right at what was wrong, I corrected my strategy and fixed his problem for him in about 10 minutes. Is what I am trying to get at here is please be tolarent of all the questions, we are trying to get you fixed, but we need the details!! > If you have a problem with a device, you *report it*. Silent death is never > acceptable. The kernel is running in this case, but the system is hung > waiting on I/O completion. If you _detect_ a problem this holds true, but no code can detect all possible problems :-(. > I am not at all convinced this is a firmware issue. If it was then the 83 > days of uptime on identically-configured BSDI machines wouldn't be happening. Unless FreeBSD happens to do some different type of operations that cause a different path in the firmware to be taken. Or press other limits that you had not been using before. Just becuase X works does not mean that Y does not have a flaw in it. > But they are. :-(. > Those 83-day uptimes are recorded on our production NFS servers which run a > much heavier disk load, with the same devices, on a different OS with no > problems. Same _exact_ devices, or same _model_/_pn_/_revision_/_date_code_? FreeBSD has been running on aha1742 based controllers quite stably for well over 2 years (I know, in that my personal machines, plus the FreeBSD developement machine (freefall) and the cdrom ftp site (wcarchive) where all built by me initially as ECS MB with aha1742 controllers. I am seeing lots of sites here reporting ``no problem'' with there aha1742/ aha2742 so that leads me to want to know what is ``different'' about your site that causes this problem to show up. I know these things: a) You have a hang problem on a 2742 with no error message b) You have a hang problem on a 1742 with some error before it, but I did not see any error in your mail. c) You are using Seagate and Micropolis (I think that is what you said) disk drives, but I have no idea as to what models). d) You have running on similiar hardware (maybe even the exact hardware) BSDI with long uptimes. e) You crash once a day. f) You publically posted that you get a 200% performance boost running FreeBSD over BSDI, telling me we are probably pushing your hardware quite a bit harder than BSDI did. What I do not know: a) Are you using active termination? b) Do your scsi cables meet the SCSI-ii spec with respect to all parameters (length, impendence, capacitance, etc)? c) What exact model of disk drives you are using? d) What that error message you get is? e) What motherboard you are running on, as much detail as possible. f) What exact model/revision aha174x and 274x are you using. g) What other I/O cards are in the machine. h) What is the system running as far as a work load, does any one specific work load tend to bring the crash out? i) Are you willing to pay for production type support, or is this the reason you switched from BSDI to FreeBSD and now expect to get that level of support for free? Contracted support is avaliable from several people if you expect that level of service. What I am willing to do: a) As long as you keep answering the questions and filling in the details I will continue to follow the thread so that we might come to a final resolution of your problem. b) Reserect my DX2/66 EISA 1742 based system to run some testing on duplicating your environment as much as I can with time permitting (and I am one very busy person) to try and duplicate the bug here. c) Loan you my aha1742 that I know has worked for 2.5 years with FreeBSD with out a single hickup. d) Since you mentioned ``production'': If you are in a real hurry to get it fixed, you can pay me at contracted rates and I will be at your site with my equipment within 2 days. This is an expensive option, but one that does exist. -- Rod Grimes rgrimes@gndrsh.aac.dev.com Accurate Automation Company Reliable computers for FreeBSD