From owner-freebsd-hackers  Mon Aug 28 18:20:12 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.FreeBSD.org (8.6.11/8.6.6) id SAA03869
          for hackers-outgoing; Mon, 28 Aug 1995 18:20:12 -0700
Received: from hutcs.cs.hut.fi (hutcs.cs.hut.fi [130.233.192.2])
          by freefall.FreeBSD.org (8.6.11/8.6.6) with SMTP id SAA03863
          for <freebsd-hackers@freefall.FreeBSD.org>; Mon, 28 Aug 1995 18:20:07 -0700
Received: from shadows.cs.hut.fi by hutcs.cs.hut.fi with SMTP id AA07478
  (5.65c8/HUTCS-S 1.4 for <freebsd-hackers@freefall.cdrom.com>); Tue, 29 Aug 1995 04:20:00 +0300
Received: (hsu@localhost) by shadows.cs.hut.fi (8.6.10/8.6.10) id EAA08982; Tue, 29 Aug 1995 04:20:02 +0300
Date: Tue, 29 Aug 1995 04:20:02 +0300
Message-Id: <199508290120.EAA08982@shadows.cs.hut.fi>
From: Heikki Suonsivu <hsu@cs.hut.fi>
To: "Rashid Karimov." <rashid@haven.ios.com>
Cc: freebsd-hackers@freefall.FreeBSD.org
In-Reply-To: "Rashid Karimov."'s message of 28 Aug 1995 21:16:34 +0300
Subject: S.O.S -2.1Stable and ASUSP54TP4
Organization: Helsinki University of Technology, Otaniemi, Finland
Sender: hackers-owner@FreeBSD.org
Precedence: bulk


	   system locks at random times w/o any messages at the console/
	   log files. Locks means the system becomes unreachable neither
	   from the local net nor from the console
	   After I hit "reboot" switch, system reboots up to the fsck
	   level and it starts complaining that it can't read partition
	   information off the second HDD ( Seagate Barracuda 4 Gb) (!).

	   If one hits "reboot" again and goes to the Adaptec BIOS and runs
	   disk utilities --> media check from there - the BIOS (!) complains
	   that it can not  talk to the second HD.

	   The problem goes away _only after powercycling the whole PC.
	   I never saw the stuff like this before ... any suggestions ?

What we see here:

One of the SCSI disks becomes unreachable: IBM 0662's say "Disk dribe is
becoming ready", often survies, Seagates lock up.  Usually we get IO
errors, panic, and the system gets stuck in SCSI BIOS probes (probably, it
says WAIT and sits there until reset, sometimes requiring several resets or
a power cycle).

Almost everything has been changed already, the whole system around it.
The only original things are the box (power supply) and IBM 0662 root disk,
the latter will be replaced by the end of the week.

It has been a P60 and P90, Buslogic and cheap NCR controllers have been
tried out.  Currently it has two NCR's, one with 0662 and 4G seagate hawk
and another with 1G seagate hawk.


I had a seagate barracuda in the system, and it gave similar problems.

	   Everything is fine till you don;t have too much activity going
	   on system. Some of the servers I have here run for months w./o
	   problems - but they do DNS/WWW/INN stuff.
	   As soon as you put 3000- 4000 users on the system - that when
	   the shit begins.

The problem is clearly load related, we get about one lockup a day, when it
is getting a news feed in.  If the news feed is dead, it seems to stay up
fine.

	   Till now _the most stable version is SNAP back from Feb 95.
	   It is up for 24 days, runs 4000 account, 50-70 users online.
	   Bad things about it :
	   no support for 2940,SMC EtherPower and QUOTAs don't work.

I'm still suspecting the hardware I have, but after replacing the last
original component I'm running out of ideas.  But it certainly should not
hang in BIOS probes (assuming that Buslogic & NCR did their code right).
This has been around since spring, at least.

I also could find certain sequences of disk accesses which killed the
machine repeatably.  When I switched the 2G barracuda to a 4G hawk and
copies news spool over, it always hung on certain files when tarring; I
tarred the files before the place it hung separately, removed the copied
files, and rerun the tar starting at the last hung; now it got past it.
Another case I had when trying to install an application which created dbm
indexes; it always hung the system when creating the indexes.  So it seems
that certain sequences of disk accesses kill the SCSI.

Maybe seagate did something wrong in their disks?  Tagged queuing?  When it
came around? 

-- 
Heikki Suonsivu, T{ysikuu 10 C 83/02210 Espoo/FINLAND,
hsu@cs.hut.fi  home +358-0-8031121 work -4513377 fax -4555276  riippu SN