From owner-freebsd-hackers Mon Aug 28 18:20:12 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.FreeBSD.org (8.6.11/8.6.6) id SAA03869 for hackers-outgoing; Mon, 28 Aug 1995 18:20:12 -0700 Received: from hutcs.cs.hut.fi (hutcs.cs.hut.fi [130.233.192.2]) by freefall.FreeBSD.org (8.6.11/8.6.6) with SMTP id SAA03863 for ; Mon, 28 Aug 1995 18:20:07 -0700 Received: from shadows.cs.hut.fi by hutcs.cs.hut.fi with SMTP id AA07478 (5.65c8/HUTCS-S 1.4 for ); Tue, 29 Aug 1995 04:20:00 +0300 Received: (hsu@localhost) by shadows.cs.hut.fi (8.6.10/8.6.10) id EAA08982; Tue, 29 Aug 1995 04:20:02 +0300 Date: Tue, 29 Aug 1995 04:20:02 +0300 Message-Id: <199508290120.EAA08982@shadows.cs.hut.fi> From: Heikki Suonsivu To: "Rashid Karimov." Cc: freebsd-hackers@freefall.FreeBSD.org In-Reply-To: "Rashid Karimov."'s message of 28 Aug 1995 21:16:34 +0300 Subject: S.O.S -2.1Stable and ASUSP54TP4 Organization: Helsinki University of Technology, Otaniemi, Finland Sender: hackers-owner@FreeBSD.org Precedence: bulk system locks at random times w/o any messages at the console/ log files. Locks means the system becomes unreachable neither from the local net nor from the console After I hit "reboot" switch, system reboots up to the fsck level and it starts complaining that it can't read partition information off the second HDD ( Seagate Barracuda 4 Gb) (!). If one hits "reboot" again and goes to the Adaptec BIOS and runs disk utilities --> media check from there - the BIOS (!) complains that it can not talk to the second HD. The problem goes away _only after powercycling the whole PC. I never saw the stuff like this before ... any suggestions ? What we see here: One of the SCSI disks becomes unreachable: IBM 0662's say "Disk dribe is becoming ready", often survies, Seagates lock up. Usually we get IO errors, panic, and the system gets stuck in SCSI BIOS probes (probably, it says WAIT and sits there until reset, sometimes requiring several resets or a power cycle). Almost everything has been changed already, the whole system around it. The only original things are the box (power supply) and IBM 0662 root disk, the latter will be replaced by the end of the week. It has been a P60 and P90, Buslogic and cheap NCR controllers have been tried out. Currently it has two NCR's, one with 0662 and 4G seagate hawk and another with 1G seagate hawk. I had a seagate barracuda in the system, and it gave similar problems. Everything is fine till you don;t have too much activity going on system. Some of the servers I have here run for months w./o problems - but they do DNS/WWW/INN stuff. As soon as you put 3000- 4000 users on the system - that when the shit begins. The problem is clearly load related, we get about one lockup a day, when it is getting a news feed in. If the news feed is dead, it seems to stay up fine. Till now _the most stable version is SNAP back from Feb 95. It is up for 24 days, runs 4000 account, 50-70 users online. Bad things about it : no support for 2940,SMC EtherPower and QUOTAs don't work. I'm still suspecting the hardware I have, but after replacing the last original component I'm running out of ideas. But it certainly should not hang in BIOS probes (assuming that Buslogic & NCR did their code right). This has been around since spring, at least. I also could find certain sequences of disk accesses which killed the machine repeatably. When I switched the 2G barracuda to a 4G hawk and copies news spool over, it always hung on certain files when tarring; I tarred the files before the place it hung separately, removed the copied files, and rerun the tar starting at the last hung; now it got past it. Another case I had when trying to install an application which created dbm indexes; it always hung the system when creating the indexes. So it seems that certain sequences of disk accesses kill the SCSI. Maybe seagate did something wrong in their disks? Tagged queuing? When it came around? -- Heikki Suonsivu, T{ysikuu 10 C 83/02210 Espoo/FINLAND, hsu@cs.hut.fi home +358-0-8031121 work -4513377 fax -4555276 riippu SN