From owner-freebsd-current@FreeBSD.ORG  Fri Apr 22 13:13:49 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 938E916A4CE
	for <freebsd-current@freebsd.org>;
	Fri, 22 Apr 2005 13:13:49 +0000 (GMT)
Received: from pne-smtpout2-sn1.fre.skanova.net
	(pne-smtpout2-sn1.fre.skanova.net [81.228.11.159])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2AA1E43D31
	for <freebsd-current@freebsd.org>;
	Fri, 22 Apr 2005 13:13:49 +0000 (GMT)
	(envelope-from daniel_k_eriksson@telia.com)
Received: from sentinel (195.198.193.104) by pne-smtpout2-sn1.fre.skanova.net
	(7.1.026.7)
	id 42687F2000023DBE for freebsd-current@freebsd.org;
	Fri, 22 Apr 2005 15:13:48 +0200
From: "Daniel Eriksson" <daniel_k_eriksson@telia.com>
To: "'FreeBSD Current'" <freebsd-current@freebsd.org>
Date: Fri, 22 Apr 2005 15:13:43 +0200
Organization: Home
Message-ID: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAA0VcX9IoJqUaXPS8MjT1PdsKAAAAQAAAAKDq8qQ2O9UK7PKMOCt2NqwEAAAAA@telia.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcVHPRvXPQgKmmOvTWmFZWPqKX1DMA==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Subject: Serious I/O problems (bad performance and live-lock)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Apr 2005 13:13:49 -0000


With recent CURRENT (at least for the last 2 days, but probably longer), two
of my systems can be brought to their knees (live-lock) with a simple "dd
if=/dev/zero of=test bs=128k" command. I have not tested any other systems.

I keep both servers synced running 6-CURRENT:

Server #1: dual AthlonMP 2600+, Compaq SmartArray 5302/64 hardware raid card
(ciss). The card hosts two arrays, one RAID-5 built from 4 discs that holds
the system and one RAID-0 built from 14 discs. All the discs are 36GB 10krpm
and I have one array on each channel on the card.

Server #2: AthlonXP 2500+ with an old Maxtor 27GB UDMA66 disc for the
system.


What made me take notice was that server #2 ran through a "make
installkernel; make installworld" faster than server #1 during a recent
upgrade. This makes no sense given the superior I/O performance of the
hardware scsi raid array on server #1, and I know that in the past server #1
has finished the process ahead of server #2.

After the upgrade was done I ran some simple tests with 'dd', and it only
took ~1 minute for the system to live-lock. Breaking into DDB and killing
the 'dd' process brought the machine back to life. I assumed the problem was
ciss-related, CAM-related or SMP-related, but I just tried doing the same
thing on the UP machine (server #2), and it too live-locked within a minute.

Both systems use pretty much the same config, with the only major difference
being SMP or not:
* SCHED_4BSD, PREEMPTION, ADAPTIVE_GIANT, DEVICE_POLLING, HZ=2000
* debug.mpsafenet="1", debug.mpsafevfs="1"

The problem manifests itself like this:
Shortly after 'dd' is started, the machine starts to swap.
The swapping makes the machine very unresponsive.
After about a minute or so the machine enters some sort of live-lock where
the IP-stack replies to icmp echos, but nothing else can be done.

The last test I did was on a system compiled from sources dated
2005.04.22.01.00.00 (earlier today). The oldest system I've tested is from
2005.04.20.14.30.00 (but I did notice the system being slightly sluggish
earlier in the week too, so I think the problem is older than that).

This is a serious regression! I don't know when I last did any testing with
'dd', but I'm pretty sure it was less than 3 months ago (and back then
neither system live-locked).

/Daniel Eriksson