From owner-freebsd-stable@FreeBSD.ORG Sun Jul 18 07:03:39 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 32BAC16A4CE for ; Sun, 18 Jul 2004 07:03:39 +0000 (GMT) Received: from aldan.algebra.com (aldan.algebra.com [216.254.65.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id AB8AE43D4C for ; Sun, 18 Jul 2004 07:03:38 +0000 (GMT) (envelope-from mi@aldan.algebra.com) Received: from aldan.algebra.com (mi@localhost [127.0.0.1]) by aldan.algebra.com (8.12.11/8.12.11) with ESMTP id i6I73XVt011525 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sun, 18 Jul 2004 03:03:35 -0400 (EDT) (envelope-from mi@aldan.algebra.com) Received: by aldan.algebra.com (8.12.11/8.12.11/Submit) id i6I73Woh011524 for stable@FreeBSD.org; Sun, 18 Jul 2004 03:03:32 -0400 (EDT) (envelope-from mi) From: Mikhail Teterin To: stable@FreeBSD.org Date: Sun, 18 Jul 2004 03:03:30 -0400 User-Agent: KMail/1.6.2 X-Face: %UW#n0|w>ydeGt/b@1-.UFP=K^~-:0f#O:D7whJ5G_<5143Bb3kOIs9XpX+"V+~$adGP:J|SLieM31VIhqXeLBli" Subject: how to royally mess up a -stable system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jul 2004 07:03:39 -0000 Have ImageMagick try to load a really big image file -- big enough to overblow your /var/tmp ... Here is the state of the box (after the libMagick process was killed) -- from the `systat -pigs': /0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10 Load Average |||||||||||||||||||||||||||||||||||||||||||||||||| 10.5 /0 /10 /20 /30 /40 /50 /60 /70 /80 /90 /100 root syncer XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX root pagedaemon XXXXX X [...] The machine is almost entirely unresponsive. When tcsh echoes the commands back at all, it can not execute them for many minutes. The kernel then tries to log each and every error: [...] Jul 18 02:06:33 corbulon /kernel: vnode_pager_putpages: residual I/O 65536 at 18777 Jul 18 02:06:33 corbulon /kernel: pid 7 (syncer), uid 0 on /var: file system full Jul 18 02:06:33 corbulon /kernel: vnode_pager_putpages: I/O error 28 Jul 18 02:06:33 corbulon /kernel: vnode_pager_putpages: residual I/O 65536 at 18778 Jul 18 02:06:34 corbulon /kernel: pid 7 (syncer), uid 0 on /var: file system full Jul 18 02:06:34 corbulon /kernel: vnode_pager_putpages: I/O error 28 Jul 18 02:06:34 corbulon /kernel: vnode_pager_putpages: residual I/O 65536 at 18779 Jul 18 02:06:34 corbulon /kernel: pid 7 (syncer), uid 0 on /var: file system full Jul 18 02:06:35 corbulon /kernel: vnode_pager_putpages: I/O error 28 [...] Jul 18 02:09:43 corbulon /kernel: vnode_pager_putpages: residual I/O 40960 at 8179 Jul 18 02:09:43 corbulon /kernel: pid 3 (pagedaemon), uid 0 on /var: file system full Jul 18 02:09:43 corbulon /kernel: vnode_pager_putpages: I/O error 28 Jul 18 02:09:43 corbulon /kernel: vnode_pager_putpages: residual I/O 40960 at 8179 Jul 18 02:09:43 corbulon /kernel: pid 3 (pagedaemon), uid 0 on /var: file system full Jul 18 02:09:43 corbulon /kernel: vnode_pager_putpages: I/O error 28 Jul 18 02:09:43 corbulon /kernel: vnode_pager_putpages: residual I/O 40960 at 8179 Jul 18 02:09:43 corbulon /kernel: pid 3 (pagedaemon), uid 0 on /var: file system full [...] which really chokes the box even though /var/log is on a different device from /var/tmp ... Yesterday my 4.8-stable kernel had to be cold-rebooted after almost a year because of this -- existing processes (sshd, webmin) were responding sometimes, but were unable to launch any new processes -- like shell (in case of sshd) or even /sbin/reboot (in case of webmin). Why is a fast-writing program (not run by root) able to hang a server? Perhaps, these errors logged by the kernel can be made less specific and fit into one line -- that way syslogd will be able to cope with them better, at least? -mi