Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 Jan 2001 01:21:05 -0800 (PST)
From:      "Ronald F. Guilmette" <rfg@monkeys.com>
To:        FreeBSD-gnats-submit@freebsd.org
Subject:   kern/24401: Advansys SCSI driver crashes random userland progs w/SIGPROF
Message-ID:  <200101170921.f0H9L5203676@mail.monkeys.com>

next in thread | raw e-mail | index | archive | help

>Number:         24401
>Category:       kern
>Synopsis:       Advansys SCSI driver crashes random userland progs w/SIGPROF
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Jan 17 01:30:01 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Ronald F. Guilmette
>Release:        FreeBSD 4.2-RELEASE i386
>Organization:
Infinite Monkeys & Co.
>Environment:

	System consists of:

	ASUS P5A motherboard
	256 MB SDRAM
	3 different PCI 10/100 ethernet controllers (xl0 rl0 rl1)
	ATAPI/EIDE CD ROM drive (ASUS 40x)
	Advansys model ASB3940UA Ultra/Narrow PCI SCSI controller
	3 different SCSI hard drives

>Description:

	(Note:  This system was crashing random userland processes under
	FreeBSD 4.0, usually with spurious SIGVTALRM signals being sent
	to random processes at seemingly random times.  Now I suspect that
	at last I know why, however I'm not 100% sure that those incidents
	were directly related to the bug that I am reporting here.  THOSE
	problems with FreeBSD 4.0 got so bad that I had to back out of my
	upgrade to 4.0 and go back to 3.3.)

	Now, on to the current problem/bug...

	I recently was in the process of decomissioning an old (but large)
	Narrow/Ultra SCSI drive that was in the system (along with some
	others) i.e. an IBM model DCAS-34330 (4.3 GB).

	I backed up everything useful from that drive (a complete 4.1.1 system)
	onto my trusty old HP 35470A SCSI DAT/DSS tape drive and removed the
	drive physically from the system.  I then installed a Quantum 4.5 GB 
	SCSI drive (Quantum Viking 4.5) and loaded up FreeBSD 4.2 onto it.
	I then powered down, attached the the old IBM SCSI drive back to the
	system (in an external cases this time), did a low-level format on it
	and then made a fresh file system on it and then tried to restore my
	important stuff from my backup tape back onto the old IBM drive (using
	cpio).

	That restore from tape seemed to work OK until about half-way through
	when cpio crashed, apparently because it had received a totally
	unexpected SIGPROF.  (The console message at the time cpio crashed
	said "Profiling time alarm" aka SIGPROF.)

	At first, I just chalked this up to sunspots or to gremlins or to
	the phase of the moon or something, and I just shrugged it off.  (I
	didn't really need to do this restore from tape anyway.)

	A little later, I decided to sell the old IBM drive on eBay, but
	first I wanted to make sure that there would not be any incriminating
	White House E-mail message left intact on the drive. :-) So I did
	the following to try to erase whatever was on there formerly, and
	to wipe the drive totally clean:

		dd if=/dev/zero of=/dev/da1 bs=4096

	This also seemed to be working ok... for awhile.  But after awhile,
	the dd process also crashed and the console said "Profiling time alarm"
	(aka SIGPROF).

	I did the dd again and the same exact thing happened.

	I then decided to try to see if these failures were random or if they
	were always happening at the same spot on the disk.  So I wrote a
	little C program (attached below) which would just write 4 KB sized
	blocks of zeros to any device it was told to write them to... while
	printing the block numbers as it was writing... and then I ran that
	against /dev/da1.  Sure enough, after 485207 4KB blocks had been
	written (about half the disk) the system locked up.  The X server
	stopped responding to the mouse and to keyboard input and about a
	minute later, the system rebooted on its own accord.

	I figured that the Advansys driver was sending the spurious SIGPROF
	signals to whatever userland process happened to be running at the
	unfortunate moment when it (the Advansys driver) tried to throw one
	of these signals.  So I decided that it would be best to try running
	my little "zerodisk" test program when X was *not* running.

	I then did that... several times.

	In all cases (4) my little "zerodisk" program crashed unexpectedly
	(console message was always "Profiling time alarm") after it had
	already written several hundred thousand 4KB blocks of zeros to
	/dev/da1.  Here are some of the block counts at the times of the
	crashes:

		344449
		329357
		314214

	As you can see, it may take awhile, but with the Advansys controller
	in the system, I could *always* and *repeatedly* get the driver to
	send one of these spurious SIGPROF signals to some undeserving userland
	process.  (On an otherwise quite system, my little "zerodisk" program
	itself was the one most likely to be scheduled for execution by the
	kernel at any given instant in time, so it usually received these
	signals.  But I believe that I have evidence that these spurious
	SIGPROF signals might also get sent in some cases to other random
	userland processes... depending on the exact timing of their genera-
	tion within the kernel.)

	After this, I gen'd up a new kernel (with Adaptech support in it),
	installed that, yanked the Advansys SCSI card out and plugged in
	an Adaptec 3940AU and re-ran my "zerodisk" test program against
	/dev/da1.  I did this THREE TIMES, just to be sure, and it worked
	flawlessly each time, all the way to the end of the disk... over
	1,000,000 4 KB block writes in each case.

	The bottom line is that if you do enough writes (several hundred
	thousand, typically) using an Advansys 3940UA controller, and an
	ordinary Ultra/Narrow SCSI drive (note: the IBM I used does NOT
	support tagged command queueing) using FreeBSD 4.2 and the Advansys
	driver contained therein, then eventually you are going to work the 
	Advansys driver into a state where it will start throwing SIGPROF
	signals at random times to random useland processes for no apparent
	reason.  This _does not_ occur with other SCSI controllers (e.g.
	AHA-3940AU) in the exact same system/environment.

	Clearly the Advansys driver has a VERY subtle, but very bad bug
	which, it appears, can only be consistantly/dependably elicited
	via a very intense stress test, e.g. several hundred thousand writes
	to disk before you can be assured of seeing the bug.)

	I am filing this bug report as critical/high-priority because the
	the effects of this bug are so nefarious, i.e. crashing random
	userland programs (maybe even init and/or the X server) at totally
	unpredictable and random times.  (This sort of thing could give
	FreeBSD a bad reputation for unreliability!)
	
>How-To-Repeat:

	Get yourself a Advansys ASB3940UA Ultra/Narrow PCI SCSI controller.
	Put it into an otherwise unremarkable x86/PCI system.  Plug in one
	SCSI drive and install FreeBSD 4.2 on it.  Plug in a second SCSI
	drive (at least 2GB, but 4GB would be better) that you can afford
	to overwrite entirely, and then just do:

		dd if=/dev/zero of=/dev/da1 bs=4096

	(preferably on a quiet system, without any X server running) and then
	just sit back and wait.  After awhile, the dd process will crash and
	you'll get the message:

		Profiling time alarm

	(I will even loan this exact IBM drive, and the controller, to anyone
	who wants to work on this bug.  Just ask.  The controller is useless
	to me now anyway... until someone fixes this bug... and I was gonna
	sell the drive on eBay anyway.)

	Alternatively, you can run the following simple "zerodisk" program
	that I cooked up.  This will give you essentially the same results,
	but will show how many blocks got written before the spurious SIGPROF
	arrives.  (BE VERY CAREFUL USING THIS PROGRAM.  It must be run as
	root to access the disk device files and it can easily wipe out an
	entire disk permanently.  In fact that is the purpose for which it
	was written!)

	/* zerodisk.c */
	
	#include <stdio.h>
	#include <stdarg.h>
	#include <string.h>
	#include <errno.h>
	#include <fcntl.h>
	#include <unistd.h>
	#include <signal.h>
	
	static char const *pname;
	
	static void
	usage (void)
	{
	  fprintf (stderr, "%s: Usage: `%s device'\n", pname, pname);
	  exit (1);
	}
	
	static void
	errorv (register char const *const fmt, va_list ap)
	{
	  fprintf (stderr, "%s: ", pname);
	  vfprintf (stderr, fmt, ap);
	  fputc ('\n', stderr);
	}
	
	static void
	error (register char const *const fmt, ...)
	{
	  va_list ap;
	
	  va_start (ap, fmt);
	  errorv (fmt, ap);
	  va_end (ap);
	}
	
	static void
	fatal (register char const *const fmt, ...)
	{
	  va_list ap;
	
	  va_start (ap, fmt);
	  errorv (fmt, ap);
	  va_end (ap);
	  exit (1);
	}
	
	int
	main (register int const argc, char *argv[])
	{
	  enum { block_size = 4096 };
	  static char zeros[block_size];
	  register int fd;
	  register unsigned long blockno = 0;
	
	  pname = strrchr (argv[0], '/');
	  pname = pname ? pname+1 : argv[0];
	
	  if (argc != 2)
	    usage ();
	
	  if ((fd = open (argv[1], O_WRONLY)) == -1)
	    fatal ("Error opening `%s': %s", argv[1], strerror (errno));
	
	  for (;;)
	    {
	      register ssize_t n;
	
	      printf ("\rWriting block %lu", ++blockno);
	      fflush (stdout);
	      if ((n = write (fd, zeros, block_size)) == -1)
		{
		  putchar ('\n');
		  fatal ("Error writing `%s': %s", argv[1], strerror (errno));
		}
	      if (n < block_size)
		{
		  putchar ('\n');
		  error ("EOF detected on `%s'", argv[1]);
		  exit (0);
		}
	    }
	}

>Fix:

	Buy and install a non-Advansys brand of SCSI controller.

>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200101170921.f0H9L5203676>