Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Mar 2000 19:18:50 -0800
From:      Mike Smith <msmith@freebsd.org>
To:        "John W. DeBoskey" <jwd@unx.sas.com>
Cc:        Mike Smith <msmith@freebsd.org>, freebsd-current@freebsd.org, Brad Chisholm <sasblc@unx.sas.com>
Subject:   Re: AMI MegaRAID lockup? not accepting commands. 
Message-ID:  <200003210318.TAA64793@mass.cdrom.com>
In-Reply-To: Your message of "Mon, 20 Mar 2000 21:55:27 EST." <200003210255.VAA24932@bb01f39.unx.sas.com> 

next in thread | previous in thread | raw e-mail | index | archive | help
>    The controller is new. Dell calls it a Perc2/dc and it has 128Meg
> of memory installed in it. I'm not sitting infront of the
> machine right now. More detailed information is available
> when the machines is booted and you enter the bios setup
> on the adapter card.

Ok.  From some rumours coming out of Dell, I get the impression that this 
is an Enterprise 1400 or 1500 with only two channels loaded.  I guess I 
need a better way of telling these controllers apart. 8(

> > >    We have a system with a new AMI card in it controlling a pair
> > > of shelves from Dell (fbsd dated: 4.0-20000313-SNAP).
> > > 
> > >    The relevant dmesg output is below: (complete dmesg at end)
> > > 
> > > amr0: <AMI MegaRAID> mem 0xf6c00000-0xf6ffffff irq 14 at device 10.1 on pci2
> > > amr0: firmware 1.01 bios 1p00  128MB memory
> > > amrd0: <MegaRAID logical drive> on amr0
> > > amrd0: 172780MB (353853440 sectors) RAID 5 (optimal)
> > > 
> > >    The adapter does not lockup while testing with bonnie and such.
> > 
> > Try running 20 or so bonnie processes in parallel; I can usually get it 
> > to lock up with this configuration.  I'm wondering which controller 
> > you've got there though - I don't recognise the BIOS/firmware versions.
> > 
> > > However, we have a 50Gig CVS repository sitting on the raid
> > > volume. When we do a 'cvs co' of -HEAD, it causes it to lockup.
> > > The following messages are repeating continuously:
> > > 
> > > Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands)
> > 
> > I'm not sure why this happens; the controller isn't coming ready even 
> > though we haven't hit any sort of limit that we're aware of.  I've been 
> > considering some workarounds involving deferring the command until the 
> > controller gives us back an interrupt, but I'm still surprised that we 
> > get to this point at all.
> 
>    Well, we've been playing around in amr.c/amr_start in the following
> code sequence:
> 
>     /* spin waiting for the mailbox */
>     debug("wait for mailbox");
>     for (i = 10000, done = 0, worked = 0; (i > 0) && !done; i--) {
>         s = splbio();
> 
>         /* is the mailbox free? */
>         if (sc->amr_mailbox->mb_busy == 0) {
>             debug("got mailbox");
>             sc->amr_mailbox64->mb64_segment = 0;
>             bcopy(&ac->ac_mailbox, sc->amr_mailbox, AMR_MBOX_CMDSIZE);
>             sc->amr_submit_command(sc);
>             done = 1;
>             sc->amr_workcount++;
>             TAILQ_INSERT_TAIL(&sc->amr_work, ac, ac_link);
> 
>             /* not free, try to clean up while we wait */
>         } else {
> -->>       printf("%s: busy flag %x\n", __FUNCTION__, sc->amr_mailbox->mb_busy);
>             debug("busy flag %x\n", sc->amr_mailbox->mb_busy);
>             worked = amr_done(sc); 
>         }
>         splx(s);
>     }
> 
> 
> 
> 
>    Note the addition of the printf statement in the else clause. Two
> interesting things happen. One, we are unable to cause the controller
> to lock up. Two, the following messages showup in syslog:
> 
> Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1
> Mar 20 12:55:46 cvsstage last message repeated 1057 times
> Mar 20 12:57:47 cvsstage last message repeated 5574 times
> Mar 20 12:59:26 cvsstage last message repeated 5431 times
> Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0
> 
>    If I understand the sequence correctly, we enter splbio() and
> then check the mailbox. Most of the time, we take the else clause
> and the busy flag is 1 as it should be. However, once every 10 to 12
> thousand loops, mb_busy is checked as being 1, but by the time we
> get to the else clause, it's 0.
> 
>    I wonder if there is some sort of timing issue since the
> addition of the printf allows the card to operate correctly. I
> haven't traced the kernel printf code, but it could change the
> spl level thus allowing the mb_busy flag to be modified.
> 
>    Comments?

The mb_busy flag is in system memory, but it's maintained by the card 
itself (it will bus-master and update it according to its internal state).
Thus, when you see it printed as 0, somewhere between the test and the 
printf the controller has updated the flag and indicated it's busy. 

You probably only see this quite rarely because the code path from the 
if() to the printf() is very short (a jump) while the code path the rest of
the way 'round is much longer (through printf(), amr_done(), splx(),
splbio() etc.).

Adding the printfs massively slows the loop down; you might try 
increasing the timeout (initial value of 'i') by an order of magnitude 
instead.  The real problem here is the spinloop - because the flag is in 
system memory, the loop runs entirely in the cache and thus executes 
insanely quickly.  If it wasn't for the fact that this code is called 
both with interrupts enabled and disabled, I'd use a much shorter loop 
and simply defer the command if the controller didn't come ready almost 
immediately.  Some strategic use of DELAY() might also help.  The Linux 
driver uses the following code:

/*==================================================*/
/* Wait until the controller's mailbox is available */
/*==================================================*/
static int mega_busyWaitMbox (mega_host_config * megaCfg)
{
  mega_mailbox *mbox = (mega_mailbox *) megaCfg->mbox;
  long counter;

  for (counter = 0; counter < 10000; counter++) {
    if (!mbox->busy) {
      return 0;
    }
    udelay (100);
    barrier();
  }
  return -1;                    /* give up after 1 second */
}

I'd be guessing that the current loop (100k iterations) is probably 
completing far sooner than 1s.  You could confirm this by grabbing a 
timestamp at the beginning of amr_start and then checking again at the 
point where it bails out.  If that's the case, try cutting the initial 
value of i down to 10,000 and insert a DELAY(100) in the "did not get 
mailbox" case.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  msmith@freebsd.org
\\ and he'll hate you for a lifetime.             \\  msmith@cdrom.com




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003210318.TAA64793>