Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Jun 2000 12:44:32 -0700 (PDT)
From:      Matt Jacob <mjacob@FreeBSD.org>
To:        cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org
Subject:   cvs commit: src/sys/dev/isp isp.c
Message-ID:  <200006271944.MAA53998@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
mjacob      2000/06/27 12:44:32 PDT

  Modified files:
    sys/dev/isp          isp.c 
  Log:
  Fix usage of DELAY (SYS_DELAY is the platform independent local
  define).  Fix stupidity wrt checking whether we've gone to
  LOOP_PDB_RCVD loopstate- it's okay to be greater than this state.
  D'oh! Protect calls to isp_pdb_sync and isp_fclink_state with IS_FC
  macros.
  
  Completely redo mailbox command routine (in preparation to make this
  possibly wait rather than poll for completion).
  
  Make a major attempt to solve the 'lost interrupt' problem
  
  1. Problem
  
  The Qlogic cards would appear to 'lose' interrupts, i.e., a legitimate
  regular SCSI command placed on the request queue would never complete
  and the watchdog routine in the driver would eventually wakeup and
  catch it. This would typically only happen on Alphas, although a
  couple folks with 700MHz Intel platforms have also seen this.
  
  For a long time I thought it was a foulup with f/w negotiations of
  SYNC and/or WIDE as it always seemed to happen right after the
  platform it was running on had done a SET TARGET PARAMETERS mailbox
  command to (re)enable sync && wide (after initially forcing
  ASYNC/NARROW at startup). However, occasionally, the same thing
  would also occur for the Fibre Channel cards as well (which, ahem,
  have no SET TARGET PARAMETERS for transfer mode).
  
  After finally putting in a better set of watchdog routines for the
  platforms for this driver, it seemed to be the case that the command
  in question (usually a READ CAPACITY) just had up and died- the
  watchdog routine would catch it after ~10 seconds. For some platforms
  (NetBSD/OpenBSD)- an ABORT COMMAND mailbox command was sent (which
  would always fail- indicating that the f/w denied knowledge of this
  command, i.e., the f/w thought it was a done command). In any case,
  retrying the command worked. But this whole problem needed to be
  really fixed.
  
  2. A False Step That Went in The Right Direction
  
  The mailbox code was completely rewritten to no longer try and grab
  the mailbox semaphore register and to try and 'by hand' complete
  async fast posting completions. It was also rewritten to now have
  separate in && out bitpatterns for registers to load to start and
  retrieve to complete. This means that isp_intr now handles mailbox
  completions.
  
  This substantially simplifies the mailbox handling code, and carries
  things 90% toward getting this to be a non-polled routine for this
  driver.
  
  This did not solve the problem, though.
  
  3. Register Debouncing
  
  I saw some comments in some errata sheets and some notes in a Qlogic
  produced Linux driver (for the Qlogic 2100) that seemed to indicate
  that debouncing of reads of the mailbox registers might be needed,
  so I added this.  This did not affect the problem. In fact, it made
  the problem worse for non-2100 cards.
  
  5. Interrupt masking/unmasking
  
  The driver *used* to do a substantial amount of masking/unmasking
  of the interrupt control register. This was done to make sure that
  the core common code could just assume it would never get pre-empted.
  
  This apparently substantially contributed to the lost interrupt
  problem.  The rewrite of the ICR (Interrupt Control Register),
  which is a separate register from the ISR (Interrupt Status Register)
  should not have caused any change to interrupt assertions pending.
  The manual does not state that it will, and the register layout
  seems to imply that the ICR is just an active route gate. We only
  enable PCI Interrupts and RISC Interrupts- this should mean that
  when the f/w asserts a RISC interrupt and (and the ICR allows RISC
  Interrupts) and we have PCI Interrupts enabled, we should get a
  PCI interrupt. Apparently this is a latch- not a signal route.
  
  Removing this got rid of *most* but not all, lost interrupts.
  
  5. Watchdog Smartening
  
  I made sure that the watchdog routine would catch cases where the
  Qlogic's ISR showed an interrupt assertion. The watchdog routine
  now calls the interrupt service routine if it sees this. Some
  additional internal state flags were added so that the watchdog
  routine could then know whether the command it was in the middle
  of burying (because we had time it out) was in fact completed by
  the interrupt service routine.
  
  6. Occasional Constipation Of Commands..
  
  In running some very strenous high IOPs tests (generating about
  11000 interrupts/second across one Qlogic 1040, one Qlogic 1080
  and one Qlogic 2200 on an Alpha PC164), I found that I would get
  occasional but regular 'watchdog timeouts' on both the 1080 and
  the 2100 cards. This is under FreeBSD, and the watchdog timeout
  routine just marks the command in error and retries it.
  
  Invariably, right after this 'watchdog timeout' error, I'd get a
  command completion for the command that I had thought timed out.
  That is, I'd get a command completion, but the handle returned by
  the firmware mapped to no current command. The frequency of this
  problem is low under such a load- it would usually take an 30
  minutes per 'lost' interrupt.
  
  I doubled the timeout for commands to see if it just was an edge
  case of waiting too short a period. This has no effect.
  
  I gathered and printed out microtimes for the watchdog completed
  command and the completion that couldn't find a command- it was
  always the case that the order of occurrence was "timeout, completion"
  separated by a time on the order of 100 to 150 ms.
  
  This caused me to consider 'firmware constipation' as to be a
  possible culprit. That is, resubmission of a command to the device
  that had suffered a watchdog timeout seemed to cause the presumed
  dead command to show back up.
  
  I added code in the watchdog routine that, when first entered for
  the command, marks the command with a flag, reissues a local timeout
  call for one second later, but also then issues a MARKER Request
  Queue entry to the Qlogic f/w. A MARKER entry is used typically
  after a Bus Reset to cause the f/w to get synchronized with respect
  to either a Bus, a Nexus or a Target.
  
  Since I've added this code, I always now see the occasional watchdog
  timeout, but the command that was about to be terminated always
  now seems to be completed after the MARKER entry is issued (and
  before the timeout extension fires, which would come back and
  *really* terminate the command).
  
  Revision  Changes    Path
  1.45      +383 -470  src/sys/dev/isp/isp.c



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200006271944.MAA53998>