Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 8 May 2011 10:53:14 +0200
From:      Joerg Wunsch <freebsd-scsi@uriah.heep.sax.de>
To:        freebsd-scsi@freebsd.org
Subject:   Panic when removing a SCSI device entry
Message-ID:  <20110508085314.GA5364@uriah.heep.sax.de>

next in thread | raw e-mail | index | archive | help
I've got a setup where a tape library is attached with a
computer-controllable power switch, so it is only turned on during the
time when backups (or restores) are done.  This is mainly to reduce
the noise level, but also to reduce the overall power consumption
energy while that library is not needed.

Every now and then, the kernel panics with a page fault during the
(unattented, it happens at night times) power cycling and surrounding
actions.  The current process when the page fault happens is always
mt(1), which is used inside the powerup/down script to ensure the
drive is being properly rewound.  The page fault happens in
destroy_devl(), at this location:

        /* If we are a child, remove us from the parents list */
        if (dev->si_flags & SI_CHILD) {
here --->>>     LIST_REMOVE(dev, si_siblings);
                dev->si_flags &= ~SI_CHILD;
        }

The preprocessed code of that looks like:

 if (dev->si_flags & 0x0010) {
  if ((((dev))->si_siblings.le_next) != ((void *)0))
        (((dev))->si_siblings.le_next)->si_siblings.le_prev =
             (dev)->si_siblings.le_prev;
  *(dev)->si_siblings.le_prev = (((dev))->si_siblings.le_next);
  dev->si_flags &= ~0x0010;
 }

and it's the indirection of *(dev)->si_siblings.le_prev that hits a
NULL pointer.  Obviously, LIST_REMOVE doesn't anticipate that
dev->si_siblings.le_prev might be a NULL pointer, so this is a usage
error, somehow.  Could it be that destroy_devl() is called twice for
the same device?

This used to happen on an earlier system (some version of 7.x-stable),
and I eventually managed it to tweak the powerup/down scripts of the
library so to avoid the critical sequence of actions triggering this
situation.  Now that I finally upgraded the machine to 8.2-STABLE,
it is triggered very frequently again though.

Any ideas how to fix it, or at least apply a workaround, other than
turning

        *(elm)->field.le_prev = LIST_NEXT((elm), field);         \

in the LIST_REMOVE macro into

        if ((elm)->field.le_prev != NULL) \
          *(elm)->field.le_prev = LIST_NEXT((elm), field);       \

which affects the entire system, not just the SCSI subsystem part?

-- 
cheers, J"org               .-.-.   --... ...--   -.. .  DL8DTL

http://www.sax.de/~joerg/                        NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110508085314.GA5364>