Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Dec 2007 13:54:37 +0100
From:      Ulrich Spoerlein <uspoerlein@gmail.com>
To:        stable@freebsd.org
Cc:        Hidetoshi Shimokawa <simokawa@freebsd.org>
Subject:   sbp(4) write error wedging GEOM mirror
Message-ID:  <20071228125437.GB1532@roadrunner.spoerlein.net>

next in thread | raw e-mail | index | archive | help
Hello *,

since a couple of months, my 6-STABLE box is wedging whenever I rebuild
my geom mirrors to two external drives simultaneously. The setup is a
follows:

Two ATA disks, two external ATA disks attached via firewire. On these
four disks there are three geom mirrors. When rebuilding two of them at
the same time, I frequently get sbp write errors which will eventually
lead to a reset and then I/O to this mirror is no longer possible. The
system is still running and delivering MP3s from the other mirror via
NFS.

ad0: 238475MB <SAMSUNG SP2514N VF100-33> at ata0-master UDMA100
ad1: 381554MB <SAMSUNG HD400LD WQ100-14> at ata0-slave UDMA100
GEOM_MIRROR: Device gm2 created (id=3879710801).
GEOM_MIRROR: Device gm2: provider ad1 detected.
GEOM_MIRROR: Device gm0 created (id=3640684492).
GEOM_MIRROR: Device gm0: provider ad0s1 detected.
GEOM_MIRROR: Device gm1 created (id=2853507194).
GEOM_MIRROR: Device gm1: provider ad0s2 detected.
GEOM_LABEL: Label for provider ad0s3 is msdosfs/DATEN.
Root mount waiting for: GMIRROR GMIRROR GMIRROR
GEOM_MIRROR: Force device gm2 start due to timeout.
Expensive timeout(9) function: 0xc0948e15(0xc343fc00) 0.006123403 s
GEOM_MIRROR: Device gm2: provider ad1 activated.
GEOM_MIRROR: Device gm2: provider mirror/gm2 launched.
GEOM_MIRROR: Force device gm0 start due to timeout.
GEOM_MIRROR: Device gm0: provider ad0s1 activated.
GEOM_MIRROR: Device gm0: provider mirror/gm0 launched.
GEOM_MIRROR: Force device gm1 start due to timeout.
GEOM_MIRROR: Device gm1: provider ad0s2 activated.
GEOM_MIRROR: Device gm1: provider mirror/gm1 launched.
Trying to mount root from ufs:/dev/mirror/gm0a
GEOM_LABEL: Label msdosfs/DATEN removed.
GEOM_LABEL: Label for provider ad0s3 is msdosfs/DATEN.
GEOM_LABEL: Label msdosfs/DATEN removed.
WARNING: attempt to net_add_domain(netgraph) after domainfinalize()

fwohci0: BUS reset
fwohci0: node_id=0x8800ffc0, gen=2, non CYCLEMASTER mode
firewire0: 2 nodes, maxhop <= 1, cable IRM = 0 (me)
firewire0: root node is not cycle master capable
firewire0: bus manager 0 (me)
fwohci0: too many cycle lost, no cycle master presents?
fwohci0: txd err=14 ack busy_X
fwohci0: txd err=14 ack busy_X
fwohci0: txd err=14 ack busy_X
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc1, gen=3, CYCLEMASTER mode
firewire0: 2 nodes, maxhop <= 1, cable IRM = 1 (me)
firewire0: bus manager 1 (me)
firewire0: New S400 device ID:0050770e012005cf
da0 at sbp0 bus 0 target 0 lun 0
da0: <Prolific PL-3507C Drive 2804> Fixed Simplified Direct Access SCSI-4 device
da0: 50.000MB/s transfers
da0: 238475MB (488397168 512 byte sectors: 255H 63S/T 30401C)
GEOM_MIRROR: Device gm0: provider da0s1 detected.
GEOM_MIRROR: Device gm0: provider da0s1 is stale.
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc2, gen=4, CYCLEMASTER mode
firewire0: 3 nodes, maxhop <= 2, cable IRM = 2 (me)
firewire0: bus manager 2 (me)
fwohci0: txd err=14 ack busy_X
fwohci0: txd err=14 ack busy_X
fwohci0: txd err=14 ack busy_X
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc2, gen=5, CYCLEMASTER mode
firewire0: 3 nodes, maxhop <= 2, cable IRM = 2 (me)
firewire0: bus manager 2 (me)
firewire0: New S400 device ID:0050770e013023f0
da1 at sbp0 bus 0 target 1 lun 0
da1: <Prolific PL-3507C Drive 2804> Fixed Simplified Direct Access SCSI-4 device
da1: 50.000MB/s transfers
da1: 381554MB (781422768 512 byte sectors: 255H 63S/T 48641C)
GEOM_MIRROR: Device gm2: provider da1 detected.
GEOM_MIRROR: Device gm2: rebuilding provider da1.

Rebuilding some components finishes ok, then write errors start
appearing

(11:27:31) root@coyote: ~# gmirror status
      Name    Status  Components
mirror/gm2  DEGRADED  ad1
                      da1 (5%)
mirror/gm0  DEGRADED  ad0s1
                      da0s1 (87%)
mirror/gm1  DEGRADED  ad0s2
                      da0s2
(11:30:05) root@coyote: ~# GEOM_MIRROR: Device gm0: rebuilding provider da0s1 finished.
GEOM_MIRROR: Device gm0: provider da0s1 activated.
(da0:sbp0:0:0:0): WRITE(10). CDB: 2a 0 0 4a f3 f1 0 0 20 0
(da0:sbp0:0:0:0): CAM Status: SCSI Status Error
(da0:sbp0:0:0:0): SCSI Status: Check Condition
(da0:sbp0:0:0:0): MEDIUM ERROR csi:70,70,70,70 asc:c,0
(da0:sbp0:0:0:0): Write error
(da0:sbp0:0:0:0): Retrying Command (per Sense Data)
(da0:sbp0:0:0:0): WRITE(10). CDB: 2a 0 0 48 9c d1 0 0 20 0
(da0:sbp0:0:0:0): CAM Status: SCSI Status Error
(da0:sbp0:0:0:0): SCSI Status: Check Condition
(da0:sbp0:0:0:0): MEDIUM ERROR csi:70,70,70,70 asc:c,0
(da0:sbp0:0:0:0): Write error
(da0:sbp0:0:0:0): Retrying Command (per Sense Data)
...
(da0:sbp0:0:0:0): WRITE(10). CDB: 2a 0 0 49 c d1 0 0 20 0
(da0:sbp0:0:0:0): CAM Status: SCSI Status Error
(da0:sbp0:0:0:0): SCSI Status: Check Condition
(da0:sbp0:0:0:0): MEDIUM ERROR csi:70,70,70,70 asc:c,0
(da0:sbp0:0:0:0): Write error
(da0:sbp0:0:0:0): Retrying Command (per Sense Data)
(da0:sbp0:0:0:0): WRITE(10). CDB: 2a 0 0 82 7d 31 0 0 20 0
(da0:sbp0:0:0:0): CAM Status: SCSI Status Error
(da0:sbp0:0:0:0): SCSI Status: Check Condition
(da0:sbp0:0:0:0): MEDIUM ERROR csi:70,70,70,70 asc:c,0
(da0:sbp0:0:0:0): Write error
(da0:sbp0:0:0:0): Retrying Command (per Sense Data)

(12:17:05) root@coyote: ~# gmirror rebuild gm1 da0s2
GEOM_MIRROR: Device gm1: provider da0s2 disconnected.
GEOM_MIRROR: Device gm1: provider da0s2 detected.
GEOM_MIRROR: Device gm1: rebuilding provider da0s2.
(12:17:11) root@coyote: ~# gmirror status
      Name    Status  Components
mirror/gm2  DEGRADED  ad1 
                      da1 (21%)
mirror/gm0  COMPLETE  ad0s1
                      da0s1
mirror/gm1  DEGRADED  ad0s2 
                      da0s2 (0%)

(da0:sbp0:0:0:0): WRITE(10). CDB: 2a 0 2 3a e6 9 0 0 80 0
(da0:sbp0:0:0:0): CAM Status: SCSI Status Error
(da0:sbp0:0:0:0): SCSI Status: Check Condition 
(da0:sbp0:0:0:0): MEDIUM ERROR csi:70,70,70,70 asc:c,0
(da0:sbp0:0:0:0): Write error
(da0:sbp0:0:0:0): Retrying Command (per Sense Data)
sbp0:0:0 request timeout(cmd orb:0x1a6029dc) ... agent reset
sbp0:0:0 request timeout(cmd orb:0x1a602b14) ... target reset
sbp0:0:0 request timeout(mgm orb:0x1a602c4c) ... reset start
fwohci0: txd err=14 ack busy_X
sbp0:0:0 sbp_reset_start failed: resp=16

>From here on, no login is possible, but I'm still listening to MP3
served from gm1 via NFS (read only, due to noatime). My /home partition
is on gm1, too. The system itself resides on gm0. gm2 can also be
accessd just fine. An NFS mount request to gm0 never comes back,
however. The rebuild of gm2 is also still going on.

Two questions remain:

Why on earth is sbp(4) starting to spit write errors? The drive is ok,
the rebuild is also not complaining about write errors. Could there be
some race or bus starvation that leads to the write command timing out?
Why is this not appearing if I rebuild all mirrors sequentially? Is the
firewire controller crap?

And then I'd assume if gmirror is no longer able to write to one of its
components, that it kicks it out of the mirror and continues with the
other good device. Or is that only possible to detect during failed
reads?


Here's some ddb(4) output, but I guess it's not helpful

KDB: enter: Break sequence on console
[thread pid 10 tid 100006 ]
Stopped at      kdb_enter+0x30: leave   
db> show pcpu
cpuid        = 0
curthread    = 0xc3237600: pid 10 "idle"
curpcb       = 0xd4051d90
fpcurthread  = none
idlethread   = 0xc3237600: pid 10 "idle"
APIC ID      = 0
currentldt   = 0x50
spin locks held:
db> show locks
db> show alllocks
db> show lockedvnods
Locked vnodes

0xc3621570: tag ufs, type VDIR
    usecount 1, writecount 0, refcount 144 mountedhere 0
    flags (VV_ROOT)
     lock type ufs: EXCL (count 1) by thread 0xc3c15a80 (pid 29986) with 142 pending#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05ba64a at vget+0xee
#5 0xc05b0934 at vfs_hash_get+0xdc
#6 0xc06b8343 at ffs_vget+0x49
#7 0xc06c3338 at ufs_root+0x28
#8 0xc05b21c2 at lookup+0x8c9
#9 0xc05b15f0 at namei+0x449
#10 0xc05c7a18 at vn_open_cred+0x2af
#11 0xc05c7767 at vn_open+0x33
#12 0xc05bfaa0 at kern_open+0x2b4
#13 0xc05bf7b4 at open+0x36
#14 0xc07294e9 at syscall+0x2d1
#15 0xc07142df at Xint0x80_syscall+0x1f

        ino 2, on dev mirror/gm0d

0xc3621414: tag syncer, type VNON
    usecount 1, writecount 0, refcount 2 mountedhere 0
    flags ()
     lock type syncer: EXCL (count 1) by thread 0xc332fa80 (pid 50)#0 0xc0544b0e at lockmgr+0x566
#1 0xc05af7a3 at vop_stdlock+0x32
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05b9c1d at sync_vnode+0x134
#5 0xc05b9f2f at sched_sync+0x23d
#6 0xc0539ab3 at fork_exit+0xc2
#7 0xc07142ec at fork_trampoline+0x8


0xc362fd98: tag ufs, type VDIR
    usecount 1, writecount 0, refcount 4 mountedhere 0
    flags ()
     lock type ufs: EXCL (count 1) by thread 0xc3c14180 (pid 29985) with 1 pending#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05ba64a at vget+0xee
#5 0xc05aca3c at cache_lookup+0x3d0
#6 0xc05ad05d at vfs_cache_lookup+0xa4
#7 0xc073cb2f at VOP_LOOKUP_APV+0xa6
#8 0xc05b1e33 at lookup+0x53a
#9 0xc05b15f0 at namei+0x449
#10 0xc05c7a18 at vn_open_cred+0x2af
#11 0xc05c7767 at vn_open+0x33
#12 0xc05bfaa0 at kern_open+0x2b4
#13 0xc05bf7b4 at open+0x36
#14 0xc07294e9 at syscall+0x2d1
#15 0xc07142df at Xint0x80_syscall+0x1f

        ino 164864, on dev mirror/gm0d

0xc382b828: tag ufs, type VREG
    usecount 1, writecount 1, refcount 3 mountedhere 0
    flags ()
    v_object 0xc378ece4 ref 0 pages 11
     lock type ufs: EXCL (count 1) by thread 0xc345a180 (pid 719)#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05c3e12 at fsync+0xc0
#5 0xc07294e9 at syscall+0x2d1
#6 0xc07142df at Xint0x80_syscall+0x1f

        ino 118536, on dev mirror/gm0d

0xc3831d98: tag ufs, type VREG
    usecount 0, writecount 0, refcount 3 mountedhere 0
    flags ()
    v_object 0xc3789dec ref 0 pages 1
     lock type ufs: EXCL (count 1) by thread 0xc3c15780 (pid 29983)#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05c7d34 at vn_close+0x7a
#5 0xc05c8d77 at vn_closefile+0xee
#6 0xc052f9b4 at fdrop_locked+0xef
#7 0xc052f8bf at fdrop+0x3c
#8 0xc052ddb2 at closef+0x426
#9 0xc052af69 at kern_close+0x226
#10 0xc052ad41 at close+0x1a
#11 0xc07294e9 at syscall+0x2d1
#12 0xc07142df at Xint0x80_syscall+0x1f

        ino 70921, on dev mirror/gm0d

0xc3870c3c: tag ufs, type VREG
    usecount 0, writecount 0, refcount 8 mountedhere 0
    flags ()
    v_object 0xc3821d68 ref 0 pages 325
     lock type ufs: EXCL (count 1) by thread 0xc332fa80 (pid 50)#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05ba64a at vget+0xee
#5 0xc05bbecb at vfs_msync+0xf5
#6 0xc05bc79d at sync_fsync+0x1d8
#7 0xc073e1fa at VOP_FSYNC_APV+0xc4
#8 0xc05b9c43 at sync_vnode+0x15a
#9 0xc05b9f2f at sched_sync+0x23d
#10 0xc0539ab3 at fork_exit+0xc2
#11 0xc07142ec at fork_trampoline+0x8

        ino 189547, on dev mirror/gm0d

0xc3874828: tag ufs, type VREG
    usecount 0, writecount 0, refcount 10 mountedhere 0
    flags ()
    v_object 0xc3821738 ref 0 pages 844
     lock type ufs: EXCL (count 1) by thread 0xc345c600 (pid 748)#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05c7d34 at vn_close+0x7a
#5 0xc05c8d77 at vn_closefile+0xee
#6 0xc052f9b4 at fdrop_locked+0xef
#7 0xc052f8bf at fdrop+0x3c
#8 0xc052ddb2 at closef+0x426
#9 0xc052af69 at kern_close+0x226
#10 0xc052ad41 at close+0x1a
#11 0xc07294e9 at syscall+0x2d1
#12 0xc07142df at Xint0x80_syscall+0x1f

        ino 189548, on dev mirror/gm0d

0xc38fe984: tag ufs, type VDIR
    usecount 1, writecount 0, refcount 4 mountedhere 0
    flags ()
     lock type ufs: EXCL (count 1) by thread 0xc3f91a80 (pid 29984) with 1 pending#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05ba64a at vget+0xee
#5 0xc05aca3c at cache_lookup+0x3d0
#6 0xc05ad05d at vfs_cache_lookup+0xa4
#7 0xc073cb2f at VOP_LOOKUP_APV+0xa6
#8 0xc05b1e33 at lookup+0x53a
#9 0xc05b15f0 at namei+0x449
#10 0xc05c77c4 at vn_open_cred+0x5b
#11 0xc05c7767 at vn_open+0x33
#12 0xc05bfaa0 at kern_open+0x2b4
#13 0xc05bf7b4 at open+0x36
#14 0xc07294e9 at syscall+0x2d1
#15 0xc07142df at Xint0x80_syscall+0x1f

        ino 211976, on dev mirror/gm0d

0xc3acac3c: tag ufs, type VREG
    usecount 0, writecount 0, refcount 9 mountedhere 0
    flags ()
    v_object 0xc3b8e210 ref 0 pages 56
     lock type ufs: EXCL (count 1) by thread 0xc3935000 (pid 1205)#0 0xc0544b0e at lockmgr+0x566
#1 0xc06b98d3 at ffs_lock+0x8a
#2 0xc073f176 at VOP_LOCK_APV+0xa6
#3 0xc05c8c00 at vn_lock+0xd3
#4 0xc05c7d34 at vn_close+0x7a
#5 0xc05c8d77 at vn_closefile+0xee
#6 0xc052f9b4 at fdrop_locked+0xef
#7 0xc052f8bf at fdrop+0x3c
#8 0xc052ddb2 at closef+0x426
#9 0xc052af69 at kern_close+0x226
#10 0xc052ad41 at close+0x1a
#11 0xc07294e9 at syscall+0x2d1
#12 0xc07142df at Xint0x80_syscall+0x1f

        ino 141531, on dev mirror/gm0d

0xc3e2d15c: tag ufs, type VNON
    usecount 1, writecount 0, refcount 1 mountedhere 0
    flags ()
     lock type ufs: EXCL (count 1) by thread 0xc3f91a80 (pid 29984)#0 0xc0544b0e at lockmgr+0x566
#1 0xc05b0a48 at vfs_hash_insert+0x36
#2 0xc06b84ac at ffs_vget+0x1b2
#3 0xc0698877 at ffs_valloc+0x12f
#4 0xc06c6c6a at ufs_makeinode+0x74
#5 0xc06c3682 at ufs_create+0x36
#6 0xc073cd7d at VOP_CREATE_APV+0xc4
#7 0xc05c7933 at vn_open_cred+0x1ca
#8 0xc05c7767 at vn_open+0x33
#9 0xc05bfaa0 at kern_open+0x2b4
#10 0xc05bf7b4 at open+0x36
#11 0xc07294e9 at syscall+0x2d1
#12 0xc07142df at Xint0x80_syscall+0x1f

        ino 212371, on dev mirror/gm0d

I then plugged in my laptop into the same Firewire bus and the reset
unwedged the whole situation

fwohci0: BUS reset
fwohci0: node_id=0xc800ffc3, gen=6, CYCLEMASTER mode
firewire0: 4 nodes, maxhop <= 2, cable IRM = 3 (me)
firewire0: bus manager 3 (me)
firewire0: New S400 device ID:354fc00035679830
firewire0: split transaction timeout dst=0xffc1 tl=0x3a state=3
firewire0: split transaction timeout dst=0xffc1 tl=0x0 state=3
firewire0: split transaction timeout dst=0xffc1 tl=0x1 state=3
(da0:sbp0:0:0:0): lost device
GEOM_MIRROR: Request failed (error=6). da0s1[WRITE(offset=998121GEOM_MIRROR: Synchronization request failed (error=22). da0s2[WRITE(offset=2763128472, length=16384)]
GEOM_MIRROR: Device gm0: provider da0s1 disconnected.
832, length=131072)]
GEOM_MIRROR: Device gm1: provider da0s2 disconnected.
GEOM_MIRROR: Device gm1: rebuilding provider da0s2 stopped.
(da0:sbp0:0:0:0): Synchronize cache failed, status == 0x8, scsi status == 0x0
(da0:sbp0:0:0:0): removing device entry

Please note that I'm using hw.firewire.hold_count=0 so the manual reset
shouldn't be necessary, should it?

The system would still hang while shutting down, though

GEOM_MIRROR: Device gm2: rebuilding provider da1 stopped.
Waiting (max 60 seconds) for system process `vnlru' to stop...done
Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining...13 13 7 2 5 3 3 1 1 0 0 0 0 done
All buffers synced.
sbp0:1:0 request timeout(cmd orb:0x1195d28c) ... agent reset
fwohci0: txd err=1e ack type_err
sbp0:1:0 sbp_agent_reset_callback: resp=22
fwohci0: txd err=1e ack type_err
sbp_orb_pointer_callback: xfer->resp = 22
sbp0:1:0 request timeout(cmd orb:0x1195d3c4) ... target reset
fwohci0: txd err=1e ack type_err
sbp0:1:0 sbp_agent_reset_callback: resp=22
fwohci0: txd err=1e ack type_err
sbp_orb_pointer_callback: xfer->resp = 22
sbp0:1:0 request timeout(cmd orb:0x1195d634) ... reset start

< here I unplugged my laptop again >

Uptime: 2h57m3s
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc2, gen=8, CYCLEMASTER mode
firewire0: 3 nodes, maxhop <= 2, cable IRM = 2 (me)
firewire0: bus manager 2 (me)
fwohci0: BUS reset
fwohci0: node_id=0xc800ffc2, gen=8, CYCLEMASTER mode
firewire0: 3 nodes, maxhop <= 2, cable IRM = 2 (me)
firewire0: bus manager 2 (me)
GEOM_MIRROR: Device gm0: provider mirror/gm0 destroyed.
GEOM_MIRROR: Device gm0 destroyed.
Powering system off using ACPI

Anything I can do to help debugging this Firewire issue?

Cheers,
Ulrich Spoerlein
-- 
It is better to remain silent and be thought a fool,
than to speak, and remove all doubt.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071228125437.GB1532>