Date: Fri, 3 Feb 2006 20:02:30 -0800 (PST) From: Garry Belka <garry@NetworkPhysics.COM> To: FreeBSD-gnats-submit@FreeBSD.org Subject: kern/92786: [patch] ATA fixes, write support for LSI v3 RAID Message-ID: <200602040402.k1442U17058626@focus5.fractal.networkphysics.com> Resent-Message-ID: <200602040410.k144A5rZ090747@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 92786 >Category: kern >Synopsis: [patch] ATA fixes, write support for LSI v3 RAID >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sat Feb 04 04:10:04 GMT 2006 >Closed-Date: >Last-Modified: >Originator: Garry Belka >Release: FreeBSD 6.0-RELEASE i386 >Organization: Network Physics >Environment: System: FreeBSD skyway.fractal.networkphysics.com 6.0-RELEASE FreeBSD 6.0-RELEASE #1: Sat Feb 4 01:32:59 PST 2006 garry@skyway.fractal.networkphysics.com:/usr/src/sys/i386/compile/SMP i386 >Description: This patch fixes or helps to avoid several ATA subsystem problems, namely: system crash after RAID hard disk failures or after disk removal for hot-swappable disks; deadlock during RAID label access or update; systems failing to come up after reboot every 8-10 reboots; system crashes doing atacontrol attach or detach commands; minor fixes. In addition, this patch includes a write support for LSIv3 RAID. This is the hardware that is used by a current Intel Server Board (SE7520JR2). And it introduces a scheme for defining array-specific write label functions that divides functionality into two pieces, one to fill data in memory, and the other to write label. If adopted, these scheme may enable to use the same labelling routines with GEOM modules. >How-To-Repeat: reboot the system with RAID array multiple times. or simulate disk failure (e.g., remove disk or detach disk) during normal disk operation or array rebuild. or run rebuild cycle several times. >Fix: Some problems can be traced to missing synchronisation. Locking was added. In other cases ATA request weren't tracked correctly, and the system tried to access device structures after they were freed. We added reference counters to requests, to be used in suspicious circumstances. The larger problem is architectural: RAID label read and write are synchronous, and stop any I/O completion on a channel until label requests are completed. As a result, software interrupt servicing channel completion may have several completion tasks waiting to be executed. Now, if one of the requests on a channel I/O queue before the label is a RAID composite request that depends on a read operation on a different channel to complete before it can continue, we are in a deadlock. The patch makes label updates asynchronous, and that takes care of the most situations. The price is a changed semantics for label write, - it now reports success after starting an operation. I think it's acceptable, because a failure of the forthcoming wirte will cause an array reconfiguration and a visible message to user. However, the problem is in a read label case. Not willing to introduce massive architectural changes, I left it synchronous. So the deadlock described above may still happen if somebody uses atacontrol at an unhappy moment. Also, see above a comment regarding LSIv3 label support and a proposed label code structuring. --- ata-incl.patch begins here --- Only in ata_curr: CVS diff -u ata_curr/ata-all.c mata/ata-all.c --- ata_curr/ata-all.c Wed Jan 18 05:10:16 2006 +++ mata/ata-all.c Tue Jan 31 18:24:55 2006 @@ -542,8 +542,8 @@ /* release the hook that got us here, we are only needed once during boot */ if (ata_delayed_attach) { config_intrhook_disestablish(ata_delayed_attach); - ata_delayed_attach = NULL; free(ata_delayed_attach, M_TEMP); + ata_delayed_attach = NULL; } mtx_unlock(&Giant); /* newbus suckage dealt with, release Giant */ diff -u ata_curr/ata-all.h mata/ata-all.h --- ata_curr/ata-all.h Wed Jan 18 05:10:17 2006 +++ mata/ata-all.h Tue Jan 31 16:04:40 2006 @@ -370,6 +370,8 @@ #define ATA_R_THREAD 0x00000800 #define ATA_R_DIRECT 0x00001000 +#define ATA_R_REQ_FAIL 0x00002000 /* needs + ata_request_incref() */ #define ATA_R_DEBUG 0x10000000 u_int8_t status; /* ATA status */ @@ -387,6 +389,8 @@ int this; /* this request ID */ struct ata_composite *composite; /* for composite atomic ops */ void *driver; /* driver specific */ + struct ata_channel *channel; /* handle to controller softc*/ + int ntimeoutref; /* active timeout ref count */ TAILQ_ENTRY(ata_request) chain; /* list management */ }; @@ -514,6 +518,9 @@ TAILQ_HEAD(, ata_request) ata_queue; /* head of ATA queue */ struct ata_request *freezepoint; /* composite freezepoint */ struct ata_request *running; /* currently running request */ + int stoplight; /* disable ata_start. counter. + * access under queue_mtx lock + */ }; /* disk bay/enclosure related */ @@ -551,9 +558,11 @@ /* ata-queue.c: */ int ata_controlcmd(device_t dev, u_int8_t command, u_int16_t feature, u_int64_t lba, u_int16_t count); int ata_atapicmd(device_t dev, u_int8_t *ccb, caddr_t data, int count, int flags, int timeout); -void ata_queue_request(struct ata_request *request); +int ata_queue_request(struct ata_request *request); void ata_start(device_t dev); void ata_finish(struct ata_request *request); +void ata_free_nref_request(struct ata_request *); +int ata_orphan(struct ata_request *); void ata_timeout(struct ata_request *); void ata_catch_inflight(device_t dev); void ata_fail_requests(device_t dev); @@ -569,7 +578,19 @@ /* macros for alloc/free of struct ata_request */ extern uma_zone_t ata_request_zone; #define ata_alloc_request() uma_zalloc(ata_request_zone, M_NOWAIT | M_ZERO) -#define ata_free_request(request) uma_zfree(ata_request_zone, request) +#define ata_free_request_mem(request) uma_zfree(ata_request_zone, (request)) +#define ata_free_request(request) \ + (((request)->flags & ATA_R_REQ_FAIL) ? \ + ata_free_nref_request(request) : \ + ata_free_request_mem(request)) + +/* macros to increase/decrease request ref count. + * used under mtx_lock(&ch->state_mtx), aka + * mtx_lock(&request->channel->state_mtx) + */ +#define ata_request_incref(request) (++ (request)->ntimeoutref) +#define ata_request_decref(request) (-- (request)->ntimeoutref) +#define ata_request_nref(request) ((request)->ntimeoutref) /* macros for alloc/free of struct ata_composite */ extern uma_zone_t ata_composite_zone; diff -u ata_curr/ata-queue.c mata/ata-queue.c --- ata_curr/ata-queue.c Wed Jan 18 05:10:17 2006 +++ mata/ata-queue.c Wed Feb 1 17:31:57 2006 @@ -48,20 +48,29 @@ static void ata_sort_queue(struct ata_channel *ch, struct ata_request *request); static char *ata_skey2str(u_int8_t); -void +int ata_queue_request(struct ata_request *request) { struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + int timeoutflag = request->flags & ATA_R_TIMEOUT; + request->flags &= ~ ATA_R_TIMEOUT; /* mark request as virgin (this might be a ATA_R_REQUEUE) */ request->result = request->status = request->error = 0; +#if 1 + /* This is a 6.0-ism */ callout_init_mtx(&request->callout, &ch->state_mtx, CALLOUT_RETURNUNLOCKED); +#else + callout_init(&request->callout, 1); +#endif + /* stow the channel ptr in the request for ata_timeout()'s use: */ + request->channel = ch; if (!request->callback && !(request->flags & ATA_R_REQUEUE)) sema_init(&request->done, 0, "ATA request done"); /* in ATA_STALL_QUEUE state we call HW directly (used only during reinit) */ - if ((ch->state & ATA_STALL_QUEUE) && (request->flags & ATA_R_CONTROL)) { + if ((ch->state == ATA_STALL_QUEUE) && (request->flags & ATA_R_CONTROL)) { mtx_lock(&ch->state_mtx); ch->running = request; if (ch->hw.begin_transaction(request) == ATA_OP_FINISHED) { @@ -69,13 +78,21 @@ if (!request->callback) sema_destroy(&request->done); mtx_unlock(&ch->state_mtx); - return; + return 0; } mtx_unlock(&ch->state_mtx); } /* otherwise put request on the locked queue at the specified location */ else { mtx_lock(&ch->queue_mtx); + if (timeoutflag && ata_orphan(request)) { + KASSERT(!(request->flags & ATA_R_REQ_FAIL), + "ata_queue_request: request with ATA_R_REQ_FAIL\n"); + ata_request_incref(request); + request->flags |= ATA_R_REQ_FAIL; + mtx_unlock(&ch->queue_mtx); + return ENXIO; + } if (request->flags & ATA_R_AT_HEAD) TAILQ_INSERT_HEAD(&ch->ata_queue, request, chain); else if (request->flags & ATA_R_ORDERED) @@ -89,20 +106,25 @@ /* if this is a requeued request callback/sleep we're done */ if (request->flags & ATA_R_REQUEUE) - return; + return 0; /* if this is not a callback wait until request is completed */ if (!request->callback) { + int i = 0, imax = 50; ATA_DEBUG_RQ(request, "wait for completition"); - while (!dumping && - sema_timedwait(&request->done, request->timeout * hz * 4)) { - device_printf(request->dev, - "req=%p %s semaphore timeout !! DANGER Will Robinson !!\n", + while (!dumping && + sema_timedwait(&request->done, request->timeout * hz * 4)) { + if (i < 5 || i % 10 == 0) + device_printf(request->dev, + "req=%p unexpected %s semaphore timeout - DWR error\n", request, ata_cmd2str(request)); + if (++i >= imax) + panic("ATA sema\n"); ata_start(ch->dev); } sema_destroy(&request->done); } + return 0; } int @@ -165,6 +187,10 @@ /* if we have a request on the queue try to get it running */ mtx_lock(&ch->queue_mtx); + if (ch->stoplight) { + mtx_unlock(&ch->queue_mtx); + return; + } if ((request = TAILQ_FIRST(&ch->ata_queue))) { /* we need the locking function to get the lock for this channel */ @@ -218,14 +244,15 @@ void ata_finish(struct ata_request *request) { - struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + //struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + struct ata_channel *ch = request->channel; /* * if in ATA_STALL_QUEUE state or request has ATA_R_DIRECT flags set * we need to call ata_complete() directly here (no taskqueue involvement) */ if (dumping || - (ch->state & ATA_STALL_QUEUE) || (request->flags & ATA_R_DIRECT)) { + (ch->state == ATA_STALL_QUEUE) || (request->flags & ATA_R_DIRECT)) { ATA_DEBUG_RQ(request, "finish directly"); ata_completed(request, 0); } @@ -247,14 +274,21 @@ ata_completed(void *context, int dummy) { struct ata_request *request = (struct ata_request *)context; - struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + //struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + struct ata_channel *ch = request->channel; struct ata_device *atadev = device_get_softc(request->dev); struct ata_composite *composite; + int fail = request->flags & ATA_R_REQ_FAIL; /* XXX or atadev fail */ ATA_DEBUG_RQ(request, "completed entered"); + /* if we're to fail this request on this device, skip error handling */ + if (fail) + goto composite; + /* if we had a timeout, reinit channel and deal with the falldown */ if (request->flags & ATA_R_TIMEOUT) { + int orphan = 0; /* * if reinit succeeds and the device doesn't get detached and * there are retries left we reinject this request @@ -263,33 +297,35 @@ (request->retries-- > 0)) { if (!(request->flags & ATA_R_QUIET)) { device_printf(request->dev, - "TIMEOUT - %s retrying (%d retr%s left)", - ata_cmd2str(request), request->retries, - request->retries == 1 ? "y" : "ies"); + "TIMEOUT - %s retrying req %p (%d retr%s left, flags %x)", + ata_cmd2str(request), request, request->retries, + request->retries == 1 ? "y" : "ies", request->flags); if (!(request->flags & (ATA_R_ATAPI | ATA_R_CONTROL))) printf(" LBA=%llu", (unsigned long long)request->u.ata.lba); printf("\n"); } - request->flags &= ~(ATA_R_TIMEOUT | ATA_R_DEBUG); + request->flags &= ~ATA_R_DEBUG; request->flags |= (ATA_R_AT_HEAD | ATA_R_REQUEUE); ATA_DEBUG_RQ(request, "completed reinject"); - ata_queue_request(request); - return; + orphan = ata_queue_request(request); + if (!orphan) + return; } /* ran out of good intentions so finish with error */ if (!request->result) { if (!(request->flags & ATA_R_QUIET)) { if (request->dev) { - device_printf(request->dev, "FAILURE - %s timed out", - ata_cmd2str(request)); + device_printf(request->dev, + "FAILURE - %s req %p timed out (flags %x)", + ata_cmd2str(request), request, request->flags); if (!(request->flags & (ATA_R_ATAPI | ATA_R_CONTROL))) printf(" LBA=%llu", (unsigned long long)request->u.ata.lba); printf("\n"); } } - request->result = EIO; + request->result = orphan ? ENXIO : EIO; } } else { @@ -422,7 +458,8 @@ ATA_DEBUG_RQ(request, "completed callback/wakeup"); - /* if we are part of a composite operation we need to maintain progress */ + composite: + /* if we are part of a composite operation update progress */ if ((composite = request->composite)) { int index = 0; @@ -439,6 +476,8 @@ (composite->rd_done & composite->wr_depend)==composite->wr_depend && (composite->wr_needed & (~composite->wr_done))) { index = composite->wr_needed & ~composite->wr_done; + if (composite->request[index - 1]->flags & ATA_R_REQ_FAIL) + index = 0; } mtx_unlock(&composite->lock); @@ -447,9 +486,11 @@ if (index) { int bit; - for (bit = 0; bit < MAX_COMPOSITES; bit++) { - if (index & (1 << bit)) + for (bit = 0; index; bit++) { + if (index & (1 << bit)) { ata_start(device_get_parent(composite->request[bit]->dev)); + index &= ~(1<<bit); + } } } } @@ -460,17 +501,81 @@ else sema_post(&request->done); - ata_start(ch->dev); + if (!fail) + ata_start(ch->dev); } void +ata_free_nref_request(struct ata_request *request) +{ + struct ata_channel *ch = request->channel; + + mtx_lock(&ch->state_mtx); + ata_request_decref(request); + if (ata_request_nref(request) <= 0) { + mtx_unlock(&ch->state_mtx); + ata_free_request_mem(request); + return; + } + mtx_unlock(&ch->state_mtx); +} + +/* + * check whether request's dev has gone away + * XXX: do we need locking? + */ +int +ata_orphan(struct ata_request *request) +{ + struct ata_channel *ch = request->channel; + device_t *children; + int nchildren; + int i; + int found = 0; + + if (!device_get_children(ch->dev, &children, &nchildren)) { + for (i = 0; i < nchildren; i++) + if (children[i] && children[i] == request->dev) + found = 1; + free(children, M_TEMP); + } + return !found; +} + +void ata_timeout(struct ata_request *request) { - struct ata_channel *ch = device_get_softc(device_get_parent(request->dev)); + struct ata_channel *ch = request->channel; + + device_printf (ch->dev, "ata_timeout req %p flags x%x res %d chstate %d\n", + request, request->flags, request->result, ch->state); +#if 0 + mtx_lock(&ch->state_mtx); /* not needed on 6.0 due to callout_init_mtx() settings */ +#endif + if (request->flags & ATA_R_REQ_FAIL) { + mtx_unlock(&ch->state_mtx); + ata_free_request(request); + return; + } //request->flags |= ATA_R_DEBUG; ATA_DEBUG_RQ(request, "timeout"); + /* If our timeout callback has been cancelled while in process, then the + _fail_requests() code will have marked the request as a failure. Also + watch out for cases where request->dev has gotten blown away or + unhooked from the device tree (walk down from the channel to check + this). If either of these are the case, don't look any further, just + silently exit the timeout and be happy. (pavel 4-Nov-2005) */ + +#if 0 + if (request->result == ENXIO || ata_orphan(request)) { + device_printf (ch->dev, "ata_timeout called with bum req=%p\n", + request); + /* force the failure and pass the req to _finish/_completed */ + request->result = ENXIO; + } +#endif /* * if we have an ATA_ACTIVE request running, we flag the request * ATA_R_TIMEOUT so ata_finish will handle it correctly @@ -493,34 +598,72 @@ ata_fail_requests(device_t dev) { struct ata_channel *ch = device_get_softc(device_get_parent(dev)); - struct ata_request *request; + struct ata_request *request, *req0, *req1; + TAILQ_HEAD(, ata_request) rq; + + TAILQ_INIT(&rq); + mtx_lock(&ch->queue_mtx); - /* do we have any outstanding request to care about ?*/ mtx_lock(&ch->state_mtx); - if ((request = ch->running) && (!dev || request->dev == dev)) { - callout_stop(&request->callout); - ch->running = NULL; - } - else + + ++ ch->stoplight; /* prevent start of ATA cmds */ + + /* do we have any outstanding request to care about ?*/ + request = ch->running; + if (request && (!dev || request->dev == dev)) { + if (!callout_stop(&request->callout)) { + /* + * failed to stop timeout. race with ata_timeout() here. + * Make sure that only the last referer that touches the request + * will free it: account for timeout in ref counter. + */ + ata_request_incref(request); + } + ata_request_incref(request); /* one for normal path */ + request->flags |= ATA_R_REQ_FAIL; /* mark as failed */ + ch->running = NULL; + request->result = ENXIO; + /* device_printf (ch->dev, + "ata_fail rureq %p flags x%x res %d chstate %d\n", + request, request->flags, request->result, ch->state); + */ + } else request = NULL; mtx_unlock(&ch->state_mtx); - if (request) { - request->result = ENXIO; + + /* gather all requests queued on this channel for device dev if !NULL */ + TAILQ_FOREACH_SAFE(req0, &ch->ata_queue, chain, req1) { + if (!dev || req0->dev == dev) { + TAILQ_REMOVE(&ch->ata_queue, req0, chain); + /* + * no callout for timeout was set => + * no timeout race here, so just a single ref counter increase + */ + ata_request_incref(req0); /* one for normal path */ + req0->flags |= ATA_R_REQ_FAIL; /* and mark as failed */ + req0->result = ENXIO; + TAILQ_INSERT_TAIL(&rq, req0, chain); + /* device_printf (ch->dev, + "ata_fail req %p flags x%x res %d chstate %d\n", + req0, req0->flags, req0->result, ch->state); + */ + } + } + mtx_unlock(&ch->queue_mtx); + + if (request) ata_finish(request); + + /* fail all requests queued on this channel for device dev */ + TAILQ_FOREACH_SAFE(req0, &rq, chain, req1) { + TAILQ_REMOVE(&rq, req0, chain); + ata_finish(req0); } - /* fail all requests queued on this channel for device dev if !NULL */ mtx_lock(&ch->queue_mtx); - while ((request = TAILQ_FIRST(&ch->ata_queue))) { - if (!dev || request->dev == dev) { - TAILQ_REMOVE(&ch->ata_queue, request, chain); - mtx_unlock(&ch->queue_mtx); - request->result = ENXIO; - ata_finish(request); - mtx_lock(&ch->queue_mtx); - } - } + -- ch->stoplight; mtx_unlock(&ch->queue_mtx); + ata_start(ch->dev); } static u_int64_t diff -u ata_curr/ata-raid.c mata/ata-raid.c --- ata_curr/ata-raid.c Wed Jan 18 05:10:17 2006 +++ mata/ata-raid.c Fri Feb 3 18:58:47 2006 @@ -55,13 +55,16 @@ /* prototypes */ static void ata_raid_done(struct ata_request *request); -static void ata_raid_config_changed(struct ar_softc *rdp, int writeback); +static void ata_raid_config_changed_unlock(struct ar_softc *rdp, int writeback); static int ata_raid_status(struct ata_ioc_raid_config *config); static int ata_raid_create(struct ata_ioc_raid_config *config); static int ata_raid_delete(int array); static int ata_raid_addspare(struct ata_ioc_raid_config *config); static int ata_raid_rebuild(int array); static int ata_raid_read_metadata(device_t subdisk); +static int ata_raid_write_metas(struct ata_raid_metas *metas, struct ar_softc *rdp); +static void ata_raid_free_metas(struct ata_raid_metas *metas); +static int ata_raid_write_metadata_old(struct ar_softc *rdp); static int ata_raid_write_metadata(struct ar_softc *rdp); static int ata_raid_wipe_metadata(struct ar_softc *rdp); static int ata_raid_adaptec_read_meta(device_t dev, struct ar_softc **raidp); @@ -73,9 +76,14 @@ static int ata_raid_ite_read_meta(device_t dev, struct ar_softc **raidp); static int ata_raid_lsiv2_read_meta(device_t dev, struct ar_softc **raidp); static int ata_raid_lsiv3_read_meta(device_t dev, struct ar_softc **raidp); +static int ata_raid_lsiv3_fill_metas_locked(struct ar_softc *rdp, + struct ata_raid_metas *metas); +static void ata_raid_lsiv3_fix_metas_unlocked(struct ata_raid_metas *metas); static int ata_raid_nvidia_read_meta(device_t dev, struct ar_softc **raidp); static int ata_raid_promise_read_meta(device_t dev, struct ar_softc **raidp, int native); -static int ata_raid_promise_write_meta(struct ar_softc *rdp); +static int ata_raid_promise_fill_metas_locked(struct ar_softc *rdp, + struct ata_raid_metas *metas); +static void ata_raid_promise_fix_metas_unlocked(struct ata_raid_metas *metas); static int ata_raid_sii_read_meta(device_t dev, struct ar_softc **raidp); static int ata_raid_sis_read_meta(device_t dev, struct ar_softc **raidp); static int ata_raid_sis_write_meta(struct ar_softc *rdp); @@ -84,6 +92,7 @@ static struct ata_request *ata_raid_init_request(struct ar_softc *rdp, struct bio *bio); static int ata_raid_send_request(struct ata_request *request); static int ata_raid_rw(device_t dev, u_int64_t lba, void *data, u_int bcount, int flags); +static int ata_raid_request_write(device_t dev, u_int64_t lba, void *data, u_int bcount, int flags, struct ar_softc *rdp); static char * ata_raid_format(struct ar_softc *rdp); static char * ata_raid_type(struct ar_softc *rdp); static char * ata_raid_flags(struct ar_softc *rdp); @@ -120,7 +129,9 @@ int disk; mtx_init(&rdp->lock, "ATA PseudoRAID metadata lock", NULL, MTX_DEF); - ata_raid_config_changed(rdp, writeback); + + mtx_lock(&rdp->lock); + ata_raid_config_changed_unlock(rdp, writeback); /* sanitize arrays total_size % (width * interleave) == 0 */ if (rdp->type == AR_T_RAID0 || rdp->type == AR_T_RAID01 || @@ -305,21 +316,24 @@ case AR_T_JBOD: case AR_T_SPAN: case AR_T_RAID0: + mtx_lock(&rdp->lock); if (((rdp->disks[drv].flags & (AR_DF_PRESENT|AR_DF_ONLINE)) == (AR_DF_PRESENT|AR_DF_ONLINE) && !rdp->disks[drv].dev)) { rdp->disks[drv].flags &= ~AR_DF_ONLINE; - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); ata_free_request(request); biofinish(bp, NULL, EIO); return; } request->this = drv; request->dev = rdp->disks[request->this].dev; + mtx_unlock(&rdp->lock); ata_raid_send_request(request); break; case AR_T_RAID1: case AR_T_RAID01: + mtx_lock(&rdp->lock); if ((rdp->disks[drv].flags & (AR_DF_PRESENT|AR_DF_ONLINE))==(AR_DF_PRESENT|AR_DF_ONLINE) && !rdp->disks[drv].dev) { @@ -333,7 +347,9 @@ change = 1; } if (change) - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); + else + mtx_unlock(&rdp->lock); if (!(rdp->status & AR_S_READY)) { ata_free_request(request); biofinish(bp, NULL, EIO); @@ -416,6 +432,8 @@ request->composite = composite; rebuild->composite = composite; ata_raid_send_request(rebuild); + rdp->disks[this].last_lba = + bp->bio_pblkno + chunk; } else { ata_free_composite(composite); @@ -458,9 +476,8 @@ if ((composite = ata_alloc_composite())) { if ((mirror = ata_alloc_request())) { - if ((blk <= rdp->rebuild_lba) && - ((blk + chunk) > rdp->rebuild_lba)) - rdp->rebuild_lba = blk + chunk; + if ((blk + chunk) > rdp->rebuild_lba) + rdp->rebuild_lba = blk + chunk; bcopy(request, mirror, sizeof(struct ata_request)); mirror->this = this; @@ -499,6 +516,7 @@ break; case AR_T_RAID5: + mtx_lock(&rdp->lock); if (((rdp->disks[drv].flags & (AR_DF_PRESENT|AR_DF_ONLINE)) == (AR_DF_PRESENT|AR_DF_ONLINE) && !rdp->disks[drv].dev)) { rdp->disks[drv].flags &= ~AR_DF_ONLINE; @@ -510,7 +528,9 @@ change = 1; } if (change) - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); + else + mtx_unlock(&rdp->lock); if (!(rdp->status & AR_S_READY)) { ata_free_request(request); biofinish(bp, NULL, EIO); @@ -556,16 +576,21 @@ case AR_T_SPAN: case AR_T_RAID0: if (request->result) { + mtx_lock(&rdp->lock); rdp->disks[request->this].flags &= ~AR_DF_ONLINE; - ata_raid_config_changed(rdp, 1); - bp->bio_error = request->result; + ata_raid_config_changed_unlock(rdp, 1); + if (bp) + bp->bio_error = request->result; finished = 1; } - else { + else if (bp) { bp->bio_resid -= request->donecount; if (!bp->bio_resid) finished = 1; - } + } else { + free(request->data, M_AR); + finished = 1; + } break; case AR_T_RAID1: @@ -575,10 +600,19 @@ else mirror = request->this - rdp->width; if (request->result) { + mtx_lock(&rdp->lock); rdp->disks[request->this].flags &= ~AR_DF_ONLINE; - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); } - if (rdp->status & AR_S_READY) { + if (!bp) { + /* this is write request set up by ata_raid_request_write() + * it's used for writing RAID label only. + * it's not composite. it has no dependencies. + */ + free(request->data, M_AR); + finished = 1; + } + else if (rdp->status & AR_S_READY) { u_int64_t blk = 0; if (rdp->status & AR_S_REBUILDING) @@ -587,10 +621,11 @@ (request->this % rdp->width)) + request->u.ata.lba % rdp->interleave; - if (bp->bio_cmd == BIO_READ) { + if (bp->bio_cmd == BIO_READ) { /* is this a rebuild composite */ if ((composite = request->composite)) { + mtx_lock(&rdp->lock); mtx_lock(&composite->lock); /* handle the read part of a rebuild composite */ @@ -599,9 +634,10 @@ /* if read failed array is now broken */ if (request->result) { rdp->disks[request->this].flags &= ~AR_DF_ONLINE; - ata_raid_config_changed(rdp, 1); - bp->bio_error = request->result; rdp->rebuild_lba = blk; + mtx_unlock(&composite->lock); + ata_raid_config_changed_unlock(rdp, 1); + bp->bio_error = request->result; finished = 1; } @@ -613,6 +649,8 @@ if (composite->wr_done & (1 << mirror)) finished = 1; } + mtx_unlock(&composite->lock); + mtx_unlock(&rdp->lock); } } @@ -626,14 +664,15 @@ if (!composite->residual) finished = 1; } + mtx_unlock(&composite->lock); + mtx_unlock(&rdp->lock); } - mtx_unlock(&composite->lock); } /* if read failed retry on the mirror */ else if (request->result) { request->dev = rdp->disks[mirror].dev; - request->flags &= ~ATA_R_TIMEOUT; +/* request->flags &= ~ATA_R_TIMEOUT; XXXB */ ata_raid_send_request(request); return; } @@ -687,8 +726,9 @@ case AR_T_RAID5: if (request->result) { + mtx_lock(&rdp->lock); rdp->disks[request->this].flags &= ~AR_DF_ONLINE; - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); if (rdp->status & AR_S_READY) { if (bp->bio_cmd == BIO_READ) { /* do the XOR game to recover data */ @@ -715,6 +755,7 @@ } if (finished) { + mtx_lock(&rdp->lock); if ((rdp->status & AR_S_REBUILDING) && rdp->rebuild_lba >= rdp->total_sectors) { int disk; @@ -728,12 +769,13 @@ } } rdp->status &= ~AR_S_REBUILDING; - ata_raid_config_changed(rdp, 1); - } - if (!bp->bio_resid) - biodone(bp); + ata_raid_config_changed_unlock(rdp, 1); + } else + mtx_unlock(&rdp->lock); + if (bp && !bp->bio_resid) + biodone(bp); } - + if (composite) { if (finished) { /* we are done with this composite, free all resources */ @@ -780,12 +822,14 @@ return bp.bio_error; } +/* called with (&rdp->lock) locked. + * however, returns with the lock *unlocked* + */ static void -ata_raid_config_changed(struct ar_softc *rdp, int writeback) +ata_raid_config_changed_unlock(struct ar_softc *rdp, int writeback) { int disk, count, status; - mtx_lock(&rdp->lock); /* set default all working mode */ status = rdp->status; rdp->status &= ~AR_S_DEGRADED; @@ -820,6 +864,10 @@ (rdp->disks [disk + rdp->width].flags & AR_DF_ONLINE))) { rdp->status |= AR_S_DEGRADED; } + if ((rdp->disks[disk].flags & + (AR_DF_PRESENT | AR_DF_ASSIGNED | AR_DF_SPARE)) != + (AR_DF_PRESENT | AR_DF_ASSIGNED | AR_DF_SPARE)) + rdp->status &= ~AR_S_REBUILDING; } break; @@ -853,9 +901,10 @@ ata_raid_type(rdp)); } } - mtx_unlock(&rdp->lock); if (writeback) ata_raid_write_metadata(rdp); + else + mtx_unlock(&rdp->lock); } @@ -878,7 +927,8 @@ } config->interleave = rdp->interleave; config->status = rdp->status; - config->progress = 100 * rdp->rebuild_lba / rdp->total_sectors; + config->progress = rdp->total_sectors ? + (100 * rdp->rebuild_lba / rdp->total_sectors) : 0; return 0; } @@ -902,7 +952,9 @@ printf("ar%d: no memory for metadata storage\n", array); return ENOMEM; } - + /* printf("ar%d: cnfig lun %d tdisks %d type %d status %d\n", + array, config->lun, config->total_disks, config->type, config->status); + */ for (disk = 0; disk < config->total_disks; disk++) { if ((subdisk = devclass_get_device(ata_raid_sub_devclass, config->disks[disk]))) { @@ -1078,10 +1130,12 @@ case AR_F_LSIV3_RAID: rdp->interleave = min(max(2, rdp->interleave), 256); + rdp->metasize = sizeof(struct lsiv3_raid_conf); break; case AR_F_PROMISE_RAID: rdp->interleave = min(max(2, rdp->interleave), 2048); /*+*/ + rdp->metasize = sizeof(struct promise_raid_conf); break; case AR_F_SII_RAID: @@ -1133,6 +1187,7 @@ return ENXIO; rdp->status &= ~AR_S_READY; + if (rdp->disk) disk_destroy(rdp->disk); @@ -1163,15 +1218,18 @@ { struct ar_softc *rdp; device_t subdisk; - int disk; + int disk, error = 0; if (!(rdp = ata_raid_arrays[config->lun])) return ENXIO; + + mtx_lock(&rdp->lock); /* XXX Race here if rdp goes away + (e.g., due to disk failure) */ if (!(rdp->status & AR_S_DEGRADED) || !(rdp->status & AR_S_READY)) - return ENXIO; - if (rdp->status & AR_S_REBUILDING) - return EBUSY; - switch (rdp->type) { + error = ENXIO; + else if (rdp->status & AR_S_REBUILDING) + error = EBUSY; + else switch (rdp->type) { case AR_T_RAID1: case AR_T_RAID01: case AR_T_RAID5: @@ -1185,8 +1243,10 @@ config->disks[0] ))) { struct ata_raid_subdisk *ars = device_get_softc(subdisk); - if (ars->raid[rdp->volume]) + if (ars->raid[rdp->volume]) { + mtx_unlock(&rdp->lock); return EBUSY; + } /* XXX SOS validate size etc etc */ ars->raid[rdp->volume] = rdp; @@ -1198,15 +1258,17 @@ device_printf(rdp->disks[disk].dev, "inserted into ar%d disk%d as spare\n", rdp->lun, disk); - ata_raid_config_changed(rdp, 1); + ata_raid_config_changed_unlock(rdp, 1); return 0; } } - return ENXIO; + error = ENXIO; default: - return EPERM; + error = EPERM; } + mtx_unlock(&rdp->lock); + return error; } static int @@ -1321,12 +1383,90 @@ } static int -ata_raid_write_metadata(struct ar_softc *rdp) +ata_raid_write_metas(struct ata_raid_metas *metas, struct ar_softc *rdp) +{ + int disk; + int error = 0; + + for (disk = 0; disk < metas->am_nmeta; disk++) { + if (metas->am_metas[disk].m_dev) { + if (ata_raid_request_write(metas->am_metas[disk].m_dev, + metas->am_metas[disk].m_lba, + metas->am_metas[disk].m_meta, + metas->am_metasize, + ATA_R_WRITE | ATA_R_DIRECT, + rdp)) { + device_printf(metas->am_metas[disk].m_dev, + "write metadata failed\n"); + free(metas->am_metas[disk].m_meta, M_AR); + error = EIO; + } + } + } + free(metas, M_AR); + return error; +} + +static struct ata_raid_metas * +ata_raid_alloc_metas(int ndisks, int metasize) +{ + struct ata_raid_metas *metas; + int disk; + + metas = (struct ata_raid_metas *) + malloc(sizeof(*metas) + sizeof(metas->am_metas[0]) * (ndisks - 1), + M_AR, M_NOWAIT | M_ZERO); + if (!metas) + return NULL; + + metas->am_nmeta = ndisks; + metas->am_metasize = metasize; + + for (disk = 0; disk < ndisks; disk++) { + metas->am_metas[disk].m_meta = + malloc(metasize, M_AR, M_NOWAIT | M_ZERO); + if (!metas->am_metas[disk].m_meta) { + ata_raid_free_metas(metas); + return NULL; + } + } + return metas; +} + +static void +ata_raid_free_metas(struct ata_raid_metas *metas) +{ + int disk; + if (metas) { + for (disk = 0; disk < metas->am_nmeta; disk ++) + free(metas->am_metas[disk].m_meta, M_AR); + free(metas, M_AR); + } +} + +/* creating vendor metadata disk labels + * fill() is called under rdp->lock or otherwise guaranteed stability of *rdp + * it creates vendor-specific label in memory based on our internal label + * fix() needs no lock. it does checksums, etc based on filled content only + */ +static struct { + int format; + int (*fill)(struct ar_softc *rdp, + struct ata_raid_metas *metas); + void (*fix)(struct ata_raid_metas *metas); +} metas_fun[] = { + {AR_F_FREEBSD_RAID, ata_raid_promise_fill_metas_locked, + ata_raid_promise_fix_metas_unlocked}, + {AR_F_PROMISE_RAID, ata_raid_promise_fill_metas_locked, + ata_raid_promise_fix_metas_unlocked}, + {AR_F_LSIV3_RAID, ata_raid_lsiv3_fill_metas_locked, + ata_raid_lsiv3_fix_metas_unlocked} +}; + +static int +ata_raid_write_metadata_old(struct ar_softc *rdp) { switch (rdp->format) { - case AR_F_FREEBSD_RAID: - case AR_F_PROMISE_RAID: - return ata_raid_promise_write_meta(rdp); case AR_F_HPTV3_RAID: case AR_F_HPTV2_RAID: @@ -1334,7 +1474,7 @@ * always write HPT v2 metadata, the v3 BIOS knows it as well. * this is handy since we cannot know what version BIOS is on there */ - return ata_raid_hptv2_write_meta(rdp); + return ata_raid_hptv2_write_meta(rdp); /* XXX need to redo */ case AR_F_INTEL_RAID: return ata_raid_intel_write_meta(rdp); @@ -1357,9 +1497,6 @@ case AR_F_LSIV2_RAID: return ata_raid_lsiv2_write_meta(rdp); - case AR_F_LSIV3_RAID: - return ata_raid_lsiv3_write_meta(rdp); - case AR_F_NVIDIA_RAID: return ata_raid_nvidia_write_meta(rdp); @@ -1374,6 +1511,53 @@ return -1; } +/* called with mtx_lock(&rdp->lock) locked + * returns with the lock unlocked + */ +static int +ata_raid_write_metadata(struct ar_softc *rdp) +{ + int error = 0; + struct ata_raid_metas *metas = NULL; + int ndisks, fi; + + for (fi = 0; fi < ARRAY_SIZE(metas_fun); fi ++) { + if (metas_fun[fi].format == rdp->format) + break; + } + if (fi >= ARRAY_SIZE(metas_fun)) { + mtx_unlock(&rdp->lock); + return ata_raid_write_metadata_old(rdp); + } + if (rdp->total_disks == 0) { + mtx_unlock(&rdp->lock); + return 0; + } + KASSERT(rdp->metasize, ("%s: metasize is not set for %s!", + ata_raid_format(rdp))); + ndisks = rdp->total_disks; + + metas = ata_raid_alloc_metas(ndisks, rdp->metasize); + if (!metas) { + printf("ar%d: failed to allocate metadata storage\n", rdp->lun); + mtx_unlock(&rdp->lock); + return ENOMEM; + } + rdp->generation++; + + error = metas_fun[fi].fill(rdp, metas); + mtx_unlock(&rdp->lock); + + if (error) { + ata_raid_free_metas(metas); + return error; + } + metas_fun[fi].fix(metas); + error = ata_raid_write_metas(metas, rdp); + return error; +} + + static int ata_raid_wipe_metadata(struct ar_softc *rdp) { @@ -2516,6 +2700,165 @@ } /* LSILogic V3 MegaRAID Metadata */ + +/* + * generic checksum handling. move to some kern/subr_* file later + */ +static int +checksum_calc(struct checksum_descr *d, u_int8_t *s) +{ + u_int8_t checksum; + int i; + for (checksum = 0, i = d->from; i < d->to; i++) + checksum += s[i]; + return checksum; +} + +static void +checksum_set(struct checksum_descr *d, u_int8_t *s) +{ + u_int8_t checksum; + int i; + for (checksum = 0, i = d->from; i < d->to; i++) + checksum += s[i]; + s[d->checksum_offset] -= checksum; +} + +static boolean_t +checksum_test_n(struct checksum_descr descr[], int ndescr, u_int8_t *s) +{ + int j; + for (j = 0; j < ndescr; j ++) { + if (checksum_calc(descr + j, s) != 0) + return FALSE; + } + return TRUE; +} + +static void +checksum_set_n(struct checksum_descr descr[], int ndescr, u_int8_t *s) +{ + int j; + for (j = 0; j < ndescr; j ++) + checksum_set(descr + j, s); +} + +/* + * LSI v3 checksum descriptor array and functions + */ +static struct checksum_descr lsiv3_checksums[] = { + /* 0 [..[ 512 */ + {0, LSIV3_FIELD(filler_5), LSIV3_FIELD(checksum_0)}, + /* 0x600 [..[ 0x610 */ + {LSIV3_FIELD(lsi_id), LSIV3_FIELD(raid), LSIV3_FIELD(checksum_1)}, + /* 0x610 [..[ 0x800 */ + {LSIV3_FIELD(raid), sizeof(struct lsiv3_raid_conf), LSIV3_FIELD(checksum_2)} +}; + +static int +lsiv3_test_checksums(struct lsiv3_raid_conf *meta) +{ + return checksum_test_n(lsiv3_checksums, ARRAY_SIZE(lsiv3_checksums), + (u_int8_t *)meta); +} + +static void +lsiv3_set_checksums(struct lsiv3_raid_conf *meta) +{ + checksum_set_n(lsiv3_checksums, ARRAY_SIZE(lsiv3_checksums), + (u_int8_t *)meta); +} + +static void +ata_raid_lsiv3_print_timestamp(u_int8_t *p) +{ + u_int32_t t = (p[3] << 24) + (p[2] << 16) + (p[1] << 8) + p[0]; + + printf(" day %d\n", (t >> 27) & 0x1f); + printf(" month %d\n", (t >> 23) & 0x0f); + printf(" hour %d\n", (t >> 18) & 0x1f); + printf(" min %d\n", (t >> 12) & 0x3f); + printf(" sec %d\n", (t >> 6) & 0x3f); +} + +/* we have no struct tm, gmtime() or similar function + * generally available in kernel. + * Every subsystem or driver that needs similar functionality + * implements its own version. + * In particular, any subsystem that deals with MS-DOS or its progeny, + * has reimplemented this functions. + * Code below is adapted from sys/fs/msdosfs/msdosfs_conv.c:unix2dostime() + */ +/* + * Total number of days that have passed for each month in a regular year. + */ +static u_short regyear[] = { + 31, 59, 90, 120, 151, 181, + 212, 243, 273, 304, 334, 365 +}; + +/* + * Total number of days that have passed for each month in a leap year. + */ +static u_short leapyear[] = { + 31, 60, 91, 121, 152, 182, + 213, 244, 274, 305, 335, 366 +}; + +static void +ata_raid_lsiv3_make_timestamp(struct timeval *tsp, u_int8_t *dtp) +{ + u_long days; + u_long inc; + u_long year; + u_long month; + u_short *months; + u_long t; + /* + * Variables used to remember parts of the last time conversion. + * Maybe we can avoid a full conversion. + */ + static u_long lasttime; + static u_long lastday; + static u_int32_t lasttimestamp; + + /* + * If the time from the last conversion is the same as now, then + * skip the computations and use the saved result. + */ + t = tsp->tv_sec; + if (lasttime != t) { + lasttime = t; + lasttimestamp = ((t % 60) << 6) + + (((t / 60) % 60) << 12) + + (((t / 3600) % 24) << 18); + + /* + * If the number of days since 1970 is the same as the last + * time we did the computation then skip all this leap year + * and month stuff. + */ + days = t / (24 * 60 * 60); + if (days != lastday) { + lastday = days; + for (year = 1970;; year++) { + inc = year & 0x03 ? 365 : 366; + if (days < inc) + break; + days -= inc; + } + months = year & 0x03 ? regyear : leapyear; + for (month = 0; days >= months[month]; month++) + ; + if (month > 0) + days -= months[month - 1]; + lasttimestamp |= ((days + 1) << 27) + + ((month + 1) << 23); + } + } + bcopy(&lasttimestamp, dtp, 4); +} + static int ata_raid_lsiv3_read_meta(device_t dev, struct ar_softc **raidp) { @@ -2523,8 +2866,7 @@ device_t parent = device_get_parent(dev); struct lsiv3_raid_conf *meta; struct ar_softc *raid = NULL; - u_int8_t checksum, *ptr; - int array, entry, count, disk_number, retval = 0; + int array, entry, disk_number, retval = 0, attach = 0; if (!(meta = (struct lsiv3_raid_conf *) malloc(sizeof(struct lsiv3_raid_conf), M_AR, M_NOWAIT | M_ZERO))) @@ -2538,16 +2880,14 @@ } /* check if this is a LSI RAID struct */ - if (strncmp(meta->lsi_id, LSIV3_MAGIC, strlen(LSIV3_MAGIC))) { + if (strncmp(meta->lsi_id, LSIV3_MAGIC_ID, strlen(LSIV3_MAGIC_ID))) { if (testing || bootverbose) device_printf(parent, "LSI (v3) check1 failed\n"); goto lsiv3_out; } - /* check if the checksum is OK */ - for (checksum = 0, ptr = meta->lsi_id, count = 0; count < 512; count++) - checksum += *ptr++; - if (checksum) { + /* check if the checksums are OK */ + if (!lsiv3_test_checksums(meta)) { if (testing || bootverbose) device_printf(parent, "LSI (v3) check2 failed\n"); goto lsiv3_out; @@ -2566,6 +2906,7 @@ device_printf(parent, "failed to allocate metadata storage\n"); goto lsiv3_out; } + attach = 1; } raid = raidp[array]; if (raid->format && (raid->format != AR_F_LSIV3_RAID)) { @@ -2581,7 +2922,10 @@ switch (meta->raid[entry].total_disks) { case 0: + free(raidp[array], M_AR); + raidp[array] = NULL; entry++; + attach = 0; continue; case 1: if (meta->raid[entry].device == meta->device) { @@ -2618,9 +2962,29 @@ meta->raid[entry].type); free(raidp[array], M_AR); raidp[array] = NULL; + attach = 0; entry++; continue; } + switch (meta->raid[entry].status) { + case LSIV3_R_READY: + raid->status |= AR_S_READY; + break; + case LSIV3_R_DEGRADED: + raid->status |= AR_S_READY | AR_S_DEGRADED; + break; + case LSIV3_R_OFFLINE: + /* nothing to do */ + break; + default: + device_printf(parent, "LSI v3 unknown RAID status 0x%02x\n", + meta->raid[entry].status); + free(raidp[array], M_AR); + raidp[array] = NULL; + attach = 0; + entry++; + continue; + } raid->magic_0 = meta->timestamp; raid->format = AR_F_LSIV3_RAID; @@ -2634,11 +2998,28 @@ raid->offset_sectors = meta->raid[entry].offset; raid->rebuild_lba = 0; raid->lun = array; + raid->metasize = sizeof(*meta); raid->disks[disk_number].dev = parent; raid->disks[disk_number].sectors = raid->total_sectors / raid->width; - raid->disks[disk_number].flags = - (AR_DF_PRESENT | AR_DF_ASSIGNED | AR_DF_ONLINE); + switch (meta->disk[disk_number].disk_status) { + case LSIV3_D_ON: + raid->disks[disk_number].flags = + (AR_DF_PRESENT | AR_DF_ASSIGNED | AR_DF_ONLINE); + break; + case LSIV3_D_FAIL: + raid->disks[disk_number].flags = AR_DF_PRESENT | AR_DF_ASSIGNED; + break; + case LSIV3_D_CLEAR: + raid->disks[disk_number].flags = AR_DF_PRESENT; + break; + } + if (attach) { + /* ata_raid_attach(raid, 0, 0); */ + mtx_init(&raid->lock, "ATA PseudoRAID metadata lock", NULL, MTX_DEF); + device_printf(parent, "assigned to newly defined ar%d\n", array); + attach = 0; + } ars->raid[raid->volume] = raid; ars->disk_number[raid->volume] = disk_number; retval = 1; @@ -2651,6 +3032,116 @@ return retval; } +static int +ata_raid_lsiv3_fill_metas_locked(struct ar_softc *rdp, + struct ata_raid_metas *metas) +{ + struct timeval timestamp; + struct lsiv3_raid_conf *meta; + int disk, drive; + int entry; /* raid entry */ + + microtime(×tamp); + + for (disk = 0; disk < rdp->total_disks; disk++) { + meta = metas->am_metas[disk].m_meta; + metas->am_metas[disk].m_dev = rdp->disks[disk].dev; + if (!metas->am_metas[disk].m_dev) + continue; + + metas->am_metas[disk].m_lba = LSIV3_LBA(metas->am_metas[disk].m_dev); + + meta->magic_0 = 0xa0203200; + strncpy(meta->magic_1, LSIV3_MAGIC_SATA, sizeof(meta->magic_1)); + meta->dummy_0 = meta->dummy_1 = 0x0d000003; + + strncpy(meta->magic_2, LSIV3_MAGIC_ENQ, sizeof(meta->magic_2)); + meta->magic_e2 = 0x33; + meta->magic_e3 = 0x30; + ata_raid_lsiv3_make_timestamp(×tamp, meta->timestamp_c); + meta->dummy_e2 = 0xd6; /* XXX saw 0xe6 once */ + + strncpy(meta->lsi_id, LSIV3_MAGIC_ID, sizeof(meta->lsi_id)); + meta->magic_f2 = 0x33; + meta->magic_f3 = 0x31; + meta->dummy_2 = 1; /* 0 for clear, not in our case */ + meta->dummy_3 = 0x4004; + + /* + * XXX We handle a single lun and a single raid array per disk only. + * To write meta conf for several arrays using the same disk + * ata_raid_write_metadata() and its callers should be + * rearchitected. + */ + entry = rdp->lun; + + meta->raid[entry].stripe_pages = rdp->interleave / 8; /* XXX 4? */ + switch (rdp->type) { + case AR_T_RAID0: + meta->raid[entry].type = LSIV3_T_RAID0; + meta->raid[entry].array_width = rdp->total_disks; /*XXX is it so?*/ + break; + case AR_T_RAID1: + meta->raid[entry].type = LSIV3_T_RAID1; + meta->raid[entry].array_width = rdp->width; + break; + default: + printf( + "ar%d: RAID type %d unsupported in current LSIV3 implementation\n", + rdp->lun, rdp->type); + return ENODEV; + } + + switch (rdp->status & (AR_S_READY | AR_S_DEGRADED)) { + case AR_S_READY: + meta->raid[entry].status = LSIV3_R_READY; + break; + case AR_S_READY | AR_S_DEGRADED: + meta->raid[entry].status = LSIV3_R_DEGRADED; + break; + default: + meta->raid[entry].status = LSIV3_R_OFFLINE; + } + meta->raid[entry].total_disks = rdp->total_disks; + meta->raid[entry].sectors = rdp->total_sectors / rdp->width; + meta->raid[entry].offset = rdp->offset_sectors; + meta->raid[entry].device = 0; /* XXX */ + meta->raid[entry].dummy_3 = 0x10; + + for (drive = 0; drive < rdp->total_disks; drive++) { + meta->disk[drive].disk_sectors = rdp->total_sectors / rdp->width; + + switch (rdp->disks[drive].flags & + (AR_DF_PRESENT|AR_DF_ASSIGNED|AR_DF_ONLINE)) { + case AR_DF_PRESENT|AR_DF_ASSIGNED|AR_DF_ONLINE: + meta->disk[drive].disk_status = LSIV3_D_ON; + break; + case AR_DF_PRESENT|AR_DF_ASSIGNED: + meta->disk[drive].disk_status = LSIV3_D_FAIL; + break; + case AR_DF_PRESENT: + meta->disk[drive].disk_status = LSIV3_D_CLEAR; + break; + } + meta->disk[drive].flags = LSIV3_D_STRIPE; /* XXX all I ever saw */ + } + meta->device = (disk & 1 ? LSIV3_D_CHANNEL : 0) + + (disk & 2 ? LSIV3_D_DEVICE : 0); + bcopy(meta->timestamp_c, &meta->timestamp, sizeof(meta->timestamp)); + } + return 0; +} + +static void +ata_raid_lsiv3_fix_metas_unlocked(struct ata_raid_metas *metas) +{ + int disk; + + for (disk = 0; disk < metas->am_nmeta; disk++) + if (metas->am_metas[disk].m_dev) + lsiv3_set_checksums(metas->am_metas[disk].m_meta); +} + /* nVidia MediaShield Metadata */ static int ata_raid_nvidia_read_meta(device_t dev, struct ar_softc **raidp) @@ -2785,7 +3276,7 @@ struct promise_raid_conf *meta; struct ar_softc *raid; u_int32_t checksum, *ptr; - int array, count, disk, disksum = 0, retval = 0; + int array, count, disk, disksum = 0, retval = 0, attach = 0; if (!(meta = (struct promise_raid_conf *) malloc(sizeof(struct promise_raid_conf), M_AR, M_NOWAIT | M_ZERO))) @@ -2846,6 +3337,7 @@ device_printf(parent, "failed to allocate metadata storage\n"); goto promise_out; } + attach = 1; } raid = raidp[array]; if (raid->format && @@ -2886,6 +3378,7 @@ native ? "FreeBSD" : "Promise", meta->raid.type); free(raidp[array], M_AR); raidp[array] = NULL; + attach = 0; goto promise_out; } raid->magic_1 = meta->raid.magic_1; @@ -2901,6 +3394,8 @@ raid->offset_sectors = 0; raid->rebuild_lba = meta->raid.rebuild_lba; raid->lun = array; + raid->metasize = sizeof(*meta); + if ((meta->raid.status & (PR_S_VALID | PR_S_ONLINE | PR_S_INITED | PR_S_READY)) == (PR_S_VALID | PR_S_ONLINE | PR_S_INITED | PR_S_READY)) { @@ -2954,6 +3449,11 @@ } } } + if (attach) { + /* ata_raid_attach(raid, 0, 0)*/ ; + mtx_init(&raid->lock, "ATA PseudoRAID metadata lock", NULL, MTX_DEF); + device_printf(parent, "assigned to newly defined ar%d\n", array); + } break; } @@ -2963,23 +3463,23 @@ } static int -ata_raid_promise_write_meta(struct ar_softc *rdp) +ata_raid_promise_fill_metas_locked(struct ar_softc *rdp, + struct ata_raid_metas *metas) { struct promise_raid_conf *meta; struct timeval timestamp; - u_int32_t *ckptr; - int count, disk, drive, error = 0; - - if (!(meta = (struct promise_raid_conf *) - malloc(sizeof(struct promise_raid_conf), M_AR, M_NOWAIT))) { - printf("ar%d: failed to allocate metadata storage\n", rdp->lun); - return ENOMEM; - } + int count, disk, drive; - rdp->generation++; microtime(×tamp); for (disk = 0; disk < rdp->total_disks; disk++) { + meta = metas->am_metas[disk].m_meta; + metas->am_metas[disk].m_dev = rdp->disks[disk].dev; + if (!metas->am_metas[disk].m_dev) + continue; + + metas->am_metas[disk].m_lba = PROMISE_LBA(rdp->disks[disk].dev); + for (count = 0; count < sizeof(struct promise_raid_conf); count++) *(((u_int8_t *)meta) + count) = 255 - (count % 256); meta->dummy_0 = 0x00020000; @@ -3095,22 +3595,27 @@ } else bzero(meta->promise_id, sizeof(meta->promise_id)); + } + } + return 0; +} + +static void +ata_raid_promise_fix_metas_unlocked(struct ata_raid_metas *metas) +{ + struct promise_raid_conf *meta; + u_int32_t *ckptr; + int count, disk; + + for (disk = 0; disk < metas->am_nmeta; disk++) { + meta = metas->am_metas[disk].m_meta; + if (metas->am_metas[disk].m_dev) { meta->checksum = 0; for (ckptr = (int32_t *)meta, count = 0; count < 511; count++) meta->checksum += *ckptr++; - if (testing || bootverbose) - ata_raid_promise_print_meta(meta); - if (ata_raid_rw(rdp->disks[disk].dev, - PROMISE_LBA(rdp->disks[disk].dev), - meta, sizeof(struct promise_raid_conf), - ATA_R_WRITE | ATA_R_DIRECT)) { - device_printf(rdp->disks[disk].dev, "write metadata failed\n"); - error = EIO; - } - } + + } } - free(meta, M_AR); - return error; } /* Silicon Image Medley Metadata */ @@ -3600,6 +4105,7 @@ free(meta, M_AR); return retval; } + static int ata_raid_via_write_meta(struct ar_softc *rdp) { @@ -3801,6 +4307,46 @@ return error; } +static int +ata_raid_request_write(device_t dev, u_int64_t lba, void *data, + u_int bcount, int flags, struct ar_softc *rdp) +{ + struct ata_device *atadev = device_get_softc(dev); + struct ata_request *request; + + if (bcount % DEV_BSIZE) { + device_printf(dev, "FAILURE - transfers must be modulo sectorsize\n"); + return ENOMEM; + } + + if (!(request = ata_alloc_request())) { + device_printf(dev, + "FAILURE - out of memory in ata_raid_request_write\n"); + return ENOMEM; + } + + /* setup request */ + request->dev = dev; + request->timeout = 10; + request->retries = 0; + request->data = data; + request->bytecount = bcount; + request->transfersize = DEV_BSIZE; + request->u.ata.lba = lba; + request->u.ata.count = request->bytecount / DEV_BSIZE; + request->flags = flags; + request->driver = rdp; + request->callback = ata_raid_done; + + if (atadev->mode >= ATA_DMA) { + request->u.ata.command = ATA_WRITE_DMA; + request->flags |= ATA_R_DMA; + } else + request->u.ata.command = ATA_WRITE; + ata_queue_request(request); + return 0; +} + /* * module handeling */ @@ -3830,13 +4376,16 @@ { struct ata_raid_subdisk *ars = device_get_softc(dev); int volume; + struct ar_softc *rdp; for (volume = 0; volume < MAX_VOLUMES; volume++) { - if (ars->raid[volume]) { - ars->raid[volume]->disks[ars->disk_number[volume]].flags &= + rdp = ars->raid[volume]; + if (rdp) { + mtx_lock(&rdp->lock); + rdp->disks[ars->disk_number[volume]].flags &= ~(AR_DF_PRESENT | AR_DF_ONLINE); - ars->raid[volume]->disks[ars->disk_number[volume]].dev = NULL; - ata_raid_config_changed(ars->raid[volume], 1); + rdp->disks[ars->disk_number[volume]].dev = NULL; + ata_raid_config_changed_unlock(rdp, 1); ars->raid[volume] = NULL; ars->disk_number[volume] = -1; } @@ -4370,17 +4919,27 @@ int i; printf("******* ATA LSILogic V3 MegaRAID Metadata *******\n"); + printf("magic_1 <%.4s>\n", meta->magic_1); + + /* reversed timestamp_c printing order (Intel x86 is little-endian) */ + printf("timestamp_c 0x%02x%02x%02x%02x\n", + meta->timestamp_c[3], meta->timestamp_c[2], + meta->timestamp_c[1], meta->timestamp_c[0]); + ata_raid_lsiv3_print_timestamp(meta->timestamp_c); + printf("dummy_e2 0x%02x\n", meta->dummy_e2); + printf("lsi_id <%.6s>\n", meta->lsi_id); - printf("dummy_0 0x%04x\n", meta->dummy_0); - printf("version 0x%04x\n", meta->version); - printf("dummy_0 0x%04x\n", meta->dummy_1); + printf("dummy_2 0x%02x\n", meta->dummy_2); printf("RAID configs:\n"); for (i = 0; i < 8; i++) { if (meta->raid[i].total_disks) { printf("%02d stripe_pages %u\n", i, meta->raid[i].stripe_pages); - printf("%02d type %s\n", i, - ata_raid_lsiv3_type(meta->raid[i].type)); + printf("%02d type %s (%u)\n", i, + ata_raid_lsiv3_type(meta->raid[i].type), + meta->raid[i].type); + printf("%02d status %u\n", i, + meta->raid[i].status); printf("%02d total_disks %u\n", i, meta->raid[i].total_disks); printf("%02d array_width %u\n", i, @@ -4389,19 +4948,30 @@ printf("%02d offset %u\n", i, meta->raid[i].offset); printf("%02d device 0x%02x\n", i, meta->raid[i].device); + printf("%02d dummy_3 0x%02x\n", i, + meta->raid[i].dummy_3); } } printf("DISK configs:\n"); for (i = 0; i < 6; i++) { - if (meta->disk[i].disk_sectors) { + if (meta->disk[i].disk_sectors) { printf("%02d disk_sectors %u\n", i, meta->disk[i].disk_sectors); + printf("%02d disk_status 0x%02x\n", i, + meta->disk[i].disk_status); printf("%02d flags 0x%02x\n", i, meta->disk[i].flags); } } printf("device 0x%02x\n", meta->device); printf("timestamp 0x%08x\n", meta->timestamp); + ata_raid_lsiv3_print_timestamp((u_int8_t*)&meta->timestamp); + printf("checksum_0 0x%02x\n", meta->checksum_0); printf("checksum_1 0x%02x\n", meta->checksum_1); + printf("checksum_2 0x%02x\n", meta->checksum_2); + + if (!lsiv3_test_checksums(meta)) + printf("lsiv3 conf checksum check FAILED\n"); + printf("=================================================\n"); } diff -u ata_curr/ata-raid.h mata/ata-raid.h --- ata_curr/ata-raid.h Wed Jan 18 05:10:17 2006 +++ mata/ata-raid.h Tue Jan 31 16:21:03 2006 @@ -34,6 +34,17 @@ #define ATA_MAGIC "FreeBSD ATA driver RAID " +#define FIELD_OFFSET(struc, field) ((int)&(((struct struc*)0)->field)) +#define ARRAY_SIZE(a) (sizeof(a)/sizeof((a)[0])) + +/* generic checksum descriptor. move it to some generic include file later + */ +struct checksum_descr { + int from; /* from, inclusive lower limit */ + int to; /* to, non-inclusive upper limit */ + int checksum_offset; /* offset to checksum byte, within range */ +}; + struct ata_raid_subdisk { struct ar_softc *raid[MAX_VOLUMES]; int disk_number[MAX_VOLUMES]; @@ -103,6 +114,19 @@ struct mtx lock; /* metadata lock */ struct disk *disk; /* disklabel/slice stuff */ struct proc *pid; /* rebuilder process id */ + int metasize; /* sizeof(metadata label) */ +}; + +/* an array of metas (array disk labels) + */ +struct ata_raid_metas { + int am_nmeta; /* number of entries */ + int am_metasize; /* size of meta label, in bytes */ + struct { + void *m_meta; + device_t m_dev; + u_int64_t m_lba; + } am_metas[1]; }; /* Adaptec HostRAID Metadata */ @@ -463,25 +487,37 @@ #define LSIV3_LBA(dev) \ (((struct ad_softc *)device_get_ivars(dev))->total_secs - 4) +#define LSIV3_FIELD(field) FIELD_OFFSET(lsiv3_raid_conf, field) + struct lsiv3_raid_conf { u_int32_t magic_0; /* 0xa0203200 */ u_int32_t filler_0[3]; u_int8_t magic_1[4]; /* "SATA" */ +#define LSIV3_MAGIC_SATA "SATA" + u_int32_t filler_1[40]; u_int32_t dummy_0; /* 0x0d000003 */ u_int32_t filler_2[7]; u_int32_t dummy_1; /* 0x0d000003 */ u_int32_t filler_3[70]; - u_int8_t magic_2[8]; /* "$_ENQ$31" */ - u_int8_t filler_4[7]; - u_int8_t checksum_0; + u_int8_t magic_2[6]; /* "$_ENQ$" */ +#define LSIV3_MAGIC_ENQ "$_ENQ$" + + u_int8_t checksum_0; /* magic_0 .. dummy_e2 (0..512) */ + u_int8_t magic_e2; /* 0x33 */ + u_int8_t magic_e3; /* 0x30 */ + u_int8_t timestamp_c[4]; /* unaligned timestamp copy */ + u_int8_t filler_4[2]; + u_int8_t dummy_e2; /* 0xd6 or 0xe6 */ u_int8_t filler_5[512*2]; u_int8_t lsi_id[6]; -#define LSIV3_MAGIC "$_IDE$" +#define LSIV3_MAGIC_ID "$_IDE$" - u_int16_t dummy_2; /* 0x33de for OK disk */ - u_int16_t version; /* 0x0131 for this version */ - u_int16_t dummy_3; /* 0x0440 always */ + u_int8_t checksum_1; /* lsi_id .. filler6 (0x600..0x610) */ + u_int8_t magic_f2; /* 0x33 */ + u_int8_t magic_f3; /* 0x31 */ + u_int8_t dummy_2; /* 0 for clear, or 1 */ + u_int16_t dummy_3; /* 0x4004 */ u_int32_t filler_6; struct { @@ -490,7 +526,11 @@ #define LSIV3_T_RAID0 0x00 #define LSIV3_T_RAID1 0x01 - u_int8_t dummy_0; + u_int8_t status; +#define LSIV3_R_READY 0x00 +#define LSIV3_R_DEGRADED 0x02 +#define LSIV3_R_OFFLINE 0x03 + u_int8_t total_disks; u_int8_t array_width; u_int8_t filler_0[10]; @@ -503,27 +543,33 @@ #define LSIV3_D_DEVICE 0x01 #define LSIV3_D_CHANNEL 0x10 - u_int8_t dummy_3; + u_int8_t dummy_3; /* 0x10 */ u_int8_t dummy_4; u_int8_t dummy_5; u_int8_t filler_1[16]; } __packed raid[8]; struct { u_int32_t disk_sectors; - u_int32_t dummy_0; - u_int32_t dummy_1; - u_int8_t dummy_2; + u_int16_t dummy_0; + u_int8_t disk_status; +#define LSIV3_D_ON 0x00 +#define LSIV3_D_FAIL 0x02 +#define LSIV3_D_CLEAR 0xff + + u_int8_t dummy_1; + u_int32_t dummy_2; u_int8_t dummy_3; + u_int8_t dummy_4; u_int8_t flags; #define LSIV3_D_MIRROR 0x00 -#define LSIV3_D_STRIPE 0xff - u_int8_t dummy_4; +#define LSIV3_D_STRIPE 0xff /* SPAN? */ + u_int8_t dummy_5; } __packed disk[6]; u_int8_t filler_7[7]; u_int8_t device; u_int32_t timestamp; u_int8_t filler_8[3]; - u_int8_t checksum_1; + u_int8_t checksum_2; /* raid .. end (0x610..0x800) */ } __packed; Only in ata_curr: ata_if.m Only in ata_curr: jre-1_5_0_06-linux-amd64.rpm --- ata-incl.patch ends here --- >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200602040402.k1442U17058626>