Date: Sat, 19 Sep 2020 02:31:03 -0700 From: Mark Millard <marklmi@yahoo.com> To: Hans Petter Selasky <hps@selasky.org> Cc: freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: Comment #135 for bugzilla 237666 : a USB3-handling problem with a investigatory fix for a cortex-a72 context Message-ID: <F9E8A6A0-619B-4C6F-8485-032B8358B780@yahoo.com> In-Reply-To: <866bd652-5a8b-7500-b1e0-7528c035048b@selasky.org> References: <6E618C3D-12DF-429E-A249-5BAB90FC6B15.ref@yahoo.com> <6E618C3D-12DF-429E-A249-5BAB90FC6B15@yahoo.com> <866bd652-5a8b-7500-b1e0-7528c035048b@selasky.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Sep-19, at 01:08, Hans Petter Selasky <hps at selasky.org> = wrote: > On 2020-09-19 05:41, Mark Millard via freebsd-arm wrote: >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D237666 is about = USB3 >> problems that folks have been having for over a year. Bjoern A. Zeeb >> comment #125 that looked like something I'd seen on a cortex-a72 when >> I tried a kernel that had been built with -mcpu=3Dcortex-a72 : an >> indefinite looping that involved "Resetting controller" for xhci0. >> (This prevented booting from the USB3 device.) >> Well, I got the A72 to boot and comment #135 indicates how but I'll >> paste the information below. All that was involved was adding >> examples of "dsb st" and one "dsb ld". >> Comment 135's content: >> A cortex-a72 success! (In the form of investigatory code, not >> code to check-in as is.) >> Just by adding some "dsb st" commands and a "dsb ld" command in >> places in the xhci code (just for __arach64__), I've gotten the >> cortext-A72 to boot from the USB3 SSD via a -mcpu=3Dcortex-a72 >> based kernel build. No more indefinitely repeating "Resetting >> controller" problem. >> I do not claim the additions are the minimal ones. I know one place >> is required for sure: the last one added enabled the boot (given >> the others were already present). Prior to that I still had the >> indefinitely repeating "Resetting controller" problem. >> There could be more places that should have dsb st or dsb ld for >> overall correctness. >> This evidence may suggest somewhat analogous problems for other >> platforms than aarch64 that are seeing problems. >> For the aarch64 context, the evidence is (the changes are) >> as follows. The first "dsb st" textually is the last one I >> added. >> # svnlite diff /usr/src/sys/dev/usb/controller/xhci.c >> Index: /usr/src/sys/dev/usb/controller/xhci.c >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- /usr/src/sys/dev/usb/controller/xhci.c (revision 363590) >> +++ /usr/src/sys/dev/usb/controller/xhci.c (working copy) >> @@ -431,6 +431,9 @@ >> phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D htole64(addr); >> phwr->hwr_ring_seg[0].dwEvrsTableSize =3D = htole32(XHCI_MAX_EVENTS); >> +#ifdef __aarch64__ >> +__asm __volatile("dsb st" : : : "memory"); >> +#endif >> DPRINTF("ERDP(0)=3D0x%016llx\n", (unsigned long long)addr); >> @@ -1098,6 +1101,9 @@ >> while (1) { >> +#ifdef __aarch64__ >> +__asm __volatile("dsb ld" : : : "memory"); >> +#endif >> temp =3D le32toh(phwr->hwr_events[i].dwTrb3); >> k =3D (temp & XHCI_TRB_3_CYCLE_BIT) ? 1 : 0; >> @@ -1107,6 +1113,9 @@ >> event =3D XHCI_TRB_3_TYPE_GET(temp); >> +#ifdef __aarch64__ >> +__asm __volatile("dsb ld" : : : "memory"); >> +#endif >> DPRINTFN(10, "event[%u] =3D %u (0x%016llx 0x%08lx = 0x%08lx)\n", >> i, event, (long = long)le64toh(phwr->hwr_events[i].qwTrb0), >> (long)le32toh(phwr->hwr_events[i].dwTrb2), >> @@ -1239,6 +1248,9 @@ >> sc->sc_command_idx =3D i; >> sc->sc_command_ccs =3D j; >> +#ifdef __aarch64__ >> +__asm __volatile("dsb st" : : : "memory"); >> +#endif >> XWRITE4(sc, door, XHCI_DOORBELL(0), 0); >> err =3D cv_timedwait(&sc->sc_cmd_cv, &sc->sc_bus.bus_mtx, >> @@ -2855,6 +2867,9 @@ >> index =3D xfer->xroot->udev->controller_slot_id; >> if (xfer->xroot->udev->flags.self_suspended =3D=3D 0) { >> +#ifdef __aarch64__ >> +__asm __volatile("dsb st" : : : "memory"); >> +#endif >> XWRITE4(sc, door, XHCI_DOORBELL(index), >> epno | XHCI_DB_SID_SET(xfer->stream_id)); >> } >> @@ -4180,6 +4195,9 @@ >> for (n =3D 1; n !=3D XHCI_MAX_ENDPOINTS; n++) { >> for (p =3D 0; p !=3D XHCI_MAX_STREAMS; p++) { >> +#ifdef __aarch64__ >> +__asm __volatile("dsb st" : : : "memory"); >> +#endif >> XWRITE4(sc, door, XHCI_DOORBELL(index), >> n | XHCI_DB_SID_SET(p)); >> } >> I'm clearly just "evidence hacking" here. I've no clue what a >> good design for allowing aarch64 build to supply having any of >> those "dsb st"s or the "dsb ld". >> Hopefully someone that knows what they are doing can figure out >> if any of the other examples on other platforms are analogous >> --and ,if they are, provide some hook for platform to contribute >> such code to the xhci code. >=20 > Hi Mark, >=20 > Your finding indicate a problem in usb_pc_cpu_flush() and >=20 > bus_dmamap_sync(pc->tag, pc->map, BUS_DMASYNC_PREWRITE); >=20 > Try to put the dsb only after dmamap_sync. >=20 > void > usb_pc_cpu_flush(struct usb_page_cache *pc) > { > if (pc->page_offset_end =3D=3D pc->page_offset_buf) { > /* nothing has been loaded into this page cache! */ > return; > } > bus_dmamap_sync(pc->tag, pc->map, BUS_DMASYNC_PREWRITE); > } >=20 > The PCI I/O memory should be coherent and should not need any DSB's. [I'll note that historically I've gotten one "Resetting Controller" notice from the -mcpu=3Dcortex-a53 kernel during the sequence leading to the root file system mount. I get none from the -mcpu=3Dcortex-a72 kernel after the changes that I reported.] After adding a couple of more DSB ST instructions and rebuilding (cross build), installing, and booting, I started a self-hosted, from-scratch, buildworld buildlkernel and intend to installkernel installworld and reboot if it appears to go well. (Better than just a boot test for if things seem at least sufficient, even if overkill.) But it will be hours before buildworld build kernel is done (-j3). (I'll sleep during much of that time.) I'll get to experiments you suggest after that. I assume that you want the two (not one) dsb ld instructions removed as well. I'm happy to do your experiments, despite the following note about what I expect the result will be for the first one. I do not expect your experiment to work in one place because the addition that started things working does not involve a usb_pc_cpu_flush for the event ring initialization that the specific change was for: phwr =3D buf_res.buffer; addr =3D buf_res.physaddr; addr +=3D (uintptr_t)&((struct xhci_hw_root *)0)->hwr_events[0]; =20 /* reset hardware root structure */ memset(phwr, 0, sizeof(*phwr)); =20 phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D htole64(addr); phwr->hwr_ring_seg[0].dwEvrsTableSize =3D = htole32(XHCI_MAX_EVENTS); =20 DPRINTF("ERDP(0)=3D0x%016llx\n", (unsigned long long)addr); =20 #ifdef __aarch64__ __asm __volatile("dsb st" : : : "memory"); #endif XWRITE4(sc, runt, XHCI_ERDP_LO(0), (uint32_t)addr); XWRITE4(sc, runt, XHCI_ERDP_HI(0), (uint32_t)(addr >> 32)); =20 addr =3D buf_res.physaddr; =20 DPRINTF("ERSTBA(0)=3D0x%016llx\n", (unsigned long long)addr); XWRITE4(sc, runt, XHCI_ERSTBA_LO(0), (uint32_t)addr); XWRITE4(sc, runt, XHCI_ERSTBA_HI(0), (uint32_t)(addr >> 32)); My understanding is that those last two lines start the event ring and without the DSB ST above the: phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D htole64(addr); phwr->hwr_ring_seg[0].dwEvrsTableSize =3D = htole32(XHCI_MAX_EVENTS); result was not observed by xhci but needs to be observed at that ring start time. Without the DSB ST above, xhci_interrupt_poll's code (with one of my additions shown): #ifdef __aarch64__ __asm __volatile("dsb ld" : : : "memory"); #endif temp =3D le32toh(phwr->hwr_events[i].dwTrb3); =20 k =3D (temp & XHCI_TRB_3_CYCLE_BIT) ? 1 : 0; =20 if (j !=3D k) break; always took the break because temp was always zero, k was always zero, and j was 1. I expect this is because the wrong qwEvrsTablePtr and dwEvrsTableSize content was in use by the xhci (fixed by the DSB ST). (I had a temporarily version that printed the values. With and without the DSB LD made no difference without the DSB ST that was later added.) =46rom what I understand, the cortex-a72 has far more structure that involves the "DSB completes when" behavior being needed compared to the cortex-a53 and code tuned to the A72 is more likely to end up needing such. (Speculation and speculative-driven cache contents, out of order execution, and probably more.) QUOTE =E2=80=A2 all explicit memory accesses that are observed by Pe = before the DSB is executed, are of the required access types, and are = from observers in the same required shareability domain as Pe, are = complete for the set of observers in the required shareability domain =E2=80=A2 all cache, branch predictor, and TLB maintenance = operations issued by Pe before the DSB are complete for the required = shareability domain. END QUOTE (Pe means "executing processor".) I expect the removal of that specific DSB ST to result in xhci_interrupt_poll's code again taking the break every time, just based on event ring behavior alone. I'll note that the RPi4B SBC has a not-physically-exposed PCIe that the USB3 is off of, and the UEFI/ACPI that I have in use for the RPi4B does not expose the PICe to the loader or OS at all. ACPI does expose the USB3 related material. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F9E8A6A0-619B-4C6F-8485-032B8358B780>