Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Sep 2020 05:22:43 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Hans Petter Selasky <hps@selasky.org>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: Comment #135 for bugzilla 237666 : a USB3-handling problem with a investigatory fix for a cortex-a72 context
Message-ID:  <723E6915-94F5-417C-B4AF-EEEBFBDF6162@yahoo.com>
In-Reply-To: <578faf26-6042-91ec-c639-2bca17105fae@selasky.org>
References:  <6E618C3D-12DF-429E-A249-5BAB90FC6B15.ref@yahoo.com> <6E618C3D-12DF-429E-A249-5BAB90FC6B15@yahoo.com> <866bd652-5a8b-7500-b1e0-7528c035048b@selasky.org> <F9E8A6A0-619B-4C6F-8485-032B8358B780@yahoo.com> <578faf26-6042-91ec-c639-2bca17105fae@selasky.org>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2020-Sep-19, at 04:46, Hans Petter Selasky <hps at selasky.org> =
wrote:

> On 2020-09-19 11:31, Mark Millard wrote:
>> On 2020-Sep-19, at 01:08, Hans Petter Selasky <hps at selasky.org> =
wrote:
>>> On 2020-09-19 05:41, Mark Millard via freebsd-arm wrote:
>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D237666 is about =
USB3
>>>> problems that folks have been having for over a year. Bjoern A. =
Zeeb
>>>> comment #125 that looked like something I'd seen on a cortex-a72 =
when
>>>> I tried a kernel that had been built with -mcpu=3Dcortex-a72 : an
>>>> indefinite looping that involved "Resetting controller" for xhci0.
>>>> (This prevented booting from the USB3 device.)
>>>> Well, I got the A72 to boot and comment #135 indicates how but I'll
>>>> paste the information below. All that was involved was adding
>>>> examples of "dsb st" and one "dsb ld".
>>>> Comment 135's content:
>>>> A cortex-a72 success! (In the form of investigatory code, not
>>>> code to check-in as is.)
>>>> Just by adding some "dsb st" commands and a "dsb ld" command in
>>>> places in the xhci code (just for __arach64__), I've gotten the
>>>> cortext-A72 to boot from the USB3 SSD via a -mcpu=3Dcortex-a72
>>>> based kernel build. No more indefinitely repeating "Resetting
>>>> controller" problem.
>>>> I do not claim the additions are the minimal ones. I know one place
>>>> is required for sure: the last one added enabled the boot (given
>>>> the others were already present). Prior to that I still had the
>>>> indefinitely repeating "Resetting controller" problem.
>>>> There could be more places that should have dsb st or dsb ld for
>>>> overall correctness.
>>>> This evidence may suggest somewhat analogous problems for other
>>>> platforms than aarch64 that are seeing problems.
>>>> For the aarch64 context, the evidence is (the changes are)
>>>> as follows. The first "dsb st" textually is the last one I
>>>> added.
>>>> # svnlite diff /usr/src/sys/dev/usb/controller/xhci.c
>>>> Index: /usr/src/sys/dev/usb/controller/xhci.c
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> --- /usr/src/sys/dev/usb/controller/xhci.c	(revision 363590)
>>>> +++ /usr/src/sys/dev/usb/controller/xhci.c	(working copy)
>>>> @@ -431,6 +431,9 @@
>>>>    	phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D htole64(addr);
>>>>  	phwr->hwr_ring_seg[0].dwEvrsTableSize =3D =
htole32(XHCI_MAX_EVENTS);
>>>> +#ifdef __aarch64__
>>>> +__asm __volatile("dsb st" : : : "memory");
>>>> +#endif
>>>>    	DPRINTF("ERDP(0)=3D0x%016llx\n", (unsigned long =
long)addr);
>>>>  @@ -1098,6 +1101,9 @@
>>>>    	while (1) {
>>>>  +#ifdef __aarch64__
>>>> +__asm __volatile("dsb ld" : : : "memory");
>>>> +#endif
>>>>  		temp =3D le32toh(phwr->hwr_events[i].dwTrb3);
>>>>    		k =3D (temp & XHCI_TRB_3_CYCLE_BIT) ? 1 : 0;
>>>> @@ -1107,6 +1113,9 @@
>>>>    		event =3D XHCI_TRB_3_TYPE_GET(temp);
>>>>  +#ifdef __aarch64__
>>>> +__asm __volatile("dsb ld" : : : "memory");
>>>> +#endif
>>>>  		DPRINTFN(10, "event[%u] =3D %u (0x%016llx 0x%08lx =
0x%08lx)\n",
>>>>  		    i, event, (long =
long)le64toh(phwr->hwr_events[i].qwTrb0),
>>>>  		    (long)le32toh(phwr->hwr_events[i].dwTrb2),
>>>> @@ -1239,6 +1248,9 @@
>>>>  	sc->sc_command_idx =3D i;
>>>>  	sc->sc_command_ccs =3D j;
>>>>  +#ifdef __aarch64__
>>>> +__asm __volatile("dsb st" : : : "memory");
>>>> +#endif
>>>>  	XWRITE4(sc, door, XHCI_DOORBELL(0), 0);
>>>>    	err =3D cv_timedwait(&sc->sc_cmd_cv, =
&sc->sc_bus.bus_mtx,
>>>> @@ -2855,6 +2867,9 @@
>>>>  	index =3D xfer->xroot->udev->controller_slot_id;
>>>>    	if (xfer->xroot->udev->flags.self_suspended =3D=3D 0) {
>>>> +#ifdef __aarch64__
>>>> +__asm __volatile("dsb st" : : : "memory");
>>>> +#endif
>>>>  		XWRITE4(sc, door, XHCI_DOORBELL(index),
>>>>  		    epno | XHCI_DB_SID_SET(xfer->stream_id));
>>>>  	}
>>>> @@ -4180,6 +4195,9 @@
>>>>    	for (n =3D 1; n !=3D XHCI_MAX_ENDPOINTS; n++) {
>>>>  		for (p =3D 0; p !=3D XHCI_MAX_STREAMS; p++) {
>>>> +#ifdef __aarch64__
>>>> +__asm __volatile("dsb st" : : : "memory");
>>>> +#endif
>>>>  			XWRITE4(sc, door, XHCI_DOORBELL(index),
>>>>  			    n | XHCI_DB_SID_SET(p));
>>>>  		}
>>>> I'm clearly just "evidence hacking" here. I've no clue what a
>>>> good design for allowing aarch64 build to supply having any of
>>>> those "dsb st"s or the "dsb ld".
>>>> Hopefully someone that knows what they are doing can figure out
>>>> if any of the other examples on other platforms are analogous
>>>> --and ,if they are, provide some hook for platform to contribute
>>>> such code to the xhci code.
>>>=20
>>> Hi Mark,
>>>=20
>>> Your finding indicate a problem in usb_pc_cpu_flush() and
>>>=20
>>> bus_dmamap_sync(pc->tag, pc->map, BUS_DMASYNC_PREWRITE);
>>>=20
>>> Try to put the dsb only after dmamap_sync.
>>>=20
>>> void
>>> usb_pc_cpu_flush(struct usb_page_cache *pc)
>>> {
>>>        if (pc->page_offset_end =3D=3D pc->page_offset_buf) {
>>>                /* nothing has been loaded into this page cache! */
>>>                return;
>>>        }
>>>        bus_dmamap_sync(pc->tag, pc->map, BUS_DMASYNC_PREWRITE);
>>> }
>>>=20
>>> The PCI I/O memory should be coherent and should not need any DSB's.
>> [I'll note that historically I've gotten one "Resetting Controller"
>> notice from the -mcpu=3Dcortex-a53 kernel during the sequence leading
>> to the root file system mount. I get none from the -mcpu=3Dcortex-a72
>> kernel after the changes that I reported.]
>> After adding a couple of more DSB ST instructions and rebuilding
>> (cross build), installing, and booting, I started a self-hosted,
>> from-scratch, buildworld buildlkernel and intend to installkernel
>> installworld and reboot if it appears to go well. (Better than
>> just a boot test for if things seem at least sufficient, even if
>> overkill.) But it will be hours before buildworld build kernel is
>> done (-j3). (I'll sleep during much of that time.)

I woke up so I figured I'd look at this before trying to sleep
again.

>> I'll get to experiments you suggest after that. I assume that
>> you want the two (not one) dsb ld instructions removed as well.
>> I'm happy to do your experiments, despite the following note
>> about what I expect the result will be for the first one.
>> I do not expect your experiment to work in one place because
>> the addition that started things working does not involve a
>> usb_pc_cpu_flush for the event ring initialization that the
>> specific change was for:
>>         phwr =3D buf_res.buffer;
>>         addr =3D buf_res.physaddr;
>>         addr +=3D (uintptr_t)&((struct xhci_hw_root =
*)0)->hwr_events[0];
>>                  /* reset hardware root structure */
>>         memset(phwr, 0, sizeof(*phwr));
>>                  phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D =
htole64(addr);
>>         phwr->hwr_ring_seg[0].dwEvrsTableSize =3D =
htole32(XHCI_MAX_EVENTS);
>>                  DPRINTF("ERDP(0)=3D0x%016llx\n", (unsigned long =
long)addr);
>>             #ifdef __aarch64__
>> __asm __volatile("dsb st" : : : "memory");
>> #endif
>=20
> Hi Mark,
>=20
> The values in ERDP_LO/HI are not used until the RUN bit is set.

I just picked between the qwEvrsTablePtr and dwEvrsTableSize writes
and the following several XWRITE4's, figuring it would be sufficient
placement if it helped at all. (It did help.)

>>         XWRITE4(sc, runt, XHCI_ERDP_LO(0), (uint32_t)addr);
>>         XWRITE4(sc, runt, XHCI_ERDP_HI(0), (uint32_t)(addr >> 32));
>>                  addr =3D buf_res.physaddr;
>>                  DPRINTF("ERSTBA(0)=3D0x%016llx\n", (unsigned long =
long)addr);
>>         XWRITE4(sc, runt, XHCI_ERSTBA_LO(0), (uint32_t)addr);
>>         XWRITE4(sc, runt, XHCI_ERSTBA_HI(0), (uint32_t)(addr >> 32));
>> My understanding is that those last two lines start the event ring
>> and without the DSB ST above the:
>=20
> No they don't.

The text associated with "The following steps describe the xHC Event =
Ring Enqueue Pointer (EREP) Advancement algorithm (left side of Figure =
4-12)" in extensible-host-controler-interface-usb-xhci.pdf says:

QUOTE (partial)
	1. When the ERST Base Address (ERSTBA) register is initially =
written the Event Ring State Machine enters the Start state.
	2. The xHC initializes its internal PCS flag to =E2=80=981=E2=80=99=
.

	3. The xHC sets its internal ERST Count to =E2=80=980=E2=80=99.

	4. The xHC then fetches the entry in the Event Ring Segment =
Table referenced by the ERST Count (ERSTE =3D ERST[ERST Count]) and =
initializes its Enqueue Pointer (EREP) with the value of the Ring =
Segment Base Address field (ERSTE.BaseAddr), and the TRB Count with the =
value of the Segment Size field (ERSTE.Size).

	5. If the USBCMD Run/Stop (R/S) flag =3D =E2=80=980=E2=80=99 the =
Event Ring State Machine shall wait for Run/Stop (R/S) to return to =
=E2=80=981=E2=80=99. When Run/Stop (R/S) flag=3D =E2=80=981=E2=80=99 the =
xHC shall proceeds to check if an event is posted (step 6., otherwise it =
proceeds immediately to step 6.
END QUOTE

(Steps 6-8 are omitted above.) May be I read too much into (4) and the =
first reference to R/S being in step (5).

>>         phwr->hwr_ring_seg[0].qwEvrsTablePtr =3D htole64(addr);
>>         phwr->hwr_ring_seg[0].dwEvrsTableSize =3D =
htole32(XHCI_MAX_EVENTS);
>> result was not observed by xhci but needs to be observed at that ring
>> start time.
>> Without the DSB ST above, xhci_interrupt_poll's code (with one
>> of my additions shown):
>> #ifdef __aarch64__
>> __asm __volatile("dsb ld" : : : "memory");
>> #endif
>=20
> Probably same bug in the USB invalidate function.
>=20
> Try to patch those generic flush/invalidate functions instead. I've =
been very careful about those.

Sure.

>>                 temp =3D le32toh(phwr->hwr_events[i].dwTrb3);
>>                          k =3D (temp & XHCI_TRB_3_CYCLE_BIT) ? 1 : 0;
>>                          if (j !=3D k)
>>                         break;
>> always took the break because temp was always zero, k was always =
zero,
>> and j was 1. I expect this is because the wrong qwEvrsTablePtr and
>> dwEvrsTableSize content was in use by the xhci (fixed by the DSB ST).
>> (I had a temporarily version that printed the values. With and =
without
>> the DSB LD made no difference without the DSB ST that was later =
added.)
>> =46rom what I understand, the cortex-a72 has far more structure that
>> involves the "DSB completes when" behavior being needed compared
>> to the cortex-a53 and code tuned to the A72 is more likely to end
>> up needing such. (Speculation and speculative-driven cache contents,
>> out of order execution, and probably more.)
>> QUOTE
>> 	=E2=80=A2 all explicit memory accesses that are observed by Pe =
before the DSB is executed, are of the required access types, and are =
from observers in the same required shareability domain as Pe, are =
complete for the set of observers in the required shareability domain
>> 	=E2=80=A2 all cache, branch predictor, and TLB maintenance =
operations issued by Pe before the DSB are complete for the required =
shareability domain.
>> END QUOTE
>=20
> It smells like a generic problem in the busdma synchronize code.
>=20
>> (Pe means "executing processor".)
>> I expect the removal of that specific DSB ST to result in
>> xhci_interrupt_poll's code again taking the break every time,
>> just based on event ring behavior alone.
>=20
> Again, try to patch both
> usb_pc_cpu_flush()
> and
> usb_pc_cpu_invalidate()

Sure.

>> I'll note that the RPi4B SBC has a not-physically-exposed PCIe
>> that the USB3 is off of, and the UEFI/ACPI that I have in use for
>> the RPi4B does not expose the PICe to the loader or OS at all.
>> ACPI does expose the USB3 related material.

FYI: The buildworld buildkernel is continuing.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?723E6915-94F5-417C-B4AF-EEEBFBDF6162>