Date: Sun, 6 Dec 2020 13:30:52 -0800 From: Mark Millard <marklmi@yahoo.com> To: mmel@freebsd.org Cc: freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: ThunderX Panic after r368370 Message-ID: <BB5C4C3E-EDF6-4C3D-BEE1-F8B2989216E0@yahoo.com> In-Reply-To: <91654fc4-8734-d8a7-5309-0400f418438a@freebsd.org> References: <1C3442ED-278E-45B8-9206-0DD24FCBC237@brickporch.com> <4331eee0-74a6-565c-3bec-0051415b2bc1@freebsd.org> <56F0E9EB-0B78-4B0B-830A-48F8AFC5ABE1@yahoo.com> <91654fc4-8734-d8a7-5309-0400f418438a@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Dec-6, at 03:51, Michal Meloun <meloun.michal at gmail.com> = wrote: > On 06.12.2020 10:47, Mark Millard wrote: >> On 2020-Dec-6, at 00:17, Michal Meloun <meloun.michal at gmail.com> = wrote: >>> On 06.12.2020 3:21, Marcel Flores wrote: >>>> Hi All, >>>> Looks like the ThunderX started panicking at boot after r368370: >>>> https://reviews.freebsd.org/rS368370 >>>> =46rom a verbose boot, it looks like it bails in gic0 redistributor = setup(?): >>>> gic0: CPU29 Re-Distributor woke up >>>> gic0: CPU24 enabled CPU interface via system registers >>>> gic0: CPU17 enabled CPU interface via system registers >>>> gic0: CPU29 enabled CPU interface via system registers >>>> done >>>> Full Verbose boot: >>>> https://gist.github.com/mesflores/f026122495c8494d041bce04d30b15bb >>>> I'm not really familiar with the details of the commit, but happy = to test >>>> anything if anyone has any ideas. >>>=20 >>>=20 >>> Hi Marcel >>> are you able to get crashdump and do backtrace? >>> = https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html#= kerneldebug-obtain >>> and >>> = https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.h= tml >>> If not, I'll make some debug patch. >>>=20 >>> It's weird, even though GIC is potentially affected by my patch, in = this case the cpuid numbering was not changed. >> (I've no access to a ThunderX. I just looked for my own curiosity. >> Sorry if this is obvious and so is noise.) >> When I looked at the code it appeared to be the last "->" in >> the following that was dereferencing the nullptr value (via [x8] >> in assembler notation): >> static uint64_t >> its_cmd_prepare(struct its_cmd *cmd, struct its_cmd_desc *desc) >> { >> uint64_t target; >> uint8_t cmd_type; >> u_int size; >> cmd_type =3D desc->cmd_type; >> target =3D ITS_TARGET_NONE; >> switch (cmd_type) { >> case ITS_CMD_MOVI: /* Move interrupt ID to another = collection */ >> target =3D desc->cmd_desc_movi.col->col_target; >> . . . >> In other words: it appeared to me that the above = desc->cmd_desc_movi.col >> evaluated as 0 when used in what was reported. > This is very probably right analysis. But problem is that = cmd_desc_movi.col should not be NULL, is initialized in its_cmd_movi = from sc->sc_its_cols which should be allocated in gicv3_its_attach(). >=20 The following is unlikely to directly contribute to the specific problem's solution but documents an oddity that took my time while looking around related the problem. One (comment?) oddity I ran into looking around: /usr/src/sys/sys/cpuset.h:#define CPU_FFS(p) = BIT_FFS(CPU_SETSIZE, p) but in /usr/src/sys/sys/bitset.h : #define BIT_FFS(_s, p) BIT_FFS_AT((_s), (p), 0) and (comment wrong about start?): /* * Note that `start` and the returned value from BIT_FFS_AT are * 1-based bit indices. */ #define BIT_FFS_AT(_s, p, start) __extension__ ({ = \ . . . In other words, BIT_FFS (and CPU_FFS) provide BIT_FFS_AT with start=3D=3D0= but start is documented to be a 1-based bit index. So, looking into what happens with start=3D=3D0, showing BIT_FFS_AT: #define BIT_FFS_AT(_s, p, start) __extension__ ({ = \ __size_t __i; = \ long __mask; = \ int __bit; = \ = \ __mask =3D ~0UL << ((start) % _BITSET_BITS); = \ __bit =3D 0; = \ for (__i =3D __bitset_word((_s), (start)); = \ __i < __bitset_words((_s)); = \ __i++) { = \ if (((p)->__bits[__i] & __mask) !=3D 0) { = \ __bit =3D ffsl((p)->__bits[__i] & __mask); = \ __bit +=3D __i * _BITSET_BITS; = \ break; = \ } = \ __mask =3D ~0UL; = \ } = \ __bit; = \ }) It looks like this traces to use of: __mask =3D ~0UL << ((start) % _BITSET_BITS); = \ and to use of: #define __bitset_word(_s, n) = \ (__constexpr_cond(__bitset_words((_s)) =3D=3D 1) ? = \ 0 : ((n) / _BITSET_BITS)) So __mask=3D=3D~0UL and __bitset_word((_s), (start))=3D=3D0 . Then for __i=3D=3D0: ((p)->__bits[0] & __mask) !=3D 0 evaluates like ((p)->__bits[0] & ~0UL) !=3D 0 which in turn evaluates like (p)->__bits[0] !=3D 0. =46rom there __bit =3D ffsl((p)->__bits[0] & __mask) would involve (p)->__bits[0] & __mask evaluing like (p)->__bits[0] & ~0UL and that in turn evaluating like just (p)->__bits[0] . Presuming non-zero as a context, effectively for such a context: __bit =3D ffsl((p)->__bits[0]); __bit +=3D 0; which would seem to set __bit correctly. It looks to me like start is 0-based in BIT_FFS_AT, not 1-based. So I expect that the comment is wrong about start. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BB5C4C3E-EDF6-4C3D-BEE1-F8B2989216E0>