Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Nov 2017 02:56:02 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Andrew Turner <andrew@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r326312 - head/sys/arm64/arm64
Message-ID:  <20171128234051.L1761@besplex.bde.org>
In-Reply-To: <201711281104.vASB4lTu024913@repo.freebsd.org>
References:  <201711281104.vASB4lTu024913@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 28 Nov 2017, Andrew Turner wrote:

> Log:
>  When we exit the kernel debugger having entered because of a breakpoint
>  instruction we need to jump over the instruction. Without this we will
>  execute the same instruction again and enter into the debugger again.
>
>  PR:		223917
>  Reported by:	emaste
>  MFC after:	1 week
>  Sponsored by:	DARPA, AFRL

This belongs in the debugger.  ddb has several macros for skipping back and
forth over the breakpoint instruction, and normally returns with it skipped
over.  gdb presumably does the same.  The macros seem to be correct for arm64.

> Modified: head/sys/arm64/arm64/trap.c
> ==============================================================================
> --- head/sys/arm64/arm64/trap.c	Tue Nov 28 09:34:43 2017	(r326311)
> +++ head/sys/arm64/arm64/trap.c	Tue Nov 28 11:04:47 2017	(r326312)
> @@ -323,7 +323,10 @@ do_el1h_sync(struct thread *td, struct trapframe *fram
> 			break;
> 		}
> #endif
> -		/* FALLTHROUGH */
> +		kdb_trap(exception, 0,
> +		    (td->td_frame != NULL) ? td->td_frame : frame);
> +		frame->tf_elr -= 4;

This seems to have the wrong sign as well as being fundamentally wrong.
Execution only decrements the pc on exotic arches, and arm64 seems to
be non-exotic, since the BKPT_SKIP() instruction in db_machdep.h for
skipping the pc does frame->tf_elr += BKPT_SIZE.

BKPT_SKIP also does kdb_thrctx->pcb_rip += BKPT_SIZE on amd64, i386 and mips.
This is missing on arm, arm64 and riscv and sparc64.

BKPT_SKIP is missing on powerpc.

On sparc64, BKPT_SKIP also advances a second pc by 2*BKPT_SIZE.

> +		break;
> 	case EXCP_WATCHPT_EL1:
> 	case EXCP_SOFTSTP_EL1:
> #ifdef KDB

Falling through was almost correct.  I don't see anything wrong with
BKPT_SKIP(), so the bug seems to break the usual case where ddb
skips over the breakpoint.  Then the above reverse the skip.  Perhaps
gdb is broken and advances by 8.  Then the above corrects +8 to +4.
Unusual cases are much more complicated and broken.  Sometimes the
instruction needs to be restarted or user has set the pc to an unrelated
value -- then only the debugger knows what the pc should be.

Old bugs in the fall-through case:
- non-bug: when KDB is not configured, panic is called.  This is correct,
   especially for the breakpoint case since the instruction stream might
   be corrupt in this case.
- when KDB configured but no backend is attached, kdb_trap() fails and
   panic should be called then, as on x86, for the same reasons as when KDB
   is not configured.
- it is a style bug to ifdef the call to kdb_trap() with KDB.  When KDB is
   not configured, kdb_trap() fails and correct code checks for this and
   panics, as on x86.
- non-bug: since the hardware doesn't advance the pc on arm64, the
   instruction stream is actually never corrupt when kdb_trap() fails
- kdb_trap() can also fail when ddb is attached but the console is
   transiently unavailable.  Then the instruction stream is corrupt in some
   cases, but it is wrong to turn console unavailability into a panic.  This
   is partly fixed in my version, but fixing up the instruction stream and
   returning success from ddb (with the usual complications for continuing
   over a breakpoint).  Console unavailability remains very broken as
   implemented -- it does little except try to give this panic but not give
   it in many cases.

New bugs given by not falling through:
- non-bug: it is missing the style bug of the KDB ifdef
- it is missing the panic in the case where KDB is not configured
- non-new-bug: when KDB is not configured it is missing the error check and
   panic as before.

One case that is clearly broken now but probably worked before is when
multiple CPUs hit the same breakpoint and the breakpoint is removed on
one of them that is not the last one to enter ddb.  Then the correct
behaviour is:

- enter with all pc's advanced or not over the breakpoint instruction
   (assume that it is a specialy instruction written into the instruction
   stream; this corrupts the stream and must be fixed up)
- in each ddb instance, unadvance the pc if necessary to the breakpoint
   instruction
- determine if the pc is for a breakpoint set by us.  Until the breakpoint
   is unset by one of the ddb instances, find that it was set by us.
   Continue normally in this case (put back the original instruction and
   single-step over it.  This commit seems to completely break this handling,
   even for a single ddb instance).  Then get a single-step trap and put
   back the breakpoint instruction.)
- there is not enough synchronization, so ddb loses control when it returns
   to single-step.  Other ddb instances hitting the breakpoint see the
   original instruction while racing with the previous step.  They should
   put it back over itself and do much the same.
- eventually 1 of the ddb instances removes the breakpoint.  It puts back
   the original instruction and continues without single-stepping.
- ddb instances after this enter with a breakpoint trap but no sign of
   the breakpoint in the instruction stream.  They must restart at where
   the breakpoint was.  This is very broken on x86 (except in my version).
   I think it used to work automatically on arm64, because the hardware
   doesn't advance the pc.  Now the software breaks it by advancing
   unconditionally.

Here is the ddb code for fixing up the pc after a breakpoint, with my fixes
applied:

X bool
X db_stop_at_pc(int type, int code, bool *is_breakpoint, bool *is_watchpoint)
X {
X 	db_addr_t	pc;
X 	db_breakpoint_t bkpt;
X 
X 	*is_breakpoint = IS_BREAKPOINT_TRAP(type, code);
X 	*is_watchpoint = IS_WATCHPOINT_TRAP(type, code);
X 	pc = PC_REGS();
X 	if (db_pc_is_singlestep(pc))
X 		*is_breakpoint = false;

ddb watchpoints are mostly broken and cause confusion.  x86 doesn't
have any, but maps x86 hardware breakpoints to ddb watchpoints.  I
think the latter are supposed to be in software and can be implemented
using software interrupts, but x86 doesn't bother.  It is unclear what
they can do that breakpoint software interrupts can't do -- they can
at least give a different class of breakpoints encoded by the software
interrupt number.  arm64 has a watchpoint exception number
EXCP_WATCHPOINT_EL1.  This maps to is_watchpoint = TRUE in the above.

X 
X 	db_clear_single_step();
X 	db_clear_breakpoints();
X 	db_clear_watchpoints();
X 
X #ifdef	FIXUP_PC_AFTER_BREAK
X 	if (*is_breakpoint) {
X 	    /*
X 	     * Breakpoint trap.  Fix up the PC if the
X 	     * machine requires it.
X 	     */
X 	    FIXUP_PC_AFTER_BREAK
X 	    pc = PC_REGS();
X 	}
X #endif

FIXUP_PC_AFTER_BREAK must be defined by arches where the hardware doesn't
advance the pc over breakpoint instructions in the instruction stream.
On x86, it unadvances the pc by 1.  On arm64, it is not defined, since the
pc is already unadvanced.

X 
X 	/*
X 	 * Now check for a breakpoint at this address.
X 	 */
X 	bkpt = db_find_breakpoint_here(pc);
X 	if (bkpt) {
X 	    if (--bkpt->count == 0) {
X 		bkpt->count = bkpt->init_count;
X 		*is_breakpoint = true;
X 		return (true);	/* stop here */
X 	    }
X 	    return (false);	/* continue the countdown */
X 	} else if (*is_breakpoint) {
X #ifdef BKPT_SKIP

We unadvanced the pc to look it up in the breakpoint table.  Now we
must advance it over the breakpoint instruction to restart from there
since it was not in our table.  It should be part of the normal instruction
stream.  If another debugger overwrote the instruction stream and gave
control to us, then that is a bug in the other debugger and we will
probably crash soon.

X #ifdef SMP
X 	    /*
X 	     * In the SMP case, 'bkpt' doesn't tell us if this breakpoint
X 	     * was controlled by us when we hit it, since the breakpoint
X 	     * may have been cleared during a ddb entry by another CPU
X 	     * which won the race to enter.  Only skip if the breakpoint
X 	     * is still in the instruction stream.  Otherwise, turn the
X 	     * breakpoint into an unknown trap except for ignoring it if
X 	     * we are continuing.
X 	     */
X 	    BKPT_INST_TYPE memval, testval;
X 
X 	    memval = db_get_value(pc, BKPT_SIZE, FALSE);
X 	    testval = BKPT_SET(&testval);
X 	    if (bcmp(&memval, &testval, BKPT_SIZE) == 0)
X 		BKPT_SKIP;
X 	    else {
X 		*is_breakpoint = false;
X 		if (db_run_mode == STEP_CONTINUE)
X 		    return (false);	/* continue */
X 	    }

For the SMP case, the other debugger can be ddb, and this is my fix for it.

X #else
X 	    BKPT_SKIP;

The !SMP case is simpler.  BKPT_SKIP usually just advances the pc by
BKPT_SIZE, but does a contitional advance related to the current problem
on mips.

X #endif
X #endif
X 	}

We already returned with the unadvanced pc if we are contining over a
breakpoint.  Advancing in the caller breaks this.  Unadvancing breaks it
more.

X 
X 	*is_breakpoint = false;	/* might be a breakpoint, but not ours */

Here we continue with the advanced pc.  Again it seems to be correct on
arm64, and advancing or unadvancing in the caller breaks it.

Here are my fixes in the above code:

X Index: db_run.c
X ===================================================================
X --- db_run.c	(revision 325403)
X +++ db_run.c	(working copy)
X @@ -36,6 +36,7 @@
X  __FBSDID("$FreeBSD$");
X 
X  #include <sys/param.h>
X +#include <sys/systm.h>
X  #include <sys/kdb.h>
X  #include <sys/proc.h>
X 
X @@ -129,8 +130,31 @@
X  	    return (false);	/* continue the countdown */
X  	} else if (*is_breakpoint) {
X  #ifdef BKPT_SKIP
X +#ifdef SMP
X +	    /*
X +	     * In the SMP case, 'bkpt' doesn't tell us if this breakpoint
X +	     * was controlled by us when we hit it, since the breakpoint
X +	     * may have been cleared during a ddb entry by another CPU
X +	     * which won the race to enter.  Only skip if the breakpoint
X +	     * is still in the instruction stream.  Otherwise, turn the
X +	     * breakpoint into an unknown trap except for ignoring it if
X +	     * we are continuing.
X +	     */
X +	    BKPT_INST_TYPE memval, testval;
X +
X +	    memval = db_get_value(pc, BKPT_SIZE, FALSE);
X +	    testval = BKPT_SET(&testval);

BKPT_SET() is a misimplemented macro that takes an arg which is always ignored;
it is always implemented as a manifest constant (ddb macros mostly have the
opposite error of looking like manifest constants but being function-like
macros with no args but side effects).  It is apparently meant for setting
a breakpoint instruction directly in the instruction stream, but that is
too hard so it is always used as above by setting a local variable and then
copying that to memory using a special ddb access function.

X +	    if (bcmp(&memval, &testval, BKPT_SIZE) == 0)
X  		BKPT_SKIP;

kib didn't like the MD'ness of this, but it is no worse than hard-coding
the setting of the instruction using the accesss function because the
BKPT_SET() macro is useless for applying MD settings to the instruction
stream (as used, it might apply them only to a local variable)).

X +	    else {
X +		*is_breakpoint = false;
X +		if (db_run_mode == STEP_CONTINUE)
X +		    return (false);	/* continue */
X +	    }
X +#else
X +	    BKPT_SKIP;
X  #endif
X +#endif
X  	}
X 
X  	*is_breakpoint = false;	/* might be a breakpoint, but not ours */
X

Behaviour of various arches for advancing and unadvancing the pc after
a breakpoint:
- and64, i386: +-1 over 1-byte instruction.  1 is spelled as itself instead
   of as BKPT_SIZE in the implementation.  Most implentations have bogus
   parentheses around literal constants.
- arm: +4 and no unadvance over 4-byte instruction.  4 is spelled BKPT_SIZE
   and further obfuscated by spelling BKPT_SIZE as INSN_SIZE in the
   implementation.  INSN_SIZE is defined in another header, so db_machdep.h
   requires more namespace pollution than it should.  BKPT_INSTRUCTION is
   similarly obfuscated by spelling it as KERNEL_BREAKPOINT which is defined
   in yet another header so even more pollution is required.
- arm64: like arm except without the pollution.  Minor obfuscation of spelling
   4 as BKPT_SIZE in the implementation.  The change should use the macro
   if it is correct, but spells BKPT_SIZE as -4.
- mips: +4 and no unadvance over 4-byte instruction.  4 is spelled BKPT_SIZE
   BKPT_SIZE but BKPT_SIZE is not obfuscated as on arm.  BREAKPOINT_INSTRUCTION
   doesn't exit and shouldn't be exported as on arm*, but its literal value
   is more interesting.  mips apparently has many software interrupts which
   give the same debugger exception and the instruction has to be decoded
   to decide which one.  BKPT_SE() returns one for ddb, MIPS_BREAK_DDB, and
   pollution is implemented by not defining this in ddb's header file.
   BKPT_SKIP() is complicted to read the instruction stream much like my
   patch does.  Although MIPS_BREAK_DDB is a single instruction, BKPT_SKIP
   ignores some bits in the comparision so as to skip for this and some
   other instructions.  It makes some sense to use different software
   interrupts for different debuggers, but I don't see how this can fix
   the problem of debuggers possibly corrupting each others' instruction
   streams with breapoint instructions.  It can help detect the problem.
- powerpc: like arm64
- riscv: like arm
- sparc64: like arm64, except it spells BKPT_SIZE as 4 in the implementation
   and has a secondary pc which it advances.

amd64 and i386 are the only arches where the hardware advances over the
breakpoint so that the unadvance is needed.

kdb seems to be most broken for powerpc.  powerpc has no advance and no
unadvance.  It doesn't even call kdb_reenter().

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20171128234051.L1761>