FreeBSD Mail Archives

Date:      Fri, 3 Jun 2005 19:57:40 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        John-Mark Gurney <gurney_j@resnet.uoregon.edu>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Possible instruction pipelining problem between HT's on the same die ?
Message-ID:  <200506040257.j542veCm063487@apollo.backplane.com>
References:  <200506032057.j53KvOFw062012@apollo.backplane.com> <20050604021812.GG594@funkthat.com>

:have you put a SFENCE between write A and write B?  You never tell us
:where you've tried to put the various fence instructions...
:
:-- 
:  John-Mark Gurney				Voice: +1 415 225 5579

    No, I haven't tried doing that because both the AMD and Intel manuals
    make it very clear that writes are ordered. 

    This is the code on the read side:

    wi = ip->ip_windex;					<<<<<<< READ B (INDEX)
    while ((ri = ip->ip_rindex) != wi) {
        ip->ip_rindex = ri + 1;
        ri &= MAXCPUFIFO_MASK;
        ip->ip_func[ri](ip->ip_arg[ri], frame);
        ip->ip_xindex = ip->ip_rindex;
    }

    ip_func is lwkt_putport_remote which is basically:

static
void
lwkt_putport_remote(lwkt_msg_t msg)
{
    lwkt_port_t port = msg->ms_target_port;	<<<<<< READ A 
    thread_t td = port->mp_td;

    TAILQ_INSERT_TAIL(&port->mp_msgq, msg, ms_node);
			[ CRASH ON BAD 'PORT' VARIABLE ]
    if (port->mp_flags & MSGPORTF_WAITING)
	lwkt_schedule(td);
}

    When the crash occurs, the data load of A is bad data... the contents of
    that field in the msg structure BEFORE the other cpu had written it
    rather then after.  It is looking at the correct message structure, it
    happened to be in a register when it crashed and it matches the message
    structure that was transmitted.  The contents of the field in the message
    structure post-crash was *CORRECT*.

    There are about 16 instructions between the READ B where the code sees
    the updated index and the READ A where the code reads the bad data.

				------------

    On the sending side we have this:

int
lwkt_default_putport(lwkt_port_t port, lwkt_msg_t msg)
{
    crit_enter();
    msg->ms_flags |= MSGF_QUEUED;       /* abort interlock */
    msg->ms_flags &= ~MSGF_DONE;
    msg->ms_target_port = port;		<<<<<<<<<<< WRITE A
    _lwkt_putport(port, msg, 0);
    crit_exit();
    return(EASYNC);
}

[ inline that default_putport calls obviously comes before, putting it
  after so the code flow is more obvious ]

static
__inline
void
_lwkt_putport(lwkt_port_t port, lwkt_msg_t msg, int force)
{
    thread_t td = port->mp_td;

    if (force || td->td_gd == mycpu) {
        TAILQ_INSERT_TAIL(&port->mp_msgq, msg, ms_node);
        if (port->mp_flags & MSGPORTF_WAITING)
            lwkt_schedule(td);
    } else {
        lwkt_send_ipiq(td->td_gd, (ipifunc_t)lwkt_putport_remote, msg);
    }
}

lwkt_send_ipiq( ... )
{
	... [ about 7-8 lines of executed C code ] ...
    /*
     * Queue the new message
     */
    windex = ip->ip_windex & MAXCPUFIFO_MASK;
    ip->ip_func[windex] = (ipifunc2_t)func;
    ip->ip_arg[windex] = arg;
    ++ip->ip_windex;			<<<<<<<<<<< WRITE B (INDEX)
    --gd->gd_intr_nesting_level;
	...
}


    Which is about ~30 instructions between the writing of A and the writing
    of B.  It seems very unlikely that the writes got misordered on the
    sending side.

    But on the receiving side there are ~16 instructions between the read B
    and the read A.  This seemed very unlikely to me too but I have not been
    able to come to any other conclusion.  During tests when we added 
    'too much' debug code to the READ side the problem went away.  When
    we added debug code to the WRITE side the problem seemed to stay put.

    The original crash was reported on a system with 4 processor boards
    (8 logical cpus).   The user pulled 3 boards out so there was one
    processor board and 2 logical cpus and the problem still occured.

    It seems so unlikely that this could occur across physical cpus that
    I was not surprised at all by this.  But 16 instructions seemed unlikely
    to me.  The only scenario I can come up with is that the READ SIDE on
    the HT cpu (logical cpu #1) did a speculative read of B before logical
    cpu #0 wrote to it, then somehow held that speculative read for 16 
    whole instructions on logical cpu #1.

    Is that even possible ?  holding speculative read data across
    16 instructions ?

    The only other possibility is that there are major interactions in the
    instruction pipeline and cpu #1 is reading e.g. the index B from the
    pipeline or write buffer and data A from memory prior to data A being
    retired to memory by cpu #0.  That seems ridiculous to me, but I 
    wonder if it's possible without an SFENCE.

    This crash occurs fairly rarely.  It takes a lot of packets for it to
    occur... perhaps a million or more.

    In anycase, we are now testing a kernel with a locked bus cycle inbetwen
    the READ B and the READ A to see if that fixes the problem.  If that
    doesn't work I will put an SFENCE between the WRITE A and the WRITE B.
    And if that doesn't work then I'm shooting up the wrong alley and it
    isn't an instruction/memory ordering issue.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200506040257.j542veCm063487>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation