Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 Jun 2005 09:17:57 +0100
From:      Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
To:        Kip Macy <kmacy@netapp.com>, dillon@apollo.backplane.com
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Possible instruction pipelining problem between HT's on the same die ? (fwd)
Message-ID:  <b309e8bd715874330d9dc2a50fc30742@cl.cam.ac.uk>
In-Reply-To: <Pine.LNX.4.44.0506031402050.27835-100000@octopus.eng.netapp.com>
References:  <Pine.LNX.4.44.0506031402050.27835-100000@octopus.eng.netapp.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

I did a fair amount of lock-free programming during my PhD and for Xen, 
so I may be able to shed some light on this situation. OTOH I may also 
be confused: the x86 memory model is poorly specified and the reference 
manuals are often badly written and misleading. I'll address the points 
and questions out of order....

>     But I'm beginning to think that it isn't working as advertised.  
> I've
>     read the manuals over and over again and they seem to only 
> guarentee
>     write ordering between physical cpus, not between logical HT cpus, 
> and
>     even then it appears that a cpu can do a speculative read and
>     thus get an old value for A even after getting a new value for B.

The ordering guarantees between HTs are identical to those between 
physical cpus. I'm referring to Section 7.6.19 of IARM (Intel IA-32 
Reference Manual) Vol 3. It's slightly confusing that it says "can 
further be defined as 'write-ordered with store buffer forwarding'" but 
this forwarding only occurs separately *within* each logical cpu (the 
store buffer is statically partitioned between the two HTs), and this 
phrase is identical to the one describing physical cpu behaviour in 
Section 7.2.2 (ie. it is redundant to reiterate it in this later 
section).

Reads can be speculatively executed out-of-order, but this property 
isn't unique to HTs. This race could in theory happen across physical 
cpus.

>     Now I was depending on the presumed write ordering, so if a foreign
>     cpu sees that B is updated it can assume that A has also been 
> updated.

You *can* depend on write ordering. But this ordering is no help if 
CPU#1 has already executed, and is retiring, the read from A by the 
time it executes the read from B. It's CPU#1 that is screwing up, not 
CPU#0.

>     I looked at the various SFENCE/LFENCE/MFENCE instructions and they
>     do not seem to guarentee ordering for speculative accesses at all.
>     They all say that they do not protect against speculative reads.
>     Bus-locked instructions don't seem to avoid speculative reads 
> either.

I think the reference manual is being almost wilfully misleading by 
referring to the speculative prefetch mechanism and its total 
independence from the fence instructions: "data could be speculatively 
loaded into the cache just before, during, or after the execution of an 
MFENCE instruction". It is important to realise that speculative 
execution of a memory-reading instruction is quite different from 
speculative prefetch into a cache. The latter should not matter to the 
programmer: the cache coherency protocol hides it. Consider the code 
example in the original email:

>     cpu #0	write A
> 		write B
>
>     (HT)cpu #1	read B
> 		if (B)
> 		    read A	<----  gets OLD data in A, not new data

If CPU#1 prefetches A into its cache before it reads B, it may indeed 
see the old value of A; *but* when CPU#0 writes A it will invalidate 
that cacheline in all remote caches; *furthermore* CPU#0 cannot commit 
its update of B until after it has committed its update of A (x86 
guarantees write order). So, if CPU#1 reads the new value of B, then 
any stale value of A in its cache has been invalidated by that point. 
All you need to ensure is that CPU#1 hasn't speculatively executed the 
read from A: precisely the purpose of MFENCE and LFENCE.

This is more complicated if both CPUs are sharing their memory 
hierarchy. However, either cache lines are tagged with an HT identifier 
and so the cache logically operates as two separate variable-sized 
caches (in which case normal cache coherency rules apply as described 
above), or there is true cacheline sharing (in which case there is no 
stale data to worry about, as CPU#0 will directly update the cache data 
that CPU#1 will read from). Either way, there's no weakening of the 
memory model.

  -- Keir




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b309e8bd715874330d9dc2a50fc30742>