About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / memory-barriers.txt




Custom Search

Based on kernel version 4.7.2. Page generated on 2016-08-22 22:46 EST.

1				 ============================
2				 LINUX KERNEL MEMORY BARRIERS
3				 ============================
4	
5	By: David Howells <dhowells@redhat.com>
6	    Paul E. McKenney <paulmck@linux.vnet.ibm.com>
7	    Will Deacon <will.deacon@arm.com>
8	    Peter Zijlstra <peterz@infradead.org>
9	
10	==========
11	DISCLAIMER
12	==========
13	
14	This document is not a specification; it is intentionally (for the sake of
15	brevity) and unintentionally (due to being human) incomplete. This document is
16	meant as a guide to using the various memory barriers provided by Linux, but
17	in case of any doubt (and there are many) please ask.
18	
19	To repeat, this document is not a specification of what Linux expects from
20	hardware.
21	
22	The purpose of this document is twofold:
23	
24	 (1) to specify the minimum functionality that one can rely on for any
25	     particular barrier, and
26	
27	 (2) to provide a guide as to how to use the barriers that are available.
28	
29	Note that an architecture can provide more than the minimum requirement
30	for any particular barrier, but if the architecure provides less than
31	that, that architecture is incorrect.
32	
33	Note also that it is possible that a barrier may be a no-op for an
34	architecture because the way that arch works renders an explicit barrier
35	unnecessary in that case.
36	
37	
38	========
39	CONTENTS
40	========
41	
42	 (*) Abstract memory access model.
43	
44	     - Device operations.
45	     - Guarantees.
46	
47	 (*) What are memory barriers?
48	
49	     - Varieties of memory barrier.
50	     - What may not be assumed about memory barriers?
51	     - Data dependency barriers.
52	     - Control dependencies.
53	     - SMP barrier pairing.
54	     - Examples of memory barrier sequences.
55	     - Read memory barriers vs load speculation.
56	     - Transitivity
57	
58	 (*) Explicit kernel barriers.
59	
60	     - Compiler barrier.
61	     - CPU memory barriers.
62	     - MMIO write barrier.
63	
64	 (*) Implicit kernel memory barriers.
65	
66	     - Lock acquisition functions.
67	     - Interrupt disabling functions.
68	     - Sleep and wake-up functions.
69	     - Miscellaneous functions.
70	
71	 (*) Inter-CPU acquiring barrier effects.
72	
73	     - Acquires vs memory accesses.
74	     - Acquires vs I/O accesses.
75	
76	 (*) Where are memory barriers needed?
77	
78	     - Interprocessor interaction.
79	     - Atomic operations.
80	     - Accessing devices.
81	     - Interrupts.
82	
83	 (*) Kernel I/O barrier effects.
84	
85	 (*) Assumed minimum execution ordering model.
86	
87	 (*) The effects of the cpu cache.
88	
89	     - Cache coherency.
90	     - Cache coherency vs DMA.
91	     - Cache coherency vs MMIO.
92	
93	 (*) The things CPUs get up to.
94	
95	     - And then there's the Alpha.
96	     - Virtual Machine Guests.
97	
98	 (*) Example uses.
99	
100	     - Circular buffers.
101	
102	 (*) References.
103	
104	
105	============================
106	ABSTRACT MEMORY ACCESS MODEL
107	============================
108	
109	Consider the following abstract model of the system:
110	
111			            :                :
112			            :                :
113			            :                :
114			+-------+   :   +--------+   :   +-------+
115			|       |   :   |        |   :   |       |
116			|       |   :   |        |   :   |       |
117			| CPU 1 |<----->| Memory |<----->| CPU 2 |
118			|       |   :   |        |   :   |       |
119			|       |   :   |        |   :   |       |
120			+-------+   :   +--------+   :   +-------+
121			    ^       :       ^        :       ^
122			    |       :       |        :       |
123			    |       :       |        :       |
124			    |       :       v        :       |
125			    |       :   +--------+   :       |
126			    |       :   |        |   :       |
127			    |       :   |        |   :       |
128			    +---------->| Device |<----------+
129			            :   |        |   :
130			            :   |        |   :
131			            :   +--------+   :
132			            :                :
133	
134	Each CPU executes a program that generates memory access operations.  In the
135	abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
136	perform the memory operations in any order it likes, provided program causality
137	appears to be maintained.  Similarly, the compiler may also arrange the
138	instructions it emits in any order it likes, provided it doesn't affect the
139	apparent operation of the program.
140	
141	So in the above diagram, the effects of the memory operations performed by a
142	CPU are perceived by the rest of the system as the operations cross the
143	interface between the CPU and rest of the system (the dotted lines).
144	
145	
146	For example, consider the following sequence of events:
147	
148		CPU 1		CPU 2
149		===============	===============
150		{ A == 1; B == 2 }
151		A = 3;		x = B;
152		B = 4;		y = A;
153	
154	The set of accesses as seen by the memory system in the middle can be arranged
155	in 24 different combinations:
156	
157		STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
158		STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
159		STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
160		STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
161		STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
162		STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
163		STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
164		STORE B=4, ...
165		...
166	
167	and can thus result in four different combinations of values:
168	
169		x == 2, y == 1
170		x == 2, y == 3
171		x == 4, y == 1
172		x == 4, y == 3
173	
174	
175	Furthermore, the stores committed by a CPU to the memory system may not be
176	perceived by the loads made by another CPU in the same order as the stores were
177	committed.
178	
179	
180	As a further example, consider this sequence of events:
181	
182		CPU 1		CPU 2
183		===============	===============
184		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
185		B = 4;		Q = P;
186		P = &B		D = *Q;
187	
188	There is an obvious data dependency here, as the value loaded into D depends on
189	the address retrieved from P by CPU 2.  At the end of the sequence, any of the
190	following results are possible:
191	
192		(Q == &A) and (D == 1)
193		(Q == &B) and (D == 2)
194		(Q == &B) and (D == 4)
195	
196	Note that CPU 2 will never try and load C into D because the CPU will load P
197	into Q before issuing the load of *Q.
198	
199	
200	DEVICE OPERATIONS
201	-----------------
202	
203	Some devices present their control interfaces as collections of memory
204	locations, but the order in which the control registers are accessed is very
205	important.  For instance, imagine an ethernet card with a set of internal
206	registers that are accessed through an address port register (A) and a data
207	port register (D).  To read internal register 5, the following code might then
208	be used:
209	
210		*A = 5;
211		x = *D;
212	
213	but this might show up as either of the following two sequences:
214	
215		STORE *A = 5, x = LOAD *D
216		x = LOAD *D, STORE *A = 5
217	
218	the second of which will almost certainly result in a malfunction, since it set
219	the address _after_ attempting to read the register.
220	
221	
222	GUARANTEES
223	----------
224	
225	There are some minimal guarantees that may be expected of a CPU:
226	
227	 (*) On any given CPU, dependent memory accesses will be issued in order, with
228	     respect to itself.  This means that for:
229	
230		Q = READ_ONCE(P); smp_read_barrier_depends(); D = READ_ONCE(*Q);
231	
232	     the CPU will issue the following memory operations:
233	
234		Q = LOAD P, D = LOAD *Q
235	
236	     and always in that order.  On most systems, smp_read_barrier_depends()
237	     does nothing, but it is required for DEC Alpha.  The READ_ONCE()
238	     is required to prevent compiler mischief.  Please note that you
239	     should normally use something like rcu_dereference() instead of
240	     open-coding smp_read_barrier_depends().
241	
242	 (*) Overlapping loads and stores within a particular CPU will appear to be
243	     ordered within that CPU.  This means that for:
244	
245		a = READ_ONCE(*X); WRITE_ONCE(*X, b);
246	
247	     the CPU will only issue the following sequence of memory operations:
248	
249		a = LOAD *X, STORE *X = b
250	
251	     And for:
252	
253		WRITE_ONCE(*X, c); d = READ_ONCE(*X);
254	
255	     the CPU will only issue:
256	
257		STORE *X = c, d = LOAD *X
258	
259	     (Loads and stores overlap if they are targeted at overlapping pieces of
260	     memory).
261	
262	And there are a number of things that _must_ or _must_not_ be assumed:
263	
264	 (*) It _must_not_ be assumed that the compiler will do what you want
265	     with memory references that are not protected by READ_ONCE() and
266	     WRITE_ONCE().  Without them, the compiler is within its rights to
267	     do all sorts of "creative" transformations, which are covered in
268	     the COMPILER BARRIER section.
269	
270	 (*) It _must_not_ be assumed that independent loads and stores will be issued
271	     in the order given.  This means that for:
272	
273		X = *A; Y = *B; *D = Z;
274	
275	     we may get any of the following sequences:
276	
277		X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
278		X = LOAD *A,  STORE *D = Z, Y = LOAD *B
279		Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
280		Y = LOAD *B,  STORE *D = Z, X = LOAD *A
281		STORE *D = Z, X = LOAD *A,  Y = LOAD *B
282		STORE *D = Z, Y = LOAD *B,  X = LOAD *A
283	
284	 (*) It _must_ be assumed that overlapping memory accesses may be merged or
285	     discarded.  This means that for:
286	
287		X = *A; Y = *(A + 4);
288	
289	     we may get any one of the following sequences:
290	
291		X = LOAD *A; Y = LOAD *(A + 4);
292		Y = LOAD *(A + 4); X = LOAD *A;
293		{X, Y} = LOAD {*A, *(A + 4) };
294	
295	     And for:
296	
297		*A = X; *(A + 4) = Y;
298	
299	     we may get any of:
300	
301		STORE *A = X; STORE *(A + 4) = Y;
302		STORE *(A + 4) = Y; STORE *A = X;
303		STORE {*A, *(A + 4) } = {X, Y};
304	
305	And there are anti-guarantees:
306	
307	 (*) These guarantees do not apply to bitfields, because compilers often
308	     generate code to modify these using non-atomic read-modify-write
309	     sequences.  Do not attempt to use bitfields to synchronize parallel
310	     algorithms.
311	
312	 (*) Even in cases where bitfields are protected by locks, all fields
313	     in a given bitfield must be protected by one lock.  If two fields
314	     in a given bitfield are protected by different locks, the compiler's
315	     non-atomic read-modify-write sequences can cause an update to one
316	     field to corrupt the value of an adjacent field.
317	
318	 (*) These guarantees apply only to properly aligned and sized scalar
319	     variables.  "Properly sized" currently means variables that are
320	     the same size as "char", "short", "int" and "long".  "Properly
321	     aligned" means the natural alignment, thus no constraints for
322	     "char", two-byte alignment for "short", four-byte alignment for
323	     "int", and either four-byte or eight-byte alignment for "long",
324	     on 32-bit and 64-bit systems, respectively.  Note that these
325	     guarantees were introduced into the C11 standard, so beware when
326	     using older pre-C11 compilers (for example, gcc 4.6).  The portion
327	     of the standard containing this guarantee is Section 3.14, which
328	     defines "memory location" as follows:
329	
330	     	memory location
331			either an object of scalar type, or a maximal sequence
332			of adjacent bit-fields all having nonzero width
333	
334			NOTE 1: Two threads of execution can update and access
335			separate memory locations without interfering with
336			each other.
337	
338			NOTE 2: A bit-field and an adjacent non-bit-field member
339			are in separate memory locations. The same applies
340			to two bit-fields, if one is declared inside a nested
341			structure declaration and the other is not, or if the two
342			are separated by a zero-length bit-field declaration,
343			or if they are separated by a non-bit-field member
344			declaration. It is not safe to concurrently update two
345			bit-fields in the same structure if all members declared
346			between them are also bit-fields, no matter what the
347			sizes of those intervening bit-fields happen to be.
348	
349	
350	=========================
351	WHAT ARE MEMORY BARRIERS?
352	=========================
353	
354	As can be seen above, independent memory operations are effectively performed
355	in random order, but this can be a problem for CPU-CPU interaction and for I/O.
356	What is required is some way of intervening to instruct the compiler and the
357	CPU to restrict the order.
358	
359	Memory barriers are such interventions.  They impose a perceived partial
360	ordering over the memory operations on either side of the barrier.
361	
362	Such enforcement is important because the CPUs and other devices in a system
363	can use a variety of tricks to improve performance, including reordering,
364	deferral and combination of memory operations; speculative loads; speculative
365	branch prediction and various types of caching.  Memory barriers are used to
366	override or suppress these tricks, allowing the code to sanely control the
367	interaction of multiple CPUs and/or devices.
368	
369	
370	VARIETIES OF MEMORY BARRIER
371	---------------------------
372	
373	Memory barriers come in four basic varieties:
374	
375	 (1) Write (or store) memory barriers.
376	
377	     A write memory barrier gives a guarantee that all the STORE operations
378	     specified before the barrier will appear to happen before all the STORE
379	     operations specified after the barrier with respect to the other
380	     components of the system.
381	
382	     A write barrier is a partial ordering on stores only; it is not required
383	     to have any effect on loads.
384	
385	     A CPU can be viewed as committing a sequence of store operations to the
386	     memory system as time progresses.  All stores before a write barrier will
387	     occur in the sequence _before_ all the stores after the write barrier.
388	
389	     [!] Note that write barriers should normally be paired with read or data
390	     dependency barriers; see the "SMP barrier pairing" subsection.
391	
392	
393	 (2) Data dependency barriers.
394	
395	     A data dependency barrier is a weaker form of read barrier.  In the case
396	     where two loads are performed such that the second depends on the result
397	     of the first (eg: the first load retrieves the address to which the second
398	     load will be directed), a data dependency barrier would be required to
399	     make sure that the target of the second load is updated before the address
400	     obtained by the first load is accessed.
401	
402	     A data dependency barrier is a partial ordering on interdependent loads
403	     only; it is not required to have any effect on stores, independent loads
404	     or overlapping loads.
405	
406	     As mentioned in (1), the other CPUs in the system can be viewed as
407	     committing sequences of stores to the memory system that the CPU being
408	     considered can then perceive.  A data dependency barrier issued by the CPU
409	     under consideration guarantees that for any load preceding it, if that
410	     load touches one of a sequence of stores from another CPU, then by the
411	     time the barrier completes, the effects of all the stores prior to that
412	     touched by the load will be perceptible to any loads issued after the data
413	     dependency barrier.
414	
415	     See the "Examples of memory barrier sequences" subsection for diagrams
416	     showing the ordering constraints.
417	
418	     [!] Note that the first load really has to have a _data_ dependency and
419	     not a control dependency.  If the address for the second load is dependent
420	     on the first load, but the dependency is through a conditional rather than
421	     actually loading the address itself, then it's a _control_ dependency and
422	     a full read barrier or better is required.  See the "Control dependencies"
423	     subsection for more information.
424	
425	     [!] Note that data dependency barriers should normally be paired with
426	     write barriers; see the "SMP barrier pairing" subsection.
427	
428	
429	 (3) Read (or load) memory barriers.
430	
431	     A read barrier is a data dependency barrier plus a guarantee that all the
432	     LOAD operations specified before the barrier will appear to happen before
433	     all the LOAD operations specified after the barrier with respect to the
434	     other components of the system.
435	
436	     A read barrier is a partial ordering on loads only; it is not required to
437	     have any effect on stores.
438	
439	     Read memory barriers imply data dependency barriers, and so can substitute
440	     for them.
441	
442	     [!] Note that read barriers should normally be paired with write barriers;
443	     see the "SMP barrier pairing" subsection.
444	
445	
446	 (4) General memory barriers.
447	
448	     A general memory barrier gives a guarantee that all the LOAD and STORE
449	     operations specified before the barrier will appear to happen before all
450	     the LOAD and STORE operations specified after the barrier with respect to
451	     the other components of the system.
452	
453	     A general memory barrier is a partial ordering over both loads and stores.
454	
455	     General memory barriers imply both read and write memory barriers, and so
456	     can substitute for either.
457	
458	
459	And a couple of implicit varieties:
460	
461	 (5) ACQUIRE operations.
462	
463	     This acts as a one-way permeable barrier.  It guarantees that all memory
464	     operations after the ACQUIRE operation will appear to happen after the
465	     ACQUIRE operation with respect to the other components of the system.
466	     ACQUIRE operations include LOCK operations and both smp_load_acquire()
467	     and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
468	     semantics from relying on a control dependency and smp_rmb().
469	
470	     Memory operations that occur before an ACQUIRE operation may appear to
471	     happen after it completes.
472	
473	     An ACQUIRE operation should almost always be paired with a RELEASE
474	     operation.
475	
476	
477	 (6) RELEASE operations.
478	
479	     This also acts as a one-way permeable barrier.  It guarantees that all
480	     memory operations before the RELEASE operation will appear to happen
481	     before the RELEASE operation with respect to the other components of the
482	     system. RELEASE operations include UNLOCK operations and
483	     smp_store_release() operations.
484	
485	     Memory operations that occur after a RELEASE operation may appear to
486	     happen before it completes.
487	
488	     The use of ACQUIRE and RELEASE operations generally precludes the need
489	     for other sorts of memory barrier (but note the exceptions mentioned in
490	     the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
491	     pair is -not- guaranteed to act as a full memory barrier.  However, after
492	     an ACQUIRE on a given variable, all memory accesses preceding any prior
493	     RELEASE on that same variable are guaranteed to be visible.  In other
494	     words, within a given variable's critical section, all accesses of all
495	     previous critical sections for that variable are guaranteed to have
496	     completed.
497	
498	     This means that ACQUIRE acts as a minimal "acquire" operation and
499	     RELEASE acts as a minimal "release" operation.
500	
501	A subset of the atomic operations described in atomic_ops.txt have ACQUIRE
502	and RELEASE variants in addition to fully-ordered and relaxed (no barrier
503	semantics) definitions.  For compound atomics performing both a load and a
504	store, ACQUIRE semantics apply only to the load and RELEASE semantics apply
505	only to the store portion of the operation.
506	
507	Memory barriers are only required where there's a possibility of interaction
508	between two CPUs or between a CPU and a device.  If it can be guaranteed that
509	there won't be any such interaction in any particular piece of code, then
510	memory barriers are unnecessary in that piece of code.
511	
512	
513	Note that these are the _minimum_ guarantees.  Different architectures may give
514	more substantial guarantees, but they may _not_ be relied upon outside of arch
515	specific code.
516	
517	
518	WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
519	----------------------------------------------
520	
521	There are certain things that the Linux kernel memory barriers do not guarantee:
522	
523	 (*) There is no guarantee that any of the memory accesses specified before a
524	     memory barrier will be _complete_ by the completion of a memory barrier
525	     instruction; the barrier can be considered to draw a line in that CPU's
526	     access queue that accesses of the appropriate type may not cross.
527	
528	 (*) There is no guarantee that issuing a memory barrier on one CPU will have
529	     any direct effect on another CPU or any other hardware in the system.  The
530	     indirect effect will be the order in which the second CPU sees the effects
531	     of the first CPU's accesses occur, but see the next point:
532	
533	 (*) There is no guarantee that a CPU will see the correct order of effects
534	     from a second CPU's accesses, even _if_ the second CPU uses a memory
535	     barrier, unless the first CPU _also_ uses a matching memory barrier (see
536	     the subsection on "SMP Barrier Pairing").
537	
538	 (*) There is no guarantee that some intervening piece of off-the-CPU
539	     hardware[*] will not reorder the memory accesses.  CPU cache coherency
540	     mechanisms should propagate the indirect effects of a memory barrier
541	     between CPUs, but might not do so in order.
542	
543		[*] For information on bus mastering DMA and coherency please read:
544	
545		    Documentation/PCI/pci.txt
546		    Documentation/DMA-API-HOWTO.txt
547		    Documentation/DMA-API.txt
548	
549	
550	DATA DEPENDENCY BARRIERS
551	------------------------
552	
553	The usage requirements of data dependency barriers are a little subtle, and
554	it's not always obvious that they're needed.  To illustrate, consider the
555	following sequence of events:
556	
557		CPU 1		      CPU 2
558		===============	      ===============
559		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
560		B = 4;
561		<write barrier>
562		WRITE_ONCE(P, &B)
563				      Q = READ_ONCE(P);
564				      D = *Q;
565	
566	There's a clear data dependency here, and it would seem that by the end of the
567	sequence, Q must be either &A or &B, and that:
568	
569		(Q == &A) implies (D == 1)
570		(Q == &B) implies (D == 4)
571	
572	But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
573	leading to the following situation:
574	
575		(Q == &B) and (D == 2) ????
576	
577	Whilst this may seem like a failure of coherency or causality maintenance, it
578	isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
579	Alpha).
580	
581	To deal with this, a data dependency barrier or better must be inserted
582	between the address load and the data load:
583	
584		CPU 1		      CPU 2
585		===============	      ===============
586		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
587		B = 4;
588		<write barrier>
589		WRITE_ONCE(P, &B);
590				      Q = READ_ONCE(P);
591				      <data dependency barrier>
592				      D = *Q;
593	
594	This enforces the occurrence of one of the two implications, and prevents the
595	third possibility from arising.
596	
597	A data-dependency barrier must also order against dependent writes:
598	
599		CPU 1		      CPU 2
600		===============	      ===============
601		{ A == 1, B == 2, C = 3, P == &A, Q == &C }
602		B = 4;
603		<write barrier>
604		WRITE_ONCE(P, &B);
605				      Q = READ_ONCE(P);
606				      <data dependency barrier>
607				      *Q = 5;
608	
609	The data-dependency barrier must order the read into Q with the store
610	into *Q.  This prohibits this outcome:
611	
612		(Q == B) && (B == 4)
613	
614	Please note that this pattern should be rare.  After all, the whole point
615	of dependency ordering is to -prevent- writes to the data structure, along
616	with the expensive cache misses associated with those writes.  This pattern
617	can be used to record rare error conditions and the like, and the ordering
618	prevents such records from being lost.
619	
620	
621	[!] Note that this extremely counterintuitive situation arises most easily on
622	machines with split caches, so that, for example, one cache bank processes
623	even-numbered cache lines and the other bank processes odd-numbered cache
624	lines.  The pointer P might be stored in an odd-numbered cache line, and the
625	variable B might be stored in an even-numbered cache line.  Then, if the
626	even-numbered bank of the reading CPU's cache is extremely busy while the
627	odd-numbered bank is idle, one can see the new value of the pointer P (&B),
628	but the old value of the variable B (2).
629	
630	
631	The data dependency barrier is very important to the RCU system,
632	for example.  See rcu_assign_pointer() and rcu_dereference() in
633	include/linux/rcupdate.h.  This permits the current target of an RCU'd
634	pointer to be replaced with a new modified target, without the replacement
635	target appearing to be incompletely initialised.
636	
637	See also the subsection on "Cache Coherency" for a more thorough example.
638	
639	
640	CONTROL DEPENDENCIES
641	--------------------
642	
643	A load-load control dependency requires a full read memory barrier, not
644	simply a data dependency barrier to make it work correctly.  Consider the
645	following bit of code:
646	
647		q = READ_ONCE(a);
648		if (q) {
649			<data dependency barrier>  /* BUG: No data dependency!!! */
650			p = READ_ONCE(b);
651		}
652	
653	This will not have the desired effect because there is no actual data
654	dependency, but rather a control dependency that the CPU may short-circuit
655	by attempting to predict the outcome in advance, so that other CPUs see
656	the load from b as having happened before the load from a.  In such a
657	case what's actually required is:
658	
659		q = READ_ONCE(a);
660		if (q) {
661			<read barrier>
662			p = READ_ONCE(b);
663		}
664	
665	However, stores are not speculated.  This means that ordering -is- provided
666	for load-store control dependencies, as in the following example:
667	
668		q = READ_ONCE(a);
669		if (q) {
670			WRITE_ONCE(b, p);
671		}
672	
673	Control dependencies pair normally with other types of barriers.  That
674	said, please note that READ_ONCE() is not optional! Without the
675	READ_ONCE(), the compiler might combine the load from 'a' with other
676	loads from 'a', and the store to 'b' with other stores to 'b', with
677	possible highly counterintuitive effects on ordering.
678	
679	Worse yet, if the compiler is able to prove (say) that the value of
680	variable 'a' is always non-zero, it would be well within its rights
681	to optimize the original example by eliminating the "if" statement
682	as follows:
683	
684		q = a;
685		b = p;  /* BUG: Compiler and CPU can both reorder!!! */
686	
687	So don't leave out the READ_ONCE().
688	
689	It is tempting to try to enforce ordering on identical stores on both
690	branches of the "if" statement as follows:
691	
692		q = READ_ONCE(a);
693		if (q) {
694			barrier();
695			WRITE_ONCE(b, p);
696			do_something();
697		} else {
698			barrier();
699			WRITE_ONCE(b, p);
700			do_something_else();
701		}
702	
703	Unfortunately, current compilers will transform this as follows at high
704	optimization levels:
705	
706		q = READ_ONCE(a);
707		barrier();
708		WRITE_ONCE(b, p);  /* BUG: No ordering vs. load from a!!! */
709		if (q) {
710			/* WRITE_ONCE(b, p); -- moved up, BUG!!! */
711			do_something();
712		} else {
713			/* WRITE_ONCE(b, p); -- moved up, BUG!!! */
714			do_something_else();
715		}
716	
717	Now there is no conditional between the load from 'a' and the store to
718	'b', which means that the CPU is within its rights to reorder them:
719	The conditional is absolutely required, and must be present in the
720	assembly code even after all compiler optimizations have been applied.
721	Therefore, if you need ordering in this example, you need explicit
722	memory barriers, for example, smp_store_release():
723	
724		q = READ_ONCE(a);
725		if (q) {
726			smp_store_release(&b, p);
727			do_something();
728		} else {
729			smp_store_release(&b, p);
730			do_something_else();
731		}
732	
733	In contrast, without explicit memory barriers, two-legged-if control
734	ordering is guaranteed only when the stores differ, for example:
735	
736		q = READ_ONCE(a);
737		if (q) {
738			WRITE_ONCE(b, p);
739			do_something();
740		} else {
741			WRITE_ONCE(b, r);
742			do_something_else();
743		}
744	
745	The initial READ_ONCE() is still required to prevent the compiler from
746	proving the value of 'a'.
747	
748	In addition, you need to be careful what you do with the local variable 'q',
749	otherwise the compiler might be able to guess the value and again remove
750	the needed conditional.  For example:
751	
752		q = READ_ONCE(a);
753		if (q % MAX) {
754			WRITE_ONCE(b, p);
755			do_something();
756		} else {
757			WRITE_ONCE(b, r);
758			do_something_else();
759		}
760	
761	If MAX is defined to be 1, then the compiler knows that (q % MAX) is
762	equal to zero, in which case the compiler is within its rights to
763	transform the above code into the following:
764	
765		q = READ_ONCE(a);
766		WRITE_ONCE(b, p);
767		do_something_else();
768	
769	Given this transformation, the CPU is not required to respect the ordering
770	between the load from variable 'a' and the store to variable 'b'.  It is
771	tempting to add a barrier(), but this does not help.  The conditional
772	is gone, and the barrier won't bring it back.  Therefore, if you are
773	relying on this ordering, you should make sure that MAX is greater than
774	one, perhaps as follows:
775	
776		q = READ_ONCE(a);
777		BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
778		if (q % MAX) {
779			WRITE_ONCE(b, p);
780			do_something();
781		} else {
782			WRITE_ONCE(b, r);
783			do_something_else();
784		}
785	
786	Please note once again that the stores to 'b' differ.  If they were
787	identical, as noted earlier, the compiler could pull this store outside
788	of the 'if' statement.
789	
790	You must also be careful not to rely too much on boolean short-circuit
791	evaluation.  Consider this example:
792	
793		q = READ_ONCE(a);
794		if (q || 1 > 0)
795			WRITE_ONCE(b, 1);
796	
797	Because the first condition cannot fault and the second condition is
798	always true, the compiler can transform this example as following,
799	defeating control dependency:
800	
801		q = READ_ONCE(a);
802		WRITE_ONCE(b, 1);
803	
804	This example underscores the need to ensure that the compiler cannot
805	out-guess your code.  More generally, although READ_ONCE() does force
806	the compiler to actually emit code for a given load, it does not force
807	the compiler to use the results.
808	
809	Finally, control dependencies do -not- provide transitivity.  This is
810	demonstrated by two related examples, with the initial values of
811	x and y both being zero:
812	
813		CPU 0                     CPU 1
814		=======================   =======================
815		r1 = READ_ONCE(x);        r2 = READ_ONCE(y);
816		if (r1 > 0)               if (r2 > 0)
817		  WRITE_ONCE(y, 1);         WRITE_ONCE(x, 1);
818	
819		assert(!(r1 == 1 && r2 == 1));
820	
821	The above two-CPU example will never trigger the assert().  However,
822	if control dependencies guaranteed transitivity (which they do not),
823	then adding the following CPU would guarantee a related assertion:
824	
825		CPU 2
826		=====================
827		WRITE_ONCE(x, 2);
828	
829		assert(!(r1 == 2 && r2 == 1 && x == 2)); /* FAILS!!! */
830	
831	But because control dependencies do -not- provide transitivity, the above
832	assertion can fail after the combined three-CPU example completes.  If you
833	need the three-CPU example to provide ordering, you will need smp_mb()
834	between the loads and stores in the CPU 0 and CPU 1 code fragments,
835	that is, just before or just after the "if" statements.  Furthermore,
836	the original two-CPU example is very fragile and should be avoided.
837	
838	These two examples are the LB and WWC litmus tests from this paper:
839	http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this
840	site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.
841	
842	In summary:
843	
844	  (*) Control dependencies can order prior loads against later stores.
845	      However, they do -not- guarantee any other sort of ordering:
846	      Not prior loads against later loads, nor prior stores against
847	      later anything.  If you need these other forms of ordering,
848	      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
849	      later loads, smp_mb().
850	
851	  (*) If both legs of the "if" statement begin with identical stores to
852	      the same variable, then those stores must be ordered, either by
853	      preceding both of them with smp_mb() or by using smp_store_release()
854	      to carry out the stores.  Please note that it is -not- sufficient
855	      to use barrier() at beginning of each leg of the "if" statement
856	      because, as shown by the example above, optimizing compilers can
857	      destroy the control dependency while respecting the letter of the
858	      barrier() law.
859	
860	  (*) Control dependencies require at least one run-time conditional
861	      between the prior load and the subsequent store, and this
862	      conditional must involve the prior load.  If the compiler is able
863	      to optimize the conditional away, it will have also optimized
864	      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
865	      can help to preserve the needed conditional.
866	
867	  (*) Control dependencies require that the compiler avoid reordering the
868	      dependency into nonexistence.  Careful use of READ_ONCE() or
869	      atomic{,64}_read() can help to preserve your control dependency.
870	      Please see the COMPILER BARRIER section for more information.
871	
872	  (*) Control dependencies pair normally with other types of barriers.
873	
874	  (*) Control dependencies do -not- provide transitivity.  If you
875	      need transitivity, use smp_mb().
876	
877	
878	SMP BARRIER PAIRING
879	-------------------
880	
881	When dealing with CPU-CPU interactions, certain types of memory barrier should
882	always be paired.  A lack of appropriate pairing is almost certainly an error.
883	
884	General barriers pair with each other, though they also pair with most
885	other types of barriers, albeit without transitivity.  An acquire barrier
886	pairs with a release barrier, but both may also pair with other barriers,
887	including of course general barriers.  A write barrier pairs with a data
888	dependency barrier, a control dependency, an acquire barrier, a release
889	barrier, a read barrier, or a general barrier.  Similarly a read barrier,
890	control dependency, or a data dependency barrier pairs with a write
891	barrier, an acquire barrier, a release barrier, or a general barrier:
892	
893		CPU 1		      CPU 2
894		===============	      ===============
895		WRITE_ONCE(a, 1);
896		<write barrier>
897		WRITE_ONCE(b, 2);     x = READ_ONCE(b);
898				      <read barrier>
899				      y = READ_ONCE(a);
900	
901	Or:
902	
903		CPU 1		      CPU 2
904		===============	      ===============================
905		a = 1;
906		<write barrier>
907		WRITE_ONCE(b, &a);    x = READ_ONCE(b);
908				      <data dependency barrier>
909				      y = *x;
910	
911	Or even:
912	
913		CPU 1		      CPU 2
914		===============	      ===============================
915		r1 = READ_ONCE(y);
916		<general barrier>
917		WRITE_ONCE(y, 1);     if (r2 = READ_ONCE(x)) {
918				         <implicit control dependency>
919				         WRITE_ONCE(y, 1);
920				      }
921	
922		assert(r1 == 0 || r2 == 0);
923	
924	Basically, the read barrier always has to be there, even though it can be of
925	the "weaker" type.
926	
927	[!] Note that the stores before the write barrier would normally be expected to
928	match the loads after the read barrier or the data dependency barrier, and vice
929	versa:
930	
931		CPU 1                               CPU 2
932		===================                 ===================
933		WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
934		WRITE_ONCE(b, 2);    }    \ /    {  w = READ_ONCE(d);
935		<write barrier>            \        <read barrier>
936		WRITE_ONCE(c, 3);    }    / \    {  x = READ_ONCE(a);
937		WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
938	
939	
940	EXAMPLES OF MEMORY BARRIER SEQUENCES
941	------------------------------------
942	
943	Firstly, write barriers act as partial orderings on store operations.
944	Consider the following sequence of events:
945	
946		CPU 1
947		=======================
948		STORE A = 1
949		STORE B = 2
950		STORE C = 3
951		<write barrier>
952		STORE D = 4
953		STORE E = 5
954	
955	This sequence of events is committed to the memory coherence system in an order
956	that the rest of the system might perceive as the unordered set of { STORE A,
957	STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
958	}:
959	
960		+-------+       :      :
961		|       |       +------+
962		|       |------>| C=3  |     }     /\
963		|       |  :    +------+     }-----  \  -----> Events perceptible to
964		|       |  :    | A=1  |     }        \/       the rest of the system
965		|       |  :    +------+     }
966		| CPU 1 |  :    | B=2  |     }
967		|       |       +------+     }
968		|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
969		|       |       +------+     }        requires all stores prior to the
970		|       |  :    | E=5  |     }        barrier to be committed before
971		|       |  :    +------+     }        further stores may take place
972		|       |------>| D=4  |     }
973		|       |       +------+
974		+-------+       :      :
975		                   |
976		                   | Sequence in which stores are committed to the
977		                   | memory system by CPU 1
978		                   V
979	
980	
981	Secondly, data dependency barriers act as partial orderings on data-dependent
982	loads.  Consider the following sequence of events:
983	
984		CPU 1			CPU 2
985		=======================	=======================
986			{ B = 7; X = 9; Y = 8; C = &Y }
987		STORE A = 1
988		STORE B = 2
989		<write barrier>
990		STORE C = &B		LOAD X
991		STORE D = 4		LOAD C (gets &B)
992					LOAD *C (reads B)
993	
994	Without intervention, CPU 2 may perceive the events on CPU 1 in some
995	effectively random order, despite the write barrier issued by CPU 1:
996	
997		+-------+       :      :                :       :
998		|       |       +------+                +-------+  | Sequence of update
999		|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1000		|       |  :    +------+     \          +-------+  | CPU 2
1001		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1002		|       |       +------+       |        +-------+
1003		|       |   wwwwwwwwwwwwwwww   |        :       :
1004		|       |       +------+       |        :       :
1005		|       |  :    | C=&B |---    |        :       :       +-------+
1006		|       |  :    +------+   \   |        +-------+       |       |
1007		|       |------>| D=4  |    ----------->| C->&B |------>|       |
1008		|       |       +------+       |        +-------+       |       |
1009		+-------+       :      :       |        :       :       |       |
1010		                               |        :       :       |       |
1011		                               |        :       :       | CPU 2 |
1012		                               |        +-------+       |       |
1013		    Apparently incorrect --->  |        | B->7  |------>|       |
1014		    perception of B (!)        |        +-------+       |       |
1015		                               |        :       :       |       |
1016		                               |        +-------+       |       |
1017		    The load of X holds --->    \       | X->9  |------>|       |
1018		    up the maintenance           \      +-------+       |       |
1019		    of coherence of B             ----->| B->2  |       +-------+
1020		                                        +-------+
1021		                                        :       :
1022	
1023	
1024	In the above example, CPU 2 perceives that B is 7, despite the load of *C
1025	(which would be B) coming after the LOAD of C.
1026	
1027	If, however, a data dependency barrier were to be placed between the load of C
1028	and the load of *C (ie: B) on CPU 2:
1029	
1030		CPU 1			CPU 2
1031		=======================	=======================
1032			{ B = 7; X = 9; Y = 8; C = &Y }
1033		STORE A = 1
1034		STORE B = 2
1035		<write barrier>
1036		STORE C = &B		LOAD X
1037		STORE D = 4		LOAD C (gets &B)
1038					<data dependency barrier>
1039					LOAD *C (reads B)
1040	
1041	then the following will occur:
1042	
1043		+-------+       :      :                :       :
1044		|       |       +------+                +-------+
1045		|       |------>| B=2  |-----       --->| Y->8  |
1046		|       |  :    +------+     \          +-------+
1047		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1048		|       |       +------+       |        +-------+
1049		|       |   wwwwwwwwwwwwwwww   |        :       :
1050		|       |       +------+       |        :       :
1051		|       |  :    | C=&B |---    |        :       :       +-------+
1052		|       |  :    +------+   \   |        +-------+       |       |
1053		|       |------>| D=4  |    ----------->| C->&B |------>|       |
1054		|       |       +------+       |        +-------+       |       |
1055		+-------+       :      :       |        :       :       |       |
1056		                               |        :       :       |       |
1057		                               |        :       :       | CPU 2 |
1058		                               |        +-------+       |       |
1059		                               |        | X->9  |------>|       |
1060		                               |        +-------+       |       |
1061		  Makes sure all effects --->   \   ddddddddddddddddd   |       |
1062		  prior to the store of C        \      +-------+       |       |
1063		  are perceptible to              ----->| B->2  |------>|       |
1064		  subsequent loads                      +-------+       |       |
1065		                                        :       :       +-------+
1066	
1067	
1068	And thirdly, a read barrier acts as a partial order on loads.  Consider the
1069	following sequence of events:
1070	
1071		CPU 1			CPU 2
1072		=======================	=======================
1073			{ A = 0, B = 9 }
1074		STORE A=1
1075		<write barrier>
1076		STORE B=2
1077					LOAD B
1078					LOAD A
1079	
1080	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
1081	some effectively random order, despite the write barrier issued by CPU 1:
1082	
1083		+-------+       :      :                :       :
1084		|       |       +------+                +-------+
1085		|       |------>| A=1  |------      --->| A->0  |
1086		|       |       +------+      \         +-------+
1087		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1088		|       |       +------+        |       +-------+
1089		|       |------>| B=2  |---     |       :       :
1090		|       |       +------+   \    |       :       :       +-------+
1091		+-------+       :      :    \   |       +-------+       |       |
1092		                             ---------->| B->2  |------>|       |
1093		                                |       +-------+       | CPU 2 |
1094		                                |       | A->0  |------>|       |
1095		                                |       +-------+       |       |
1096		                                |       :       :       +-------+
1097		                                 \      :       :
1098		                                  \     +-------+
1099		                                   ---->| A->1  |
1100		                                        +-------+
1101		                                        :       :
1102	
1103	
1104	If, however, a read barrier were to be placed between the load of B and the
1105	load of A on CPU 2:
1106	
1107		CPU 1			CPU 2
1108		=======================	=======================
1109			{ A = 0, B = 9 }
1110		STORE A=1
1111		<write barrier>
1112		STORE B=2
1113					LOAD B
1114					<read barrier>
1115					LOAD A
1116	
1117	then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
1118	2:
1119	
1120		+-------+       :      :                :       :
1121		|       |       +------+                +-------+
1122		|       |------>| A=1  |------      --->| A->0  |
1123		|       |       +------+      \         +-------+
1124		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1125		|       |       +------+        |       +-------+
1126		|       |------>| B=2  |---     |       :       :
1127		|       |       +------+   \    |       :       :       +-------+
1128		+-------+       :      :    \   |       +-------+       |       |
1129		                             ---------->| B->2  |------>|       |
1130		                                |       +-------+       | CPU 2 |
1131		                                |       :       :       |       |
1132		                                |       :       :       |       |
1133		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1134		  barrier causes all effects      \     +-------+       |       |
1135		  prior to the storage of B        ---->| A->1  |------>|       |
1136		  to be perceptible to CPU 2            +-------+       |       |
1137		                                        :       :       +-------+
1138	
1139	
1140	To illustrate this more completely, consider what could happen if the code
1141	contained a load of A either side of the read barrier:
1142	
1143		CPU 1			CPU 2
1144		=======================	=======================
1145			{ A = 0, B = 9 }
1146		STORE A=1
1147		<write barrier>
1148		STORE B=2
1149					LOAD B
1150					LOAD A [first load of A]
1151					<read barrier>
1152					LOAD A [second load of A]
1153	
1154	Even though the two loads of A both occur after the load of B, they may both
1155	come up with different values:
1156	
1157		+-------+       :      :                :       :
1158		|       |       +------+                +-------+
1159		|       |------>| A=1  |------      --->| A->0  |
1160		|       |       +------+      \         +-------+
1161		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1162		|       |       +------+        |       +-------+
1163		|       |------>| B=2  |---     |       :       :
1164		|       |       +------+   \    |       :       :       +-------+
1165		+-------+       :      :    \   |       +-------+       |       |
1166		                             ---------->| B->2  |------>|       |
1167		                                |       +-------+       | CPU 2 |
1168		                                |       :       :       |       |
1169		                                |       :       :       |       |
1170		                                |       +-------+       |       |
1171		                                |       | A->0  |------>| 1st   |
1172		                                |       +-------+       |       |
1173		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1174		  barrier causes all effects      \     +-------+       |       |
1175		  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1176		  to be perceptible to CPU 2            +-------+       |       |
1177		                                        :       :       +-------+
1178	
1179	
1180	But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1181	before the read barrier completes anyway:
1182	
1183		+-------+       :      :                :       :
1184		|       |       +------+                +-------+
1185		|       |------>| A=1  |------      --->| A->0  |
1186		|       |       +------+      \         +-------+
1187		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1188		|       |       +------+        |       +-------+
1189		|       |------>| B=2  |---     |       :       :
1190		|       |       +------+   \    |       :       :       +-------+
1191		+-------+       :      :    \   |       +-------+       |       |
1192		                             ---------->| B->2  |------>|       |
1193		                                |       +-------+       | CPU 2 |
1194		                                |       :       :       |       |
1195		                                 \      :       :       |       |
1196		                                  \     +-------+       |       |
1197		                                   ---->| A->1  |------>| 1st   |
1198		                                        +-------+       |       |
1199		                                    rrrrrrrrrrrrrrrrr   |       |
1200		                                        +-------+       |       |
1201		                                        | A->1  |------>| 2nd   |
1202		                                        +-------+       |       |
1203		                                        :       :       +-------+
1204	
1205	
1206	The guarantee is that the second load will always come up with A == 1 if the
1207	load of B came up with B == 2.  No such guarantee exists for the first load of
1208	A; that may come up with either A == 0 or A == 1.
1209	
1210	
1211	READ MEMORY BARRIERS VS LOAD SPECULATION
1212	----------------------------------------
1213	
1214	Many CPUs speculate with loads: that is they see that they will need to load an
1215	item from memory, and they find a time where they're not using the bus for any
1216	other loads, and so do the load in advance - even though they haven't actually
1217	got to that point in the instruction execution flow yet.  This permits the
1218	actual load instruction to potentially complete immediately because the CPU
1219	already has the value to hand.
1220	
1221	It may turn out that the CPU didn't actually need the value - perhaps because a
1222	branch circumvented the load - in which case it can discard the value or just
1223	cache it for later use.
1224	
1225	Consider:
1226	
1227		CPU 1			CPU 2
1228		=======================	=======================
1229					LOAD B
1230					DIVIDE		} Divide instructions generally
1231					DIVIDE		} take a long time to perform
1232					LOAD A
1233	
1234	Which might appear as this:
1235	
1236		                                        :       :       +-------+
1237		                                        +-------+       |       |
1238		                                    --->| B->2  |------>|       |
1239		                                        +-------+       | CPU 2 |
1240		                                        :       :DIVIDE |       |
1241		                                        +-------+       |       |
1242		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1243		division speculates on the              +-------+   ~   |       |
1244		LOAD of A                               :       :   ~   |       |
1245		                                        :       :DIVIDE |       |
1246		                                        :       :   ~   |       |
1247		Once the divisions are complete -->     :       :   ~-->|       |
1248		the CPU can then perform the            :       :       |       |
1249		LOAD with immediate effect              :       :       +-------+
1250	
1251	
1252	Placing a read barrier or a data dependency barrier just before the second
1253	load:
1254	
1255		CPU 1			CPU 2
1256		=======================	=======================
1257					LOAD B
1258					DIVIDE
1259					DIVIDE
1260					<read barrier>
1261					LOAD A
1262	
1263	will force any value speculatively obtained to be reconsidered to an extent
1264	dependent on the type of barrier used.  If there was no change made to the
1265	speculated memory location, then the speculated value will just be used:
1266	
1267		                                        :       :       +-------+
1268		                                        +-------+       |       |
1269		                                    --->| B->2  |------>|       |
1270		                                        +-------+       | CPU 2 |
1271		                                        :       :DIVIDE |       |
1272		                                        +-------+       |       |
1273		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1274		division speculates on the              +-------+   ~   |       |
1275		LOAD of A                               :       :   ~   |       |
1276		                                        :       :DIVIDE |       |
1277		                                        :       :   ~   |       |
1278		                                        :       :   ~   |       |
1279		                                    rrrrrrrrrrrrrrrr~   |       |
1280		                                        :       :   ~   |       |
1281		                                        :       :   ~-->|       |
1282		                                        :       :       |       |
1283		                                        :       :       +-------+
1284	
1285	
1286	but if there was an update or an invalidation from another CPU pending, then
1287	the speculation will be cancelled and the value reloaded:
1288	
1289		                                        :       :       +-------+
1290		                                        +-------+       |       |
1291		                                    --->| B->2  |------>|       |
1292		                                        +-------+       | CPU 2 |
1293		                                        :       :DIVIDE |       |
1294		                                        +-------+       |       |
1295		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1296		division speculates on the              +-------+   ~   |       |
1297		LOAD of A                               :       :   ~   |       |
1298		                                        :       :DIVIDE |       |
1299		                                        :       :   ~   |       |
1300		                                        :       :   ~   |       |
1301		                                    rrrrrrrrrrrrrrrrr   |       |
1302		                                        +-------+       |       |
1303		The speculation is discarded --->   --->| A->1  |------>|       |
1304		and an updated value is                 +-------+       |       |
1305		retrieved                               :       :       +-------+
1306	
1307	
1308	TRANSITIVITY
1309	------------
1310	
1311	Transitivity is a deeply intuitive notion about ordering that is not
1312	always provided by real computer systems.  The following example
1313	demonstrates transitivity:
1314	
1315		CPU 1			CPU 2			CPU 3
1316		=======================	=======================	=======================
1317			{ X = 0, Y = 0 }
1318		STORE X=1		LOAD X			STORE Y=1
1319					<general barrier>	<general barrier>
1320					LOAD Y			LOAD X
1321	
1322	Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
1323	This indicates that CPU 2's load from X in some sense follows CPU 1's
1324	store to X and that CPU 2's load from Y in some sense preceded CPU 3's
1325	store to Y.  The question is then "Can CPU 3's load from X return 0?"
1326	
1327	Because CPU 2's load from X in some sense came after CPU 1's store, it
1328	is natural to expect that CPU 3's load from X must therefore return 1.
1329	This expectation is an example of transitivity: if a load executing on
1330	CPU A follows a load from the same variable executing on CPU B, then
1331	CPU A's load must either return the same value that CPU B's load did,
1332	or must return some later value.
1333	
1334	In the Linux kernel, use of general memory barriers guarantees
1335	transitivity.  Therefore, in the above example, if CPU 2's load from X
1336	returns 1 and its load from Y returns 0, then CPU 3's load from X must
1337	also return 1.
1338	
1339	However, transitivity is -not- guaranteed for read or write barriers.
1340	For example, suppose that CPU 2's general barrier in the above example
1341	is changed to a read barrier as shown below:
1342	
1343		CPU 1			CPU 2			CPU 3
1344		=======================	=======================	=======================
1345			{ X = 0, Y = 0 }
1346		STORE X=1		LOAD X			STORE Y=1
1347					<read barrier>		<general barrier>
1348					LOAD Y			LOAD X
1349	
1350	This substitution destroys transitivity: in this example, it is perfectly
1351	legal for CPU 2's load from X to return 1, its load from Y to return 0,
1352	and CPU 3's load from X to return 0.
1353	
1354	The key point is that although CPU 2's read barrier orders its pair
1355	of loads, it does not guarantee to order CPU 1's store.  Therefore, if
1356	this example runs on a system where CPUs 1 and 2 share a store buffer
1357	or a level of cache, CPU 2 might have early access to CPU 1's writes.
1358	General barriers are therefore required to ensure that all CPUs agree
1359	on the combined order of CPU 1's and CPU 2's accesses.
1360	
1361	General barriers provide "global transitivity", so that all CPUs will
1362	agree on the order of operations.  In contrast, a chain of release-acquire
1363	pairs provides only "local transitivity", so that only those CPUs on
1364	the chain are guaranteed to agree on the combined order of the accesses.
1365	For example, switching to C code in deference to Herman Hollerith:
1366	
1367		int u, v, x, y, z;
1368	
1369		void cpu0(void)
1370		{
1371			r0 = smp_load_acquire(&x);
1372			WRITE_ONCE(u, 1);
1373			smp_store_release(&y, 1);
1374		}
1375	
1376		void cpu1(void)
1377		{
1378			r1 = smp_load_acquire(&y);
1379			r4 = READ_ONCE(v);
1380			r5 = READ_ONCE(u);
1381			smp_store_release(&z, 1);
1382		}
1383	
1384		void cpu2(void)
1385		{
1386			r2 = smp_load_acquire(&z);
1387			smp_store_release(&x, 1);
1388		}
1389	
1390		void cpu3(void)
1391		{
1392			WRITE_ONCE(v, 1);
1393			smp_mb();
1394			r3 = READ_ONCE(u);
1395		}
1396	
1397	Because cpu0(), cpu1(), and cpu2() participate in a local transitive
1398	chain of smp_store_release()/smp_load_acquire() pairs, the following
1399	outcome is prohibited:
1400	
1401		r0 == 1 && r1 == 1 && r2 == 1
1402	
1403	Furthermore, because of the release-acquire relationship between cpu0()
1404	and cpu1(), cpu1() must see cpu0()'s writes, so that the following
1405	outcome is prohibited:
1406	
1407		r1 == 1 && r5 == 0
1408	
1409	However, the transitivity of release-acquire is local to the participating
1410	CPUs and does not apply to cpu3().  Therefore, the following outcome
1411	is possible:
1412	
1413		r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
1414	
1415	As an aside, the following outcome is also possible:
1416	
1417		r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
1418	
1419	Although cpu0(), cpu1(), and cpu2() will see their respective reads and
1420	writes in order, CPUs not involved in the release-acquire chain might
1421	well disagree on the order.  This disagreement stems from the fact that
1422	the weak memory-barrier instructions used to implement smp_load_acquire()
1423	and smp_store_release() are not required to order prior stores against
1424	subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
1425	store to u as happening -after- cpu1()'s load from v, even though
1426	both cpu0() and cpu1() agree that these two operations occurred in the
1427	intended order.
1428	
1429	However, please keep in mind that smp_load_acquire() is not magic.
1430	In particular, it simply reads from its argument with ordering.  It does
1431	-not- ensure that any particular value will be read.  Therefore, the
1432	following outcome is possible:
1433	
1434		r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
1435	
1436	Note that this outcome can happen even on a mythical sequentially
1437	consistent system where nothing is ever reordered.
1438	
1439	To reiterate, if your code requires global transitivity, use general
1440	barriers throughout.
1441	
1442	
1443	========================
1444	EXPLICIT KERNEL BARRIERS
1445	========================
1446	
1447	The Linux kernel has a variety of different barriers that act at different
1448	levels:
1449	
1450	  (*) Compiler barrier.
1451	
1452	  (*) CPU memory barriers.
1453	
1454	  (*) MMIO write barrier.
1455	
1456	
1457	COMPILER BARRIER
1458	----------------
1459	
1460	The Linux kernel has an explicit compiler barrier function that prevents the
1461	compiler from moving the memory accesses either side of it to the other side:
1462	
1463		barrier();
1464	
1465	This is a general barrier -- there are no read-read or write-write
1466	variants of barrier().  However, READ_ONCE() and WRITE_ONCE() can be
1467	thought of as weak forms of barrier() that affect only the specific
1468	accesses flagged by the READ_ONCE() or WRITE_ONCE().
1469	
1470	The barrier() function has the following effects:
1471	
1472	 (*) Prevents the compiler from reordering accesses following the
1473	     barrier() to precede any accesses preceding the barrier().
1474	     One example use for this property is to ease communication between
1475	     interrupt-handler code and the code that was interrupted.
1476	
1477	 (*) Within a loop, forces the compiler to load the variables used
1478	     in that loop's conditional on each pass through that loop.
1479	
1480	The READ_ONCE() and WRITE_ONCE() functions can prevent any number of
1481	optimizations that, while perfectly safe in single-threaded code, can
1482	be fatal in concurrent code.  Here are some examples of these sorts
1483	of optimizations:
1484	
1485	 (*) The compiler is within its rights to reorder loads and stores
1486	     to the same variable, and in some cases, the CPU is within its
1487	     rights to reorder loads to the same variable.  This means that
1488	     the following code:
1489	
1490		a[0] = x;
1491		a[1] = x;
1492	
1493	     Might result in an older value of x stored in a[1] than in a[0].
1494	     Prevent both the compiler and the CPU from doing this as follows:
1495	
1496		a[0] = READ_ONCE(x);
1497		a[1] = READ_ONCE(x);
1498	
1499	     In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for
1500	     accesses from multiple CPUs to a single variable.
1501	
1502	 (*) The compiler is within its rights to merge successive loads from
1503	     the same variable.  Such merging can cause the compiler to "optimize"
1504	     the following code:
1505	
1506		while (tmp = a)
1507			do_something_with(tmp);
1508	
1509	     into the following code, which, although in some sense legitimate
1510	     for single-threaded code, is almost certainly not what the developer
1511	     intended:
1512	
1513		if (tmp = a)
1514			for (;;)
1515				do_something_with(tmp);
1516	
1517	     Use READ_ONCE() to prevent the compiler from doing this to you:
1518	
1519		while (tmp = READ_ONCE(a))
1520			do_something_with(tmp);
1521	
1522	 (*) The compiler is within its rights to reload a variable, for example,
1523	     in cases where high register pressure prevents the compiler from
1524	     keeping all data of interest in registers.  The compiler might
1525	     therefore optimize the variable 'tmp' out of our previous example:
1526	
1527		while (tmp = a)
1528			do_something_with(tmp);
1529	
1530	     This could result in the following code, which is perfectly safe in
1531	     single-threaded code, but can be fatal in concurrent code:
1532	
1533		while (a)
1534			do_something_with(a);
1535	
1536	     For example, the optimized version of this code could result in
1537	     passing a zero to do_something_with() in the case where the variable
1538	     a was modified by some other CPU between the "while" statement and
1539	     the call to do_something_with().
1540	
1541	     Again, use READ_ONCE() to prevent the compiler from doing this:
1542	
1543		while (tmp = READ_ONCE(a))
1544			do_something_with(tmp);
1545	
1546	     Note that if the compiler runs short of registers, it might save
1547	     tmp onto the stack.  The overhead of this saving and later restoring
1548	     is why compilers reload variables.  Doing so is perfectly safe for
1549	     single-threaded code, so you need to tell the compiler about cases
1550	     where it is not safe.
1551	
1552	 (*) The compiler is within its rights to omit a load entirely if it knows
1553	     what the value will be.  For example, if the compiler can prove that
1554	     the value of variable 'a' is always zero, it can optimize this code:
1555	
1556		while (tmp = a)
1557			do_something_with(tmp);
1558	
1559	     Into this:
1560	
1561		do { } while (0);
1562	
1563	     This transformation is a win for single-threaded code because it
1564	     gets rid of a load and a branch.  The problem is that the compiler
1565	     will carry out its proof assuming that the current CPU is the only
1566	     one updating variable 'a'.  If variable 'a' is shared, then the
1567	     compiler's proof will be erroneous.  Use READ_ONCE() to tell the
1568	     compiler that it doesn't know as much as it thinks it does:
1569	
1570		while (tmp = READ_ONCE(a))
1571			do_something_with(tmp);
1572	
1573	     But please note that the compiler is also closely watching what you
1574	     do with the value after the READ_ONCE().  For example, suppose you
1575	     do the following and MAX is a preprocessor macro with the value 1:
1576	
1577		while ((tmp = READ_ONCE(a)) % MAX)
1578			do_something_with(tmp);
1579	
1580	     Then the compiler knows that the result of the "%" operator applied
1581	     to MAX will always be zero, again allowing the compiler to optimize
1582	     the code into near-nonexistence.  (It will still load from the
1583	     variable 'a'.)
1584	
1585	 (*) Similarly, the compiler is within its rights to omit a store entirely
1586	     if it knows that the variable already has the value being stored.
1587	     Again, the compiler assumes that the current CPU is the only one
1588	     storing into the variable, which can cause the compiler to do the
1589	     wrong thing for shared variables.  For example, suppose you have
1590	     the following:
1591	
1592		a = 0;
1593		... Code that does not store to variable a ...
1594		a = 0;
1595	
1596	     The compiler sees that the value of variable 'a' is already zero, so
1597	     it might well omit the second store.  This would come as a fatal
1598	     surprise if some other CPU might have stored to variable 'a' in the
1599	     meantime.
1600	
1601	     Use WRITE_ONCE() to prevent the compiler from making this sort of
1602	     wrong guess:
1603	
1604		WRITE_ONCE(a, 0);
1605		... Code that does not store to variable a ...
1606		WRITE_ONCE(a, 0);
1607	
1608	 (*) The compiler is within its rights to reorder memory accesses unless
1609	     you tell it not to.  For example, consider the following interaction
1610	     between process-level code and an interrupt handler:
1611	
1612		void process_level(void)
1613		{
1614			msg = get_message();
1615			flag = true;
1616		}
1617	
1618		void interrupt_handler(void)
1619		{
1620			if (flag)
1621				process_message(msg);
1622		}
1623	
1624	     There is nothing to prevent the compiler from transforming
1625	     process_level() to the following, in fact, this might well be a
1626	     win for single-threaded code:
1627	
1628		void process_level(void)
1629		{
1630			flag = true;
1631			msg = get_message();
1632		}
1633	
1634	     If the interrupt occurs between these two statement, then
1635	     interrupt_handler() might be passed a garbled msg.  Use WRITE_ONCE()
1636	     to prevent this as follows:
1637	
1638		void process_level(void)
1639		{
1640			WRITE_ONCE(msg, get_message());
1641			WRITE_ONCE(flag, true);
1642		}
1643	
1644		void interrupt_handler(void)
1645		{
1646			if (READ_ONCE(flag))
1647				process_message(READ_ONCE(msg));
1648		}
1649	
1650	     Note that the READ_ONCE() and WRITE_ONCE() wrappers in
1651	     interrupt_handler() are needed if this interrupt handler can itself
1652	     be interrupted by something that also accesses 'flag' and 'msg',
1653	     for example, a nested interrupt or an NMI.  Otherwise, READ_ONCE()
1654	     and WRITE_ONCE() are not needed in interrupt_handler() other than
1655	     for documentation purposes.  (Note also that nested interrupts
1656	     do not typically occur in modern Linux kernels, in fact, if an
1657	     interrupt handler returns with interrupts enabled, you will get a
1658	     WARN_ONCE() splat.)
1659	
1660	     You should assume that the compiler can move READ_ONCE() and
1661	     WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
1662	     barrier(), or similar primitives.
1663	
1664	     This effect could also be achieved using barrier(), but READ_ONCE()
1665	     and WRITE_ONCE() are more selective:  With READ_ONCE() and
1666	     WRITE_ONCE(), the compiler need only forget the contents of the
1667	     indicated memory locations, while with barrier() the compiler must
1668	     discard the value of all memory locations that it has currented
1669	     cached in any machine registers.  Of course, the compiler must also
1670	     respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
1671	     though the CPU of course need not do so.
1672	
1673	 (*) The compiler is within its rights to invent stores to a variable,
1674	     as in the following example:
1675	
1676		if (a)
1677			b = a;
1678		else
1679			b = 42;
1680	
1681	     The compiler might save a branch by optimizing this as follows:
1682	
1683		b = 42;
1684		if (a)
1685			b = a;
1686	
1687	     In single-threaded code, this is not only safe, but also saves
1688	     a branch.  Unfortunately, in concurrent code, this optimization
1689	     could cause some other CPU to see a spurious value of 42 -- even
1690	     if variable 'a' was never zero -- when loading variable 'b'.
1691	     Use WRITE_ONCE() to prevent this as follows:
1692	
1693		if (a)
1694			WRITE_ONCE(b, a);
1695		else
1696			WRITE_ONCE(b, 42);
1697	
1698	     The compiler can also invent loads.  These are usually less
1699	     damaging, but they can result in cache-line bouncing and thus in
1700	     poor performance and scalability.  Use READ_ONCE() to prevent
1701	     invented loads.
1702	
1703	 (*) For aligned memory locations whose size allows them to be accessed
1704	     with a single memory-reference instruction, prevents "load tearing"
1705	     and "store tearing," in which a single large access is replaced by
1706	     multiple smaller accesses.  For example, given an architecture having
1707	     16-bit store instructions with 7-bit immediate fields, the compiler
1708	     might be tempted to use two 16-bit store-immediate instructions to
1709	     implement the following 32-bit store:
1710	
1711		p = 0x00010002;
1712	
1713	     Please note that GCC really does use this sort of optimization,
1714	     which is not surprising given that it would likely take more
1715	     than two instructions to build the constant and then store it.
1716	     This optimization can therefore be a win in single-threaded code.
1717	     In fact, a recent bug (since fixed) caused GCC to incorrectly use
1718	     this optimization in a volatile store.  In the absence of such bugs,
1719	     use of WRITE_ONCE() prevents store tearing in the following example:
1720	
1721		WRITE_ONCE(p, 0x00010002);
1722	
1723	     Use of packed structures can also result in load and store tearing,
1724	     as in this example:
1725	
1726		struct __attribute__((__packed__)) foo {
1727			short a;
1728			int b;
1729			short c;
1730		};
1731		struct foo foo1, foo2;
1732		...
1733	
1734		foo2.a = foo1.a;
1735		foo2.b = foo1.b;
1736		foo2.c = foo1.c;
1737	
1738	     Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no
1739	     volatile markings, the compiler would be well within its rights to
1740	     implement these three assignment statements as a pair of 32-bit
1741	     loads followed by a pair of 32-bit stores.  This would result in
1742	     load tearing on 'foo1.b' and store tearing on 'foo2.b'.  READ_ONCE()
1743	     and WRITE_ONCE() again prevent tearing in this example:
1744	
1745		foo2.a = foo1.a;
1746		WRITE_ONCE(foo2.b, READ_ONCE(foo1.b));
1747		foo2.c = foo1.c;
1748	
1749	All that aside, it is never necessary to use READ_ONCE() and
1750	WRITE_ONCE() on a variable that has been marked volatile.  For example,
1751	because 'jiffies' is marked volatile, it is never necessary to
1752	say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
1753	WRITE_ONCE() are implemented as volatile casts, which has no effect when
1754	its argument is already marked volatile.
1755	
1756	Please note that these compiler barriers have no direct effect on the CPU,
1757	which may then reorder things however it wishes.
1758	
1759	
1760	CPU MEMORY BARRIERS
1761	-------------------
1762	
1763	The Linux kernel has eight basic CPU memory barriers:
1764	
1765		TYPE		MANDATORY		SMP CONDITIONAL
1766		===============	=======================	===========================
1767		GENERAL		mb()			smp_mb()
1768		WRITE		wmb()			smp_wmb()
1769		READ		rmb()			smp_rmb()
1770		DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
1771	
1772	
1773	All memory barriers except the data dependency barriers imply a compiler
1774	barrier.  Data dependencies do not impose any additional compiler ordering.
1775	
1776	Aside: In the case of data dependencies, the compiler would be expected
1777	to issue the loads in the correct order (eg. `a[b]` would have to load
1778	the value of b before loading a[b]), however there is no guarantee in
1779	the C specification that the compiler may not speculate the value of b
1780	(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1)
1781	tmp = a[b]; ).  There is also the problem of a compiler reloading b after
1782	having loaded a[b], thus having a newer copy of b than a[b].  A consensus
1783	has not yet been reached about these problems, however the READ_ONCE()
1784	macro is a good place to start looking.
1785	
1786	SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
1787	systems because it is assumed that a CPU will appear to be self-consistent,
1788	and will order overlapping accesses correctly with respect to itself.
1789	However, see the subsection on "Virtual Machine Guests" below.
1790	
1791	[!] Note that SMP memory barriers _must_ be used to control the ordering of
1792	references to shared memory on SMP systems, though the use of locking instead
1793	is sufficient.
1794	
1795	Mandatory barriers should not be used to control SMP effects, since mandatory
1796	barriers impose unnecessary overhead on both SMP and UP systems. They may,
1797	however, be used to control MMIO effects on accesses through relaxed memory I/O
1798	windows.  These barriers are required even on non-SMP systems as they affect
1799	the order in which memory operations appear to a device by prohibiting both the
1800	compiler and the CPU from reordering them.
1801	
1802	
1803	There are some more advanced barrier functions:
1804	
1805	 (*) smp_store_mb(var, value)
1806	
1807	     This assigns the value to the variable and then inserts a full memory
1808	     barrier after it.  It isn't guaranteed to insert anything more than a
1809	     compiler barrier in a UP compilation.
1810	
1811	
1812	 (*) smp_mb__before_atomic();
1813	 (*) smp_mb__after_atomic();
1814	
1815	     These are for use with atomic (such as add, subtract, increment and
1816	     decrement) functions that don't return a value, especially when used for
1817	     reference counting.  These functions do not imply memory barriers.
1818	
1819	     These are also used for atomic bitop functions that do not return a
1820	     value (such as set_bit and clear_bit).
1821	
1822	     As an example, consider a piece of code that marks an object as being dead
1823	     and then decrements the object's reference count:
1824	
1825		obj->dead = 1;
1826		smp_mb__before_atomic();
1827		atomic_dec(&obj->ref_count);
1828	
1829	     This makes sure that the death mark on the object is perceived to be set
1830	     *before* the reference counter is decremented.
1831	
1832	     See Documentation/atomic_ops.txt for more information.  See the "Atomic
1833	     operations" subsection for information on where to use these.
1834	
1835	
1836	 (*) lockless_dereference();
1837	
1838	     This can be thought of as a pointer-fetch wrapper around the
1839	     smp_read_barrier_depends() data-dependency barrier.
1840	
1841	     This is also similar to rcu_dereference(), but in cases where
1842	     object lifetime is handled by some mechanism other than RCU, for
1843	     example, when the objects removed only when the system goes down.
1844	     In addition, lockless_dereference() is used in some data structures
1845	     that can be used both with and without RCU.
1846	
1847	
1848	 (*) dma_wmb();
1849	 (*) dma_rmb();
1850	
1851	     These are for use with consistent memory to guarantee the ordering
1852	     of writes or reads of shared memory accessible to both the CPU and a
1853	     DMA capable device.
1854	
1855	     For example, consider a device driver that shares memory with a device
1856	     and uses a descriptor status value to indicate if the descriptor belongs
1857	     to the device or the CPU, and a doorbell to notify it when new
1858	     descriptors are available:
1859	
1860		if (desc->status != DEVICE_OWN) {
1861			/* do not read data until we own descriptor */
1862			dma_rmb();
1863	
1864			/* read/modify data */
1865			read_data = desc->data;
1866			desc->data = write_data;
1867	
1868			/* flush modifications before status update */
1869			dma_wmb();
1870	
1871			/* assign ownership */
1872			desc->status = DEVICE_OWN;
1873	
1874			/* force memory to sync before notifying device via MMIO */
1875			wmb();
1876	
1877			/* notify device of new descriptors */
1878			writel(DESC_NOTIFY, doorbell);
1879		}
1880	
1881	     The dma_rmb() allows us guarantee the device has released ownership
1882	     before we read the data from the descriptor, and the dma_wmb() allows
1883	     us to guarantee the data is written to the descriptor before the device
1884	     can see it now has ownership.  The wmb() is needed to guarantee that the
1885	     cache coherent memory writes have completed before attempting a write to
1886	     the cache incoherent MMIO region.
1887	
1888	     See Documentation/DMA-API.txt for more information on consistent memory.
1889	
1890	MMIO WRITE BARRIER
1891	------------------
1892	
1893	The Linux kernel also has a special barrier for use with memory-mapped I/O
1894	writes:
1895	
1896		mmiowb();
1897	
1898	This is a variation on the mandatory write barrier that causes writes to weakly
1899	ordered I/O regions to be partially ordered.  Its effects may go beyond the
1900	CPU->Hardware interface and actually affect the hardware at some level.
1901	
1902	See the subsection "Acquires vs I/O accesses" for more information.
1903	
1904	
1905	===============================
1906	IMPLICIT KERNEL MEMORY BARRIERS
1907	===============================
1908	
1909	Some of the other functions in the linux kernel imply memory barriers, amongst
1910	which are locking and scheduling functions.
1911	
1912	This specification is a _minimum_ guarantee; any particular architecture may
1913	provide more substantial guarantees, but these may not be relied upon outside
1914	of arch specific code.
1915	
1916	
1917	LOCK ACQUISITION FUNCTIONS
1918	--------------------------
1919	
1920	The Linux kernel has a number of locking constructs:
1921	
1922	 (*) spin locks
1923	 (*) R/W spin locks
1924	 (*) mutexes
1925	 (*) semaphores
1926	 (*) R/W semaphores
1927	
1928	In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
1929	for each construct.  These operations all imply certain barriers:
1930	
1931	 (1) ACQUIRE operation implication:
1932	
1933	     Memory operations issued after the ACQUIRE will be completed after the
1934	     ACQUIRE operation has completed.
1935	
1936	     Memory operations issued before the ACQUIRE may be completed after
1937	     the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
1938	     combined with a following ACQUIRE, orders prior stores against
1939	     subsequent loads and stores.  Note that this is weaker than smp_mb()!
1940	     The smp_mb__before_spinlock() primitive is free on many architectures.
1941	
1942	 (2) RELEASE operation implication:
1943	
1944	     Memory operations issued before the RELEASE will be completed before the
1945	     RELEASE operation has completed.
1946	
1947	     Memory operations issued after the RELEASE may be completed before the
1948	     RELEASE operation has completed.
1949	
1950	 (3) ACQUIRE vs ACQUIRE implication:
1951	
1952	     All ACQUIRE operations issued before another ACQUIRE operation will be
1953	     completed before that ACQUIRE operation.
1954	
1955	 (4) ACQUIRE vs RELEASE implication:
1956	
1957	     All ACQUIRE operations issued before a RELEASE operation will be
1958	     completed before the RELEASE operation.
1959	
1960	 (5) Failed conditional ACQUIRE implication:
1961	
1962	     Certain locking variants of the ACQUIRE operation may fail, either due to
1963	     being unable to get the lock immediately, or due to receiving an unblocked
1964	     signal whilst asleep waiting for the lock to become available.  Failed
1965	     locks do not imply any sort of barrier.
1966	
1967	[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
1968	one-way barriers is that the effects of instructions outside of a critical
1969	section may seep into the inside of the critical section.
1970	
1971	An ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier
1972	because it is possible for an access preceding the ACQUIRE to happen after the
1973	ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
1974	the two accesses can themselves then cross:
1975	
1976		*A = a;
1977		ACQUIRE M
1978		RELEASE M
1979		*B = b;
1980	
1981	may occur as:
1982	
1983		ACQUIRE M, STORE *B, STORE *A, RELEASE M
1984	
1985	When the ACQUIRE and RELEASE are a lock acquisition and release,
1986	respectively, this same reordering can occur if the lock's ACQUIRE and
1987	RELEASE are to the same lock variable, but only from the perspective of
1988	another CPU not holding that lock.  In short, a ACQUIRE followed by an
1989	RELEASE may -not- be assumed to be a full memory barrier.
1990	
1991	Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
1992	not imply a full memory barrier.  Therefore, the CPU's execution of the
1993	critical sections corresponding to the RELEASE and the ACQUIRE can cross,
1994	so that:
1995	
1996		*A = a;
1997		RELEASE M
1998		ACQUIRE N
1999		*B = b;
2000	
2001	could occur as:
2002	
2003		ACQUIRE N, STORE *B, STORE *A, RELEASE M
2004	
2005	It might appear that this reordering could introduce a deadlock.
2006	However, this cannot happen because if such a deadlock threatened,
2007	the RELEASE would simply complete, thereby avoiding the deadlock.
2008	
2009		Why does this work?
2010	
2011		One key point is that we are only talking about the CPU doing
2012		the reordering, not the compiler.  If the compiler (or, for
2013		that matter, the developer) switched the operations, deadlock
2014		-could- occur.
2015	
2016		But suppose the CPU reordered the operations.  In this case,
2017		the unlock precedes the lock in the assembly code.  The CPU
2018		simply elected to try executing the later lock operation first.
2019		If there is a deadlock, this lock operation will simply spin (or
2020		try to sleep, but more on that later).	The CPU will eventually
2021		execute the unlock operation (which preceded the lock operation
2022		in the assembly code), which will unravel the potential deadlock,
2023		allowing the lock operation to succeed.
2024	
2025		But what if the lock is a sleeplock?  In that case, the code will
2026		try to enter the scheduler, where it will eventually encounter
2027		a memory barrier, which will force the earlier unlock operation
2028		to complete, again unraveling the deadlock.  There might be
2029		a sleep-unlock race, but the locking primitive needs to resolve
2030		such races properly in any case.
2031	
2032	Locks and semaphores may not provide any guarantee of ordering on UP compiled
2033	systems, and so cannot be counted on in such a situation to actually achieve
2034	anything at all - especially with respect to I/O accesses - unless combined
2035	with interrupt disabling operations.
2036	
2037	See also the section on "Inter-CPU locking barrier effects".
2038	
2039	
2040	As an example, consider the following:
2041	
2042		*A = a;
2043		*B = b;
2044		ACQUIRE
2045		*C = c;
2046		*D = d;
2047		RELEASE
2048		*E = e;
2049		*F = f;
2050	
2051	The following sequence of events is acceptable:
2052	
2053		ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
2054	
2055		[+] Note that {*F,*A} indicates a combined access.
2056	
2057	But none of the following are:
2058	
2059		{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
2060		*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
2061		*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
2062		*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
2063	
2064	
2065	
2066	INTERRUPT DISABLING FUNCTIONS
2067	-----------------------------
2068	
2069	Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
2070	(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
2071	barriers are required in such a situation, they must be provided from some
2072	other means.
2073	
2074	
2075	SLEEP AND WAKE-UP FUNCTIONS
2076	---------------------------
2077	
2078	Sleeping and waking on an event flagged in global data can be viewed as an
2079	interaction between two pieces of data: the task state of the task waiting for
2080	the event and the global data used to indicate the event.  To make sure that
2081	these appear to happen in the right order, the primitives to begin the process
2082	of going to sleep, and the primitives to initiate a wake up imply certain
2083	barriers.
2084	
2085	Firstly, the sleeper normally follows something like this sequence of events:
2086	
2087		for (;;) {
2088			set_current_state(TASK_UNINTERRUPTIBLE);
2089			if (event_indicated)
2090				break;
2091			schedule();
2092		}
2093	
2094	A general memory barrier is interpolated automatically by set_current_state()
2095	after it has altered the task state:
2096	
2097		CPU 1
2098		===============================
2099		set_current_state();
2100		  smp_store_mb();
2101		    STORE current->state
2102		    <general barrier>
2103		LOAD event_indicated
2104	
2105	set_current_state() may be wrapped by:
2106	
2107		prepare_to_wait();
2108		prepare_to_wait_exclusive();
2109	
2110	which therefore also imply a general memory barrier after setting the state.
2111	The whole sequence above is available in various canned forms, all of which
2112	interpolate the memory barrier in the right place:
2113	
2114		wait_event();
2115		wait_event_interruptible();
2116		wait_event_interruptible_exclusive();
2117		wait_event_interruptible_timeout();
2118		wait_event_killable();
2119		wait_event_timeout();
2120		wait_on_bit();
2121		wait_on_bit_lock();
2122	
2123	
2124	Secondly, code that performs a wake up normally follows something like this:
2125	
2126		event_indicated = 1;
2127		wake_up(&event_wait_queue);
2128	
2129	or:
2130	
2131		event_indicated = 1;
2132		wake_up_process(event_daemon);
2133	
2134	A write memory barrier is implied by wake_up() and co.  if and only if they
2135	wake something up.  The barrier occurs before the task state is cleared, and so
2136	sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
2137	
2138		CPU 1				CPU 2
2139		===============================	===============================
2140		set_current_state();		STORE event_indicated
2141		  smp_store_mb();		wake_up();
2142		    STORE current->state	  <write barrier>
2143		    <general barrier>		  STORE current->state
2144		LOAD event_indicated
2145	
2146	To repeat, this write memory barrier is present if and only if something
2147	is actually awakened.  To see this, consider the following sequence of
2148	events, where X and Y are both initially zero:
2149	
2150		CPU 1				CPU 2
2151		===============================	===============================
2152		X = 1;				STORE event_indicated
2153		smp_mb();			wake_up();
2154		Y = 1;				wait_event(wq, Y == 1);
2155		wake_up();			  load from Y sees 1, no memory barrier
2156						load from X might see 0
2157	
2158	In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed
2159	to see 1.
2160	
2161	The available waker functions include:
2162	
2163		complete();
2164		wake_up();
2165		wake_up_all();
2166		wake_up_bit();
2167		wake_up_interruptible();
2168		wake_up_interruptible_all();
2169		wake_up_interruptible_nr();
2170		wake_up_interruptible_poll();
2171		wake_up_interruptible_sync();
2172		wake_up_interruptible_sync_poll();
2173		wake_up_locked();
2174		wake_up_locked_poll();
2175		wake_up_nr();
2176		wake_up_poll();
2177		wake_up_process();
2178	
2179	
2180	[!] Note that the memory barriers implied by the sleeper and the waker do _not_
2181	order multiple stores before the wake-up with respect to loads of those stored
2182	values after the sleeper has called set_current_state().  For instance, if the
2183	sleeper does:
2184	
2185		set_current_state(TASK_INTERRUPTIBLE);
2186		if (event_indicated)
2187			break;
2188		__set_current_state(TASK_RUNNING);
2189		do_something(my_data);
2190	
2191	and the waker does:
2192	
2193		my_data = value;
2194		event_indicated = 1;
2195		wake_up(&event_wait_queue);
2196	
2197	there's no guarantee that the change to event_indicated will be perceived by
2198	the sleeper as coming after the change to my_data.  In such a circumstance, the
2199	code on both sides must interpolate its own memory barriers between the
2200	separate data accesses.  Thus the above sleeper ought to do:
2201	
2202		set_current_state(TASK_INTERRUPTIBLE);
2203		if (event_indicated) {
2204			smp_rmb();
2205			do_something(my_data);
2206		}
2207	
2208	and the waker should do:
2209	
2210		my_data = value;
2211		smp_wmb();
2212		event_indicated = 1;
2213		wake_up(&event_wait_queue);
2214	
2215	
2216	MISCELLANEOUS FUNCTIONS
2217	-----------------------
2218	
2219	Other functions that imply barriers:
2220	
2221	 (*) schedule() and similar imply full memory barriers.
2222	
2223	
2224	===================================
2225	INTER-CPU ACQUIRING BARRIER EFFECTS
2226	===================================
2227	
2228	On SMP systems locking primitives give a more substantial form of barrier: one
2229	that does affect memory access ordering on other CPUs, within the context of
2230	conflict on any particular lock.
2231	
2232	
2233	ACQUIRES VS MEMORY ACCESSES
2234	---------------------------
2235	
2236	Consider the following: the system has a pair of spinlocks (M) and (Q), and
2237	three CPUs; then should the following sequence of events occur:
2238	
2239		CPU 1				CPU 2
2240		===============================	===============================
2241		WRITE_ONCE(*A, a);		WRITE_ONCE(*E, e);
2242		ACQUIRE M			ACQUIRE Q
2243		WRITE_ONCE(*B, b);		WRITE_ONCE(*F, f);
2244		WRITE_ONCE(*C, c);		WRITE_ONCE(*G, g);
2245		RELEASE M			RELEASE Q
2246		WRITE_ONCE(*D, d);		WRITE_ONCE(*H, h);
2247	
2248	Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2249	through *H occur in, other than the constraints imposed by the separate locks
2250	on the separate CPUs.  It might, for example, see:
2251	
2252		*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
2253	
2254	But it won't see any of:
2255	
2256		*B, *C or *D preceding ACQUIRE M
2257		*A, *B or *C following RELEASE M
2258		*F, *G or *H preceding ACQUIRE Q
2259		*E, *F or *G following RELEASE Q
2260	
2261	
2262	
2263	ACQUIRES VS I/O ACCESSES
2264	------------------------
2265	
2266	Under certain circumstances (especially involving NUMA), I/O accesses within
2267	two spinlocked sections on two different CPUs may be seen as interleaved by the
2268	PCI bridge, because the PCI bridge does not necessarily participate in the
2269	cache-coherence protocol, and is therefore incapable of issuing the required
2270	read memory barriers.
2271	
2272	For example:
2273	
2274		CPU 1				CPU 2
2275		===============================	===============================
2276		spin_lock(Q)
2277		writel(0, ADDR)
2278		writel(1, DATA);
2279		spin_unlock(Q);
2280						spin_lock(Q);
2281						writel(4, ADDR);
2282						writel(5, DATA);
2283						spin_unlock(Q);
2284	
2285	may be seen by the PCI bridge as follows:
2286	
2287		STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
2288	
2289	which would probably cause the hardware to malfunction.
2290	
2291	
2292	What is necessary here is to intervene with an mmiowb() before dropping the
2293	spinlock, for example:
2294	
2295		CPU 1				CPU 2
2296		===============================	===============================
2297		spin_lock(Q)
2298		writel(0, ADDR)
2299		writel(1, DATA);
2300		mmiowb();
2301		spin_unlock(Q);
2302						spin_lock(Q);
2303						writel(4, ADDR);
2304						writel(5, DATA);
2305						mmiowb();
2306						spin_unlock(Q);
2307	
2308	this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
2309	before either of the stores issued on CPU 2.
2310	
2311	
2312	Furthermore, following a store by a load from the same device obviates the need
2313	for the mmiowb(), because the load forces the store to complete before the load
2314	is performed:
2315	
2316		CPU 1				CPU 2
2317		===============================	===============================
2318		spin_lock(Q)
2319		writel(0, ADDR)
2320		a = readl(DATA);
2321		spin_unlock(Q);
2322						spin_lock(Q);
2323						writel(4, ADDR);
2324						b = readl(DATA);
2325						spin_unlock(Q);
2326	
2327	
2328	See Documentation/DocBook/deviceiobook.tmpl for more information.
2329	
2330	
2331	=================================
2332	WHERE ARE MEMORY BARRIERS NEEDED?
2333	=================================
2334	
2335	Under normal operation, memory operation reordering is generally not going to
2336	be a problem as a single-threaded linear piece of code will still appear to
2337	work correctly, even if it's in an SMP kernel.  There are, however, four
2338	circumstances in which reordering definitely _could_ be a problem:
2339	
2340	 (*) Interprocessor interaction.
2341	
2342	 (*) Atomic operations.
2343	
2344	 (*) Accessing devices.
2345	
2346	 (*) Interrupts.
2347	
2348	
2349	INTERPROCESSOR INTERACTION
2350	--------------------------
2351	
2352	When there's a system with more than one processor, more than one CPU in the
2353	system may be working on the same data set at the same time.  This can cause
2354	synchronisation problems, and the usual way of dealing with them is to use
2355	locks.  Locks, however, are quite expensive, and so it may be preferable to
2356	operate without the use of a lock if at all possible.  In such a case
2357	operations that affect both CPUs may have to be carefully ordered to prevent
2358	a malfunction.
2359	
2360	Consider, for example, the R/W semaphore slow path.  Here a waiting process is
2361	queued on the semaphore, by virtue of it having a piece of its stack linked to
2362	the semaphore's list of waiting processes:
2363	
2364		struct rw_semaphore {
2365			...
2366			spinlock_t lock;
2367			struct list_head waiters;
2368		};
2369	
2370		struct rwsem_waiter {
2371			struct list_head list;
2372			struct task_struct *task;
2373		};
2374	
2375	To wake up a particular waiter, the up_read() or up_write() functions have to:
2376	
2377	 (1) read the next pointer from this waiter's record to know as to where the
2378	     next waiter record is;
2379	
2380	 (2) read the pointer to the waiter's task structure;
2381	
2382	 (3) clear the task pointer to tell the waiter it has been given the semaphore;
2383	
2384	 (4) call wake_up_process() on the task; and
2385	
2386	 (5) release the reference held on the waiter's task struct.
2387	
2388	In other words, it has to perform this sequence of events:
2389	
2390		LOAD waiter->list.next;
2391		LOAD waiter->task;
2392		STORE waiter->task;
2393		CALL wakeup
2394		RELEASE task
2395	
2396	and if any of these steps occur out of order, then the whole thing may
2397	malfunction.
2398	
2399	Once it has queued itself and dropped the semaphore lock, the waiter does not
2400	get the lock again; it instead just waits for its task pointer to be cleared
2401	before proceeding.  Since the record is on the waiter's stack, this means that
2402	if the task pointer is cleared _before_ the next pointer in the list is read,
2403	another CPU might start processing the waiter and might clobber the waiter's
2404	stack before the up*() function has a chance to read the next pointer.
2405	
2406	Consider then what might happen to the above sequence of events:
2407	
2408		CPU 1				CPU 2
2409		===============================	===============================
2410						down_xxx()
2411						Queue waiter
2412						Sleep
2413		up_yyy()
2414		LOAD waiter->task;
2415		STORE waiter->task;
2416						Woken up by other event
2417		<preempt>
2418						Resume processing
2419						down_xxx() returns
2420						call foo()
2421						foo() clobbers *waiter
2422		</preempt>
2423		LOAD waiter->list.next;
2424		--- OOPS ---
2425	
2426	This could be dealt with using the semaphore lock, but then the down_xxx()
2427	function has to needlessly get the spinlock again after being woken up.
2428	
2429	The way to deal with this is to insert a general SMP memory barrier:
2430	
2431		LOAD waiter->list.next;
2432		LOAD waiter->task;
2433		smp_mb();
2434		STORE waiter->task;
2435		CALL wakeup
2436		RELEASE task
2437	
2438	In this case, the barrier makes a guarantee that all memory accesses before the
2439	barrier will appear to happen before all the memory accesses after the barrier
2440	with respect to the other CPUs on the system.  It does _not_ guarantee that all
2441	the memory accesses before the barrier will be complete by the time the barrier
2442	instruction itself is complete.
2443	
2444	On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2445	compiler barrier, thus making sure the compiler emits the instructions in the
2446	right order without actually intervening in the CPU.  Since there's only one
2447	CPU, that CPU's dependency ordering logic will take care of everything else.
2448	
2449	
2450	ATOMIC OPERATIONS
2451	-----------------
2452	
2453	Whilst they are technically interprocessor interaction considerations, atomic
2454	operations are noted specially as some of them imply full memory barriers and
2455	some don't, but they're very heavily relied on as a group throughout the
2456	kernel.
2457	
2458	Any atomic operation that modifies some state in memory and returns information
2459	about the state (old or new) implies an SMP-conditional general memory barrier
2460	(smp_mb()) on each side of the actual operation (with the exception of
2461	explicit lock operations, described later).  These include:
2462	
2463		xchg();
2464		atomic_xchg();			atomic_long_xchg();
2465		atomic_inc_return();		atomic_long_inc_return();
2466		atomic_dec_return();		atomic_long_dec_return();
2467		atomic_add_return();		atomic_long_add_return();
2468		atomic_sub_return();		atomic_long_sub_return();
2469		atomic_inc_and_test();		atomic_long_inc_and_test();
2470		atomic_dec_and_test();		atomic_long_dec_and_test();
2471		atomic_sub_and_test();		atomic_long_sub_and_test();
2472		atomic_add_negative();		atomic_long_add_negative();
2473		test_and_set_bit();
2474		test_and_clear_bit();
2475		test_and_change_bit();
2476	
2477		/* when succeeds */
2478		cmpxchg();
2479		atomic_cmpxchg();		atomic_long_cmpxchg();
2480		atomic_add_unless();		atomic_long_add_unless();
2481	
2482	These are used for such things as implementing ACQUIRE-class and RELEASE-class
2483	operations and adjusting reference counters towards object destruction, and as
2484	such the implicit memory barrier effects are necessary.
2485	
2486	
2487	The following operations are potential problems as they do _not_ imply memory
2488	barriers, but might be used for implementing such things as RELEASE-class
2489	operations:
2490	
2491		atomic_set();
2492		set_bit();
2493		clear_bit();
2494		change_bit();
2495	
2496	With these the appropriate explicit memory barrier should be used if necessary
2497	(smp_mb__before_atomic() for instance).
2498	
2499	
2500	The following also do _not_ imply memory barriers, and so may require explicit
2501	memory barriers under some circumstances (smp_mb__before_atomic() for
2502	instance):
2503	
2504		atomic_add();
2505		atomic_sub();
2506		atomic_inc();
2507		atomic_dec();
2508	
2509	If they're used for statistics generation, then they probably don't need memory
2510	barriers, unless there's a coupling between statistical data.
2511	
2512	If they're used for reference counting on an object to control its lifetime,
2513	they probably don't need memory barriers because either the reference count
2514	will be adjusted inside a locked section, or the caller will already hold
2515	sufficient references to make the lock, and thus a memory barrier unnecessary.
2516	
2517	If they're used for constructing a lock of some description, then they probably
2518	do need memory barriers as a lock primitive generally has to do things in a
2519	specific order.
2520	
2521	Basically, each usage case has to be carefully considered as to whether memory
2522	barriers are needed or not.
2523	
2524	The following operations are special locking primitives:
2525	
2526		test_and_set_bit_lock();
2527		clear_bit_unlock();
2528		__clear_bit_unlock();
2529	
2530	These implement ACQUIRE-class and RELEASE-class operations.  These should be
2531	used in preference to other operations when implementing locking primitives,
2532	because their implementations can be optimised on many architectures.
2533	
2534	[!] Note that special memory barrier primitives are available for these
2535	situations because on some CPUs the atomic instructions used imply full memory
2536	barriers, and so barrier instructions are superfluous in conjunction with them,
2537	and in such cases the special barrier primitives will be no-ops.
2538	
2539	See Documentation/atomic_ops.txt for more information.
2540	
2541	
2542	ACCESSING DEVICES
2543	-----------------
2544	
2545	Many devices can be memory mapped, and so appear to the CPU as if they're just
2546	a set of memory locations.  To control such a device, the driver usually has to
2547	make the right memory accesses in exactly the right order.
2548	
2549	However, having a clever CPU or a clever compiler creates a potential problem
2550	in that the carefully sequenced accesses in the driver code won't reach the
2551	device in the requisite order if the CPU or the compiler thinks it is more
2552	efficient to reorder, combine or merge accesses - something that would cause
2553	the device to malfunction.
2554	
2555	Inside of the Linux kernel, I/O should be done through the appropriate accessor
2556	routines - such as inb() or writel() - which know how to make such accesses
2557	appropriately sequential.  Whilst this, for the most part, renders the explicit
2558	use of memory barriers unnecessary, there are a couple of situations where they
2559	might be needed:
2560	
2561	 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
2562	     so for _all_ general drivers locks should be used and mmiowb() must be
2563	     issued prior to unlocking the critical section.
2564	
2565	 (2) If the accessor functions are used to refer to an I/O memory window with
2566	     relaxed memory access properties, then _mandatory_ memory barriers are
2567	     required to enforce ordering.
2568	
2569	See Documentation/DocBook/deviceiobook.tmpl for more information.
2570	
2571	
2572	INTERRUPTS
2573	----------
2574	
2575	A driver may be interrupted by its own interrupt service routine, and thus the
2576	two parts of the driver may interfere with each other's attempts to control or
2577	access the device.
2578	
2579	This may be alleviated - at least in part - by disabling local interrupts (a
2580	form of locking), such that the critical operations are all contained within
2581	the interrupt-disabled section in the driver.  Whilst the driver's interrupt
2582	routine is executing, the driver's core may not run on the same CPU, and its
2583	interrupt is not permitted to happen again until the current interrupt has been
2584	handled, thus the interrupt handler does not need to lock against that.
2585	
2586	However, consider a driver that was talking to an ethernet card that sports an
2587	address register and a data register.  If that driver's core talks to the card
2588	under interrupt-disablement and then the driver's interrupt handler is invoked:
2589	
2590		LOCAL IRQ DISABLE
2591		writew(ADDR, 3);
2592		writew(DATA, y);
2593		LOCAL IRQ ENABLE
2594		<interrupt>
2595		writew(ADDR, 4);
2596		q = readw(DATA);
2597		</interrupt>
2598	
2599	The store to the data register might happen after the second store to the
2600	address register if ordering rules are sufficiently relaxed:
2601	
2602		STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
2603	
2604	
2605	If ordering rules are relaxed, it must be assumed that accesses done inside an
2606	interrupt disabled section may leak outside of it and may interleave with
2607	accesses performed in an interrupt - and vice versa - unless implicit or
2608	explicit barriers are used.
2609	
2610	Normally this won't be a problem because the I/O accesses done inside such
2611	sections will include synchronous load operations on strictly ordered I/O
2612	registers that form implicit I/O barriers.  If this isn't sufficient then an
2613	mmiowb() may need to be used explicitly.
2614	
2615	
2616	A similar situation may occur between an interrupt routine and two routines
2617	running on separate CPUs that communicate with each other.  If such a case is
2618	likely, then interrupt-disabling locks should be used to guarantee ordering.
2619	
2620	
2621	==========================
2622	KERNEL I/O BARRIER EFFECTS
2623	==========================
2624	
2625	When accessing I/O memory, drivers should use the appropriate accessor
2626	functions:
2627	
2628	 (*) inX(), outX():
2629	
2630	     These are intended to talk to I/O space rather than memory space, but
2631	     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
2632	     do indeed have special I/O space access cycles and instructions, but many
2633	     CPUs don't have such a concept.
2634	
2635	     The PCI bus, amongst others, defines an I/O space concept which - on such
2636	     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
2637	     space.  However, it may also be mapped as a virtual I/O space in the CPU's
2638	     memory map, particularly on those CPUs that don't support alternate I/O
2639	     spaces.
2640	
2641	     Accesses to this space may be fully synchronous (as on i386), but
2642	     intermediary bridges (such as the PCI host bridge) may not fully honour
2643	     that.
2644	
2645	     They are guaranteed to be fully ordered with respect to each other.
2646	
2647	     They are not guaranteed to be fully ordered with respect to other types of
2648	     memory and I/O operation.
2649	
2650	 (*) readX(), writeX():
2651	
2652	     Whether these are guaranteed to be fully ordered and uncombined with
2653	     respect to each other on the issuing CPU depends on the characteristics
2654	     defined for the memory window through which they're accessing.  On later
2655	     i386 architecture machines, for example, this is controlled by way of the
2656	     MTRR registers.
2657	
2658	     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
2659	     provided they're not accessing a prefetchable device.
2660	
2661	     However, intermediary hardware (such as a PCI bridge) may indulge in
2662	     deferral if it so wishes; to flush a store, a load from the same location
2663	     is preferred[*], but a load from the same device or from configuration
2664	     space should suffice for PCI.
2665	
2666	     [*] NOTE! attempting to load from the same location as was written to may
2667		 cause a malfunction - consider the 16550 Rx/Tx serial registers for
2668		 example.
2669	
2670	     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
2671	     force stores to be ordered.
2672	
2673	     Please refer to the PCI specification for more information on interactions
2674	     between PCI transactions.
2675	
2676	 (*) readX_relaxed(), writeX_relaxed()
2677	
2678	     These are similar to readX() and writeX(), but provide weaker memory
2679	     ordering guarantees.  Specifically, they do not guarantee ordering with
2680	     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
2681	     ordering with respect to LOCK or UNLOCK operations.  If the latter is
2682	     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
2683	     the same peripheral are guaranteed to be ordered with respect to each
2684	     other.
2685	
2686	 (*) ioreadX(), iowriteX()
2687	
2688	     These will perform appropriately for the type of access they're actually
2689	     doing, be it inX()/outX() or readX()/writeX().
2690	
2691	
2692	========================================
2693	ASSUMED MINIMUM EXECUTION ORDERING MODEL
2694	========================================
2695	
2696	It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2697	maintain the appearance of program causality with respect to itself.  Some CPUs
2698	(such as i386 or x86_64) are more constrained than others (such as powerpc or
2699	frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
2700	of arch-specific code.
2701	
2702	This means that it must be considered that the CPU will execute its instruction
2703	stream in any order it feels like - or even in parallel - provided that if an
2704	instruction in the stream depends on an earlier instruction, then that
2705	earlier instruction must be sufficiently complete[*] before the later
2706	instruction may proceed; in other words: provided that the appearance of
2707	causality is maintained.
2708	
2709	 [*] Some instructions have more than one effect - such as changing the
2710	     condition codes, changing registers or changing memory - and different
2711	     instructions may depend on different effects.
2712	
2713	A CPU may also discard any instruction sequence that winds up having no
2714	ultimate effect.  For example, if two adjacent instructions both load an
2715	immediate value into the same register, the first may be discarded.
2716	
2717	
2718	Similarly, it has to be assumed that compiler might reorder the instruction
2719	stream in any way it sees fit, again provided the appearance of causality is
2720	maintained.
2721	
2722	
2723	============================
2724	THE EFFECTS OF THE CPU CACHE
2725	============================
2726	
2727	The way cached memory operations are perceived across the system is affected to
2728	a certain extent by the caches that lie between CPUs and memory, and by the
2729	memory coherence system that maintains the consistency of state in the system.
2730	
2731	As far as the way a CPU interacts with another part of the system through the
2732	caches goes, the memory system has to include the CPU's caches, and memory
2733	barriers for the most part act at the interface between the CPU and its cache
2734	(memory barriers logically act on the dotted line in the following diagram):
2735	
2736		    <--- CPU --->         :       <----------- Memory ----------->
2737		                          :
2738		+--------+    +--------+  :   +--------+    +-----------+
2739		|        |    |        |  :   |        |    |           |    +--------+
2740		|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
2741		|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2742		|        |    | Queue  |  :   |        |    |           |--->| Memory |
2743		|        |    |        |  :   |        |    |           |    |        |
2744		+--------+    +--------+  :   +--------+    |           |    |        |
2745		                          :                 | Cache     |    +--------+
2746		                          :                 | Coherency |
2747		                          :                 | Mechanism |    +--------+
2748		+--------+    +--------+  :   +--------+    |           |    |	      |
2749		|        |    |        |  :   |        |    |           |    |        |
2750		|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2751		|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2752		|        |    | Queue  |  :   |        |    |           |    |        |
2753		|        |    |        |  :   |        |    |           |    +--------+
2754		+--------+    +--------+  :   +--------+    +-----------+
2755		                          :
2756		                          :
2757	
2758	Although any particular load or store may not actually appear outside of the
2759	CPU that issued it since it may have been satisfied within the CPU's own cache,
2760	it will still appear as if the full memory access had taken place as far as the
2761	other CPUs are concerned since the cache coherency mechanisms will migrate the
2762	cacheline over to the accessing CPU and propagate the effects upon conflict.
2763	
2764	The CPU core may execute instructions in any order it deems fit, provided the
2765	expected program causality appears to be maintained.  Some of the instructions
2766	generate load and store operations which then go into the queue of memory
2767	accesses to be performed.  The core may place these in the queue in any order
2768	it wishes, and continue execution until it is forced to wait for an instruction
2769	to complete.
2770	
2771	What memory barriers are concerned with is controlling the order in which
2772	accesses cross from the CPU side of things to the memory side of things, and
2773	the order in which the effects are perceived to happen by the other observers
2774	in the system.
2775	
2776	[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2777	their own loads and stores as if they had happened in program order.
2778	
2779	[!] MMIO or other device accesses may bypass the cache system.  This depends on
2780	the properties of the memory window through which devices are accessed and/or
2781	the use of any special device communication instructions the CPU may have.
2782	
2783	
2784	CACHE COHERENCY
2785	---------------
2786	
2787	Life isn't quite as simple as it may appear above, however: for while the
2788	caches are expected to be coherent, there's no guarantee that that coherency
2789	will be ordered.  This means that whilst changes made on one CPU will
2790	eventually become visible on all CPUs, there's no guarantee that they will
2791	become apparent in the same order on those other CPUs.
2792	
2793	
2794	Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2795	has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
2796	
2797		            :
2798		            :                          +--------+
2799		            :      +---------+         |        |
2800		+--------+  : +--->| Cache A |<------->|        |
2801		|        |  : |    +---------+         |        |
2802		|  CPU 1 |<---+                        |        |
2803		|        |  : |    +---------+         |        |
2804		+--------+  : +--->| Cache B |<------->|        |
2805		            :      +---------+         |        |
2806		            :                          | Memory |
2807		            :      +---------+         | System |
2808		+--------+  : +--->| Cache C |<------->|        |
2809		|        |  : |    +---------+         |        |
2810		|  CPU 2 |<---+                        |        |
2811		|        |  : |    +---------+         |        |
2812		+--------+  : +--->| Cache D |<------->|        |
2813		            :      +---------+         |        |
2814		            :                          +--------+
2815		            :
2816	
2817	Imagine the system has the following properties:
2818	
2819	 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2820	     resident in memory;
2821	
2822	 (*) an even-numbered cache line may be in cache B, cache D or it may still be
2823	     resident in memory;
2824	
2825	 (*) whilst the CPU core is interrogating one cache, the other cache may be
2826	     making use of the bus to access the rest of the system - perhaps to
2827	     displace a dirty cacheline or to do a speculative load;
2828	
2829	 (*) each cache has a queue of operations that need to be applied to that cache
2830	     to maintain coherency with the rest of the system;
2831	
2832	 (*) the coherency queue is not flushed by normal loads to lines already
2833	     present in the cache, even though the contents of the queue may
2834	     potentially affect those loads.
2835	
2836	Imagine, then, that two writes are made on the first CPU, with a write barrier
2837	between them to guarantee that they will appear to reach that CPU's caches in
2838	the requisite order:
2839	
2840		CPU 1		CPU 2		COMMENT
2841		===============	===============	=======================================
2842						u == 0, v == 1 and p == &u, q == &u
2843		v = 2;
2844		smp_wmb();			Make sure change to v is visible before
2845						 change to p
2846		<A:modify v=2>			v is now in cache A exclusively
2847		p = &v;
2848		<B:modify p=&v>			p is now in cache B exclusively
2849	
2850	The write memory barrier forces the other CPUs in the system to perceive that
2851	the local CPU's caches have apparently been updated in the correct order.  But
2852	now imagine that the second CPU wants to read those values:
2853	
2854		CPU 1		CPU 2		COMMENT
2855		===============	===============	=======================================
2856		...
2857				q = p;
2858				x = *q;
2859	
2860	The above pair of reads may then fail to happen in the expected order, as the
2861	cacheline holding p may get updated in one of the second CPU's caches whilst
2862	the update to the cacheline holding v is delayed in the other of the second
2863	CPU's caches by some other cache event:
2864	
2865		CPU 1		CPU 2		COMMENT
2866		===============	===============	=======================================
2867						u == 0, v == 1 and p == &u, q == &u
2868		v = 2;
2869		smp_wmb();
2870		<A:modify v=2>	<C:busy>
2871				<C:queue v=2>
2872		p = &v;		q = p;
2873				<D:request p>
2874		<B:modify p=&v>	<D:commit p=&v>
2875				<D:read p>
2876				x = *q;
2877				<C:read *q>	Reads from v before v updated in cache
2878				<C:unbusy>
2879				<C:commit v=2>
2880	
2881	Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2882	no guarantee that, without intervention, the order of update will be the same
2883	as that committed on CPU 1.
2884	
2885	
2886	To intervene, we need to interpolate a data dependency barrier or a read
2887	barrier between the loads.  This will force the cache to commit its coherency
2888	queue before processing any further requests:
2889	
2890		CPU 1		CPU 2		COMMENT
2891		===============	===============	=======================================
2892						u == 0, v == 1 and p == &u, q == &u
2893		v = 2;
2894		smp_wmb();
2895		<A:modify v=2>	<C:busy>
2896				<C:queue v=2>
2897		p = &v;		q = p;
2898				<D:request p>
2899		<B:modify p=&v>	<D:commit p=&v>
2900				<D:read p>
2901				smp_read_barrier_depends()
2902				<C:unbusy>
2903				<C:commit v=2>
2904				x = *q;
2905				<C:read *q>	Reads from v after v updated in cache
2906	
2907	
2908	This sort of problem can be encountered on DEC Alpha processors as they have a
2909	split cache that improves performance by making better use of the data bus.
2910	Whilst most CPUs do imply a data dependency barrier on the read when a memory
2911	access depends on a read, not all do, so it may not be relied on.
2912	
2913	Other CPUs may also have split caches, but must coordinate between the various
2914	cachelets for normal memory accesses.  The semantics of the Alpha removes the
2915	need for coordination in the absence of memory barriers.
2916	
2917	
2918	CACHE COHERENCY VS DMA
2919	----------------------
2920	
2921	Not all systems maintain cache coherency with respect to devices doing DMA.  In
2922	such cases, a device attempting DMA may obtain stale data from RAM because
2923	dirty cache lines may be resident in the caches of various CPUs, and may not
2924	have been written back to RAM yet.  To deal with this, the appropriate part of
2925	the kernel must flush the overlapping bits of cache on each CPU (and maybe
2926	invalidate them as well).
2927	
2928	In addition, the data DMA'd to RAM by a device may be overwritten by dirty
2929	cache lines being written back to RAM from a CPU's cache after the device has
2930	installed its own data, or cache lines present in the CPU's cache may simply
2931	obscure the fact that RAM has been updated, until at such time as the cacheline
2932	is discarded from the CPU's cache and reloaded.  To deal with this, the
2933	appropriate part of the kernel must invalidate the overlapping bits of the
2934	cache on each CPU.
2935	
2936	See Documentation/cachetlb.txt for more information on cache management.
2937	
2938	
2939	CACHE COHERENCY VS MMIO
2940	-----------------------
2941	
2942	Memory mapped I/O usually takes place through memory locations that are part of
2943	a window in the CPU's memory space that has different properties assigned than
2944	the usual RAM directed window.
2945	
2946	Amongst these properties is usually the fact that such accesses bypass the
2947	caching entirely and go directly to the device buses.  This means MMIO accesses
2948	may, in effect, overtake accesses to cached memory that were emitted earlier.
2949	A memory barrier isn't sufficient in such a case, but rather the cache must be
2950	flushed between the cached memory write and the MMIO access if the two are in
2951	any way dependent.
2952	
2953	
2954	=========================
2955	THE THINGS CPUS GET UP TO
2956	=========================
2957	
2958	A programmer might take it for granted that the CPU will perform memory
2959	operations in exactly the order specified, so that if the CPU is, for example,
2960	given the following piece of code to execute:
2961	
2962		a = READ_ONCE(*A);
2963		WRITE_ONCE(*B, b);
2964		c = READ_ONCE(*C);
2965		d = READ_ONCE(*D);
2966		WRITE_ONCE(*E, e);
2967	
2968	they would then expect that the CPU will complete the memory operation for each
2969	instruction before moving on to the next one, leading to a definite sequence of
2970	operations as seen by external observers in the system:
2971	
2972		LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2973	
2974	
2975	Reality is, of course, much messier.  With many CPUs and compilers, the above
2976	assumption doesn't hold because:
2977	
2978	 (*) loads are more likely to need to be completed immediately to permit
2979	     execution progress, whereas stores can often be deferred without a
2980	     problem;
2981	
2982	 (*) loads may be done speculatively, and the result discarded should it prove
2983	     to have been unnecessary;
2984	
2985	 (*) loads may be done speculatively, leading to the result having been fetched
2986	     at the wrong time in the expected sequence of events;
2987	
2988	 (*) the order of the memory accesses may be rearranged to promote better use
2989	     of the CPU buses and caches;
2990	
2991	 (*) loads and stores may be combined to improve performance when talking to
2992	     memory or I/O hardware that can do batched accesses of adjacent locations,
2993	     thus cutting down on transaction setup costs (memory and PCI devices may
2994	     both be able to do this); and
2995	
2996	 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2997	     mechanisms may alleviate this - once the store has actually hit the cache
2998	     - there's no guarantee that the coherency management will be propagated in
2999	     order to other CPUs.
3000	
3001	So what another CPU, say, might actually observe from the above piece of code
3002	is:
3003	
3004		LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
3005	
3006		(Where "LOAD {*C,*D}" is a combined load)
3007	
3008	
3009	However, it is guaranteed that a CPU will be self-consistent: it will see its
3010	_own_ accesses appear to be correctly ordered, without the need for a memory
3011	barrier.  For instance with the following code:
3012	
3013		U = READ_ONCE(*A);
3014		WRITE_ONCE(*A, V);
3015		WRITE_ONCE(*A, W);
3016		X = READ_ONCE(*A);
3017		WRITE_ONCE(*A, Y);
3018		Z = READ_ONCE(*A);
3019	
3020	and assuming no intervention by an external influence, it can be assumed that
3021	the final result will appear to be:
3022	
3023		U == the original value of *A
3024		X == W
3025		Z == Y
3026		*A == Y
3027	
3028	The code above may cause the CPU to generate the full sequence of memory
3029	accesses:
3030	
3031		U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
3032	
3033	in that order, but, without intervention, the sequence may have almost any
3034	combination of elements combined or discarded, provided the program's view
3035	of the world remains consistent.  Note that READ_ONCE() and WRITE_ONCE()
3036	are -not- optional in the above example, as there are architectures
3037	where a given CPU might reorder successive loads to the same location.
3038	On such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
3039	necessary to prevent this, for example, on Itanium the volatile casts
3040	used by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq
3041	and st.rel instructions (respectively) that prevent such reordering.
3042	
3043	The compiler may also combine, discard or defer elements of the sequence before
3044	the CPU even sees them.
3045	
3046	For instance:
3047	
3048		*A = V;
3049		*A = W;
3050	
3051	may be reduced to:
3052	
3053		*A = W;
3054	
3055	since, without either a write barrier or an WRITE_ONCE(), it can be
3056	assumed that the effect of the storage of V to *A is lost.  Similarly:
3057	
3058		*A = Y;
3059		Z = *A;
3060	
3061	may, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be
3062	reduced to:
3063	
3064		*A = Y;
3065		Z = Y;
3066	
3067	and the LOAD operation never appear outside of the CPU.
3068	
3069	
3070	AND THEN THERE'S THE ALPHA
3071	--------------------------
3072	
3073	The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
3074	some versions of the Alpha CPU have a split data cache, permitting them to have
3075	two semantically-related cache lines updated at separate times.  This is where
3076	the data dependency barrier really becomes necessary as this synchronises both
3077	caches with the memory coherence system, thus making it seem like pointer
3078	changes vs new data occur in the right order.
3079	
3080	The Alpha defines the Linux kernel's memory barrier model.
3081	
3082	See the subsection on "Cache Coherency" above.
3083	
3084	
3085	VIRTUAL MACHINE GUESTS
3086	----------------------
3087	
3088	Guests running within virtual machines might be affected by SMP effects even if
3089	the guest itself is compiled without SMP support.  This is an artifact of
3090	interfacing with an SMP host while running an UP kernel.  Using mandatory
3091	barriers for this use-case would be possible but is often suboptimal.
3092	
3093	To handle this case optimally, low-level virt_mb() etc macros are available.
3094	These have the same effect as smp_mb() etc when SMP is enabled, but generate
3095	identical code for SMP and non-SMP systems.  For example, virtual machine guests
3096	should use virt_mb() rather than smp_mb() when synchronizing against a
3097	(possibly SMP) host.
3098	
3099	These are equivalent to smp_mb() etc counterparts in all other respects,
3100	in particular, they do not control MMIO effects: to control
3101	MMIO effects, use mandatory barriers.
3102	
3103	
3104	============
3105	EXAMPLE USES
3106	============
3107	
3108	CIRCULAR BUFFERS
3109	----------------
3110	
3111	Memory barriers can be used to implement circular buffering without the need
3112	of a lock to serialise the producer with the consumer.  See:
3113	
3114		Documentation/circular-buffers.txt
3115	
3116	for details.
3117	
3118	
3119	==========
3120	REFERENCES
3121	==========
3122	
3123	Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
3124	Digital Press)
3125		Chapter 5.2: Physical Address Space Characteristics
3126		Chapter 5.4: Caches and Write Buffers
3127		Chapter 5.5: Data Sharing
3128		Chapter 5.6: Read/Write Ordering
3129	
3130	AMD64 Architecture Programmer's Manual Volume 2: System Programming
3131		Chapter 7.1: Memory-Access Ordering
3132		Chapter 7.4: Buffering and Combining Memory Writes
3133	
3134	IA-32 Intel Architecture Software Developer's Manual, Volume 3:
3135	System Programming Guide
3136		Chapter 7.1: Locked Atomic Operations
3137		Chapter 7.2: Memory Ordering
3138		Chapter 7.4: Serializing Instructions
3139	
3140	The SPARC Architecture Manual, Version 9
3141		Chapter 8: Memory Models
3142		Appendix D: Formal Specification of the Memory Models
3143		Appendix J: Programming with the Memory Models
3144	
3145	UltraSPARC Programmer Reference Manual
3146		Chapter 5: Memory Accesses and Cacheability
3147		Chapter 15: Sparc-V9 Memory Models
3148	
3149	UltraSPARC III Cu User's Manual
3150		Chapter 9: Memory Models
3151	
3152	UltraSPARC IIIi Processor User's Manual
3153		Chapter 8: Memory Models
3154	
3155	UltraSPARC Architecture 2005
3156		Chapter 9: Memory
3157		Appendix D: Formal Specifications of the Memory Models
3158	
3159	UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
3160		Chapter 8: Memory Models
3161		Appendix F: Caches and Cache Coherency
3162	
3163	Solaris Internals, Core Kernel Architecture, p63-68:
3164		Chapter 3.3: Hardware Considerations for Locks and
3165				Synchronization
3166	
3167	Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
3168	for Kernel Programmers:
3169		Chapter 13: Other Memory Models
3170	
3171	Intel Itanium Architecture Software Developer's Manual: Volume 1:
3172		Section 2.6: Speculation
3173		Section 4.4: Memory Access
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.