About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / memory-barriers.txt




Custom Search

Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.

1				 ============================
2				 LINUX KERNEL MEMORY BARRIERS
3				 ============================
4	
5	By: David Howells <dhowells@redhat.com>
6	    Paul E. McKenney <paulmck@linux.vnet.ibm.com>
7	    Will Deacon <will.deacon@arm.com>
8	    Peter Zijlstra <peterz@infradead.org>
9	
10	==========
11	DISCLAIMER
12	==========
13	
14	This document is not a specification; it is intentionally (for the sake of
15	brevity) and unintentionally (due to being human) incomplete. This document is
16	meant as a guide to using the various memory barriers provided by Linux, but
17	in case of any doubt (and there are many) please ask.
18	
19	To repeat, this document is not a specification of what Linux expects from
20	hardware.
21	
22	The purpose of this document is twofold:
23	
24	 (1) to specify the minimum functionality that one can rely on for any
25	     particular barrier, and
26	
27	 (2) to provide a guide as to how to use the barriers that are available.
28	
29	Note that an architecture can provide more than the minimum requirement
30	for any particular barrier, but if the architecture provides less than
31	that, that architecture is incorrect.
32	
33	Note also that it is possible that a barrier may be a no-op for an
34	architecture because the way that arch works renders an explicit barrier
35	unnecessary in that case.
36	
37	
38	========
39	CONTENTS
40	========
41	
42	 (*) Abstract memory access model.
43	
44	     - Device operations.
45	     - Guarantees.
46	
47	 (*) What are memory barriers?
48	
49	     - Varieties of memory barrier.
50	     - What may not be assumed about memory barriers?
51	     - Data dependency barriers.
52	     - Control dependencies.
53	     - SMP barrier pairing.
54	     - Examples of memory barrier sequences.
55	     - Read memory barriers vs load speculation.
56	     - Multicopy atomicity.
57	
58	 (*) Explicit kernel barriers.
59	
60	     - Compiler barrier.
61	     - CPU memory barriers.
62	     - MMIO write barrier.
63	
64	 (*) Implicit kernel memory barriers.
65	
66	     - Lock acquisition functions.
67	     - Interrupt disabling functions.
68	     - Sleep and wake-up functions.
69	     - Miscellaneous functions.
70	
71	 (*) Inter-CPU acquiring barrier effects.
72	
73	     - Acquires vs memory accesses.
74	     - Acquires vs I/O accesses.
75	
76	 (*) Where are memory barriers needed?
77	
78	     - Interprocessor interaction.
79	     - Atomic operations.
80	     - Accessing devices.
81	     - Interrupts.
82	
83	 (*) Kernel I/O barrier effects.
84	
85	 (*) Assumed minimum execution ordering model.
86	
87	 (*) The effects of the cpu cache.
88	
89	     - Cache coherency.
90	     - Cache coherency vs DMA.
91	     - Cache coherency vs MMIO.
92	
93	 (*) The things CPUs get up to.
94	
95	     - And then there's the Alpha.
96	     - Virtual Machine Guests.
97	
98	 (*) Example uses.
99	
100	     - Circular buffers.
101	
102	 (*) References.
103	
104	
105	============================
106	ABSTRACT MEMORY ACCESS MODEL
107	============================
108	
109	Consider the following abstract model of the system:
110	
111			            :                :
112			            :                :
113			            :                :
114			+-------+   :   +--------+   :   +-------+
115			|       |   :   |        |   :   |       |
116			|       |   :   |        |   :   |       |
117			| CPU 1 |<----->| Memory |<----->| CPU 2 |
118			|       |   :   |        |   :   |       |
119			|       |   :   |        |   :   |       |
120			+-------+   :   +--------+   :   +-------+
121			    ^       :       ^        :       ^
122			    |       :       |        :       |
123			    |       :       |        :       |
124			    |       :       v        :       |
125			    |       :   +--------+   :       |
126			    |       :   |        |   :       |
127			    |       :   |        |   :       |
128			    +---------->| Device |<----------+
129			            :   |        |   :
130			            :   |        |   :
131			            :   +--------+   :
132			            :                :
133	
134	Each CPU executes a program that generates memory access operations.  In the
135	abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
136	perform the memory operations in any order it likes, provided program causality
137	appears to be maintained.  Similarly, the compiler may also arrange the
138	instructions it emits in any order it likes, provided it doesn't affect the
139	apparent operation of the program.
140	
141	So in the above diagram, the effects of the memory operations performed by a
142	CPU are perceived by the rest of the system as the operations cross the
143	interface between the CPU and rest of the system (the dotted lines).
144	
145	
146	For example, consider the following sequence of events:
147	
148		CPU 1		CPU 2
149		===============	===============
150		{ A == 1; B == 2 }
151		A = 3;		x = B;
152		B = 4;		y = A;
153	
154	The set of accesses as seen by the memory system in the middle can be arranged
155	in 24 different combinations:
156	
157		STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
158		STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
159		STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
160		STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
161		STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
162		STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
163		STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
164		STORE B=4, ...
165		...
166	
167	and can thus result in four different combinations of values:
168	
169		x == 2, y == 1
170		x == 2, y == 3
171		x == 4, y == 1
172		x == 4, y == 3
173	
174	
175	Furthermore, the stores committed by a CPU to the memory system may not be
176	perceived by the loads made by another CPU in the same order as the stores were
177	committed.
178	
179	
180	As a further example, consider this sequence of events:
181	
182		CPU 1		CPU 2
183		===============	===============
184		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
185		B = 4;		Q = P;
186		P = &B		D = *Q;
187	
188	There is an obvious data dependency here, as the value loaded into D depends on
189	the address retrieved from P by CPU 2.  At the end of the sequence, any of the
190	following results are possible:
191	
192		(Q == &A) and (D == 1)
193		(Q == &B) and (D == 2)
194		(Q == &B) and (D == 4)
195	
196	Note that CPU 2 will never try and load C into D because the CPU will load P
197	into Q before issuing the load of *Q.
198	
199	
200	DEVICE OPERATIONS
201	-----------------
202	
203	Some devices present their control interfaces as collections of memory
204	locations, but the order in which the control registers are accessed is very
205	important.  For instance, imagine an ethernet card with a set of internal
206	registers that are accessed through an address port register (A) and a data
207	port register (D).  To read internal register 5, the following code might then
208	be used:
209	
210		*A = 5;
211		x = *D;
212	
213	but this might show up as either of the following two sequences:
214	
215		STORE *A = 5, x = LOAD *D
216		x = LOAD *D, STORE *A = 5
217	
218	the second of which will almost certainly result in a malfunction, since it set
219	the address _after_ attempting to read the register.
220	
221	
222	GUARANTEES
223	----------
224	
225	There are some minimal guarantees that may be expected of a CPU:
226	
227	 (*) On any given CPU, dependent memory accesses will be issued in order, with
228	     respect to itself.  This means that for:
229	
230		Q = READ_ONCE(P); D = READ_ONCE(*Q);
231	
232	     the CPU will issue the following memory operations:
233	
234		Q = LOAD P, D = LOAD *Q
235	
236	     and always in that order.  However, on DEC Alpha, READ_ONCE() also
237	     emits a memory-barrier instruction, so that a DEC Alpha CPU will
238	     instead issue the following memory operations:
239	
240		Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
241	
242	     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
243	     mischief.
244	
245	 (*) Overlapping loads and stores within a particular CPU will appear to be
246	     ordered within that CPU.  This means that for:
247	
248		a = READ_ONCE(*X); WRITE_ONCE(*X, b);
249	
250	     the CPU will only issue the following sequence of memory operations:
251	
252		a = LOAD *X, STORE *X = b
253	
254	     And for:
255	
256		WRITE_ONCE(*X, c); d = READ_ONCE(*X);
257	
258	     the CPU will only issue:
259	
260		STORE *X = c, d = LOAD *X
261	
262	     (Loads and stores overlap if they are targeted at overlapping pieces of
263	     memory).
264	
265	And there are a number of things that _must_ or _must_not_ be assumed:
266	
267	 (*) It _must_not_ be assumed that the compiler will do what you want
268	     with memory references that are not protected by READ_ONCE() and
269	     WRITE_ONCE().  Without them, the compiler is within its rights to
270	     do all sorts of "creative" transformations, which are covered in
271	     the COMPILER BARRIER section.
272	
273	 (*) It _must_not_ be assumed that independent loads and stores will be issued
274	     in the order given.  This means that for:
275	
276		X = *A; Y = *B; *D = Z;
277	
278	     we may get any of the following sequences:
279	
280		X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
281		X = LOAD *A,  STORE *D = Z, Y = LOAD *B
282		Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
283		Y = LOAD *B,  STORE *D = Z, X = LOAD *A
284		STORE *D = Z, X = LOAD *A,  Y = LOAD *B
285		STORE *D = Z, Y = LOAD *B,  X = LOAD *A
286	
287	 (*) It _must_ be assumed that overlapping memory accesses may be merged or
288	     discarded.  This means that for:
289	
290		X = *A; Y = *(A + 4);
291	
292	     we may get any one of the following sequences:
293	
294		X = LOAD *A; Y = LOAD *(A + 4);
295		Y = LOAD *(A + 4); X = LOAD *A;
296		{X, Y} = LOAD {*A, *(A + 4) };
297	
298	     And for:
299	
300		*A = X; *(A + 4) = Y;
301	
302	     we may get any of:
303	
304		STORE *A = X; STORE *(A + 4) = Y;
305		STORE *(A + 4) = Y; STORE *A = X;
306		STORE {*A, *(A + 4) } = {X, Y};
307	
308	And there are anti-guarantees:
309	
310	 (*) These guarantees do not apply to bitfields, because compilers often
311	     generate code to modify these using non-atomic read-modify-write
312	     sequences.  Do not attempt to use bitfields to synchronize parallel
313	     algorithms.
314	
315	 (*) Even in cases where bitfields are protected by locks, all fields
316	     in a given bitfield must be protected by one lock.  If two fields
317	     in a given bitfield are protected by different locks, the compiler's
318	     non-atomic read-modify-write sequences can cause an update to one
319	     field to corrupt the value of an adjacent field.
320	
321	 (*) These guarantees apply only to properly aligned and sized scalar
322	     variables.  "Properly sized" currently means variables that are
323	     the same size as "char", "short", "int" and "long".  "Properly
324	     aligned" means the natural alignment, thus no constraints for
325	     "char", two-byte alignment for "short", four-byte alignment for
326	     "int", and either four-byte or eight-byte alignment for "long",
327	     on 32-bit and 64-bit systems, respectively.  Note that these
328	     guarantees were introduced into the C11 standard, so beware when
329	     using older pre-C11 compilers (for example, gcc 4.6).  The portion
330	     of the standard containing this guarantee is Section 3.14, which
331	     defines "memory location" as follows:
332	
333	     	memory location
334			either an object of scalar type, or a maximal sequence
335			of adjacent bit-fields all having nonzero width
336	
337			NOTE 1: Two threads of execution can update and access
338			separate memory locations without interfering with
339			each other.
340	
341			NOTE 2: A bit-field and an adjacent non-bit-field member
342			are in separate memory locations. The same applies
343			to two bit-fields, if one is declared inside a nested
344			structure declaration and the other is not, or if the two
345			are separated by a zero-length bit-field declaration,
346			or if they are separated by a non-bit-field member
347			declaration. It is not safe to concurrently update two
348			bit-fields in the same structure if all members declared
349			between them are also bit-fields, no matter what the
350			sizes of those intervening bit-fields happen to be.
351	
352	
353	=========================
354	WHAT ARE MEMORY BARRIERS?
355	=========================
356	
357	As can be seen above, independent memory operations are effectively performed
358	in random order, but this can be a problem for CPU-CPU interaction and for I/O.
359	What is required is some way of intervening to instruct the compiler and the
360	CPU to restrict the order.
361	
362	Memory barriers are such interventions.  They impose a perceived partial
363	ordering over the memory operations on either side of the barrier.
364	
365	Such enforcement is important because the CPUs and other devices in a system
366	can use a variety of tricks to improve performance, including reordering,
367	deferral and combination of memory operations; speculative loads; speculative
368	branch prediction and various types of caching.  Memory barriers are used to
369	override or suppress these tricks, allowing the code to sanely control the
370	interaction of multiple CPUs and/or devices.
371	
372	
373	VARIETIES OF MEMORY BARRIER
374	---------------------------
375	
376	Memory barriers come in four basic varieties:
377	
378	 (1) Write (or store) memory barriers.
379	
380	     A write memory barrier gives a guarantee that all the STORE operations
381	     specified before the barrier will appear to happen before all the STORE
382	     operations specified after the barrier with respect to the other
383	     components of the system.
384	
385	     A write barrier is a partial ordering on stores only; it is not required
386	     to have any effect on loads.
387	
388	     A CPU can be viewed as committing a sequence of store operations to the
389	     memory system as time progresses.  All stores _before_ a write barrier
390	     will occur _before_ all the stores after the write barrier.
391	
392	     [!] Note that write barriers should normally be paired with read or data
393	     dependency barriers; see the "SMP barrier pairing" subsection.
394	
395	
396	 (2) Data dependency barriers.
397	
398	     A data dependency barrier is a weaker form of read barrier.  In the case
399	     where two loads are performed such that the second depends on the result
400	     of the first (eg: the first load retrieves the address to which the second
401	     load will be directed), a data dependency barrier would be required to
402	     make sure that the target of the second load is updated before the address
403	     obtained by the first load is accessed.
404	
405	     A data dependency barrier is a partial ordering on interdependent loads
406	     only; it is not required to have any effect on stores, independent loads
407	     or overlapping loads.
408	
409	     As mentioned in (1), the other CPUs in the system can be viewed as
410	     committing sequences of stores to the memory system that the CPU being
411	     considered can then perceive.  A data dependency barrier issued by the CPU
412	     under consideration guarantees that for any load preceding it, if that
413	     load touches one of a sequence of stores from another CPU, then by the
414	     time the barrier completes, the effects of all the stores prior to that
415	     touched by the load will be perceptible to any loads issued after the data
416	     dependency barrier.
417	
418	     See the "Examples of memory barrier sequences" subsection for diagrams
419	     showing the ordering constraints.
420	
421	     [!] Note that the first load really has to have a _data_ dependency and
422	     not a control dependency.  If the address for the second load is dependent
423	     on the first load, but the dependency is through a conditional rather than
424	     actually loading the address itself, then it's a _control_ dependency and
425	     a full read barrier or better is required.  See the "Control dependencies"
426	     subsection for more information.
427	
428	     [!] Note that data dependency barriers should normally be paired with
429	     write barriers; see the "SMP barrier pairing" subsection.
430	
431	
432	 (3) Read (or load) memory barriers.
433	
434	     A read barrier is a data dependency barrier plus a guarantee that all the
435	     LOAD operations specified before the barrier will appear to happen before
436	     all the LOAD operations specified after the barrier with respect to the
437	     other components of the system.
438	
439	     A read barrier is a partial ordering on loads only; it is not required to
440	     have any effect on stores.
441	
442	     Read memory barriers imply data dependency barriers, and so can substitute
443	     for them.
444	
445	     [!] Note that read barriers should normally be paired with write barriers;
446	     see the "SMP barrier pairing" subsection.
447	
448	
449	 (4) General memory barriers.
450	
451	     A general memory barrier gives a guarantee that all the LOAD and STORE
452	     operations specified before the barrier will appear to happen before all
453	     the LOAD and STORE operations specified after the barrier with respect to
454	     the other components of the system.
455	
456	     A general memory barrier is a partial ordering over both loads and stores.
457	
458	     General memory barriers imply both read and write memory barriers, and so
459	     can substitute for either.
460	
461	
462	And a couple of implicit varieties:
463	
464	 (5) ACQUIRE operations.
465	
466	     This acts as a one-way permeable barrier.  It guarantees that all memory
467	     operations after the ACQUIRE operation will appear to happen after the
468	     ACQUIRE operation with respect to the other components of the system.
469	     ACQUIRE operations include LOCK operations and both smp_load_acquire()
470	     and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
471	     semantics from relying on a control dependency and smp_rmb().
472	
473	     Memory operations that occur before an ACQUIRE operation may appear to
474	     happen after it completes.
475	
476	     An ACQUIRE operation should almost always be paired with a RELEASE
477	     operation.
478	
479	
480	 (6) RELEASE operations.
481	
482	     This also acts as a one-way permeable barrier.  It guarantees that all
483	     memory operations before the RELEASE operation will appear to happen
484	     before the RELEASE operation with respect to the other components of the
485	     system. RELEASE operations include UNLOCK operations and
486	     smp_store_release() operations.
487	
488	     Memory operations that occur after a RELEASE operation may appear to
489	     happen before it completes.
490	
491	     The use of ACQUIRE and RELEASE operations generally precludes the need
492	     for other sorts of memory barrier (but note the exceptions mentioned in
493	     the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
494	     pair is -not- guaranteed to act as a full memory barrier.  However, after
495	     an ACQUIRE on a given variable, all memory accesses preceding any prior
496	     RELEASE on that same variable are guaranteed to be visible.  In other
497	     words, within a given variable's critical section, all accesses of all
498	     previous critical sections for that variable are guaranteed to have
499	     completed.
500	
501	     This means that ACQUIRE acts as a minimal "acquire" operation and
502	     RELEASE acts as a minimal "release" operation.
503	
504	A subset of the atomic operations described in atomic_t.txt have ACQUIRE and
505	RELEASE variants in addition to fully-ordered and relaxed (no barrier
506	semantics) definitions.  For compound atomics performing both a load and a
507	store, ACQUIRE semantics apply only to the load and RELEASE semantics apply
508	only to the store portion of the operation.
509	
510	Memory barriers are only required where there's a possibility of interaction
511	between two CPUs or between a CPU and a device.  If it can be guaranteed that
512	there won't be any such interaction in any particular piece of code, then
513	memory barriers are unnecessary in that piece of code.
514	
515	
516	Note that these are the _minimum_ guarantees.  Different architectures may give
517	more substantial guarantees, but they may _not_ be relied upon outside of arch
518	specific code.
519	
520	
521	WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
522	----------------------------------------------
523	
524	There are certain things that the Linux kernel memory barriers do not guarantee:
525	
526	 (*) There is no guarantee that any of the memory accesses specified before a
527	     memory barrier will be _complete_ by the completion of a memory barrier
528	     instruction; the barrier can be considered to draw a line in that CPU's
529	     access queue that accesses of the appropriate type may not cross.
530	
531	 (*) There is no guarantee that issuing a memory barrier on one CPU will have
532	     any direct effect on another CPU or any other hardware in the system.  The
533	     indirect effect will be the order in which the second CPU sees the effects
534	     of the first CPU's accesses occur, but see the next point:
535	
536	 (*) There is no guarantee that a CPU will see the correct order of effects
537	     from a second CPU's accesses, even _if_ the second CPU uses a memory
538	     barrier, unless the first CPU _also_ uses a matching memory barrier (see
539	     the subsection on "SMP Barrier Pairing").
540	
541	 (*) There is no guarantee that some intervening piece of off-the-CPU
542	     hardware[*] will not reorder the memory accesses.  CPU cache coherency
543	     mechanisms should propagate the indirect effects of a memory barrier
544	     between CPUs, but might not do so in order.
545	
546		[*] For information on bus mastering DMA and coherency please read:
547	
548		    Documentation/PCI/pci.txt
549		    Documentation/DMA-API-HOWTO.txt
550		    Documentation/DMA-API.txt
551	
552	
553	DATA DEPENDENCY BARRIERS
554	------------------------
555	
556	The usage requirements of data dependency barriers are a little subtle, and
557	it's not always obvious that they're needed.  To illustrate, consider the
558	following sequence of events:
559	
560		CPU 1		      CPU 2
561		===============	      ===============
562		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
563		B = 4;
564		<write barrier>
565		WRITE_ONCE(P, &B)
566				      Q = READ_ONCE(P);
567				      D = *Q;
568	
569	There's a clear data dependency here, and it would seem that by the end of the
570	sequence, Q must be either &A or &B, and that:
571	
572		(Q == &A) implies (D == 1)
573		(Q == &B) implies (D == 4)
574	
575	But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
576	leading to the following situation:
577	
578		(Q == &B) and (D == 2) ????
579	
580	Whilst this may seem like a failure of coherency or causality maintenance, it
581	isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
582	Alpha).
583	
584	To deal with this, a data dependency barrier or better must be inserted
585	between the address load and the data load:
586	
587		CPU 1		      CPU 2
588		===============	      ===============
589		{ A == 1, B == 2, C == 3, P == &A, Q == &C }
590		B = 4;
591		<write barrier>
592		WRITE_ONCE(P, &B);
593				      Q = READ_ONCE(P);
594				      <data dependency barrier>
595				      D = *Q;
596	
597	This enforces the occurrence of one of the two implications, and prevents the
598	third possibility from arising.
599	
600	
601	[!] Note that this extremely counterintuitive situation arises most easily on
602	machines with split caches, so that, for example, one cache bank processes
603	even-numbered cache lines and the other bank processes odd-numbered cache
604	lines.  The pointer P might be stored in an odd-numbered cache line, and the
605	variable B might be stored in an even-numbered cache line.  Then, if the
606	even-numbered bank of the reading CPU's cache is extremely busy while the
607	odd-numbered bank is idle, one can see the new value of the pointer P (&B),
608	but the old value of the variable B (2).
609	
610	
611	A data-dependency barrier is not required to order dependent writes
612	because the CPUs that the Linux kernel supports don't do writes
613	until they are certain (1) that the write will actually happen, (2)
614	of the location of the write, and (3) of the value to be written.
615	But please carefully read the "CONTROL DEPENDENCIES" section and the
616	Documentation/RCU/rcu_dereference.txt file:  The compiler can and does
617	break dependencies in a great many highly creative ways.
618	
619		CPU 1		      CPU 2
620		===============	      ===============
621		{ A == 1, B == 2, C = 3, P == &A, Q == &C }
622		B = 4;
623		<write barrier>
624		WRITE_ONCE(P, &B);
625				      Q = READ_ONCE(P);
626				      WRITE_ONCE(*Q, 5);
627	
628	Therefore, no data-dependency barrier is required to order the read into
629	Q with the store into *Q.  In other words, this outcome is prohibited,
630	even without a data-dependency barrier:
631	
632		(Q == &B) && (B == 4)
633	
634	Please note that this pattern should be rare.  After all, the whole point
635	of dependency ordering is to -prevent- writes to the data structure, along
636	with the expensive cache misses associated with those writes.  This pattern
637	can be used to record rare error conditions and the like, and the CPUs'
638	naturally occurring ordering prevents such records from being lost.
639	
640	
641	Note well that the ordering provided by a data dependency is local to
642	the CPU containing it.  See the section on "Multicopy atomicity" for
643	more information.
644	
645	
646	The data dependency barrier is very important to the RCU system,
647	for example.  See rcu_assign_pointer() and rcu_dereference() in
648	include/linux/rcupdate.h.  This permits the current target of an RCU'd
649	pointer to be replaced with a new modified target, without the replacement
650	target appearing to be incompletely initialised.
651	
652	See also the subsection on "Cache Coherency" for a more thorough example.
653	
654	
655	CONTROL DEPENDENCIES
656	--------------------
657	
658	Control dependencies can be a bit tricky because current compilers do
659	not understand them.  The purpose of this section is to help you prevent
660	the compiler's ignorance from breaking your code.
661	
662	A load-load control dependency requires a full read memory barrier, not
663	simply a data dependency barrier to make it work correctly.  Consider the
664	following bit of code:
665	
666		q = READ_ONCE(a);
667		if (q) {
668			<data dependency barrier>  /* BUG: No data dependency!!! */
669			p = READ_ONCE(b);
670		}
671	
672	This will not have the desired effect because there is no actual data
673	dependency, but rather a control dependency that the CPU may short-circuit
674	by attempting to predict the outcome in advance, so that other CPUs see
675	the load from b as having happened before the load from a.  In such a
676	case what's actually required is:
677	
678		q = READ_ONCE(a);
679		if (q) {
680			<read barrier>
681			p = READ_ONCE(b);
682		}
683	
684	However, stores are not speculated.  This means that ordering -is- provided
685	for load-store control dependencies, as in the following example:
686	
687		q = READ_ONCE(a);
688		if (q) {
689			WRITE_ONCE(b, 1);
690		}
691	
692	Control dependencies pair normally with other types of barriers.
693	That said, please note that neither READ_ONCE() nor WRITE_ONCE()
694	are optional! Without the READ_ONCE(), the compiler might combine the
695	load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
696	the compiler might combine the store to 'b' with other stores to 'b'.
697	Either can result in highly counterintuitive effects on ordering.
698	
699	Worse yet, if the compiler is able to prove (say) that the value of
700	variable 'a' is always non-zero, it would be well within its rights
701	to optimize the original example by eliminating the "if" statement
702	as follows:
703	
704		q = a;
705		b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
706	
707	So don't leave out the READ_ONCE().
708	
709	It is tempting to try to enforce ordering on identical stores on both
710	branches of the "if" statement as follows:
711	
712		q = READ_ONCE(a);
713		if (q) {
714			barrier();
715			WRITE_ONCE(b, 1);
716			do_something();
717		} else {
718			barrier();
719			WRITE_ONCE(b, 1);
720			do_something_else();
721		}
722	
723	Unfortunately, current compilers will transform this as follows at high
724	optimization levels:
725	
726		q = READ_ONCE(a);
727		barrier();
728		WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
729		if (q) {
730			/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
731			do_something();
732		} else {
733			/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
734			do_something_else();
735		}
736	
737	Now there is no conditional between the load from 'a' and the store to
738	'b', which means that the CPU is within its rights to reorder them:
739	The conditional is absolutely required, and must be present in the
740	assembly code even after all compiler optimizations have been applied.
741	Therefore, if you need ordering in this example, you need explicit
742	memory barriers, for example, smp_store_release():
743	
744		q = READ_ONCE(a);
745		if (q) {
746			smp_store_release(&b, 1);
747			do_something();
748		} else {
749			smp_store_release(&b, 1);
750			do_something_else();
751		}
752	
753	In contrast, without explicit memory barriers, two-legged-if control
754	ordering is guaranteed only when the stores differ, for example:
755	
756		q = READ_ONCE(a);
757		if (q) {
758			WRITE_ONCE(b, 1);
759			do_something();
760		} else {
761			WRITE_ONCE(b, 2);
762			do_something_else();
763		}
764	
765	The initial READ_ONCE() is still required to prevent the compiler from
766	proving the value of 'a'.
767	
768	In addition, you need to be careful what you do with the local variable 'q',
769	otherwise the compiler might be able to guess the value and again remove
770	the needed conditional.  For example:
771	
772		q = READ_ONCE(a);
773		if (q % MAX) {
774			WRITE_ONCE(b, 1);
775			do_something();
776		} else {
777			WRITE_ONCE(b, 2);
778			do_something_else();
779		}
780	
781	If MAX is defined to be 1, then the compiler knows that (q % MAX) is
782	equal to zero, in which case the compiler is within its rights to
783	transform the above code into the following:
784	
785		q = READ_ONCE(a);
786		WRITE_ONCE(b, 2);
787		do_something_else();
788	
789	Given this transformation, the CPU is not required to respect the ordering
790	between the load from variable 'a' and the store to variable 'b'.  It is
791	tempting to add a barrier(), but this does not help.  The conditional
792	is gone, and the barrier won't bring it back.  Therefore, if you are
793	relying on this ordering, you should make sure that MAX is greater than
794	one, perhaps as follows:
795	
796		q = READ_ONCE(a);
797		BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
798		if (q % MAX) {
799			WRITE_ONCE(b, 1);
800			do_something();
801		} else {
802			WRITE_ONCE(b, 2);
803			do_something_else();
804		}
805	
806	Please note once again that the stores to 'b' differ.  If they were
807	identical, as noted earlier, the compiler could pull this store outside
808	of the 'if' statement.
809	
810	You must also be careful not to rely too much on boolean short-circuit
811	evaluation.  Consider this example:
812	
813		q = READ_ONCE(a);
814		if (q || 1 > 0)
815			WRITE_ONCE(b, 1);
816	
817	Because the first condition cannot fault and the second condition is
818	always true, the compiler can transform this example as following,
819	defeating control dependency:
820	
821		q = READ_ONCE(a);
822		WRITE_ONCE(b, 1);
823	
824	This example underscores the need to ensure that the compiler cannot
825	out-guess your code.  More generally, although READ_ONCE() does force
826	the compiler to actually emit code for a given load, it does not force
827	the compiler to use the results.
828	
829	In addition, control dependencies apply only to the then-clause and
830	else-clause of the if-statement in question.  In particular, it does
831	not necessarily apply to code following the if-statement:
832	
833		q = READ_ONCE(a);
834		if (q) {
835			WRITE_ONCE(b, 1);
836		} else {
837			WRITE_ONCE(b, 2);
838		}
839		WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
840	
841	It is tempting to argue that there in fact is ordering because the
842	compiler cannot reorder volatile accesses and also cannot reorder
843	the writes to 'b' with the condition.  Unfortunately for this line
844	of reasoning, the compiler might compile the two writes to 'b' as
845	conditional-move instructions, as in this fanciful pseudo-assembly
846	language:
847	
848		ld r1,a
849		cmp r1,$0
850		cmov,ne r4,$1
851		cmov,eq r4,$2
852		st r4,b
853		st $1,c
854	
855	A weakly ordered CPU would have no dependency of any sort between the load
856	from 'a' and the store to 'c'.  The control dependencies would extend
857	only to the pair of cmov instructions and the store depending on them.
858	In short, control dependencies apply only to the stores in the then-clause
859	and else-clause of the if-statement in question (including functions
860	invoked by those two clauses), not to code following that if-statement.
861	
862	
863	Note well that the ordering provided by a control dependency is local
864	to the CPU containing it.  See the section on "Multicopy atomicity"
865	for more information.
866	
867	
868	In summary:
869	
870	  (*) Control dependencies can order prior loads against later stores.
871	      However, they do -not- guarantee any other sort of ordering:
872	      Not prior loads against later loads, nor prior stores against
873	      later anything.  If you need these other forms of ordering,
874	      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
875	      later loads, smp_mb().
876	
877	  (*) If both legs of the "if" statement begin with identical stores to
878	      the same variable, then those stores must be ordered, either by
879	      preceding both of them with smp_mb() or by using smp_store_release()
880	      to carry out the stores.  Please note that it is -not- sufficient
881	      to use barrier() at beginning of each leg of the "if" statement
882	      because, as shown by the example above, optimizing compilers can
883	      destroy the control dependency while respecting the letter of the
884	      barrier() law.
885	
886	  (*) Control dependencies require at least one run-time conditional
887	      between the prior load and the subsequent store, and this
888	      conditional must involve the prior load.  If the compiler is able
889	      to optimize the conditional away, it will have also optimized
890	      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
891	      can help to preserve the needed conditional.
892	
893	  (*) Control dependencies require that the compiler avoid reordering the
894	      dependency into nonexistence.  Careful use of READ_ONCE() or
895	      atomic{,64}_read() can help to preserve your control dependency.
896	      Please see the COMPILER BARRIER section for more information.
897	
898	  (*) Control dependencies apply only to the then-clause and else-clause
899	      of the if-statement containing the control dependency, including
900	      any functions that these two clauses call.  Control dependencies
901	      do -not- apply to code following the if-statement containing the
902	      control dependency.
903	
904	  (*) Control dependencies pair normally with other types of barriers.
905	
906	  (*) Control dependencies do -not- provide multicopy atomicity.  If you
907	      need all the CPUs to see a given store at the same time, use smp_mb().
908	
909	  (*) Compilers do not understand control dependencies.  It is therefore
910	      your job to ensure that they do not break your code.
911	
912	
913	SMP BARRIER PAIRING
914	-------------------
915	
916	When dealing with CPU-CPU interactions, certain types of memory barrier should
917	always be paired.  A lack of appropriate pairing is almost certainly an error.
918	
919	General barriers pair with each other, though they also pair with most
920	other types of barriers, albeit without multicopy atomicity.  An acquire
921	barrier pairs with a release barrier, but both may also pair with other
922	barriers, including of course general barriers.  A write barrier pairs
923	with a data dependency barrier, a control dependency, an acquire barrier,
924	a release barrier, a read barrier, or a general barrier.  Similarly a
925	read barrier, control dependency, or a data dependency barrier pairs
926	with a write barrier, an acquire barrier, a release barrier, or a
927	general barrier:
928	
929		CPU 1		      CPU 2
930		===============	      ===============
931		WRITE_ONCE(a, 1);
932		<write barrier>
933		WRITE_ONCE(b, 2);     x = READ_ONCE(b);
934				      <read barrier>
935				      y = READ_ONCE(a);
936	
937	Or:
938	
939		CPU 1		      CPU 2
940		===============	      ===============================
941		a = 1;
942		<write barrier>
943		WRITE_ONCE(b, &a);    x = READ_ONCE(b);
944				      <data dependency barrier>
945				      y = *x;
946	
947	Or even:
948	
949		CPU 1		      CPU 2
950		===============	      ===============================
951		r1 = READ_ONCE(y);
952		<general barrier>
953		WRITE_ONCE(x, 1);     if (r2 = READ_ONCE(x)) {
954				         <implicit control dependency>
955				         WRITE_ONCE(y, 1);
956				      }
957	
958		assert(r1 == 0 || r2 == 0);
959	
960	Basically, the read barrier always has to be there, even though it can be of
961	the "weaker" type.
962	
963	[!] Note that the stores before the write barrier would normally be expected to
964	match the loads after the read barrier or the data dependency barrier, and vice
965	versa:
966	
967		CPU 1                               CPU 2
968		===================                 ===================
969		WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
970		WRITE_ONCE(b, 2);    }    \ /    {  w = READ_ONCE(d);
971		<write barrier>            \        <read barrier>
972		WRITE_ONCE(c, 3);    }    / \    {  x = READ_ONCE(a);
973		WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
974	
975	
976	EXAMPLES OF MEMORY BARRIER SEQUENCES
977	------------------------------------
978	
979	Firstly, write barriers act as partial orderings on store operations.
980	Consider the following sequence of events:
981	
982		CPU 1
983		=======================
984		STORE A = 1
985		STORE B = 2
986		STORE C = 3
987		<write barrier>
988		STORE D = 4
989		STORE E = 5
990	
991	This sequence of events is committed to the memory coherence system in an order
992	that the rest of the system might perceive as the unordered set of { STORE A,
993	STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
994	}:
995	
996		+-------+       :      :
997		|       |       +------+
998		|       |------>| C=3  |     }     /\
999		|       |  :    +------+     }-----  \  -----> Events perceptible to
1000		|       |  :    | A=1  |     }        \/       the rest of the system
1001		|       |  :    +------+     }
1002		| CPU 1 |  :    | B=2  |     }
1003		|       |       +------+     }
1004		|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1005		|       |       +------+     }        requires all stores prior to the
1006		|       |  :    | E=5  |     }        barrier to be committed before
1007		|       |  :    +------+     }        further stores may take place
1008		|       |------>| D=4  |     }
1009		|       |       +------+
1010		+-------+       :      :
1011		                   |
1012		                   | Sequence in which stores are committed to the
1013		                   | memory system by CPU 1
1014		                   V
1015	
1016	
1017	Secondly, data dependency barriers act as partial orderings on data-dependent
1018	loads.  Consider the following sequence of events:
1019	
1020		CPU 1			CPU 2
1021		=======================	=======================
1022			{ B = 7; X = 9; Y = 8; C = &Y }
1023		STORE A = 1
1024		STORE B = 2
1025		<write barrier>
1026		STORE C = &B		LOAD X
1027		STORE D = 4		LOAD C (gets &B)
1028					LOAD *C (reads B)
1029	
1030	Without intervention, CPU 2 may perceive the events on CPU 1 in some
1031	effectively random order, despite the write barrier issued by CPU 1:
1032	
1033		+-------+       :      :                :       :
1034		|       |       +------+                +-------+  | Sequence of update
1035		|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1036		|       |  :    +------+     \          +-------+  | CPU 2
1037		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1038		|       |       +------+       |        +-------+
1039		|       |   wwwwwwwwwwwwwwww   |        :       :
1040		|       |       +------+       |        :       :
1041		|       |  :    | C=&B |---    |        :       :       +-------+
1042		|       |  :    +------+   \   |        +-------+       |       |
1043		|       |------>| D=4  |    ----------->| C->&B |------>|       |
1044		|       |       +------+       |        +-------+       |       |
1045		+-------+       :      :       |        :       :       |       |
1046		                               |        :       :       |       |
1047		                               |        :       :       | CPU 2 |
1048		                               |        +-------+       |       |
1049		    Apparently incorrect --->  |        | B->7  |------>|       |
1050		    perception of B (!)        |        +-------+       |       |
1051		                               |        :       :       |       |
1052		                               |        +-------+       |       |
1053		    The load of X holds --->    \       | X->9  |------>|       |
1054		    up the maintenance           \      +-------+       |       |
1055		    of coherence of B             ----->| B->2  |       +-------+
1056		                                        +-------+
1057		                                        :       :
1058	
1059	
1060	In the above example, CPU 2 perceives that B is 7, despite the load of *C
1061	(which would be B) coming after the LOAD of C.
1062	
1063	If, however, a data dependency barrier were to be placed between the load of C
1064	and the load of *C (ie: B) on CPU 2:
1065	
1066		CPU 1			CPU 2
1067		=======================	=======================
1068			{ B = 7; X = 9; Y = 8; C = &Y }
1069		STORE A = 1
1070		STORE B = 2
1071		<write barrier>
1072		STORE C = &B		LOAD X
1073		STORE D = 4		LOAD C (gets &B)
1074					<data dependency barrier>
1075					LOAD *C (reads B)
1076	
1077	then the following will occur:
1078	
1079		+-------+       :      :                :       :
1080		|       |       +------+                +-------+
1081		|       |------>| B=2  |-----       --->| Y->8  |
1082		|       |  :    +------+     \          +-------+
1083		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1084		|       |       +------+       |        +-------+
1085		|       |   wwwwwwwwwwwwwwww   |        :       :
1086		|       |       +------+       |        :       :
1087		|       |  :    | C=&B |---    |        :       :       +-------+
1088		|       |  :    +------+   \   |        +-------+       |       |
1089		|       |------>| D=4  |    ----------->| C->&B |------>|       |
1090		|       |       +------+       |        +-------+       |       |
1091		+-------+       :      :       |        :       :       |       |
1092		                               |        :       :       |       |
1093		                               |        :       :       | CPU 2 |
1094		                               |        +-------+       |       |
1095		                               |        | X->9  |------>|       |
1096		                               |        +-------+       |       |
1097		  Makes sure all effects --->   \   ddddddddddddddddd   |       |
1098		  prior to the store of C        \      +-------+       |       |
1099		  are perceptible to              ----->| B->2  |------>|       |
1100		  subsequent loads                      +-------+       |       |
1101		                                        :       :       +-------+
1102	
1103	
1104	And thirdly, a read barrier acts as a partial order on loads.  Consider the
1105	following sequence of events:
1106	
1107		CPU 1			CPU 2
1108		=======================	=======================
1109			{ A = 0, B = 9 }
1110		STORE A=1
1111		<write barrier>
1112		STORE B=2
1113					LOAD B
1114					LOAD A
1115	
1116	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
1117	some effectively random order, despite the write barrier issued by CPU 1:
1118	
1119		+-------+       :      :                :       :
1120		|       |       +------+                +-------+
1121		|       |------>| A=1  |------      --->| A->0  |
1122		|       |       +------+      \         +-------+
1123		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1124		|       |       +------+        |       +-------+
1125		|       |------>| B=2  |---     |       :       :
1126		|       |       +------+   \    |       :       :       +-------+
1127		+-------+       :      :    \   |       +-------+       |       |
1128		                             ---------->| B->2  |------>|       |
1129		                                |       +-------+       | CPU 2 |
1130		                                |       | A->0  |------>|       |
1131		                                |       +-------+       |       |
1132		                                |       :       :       +-------+
1133		                                 \      :       :
1134		                                  \     +-------+
1135		                                   ---->| A->1  |
1136		                                        +-------+
1137		                                        :       :
1138	
1139	
1140	If, however, a read barrier were to be placed between the load of B and the
1141	load of A on CPU 2:
1142	
1143		CPU 1			CPU 2
1144		=======================	=======================
1145			{ A = 0, B = 9 }
1146		STORE A=1
1147		<write barrier>
1148		STORE B=2
1149					LOAD B
1150					<read barrier>
1151					LOAD A
1152	
1153	then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
1154	2:
1155	
1156		+-------+       :      :                :       :
1157		|       |       +------+                +-------+
1158		|       |------>| A=1  |------      --->| A->0  |
1159		|       |       +------+      \         +-------+
1160		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1161		|       |       +------+        |       +-------+
1162		|       |------>| B=2  |---     |       :       :
1163		|       |       +------+   \    |       :       :       +-------+
1164		+-------+       :      :    \   |       +-------+       |       |
1165		                             ---------->| B->2  |------>|       |
1166		                                |       +-------+       | CPU 2 |
1167		                                |       :       :       |       |
1168		                                |       :       :       |       |
1169		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1170		  barrier causes all effects      \     +-------+       |       |
1171		  prior to the storage of B        ---->| A->1  |------>|       |
1172		  to be perceptible to CPU 2            +-------+       |       |
1173		                                        :       :       +-------+
1174	
1175	
1176	To illustrate this more completely, consider what could happen if the code
1177	contained a load of A either side of the read barrier:
1178	
1179		CPU 1			CPU 2
1180		=======================	=======================
1181			{ A = 0, B = 9 }
1182		STORE A=1
1183		<write barrier>
1184		STORE B=2
1185					LOAD B
1186					LOAD A [first load of A]
1187					<read barrier>
1188					LOAD A [second load of A]
1189	
1190	Even though the two loads of A both occur after the load of B, they may both
1191	come up with different values:
1192	
1193		+-------+       :      :                :       :
1194		|       |       +------+                +-------+
1195		|       |------>| A=1  |------      --->| A->0  |
1196		|       |       +------+      \         +-------+
1197		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1198		|       |       +------+        |       +-------+
1199		|       |------>| B=2  |---     |       :       :
1200		|       |       +------+   \    |       :       :       +-------+
1201		+-------+       :      :    \   |       +-------+       |       |
1202		                             ---------->| B->2  |------>|       |
1203		                                |       +-------+       | CPU 2 |
1204		                                |       :       :       |       |
1205		                                |       :       :       |       |
1206		                                |       +-------+       |       |
1207		                                |       | A->0  |------>| 1st   |
1208		                                |       +-------+       |       |
1209		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1210		  barrier causes all effects      \     +-------+       |       |
1211		  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1212		  to be perceptible to CPU 2            +-------+       |       |
1213		                                        :       :       +-------+
1214	
1215	
1216	But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1217	before the read barrier completes anyway:
1218	
1219		+-------+       :      :                :       :
1220		|       |       +------+                +-------+
1221		|       |------>| A=1  |------      --->| A->0  |
1222		|       |       +------+      \         +-------+
1223		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1224		|       |       +------+        |       +-------+
1225		|       |------>| B=2  |---     |       :       :
1226		|       |       +------+   \    |       :       :       +-------+
1227		+-------+       :      :    \   |       +-------+       |       |
1228		                             ---------->| B->2  |------>|       |
1229		                                |       +-------+       | CPU 2 |
1230		                                |       :       :       |       |
1231		                                 \      :       :       |       |
1232		                                  \     +-------+       |       |
1233		                                   ---->| A->1  |------>| 1st   |
1234		                                        +-------+       |       |
1235		                                    rrrrrrrrrrrrrrrrr   |       |
1236		                                        +-------+       |       |
1237		                                        | A->1  |------>| 2nd   |
1238		                                        +-------+       |       |
1239		                                        :       :       +-------+
1240	
1241	
1242	The guarantee is that the second load will always come up with A == 1 if the
1243	load of B came up with B == 2.  No such guarantee exists for the first load of
1244	A; that may come up with either A == 0 or A == 1.
1245	
1246	
1247	READ MEMORY BARRIERS VS LOAD SPECULATION
1248	----------------------------------------
1249	
1250	Many CPUs speculate with loads: that is they see that they will need to load an
1251	item from memory, and they find a time where they're not using the bus for any
1252	other loads, and so do the load in advance - even though they haven't actually
1253	got to that point in the instruction execution flow yet.  This permits the
1254	actual load instruction to potentially complete immediately because the CPU
1255	already has the value to hand.
1256	
1257	It may turn out that the CPU didn't actually need the value - perhaps because a
1258	branch circumvented the load - in which case it can discard the value or just
1259	cache it for later use.
1260	
1261	Consider:
1262	
1263		CPU 1			CPU 2
1264		=======================	=======================
1265					LOAD B
1266					DIVIDE		} Divide instructions generally
1267					DIVIDE		} take a long time to perform
1268					LOAD A
1269	
1270	Which might appear as this:
1271	
1272		                                        :       :       +-------+
1273		                                        +-------+       |       |
1274		                                    --->| B->2  |------>|       |
1275		                                        +-------+       | CPU 2 |
1276		                                        :       :DIVIDE |       |
1277		                                        +-------+       |       |
1278		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1279		division speculates on the              +-------+   ~   |       |
1280		LOAD of A                               :       :   ~   |       |
1281		                                        :       :DIVIDE |       |
1282		                                        :       :   ~   |       |
1283		Once the divisions are complete -->     :       :   ~-->|       |
1284		the CPU can then perform the            :       :       |       |
1285		LOAD with immediate effect              :       :       +-------+
1286	
1287	
1288	Placing a read barrier or a data dependency barrier just before the second
1289	load:
1290	
1291		CPU 1			CPU 2
1292		=======================	=======================
1293					LOAD B
1294					DIVIDE
1295					DIVIDE
1296					<read barrier>
1297					LOAD A
1298	
1299	will force any value speculatively obtained to be reconsidered to an extent
1300	dependent on the type of barrier used.  If there was no change made to the
1301	speculated memory location, then the speculated value will just be used:
1302	
1303		                                        :       :       +-------+
1304		                                        +-------+       |       |
1305		                                    --->| B->2  |------>|       |
1306		                                        +-------+       | CPU 2 |
1307		                                        :       :DIVIDE |       |
1308		                                        +-------+       |       |
1309		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1310		division speculates on the              +-------+   ~   |       |
1311		LOAD of A                               :       :   ~   |       |
1312		                                        :       :DIVIDE |       |
1313		                                        :       :   ~   |       |
1314		                                        :       :   ~   |       |
1315		                                    rrrrrrrrrrrrrrrr~   |       |
1316		                                        :       :   ~   |       |
1317		                                        :       :   ~-->|       |
1318		                                        :       :       |       |
1319		                                        :       :       +-------+
1320	
1321	
1322	but if there was an update or an invalidation from another CPU pending, then
1323	the speculation will be cancelled and the value reloaded:
1324	
1325		                                        :       :       +-------+
1326		                                        +-------+       |       |
1327		                                    --->| B->2  |------>|       |
1328		                                        +-------+       | CPU 2 |
1329		                                        :       :DIVIDE |       |
1330		                                        +-------+       |       |
1331		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1332		division speculates on the              +-------+   ~   |       |
1333		LOAD of A                               :       :   ~   |       |
1334		                                        :       :DIVIDE |       |
1335		                                        :       :   ~   |       |
1336		                                        :       :   ~   |       |
1337		                                    rrrrrrrrrrrrrrrrr   |       |
1338		                                        +-------+       |       |
1339		The speculation is discarded --->   --->| A->1  |------>|       |
1340		and an updated value is                 +-------+       |       |
1341		retrieved                               :       :       +-------+
1342	
1343	
1344	MULTICOPY ATOMICITY
1345	--------------------
1346	
1347	Multicopy atomicity is a deeply intuitive notion about ordering that is
1348	not always provided by real computer systems, namely that a given store
1349	becomes visible at the same time to all CPUs, or, alternatively, that all
1350	CPUs agree on the order in which all stores become visible.  However,
1351	support of full multicopy atomicity would rule out valuable hardware
1352	optimizations, so a weaker form called ``other multicopy atomicity''
1353	instead guarantees only that a given store becomes visible at the same
1354	time to all -other- CPUs.  The remainder of this document discusses this
1355	weaker form, but for brevity will call it simply ``multicopy atomicity''.
1356	
1357	The following example demonstrates multicopy atomicity:
1358	
1359		CPU 1			CPU 2			CPU 3
1360		=======================	=======================	=======================
1361			{ X = 0, Y = 0 }
1362		STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1363					<general barrier>	<read barrier>
1364					STORE Y=r1		LOAD X
1365	
1366	Suppose that CPU 2's load from X returns 1, which it then stores to Y,
1367	and CPU 3's load from Y returns 1.  This indicates that CPU 1's store
1368	to X precedes CPU 2's load from X and that CPU 2's store to Y precedes
1369	CPU 3's load from Y.  In addition, the memory barriers guarantee that
1370	CPU 2 executes its load before its store, and CPU 3 loads from Y before
1371	it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1372	
1373	Because CPU 3's load from X in some sense comes after CPU 2's load, it
1374	is natural to expect that CPU 3's load from X must therefore return 1.
1375	This expectation follows from multicopy atomicity: if a load executing
1376	on CPU B follows a load from the same variable executing on CPU A (and
1377	CPU A did not originally store the value which it read), then on
1378	multicopy-atomic systems, CPU B's load must return either the same value
1379	that CPU A's load did or some later value.  However, the Linux kernel
1380	does not require systems to be multicopy atomic.
1381	
1382	The use of a general memory barrier in the example above compensates
1383	for any lack of multicopy atomicity.  In the example, if CPU 2's load
1384	from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
1385	from X must indeed also return 1.
1386	
1387	However, dependencies, read barriers, and write barriers are not always
1388	able to compensate for non-multicopy atomicity.  For example, suppose
1389	that CPU 2's general barrier is removed from the above example, leaving
1390	only the data dependency shown below:
1391	
1392		CPU 1			CPU 2			CPU 3
1393		=======================	=======================	=======================
1394			{ X = 0, Y = 0 }
1395		STORE X=1		r1=LOAD X (reads 1)	LOAD Y (reads 1)
1396					<data dependency>	<read barrier>
1397					STORE Y=r1		LOAD X (reads 0)
1398	
1399	This substitution allows non-multicopy atomicity to run rampant: in
1400	this example, it is perfectly legal for CPU 2's load from X to return 1,
1401	CPU 3's load from Y to return 1, and its load from X to return 0.
1402	
1403	The key point is that although CPU 2's data dependency orders its load
1404	and store, it does not guarantee to order CPU 1's store.  Thus, if this
1405	example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1406	store buffer or a level of cache, CPU 2 might have early access to CPU 1's
1407	writes.  General barriers are therefore required to ensure that all CPUs
1408	agree on the combined order of multiple accesses.
1409	
1410	General barriers can compensate not only for non-multicopy atomicity,
1411	but can also generate additional ordering that can ensure that -all-
1412	CPUs will perceive the same order of -all- operations.  In contrast, a
1413	chain of release-acquire pairs do not provide this additional ordering,
1414	which means that only those CPUs on the chain are guaranteed to agree
1415	on the combined order of the accesses.  For example, switching to C code
1416	in deference to the ghost of Herman Hollerith:
1417	
1418		int u, v, x, y, z;
1419	
1420		void cpu0(void)
1421		{
1422			r0 = smp_load_acquire(&x);
1423			WRITE_ONCE(u, 1);
1424			smp_store_release(&y, 1);
1425		}
1426	
1427		void cpu1(void)
1428		{
1429			r1 = smp_load_acquire(&y);
1430			r4 = READ_ONCE(v);
1431			r5 = READ_ONCE(u);
1432			smp_store_release(&z, 1);
1433		}
1434	
1435		void cpu2(void)
1436		{
1437			r2 = smp_load_acquire(&z);
1438			smp_store_release(&x, 1);
1439		}
1440	
1441		void cpu3(void)
1442		{
1443			WRITE_ONCE(v, 1);
1444			smp_mb();
1445			r3 = READ_ONCE(u);
1446		}
1447	
1448	Because cpu0(), cpu1(), and cpu2() participate in a chain of
1449	smp_store_release()/smp_load_acquire() pairs, the following outcome
1450	is prohibited:
1451	
1452		r0 == 1 && r1 == 1 && r2 == 1
1453	
1454	Furthermore, because of the release-acquire relationship between cpu0()
1455	and cpu1(), cpu1() must see cpu0()'s writes, so that the following
1456	outcome is prohibited:
1457	
1458		r1 == 1 && r5 == 0
1459	
1460	However, the ordering provided by a release-acquire chain is local
1461	to the CPUs participating in that chain and does not apply to cpu3(),
1462	at least aside from stores.  Therefore, the following outcome is possible:
1463	
1464		r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
1465	
1466	As an aside, the following outcome is also possible:
1467	
1468		r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
1469	
1470	Although cpu0(), cpu1(), and cpu2() will see their respective reads and
1471	writes in order, CPUs not involved in the release-acquire chain might
1472	well disagree on the order.  This disagreement stems from the fact that
1473	the weak memory-barrier instructions used to implement smp_load_acquire()
1474	and smp_store_release() are not required to order prior stores against
1475	subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
1476	store to u as happening -after- cpu1()'s load from v, even though
1477	both cpu0() and cpu1() agree that these two operations occurred in the
1478	intended order.
1479	
1480	However, please keep in mind that smp_load_acquire() is not magic.
1481	In particular, it simply reads from its argument with ordering.  It does
1482	-not- ensure that any particular value will be read.  Therefore, the
1483	following outcome is possible:
1484	
1485		r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
1486	
1487	Note that this outcome can happen even on a mythical sequentially
1488	consistent system where nothing is ever reordered.
1489	
1490	To reiterate, if your code requires full ordering of all operations,
1491	use general barriers throughout.
1492	
1493	
1494	========================
1495	EXPLICIT KERNEL BARRIERS
1496	========================
1497	
1498	The Linux kernel has a variety of different barriers that act at different
1499	levels:
1500	
1501	  (*) Compiler barrier.
1502	
1503	  (*) CPU memory barriers.
1504	
1505	  (*) MMIO write barrier.
1506	
1507	
1508	COMPILER BARRIER
1509	----------------
1510	
1511	The Linux kernel has an explicit compiler barrier function that prevents the
1512	compiler from moving the memory accesses either side of it to the other side:
1513	
1514		barrier();
1515	
1516	This is a general barrier -- there are no read-read or write-write
1517	variants of barrier().  However, READ_ONCE() and WRITE_ONCE() can be
1518	thought of as weak forms of barrier() that affect only the specific
1519	accesses flagged by the READ_ONCE() or WRITE_ONCE().
1520	
1521	The barrier() function has the following effects:
1522	
1523	 (*) Prevents the compiler from reordering accesses following the
1524	     barrier() to precede any accesses preceding the barrier().
1525	     One example use for this property is to ease communication between
1526	     interrupt-handler code and the code that was interrupted.
1527	
1528	 (*) Within a loop, forces the compiler to load the variables used
1529	     in that loop's conditional on each pass through that loop.
1530	
1531	The READ_ONCE() and WRITE_ONCE() functions can prevent any number of
1532	optimizations that, while perfectly safe in single-threaded code, can
1533	be fatal in concurrent code.  Here are some examples of these sorts
1534	of optimizations:
1535	
1536	 (*) The compiler is within its rights to reorder loads and stores
1537	     to the same variable, and in some cases, the CPU is within its
1538	     rights to reorder loads to the same variable.  This means that
1539	     the following code:
1540	
1541		a[0] = x;
1542		a[1] = x;
1543	
1544	     Might result in an older value of x stored in a[1] than in a[0].
1545	     Prevent both the compiler and the CPU from doing this as follows:
1546	
1547		a[0] = READ_ONCE(x);
1548		a[1] = READ_ONCE(x);
1549	
1550	     In short, READ_ONCE() and WRITE_ONCE() provide cache coherence for
1551	     accesses from multiple CPUs to a single variable.
1552	
1553	 (*) The compiler is within its rights to merge successive loads from
1554	     the same variable.  Such merging can cause the compiler to "optimize"
1555	     the following code:
1556	
1557		while (tmp = a)
1558			do_something_with(tmp);
1559	
1560	     into the following code, which, although in some sense legitimate
1561	     for single-threaded code, is almost certainly not what the developer
1562	     intended:
1563	
1564		if (tmp = a)
1565			for (;;)
1566				do_something_with(tmp);
1567	
1568	     Use READ_ONCE() to prevent the compiler from doing this to you:
1569	
1570		while (tmp = READ_ONCE(a))
1571			do_something_with(tmp);
1572	
1573	 (*) The compiler is within its rights to reload a variable, for example,
1574	     in cases where high register pressure prevents the compiler from
1575	     keeping all data of interest in registers.  The compiler might
1576	     therefore optimize the variable 'tmp' out of our previous example:
1577	
1578		while (tmp = a)
1579			do_something_with(tmp);
1580	
1581	     This could result in the following code, which is perfectly safe in
1582	     single-threaded code, but can be fatal in concurrent code:
1583	
1584		while (a)
1585			do_something_with(a);
1586	
1587	     For example, the optimized version of this code could result in
1588	     passing a zero to do_something_with() in the case where the variable
1589	     a was modified by some other CPU between the "while" statement and
1590	     the call to do_something_with().
1591	
1592	     Again, use READ_ONCE() to prevent the compiler from doing this:
1593	
1594		while (tmp = READ_ONCE(a))
1595			do_something_with(tmp);
1596	
1597	     Note that if the compiler runs short of registers, it might save
1598	     tmp onto the stack.  The overhead of this saving and later restoring
1599	     is why compilers reload variables.  Doing so is perfectly safe for
1600	     single-threaded code, so you need to tell the compiler about cases
1601	     where it is not safe.
1602	
1603	 (*) The compiler is within its rights to omit a load entirely if it knows
1604	     what the value will be.  For example, if the compiler can prove that
1605	     the value of variable 'a' is always zero, it can optimize this code:
1606	
1607		while (tmp = a)
1608			do_something_with(tmp);
1609	
1610	     Into this:
1611	
1612		do { } while (0);
1613	
1614	     This transformation is a win for single-threaded code because it
1615	     gets rid of a load and a branch.  The problem is that the compiler
1616	     will carry out its proof assuming that the current CPU is the only
1617	     one updating variable 'a'.  If variable 'a' is shared, then the
1618	     compiler's proof will be erroneous.  Use READ_ONCE() to tell the
1619	     compiler that it doesn't know as much as it thinks it does:
1620	
1621		while (tmp = READ_ONCE(a))
1622			do_something_with(tmp);
1623	
1624	     But please note that the compiler is also closely watching what you
1625	     do with the value after the READ_ONCE().  For example, suppose you
1626	     do the following and MAX is a preprocessor macro with the value 1:
1627	
1628		while ((tmp = READ_ONCE(a)) % MAX)
1629			do_something_with(tmp);
1630	
1631	     Then the compiler knows that the result of the "%" operator applied
1632	     to MAX will always be zero, again allowing the compiler to optimize
1633	     the code into near-nonexistence.  (It will still load from the
1634	     variable 'a'.)
1635	
1636	 (*) Similarly, the compiler is within its rights to omit a store entirely
1637	     if it knows that the variable already has the value being stored.
1638	     Again, the compiler assumes that the current CPU is the only one
1639	     storing into the variable, which can cause the compiler to do the
1640	     wrong thing for shared variables.  For example, suppose you have
1641	     the following:
1642	
1643		a = 0;
1644		... Code that does not store to variable a ...
1645		a = 0;
1646	
1647	     The compiler sees that the value of variable 'a' is already zero, so
1648	     it might well omit the second store.  This would come as a fatal
1649	     surprise if some other CPU might have stored to variable 'a' in the
1650	     meantime.
1651	
1652	     Use WRITE_ONCE() to prevent the compiler from making this sort of
1653	     wrong guess:
1654	
1655		WRITE_ONCE(a, 0);
1656		... Code that does not store to variable a ...
1657		WRITE_ONCE(a, 0);
1658	
1659	 (*) The compiler is within its rights to reorder memory accesses unless
1660	     you tell it not to.  For example, consider the following interaction
1661	     between process-level code and an interrupt handler:
1662	
1663		void process_level(void)
1664		{
1665			msg = get_message();
1666			flag = true;
1667		}
1668	
1669		void interrupt_handler(void)
1670		{
1671			if (flag)
1672				process_message(msg);
1673		}
1674	
1675	     There is nothing to prevent the compiler from transforming
1676	     process_level() to the following, in fact, this might well be a
1677	     win for single-threaded code:
1678	
1679		void process_level(void)
1680		{
1681			flag = true;
1682			msg = get_message();
1683		}
1684	
1685	     If the interrupt occurs between these two statement, then
1686	     interrupt_handler() might be passed a garbled msg.  Use WRITE_ONCE()
1687	     to prevent this as follows:
1688	
1689		void process_level(void)
1690		{
1691			WRITE_ONCE(msg, get_message());
1692			WRITE_ONCE(flag, true);
1693		}
1694	
1695		void interrupt_handler(void)
1696		{
1697			if (READ_ONCE(flag))
1698				process_message(READ_ONCE(msg));
1699		}
1700	
1701	     Note that the READ_ONCE() and WRITE_ONCE() wrappers in
1702	     interrupt_handler() are needed if this interrupt handler can itself
1703	     be interrupted by something that also accesses 'flag' and 'msg',
1704	     for example, a nested interrupt or an NMI.  Otherwise, READ_ONCE()
1705	     and WRITE_ONCE() are not needed in interrupt_handler() other than
1706	     for documentation purposes.  (Note also that nested interrupts
1707	     do not typically occur in modern Linux kernels, in fact, if an
1708	     interrupt handler returns with interrupts enabled, you will get a
1709	     WARN_ONCE() splat.)
1710	
1711	     You should assume that the compiler can move READ_ONCE() and
1712	     WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
1713	     barrier(), or similar primitives.
1714	
1715	     This effect could also be achieved using barrier(), but READ_ONCE()
1716	     and WRITE_ONCE() are more selective:  With READ_ONCE() and
1717	     WRITE_ONCE(), the compiler need only forget the contents of the
1718	     indicated memory locations, while with barrier() the compiler must
1719	     discard the value of all memory locations that it has currented
1720	     cached in any machine registers.  Of course, the compiler must also
1721	     respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur,
1722	     though the CPU of course need not do so.
1723	
1724	 (*) The compiler is within its rights to invent stores to a variable,
1725	     as in the following example:
1726	
1727		if (a)
1728			b = a;
1729		else
1730			b = 42;
1731	
1732	     The compiler might save a branch by optimizing this as follows:
1733	
1734		b = 42;
1735		if (a)
1736			b = a;
1737	
1738	     In single-threaded code, this is not only safe, but also saves
1739	     a branch.  Unfortunately, in concurrent code, this optimization
1740	     could cause some other CPU to see a spurious value of 42 -- even
1741	     if variable 'a' was never zero -- when loading variable 'b'.
1742	     Use WRITE_ONCE() to prevent this as follows:
1743	
1744		if (a)
1745			WRITE_ONCE(b, a);
1746		else
1747			WRITE_ONCE(b, 42);
1748	
1749	     The compiler can also invent loads.  These are usually less
1750	     damaging, but they can result in cache-line bouncing and thus in
1751	     poor performance and scalability.  Use READ_ONCE() to prevent
1752	     invented loads.
1753	
1754	 (*) For aligned memory locations whose size allows them to be accessed
1755	     with a single memory-reference instruction, prevents "load tearing"
1756	     and "store tearing," in which a single large access is replaced by
1757	     multiple smaller accesses.  For example, given an architecture having
1758	     16-bit store instructions with 7-bit immediate fields, the compiler
1759	     might be tempted to use two 16-bit store-immediate instructions to
1760	     implement the following 32-bit store:
1761	
1762		p = 0x00010002;
1763	
1764	     Please note that GCC really does use this sort of optimization,
1765	     which is not surprising given that it would likely take more
1766	     than two instructions to build the constant and then store it.
1767	     This optimization can therefore be a win in single-threaded code.
1768	     In fact, a recent bug (since fixed) caused GCC to incorrectly use
1769	     this optimization in a volatile store.  In the absence of such bugs,
1770	     use of WRITE_ONCE() prevents store tearing in the following example:
1771	
1772		WRITE_ONCE(p, 0x00010002);
1773	
1774	     Use of packed structures can also result in load and store tearing,
1775	     as in this example:
1776	
1777		struct __attribute__((__packed__)) foo {
1778			short a;
1779			int b;
1780			short c;
1781		};
1782		struct foo foo1, foo2;
1783		...
1784	
1785		foo2.a = foo1.a;
1786		foo2.b = foo1.b;
1787		foo2.c = foo1.c;
1788	
1789	     Because there are no READ_ONCE() or WRITE_ONCE() wrappers and no
1790	     volatile markings, the compiler would be well within its rights to
1791	     implement these three assignment statements as a pair of 32-bit
1792	     loads followed by a pair of 32-bit stores.  This would result in
1793	     load tearing on 'foo1.b' and store tearing on 'foo2.b'.  READ_ONCE()
1794	     and WRITE_ONCE() again prevent tearing in this example:
1795	
1796		foo2.a = foo1.a;
1797		WRITE_ONCE(foo2.b, READ_ONCE(foo1.b));
1798		foo2.c = foo1.c;
1799	
1800	All that aside, it is never necessary to use READ_ONCE() and
1801	WRITE_ONCE() on a variable that has been marked volatile.  For example,
1802	because 'jiffies' is marked volatile, it is never necessary to
1803	say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
1804	WRITE_ONCE() are implemented as volatile casts, which has no effect when
1805	its argument is already marked volatile.
1806	
1807	Please note that these compiler barriers have no direct effect on the CPU,
1808	which may then reorder things however it wishes.
1809	
1810	
1811	CPU MEMORY BARRIERS
1812	-------------------
1813	
1814	The Linux kernel has eight basic CPU memory barriers:
1815	
1816		TYPE		MANDATORY		SMP CONDITIONAL
1817		===============	=======================	===========================
1818		GENERAL		mb()			smp_mb()
1819		WRITE		wmb()			smp_wmb()
1820		READ		rmb()			smp_rmb()
1821		DATA DEPENDENCY				READ_ONCE()
1822	
1823	
1824	All memory barriers except the data dependency barriers imply a compiler
1825	barrier.  Data dependencies do not impose any additional compiler ordering.
1826	
1827	Aside: In the case of data dependencies, the compiler would be expected
1828	to issue the loads in the correct order (eg. `a[b]` would have to load
1829	the value of b before loading a[b]), however there is no guarantee in
1830	the C specification that the compiler may not speculate the value of b
1831	(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1)
1832	tmp = a[b]; ).  There is also the problem of a compiler reloading b after
1833	having loaded a[b], thus having a newer copy of b than a[b].  A consensus
1834	has not yet been reached about these problems, however the READ_ONCE()
1835	macro is a good place to start looking.
1836	
1837	SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
1838	systems because it is assumed that a CPU will appear to be self-consistent,
1839	and will order overlapping accesses correctly with respect to itself.
1840	However, see the subsection on "Virtual Machine Guests" below.
1841	
1842	[!] Note that SMP memory barriers _must_ be used to control the ordering of
1843	references to shared memory on SMP systems, though the use of locking instead
1844	is sufficient.
1845	
1846	Mandatory barriers should not be used to control SMP effects, since mandatory
1847	barriers impose unnecessary overhead on both SMP and UP systems. They may,
1848	however, be used to control MMIO effects on accesses through relaxed memory I/O
1849	windows.  These barriers are required even on non-SMP systems as they affect
1850	the order in which memory operations appear to a device by prohibiting both the
1851	compiler and the CPU from reordering them.
1852	
1853	
1854	There are some more advanced barrier functions:
1855	
1856	 (*) smp_store_mb(var, value)
1857	
1858	     This assigns the value to the variable and then inserts a full memory
1859	     barrier after it.  It isn't guaranteed to insert anything more than a
1860	     compiler barrier in a UP compilation.
1861	
1862	
1863	 (*) smp_mb__before_atomic();
1864	 (*) smp_mb__after_atomic();
1865	
1866	     These are for use with atomic (such as add, subtract, increment and
1867	     decrement) functions that don't return a value, especially when used for
1868	     reference counting.  These functions do not imply memory barriers.
1869	
1870	     These are also used for atomic bitop functions that do not return a
1871	     value (such as set_bit and clear_bit).
1872	
1873	     As an example, consider a piece of code that marks an object as being dead
1874	     and then decrements the object's reference count:
1875	
1876		obj->dead = 1;
1877		smp_mb__before_atomic();
1878		atomic_dec(&obj->ref_count);
1879	
1880	     This makes sure that the death mark on the object is perceived to be set
1881	     *before* the reference counter is decremented.
1882	
1883	     See Documentation/atomic_{t,bitops}.txt for more information.
1884	
1885	
1886	 (*) dma_wmb();
1887	 (*) dma_rmb();
1888	
1889	     These are for use with consistent memory to guarantee the ordering
1890	     of writes or reads of shared memory accessible to both the CPU and a
1891	     DMA capable device.
1892	
1893	     For example, consider a device driver that shares memory with a device
1894	     and uses a descriptor status value to indicate if the descriptor belongs
1895	     to the device or the CPU, and a doorbell to notify it when new
1896	     descriptors are available:
1897	
1898		if (desc->status != DEVICE_OWN) {
1899			/* do not read data until we own descriptor */
1900			dma_rmb();
1901	
1902			/* read/modify data */
1903			read_data = desc->data;
1904			desc->data = write_data;
1905	
1906			/* flush modifications before status update */
1907			dma_wmb();
1908	
1909			/* assign ownership */
1910			desc->status = DEVICE_OWN;
1911	
1912			/* force memory to sync before notifying device via MMIO */
1913			wmb();
1914	
1915			/* notify device of new descriptors */
1916			writel(DESC_NOTIFY, doorbell);
1917		}
1918	
1919	     The dma_rmb() allows us guarantee the device has released ownership
1920	     before we read the data from the descriptor, and the dma_wmb() allows
1921	     us to guarantee the data is written to the descriptor before the device
1922	     can see it now has ownership.  The wmb() is needed to guarantee that the
1923	     cache coherent memory writes have completed before attempting a write to
1924	     the cache incoherent MMIO region.
1925	
1926	     See Documentation/DMA-API.txt for more information on consistent memory.
1927	
1928	
1929	MMIO WRITE BARRIER
1930	------------------
1931	
1932	The Linux kernel also has a special barrier for use with memory-mapped I/O
1933	writes:
1934	
1935		mmiowb();
1936	
1937	This is a variation on the mandatory write barrier that causes writes to weakly
1938	ordered I/O regions to be partially ordered.  Its effects may go beyond the
1939	CPU->Hardware interface and actually affect the hardware at some level.
1940	
1941	See the subsection "Acquires vs I/O accesses" for more information.
1942	
1943	
1944	===============================
1945	IMPLICIT KERNEL MEMORY BARRIERS
1946	===============================
1947	
1948	Some of the other functions in the linux kernel imply memory barriers, amongst
1949	which are locking and scheduling functions.
1950	
1951	This specification is a _minimum_ guarantee; any particular architecture may
1952	provide more substantial guarantees, but these may not be relied upon outside
1953	of arch specific code.
1954	
1955	
1956	LOCK ACQUISITION FUNCTIONS
1957	--------------------------
1958	
1959	The Linux kernel has a number of locking constructs:
1960	
1961	 (*) spin locks
1962	 (*) R/W spin locks
1963	 (*) mutexes
1964	 (*) semaphores
1965	 (*) R/W semaphores
1966	
1967	In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
1968	for each construct.  These operations all imply certain barriers:
1969	
1970	 (1) ACQUIRE operation implication:
1971	
1972	     Memory operations issued after the ACQUIRE will be completed after the
1973	     ACQUIRE operation has completed.
1974	
1975	     Memory operations issued before the ACQUIRE may be completed after
1976	     the ACQUIRE operation has completed.
1977	
1978	 (2) RELEASE operation implication:
1979	
1980	     Memory operations issued before the RELEASE will be completed before the
1981	     RELEASE operation has completed.
1982	
1983	     Memory operations issued after the RELEASE may be completed before the
1984	     RELEASE operation has completed.
1985	
1986	 (3) ACQUIRE vs ACQUIRE implication:
1987	
1988	     All ACQUIRE operations issued before another ACQUIRE operation will be
1989	     completed before that ACQUIRE operation.
1990	
1991	 (4) ACQUIRE vs RELEASE implication:
1992	
1993	     All ACQUIRE operations issued before a RELEASE operation will be
1994	     completed before the RELEASE operation.
1995	
1996	 (5) Failed conditional ACQUIRE implication:
1997	
1998	     Certain locking variants of the ACQUIRE operation may fail, either due to
1999	     being unable to get the lock immediately, or due to receiving an unblocked
2000	     signal whilst asleep waiting for the lock to become available.  Failed
2001	     locks do not imply any sort of barrier.
2002	
2003	[!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
2004	one-way barriers is that the effects of instructions outside of a critical
2005	section may seep into the inside of the critical section.
2006	
2007	An ACQUIRE followed by a RELEASE may not be assumed to be full memory barrier
2008	because it is possible for an access preceding the ACQUIRE to happen after the
2009	ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
2010	the two accesses can themselves then cross:
2011	
2012		*A = a;
2013		ACQUIRE M
2014		RELEASE M
2015		*B = b;
2016	
2017	may occur as:
2018	
2019		ACQUIRE M, STORE *B, STORE *A, RELEASE M
2020	
2021	When the ACQUIRE and RELEASE are a lock acquisition and release,
2022	respectively, this same reordering can occur if the lock's ACQUIRE and
2023	RELEASE are to the same lock variable, but only from the perspective of
2024	another CPU not holding that lock.  In short, a ACQUIRE followed by an
2025	RELEASE may -not- be assumed to be a full memory barrier.
2026	
2027	Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
2028	not imply a full memory barrier.  Therefore, the CPU's execution of the
2029	critical sections corresponding to the RELEASE and the ACQUIRE can cross,
2030	so that:
2031	
2032		*A = a;
2033		RELEASE M
2034		ACQUIRE N
2035		*B = b;
2036	
2037	could occur as:
2038	
2039		ACQUIRE N, STORE *B, STORE *A, RELEASE M
2040	
2041	It might appear that this reordering could introduce a deadlock.
2042	However, this cannot happen because if such a deadlock threatened,
2043	the RELEASE would simply complete, thereby avoiding the deadlock.
2044	
2045		Why does this work?
2046	
2047		One key point is that we are only talking about the CPU doing
2048		the reordering, not the compiler.  If the compiler (or, for
2049		that matter, the developer) switched the operations, deadlock
2050		-could- occur.
2051	
2052		But suppose the CPU reordered the operations.  In this case,
2053		the unlock precedes the lock in the assembly code.  The CPU
2054		simply elected to try executing the later lock operation first.
2055		If there is a deadlock, this lock operation will simply spin (or
2056		try to sleep, but more on that later).	The CPU will eventually
2057		execute the unlock operation (which preceded the lock operation
2058		in the assembly code), which will unravel the potential deadlock,
2059		allowing the lock operation to succeed.
2060	
2061		But what if the lock is a sleeplock?  In that case, the code will
2062		try to enter the scheduler, where it will eventually encounter
2063		a memory barrier, which will force the earlier unlock operation
2064		to complete, again unraveling the deadlock.  There might be
2065		a sleep-unlock race, but the locking primitive needs to resolve
2066		such races properly in any case.
2067	
2068	Locks and semaphores may not provide any guarantee of ordering on UP compiled
2069	systems, and so cannot be counted on in such a situation to actually achieve
2070	anything at all - especially with respect to I/O accesses - unless combined
2071	with interrupt disabling operations.
2072	
2073	See also the section on "Inter-CPU acquiring barrier effects".
2074	
2075	
2076	As an example, consider the following:
2077	
2078		*A = a;
2079		*B = b;
2080		ACQUIRE
2081		*C = c;
2082		*D = d;
2083		RELEASE
2084		*E = e;
2085		*F = f;
2086	
2087	The following sequence of events is acceptable:
2088	
2089		ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
2090	
2091		[+] Note that {*F,*A} indicates a combined access.
2092	
2093	But none of the following are:
2094	
2095		{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
2096		*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
2097		*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
2098		*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
2099	
2100	
2101	
2102	INTERRUPT DISABLING FUNCTIONS
2103	-----------------------------
2104	
2105	Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
2106	(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
2107	barriers are required in such a situation, they must be provided from some
2108	other means.
2109	
2110	
2111	SLEEP AND WAKE-UP FUNCTIONS
2112	---------------------------
2113	
2114	Sleeping and waking on an event flagged in global data can be viewed as an
2115	interaction between two pieces of data: the task state of the task waiting for
2116	the event and the global data used to indicate the event.  To make sure that
2117	these appear to happen in the right order, the primitives to begin the process
2118	of going to sleep, and the primitives to initiate a wake up imply certain
2119	barriers.
2120	
2121	Firstly, the sleeper normally follows something like this sequence of events:
2122	
2123		for (;;) {
2124			set_current_state(TASK_UNINTERRUPTIBLE);
2125			if (event_indicated)
2126				break;
2127			schedule();
2128		}
2129	
2130	A general memory barrier is interpolated automatically by set_current_state()
2131	after it has altered the task state:
2132	
2133		CPU 1
2134		===============================
2135		set_current_state();
2136		  smp_store_mb();
2137		    STORE current->state
2138		    <general barrier>
2139		LOAD event_indicated
2140	
2141	set_current_state() may be wrapped by:
2142	
2143		prepare_to_wait();
2144		prepare_to_wait_exclusive();
2145	
2146	which therefore also imply a general memory barrier after setting the state.
2147	The whole sequence above is available in various canned forms, all of which
2148	interpolate the memory barrier in the right place:
2149	
2150		wait_event();
2151		wait_event_interruptible();
2152		wait_event_interruptible_exclusive();
2153		wait_event_interruptible_timeout();
2154		wait_event_killable();
2155		wait_event_timeout();
2156		wait_on_bit();
2157		wait_on_bit_lock();
2158	
2159	
2160	Secondly, code that performs a wake up normally follows something like this:
2161	
2162		event_indicated = 1;
2163		wake_up(&event_wait_queue);
2164	
2165	or:
2166	
2167		event_indicated = 1;
2168		wake_up_process(event_daemon);
2169	
2170	A write memory barrier is implied by wake_up() and co.  if and only if they
2171	wake something up.  The barrier occurs before the task state is cleared, and so
2172	sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
2173	
2174		CPU 1				CPU 2
2175		===============================	===============================
2176		set_current_state();		STORE event_indicated
2177		  smp_store_mb();		wake_up();
2178		    STORE current->state	  <write barrier>
2179		    <general barrier>		  STORE current->state
2180		LOAD event_indicated
2181	
2182	To repeat, this write memory barrier is present if and only if something
2183	is actually awakened.  To see this, consider the following sequence of
2184	events, where X and Y are both initially zero:
2185	
2186		CPU 1				CPU 2
2187		===============================	===============================
2188		X = 1;				STORE event_indicated
2189		smp_mb();			wake_up();
2190		Y = 1;				wait_event(wq, Y == 1);
2191		wake_up();			  load from Y sees 1, no memory barrier
2192						load from X might see 0
2193	
2194	In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed
2195	to see 1.
2196	
2197	The available waker functions include:
2198	
2199		complete();
2200		wake_up();
2201		wake_up_all();
2202		wake_up_bit();
2203		wake_up_interruptible();
2204		wake_up_interruptible_all();
2205		wake_up_interruptible_nr();
2206		wake_up_interruptible_poll();
2207		wake_up_interruptible_sync();
2208		wake_up_interruptible_sync_poll();
2209		wake_up_locked();
2210		wake_up_locked_poll();
2211		wake_up_nr();
2212		wake_up_poll();
2213		wake_up_process();
2214	
2215	
2216	[!] Note that the memory barriers implied by the sleeper and the waker do _not_
2217	order multiple stores before the wake-up with respect to loads of those stored
2218	values after the sleeper has called set_current_state().  For instance, if the
2219	sleeper does:
2220	
2221		set_current_state(TASK_INTERRUPTIBLE);
2222		if (event_indicated)
2223			break;
2224		__set_current_state(TASK_RUNNING);
2225		do_something(my_data);
2226	
2227	and the waker does:
2228	
2229		my_data = value;
2230		event_indicated = 1;
2231		wake_up(&event_wait_queue);
2232	
2233	there's no guarantee that the change to event_indicated will be perceived by
2234	the sleeper as coming after the change to my_data.  In such a circumstance, the
2235	code on both sides must interpolate its own memory barriers between the
2236	separate data accesses.  Thus the above sleeper ought to do:
2237	
2238		set_current_state(TASK_INTERRUPTIBLE);
2239		if (event_indicated) {
2240			smp_rmb();
2241			do_something(my_data);
2242		}
2243	
2244	and the waker should do:
2245	
2246		my_data = value;
2247		smp_wmb();
2248		event_indicated = 1;
2249		wake_up(&event_wait_queue);
2250	
2251	
2252	MISCELLANEOUS FUNCTIONS
2253	-----------------------
2254	
2255	Other functions that imply barriers:
2256	
2257	 (*) schedule() and similar imply full memory barriers.
2258	
2259	
2260	===================================
2261	INTER-CPU ACQUIRING BARRIER EFFECTS
2262	===================================
2263	
2264	On SMP systems locking primitives give a more substantial form of barrier: one
2265	that does affect memory access ordering on other CPUs, within the context of
2266	conflict on any particular lock.
2267	
2268	
2269	ACQUIRES VS MEMORY ACCESSES
2270	---------------------------
2271	
2272	Consider the following: the system has a pair of spinlocks (M) and (Q), and
2273	three CPUs; then should the following sequence of events occur:
2274	
2275		CPU 1				CPU 2
2276		===============================	===============================
2277		WRITE_ONCE(*A, a);		WRITE_ONCE(*E, e);
2278		ACQUIRE M			ACQUIRE Q
2279		WRITE_ONCE(*B, b);		WRITE_ONCE(*F, f);
2280		WRITE_ONCE(*C, c);		WRITE_ONCE(*G, g);
2281		RELEASE M			RELEASE Q
2282		WRITE_ONCE(*D, d);		WRITE_ONCE(*H, h);
2283	
2284	Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2285	through *H occur in, other than the constraints imposed by the separate locks
2286	on the separate CPUs.  It might, for example, see:
2287	
2288		*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
2289	
2290	But it won't see any of:
2291	
2292		*B, *C or *D preceding ACQUIRE M
2293		*A, *B or *C following RELEASE M
2294		*F, *G or *H preceding ACQUIRE Q
2295		*E, *F or *G following RELEASE Q
2296	
2297	
2298	
2299	ACQUIRES VS I/O ACCESSES
2300	------------------------
2301	
2302	Under certain circumstances (especially involving NUMA), I/O accesses within
2303	two spinlocked sections on two different CPUs may be seen as interleaved by the
2304	PCI bridge, because the PCI bridge does not necessarily participate in the
2305	cache-coherence protocol, and is therefore incapable of issuing the required
2306	read memory barriers.
2307	
2308	For example:
2309	
2310		CPU 1				CPU 2
2311		===============================	===============================
2312		spin_lock(Q)
2313		writel(0, ADDR)
2314		writel(1, DATA);
2315		spin_unlock(Q);
2316						spin_lock(Q);
2317						writel(4, ADDR);
2318						writel(5, DATA);
2319						spin_unlock(Q);
2320	
2321	may be seen by the PCI bridge as follows:
2322	
2323		STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
2324	
2325	which would probably cause the hardware to malfunction.
2326	
2327	
2328	What is necessary here is to intervene with an mmiowb() before dropping the
2329	spinlock, for example:
2330	
2331		CPU 1				CPU 2
2332		===============================	===============================
2333		spin_lock(Q)
2334		writel(0, ADDR)
2335		writel(1, DATA);
2336		mmiowb();
2337		spin_unlock(Q);
2338						spin_lock(Q);
2339						writel(4, ADDR);
2340						writel(5, DATA);
2341						mmiowb();
2342						spin_unlock(Q);
2343	
2344	this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
2345	before either of the stores issued on CPU 2.
2346	
2347	
2348	Furthermore, following a store by a load from the same device obviates the need
2349	for the mmiowb(), because the load forces the store to complete before the load
2350	is performed:
2351	
2352		CPU 1				CPU 2
2353		===============================	===============================
2354		spin_lock(Q)
2355		writel(0, ADDR)
2356		a = readl(DATA);
2357		spin_unlock(Q);
2358						spin_lock(Q);
2359						writel(4, ADDR);
2360						b = readl(DATA);
2361						spin_unlock(Q);
2362	
2363	
2364	See Documentation/driver-api/device-io.rst for more information.
2365	
2366	
2367	=================================
2368	WHERE ARE MEMORY BARRIERS NEEDED?
2369	=================================
2370	
2371	Under normal operation, memory operation reordering is generally not going to
2372	be a problem as a single-threaded linear piece of code will still appear to
2373	work correctly, even if it's in an SMP kernel.  There are, however, four
2374	circumstances in which reordering definitely _could_ be a problem:
2375	
2376	 (*) Interprocessor interaction.
2377	
2378	 (*) Atomic operations.
2379	
2380	 (*) Accessing devices.
2381	
2382	 (*) Interrupts.
2383	
2384	
2385	INTERPROCESSOR INTERACTION
2386	--------------------------
2387	
2388	When there's a system with more than one processor, more than one CPU in the
2389	system may be working on the same data set at the same time.  This can cause
2390	synchronisation problems, and the usual way of dealing with them is to use
2391	locks.  Locks, however, are quite expensive, and so it may be preferable to
2392	operate without the use of a lock if at all possible.  In such a case
2393	operations that affect both CPUs may have to be carefully ordered to prevent
2394	a malfunction.
2395	
2396	Consider, for example, the R/W semaphore slow path.  Here a waiting process is
2397	queued on the semaphore, by virtue of it having a piece of its stack linked to
2398	the semaphore's list of waiting processes:
2399	
2400		struct rw_semaphore {
2401			...
2402			spinlock_t lock;
2403			struct list_head waiters;
2404		};
2405	
2406		struct rwsem_waiter {
2407			struct list_head list;
2408			struct task_struct *task;
2409		};
2410	
2411	To wake up a particular waiter, the up_read() or up_write() functions have to:
2412	
2413	 (1) read the next pointer from this waiter's record to know as to where the
2414	     next waiter record is;
2415	
2416	 (2) read the pointer to the waiter's task structure;
2417	
2418	 (3) clear the task pointer to tell the waiter it has been given the semaphore;
2419	
2420	 (4) call wake_up_process() on the task; and
2421	
2422	 (5) release the reference held on the waiter's task struct.
2423	
2424	In other words, it has to perform this sequence of events:
2425	
2426		LOAD waiter->list.next;
2427		LOAD waiter->task;
2428		STORE waiter->task;
2429		CALL wakeup
2430		RELEASE task
2431	
2432	and if any of these steps occur out of order, then the whole thing may
2433	malfunction.
2434	
2435	Once it has queued itself and dropped the semaphore lock, the waiter does not
2436	get the lock again; it instead just waits for its task pointer to be cleared
2437	before proceeding.  Since the record is on the waiter's stack, this means that
2438	if the task pointer is cleared _before_ the next pointer in the list is read,
2439	another CPU might start processing the waiter and might clobber the waiter's
2440	stack before the up*() function has a chance to read the next pointer.
2441	
2442	Consider then what might happen to the above sequence of events:
2443	
2444		CPU 1				CPU 2
2445		===============================	===============================
2446						down_xxx()
2447						Queue waiter
2448						Sleep
2449		up_yyy()
2450		LOAD waiter->task;
2451		STORE waiter->task;
2452						Woken up by other event
2453		<preempt>
2454						Resume processing
2455						down_xxx() returns
2456						call foo()
2457						foo() clobbers *waiter
2458		</preempt>
2459		LOAD waiter->list.next;
2460		--- OOPS ---
2461	
2462	This could be dealt with using the semaphore lock, but then the down_xxx()
2463	function has to needlessly get the spinlock again after being woken up.
2464	
2465	The way to deal with this is to insert a general SMP memory barrier:
2466	
2467		LOAD waiter->list.next;
2468		LOAD waiter->task;
2469		smp_mb();
2470		STORE waiter->task;
2471		CALL wakeup
2472		RELEASE task
2473	
2474	In this case, the barrier makes a guarantee that all memory accesses before the
2475	barrier will appear to happen before all the memory accesses after the barrier
2476	with respect to the other CPUs on the system.  It does _not_ guarantee that all
2477	the memory accesses before the barrier will be complete by the time the barrier
2478	instruction itself is complete.
2479	
2480	On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2481	compiler barrier, thus making sure the compiler emits the instructions in the
2482	right order without actually intervening in the CPU.  Since there's only one
2483	CPU, that CPU's dependency ordering logic will take care of everything else.
2484	
2485	
2486	ATOMIC OPERATIONS
2487	-----------------
2488	
2489	Whilst they are technically interprocessor interaction considerations, atomic
2490	operations are noted specially as some of them imply full memory barriers and
2491	some don't, but they're very heavily relied on as a group throughout the
2492	kernel.
2493	
2494	See Documentation/atomic_t.txt for more information.
2495	
2496	
2497	ACCESSING DEVICES
2498	-----------------
2499	
2500	Many devices can be memory mapped, and so appear to the CPU as if they're just
2501	a set of memory locations.  To control such a device, the driver usually has to
2502	make the right memory accesses in exactly the right order.
2503	
2504	However, having a clever CPU or a clever compiler creates a potential problem
2505	in that the carefully sequenced accesses in the driver code won't reach the
2506	device in the requisite order if the CPU or the compiler thinks it is more
2507	efficient to reorder, combine or merge accesses - something that would cause
2508	the device to malfunction.
2509	
2510	Inside of the Linux kernel, I/O should be done through the appropriate accessor
2511	routines - such as inb() or writel() - which know how to make such accesses
2512	appropriately sequential.  Whilst this, for the most part, renders the explicit
2513	use of memory barriers unnecessary, there are a couple of situations where they
2514	might be needed:
2515	
2516	 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
2517	     so for _all_ general drivers locks should be used and mmiowb() must be
2518	     issued prior to unlocking the critical section.
2519	
2520	 (2) If the accessor functions are used to refer to an I/O memory window with
2521	     relaxed memory access properties, then _mandatory_ memory barriers are
2522	     required to enforce ordering.
2523	
2524	See Documentation/driver-api/device-io.rst for more information.
2525	
2526	
2527	INTERRUPTS
2528	----------
2529	
2530	A driver may be interrupted by its own interrupt service routine, and thus the
2531	two parts of the driver may interfere with each other's attempts to control or
2532	access the device.
2533	
2534	This may be alleviated - at least in part - by disabling local interrupts (a
2535	form of locking), such that the critical operations are all contained within
2536	the interrupt-disabled section in the driver.  Whilst the driver's interrupt
2537	routine is executing, the driver's core may not run on the same CPU, and its
2538	interrupt is not permitted to happen again until the current interrupt has been
2539	handled, thus the interrupt handler does not need to lock against that.
2540	
2541	However, consider a driver that was talking to an ethernet card that sports an
2542	address register and a data register.  If that driver's core talks to the card
2543	under interrupt-disablement and then the driver's interrupt handler is invoked:
2544	
2545		LOCAL IRQ DISABLE
2546		writew(ADDR, 3);
2547		writew(DATA, y);
2548		LOCAL IRQ ENABLE
2549		<interrupt>
2550		writew(ADDR, 4);
2551		q = readw(DATA);
2552		</interrupt>
2553	
2554	The store to the data register might happen after the second store to the
2555	address register if ordering rules are sufficiently relaxed:
2556	
2557		STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
2558	
2559	
2560	If ordering rules are relaxed, it must be assumed that accesses done inside an
2561	interrupt disabled section may leak outside of it and may interleave with
2562	accesses performed in an interrupt - and vice versa - unless implicit or
2563	explicit barriers are used.
2564	
2565	Normally this won't be a problem because the I/O accesses done inside such
2566	sections will include synchronous load operations on strictly ordered I/O
2567	registers that form implicit I/O barriers.  If this isn't sufficient then an
2568	mmiowb() may need to be used explicitly.
2569	
2570	
2571	A similar situation may occur between an interrupt routine and two routines
2572	running on separate CPUs that communicate with each other.  If such a case is
2573	likely, then interrupt-disabling locks should be used to guarantee ordering.
2574	
2575	
2576	==========================
2577	KERNEL I/O BARRIER EFFECTS
2578	==========================
2579	
2580	When accessing I/O memory, drivers should use the appropriate accessor
2581	functions:
2582	
2583	 (*) inX(), outX():
2584	
2585	     These are intended to talk to I/O space rather than memory space, but
2586	     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
2587	     do indeed have special I/O space access cycles and instructions, but many
2588	     CPUs don't have such a concept.
2589	
2590	     The PCI bus, amongst others, defines an I/O space concept which - on such
2591	     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
2592	     space.  However, it may also be mapped as a virtual I/O space in the CPU's
2593	     memory map, particularly on those CPUs that don't support alternate I/O
2594	     spaces.
2595	
2596	     Accesses to this space may be fully synchronous (as on i386), but
2597	     intermediary bridges (such as the PCI host bridge) may not fully honour
2598	     that.
2599	
2600	     They are guaranteed to be fully ordered with respect to each other.
2601	
2602	     They are not guaranteed to be fully ordered with respect to other types of
2603	     memory and I/O operation.
2604	
2605	 (*) readX(), writeX():
2606	
2607	     Whether these are guaranteed to be fully ordered and uncombined with
2608	     respect to each other on the issuing CPU depends on the characteristics
2609	     defined for the memory window through which they're accessing.  On later
2610	     i386 architecture machines, for example, this is controlled by way of the
2611	     MTRR registers.
2612	
2613	     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
2614	     provided they're not accessing a prefetchable device.
2615	
2616	     However, intermediary hardware (such as a PCI bridge) may indulge in
2617	     deferral if it so wishes; to flush a store, a load from the same location
2618	     is preferred[*], but a load from the same device or from configuration
2619	     space should suffice for PCI.
2620	
2621	     [*] NOTE! attempting to load from the same location as was written to may
2622		 cause a malfunction - consider the 16550 Rx/Tx serial registers for
2623		 example.
2624	
2625	     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
2626	     force stores to be ordered.
2627	
2628	     Please refer to the PCI specification for more information on interactions
2629	     between PCI transactions.
2630	
2631	 (*) readX_relaxed(), writeX_relaxed()
2632	
2633	     These are similar to readX() and writeX(), but provide weaker memory
2634	     ordering guarantees.  Specifically, they do not guarantee ordering with
2635	     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
2636	     ordering with respect to LOCK or UNLOCK operations.  If the latter is
2637	     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
2638	     the same peripheral are guaranteed to be ordered with respect to each
2639	     other.
2640	
2641	 (*) ioreadX(), iowriteX()
2642	
2643	     These will perform appropriately for the type of access they're actually
2644	     doing, be it inX()/outX() or readX()/writeX().
2645	
2646	
2647	========================================
2648	ASSUMED MINIMUM EXECUTION ORDERING MODEL
2649	========================================
2650	
2651	It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2652	maintain the appearance of program causality with respect to itself.  Some CPUs
2653	(such as i386 or x86_64) are more constrained than others (such as powerpc or
2654	frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
2655	of arch-specific code.
2656	
2657	This means that it must be considered that the CPU will execute its instruction
2658	stream in any order it feels like - or even in parallel - provided that if an
2659	instruction in the stream depends on an earlier instruction, then that
2660	earlier instruction must be sufficiently complete[*] before the later
2661	instruction may proceed; in other words: provided that the appearance of
2662	causality is maintained.
2663	
2664	 [*] Some instructions have more than one effect - such as changing the
2665	     condition codes, changing registers or changing memory - and different
2666	     instructions may depend on different effects.
2667	
2668	A CPU may also discard any instruction sequence that winds up having no
2669	ultimate effect.  For example, if two adjacent instructions both load an
2670	immediate value into the same register, the first may be discarded.
2671	
2672	
2673	Similarly, it has to be assumed that compiler might reorder the instruction
2674	stream in any way it sees fit, again provided the appearance of causality is
2675	maintained.
2676	
2677	
2678	============================
2679	THE EFFECTS OF THE CPU CACHE
2680	============================
2681	
2682	The way cached memory operations are perceived across the system is affected to
2683	a certain extent by the caches that lie between CPUs and memory, and by the
2684	memory coherence system that maintains the consistency of state in the system.
2685	
2686	As far as the way a CPU interacts with another part of the system through the
2687	caches goes, the memory system has to include the CPU's caches, and memory
2688	barriers for the most part act at the interface between the CPU and its cache
2689	(memory barriers logically act on the dotted line in the following diagram):
2690	
2691		    <--- CPU --->         :       <----------- Memory ----------->
2692		                          :
2693		+--------+    +--------+  :   +--------+    +-----------+
2694		|        |    |        |  :   |        |    |           |    +--------+
2695		|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
2696		|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2697		|        |    | Queue  |  :   |        |    |           |--->| Memory |
2698		|        |    |        |  :   |        |    |           |    |        |
2699		+--------+    +--------+  :   +--------+    |           |    |        |
2700		                          :                 | Cache     |    +--------+
2701		                          :                 | Coherency |
2702		                          :                 | Mechanism |    +--------+
2703		+--------+    +--------+  :   +--------+    |           |    |	      |
2704		|        |    |        |  :   |        |    |           |    |        |
2705		|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2706		|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2707		|        |    | Queue  |  :   |        |    |           |    |        |
2708		|        |    |        |  :   |        |    |           |    +--------+
2709		+--------+    +--------+  :   +--------+    +-----------+
2710		                          :
2711		                          :
2712	
2713	Although any particular load or store may not actually appear outside of the
2714	CPU that issued it since it may have been satisfied within the CPU's own cache,
2715	it will still appear as if the full memory access had taken place as far as the
2716	other CPUs are concerned since the cache coherency mechanisms will migrate the
2717	cacheline over to the accessing CPU and propagate the effects upon conflict.
2718	
2719	The CPU core may execute instructions in any order it deems fit, provided the
2720	expected program causality appears to be maintained.  Some of the instructions
2721	generate load and store operations which then go into the queue of memory
2722	accesses to be performed.  The core may place these in the queue in any order
2723	it wishes, and continue execution until it is forced to wait for an instruction
2724	to complete.
2725	
2726	What memory barriers are concerned with is controlling the order in which
2727	accesses cross from the CPU side of things to the memory side of things, and
2728	the order in which the effects are perceived to happen by the other observers
2729	in the system.
2730	
2731	[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2732	their own loads and stores as if they had happened in program order.
2733	
2734	[!] MMIO or other device accesses may bypass the cache system.  This depends on
2735	the properties of the memory window through which devices are accessed and/or
2736	the use of any special device communication instructions the CPU may have.
2737	
2738	
2739	CACHE COHERENCY
2740	---------------
2741	
2742	Life isn't quite as simple as it may appear above, however: for while the
2743	caches are expected to be coherent, there's no guarantee that that coherency
2744	will be ordered.  This means that whilst changes made on one CPU will
2745	eventually become visible on all CPUs, there's no guarantee that they will
2746	become apparent in the same order on those other CPUs.
2747	
2748	
2749	Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2750	has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
2751	
2752		            :
2753		            :                          +--------+
2754		            :      +---------+         |        |
2755		+--------+  : +--->| Cache A |<------->|        |
2756		|        |  : |    +---------+         |        |
2757		|  CPU 1 |<---+                        |        |
2758		|        |  : |    +---------+         |        |
2759		+--------+  : +--->| Cache B |<------->|        |
2760		            :      +---------+         |        |
2761		            :                          | Memory |
2762		            :      +---------+         | System |
2763		+--------+  : +--->| Cache C |<------->|        |
2764		|        |  : |    +---------+         |        |
2765		|  CPU 2 |<---+                        |        |
2766		|        |  : |    +---------+         |        |
2767		+--------+  : +--->| Cache D |<------->|        |
2768		            :      +---------+         |        |
2769		            :                          +--------+
2770		            :
2771	
2772	Imagine the system has the following properties:
2773	
2774	 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2775	     resident in memory;
2776	
2777	 (*) an even-numbered cache line may be in cache B, cache D or it may still be
2778	     resident in memory;
2779	
2780	 (*) whilst the CPU core is interrogating one cache, the other cache may be
2781	     making use of the bus to access the rest of the system - perhaps to
2782	     displace a dirty cacheline or to do a speculative load;
2783	
2784	 (*) each cache has a queue of operations that need to be applied to that cache
2785	     to maintain coherency with the rest of the system;
2786	
2787	 (*) the coherency queue is not flushed by normal loads to lines already
2788	     present in the cache, even though the contents of the queue may
2789	     potentially affect those loads.
2790	
2791	Imagine, then, that two writes are made on the first CPU, with a write barrier
2792	between them to guarantee that they will appear to reach that CPU's caches in
2793	the requisite order:
2794	
2795		CPU 1		CPU 2		COMMENT
2796		===============	===============	=======================================
2797						u == 0, v == 1 and p == &u, q == &u
2798		v = 2;
2799		smp_wmb();			Make sure change to v is visible before
2800						 change to p
2801		<A:modify v=2>			v is now in cache A exclusively
2802		p = &v;
2803		<B:modify p=&v>			p is now in cache B exclusively
2804	
2805	The write memory barrier forces the other CPUs in the system to perceive that
2806	the local CPU's caches have apparently been updated in the correct order.  But
2807	now imagine that the second CPU wants to read those values:
2808	
2809		CPU 1		CPU 2		COMMENT
2810		===============	===============	=======================================
2811		...
2812				q = p;
2813				x = *q;
2814	
2815	The above pair of reads may then fail to happen in the expected order, as the
2816	cacheline holding p may get updated in one of the second CPU's caches whilst
2817	the update to the cacheline holding v is delayed in the other of the second
2818	CPU's caches by some other cache event:
2819	
2820		CPU 1		CPU 2		COMMENT
2821		===============	===============	=======================================
2822						u == 0, v == 1 and p == &u, q == &u
2823		v = 2;
2824		smp_wmb();
2825		<A:modify v=2>	<C:busy>
2826				<C:queue v=2>
2827		p = &v;		q = p;
2828				<D:request p>
2829		<B:modify p=&v>	<D:commit p=&v>
2830				<D:read p>
2831				x = *q;
2832				<C:read *q>	Reads from v before v updated in cache
2833				<C:unbusy>
2834				<C:commit v=2>
2835	
2836	Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2837	no guarantee that, without intervention, the order of update will be the same
2838	as that committed on CPU 1.
2839	
2840	
2841	To intervene, we need to interpolate a data dependency barrier or a read
2842	barrier between the loads.  This will force the cache to commit its coherency
2843	queue before processing any further requests:
2844	
2845		CPU 1		CPU 2		COMMENT
2846		===============	===============	=======================================
2847						u == 0, v == 1 and p == &u, q == &u
2848		v = 2;
2849		smp_wmb();
2850		<A:modify v=2>	<C:busy>
2851				<C:queue v=2>
2852		p = &v;		q = p;
2853				<D:request p>
2854		<B:modify p=&v>	<D:commit p=&v>
2855				<D:read p>
2856				smp_read_barrier_depends()
2857				<C:unbusy>
2858				<C:commit v=2>
2859				x = *q;
2860				<C:read *q>	Reads from v after v updated in cache
2861	
2862	
2863	This sort of problem can be encountered on DEC Alpha processors as they have a
2864	split cache that improves performance by making better use of the data bus.
2865	Whilst most CPUs do imply a data dependency barrier on the read when a memory
2866	access depends on a read, not all do, so it may not be relied on.
2867	
2868	Other CPUs may also have split caches, but must coordinate between the various
2869	cachelets for normal memory accesses.  The semantics of the Alpha removes the
2870	need for hardware coordination in the absence of memory barriers, which
2871	permitted Alpha to sport higher CPU clock rates back in the day.  However,
2872	please note that smp_read_barrier_depends() should not be used except in
2873	Alpha arch-specific code and within the READ_ONCE() macro.
2874	
2875	
2876	CACHE COHERENCY VS DMA
2877	----------------------
2878	
2879	Not all systems maintain cache coherency with respect to devices doing DMA.  In
2880	such cases, a device attempting DMA may obtain stale data from RAM because
2881	dirty cache lines may be resident in the caches of various CPUs, and may not
2882	have been written back to RAM yet.  To deal with this, the appropriate part of
2883	the kernel must flush the overlapping bits of cache on each CPU (and maybe
2884	invalidate them as well).
2885	
2886	In addition, the data DMA'd to RAM by a device may be overwritten by dirty
2887	cache lines being written back to RAM from a CPU's cache after the device has
2888	installed its own data, or cache lines present in the CPU's cache may simply
2889	obscure the fact that RAM has been updated, until at such time as the cacheline
2890	is discarded from the CPU's cache and reloaded.  To deal with this, the
2891	appropriate part of the kernel must invalidate the overlapping bits of the
2892	cache on each CPU.
2893	
2894	See Documentation/cachetlb.txt for more information on cache management.
2895	
2896	
2897	CACHE COHERENCY VS MMIO
2898	-----------------------
2899	
2900	Memory mapped I/O usually takes place through memory locations that are part of
2901	a window in the CPU's memory space that has different properties assigned than
2902	the usual RAM directed window.
2903	
2904	Amongst these properties is usually the fact that such accesses bypass the
2905	caching entirely and go directly to the device buses.  This means MMIO accesses
2906	may, in effect, overtake accesses to cached memory that were emitted earlier.
2907	A memory barrier isn't sufficient in such a case, but rather the cache must be
2908	flushed between the cached memory write and the MMIO access if the two are in
2909	any way dependent.
2910	
2911	
2912	=========================
2913	THE THINGS CPUS GET UP TO
2914	=========================
2915	
2916	A programmer might take it for granted that the CPU will perform memory
2917	operations in exactly the order specified, so that if the CPU is, for example,
2918	given the following piece of code to execute:
2919	
2920		a = READ_ONCE(*A);
2921		WRITE_ONCE(*B, b);
2922		c = READ_ONCE(*C);
2923		d = READ_ONCE(*D);
2924		WRITE_ONCE(*E, e);
2925	
2926	they would then expect that the CPU will complete the memory operation for each
2927	instruction before moving on to the next one, leading to a definite sequence of
2928	operations as seen by external observers in the system:
2929	
2930		LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2931	
2932	
2933	Reality is, of course, much messier.  With many CPUs and compilers, the above
2934	assumption doesn't hold because:
2935	
2936	 (*) loads are more likely to need to be completed immediately to permit
2937	     execution progress, whereas stores can often be deferred without a
2938	     problem;
2939	
2940	 (*) loads may be done speculatively, and the result discarded should it prove
2941	     to have been unnecessary;
2942	
2943	 (*) loads may be done speculatively, leading to the result having been fetched
2944	     at the wrong time in the expected sequence of events;
2945	
2946	 (*) the order of the memory accesses may be rearranged to promote better use
2947	     of the CPU buses and caches;
2948	
2949	 (*) loads and stores may be combined to improve performance when talking to
2950	     memory or I/O hardware that can do batched accesses of adjacent locations,
2951	     thus cutting down on transaction setup costs (memory and PCI devices may
2952	     both be able to do this); and
2953	
2954	 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2955	     mechanisms may alleviate this - once the store has actually hit the cache
2956	     - there's no guarantee that the coherency management will be propagated in
2957	     order to other CPUs.
2958	
2959	So what another CPU, say, might actually observe from the above piece of code
2960	is:
2961	
2962		LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2963	
2964		(Where "LOAD {*C,*D}" is a combined load)
2965	
2966	
2967	However, it is guaranteed that a CPU will be self-consistent: it will see its
2968	_own_ accesses appear to be correctly ordered, without the need for a memory
2969	barrier.  For instance with the following code:
2970	
2971		U = READ_ONCE(*A);
2972		WRITE_ONCE(*A, V);
2973		WRITE_ONCE(*A, W);
2974		X = READ_ONCE(*A);
2975		WRITE_ONCE(*A, Y);
2976		Z = READ_ONCE(*A);
2977	
2978	and assuming no intervention by an external influence, it can be assumed that
2979	the final result will appear to be:
2980	
2981		U == the original value of *A
2982		X == W
2983		Z == Y
2984		*A == Y
2985	
2986	The code above may cause the CPU to generate the full sequence of memory
2987	accesses:
2988	
2989		U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2990	
2991	in that order, but, without intervention, the sequence may have almost any
2992	combination of elements combined or discarded, provided the program's view
2993	of the world remains consistent.  Note that READ_ONCE() and WRITE_ONCE()
2994	are -not- optional in the above example, as there are architectures
2995	where a given CPU might reorder successive loads to the same location.
2996	On such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
2997	necessary to prevent this, for example, on Itanium the volatile casts
2998	used by READ_ONCE() and WRITE_ONCE() cause GCC to emit the special ld.acq
2999	and st.rel instructions (respectively) that prevent such reordering.
3000	
3001	The compiler may also combine, discard or defer elements of the sequence before
3002	the CPU even sees them.
3003	
3004	For instance:
3005	
3006		*A = V;
3007		*A = W;
3008	
3009	may be reduced to:
3010	
3011		*A = W;
3012	
3013	since, without either a write barrier or an WRITE_ONCE(), it can be
3014	assumed that the effect of the storage of V to *A is lost.  Similarly:
3015	
3016		*A = Y;
3017		Z = *A;
3018	
3019	may, without a memory barrier or an READ_ONCE() and WRITE_ONCE(), be
3020	reduced to:
3021	
3022		*A = Y;
3023		Z = Y;
3024	
3025	and the LOAD operation never appear outside of the CPU.
3026	
3027	
3028	AND THEN THERE'S THE ALPHA
3029	--------------------------
3030	
3031	The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
3032	some versions of the Alpha CPU have a split data cache, permitting them to have
3033	two semantically-related cache lines updated at separate times.  This is where
3034	the data dependency barrier really becomes necessary as this synchronises both
3035	caches with the memory coherence system, thus making it seem like pointer
3036	changes vs new data occur in the right order.
3037	
3038	The Alpha defines the Linux kernel's memory barrier model.
3039	
3040	See the subsection on "Cache Coherency" above.
3041	
3042	
3043	VIRTUAL MACHINE GUESTS
3044	----------------------
3045	
3046	Guests running within virtual machines might be affected by SMP effects even if
3047	the guest itself is compiled without SMP support.  This is an artifact of
3048	interfacing with an SMP host while running an UP kernel.  Using mandatory
3049	barriers for this use-case would be possible but is often suboptimal.
3050	
3051	To handle this case optimally, low-level virt_mb() etc macros are available.
3052	These have the same effect as smp_mb() etc when SMP is enabled, but generate
3053	identical code for SMP and non-SMP systems.  For example, virtual machine guests
3054	should use virt_mb() rather than smp_mb() when synchronizing against a
3055	(possibly SMP) host.
3056	
3057	These are equivalent to smp_mb() etc counterparts in all other respects,
3058	in particular, they do not control MMIO effects: to control
3059	MMIO effects, use mandatory barriers.
3060	
3061	
3062	============
3063	EXAMPLE USES
3064	============
3065	
3066	CIRCULAR BUFFERS
3067	----------------
3068	
3069	Memory barriers can be used to implement circular buffering without the need
3070	of a lock to serialise the producer with the consumer.  See:
3071	
3072		Documentation/circular-buffers.txt
3073	
3074	for details.
3075	
3076	
3077	==========
3078	REFERENCES
3079	==========
3080	
3081	Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
3082	Digital Press)
3083		Chapter 5.2: Physical Address Space Characteristics
3084		Chapter 5.4: Caches and Write Buffers
3085		Chapter 5.5: Data Sharing
3086		Chapter 5.6: Read/Write Ordering
3087	
3088	AMD64 Architecture Programmer's Manual Volume 2: System Programming
3089		Chapter 7.1: Memory-Access Ordering
3090		Chapter 7.4: Buffering and Combining Memory Writes
3091	
3092	ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
3093		Chapter B2: The AArch64 Application Level Memory Model
3094	
3095	IA-32 Intel Architecture Software Developer's Manual, Volume 3:
3096	System Programming Guide
3097		Chapter 7.1: Locked Atomic Operations
3098		Chapter 7.2: Memory Ordering
3099		Chapter 7.4: Serializing Instructions
3100	
3101	The SPARC Architecture Manual, Version 9
3102		Chapter 8: Memory Models
3103		Appendix D: Formal Specification of the Memory Models
3104		Appendix J: Programming with the Memory Models
3105	
3106	Storage in the PowerPC (Stone and Fitzgerald)
3107	
3108	UltraSPARC Programmer Reference Manual
3109		Chapter 5: Memory Accesses and Cacheability
3110		Chapter 15: Sparc-V9 Memory Models
3111	
3112	UltraSPARC III Cu User's Manual
3113		Chapter 9: Memory Models
3114	
3115	UltraSPARC IIIi Processor User's Manual
3116		Chapter 8: Memory Models
3117	
3118	UltraSPARC Architecture 2005
3119		Chapter 9: Memory
3120		Appendix D: Formal Specifications of the Memory Models
3121	
3122	UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
3123		Chapter 8: Memory Models
3124		Appendix F: Caches and Cache Coherency
3125	
3126	Solaris Internals, Core Kernel Architecture, p63-68:
3127		Chapter 3.3: Hardware Considerations for Locks and
3128				Synchronization
3129	
3130	Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
3131	for Kernel Programmers:
3132		Chapter 13: Other Memory Models
3133	
3134	Intel Itanium Architecture Software Developer's Manual: Volume 1:
3135		Section 2.6: Speculation
3136		Section 4.4: Memory Access
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.