About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / memory-barriers.txt




Custom Search

Based on kernel version 2.6.34. Page generated on 2010-05-31 16:03 EST.

1				 ============================
2				 LINUX KERNEL MEMORY BARRIERS
3				 ============================
4	
5	By: David Howells <dhowells[AT]redhat[DOT]com>
6	    Paul E. McKenney <paulmck[AT]linux.vnet.ibm[DOT]com>
7	
8	Contents:
9	
10	 (*) Abstract memory access model.
11	
12	     - Device operations.
13	     - Guarantees.
14	
15	 (*) What are memory barriers?
16	
17	     - Varieties of memory barrier.
18	     - What may not be assumed about memory barriers?
19	     - Data dependency barriers.
20	     - Control dependencies.
21	     - SMP barrier pairing.
22	     - Examples of memory barrier sequences.
23	     - Read memory barriers vs load speculation.
24	
25	 (*) Explicit kernel barriers.
26	
27	     - Compiler barrier.
28	     - CPU memory barriers.
29	     - MMIO write barrier.
30	
31	 (*) Implicit kernel memory barriers.
32	
33	     - Locking functions.
34	     - Interrupt disabling functions.
35	     - Sleep and wake-up functions.
36	     - Miscellaneous functions.
37	
38	 (*) Inter-CPU locking barrier effects.
39	
40	     - Locks vs memory accesses.
41	     - Locks vs I/O accesses.
42	
43	 (*) Where are memory barriers needed?
44	
45	     - Interprocessor interaction.
46	     - Atomic operations.
47	     - Accessing devices.
48	     - Interrupts.
49	
50	 (*) Kernel I/O barrier effects.
51	
52	 (*) Assumed minimum execution ordering model.
53	
54	 (*) The effects of the cpu cache.
55	
56	     - Cache coherency.
57	     - Cache coherency vs DMA.
58	     - Cache coherency vs MMIO.
59	
60	 (*) The things CPUs get up to.
61	
62	     - And then there's the Alpha.
63	
64	 (*) Example uses.
65	
66	     - Circular buffers.
67	
68	 (*) References.
69	
70	
71	============================
72	ABSTRACT MEMORY ACCESS MODEL
73	============================
74	
75	Consider the following abstract model of the system:
76	
77			            :                :
78			            :                :
79			            :                :
80			+-------+   :   +--------+   :   +-------+
81			|       |   :   |        |   :   |       |
82			|       |   :   |        |   :   |       |
83			| CPU 1 |<----->| Memory |<----->| CPU 2 |
84			|       |   :   |        |   :   |       |
85			|       |   :   |        |   :   |       |
86			+-------+   :   +--------+   :   +-------+
87			    ^       :       ^        :       ^
88			    |       :       |        :       |
89			    |       :       |        :       |
90			    |       :       v        :       |
91			    |       :   +--------+   :       |
92			    |       :   |        |   :       |
93			    |       :   |        |   :       |
94			    +---------->| Device |<----------+
95			            :   |        |   :
96			            :   |        |   :
97			            :   +--------+   :
98			            :                :
99	
100	Each CPU executes a program that generates memory access operations.  In the
101	abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
102	perform the memory operations in any order it likes, provided program causality
103	appears to be maintained.  Similarly, the compiler may also arrange the
104	instructions it emits in any order it likes, provided it doesn't affect the
105	apparent operation of the program.
106	
107	So in the above diagram, the effects of the memory operations performed by a
108	CPU are perceived by the rest of the system as the operations cross the
109	interface between the CPU and rest of the system (the dotted lines).
110	
111	
112	For example, consider the following sequence of events:
113	
114		CPU 1		CPU 2
115		===============	===============
116		{ A == 1; B == 2 }
117		A = 3;		x = A;
118		B = 4;		y = B;
119	
120	The set of accesses as seen by the memory system in the middle can be arranged
121	in 24 different combinations:
122	
123		STORE A=3,	STORE B=4,	x=LOAD A->3,	y=LOAD B->4
124		STORE A=3,	STORE B=4,	y=LOAD B->4,	x=LOAD A->3
125		STORE A=3,	x=LOAD A->3,	STORE B=4,	y=LOAD B->4
126		STORE A=3,	x=LOAD A->3,	y=LOAD B->2,	STORE B=4
127		STORE A=3,	y=LOAD B->2,	STORE B=4,	x=LOAD A->3
128		STORE A=3,	y=LOAD B->2,	x=LOAD A->3,	STORE B=4
129		STORE B=4,	STORE A=3,	x=LOAD A->3,	y=LOAD B->4
130		STORE B=4, ...
131		...
132	
133	and can thus result in four different combinations of values:
134	
135		x == 1, y == 2
136		x == 1, y == 4
137		x == 3, y == 2
138		x == 3, y == 4
139	
140	
141	Furthermore, the stores committed by a CPU to the memory system may not be
142	perceived by the loads made by another CPU in the same order as the stores were
143	committed.
144	
145	
146	As a further example, consider this sequence of events:
147	
148		CPU 1		CPU 2
149		===============	===============
150		{ A == 1, B == 2, C = 3, P == &A, Q == &C }
151		B = 4;		Q = P;
152		P = &B		D = *Q;
153	
154	There is an obvious data dependency here, as the value loaded into D depends on
155	the address retrieved from P by CPU 2.  At the end of the sequence, any of the
156	following results are possible:
157	
158		(Q == &A) and (D == 1)
159		(Q == &B) and (D == 2)
160		(Q == &B) and (D == 4)
161	
162	Note that CPU 2 will never try and load C into D because the CPU will load P
163	into Q before issuing the load of *Q.
164	
165	
166	DEVICE OPERATIONS
167	-----------------
168	
169	Some devices present their control interfaces as collections of memory
170	locations, but the order in which the control registers are accessed is very
171	important.  For instance, imagine an ethernet card with a set of internal
172	registers that are accessed through an address port register (A) and a data
173	port register (D).  To read internal register 5, the following code might then
174	be used:
175	
176		*A = 5;
177		x = *D;
178	
179	but this might show up as either of the following two sequences:
180	
181		STORE *A = 5, x = LOAD *D
182		x = LOAD *D, STORE *A = 5
183	
184	the second of which will almost certainly result in a malfunction, since it set
185	the address _after_ attempting to read the register.
186	
187	
188	GUARANTEES
189	----------
190	
191	There are some minimal guarantees that may be expected of a CPU:
192	
193	 (*) On any given CPU, dependent memory accesses will be issued in order, with
194	     respect to itself.  This means that for:
195	
196		Q = P; D = *Q;
197	
198	     the CPU will issue the following memory operations:
199	
200		Q = LOAD P, D = LOAD *Q
201	
202	     and always in that order.
203	
204	 (*) Overlapping loads and stores within a particular CPU will appear to be
205	     ordered within that CPU.  This means that for:
206	
207		a = *X; *X = b;
208	
209	     the CPU will only issue the following sequence of memory operations:
210	
211		a = LOAD *X, STORE *X = b
212	
213	     And for:
214	
215		*X = c; d = *X;
216	
217	     the CPU will only issue:
218	
219		STORE *X = c, d = LOAD *X
220	
221	     (Loads and stores overlap if they are targeted at overlapping pieces of
222	     memory).
223	
224	And there are a number of things that _must_ or _must_not_ be assumed:
225	
226	 (*) It _must_not_ be assumed that independent loads and stores will be issued
227	     in the order given.  This means that for:
228	
229		X = *A; Y = *B; *D = Z;
230	
231	     we may get any of the following sequences:
232	
233		X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
234		X = LOAD *A,  STORE *D = Z, Y = LOAD *B
235		Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
236		Y = LOAD *B,  STORE *D = Z, X = LOAD *A
237		STORE *D = Z, X = LOAD *A,  Y = LOAD *B
238		STORE *D = Z, Y = LOAD *B,  X = LOAD *A
239	
240	 (*) It _must_ be assumed that overlapping memory accesses may be merged or
241	     discarded.  This means that for:
242	
243		X = *A; Y = *(A + 4);
244	
245	     we may get any one of the following sequences:
246	
247		X = LOAD *A; Y = LOAD *(A + 4);
248		Y = LOAD *(A + 4); X = LOAD *A;
249		{X, Y} = LOAD {*A, *(A + 4) };
250	
251	     And for:
252	
253		*A = X; Y = *A;
254	
255	     we may get either of:
256	
257		STORE *A = X; Y = LOAD *A;
258		STORE *A = Y = X;
259	
260	
261	=========================
262	WHAT ARE MEMORY BARRIERS?
263	=========================
264	
265	As can be seen above, independent memory operations are effectively performed
266	in random order, but this can be a problem for CPU-CPU interaction and for I/O.
267	What is required is some way of intervening to instruct the compiler and the
268	CPU to restrict the order.
269	
270	Memory barriers are such interventions.  They impose a perceived partial
271	ordering over the memory operations on either side of the barrier.
272	
273	Such enforcement is important because the CPUs and other devices in a system
274	can use a variety of tricks to improve performance, including reordering,
275	deferral and combination of memory operations; speculative loads; speculative
276	branch prediction and various types of caching.  Memory barriers are used to
277	override or suppress these tricks, allowing the code to sanely control the
278	interaction of multiple CPUs and/or devices.
279	
280	
281	VARIETIES OF MEMORY BARRIER
282	---------------------------
283	
284	Memory barriers come in four basic varieties:
285	
286	 (1) Write (or store) memory barriers.
287	
288	     A write memory barrier gives a guarantee that all the STORE operations
289	     specified before the barrier will appear to happen before all the STORE
290	     operations specified after the barrier with respect to the other
291	     components of the system.
292	
293	     A write barrier is a partial ordering on stores only; it is not required
294	     to have any effect on loads.
295	
296	     A CPU can be viewed as committing a sequence of store operations to the
297	     memory system as time progresses.  All stores before a write barrier will
298	     occur in the sequence _before_ all the stores after the write barrier.
299	
300	     [!] Note that write barriers should normally be paired with read or data
301	     dependency barriers; see the "SMP barrier pairing" subsection.
302	
303	
304	 (2) Data dependency barriers.
305	
306	     A data dependency barrier is a weaker form of read barrier.  In the case
307	     where two loads are performed such that the second depends on the result
308	     of the first (eg: the first load retrieves the address to which the second
309	     load will be directed), a data dependency barrier would be required to
310	     make sure that the target of the second load is updated before the address
311	     obtained by the first load is accessed.
312	
313	     A data dependency barrier is a partial ordering on interdependent loads
314	     only; it is not required to have any effect on stores, independent loads
315	     or overlapping loads.
316	
317	     As mentioned in (1), the other CPUs in the system can be viewed as
318	     committing sequences of stores to the memory system that the CPU being
319	     considered can then perceive.  A data dependency barrier issued by the CPU
320	     under consideration guarantees that for any load preceding it, if that
321	     load touches one of a sequence of stores from another CPU, then by the
322	     time the barrier completes, the effects of all the stores prior to that
323	     touched by the load will be perceptible to any loads issued after the data
324	     dependency barrier.
325	
326	     See the "Examples of memory barrier sequences" subsection for diagrams
327	     showing the ordering constraints.
328	
329	     [!] Note that the first load really has to have a _data_ dependency and
330	     not a control dependency.  If the address for the second load is dependent
331	     on the first load, but the dependency is through a conditional rather than
332	     actually loading the address itself, then it's a _control_ dependency and
333	     a full read barrier or better is required.  See the "Control dependencies"
334	     subsection for more information.
335	
336	     [!] Note that data dependency barriers should normally be paired with
337	     write barriers; see the "SMP barrier pairing" subsection.
338	
339	
340	 (3) Read (or load) memory barriers.
341	
342	     A read barrier is a data dependency barrier plus a guarantee that all the
343	     LOAD operations specified before the barrier will appear to happen before
344	     all the LOAD operations specified after the barrier with respect to the
345	     other components of the system.
346	
347	     A read barrier is a partial ordering on loads only; it is not required to
348	     have any effect on stores.
349	
350	     Read memory barriers imply data dependency barriers, and so can substitute
351	     for them.
352	
353	     [!] Note that read barriers should normally be paired with write barriers;
354	     see the "SMP barrier pairing" subsection.
355	
356	
357	 (4) General memory barriers.
358	
359	     A general memory barrier gives a guarantee that all the LOAD and STORE
360	     operations specified before the barrier will appear to happen before all
361	     the LOAD and STORE operations specified after the barrier with respect to
362	     the other components of the system.
363	
364	     A general memory barrier is a partial ordering over both loads and stores.
365	
366	     General memory barriers imply both read and write memory barriers, and so
367	     can substitute for either.
368	
369	
370	And a couple of implicit varieties:
371	
372	 (5) LOCK operations.
373	
374	     This acts as a one-way permeable barrier.  It guarantees that all memory
375	     operations after the LOCK operation will appear to happen after the LOCK
376	     operation with respect to the other components of the system.
377	
378	     Memory operations that occur before a LOCK operation may appear to happen
379	     after it completes.
380	
381	     A LOCK operation should almost always be paired with an UNLOCK operation.
382	
383	
384	 (6) UNLOCK operations.
385	
386	     This also acts as a one-way permeable barrier.  It guarantees that all
387	     memory operations before the UNLOCK operation will appear to happen before
388	     the UNLOCK operation with respect to the other components of the system.
389	
390	     Memory operations that occur after an UNLOCK operation may appear to
391	     happen before it completes.
392	
393	     LOCK and UNLOCK operations are guaranteed to appear with respect to each
394	     other strictly in the order specified.
395	
396	     The use of LOCK and UNLOCK operations generally precludes the need for
397	     other sorts of memory barrier (but note the exceptions mentioned in the
398	     subsection "MMIO write barrier").
399	
400	
401	Memory barriers are only required where there's a possibility of interaction
402	between two CPUs or between a CPU and a device.  If it can be guaranteed that
403	there won't be any such interaction in any particular piece of code, then
404	memory barriers are unnecessary in that piece of code.
405	
406	
407	Note that these are the _minimum_ guarantees.  Different architectures may give
408	more substantial guarantees, but they may _not_ be relied upon outside of arch
409	specific code.
410	
411	
412	WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
413	----------------------------------------------
414	
415	There are certain things that the Linux kernel memory barriers do not guarantee:
416	
417	 (*) There is no guarantee that any of the memory accesses specified before a
418	     memory barrier will be _complete_ by the completion of a memory barrier
419	     instruction; the barrier can be considered to draw a line in that CPU's
420	     access queue that accesses of the appropriate type may not cross.
421	
422	 (*) There is no guarantee that issuing a memory barrier on one CPU will have
423	     any direct effect on another CPU or any other hardware in the system.  The
424	     indirect effect will be the order in which the second CPU sees the effects
425	     of the first CPU's accesses occur, but see the next point:
426	
427	 (*) There is no guarantee that a CPU will see the correct order of effects
428	     from a second CPU's accesses, even _if_ the second CPU uses a memory
429	     barrier, unless the first CPU _also_ uses a matching memory barrier (see
430	     the subsection on "SMP Barrier Pairing").
431	
432	 (*) There is no guarantee that some intervening piece of off-the-CPU
433	     hardware[*] will not reorder the memory accesses.  CPU cache coherency
434	     mechanisms should propagate the indirect effects of a memory barrier
435	     between CPUs, but might not do so in order.
436	
437		[*] For information on bus mastering DMA and coherency please read:
438	
439		    Documentation/PCI/pci.txt
440		    Documentation/PCI/PCI-DMA-mapping.txt
441		    Documentation/DMA-API.txt
442	
443	
444	DATA DEPENDENCY BARRIERS
445	------------------------
446	
447	The usage requirements of data dependency barriers are a little subtle, and
448	it's not always obvious that they're needed.  To illustrate, consider the
449	following sequence of events:
450	
451		CPU 1		CPU 2
452		===============	===============
453		{ A == 1, B == 2, C = 3, P == &A, Q == &C }
454		B = 4;
455		<write barrier>
456		P = &B
457				Q = P;
458				D = *Q;
459	
460	There's a clear data dependency here, and it would seem that by the end of the
461	sequence, Q must be either &A or &B, and that:
462	
463		(Q == &A) implies (D == 1)
464		(Q == &B) implies (D == 4)
465	
466	But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
467	leading to the following situation:
468	
469		(Q == &B) and (D == 2) ????
470	
471	Whilst this may seem like a failure of coherency or causality maintenance, it
472	isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
473	Alpha).
474	
475	To deal with this, a data dependency barrier or better must be inserted
476	between the address load and the data load:
477	
478		CPU 1		CPU 2
479		===============	===============
480		{ A == 1, B == 2, C = 3, P == &A, Q == &C }
481		B = 4;
482		<write barrier>
483		P = &B
484				Q = P;
485				<data dependency barrier>
486				D = *Q;
487	
488	This enforces the occurrence of one of the two implications, and prevents the
489	third possibility from arising.
490	
491	[!] Note that this extremely counterintuitive situation arises most easily on
492	machines with split caches, so that, for example, one cache bank processes
493	even-numbered cache lines and the other bank processes odd-numbered cache
494	lines.  The pointer P might be stored in an odd-numbered cache line, and the
495	variable B might be stored in an even-numbered cache line.  Then, if the
496	even-numbered bank of the reading CPU's cache is extremely busy while the
497	odd-numbered bank is idle, one can see the new value of the pointer P (&B),
498	but the old value of the variable B (2).
499	
500	
501	Another example of where data dependency barriers might by required is where a
502	number is read from memory and then used to calculate the index for an array
503	access:
504	
505		CPU 1		CPU 2
506		===============	===============
507		{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
508		M[1] = 4;
509		<write barrier>
510		P = 1
511				Q = P;
512				<data dependency barrier>
513				D = M[Q];
514	
515	
516	The data dependency barrier is very important to the RCU system, for example.
517	See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
518	target of an RCU'd pointer to be replaced with a new modified target, without
519	the replacement target appearing to be incompletely initialised.
520	
521	See also the subsection on "Cache Coherency" for a more thorough example.
522	
523	
524	CONTROL DEPENDENCIES
525	--------------------
526	
527	A control dependency requires a full read memory barrier, not simply a data
528	dependency barrier to make it work correctly.  Consider the following bit of
529	code:
530	
531		q = &a;
532		if (p)
533			q = &b;
534		<data dependency barrier>
535		x = *q;
536	
537	This will not have the desired effect because there is no actual data
538	dependency, but rather a control dependency that the CPU may short-circuit by
539	attempting to predict the outcome in advance.  In such a case what's actually
540	required is:
541	
542		q = &a;
543		if (p)
544			q = &b;
545		<read barrier>
546		x = *q;
547	
548	
549	SMP BARRIER PAIRING
550	-------------------
551	
552	When dealing with CPU-CPU interactions, certain types of memory barrier should
553	always be paired.  A lack of appropriate pairing is almost certainly an error.
554	
555	A write barrier should always be paired with a data dependency barrier or read
556	barrier, though a general barrier would also be viable.  Similarly a read
557	barrier or a data dependency barrier should always be paired with at least an
558	write barrier, though, again, a general barrier is viable:
559	
560		CPU 1		CPU 2
561		===============	===============
562		a = 1;
563		<write barrier>
564		b = 2;		x = b;
565				<read barrier>
566				y = a;
567	
568	Or:
569	
570		CPU 1		CPU 2
571		===============	===============================
572		a = 1;
573		<write barrier>
574		b = &a;		x = b;
575				<data dependency barrier>
576				y = *x;
577	
578	Basically, the read barrier always has to be there, even though it can be of
579	the "weaker" type.
580	
581	[!] Note that the stores before the write barrier would normally be expected to
582	match the loads after the read barrier or the data dependency barrier, and vice
583	versa:
584	
585		CPU 1                           CPU 2
586		===============                 ===============
587		a = 1;           }----   --->{  v = c
588		b = 2;           }    \ /    {  w = d
589		<write barrier>        \        <read barrier>
590		c = 3;           }    / \    {  x = a;
591		d = 4;           }----   --->{  y = b;
592	
593	
594	EXAMPLES OF MEMORY BARRIER SEQUENCES
595	------------------------------------
596	
597	Firstly, write barriers act as partial orderings on store operations.
598	Consider the following sequence of events:
599	
600		CPU 1
601		=======================
602		STORE A = 1
603		STORE B = 2
604		STORE C = 3
605		<write barrier>
606		STORE D = 4
607		STORE E = 5
608	
609	This sequence of events is committed to the memory coherence system in an order
610	that the rest of the system might perceive as the unordered set of { STORE A,
611	STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
612	}:
613	
614		+-------+       :      :
615		|       |       +------+
616		|       |------>| C=3  |     }     /\
617		|       |  :    +------+     }-----  \  -----> Events perceptible to
618		|       |  :    | A=1  |     }        \/       the rest of the system
619		|       |  :    +------+     }
620		| CPU 1 |  :    | B=2  |     }
621		|       |       +------+     }
622		|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
623		|       |       +------+     }        requires all stores prior to the
624		|       |  :    | E=5  |     }        barrier to be committed before
625		|       |  :    +------+     }        further stores may take place
626		|       |------>| D=4  |     }
627		|       |       +------+
628		+-------+       :      :
629		                   |
630		                   | Sequence in which stores are committed to the
631		                   | memory system by CPU 1
632		                   V
633	
634	
635	Secondly, data dependency barriers act as partial orderings on data-dependent
636	loads.  Consider the following sequence of events:
637	
638		CPU 1			CPU 2
639		=======================	=======================
640			{ B = 7; X = 9; Y = 8; C = &Y }
641		STORE A = 1
642		STORE B = 2
643		<write barrier>
644		STORE C = &B		LOAD X
645		STORE D = 4		LOAD C (gets &B)
646					LOAD *C (reads B)
647	
648	Without intervention, CPU 2 may perceive the events on CPU 1 in some
649	effectively random order, despite the write barrier issued by CPU 1:
650	
651		+-------+       :      :                :       :
652		|       |       +------+                +-------+  | Sequence of update
653		|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
654		|       |  :    +------+     \          +-------+  | CPU 2
655		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
656		|       |       +------+       |        +-------+
657		|       |   wwwwwwwwwwwwwwww   |        :       :
658		|       |       +------+       |        :       :
659		|       |  :    | C=&B |---    |        :       :       +-------+
660		|       |  :    +------+   \   |        +-------+       |       |
661		|       |------>| D=4  |    ----------->| C->&B |------>|       |
662		|       |       +------+       |        +-------+       |       |
663		+-------+       :      :       |        :       :       |       |
664		                               |        :       :       |       |
665		                               |        :       :       | CPU 2 |
666		                               |        +-------+       |       |
667		    Apparently incorrect --->  |        | B->7  |------>|       |
668		    perception of B (!)        |        +-------+       |       |
669		                               |        :       :       |       |
670		                               |        +-------+       |       |
671		    The load of X holds --->    \       | X->9  |------>|       |
672		    up the maintenance           \      +-------+       |       |
673		    of coherence of B             ----->| B->2  |       +-------+
674		                                        +-------+
675		                                        :       :
676	
677	
678	In the above example, CPU 2 perceives that B is 7, despite the load of *C
679	(which would be B) coming after the LOAD of C.
680	
681	If, however, a data dependency barrier were to be placed between the load of C
682	and the load of *C (ie: B) on CPU 2:
683	
684		CPU 1			CPU 2
685		=======================	=======================
686			{ B = 7; X = 9; Y = 8; C = &Y }
687		STORE A = 1
688		STORE B = 2
689		<write barrier>
690		STORE C = &B		LOAD X
691		STORE D = 4		LOAD C (gets &B)
692					<data dependency barrier>
693					LOAD *C (reads B)
694	
695	then the following will occur:
696	
697		+-------+       :      :                :       :
698		|       |       +------+                +-------+
699		|       |------>| B=2  |-----       --->| Y->8  |
700		|       |  :    +------+     \          +-------+
701		| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
702		|       |       +------+       |        +-------+
703		|       |   wwwwwwwwwwwwwwww   |        :       :
704		|       |       +------+       |        :       :
705		|       |  :    | C=&B |---    |        :       :       +-------+
706		|       |  :    +------+   \   |        +-------+       |       |
707		|       |------>| D=4  |    ----------->| C->&B |------>|       |
708		|       |       +------+       |        +-------+       |       |
709		+-------+       :      :       |        :       :       |       |
710		                               |        :       :       |       |
711		                               |        :       :       | CPU 2 |
712		                               |        +-------+       |       |
713		                               |        | X->9  |------>|       |
714		                               |        +-------+       |       |
715		  Makes sure all effects --->   \   ddddddddddddddddd   |       |
716		  prior to the store of C        \      +-------+       |       |
717		  are perceptible to              ----->| B->2  |------>|       |
718		  subsequent loads                      +-------+       |       |
719		                                        :       :       +-------+
720	
721	
722	And thirdly, a read barrier acts as a partial order on loads.  Consider the
723	following sequence of events:
724	
725		CPU 1			CPU 2
726		=======================	=======================
727			{ A = 0, B = 9 }
728		STORE A=1
729		<write barrier>
730		STORE B=2
731					LOAD B
732					LOAD A
733	
734	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
735	some effectively random order, despite the write barrier issued by CPU 1:
736	
737		+-------+       :      :                :       :
738		|       |       +------+                +-------+
739		|       |------>| A=1  |------      --->| A->0  |
740		|       |       +------+      \         +-------+
741		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
742		|       |       +------+        |       +-------+
743		|       |------>| B=2  |---     |       :       :
744		|       |       +------+   \    |       :       :       +-------+
745		+-------+       :      :    \   |       +-------+       |       |
746		                             ---------->| B->2  |------>|       |
747		                                |       +-------+       | CPU 2 |
748		                                |       | A->0  |------>|       |
749		                                |       +-------+       |       |
750		                                |       :       :       +-------+
751		                                 \      :       :
752		                                  \     +-------+
753		                                   ---->| A->1  |
754		                                        +-------+
755		                                        :       :
756	
757	
758	If, however, a read barrier were to be placed between the load of B and the
759	load of A on CPU 2:
760	
761		CPU 1			CPU 2
762		=======================	=======================
763			{ A = 0, B = 9 }
764		STORE A=1
765		<write barrier>
766		STORE B=2
767					LOAD B
768					<read barrier>
769					LOAD A
770	
771	then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
772	2:
773	
774		+-------+       :      :                :       :
775		|       |       +------+                +-------+
776		|       |------>| A=1  |------      --->| A->0  |
777		|       |       +------+      \         +-------+
778		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
779		|       |       +------+        |       +-------+
780		|       |------>| B=2  |---     |       :       :
781		|       |       +------+   \    |       :       :       +-------+
782		+-------+       :      :    \   |       +-------+       |       |
783		                             ---------->| B->2  |------>|       |
784		                                |       +-------+       | CPU 2 |
785		                                |       :       :       |       |
786		                                |       :       :       |       |
787		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
788		  barrier causes all effects      \     +-------+       |       |
789		  prior to the storage of B        ---->| A->1  |------>|       |
790		  to be perceptible to CPU 2            +-------+       |       |
791		                                        :       :       +-------+
792	
793	
794	To illustrate this more completely, consider what could happen if the code
795	contained a load of A either side of the read barrier:
796	
797		CPU 1			CPU 2
798		=======================	=======================
799			{ A = 0, B = 9 }
800		STORE A=1
801		<write barrier>
802		STORE B=2
803					LOAD B
804					LOAD A [first load of A]
805					<read barrier>
806					LOAD A [second load of A]
807	
808	Even though the two loads of A both occur after the load of B, they may both
809	come up with different values:
810	
811		+-------+       :      :                :       :
812		|       |       +------+                +-------+
813		|       |------>| A=1  |------      --->| A->0  |
814		|       |       +------+      \         +-------+
815		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
816		|       |       +------+        |       +-------+
817		|       |------>| B=2  |---     |       :       :
818		|       |       +------+   \    |       :       :       +-------+
819		+-------+       :      :    \   |       +-------+       |       |
820		                             ---------->| B->2  |------>|       |
821		                                |       +-------+       | CPU 2 |
822		                                |       :       :       |       |
823		                                |       :       :       |       |
824		                                |       +-------+       |       |
825		                                |       | A->0  |------>| 1st   |
826		                                |       +-------+       |       |
827		  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
828		  barrier causes all effects      \     +-------+       |       |
829		  prior to the storage of B        ---->| A->1  |------>| 2nd   |
830		  to be perceptible to CPU 2            +-------+       |       |
831		                                        :       :       +-------+
832	
833	
834	But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
835	before the read barrier completes anyway:
836	
837		+-------+       :      :                :       :
838		|       |       +------+                +-------+
839		|       |------>| A=1  |------      --->| A->0  |
840		|       |       +------+      \         +-------+
841		| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
842		|       |       +------+        |       +-------+
843		|       |------>| B=2  |---     |       :       :
844		|       |       +------+   \    |       :       :       +-------+
845		+-------+       :      :    \   |       +-------+       |       |
846		                             ---------->| B->2  |------>|       |
847		                                |       +-------+       | CPU 2 |
848		                                |       :       :       |       |
849		                                 \      :       :       |       |
850		                                  \     +-------+       |       |
851		                                   ---->| A->1  |------>| 1st   |
852		                                        +-------+       |       |
853		                                    rrrrrrrrrrrrrrrrr   |       |
854		                                        +-------+       |       |
855		                                        | A->1  |------>| 2nd   |
856		                                        +-------+       |       |
857		                                        :       :       +-------+
858	
859	
860	The guarantee is that the second load will always come up with A == 1 if the
861	load of B came up with B == 2.  No such guarantee exists for the first load of
862	A; that may come up with either A == 0 or A == 1.
863	
864	
865	READ MEMORY BARRIERS VS LOAD SPECULATION
866	----------------------------------------
867	
868	Many CPUs speculate with loads: that is they see that they will need to load an
869	item from memory, and they find a time where they're not using the bus for any
870	other loads, and so do the load in advance - even though they haven't actually
871	got to that point in the instruction execution flow yet.  This permits the
872	actual load instruction to potentially complete immediately because the CPU
873	already has the value to hand.
874	
875	It may turn out that the CPU didn't actually need the value - perhaps because a
876	branch circumvented the load - in which case it can discard the value or just
877	cache it for later use.
878	
879	Consider:
880	
881		CPU 1	   		CPU 2
882		=======================	=======================
883		 	   		LOAD B
884		 	   		DIVIDE		} Divide instructions generally
885		 	   		DIVIDE		} take a long time to perform
886		 	   		LOAD A
887	
888	Which might appear as this:
889	
890		                                        :       :       +-------+
891		                                        +-------+       |       |
892		                                    --->| B->2  |------>|       |
893		                                        +-------+       | CPU 2 |
894		                                        :       :DIVIDE |       |
895		                                        +-------+       |       |
896		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
897		division speculates on the              +-------+   ~   |       |
898		LOAD of A                               :       :   ~   |       |
899		                                        :       :DIVIDE |       |
900		                                        :       :   ~   |       |
901		Once the divisions are complete -->     :       :   ~-->|       |
902		the CPU can then perform the            :       :       |       |
903		LOAD with immediate effect              :       :       +-------+
904	
905	
906	Placing a read barrier or a data dependency barrier just before the second
907	load:
908	
909		CPU 1	   		CPU 2
910		=======================	=======================
911		 	   		LOAD B
912		 	   		DIVIDE
913		 	   		DIVIDE
914					<read barrier>
915		 	   		LOAD A
916	
917	will force any value speculatively obtained to be reconsidered to an extent
918	dependent on the type of barrier used.  If there was no change made to the
919	speculated memory location, then the speculated value will just be used:
920	
921		                                        :       :       +-------+
922		                                        +-------+       |       |
923		                                    --->| B->2  |------>|       |
924		                                        +-------+       | CPU 2 |
925		                                        :       :DIVIDE |       |
926		                                        +-------+       |       |
927		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
928		division speculates on the              +-------+   ~   |       |
929		LOAD of A                               :       :   ~   |       |
930		                                        :       :DIVIDE |       |
931		                                        :       :   ~   |       |
932		                                        :       :   ~   |       |
933		                                    rrrrrrrrrrrrrrrr~   |       |
934		                                        :       :   ~   |       |
935		                                        :       :   ~-->|       |
936		                                        :       :       |       |
937		                                        :       :       +-------+
938	
939	
940	but if there was an update or an invalidation from another CPU pending, then
941	the speculation will be cancelled and the value reloaded:
942	
943		                                        :       :       +-------+
944		                                        +-------+       |       |
945		                                    --->| B->2  |------>|       |
946		                                        +-------+       | CPU 2 |
947		                                        :       :DIVIDE |       |
948		                                        +-------+       |       |
949		The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
950		division speculates on the              +-------+   ~   |       |
951		LOAD of A                               :       :   ~   |       |
952		                                        :       :DIVIDE |       |
953		                                        :       :   ~   |       |
954		                                        :       :   ~   |       |
955		                                    rrrrrrrrrrrrrrrrr   |       |
956		                                        +-------+       |       |
957		The speculation is discarded --->   --->| A->1  |------>|       |
958		and an updated value is                 +-------+       |       |
959		retrieved                               :       :       +-------+
960	
961	
962	========================
963	EXPLICIT KERNEL BARRIERS
964	========================
965	
966	The Linux kernel has a variety of different barriers that act at different
967	levels:
968	
969	  (*) Compiler barrier.
970	
971	  (*) CPU memory barriers.
972	
973	  (*) MMIO write barrier.
974	
975	
976	COMPILER BARRIER
977	----------------
978	
979	The Linux kernel has an explicit compiler barrier function that prevents the
980	compiler from moving the memory accesses either side of it to the other side:
981	
982		barrier();
983	
984	This is a general barrier - lesser varieties of compiler barrier do not exist.
985	
986	The compiler barrier has no direct effect on the CPU, which may then reorder
987	things however it wishes.
988	
989	
990	CPU MEMORY BARRIERS
991	-------------------
992	
993	The Linux kernel has eight basic CPU memory barriers:
994	
995		TYPE		MANDATORY		SMP CONDITIONAL
996		===============	=======================	===========================
997		GENERAL		mb()			smp_mb()
998		WRITE		wmb()			smp_wmb()
999		READ		rmb()			smp_rmb()
1000		DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
1001	
1002	
1003	All memory barriers except the data dependency barriers imply a compiler
1004	barrier. Data dependencies do not impose any additional compiler ordering.
1005	
1006	Aside: In the case of data dependencies, the compiler would be expected to
1007	issue the loads in the correct order (eg. `a[b]` would have to load the value
1008	of b before loading a[b]), however there is no guarantee in the C specification
1009	that the compiler may not speculate the value of b (eg. is equal to 1) and load
1010	a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
1011	problem of a compiler reloading b after having loaded a[b], thus having a newer
1012	copy of b than a[b]. A consensus has not yet been reached about these problems,
1013	however the ACCESS_ONCE macro is a good place to start looking.
1014	
1015	SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
1016	systems because it is assumed that a CPU will appear to be self-consistent,
1017	and will order overlapping accesses correctly with respect to itself.
1018	
1019	[!] Note that SMP memory barriers _must_ be used to control the ordering of
1020	references to shared memory on SMP systems, though the use of locking instead
1021	is sufficient.
1022	
1023	Mandatory barriers should not be used to control SMP effects, since mandatory
1024	barriers unnecessarily impose overhead on UP systems. They may, however, be
1025	used to control MMIO effects on accesses through relaxed memory I/O windows.
1026	These are required even on non-SMP systems as they affect the order in which
1027	memory operations appear to a device by prohibiting both the compiler and the
1028	CPU from reordering them.
1029	
1030	
1031	There are some more advanced barrier functions:
1032	
1033	 (*) set_mb(var, value)
1034	
1035	     This assigns the value to the variable and then inserts a full memory
1036	     barrier after it, depending on the function.  It isn't guaranteed to
1037	     insert anything more than a compiler barrier in a UP compilation.
1038	
1039	
1040	 (*) smp_mb__before_atomic_dec();
1041	 (*) smp_mb__after_atomic_dec();
1042	 (*) smp_mb__before_atomic_inc();
1043	 (*) smp_mb__after_atomic_inc();
1044	
1045	     These are for use with atomic add, subtract, increment and decrement
1046	     functions that don't return a value, especially when used for reference
1047	     counting.  These functions do not imply memory barriers.
1048	
1049	     As an example, consider a piece of code that marks an object as being dead
1050	     and then decrements the object's reference count:
1051	
1052		obj->dead = 1;
1053		smp_mb__before_atomic_dec();
1054		atomic_dec(&obj->ref_count);
1055	
1056	     This makes sure that the death mark on the object is perceived to be set
1057	     *before* the reference counter is decremented.
1058	
1059	     See Documentation/atomic_ops.txt for more information.  See the "Atomic
1060	     operations" subsection for information on where to use these.
1061	
1062	
1063	 (*) smp_mb__before_clear_bit(void);
1064	 (*) smp_mb__after_clear_bit(void);
1065	
1066	     These are for use similar to the atomic inc/dec barriers.  These are
1067	     typically used for bitwise unlocking operations, so care must be taken as
1068	     there are no implicit memory barriers here either.
1069	
1070	     Consider implementing an unlock operation of some nature by clearing a
1071	     locking bit.  The clear_bit() would then need to be barriered like this:
1072	
1073		smp_mb__before_clear_bit();
1074		clear_bit( ... );
1075	
1076	     This prevents memory operations before the clear leaking to after it.  See
1077	     the subsection on "Locking Functions" with reference to UNLOCK operation
1078	     implications.
1079	
1080	     See Documentation/atomic_ops.txt for more information.  See the "Atomic
1081	     operations" subsection for information on where to use these.
1082	
1083	
1084	MMIO WRITE BARRIER
1085	------------------
1086	
1087	The Linux kernel also has a special barrier for use with memory-mapped I/O
1088	writes:
1089	
1090		mmiowb();
1091	
1092	This is a variation on the mandatory write barrier that causes writes to weakly
1093	ordered I/O regions to be partially ordered.  Its effects may go beyond the
1094	CPU->Hardware interface and actually affect the hardware at some level.
1095	
1096	See the subsection "Locks vs I/O accesses" for more information.
1097	
1098	
1099	===============================
1100	IMPLICIT KERNEL MEMORY BARRIERS
1101	===============================
1102	
1103	Some of the other functions in the linux kernel imply memory barriers, amongst
1104	which are locking and scheduling functions.
1105	
1106	This specification is a _minimum_ guarantee; any particular architecture may
1107	provide more substantial guarantees, but these may not be relied upon outside
1108	of arch specific code.
1109	
1110	
1111	LOCKING FUNCTIONS
1112	-----------------
1113	
1114	The Linux kernel has a number of locking constructs:
1115	
1116	 (*) spin locks
1117	 (*) R/W spin locks
1118	 (*) mutexes
1119	 (*) semaphores
1120	 (*) R/W semaphores
1121	 (*) RCU
1122	
1123	In all cases there are variants on "LOCK" operations and "UNLOCK" operations
1124	for each construct.  These operations all imply certain barriers:
1125	
1126	 (1) LOCK operation implication:
1127	
1128	     Memory operations issued after the LOCK will be completed after the LOCK
1129	     operation has completed.
1130	
1131	     Memory operations issued before the LOCK may be completed after the LOCK
1132	     operation has completed.
1133	
1134	 (2) UNLOCK operation implication:
1135	
1136	     Memory operations issued before the UNLOCK will be completed before the
1137	     UNLOCK operation has completed.
1138	
1139	     Memory operations issued after the UNLOCK may be completed before the
1140	     UNLOCK operation has completed.
1141	
1142	 (3) LOCK vs LOCK implication:
1143	
1144	     All LOCK operations issued before another LOCK operation will be completed
1145	     before that LOCK operation.
1146	
1147	 (4) LOCK vs UNLOCK implication:
1148	
1149	     All LOCK operations issued before an UNLOCK operation will be completed
1150	     before the UNLOCK operation.
1151	
1152	     All UNLOCK operations issued before a LOCK operation will be completed
1153	     before the LOCK operation.
1154	
1155	 (5) Failed conditional LOCK implication:
1156	
1157	     Certain variants of the LOCK operation may fail, either due to being
1158	     unable to get the lock immediately, or due to receiving an unblocked
1159	     signal whilst asleep waiting for the lock to become available.  Failed
1160	     locks do not imply any sort of barrier.
1161	
1162	Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
1163	equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
1164	
1165	[!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
1166	    barriers is that the effects of instructions outside of a critical section
1167	    may seep into the inside of the critical section.
1168	
1169	A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
1170	because it is possible for an access preceding the LOCK to happen after the
1171	LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
1172	two accesses can themselves then cross:
1173	
1174		*A = a;
1175		LOCK
1176		UNLOCK
1177		*B = b;
1178	
1179	may occur as:
1180	
1181		LOCK, STORE *B, STORE *A, UNLOCK
1182	
1183	Locks and semaphores may not provide any guarantee of ordering on UP compiled
1184	systems, and so cannot be counted on in such a situation to actually achieve
1185	anything at all - especially with respect to I/O accesses - unless combined
1186	with interrupt disabling operations.
1187	
1188	See also the section on "Inter-CPU locking barrier effects".
1189	
1190	
1191	As an example, consider the following:
1192	
1193		*A = a;
1194		*B = b;
1195		LOCK
1196		*C = c;
1197		*D = d;
1198		UNLOCK
1199		*E = e;
1200		*F = f;
1201	
1202	The following sequence of events is acceptable:
1203	
1204		LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
1205	
1206		[+] Note that {*F,*A} indicates a combined access.
1207	
1208	But none of the following are:
1209	
1210		{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
1211		*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
1212		*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
1213		*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
1214	
1215	
1216	
1217	INTERRUPT DISABLING FUNCTIONS
1218	-----------------------------
1219	
1220	Functions that disable interrupts (LOCK equivalent) and enable interrupts
1221	(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
1222	barriers are required in such a situation, they must be provided from some
1223	other means.
1224	
1225	
1226	SLEEP AND WAKE-UP FUNCTIONS
1227	---------------------------
1228	
1229	Sleeping and waking on an event flagged in global data can be viewed as an
1230	interaction between two pieces of data: the task state of the task waiting for
1231	the event and the global data used to indicate the event.  To make sure that
1232	these appear to happen in the right order, the primitives to begin the process
1233	of going to sleep, and the primitives to initiate a wake up imply certain
1234	barriers.
1235	
1236	Firstly, the sleeper normally follows something like this sequence of events:
1237	
1238		for (;;) {
1239			set_current_state(TASK_UNINTERRUPTIBLE);
1240			if (event_indicated)
1241				break;
1242			schedule();
1243		}
1244	
1245	A general memory barrier is interpolated automatically by set_current_state()
1246	after it has altered the task state:
1247	
1248		CPU 1
1249		===============================
1250		set_current_state();
1251		  set_mb();
1252		    STORE current->state
1253		    <general barrier>
1254		LOAD event_indicated
1255	
1256	set_current_state() may be wrapped by:
1257	
1258		prepare_to_wait();
1259		prepare_to_wait_exclusive();
1260	
1261	which therefore also imply a general memory barrier after setting the state.
1262	The whole sequence above is available in various canned forms, all of which
1263	interpolate the memory barrier in the right place:
1264	
1265		wait_event();
1266		wait_event_interruptible();
1267		wait_event_interruptible_exclusive();
1268		wait_event_interruptible_timeout();
1269		wait_event_killable();
1270		wait_event_timeout();
1271		wait_on_bit();
1272		wait_on_bit_lock();
1273	
1274	
1275	Secondly, code that performs a wake up normally follows something like this:
1276	
1277		event_indicated = 1;
1278		wake_up(&event_wait_queue);
1279	
1280	or:
1281	
1282		event_indicated = 1;
1283		wake_up_process(event_daemon);
1284	
1285	A write memory barrier is implied by wake_up() and co. if and only if they wake
1286	something up.  The barrier occurs before the task state is cleared, and so sits
1287	between the STORE to indicate the event and the STORE to set TASK_RUNNING:
1288	
1289		CPU 1				CPU 2
1290		===============================	===============================
1291		set_current_state();		STORE event_indicated
1292		  set_mb();			wake_up();
1293		    STORE current->state	  <write barrier>
1294		    <general barrier>		  STORE current->state
1295		LOAD event_indicated
1296	
1297	The available waker functions include:
1298	
1299		complete();
1300		wake_up();
1301		wake_up_all();
1302		wake_up_bit();
1303		wake_up_interruptible();
1304		wake_up_interruptible_all();
1305		wake_up_interruptible_nr();
1306		wake_up_interruptible_poll();
1307		wake_up_interruptible_sync();
1308		wake_up_interruptible_sync_poll();
1309		wake_up_locked();
1310		wake_up_locked_poll();
1311		wake_up_nr();
1312		wake_up_poll();
1313		wake_up_process();
1314	
1315	
1316	[!] Note that the memory barriers implied by the sleeper and the waker do _not_
1317	order multiple stores before the wake-up with respect to loads of those stored
1318	values after the sleeper has called set_current_state().  For instance, if the
1319	sleeper does:
1320	
1321		set_current_state(TASK_INTERRUPTIBLE);
1322		if (event_indicated)
1323			break;
1324		__set_current_state(TASK_RUNNING);
1325		do_something(my_data);
1326	
1327	and the waker does:
1328	
1329		my_data = value;
1330		event_indicated = 1;
1331		wake_up(&event_wait_queue);
1332	
1333	there's no guarantee that the change to event_indicated will be perceived by
1334	the sleeper as coming after the change to my_data.  In such a circumstance, the
1335	code on both sides must interpolate its own memory barriers between the
1336	separate data accesses.  Thus the above sleeper ought to do:
1337	
1338		set_current_state(TASK_INTERRUPTIBLE);
1339		if (event_indicated) {
1340			smp_rmb();
1341			do_something(my_data);
1342		}
1343	
1344	and the waker should do:
1345	
1346		my_data = value;
1347		smp_wmb();
1348		event_indicated = 1;
1349		wake_up(&event_wait_queue);
1350	
1351	
1352	MISCELLANEOUS FUNCTIONS
1353	-----------------------
1354	
1355	Other functions that imply barriers:
1356	
1357	 (*) schedule() and similar imply full memory barriers.
1358	
1359	
1360	=================================
1361	INTER-CPU LOCKING BARRIER EFFECTS
1362	=================================
1363	
1364	On SMP systems locking primitives give a more substantial form of barrier: one
1365	that does affect memory access ordering on other CPUs, within the context of
1366	conflict on any particular lock.
1367	
1368	
1369	LOCKS VS MEMORY ACCESSES
1370	------------------------
1371	
1372	Consider the following: the system has a pair of spinlocks (M) and (Q), and
1373	three CPUs; then should the following sequence of events occur:
1374	
1375		CPU 1				CPU 2
1376		===============================	===============================
1377		*A = a;				*E = e;
1378		LOCK M				LOCK Q
1379		*B = b;				*F = f;
1380		*C = c;				*G = g;
1381		UNLOCK M			UNLOCK Q
1382		*D = d;				*H = h;
1383	
1384	Then there is no guarantee as to what order CPU 3 will see the accesses to *A
1385	through *H occur in, other than the constraints imposed by the separate locks
1386	on the separate CPUs. It might, for example, see:
1387	
1388		*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
1389	
1390	But it won't see any of:
1391	
1392		*B, *C or *D preceding LOCK M
1393		*A, *B or *C following UNLOCK M
1394		*F, *G or *H preceding LOCK Q
1395		*E, *F or *G following UNLOCK Q
1396	
1397	
1398	However, if the following occurs:
1399	
1400		CPU 1				CPU 2
1401		===============================	===============================
1402		*A = a;
1403		LOCK M		[1]
1404		*B = b;
1405		*C = c;
1406		UNLOCK M	[1]
1407		*D = d;				*E = e;
1408						LOCK M		[2]
1409						*F = f;
1410						*G = g;
1411						UNLOCK M	[2]
1412						*H = h;
1413	
1414	CPU 3 might see:
1415	
1416		*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
1417			LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
1418	
1419	But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
1420	
1421		*B, *C, *D, *F, *G or *H preceding LOCK M [1]
1422		*A, *B or *C following UNLOCK M [1]
1423		*F, *G or *H preceding LOCK M [2]
1424		*A, *B, *C, *E, *F or *G following UNLOCK M [2]
1425	
1426	
1427	LOCKS VS I/O ACCESSES
1428	---------------------
1429	
1430	Under certain circumstances (especially involving NUMA), I/O accesses within
1431	two spinlocked sections on two different CPUs may be seen as interleaved by the
1432	PCI bridge, because the PCI bridge does not necessarily participate in the
1433	cache-coherence protocol, and is therefore incapable of issuing the required
1434	read memory barriers.
1435	
1436	For example:
1437	
1438		CPU 1				CPU 2
1439		===============================	===============================
1440		spin_lock(Q)
1441		writel(0, ADDR)
1442		writel(1, DATA);
1443		spin_unlock(Q);
1444						spin_lock(Q);
1445						writel(4, ADDR);
1446						writel(5, DATA);
1447						spin_unlock(Q);
1448	
1449	may be seen by the PCI bridge as follows:
1450	
1451		STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
1452	
1453	which would probably cause the hardware to malfunction.
1454	
1455	
1456	What is necessary here is to intervene with an mmiowb() before dropping the
1457	spinlock, for example:
1458	
1459		CPU 1				CPU 2
1460		===============================	===============================
1461		spin_lock(Q)
1462		writel(0, ADDR)
1463		writel(1, DATA);
1464		mmiowb();
1465		spin_unlock(Q);
1466						spin_lock(Q);
1467						writel(4, ADDR);
1468						writel(5, DATA);
1469						mmiowb();
1470						spin_unlock(Q);
1471	
1472	this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
1473	before either of the stores issued on CPU 2.
1474	
1475	
1476	Furthermore, following a store by a load from the same device obviates the need
1477	for the mmiowb(), because the load forces the store to complete before the load
1478	is performed:
1479	
1480		CPU 1				CPU 2
1481		===============================	===============================
1482		spin_lock(Q)
1483		writel(0, ADDR)
1484		a = readl(DATA);
1485		spin_unlock(Q);
1486						spin_lock(Q);
1487						writel(4, ADDR);
1488						b = readl(DATA);
1489						spin_unlock(Q);
1490	
1491	
1492	See Documentation/DocBook/deviceiobook.tmpl for more information.
1493	
1494	
1495	=================================
1496	WHERE ARE MEMORY BARRIERS NEEDED?
1497	=================================
1498	
1499	Under normal operation, memory operation reordering is generally not going to
1500	be a problem as a single-threaded linear piece of code will still appear to
1501	work correctly, even if it's in an SMP kernel.  There are, however, four
1502	circumstances in which reordering definitely _could_ be a problem:
1503	
1504	 (*) Interprocessor interaction.
1505	
1506	 (*) Atomic operations.
1507	
1508	 (*) Accessing devices.
1509	
1510	 (*) Interrupts.
1511	
1512	
1513	INTERPROCESSOR INTERACTION
1514	--------------------------
1515	
1516	When there's a system with more than one processor, more than one CPU in the
1517	system may be working on the same data set at the same time.  This can cause
1518	synchronisation problems, and the usual way of dealing with them is to use
1519	locks.  Locks, however, are quite expensive, and so it may be preferable to
1520	operate without the use of a lock if at all possible.  In such a case
1521	operations that affect both CPUs may have to be carefully ordered to prevent
1522	a malfunction.
1523	
1524	Consider, for example, the R/W semaphore slow path.  Here a waiting process is
1525	queued on the semaphore, by virtue of it having a piece of its stack linked to
1526	the semaphore's list of waiting processes:
1527	
1528		struct rw_semaphore {
1529			...
1530			spinlock_t lock;
1531			struct list_head waiters;
1532		};
1533	
1534		struct rwsem_waiter {
1535			struct list_head list;
1536			struct task_struct *task;
1537		};
1538	
1539	To wake up a particular waiter, the up_read() or up_write() functions have to:
1540	
1541	 (1) read the next pointer from this waiter's record to know as to where the
1542	     next waiter record is;
1543	
1544	 (2) read the pointer to the waiter's task structure;
1545	
1546	 (3) clear the task pointer to tell the waiter it has been given the semaphore;
1547	
1548	 (4) call wake_up_process() on the task; and
1549	
1550	 (5) release the reference held on the waiter's task struct.
1551	
1552	In other words, it has to perform this sequence of events:
1553	
1554		LOAD waiter->list.next;
1555		LOAD waiter->task;
1556		STORE waiter->task;
1557		CALL wakeup
1558		RELEASE task
1559	
1560	and if any of these steps occur out of order, then the whole thing may
1561	malfunction.
1562	
1563	Once it has queued itself and dropped the semaphore lock, the waiter does not
1564	get the lock again; it instead just waits for its task pointer to be cleared
1565	before proceeding.  Since the record is on the waiter's stack, this means that
1566	if the task pointer is cleared _before_ the next pointer in the list is read,
1567	another CPU might start processing the waiter and might clobber the waiter's
1568	stack before the up*() function has a chance to read the next pointer.
1569	
1570	Consider then what might happen to the above sequence of events:
1571	
1572		CPU 1				CPU 2
1573		===============================	===============================
1574						down_xxx()
1575						Queue waiter
1576						Sleep
1577		up_yyy()
1578		LOAD waiter->task;
1579		STORE waiter->task;
1580						Woken up by other event
1581		<preempt>
1582						Resume processing
1583						down_xxx() returns
1584						call foo()
1585						foo() clobbers *waiter
1586		</preempt>
1587		LOAD waiter->list.next;
1588		--- OOPS ---
1589	
1590	This could be dealt with using the semaphore lock, but then the down_xxx()
1591	function has to needlessly get the spinlock again after being woken up.
1592	
1593	The way to deal with this is to insert a general SMP memory barrier:
1594	
1595		LOAD waiter->list.next;
1596		LOAD waiter->task;
1597		smp_mb();
1598		STORE waiter->task;
1599		CALL wakeup
1600		RELEASE task
1601	
1602	In this case, the barrier makes a guarantee that all memory accesses before the
1603	barrier will appear to happen before all the memory accesses after the barrier
1604	with respect to the other CPUs on the system.  It does _not_ guarantee that all
1605	the memory accesses before the barrier will be complete by the time the barrier
1606	instruction itself is complete.
1607	
1608	On a UP system - where this wouldn't be a problem - the smp_mb() is just a
1609	compiler barrier, thus making sure the compiler emits the instructions in the
1610	right order without actually intervening in the CPU.  Since there's only one
1611	CPU, that CPU's dependency ordering logic will take care of everything else.
1612	
1613	
1614	ATOMIC OPERATIONS
1615	-----------------
1616	
1617	Whilst they are technically interprocessor interaction considerations, atomic
1618	operations are noted specially as some of them imply full memory barriers and
1619	some don't, but they're very heavily relied on as a group throughout the
1620	kernel.
1621	
1622	Any atomic operation that modifies some state in memory and returns information
1623	about the state (old or new) implies an SMP-conditional general memory barrier
1624	(smp_mb()) on each side of the actual operation (with the exception of
1625	explicit lock operations, described later).  These include:
1626	
1627		xchg();
1628		cmpxchg();
1629		atomic_cmpxchg();
1630		atomic_inc_return();
1631		atomic_dec_return();
1632		atomic_add_return();
1633		atomic_sub_return();
1634		atomic_inc_and_test();
1635		atomic_dec_and_test();
1636		atomic_sub_and_test();
1637		atomic_add_negative();
1638		atomic_add_unless();	/* when succeeds (returns 1) */
1639		test_and_set_bit();
1640		test_and_clear_bit();
1641		test_and_change_bit();
1642	
1643	These are used for such things as implementing LOCK-class and UNLOCK-class
1644	operations and adjusting reference counters towards object destruction, and as
1645	such the implicit memory barrier effects are necessary.
1646	
1647	
1648	The following operations are potential problems as they do _not_ imply memory
1649	barriers, but might be used for implementing such things as UNLOCK-class
1650	operations:
1651	
1652		atomic_set();
1653		set_bit();
1654		clear_bit();
1655		change_bit();
1656	
1657	With these the appropriate explicit memory barrier should be used if necessary
1658	(smp_mb__before_clear_bit() for instance).
1659	
1660	
1661	The following also do _not_ imply memory barriers, and so may require explicit
1662	memory barriers under some circumstances (smp_mb__before_atomic_dec() for
1663	instance):
1664	
1665		atomic_add();
1666		atomic_sub();
1667		atomic_inc();
1668		atomic_dec();
1669	
1670	If they're used for statistics generation, then they probably don't need memory
1671	barriers, unless there's a coupling between statistical data.
1672	
1673	If they're used for reference counting on an object to control its lifetime,
1674	they probably don't need memory barriers because either the reference count
1675	will be adjusted inside a locked section, or the caller will already hold
1676	sufficient references to make the lock, and thus a memory barrier unnecessary.
1677	
1678	If they're used for constructing a lock of some description, then they probably
1679	do need memory barriers as a lock primitive generally has to do things in a
1680	specific order.
1681	
1682	Basically, each usage case has to be carefully considered as to whether memory
1683	barriers are needed or not.
1684	
1685	The following operations are special locking primitives:
1686	
1687		test_and_set_bit_lock();
1688		clear_bit_unlock();
1689		__clear_bit_unlock();
1690	
1691	These implement LOCK-class and UNLOCK-class operations. These should be used in
1692	preference to other operations when implementing locking primitives, because
1693	their implementations can be optimised on many architectures.
1694	
1695	[!] Note that special memory barrier primitives are available for these
1696	situations because on some CPUs the atomic instructions used imply full memory
1697	barriers, and so barrier instructions are superfluous in conjunction with them,
1698	and in such cases the special barrier primitives will be no-ops.
1699	
1700	See Documentation/atomic_ops.txt for more information.
1701	
1702	
1703	ACCESSING DEVICES
1704	-----------------
1705	
1706	Many devices can be memory mapped, and so appear to the CPU as if they're just
1707	a set of memory locations.  To control such a device, the driver usually has to
1708	make the right memory accesses in exactly the right order.
1709	
1710	However, having a clever CPU or a clever compiler creates a potential problem
1711	in that the carefully sequenced accesses in the driver code won't reach the
1712	device in the requisite order if the CPU or the compiler thinks it is more
1713	efficient to reorder, combine or merge accesses - something that would cause
1714	the device to malfunction.
1715	
1716	Inside of the Linux kernel, I/O should be done through the appropriate accessor
1717	routines - such as inb() or writel() - which know how to make such accesses
1718	appropriately sequential.  Whilst this, for the most part, renders the explicit
1719	use of memory barriers unnecessary, there are a couple of situations where they
1720	might be needed:
1721	
1722	 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
1723	     so for _all_ general drivers locks should be used and mmiowb() must be
1724	     issued prior to unlocking the critical section.
1725	
1726	 (2) If the accessor functions are used to refer to an I/O memory window with
1727	     relaxed memory access properties, then _mandatory_ memory barriers are
1728	     required to enforce ordering.
1729	
1730	See Documentation/DocBook/deviceiobook.tmpl for more information.
1731	
1732	
1733	INTERRUPTS
1734	----------
1735	
1736	A driver may be interrupted by its own interrupt service routine, and thus the
1737	two parts of the driver may interfere with each other's attempts to control or
1738	access the device.
1739	
1740	This may be alleviated - at least in part - by disabling local interrupts (a
1741	form of locking), such that the critical operations are all contained within
1742	the interrupt-disabled section in the driver.  Whilst the driver's interrupt
1743	routine is executing, the driver's core may not run on the same CPU, and its
1744	interrupt is not permitted to happen again until the current interrupt has been
1745	handled, thus the interrupt handler does not need to lock against that.
1746	
1747	However, consider a driver that was talking to an ethernet card that sports an
1748	address register and a data register.  If that driver's core talks to the card
1749	under interrupt-disablement and then the driver's interrupt handler is invoked:
1750	
1751		LOCAL IRQ DISABLE
1752		writew(ADDR, 3);
1753		writew(DATA, y);
1754		LOCAL IRQ ENABLE
1755		<interrupt>
1756		writew(ADDR, 4);
1757		q = readw(DATA);
1758		</interrupt>
1759	
1760	The store to the data register might happen after the second store to the
1761	address register if ordering rules are sufficiently relaxed:
1762	
1763		STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
1764	
1765	
1766	If ordering rules are relaxed, it must be assumed that accesses done inside an
1767	interrupt disabled section may leak outside of it and may interleave with
1768	accesses performed in an interrupt - and vice versa - unless implicit or
1769	explicit barriers are used.
1770	
1771	Normally this won't be a problem because the I/O accesses done inside such
1772	sections will include synchronous load operations on strictly ordered I/O
1773	registers that form implicit I/O barriers. If this isn't sufficient then an
1774	mmiowb() may need to be used explicitly.
1775	
1776	
1777	A similar situation may occur between an interrupt routine and two routines
1778	running on separate CPUs that communicate with each other. If such a case is
1779	likely, then interrupt-disabling locks should be used to guarantee ordering.
1780	
1781	
1782	==========================
1783	KERNEL I/O BARRIER EFFECTS
1784	==========================
1785	
1786	When accessing I/O memory, drivers should use the appropriate accessor
1787	functions:
1788	
1789	 (*) inX(), outX():
1790	
1791	     These are intended to talk to I/O space rather than memory space, but
1792	     that's primarily a CPU-specific concept. The i386 and x86_64 processors do
1793	     indeed have special I/O space access cycles and instructions, but many
1794	     CPUs don't have such a concept.
1795	
1796	     The PCI bus, amongst others, defines an I/O space concept which - on such
1797	     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
1798	     space.  However, it may also be mapped as a virtual I/O space in the CPU's
1799	     memory map, particularly on those CPUs that don't support alternate I/O
1800	     spaces.
1801	
1802	     Accesses to this space may be fully synchronous (as on i386), but
1803	     intermediary bridges (such as the PCI host bridge) may not fully honour
1804	     that.
1805	
1806	     They are guaranteed to be fully ordered with respect to each other.
1807	
1808	     They are not guaranteed to be fully ordered with respect to other types of
1809	     memory and I/O operation.
1810	
1811	 (*) readX(), writeX():
1812	
1813	     Whether these are guaranteed to be fully ordered and uncombined with
1814	     respect to each other on the issuing CPU depends on the characteristics
1815	     defined for the memory window through which they're accessing. On later
1816	     i386 architecture machines, for example, this is controlled by way of the
1817	     MTRR registers.
1818	
1819	     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
1820	     provided they're not accessing a prefetchable device.
1821	
1822	     However, intermediary hardware (such as a PCI bridge) may indulge in
1823	     deferral if it so wishes; to flush a store, a load from the same location
1824	     is preferred[*], but a load from the same device or from configuration
1825	     space should suffice for PCI.
1826	
1827	     [*] NOTE! attempting to load from the same location as was written to may
1828	     	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
1829	     	 example.
1830	
1831	     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
1832	     force stores to be ordered.
1833	
1834	     Please refer to the PCI specification for more information on interactions
1835	     between PCI transactions.
1836	
1837	 (*) readX_relaxed()
1838	
1839	     These are similar to readX(), but are not guaranteed to be ordered in any
1840	     way. Be aware that there is no I/O read barrier available.
1841	
1842	 (*) ioreadX(), iowriteX()
1843	
1844	     These will perform appropriately for the type of access they're actually
1845	     doing, be it inX()/outX() or readX()/writeX().
1846	
1847	
1848	========================================
1849	ASSUMED MINIMUM EXECUTION ORDERING MODEL
1850	========================================
1851	
1852	It has to be assumed that the conceptual CPU is weakly-ordered but that it will
1853	maintain the appearance of program causality with respect to itself.  Some CPUs
1854	(such as i386 or x86_64) are more constrained than others (such as powerpc or
1855	frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
1856	of arch-specific code.
1857	
1858	This means that it must be considered that the CPU will execute its instruction
1859	stream in any order it feels like - or even in parallel - provided that if an
1860	instruction in the stream depends on an earlier instruction, then that
1861	earlier instruction must be sufficiently complete[*] before the later
1862	instruction may proceed; in other words: provided that the appearance of
1863	causality is maintained.
1864	
1865	 [*] Some instructions have more than one effect - such as changing the
1866	     condition codes, changing registers or changing memory - and different
1867	     instructions may depend on different effects.
1868	
1869	A CPU may also discard any instruction sequence that winds up having no
1870	ultimate effect.  For example, if two adjacent instructions both load an
1871	immediate value into the same register, the first may be discarded.
1872	
1873	
1874	Similarly, it has to be assumed that compiler might reorder the instruction
1875	stream in any way it sees fit, again provided the appearance of causality is
1876	maintained.
1877	
1878	
1879	============================
1880	THE EFFECTS OF THE CPU CACHE
1881	============================
1882	
1883	The way cached memory operations are perceived across the system is affected to
1884	a certain extent by the caches that lie between CPUs and memory, and by the
1885	memory coherence system that maintains the consistency of state in the system.
1886	
1887	As far as the way a CPU interacts with another part of the system through the
1888	caches goes, the memory system has to include the CPU's caches, and memory
1889	barriers for the most part act at the interface between the CPU and its cache
1890	(memory barriers logically act on the dotted line in the following diagram):
1891	
1892		    <--- CPU --->         :       <----------- Memory ----------->
1893		                          :
1894		+--------+    +--------+  :   +--------+    +-----------+
1895		|        |    |        |  :   |        |    |           |    +--------+
1896		|  CPU   |    | Memory |  :   | CPU    |    |           |    |	      |
1897		|  Core  |--->| Access |----->| Cache  |<-->|           |    |	      |
1898		|        |    | Queue  |  :   |        |    |           |--->| Memory |
1899		|        |    |        |  :   |        |    |           |    |	      |
1900		+--------+    +--------+  :   +--------+    |           |    | 	      |
1901		                          :                 | Cache     |    +--------+
1902		                          :                 | Coherency |
1903		                          :                 | Mechanism |    +--------+
1904		+--------+    +--------+  :   +--------+    |           |    |	      |
1905		|        |    |        |  :   |        |    |           |    |        |
1906		|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
1907		|  Core  |--->| Access |----->| Cache  |<-->|           |    | 	      |
1908		|        |    | Queue  |  :   |        |    |           |    | 	      |
1909		|        |    |        |  :   |        |    |           |    +--------+
1910		+--------+    +--------+  :   +--------+    +-----------+
1911		                          :
1912		                          :
1913	
1914	Although any particular load or store may not actually appear outside of the
1915	CPU that issued it since it may have been satisfied within the CPU's own cache,
1916	it will still appear as if the full memory access had taken place as far as the
1917	other CPUs are concerned since the cache coherency mechanisms will migrate the
1918	cacheline over to the accessing CPU and propagate the effects upon conflict.
1919	
1920	The CPU core may execute instructions in any order it deems fit, provided the
1921	expected program causality appears to be maintained.  Some of the instructions
1922	generate load and store operations which then go into the queue of memory
1923	accesses to be performed.  The core may place these in the queue in any order
1924	it wishes, and continue execution until it is forced to wait for an instruction
1925	to complete.
1926	
1927	What memory barriers are concerned with is controlling the order in which
1928	accesses cross from the CPU side of things to the memory side of things, and
1929	the order in which the effects are perceived to happen by the other observers
1930	in the system.
1931	
1932	[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
1933	their own loads and stores as if they had happened in program order.
1934	
1935	[!] MMIO or other device accesses may bypass the cache system.  This depends on
1936	the properties of the memory window through which devices are accessed and/or
1937	the use of any special device communication instructions the CPU may have.
1938	
1939	
1940	CACHE COHERENCY
1941	---------------
1942	
1943	Life isn't quite as simple as it may appear above, however: for while the
1944	caches are expected to be coherent, there's no guarantee that that coherency
1945	will be ordered.  This means that whilst changes made on one CPU will
1946	eventually become visible on all CPUs, there's no guarantee that they will
1947	become apparent in the same order on those other CPUs.
1948	
1949	
1950	Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
1951	has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
1952	
1953		            :
1954		            :                          +--------+
1955		            :      +---------+         |        |
1956		+--------+  : +--->| Cache A |<------->|        |
1957		|        |  : |    +---------+         |        |
1958		|  CPU 1 |<---+                        |        |
1959		|        |  : |    +---------+         |        |
1960		+--------+  : +--->| Cache B |<------->|        |
1961		            :      +---------+         |        |
1962		            :                          | Memory |
1963		            :      +---------+         | System |
1964		+--------+  : +--->| Cache C |<------->|        |
1965		|        |  : |    +---------+         |        |
1966		|  CPU 2 |<---+                        |        |
1967		|        |  : |    +---------+         |        |
1968		+--------+  : +--->| Cache D |<------->|        |
1969		            :      +---------+         |        |
1970		            :                          +--------+
1971		            :
1972	
1973	Imagine the system has the following properties:
1974	
1975	 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
1976	     resident in memory;
1977	
1978	 (*) an even-numbered cache line may be in cache B, cache D or it may still be
1979	     resident in memory;
1980	
1981	 (*) whilst the CPU core is interrogating one cache, the other cache may be
1982	     making use of the bus to access the rest of the system - perhaps to
1983	     displace a dirty cacheline or to do a speculative load;
1984	
1985	 (*) each cache has a queue of operations that need to be applied to that cache
1986	     to maintain coherency with the rest of the system;
1987	
1988	 (*) the coherency queue is not flushed by normal loads to lines already
1989	     present in the cache, even though the contents of the queue may
1990	     potentially affect those loads.
1991	
1992	Imagine, then, that two writes are made on the first CPU, with a write barrier
1993	between them to guarantee that they will appear to reach that CPU's caches in
1994	the requisite order:
1995	
1996		CPU 1		CPU 2		COMMENT
1997		===============	===============	=======================================
1998						u == 0, v == 1 and p == &u, q == &u
1999		v = 2;
2000		smp_wmb();			Make sure change to v is visible before
2001						 change to p
2002		<A:modify v=2>			v is now in cache A exclusively
2003		p = &v;
2004		<B:modify p=&v>			p is now in cache B exclusively
2005	
2006	The write memory barrier forces the other CPUs in the system to perceive that
2007	the local CPU's caches have apparently been updated in the correct order.  But
2008	now imagine that the second CPU wants to read those values:
2009	
2010		CPU 1		CPU 2		COMMENT
2011		===============	===============	=======================================
2012		...
2013				q = p;
2014				x = *q;
2015	
2016	The above pair of reads may then fail to happen in the expected order, as the
2017	cacheline holding p may get updated in one of the second CPU's caches whilst
2018	the update to the cacheline holding v is delayed in the other of the second
2019	CPU's caches by some other cache event:
2020	
2021		CPU 1		CPU 2		COMMENT
2022		===============	===============	=======================================
2023						u == 0, v == 1 and p == &u, q == &u
2024		v = 2;
2025		smp_wmb();
2026		<A:modify v=2>	<C:busy>
2027				<C:queue v=2>
2028		p = &v;		q = p;
2029				<D:request p>
2030		<B:modify p=&v>	<D:commit p=&v>
2031			  	<D:read p>
2032				x = *q;
2033				<C:read *q>	Reads from v before v updated in cache
2034				<C:unbusy>
2035				<C:commit v=2>
2036	
2037	Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2038	no guarantee that, without intervention, the order of update will be the same
2039	as that committed on CPU 1.
2040	
2041	
2042	To intervene, we need to interpolate a data dependency barrier or a read
2043	barrier between the loads.  This will force the cache to commit its coherency
2044	queue before processing any further requests:
2045	
2046		CPU 1		CPU 2		COMMENT
2047		===============	===============	=======================================
2048						u == 0, v == 1 and p == &u, q == &u
2049		v = 2;
2050		smp_wmb();
2051		<A:modify v=2>	<C:busy>
2052				<C:queue v=2>
2053		p = &v;		q = p;
2054				<D:request p>
2055		<B:modify p=&v>	<D:commit p=&v>
2056			  	<D:read p>
2057				smp_read_barrier_depends()
2058				<C:unbusy>
2059				<C:commit v=2>
2060				x = *q;
2061				<C:read *q>	Reads from v after v updated in cache
2062	
2063	
2064	This sort of problem can be encountered on DEC Alpha processors as they have a
2065	split cache that improves performance by making better use of the data bus.
2066	Whilst most CPUs do imply a data dependency barrier on the read when a memory
2067	access depends on a read, not all do, so it may not be relied on.
2068	
2069	Other CPUs may also have split caches, but must coordinate between the various
2070	cachelets for normal memory accesses.  The semantics of the Alpha removes the
2071	need for coordination in the absence of memory barriers.
2072	
2073	
2074	CACHE COHERENCY VS DMA
2075	----------------------
2076	
2077	Not all systems maintain cache coherency with respect to devices doing DMA.  In
2078	such cases, a device attempting DMA may obtain stale data from RAM because
2079	dirty cache lines may be resident in the caches of various CPUs, and may not
2080	have been written back to RAM yet.  To deal with this, the appropriate part of
2081	the kernel must flush the overlapping bits of cache on each CPU (and maybe
2082	invalidate them as well).
2083	
2084	In addition, the data DMA'd to RAM by a device may be overwritten by dirty
2085	cache lines being written back to RAM from a CPU's cache after the device has
2086	installed its own data, or cache lines present in the CPU's cache may simply
2087	obscure the fact that RAM has been updated, until at such time as the cacheline
2088	is discarded from the CPU's cache and reloaded.  To deal with this, the
2089	appropriate part of the kernel must invalidate the overlapping bits of the
2090	cache on each CPU.
2091	
2092	See Documentation/cachetlb.txt for more information on cache management.
2093	
2094	
2095	CACHE COHERENCY VS MMIO
2096	-----------------------
2097	
2098	Memory mapped I/O usually takes place through memory locations that are part of
2099	a window in the CPU's memory space that has different properties assigned than
2100	the usual RAM directed window.
2101	
2102	Amongst these properties is usually the fact that such accesses bypass the
2103	caching entirely and go directly to the device buses.  This means MMIO accesses
2104	may, in effect, overtake accesses to cached memory that were emitted earlier.
2105	A memory barrier isn't sufficient in such a case, but rather the cache must be
2106	flushed between the cached memory write and the MMIO access if the two are in
2107	any way dependent.
2108	
2109	
2110	=========================
2111	THE THINGS CPUS GET UP TO
2112	=========================
2113	
2114	A programmer might take it for granted that the CPU will perform memory
2115	operations in exactly the order specified, so that if the CPU is, for example,
2116	given the following piece of code to execute:
2117	
2118		a = *A;
2119		*B = b;
2120		c = *C;
2121		d = *D;
2122		*E = e;
2123	
2124	they would then expect that the CPU will complete the memory operation for each
2125	instruction before moving on to the next one, leading to a definite sequence of
2126	operations as seen by external observers in the system:
2127	
2128		LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2129	
2130	
2131	Reality is, of course, much messier.  With many CPUs and compilers, the above
2132	assumption doesn't hold because:
2133	
2134	 (*) loads are more likely to need to be completed immediately to permit
2135	     execution progress, whereas stores can often be deferred without a
2136	     problem;
2137	
2138	 (*) loads may be done speculatively, and the result discarded should it prove
2139	     to have been unnecessary;
2140	
2141	 (*) loads may be done speculatively, leading to the result having been fetched
2142	     at the wrong time in the expected sequence of events;
2143	
2144	 (*) the order of the memory accesses may be rearranged to promote better use
2145	     of the CPU buses and caches;
2146	
2147	 (*) loads and stores may be combined to improve performance when talking to
2148	     memory or I/O hardware that can do batched accesses of adjacent locations,
2149	     thus cutting down on transaction setup costs (memory and PCI devices may
2150	     both be able to do this); and
2151	
2152	 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2153	     mechanisms may alleviate this - once the store has actually hit the cache
2154	     - there's no guarantee that the coherency management will be propagated in
2155	     order to other CPUs.
2156	
2157	So what another CPU, say, might actually observe from the above piece of code
2158	is:
2159	
2160		LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2161	
2162		(Where "LOAD {*C,*D}" is a combined load)
2163	
2164	
2165	However, it is guaranteed that a CPU will be self-consistent: it will see its
2166	_own_ accesses appear to be correctly ordered, without the need for a memory
2167	barrier.  For instance with the following code:
2168	
2169		U = *A;
2170		*A = V;
2171		*A = W;
2172		X = *A;
2173		*A = Y;
2174		Z = *A;
2175	
2176	and assuming no intervention by an external influence, it can be assumed that
2177	the final result will appear to be:
2178	
2179		U == the original value of *A
2180		X == W
2181		Z == Y
2182		*A == Y
2183	
2184	The code above may cause the CPU to generate the full sequence of memory
2185	accesses:
2186	
2187		U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2188	
2189	in that order, but, without intervention, the sequence may have almost any
2190	combination of elements combined or discarded, provided the program's view of
2191	the world remains consistent.
2192	
2193	The compiler may also combine, discard or defer elements of the sequence before
2194	the CPU even sees them.
2195	
2196	For instance:
2197	
2198		*A = V;
2199		*A = W;
2200	
2201	may be reduced to:
2202	
2203		*A = W;
2204	
2205	since, without a write barrier, it can be assumed that the effect of the
2206	storage of V to *A is lost.  Similarly:
2207	
2208		*A = Y;
2209		Z = *A;
2210	
2211	may, without a memory barrier, be reduced to:
2212	
2213		*A = Y;
2214		Z = Y;
2215	
2216	and the LOAD operation never appear outside of the CPU.
2217	
2218	
2219	AND THEN THERE'S THE ALPHA
2220	--------------------------
2221	
2222	The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2223	some versions of the Alpha CPU have a split data cache, permitting them to have
2224	two semantically-related cache lines updated at separate times.  This is where
2225	the data dependency barrier really becomes necessary as this synchronises both
2226	caches with the memory coherence system, thus making it seem like pointer
2227	changes vs new data occur in the right order.
2228	
2229	The Alpha defines the Linux kernel's memory barrier model.
2230	
2231	See the subsection on "Cache Coherency" above.
2232	
2233	
2234	============
2235	EXAMPLE USES
2236	============
2237	
2238	CIRCULAR BUFFERS
2239	----------------
2240	
2241	Memory barriers can be used to implement circular buffering without the need
2242	of a lock to serialise the producer with the consumer.  See:
2243	
2244		Documentation/circular-buffers.txt
2245	
2246	for details.
2247	
2248	
2249	==========
2250	REFERENCES
2251	==========
2252	
2253	Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
2254	Digital Press)
2255		Chapter 5.2: Physical Address Space Characteristics
2256		Chapter 5.4: Caches and Write Buffers
2257		Chapter 5.5: Data Sharing
2258		Chapter 5.6: Read/Write Ordering
2259	
2260	AMD64 Architecture Programmer's Manual Volume 2: System Programming
2261		Chapter 7.1: Memory-Access Ordering
2262		Chapter 7.4: Buffering and Combining Memory Writes
2263	
2264	IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2265	System Programming Guide
2266		Chapter 7.1: Locked Atomic Operations
2267		Chapter 7.2: Memory Ordering
2268		Chapter 7.4: Serializing Instructions
2269	
2270	The SPARC Architecture Manual, Version 9
2271		Chapter 8: Memory Models
2272		Appendix D: Formal Specification of the Memory Models
2273		Appendix J: Programming with the Memory Models
2274	
2275	UltraSPARC Programmer Reference Manual
2276		Chapter 5: Memory Accesses and Cacheability
2277		Chapter 15: Sparc-V9 Memory Models
2278	
2279	UltraSPARC III Cu User's Manual
2280		Chapter 9: Memory Models
2281	
2282	UltraSPARC IIIi Processor User's Manual
2283		Chapter 8: Memory Models
2284	
2285	UltraSPARC Architecture 2005
2286		Chapter 9: Memory
2287		Appendix D: Formal Specifications of the Memory Models
2288	
2289	UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
2290		Chapter 8: Memory Models
2291		Appendix F: Caches and Cache Coherency
2292	
2293	Solaris Internals, Core Kernel Architecture, p63-68:
2294		Chapter 3.3: Hardware Considerations for Locks and
2295				Synchronization
2296	
2297	Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
2298	for Kernel Programmers:
2299		Chapter 13: Other Memory Models
2300	
2301	Intel Itanium Architecture Software Developer's Manual: Volume 1:
2302		Section 2.6: Speculation
2303		Section 4.4: Memory Access
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.