Based on kernel version 2.6.26. Page generated on 2008-07-16 21:13 EST.
1 ============================ 2 LINUX KERNEL MEMORY BARRIERS 3 ============================ 4 5 By: David Howells <dhowells[AT]redhat[DOT]com> 6 7 Contents: 8 9 (*) Abstract memory access model. 10 11 - Device operations. 12 - Guarantees. 13 14 (*) What are memory barriers? 15 16 - Varieties of memory barrier. 17 - What may not be assumed about memory barriers? 18 - Data dependency barriers. 19 - Control dependencies. 20 - SMP barrier pairing. 21 - Examples of memory barrier sequences. 22 - Read memory barriers vs load speculation. 23 24 (*) Explicit kernel barriers. 25 26 - Compiler barrier. 27 - CPU memory barriers. 28 - MMIO write barrier. 29 30 (*) Implicit kernel memory barriers. 31 32 - Locking functions. 33 - Interrupt disabling functions. 34 - Miscellaneous functions. 35 36 (*) Inter-CPU locking barrier effects. 37 38 - Locks vs memory accesses. 39 - Locks vs I/O accesses. 40 41 (*) Where are memory barriers needed? 42 43 - Interprocessor interaction. 44 - Atomic operations. 45 - Accessing devices. 46 - Interrupts. 47 48 (*) Kernel I/O barrier effects. 49 50 (*) Assumed minimum execution ordering model. 51 52 (*) The effects of the cpu cache. 53 54 - Cache coherency. 55 - Cache coherency vs DMA. 56 - Cache coherency vs MMIO. 57 58 (*) The things CPUs get up to. 59 60 - And then there's the Alpha. 61 62 (*) References. 63 64 65 ============================ 66 ABSTRACT MEMORY ACCESS MODEL 67 ============================ 68 69 Consider the following abstract model of the system: 70 71 : : 72 : : 73 : : 74 +-------+ : +--------+ : +-------+ 75 | | : | | : | | 76 | | : | | : | | 77 | CPU 1 |<----->| Memory |<----->| CPU 2 | 78 | | : | | : | | 79 | | : | | : | | 80 +-------+ : +--------+ : +-------+ 81 ^ : ^ : ^ 82 | : | : | 83 | : | : | 84 | : v : | 85 | : +--------+ : | 86 | : | | : | 87 | : | | : | 88 +---------->| Device |<----------+ 89 : | | : 90 : | | : 91 : +--------+ : 92 : : 93 94 Each CPU executes a program that generates memory access operations. In the 95 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually 96 perform the memory operations in any order it likes, provided program causality 97 appears to be maintained. Similarly, the compiler may also arrange the 98 instructions it emits in any order it likes, provided it doesn't affect the 99 apparent operation of the program. 100 101 So in the above diagram, the effects of the memory operations performed by a 102 CPU are perceived by the rest of the system as the operations cross the 103 interface between the CPU and rest of the system (the dotted lines). 104 105 106 For example, consider the following sequence of events: 107 108 CPU 1 CPU 2 109 =============== =============== 110 { A == 1; B == 2 } 111 A = 3; x = A; 112 B = 4; y = B; 113 114 The set of accesses as seen by the memory system in the middle can be arranged 115 in 24 different combinations: 116 117 STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4 118 STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3 119 STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4 120 STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4 121 STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3 122 STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4 123 STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4 124 STORE B=4, ... 125 ... 126 127 and can thus result in four different combinations of values: 128 129 x == 1, y == 2 130 x == 1, y == 4 131 x == 3, y == 2 132 x == 3, y == 4 133 134 135 Furthermore, the stores committed by a CPU to the memory system may not be 136 perceived by the loads made by another CPU in the same order as the stores were 137 committed. 138 139 140 As a further example, consider this sequence of events: 141 142 CPU 1 CPU 2 143 =============== =============== 144 { A == 1, B == 2, C = 3, P == &A, Q == &C } 145 B = 4; Q = P; 146 P = &B D = *Q; 147 148 There is an obvious data dependency here, as the value loaded into D depends on 149 the address retrieved from P by CPU 2. At the end of the sequence, any of the 150 following results are possible: 151 152 (Q == &A) and (D == 1) 153 (Q == &B) and (D == 2) 154 (Q == &B) and (D == 4) 155 156 Note that CPU 2 will never try and load C into D because the CPU will load P 157 into Q before issuing the load of *Q. 158 159 160 DEVICE OPERATIONS 161 ----------------- 162 163 Some devices present their control interfaces as collections of memory 164 locations, but the order in which the control registers are accessed is very 165 important. For instance, imagine an ethernet card with a set of internal 166 registers that are accessed through an address port register (A) and a data 167 port register (D). To read internal register 5, the following code might then 168 be used: 169 170 *A = 5; 171 x = *D; 172 173 but this might show up as either of the following two sequences: 174 175 STORE *A = 5, x = LOAD *D 176 x = LOAD *D, STORE *A = 5 177 178 the second of which will almost certainly result in a malfunction, since it set 179 the address _after_ attempting to read the register. 180 181 182 GUARANTEES 183 ---------- 184 185 There are some minimal guarantees that may be expected of a CPU: 186 187 (*) On any given CPU, dependent memory accesses will be issued in order, with 188 respect to itself. This means that for: 189 190 Q = P; D = *Q; 191 192 the CPU will issue the following memory operations: 193 194 Q = LOAD P, D = LOAD *Q 195 196 and always in that order. 197 198 (*) Overlapping loads and stores within a particular CPU will appear to be 199 ordered within that CPU. This means that for: 200 201 a = *X; *X = b; 202 203 the CPU will only issue the following sequence of memory operations: 204 205 a = LOAD *X, STORE *X = b 206 207 And for: 208 209 *X = c; d = *X; 210 211 the CPU will only issue: 212 213 STORE *X = c, d = LOAD *X 214 215 (Loads and stores overlap if they are targeted at overlapping pieces of 216 memory). 217 218 And there are a number of things that _must_ or _must_not_ be assumed: 219 220 (*) It _must_not_ be assumed that independent loads and stores will be issued 221 in the order given. This means that for: 222 223 X = *A; Y = *B; *D = Z; 224 225 we may get any of the following sequences: 226 227 X = LOAD *A, Y = LOAD *B, STORE *D = Z 228 X = LOAD *A, STORE *D = Z, Y = LOAD *B 229 Y = LOAD *B, X = LOAD *A, STORE *D = Z 230 Y = LOAD *B, STORE *D = Z, X = LOAD *A 231 STORE *D = Z, X = LOAD *A, Y = LOAD *B 232 STORE *D = Z, Y = LOAD *B, X = LOAD *A 233 234 (*) It _must_ be assumed that overlapping memory accesses may be merged or 235 discarded. This means that for: 236 237 X = *A; Y = *(A + 4); 238 239 we may get any one of the following sequences: 240 241 X = LOAD *A; Y = LOAD *(A + 4); 242 Y = LOAD *(A + 4); X = LOAD *A; 243 {X, Y} = LOAD {*A, *(A + 4) }; 244 245 And for: 246 247 *A = X; Y = *A; 248 249 we may get either of: 250 251 STORE *A = X; Y = LOAD *A; 252 STORE *A = Y = X; 253 254 255 ========================= 256 WHAT ARE MEMORY BARRIERS? 257 ========================= 258 259 As can be seen above, independent memory operations are effectively performed 260 in random order, but this can be a problem for CPU-CPU interaction and for I/O. 261 What is required is some way of intervening to instruct the compiler and the 262 CPU to restrict the order. 263 264 Memory barriers are such interventions. They impose a perceived partial 265 ordering over the memory operations on either side of the barrier. 266 267 Such enforcement is important because the CPUs and other devices in a system 268 can use a variety of tricks to improve performance, including reordering, 269 deferral and combination of memory operations; speculative loads; speculative 270 branch prediction and various types of caching. Memory barriers are used to 271 override or suppress these tricks, allowing the code to sanely control the 272 interaction of multiple CPUs and/or devices. 273 274 275 VARIETIES OF MEMORY BARRIER 276 --------------------------- 277 278 Memory barriers come in four basic varieties: 279 280 (1) Write (or store) memory barriers. 281 282 A write memory barrier gives a guarantee that all the STORE operations 283 specified before the barrier will appear to happen before all the STORE 284 operations specified after the barrier with respect to the other 285 components of the system. 286 287 A write barrier is a partial ordering on stores only; it is not required 288 to have any effect on loads. 289 290 A CPU can be viewed as committing a sequence of store operations to the 291 memory system as time progresses. All stores before a write barrier will 292 occur in the sequence _before_ all the stores after the write barrier. 293 294 [!] Note that write barriers should normally be paired with read or data 295 dependency barriers; see the "SMP barrier pairing" subsection. 296 297 298 (2) Data dependency barriers. 299 300 A data dependency barrier is a weaker form of read barrier. In the case 301 where two loads are performed such that the second depends on the result 302 of the first (eg: the first load retrieves the address to which the second 303 load will be directed), a data dependency barrier would be required to 304 make sure that the target of the second load is updated before the address 305 obtained by the first load is accessed. 306 307 A data dependency barrier is a partial ordering on interdependent loads 308 only; it is not required to have any effect on stores, independent loads 309 or overlapping loads. 310 311 As mentioned in (1), the other CPUs in the system can be viewed as 312 committing sequences of stores to the memory system that the CPU being 313 considered can then perceive. A data dependency barrier issued by the CPU 314 under consideration guarantees that for any load preceding it, if that 315 load touches one of a sequence of stores from another CPU, then by the 316 time the barrier completes, the effects of all the stores prior to that 317 touched by the load will be perceptible to any loads issued after the data 318 dependency barrier. 319 320 See the "Examples of memory barrier sequences" subsection for diagrams 321 showing the ordering constraints. 322 323 [!] Note that the first load really has to have a _data_ dependency and 324 not a control dependency. If the address for the second load is dependent 325 on the first load, but the dependency is through a conditional rather than 326 actually loading the address itself, then it's a _control_ dependency and 327 a full read barrier or better is required. See the "Control dependencies" 328 subsection for more information. 329 330 [!] Note that data dependency barriers should normally be paired with 331 write barriers; see the "SMP barrier pairing" subsection. 332 333 334 (3) Read (or load) memory barriers. 335 336 A read barrier is a data dependency barrier plus a guarantee that all the 337 LOAD operations specified before the barrier will appear to happen before 338 all the LOAD operations specified after the barrier with respect to the 339 other components of the system. 340 341 A read barrier is a partial ordering on loads only; it is not required to 342 have any effect on stores. 343 344 Read memory barriers imply data dependency barriers, and so can substitute 345 for them. 346 347 [!] Note that read barriers should normally be paired with write barriers; 348 see the "SMP barrier pairing" subsection. 349 350 351 (4) General memory barriers. 352 353 A general memory barrier gives a guarantee that all the LOAD and STORE 354 operations specified before the barrier will appear to happen before all 355 the LOAD and STORE operations specified after the barrier with respect to 356 the other components of the system. 357 358 A general memory barrier is a partial ordering over both loads and stores. 359 360 General memory barriers imply both read and write memory barriers, and so 361 can substitute for either. 362 363 364 And a couple of implicit varieties: 365 366 (5) LOCK operations. 367 368 This acts as a one-way permeable barrier. It guarantees that all memory 369 operations after the LOCK operation will appear to happen after the LOCK 370 operation with respect to the other components of the system. 371 372 Memory operations that occur before a LOCK operation may appear to happen 373 after it completes. 374 375 A LOCK operation should almost always be paired with an UNLOCK operation. 376 377 378 (6) UNLOCK operations. 379 380 This also acts as a one-way permeable barrier. It guarantees that all 381 memory operations before the UNLOCK operation will appear to happen before 382 the UNLOCK operation with respect to the other components of the system. 383 384 Memory operations that occur after an UNLOCK operation may appear to 385 happen before it completes. 386 387 LOCK and UNLOCK operations are guaranteed to appear with respect to each 388 other strictly in the order specified. 389 390 The use of LOCK and UNLOCK operations generally precludes the need for 391 other sorts of memory barrier (but note the exceptions mentioned in the 392 subsection "MMIO write barrier"). 393 394 395 Memory barriers are only required where there's a possibility of interaction 396 between two CPUs or between a CPU and a device. If it can be guaranteed that 397 there won't be any such interaction in any particular piece of code, then 398 memory barriers are unnecessary in that piece of code. 399 400 401 Note that these are the _minimum_ guarantees. Different architectures may give 402 more substantial guarantees, but they may _not_ be relied upon outside of arch 403 specific code. 404 405 406 WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS? 407 ---------------------------------------------- 408 409 There are certain things that the Linux kernel memory barriers do not guarantee: 410 411 (*) There is no guarantee that any of the memory accesses specified before a 412 memory barrier will be _complete_ by the completion of a memory barrier 413 instruction; the barrier can be considered to draw a line in that CPU's 414 access queue that accesses of the appropriate type may not cross. 415 416 (*) There is no guarantee that issuing a memory barrier on one CPU will have 417 any direct effect on another CPU or any other hardware in the system. The 418 indirect effect will be the order in which the second CPU sees the effects 419 of the first CPU's accesses occur, but see the next point: 420 421 (*) There is no guarantee that a CPU will see the correct order of effects 422 from a second CPU's accesses, even _if_ the second CPU uses a memory 423 barrier, unless the first CPU _also_ uses a matching memory barrier (see 424 the subsection on "SMP Barrier Pairing"). 425 426 (*) There is no guarantee that some intervening piece of off-the-CPU 427 hardware[*] will not reorder the memory accesses. CPU cache coherency 428 mechanisms should propagate the indirect effects of a memory barrier 429 between CPUs, but might not do so in order. 430 431 [*] For information on bus mastering DMA and coherency please read: 432 433 Documentation/pci.txt 434 Documentation/DMA-mapping.txt 435 Documentation/DMA-API.txt 436 437 438 DATA DEPENDENCY BARRIERS 439 ------------------------ 440 441 The usage requirements of data dependency barriers are a little subtle, and 442 it's not always obvious that they're needed. To illustrate, consider the 443 following sequence of events: 444 445 CPU 1 CPU 2 446 =============== =============== 447 { A == 1, B == 2, C = 3, P == &A, Q == &C } 448 B = 4; 449 <write barrier> 450 P = &B 451 Q = P; 452 D = *Q; 453 454 There's a clear data dependency here, and it would seem that by the end of the 455 sequence, Q must be either &A or &B, and that: 456 457 (Q == &A) implies (D == 1) 458 (Q == &B) implies (D == 4) 459 460 But! CPU 2's perception of P may be updated _before_ its perception of B, thus 461 leading to the following situation: 462 463 (Q == &B) and (D == 2) ???? 464 465 Whilst this may seem like a failure of coherency or causality maintenance, it 466 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC 467 Alpha). 468 469 To deal with this, a data dependency barrier or better must be inserted 470 between the address load and the data load: 471 472 CPU 1 CPU 2 473 =============== =============== 474 { A == 1, B == 2, C = 3, P == &A, Q == &C } 475 B = 4; 476 <write barrier> 477 P = &B 478 Q = P; 479 <data dependency barrier> 480 D = *Q; 481 482 This enforces the occurrence of one of the two implications, and prevents the 483 third possibility from arising. 484 485 [!] Note that this extremely counterintuitive situation arises most easily on 486 machines with split caches, so that, for example, one cache bank processes 487 even-numbered cache lines and the other bank processes odd-numbered cache 488 lines. The pointer P might be stored in an odd-numbered cache line, and the 489 variable B might be stored in an even-numbered cache line. Then, if the 490 even-numbered bank of the reading CPU's cache is extremely busy while the 491 odd-numbered bank is idle, one can see the new value of the pointer P (&B), 492 but the old value of the variable B (2). 493 494 495 Another example of where data dependency barriers might by required is where a 496 number is read from memory and then used to calculate the index for an array 497 access: 498 499 CPU 1 CPU 2 500 =============== =============== 501 { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } 502 M[1] = 4; 503 <write barrier> 504 P = 1 505 Q = P; 506 <data dependency barrier> 507 D = M[Q]; 508 509 510 The data dependency barrier is very important to the RCU system, for example. 511 See rcu_dereference() in include/linux/rcupdate.h. This permits the current 512 target of an RCU'd pointer to be replaced with a new modified target, without 513 the replacement target appearing to be incompletely initialised. 514 515 See also the subsection on "Cache Coherency" for a more thorough example. 516 517 518 CONTROL DEPENDENCIES 519 -------------------- 520 521 A control dependency requires a full read memory barrier, not simply a data 522 dependency barrier to make it work correctly. Consider the following bit of 523 code: 524 525 q = &a; 526 if (p) 527 q = &b; 528 <data dependency barrier> 529 x = *q; 530 531 This will not have the desired effect because there is no actual data 532 dependency, but rather a control dependency that the CPU may short-circuit by 533 attempting to predict the outcome in advance. In such a case what's actually 534 required is: 535 536 q = &a; 537 if (p) 538 q = &b; 539 <read barrier> 540 x = *q; 541 542 543 SMP BARRIER PAIRING 544 ------------------- 545 546 When dealing with CPU-CPU interactions, certain types of memory barrier should 547 always be paired. A lack of appropriate pairing is almost certainly an error. 548 549 A write barrier should always be paired with a data dependency barrier or read 550 barrier, though a general barrier would also be viable. Similarly a read 551 barrier or a data dependency barrier should always be paired with at least an 552 write barrier, though, again, a general barrier is viable: 553 554 CPU 1 CPU 2 555 =============== =============== 556 a = 1; 557 <write barrier> 558 b = 2; x = b; 559 <read barrier> 560 y = a; 561 562 Or: 563 564 CPU 1 CPU 2 565 =============== =============================== 566 a = 1; 567 <write barrier> 568 b = &a; x = b; 569 <data dependency barrier> 570 y = *x; 571 572 Basically, the read barrier always has to be there, even though it can be of 573 the "weaker" type. 574 575 [!] Note that the stores before the write barrier would normally be expected to 576 match the loads after the read barrier or the data dependency barrier, and vice 577 versa: 578 579 CPU 1 CPU 2 580 =============== =============== 581 a = 1; }---- --->{ v = c 582 b = 2; } \ / { w = d 583 <write barrier> \ <read barrier> 584 c = 3; } / \ { x = a; 585 d = 4; }---- --->{ y = b; 586 587 588 EXAMPLES OF MEMORY BARRIER SEQUENCES 589 ------------------------------------ 590 591 Firstly, write barriers act as partial orderings on store operations. 592 Consider the following sequence of events: 593 594 CPU 1 595 ======================= 596 STORE A = 1 597 STORE B = 2 598 STORE C = 3 599 <write barrier> 600 STORE D = 4 601 STORE E = 5 602 603 This sequence of events is committed to the memory coherence system in an order 604 that the rest of the system might perceive as the unordered set of { STORE A, 605 STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E 606 }: 607 608 +-------+ : : 609 | | +------+ 610 | |------>| C=3 | } /\ 611 | | : +------+ }----- \ -----> Events perceptible to 612 | | : | A=1 | } \/ the rest of the system 613 | | : +------+ } 614 | CPU 1 | : | B=2 | } 615 | | +------+ } 616 | | wwwwwwwwwwwwwwww } <--- At this point the write barrier 617 | | +------+ } requires all stores prior to the 618 | | : | E=5 | } barrier to be committed before 619 | | : +------+ } further stores may take place 620 | |------>| D=4 | } 621 | | +------+ 622 +-------+ : : 623 | 624 | Sequence in which stores are committed to the 625 | memory system by CPU 1 626 V 627 628 629 Secondly, data dependency barriers act as partial orderings on data-dependent 630 loads. Consider the following sequence of events: 631 632 CPU 1 CPU 2 633 ======================= ======================= 634 { B = 7; X = 9; Y = 8; C = &Y } 635 STORE A = 1 636 STORE B = 2 637 <write barrier> 638 STORE C = &B LOAD X 639 STORE D = 4 LOAD C (gets &B) 640 LOAD *C (reads B) 641 642 Without intervention, CPU 2 may perceive the events on CPU 1 in some 643 effectively random order, despite the write barrier issued by CPU 1: 644 645 +-------+ : : : : 646 | | +------+ +-------+ | Sequence of update 647 | |------>| B=2 |----- --->| Y->8 | | of perception on 648 | | : +------+ \ +-------+ | CPU 2 649 | CPU 1 | : | A=1 | \ --->| C->&Y | V 650 | | +------+ | +-------+ 651 | | wwwwwwwwwwwwwwww | : : 652 | | +------+ | : : 653 | | : | C=&B |--- | : : +-------+ 654 | | : +------+ \ | +-------+ | | 655 | |------>| D=4 | ----------->| C->&B |------>| | 656 | | +------+ | +-------+ | | 657 +-------+ : : | : : | | 658 | : : | | 659 | : : | CPU 2 | 660 | +-------+ | | 661 Apparently incorrect ---> | | B->7 |------>| | 662 perception of B (!) | +-------+ | | 663 | : : | | 664 | +-------+ | | 665 The load of X holds ---> \ | X->9 |------>| | 666 up the maintenance \ +-------+ | | 667 of coherence of B ----->| B->2 | +-------+ 668 +-------+ 669 : : 670 671 672 In the above example, CPU 2 perceives that B is 7, despite the load of *C 673 (which would be B) coming after the LOAD of C. 674 675 If, however, a data dependency barrier were to be placed between the load of C 676 and the load of *C (ie: B) on CPU 2: 677 678 CPU 1 CPU 2 679 ======================= ======================= 680 { B = 7; X = 9; Y = 8; C = &Y } 681 STORE A = 1 682 STORE B = 2 683 <write barrier> 684 STORE C = &B LOAD X 685 STORE D = 4 LOAD C (gets &B) 686 <data dependency barrier> 687 LOAD *C (reads B) 688 689 then the following will occur: 690 691 +-------+ : : : : 692 | | +------+ +-------+ 693 | |------>| B=2 |----- --->| Y->8 | 694 | | : +------+ \ +-------+ 695 | CPU 1 | : | A=1 | \ --->| C->&Y | 696 | | +------+ | +-------+ 697 | | wwwwwwwwwwwwwwww | : : 698 | | +------+ | : : 699 | | : | C=&B |--- | : : +-------+ 700 | | : +------+ \ | +-------+ | | 701 | |------>| D=4 | ----------->| C->&B |------>| | 702 | | +------+ | +-------+ | | 703 +-------+ : : | : : | | 704 | : : | | 705 | : : | CPU 2 | 706 | +-------+ | | 707 | | X->9 |------>| | 708 | +-------+ | | 709 Makes sure all effects ---> \ ddddddddddddddddd | | 710 prior to the store of C \ +-------+ | | 711 are perceptible to ----->| B->2 |------>| | 712 subsequent loads +-------+ | | 713 : : +-------+ 714 715 716 And thirdly, a read barrier acts as a partial order on loads. Consider the 717 following sequence of events: 718 719 CPU 1 CPU 2 720 ======================= ======================= 721 { A = 0, B = 9 } 722 STORE A=1 723 <write barrier> 724 STORE B=2 725 LOAD B 726 LOAD A 727 728 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in 729 some effectively random order, despite the write barrier issued by CPU 1: 730 731 +-------+ : : : : 732 | | +------+ +-------+ 733 | |------>| A=1 |------ --->| A->0 | 734 | | +------+ \ +-------+ 735 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 736 | | +------+ | +-------+ 737 | |------>| B=2 |--- | : : 738 | | +------+ \ | : : +-------+ 739 +-------+ : : \ | +-------+ | | 740 ---------->| B->2 |------>| | 741 | +-------+ | CPU 2 | 742 | | A->0 |------>| | 743 | +-------+ | | 744 | : : +-------+ 745 \ : : 746 \ +-------+ 747 ---->| A->1 | 748 +-------+ 749 : : 750 751 752 If, however, a read barrier were to be placed between the load of B and the 753 load of A on CPU 2: 754 755 CPU 1 CPU 2 756 ======================= ======================= 757 { A = 0, B = 9 } 758 STORE A=1 759 <write barrier> 760 STORE B=2 761 LOAD B 762 <read barrier> 763 LOAD A 764 765 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU 766 2: 767 768 +-------+ : : : : 769 | | +------+ +-------+ 770 | |------>| A=1 |------ --->| A->0 | 771 | | +------+ \ +-------+ 772 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 773 | | +------+ | +-------+ 774 | |------>| B=2 |--- | : : 775 | | +------+ \ | : : +-------+ 776 +-------+ : : \ | +-------+ | | 777 ---------->| B->2 |------>| | 778 | +-------+ | CPU 2 | 779 | : : | | 780 | : : | | 781 At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 782 barrier causes all effects \ +-------+ | | 783 prior to the storage of B ---->| A->1 |------>| | 784 to be perceptible to CPU 2 +-------+ | | 785 : : +-------+ 786 787 788 To illustrate this more completely, consider what could happen if the code 789 contained a load of A either side of the read barrier: 790 791 CPU 1 CPU 2 792 ======================= ======================= 793 { A = 0, B = 9 } 794 STORE A=1 795 <write barrier> 796 STORE B=2 797 LOAD B 798 LOAD A [first load of A] 799 <read barrier> 800 LOAD A [second load of A] 801 802 Even though the two loads of A both occur after the load of B, they may both 803 come up with different values: 804 805 +-------+ : : : : 806 | | +------+ +-------+ 807 | |------>| A=1 |------ --->| A->0 | 808 | | +------+ \ +-------+ 809 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 810 | | +------+ | +-------+ 811 | |------>| B=2 |--- | : : 812 | | +------+ \ | : : +-------+ 813 +-------+ : : \ | +-------+ | | 814 ---------->| B->2 |------>| | 815 | +-------+ | CPU 2 | 816 | : : | | 817 | : : | | 818 | +-------+ | | 819 | | A->0 |------>| 1st | 820 | +-------+ | | 821 At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 822 barrier causes all effects \ +-------+ | | 823 prior to the storage of B ---->| A->1 |------>| 2nd | 824 to be perceptible to CPU 2 +-------+ | | 825 : : +-------+ 826 827 828 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2 829 before the read barrier completes anyway: 830 831 +-------+ : : : : 832 | | +------+ +-------+ 833 | |------>| A=1 |------ --->| A->0 | 834 | | +------+ \ +-------+ 835 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 836 | | +------+ | +-------+ 837 | |------>| B=2 |--- | : : 838 | | +------+ \ | : : +-------+ 839 +-------+ : : \ | +-------+ | | 840 ---------->| B->2 |------>| | 841 | +-------+ | CPU 2 | 842 | : : | | 843 \ : : | | 844 \ +-------+ | | 845 ---->| A->1 |------>| 1st | 846 +-------+ | | 847 rrrrrrrrrrrrrrrrr | | 848 +-------+ | | 849 | A->1 |------>| 2nd | 850 +-------+ | | 851 : : +-------+ 852 853 854 The guarantee is that the second load will always come up with A == 1 if the 855 load of B came up with B == 2. No such guarantee exists for the first load of 856 A; that may come up with either A == 0 or A == 1. 857 858 859 READ MEMORY BARRIERS VS LOAD SPECULATION 860 ---------------------------------------- 861 862 Many CPUs speculate with loads: that is they see that they will need to load an 863 item from memory, and they find a time where they're not using the bus for any 864 other loads, and so do the load in advance - even though they haven't actually 865 got to that point in the instruction execution flow yet. This permits the 866 actual load instruction to potentially complete immediately because the CPU 867 already has the value to hand. 868 869 It may turn out that the CPU didn't actually need the value - perhaps because a 870 branch circumvented the load - in which case it can discard the value or just 871 cache it for later use. 872 873 Consider: 874 875 CPU 1 CPU 2 876 ======================= ======================= 877 LOAD B 878 DIVIDE } Divide instructions generally 879 DIVIDE } take a long time to perform 880 LOAD A 881 882 Which might appear as this: 883 884 : : +-------+ 885 +-------+ | | 886 --->| B->2 |------>| | 887 +-------+ | CPU 2 | 888 : :DIVIDE | | 889 +-------+ | | 890 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 891 division speculates on the +-------+ ~ | | 892 LOAD of A : : ~ | | 893 : :DIVIDE | | 894 : : ~ | | 895 Once the divisions are complete --> : : ~-->| | 896 the CPU can then perform the : : | | 897 LOAD with immediate effect : : +-------+ 898 899 900 Placing a read barrier or a data dependency barrier just before the second 901 load: 902 903 CPU 1 CPU 2 904 ======================= ======================= 905 LOAD B 906 DIVIDE 907 DIVIDE 908 <read barrier> 909 LOAD A 910 911 will force any value speculatively obtained to be reconsidered to an extent 912 dependent on the type of barrier used. If there was no change made to the 913 speculated memory location, then the speculated value will just be used: 914 915 : : +-------+ 916 +-------+ | | 917 --->| B->2 |------>| | 918 +-------+ | CPU 2 | 919 : :DIVIDE | | 920 +-------+ | | 921 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 922 division speculates on the +-------+ ~ | | 923 LOAD of A : : ~ | | 924 : :DIVIDE | | 925 : : ~ | | 926 : : ~ | | 927 rrrrrrrrrrrrrrrr~ | | 928 : : ~ | | 929 : : ~-->| | 930 : : | | 931 : : +-------+ 932 933 934 but if there was an update or an invalidation from another CPU pending, then 935 the speculation will be cancelled and the value reloaded: 936 937 : : +-------+ 938 +-------+ | | 939 --->| B->2 |------>| | 940 +-------+ | CPU 2 | 941 : :DIVIDE | | 942 +-------+ | | 943 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 944 division speculates on the +-------+ ~ | | 945 LOAD of A : : ~ | | 946 : :DIVIDE | | 947 : : ~ | | 948 : : ~ | | 949 rrrrrrrrrrrrrrrrr | | 950 +-------+ | | 951 The speculation is discarded ---> --->| A->1 |------>| | 952 and an updated value is +-------+ | | 953 retrieved : : +-------+ 954 955 956 ======================== 957 EXPLICIT KERNEL BARRIERS 958 ======================== 959 960 The Linux kernel has a variety of different barriers that act at different 961 levels: 962 963 (*) Compiler barrier. 964 965 (*) CPU memory barriers. 966 967 (*) MMIO write barrier. 968 969 970 COMPILER BARRIER 971 ---------------- 972 973 The Linux kernel has an explicit compiler barrier function that prevents the 974 compiler from moving the memory accesses either side of it to the other side: 975 976 barrier(); 977 978 This is a general barrier - lesser varieties of compiler barrier do not exist. 979 980 The compiler barrier has no direct effect on the CPU, which may then reorder 981 things however it wishes. 982 983 984 CPU MEMORY BARRIERS 985 ------------------- 986 987 The Linux kernel has eight basic CPU memory barriers: 988 989 TYPE MANDATORY SMP CONDITIONAL 990 =============== ======================= =========================== 991 GENERAL mb() smp_mb() 992 WRITE wmb() smp_wmb() 993 READ rmb() smp_rmb() 994 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() 995 996 997 All CPU memory barriers unconditionally imply compiler barriers. 998 999 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled 1000 systems because it is assumed that a CPU will appear to be self-consistent, 1001 and will order overlapping accesses correctly with respect to itself. 1002 1003 [!] Note that SMP memory barriers _must_ be used to control the ordering of 1004 references to shared memory on SMP systems, though the use of locking instead 1005 is sufficient. 1006 1007 Mandatory barriers should not be used to control SMP effects, since mandatory 1008 barriers unnecessarily impose overhead on UP systems. They may, however, be 1009 used to control MMIO effects on accesses through relaxed memory I/O windows. 1010 These are required even on non-SMP systems as they affect the order in which 1011 memory operations appear to a device by prohibiting both the compiler and the 1012 CPU from reordering them. 1013 1014 1015 There are some more advanced barrier functions: 1016 1017 (*) set_mb(var, value) 1018 1019 This assigns the value to the variable and then inserts a full memory 1020 barrier after it, depending on the function. It isn't guaranteed to 1021 insert anything more than a compiler barrier in a UP compilation. 1022 1023 1024 (*) smp_mb__before_atomic_dec(); 1025 (*) smp_mb__after_atomic_dec(); 1026 (*) smp_mb__before_atomic_inc(); 1027 (*) smp_mb__after_atomic_inc(); 1028 1029 These are for use with atomic add, subtract, increment and decrement 1030 functions that don't return a value, especially when used for reference 1031 counting. These functions do not imply memory barriers. 1032 1033 As an example, consider a piece of code that marks an object as being dead 1034 and then decrements the object's reference count: 1035 1036 obj->dead = 1; 1037 smp_mb__before_atomic_dec(); 1038 atomic_dec(&obj->ref_count); 1039 1040 This makes sure that the death mark on the object is perceived to be set 1041 *before* the reference counter is decremented. 1042 1043 See Documentation/atomic_ops.txt for more information. See the "Atomic 1044 operations" subsection for information on where to use these. 1045 1046 1047 (*) smp_mb__before_clear_bit(void); 1048 (*) smp_mb__after_clear_bit(void); 1049 1050 These are for use similar to the atomic inc/dec barriers. These are 1051 typically used for bitwise unlocking operations, so care must be taken as 1052 there are no implicit memory barriers here either. 1053 1054 Consider implementing an unlock operation of some nature by clearing a 1055 locking bit. The clear_bit() would then need to be barriered like this: 1056 1057 smp_mb__before_clear_bit(); 1058 clear_bit( ... ); 1059 1060 This prevents memory operations before the clear leaking to after it. See 1061 the subsection on "Locking Functions" with reference to UNLOCK operation 1062 implications. 1063 1064 See Documentation/atomic_ops.txt for more information. See the "Atomic 1065 operations" subsection for information on where to use these. 1066 1067 1068 MMIO WRITE BARRIER 1069 ------------------ 1070 1071 The Linux kernel also has a special barrier for use with memory-mapped I/O 1072 writes: 1073 1074 mmiowb(); 1075 1076 This is a variation on the mandatory write barrier that causes writes to weakly 1077 ordered I/O regions to be partially ordered. Its effects may go beyond the 1078 CPU->Hardware interface and actually affect the hardware at some level. 1079 1080 See the subsection "Locks vs I/O accesses" for more information. 1081 1082 1083 =============================== 1084 IMPLICIT KERNEL MEMORY BARRIERS 1085 =============================== 1086 1087 Some of the other functions in the linux kernel imply memory barriers, amongst 1088 which are locking and scheduling functions. 1089 1090 This specification is a _minimum_ guarantee; any particular architecture may 1091 provide more substantial guarantees, but these may not be relied upon outside 1092 of arch specific code. 1093 1094 1095 LOCKING FUNCTIONS 1096 ----------------- 1097 1098 The Linux kernel has a number of locking constructs: 1099 1100 (*) spin locks 1101 (*) R/W spin locks 1102 (*) mutexes 1103 (*) semaphores 1104 (*) R/W semaphores 1105 (*) RCU 1106 1107 In all cases there are variants on "LOCK" operations and "UNLOCK" operations 1108 for each construct. These operations all imply certain barriers: 1109 1110 (1) LOCK operation implication: 1111 1112 Memory operations issued after the LOCK will be completed after the LOCK 1113 operation has completed. 1114 1115 Memory operations issued before the LOCK may be completed after the LOCK 1116 operation has completed. 1117 1118 (2) UNLOCK operation implication: 1119 1120 Memory operations issued before the UNLOCK will be completed before the 1121 UNLOCK operation has completed. 1122 1123 Memory operations issued after the UNLOCK may be completed before the 1124 UNLOCK operation has completed. 1125 1126 (3) LOCK vs LOCK implication: 1127 1128 All LOCK operations issued before another LOCK operation will be completed 1129 before that LOCK operation. 1130 1131 (4) LOCK vs UNLOCK implication: 1132 1133 All LOCK operations issued before an UNLOCK operation will be completed 1134 before the UNLOCK operation. 1135 1136 All UNLOCK operations issued before a LOCK operation will be completed 1137 before the LOCK operation. 1138 1139 (5) Failed conditional LOCK implication: 1140 1141 Certain variants of the LOCK operation may fail, either due to being 1142 unable to get the lock immediately, or due to receiving an unblocked 1143 signal whilst asleep waiting for the lock to become available. Failed 1144 locks do not imply any sort of barrier. 1145 1146 Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is 1147 equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. 1148 1149 [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way 1150 barriers is that the effects of instructions outside of a critical section 1151 may seep into the inside of the critical section. 1152 1153 A LOCK followed by an UNLOCK may not be assumed to be full memory barrier 1154 because it is possible for an access preceding the LOCK to happen after the 1155 LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the 1156 two accesses can themselves then cross: 1157 1158 *A = a; 1159 LOCK 1160 UNLOCK 1161 *B = b; 1162 1163 may occur as: 1164 1165 LOCK, STORE *B, STORE *A, UNLOCK 1166 1167 Locks and semaphores may not provide any guarantee of ordering on UP compiled 1168 systems, and so cannot be counted on in such a situation to actually achieve 1169 anything at all - especially with respect to I/O accesses - unless combined 1170 with interrupt disabling operations. 1171 1172 See also the section on "Inter-CPU locking barrier effects". 1173 1174 1175 As an example, consider the following: 1176 1177 *A = a; 1178 *B = b; 1179 LOCK 1180 *C = c; 1181 *D = d; 1182 UNLOCK 1183 *E = e; 1184 *F = f; 1185 1186 The following sequence of events is acceptable: 1187 1188 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK 1189 1190 [+] Note that {*F,*A} indicates a combined access. 1191 1192 But none of the following are: 1193 1194 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E 1195 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F 1196 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F 1197 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E 1198 1199 1200 1201 INTERRUPT DISABLING FUNCTIONS 1202 ----------------------------- 1203 1204 Functions that disable interrupts (LOCK equivalent) and enable interrupts 1205 (UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O 1206 barriers are required in such a situation, they must be provided from some 1207 other means. 1208 1209 1210 MISCELLANEOUS FUNCTIONS 1211 ----------------------- 1212 1213 Other functions that imply barriers: 1214 1215 (*) schedule() and similar imply full memory barriers. 1216 1217 1218 ================================= 1219 INTER-CPU LOCKING BARRIER EFFECTS 1220 ================================= 1221 1222 On SMP systems locking primitives give a more substantial form of barrier: one 1223 that does affect memory access ordering on other CPUs, within the context of 1224 conflict on any particular lock. 1225 1226 1227 LOCKS VS MEMORY ACCESSES 1228 ------------------------ 1229 1230 Consider the following: the system has a pair of spinlocks (M) and (Q), and 1231 three CPUs; then should the following sequence of events occur: 1232 1233 CPU 1 CPU 2 1234 =============================== =============================== 1235 *A = a; *E = e; 1236 LOCK M LOCK Q 1237 *B = b; *F = f; 1238 *C = c; *G = g; 1239 UNLOCK M UNLOCK Q 1240 *D = d; *H = h; 1241 1242 Then there is no guarantee as to what order CPU 3 will see the accesses to *A 1243 through *H occur in, other than the constraints imposed by the separate locks 1244 on the separate CPUs. It might, for example, see: 1245 1246 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M 1247 1248 But it won't see any of: 1249 1250 *B, *C or *D preceding LOCK M 1251 *A, *B or *C following UNLOCK M 1252 *F, *G or *H preceding LOCK Q 1253 *E, *F or *G following UNLOCK Q 1254 1255 1256 However, if the following occurs: 1257 1258 CPU 1 CPU 2 1259 =============================== =============================== 1260 *A = a; 1261 LOCK M [1] 1262 *B = b; 1263 *C = c; 1264 UNLOCK M [1] 1265 *D = d; *E = e; 1266 LOCK M [2] 1267 *F = f; 1268 *G = g; 1269 UNLOCK M [2] 1270 *H = h; 1271 1272 CPU 3 might see: 1273 1274 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], 1275 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D 1276 1277 But assuming CPU 1 gets the lock first, CPU 3 won't see any of: 1278 1279 *B, *C, *D, *F, *G or *H preceding LOCK M [1] 1280 *A, *B or *C following UNLOCK M [1] 1281 *F, *G or *H preceding LOCK M [2] 1282 *A, *B, *C, *E, *F or *G following UNLOCK M [2] 1283 1284 1285 LOCKS VS I/O ACCESSES 1286 --------------------- 1287 1288 Under certain circumstances (especially involving NUMA), I/O accesses within 1289 two spinlocked sections on two different CPUs may be seen as interleaved by the 1290 PCI bridge, because the PCI bridge does not necessarily participate in the 1291 cache-coherence protocol, and is therefore incapable of issuing the required 1292 read memory barriers. 1293 1294 For example: 1295 1296 CPU 1 CPU 2 1297 =============================== =============================== 1298 spin_lock(Q) 1299 writel(0, ADDR) 1300 writel(1, DATA); 1301 spin_unlock(Q); 1302 spin_lock(Q); 1303 writel(4, ADDR); 1304 writel(5, DATA); 1305 spin_unlock(Q); 1306 1307 may be seen by the PCI bridge as follows: 1308 1309 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5 1310 1311 which would probably cause the hardware to malfunction. 1312 1313 1314 What is necessary here is to intervene with an mmiowb() before dropping the 1315 spinlock, for example: 1316 1317 CPU 1 CPU 2 1318 =============================== =============================== 1319 spin_lock(Q) 1320 writel(0, ADDR) 1321 writel(1, DATA); 1322 mmiowb(); 1323 spin_unlock(Q); 1324 spin_lock(Q); 1325 writel(4, ADDR); 1326 writel(5, DATA); 1327 mmiowb(); 1328 spin_unlock(Q); 1329 1330 this will ensure that the two stores issued on CPU 1 appear at the PCI bridge 1331 before either of the stores issued on CPU 2. 1332 1333 1334 Furthermore, following a store by a load from the same device obviates the need 1335 for the mmiowb(), because the load forces the store to complete before the load 1336 is performed: 1337 1338 CPU 1 CPU 2 1339 =============================== =============================== 1340 spin_lock(Q) 1341 writel(0, ADDR) 1342 a = readl(DATA); 1343 spin_unlock(Q); 1344 spin_lock(Q); 1345 writel(4, ADDR); 1346 b = readl(DATA); 1347 spin_unlock(Q); 1348 1349 1350 See Documentation/DocBook/deviceiobook.tmpl for more information. 1351 1352 1353 ================================= 1354 WHERE ARE MEMORY BARRIERS NEEDED? 1355 ================================= 1356 1357 Under normal operation, memory operation reordering is generally not going to 1358 be a problem as a single-threaded linear piece of code will still appear to 1359 work correctly, even if it's in an SMP kernel. There are, however, three 1360 circumstances in which reordering definitely _could_ be a problem: 1361 1362 (*) Interprocessor interaction. 1363 1364 (*) Atomic operations. 1365 1366 (*) Accessing devices. 1367 1368 (*) Interrupts. 1369 1370 1371 INTERPROCESSOR INTERACTION 1372 -------------------------- 1373 1374 When there's a system with more than one processor, more than one CPU in the 1375 system may be workin