Based on kernel version 2.6.25. Page generated on 2008-04-18 21:22 EST.
1 <?xml version="1.0" encoding="UTF-8"?> 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> 4 5 <book id="LKLockingGuide"> 6 <bookinfo> 7 <title>Unreliable Guide To Locking</title> 8 9 <authorgroup> 10 <author> 11 <firstname>Rusty</firstname> 12 <surname>Russell</surname> 13 <affiliation> 14 <address> 15 <email>rusty[AT]rustcorp.com[DOT]au</email> 16 </address> 17 </affiliation> 18 </author> 19 </authorgroup> 20 21 <copyright> 22 <year>2003</year> 23 <holder>Rusty Russell</holder> 24 </copyright> 25 26 <legalnotice> 27 <para> 28 This documentation is free software; you can redistribute 29 it and/or modify it under the terms of the GNU General Public 30 License as published by the Free Software Foundation; either 31 version 2 of the License, or (at your option) any later 32 version. 33 </para> 34 35 <para> 36 This program is distributed in the hope that it will be 37 useful, but WITHOUT ANY WARRANTY; without even the implied 38 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 39 See the GNU General Public License for more details. 40 </para> 41 42 <para> 43 You should have received a copy of the GNU General Public 44 License along with this program; if not, write to the Free 45 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 46 MA 02111-1307 USA 47 </para> 48 49 <para> 50 For more details see the file COPYING in the source 51 distribution of Linux. 52 </para> 53 </legalnotice> 54 </bookinfo> 55 56 <toc></toc> 57 <chapter id="intro"> 58 <title>Introduction</title> 59 <para> 60 Welcome, to Rusty's Remarkably Unreliable Guide to Kernel 61 Locking issues. This document describes the locking systems in 62 the Linux Kernel in 2.6. 63 </para> 64 <para> 65 With the wide availability of HyperThreading, and <firstterm 66 linkend="gloss-preemption">preemption </firstterm> in the Linux 67 Kernel, everyone hacking on the kernel needs to know the 68 fundamentals of concurrency and locking for 69 <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>. 70 </para> 71 </chapter> 72 73 <chapter id="races"> 74 <title>The Problem With Concurrency</title> 75 <para> 76 (Skip this if you know what a Race Condition is). 77 </para> 78 <para> 79 In a normal program, you can increment a counter like so: 80 </para> 81 <programlisting> 82 very_important_count++; 83 </programlisting> 84 85 <para> 86 This is what they would expect to happen: 87 </para> 88 89 <table> 90 <title>Expected Results</title> 91 92 <tgroup cols="2" align="left"> 93 94 <thead> 95 <row> 96 <entry>Instance 1</entry> 97 <entry>Instance 2</entry> 98 </row> 99 </thead> 100 101 <tbody> 102 <row> 103 <entry>read very_important_count (5)</entry> 104 <entry></entry> 105 </row> 106 <row> 107 <entry>add 1 (6)</entry> 108 <entry></entry> 109 </row> 110 <row> 111 <entry>write very_important_count (6)</entry> 112 <entry></entry> 113 </row> 114 <row> 115 <entry></entry> 116 <entry>read very_important_count (6)</entry> 117 </row> 118 <row> 119 <entry></entry> 120 <entry>add 1 (7)</entry> 121 </row> 122 <row> 123 <entry></entry> 124 <entry>write very_important_count (7)</entry> 125 </row> 126 </tbody> 127 128 </tgroup> 129 </table> 130 131 <para> 132 This is what might happen: 133 </para> 134 135 <table> 136 <title>Possible Results</title> 137 138 <tgroup cols="2" align="left"> 139 <thead> 140 <row> 141 <entry>Instance 1</entry> 142 <entry>Instance 2</entry> 143 </row> 144 </thead> 145 146 <tbody> 147 <row> 148 <entry>read very_important_count (5)</entry> 149 <entry></entry> 150 </row> 151 <row> 152 <entry></entry> 153 <entry>read very_important_count (5)</entry> 154 </row> 155 <row> 156 <entry>add 1 (6)</entry> 157 <entry></entry> 158 </row> 159 <row> 160 <entry></entry> 161 <entry>add 1 (6)</entry> 162 </row> 163 <row> 164 <entry>write very_important_count (6)</entry> 165 <entry></entry> 166 </row> 167 <row> 168 <entry></entry> 169 <entry>write very_important_count (6)</entry> 170 </row> 171 </tbody> 172 </tgroup> 173 </table> 174 175 <sect1 id="race-condition"> 176 <title>Race Conditions and Critical Regions</title> 177 <para> 178 This overlap, where the result depends on the 179 relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>. 180 The piece of code containing the concurrency issue is called a 181 <firstterm>critical region</firstterm>. And especially since Linux starting running 182 on SMP machines, they became one of the major issues in kernel 183 design and implementation. 184 </para> 185 <para> 186 Preemption can have the same effect, even if there is only one 187 CPU: by preempting one task during the critical region, we have 188 exactly the same race condition. In this case the thread which 189 preempts might run the critical region itself. 190 </para> 191 <para> 192 The solution is to recognize when these simultaneous accesses 193 occur, and use locks to make sure that only one instance can 194 enter the critical region at any time. There are many 195 friendly primitives in the Linux kernel to help you do this. 196 And then there are the unfriendly primitives, but I'll pretend 197 they don't exist. 198 </para> 199 </sect1> 200 </chapter> 201 202 <chapter id="locks"> 203 <title>Locking in the Linux Kernel</title> 204 205 <para> 206 If I could give you one piece of advice: never sleep with anyone 207 crazier than yourself. But if I had to give you advice on 208 locking: <emphasis>keep it simple</emphasis>. 209 </para> 210 211 <para> 212 Be reluctant to introduce new locks. 213 </para> 214 215 <para> 216 Strangely enough, this last one is the exact reverse of my advice when 217 you <emphasis>have</emphasis> slept with someone crazier than yourself. 218 And you should think about getting a big dog. 219 </para> 220 221 <sect1 id="lock-intro"> 222 <title>Three Main Types of Kernel Locks: Spinlocks, Mutexes and Semaphores</title> 223 224 <para> 225 There are three main types of kernel locks. The fundamental type 226 is the spinlock 227 (<filename class="headerfile">include/asm/spinlock.h</filename>), 228 which is a very simple single-holder lock: if you can't get the 229 spinlock, you keep trying (spinning) until you can. Spinlocks are 230 very small and fast, and can be used anywhere. 231 </para> 232 <para> 233 The second type is a mutex 234 (<filename class="headerfile">include/linux/mutex.h</filename>): it 235 is like a spinlock, but you may block holding a mutex. 236 If you can't lock a mutex, your task will suspend itself, and be woken 237 up when the mutex is released. This means the CPU can do something 238 else while you are waiting. There are many cases when you simply 239 can't sleep (see <xref linkend="sleeping-things"/>), and so have to 240 use a spinlock instead. 241 </para> 242 <para> 243 The third type is a semaphore 244 (<filename class="headerfile">include/asm/semaphore.h</filename>): it 245 can have more than one holder at any time (the number decided at 246 initialization time), although it is most commonly used as a 247 single-holder lock (a mutex). If you can't get a semaphore, your 248 task will be suspended and later on woken up - just like for mutexes. 249 </para> 250 <para> 251 Neither type of lock is recursive: see 252 <xref linkend="deadlock"/>. 253 </para> 254 </sect1> 255 256 <sect1 id="uniprocessor"> 257 <title>Locks and Uniprocessor Kernels</title> 258 259 <para> 260 For kernels compiled without <symbol>CONFIG_SMP</symbol>, and 261 without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at 262 all. This is an excellent design decision: when no-one else can 263 run at the same time, there is no reason to have a lock. 264 </para> 265 266 <para> 267 If the kernel is compiled without <symbol>CONFIG_SMP</symbol>, 268 but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks 269 simply disable preemption, which is sufficient to prevent any 270 races. For most purposes, we can think of preemption as 271 equivalent to SMP, and not worry about it separately. 272 </para> 273 274 <para> 275 You should always test your locking code with <symbol>CONFIG_SMP</symbol> 276 and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it 277 will still catch some kinds of locking bugs. 278 </para> 279 280 <para> 281 Semaphores still exist, because they are required for 282 synchronization between <firstterm linkend="gloss-usercontext">user 283 contexts</firstterm>, as we will see below. 284 </para> 285 </sect1> 286 287 <sect1 id="usercontextlocking"> 288 <title>Locking Only In User Context</title> 289 290 <para> 291 If you have a data structure which is only ever accessed from 292 user context, then you can use a simple semaphore 293 (<filename>linux/asm/semaphore.h</filename>) to protect it. This 294 is the most trivial case: you initialize the semaphore to the number 295 of resources available (usually 1), and call 296 <function>down_interruptible()</function> to grab the semaphore, and 297 <function>up()</function> to release it. There is also a 298 <function>down()</function>, which should be avoided, because it 299 will not return if a signal is received. 300 </para> 301 302 <para> 303 Example: <filename>linux/net/core/netfilter.c</filename> allows 304 registration of new <function>setsockopt()</function> and 305 <function>getsockopt()</function> calls, with 306 <function>nf_register_sockopt()</function>. Registration and 307 de-registration are only done on module load and unload (and boot 308 time, where there is no concurrency), and the list of registrations 309 is only consulted for an unknown <function>setsockopt()</function> 310 or <function>getsockopt()</function> system call. The 311 <varname>nf_sockopt_mutex</varname> is perfect to protect this, 312 especially since the setsockopt and getsockopt calls may well 313 sleep. 314 </para> 315 </sect1> 316 317 <sect1 id="lock-user-bh"> 318 <title>Locking Between User Context and Softirqs</title> 319 320 <para> 321 If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares 322 data with user context, you have two problems. Firstly, the current 323 user context can be interrupted by a softirq, and secondly, the 324 critical region could be entered from another CPU. This is where 325 <function>spin_lock_bh()</function> 326 (<filename class="headerfile">include/linux/spinlock.h</filename>) is 327 used. It disables softirqs on that CPU, then grabs the lock. 328 <function>spin_unlock_bh()</function> does the reverse. (The 329 '_bh' suffix is a historical reference to "Bottom Halves", the 330 old name for software interrupts. It should really be 331 called spin_lock_softirq()' in a perfect world). 332 </para> 333 334 <para> 335 Note that you can also use <function>spin_lock_irq()</function> 336 or <function>spin_lock_irqsave()</function> here, which stop 337 hardware interrupts as well: see <xref linkend="hardirq-context"/>. 338 </para> 339 340 <para> 341 This works perfectly for <firstterm linkend="gloss-up"><acronym>UP 342 </acronym></firstterm> as well: the spin lock vanishes, and this macro 343 simply becomes <function>local_bh_disable()</function> 344 (<filename class="headerfile">include/linux/interrupt.h</filename>), which 345 protects you from the softirq being run. 346 </para> 347 </sect1> 348 349 <sect1 id="lock-user-tasklet"> 350 <title>Locking Between User Context and Tasklets</title> 351 352 <para> 353 This is exactly the same as above, because <firstterm 354 linkend="gloss-tasklet">tasklets</firstterm> are actually run 355 from a softirq. 356 </para> 357 </sect1> 358 359 <sect1 id="lock-user-timers"> 360 <title>Locking Between User Context and Timers</title> 361 362 <para> 363 This, too, is exactly the same as above, because <firstterm 364 linkend="gloss-timers">timers</firstterm> are actually run from 365 a softirq. From a locking point of view, tasklets and timers 366 are identical. 367 </para> 368 </sect1> 369 370 <sect1 id="lock-tasklets"> 371 <title>Locking Between Tasklets/Timers</title> 372 373 <para> 374 Sometimes a tasklet or timer might want to share data with 375 another tasklet or timer. 376 </para> 377 378 <sect2 id="lock-tasklets-same"> 379 <title>The Same Tasklet/Timer</title> 380 <para> 381 Since a tasklet is never run on two CPUs at once, you don't 382 need to worry about your tasklet being reentrant (running 383 twice at once), even on SMP. 384 </para> 385 </sect2> 386 387 <sect2 id="lock-tasklets-different"> 388 <title>Different Tasklets/Timers</title> 389 <para> 390 If another tasklet/timer wants 391 to share data with your tasklet or timer , you will both need to use 392 <function>spin_lock()</function> and 393 <function>spin_unlock()</function> calls. 394 <function>spin_lock_bh()</function> is 395 unnecessary here, as you are already in a tasklet, and 396 none will be run on the same CPU. 397 </para> 398 </sect2> 399 </sect1> 400 401 <sect1 id="lock-softirqs"> 402 <title>Locking Between Softirqs</title> 403 404 <para> 405 Often a softirq might 406 want to share data with itself or a tasklet/timer. 407 </para> 408 409 <sect2 id="lock-softirqs-same"> 410 <title>The Same Softirq</title> 411 412 <para> 413 The same softirq can run on the other CPUs: you can use a 414 per-CPU array (see <xref linkend="per-cpu"/>) for better 415 performance. If you're going so far as to use a softirq, 416 you probably care about scalable performance enough 417 to justify the extra complexity. 418 </para> 419 420 <para> 421 You'll need to use <function>spin_lock()</function> and 422 <function>spin_unlock()</function> for shared data. 423 </para> 424 </sect2> 425 426 <sect2 id="lock-softirqs-different"> 427 <title>Different Softirqs</title> 428 429 <para> 430 You'll need to use <function>spin_lock()</function> and 431 <function>spin_unlock()</function> for shared data, whether it 432 be a timer, tasklet, different softirq or the same or another 433 softirq: any of them could be running on a different CPU. 434 </para> 435 </sect2> 436 </sect1> 437 </chapter> 438 439 <chapter id="hardirq-context"> 440 <title>Hard IRQ Context</title> 441 442 <para> 443 Hardware interrupts usually communicate with a 444 tasklet or softirq. Frequently this involves putting work in a 445 queue, which the softirq will take out. 446 </para> 447 448 <sect1 id="hardirq-softirq"> 449 <title>Locking Between Hard IRQ and Softirqs/Tasklets</title> 450 451 <para> 452 If a hardware irq handler shares data with a softirq, you have 453 two concerns. Firstly, the softirq processing can be 454 interrupted by a hardware interrupt, and secondly, the 455 critical region could be entered by a hardware interrupt on 456 another CPU. This is where <function>spin_lock_irq()</function> is 457 used. It is defined to disable interrupts on that cpu, then grab 458 the lock. <function>spin_unlock_irq()</function> does the reverse. 459 </para> 460 461 <para> 462 The irq handler does not to use 463 <function>spin_lock_irq()</function>, because the softirq cannot 464 run while the irq handler is running: it can use 465 <function>spin_lock()</function>, which is slightly faster. The 466 only exception would be if a different hardware irq handler uses 467 the same lock: <function>spin_lock_irq()</function> will stop 468 that from interrupting us. 469 </para> 470 471 <para> 472 This works perfectly for UP as well: the spin lock vanishes, 473 and this macro simply becomes <function>local_irq_disable()</function> 474 (<filename class="headerfile">include/asm/smp.h</filename>), which 475 protects you from the softirq/tasklet/BH being run. 476 </para> 477 478 <para> 479 <function>spin_lock_irqsave()</function> 480 (<filename>include/linux/spinlock.h</filename>) is a variant 481 which saves whether interrupts were on or off in a flags word, 482 which is passed to <function>spin_unlock_irqrestore()</function>. This 483 means that the same code can be used inside an hard irq handler (where 484 interrupts are already off) and in softirqs (where the irq 485 disabling is required). 486 </para> 487 488 <para> 489 Note that softirqs (and hence tasklets and timers) are run on 490 return from hardware interrupts, so 491 <function>spin_lock_irq()</function> also stops these. In that 492 sense, <function>spin_lock_irqsave()</function> is the most 493 general and powerful locking function. 494 </para> 495 496 </sect1> 497 <sect1 id="hardirq-hardirq"> 498 <title>Locking Between Two Hard IRQ Handlers</title> 499 <para> 500 It is rare to have to share data between two IRQ handlers, but 501 if you do, <function>spin_lock_irqsave()</function> should be 502 used: it is architecture-specific whether all interrupts are 503 disabled inside irq handlers themselves. 504 </para> 505 </sect1> 506 507 </chapter> 508 509 <chapter id="cheatsheet"> 510 <title>Cheat Sheet For Locking</title> 511 <para> 512 Pete Zaitcev gives the following summary: 513 </para> 514 <itemizedlist> 515 <listitem> 516 <para> 517 If you are in a process context (any syscall) and want to 518 lock other process out, use a semaphore. You can take a semaphore 519 and sleep (<function>copy_from_user*(</function> or 520 <function>kmalloc(x,GFP_KERNEL)</function>). 521 </para> 522 </listitem> 523 <listitem> 524 <para> 525 Otherwise (== data can be touched in an interrupt), use 526 <function>spin_lock_irqsave()</function> and 527 <function>spin_unlock_irqrestore()</function>. 528 </para> 529 </listitem> 530 <listitem> 531 <para> 532 Avoid holding spinlock for more than 5 lines of code and 533 across any function call (except accessors like 534 <function>readb</function>). 535 </para> 536 </listitem> 537 </itemizedlist> 538 539 <sect1 id="minimum-lock-reqirements"> 540 <title>Table of Minimum Requirements</title> 541 542 <para> The following table lists the <emphasis>minimum</emphasis> 543 locking requirements between various contexts. In some cases, 544 the same context can only be running on one CPU at a time, so 545 no locking is required for that context (eg. a particular 546 thread can only run on one CPU at a time, but if it needs 547 shares data with another thread, locking is required). 548 </para> 549 <para> 550 Remember the advice above: you can always use 551 <function>spin_lock_irqsave()</function>, which is a superset 552 of all other spinlock primitives. 553 </para> 554 555 <table> 556 <title>Table of Locking Requirements</title> 557 <tgroup cols="11"> 558 <tbody> 559 560 <row> 561 <entry></entry> 562 <entry>IRQ Handler A</entry> 563 <entry>IRQ Handler B</entry> 564 <entry>Softirq A</entry> 565 <entry>Softirq B</entry> 566 <entry>Tasklet A</entry> 567 <entry>Tasklet B</entry> 568 <entry>Timer A</entry> 569 <entry>Timer B</entry> 570 <entry>User Context A</entry> 571 <entry>User Context B</entry> 572 </row> 573 574 <row> 575 <entry>IRQ Handler A</entry> 576 <entry>None</entry> 577 </row> 578 579 <row> 580 <entry>IRQ Handler B</entry> 581 <entry>SLIS</entry> 582 <entry>None</entry> 583 </row> 584 585 <row> 586 <entry>Softirq A</entry> 587 <entry>SLI</entry> 588 <entry>SLI</entry> 589 <entry>SL</entry> 590 </row> 591 592 <row> 593 <entry>Softirq B</entry> 594 <entry>SLI</entry> 595 <entry>SLI</entry> 596 <entry>SL</entry> 597 <entry>SL</entry> 598 </row> 599 600 <row> 601 <entry>Tasklet A</entry> 602 <entry>SLI</entry> 603 <entry>SLI</entry> 604 <entry>SL</entry> 605 <entry>SL</entry> 606 <entry>None</entry> 607 </row> 608 609 <row> 610 <entry>Tasklet B</entry> 611 <entry>SLI</entry> 612 <entry>SLI</entry> 613 <entry>SL</entry> 614 <entry>SL</entry> 615 <entry>SL</entry> 616 <entry>None</entry> 617 </row> 618 619 <row> 620 <entry>Timer A</entry> 621 <entry>SLI</entry> 622 <entry>SLI</entry> 623 <entry>SL</entry> 624 <entry>SL</entry> 625 <entry>SL</entry> 626 <entry>SL</entry> 627 <entry>None</entry> 628 </row> 629 630 <row> 631 <entry>Timer B</entry> 632 <entry>SLI</entry> 633 <entry>SLI</entry> 634 <entry>SL</entry> 635 <entry>SL</entry> 636 <entry>SL</entry> 637 <entry>SL</entry> 638 <entry>SL</entry> 639 <entry>None</entry> 640 </row> 641 642 <row> 643 <entry>User Context A</entry> 644 <entry>SLI</entry> 645 <entry>SLI</entry> 646 <entry>SLBH</entry> 647 <entry>SLBH</entry> 648 <entry>SLBH</entry> 649 <entry>SLBH</entry> 650 <entry>SLBH</entry> 651 <entry>SLBH</entry> 652 <entry>None</entry> 653 </row> 654 655 <row> 656 <entry>User Context B</entry> 657 <entry>SLI</entry> 658 <entry>SLI</entry> 659 <entry>SLBH</entry> 660 <entry>SLBH</entry> 661 <entry>SLBH</entry> 662 <entry>SLBH</entry> 663 <entry>SLBH</entry> 664 <entry>SLBH</entry> 665 <entry>DI</entry> 666 <entry>None</entry> 667 </row> 668 669 </tbody> 670 </tgroup> 671 </table> 672 673 <table> 674 <title>Legend for Locking Requirements Table</title> 675 <tgroup cols="2"> 676 <tbody> 677 678 <row> 679 <entry>SLIS</entry> 680 <entry>spin_lock_irqsave</entry> 681 </row> 682 <row> 683 <entry>SLI</entry> 684 <entry>spin_lock_irq</entry> 685 </row> 686 <row> 687 <entry>SL</entry> 688 <entry>spin_lock</entry> 689 </row> 690 <row> 691 <entry>SLBH</entry> 692 <entry>spin_lock_bh</entry> 693 </row> 694 <row> 695 <entry>DI</entry> 696 <entry>down_interruptible</entry> 697 </row> 698 699 </tbody> 700 </tgroup> 701 </table> 702 703 </sect1> 704 </chapter> 705 706 <chapter id="Examples"> 707 <title>Common Examples</title> 708 <para> 709 Let's step through a simple example: a cache of number to name 710 mappings. The cache keeps a count of how often each of the objects is 711 used, and when it gets full, throws out the least used one. 712 713 </para> 714 715 <sect1 id="examples-usercontext"> 716 <title>All In User Context</title> 717 <para> 718 For our first example, we assume that all operations are in user 719 context (ie. from system calls), so we can sleep. This means we can 720 use a mutex to protect the cache and all the objects within 721 it. Here's the code: 722 </para> 723 724 <programlisting> 725 #include <linux/list.h> 726 #include <linux/slab.h> 727 #include <linux/string.h> 728 #include <linux/mutex.h> 729 #include <asm/errno.h> 730 731 struct object 732 { 733 struct list_head list; 734 int id; 735 char name[32]; 736 int popularity; 737 }; 738 739 /* Protects the cache, cache_num, and the objects within it */ 740 static DEFINE_MUTEX(cache_lock); 741 static LIST_HEAD(cache); 742 static unsigned int cache_num = 0; 743 #define MAX_CACHE_SIZE 10 744 745 /* Must be holding cache_lock */ 746 static struct object *__cache_find(int id) 747 { 748 struct object *i; 749 750 list_for_each_entry(i, &cache, list) 751 if (i->id == id) { 752 i->popularity++; 753 return i; 754 } 755 return NULL; 756 } 757 758 /* Must be holding cache_lock */ 759 static void __cache_delete(struct object *obj) 760 { 761 BUG_ON(!obj); 762 list_del(&obj->list); 763 kfree(obj); 764 cache_num--; 765 } 766 767 /* Must be holding cache_lock */ 768 static void __cache_add(struct object *obj) 769 { 770 list_add(&obj->list, &cache); 771 if (++cache_num > MAX_CACHE_SIZE) { 772 struct object *i, *outcast = NULL; 773 list_for_each_entry(i, &cache, list) { 774 if (!outcast || i->popularity < outcast->popularity) 775 outcast = i; 776 } 777 __cache_delete(outcast); 778 } 779 } 780 781 int cache_add(int id, const char *name) 782 { 783 struct object *obj; 784 785 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 786 return -ENOMEM; 787 788 strlcpy(obj->name, name, sizeof(obj->name)); 789 obj->id = id; 790 obj->popularity = 0; 791 792 mutex_lock(&cache_lock); 793 __cache_add(obj); 794 mutex_unlock(&cache_lock); 795 return 0; 796 } 797 798 void cache_delete(int id) 799 { 800 mutex_lock(&cache_lock); 801 __cache_delete(__cache_find(id)); 802 mutex_unlock(&cache_lock); 803 } 804 805 int cache_find(int id, char *name) 806 { 807 struct object *obj; 808 int ret = -ENOENT; 809 810 mutex_lock(&cache_lock); 811 obj = __cache_find(id); 812 if (obj) { 813 ret = 0; 814 strcpy(name, obj->name); 815 } 816 mutex_unlock(&cache_lock); 817 return ret; 818 } 819 </programlisting> 820 821 <para> 822 Note that we always make sure we have the cache_lock when we add, 823 delete, or look up the cache: both the cache infrastructure itself and 824 the contents of the objects are protected by the lock. In this case 825 it's easy, since we copy the data for the user, and never let them 826 access the objects directly. 827 </para> 828 <para> 829 There is a slight (and common) optimization here: in 830 <function>cache_add</function> we set up the fields of the object 831 before grabbing the lock. This is safe, as no-one else can access it 832 until we put it in cache. 833 </para> 834 </sect1> 835 836 <sect1 id="examples-interrupt"> 837 <title>Accessing From Interrupt Context</title> 838 <para> 839 Now consider the case where <function>cache_find</function> can be 840 called from interrupt context: either a hardware interrupt or a 841 softirq. An example would be a timer which deletes object from the 842 cache. 843 </para> 844 <para> 845 The change is shown below, in standard patch format: the 846 <symbol>-</symbol> are lines which are taken away, and the 847 <symbol>+</symbol> are lines which are added. 848 </para> 849 <programlisting> 850 --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 851 +++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 852 @@ -12,7 +12,7 @@ 853 int popularity; 854 }; 855 856 -static DEFINE_MUTEX(cache_lock); 857 +static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; 858 static LIST_HEAD(cache); 859 static unsigned int cache_num = 0; 860 #define MAX_CACHE_SIZE 10 861 @@ -55,6 +55,7 @@ 862 int cache_add(int id, const char *name) 863 { 864 struct object *obj; 865 + unsigned long flags; 866 867 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 868 return -ENOMEM; 869 @@ -63,30 +64,33 @@ 870 obj->id = id; 871 obj->popularity = 0; 872 873 - mutex_lock(&cache_lock); 874 + spin_lock_irqsave(&cache_lock, flags); 875 __cache_add(obj); 876 - mutex_unlock(&cache_lock); 877 + spin_unlock_irqrestore(&cache_lock, flags); 878 return 0; 879 } 880 881 void cache_delete(int id) 882 { 883 - mutex_lock(&cache_lock); 884 + unsigned long flags; 885 + 886 + spin_lock_irqsave(&cache_lock, flags); 887 __cache_delete(__cache_find(id)); 888 - mutex_unlock(&cache_lock); 889 + spin_unlock_irqrestore(&cache_lock, flags); 890 } 891 892 int cache_find(int id, char *name) 893 { 894 struct object *obj; 895 int ret = -ENOENT; 896 + unsigned long flags; 897 898 - mutex_lock(&cache_lock); 899 + spin_lock_irqsave(&cache_lock, flags); 900 obj = __cache_find(id); 901 if (obj) { 902 ret = 0; 903 strcpy(name, obj->name); 904 } 905 - mutex_unlock(&cache_lock); 906 + spin_unlock_irqrestore(&cache_lock, flags); 907 return ret; 908 } 909 </programlisting> 910 911 <para> 912 Note that the <function>spin_lock_irqsave</function> will turn off 913 interrupts if they are on, otherwise does nothing (if we are already 914 in an interrupt handler), hence these functions are safe to call from 915 any context. 916 </para> 917 <para> 918 Unfortunately, <function>cache_add</function> calls 919 <function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol> 920 flag, which is only legal in user context. I have assumed that 921 <function>cache_add</function> is still only called in user context, 922 otherwise this should become a parameter to 923 <function>cache_add</function>. 924 </para> 925 </sect1> 926 <sect1 id="examples-refcnt"> 927 <title>Exposing Objects Outside This File</title> 928 <para> 929 If our objects contained more information, it might not be sufficient 930 to copy the information in and out: other parts of the code might want 931 to keep pointers to these objects, for example, rather than looking up 932 the id every time. This produces two problems. 933 </para> 934 <para> 935 The first problem is that we use the <symbol>cache_lock</symbol> to 936 protect objects: we'd need to make this non-static so the rest of the 937 code can use it. This makes locking trickier, as it is no longer all 938 in one place. 939 </para> 940 <para> 941 The second problem is the lifetime problem: if another structure keeps 942 a pointer to an object, it presumably expects that pointer to remain 943 valid. Unfortunately, this is only guaranteed while you hold the 944 lock, otherwise someone might call <function>cache_delete</function> 945 and even worse, add another object, re-using the same address. 946 </para> 947 <para> 948 As there is only one lock, you can't hold it forever: no-one else would 949 get any work done. 950 </para> 951 <para> 952 The solution to this problem is to use a reference count: everyone who 953 has a pointer to the object increases it when they first get the 954 object, and drops the reference count when they're finished with it. 955 Whoever drops it to zero knows it is unused, and can actually delete it. 956 </para> 957 <para> 958 Here is the code: 959 </para> 960 961 <programlisting> 962 --- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 963 +++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 964 @@ -7,6 +7,7 @@ 965 struct object 966 { 967 struct list_head list; 968 + unsigned int refcnt; 969 int id; 970 char name[32]; 971 int popularity; 972 @@ -17,6 +18,35 @@ 973 static unsigned int cache_num = 0; 974 #define MAX_CACHE_SIZE 10 975 976 +static void __object_put(struct object *obj) 977 +{ 978 + if (--obj->refcnt == 0) 979 + kfree(obj); 980 +} 981 + 982 +static void __object_get(struct object *obj) 983 +{ 984 + obj->refcnt++; 985 +} 986 + 987 +void object_put(struct object *obj) 988 +{ 989 + unsigned long flags; 990 + 991 + spin_lock_irqsave(&cache_lock, flags); 992 + __object_put(obj); 993 + spin_unlock_irqrestore(&cache_lock, flags); 994 +} 995 + 996 +void object_get(struct object *obj) 997 +{ 998 + unsigned long flags; 999 + 1000 + spin_lock_irqsave(&cache_lock, flags); 1001 + __object_get(obj); 1002 + spin_unlock_irqrestore(&cache_lock, flags); 1003 +} 1004 + 1005 /* Must be holding cache_lock */ 1006 static struct object *__cache_find(int id) 1007 { 1008 @@ -35,6 +65,7 @@ 1009 { 1010 BUG_ON(!obj); 1011 list_del(&obj->list); 1012 + __object_put(obj); 1013 cache_num--; 1014 } 1015 1016 @@ -63,6 +94,7 @@ 1017 strlcpy(obj->name, name, sizeof(obj->name)); 1018 obj->id = id; 1019 obj->popularity = 0; 1020 + obj->refcnt = 1; /* The cache holds a reference */ 1021 1022 spin_lock_irqsave(&cache_lock, flags); 1023 __cache_add(obj); 1024 @@ -79,18 +111,15 @@ 1025 spin_unlock_irqrestore(&cache_lock, flags); 1026 } 1027 1028 -int cache_find(int id, char *name) 1029 +struct object *cache_find(int id) 1030 { 1031 struct object *obj; 1032 - int ret = -ENOENT; 1033 unsigned long flags; 1034 1035 spin_lock_irqsave(&cache_lock, flags); 1036 obj = __cache_find(id); 1037 - if (obj) { 1038 - ret = 0; 1039 - strcpy(name, obj->name); 1040 - } 1041 + if (obj) 1042 + __object_get(obj); 1043 spin_unlock_irqrestore(&cache_lock, flags); 1044 - return ret; 1045 + return obj; 1046 } 1047 </programlisting> 1048 1049 <para> 1050 We encapsulate the reference counting in the standard 'get' and 'put' 1051 functions. Now we can return the object itself from 1052 <function>cache_find</function> which has the advantage that the user 1053 can now sleep holding the object (eg. to 1054 <function>copy_to_user</function> to name to userspace). 1055 </para> 1056 <para> 1057 The other point to note is that I said a reference should be held for 1058 every pointer to the object: thus the reference count is 1 when first 1059 inserted into the cache. In some versions the framework does not hold 1060 a reference count, but they are more complicated. 1061 </para> 1062 1063 <sect2 id="examples-refcnt-atomic"> 1064 <title>Using Atomic Operations For The Reference Count</title> 1065 <para> 1066 In practice, <type>atomic_t</type> would usually be used for 1067 <structfield>refcnt</structfield>. There are a number of atomic 1068 operations defined in 1069 1070 <filename class="headerfile">include/asm/atomic.h</filename>: these are 1071 guaranteed to be seen atomically from all CPUs in the system, so no 1072 lock is required. In this case, it is simpler than using spinlocks, 1073 although for anything non-trivial using spinlocks is clearer. The 1074 <function>atomic_inc</function> and 1075 <function>atomic_dec_and_test</function> are used instead of the 1076 standard increment and decrement operators, and the lock is no longer 1077 used to protect the reference count itself. 1078 </para> 1079 1080 <programlisting> 1081 --- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 1082 +++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 1083 @@ -7,7 +7,7 @@ 1084 struct object 1085 { 1086 struct list_head list; 1087 - unsigned int refcnt; 1088 + atomic_t refcnt; 1089 int id; 1090 char name[32]; 1091 int popularity; 1092 @@ -18,33 +18,15 @@ 1093 static unsigned int cache_num = 0; 1094 #define MAX_CACHE_SIZE 10 1095 1096 -static void __object_put(struct object *obj) 1097 -{ 1098 - if (--obj->refcnt == 0) 1099 - kfree(obj); 1100 -} 1101 - 1102 -static void __object_get(struct object *obj) 1103 -{ 1104 - obj->refcnt++; 1105 -} 1106 - 1107 void object_put(struct object *obj) 1108 { 1109 - unsigned long flags; 1110 - 1111 - spin_lock_irqsave(&cache_lock, flags); 1112 - __object_put(obj); 1113 - spin_unlock_irqrestore(&cache_lock, flags); 1114 + if (atomic_dec_and_test(&obj->refcnt)) 1115 + kfree(obj); 1116 } 1117 1118 void object_get(struct object *obj) 1119 { 1120 - unsigned long flags; 1121 - 1122 - spin_lock_irqsave(&cache_lock, flags); 1123 - __object_get(obj); 1124 - spin_unlock_irqrestore(&cache_lock, flags); 1125 + atomic_inc(&obj->refcnt); 1126 } 1127 1128 /* Must be holding cache_lock */ 1129 @@ -65,7 +47,7 @@ 1130 { 1131 BUG_ON(!obj); 1132 list_del(&obj->list); 1133 - __object_put(obj); 1134 + object_put(obj); 1135 cache_num--; 1136 } 1137 1138 @@ -94,7 +76,7 @@ 1139 strlcpy(obj->name, name, sizeof(obj->name)); 1140 obj->id = id; 1141 obj->popularity = 0; 1142 - obj->refcnt = 1; /* The cache holds a reference */ 1143 + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1144 1145 spin_lock_irqsave(&cache_lock, flags); 1146 __cache_add(obj); 1147 @@ -119,7 +101,7 @@ 1148 spin_lock_irqsave(&cache_lock, flags); 1149 obj = __cache_find(id); 1150 if (obj) 1151 - __object_get(obj); 1152 + object_get(obj); 1153 spin_unlock_irqrestore(&cache_lock, flags); 1154 return obj; 1155 } 1156 </programlisting> 1157 </sect2> 1158 </sect1> 1159 1160 <sect1 id="examples-lock-per-obj"> 1161 <title>Protecting The Objects Themselves</title> 1162 <para> 1163 In these examples, we assumed that the objects (except the reference 1164 counts) never changed once they are created. If we wanted to allow 1165 the name to change, there are three possibilities: 1166 </para> 1167 <itemizedlist> 1168 <listitem> 1169 <para> 1170 You can make <symbol>cache_lock</symbol> non-static, and tell people 1171 to grab that lock before changing the name in any object. 1172 </para> 1173 </listitem> 1174 <listitem> 1175 <para> 1176 You can provide a <function>cache_obj_rename</function> which grabs 1177 this lock and changes the name for the caller, and tell everyone to 1178 use that function. 1179 </para> 1180 </listitem> 1181 <listitem> 1182 <para> 1183 You can make the <symbol>cache_lock</symbol> protect only the cache 1184 itself, and use another lock to protect the name. 1185 </para> 1186 </listitem> 1187 </itemizedlist> 1188 1189 <para> 1190 Theoretically, you can make the locks as fine-grained as one lock for 1191 every field, for every object. In practice, the most common variants 1192 are: 1193 </para> 1194 <itemizedlist> 1195 <listitem> 1196 <para> 1197 One lock which protects the infrastructure (the <symbol>cache</symbol> 1198 list in this example) and all the objects. This is what we have done 1199 so far. 1200 </para> 1201 </listitem> 1202 <listitem> 1203 <para> 1204 One lock which protects the infrastructure (including the list 1205 pointers inside the objects), and one lock inside the object which 1206 protects the rest of that object. 1207 </para> 1208 </listitem> 1209 <listitem> 1210 <para> 1211 Multiple locks to protect the infrastructure (eg. one lock per hash 1212 chain), possibly with a separate per-object lock. 1213 </para> 1214 </listitem> 1215 </itemizedlist> 1216 1217 <para> 1218 Here is the "lock-per-object" implementation: 1219 </para> 1220 <programlisting> 1221 --- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 1222 +++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1223 @@ -6,11 +6,17 @@ 1224 1225 struct object 1226 { 1227 + /* These two protected by cache_lock. */ 1228 struct list_head list; 1229 + int popularity; 1230 + 1231 atomic_t refcnt; 1232 + 1233 + /* Doesn't change once created. */ 1234 int id; 1235 + 1236 + spinlock_t lock; /* Protects the name */ 1237 char name[32]; 1238 - int popularity; 1239 }; 1240 1241 static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; 1242 @@ -77,6 +84,7 @@ 1243 obj->id = id; 1244 obj->popularity = 0; 1245 atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1246 + spin_lock_init(&obj->lock); 1247 1248 spin_lock_irqsave(&cache_lock, flags); 1249 __cache_add(obj); 1250 </programlisting> 1251 1252 <para> 1253 Note that I decide that the <structfield>popularity</structfield> 1254 count should be protected by the <symbol>cache_lock</symbol> rather 1255 than the per-object lock: this is because it (like the 1256 <structname>struct list_head</structname> inside the object) is 1257 logically part of the infrastructure. This way, I don't need to grab 1258 the lock of every object in <function>__cache_add</function> when 1259 seeking the least popular. 1260 </para> 1261 1262 <para> 1263 I also decided that the <structfield>id</structfield> member is 1264 unchangeable, so I don't need to grab each object lock in 1265 <function>__cache_find()</function> to examine the 1266 <structfield>id</structfield>: the object lock is only used by a 1267 caller who wants to read or write the <structfield>name</structfield> 1268 field. 1269 </para> 1270 1271 <para> 1272 Note also that I added a comment describing what data was protected by 1273 which locks. This is extremely important, as it describes the runtime 1274 behavior of the code, and can be hard to gain from just reading. And 1275 as Alan Cox says, <quote>Lock data, not code</quote>. 1276 </para> 1277 </sect1> 1278 </chapter> 1279 1280 <chapter id="common-problems"> 1281 <title>Common Problems</title> 1282 <sect1 id="deadlock"> 1283 <title>Deadlock: Simple and Advanced</title> 1284 1285 <para> 1286 There is a coding bug where a piece of code tries to grab a 1287 spinlock twice: it will spin forever, waiting for the lock to 1288 be released (spinlocks, rwlocks and semaphores are not 1289 recursive in Linux). This is trivial to diagnose: not a 1290 stay-up-five-nights-talk-to-fluffy-code-bunnies kind of 1291 problem. 1292 </para> 1293 1294 <para> 1295 For a slightly more complex case, imagine you have a region 1296 shared by a softirq and user context. If you use a 1297 <function>spin_lock()</function> call to protect it, it is 1298 possible that the user context will be interrupted by the softirq 1299 while it holds the lock, and the softirq will then spin 1300 forever trying to get the same lock. 1301 </para> 1302 1303 <para> 1304 Both of these are called deadlock, and as shown above, it can 1305 occur even with a single CPU (although not on UP compiles, 1306 since spinlocks vanish on kernel compiles with 1307 <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 1308 in the second example). 1309 </para> 1310 1311 <para> 1312 This complete lockup is easy to diagnose: on SMP boxes the 1313 watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set 1314 (<filename>include/linux/spinlock.h</filename>) will show this up 1315 immediately when it happens. 1316 </para> 1317 1318 <para> 1319 A more complex problem is the so-called 'deadly embrace', 1320 involving two or more locks. Say you have a hash table: each 1321 entry in the table is a spinlock, and a chain of hashed 1322 objects. Inside a softirq handler, you sometimes want to 1323 alter an object from one place in the hash to another: you 1324 grab the spinlock of the old hash chain and the spinlock of 1325 the new hash chain, and delete the object from the old one, 1326 and insert it in the new one. 1327 </para> 1328 1329 <para> 1330 There are two problems here. First, if your code ever 1331 tries to move the object to the same chain, it will deadlock 1332 with itself as it tries to lock it twice. Secondly, if the 1333 same softirq on another CPU is trying to move another object 1334 in the reverse direction, the following could happen: 1335 </para> 1336 1337 <table> 1338 <title>Consequences</title> 1339 1340 <tgroup cols="2" align="left"> 1341 1342 <thead> 1343 <row> 1344 <entry>CPU 1</entry> 1345 <entry>CPU 2</entry> 1346 </row> 1347 </thead> 1348 1349 <tbody> 1350 <row> 1351 <entry>Grab lock A -> OK</entry> 1352 <entry>Grab lock B -> OK</entry> 1353 </row> 1354 <row> 1355 <entry>Grab lock B -> spin</entry> 1356 <entry>Grab lock A -> spin</entry> 1357 </row> 1358 </tbody> 1359 </tgroup> 1360 </table> 1361 1362 <para> 1363 The two CPUs will spin forever, waiting for the other to give up 1364 their lock. It will look, smell, and feel like a crash. 1365 </para> 1366 </sect1> 1367 1368 <sect1 id="techs-deadlock-prevent"> 1369 <title>Preventing Deadlock</title> 1370 1371 <para> 1372 Textbooks will tell you that if you always lock in the same 1373 order, you will never get this kind of deadlock. Practice 1374 will tell you that this approach doesn't scale: when I 1375 create a new lock, I don't understand enough of the kernel 1376 to figure out where in the 5000 lock hierarchy it will fit. 1377 </para> 1378 1379 <para> 1380 The best locks are encapsulated: they never get exposed in 1381 headers, and are never held around calls to non-trivial 1382 functions outside the same file. You can read through this 1383 code and see that it will never deadlock, because it never 1384 tries to grab another lock while it has that one. People 1385 using your code don't even need to know you are using a 1386 lock. 1387 </para>