About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / DocBook / kernel-locking.tmpl

Based on kernel version 2.6.25. Page generated on 2008-04-18 21:22 EST.

1	<?xml version="1.0" encoding="UTF-8"?>
2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
3		"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
4	
5	<book id="LKLockingGuide">
6	 <bookinfo>
7	  <title>Unreliable Guide To Locking</title>
8	  
9	  <authorgroup>
10	   <author>
11	    <firstname>Rusty</firstname>
12	    <surname>Russell</surname>
13	    <affiliation>
14	     <address>
15	      <email>rusty[AT]rustcorp.com[DOT]au</email>
16	     </address>
17	    </affiliation>
18	   </author>
19	  </authorgroup>
20	
21	  <copyright>
22	   <year>2003</year>
23	   <holder>Rusty Russell</holder>
24	  </copyright>
25	
26	  <legalnotice>
27	   <para>
28	     This documentation is free software; you can redistribute
29	     it and/or modify it under the terms of the GNU General Public
30	     License as published by the Free Software Foundation; either
31	     version 2 of the License, or (at your option) any later
32	     version.
33	   </para>
34	      
35	   <para>
36	     This program is distributed in the hope that it will be
37	     useful, but WITHOUT ANY WARRANTY; without even the implied
38	     warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
39	     See the GNU General Public License for more details.
40	   </para>
41	      
42	   <para>
43	     You should have received a copy of the GNU General Public
44	     License along with this program; if not, write to the Free
45	     Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
46	     MA 02111-1307 USA
47	   </para>
48	      
49	   <para>
50	     For more details see the file COPYING in the source
51	     distribution of Linux.
52	   </para>
53	  </legalnotice>
54	 </bookinfo>
55	
56	 <toc></toc>
57	  <chapter id="intro">
58	   <title>Introduction</title>
59	   <para>
60	     Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
61	     Locking issues.  This document describes the locking systems in
62	     the Linux Kernel in 2.6.
63	   </para>
64	   <para>
65	     With the wide availability of HyperThreading, and <firstterm
66	     linkend="gloss-preemption">preemption </firstterm> in the Linux
67	     Kernel, everyone hacking on the kernel needs to know the
68	     fundamentals of concurrency and locking for
69	     <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>.
70	   </para>
71	  </chapter>
72	
73	   <chapter id="races">
74	    <title>The Problem With Concurrency</title>
75	    <para>
76	      (Skip this if you know what a Race Condition is).
77	    </para>
78	    <para>
79	      In a normal program, you can increment a counter like so:
80	    </para>
81	    <programlisting>
82	      very_important_count++;
83	    </programlisting>
84	
85	    <para>
86	      This is what they would expect to happen:
87	    </para>
88	
89	    <table>
90	     <title>Expected Results</title>
91	
92	     <tgroup cols="2" align="left">
93	
94	      <thead>
95	       <row>
96	        <entry>Instance 1</entry>
97	        <entry>Instance 2</entry>
98	       </row>
99	      </thead>
100	
101	      <tbody>
102	       <row>
103	        <entry>read very_important_count (5)</entry>
104	        <entry></entry>
105	       </row>
106	       <row>
107	        <entry>add 1 (6)</entry>
108	        <entry></entry>
109	       </row>
110	       <row>
111	        <entry>write very_important_count (6)</entry>
112	        <entry></entry>
113	       </row>
114	       <row>
115	        <entry></entry>
116	        <entry>read very_important_count (6)</entry>
117	       </row>
118	       <row>
119	        <entry></entry>
120	        <entry>add 1 (7)</entry>
121	       </row>
122	       <row>
123	        <entry></entry>
124	        <entry>write very_important_count (7)</entry>
125	       </row>
126	      </tbody>
127	
128	     </tgroup>
129	    </table>
130	
131	    <para>
132	     This is what might happen:
133	    </para>
134	
135	    <table>
136	     <title>Possible Results</title>
137	
138	     <tgroup cols="2" align="left">
139	      <thead>
140	       <row>
141	        <entry>Instance 1</entry>
142	        <entry>Instance 2</entry>
143	       </row>
144	      </thead>
145	
146	      <tbody>
147	       <row>
148	        <entry>read very_important_count (5)</entry>
149	        <entry></entry>
150	       </row>
151	       <row>
152	        <entry></entry>
153	        <entry>read very_important_count (5)</entry>
154	       </row>
155	       <row>
156	        <entry>add 1 (6)</entry>
157	        <entry></entry>
158	       </row>
159	       <row>
160	        <entry></entry>
161	        <entry>add 1 (6)</entry>
162	       </row>
163	       <row>
164	        <entry>write very_important_count (6)</entry>
165	        <entry></entry>
166	       </row>
167	       <row>
168	        <entry></entry>
169	        <entry>write very_important_count (6)</entry>
170	       </row>
171	      </tbody>
172	     </tgroup>
173	    </table>
174	
175	    <sect1 id="race-condition">
176	    <title>Race Conditions and Critical Regions</title>
177	    <para>
178	      This overlap, where the result depends on the
179	      relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>.
180	      The piece of code containing the concurrency issue is called a
181	      <firstterm>critical region</firstterm>.  And especially since Linux starting running
182	      on SMP machines, they became one of the major issues in kernel
183	      design and implementation.
184	    </para>
185	    <para>
186	      Preemption can have the same effect, even if there is only one
187	      CPU: by preempting one task during the critical region, we have
188	      exactly the same race condition.  In this case the thread which
189	      preempts might run the critical region itself.
190	    </para>
191	    <para>
192	      The solution is to recognize when these simultaneous accesses
193	      occur, and use locks to make sure that only one instance can
194	      enter the critical region at any time.  There are many
195	      friendly primitives in the Linux kernel to help you do this.
196	      And then there are the unfriendly primitives, but I'll pretend
197	      they don't exist.
198	    </para>
199	    </sect1>
200	  </chapter>
201	
202	  <chapter id="locks">
203	   <title>Locking in the Linux Kernel</title>
204	
205	   <para>
206	     If I could give you one piece of advice: never sleep with anyone
207	     crazier than yourself.  But if I had to give you advice on
208	     locking: <emphasis>keep it simple</emphasis>.
209	   </para>
210	
211	   <para>
212	     Be reluctant to introduce new locks.
213	   </para>
214	
215	   <para>
216	     Strangely enough, this last one is the exact reverse of my advice when
217	     you <emphasis>have</emphasis> slept with someone crazier than yourself.
218	     And you should think about getting a big dog.
219	   </para>
220	
221	   <sect1 id="lock-intro">
222	   <title>Three Main Types of Kernel Locks: Spinlocks, Mutexes and Semaphores</title>
223	
224	   <para>
225	     There are three main types of kernel locks.  The fundamental type
226	     is the spinlock 
227	     (<filename class="headerfile">include/asm/spinlock.h</filename>),
228	     which is a very simple single-holder lock: if you can't get the 
229	     spinlock, you keep trying (spinning) until you can.  Spinlocks are 
230	     very small and fast, and can be used anywhere.
231	   </para>
232	   <para>
233	     The second type is a mutex
234	     (<filename class="headerfile">include/linux/mutex.h</filename>): it
235	     is like a spinlock, but you may block holding a mutex.
236	     If you can't lock a mutex, your task will suspend itself, and be woken
237	     up when the mutex is released.  This means the CPU can do something
238	     else while you are waiting.  There are many cases when you simply
239	     can't sleep (see <xref linkend="sleeping-things"/>), and so have to
240	     use a spinlock instead.
241	   </para>
242	   <para>
243	     The third type is a semaphore
244	     (<filename class="headerfile">include/asm/semaphore.h</filename>): it
245	     can have more than one holder at any time (the number decided at
246	     initialization time), although it is most commonly used as a
247	     single-holder lock (a mutex).  If you can't get a semaphore, your
248	     task will be suspended and later on woken up - just like for mutexes.
249	   </para>
250	   <para>
251	     Neither type of lock is recursive: see
252	     <xref linkend="deadlock"/>.
253	   </para>
254	   </sect1>
255	 
256	   <sect1 id="uniprocessor">
257	    <title>Locks and Uniprocessor Kernels</title>
258	
259	    <para>
260	      For kernels compiled without <symbol>CONFIG_SMP</symbol>, and
261	      without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at
262	      all.  This is an excellent design decision: when no-one else can
263	      run at the same time, there is no reason to have a lock.
264	    </para>
265	
266	    <para>
267	      If the kernel is compiled without <symbol>CONFIG_SMP</symbol>,
268	      but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks
269	      simply disable preemption, which is sufficient to prevent any
270	      races.  For most purposes, we can think of preemption as
271	      equivalent to SMP, and not worry about it separately.
272	    </para>
273	
274	    <para>
275	      You should always test your locking code with <symbol>CONFIG_SMP</symbol>
276	      and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it
277	      will still catch some kinds of locking bugs.
278	    </para>
279	
280	    <para>
281	      Semaphores still exist, because they are required for
282	      synchronization between <firstterm linkend="gloss-usercontext">user 
283	      contexts</firstterm>, as we will see below.
284	    </para>
285	   </sect1>
286	
287	    <sect1 id="usercontextlocking">
288	     <title>Locking Only In User Context</title>
289	
290	     <para>
291	       If you have a data structure which is only ever accessed from
292	       user context, then you can use a simple semaphore
293	       (<filename>linux/asm/semaphore.h</filename>) to protect it.  This 
294	       is the most trivial case: you initialize the semaphore to the number 
295	       of resources available (usually 1), and call
296	       <function>down_interruptible()</function> to grab the semaphore, and 
297	       <function>up()</function> to release it.  There is also a 
298	       <function>down()</function>, which should be avoided, because it 
299	       will not return if a signal is received.
300	     </para>
301	
302	     <para>
303	       Example: <filename>linux/net/core/netfilter.c</filename> allows 
304	       registration of new <function>setsockopt()</function> and 
305	       <function>getsockopt()</function> calls, with
306	       <function>nf_register_sockopt()</function>.  Registration and 
307	       de-registration are only done on module load and unload (and boot 
308	       time, where there is no concurrency), and the list of registrations 
309	       is only consulted for an unknown <function>setsockopt()</function>
310	       or <function>getsockopt()</function> system call.  The 
311	       <varname>nf_sockopt_mutex</varname> is perfect to protect this,
312	       especially since the setsockopt and getsockopt calls may well
313	       sleep.
314	     </para>
315	   </sect1>
316	
317	   <sect1 id="lock-user-bh">
318	    <title>Locking Between User Context and Softirqs</title>
319	
320	    <para>
321	      If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares
322	      data with user context, you have two problems.  Firstly, the current 
323	      user context can be interrupted by a softirq, and secondly, the
324	      critical region could be entered from another CPU.  This is where
325	      <function>spin_lock_bh()</function> 
326	      (<filename class="headerfile">include/linux/spinlock.h</filename>) is
327	      used.  It disables softirqs on that CPU, then grabs the lock.
328	      <function>spin_unlock_bh()</function> does the reverse.  (The
329	      '_bh' suffix is a historical reference to "Bottom Halves", the
330	      old name for software interrupts.  It should really be
331	      called spin_lock_softirq()' in a perfect world).
332	    </para>
333	
334	    <para>
335	      Note that you can also use <function>spin_lock_irq()</function>
336	      or <function>spin_lock_irqsave()</function> here, which stop
337	      hardware interrupts as well: see <xref linkend="hardirq-context"/>.
338	    </para>
339	
340	    <para>
341	      This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
342	      </acronym></firstterm> as well: the spin lock vanishes, and this macro 
343	      simply becomes <function>local_bh_disable()</function>
344	      (<filename class="headerfile">include/linux/interrupt.h</filename>), which
345	      protects you from the softirq being run.
346	    </para>
347	   </sect1>
348	
349	   <sect1 id="lock-user-tasklet">
350	    <title>Locking Between User Context and Tasklets</title>
351	
352	    <para>
353	      This is exactly the same as above, because <firstterm
354	      linkend="gloss-tasklet">tasklets</firstterm> are actually run
355	      from a softirq.
356	    </para>
357	   </sect1>
358	
359	   <sect1 id="lock-user-timers">
360	    <title>Locking Between User Context and Timers</title>
361	
362	    <para>
363	      This, too, is exactly the same as above, because <firstterm
364	      linkend="gloss-timers">timers</firstterm> are actually run from
365	      a softirq.  From a locking point of view, tasklets and timers
366	      are identical.
367	    </para>
368	   </sect1>
369	
370	   <sect1 id="lock-tasklets">
371	    <title>Locking Between Tasklets/Timers</title>
372	
373	    <para>
374	      Sometimes a tasklet or timer might want to share data with
375	      another tasklet or timer.
376	    </para>
377	
378	    <sect2 id="lock-tasklets-same">
379	     <title>The Same Tasklet/Timer</title>
380	     <para>
381	       Since a tasklet is never run on two CPUs at once, you don't
382	       need to worry about your tasklet being reentrant (running
383	       twice at once), even on SMP.
384	     </para>
385	    </sect2>
386	
387	    <sect2 id="lock-tasklets-different">
388	     <title>Different Tasklets/Timers</title>
389	     <para>
390	       If another tasklet/timer wants
391	       to share data with your tasklet or timer , you will both need to use
392	       <function>spin_lock()</function> and
393	       <function>spin_unlock()</function> calls.  
394	       <function>spin_lock_bh()</function> is
395	       unnecessary here, as you are already in a tasklet, and
396	       none will be run on the same CPU.
397	     </para>
398	    </sect2>
399	   </sect1>
400	
401	   <sect1 id="lock-softirqs">
402	    <title>Locking Between Softirqs</title>
403	
404	    <para>
405	      Often a softirq might
406	      want to share data with itself or a tasklet/timer.
407	    </para>
408	
409	    <sect2 id="lock-softirqs-same">
410	     <title>The Same Softirq</title>
411	
412	     <para>
413	       The same softirq can run on the other CPUs: you can use a
414	       per-CPU array (see <xref linkend="per-cpu"/>) for better
415	       performance.  If you're going so far as to use a softirq,
416	       you probably care about scalable performance enough
417	       to justify the extra complexity.
418	     </para>
419	
420	     <para>
421	       You'll need to use <function>spin_lock()</function> and 
422	       <function>spin_unlock()</function> for shared data.
423	     </para>
424	    </sect2>
425	
426	    <sect2 id="lock-softirqs-different">
427	     <title>Different Softirqs</title>
428	
429	     <para>
430	       You'll need to use <function>spin_lock()</function> and
431	       <function>spin_unlock()</function> for shared data, whether it
432	       be a timer, tasklet, different softirq or the same or another
433	       softirq: any of them could be running on a different CPU.
434	     </para>
435	    </sect2>
436	   </sect1>
437	  </chapter>
438	
439	  <chapter id="hardirq-context">
440	   <title>Hard IRQ Context</title>
441	
442	   <para>
443	     Hardware interrupts usually communicate with a
444	     tasklet or softirq.  Frequently this involves putting work in a
445	     queue, which the softirq will take out.
446	   </para>
447	
448	   <sect1 id="hardirq-softirq">
449	    <title>Locking Between Hard IRQ and Softirqs/Tasklets</title>
450	
451	    <para>
452	      If a hardware irq handler shares data with a softirq, you have
453	      two concerns.  Firstly, the softirq processing can be
454	      interrupted by a hardware interrupt, and secondly, the
455	      critical region could be entered by a hardware interrupt on
456	      another CPU.  This is where <function>spin_lock_irq()</function> is 
457	      used.  It is defined to disable interrupts on that cpu, then grab 
458	      the lock. <function>spin_unlock_irq()</function> does the reverse.
459	    </para>
460	
461	    <para>
462	      The irq handler does not to use
463	      <function>spin_lock_irq()</function>, because the softirq cannot
464	      run while the irq handler is running: it can use
465	      <function>spin_lock()</function>, which is slightly faster.  The
466	      only exception would be if a different hardware irq handler uses
467	      the same lock: <function>spin_lock_irq()</function> will stop
468	      that from interrupting us.
469	    </para>
470	
471	    <para>
472	      This works perfectly for UP as well: the spin lock vanishes,
473	      and this macro simply becomes <function>local_irq_disable()</function>
474	      (<filename class="headerfile">include/asm/smp.h</filename>), which
475	      protects you from the softirq/tasklet/BH being run.
476	    </para>
477	
478	    <para>
479	      <function>spin_lock_irqsave()</function> 
480	      (<filename>include/linux/spinlock.h</filename>) is a variant
481	      which saves whether interrupts were on or off in a flags word,
482	      which is passed to <function>spin_unlock_irqrestore()</function>.  This
483	      means that the same code can be used inside an hard irq handler (where
484	      interrupts are already off) and in softirqs (where the irq
485	      disabling is required).
486	    </para>
487	
488	    <para>
489	      Note that softirqs (and hence tasklets and timers) are run on
490	      return from hardware interrupts, so
491	      <function>spin_lock_irq()</function> also stops these.  In that
492	      sense, <function>spin_lock_irqsave()</function> is the most
493	      general and powerful locking function.
494	    </para>
495	
496	   </sect1>
497	   <sect1 id="hardirq-hardirq">
498	    <title>Locking Between Two Hard IRQ Handlers</title>
499	    <para>
500	      It is rare to have to share data between two IRQ handlers, but
501	      if you do, <function>spin_lock_irqsave()</function> should be
502	      used: it is architecture-specific whether all interrupts are
503	      disabled inside irq handlers themselves.
504	    </para>
505	   </sect1>
506	
507	  </chapter>
508	
509	  <chapter id="cheatsheet">
510	   <title>Cheat Sheet For Locking</title>
511	   <para>
512	     Pete Zaitcev gives the following summary:
513	   </para>
514	   <itemizedlist>
515	      <listitem>
516		<para>
517	          If you are in a process context (any syscall) and want to
518		lock other process out, use a semaphore.  You can take a semaphore
519		and sleep (<function>copy_from_user*(</function> or
520		<function>kmalloc(x,GFP_KERNEL)</function>).
521	      </para>
522	      </listitem>
523	      <listitem>
524		<para>
525		Otherwise (== data can be touched in an interrupt), use
526		<function>spin_lock_irqsave()</function> and
527		<function>spin_unlock_irqrestore()</function>.
528		</para>
529	      </listitem>
530	      <listitem>
531		<para>
532		Avoid holding spinlock for more than 5 lines of code and
533		across any function call (except accessors like
534		<function>readb</function>).
535		</para>
536	      </listitem>
537	    </itemizedlist>
538	
539	   <sect1 id="minimum-lock-reqirements">
540	   <title>Table of Minimum Requirements</title>
541	
542	   <para> The following table lists the <emphasis>minimum</emphasis>
543		locking requirements between various contexts.  In some cases,
544		the same context can only be running on one CPU at a time, so
545		no locking is required for that context (eg. a particular
546		thread can only run on one CPU at a time, but if it needs
547		shares data with another thread, locking is required).
548	   </para>
549	   <para>
550		Remember the advice above: you can always use
551		<function>spin_lock_irqsave()</function>, which is a superset
552		of all other spinlock primitives.
553	   </para>
554	
555	   <table>
556	<title>Table of Locking Requirements</title>
557	<tgroup cols="11">
558	<tbody>
559	
560	<row>
561	<entry></entry>
562	<entry>IRQ Handler A</entry>
563	<entry>IRQ Handler B</entry>
564	<entry>Softirq A</entry>
565	<entry>Softirq B</entry>
566	<entry>Tasklet A</entry>
567	<entry>Tasklet B</entry>
568	<entry>Timer A</entry>
569	<entry>Timer B</entry>
570	<entry>User Context A</entry>
571	<entry>User Context B</entry>
572	</row>
573	
574	<row>
575	<entry>IRQ Handler A</entry>
576	<entry>None</entry>
577	</row>
578	
579	<row>
580	<entry>IRQ Handler B</entry>
581	<entry>SLIS</entry>
582	<entry>None</entry>
583	</row>
584	
585	<row>
586	<entry>Softirq A</entry>
587	<entry>SLI</entry>
588	<entry>SLI</entry>
589	<entry>SL</entry>
590	</row>
591	
592	<row>
593	<entry>Softirq B</entry>
594	<entry>SLI</entry>
595	<entry>SLI</entry>
596	<entry>SL</entry>
597	<entry>SL</entry>
598	</row>
599	
600	<row>
601	<entry>Tasklet A</entry>
602	<entry>SLI</entry>
603	<entry>SLI</entry>
604	<entry>SL</entry>
605	<entry>SL</entry>
606	<entry>None</entry>
607	</row>
608	
609	<row>
610	<entry>Tasklet B</entry>
611	<entry>SLI</entry>
612	<entry>SLI</entry>
613	<entry>SL</entry>
614	<entry>SL</entry>
615	<entry>SL</entry>
616	<entry>None</entry>
617	</row>
618	
619	<row>
620	<entry>Timer A</entry>
621	<entry>SLI</entry>
622	<entry>SLI</entry>
623	<entry>SL</entry>
624	<entry>SL</entry>
625	<entry>SL</entry>
626	<entry>SL</entry>
627	<entry>None</entry>
628	</row>
629	
630	<row>
631	<entry>Timer B</entry>
632	<entry>SLI</entry>
633	<entry>SLI</entry>
634	<entry>SL</entry>
635	<entry>SL</entry>
636	<entry>SL</entry>
637	<entry>SL</entry>
638	<entry>SL</entry>
639	<entry>None</entry>
640	</row>
641	
642	<row>
643	<entry>User Context A</entry>
644	<entry>SLI</entry>
645	<entry>SLI</entry>
646	<entry>SLBH</entry>
647	<entry>SLBH</entry>
648	<entry>SLBH</entry>
649	<entry>SLBH</entry>
650	<entry>SLBH</entry>
651	<entry>SLBH</entry>
652	<entry>None</entry>
653	</row>
654	
655	<row>
656	<entry>User Context B</entry>
657	<entry>SLI</entry>
658	<entry>SLI</entry>
659	<entry>SLBH</entry>
660	<entry>SLBH</entry>
661	<entry>SLBH</entry>
662	<entry>SLBH</entry>
663	<entry>SLBH</entry>
664	<entry>SLBH</entry>
665	<entry>DI</entry>
666	<entry>None</entry>
667	</row>
668	
669	</tbody>
670	</tgroup>
671	</table>
672	
673	   <table>
674	<title>Legend for Locking Requirements Table</title>
675	<tgroup cols="2">
676	<tbody>
677	
678	<row>
679	<entry>SLIS</entry>
680	<entry>spin_lock_irqsave</entry>
681	</row>
682	<row>
683	<entry>SLI</entry>
684	<entry>spin_lock_irq</entry>
685	</row>
686	<row>
687	<entry>SL</entry>
688	<entry>spin_lock</entry>
689	</row>
690	<row>
691	<entry>SLBH</entry>
692	<entry>spin_lock_bh</entry>
693	</row>
694	<row>
695	<entry>DI</entry>
696	<entry>down_interruptible</entry>
697	</row>
698	
699	</tbody>
700	</tgroup>
701	</table>
702	
703	</sect1>
704	</chapter>
705	
706	  <chapter id="Examples">
707	   <title>Common Examples</title>
708	    <para>
709	Let's step through a simple example: a cache of number to name
710	mappings.  The cache keeps a count of how often each of the objects is
711	used, and when it gets full, throws out the least used one.
712	
713	    </para>
714	
715	   <sect1 id="examples-usercontext">
716	    <title>All In User Context</title>
717	    <para>
718	For our first example, we assume that all operations are in user
719	context (ie. from system calls), so we can sleep.  This means we can
720	use a mutex to protect the cache and all the objects within
721	it.  Here's the code:
722	    </para>
723	
724	    <programlisting>
725	#include &lt;linux/list.h&gt;
726	#include &lt;linux/slab.h&gt;
727	#include &lt;linux/string.h&gt;
728	#include &lt;linux/mutex.h&gt;
729	#include &lt;asm/errno.h&gt;
730	
731	struct object
732	{
733	        struct list_head list;
734	        int id;
735	        char name[32];
736	        int popularity;
737	};
738	
739	/* Protects the cache, cache_num, and the objects within it */
740	static DEFINE_MUTEX(cache_lock);
741	static LIST_HEAD(cache);
742	static unsigned int cache_num = 0;
743	#define MAX_CACHE_SIZE 10
744	
745	/* Must be holding cache_lock */
746	static struct object *__cache_find(int id)
747	{
748	        struct object *i;
749	
750	        list_for_each_entry(i, &amp;cache, list)
751	                if (i-&gt;id == id) {
752	                        i-&gt;popularity++;
753	                        return i;
754	                }
755	        return NULL;
756	}
757	
758	/* Must be holding cache_lock */
759	static void __cache_delete(struct object *obj)
760	{
761	        BUG_ON(!obj);
762	        list_del(&amp;obj-&gt;list);
763	        kfree(obj);
764	        cache_num--;
765	}
766	
767	/* Must be holding cache_lock */
768	static void __cache_add(struct object *obj)
769	{
770	        list_add(&amp;obj-&gt;list, &amp;cache);
771	        if (++cache_num > MAX_CACHE_SIZE) {
772	                struct object *i, *outcast = NULL;
773	                list_for_each_entry(i, &amp;cache, list) {
774	                        if (!outcast || i-&gt;popularity &lt; outcast-&gt;popularity)
775	                                outcast = i;
776	                }
777	                __cache_delete(outcast);
778	        }
779	}
780	
781	int cache_add(int id, const char *name)
782	{
783	        struct object *obj;
784	
785	        if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
786	                return -ENOMEM;
787	
788	        strlcpy(obj-&gt;name, name, sizeof(obj-&gt;name));
789	        obj-&gt;id = id;
790	        obj-&gt;popularity = 0;
791	
792	        mutex_lock(&amp;cache_lock);
793	        __cache_add(obj);
794	        mutex_unlock(&amp;cache_lock);
795	        return 0;
796	}
797	
798	void cache_delete(int id)
799	{
800	        mutex_lock(&amp;cache_lock);
801	        __cache_delete(__cache_find(id));
802	        mutex_unlock(&amp;cache_lock);
803	}
804	
805	int cache_find(int id, char *name)
806	{
807	        struct object *obj;
808	        int ret = -ENOENT;
809	
810	        mutex_lock(&amp;cache_lock);
811	        obj = __cache_find(id);
812	        if (obj) {
813	                ret = 0;
814	                strcpy(name, obj-&gt;name);
815	        }
816	        mutex_unlock(&amp;cache_lock);
817	        return ret;
818	}
819	</programlisting>
820	
821	    <para>
822	Note that we always make sure we have the cache_lock when we add,
823	delete, or look up the cache: both the cache infrastructure itself and
824	the contents of the objects are protected by the lock.  In this case
825	it's easy, since we copy the data for the user, and never let them
826	access the objects directly.
827	    </para>
828	    <para>
829	There is a slight (and common) optimization here: in
830	<function>cache_add</function> we set up the fields of the object
831	before grabbing the lock.  This is safe, as no-one else can access it
832	until we put it in cache.
833	    </para>
834	    </sect1>
835	
836	   <sect1 id="examples-interrupt">
837	    <title>Accessing From Interrupt Context</title>
838	    <para>
839	Now consider the case where <function>cache_find</function> can be
840	called from interrupt context: either a hardware interrupt or a
841	softirq.  An example would be a timer which deletes object from the
842	cache.
843	    </para>
844	    <para>
845	The change is shown below, in standard patch format: the
846	<symbol>-</symbol> are lines which are taken away, and the
847	<symbol>+</symbol> are lines which are added.
848	    </para>
849	<programlisting>
850	--- cache.c.usercontext	2003-12-09 13:58:54.000000000 +1100
851	+++ cache.c.interrupt	2003-12-09 14:07:49.000000000 +1100
852	@@ -12,7 +12,7 @@
853	         int popularity;
854	 };
855	
856	-static DEFINE_MUTEX(cache_lock);
857	+static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
858	 static LIST_HEAD(cache);
859	 static unsigned int cache_num = 0;
860	 #define MAX_CACHE_SIZE 10
861	@@ -55,6 +55,7 @@
862	 int cache_add(int id, const char *name)
863	 {
864	         struct object *obj;
865	+        unsigned long flags;
866	
867	         if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
868	                 return -ENOMEM;
869	@@ -63,30 +64,33 @@
870	         obj-&gt;id = id;
871	         obj-&gt;popularity = 0;
872	
873	-        mutex_lock(&amp;cache_lock);
874	+        spin_lock_irqsave(&amp;cache_lock, flags);
875	         __cache_add(obj);
876	-        mutex_unlock(&amp;cache_lock);
877	+        spin_unlock_irqrestore(&amp;cache_lock, flags);
878	         return 0;
879	 }
880	
881	 void cache_delete(int id)
882	 {
883	-        mutex_lock(&amp;cache_lock);
884	+        unsigned long flags;
885	+
886	+        spin_lock_irqsave(&amp;cache_lock, flags);
887	         __cache_delete(__cache_find(id));
888	-        mutex_unlock(&amp;cache_lock);
889	+        spin_unlock_irqrestore(&amp;cache_lock, flags);
890	 }
891	
892	 int cache_find(int id, char *name)
893	 {
894	         struct object *obj;
895	         int ret = -ENOENT;
896	+        unsigned long flags;
897	
898	-        mutex_lock(&amp;cache_lock);
899	+        spin_lock_irqsave(&amp;cache_lock, flags);
900	         obj = __cache_find(id);
901	         if (obj) {
902	                 ret = 0;
903	                 strcpy(name, obj-&gt;name);
904	         }
905	-        mutex_unlock(&amp;cache_lock);
906	+        spin_unlock_irqrestore(&amp;cache_lock, flags);
907	         return ret;
908	 }
909	</programlisting>
910	
911	    <para>
912	Note that the <function>spin_lock_irqsave</function> will turn off
913	interrupts if they are on, otherwise does nothing (if we are already
914	in an interrupt handler), hence these functions are safe to call from
915	any context.
916	    </para>
917	    <para>
918	Unfortunately, <function>cache_add</function> calls
919	<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol>
920	flag, which is only legal in user context.  I have assumed that
921	<function>cache_add</function> is still only called in user context,
922	otherwise this should become a parameter to
923	<function>cache_add</function>.
924	    </para>
925	  </sect1>
926	   <sect1 id="examples-refcnt">
927	    <title>Exposing Objects Outside This File</title>
928	    <para>
929	If our objects contained more information, it might not be sufficient
930	to copy the information in and out: other parts of the code might want
931	to keep pointers to these objects, for example, rather than looking up
932	the id every time.  This produces two problems.
933	    </para>
934	    <para>
935	The first problem is that we use the <symbol>cache_lock</symbol> to
936	protect objects: we'd need to make this non-static so the rest of the
937	code can use it.  This makes locking trickier, as it is no longer all
938	in one place.
939	    </para>
940	    <para>
941	The second problem is the lifetime problem: if another structure keeps
942	a pointer to an object, it presumably expects that pointer to remain
943	valid.  Unfortunately, this is only guaranteed while you hold the
944	lock, otherwise someone might call <function>cache_delete</function>
945	and even worse, add another object, re-using the same address.
946	    </para>
947	    <para>
948	As there is only one lock, you can't hold it forever: no-one else would
949	get any work done.
950	    </para>
951	    <para>
952	The solution to this problem is to use a reference count: everyone who
953	has a pointer to the object increases it when they first get the
954	object, and drops the reference count when they're finished with it.
955	Whoever drops it to zero knows it is unused, and can actually delete it.
956	    </para>
957	    <para>
958	Here is the code:
959	    </para>
960	
961	<programlisting>
962	--- cache.c.interrupt	2003-12-09 14:25:43.000000000 +1100
963	+++ cache.c.refcnt	2003-12-09 14:33:05.000000000 +1100
964	@@ -7,6 +7,7 @@
965	 struct object
966	 {
967	         struct list_head list;
968	+        unsigned int refcnt;
969	         int id;
970	         char name[32];
971	         int popularity;
972	@@ -17,6 +18,35 @@
973	 static unsigned int cache_num = 0;
974	 #define MAX_CACHE_SIZE 10
975	
976	+static void __object_put(struct object *obj)
977	+{
978	+        if (--obj-&gt;refcnt == 0)
979	+                kfree(obj);
980	+}
981	+
982	+static void __object_get(struct object *obj)
983	+{
984	+        obj-&gt;refcnt++;
985	+}
986	+
987	+void object_put(struct object *obj)
988	+{
989	+        unsigned long flags;
990	+
991	+        spin_lock_irqsave(&amp;cache_lock, flags);
992	+        __object_put(obj);
993	+        spin_unlock_irqrestore(&amp;cache_lock, flags);
994	+}
995	+
996	+void object_get(struct object *obj)
997	+{
998	+        unsigned long flags;
999	+
1000	+        spin_lock_irqsave(&amp;cache_lock, flags);
1001	+        __object_get(obj);
1002	+        spin_unlock_irqrestore(&amp;cache_lock, flags);
1003	+}
1004	+
1005	 /* Must be holding cache_lock */
1006	 static struct object *__cache_find(int id)
1007	 {
1008	@@ -35,6 +65,7 @@
1009	 {
1010	         BUG_ON(!obj);
1011	         list_del(&amp;obj-&gt;list);
1012	+        __object_put(obj);
1013	         cache_num--;
1014	 }
1015	
1016	@@ -63,6 +94,7 @@
1017	         strlcpy(obj-&gt;name, name, sizeof(obj-&gt;name));
1018	         obj-&gt;id = id;
1019	         obj-&gt;popularity = 0;
1020	+        obj-&gt;refcnt = 1; /* The cache holds a reference */
1021	
1022	         spin_lock_irqsave(&amp;cache_lock, flags);
1023	         __cache_add(obj);
1024	@@ -79,18 +111,15 @@
1025	         spin_unlock_irqrestore(&amp;cache_lock, flags);
1026	 }
1027	
1028	-int cache_find(int id, char *name)
1029	+struct object *cache_find(int id)
1030	 {
1031	         struct object *obj;
1032	-        int ret = -ENOENT;
1033	         unsigned long flags;
1034	
1035	         spin_lock_irqsave(&amp;cache_lock, flags);
1036	         obj = __cache_find(id);
1037	-        if (obj) {
1038	-                ret = 0;
1039	-                strcpy(name, obj-&gt;name);
1040	-        }
1041	+        if (obj)
1042	+                __object_get(obj);
1043	         spin_unlock_irqrestore(&amp;cache_lock, flags);
1044	-        return ret;
1045	+        return obj;
1046	 }
1047	</programlisting>
1048	
1049	<para>
1050	We encapsulate the reference counting in the standard 'get' and 'put'
1051	functions.  Now we can return the object itself from
1052	<function>cache_find</function> which has the advantage that the user
1053	can now sleep holding the object (eg. to
1054	<function>copy_to_user</function> to name to userspace).
1055	</para>
1056	<para>
1057	The other point to note is that I said a reference should be held for
1058	every pointer to the object: thus the reference count is 1 when first
1059	inserted into the cache.  In some versions the framework does not hold
1060	a reference count, but they are more complicated.
1061	</para>
1062	
1063	   <sect2 id="examples-refcnt-atomic">
1064	    <title>Using Atomic Operations For The Reference Count</title>
1065	<para>
1066	In practice, <type>atomic_t</type> would usually be used for
1067	<structfield>refcnt</structfield>.  There are a number of atomic
1068	operations defined in
1069	
1070	<filename class="headerfile">include/asm/atomic.h</filename>: these are
1071	guaranteed to be seen atomically from all CPUs in the system, so no
1072	lock is required.  In this case, it is simpler than using spinlocks,
1073	although for anything non-trivial using spinlocks is clearer.  The
1074	<function>atomic_inc</function> and
1075	<function>atomic_dec_and_test</function> are used instead of the
1076	standard increment and decrement operators, and the lock is no longer
1077	used to protect the reference count itself.
1078	</para>
1079	
1080	<programlisting>
1081	--- cache.c.refcnt	2003-12-09 15:00:35.000000000 +1100
1082	+++ cache.c.refcnt-atomic	2003-12-11 15:49:42.000000000 +1100
1083	@@ -7,7 +7,7 @@
1084	 struct object
1085	 {
1086	         struct list_head list;
1087	-        unsigned int refcnt;
1088	+        atomic_t refcnt;
1089	         int id;
1090	         char name[32];
1091	         int popularity;
1092	@@ -18,33 +18,15 @@
1093	 static unsigned int cache_num = 0;
1094	 #define MAX_CACHE_SIZE 10
1095	
1096	-static void __object_put(struct object *obj)
1097	-{
1098	-        if (--obj-&gt;refcnt == 0)
1099	-                kfree(obj);
1100	-}
1101	-
1102	-static void __object_get(struct object *obj)
1103	-{
1104	-        obj-&gt;refcnt++;
1105	-}
1106	-
1107	 void object_put(struct object *obj)
1108	 {
1109	-        unsigned long flags;
1110	-
1111	-        spin_lock_irqsave(&amp;cache_lock, flags);
1112	-        __object_put(obj);
1113	-        spin_unlock_irqrestore(&amp;cache_lock, flags);
1114	+        if (atomic_dec_and_test(&amp;obj-&gt;refcnt))
1115	+                kfree(obj);
1116	 }
1117	
1118	 void object_get(struct object *obj)
1119	 {
1120	-        unsigned long flags;
1121	-
1122	-        spin_lock_irqsave(&amp;cache_lock, flags);
1123	-        __object_get(obj);
1124	-        spin_unlock_irqrestore(&amp;cache_lock, flags);
1125	+        atomic_inc(&amp;obj-&gt;refcnt);
1126	 }
1127	
1128	 /* Must be holding cache_lock */
1129	@@ -65,7 +47,7 @@
1130	 {
1131	         BUG_ON(!obj);
1132	         list_del(&amp;obj-&gt;list);
1133	-        __object_put(obj);
1134	+        object_put(obj);
1135	         cache_num--;
1136	 }
1137	
1138	@@ -94,7 +76,7 @@
1139	         strlcpy(obj-&gt;name, name, sizeof(obj-&gt;name));
1140	         obj-&gt;id = id;
1141	         obj-&gt;popularity = 0;
1142	-        obj-&gt;refcnt = 1; /* The cache holds a reference */
1143	+        atomic_set(&amp;obj-&gt;refcnt, 1); /* The cache holds a reference */
1144	
1145	         spin_lock_irqsave(&amp;cache_lock, flags);
1146	         __cache_add(obj);
1147	@@ -119,7 +101,7 @@
1148	         spin_lock_irqsave(&amp;cache_lock, flags);
1149	         obj = __cache_find(id);
1150	         if (obj)
1151	-                __object_get(obj);
1152	+                object_get(obj);
1153	         spin_unlock_irqrestore(&amp;cache_lock, flags);
1154	         return obj;
1155	 }
1156	</programlisting>
1157	</sect2>
1158	</sect1>
1159	
1160	   <sect1 id="examples-lock-per-obj">
1161	    <title>Protecting The Objects Themselves</title>
1162	    <para>
1163	In these examples, we assumed that the objects (except the reference
1164	counts) never changed once they are created.  If we wanted to allow
1165	the name to change, there are three possibilities:
1166	    </para>
1167	    <itemizedlist>
1168	      <listitem>
1169		<para>
1170	You can make <symbol>cache_lock</symbol> non-static, and tell people
1171	to grab that lock before changing the name in any object.
1172	        </para>
1173	      </listitem>
1174	      <listitem>
1175	        <para>
1176	You can provide a <function>cache_obj_rename</function> which grabs
1177	this lock and changes the name for the caller, and tell everyone to
1178	use that function.
1179	        </para>
1180	      </listitem>
1181	      <listitem>
1182	        <para>
1183	You can make the <symbol>cache_lock</symbol> protect only the cache
1184	itself, and use another lock to protect the name.
1185	        </para>
1186	      </listitem>
1187	    </itemizedlist>
1188	
1189	      <para>
1190	Theoretically, you can make the locks as fine-grained as one lock for
1191	every field, for every object.  In practice, the most common variants
1192	are:
1193	</para>
1194	    <itemizedlist>
1195	      <listitem>
1196		<para>
1197	One lock which protects the infrastructure (the <symbol>cache</symbol>
1198	list in this example) and all the objects.  This is what we have done
1199	so far.
1200		</para>
1201	      </listitem>
1202	      <listitem>
1203	        <para>
1204	One lock which protects the infrastructure (including the list
1205	pointers inside the objects), and one lock inside the object which
1206	protects the rest of that object.
1207	        </para>
1208	      </listitem>
1209	      <listitem>
1210	        <para>
1211	Multiple locks to protect the infrastructure (eg. one lock per hash
1212	chain), possibly with a separate per-object lock.
1213	        </para>
1214	      </listitem>
1215	    </itemizedlist>
1216	
1217	<para>
1218	Here is the "lock-per-object" implementation:
1219	</para>
1220	<programlisting>
1221	--- cache.c.refcnt-atomic	2003-12-11 15:50:54.000000000 +1100
1222	+++ cache.c.perobjectlock	2003-12-11 17:15:03.000000000 +1100
1223	@@ -6,11 +6,17 @@
1224	
1225	 struct object
1226	 {
1227	+        /* These two protected by cache_lock. */
1228	         struct list_head list;
1229	+        int popularity;
1230	+
1231	         atomic_t refcnt;
1232	+
1233	+        /* Doesn't change once created. */
1234	         int id;
1235	+
1236	+        spinlock_t lock; /* Protects the name */
1237	         char name[32];
1238	-        int popularity;
1239	 };
1240	
1241	 static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
1242	@@ -77,6 +84,7 @@
1243	         obj-&gt;id = id;
1244	         obj-&gt;popularity = 0;
1245	         atomic_set(&amp;obj-&gt;refcnt, 1); /* The cache holds a reference */
1246	+        spin_lock_init(&amp;obj-&gt;lock);
1247	
1248	         spin_lock_irqsave(&amp;cache_lock, flags);
1249	         __cache_add(obj);
1250	</programlisting>
1251	
1252	<para>
1253	Note that I decide that the <structfield>popularity</structfield>
1254	count should be protected by the <symbol>cache_lock</symbol> rather
1255	than the per-object lock: this is because it (like the
1256	<structname>struct list_head</structname> inside the object) is
1257	logically part of the infrastructure.  This way, I don't need to grab
1258	the lock of every object in <function>__cache_add</function> when
1259	seeking the least popular.
1260	</para>
1261	
1262	<para>
1263	I also decided that the <structfield>id</structfield> member is
1264	unchangeable, so I don't need to grab each object lock in
1265	<function>__cache_find()</function> to examine the
1266	<structfield>id</structfield>: the object lock is only used by a
1267	caller who wants to read or write the <structfield>name</structfield>
1268	field.
1269	</para>
1270	
1271	<para>
1272	Note also that I added a comment describing what data was protected by
1273	which locks.  This is extremely important, as it describes the runtime
1274	behavior of the code, and can be hard to gain from just reading.  And
1275	as Alan Cox says, <quote>Lock data, not code</quote>.
1276	</para>
1277	</sect1>
1278	</chapter>
1279	
1280	   <chapter id="common-problems">
1281	    <title>Common Problems</title>
1282	    <sect1 id="deadlock">
1283	    <title>Deadlock: Simple and Advanced</title>
1284	
1285	    <para>
1286	      There is a coding bug where a piece of code tries to grab a
1287	      spinlock twice: it will spin forever, waiting for the lock to
1288	      be released (spinlocks, rwlocks and semaphores are not
1289	      recursive in Linux).  This is trivial to diagnose: not a
1290	      stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
1291	      problem.
1292	    </para>
1293	
1294	    <para>
1295	      For a slightly more complex case, imagine you have a region
1296	      shared by a softirq and user context.  If you use a
1297	      <function>spin_lock()</function> call to protect it, it is 
1298	      possible that the user context will be interrupted by the softirq
1299	      while it holds the lock, and the softirq will then spin
1300	      forever trying to get the same lock.
1301	    </para>
1302	
1303	    <para>
1304	      Both of these are called deadlock, and as shown above, it can
1305	      occur even with a single CPU (although not on UP compiles,
1306	      since spinlocks vanish on kernel compiles with 
1307	      <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 
1308	      in the second example).
1309	    </para>
1310	
1311	    <para>
1312	      This complete lockup is easy to diagnose: on SMP boxes the
1313	      watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
1314	      (<filename>include/linux/spinlock.h</filename>) will show this up 
1315	      immediately when it happens.
1316	    </para>
1317	
1318	    <para>
1319	      A more complex problem is the so-called 'deadly embrace',
1320	      involving two or more locks.  Say you have a hash table: each
1321	      entry in the table is a spinlock, and a chain of hashed
1322	      objects.  Inside a softirq handler, you sometimes want to
1323	      alter an object from one place in the hash to another: you
1324	      grab the spinlock of the old hash chain and the spinlock of
1325	      the new hash chain, and delete the object from the old one,
1326	      and insert it in the new one.
1327	    </para>
1328	
1329	    <para>
1330	      There are two problems here.  First, if your code ever
1331	      tries to move the object to the same chain, it will deadlock
1332	      with itself as it tries to lock it twice.  Secondly, if the
1333	      same softirq on another CPU is trying to move another object
1334	      in the reverse direction, the following could happen:
1335	    </para>
1336	
1337	    <table>
1338	     <title>Consequences</title>
1339	
1340	     <tgroup cols="2" align="left">
1341	
1342	      <thead>
1343	       <row>
1344	        <entry>CPU 1</entry>
1345	        <entry>CPU 2</entry>
1346	       </row>
1347	      </thead>
1348	
1349	      <tbody>
1350	       <row>
1351	        <entry>Grab lock A -&gt; OK</entry>
1352	        <entry>Grab lock B -&gt; OK</entry>
1353	       </row>
1354	       <row>
1355	        <entry>Grab lock B -&gt; spin</entry>
1356	        <entry>Grab lock A -&gt; spin</entry>
1357	       </row>
1358	      </tbody>
1359	     </tgroup>
1360	    </table>
1361	
1362	    <para>
1363	      The two CPUs will spin forever, waiting for the other to give up
1364	      their lock.  It will look, smell, and feel like a crash.
1365	    </para>
1366	    </sect1>
1367	
1368	    <sect1 id="techs-deadlock-prevent">
1369	     <title>Preventing Deadlock</title>
1370	
1371	     <para>
1372	       Textbooks will tell you that if you always lock in the same
1373	       order, you will never get this kind of deadlock.  Practice
1374	       will tell you that this approach doesn't scale: when I
1375	       create a new lock, I don't understand enough of the kernel
1376	       to figure out where in the 5000 lock hierarchy it will fit.
1377	     </para>
1378	
1379	     <para>
1380	       The best locks are encapsulated: they never get exposed in
1381	       headers, and are never held around calls to non-trivial
1382	       functions outside the same file.  You can read through this
1383	       code and see that it will never deadlock, because it never
1384	       tries to grab another lock while it has that one.  People
1385	       using your code don't even need to know you are using a
1386	       lock.
1387	     </para>