About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / trace / ring-buffer-design.txt




Custom Search

Based on kernel version 3.15.4. Page generated on 2014-07-07 09:04 EST.

1			Lockless Ring Buffer Design
2			===========================
3	
4	Copyright 2009 Red Hat Inc.
5	   Author:   Steven Rostedt <srostedt@redhat.com>
6	  License:   The GNU Free Documentation License, Version 1.2
7	               (dual licensed under the GPL v2)
8	Reviewers:   Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
9		     and Frederic Weisbecker.
10	
11	
12	Written for: 2.6.31
13	
14	Terminology used in this Document
15	---------------------------------
16	
17	tail - where new writes happen in the ring buffer.
18	
19	head - where new reads happen in the ring buffer.
20	
21	producer - the task that writes into the ring buffer (same as writer)
22	
23	writer - same as producer
24	
25	consumer - the task that reads from the buffer (same as reader)
26	
27	reader - same as consumer.
28	
29	reader_page - A page outside the ring buffer used solely (for the most part)
30	    by the reader.
31	
32	head_page - a pointer to the page that the reader will use next
33	
34	tail_page - a pointer to the page that will be written to next
35	
36	commit_page - a pointer to the page with the last finished non-nested write.
37	
38	cmpxchg - hardware-assisted atomic transaction that performs the following:
39	
40	   A = B iff previous A == C
41	
42	   R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
43	      current A is equal to C, and we put the old (current) A into R
44	
45	   R gets the previous A regardless if A is updated with B or not.
46	
47	   To see if the update was successful a compare of R == C may be used.
48	
49	The Generic Ring Buffer
50	-----------------------
51	
52	The ring buffer can be used in either an overwrite mode or in
53	producer/consumer mode.
54	
55	Producer/consumer mode is where if the producer were to fill up the
56	buffer before the consumer could free up anything, the producer
57	will stop writing to the buffer. This will lose most recent events.
58	
59	Overwrite mode is where if the producer were to fill up the buffer
60	before the consumer could free up anything, the producer will
61	overwrite the older data. This will lose the oldest events.
62	
63	No two writers can write at the same time (on the same per-cpu buffer),
64	but a writer may interrupt another writer, but it must finish writing
65	before the previous writer may continue. This is very important to the
66	algorithm. The writers act like a "stack". The way interrupts works
67	enforces this behavior.
68	
69	
70	  writer1 start
71	     <preempted> writer2 start
72	         <preempted> writer3 start
73	                     writer3 finishes
74	                 writer2 finishes
75	  writer1 finishes
76	
77	This is very much like a writer being preempted by an interrupt and
78	the interrupt doing a write as well.
79	
80	Readers can happen at any time. But no two readers may run at the
81	same time, nor can a reader preempt/interrupt another reader. A reader
82	cannot preempt/interrupt a writer, but it may read/consume from the
83	buffer at the same time as a writer is writing, but the reader must be
84	on another processor to do so. A reader may read on its own processor
85	and can be preempted by a writer.
86	
87	A writer can preempt a reader, but a reader cannot preempt a writer.
88	But a reader can read the buffer at the same time (on another processor)
89	as a writer.
90	
91	The ring buffer is made up of a list of pages held together by a linked list.
92	
93	At initialization a reader page is allocated for the reader that is not
94	part of the ring buffer.
95	
96	The head_page, tail_page and commit_page are all initialized to point
97	to the same page.
98	
99	The reader page is initialized to have its next pointer pointing to
100	the head page, and its previous pointer pointing to a page before
101	the head page.
102	
103	The reader has its own page to use. At start up time, this page is
104	allocated but is not attached to the list. When the reader wants
105	to read from the buffer, if its page is empty (like it is on start-up),
106	it will swap its page with the head_page. The old reader page will
107	become part of the ring buffer and the head_page will be removed.
108	The page after the inserted page (old reader_page) will become the
109	new head page.
110	
111	Once the new page is given to the reader, the reader could do what
112	it wants with it, as long as a writer has left that page.
113	
114	A sample of how the reader page is swapped: Note this does not
115	show the head page in the buffer, it is for demonstrating a swap
116	only.
117	
118	  +------+
119	  |reader|          RING BUFFER
120	  |page  |
121	  +------+
122	                  +---+   +---+   +---+
123	                  |   |-->|   |-->|   |
124	                  |   |<--|   |<--|   |
125	                  +---+   +---+   +---+
126	                   ^ |             ^ |
127	                   | +-------------+ |
128	                   +-----------------+
129	
130	
131	  +------+
132	  |reader|          RING BUFFER
133	  |page  |-------------------+
134	  +------+                   v
135	    |             +---+   +---+   +---+
136	    |             |   |-->|   |-->|   |
137	    |             |   |<--|   |<--|   |<-+
138	    |             +---+   +---+   +---+  |
139	    |              ^ |             ^ |   |
140	    |              | +-------------+ |   |
141	    |              +-----------------+   |
142	    +------------------------------------+
143	
144	  +------+
145	  |reader|          RING BUFFER
146	  |page  |-------------------+
147	  +------+ <---------------+ v
148	    |  ^          +---+   +---+   +---+
149	    |  |          |   |-->|   |-->|   |
150	    |  |          |   |   |   |<--|   |<-+
151	    |  |          +---+   +---+   +---+  |
152	    |  |             |             ^ |   |
153	    |  |             +-------------+ |   |
154	    |  +-----------------------------+   |
155	    +------------------------------------+
156	
157	  +------+
158	  |buffer|          RING BUFFER
159	  |page  |-------------------+
160	  +------+ <---------------+ v
161	    |  ^          +---+   +---+   +---+
162	    |  |          |   |   |   |-->|   |
163	    |  |  New     |   |   |   |<--|   |<-+
164	    |  | Reader   +---+   +---+   +---+  |
165	    |  |  page ----^                 |   |
166	    |  |                             |   |
167	    |  +-----------------------------+   |
168	    +------------------------------------+
169	
170	
171	
172	It is possible that the page swapped is the commit page and the tail page,
173	if what is in the ring buffer is less than what is held in a buffer page.
174	
175	
176	          reader page    commit page   tail page
177	              |              |             |
178	              v              |             |
179	             +---+           |             |
180	             |   |<----------+             |
181	             |   |<------------------------+
182	             |   |------+
183	             +---+      |
184	                        |
185	                        v
186	    +---+    +---+    +---+    +---+
187	<---|   |--->|   |--->|   |--->|   |--->
188	--->|   |<---|   |<---|   |<---|   |<---
189	    +---+    +---+    +---+    +---+
190	
191	This case is still valid for this algorithm.
192	When the writer leaves the page, it simply goes into the ring buffer
193	since the reader page still points to the next location in the ring
194	buffer.
195	
196	
197	The main pointers:
198	
199	  reader page - The page used solely by the reader and is not part
200	                of the ring buffer (may be swapped in)
201	
202	  head page - the next page in the ring buffer that will be swapped
203	              with the reader page.
204	
205	  tail page - the page where the next write will take place.
206	
207	  commit page - the page that last finished a write.
208	
209	The commit page only is updated by the outermost writer in the
210	writer stack. A writer that preempts another writer will not move the
211	commit page.
212	
213	When data is written into the ring buffer, a position is reserved
214	in the ring buffer and passed back to the writer. When the writer
215	is finished writing data into that position, it commits the write.
216	
217	Another write (or a read) may take place at anytime during this
218	transaction. If another write happens it must finish before continuing
219	with the previous write.
220	
221	
222	   Write reserve:
223	
224	       Buffer page
225	      +---------+
226	      |written  |
227	      +---------+  <--- given back to writer (current commit)
228	      |reserved |
229	      +---------+ <--- tail pointer
230	      | empty   |
231	      +---------+
232	
233	   Write commit:
234	
235	       Buffer page
236	      +---------+
237	      |written  |
238	      +---------+
239	      |written  |
240	      +---------+  <--- next position for write (current commit)
241	      | empty   |
242	      +---------+
243	
244	
245	 If a write happens after the first reserve:
246	
247	       Buffer page
248	      +---------+
249	      |written  |
250	      +---------+  <-- current commit
251	      |reserved |
252	      +---------+  <--- given back to second writer
253	      |reserved |
254	      +---------+ <--- tail pointer
255	
256	  After second writer commits:
257	
258	
259	       Buffer page
260	      +---------+
261	      |written  |
262	      +---------+  <--(last full commit)
263	      |reserved |
264	      +---------+
265	      |pending  |
266	      |commit   |
267	      +---------+ <--- tail pointer
268	
269	  When the first writer commits:
270	
271	       Buffer page
272	      +---------+
273	      |written  |
274	      +---------+
275	      |written  |
276	      +---------+
277	      |written  |
278	      +---------+  <--(last full commit and tail pointer)
279	
280	
281	The commit pointer points to the last write location that was
282	committed without preempting another write. When a write that
283	preempted another write is committed, it only becomes a pending commit
284	and will not be a full commit until all writes have been committed.
285	
286	The commit page points to the page that has the last full commit.
287	The tail page points to the page with the last write (before
288	committing).
289	
290	The tail page is always equal to or after the commit page. It may
291	be several pages ahead. If the tail page catches up to the commit
292	page then no more writes may take place (regardless of the mode
293	of the ring buffer: overwrite and produce/consumer).
294	
295	The order of pages is:
296	
297	 head page
298	 commit page
299	 tail page
300	
301	Possible scenario:
302	                             tail page
303	  head page         commit page  |
304	      |                 |        |
305	      v                 v        v
306	    +---+    +---+    +---+    +---+
307	<---|   |--->|   |--->|   |--->|   |--->
308	--->|   |<---|   |<---|   |<---|   |<---
309	    +---+    +---+    +---+    +---+
310	
311	There is a special case that the head page is after either the commit page
312	and possibly the tail page. That is when the commit (and tail) page has been
313	swapped with the reader page. This is because the head page is always
314	part of the ring buffer, but the reader page is not. Whenever there
315	has been less than a full page that has been committed inside the ring buffer,
316	and a reader swaps out a page, it will be swapping out the commit page.
317	
318	
319	          reader page    commit page   tail page
320	              |              |             |
321	              v              |             |
322	             +---+           |             |
323	             |   |<----------+             |
324	             |   |<------------------------+
325	             |   |------+
326	             +---+      |
327	                        |
328	                        v
329	    +---+    +---+    +---+    +---+
330	<---|   |--->|   |--->|   |--->|   |--->
331	--->|   |<---|   |<---|   |<---|   |<---
332	    +---+    +---+    +---+    +---+
333	                        ^
334	                        |
335	                    head page
336	
337	
338	In this case, the head page will not move when the tail and commit
339	move back into the ring buffer.
340	
341	The reader cannot swap a page into the ring buffer if the commit page
342	is still on that page. If the read meets the last commit (real commit
343	not pending or reserved), then there is nothing more to read.
344	The buffer is considered empty until another full commit finishes.
345	
346	When the tail meets the head page, if the buffer is in overwrite mode,
347	the head page will be pushed ahead one. If the buffer is in producer/consumer
348	mode, the write will fail.
349	
350	Overwrite mode:
351	
352	            tail page
353	               |
354	               v
355	    +---+    +---+    +---+    +---+
356	<---|   |--->|   |--->|   |--->|   |--->
357	--->|   |<---|   |<---|   |<---|   |<---
358	    +---+    +---+    +---+    +---+
359	                        ^
360	                        |
361	                    head page
362	
363	
364	            tail page
365	               |
366	               v
367	    +---+    +---+    +---+    +---+
368	<---|   |--->|   |--->|   |--->|   |--->
369	--->|   |<---|   |<---|   |<---|   |<---
370	    +---+    +---+    +---+    +---+
371	                                 ^
372	                                 |
373	                             head page
374	
375	
376	                    tail page
377	                        |
378	                        v
379	    +---+    +---+    +---+    +---+
380	<---|   |--->|   |--->|   |--->|   |--->
381	--->|   |<---|   |<---|   |<---|   |<---
382	    +---+    +---+    +---+    +---+
383	                                 ^
384	                                 |
385	                             head page
386	
387	Note, the reader page will still point to the previous head page.
388	But when a swap takes place, it will use the most recent head page.
389	
390	
391	Making the Ring Buffer Lockless:
392	--------------------------------
393	
394	The main idea behind the lockless algorithm is to combine the moving
395	of the head_page pointer with the swapping of pages with the reader.
396	State flags are placed inside the pointer to the page. To do this,
397	each page must be aligned in memory by 4 bytes. This will allow the 2
398	least significant bits of the address to be used as flags, since
399	they will always be zero for the address. To get the address,
400	simply mask out the flags.
401	
402	  MASK = ~3
403	
404	  address & MASK
405	
406	Two flags will be kept by these two bits:
407	
408	   HEADER - the page being pointed to is a head page
409	
410	   UPDATE - the page being pointed to is being updated by a writer
411	          and was or is about to be a head page.
412	
413	
414	          reader page
415	              |
416	              v
417	             +---+
418	             |   |------+
419	             +---+      |
420	                        |
421	                        v
422	    +---+    +---+    +---+    +---+
423	<---|   |--->|   |-H->|   |--->|   |--->
424	--->|   |<---|   |<---|   |<---|   |<---
425	    +---+    +---+    +---+    +---+
426	
427	
428	The above pointer "-H->" would have the HEADER flag set. That is
429	the next page is the next page to be swapped out by the reader.
430	This pointer means the next page is the head page.
431	
432	When the tail page meets the head pointer, it will use cmpxchg to
433	change the pointer to the UPDATE state:
434	
435	
436	            tail page
437	               |
438	               v
439	    +---+    +---+    +---+    +---+
440	<---|   |--->|   |-H->|   |--->|   |--->
441	--->|   |<---|   |<---|   |<---|   |<---
442	    +---+    +---+    +---+    +---+
443	
444	            tail page
445	               |
446	               v
447	    +---+    +---+    +---+    +---+
448	<---|   |--->|   |-U->|   |--->|   |--->
449	--->|   |<---|   |<---|   |<---|   |<---
450	    +---+    +---+    +---+    +---+
451	
452	"-U->" represents a pointer in the UPDATE state.
453	
454	Any access to the reader will need to take some sort of lock to serialize
455	the readers. But the writers will never take a lock to write to the
456	ring buffer. This means we only need to worry about a single reader,
457	and writes only preempt in "stack" formation.
458	
459	When the reader tries to swap the page with the ring buffer, it
460	will also use cmpxchg. If the flag bit in the pointer to the
461	head page does not have the HEADER flag set, the compare will fail
462	and the reader will need to look for the new head page and try again.
463	Note, the flags UPDATE and HEADER are never set at the same time.
464	
465	The reader swaps the reader page as follows:
466	
467	  +------+
468	  |reader|          RING BUFFER
469	  |page  |
470	  +------+
471	                  +---+    +---+    +---+
472	                  |   |--->|   |--->|   |
473	                  |   |<---|   |<---|   |
474	                  +---+    +---+    +---+
475	                   ^ |               ^ |
476	                   | +---------------+ |
477	                   +-----H-------------+
478	
479	The reader sets the reader page next pointer as HEADER to the page after
480	the head page.
481	
482	
483	  +------+
484	  |reader|          RING BUFFER
485	  |page  |-------H-----------+
486	  +------+                   v
487	    |             +---+    +---+    +---+
488	    |             |   |--->|   |--->|   |
489	    |             |   |<---|   |<---|   |<-+
490	    |             +---+    +---+    +---+  |
491	    |              ^ |               ^ |   |
492	    |              | +---------------+ |   |
493	    |              +-----H-------------+   |
494	    +--------------------------------------+
495	
496	It does a cmpxchg with the pointer to the previous head page to make it
497	point to the reader page. Note that the new pointer does not have the HEADER
498	flag set.  This action atomically moves the head page forward.
499	
500	  +------+
501	  |reader|          RING BUFFER
502	  |page  |-------H-----------+
503	  +------+                   v
504	    |  ^          +---+   +---+   +---+
505	    |  |          |   |-->|   |-->|   |
506	    |  |          |   |<--|   |<--|   |<-+
507	    |  |          +---+   +---+   +---+  |
508	    |  |             |             ^ |   |
509	    |  |             +-------------+ |   |
510	    |  +-----------------------------+   |
511	    +------------------------------------+
512	
513	After the new head page is set, the previous pointer of the head page is
514	updated to the reader page.
515	
516	  +------+
517	  |reader|          RING BUFFER
518	  |page  |-------H-----------+
519	  +------+ <---------------+ v
520	    |  ^          +---+   +---+   +---+
521	    |  |          |   |-->|   |-->|   |
522	    |  |          |   |   |   |<--|   |<-+
523	    |  |          +---+   +---+   +---+  |
524	    |  |             |             ^ |   |
525	    |  |             +-------------+ |   |
526	    |  +-----------------------------+   |
527	    +------------------------------------+
528	
529	  +------+
530	  |buffer|          RING BUFFER
531	  |page  |-------H-----------+  <--- New head page
532	  +------+ <---------------+ v
533	    |  ^          +---+   +---+   +---+
534	    |  |          |   |   |   |-->|   |
535	    |  |  New     |   |   |   |<--|   |<-+
536	    |  | Reader   +---+   +---+   +---+  |
537	    |  |  page ----^                 |   |
538	    |  |                             |   |
539	    |  +-----------------------------+   |
540	    +------------------------------------+
541	
542	Another important point: The page that the reader page points back to
543	by its previous pointer (the one that now points to the new head page)
544	never points back to the reader page. That is because the reader page is
545	not part of the ring buffer. Traversing the ring buffer via the next pointers
546	will always stay in the ring buffer. Traversing the ring buffer via the
547	prev pointers may not.
548	
549	Note, the way to determine a reader page is simply by examining the previous
550	pointer of the page. If the next pointer of the previous page does not
551	point back to the original page, then the original page is a reader page:
552	
553	
554	             +--------+
555	             | reader |  next   +----+
556	             |  page  |-------->|    |<====== (buffer page)
557	             +--------+         +----+
558	                 |                | ^
559	                 |                v | next
560	            prev |              +----+
561	                 +------------->|    |
562	                                +----+
563	
564	The way the head page moves forward:
565	
566	When the tail page meets the head page and the buffer is in overwrite mode
567	and more writes take place, the head page must be moved forward before the
568	writer may move the tail page. The way this is done is that the writer
569	performs a cmpxchg to convert the pointer to the head page from the HEADER
570	flag to have the UPDATE flag set. Once this is done, the reader will
571	not be able to swap the head page from the buffer, nor will it be able to
572	move the head page, until the writer is finished with the move.
573	
574	This eliminates any races that the reader can have on the writer. The reader
575	must spin, and this is why the reader cannot preempt the writer.
576	
577	            tail page
578	               |
579	               v
580	    +---+    +---+    +---+    +---+
581	<---|   |--->|   |-H->|   |--->|   |--->
582	--->|   |<---|   |<---|   |<---|   |<---
583	    +---+    +---+    +---+    +---+
584	
585	            tail page
586	               |
587	               v
588	    +---+    +---+    +---+    +---+
589	<---|   |--->|   |-U->|   |--->|   |--->
590	--->|   |<---|   |<---|   |<---|   |<---
591	    +---+    +---+    +---+    +---+
592	
593	The following page will be made into the new head page.
594	
595	           tail page
596	               |
597	               v
598	    +---+    +---+    +---+    +---+
599	<---|   |--->|   |-U->|   |-H->|   |--->
600	--->|   |<---|   |<---|   |<---|   |<---
601	    +---+    +---+    +---+    +---+
602	
603	After the new head page has been set, we can set the old head page
604	pointer back to NORMAL.
605	
606	           tail page
607	               |
608	               v
609	    +---+    +---+    +---+    +---+
610	<---|   |--->|   |--->|   |-H->|   |--->
611	--->|   |<---|   |<---|   |<---|   |<---
612	    +---+    +---+    +---+    +---+
613	
614	After the head page has been moved, the tail page may now move forward.
615	
616	                    tail page
617	                        |
618	                        v
619	    +---+    +---+    +---+    +---+
620	<---|   |--->|   |--->|   |-H->|   |--->
621	--->|   |<---|   |<---|   |<---|   |<---
622	    +---+    +---+    +---+    +---+
623	
624	
625	The above are the trivial updates. Now for the more complex scenarios.
626	
627	
628	As stated before, if enough writes preempt the first write, the
629	tail page may make it all the way around the buffer and meet the commit
630	page. At this time, we must start dropping writes (usually with some kind
631	of warning to the user). But what happens if the commit was still on the
632	reader page? The commit page is not part of the ring buffer. The tail page
633	must account for this.
634	
635	
636	          reader page    commit page
637	              |              |
638	              v              |
639	             +---+           |
640	             |   |<----------+
641	             |   |
642	             |   |------+
643	             +---+      |
644	                        |
645	                        v
646	    +---+    +---+    +---+    +---+
647	<---|   |--->|   |-H->|   |--->|   |--->
648	--->|   |<---|   |<---|   |<---|   |<---
649	    +---+    +---+    +---+    +---+
650	               ^
651	               |
652	           tail page
653	
654	If the tail page were to simply push the head page forward, the commit when
655	leaving the reader page would not be pointing to the correct page.
656	
657	The solution to this is to test if the commit page is on the reader page
658	before pushing the head page. If it is, then it can be assumed that the
659	tail page wrapped the buffer, and we must drop new writes.
660	
661	This is not a race condition, because the commit page can only be moved
662	by the outermost writer (the writer that was preempted).
663	This means that the commit will not move while a writer is moving the
664	tail page. The reader cannot swap the reader page if it is also being
665	used as the commit page. The reader can simply check that the commit
666	is off the reader page. Once the commit page leaves the reader page
667	it will never go back on it unless a reader does another swap with the
668	buffer page that is also the commit page.
669	
670	
671	Nested writes
672	-------------
673	
674	In the pushing forward of the tail page we must first push forward
675	the head page if the head page is the next page. If the head page
676	is not the next page, the tail page is simply updated with a cmpxchg.
677	
678	Only writers move the tail page. This must be done atomically to protect
679	against nested writers.
680	
681	  temp_page = tail_page
682	  next_page = temp_page->next
683	  cmpxchg(tail_page, temp_page, next_page)
684	
685	The above will update the tail page if it is still pointing to the expected
686	page. If this fails, a nested write pushed it forward, the current write
687	does not need to push it.
688	
689	
690	           temp page
691	               |
692	               v
693	            tail page
694	               |
695	               v
696	    +---+    +---+    +---+    +---+
697	<---|   |--->|   |--->|   |--->|   |--->
698	--->|   |<---|   |<---|   |<---|   |<---
699	    +---+    +---+    +---+    +---+
700	
701	Nested write comes in and moves the tail page forward:
702	
703	                    tail page (moved by nested writer)
704	            temp page   |
705	               |        |
706	               v        v
707	    +---+    +---+    +---+    +---+
708	<---|   |--->|   |--->|   |--->|   |--->
709	--->|   |<---|   |<---|   |<---|   |<---
710	    +---+    +---+    +---+    +---+
711	
712	The above would fail the cmpxchg, but since the tail page has already
713	been moved forward, the writer will just try again to reserve storage
714	on the new tail page.
715	
716	But the moving of the head page is a bit more complex.
717	
718	            tail page
719	               |
720	               v
721	    +---+    +---+    +---+    +---+
722	<---|   |--->|   |-H->|   |--->|   |--->
723	--->|   |<---|   |<---|   |<---|   |<---
724	    +---+    +---+    +---+    +---+
725	
726	The write converts the head page pointer to UPDATE.
727	
728	            tail page
729	               |
730	               v
731	    +---+    +---+    +---+    +---+
732	<---|   |--->|   |-U->|   |--->|   |--->
733	--->|   |<---|   |<---|   |<---|   |<---
734	    +---+    +---+    +---+    +---+
735	
736	But if a nested writer preempts here, it will see that the next
737	page is a head page, but it is also nested. It will detect that
738	it is nested and will save that information. The detection is the
739	fact that it sees the UPDATE flag instead of a HEADER or NORMAL
740	pointer.
741	
742	The nested writer will set the new head page pointer.
743	
744	           tail page
745	               |
746	               v
747	    +---+    +---+    +---+    +---+
748	<---|   |--->|   |-U->|   |-H->|   |--->
749	--->|   |<---|   |<---|   |<---|   |<---
750	    +---+    +---+    +---+    +---+
751	
752	But it will not reset the update back to normal. Only the writer
753	that converted a pointer from HEAD to UPDATE will convert it back
754	to NORMAL.
755	
756	                    tail page
757	                        |
758	                        v
759	    +---+    +---+    +---+    +---+
760	<---|   |--->|   |-U->|   |-H->|   |--->
761	--->|   |<---|   |<---|   |<---|   |<---
762	    +---+    +---+    +---+    +---+
763	
764	After the nested writer finishes, the outermost writer will convert
765	the UPDATE pointer to NORMAL.
766	
767	
768	                    tail page
769	                        |
770	                        v
771	    +---+    +---+    +---+    +---+
772	<---|   |--->|   |--->|   |-H->|   |--->
773	--->|   |<---|   |<---|   |<---|   |<---
774	    +---+    +---+    +---+    +---+
775	
776	
777	It can be even more complex if several nested writes came in and moved
778	the tail page ahead several pages:
779	
780	
781	(first writer)
782	
783	            tail page
784	               |
785	               v
786	    +---+    +---+    +---+    +---+
787	<---|   |--->|   |-H->|   |--->|   |--->
788	--->|   |<---|   |<---|   |<---|   |<---
789	    +---+    +---+    +---+    +---+
790	
791	The write converts the head page pointer to UPDATE.
792	
793	            tail page
794	               |
795	               v
796	    +---+    +---+    +---+    +---+
797	<---|   |--->|   |-U->|   |--->|   |--->
798	--->|   |<---|   |<---|   |<---|   |<---
799	    +---+    +---+    +---+    +---+
800	
801	Next writer comes in, and sees the update and sets up the new
802	head page.
803	
804	(second writer)
805	
806	           tail page
807	               |
808	               v
809	    +---+    +---+    +---+    +---+
810	<---|   |--->|   |-U->|   |-H->|   |--->
811	--->|   |<---|   |<---|   |<---|   |<---
812	    +---+    +---+    +---+    +---+
813	
814	The nested writer moves the tail page forward. But does not set the old
815	update page to NORMAL because it is not the outermost writer.
816	
817	                    tail page
818	                        |
819	                        v
820	    +---+    +---+    +---+    +---+
821	<---|   |--->|   |-U->|   |-H->|   |--->
822	--->|   |<---|   |<---|   |<---|   |<---
823	    +---+    +---+    +---+    +---+
824	
825	Another writer preempts and sees the page after the tail page is a head page.
826	It changes it from HEAD to UPDATE.
827	
828	(third writer)
829	
830	                    tail page
831	                        |
832	                        v
833	    +---+    +---+    +---+    +---+
834	<---|   |--->|   |-U->|   |-U->|   |--->
835	--->|   |<---|   |<---|   |<---|   |<---
836	    +---+    +---+    +---+    +---+
837	
838	The writer will move the head page forward:
839	
840	
841	(third writer)
842	
843	                    tail page
844	                        |
845	                        v
846	    +---+    +---+    +---+    +---+
847	<---|   |--->|   |-U->|   |-U->|   |-H->
848	--->|   |<---|   |<---|   |<---|   |<---
849	    +---+    +---+    +---+    +---+
850	
851	But now that the third writer did change the HEAD flag to UPDATE it
852	will convert it to normal:
853	
854	
855	(third writer)
856	
857	                    tail page
858	                        |
859	                        v
860	    +---+    +---+    +---+    +---+
861	<---|   |--->|   |-U->|   |--->|   |-H->
862	--->|   |<---|   |<---|   |<---|   |<---
863	    +---+    +---+    +---+    +---+
864	
865	
866	Then it will move the tail page, and return back to the second writer.
867	
868	
869	(second writer)
870	
871	                             tail page
872	                                 |
873	                                 v
874	    +---+    +---+    +---+    +---+
875	<---|   |--->|   |-U->|   |--->|   |-H->
876	--->|   |<---|   |<---|   |<---|   |<---
877	    +---+    +---+    +---+    +---+
878	
879	
880	The second writer will fail to move the tail page because it was already
881	moved, so it will try again and add its data to the new tail page.
882	It will return to the first writer.
883	
884	
885	(first writer)
886	
887	                             tail page
888	                                 |
889	                                 v
890	    +---+    +---+    +---+    +---+
891	<---|   |--->|   |-U->|   |--->|   |-H->
892	--->|   |<---|   |<---|   |<---|   |<---
893	    +---+    +---+    +---+    +---+
894	
895	The first writer cannot know atomically if the tail page moved
896	while it updates the HEAD page. It will then update the head page to
897	what it thinks is the new head page.
898	
899	
900	(first writer)
901	
902	                             tail page
903	                                 |
904	                                 v
905	    +---+    +---+    +---+    +---+
906	<---|   |--->|   |-U->|   |-H->|   |-H->
907	--->|   |<---|   |<---|   |<---|   |<---
908	    +---+    +---+    +---+    +---+
909	
910	Since the cmpxchg returns the old value of the pointer the first writer
911	will see it succeeded in updating the pointer from NORMAL to HEAD.
912	But as we can see, this is not good enough. It must also check to see
913	if the tail page is either where it use to be or on the next page:
914	
915	
916	(first writer)
917	
918	               A        B    tail page
919	               |        |        |
920	               v        v        v
921	    +---+    +---+    +---+    +---+
922	<---|   |--->|   |-U->|   |-H->|   |-H->
923	--->|   |<---|   |<---|   |<---|   |<---
924	    +---+    +---+    +---+    +---+
925	
926	If tail page != A and tail page != B, then it must reset the pointer
927	back to NORMAL. The fact that it only needs to worry about nested
928	writers means that it only needs to check this after setting the HEAD page.
929	
930	
931	(first writer)
932	
933	               A        B    tail page
934	               |        |        |
935	               v        v        v
936	    +---+    +---+    +---+    +---+
937	<---|   |--->|   |-U->|   |--->|   |-H->
938	--->|   |<---|   |<---|   |<---|   |<---
939	    +---+    +---+    +---+    +---+
940	
941	Now the writer can update the head page. This is also why the head page must
942	remain in UPDATE and only reset by the outermost writer. This prevents
943	the reader from seeing the incorrect head page.
944	
945	
946	(first writer)
947	
948	               A        B    tail page
949	               |        |        |
950	               v        v        v
951	    +---+    +---+    +---+    +---+
952	<---|   |--->|   |--->|   |--->|   |-H->
953	--->|   |<---|   |<---|   |<---|   |<---
954	    +---+    +---+    +---+    +---+
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.