About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / robust-futex-ABI.txt




Custom Search

Based on kernel version 4.13.3. Page generated on 2017-09-23 13:56 EST.

1	====================
2	The robust futex ABI
3	====================
4	
5	:Author: Started by Paul Jackson <pj@sgi.com>
6	
7	
8	Robust_futexes provide a mechanism that is used in addition to normal
9	futexes, for kernel assist of cleanup of held locks on task exit.
10	
11	The interesting data as to what futexes a thread is holding is kept on a
12	linked list in user space, where it can be updated efficiently as locks
13	are taken and dropped, without kernel intervention.  The only additional
14	kernel intervention required for robust_futexes above and beyond what is
15	required for futexes is:
16	
17	 1) a one time call, per thread, to tell the kernel where its list of
18	    held robust_futexes begins, and
19	 2) internal kernel code at exit, to handle any listed locks held
20	    by the exiting thread.
21	
22	The existing normal futexes already provide a "Fast Userspace Locking"
23	mechanism, which handles uncontested locking without needing a system
24	call, and handles contested locking by maintaining a list of waiting
25	threads in the kernel.  Options on the sys_futex(2) system call support
26	waiting on a particular futex, and waking up the next waiter on a
27	particular futex.
28	
29	For robust_futexes to work, the user code (typically in a library such
30	as glibc linked with the application) has to manage and place the
31	necessary list elements exactly as the kernel expects them.  If it fails
32	to do so, then improperly listed locks will not be cleaned up on exit,
33	probably causing deadlock or other such failure of the other threads
34	waiting on the same locks.
35	
36	A thread that anticipates possibly using robust_futexes should first
37	issue the system call::
38	
39	    asmlinkage long
40	    sys_set_robust_list(struct robust_list_head __user *head, size_t len);
41	
42	The pointer 'head' points to a structure in the threads address space
43	consisting of three words.  Each word is 32 bits on 32 bit arch's, or 64
44	bits on 64 bit arch's, and local byte order.  Each thread should have
45	its own thread private 'head'.
46	
47	If a thread is running in 32 bit compatibility mode on a 64 native arch
48	kernel, then it can actually have two such structures - one using 32 bit
49	words for 32 bit compatibility mode, and one using 64 bit words for 64
50	bit native mode.  The kernel, if it is a 64 bit kernel supporting 32 bit
51	compatibility mode, will attempt to process both lists on each task
52	exit, if the corresponding sys_set_robust_list() call has been made to
53	setup that list.
54	
55	  The first word in the memory structure at 'head' contains a
56	  pointer to a single linked list of 'lock entries', one per lock,
57	  as described below.  If the list is empty, the pointer will point
58	  to itself, 'head'.  The last 'lock entry' points back to the 'head'.
59	
60	  The second word, called 'offset', specifies the offset from the
61	  address of the associated 'lock entry', plus or minus, of what will
62	  be called the 'lock word', from that 'lock entry'.  The 'lock word'
63	  is always a 32 bit word, unlike the other words above.  The 'lock
64	  word' holds 3 flag bits in the upper 3 bits, and the thread id (TID)
65	  of the thread holding the lock in the bottom 29 bits.  See further
66	  below for a description of the flag bits.
67	
68	  The third word, called 'list_op_pending', contains transient copy of
69	  the address of the 'lock entry', during list insertion and removal,
70	  and is needed to correctly resolve races should a thread exit while
71	  in the middle of a locking or unlocking operation.
72	
73	Each 'lock entry' on the single linked list starting at 'head' consists
74	of just a single word, pointing to the next 'lock entry', or back to
75	'head' if there are no more entries.  In addition, nearby to each 'lock
76	entry', at an offset from the 'lock entry' specified by the 'offset'
77	word, is one 'lock word'.
78	
79	The 'lock word' is always 32 bits, and is intended to be the same 32 bit
80	lock variable used by the futex mechanism, in conjunction with
81	robust_futexes.  The kernel will only be able to wakeup the next thread
82	waiting for a lock on a threads exit if that next thread used the futex
83	mechanism to register the address of that 'lock word' with the kernel.
84	
85	For each futex lock currently held by a thread, if it wants this
86	robust_futex support for exit cleanup of that lock, it should have one
87	'lock entry' on this list, with its associated 'lock word' at the
88	specified 'offset'.  Should a thread die while holding any such locks,
89	the kernel will walk this list, mark any such locks with a bit
90	indicating their holder died, and wakeup the next thread waiting for
91	that lock using the futex mechanism.
92	
93	When a thread has invoked the above system call to indicate it
94	anticipates using robust_futexes, the kernel stores the passed in 'head'
95	pointer for that task.  The task may retrieve that value later on by
96	using the system call::
97	
98	    asmlinkage long
99	    sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
100	                        size_t __user *len_ptr);
101	
102	It is anticipated that threads will use robust_futexes embedded in
103	larger, user level locking structures, one per lock.  The kernel
104	robust_futex mechanism doesn't care what else is in that structure, so
105	long as the 'offset' to the 'lock word' is the same for all
106	robust_futexes used by that thread.  The thread should link those locks
107	it currently holds using the 'lock entry' pointers.  It may also have
108	other links between the locks, such as the reverse side of a double
109	linked list, but that doesn't matter to the kernel.
110	
111	By keeping its locks linked this way, on a list starting with a 'head'
112	pointer known to the kernel, the kernel can provide to a thread the
113	essential service available for robust_futexes, which is to help clean
114	up locks held at the time of (a perhaps unexpectedly) exit.
115	
116	Actual locking and unlocking, during normal operations, is handled
117	entirely by user level code in the contending threads, and by the
118	existing futex mechanism to wait for, and wakeup, locks.  The kernels
119	only essential involvement in robust_futexes is to remember where the
120	list 'head' is, and to walk the list on thread exit, handling locks
121	still held by the departing thread, as described below.
122	
123	There may exist thousands of futex lock structures in a threads shared
124	memory, on various data structures, at a given point in time. Only those
125	lock structures for locks currently held by that thread should be on
126	that thread's robust_futex linked lock list a given time.
127	
128	A given futex lock structure in a user shared memory region may be held
129	at different times by any of the threads with access to that region. The
130	thread currently holding such a lock, if any, is marked with the threads
131	TID in the lower 29 bits of the 'lock word'.
132	
133	When adding or removing a lock from its list of held locks, in order for
134	the kernel to correctly handle lock cleanup regardless of when the task
135	exits (perhaps it gets an unexpected signal 9 in the middle of
136	manipulating this list), the user code must observe the following
137	protocol on 'lock entry' insertion and removal:
138	
139	On insertion:
140	
141	 1) set the 'list_op_pending' word to the address of the 'lock entry'
142	    to be inserted,
143	 2) acquire the futex lock,
144	 3) add the lock entry, with its thread id (TID) in the bottom 29 bits
145	    of the 'lock word', to the linked list starting at 'head', and
146	 4) clear the 'list_op_pending' word.
147	
148	On removal:
149	
150	 1) set the 'list_op_pending' word to the address of the 'lock entry'
151	    to be removed,
152	 2) remove the lock entry for this lock from the 'head' list,
153	 3) release the futex lock, and
154	 4) clear the 'lock_op_pending' word.
155	
156	On exit, the kernel will consider the address stored in
157	'list_op_pending' and the address of each 'lock word' found by walking
158	the list starting at 'head'.  For each such address, if the bottom 29
159	bits of the 'lock word' at offset 'offset' from that address equals the
160	exiting threads TID, then the kernel will do two things:
161	
162	 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
163	    wakeup on that address, which will waken the next thread that has
164	    used to the futex mechanism to wait on that address, and
165	 2) atomically set  bit 30 (0x40000000) in the 'lock word'.
166	
167	In the above, bit 31 was set by futex waiters on that lock to indicate
168	they were waiting, and bit 30 is set by the kernel to indicate that the
169	lock owner died holding the lock.
170	
171	The kernel exit code will silently stop scanning the list further if at
172	any point:
173	
174	 1) the 'head' pointer or an subsequent linked list pointer
175	    is not a valid address of a user space word
176	 2) the calculated location of the 'lock word' (address plus
177	    'offset') is not the valid address of a 32 bit user space
178	    word
179	 3) if the list contains more than 1 million (subject to
180	    future kernel configuration changes) elements.
181	
182	When the kernel sees a list entry whose 'lock word' doesn't have the
183	current threads TID in the lower 29 bits, it does nothing with that
184	entry, and goes on to the next entry.
185	
186	Bit 29 (0x20000000) of the 'lock word' is reserved for future use.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.