About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / static-keys.txt




Custom Search

Based on kernel version 4.15. Page generated on 2018-01-29 10:01 EST.

1	===========
2	Static Keys
3	===========
4	
5	.. warning::
6	
7	   DEPRECATED API:
8	
9	   The use of 'struct static_key' directly, is now DEPRECATED. In addition
10	   static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following::
11	
12		struct static_key false = STATIC_KEY_INIT_FALSE;
13		struct static_key true = STATIC_KEY_INIT_TRUE;
14		static_key_true()
15		static_key_false()
16	
17	   The updated API replacements are::
18	
19		DEFINE_STATIC_KEY_TRUE(key);
20		DEFINE_STATIC_KEY_FALSE(key);
21		DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
22		DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
23		static_branch_likely()
24		static_branch_unlikely()
25	
26	Abstract
27	========
28	
29	Static keys allows the inclusion of seldom used features in
30	performance-sensitive fast-path kernel code, via a GCC feature and a code
31	patching technique. A quick example::
32	
33		DEFINE_STATIC_KEY_FALSE(key);
34	
35		...
36	
37	        if (static_branch_unlikely(&key))
38	                do unlikely code
39	        else
40	                do likely code
41	
42		...
43		static_branch_enable(&key);
44		...
45		static_branch_disable(&key);
46		...
47	
48	The static_branch_unlikely() branch will be generated into the code with as little
49	impact to the likely code path as possible.
50	
51	
52	Motivation
53	==========
54	
55	
56	Currently, tracepoints are implemented using a conditional branch. The
57	conditional check requires checking a global variable for each tracepoint.
58	Although the overhead of this check is small, it increases when the memory
59	cache comes under pressure (memory cache lines for these global variables may
60	be shared with other memory accesses). As we increase the number of tracepoints
61	in the kernel this overhead may become more of an issue. In addition,
62	tracepoints are often dormant (disabled) and provide no direct kernel
63	functionality. Thus, it is highly desirable to reduce their impact as much as
64	possible. Although tracepoints are the original motivation for this work, other
65	kernel code paths should be able to make use of the static keys facility.
66	
67	
68	Solution
69	========
70	
71	
72	gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
73	
74	http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
75	
76	Using the 'asm goto', we can create branches that are either taken or not taken
77	by default, without the need to check memory. Then, at run-time, we can patch
78	the branch site to change the branch direction.
79	
80	For example, if we have a simple branch that is disabled by default::
81	
82		if (static_branch_unlikely(&key))
83			printk("I am the true branch\n");
84	
85	Thus, by default the 'printk' will not be emitted. And the code generated will
86	consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
87	straight-line code path. When the branch is 'flipped', we will patch the
88	'no-op' in the straight-line codepath with a 'jump' instruction to the
89	out-of-line true branch. Thus, changing branch direction is expensive but
90	branch selection is basically 'free'. That is the basic tradeoff of this
91	optimization.
92	
93	This lowlevel patching mechanism is called 'jump label patching', and it gives
94	the basis for the static keys facility.
95	
96	Static key label API, usage and examples
97	========================================
98	
99	
100	In order to make use of this optimization you must first define a key::
101	
102		DEFINE_STATIC_KEY_TRUE(key);
103	
104	or::
105	
106		DEFINE_STATIC_KEY_FALSE(key);
107	
108	
109	The key must be global, that is, it can't be allocated on the stack or dynamically
110	allocated at run-time.
111	
112	The key is then used in code as::
113	
114	        if (static_branch_unlikely(&key))
115	                do unlikely code
116	        else
117	                do likely code
118	
119	Or::
120	
121	        if (static_branch_likely(&key))
122	                do likely code
123	        else
124	                do unlikely code
125	
126	Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
127	be used in either static_branch_likely() or static_branch_unlikely()
128	statements.
129	
130	Branch(es) can be set true via::
131	
132		static_branch_enable(&key);
133	
134	or false via::
135	
136		static_branch_disable(&key);
137	
138	The branch(es) can then be switched via reference counts::
139	
140		static_branch_inc(&key);
141		...
142		static_branch_dec(&key);
143	
144	Thus, 'static_branch_inc()' means 'make the branch true', and
145	'static_branch_dec()' means 'make the branch false' with appropriate
146	reference counting. For example, if the key is initialized true, a
147	static_branch_dec(), will switch the branch to false. And a subsequent
148	static_branch_inc(), will change the branch back to true. Likewise, if the
149	key is initialized false, a 'static_branch_inc()', will change the branch to
150	true. And then a 'static_branch_dec()', will again make the branch false.
151	
152	The state and the reference count can be retrieved with 'static_key_enabled()'
153	and 'static_key_count()'.  In general, if you use these functions, they
154	should be protected with the same mutex used around the enable/disable
155	or increment/decrement function.
156	
157	Note that switching branches results in some locks being taken,
158	particularly the CPU hotplug lock (in order to avoid races against
159	CPUs being brought in the kernel whilst the kernel is getting
160	patched). Calling the static key API from within a hotplug notifier is
161	thus a sure deadlock recipe. In order to still allow use of the
162	functionnality, the following functions are provided:
163	
164		static_key_enable_cpuslocked()
165		static_key_disable_cpuslocked()
166		static_branch_enable_cpuslocked()
167		static_branch_disable_cpuslocked()
168	
169	These functions are *not* general purpose, and must only be used when
170	you really know that you're in the above context, and no other.
171	
172	Where an array of keys is required, it can be defined as::
173	
174		DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
175	
176	or::
177	
178		DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
179	
180	4) Architecture level code patching interface, 'jump labels'
181	
182	
183	There are a few functions and macros that architectures must implement in order
184	to take advantage of this optimization. If there is no architecture support, we
185	simply fall back to a traditional, load, test, and jump sequence. Also, the
186	struct jump_entry table must be at least 4-byte aligned because the
187	static_key->entry field makes use of the two least significant bits.
188	
189	* ``select HAVE_ARCH_JUMP_LABEL``,
190	    see: arch/x86/Kconfig
191	
192	* ``#define JUMP_LABEL_NOP_SIZE``,
193	    see: arch/x86/include/asm/jump_label.h
194	
195	* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``,
196	    see: arch/x86/include/asm/jump_label.h
197	
198	* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``,
199	    see: arch/x86/include/asm/jump_label.h
200	
201	* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``,
202	    see: arch/x86/kernel/jump_label.c
203	
204	* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``,
205	    see: arch/x86/kernel/jump_label.c
206	
207	* ``struct jump_entry``,
208	    see: arch/x86/include/asm/jump_label.h
209	
210	
211	5) Static keys / jump label analysis, results (x86_64):
212	
213	
214	As an example, let's add the following branch to 'getppid()', such that the
215	system call now looks like::
216	
217	  SYSCALL_DEFINE0(getppid)
218	  {
219	        int pid;
220	
221	  +     if (static_branch_unlikely(&key))
222	  +             printk("I am the true branch\n");
223	
224	        rcu_read_lock();
225	        pid = task_tgid_vnr(rcu_dereference(current->real_parent));
226	        rcu_read_unlock();
227	
228	        return pid;
229	  }
230	
231	The resulting instructions with jump labels generated by GCC is::
232	
233	  ffffffff81044290 <sys_getppid>:
234	  ffffffff81044290:       55                      push   %rbp
235	  ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
236	  ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
237	  ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
238	  ffffffff810442a0:       00 00
239	  ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
240	  ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
241	  ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
242	  ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
243	  ffffffff810442bc:       5d                      pop    %rbp
244	  ffffffff810442bd:       48 98                   cltq
245	  ffffffff810442bf:       c3                      retq
246	  ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
247	  ffffffff810442c7:       31 c0                   xor    %eax,%eax
248	  ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
249	  ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
250	
251	Without the jump label optimization it looks like::
252	
253	  ffffffff810441f0 <sys_getppid>:
254	  ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
255	  ffffffff810441f6:       55                      push   %rbp
256	  ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
257	  ffffffff810441fa:       85 c0                   test   %eax,%eax
258	  ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
259	  ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
260	  ffffffff81044205:       00 00
261	  ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
262	  ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
263	  ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
264	  ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
265	  ffffffff81044221:       5d                      pop    %rbp
266	  ffffffff81044222:       48 98                   cltq
267	  ffffffff81044224:       c3                      retq
268	  ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
269	  ffffffff8104422c:       31 c0                   xor    %eax,%eax
270	  ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
271	  ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
272	  ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
273	  ffffffff8104423c:       00 00 00 00
274	
275	Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
276	vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
277	to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
278	label case adds::
279	
280	  6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
281	
282	If we then include the padding bytes, the jump label code saves, 16 total bytes
283	of instruction memory for this small function. In this case the non-jump label
284	function is 80 bytes long. Thus, we have saved 20% of the instruction
285	footprint. We can in fact improve this even further, since the 5-byte no-op
286	really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
287	However, we have not yet implemented optimal no-op sizes (they are currently
288	hard-coded).
289	
290	Since there are a number of static key API uses in the scheduler paths,
291	'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
292	performance improvement. Testing done on 3.3.0-rc2:
293	
294	jump label disabled::
295	
296	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
297	
298	        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
299	           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
300	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
301	               487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
302	     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
303	   <not supported> stalled-cycles-frontend
304	   <not supported> stalled-cycles-backend
305	     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
306	       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
307	         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
308	
309	       1.601607384 seconds time elapsed                                          ( +-  0.07% )
310	
311	jump label enabled::
312	
313	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
314	
315	        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
316	           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
317	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
318	               487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
319	     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
320	   <not supported> stalled-cycles-frontend
321	   <not supported> stalled-cycles-backend
322	     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
323	       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
324	         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
325	
326	       1.579384366 seconds time elapsed
327	
328	The percentage of saved branches is .7%, and we've saved 12% on
329	'branch-misses'. This is where we would expect to get the most savings, since
330	this optimization is about reducing the number of branches. In addition, we've
331	saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.