About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / static-keys.txt




Custom Search

Based on kernel version 4.13.3. Page generated on 2017-09-23 13:56 EST.

1	===========
2	Static Keys
3	===========
4	
5	.. warning::
6	
7	   DEPRECATED API:
8	
9	   The use of 'struct static_key' directly, is now DEPRECATED. In addition
10	   static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following::
11	
12		struct static_key false = STATIC_KEY_INIT_FALSE;
13		struct static_key true = STATIC_KEY_INIT_TRUE;
14		static_key_true()
15		static_key_false()
16	
17	   The updated API replacements are::
18	
19		DEFINE_STATIC_KEY_TRUE(key);
20		DEFINE_STATIC_KEY_FALSE(key);
21		DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
22		DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
23		static_branch_likely()
24		static_branch_unlikely()
25	
26	Abstract
27	========
28	
29	Static keys allows the inclusion of seldom used features in
30	performance-sensitive fast-path kernel code, via a GCC feature and a code
31	patching technique. A quick example::
32	
33		DEFINE_STATIC_KEY_FALSE(key);
34	
35		...
36	
37	        if (static_branch_unlikely(&key))
38	                do unlikely code
39	        else
40	                do likely code
41	
42		...
43		static_branch_enable(&key);
44		...
45		static_branch_disable(&key);
46		...
47	
48	The static_branch_unlikely() branch will be generated into the code with as little
49	impact to the likely code path as possible.
50	
51	
52	Motivation
53	==========
54	
55	
56	Currently, tracepoints are implemented using a conditional branch. The
57	conditional check requires checking a global variable for each tracepoint.
58	Although the overhead of this check is small, it increases when the memory
59	cache comes under pressure (memory cache lines for these global variables may
60	be shared with other memory accesses). As we increase the number of tracepoints
61	in the kernel this overhead may become more of an issue. In addition,
62	tracepoints are often dormant (disabled) and provide no direct kernel
63	functionality. Thus, it is highly desirable to reduce their impact as much as
64	possible. Although tracepoints are the original motivation for this work, other
65	kernel code paths should be able to make use of the static keys facility.
66	
67	
68	Solution
69	========
70	
71	
72	gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
73	
74	http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
75	
76	Using the 'asm goto', we can create branches that are either taken or not taken
77	by default, without the need to check memory. Then, at run-time, we can patch
78	the branch site to change the branch direction.
79	
80	For example, if we have a simple branch that is disabled by default::
81	
82		if (static_branch_unlikely(&key))
83			printk("I am the true branch\n");
84	
85	Thus, by default the 'printk' will not be emitted. And the code generated will
86	consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
87	straight-line code path. When the branch is 'flipped', we will patch the
88	'no-op' in the straight-line codepath with a 'jump' instruction to the
89	out-of-line true branch. Thus, changing branch direction is expensive but
90	branch selection is basically 'free'. That is the basic tradeoff of this
91	optimization.
92	
93	This lowlevel patching mechanism is called 'jump label patching', and it gives
94	the basis for the static keys facility.
95	
96	Static key label API, usage and examples
97	========================================
98	
99	
100	In order to make use of this optimization you must first define a key::
101	
102		DEFINE_STATIC_KEY_TRUE(key);
103	
104	or::
105	
106		DEFINE_STATIC_KEY_FALSE(key);
107	
108	
109	The key must be global, that is, it can't be allocated on the stack or dynamically
110	allocated at run-time.
111	
112	The key is then used in code as::
113	
114	        if (static_branch_unlikely(&key))
115	                do unlikely code
116	        else
117	                do likely code
118	
119	Or::
120	
121	        if (static_branch_likely(&key))
122	                do likely code
123	        else
124	                do unlikely code
125	
126	Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
127	be used in either static_branch_likely() or static_branch_unlikely()
128	statements.
129	
130	Branch(es) can be set true via::
131	
132		static_branch_enable(&key);
133	
134	or false via::
135	
136		static_branch_disable(&key);
137	
138	The branch(es) can then be switched via reference counts::
139	
140		static_branch_inc(&key);
141		...
142		static_branch_dec(&key);
143	
144	Thus, 'static_branch_inc()' means 'make the branch true', and
145	'static_branch_dec()' means 'make the branch false' with appropriate
146	reference counting. For example, if the key is initialized true, a
147	static_branch_dec(), will switch the branch to false. And a subsequent
148	static_branch_inc(), will change the branch back to true. Likewise, if the
149	key is initialized false, a 'static_branch_inc()', will change the branch to
150	true. And then a 'static_branch_dec()', will again make the branch false.
151	
152	Where an array of keys is required, it can be defined as::
153	
154		DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
155	
156	or::
157	
158		DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count);
159	
160	4) Architecture level code patching interface, 'jump labels'
161	
162	
163	There are a few functions and macros that architectures must implement in order
164	to take advantage of this optimization. If there is no architecture support, we
165	simply fall back to a traditional, load, test, and jump sequence. Also, the
166	struct jump_entry table must be at least 4-byte aligned because the
167	static_key->entry field makes use of the two least significant bits.
168	
169	* ``select HAVE_ARCH_JUMP_LABEL``,
170	    see: arch/x86/Kconfig
171	
172	* ``#define JUMP_LABEL_NOP_SIZE``,
173	    see: arch/x86/include/asm/jump_label.h
174	
175	* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``,
176	    see: arch/x86/include/asm/jump_label.h
177	
178	* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``,
179	    see: arch/x86/include/asm/jump_label.h
180	
181	* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``,
182	    see: arch/x86/kernel/jump_label.c
183	
184	* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``,
185	    see: arch/x86/kernel/jump_label.c
186	
187	* ``struct jump_entry``,
188	    see: arch/x86/include/asm/jump_label.h
189	
190	
191	5) Static keys / jump label analysis, results (x86_64):
192	
193	
194	As an example, let's add the following branch to 'getppid()', such that the
195	system call now looks like::
196	
197	  SYSCALL_DEFINE0(getppid)
198	  {
199	        int pid;
200	
201	  +     if (static_branch_unlikely(&key))
202	  +             printk("I am the true branch\n");
203	
204	        rcu_read_lock();
205	        pid = task_tgid_vnr(rcu_dereference(current->real_parent));
206	        rcu_read_unlock();
207	
208	        return pid;
209	  }
210	
211	The resulting instructions with jump labels generated by GCC is::
212	
213	  ffffffff81044290 <sys_getppid>:
214	  ffffffff81044290:       55                      push   %rbp
215	  ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
216	  ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
217	  ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
218	  ffffffff810442a0:       00 00
219	  ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
220	  ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
221	  ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
222	  ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
223	  ffffffff810442bc:       5d                      pop    %rbp
224	  ffffffff810442bd:       48 98                   cltq
225	  ffffffff810442bf:       c3                      retq
226	  ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
227	  ffffffff810442c7:       31 c0                   xor    %eax,%eax
228	  ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
229	  ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
230	
231	Without the jump label optimization it looks like::
232	
233	  ffffffff810441f0 <sys_getppid>:
234	  ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
235	  ffffffff810441f6:       55                      push   %rbp
236	  ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
237	  ffffffff810441fa:       85 c0                   test   %eax,%eax
238	  ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
239	  ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
240	  ffffffff81044205:       00 00
241	  ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
242	  ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
243	  ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
244	  ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
245	  ffffffff81044221:       5d                      pop    %rbp
246	  ffffffff81044222:       48 98                   cltq
247	  ffffffff81044224:       c3                      retq
248	  ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
249	  ffffffff8104422c:       31 c0                   xor    %eax,%eax
250	  ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
251	  ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
252	  ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
253	  ffffffff8104423c:       00 00 00 00
254	
255	Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
256	vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
257	to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
258	label case adds::
259	
260	  6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
261	
262	If we then include the padding bytes, the jump label code saves, 16 total bytes
263	of instruction memory for this small function. In this case the non-jump label
264	function is 80 bytes long. Thus, we have saved 20% of the instruction
265	footprint. We can in fact improve this even further, since the 5-byte no-op
266	really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
267	However, we have not yet implemented optimal no-op sizes (they are currently
268	hard-coded).
269	
270	Since there are a number of static key API uses in the scheduler paths,
271	'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
272	performance improvement. Testing done on 3.3.0-rc2:
273	
274	jump label disabled::
275	
276	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
277	
278	        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
279	           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
280	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
281	               487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
282	     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
283	   <not supported> stalled-cycles-frontend
284	   <not supported> stalled-cycles-backend
285	     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
286	       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
287	         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
288	
289	       1.601607384 seconds time elapsed                                          ( +-  0.07% )
290	
291	jump label enabled::
292	
293	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
294	
295	        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
296	           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
297	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
298	               487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
299	     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
300	   <not supported> stalled-cycles-frontend
301	   <not supported> stalled-cycles-backend
302	     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
303	       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
304	         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
305	
306	       1.579384366 seconds time elapsed
307	
308	The percentage of saved branches is .7%, and we've saved 12% on
309	'branch-misses'. This is where we would expect to get the most savings, since
310	this optimization is about reducing the number of branches. In addition, we've
311	saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.