About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / static-keys.txt




Custom Search

Based on kernel version 4.3. Page generated on 2015-11-02 12:51 EST.

1				Static Keys
2				-----------
3	
4	DEPRECATED API:
5	
6	The use of 'struct static_key' directly, is now DEPRECATED. In addition
7	static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:
8	
9	struct static_key false = STATIC_KEY_INIT_FALSE;
10	struct static_key true = STATIC_KEY_INIT_TRUE;
11	static_key_true()
12	static_key_false()
13	
14	The updated API replacements are:
15	
16	DEFINE_STATIC_KEY_TRUE(key);
17	DEFINE_STATIC_KEY_FALSE(key);
18	static_branch_likely()
19	static_branch_unlikely()
20	
21	0) Abstract
22	
23	Static keys allows the inclusion of seldom used features in
24	performance-sensitive fast-path kernel code, via a GCC feature and a code
25	patching technique. A quick example:
26	
27		DEFINE_STATIC_KEY_FALSE(key);
28	
29		...
30	
31	        if (static_branch_unlikely(&key))
32	                do unlikely code
33	        else
34	                do likely code
35	
36		...
37		static_branch_enable(&key);
38		...
39		static_branch_disable(&key);
40		...
41	
42	The static_branch_unlikely() branch will be generated into the code with as little
43	impact to the likely code path as possible.
44	
45	
46	1) Motivation
47	
48	
49	Currently, tracepoints are implemented using a conditional branch. The
50	conditional check requires checking a global variable for each tracepoint.
51	Although the overhead of this check is small, it increases when the memory
52	cache comes under pressure (memory cache lines for these global variables may
53	be shared with other memory accesses). As we increase the number of tracepoints
54	in the kernel this overhead may become more of an issue. In addition,
55	tracepoints are often dormant (disabled) and provide no direct kernel
56	functionality. Thus, it is highly desirable to reduce their impact as much as
57	possible. Although tracepoints are the original motivation for this work, other
58	kernel code paths should be able to make use of the static keys facility.
59	
60	
61	2) Solution
62	
63	
64	gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
65	
66	http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
67	
68	Using the 'asm goto', we can create branches that are either taken or not taken
69	by default, without the need to check memory. Then, at run-time, we can patch
70	the branch site to change the branch direction.
71	
72	For example, if we have a simple branch that is disabled by default:
73	
74		if (static_branch_unlikely(&key))
75			printk("I am the true branch\n");
76	
77	Thus, by default the 'printk' will not be emitted. And the code generated will
78	consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
79	straight-line code path. When the branch is 'flipped', we will patch the
80	'no-op' in the straight-line codepath with a 'jump' instruction to the
81	out-of-line true branch. Thus, changing branch direction is expensive but
82	branch selection is basically 'free'. That is the basic tradeoff of this
83	optimization.
84	
85	This lowlevel patching mechanism is called 'jump label patching', and it gives
86	the basis for the static keys facility.
87	
88	3) Static key label API, usage and examples:
89	
90	
91	In order to make use of this optimization you must first define a key:
92	
93		DEFINE_STATIC_KEY_TRUE(key);
94	
95	or:
96	
97		DEFINE_STATIC_KEY_FALSE(key);
98	
99	
100	The key must be global, that is, it can't be allocated on the stack or dynamically
101	allocated at run-time.
102	
103	The key is then used in code as:
104	
105	        if (static_branch_unlikely(&key))
106	                do unlikely code
107	        else
108	                do likely code
109	
110	Or:
111	
112	        if (static_branch_likely(&key))
113	                do likely code
114	        else
115	                do unlikely code
116	
117	Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may
118	be used in either static_branch_likely() or static_branch_unlikely()
119	statemnts.
120	
121	Branch(es) can be set true via:
122	
123	static_branch_enable(&key);
124	
125	or false via:
126	
127	static_branch_disable(&key);
128	
129	The branch(es) can then be switched via reference counts:
130	
131		static_branch_inc(&key);
132		...
133		static_branch_dec(&key);
134	
135	Thus, 'static_branch_inc()' means 'make the branch true', and
136	'static_branch_dec()' means 'make the branch false' with appropriate
137	reference counting. For example, if the key is initialized true, a
138	static_branch_dec(), will switch the branch to false. And a subsequent
139	static_branch_inc(), will change the branch back to true. Likewise, if the
140	key is initialized false, a 'static_branch_inc()', will change the branch to
141	true. And then a 'static_branch_dec()', will again make the branch false.
142	
143	
144	4) Architecture level code patching interface, 'jump labels'
145	
146	
147	There are a few functions and macros that architectures must implement in order
148	to take advantage of this optimization. If there is no architecture support, we
149	simply fall back to a traditional, load, test, and jump sequence.
150	
151	* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig
152	
153	* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h
154	
155	* __always_inline bool arch_static_branch(struct static_key *key, bool branch), see:
156						arch/x86/include/asm/jump_label.h
157	
158	* __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch),
159						see: arch/x86/include/asm/jump_label.h
160	
161	* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),
162						see: arch/x86/kernel/jump_label.c
163	
164	* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),
165						see: arch/x86/kernel/jump_label.c
166	
167	
168	* struct jump_entry, see: arch/x86/include/asm/jump_label.h
169	
170	
171	5) Static keys / jump label analysis, results (x86_64):
172	
173	
174	As an example, let's add the following branch to 'getppid()', such that the
175	system call now looks like:
176	
177	SYSCALL_DEFINE0(getppid)
178	{
179	        int pid;
180	
181	+       if (static_branch_unlikely(&key))
182	+               printk("I am the true branch\n");
183	
184	        rcu_read_lock();
185	        pid = task_tgid_vnr(rcu_dereference(current->real_parent));
186	        rcu_read_unlock();
187	
188	        return pid;
189	}
190	
191	The resulting instructions with jump labels generated by GCC is:
192	
193	ffffffff81044290 <sys_getppid>:
194	ffffffff81044290:       55                      push   %rbp
195	ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
196	ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
197	ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
198	ffffffff810442a0:       00 00
199	ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
200	ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
201	ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
202	ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
203	ffffffff810442bc:       5d                      pop    %rbp
204	ffffffff810442bd:       48 98                   cltq
205	ffffffff810442bf:       c3                      retq
206	ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
207	ffffffff810442c7:       31 c0                   xor    %eax,%eax
208	ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
209	ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
210	
211	Without the jump label optimization it looks like:
212	
213	ffffffff810441f0 <sys_getppid>:
214	ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
215	ffffffff810441f6:       55                      push   %rbp
216	ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
217	ffffffff810441fa:       85 c0                   test   %eax,%eax
218	ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
219	ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
220	ffffffff81044205:       00 00
221	ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
222	ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
223	ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
224	ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
225	ffffffff81044221:       5d                      pop    %rbp
226	ffffffff81044222:       48 98                   cltq
227	ffffffff81044224:       c3                      retq
228	ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
229	ffffffff8104422c:       31 c0                   xor    %eax,%eax
230	ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
231	ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
232	ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
233	ffffffff8104423c:       00 00 00 00
234	
235	Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
236	vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
237	to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
238	label case adds:
239	
240	6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
241	
242	If we then include the padding bytes, the jump label code saves, 16 total bytes
243	of instruction memory for this small function. In this case the non-jump label
244	function is 80 bytes long. Thus, we have saved 20% of the instruction
245	footprint. We can in fact improve this even further, since the 5-byte no-op
246	really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
247	However, we have not yet implemented optimal no-op sizes (they are currently
248	hard-coded).
249	
250	Since there are a number of static key API uses in the scheduler paths,
251	'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
252	performance improvement. Testing done on 3.3.0-rc2:
253	
254	jump label disabled:
255	
256	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
257	
258	        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
259	           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
260	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
261	               487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
262	     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
263	   <not supported> stalled-cycles-frontend
264	   <not supported> stalled-cycles-backend
265	     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
266	       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
267	         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
268	
269	       1.601607384 seconds time elapsed                                          ( +-  0.07% )
270	
271	jump label enabled:
272	
273	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
274	
275	        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
276	           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
277	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
278	               487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
279	     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
280	   <not supported> stalled-cycles-frontend
281	   <not supported> stalled-cycles-backend
282	     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
283	       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
284	         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
285	
286	       1.579384366 seconds time elapsed
287	
288	The percentage of saved branches is .7%, and we've saved 12% on
289	'branch-misses'. This is where we would expect to get the most savings, since
290	this optimization is about reducing the number of branches. In addition, we've
291	saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.