About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / static-keys.txt




Custom Search

Based on kernel version 3.15.4. Page generated on 2014-07-07 09:04 EST.

1				Static Keys
2				-----------
3	
4	By: Jason Baron <jbaron@redhat.com>
5	
6	0) Abstract
7	
8	Static keys allows the inclusion of seldom used features in
9	performance-sensitive fast-path kernel code, via a GCC feature and a code
10	patching technique. A quick example:
11	
12		struct static_key key = STATIC_KEY_INIT_FALSE;
13	
14		...
15	
16	        if (static_key_false(&key))
17	                do unlikely code
18	        else
19	                do likely code
20	
21		...
22		static_key_slow_inc();
23		...
24		static_key_slow_inc();
25		...
26	
27	The static_key_false() branch will be generated into the code with as little
28	impact to the likely code path as possible.
29	
30	
31	1) Motivation
32	
33	
34	Currently, tracepoints are implemented using a conditional branch. The
35	conditional check requires checking a global variable for each tracepoint.
36	Although the overhead of this check is small, it increases when the memory
37	cache comes under pressure (memory cache lines for these global variables may
38	be shared with other memory accesses). As we increase the number of tracepoints
39	in the kernel this overhead may become more of an issue. In addition,
40	tracepoints are often dormant (disabled) and provide no direct kernel
41	functionality. Thus, it is highly desirable to reduce their impact as much as
42	possible. Although tracepoints are the original motivation for this work, other
43	kernel code paths should be able to make use of the static keys facility.
44	
45	
46	2) Solution
47	
48	
49	gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
50	
51	http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
52	
53	Using the 'asm goto', we can create branches that are either taken or not taken
54	by default, without the need to check memory. Then, at run-time, we can patch
55	the branch site to change the branch direction.
56	
57	For example, if we have a simple branch that is disabled by default:
58	
59		if (static_key_false(&key))
60			printk("I am the true branch\n");
61	
62	Thus, by default the 'printk' will not be emitted. And the code generated will
63	consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
64	straight-line code path. When the branch is 'flipped', we will patch the
65	'no-op' in the straight-line codepath with a 'jump' instruction to the
66	out-of-line true branch. Thus, changing branch direction is expensive but
67	branch selection is basically 'free'. That is the basic tradeoff of this
68	optimization.
69	
70	This lowlevel patching mechanism is called 'jump label patching', and it gives
71	the basis for the static keys facility.
72	
73	3) Static key label API, usage and examples:
74	
75	
76	In order to make use of this optimization you must first define a key:
77	
78		struct static_key key;
79	
80	Which is initialized as:
81	
82		struct static_key key = STATIC_KEY_INIT_TRUE;
83	
84	or:
85	
86		struct static_key key = STATIC_KEY_INIT_FALSE;
87	
88	If the key is not initialized, it is default false. The 'struct static_key',
89	must be a 'global'. That is, it can't be allocated on the stack or dynamically
90	allocated at run-time.
91	
92	The key is then used in code as:
93	
94	        if (static_key_false(&key))
95	                do unlikely code
96	        else
97	                do likely code
98	
99	Or:
100	
101	        if (static_key_true(&key))
102	                do likely code
103	        else
104	                do unlikely code
105	
106	A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a
107	'static_key_false()' construct. Likewise, a key initialized via
108	'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A
109	single key can be used in many branches, but all the branches must match the
110	way that the key has been initialized.
111	
112	The branch(es) can then be switched via:
113	
114		static_key_slow_inc(&key);
115		...
116		static_key_slow_dec(&key);
117	
118	Thus, 'static_key_slow_inc()' means 'make the branch true', and
119	'static_key_slow_dec()' means 'make the branch false' with appropriate
120	reference counting. For example, if the key is initialized true, a
121	static_key_slow_dec(), will switch the branch to false. And a subsequent
122	static_key_slow_inc(), will change the branch back to true. Likewise, if the
123	key is initialized false, a 'static_key_slow_inc()', will change the branch to
124	true. And then a 'static_key_slow_dec()', will again make the branch false.
125	
126	An example usage in the kernel is the implementation of tracepoints:
127	
128	        static inline void trace_##name(proto)                          \
129	        {                                                               \
130	                if (static_key_false(&__tracepoint_##name.key))		\
131	                        __DO_TRACE(&__tracepoint_##name,                \
132	                                TP_PROTO(data_proto),                   \
133	                                TP_ARGS(data_args),                     \
134	                                TP_CONDITION(cond));                    \
135	        }
136	
137	Tracepoints are disabled by default, and can be placed in performance critical
138	pieces of the kernel. Thus, by using a static key, the tracepoints can have
139	absolutely minimal impact when not in use.
140	
141	
142	4) Architecture level code patching interface, 'jump labels'
143	
144	
145	There are a few functions and macros that architectures must implement in order
146	to take advantage of this optimization. If there is no architecture support, we
147	simply fall back to a traditional, load, test, and jump sequence.
148	
149	* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig
150	
151	* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h
152	
153	* __always_inline bool arch_static_branch(struct static_key *key), see:
154						arch/x86/include/asm/jump_label.h
155	
156	* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),
157						see: arch/x86/kernel/jump_label.c
158	
159	* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),
160						see: arch/x86/kernel/jump_label.c
161	
162	
163	* struct jump_entry, see: arch/x86/include/asm/jump_label.h
164	
165	
166	5) Static keys / jump label analysis, results (x86_64):
167	
168	
169	As an example, let's add the following branch to 'getppid()', such that the
170	system call now looks like:
171	
172	SYSCALL_DEFINE0(getppid)
173	{
174	        int pid;
175	
176	+       if (static_key_false(&key))
177	+               printk("I am the true branch\n");
178	
179	        rcu_read_lock();
180	        pid = task_tgid_vnr(rcu_dereference(current->real_parent));
181	        rcu_read_unlock();
182	
183	        return pid;
184	}
185	
186	The resulting instructions with jump labels generated by GCC is:
187	
188	ffffffff81044290 <sys_getppid>:
189	ffffffff81044290:       55                      push   %rbp
190	ffffffff81044291:       48 89 e5                mov    %rsp,%rbp
191	ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9>
192	ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
193	ffffffff810442a0:       00 00
194	ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
195	ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
196	ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
197	ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr>
198	ffffffff810442bc:       5d                      pop    %rbp
199	ffffffff810442bd:       48 98                   cltq
200	ffffffff810442bf:       c3                      retq
201	ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi
202	ffffffff810442c7:       31 c0                   xor    %eax,%eax
203	ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk>
204	ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9>
205	
206	Without the jump label optimization it looks like:
207	
208	ffffffff810441f0 <sys_getppid>:
209	ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key>
210	ffffffff810441f6:       55                      push   %rbp
211	ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp
212	ffffffff810441fa:       85 c0                   test   %eax,%eax
213	ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35>
214	ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax
215	ffffffff81044205:       00 00
216	ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax
217	ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax
218	ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi
219	ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr>
220	ffffffff81044221:       5d                      pop    %rbp
221	ffffffff81044222:       48 98                   cltq
222	ffffffff81044224:       c3                      retq
223	ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi
224	ffffffff8104422c:       31 c0                   xor    %eax,%eax
225	ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk>
226	ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe>
227	ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
228	ffffffff8104423c:       00 00 00 00
229	
230	Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
231	vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
232	to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
233	label case adds:
234	
235	6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
236	
237	If we then include the padding bytes, the jump label code saves, 16 total bytes
238	of instruction memory for this small function. In this case the non-jump label
239	function is 80 bytes long. Thus, we have saved 20% of the instruction
240	footprint. We can in fact improve this even further, since the 5-byte no-op
241	really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
242	However, we have not yet implemented optimal no-op sizes (they are currently
243	hard-coded).
244	
245	Since there are a number of static key API uses in the scheduler paths,
246	'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
247	performance improvement. Testing done on 3.3.0-rc2:
248	
249	jump label disabled:
250	
251	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
252	
253	        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% )
254	           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% )
255	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% )
256	               487 page-faults               #    0.001 M/sec                    ( +-  0.02% )
257	     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% )
258	   <not supported> stalled-cycles-frontend
259	   <not supported> stalled-cycles-backend
260	     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% )
261	       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% )
262	         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% )
263	
264	       1.601607384 seconds time elapsed                                          ( +-  0.07% )
265	
266	jump label enabled:
267	
268	 Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
269	
270	        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% )
271	           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% )
272	                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% )
273	               487 page-faults               #    0.001 M/sec                    ( +-  0.05% )
274	     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% )
275	   <not supported> stalled-cycles-frontend
276	   <not supported> stalled-cycles-backend
277	     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% )
278	       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% )
279	         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% )
280	
281	       1.579384366 seconds time elapsed
282	
283	The percentage of saved branches is .7%, and we've saved 12% on
284	'branch-misses'. This is where we would expect to get the most savings, since
285	this optimization is about reducing the number of branches. In addition, we've
286	saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.