About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / trace / ftrace.txt




Custom Search

Based on kernel version 3.9. Page generated on 2013-05-02 23:15 EST.

1			ftrace - Function Tracer
2			========================
3	
4	Copyright 2008 Red Hat Inc.
5	   Author:   Steven Rostedt <srostedt@redhat.com>
6	  License:   The GNU Free Documentation License, Version 1.2
7	               (dual licensed under the GPL v2)
8	Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
9		     John Kacur, and David Teigland.
10	Written for: 2.6.28-rc2
11	
12	Introduction
13	------------
14	
15	Ftrace is an internal tracer designed to help out developers and
16	designers of systems to find what is going on inside the kernel.
17	It can be used for debugging or analyzing latencies and
18	performance issues that take place outside of user-space.
19	
20	Although ftrace is the function tracer, it also includes an
21	infrastructure that allows for other types of tracing. Some of
22	the tracers that are currently in ftrace include a tracer to
23	trace context switches, the time it takes for a high priority
24	task to run after it was woken up, the time interrupts are
25	disabled, and more (ftrace allows for tracer plugins, which
26	means that the list of tracers can always grow).
27	
28	
29	Implementation Details
30	----------------------
31	
32	See ftrace-design.txt for details for arch porters and such.
33	
34	
35	The File System
36	---------------
37	
38	Ftrace uses the debugfs file system to hold the control files as
39	well as the files to display output.
40	
41	When debugfs is configured into the kernel (which selecting any ftrace
42	option will do) the directory /sys/kernel/debug will be created. To mount
43	this directory, you can add to your /etc/fstab file:
44	
45	 debugfs       /sys/kernel/debug          debugfs defaults        0       0
46	
47	Or you can mount it at run time with:
48	
49	 mount -t debugfs nodev /sys/kernel/debug
50	
51	For quicker access to that directory you may want to make a soft link to
52	it:
53	
54	 ln -s /sys/kernel/debug /debug
55	
56	Any selected ftrace option will also create a directory called tracing
57	within the debugfs. The rest of the document will assume that you are in
58	the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate
59	on the files within that directory and not distract from the content with
60	the extended "/sys/kernel/debug/tracing" path name.
61	
62	That's it! (assuming that you have ftrace configured into your kernel)
63	
64	After mounting the debugfs, you can see a directory called
65	"tracing".  This directory contains the control and output files
66	of ftrace. Here is a list of some of the key files:
67	
68	
69	 Note: all time values are in microseconds.
70	
71	  current_tracer:
72	
73		This is used to set or display the current tracer
74		that is configured.
75	
76	  available_tracers:
77	
78		This holds the different types of tracers that
79		have been compiled into the kernel. The
80		tracers listed here can be configured by
81		echoing their name into current_tracer.
82	
83	  tracing_on:
84	
85		This sets or displays whether writing to the trace
86		ring buffer is enabled. Echo 0 into this file to disable
87		the tracer or 1 to enable it.
88	
89	  trace:
90	
91		This file holds the output of the trace in a human
92		readable format (described below).
93	
94	  trace_pipe:
95	
96		The output is the same as the "trace" file but this
97		file is meant to be streamed with live tracing.
98		Reads from this file will block until new data is
99		retrieved.  Unlike the "trace" file, this file is a
100		consumer. This means reading from this file causes
101		sequential reads to display more current data. Once
102		data is read from this file, it is consumed, and
103		will not be read again with a sequential read. The
104		"trace" file is static, and if the tracer is not
105		adding more data,they will display the same
106		information every time they are read.
107	
108	  trace_options:
109	
110		This file lets the user control the amount of data
111		that is displayed in one of the above output
112		files.
113	
114	  tracing_max_latency:
115	
116		Some of the tracers record the max latency.
117		For example, the time interrupts are disabled.
118		This time is saved in this file. The max trace
119		will also be stored, and displayed by "trace".
120		A new max trace will only be recorded if the
121		latency is greater than the value in this
122		file. (in microseconds)
123	
124	  buffer_size_kb:
125	
126		This sets or displays the number of kilobytes each CPU
127		buffer can hold. The tracer buffers are the same size
128		for each CPU. The displayed number is the size of the
129		CPU buffer and not total size of all buffers. The
130		trace buffers are allocated in pages (blocks of memory
131		that the kernel uses for allocation, usually 4 KB in size).
132		If the last page allocated has room for more bytes
133		than requested, the rest of the page will be used,
134		making the actual allocation bigger than requested.
135		( Note, the size may not be a multiple of the page size
136		  due to buffer management overhead. )
137	
138		This can only be updated when the current_tracer
139		is set to "nop".
140	
141	  tracing_cpumask:
142	
143		This is a mask that lets the user only trace
144		on specified CPUS. The format is a hex string
145		representing the CPUS.
146	
147	  set_ftrace_filter:
148	
149		When dynamic ftrace is configured in (see the
150		section below "dynamic ftrace"), the code is dynamically
151		modified (code text rewrite) to disable calling of the
152		function profiler (mcount). This lets tracing be configured
153		in with practically no overhead in performance.  This also
154		has a side effect of enabling or disabling specific functions
155		to be traced. Echoing names of functions into this file
156		will limit the trace to only those functions.
157	
158		This interface also allows for commands to be used. See the
159		"Filter commands" section for more details.
160	
161	  set_ftrace_notrace:
162	
163		This has an effect opposite to that of
164		set_ftrace_filter. Any function that is added here will not
165		be traced. If a function exists in both set_ftrace_filter
166		and set_ftrace_notrace,	the function will _not_ be traced.
167	
168	  set_ftrace_pid:
169	
170		Have the function tracer only trace a single thread.
171	
172	  set_graph_function:
173	
174		Set a "trigger" function where tracing should start
175		with the function graph tracer (See the section
176		"dynamic ftrace" for more details).
177	
178	  available_filter_functions:
179	
180		This lists the functions that ftrace
181		has processed and can trace. These are the function
182		names that you can pass to "set_ftrace_filter" or
183		"set_ftrace_notrace". (See the section "dynamic ftrace"
184		below for more details.)
185	
186	
187	The Tracers
188	-----------
189	
190	Here is the list of current tracers that may be configured.
191	
192	  "function"
193	
194		Function call tracer to trace all kernel functions.
195	
196	  "function_graph"
197	
198		Similar to the function tracer except that the
199		function tracer probes the functions on their entry
200		whereas the function graph tracer traces on both entry
201		and exit of the functions. It then provides the ability
202		to draw a graph of function calls similar to C code
203		source.
204	
205	  "irqsoff"
206	
207		Traces the areas that disable interrupts and saves
208		the trace with the longest max latency.
209		See tracing_max_latency. When a new max is recorded,
210		it replaces the old trace. It is best to view this
211		trace with the latency-format option enabled.
212	
213	  "preemptoff"
214	
215		Similar to irqsoff but traces and records the amount of
216		time for which preemption is disabled.
217	
218	  "preemptirqsoff"
219	
220		Similar to irqsoff and preemptoff, but traces and
221		records the largest time for which irqs and/or preemption
222		is disabled.
223	
224	  "wakeup"
225	
226		Traces and records the max latency that it takes for
227		the highest priority task to get scheduled after
228		it has been woken up.
229	        Traces all tasks as an average developer would expect.
230	
231	  "wakeup_rt"
232	
233	        Traces and records the max latency that it takes for just
234	        RT tasks (as the current "wakeup" does). This is useful
235	        for those interested in wake up timings of RT tasks.
236	
237	  "hw-branch-tracer"
238	
239		Uses the BTS CPU feature on x86 CPUs to traces all
240		branches executed.
241	
242	  "nop"
243	
244		This is the "trace nothing" tracer. To remove all
245		tracers from tracing simply echo "nop" into
246		current_tracer.
247	
248	
249	Examples of using the tracer
250	----------------------------
251	
252	Here are typical examples of using the tracers when controlling
253	them only with the debugfs interface (without using any
254	user-land utilities).
255	
256	Output format:
257	--------------
258	
259	Here is an example of the output format of the file "trace"
260	
261	                             --------
262	# tracer: function
263	#
264	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
265	#              | |      |          |         |
266	            bash-4251  [01] 10152.583854: path_put <-path_walk
267	            bash-4251  [01] 10152.583855: dput <-path_put
268	            bash-4251  [01] 10152.583855: _atomic_dec_and_lock <-dput
269	                             --------
270	
271	A header is printed with the tracer name that is represented by
272	the trace. In this case the tracer is "function". Then a header
273	showing the format. Task name "bash", the task PID "4251", the
274	CPU that it was running on "01", the timestamp in <secs>.<usecs>
275	format, the function name that was traced "path_put" and the
276	parent function that called this function "path_walk". The
277	timestamp is the time at which the function was entered.
278	
279	Latency trace format
280	--------------------
281	
282	When the latency-format option is enabled, the trace file gives
283	somewhat more information to see why a latency happened.
284	Here is a typical trace.
285	
286	# tracer: irqsoff
287	#
288	irqsoff latency trace v1.1.5 on 2.6.26-rc8
289	--------------------------------------------------------------------
290	 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
291	    -----------------
292	    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
293	    -----------------
294	 => started at: apic_timer_interrupt
295	 => ended at:   do_softirq
296	
297	#                _------=> CPU#
298	#               / _-----=> irqs-off
299	#              | / _----=> need-resched
300	#              || / _---=> hardirq/softirq
301	#              ||| / _--=> preempt-depth
302	#              |||| /
303	#              |||||     delay
304	#  cmd     pid ||||| time  |   caller
305	#     \   /    |||||   \   |   /
306	  <idle>-0     0d..1    0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
307	  <idle>-0     0d.s.   97us : __do_softirq (do_softirq)
308	  <idle>-0     0d.s1   98us : trace_hardirqs_on (do_softirq)
309	
310	
311	This shows that the current tracer is "irqsoff" tracing the time
312	for which interrupts were disabled. It gives the trace version
313	and the version of the kernel upon which this was executed on
314	(2.6.26-rc8). Then it displays the max latency in microsecs (97
315	us). The number of trace entries displayed and the total number
316	recorded (both are three: #3/3). The type of preemption that was
317	used (PREEMPT). VP, KP, SP, and HP are always zero and are
318	reserved for later use. #P is the number of online CPUS (#P:2).
319	
320	The task is the process that was running when the latency
321	occurred. (swapper pid: 0).
322	
323	The start and stop (the functions in which the interrupts were
324	disabled and enabled respectively) that caused the latencies:
325	
326	  apic_timer_interrupt is where the interrupts were disabled.
327	  do_softirq is where they were enabled again.
328	
329	The next lines after the header are the trace itself. The header
330	explains which is which.
331	
332	  cmd: The name of the process in the trace.
333	
334	  pid: The PID of that process.
335	
336	  CPU#: The CPU which the process was running on.
337	
338	  irqs-off: 'd' interrupts are disabled. '.' otherwise.
339		    Note: If the architecture does not support a way to
340			  read the irq flags variable, an 'X' will always
341			  be printed here.
342	
343	  need-resched: 'N' task need_resched is set, '.' otherwise.
344	
345	  hardirq/softirq:
346		'H' - hard irq occurred inside a softirq.
347		'h' - hard irq is running
348		's' - soft irq is running
349		'.' - normal context.
350	
351	  preempt-depth: The level of preempt_disabled
352	
353	The above is mostly meaningful for kernel developers.
354	
355	  time: When the latency-format option is enabled, the trace file
356		output includes a timestamp relative to the start of the
357		trace. This differs from the output when latency-format
358		is disabled, which includes an absolute timestamp.
359	
360	  delay: This is just to help catch your eye a bit better. And
361		 needs to be fixed to be only relative to the same CPU.
362		 The marks are determined by the difference between this
363		 current trace and the next trace.
364		  '!' - greater than preempt_mark_thresh (default 100)
365		  '+' - greater than 1 microsecond
366		  ' ' - less than or equal to 1 microsecond.
367	
368	  The rest is the same as the 'trace' file.
369	
370	
371	trace_options
372	-------------
373	
374	The trace_options file is used to control what gets printed in
375	the trace output. To see what is available, simply cat the file:
376	
377	  cat trace_options
378	  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
379	  noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
380	
381	To disable one of the options, echo in the option prepended with
382	"no".
383	
384	  echo noprint-parent > trace_options
385	
386	To enable an option, leave off the "no".
387	
388	  echo sym-offset > trace_options
389	
390	Here are the available options:
391	
392	  print-parent - On function traces, display the calling (parent)
393			 function as well as the function being traced.
394	
395	  print-parent:
396	   bash-4000  [01]  1477.606694: simple_strtoul <-strict_strtoul
397	
398	  noprint-parent:
399	   bash-4000  [01]  1477.606694: simple_strtoul
400	
401	
402	  sym-offset - Display not only the function name, but also the
403		       offset in the function. For example, instead of
404		       seeing just "ktime_get", you will see
405		       "ktime_get+0xb/0x20".
406	
407	  sym-offset:
408	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
409	
410	  sym-addr - this will also display the function address as well
411		     as the function name.
412	
413	  sym-addr:
414	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
415	
416	  verbose - This deals with the trace file when the
417	            latency-format option is enabled.
418	
419	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
420	    (+0.000ms): simple_strtoul (strict_strtoul)
421	
422	  raw - This will display raw numbers. This option is best for
423		use with user applications that can translate the raw
424		numbers better than having it done in the kernel.
425	
426	  hex - Similar to raw, but the numbers will be in a hexadecimal
427		format.
428	
429	  bin - This will print out the formats in raw binary.
430	
431	  block - TBD (needs update)
432	
433	  stacktrace - This is one of the options that changes the trace
434		       itself. When a trace is recorded, so is the stack
435		       of functions. This allows for back traces of
436		       trace sites.
437	
438	  userstacktrace - This option changes the trace. It records a
439			   stacktrace of the current userspace thread.
440	
441	  sym-userobj - when user stacktrace are enabled, look up which
442			object the address belongs to, and print a
443			relative address. This is especially useful when
444			ASLR is on, otherwise you don't get a chance to
445			resolve the address to object/file/line after
446			the app is no longer running
447	
448			The lookup is performed when you read
449			trace,trace_pipe. Example:
450	
451			a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
452	x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
453	
454	  sched-tree - trace all tasks that are on the runqueue, at
455		       every scheduling event. Will add overhead if
456		       there's a lot of tasks running at once.
457	
458	  latency-format - This option changes the trace. When
459	                   it is enabled, the trace displays
460	                   additional information about the
461	                   latencies, as described in "Latency
462	                   trace format".
463	
464	  overwrite - This controls what happens when the trace buffer is
465	              full. If "1" (default), the oldest events are
466	              discarded and overwritten. If "0", then the newest
467	              events are discarded.
468	
469	ftrace_enabled
470	--------------
471	
472	The following tracers (listed below) give different output
473	depending on whether or not the sysctl ftrace_enabled is set. To
474	set ftrace_enabled, one can either use the sysctl function or
475	set it via the proc file system interface.
476	
477	  sysctl kernel.ftrace_enabled=1
478	
479	 or
480	
481	  echo 1 > /proc/sys/kernel/ftrace_enabled
482	
483	To disable ftrace_enabled simply replace the '1' with '0' in the
484	above commands.
485	
486	When ftrace_enabled is set the tracers will also record the
487	functions that are within the trace. The descriptions of the
488	tracers will also show an example with ftrace enabled.
489	
490	
491	irqsoff
492	-------
493	
494	When interrupts are disabled, the CPU can not react to any other
495	external event (besides NMIs and SMIs). This prevents the timer
496	interrupt from triggering or the mouse interrupt from letting
497	the kernel know of a new mouse event. The result is a latency
498	with the reaction time.
499	
500	The irqsoff tracer tracks the time for which interrupts are
501	disabled. When a new maximum latency is hit, the tracer saves
502	the trace leading up to that latency point so that every time a
503	new maximum is reached, the old saved trace is discarded and the
504	new trace is saved.
505	
506	To reset the maximum, echo 0 into tracing_max_latency. Here is
507	an example:
508	
509	 # echo irqsoff > current_tracer
510	 # echo latency-format > trace_options
511	 # echo 0 > tracing_max_latency
512	 # echo 1 > tracing_on
513	 # ls -ltr
514	 [...]
515	 # echo 0 > tracing_on
516	 # cat trace
517	# tracer: irqsoff
518	#
519	irqsoff latency trace v1.1.5 on 2.6.26
520	--------------------------------------------------------------------
521	 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
522	    -----------------
523	    | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
524	    -----------------
525	 => started at: sys_setpgid
526	 => ended at:   sys_setpgid
527	
528	#                _------=> CPU#
529	#               / _-----=> irqs-off
530	#              | / _----=> need-resched
531	#              || / _---=> hardirq/softirq
532	#              ||| / _--=> preempt-depth
533	#              |||| /
534	#              |||||     delay
535	#  cmd     pid ||||| time  |   caller
536	#     \   /    |||||   \   |   /
537	    bash-3730  1d...    0us : _write_lock_irq (sys_setpgid)
538	    bash-3730  1d..1    1us+: _write_unlock_irq (sys_setpgid)
539	    bash-3730  1d..2   14us : trace_hardirqs_on (sys_setpgid)
540	
541	
542	Here we see that that we had a latency of 12 microsecs (which is
543	very good). The _write_lock_irq in sys_setpgid disabled
544	interrupts. The difference between the 12 and the displayed
545	timestamp 14us occurred because the clock was incremented
546	between the time of recording the max latency and the time of
547	recording the function that had that latency.
548	
549	Note the above example had ftrace_enabled not set. If we set the
550	ftrace_enabled, we get a much larger output:
551	
552	# tracer: irqsoff
553	#
554	irqsoff latency trace v1.1.5 on 2.6.26-rc8
555	--------------------------------------------------------------------
556	 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
557	    -----------------
558	    | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
559	    -----------------
560	 => started at: __alloc_pages_internal
561	 => ended at:   __alloc_pages_internal
562	
563	#                _------=> CPU#
564	#               / _-----=> irqs-off
565	#              | / _----=> need-resched
566	#              || / _---=> hardirq/softirq
567	#              ||| / _--=> preempt-depth
568	#              |||| /
569	#              |||||     delay
570	#  cmd     pid ||||| time  |   caller
571	#     \   /    |||||   \   |   /
572	      ls-4339  0...1    0us+: get_page_from_freelist (__alloc_pages_internal)
573	      ls-4339  0d..1    3us : rmqueue_bulk (get_page_from_freelist)
574	      ls-4339  0d..1    3us : _spin_lock (rmqueue_bulk)
575	      ls-4339  0d..1    4us : add_preempt_count (_spin_lock)
576	      ls-4339  0d..2    4us : __rmqueue (rmqueue_bulk)
577	      ls-4339  0d..2    5us : __rmqueue_smallest (__rmqueue)
578	      ls-4339  0d..2    5us : __mod_zone_page_state (__rmqueue_smallest)
579	      ls-4339  0d..2    6us : __rmqueue (rmqueue_bulk)
580	      ls-4339  0d..2    6us : __rmqueue_smallest (__rmqueue)
581	      ls-4339  0d..2    7us : __mod_zone_page_state (__rmqueue_smallest)
582	      ls-4339  0d..2    7us : __rmqueue (rmqueue_bulk)
583	      ls-4339  0d..2    8us : __rmqueue_smallest (__rmqueue)
584	[...]
585	      ls-4339  0d..2   46us : __rmqueue_smallest (__rmqueue)
586	      ls-4339  0d..2   47us : __mod_zone_page_state (__rmqueue_smallest)
587	      ls-4339  0d..2   47us : __rmqueue (rmqueue_bulk)
588	      ls-4339  0d..2   48us : __rmqueue_smallest (__rmqueue)
589	      ls-4339  0d..2   48us : __mod_zone_page_state (__rmqueue_smallest)
590	      ls-4339  0d..2   49us : _spin_unlock (rmqueue_bulk)
591	      ls-4339  0d..2   49us : sub_preempt_count (_spin_unlock)
592	      ls-4339  0d..1   50us : get_page_from_freelist (__alloc_pages_internal)
593	      ls-4339  0d..2   51us : trace_hardirqs_on (__alloc_pages_internal)
594	
595	
596	
597	Here we traced a 50 microsecond latency. But we also see all the
598	functions that were called during that time. Note that by
599	enabling function tracing, we incur an added overhead. This
600	overhead may extend the latency times. But nevertheless, this
601	trace has provided some very helpful debugging information.
602	
603	
604	preemptoff
605	----------
606	
607	When preemption is disabled, we may be able to receive
608	interrupts but the task cannot be preempted and a higher
609	priority task must wait for preemption to be enabled again
610	before it can preempt a lower priority task.
611	
612	The preemptoff tracer traces the places that disable preemption.
613	Like the irqsoff tracer, it records the maximum latency for
614	which preemption was disabled. The control of preemptoff tracer
615	is much like the irqsoff tracer.
616	
617	 # echo preemptoff > current_tracer
618	 # echo latency-format > trace_options
619	 # echo 0 > tracing_max_latency
620	 # echo 1 > tracing_on
621	 # ls -ltr
622	 [...]
623	 # echo 0 > tracing_on
624	 # cat trace
625	# tracer: preemptoff
626	#
627	preemptoff latency trace v1.1.5 on 2.6.26-rc8
628	--------------------------------------------------------------------
629	 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
630	    -----------------
631	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
632	    -----------------
633	 => started at: do_IRQ
634	 => ended at:   __do_softirq
635	
636	#                _------=> CPU#
637	#               / _-----=> irqs-off
638	#              | / _----=> need-resched
639	#              || / _---=> hardirq/softirq
640	#              ||| / _--=> preempt-depth
641	#              |||| /
642	#              |||||     delay
643	#  cmd     pid ||||| time  |   caller
644	#     \   /    |||||   \   |   /
645	    sshd-4261  0d.h.    0us+: irq_enter (do_IRQ)
646	    sshd-4261  0d.s.   29us : _local_bh_enable (__do_softirq)
647	    sshd-4261  0d.s1   30us : trace_preempt_on (__do_softirq)
648	
649	
650	This has some more changes. Preemption was disabled when an
651	interrupt came in (notice the 'h'), and was enabled while doing
652	a softirq. (notice the 's'). But we also see that interrupts
653	have been disabled when entering the preempt off section and
654	leaving it (the 'd'). We do not know if interrupts were enabled
655	in the mean time.
656	
657	# tracer: preemptoff
658	#
659	preemptoff latency trace v1.1.5 on 2.6.26-rc8
660	--------------------------------------------------------------------
661	 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
662	    -----------------
663	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
664	    -----------------
665	 => started at: remove_wait_queue
666	 => ended at:   __do_softirq
667	
668	#                _------=> CPU#
669	#               / _-----=> irqs-off
670	#              | / _----=> need-resched
671	#              || / _---=> hardirq/softirq
672	#              ||| / _--=> preempt-depth
673	#              |||| /
674	#              |||||     delay
675	#  cmd     pid ||||| time  |   caller
676	#     \   /    |||||   \   |   /
677	    sshd-4261  0d..1    0us : _spin_lock_irqsave (remove_wait_queue)
678	    sshd-4261  0d..1    1us : _spin_unlock_irqrestore (remove_wait_queue)
679	    sshd-4261  0d..1    2us : do_IRQ (common_interrupt)
680	    sshd-4261  0d..1    2us : irq_enter (do_IRQ)
681	    sshd-4261  0d..1    2us : idle_cpu (irq_enter)
682	    sshd-4261  0d..1    3us : add_preempt_count (irq_enter)
683	    sshd-4261  0d.h1    3us : idle_cpu (irq_enter)
684	    sshd-4261  0d.h.    4us : handle_fasteoi_irq (do_IRQ)
685	[...]
686	    sshd-4261  0d.h.   12us : add_preempt_count (_spin_lock)
687	    sshd-4261  0d.h1   12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
688	    sshd-4261  0d.h1   13us : move_native_irq (ack_ioapic_quirk_irq)
689	    sshd-4261  0d.h1   13us : _spin_unlock (handle_fasteoi_irq)
690	    sshd-4261  0d.h1   14us : sub_preempt_count (_spin_unlock)
691	    sshd-4261  0d.h1   14us : irq_exit (do_IRQ)
692	    sshd-4261  0d.h1   15us : sub_preempt_count (irq_exit)
693	    sshd-4261  0d..2   15us : do_softirq (irq_exit)
694	    sshd-4261  0d...   15us : __do_softirq (do_softirq)
695	    sshd-4261  0d...   16us : __local_bh_disable (__do_softirq)
696	    sshd-4261  0d...   16us+: add_preempt_count (__local_bh_disable)
697	    sshd-4261  0d.s4   20us : add_preempt_count (__local_bh_disable)
698	    sshd-4261  0d.s4   21us : sub_preempt_count (local_bh_enable)
699	    sshd-4261  0d.s5   21us : sub_preempt_count (local_bh_enable)
700	[...]
701	    sshd-4261  0d.s6   41us : add_preempt_count (__local_bh_disable)
702	    sshd-4261  0d.s6   42us : sub_preempt_count (local_bh_enable)
703	    sshd-4261  0d.s7   42us : sub_preempt_count (local_bh_enable)
704	    sshd-4261  0d.s5   43us : add_preempt_count (__local_bh_disable)
705	    sshd-4261  0d.s5   43us : sub_preempt_count (local_bh_enable_ip)
706	    sshd-4261  0d.s6   44us : sub_preempt_count (local_bh_enable_ip)
707	    sshd-4261  0d.s5   44us : add_preempt_count (__local_bh_disable)
708	    sshd-4261  0d.s5   45us : sub_preempt_count (local_bh_enable)
709	[...]
710	    sshd-4261  0d.s.   63us : _local_bh_enable (__do_softirq)
711	    sshd-4261  0d.s1   64us : trace_preempt_on (__do_softirq)
712	
713	
714	The above is an example of the preemptoff trace with
715	ftrace_enabled set. Here we see that interrupts were disabled
716	the entire time. The irq_enter code lets us know that we entered
717	an interrupt 'h'. Before that, the functions being traced still
718	show that it is not in an interrupt, but we can see from the
719	functions themselves that this is not the case.
720	
721	Notice that __do_softirq when called does not have a
722	preempt_count. It may seem that we missed a preempt enabling.
723	What really happened is that the preempt count is held on the
724	thread's stack and we switched to the softirq stack (4K stacks
725	in effect). The code does not copy the preempt count, but
726	because interrupts are disabled, we do not need to worry about
727	it. Having a tracer like this is good for letting people know
728	what really happens inside the kernel.
729	
730	
731	preemptirqsoff
732	--------------
733	
734	Knowing the locations that have interrupts disabled or
735	preemption disabled for the longest times is helpful. But
736	sometimes we would like to know when either preemption and/or
737	interrupts are disabled.
738	
739	Consider the following code:
740	
741	    local_irq_disable();
742	    call_function_with_irqs_off();
743	    preempt_disable();
744	    call_function_with_irqs_and_preemption_off();
745	    local_irq_enable();
746	    call_function_with_preemption_off();
747	    preempt_enable();
748	
749	The irqsoff tracer will record the total length of
750	call_function_with_irqs_off() and
751	call_function_with_irqs_and_preemption_off().
752	
753	The preemptoff tracer will record the total length of
754	call_function_with_irqs_and_preemption_off() and
755	call_function_with_preemption_off().
756	
757	But neither will trace the time that interrupts and/or
758	preemption is disabled. This total time is the time that we can
759	not schedule. To record this time, use the preemptirqsoff
760	tracer.
761	
762	Again, using this trace is much like the irqsoff and preemptoff
763	tracers.
764	
765	 # echo preemptirqsoff > current_tracer
766	 # echo latency-format > trace_options
767	 # echo 0 > tracing_max_latency
768	 # echo 1 > tracing_on
769	 # ls -ltr
770	 [...]
771	 # echo 0 > tracing_on
772	 # cat trace
773	# tracer: preemptirqsoff
774	#
775	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
776	--------------------------------------------------------------------
777	 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
778	    -----------------
779	    | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
780	    -----------------
781	 => started at: apic_timer_interrupt
782	 => ended at:   __do_softirq
783	
784	#                _------=> CPU#
785	#               / _-----=> irqs-off
786	#              | / _----=> need-resched
787	#              || / _---=> hardirq/softirq
788	#              ||| / _--=> preempt-depth
789	#              |||| /
790	#              |||||     delay
791	#  cmd     pid ||||| time  |   caller
792	#     \   /    |||||   \   |   /
793	      ls-4860  0d...    0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
794	      ls-4860  0d.s.  294us : _local_bh_enable (__do_softirq)
795	      ls-4860  0d.s1  294us : trace_preempt_on (__do_softirq)
796	
797	
798	
799	The trace_hardirqs_off_thunk is called from assembly on x86 when
800	interrupts are disabled in the assembly code. Without the
801	function tracing, we do not know if interrupts were enabled
802	within the preemption points. We do see that it started with
803	preemption enabled.
804	
805	Here is a trace with ftrace_enabled set:
806	
807	
808	# tracer: preemptirqsoff
809	#
810	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
811	--------------------------------------------------------------------
812	 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
813	    -----------------
814	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
815	    -----------------
816	 => started at: write_chan
817	 => ended at:   __do_softirq
818	
819	#                _------=> CPU#
820	#               / _-----=> irqs-off
821	#              | / _----=> need-resched
822	#              || / _---=> hardirq/softirq
823	#              ||| / _--=> preempt-depth
824	#              |||| /
825	#              |||||     delay
826	#  cmd     pid ||||| time  |   caller
827	#     \   /    |||||   \   |   /
828	      ls-4473  0.N..    0us : preempt_schedule (write_chan)
829	      ls-4473  0dN.1    1us : _spin_lock (schedule)
830	      ls-4473  0dN.1    2us : add_preempt_count (_spin_lock)
831	      ls-4473  0d..2    2us : put_prev_task_fair (schedule)
832	[...]
833	      ls-4473  0d..2   13us : set_normalized_timespec (ktime_get_ts)
834	      ls-4473  0d..2   13us : __switch_to (schedule)
835	    sshd-4261  0d..2   14us : finish_task_switch (schedule)
836	    sshd-4261  0d..2   14us : _spin_unlock_irq (finish_task_switch)
837	    sshd-4261  0d..1   15us : add_preempt_count (_spin_lock_irqsave)
838	    sshd-4261  0d..2   16us : _spin_unlock_irqrestore (hrtick_set)
839	    sshd-4261  0d..2   16us : do_IRQ (common_interrupt)
840	    sshd-4261  0d..2   17us : irq_enter (do_IRQ)
841	    sshd-4261  0d..2   17us : idle_cpu (irq_enter)
842	    sshd-4261  0d..2   18us : add_preempt_count (irq_enter)
843	    sshd-4261  0d.h2   18us : idle_cpu (irq_enter)
844	    sshd-4261  0d.h.   18us : handle_fasteoi_irq (do_IRQ)
845	    sshd-4261  0d.h.   19us : _spin_lock (handle_fasteoi_irq)
846	    sshd-4261  0d.h.   19us : add_preempt_count (_spin_lock)
847	    sshd-4261  0d.h1   20us : _spin_unlock (handle_fasteoi_irq)
848	    sshd-4261  0d.h1   20us : sub_preempt_count (_spin_unlock)
849	[...]
850	    sshd-4261  0d.h1   28us : _spin_unlock (handle_fasteoi_irq)
851	    sshd-4261  0d.h1   29us : sub_preempt_count (_spin_unlock)
852	    sshd-4261  0d.h2   29us : irq_exit (do_IRQ)
853	    sshd-4261  0d.h2   29us : sub_preempt_count (irq_exit)
854	    sshd-4261  0d..3   30us : do_softirq (irq_exit)
855	    sshd-4261  0d...   30us : __do_softirq (do_softirq)
856	    sshd-4261  0d...   31us : __local_bh_disable (__do_softirq)
857	    sshd-4261  0d...   31us+: add_preempt_count (__local_bh_disable)
858	    sshd-4261  0d.s4   34us : add_preempt_count (__local_bh_disable)
859	[...]
860	    sshd-4261  0d.s3   43us : sub_preempt_count (local_bh_enable_ip)
861	    sshd-4261  0d.s4   44us : sub_preempt_count (local_bh_enable_ip)
862	    sshd-4261  0d.s3   44us : smp_apic_timer_interrupt (apic_timer_interrupt)
863	    sshd-4261  0d.s3   45us : irq_enter (smp_apic_timer_interrupt)
864	    sshd-4261  0d.s3   45us : idle_cpu (irq_enter)
865	    sshd-4261  0d.s3   46us : add_preempt_count (irq_enter)
866	    sshd-4261  0d.H3   46us : idle_cpu (irq_enter)
867	    sshd-4261  0d.H3   47us : hrtimer_interrupt (smp_apic_timer_interrupt)
868	    sshd-4261  0d.H3   47us : ktime_get (hrtimer_interrupt)
869	[...]
870	    sshd-4261  0d.H3   81us : tick_program_event (hrtimer_interrupt)
871	    sshd-4261  0d.H3   82us : ktime_get (tick_program_event)
872	    sshd-4261  0d.H3   82us : ktime_get_ts (ktime_get)
873	    sshd-4261  0d.H3   83us : getnstimeofday (ktime_get_ts)
874	    sshd-4261  0d.H3   83us : set_normalized_timespec (ktime_get_ts)
875	    sshd-4261  0d.H3   84us : clockevents_program_event (tick_program_event)
876	    sshd-4261  0d.H3   84us : lapic_next_event (clockevents_program_event)
877	    sshd-4261  0d.H3   85us : irq_exit (smp_apic_timer_interrupt)
878	    sshd-4261  0d.H3   85us : sub_preempt_count (irq_exit)
879	    sshd-4261  0d.s4   86us : sub_preempt_count (irq_exit)
880	    sshd-4261  0d.s3   86us : add_preempt_count (__local_bh_disable)
881	[...]
882	    sshd-4261  0d.s1   98us : sub_preempt_count (net_rx_action)
883	    sshd-4261  0d.s.   99us : add_preempt_count (_spin_lock_irq)
884	    sshd-4261  0d.s1   99us+: _spin_unlock_irq (run_timer_softirq)
885	    sshd-4261  0d.s.  104us : _local_bh_enable (__do_softirq)
886	    sshd-4261  0d.s.  104us : sub_preempt_count (_local_bh_enable)
887	    sshd-4261  0d.s.  105us : _local_bh_enable (__do_softirq)
888	    sshd-4261  0d.s1  105us : trace_preempt_on (__do_softirq)
889	
890	
891	This is a very interesting trace. It started with the preemption
892	of the ls task. We see that the task had the "need_resched" bit
893	set via the 'N' in the trace.  Interrupts were disabled before
894	the spin_lock at the beginning of the trace. We see that a
895	schedule took place to run sshd.  When the interrupts were
896	enabled, we took an interrupt. On return from the interrupt
897	handler, the softirq ran. We took another interrupt while
898	running the softirq as we see from the capital 'H'.
899	
900	
901	wakeup
902	------
903	
904	In a Real-Time environment it is very important to know the
905	wakeup time it takes for the highest priority task that is woken
906	up to the time that it executes. This is also known as "schedule
907	latency". I stress the point that this is about RT tasks. It is
908	also important to know the scheduling latency of non-RT tasks,
909	but the average schedule latency is better for non-RT tasks.
910	Tools like LatencyTop are more appropriate for such
911	measurements.
912	
913	Real-Time environments are interested in the worst case latency.
914	That is the longest latency it takes for something to happen,
915	and not the average. We can have a very fast scheduler that may
916	only have a large latency once in a while, but that would not
917	work well with Real-Time tasks.  The wakeup tracer was designed
918	to record the worst case wakeups of RT tasks. Non-RT tasks are
919	not recorded because the tracer only records one worst case and
920	tracing non-RT tasks that are unpredictable will overwrite the
921	worst case latency of RT tasks.
922	
923	Since this tracer only deals with RT tasks, we will run this
924	slightly differently than we did with the previous tracers.
925	Instead of performing an 'ls', we will run 'sleep 1' under
926	'chrt' which changes the priority of the task.
927	
928	 # echo wakeup > current_tracer
929	 # echo latency-format > trace_options
930	 # echo 0 > tracing_max_latency
931	 # echo 1 > tracing_on
932	 # chrt -f 5 sleep 1
933	 # echo 0 > tracing_on
934	 # cat trace
935	# tracer: wakeup
936	#
937	wakeup latency trace v1.1.5 on 2.6.26-rc8
938	--------------------------------------------------------------------
939	 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
940	    -----------------
941	    | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
942	    -----------------
943	
944	#                _------=> CPU#
945	#               / _-----=> irqs-off
946	#              | / _----=> need-resched
947	#              || / _---=> hardirq/softirq
948	#              ||| / _--=> preempt-depth
949	#              |||| /
950	#              |||||     delay
951	#  cmd     pid ||||| time  |   caller
952	#     \   /    |||||   \   |   /
953	  <idle>-0     1d.h4    0us+: try_to_wake_up (wake_up_process)
954	  <idle>-0     1d..4    4us : schedule (cpu_idle)
955	
956	
957	Running this on an idle system, we see that it only took 4
958	microseconds to perform the task switch.  Note, since the trace
959	marker in the schedule is before the actual "switch", we stop
960	the tracing when the recorded task is about to schedule in. This
961	may change if we add a new marker at the end of the scheduler.
962	
963	Notice that the recorded task is 'sleep' with the PID of 4901
964	and it has an rt_prio of 5. This priority is user-space priority
965	and not the internal kernel priority. The policy is 1 for
966	SCHED_FIFO and 2 for SCHED_RR.
967	
968	Doing the same with chrt -r 5 and ftrace_enabled set.
969	
970	# tracer: wakeup
971	#
972	wakeup latency trace v1.1.5 on 2.6.26-rc8
973	--------------------------------------------------------------------
974	 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
975	    -----------------
976	    | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
977	    -----------------
978	
979	#                _------=> CPU#
980	#               / _-----=> irqs-off
981	#              | / _----=> need-resched
982	#              || / _---=> hardirq/softirq
983	#              ||| / _--=> preempt-depth
984	#              |||| /
985	#              |||||     delay
986	#  cmd     pid ||||| time  |   caller
987	#     \   /    |||||   \   |   /
988	ksoftirq-7     1d.H3    0us : try_to_wake_up (wake_up_process)
989	ksoftirq-7     1d.H4    1us : sub_preempt_count (marker_probe_cb)
990	ksoftirq-7     1d.H3    2us : check_preempt_wakeup (try_to_wake_up)
991	ksoftirq-7     1d.H3    3us : update_curr (check_preempt_wakeup)
992	ksoftirq-7     1d.H3    4us : calc_delta_mine (update_curr)
993	ksoftirq-7     1d.H3    5us : __resched_task (check_preempt_wakeup)
994	ksoftirq-7     1d.H3    6us : task_wake_up_rt (try_to_wake_up)
995	ksoftirq-7     1d.H3    7us : _spin_unlock_irqrestore (try_to_wake_up)
996	[...]
997	ksoftirq-7     1d.H2   17us : irq_exit (smp_apic_timer_interrupt)
998	ksoftirq-7     1d.H2   18us : sub_preempt_count (irq_exit)
999	ksoftirq-7     1d.s3   19us : sub_preempt_count (irq_exit)
1000	ksoftirq-7     1..s2   20us : rcu_process_callbacks (__do_softirq)
1001	[...]
1002	ksoftirq-7     1..s2   26us : __rcu_process_callbacks (rcu_process_callbacks)
1003	ksoftirq-7     1d.s2   27us : _local_bh_enable (__do_softirq)
1004	ksoftirq-7     1d.s2   28us : sub_preempt_count (_local_bh_enable)
1005	ksoftirq-7     1.N.3   29us : sub_preempt_count (ksoftirqd)
1006	ksoftirq-7     1.N.2   30us : _cond_resched (ksoftirqd)
1007	ksoftirq-7     1.N.2   31us : __cond_resched (_cond_resched)
1008	ksoftirq-7     1.N.2   32us : add_preempt_count (__cond_resched)
1009	ksoftirq-7     1.N.2   33us : schedule (__cond_resched)
1010	ksoftirq-7     1.N.2   33us : add_preempt_count (schedule)
1011	ksoftirq-7     1.N.3   34us : hrtick_clear (schedule)
1012	ksoftirq-7     1dN.3   35us : _spin_lock (schedule)
1013	ksoftirq-7     1dN.3   36us : add_preempt_count (_spin_lock)
1014	ksoftirq-7     1d..4   37us : put_prev_task_fair (schedule)
1015	ksoftirq-7     1d..4   38us : update_curr (put_prev_task_fair)
1016	[...]
1017	ksoftirq-7     1d..5   47us : _spin_trylock (tracing_record_cmdline)
1018	ksoftirq-7     1d..5   48us : add_preempt_count (_spin_trylock)
1019	ksoftirq-7     1d..6   49us : _spin_unlock (tracing_record_cmdline)
1020	ksoftirq-7     1d..6   49us : sub_preempt_count (_spin_unlock)
1021	ksoftirq-7     1d..4   50us : schedule (__cond_resched)
1022	
1023	The interrupt went off while running ksoftirqd. This task runs
1024	at SCHED_OTHER. Why did not we see the 'N' set early? This may
1025	be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
1026	stacks configured, the interrupt and softirq run with their own
1027	stack. Some information is held on the top of the task's stack
1028	(need_resched and preempt_count are both stored there). The
1029	setting of the NEED_RESCHED bit is done directly to the task's
1030	stack, but the reading of the NEED_RESCHED is done by looking at
1031	the current stack, which in this case is the stack for the hard
1032	interrupt. This hides the fact that NEED_RESCHED has been set.
1033	We do not see the 'N' until we switch back to the task's
1034	assigned stack.
1035	
1036	function
1037	--------
1038	
1039	This tracer is the function tracer. Enabling the function tracer
1040	can be done from the debug file system. Make sure the
1041	ftrace_enabled is set; otherwise this tracer is a nop.
1042	
1043	 # sysctl kernel.ftrace_enabled=1
1044	 # echo function > current_tracer
1045	 # echo 1 > tracing_on
1046	 # usleep 1
1047	 # echo 0 > tracing_on
1048	 # cat trace
1049	# tracer: function
1050	#
1051	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1052	#              | |      |          |         |
1053	            bash-4003  [00]   123.638713: finish_task_switch <-schedule
1054	            bash-4003  [00]   123.638714: _spin_unlock_irq <-finish_task_switch
1055	            bash-4003  [00]   123.638714: sub_preempt_count <-_spin_unlock_irq
1056	            bash-4003  [00]   123.638715: hrtick_set <-schedule
1057	            bash-4003  [00]   123.638715: _spin_lock_irqsave <-hrtick_set
1058	            bash-4003  [00]   123.638716: add_preempt_count <-_spin_lock_irqsave
1059	            bash-4003  [00]   123.638716: _spin_unlock_irqrestore <-hrtick_set
1060	            bash-4003  [00]   123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1061	            bash-4003  [00]   123.638717: hrtick_clear <-hrtick_set
1062	            bash-4003  [00]   123.638718: sub_preempt_count <-schedule
1063	            bash-4003  [00]   123.638718: sub_preempt_count <-preempt_schedule
1064	            bash-4003  [00]   123.638719: wait_for_completion <-__stop_machine_run
1065	            bash-4003  [00]   123.638719: wait_for_common <-wait_for_completion
1066	            bash-4003  [00]   123.638720: _spin_lock_irq <-wait_for_common
1067	            bash-4003  [00]   123.638720: add_preempt_count <-_spin_lock_irq
1068	[...]
1069	
1070	
1071	Note: function tracer uses ring buffers to store the above
1072	entries. The newest data may overwrite the oldest data.
1073	Sometimes using echo to stop the trace is not sufficient because
1074	the tracing could have overwritten the data that you wanted to
1075	record. For this reason, it is sometimes better to disable
1076	tracing directly from a program. This allows you to stop the
1077	tracing at the point that you hit the part that you are
1078	interested in. To disable the tracing directly from a C program,
1079	something like following code snippet can be used:
1080	
1081	int trace_fd;
1082	[...]
1083	int main(int argc, char *argv[]) {
1084		[...]
1085		trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
1086		[...]
1087		if (condition_hit()) {
1088			write(trace_fd, "0", 1);
1089		}
1090		[...]
1091	}
1092	
1093	
1094	Single thread tracing
1095	---------------------
1096	
1097	By writing into set_ftrace_pid you can trace a
1098	single thread. For example:
1099	
1100	# cat set_ftrace_pid
1101	no pid
1102	# echo 3111 > set_ftrace_pid
1103	# cat set_ftrace_pid
1104	3111
1105	# echo function > current_tracer
1106	# cat trace | head
1107	 # tracer: function
1108	 #
1109	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1110	 #              | |       |          |         |
1111	     yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
1112	     yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
1113	     yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
1114	     yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
1115	     yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
1116	     yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
1117	# echo -1 > set_ftrace_pid
1118	# cat trace |head
1119	 # tracer: function
1120	 #
1121	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1122	 #              | |       |          |         |
1123	 ##### CPU 3 buffer started ####
1124	     yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
1125	     yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
1126	     yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
1127	     yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
1128	     yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
1129	
1130	If you want to trace a function when executing, you could use
1131	something like this simple program:
1132	
1133	#include <stdio.h>
1134	#include <stdlib.h>
1135	#include <sys/types.h>
1136	#include <sys/stat.h>
1137	#include <fcntl.h>
1138	#include <unistd.h>
1139	#include <string.h>
1140	
1141	#define _STR(x) #x
1142	#define STR(x) _STR(x)
1143	#define MAX_PATH 256
1144	
1145	const char *find_debugfs(void)
1146	{
1147	       static char debugfs[MAX_PATH+1];
1148	       static int debugfs_found;
1149	       char type[100];
1150	       FILE *fp;
1151	
1152	       if (debugfs_found)
1153	               return debugfs;
1154	
1155	       if ((fp = fopen("/proc/mounts","r")) == NULL) {
1156	               perror("/proc/mounts");
1157	               return NULL;
1158	       }
1159	
1160	       while (fscanf(fp, "%*s %"
1161	                     STR(MAX_PATH)
1162	                     "s %99s %*s %*d %*d\n",
1163	                     debugfs, type) == 2) {
1164	               if (strcmp(type, "debugfs") == 0)
1165	                       break;
1166	       }
1167	       fclose(fp);
1168	
1169	       if (strcmp(type, "debugfs") != 0) {
1170	               fprintf(stderr, "debugfs not mounted");
1171	               return NULL;
1172	       }
1173	
1174	       strcat(debugfs, "/tracing/");
1175	       debugfs_found = 1;
1176	
1177	       return debugfs;
1178	}
1179	
1180	const char *tracing_file(const char *file_name)
1181	{
1182	       static char trace_file[MAX_PATH+1];
1183	       snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name);
1184	       return trace_file;
1185	}
1186	
1187	int main (int argc, char **argv)
1188	{
1189	        if (argc < 1)
1190	                exit(-1);
1191	
1192	        if (fork() > 0) {
1193	                int fd, ffd;
1194	                char line[64];
1195	                int s;
1196	
1197	                ffd = open(tracing_file("current_tracer"), O_WRONLY);
1198	                if (ffd < 0)
1199	                        exit(-1);
1200	                write(ffd, "nop", 3);
1201	
1202	                fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
1203	                s = sprintf(line, "%d\n", getpid());
1204	                write(fd, line, s);
1205	
1206	                write(ffd, "function", 8);
1207	
1208	                close(fd);
1209	                close(ffd);
1210	
1211	                execvp(argv[1], argv+1);
1212	        }
1213	
1214	        return 0;
1215	}
1216	
1217	
1218	hw-branch-tracer (x86 only)
1219	---------------------------
1220	
1221	This tracer uses the x86 last branch tracing hardware feature to
1222	collect a branch trace on all cpus with relatively low overhead.
1223	
1224	The tracer uses a fixed-size circular buffer per cpu and only
1225	traces ring 0 branches. The trace file dumps that buffer in the
1226	following format:
1227	
1228	# tracer: hw-branch-tracer
1229	#
1230	# CPU#        TO  <-  FROM
1231	   0  scheduler_tick+0xb5/0x1bf	  <-  task_tick_idle+0x5/0x6
1232	   2  run_posix_cpu_timers+0x2b/0x72a	  <-  run_posix_cpu_timers+0x25/0x72a
1233	   0  scheduler_tick+0x139/0x1bf	  <-  scheduler_tick+0xed/0x1bf
1234	   0  scheduler_tick+0x17c/0x1bf	  <-  scheduler_tick+0x148/0x1bf
1235	   2  run_posix_cpu_timers+0x9e/0x72a	  <-  run_posix_cpu_timers+0x5e/0x72a
1236	   0  scheduler_tick+0x1b6/0x1bf	  <-  scheduler_tick+0x1aa/0x1bf
1237	
1238	
1239	The tracer may be used to dump the trace for the oops'ing cpu on
1240	a kernel oops into the system log. To enable this,
1241	ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
1242	can either use the sysctl function or set it via the proc system
1243	interface.
1244	
1245	  sysctl kernel.ftrace_dump_on_oops=n
1246	
1247	or
1248	
1249	  echo n > /proc/sys/kernel/ftrace_dump_on_oops
1250	
1251	If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
1252	only dump the buffer of the CPU that triggered the oops.
1253	
1254	Here's an example of such a dump after a null pointer
1255	dereference in a kernel module:
1256	
1257	[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
1258	[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
1259	[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
1260	[57848.106019] Oops: 0002 [#1] SMP
1261	[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
1262	[57848.106019] Dumping ftrace buffer:
1263	[57848.106019] ---------------------------------
1264	[...]
1265	[57848.106019]    0  chrdev_open+0xe6/0x165	  <-  cdev_put+0x23/0x24
1266	[57848.106019]    0  chrdev_open+0x117/0x165	  <-  chrdev_open+0xfa/0x165
1267	[57848.106019]    0  chrdev_open+0x120/0x165	  <-  chrdev_open+0x11c/0x165
1268	[57848.106019]    0  chrdev_open+0x134/0x165	  <-  chrdev_open+0x12b/0x165
1269	[57848.106019]    0  open+0x0/0x14 [oops]	  <-  chrdev_open+0x144/0x165
1270	[57848.106019]    0  page_fault+0x0/0x30	  <-  open+0x6/0x14 [oops]
1271	[57848.106019]    0  error_entry+0x0/0x5b	  <-  page_fault+0x4/0x30
1272	[57848.106019]    0  error_kernelspace+0x0/0x31	  <-  error_entry+0x59/0x5b
1273	[57848.106019]    0  error_sti+0x0/0x1	  <-  error_kernelspace+0x2d/0x31
1274	[57848.106019]    0  page_fault+0x9/0x30	  <-  error_sti+0x0/0x1
1275	[57848.106019]    0  do_page_fault+0x0/0x881	  <-  page_fault+0x1a/0x30
1276	[...]
1277	[57848.106019]    0  do_page_fault+0x66b/0x881	  <-  is_prefetch+0x1ee/0x1f2
1278	[57848.106019]    0  do_page_fault+0x6e0/0x881	  <-  do_page_fault+0x67a/0x881
1279	[57848.106019]    0  oops_begin+0x0/0x96	  <-  do_page_fault+0x6e0/0x881
1280	[57848.106019]    0  trace_hw_branch_oops+0x0/0x2d	  <-  oops_begin+0x9/0x96
1281	[...]
1282	[57848.106019]    0  ds_suspend_bts+0x2a/0xe3	  <-  ds_suspend_bts+0x1a/0xe3
1283	[57848.106019] ---------------------------------
1284	[57848.106019] CPU 0
1285	[57848.106019] Modules linked in: oops
1286	[57848.106019] Pid: 5542, comm: cat Tainted: G        W  2.6.28 #23
1287	[57848.106019] RIP: 0010:[<ffffffffa0000006>]  [<ffffffffa0000006>] open+0x6/0x14 [oops]
1288	[57848.106019] RSP: 0018:ffff880235457d48  EFLAGS: 00010246
1289	[...]
1290	
1291	
1292	function graph tracer
1293	---------------------------
1294	
1295	This tracer is similar to the function tracer except that it
1296	probes a function on its entry and its exit. This is done by
1297	using a dynamically allocated stack of return addresses in each
1298	task_struct. On function entry the tracer overwrites the return
1299	address of each function traced to set a custom probe. Thus the
1300	original return address is stored on the stack of return address
1301	in the task_struct.
1302	
1303	Probing on both ends of a function leads to special features
1304	such as:
1305	
1306	- measure of a function's time execution
1307	- having a reliable call stack to draw function calls graph
1308	
1309	This tracer is useful in several situations:
1310	
1311	- you want to find the reason of a strange kernel behavior and
1312	  need to see what happens in detail on any areas (or specific
1313	  ones).
1314	
1315	- you are experiencing weird latencies but it's difficult to
1316	  find its origin.
1317	
1318	- you want to find quickly which path is taken by a specific
1319	  function
1320	
1321	- you just want to peek inside a working kernel and want to see
1322	  what happens there.
1323	
1324	# tracer: function_graph
1325	#
1326	# CPU  DURATION                  FUNCTION CALLS
1327	# |     |   |                     |   |   |   |
1328	
1329	 0)               |  sys_open() {
1330	 0)               |    do_sys_open() {
1331	 0)               |      getname() {
1332	 0)               |        kmem_cache_alloc() {
1333	 0)   1.382 us    |          __might_sleep();
1334	 0)   2.478 us    |        }
1335	 0)               |        strncpy_from_user() {
1336	 0)               |          might_fault() {
1337	 0)   1.389 us    |            __might_sleep();
1338	 0)   2.553 us    |          }
1339	 0)   3.807 us    |        }
1340	 0)   7.876 us    |      }
1341	 0)               |      alloc_fd() {
1342	 0)   0.668 us    |        _spin_lock();
1343	 0)   0.570 us    |        expand_files();
1344	 0)   0.586 us    |        _spin_unlock();
1345	
1346	
1347	There are several columns that can be dynamically
1348	enabled/disabled. You can use every combination of options you
1349	want, depending on your needs.
1350	
1351	- The cpu number on which the function executed is default
1352	  enabled.  It is sometimes better to only trace one cpu (see
1353	  tracing_cpu_mask file) or you might sometimes see unordered
1354	  function calls while cpu tracing switch.
1355	
1356		hide: echo nofuncgraph-cpu > trace_options
1357		show: echo funcgraph-cpu > trace_options
1358	
1359	- The duration (function's time of execution) is displayed on
1360	  the closing bracket line of a function or on the same line
1361	  than the current function in case of a leaf one. It is default
1362	  enabled.
1363	
1364		hide: echo nofuncgraph-duration > trace_options
1365		show: echo funcgraph-duration > trace_options
1366	
1367	- The overhead field precedes the duration field in case of
1368	  reached duration thresholds.
1369	
1370		hide: echo nofuncgraph-overhead > trace_options
1371		show: echo funcgraph-overhead > trace_options
1372		depends on: funcgraph-duration
1373	
1374	  ie:
1375	
1376	  0)               |    up_write() {
1377	  0)   0.646 us    |      _spin_lock_irqsave();
1378	  0)   0.684 us    |      _spin_unlock_irqrestore();
1379	  0)   3.123 us    |    }
1380	  0)   0.548 us    |    fput();
1381	  0) + 58.628 us   |  }
1382	
1383	  [...]
1384	
1385	  0)               |      putname() {
1386	  0)               |        kmem_cache_free() {
1387	  0)   0.518 us    |          __phys_addr();
1388	  0)   1.757 us    |        }
1389	  0)   2.861 us    |      }
1390	  0) ! 115.305 us  |    }
1391	  0) ! 116.402 us  |  }
1392	
1393	  + means that the function exceeded 10 usecs.
1394	  ! means that the function exceeded 100 usecs.
1395	
1396	
1397	- The task/pid field displays the thread cmdline and pid which
1398	  executed the function. It is default disabled.
1399	
1400		hide: echo nofuncgraph-proc > trace_options
1401		show: echo funcgraph-proc > trace_options
1402	
1403	  ie:
1404	
1405	  # tracer: function_graph
1406	  #
1407	  # CPU  TASK/PID        DURATION                  FUNCTION CALLS
1408	  # |    |    |           |   |                     |   |   |   |
1409	  0)    sh-4802     |               |                  d_free() {
1410	  0)    sh-4802     |               |                    call_rcu() {
1411	  0)    sh-4802     |               |                      __call_rcu() {
1412	  0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
1413	  0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
1414	  0)    sh-4802     |   2.899 us    |                      }
1415	  0)    sh-4802     |   4.040 us    |                    }
1416	  0)    sh-4802     |   5.151 us    |                  }
1417	  0)    sh-4802     | + 49.370 us   |                }
1418	
1419	
1420	- The absolute time field is an absolute timestamp given by the
1421	  system clock since it started. A snapshot of this time is
1422	  given on each entry/exit of functions
1423	
1424		hide: echo nofuncgraph-abstime > trace_options
1425		show: echo funcgraph-abstime > trace_options
1426	
1427	  ie:
1428	
1429	  #
1430	  #      TIME       CPU  DURATION                  FUNCTION CALLS
1431	  #       |         |     |   |                     |   |   |   |
1432	  360.774522 |   1)   0.541 us    |                                          }
1433	  360.774522 |   1)   4.663 us    |                                        }
1434	  360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
1435	  360.774524 |   1)   6.796 us    |                                      }
1436	  360.774524 |   1)   7.952 us    |                                    }
1437	  360.774525 |   1)   9.063 us    |                                  }
1438	  360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
1439	  360.774527 |   1)   0.578 us    |                                  __brelse();
1440	  360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
1441	  360.774528 |   1)               |                                    unlock_buffer() {
1442	  360.774529 |   1)               |                                      wake_up_bit() {
1443	  360.774529 |   1)               |                                        bit_waitqueue() {
1444	  360.774530 |   1)   0.594 us    |                                          __phys_addr();
1445	
1446	
1447	You can put some comments on specific functions by using
1448	trace_printk() For example, if you want to put a comment inside
1449	the __might_sleep() function, you just have to include
1450	<linux/ftrace.h> and call trace_printk() inside __might_sleep()
1451	
1452	trace_printk("I'm a comment!\n")
1453	
1454	will produce:
1455	
1456	 1)               |             __might_sleep() {
1457	 1)               |                /* I'm a comment! */
1458	 1)   1.449 us    |             }
1459	
1460	
1461	You might find other useful features for this tracer in the
1462	following "dynamic ftrace" section such as tracing only specific
1463	functions or tasks.
1464	
1465	dynamic ftrace
1466	--------------
1467	
1468	If CONFIG_DYNAMIC_FTRACE is set, the system will run with
1469	virtually no overhead when function tracing is disabled. The way
1470	this works is the mcount function call (placed at the start of
1471	every kernel function, produced by the -pg switch in gcc),
1472	starts of pointing to a simple return. (Enabling FTRACE will
1473	include the -pg switch in the compiling of the kernel.)
1474	
1475	At compile time every C file object is run through the
1476	recordmcount.pl script (located in the scripts directory). This
1477	script will process the C object using objdump to find all the
1478	locations in the .text section that call mcount. (Note, only the
1479	.text section is processed, since processing other sections like
1480	.init.text may cause races due to those sections being freed).
1481	
1482	A new section called "__mcount_loc" is created that holds
1483	references to all the mcount call sites in the .text section.
1484	This section is compiled back into the original object. The
1485	final linker will add all these references into a single table.
1486	
1487	On boot up, before SMP is initialized, the dynamic ftrace code
1488	scans this table and updates all the locations into nops. It
1489	also records the locations, which are added to the
1490	available_filter_functions list.  Modules are processed as they
1491	are loaded and before they are executed.  When a module is
1492	unloaded, it also removes its functions from the ftrace function
1493	list. This is automatic in the module unload code, and the
1494	module author does not need to worry about it.
1495	
1496	When tracing is enabled, kstop_machine is called to prevent
1497	races with the CPUS executing code being modified (which can
1498	cause the CPU to do undesirable things), and the nops are
1499	patched back to calls. But this time, they do not call mcount
1500	(which is just a function stub). They now call into the ftrace
1501	infrastructure.
1502	
1503	One special side-effect to the recording of the functions being
1504	traced is that we can now selectively choose which functions we
1505	wish to trace and which ones we want the mcount calls to remain
1506	as nops.
1507	
1508	Two files are used, one for enabling and one for disabling the
1509	tracing of specified functions. They are:
1510	
1511	  set_ftrace_filter
1512	
1513	and
1514	
1515	  set_ftrace_notrace
1516	
1517	A list of available functions that you can add to these files is
1518	listed in:
1519	
1520	   available_filter_functions
1521	
1522	 # cat available_filter_functions
1523	put_prev_task_idle
1524	kmem_cache_create
1525	pick_next_task_rt
1526	get_online_cpus
1527	pick_next_task_fair
1528	mutex_lock
1529	[...]
1530	
1531	If I am only interested in sys_nanosleep and hrtimer_interrupt:
1532	
1533	 # echo sys_nanosleep hrtimer_interrupt \
1534			> set_ftrace_filter
1535	 # echo function > current_tracer
1536	 # echo 1 > tracing_on
1537	 # usleep 1
1538	 # echo 0 > tracing_on
1539	 # cat trace
1540	# tracer: ftrace
1541	#
1542	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1543	#              | |      |          |         |
1544	          usleep-4134  [00]  1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1545	          usleep-4134  [00]  1317.070111: sys_nanosleep <-syscall_call
1546	          <idle>-0     [00]  1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1547	
1548	To see which functions are being traced, you can cat the file:
1549	
1550	 # cat set_ftrace_filter
1551	hrtimer_interrupt
1552	sys_nanosleep
1553	
1554	
1555	Perhaps this is not enough. The filters also allow simple wild
1556	cards. Only the following are currently available
1557	
1558	  <match>*  - will match functions that begin with <match>
1559	  *<match>  - will match functions that end with <match>
1560	  *<match>* - will match functions that have <match> in it
1561	
1562	These are the only wild cards which are supported.
1563	
1564	  <match>*<match> will not work.
1565	
1566	Note: It is better to use quotes to enclose the wild cards,
1567	      otherwise the shell may expand the parameters into names
1568	      of files in the local directory.
1569	
1570	 # echo 'hrtimer_*' > set_ftrace_filter
1571	
1572	Produces:
1573	
1574	# tracer: ftrace
1575	#
1576	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1577	#              | |      |          |         |
1578	            bash-4003  [00]  1480.611794: hrtimer_init <-copy_process
1579	            bash-4003  [00]  1480.611941: hrtimer_start <-hrtick_set
1580	            bash-4003  [00]  1480.611956: hrtimer_cancel <-hrtick_clear
1581	            bash-4003  [00]  1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1582	          <idle>-0     [00]  1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1583	          <idle>-0     [00]  1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1584	          <idle>-0     [00]  1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1585	          <idle>-0     [00]  1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1586	          <idle>-0     [00]  1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1587	
1588	
1589	Notice that we lost the sys_nanosleep.
1590	
1591	 # cat set_ftrace_filter
1592	hrtimer_run_queues
1593	hrtimer_run_pending
1594	hrtimer_init
1595	hrtimer_cancel
1596	hrtimer_try_to_cancel
1597	hrtimer_forward
1598	hrtimer_start
1599	hrtimer_reprogram
1600	hrtimer_force_reprogram
1601	hrtimer_get_next_event
1602	hrtimer_interrupt
1603	hrtimer_nanosleep
1604	hrtimer_wakeup
1605	hrtimer_get_remaining
1606	hrtimer_get_res
1607	hrtimer_init_sleeper
1608	
1609	
1610	This is because the '>' and '>>' act just like they do in bash.
1611	To rewrite the filters, use '>'
1612	To append to the filters, use '>>'
1613	
1614	To clear out a filter so that all functions will be recorded
1615	again:
1616	
1617	 # echo > set_ftrace_filter
1618	 # cat set_ftrace_filter
1619	 #
1620	
1621	Again, now we want to append.
1622	
1623	 # echo sys_nanosleep > set_ftrace_filter
1624	 # cat set_ftrace_filter
1625	sys_nanosleep
1626	 # echo 'hrtimer_*' >> set_ftrace_filter
1627	 # cat set_ftrace_filter
1628	hrtimer_run_queues
1629	hrtimer_run_pending
1630	hrtimer_init
1631	hrtimer_cancel
1632	hrtimer_try_to_cancel
1633	hrtimer_forward
1634	hrtimer_start
1635	hrtimer_reprogram
1636	hrtimer_force_reprogram
1637	hrtimer_get_next_event
1638	hrtimer_interrupt
1639	sys_nanosleep
1640	hrtimer_nanosleep
1641	hrtimer_wakeup
1642	hrtimer_get_remaining
1643	hrtimer_get_res
1644	hrtimer_init_sleeper
1645	
1646	
1647	The set_ftrace_notrace prevents those functions from being
1648	traced.
1649	
1650	 # echo '*preempt*' '*lock*' > set_ftrace_notrace
1651	
1652	Produces:
1653	
1654	# tracer: ftrace
1655	#
1656	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1657	#              | |      |          |         |
1658	            bash-4043  [01]   115.281644: finish_task_switch <-schedule
1659	            bash-4043  [01]   115.281645: hrtick_set <-schedule
1660	            bash-4043  [01]   115.281645: hrtick_clear <-hrtick_set
1661	            bash-4043  [01]   115.281646: wait_for_completion <-__stop_machine_run
1662	            bash-4043  [01]   115.281647: wait_for_common <-wait_for_completion
1663	            bash-4043  [01]   115.281647: kthread_stop <-stop_machine_run
1664	            bash-4043  [01]   115.281648: init_waitqueue_head <-kthread_stop
1665	            bash-4043  [01]   115.281648: wake_up_process <-kthread_stop
1666	            bash-4043  [01]   115.281649: try_to_wake_up <-wake_up_process
1667	
1668	We can see that there's no more lock or preempt tracing.
1669	
1670	
1671	Dynamic ftrace with the function graph tracer
1672	---------------------------------------------
1673	
1674	Although what has been explained above concerns both the
1675	function tracer and the function-graph-tracer, there are some
1676	special features only available in the function-graph tracer.
1677	
1678	If you want to trace only one function and all of its children,
1679	you just have to echo its name into set_graph_function:
1680	
1681	 echo __do_fault > set_graph_function
1682	
1683	will produce the following "expanded" trace of the __do_fault()
1684	function:
1685	
1686	 0)               |  __do_fault() {
1687	 0)               |    filemap_fault() {
1688	 0)               |      find_lock_page() {
1689	 0)   0.804 us    |        find_get_page();
1690	 0)               |        __might_sleep() {
1691	 0)   1.329 us    |        }
1692	 0)   3.904 us    |      }
1693	 0)   4.979 us    |    }
1694	 0)   0.653 us    |    _spin_lock();
1695	 0)   0.578 us    |    page_add_file_rmap();
1696	 0)   0.525 us    |    native_set_pte_at();
1697	 0)   0.585 us    |    _spin_unlock();
1698	 0)               |    unlock_page() {
1699	 0)   0.541 us    |      page_waitqueue();
1700	 0)   0.639 us    |      __wake_up_bit();
1701	 0)   2.786 us    |    }
1702	 0) + 14.237 us   |  }
1703	 0)               |  __do_fault() {
1704	 0)               |    filemap_fault() {
1705	 0)               |      find_lock_page() {
1706	 0)   0.698 us    |        find_get_page();
1707	 0)               |        __might_sleep() {
1708	 0)   1.412 us    |        }
1709	 0)   3.950 us    |      }
1710	 0)   5.098 us    |    }
1711	 0)   0.631 us    |    _spin_lock();
1712	 0)   0.571 us    |    page_add_file_rmap();
1713	 0)   0.526 us    |    native_set_pte_at();
1714	 0)   0.586 us    |    _spin_unlock();
1715	 0)               |    unlock_page() {
1716	 0)   0.533 us    |      page_waitqueue();
1717	 0)   0.638 us    |      __wake_up_bit();
1718	 0)   2.793 us    |    }
1719	 0) + 14.012 us   |  }
1720	
1721	You can also expand several functions at once:
1722	
1723	 echo sys_open > set_graph_function
1724	 echo sys_close >> set_graph_function
1725	
1726	Now if you want to go back to trace all functions you can clear
1727	this special filter via:
1728	
1729	 echo > set_graph_function
1730	
1731	
1732	Filter commands
1733	---------------
1734	
1735	A few commands are supported by the set_ftrace_filter interface.
1736	Trace commands have the following format:
1737	
1738	<function>:<command>:<parameter>
1739	
1740	The following commands are supported:
1741	
1742	- mod
1743	  This command enables function filtering per module. The
1744	  parameter defines the module. For example, if only the write*
1745	  functions in the ext3 module are desired, run:
1746	
1747	   echo 'write*:mod:ext3' > set_ftrace_filter
1748	
1749	  This command interacts with the filter in the same way as
1750	  filtering based on function names. Thus, adding more functions
1751	  in a different module is accomplished by appending (>>) to the
1752	  filter file. Remove specific module functions by prepending
1753	  '!':
1754	
1755	   echo '!writeback*:mod:ext3' >> set_ftrace_filter
1756	
1757	- traceon/traceoff
1758	  These commands turn tracing on and off when the specified
1759	  functions are hit. The parameter determines how many times the
1760	  tracing system is turned on and off. If unspecified, there is
1761	  no limit. For example, to disable tracing when a schedule bug
1762	  is hit the first 5 times, run:
1763	
1764	   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
1765	
1766	  These commands are cumulative whether or not they are appended
1767	  to set_ftrace_filter. To remove a command, prepend it by '!'
1768	  and drop the parameter:
1769	
1770	   echo '!__schedule_bug:traceoff' > set_ftrace_filter
1771	
1772	
1773	trace_pipe
1774	----------
1775	
1776	The trace_pipe outputs the same content as the trace file, but
1777	the effect on the tracing is different. Every read from
1778	trace_pipe is consumed. This means that subsequent reads will be
1779	different. The trace is live.
1780	
1781	 # echo function > current_tracer
1782	 # cat trace_pipe > /tmp/trace.out &
1783	[1] 4153
1784	 # echo 1 > tracing_on
1785	 # usleep 1
1786	 # echo 0 > tracing_on
1787	 # cat trace
1788	# tracer: function
1789	#
1790	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1791	#              | |      |          |         |
1792	
1793	 #
1794	 # cat /tmp/trace.out
1795	            bash-4043  [00] 41.267106: finish_task_switch <-schedule
1796	            bash-4043  [00] 41.267106: hrtick_set <-schedule
1797	            bash-4043  [00] 41.267107: hrtick_clear <-hrtick_set
1798	            bash-4043  [00] 41.267108: wait_for_completion <-__stop_machine_run
1799	            bash-4043  [00] 41.267108: wait_for_common <-wait_for_completion
1800	            bash-4043  [00] 41.267109: kthread_stop <-stop_machine_run
1801	            bash-4043  [00] 41.267109: init_waitqueue_head <-kthread_stop
1802	            bash-4043  [00] 41.267110: wake_up_process <-kthread_stop
1803	            bash-4043  [00] 41.267110: try_to_wake_up <-wake_up_process
1804	            bash-4043  [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1805	
1806	
1807	Note, reading the trace_pipe file will block until more input is
1808	added. By changing the tracer, trace_pipe will issue an EOF. We
1809	needed to set the function tracer _before_ we "cat" the
1810	trace_pipe file.
1811	
1812	
1813	trace entries
1814	-------------
1815	
1816	Having too much or not enough data can be troublesome in
1817	diagnosing an issue in the kernel. The file buffer_size_kb is
1818	used to modify the size of the internal trace buffers. The
1819	number listed is the number of entries that can be recorded per
1820	CPU. To know the full size, multiply the number of possible CPUS
1821	with the number of entries.
1822	
1823	 # cat buffer_size_kb
1824	1408 (units kilobytes)
1825	
1826	Note, to modify this, you must have tracing completely disabled.
1827	To do that, echo "nop" into the current_tracer. If the
1828	current_tracer is not set to "nop", an EINVAL error will be
1829	returned.
1830	
1831	 # echo nop > current_tracer
1832	 # echo 10000 > buffer_size_kb
1833	 # cat buffer_size_kb
1834	10000 (units kilobytes)
1835	
1836	The number of pages which will be allocated is limited to a
1837	percentage of available memory. Allocating too much will produce
1838	an error.
1839	
1840	 # echo 1000000000000 > buffer_size_kb
1841	-bash: echo: write error: Cannot allocate memory
1842	 # cat buffer_size_kb
1843	85
1844	
1845	Snapshot
1846	--------
1847	CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
1848	available to all non latency tracers. (Latency tracers which
1849	record max latency, such as "irqsoff" or "wakeup", can't use
1850	this feature, since those are already using the snapshot
1851	mechanism internally.)
1852	
1853	Snapshot preserves a current trace buffer at a particular point
1854	in time without stopping tracing. Ftrace swaps the current
1855	buffer with a spare buffer, and tracing continues in the new
1856	current (=previous spare) buffer.
1857	
1858	The following debugfs files in "tracing" are related to this
1859	feature:
1860	
1861	  snapshot:
1862	
1863		This is used to take a snapshot and to read the output
1864		of the snapshot. Echo 1 into this file to allocate a
1865		spare buffer and to take a snapshot (swap), then read
1866		the snapshot from this file in the same format as
1867		"trace" (described above in the section "The File
1868		System"). Both reads snapshot and tracing are executable
1869		in parallel. When the spare buffer is allocated, echoing
1870		0 frees it, and echoing else (positive) values clear the
1871		snapshot contents.
1872		More details are shown in the table below.
1873	
1874		status\input  |     0      |     1      |    else    |
1875		--------------+------------+------------+------------+
1876		not allocated |(do nothing)| alloc+swap |(do nothing)|
1877		--------------+------------+------------+------------+
1878		allocated     |    free    |    swap    |   clear    |
1879		--------------+------------+------------+------------+
1880	
1881	Here is an example of using the snapshot feature.
1882	
1883	 # echo 1 > events/sched/enable
1884	 # echo 1 > snapshot
1885	 # cat snapshot
1886	# tracer: nop
1887	#
1888	# entries-in-buffer/entries-written: 71/71   #P:8
1889	#
1890	#                              _-----=> irqs-off
1891	#                             / _----=> need-resched
1892	#                            | / _---=> hardirq/softirq
1893	#                            || / _--=> preempt-depth
1894	#                            ||| /     delay
1895	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1896	#              | |       |   ||||       |         |
1897	          <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
1898	           sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
1899	[...]
1900	          <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
1901	
1902	 # cat trace
1903	# tracer: nop
1904	#
1905	# entries-in-buffer/entries-written: 77/77   #P:8
1906	#
1907	#                              _-----=> irqs-off
1908	#                             / _----=> need-resched
1909	#                            | / _---=> hardirq/softirq
1910	#                            || / _--=> preempt-depth
1911	#                            ||| /     delay
1912	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1913	#              | |       |   ||||       |         |
1914	          <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
1915	 snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
1916	[...]
1917	
1918	
1919	If you try to use this snapshot feature when current tracer is
1920	one of the latency tracers, you will get the following results.
1921	
1922	 # echo wakeup > current_tracer
1923	 # echo 1 > snapshot
1924	bash: echo: write error: Device or resource busy
1925	 # cat snapshot
1926	cat: snapshot: Device or resource busy
1927	
1928	-----------
1929	
1930	More details can be found in the source code, in the
1931	kernel/trace/*.c files.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.