Linux Kernel Documentation :: trace : ftrace.txt

About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog
Documentation / trace / ftrace.txt

Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.
1			ftrace - Function Tracer
2			========================
3	
4	Copyright 2008 Red Hat Inc.
5	   Author:   Steven Rostedt <srostedt@redhat.com>
6	  License:   The GNU Free Documentation License, Version 1.2
7	               (dual licensed under the GPL v2)
8	Original Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
9			      John Kacur, and David Teigland.
10	Written for: 2.6.28-rc2
11	Updated for: 3.10
12	Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
13	
14	Introduction
15	------------
16	
17	Ftrace is an internal tracer designed to help out developers and
18	designers of systems to find what is going on inside the kernel.
19	It can be used for debugging or analyzing latencies and
20	performance issues that take place outside of user-space.
21	
22	Although ftrace is typically considered the function tracer, it
23	is really a frame work of several assorted tracing utilities.
24	There's latency tracing to examine what occurs between interrupts
25	disabled and enabled, as well as for preemption and from a time
26	a task is woken to the task is actually scheduled in.
27	
28	One of the most common uses of ftrace is the event tracing.
29	Through out the kernel is hundreds of static event points that
30	can be enabled via the tracefs file system to see what is
31	going on in certain parts of the kernel.
32	
33	See events.txt for more information.
34	
35	
36	Implementation Details
37	----------------------
38	
39	See ftrace-design.txt for details for arch porters and such.
40	
41	
42	The File System
43	---------------
44	
45	Ftrace uses the tracefs file system to hold the control files as
46	well as the files to display output.
47	
48	When tracefs is configured into the kernel (which selecting any ftrace
49	option will do) the directory /sys/kernel/tracing will be created. To mount
50	this directory, you can add to your /etc/fstab file:
51	
52	 tracefs       /sys/kernel/tracing       tracefs defaults        0       0
53	
54	Or you can mount it at run time with:
55	
56	 mount -t tracefs nodev /sys/kernel/tracing
57	
58	For quicker access to that directory you may want to make a soft link to
59	it:
60	
61	 ln -s /sys/kernel/tracing /tracing
62	
63	     *** NOTICE ***
64	
65	Before 4.1, all ftrace tracing control files were within the debugfs
66	file system, which is typically located at /sys/kernel/debug/tracing.
67	For backward compatibility, when mounting the debugfs file system,
68	the tracefs file system will be automatically mounted at:
69	
70	 /sys/kernel/debug/tracing
71	
72	All files located in the tracefs file system will be located in that
73	debugfs file system directory as well.
74	
75	     *** NOTICE ***
76	
77	Any selected ftrace option will also create the tracefs file system.
78	The rest of the document will assume that you are in the ftrace directory
79	(cd /sys/kernel/tracing) and will only concentrate on the files within that
80	directory and not distract from the content with the extended
81	"/sys/kernel/tracing" path name.
82	
83	That's it! (assuming that you have ftrace configured into your kernel)
84	
85	After mounting tracefs you will have access to the control and output files
86	of ftrace. Here is a list of some of the key files:
87	
88	
89	 Note: all time values are in microseconds.
90	
91	  current_tracer:
92	
93		This is used to set or display the current tracer
94		that is configured.
95	
96	  available_tracers:
97	
98		This holds the different types of tracers that
99		have been compiled into the kernel. The
100		tracers listed here can be configured by
101		echoing their name into current_tracer.
102	
103	  tracing_on:
104	
105		This sets or displays whether writing to the trace
106		ring buffer is enabled. Echo 0 into this file to disable
107		the tracer or 1 to enable it. Note, this only disables
108		writing to the ring buffer, the tracing overhead may
109		still be occurring.
110	
111		The kernel function tracing_off() can be used within the
112		kernel to disable writing to the ring buffer, which will
113		set this file to "0". User space can re-enable tracing by
114		echoing "1" into the file.
115	
116		Note, the function and event trigger "traceoff" will also
117		set this file to zero and stop tracing. Which can also
118		be re-enabled by user space using this file.
119	
120	  trace:
121	
122		This file holds the output of the trace in a human
123		readable format (described below). Note, tracing is temporarily
124		disabled while this file is being read (opened).
125	
126	  trace_pipe:
127	
128		The output is the same as the "trace" file but this
129		file is meant to be streamed with live tracing.
130		Reads from this file will block until new data is
131		retrieved.  Unlike the "trace" file, this file is a
132		consumer. This means reading from this file causes
133		sequential reads to display more current data. Once
134		data is read from this file, it is consumed, and
135		will not be read again with a sequential read. The
136		"trace" file is static, and if the tracer is not
137		adding more data, it will display the same
138		information every time it is read. This file will not
139		disable tracing while being read.
140	
141	  trace_options:
142	
143		This file lets the user control the amount of data
144		that is displayed in one of the above output
145		files. Options also exist to modify how a tracer
146		or events work (stack traces, timestamps, etc).
147	
148	  options:
149	
150		This is a directory that has a file for every available
151		trace option (also in trace_options). Options may also be set
152		or cleared by writing a "1" or "0" respectively into the
153		corresponding file with the option name.
154	
155	  tracing_max_latency:
156	
157		Some of the tracers record the max latency.
158		For example, the maximum time that interrupts are disabled.
159		The maximum time is saved in this file. The max trace will also be
160		stored,	and displayed by "trace". A new max trace will only be
161		recorded if the latency is greater than the value in this file
162		(in microseconds).
163	
164		By echoing in a time into this file, no latency will be recorded
165		unless it is greater than the time in this file.
166	
167	  tracing_thresh:
168	
169		Some latency tracers will record a trace whenever the
170		latency is greater than the number in this file.
171		Only active when the file contains a number greater than 0.
172		(in microseconds)
173	
174	  buffer_size_kb:
175	
176		This sets or displays the number of kilobytes each CPU
177		buffer holds. By default, the trace buffers are the same size
178		for each CPU. The displayed number is the size of the
179		CPU buffer and not total size of all buffers. The
180		trace buffers are allocated in pages (blocks of memory
181		that the kernel uses for allocation, usually 4 KB in size).
182		If the last page allocated has room for more bytes
183		than requested, the rest of the page will be used,
184		making the actual allocation bigger than requested or shown.
185		( Note, the size may not be a multiple of the page size
186		  due to buffer management meta-data. )
187	
188		Buffer sizes for individual CPUs may vary
189		(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
190		this file will show "X".
191	
192	  buffer_total_size_kb:
193	
194		This displays the total combined size of all the trace buffers.
195	
196	  free_buffer:
197	
198		If a process is performing tracing, and the ring buffer	should be
199		shrunk "freed" when the process is finished, even if it were to be
200		killed by a signal, this file can be used for that purpose. On close
201		of this file, the ring buffer will be resized to its minimum size.
202		Having a process that is tracing also open this file, when the process
203		exits its file descriptor for this file will be closed, and in doing so,
204		the ring buffer will be "freed".
205	
206		It may also stop tracing if disable_on_free option is set.
207	
208	  tracing_cpumask:
209	
210		This is a mask that lets the user only trace on specified CPUs.
211		The format is a hex string representing the CPUs.
212	
213	  set_ftrace_filter:
214	
215		When dynamic ftrace is configured in (see the
216		section below "dynamic ftrace"), the code is dynamically
217		modified (code text rewrite) to disable calling of the
218		function profiler (mcount). This lets tracing be configured
219		in with practically no overhead in performance.  This also
220		has a side effect of enabling or disabling specific functions
221		to be traced. Echoing names of functions into this file
222		will limit the trace to only those functions.
223	
224		The functions listed in "available_filter_functions" are what
225		can be written into this file.
226	
227		This interface also allows for commands to be used. See the
228		"Filter commands" section for more details.
229	
230	  set_ftrace_notrace:
231	
232		This has an effect opposite to that of
233		set_ftrace_filter. Any function that is added here will not
234		be traced. If a function exists in both set_ftrace_filter
235		and set_ftrace_notrace,	the function will _not_ be traced.
236	
237	  set_ftrace_pid:
238	
239		Have the function tracer only trace the threads whose PID are
240		listed in this file.
241	
242		If the "function-fork" option is set, then when a task whose
243		PID is listed in this file forks, the child's PID will
244		automatically be added to this file, and the child will be
245		traced by the function tracer as well. This option will also
246		cause PIDs of tasks that exit to be removed from the file.
247	
248	  set_event_pid:
249	
250		Have the events only trace a task with a PID listed in this file.
251		Note, sched_switch and sched_wake_up will also trace events
252		listed in this file.
253	
254		To have the PIDs of children of tasks with their PID in this file
255		added on fork, enable the "event-fork" option. That option will also
256		cause the PIDs of tasks to be removed from this file when the task
257		exits.
258	
259	  set_graph_function:
260	
261		Functions listed in this file will cause the function graph
262		tracer to only trace these functions and the functions that
263		they call. (See the section "dynamic ftrace" for more details).
264	
265	  set_graph_notrace:
266	
267		Similar to set_graph_function, but will disable function graph
268		tracing when the function is hit until it exits the function.
269		This makes it possible to ignore tracing functions that are called
270		by a specific function.
271	
272	  available_filter_functions:
273	
274		This lists the functions that ftrace has processed and can trace.
275		These are the function names that you can pass to
276		"set_ftrace_filter" or "set_ftrace_notrace".
277		(See the section "dynamic ftrace" below for more details.)
278	
279	  dyn_ftrace_total_info:
280	
281		This file is for debugging purposes. The number of functions that
282		have been converted to nops and are available to be traced.
283	
284	  enabled_functions:
285	
286		This file is more for debugging ftrace, but can also be useful
287		in seeing if any function has a callback attached to it.
288		Not only does the trace infrastructure use ftrace function
289		trace utility, but other subsystems might too. This file
290		displays all functions that have a callback attached to them
291		as well as the number of callbacks that have been attached.
292		Note, a callback may also call multiple functions which will
293		not be listed in this count.
294	
295		If the callback registered to be traced by a function with
296		the "save regs" attribute (thus even more overhead), a 'R'
297		will be displayed on the same line as the function that
298		is returning registers.
299	
300		If the callback registered to be traced by a function with
301		the "ip modify" attribute (thus the regs->ip can be changed),
302		an 'I' will be displayed on the same line as the function that
303		can be overridden.
304	
305		If the architecture supports it, it will also show what callback
306		is being directly called by the function. If the count is greater
307		than 1 it most likely will be ftrace_ops_list_func().
308	
309		If the callback of the function jumps to a trampoline that is
310		specific to a the callback and not the standard trampoline,
311		its address will be printed as well as the function that the
312		trampoline calls.
313	
314	  function_profile_enabled:
315	
316		When set it will enable all functions with either the function
317		tracer, or if configured, the function graph tracer. It will
318		keep a histogram of the number of functions that were called
319		and if the function graph tracer was configured, it will also keep
320		track of the time spent in those functions. The histogram
321		content can be displayed in the files:
322	
323		trace_stats/function<cpu> ( function0, function1, etc).
324	
325	  trace_stats:
326	
327		A directory that holds different tracing stats.
328	
329	  kprobe_events:
330	 
331		Enable dynamic trace points. See kprobetrace.txt.
332	
333	  kprobe_profile:
334	
335		Dynamic trace points stats. See kprobetrace.txt.
336	
337	  max_graph_depth:
338	
339		Used with the function graph tracer. This is the max depth
340		it will trace into a function. Setting this to a value of
341		one will show only the first kernel function that is called
342		from user space.
343	
344	  printk_formats:
345	
346		This is for tools that read the raw format files. If an event in
347		the ring buffer references a string, only a pointer to the string
348		is recorded into the buffer and not the string itself. This prevents
349		tools from knowing what that string was. This file displays the string
350		and address for	the string allowing tools to map the pointers to what
351		the strings were.
352	
353	  saved_cmdlines:
354	
355		Only the pid of the task is recorded in a trace event unless
356		the event specifically saves the task comm as well. Ftrace
357		makes a cache of pid mappings to comms to try to display
358		comms for events. If a pid for a comm is not listed, then
359		"<...>" is displayed in the output.
360	
361		If the option "record-cmd" is set to "0", then comms of tasks
362		will not be saved during recording. By default, it is enabled.
363	
364	  saved_cmdlines_size:
365	
366		By default, 128 comms are saved (see "saved_cmdlines" above). To
367		increase or decrease the amount of comms that are cached, echo
368		in a the number of comms to cache, into this file.
369	
370	  saved_tgids:
371	
372		If the option "record-tgid" is set, on each scheduling context switch
373		the Task Group ID of a task is saved in a table mapping the PID of
374		the thread to its TGID. By default, the "record-tgid" option is
375		disabled.
376	
377	  snapshot:
378	
379		This displays the "snapshot" buffer and also lets the user
380		take a snapshot of the current running trace.
381		See the "Snapshot" section below for more details.
382	
383	  stack_max_size:
384	
385		When the stack tracer is activated, this will display the
386		maximum stack size it has encountered.
387		See the "Stack Trace" section below.
388	
389	  stack_trace:
390	
391		This displays the stack back trace of the largest stack
392		that was encountered when the stack tracer is activated.
393		See the "Stack Trace" section below.
394	
395	  stack_trace_filter:
396	
397		This is similar to "set_ftrace_filter" but it limits what
398		functions the stack tracer will check.
399	
400	  trace_clock:
401	
402		Whenever an event is recorded into the ring buffer, a
403		"timestamp" is added. This stamp comes from a specified
404		clock. By default, ftrace uses the "local" clock. This
405		clock is very fast and strictly per cpu, but on some
406		systems it may not be monotonic with respect to other
407		CPUs. In other words, the local clocks may not be in sync
408		with local clocks on other CPUs.
409	
410		Usual clocks for tracing:
411	
412		  # cat trace_clock
413		  [local] global counter x86-tsc
414	
415		  The clock with the square brackets around it is the one
416		  in effect.
417	
418		  local: Default clock, but may not be in sync across CPUs
419	
420		  global: This clock is in sync with all CPUs but may
421		  	  be a bit slower than the local clock.
422	
423		  counter: This is not a clock at all, but literally an atomic
424		  	   counter. It counts up one by one, but is in sync
425			   with all CPUs. This is useful when you need to
426			   know exactly the order events occurred with respect to
427			   each other on different CPUs.
428	
429		  uptime: This uses the jiffies counter and the time stamp
430		  	  is relative to the time since boot up.
431	
432		  perf: This makes ftrace use the same clock that perf uses.
433		  	Eventually perf will be able to read ftrace buffers
434			and this will help out in interleaving the data.
435	
436		  x86-tsc: Architectures may define their own clocks. For
437		  	   example, x86 uses its own TSC cycle clock here.
438	
439		  ppc-tb: This uses the powerpc timebase register value.
440			  This is in sync across CPUs and can also be used
441			  to correlate events across hypervisor/guest if
442			  tb_offset is known.
443	
444		  mono: This uses the fast monotonic clock (CLOCK_MONOTONIC)
445			which is monotonic and is subject to NTP rate adjustments.
446	
447		  mono_raw:
448			This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
449			which is montonic but is not subject to any rate adjustments
450			and ticks at the same rate as the hardware clocksource.
451	
452		  boot: This is the boot clock (CLOCK_BOOTTIME) and is based on the
453			fast monotonic clock, but also accounts for time spent in
454			suspend. Since the clock access is designed for use in
455			tracing in the suspend path, some side effects are possible
456			if clock is accessed after the suspend time is accounted before
457			the fast mono clock is updated. In this case, the clock update
458			appears to happen slightly sooner than it normally would have.
459			Also on 32-bit systems, it's possible that the 64-bit boot offset
460			sees a partial update. These effects are rare and post
461			processing should be able to handle them. See comments in the
462			ktime_get_boot_fast_ns() function for more information.
463	
464		To set a clock, simply echo the clock name into this file.
465	
466		  echo global > trace_clock
467	
468	  trace_marker:
469	
470		This is a very useful file for synchronizing user space
471		with events happening in the kernel. Writing strings into
472		this file will be written into the ftrace buffer.
473	
474		It is useful in applications to open this file at the start
475		of the application and just reference the file descriptor
476		for the file.
477	
478		void trace_write(const char *fmt, ...)
479		{
480			va_list ap;
481			char buf[256];
482			int n;
483	
484			if (trace_fd < 0)
485				return;
486	
487			va_start(ap, fmt);
488			n = vsnprintf(buf, 256, fmt, ap);
489			va_end(ap);
490	
491			write(trace_fd, buf, n);
492		}
493	
494		start:
495	
496			trace_fd = open("trace_marker", WR_ONLY);
497	
498	  trace_marker_raw:
499	
500		This is similar to trace_marker above, but is meant for for binary data
501		to be written to it, where a tool can be used to parse the data
502		from trace_pipe_raw.
503	
504	  uprobe_events:
505	 
506		Add dynamic tracepoints in programs.
507		See uprobetracer.txt
508	
509	  uprobe_profile:
510	
511		Uprobe statistics. See uprobetrace.txt
512	
513	  instances:
514	
515		This is a way to make multiple trace buffers where different
516		events can be recorded in different buffers.
517		See "Instances" section below.
518	
519	  events:
520	
521		This is the trace event directory. It holds event tracepoints
522		(also known as static tracepoints) that have been compiled
523		into the kernel. It shows what event tracepoints exist
524		and how they are grouped by system. There are "enable"
525		files at various levels that can enable the tracepoints
526		when a "1" is written to them.
527	
528		See events.txt for more information.
529	
530	  set_event:
531	
532		By echoing in the event into this file, will enable that event.
533	
534		See events.txt for more information.
535	
536	  available_events:
537	
538		A list of events that can be enabled in tracing.
539	
540		See events.txt for more information.
541	
542	  hwlat_detector:
543	
544		Directory for the Hardware Latency Detector.
545		See "Hardware Latency Detector" section below.
546	
547	  per_cpu:
548	
549		This is a directory that contains the trace per_cpu information.
550	
551	  per_cpu/cpu0/buffer_size_kb:
552	
553		The ftrace buffer is defined per_cpu. That is, there's a separate
554		buffer for each CPU to allow writes to be done atomically,
555		and free from cache bouncing. These buffers may have different
556		size buffers. This file is similar to the buffer_size_kb
557		file, but it only displays or sets the buffer size for the
558		specific CPU. (here cpu0).
559	
560	  per_cpu/cpu0/trace:
561	
562		This is similar to the "trace" file, but it will only display
563		the data specific for the CPU. If written to, it only clears
564		the specific CPU buffer.
565	
566	  per_cpu/cpu0/trace_pipe
567	
568		This is similar to the "trace_pipe" file, and is a consuming
569		read, but it will only display (and consume) the data specific
570		for the CPU.
571	
572	  per_cpu/cpu0/trace_pipe_raw
573	
574		For tools that can parse the ftrace ring buffer binary format,
575		the trace_pipe_raw file can be used to extract the data
576		from the ring buffer directly. With the use of the splice()
577		system call, the buffer data can be quickly transferred to
578		a file or to the network where a server is collecting the
579		data.
580	
581		Like trace_pipe, this is a consuming reader, where multiple
582		reads will always produce different data.
583	
584	  per_cpu/cpu0/snapshot:
585	
586		This is similar to the main "snapshot" file, but will only
587		snapshot the current CPU (if supported). It only displays
588		the content of the snapshot for a given CPU, and if
589		written to, only clears this CPU buffer.
590	
591	  per_cpu/cpu0/snapshot_raw:
592	
593		Similar to the trace_pipe_raw, but will read the binary format
594		from the snapshot buffer for the given CPU.
595	
596	  per_cpu/cpu0/stats:
597	
598		This displays certain stats about the ring buffer:
599	
600		 entries: The number of events that are still in the buffer.
601	
602		 overrun: The number of lost events due to overwriting when
603		 	  the buffer was full.
604	
605		 commit overrun: Should always be zero.
606		 	This gets set if so many events happened within a nested
607			event (ring buffer is re-entrant), that it fills the
608			buffer and starts dropping events.
609	
610		 bytes: Bytes actually read (not overwritten).
611	
612		 oldest event ts: The oldest timestamp in the buffer
613	
614		 now ts: The current timestamp
615	
616		 dropped events: Events lost due to overwrite option being off.
617	
618		 read events: The number of events read.
619	
620	The Tracers
621	-----------
622	
623	Here is the list of current tracers that may be configured.
624	
625	  "function"
626	
627		Function call tracer to trace all kernel functions.
628	
629	  "function_graph"
630	
631		Similar to the function tracer except that the
632		function tracer probes the functions on their entry
633		whereas the function graph tracer traces on both entry
634		and exit of the functions. It then provides the ability
635		to draw a graph of function calls similar to C code
636		source.
637	
638	  "blk"
639	
640		The block tracer. The tracer used by the blktrace user
641		application.
642	
643	  "hwlat"
644	
645		The Hardware Latency tracer is used to detect if the hardware
646		produces any latency. See "Hardware Latency Detector" section
647		below.
648	
649	  "irqsoff"
650	
651		Traces the areas that disable interrupts and saves
652		the trace with the longest max latency.
653		See tracing_max_latency. When a new max is recorded,
654		it replaces the old trace. It is best to view this
655		trace with the latency-format option enabled, which
656		happens automatically when the tracer is selected.
657	
658	  "preemptoff"
659	
660		Similar to irqsoff but traces and records the amount of
661		time for which preemption is disabled.
662	
663	  "preemptirqsoff"
664	
665		Similar to irqsoff and preemptoff, but traces and
666		records the largest time for which irqs and/or preemption
667		is disabled.
668	
669	  "wakeup"
670	
671		Traces and records the max latency that it takes for
672		the highest priority task to get scheduled after
673		it has been woken up.
674	        Traces all tasks as an average developer would expect.
675	
676	  "wakeup_rt"
677	
678	        Traces and records the max latency that it takes for just
679	        RT tasks (as the current "wakeup" does). This is useful
680	        for those interested in wake up timings of RT tasks.
681	
682	  "wakeup_dl"
683	
684		Traces and records the max latency that it takes for
685		a SCHED_DEADLINE task to be woken (as the "wakeup" and
686		"wakeup_rt" does).
687	
688	  "mmiotrace"
689	
690		A special tracer that is used to trace binary module.
691		It will trace all the calls that a module makes to the
692		hardware. Everything it writes and reads from the I/O
693		as well.
694	
695	  "branch"
696	
697		This tracer can be configured when tracing likely/unlikely
698		calls within the kernel. It will trace when a likely and
699		unlikely branch is hit and if it was correct in its prediction
700		of being correct.
701	
702	  "nop"
703	
704		This is the "trace nothing" tracer. To remove all
705		tracers from tracing simply echo "nop" into
706		current_tracer.
707	
708	
709	Examples of using the tracer
710	----------------------------
711	
712	Here are typical examples of using the tracers when controlling
713	them only with the tracefs interface (without using any
714	user-land utilities).
715	
716	Output format:
717	--------------
718	
719	Here is an example of the output format of the file "trace"
720	
721	                             --------
722	# tracer: function
723	#
724	# entries-in-buffer/entries-written: 140080/250280   #P:4
725	#
726	#                              _-----=> irqs-off
727	#                             / _----=> need-resched
728	#                            | / _---=> hardirq/softirq
729	#                            || / _--=> preempt-depth
730	#                            ||| /     delay
731	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
732	#              | |       |   ||||       |         |
733	            bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
734	            bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
735	            bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
736	            sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
737	            bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
738	            bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
739	            bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
740	            bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
741	            bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
742	            sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
743	                             --------
744	
745	A header is printed with the tracer name that is represented by
746	the trace. In this case the tracer is "function". Then it shows the
747	number of events in the buffer as well as the total number of entries
748	that were written. The difference is the number of entries that were
749	lost due to the buffer filling up (250280 - 140080 = 110200 events
750	lost).
751	
752	The header explains the content of the events. Task name "bash", the task
753	PID "1977", the CPU that it was running on "000", the latency format
754	(explained below), the timestamp in <secs>.<usecs> format, the
755	function name that was traced "sys_close" and the parent function that
756	called this function "system_call_fastpath". The timestamp is the time
757	at which the function was entered.
758	
759	Latency trace format
760	--------------------
761	
762	When the latency-format option is enabled or when one of the latency
763	tracers is set, the trace file gives somewhat more information to see
764	why a latency happened. Here is a typical trace.
765	
766	# tracer: irqsoff
767	#
768	# irqsoff latency trace v1.1.5 on 3.8.0-test+
769	# --------------------------------------------------------------------
770	# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
771	#    -----------------
772	#    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
773	#    -----------------
774	#  => started at: __lock_task_sighand
775	#  => ended at:   _raw_spin_unlock_irqrestore
776	#
777	#
778	#                  _------=> CPU#            
779	#                 / _-----=> irqs-off        
780	#                | / _----=> need-resched    
781	#                || / _---=> hardirq/softirq 
782	#                ||| / _--=> preempt-depth   
783	#                |||| /     delay             
784	#  cmd     pid   ||||| time  |   caller      
785	#     \   /      |||||  \    |   /           
786	      ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
787	      ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
788	      ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
789	      ps-6143    2d..1  306us : <stack trace>
790	 => trace_hardirqs_on_caller
791	 => trace_hardirqs_on
792	 => _raw_spin_unlock_irqrestore
793	 => do_task_stat
794	 => proc_tgid_stat
795	 => proc_single_show
796	 => seq_read
797	 => vfs_read
798	 => sys_read
799	 => system_call_fastpath
800	
801	
802	This shows that the current tracer is "irqsoff" tracing the time
803	for which interrupts were disabled. It gives the trace version (which
804	never changes) and the version of the kernel upon which this was executed on
805	(3.8). Then it displays the max latency in microseconds (259 us). The number
806	of trace entries displayed and the total number (both are four: #4/4).
807	VP, KP, SP, and HP are always zero and are reserved for later use.
808	#P is the number of online CPUs (#P:4).
809	
810	The task is the process that was running when the latency
811	occurred. (ps pid: 6143).
812	
813	The start and stop (the functions in which the interrupts were
814	disabled and enabled respectively) that caused the latencies:
815	
816	 __lock_task_sighand is where the interrupts were disabled.
817	 _raw_spin_unlock_irqrestore is where they were enabled again.
818	
819	The next lines after the header are the trace itself. The header
820	explains which is which.
821	
822	  cmd: The name of the process in the trace.
823	
824	  pid: The PID of that process.
825	
826	  CPU#: The CPU which the process was running on.
827	
828	  irqs-off: 'd' interrupts are disabled. '.' otherwise.
829		    Note: If the architecture does not support a way to
830			  read the irq flags variable, an 'X' will always
831			  be printed here.
832	
833	  need-resched:
834		'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
835		'n' only TIF_NEED_RESCHED is set,
836		'p' only PREEMPT_NEED_RESCHED is set,
837		'.' otherwise.
838	
839	  hardirq/softirq:
840		'Z' - NMI occurred inside a hardirq
841		'z' - NMI is running
842		'H' - hard irq occurred inside a softirq.
843		'h' - hard irq is running
844		's' - soft irq is running
845		'.' - normal context.
846	
847	  preempt-depth: The level of preempt_disabled
848	
849	The above is mostly meaningful for kernel developers.
850	
851	  time: When the latency-format option is enabled, the trace file
852		output includes a timestamp relative to the start of the
853		trace. This differs from the output when latency-format
854		is disabled, which includes an absolute timestamp.
855	
856	  delay: This is just to help catch your eye a bit better. And
857		 needs to be fixed to be only relative to the same CPU.
858		 The marks are determined by the difference between this
859		 current trace and the next trace.
860		  '$' - greater than 1 second
861		  '@' - greater than 100 milisecond
862		  '*' - greater than 10 milisecond
863		  '#' - greater than 1000 microsecond
864		  '!' - greater than 100 microsecond
865		  '+' - greater than 10 microsecond
866		  ' ' - less than or equal to 10 microsecond.
867	
868	  The rest is the same as the 'trace' file.
869	
870	  Note, the latency tracers will usually end with a back trace
871	  to easily find where the latency occurred.
872	
873	trace_options
874	-------------
875	
876	The trace_options file (or the options directory) is used to control
877	what gets printed in the trace output, or manipulate the tracers.
878	To see what is available, simply cat the file:
879	
880	  cat trace_options
881	print-parent
882	nosym-offset
883	nosym-addr
884	noverbose
885	noraw
886	nohex
887	nobin
888	noblock
889	trace_printk
890	annotate
891	nouserstacktrace
892	nosym-userobj
893	noprintk-msg-only
894	context-info
895	nolatency-format
896	record-cmd
897	norecord-tgid
898	overwrite
899	nodisable_on_free
900	irq-info
901	markers
902	noevent-fork
903	function-trace
904	nofunction-fork
905	nodisplay-graph
906	nostacktrace
907	nobranch
908	
909	To disable one of the options, echo in the option prepended with
910	"no".
911	
912	  echo noprint-parent > trace_options
913	
914	To enable an option, leave off the "no".
915	
916	  echo sym-offset > trace_options
917	
918	Here are the available options:
919	
920	  print-parent - On function traces, display the calling (parent)
921			 function as well as the function being traced.
922	
923	  print-parent:
924	   bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
925	
926	  noprint-parent:
927	   bash-4000  [01]  1477.606694: simple_strtoul
928	
929	
930	  sym-offset - Display not only the function name, but also the
931		       offset in the function. For example, instead of
932		       seeing just "ktime_get", you will see
933		       "ktime_get+0xb/0x20".
934	
935	  sym-offset:
936	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
937	
938	  sym-addr - this will also display the function address as well
939		     as the function name.
940	
941	  sym-addr:
942	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
943	
944	  verbose - This deals with the trace file when the
945	            latency-format option is enabled.
946	
947	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
948	    (+0.000ms): simple_strtoul (kstrtoul)
949	
950	  raw - This will display raw numbers. This option is best for
951		use with user applications that can translate the raw
952		numbers better than having it done in the kernel.
953	
954	  hex - Similar to raw, but the numbers will be in a hexadecimal
955		format.
956	
957	  bin - This will print out the formats in raw binary.
958	
959	  block - When set, reading trace_pipe will not block when polled.
960	
961	  trace_printk - Can disable trace_printk() from writing into the buffer.
962	
963	  annotate - It is sometimes confusing when the CPU buffers are full
964	  	     and one CPU buffer had a lot of events recently, thus
965		     a shorter time frame, were another CPU may have only had
966		     a few events, which lets it have older events. When
967		     the trace is reported, it shows the oldest events first,
968		     and it may look like only one CPU ran (the one with the
969		     oldest events). When the annotate option is set, it will
970		     display when a new CPU buffer started:
971	
972	          <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
973	          <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
974	          <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
975	##### CPU 2 buffer started ####
976	          <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
977	          <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
978	          <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
979	
980	  userstacktrace - This option changes the trace. It records a
981			   stacktrace of the current user space thread after
982			   each trace event.
983	
984	  sym-userobj - when user stacktrace are enabled, look up which
985			object the address belongs to, and print a
986			relative address. This is especially useful when
987			ASLR is on, otherwise you don't get a chance to
988			resolve the address to object/file/line after
989			the app is no longer running
990	
991			The lookup is performed when you read
992			trace,trace_pipe. Example:
993	
994			a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
995	x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
996	
997	
998	  printk-msg-only - When set, trace_printk()s will only show the format
999	  		    and not their parameters (if trace_bprintk() or
1000			    trace_bputs() was used to save the trace_printk()).
1001	
1002	  context-info - Show only the event data. Hides the comm, PID,
1003	  	         timestamp, CPU, and other useful data.
1004	
1005	  latency-format - This option changes the trace output. When it is enabled,
1006			   the trace displays additional information about the
1007			   latency, as described in "Latency trace format".
1008	
1009	  record-cmd - When any event or tracer is enabled, a hook is enabled
1010		       in the sched_switch trace point to fill comm cache
1011		       with mapped pids and comms. But this may cause some
1012		       overhead, and if you only care about pids, and not the
1013		       name of the task, disabling this option can lower the
1014		       impact of tracing. See "saved_cmdlines".
1015	
1016	  record-tgid - When any event or tracer is enabled, a hook is enabled
1017		        in the sched_switch trace point to fill the cache of
1018			mapped Thread Group IDs (TGID) mapping to pids. See
1019			"saved_tgids".
1020	
1021	  overwrite - This controls what happens when the trace buffer is
1022	              full. If "1" (default), the oldest events are
1023	              discarded and overwritten. If "0", then the newest
1024	              events are discarded.
1025		        (see per_cpu/cpu0/stats for overrun and dropped)
1026	
1027	  disable_on_free - When the free_buffer is closed, tracing will
1028	  		    stop (tracing_on set to 0).
1029	
1030	  irq-info - Shows the interrupt, preempt count, need resched data.
1031	  	     When disabled, the trace looks like:
1032	
1033	# tracer: function
1034	#
1035	# entries-in-buffer/entries-written: 144405/9452052   #P:4
1036	#
1037	#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
1038	#              | |       |          |         |
1039	          <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1040	          <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
1041	          <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
1042	
1043	
1044	  markers - When set, the trace_marker is writable (only by root).
1045	  	    When disabled, the trace_marker will error with EINVAL
1046		    on write.
1047	
1048	  event-fork - When set, tasks with PIDs listed in set_event_pid will have
1049		       the PIDs of their children added to set_event_pid when those
1050		       tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1051		       their PIDs will be removed from the file.
1052	
1053	  function-trace - The latency tracers will enable function tracing
1054	  	    if this option is enabled (default it is). When
1055		    it is disabled, the latency tracers do not trace
1056		    functions. This keeps the overhead of the tracer down
1057		    when performing latency tests.
1058	
1059	  function-fork - When set, tasks with PIDs listed in set_ftrace_pid will
1060			  have the PIDs of their children added to set_ftrace_pid
1061			  when those tasks fork. Also, when tasks with PIDs in
1062			  set_ftrace_pid exit, their PIDs will be removed from the
1063			  file.
1064	
1065	  display-graph - When set, the latency tracers (irqsoff, wakeup, etc) will
1066		          use function graph tracing instead of function tracing.
1067	
1068	  stacktrace - When set, a stack trace is recorded after any trace event
1069		       is recorded.
1070	
1071	  branch - Enable branch tracing with the tracer. This enables branch
1072		   tracer along with the currently set tracer. Enabling this
1073		   with the "nop" tracer is the same as just enabling the
1074		   "branch" tracer.
1075	
1076	 Note: Some tracers have their own options. They only appear in this
1077	       file when the tracer is active. They always appear in the
1078	       options directory.
1079	
1080	
1081	Here are the per tracer options:
1082	
1083	Options for function tracer:
1084	
1085	  func_stack_trace - When set, a stack trace is recorded after every
1086			     function that is recorded. NOTE! Limit the functions
1087			     that are recorded before enabling this, with
1088			     "set_ftrace_filter" otherwise the system performance
1089			     will be critically degraded. Remember to disable
1090			     this option before clearing the function filter.
1091	
1092	Options for function_graph tracer:
1093	
1094	 Since the function_graph tracer has a slightly different output
1095	 it has its own options to control what is displayed.
1096	
1097	  funcgraph-overrun - When set, the "overrun" of the graph stack is
1098			      displayed after each function traced. The
1099			      overrun, is when the stack depth of the calls
1100			      is greater than what is reserved for each task.
1101			      Each task has a fixed array of functions to
1102			      trace in the call graph. If the depth of the
1103			      calls exceeds that, the function is not traced.
1104			      The overrun is the number of functions missed
1105			      due to exceeding this array.
1106	
1107	  funcgraph-cpu - When set, the CPU number of the CPU where the trace
1108			  occurred is displayed.
1109	
1110	  funcgraph-overhead - When set, if the function takes longer than
1111			       A certain amount, then a delay marker is
1112			       displayed. See "delay" above, under the
1113			       header description.
1114	
1115	  funcgraph-proc - Unlike other tracers, the process' command line
1116			   is not displayed by default, but instead only
1117			   when a task is traced in and out during a context
1118			   switch. Enabling this options has the command
1119			   of each process displayed at every line.
1120	
1121	  funcgraph-duration - At the end of each function (the return)
1122			       the duration of the amount of time in the
1123			       function is displayed in microseconds.
1124	
1125	  funcgraph-abstime - When set, the timestamp is displayed at each
1126			      line.
1127	
1128	  funcgraph-irqs - When disabled, functions that happen inside an
1129			   interrupt will not be traced.
1130	
1131	  funcgraph-tail - When set, the return event will include the function
1132			   that it represents. By default this is off, and
1133			   only a closing curly bracket "}" is displayed for
1134			   the return of a function.
1135	
1136	  sleep-time - When running function graph tracer, to include
1137		       the time a task schedules out in its function.
1138		       When enabled, it will account time the task has been
1139		       scheduled out as part of the function call.
1140	
1141	  graph-time - When running function profiler with function graph tracer,
1142		       to include the time to call nested functions. When this is
1143		       not set, the time reported for the function will only
1144		       include the time the function itself executed for, not the
1145		       time for functions that it called.
1146	
1147	Options for blk tracer:
1148	
1149	  blk_classic - Shows a more minimalistic output.
1150	
1151	
1152	irqsoff
1153	-------
1154	
1155	When interrupts are disabled, the CPU can not react to any other
1156	external event (besides NMIs and SMIs). This prevents the timer
1157	interrupt from triggering or the mouse interrupt from letting
1158	the kernel know of a new mouse event. The result is a latency
1159	with the reaction time.
1160	
1161	The irqsoff tracer tracks the time for which interrupts are
1162	disabled. When a new maximum latency is hit, the tracer saves
1163	the trace leading up to that latency point so that every time a
1164	new maximum is reached, the old saved trace is discarded and the
1165	new trace is saved.
1166	
1167	To reset the maximum, echo 0 into tracing_max_latency. Here is
1168	an example:
1169	
1170	 # echo 0 > options/function-trace
1171	 # echo irqsoff > current_tracer
1172	 # echo 1 > tracing_on
1173	 # echo 0 > tracing_max_latency
1174	 # ls -ltr
1175	 [...]
1176	 # echo 0 > tracing_on
1177	 # cat trace
1178	# tracer: irqsoff
1179	#
1180	# irqsoff latency trace v1.1.5 on 3.8.0-test+
1181	# --------------------------------------------------------------------
1182	# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1183	#    -----------------
1184	#    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1185	#    -----------------
1186	#  => started at: run_timer_softirq
1187	#  => ended at:   run_timer_softirq
1188	#
1189	#
1190	#                  _------=> CPU#            
1191	#                 / _-----=> irqs-off        
1192	#                | / _----=> need-resched    
1193	#                || / _---=> hardirq/softirq 
1194	#                ||| / _--=> preempt-depth   
1195	#                |||| /     delay             
1196	#  cmd     pid   ||||| time  |   caller      
1197	#     \   /      |||||  \    |   /           
1198	  <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
1199	  <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
1200	  <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
1201	  <idle>-0       0dNs3   25us : <stack trace>
1202	 => _raw_spin_unlock_irq
1203	 => run_timer_softirq
1204	 => __do_softirq
1205	 => call_softirq
1206	 => do_softirq
1207	 => irq_exit
1208	 => smp_apic_timer_interrupt
1209	 => apic_timer_interrupt
1210	 => rcu_idle_exit
1211	 => cpu_idle
1212	 => rest_init
1213	 => start_kernel
1214	 => x86_64_start_reservations
1215	 => x86_64_start_kernel
1216	
1217	Here we see that that we had a latency of 16 microseconds (which is
1218	very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1219	interrupts. The difference between the 16 and the displayed
1220	timestamp 25us occurred because the clock was incremented
1221	between the time of recording the max latency and the time of
1222	recording the function that had that latency.
1223	
1224	Note the above example had function-trace not set. If we set
1225	function-trace, we get a much larger output:
1226	
1227	 with echo 1 > options/function-trace
1228	
1229	# tracer: irqsoff
1230	#
1231	# irqsoff latency trace v1.1.5 on 3.8.0-test+
1232	# --------------------------------------------------------------------
1233	# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1234	#    -----------------
1235	#    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1236	#    -----------------
1237	#  => started at: ata_scsi_queuecmd
1238	#  => ended at:   ata_scsi_queuecmd
1239	#
1240	#
1241	#                  _------=> CPU#            
1242	#                 / _-----=> irqs-off        
1243	#                | / _----=> need-resched    
1244	#                || / _---=> hardirq/softirq 
1245	#                ||| / _--=> preempt-depth   
1246	#                |||| /     delay             
1247	#  cmd     pid   ||||| time  |   caller      
1248	#     \   /      |||||  \    |   /           
1249	    bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1250	    bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1251	    bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1252	    bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1253	    bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1254	    bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1255	    bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1256	    bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1257	    bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1258	[...]
1259	    bash-2042    3d..1   67us : delay_tsc <-__delay
1260	    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1261	    bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1262	    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1263	    bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1264	    bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1265	    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1266	    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1267	    bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1268	    bash-2042    3d..1  120us : <stack trace>
1269	 => _raw_spin_unlock_irqrestore
1270	 => ata_scsi_queuecmd
1271	 => scsi_dispatch_cmd
1272	 => scsi_request_fn
1273	 => __blk_run_queue_uncond
1274	 => __blk_run_queue
1275	 => blk_queue_bio
1276	 => generic_make_request
1277	 => submit_bio
1278	 => submit_bh
1279	 => __ext3_get_inode_loc
1280	 => ext3_iget
1281	 => ext3_lookup
1282	 => lookup_real
1283	 => __lookup_hash
1284	 => walk_component
1285	 => lookup_last
1286	 => path_lookupat
1287	 => filename_lookup
1288	 => user_path_at_empty
1289	 => user_path_at
1290	 => vfs_fstatat
1291	 => vfs_stat
1292	 => sys_newstat
1293	 => system_call_fastpath
1294	
1295	
1296	Here we traced a 71 microsecond latency. But we also see all the
1297	functions that were called during that time. Note that by
1298	enabling function tracing, we incur an added overhead. This
1299	overhead may extend the latency times. But nevertheless, this
1300	trace has provided some very helpful debugging information.
1301	
1302	
1303	preemptoff
1304	----------
1305	
1306	When preemption is disabled, we may be able to receive
1307	interrupts but the task cannot be preempted and a higher
1308	priority task must wait for preemption to be enabled again
1309	before it can preempt a lower priority task.
1310	
1311	The preemptoff tracer traces the places that disable preemption.
1312	Like the irqsoff tracer, it records the maximum latency for
1313	which preemption was disabled. The control of preemptoff tracer
1314	is much like the irqsoff tracer.
1315	
1316	 # echo 0 > options/function-trace
1317	 # echo preemptoff > current_tracer
1318	 # echo 1 > tracing_on
1319	 # echo 0 > tracing_max_latency
1320	 # ls -ltr
1321	 [...]
1322	 # echo 0 > tracing_on
1323	 # cat trace
1324	# tracer: preemptoff
1325	#
1326	# preemptoff latency trace v1.1.5 on 3.8.0-test+
1327	# --------------------------------------------------------------------
1328	# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1329	#    -----------------
1330	#    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1331	#    -----------------
1332	#  => started at: do_IRQ
1333	#  => ended at:   do_IRQ
1334	#
1335	#
1336	#                  _------=> CPU#            
1337	#                 / _-----=> irqs-off        
1338	#                | / _----=> need-resched    
1339	#                || / _---=> hardirq/softirq 
1340	#                ||| / _--=> preempt-depth   
1341	#                |||| /     delay             
1342	#  cmd     pid   ||||| time  |   caller      
1343	#     \   /      |||||  \    |   /           
1344	    sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1345	    sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1346	    sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1347	    sshd-1991    1d..1   52us : <stack trace>
1348	 => sub_preempt_count
1349	 => irq_exit
1350	 => do_IRQ
1351	 => ret_from_intr
1352	
1353	
1354	This has some more changes. Preemption was disabled when an
1355	interrupt came in (notice the 'h'), and was enabled on exit.
1356	But we also see that interrupts have been disabled when entering
1357	the preempt off section and leaving it (the 'd'). We do not know if
1358	interrupts were enabled in the mean time or shortly after this
1359	was over.
1360	
1361	# tracer: preemptoff
1362	#
1363	# preemptoff latency trace v1.1.5 on 3.8.0-test+
1364	# --------------------------------------------------------------------
1365	# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1366	#    -----------------
1367	#    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1368	#    -----------------
1369	#  => started at: wake_up_new_task
1370	#  => ended at:   task_rq_unlock
1371	#
1372	#
1373	#                  _------=> CPU#            
1374	#                 / _-----=> irqs-off        
1375	#                | / _----=> need-resched    
1376	#                || / _---=> hardirq/softirq 
1377	#                ||| / _--=> preempt-depth   
1378	#                |||| /     delay             
1379	#  cmd     pid   ||||| time  |   caller      
1380	#     \   /      |||||  \    |   /           
1381	    bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1382	    bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1383	    bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1384	    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1385	    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1386	[...]
1387	    bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1388	    bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1389	    bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1390	    bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1391	    bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1392	    bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1393	    bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1394	    bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1395	[...]
1396	    bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1397	    bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1398	    bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1399	    bash-1994    1d..2   36us : do_softirq <-irq_exit
1400	    bash-1994    1d..2   36us : __do_softirq <-call_softirq
1401	    bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1402	    bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1403	    bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1404	    bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1405	    bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1406	[...]
1407	    bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1408	    bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1409	    bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1410	    bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1411	    bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1412	    bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1413	    bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1414	    bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1415	    bash-1994    1.N.1  104us : <stack trace>
1416	 => sub_preempt_count
1417	 => _raw_spin_unlock_irqrestore
1418	 => task_rq_unlock
1419	 => wake_up_new_task
1420	 => do_fork
1421	 => sys_clone
1422	 => stub_clone
1423	
1424	
1425	The above is an example of the preemptoff trace with
1426	function-trace set. Here we see that interrupts were not disabled
1427	the entire time. The irq_enter code lets us know that we entered
1428	an interrupt 'h'. Before that, the functions being traced still
1429	show that it is not in an interrupt, but we can see from the
1430	functions themselves that this is not the case.
1431	
1432	preemptirqsoff
1433	--------------
1434	
1435	Knowing the locations that have interrupts disabled or
1436	preemption disabled for the longest times is helpful. But
1437	sometimes we would like to know when either preemption and/or
1438	interrupts are disabled.
1439	
1440	Consider the following code:
1441	
1442	    local_irq_disable();
1443	    call_function_with_irqs_off();
1444	    preempt_disable();
1445	    call_function_with_irqs_and_preemption_off();
1446	    local_irq_enable();
1447	    call_function_with_preemption_off();
1448	    preempt_enable();
1449	
1450	The irqsoff tracer will record the total length of
1451	call_function_with_irqs_off() and
1452	call_function_with_irqs_and_preemption_off().
1453	
1454	The preemptoff tracer will record the total length of
1455	call_function_with_irqs_and_preemption_off() and
1456	call_function_with_preemption_off().
1457	
1458	But neither will trace the time that interrupts and/or
1459	preemption is disabled. This total time is the time that we can
1460	not schedule. To record this time, use the preemptirqsoff
1461	tracer.
1462	
1463	Again, using this trace is much like the irqsoff and preemptoff
1464	tracers.
1465	
1466	 # echo 0 > options/function-trace
1467	 # echo preemptirqsoff > current_tracer
1468	 # echo 1 > tracing_on
1469	 # echo 0 > tracing_max_latency
1470	 # ls -ltr
1471	 [...]
1472	 # echo 0 > tracing_on
1473	 # cat trace
1474	# tracer: preemptirqsoff
1475	#
1476	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1477	# --------------------------------------------------------------------
1478	# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1479	#    -----------------
1480	#    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1481	#    -----------------
1482	#  => started at: ata_scsi_queuecmd
1483	#  => ended at:   ata_scsi_queuecmd
1484	#
1485	#
1486	#                  _------=> CPU#            
1487	#                 / _-----=> irqs-off        
1488	#                | / _----=> need-resched    
1489	#                || / _---=> hardirq/softirq 
1490	#                ||| / _--=> preempt-depth   
1491	#                |||| /     delay             
1492	#  cmd     pid   ||||| time  |   caller      
1493	#     \   /      |||||  \    |   /           
1494	      ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1495	      ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1496	      ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1497	      ls-2230    3...1  111us : <stack trace>
1498	 => sub_preempt_count
1499	 => _raw_spin_unlock_irqrestore
1500	 => ata_scsi_queuecmd
1501	 => scsi_dispatch_cmd
1502	 => scsi_request_fn
1503	 => __blk_run_queue_uncond
1504	 => __blk_run_queue
1505	 => blk_queue_bio
1506	 => generic_make_request
1507	 => submit_bio
1508	 => submit_bh
1509	 => ext3_bread
1510	 => ext3_dir_bread
1511	 => htree_dirblock_to_tree
1512	 => ext3_htree_fill_tree
1513	 => ext3_readdir
1514	 => vfs_readdir
1515	 => sys_getdents
1516	 => system_call_fastpath
1517	
1518	
1519	The trace_hardirqs_off_thunk is called from assembly on x86 when
1520	interrupts are disabled in the assembly code. Without the
1521	function tracing, we do not know if interrupts were enabled
1522	within the preemption points. We do see that it started with
1523	preemption enabled.
1524	
1525	Here is a trace with function-trace set:
1526	
1527	# tracer: preemptirqsoff
1528	#
1529	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1530	# --------------------------------------------------------------------
1531	# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1532	#    -----------------
1533	#    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1534	#    -----------------
1535	#  => started at: schedule
1536	#  => ended at:   mutex_unlock
1537	#
1538	#
1539	#                  _------=> CPU#            
1540	#                 / _-----=> irqs-off        
1541	#                | / _----=> need-resched    
1542	#                || / _---=> hardirq/softirq 
1543	#                ||| / _--=> preempt-depth   
1544	#                |||| /     delay             
1545	#  cmd     pid   ||||| time  |   caller      
1546	#     \   /      |||||  \    |   /           
1547	kworker/-59      3...1    0us : __schedule <-schedule
1548	kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1549	kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1550	kworker/-59      3d..2    1us : deactivate_task <-__schedule
1551	kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1552	kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1553	kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1554	kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1555	kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1556	kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1557	kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1558	kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1559	kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1560	kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1561	kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1562	kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1563	kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1564	kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1565	kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1566	kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1567	kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1568	kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1569	kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1570	kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1571	kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1572	      ls-2269    3d..2    7us : finish_task_switch <-__schedule
1573	      ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1574	      ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1575	      ls-2269    3d..2    8us : irq_enter <-do_IRQ
1576	      ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1577	      ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1578	      ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1579	[...]
1580	      ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1581	      ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1582	      ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1583	      ls-2269    3d..3   21us : do_softirq <-irq_exit
1584	      ls-2269    3d..3   21us : __do_softirq <-call_softirq
1585	      ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1586	      ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1587	      ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1588	      ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1589	      ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1590	      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1591	[...]
1592	      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1593	      ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1594	      ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1595	      ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1596	      ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1597	      ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1598	[...]
1599	      ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1600	      ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1601	      ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1602	      ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1603	      ls-2269    3d..3  159us : idle_cpu <-irq_exit
1604	      ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1605	      ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1606	      ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1607	      ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1608	      ls-2269    3d...  186us : <stack trace>
1609	 => __mutex_unlock_slowpath
1610	 => mutex_unlock
1611	 => process_output
1612	 => n_tty_write
1613	 => tty_write
1614	 => vfs_write
1615	 => sys_write
1616	 => system_call_fastpath
1617	
1618	This is an interesting trace. It started with kworker running and
1619	scheduling out and ls taking over. But as soon as ls released the
1620	rq lock and enabled interrupts (but not preemption) an interrupt
1621	triggered. When the interrupt finished, it started running softirqs.
1622	But while the softirq was running, another interrupt triggered.
1623	When an interrupt is running inside a softirq, the annotation is 'H'.
1624	
1625	
1626	wakeup
1627	------
1628	
1629	One common case that people are interested in tracing is the
1630	time it takes for a task that is woken to actually wake up.
1631	Now for non Real-Time tasks, this can be arbitrary. But tracing
1632	it none the less can be interesting. 
1633	
1634	Without function tracing:
1635	
1636	 # echo 0 > options/function-trace
1637	 # echo wakeup > current_tracer
1638	 # echo 1 > tracing_on
1639	 # echo 0 > tracing_max_latency
1640	 # chrt -f 5 sleep 1
1641	 # echo 0 > tracing_on
1642	 # cat trace
1643	# tracer: wakeup
1644	#
1645	# wakeup latency trace v1.1.5 on 3.8.0-test+
1646	# --------------------------------------------------------------------
1647	# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1648	#    -----------------
1649	#    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1650	#    -----------------
1651	#
1652	#                  _------=> CPU#            
1653	#                 / _-----=> irqs-off        
1654	#                | / _----=> need-resched    
1655	#                || / _---=> hardirq/softirq 
1656	#                ||| / _--=> preempt-depth   
1657	#                |||| /     delay             
1658	#  cmd     pid   ||||| time  |   caller      
1659	#     \   /      |||||  \    |   /           
1660	  <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1661	  <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1662	  <idle>-0       3d..3   15us : __schedule <-schedule
1663	  <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1664	
1665	The tracer only traces the highest priority task in the system
1666	to avoid tracing the normal circumstances. Here we see that
1667	the kworker with a nice priority of -20 (not very nice), took
1668	just 15 microseconds from the time it woke up, to the time it
1669	ran.
1670	
1671	Non Real-Time tasks are not that interesting. A more interesting
1672	trace is to concentrate only on Real-Time tasks.
1673	
1674	wakeup_rt
1675	---------
1676	
1677	In a Real-Time environment it is very important to know the
1678	wakeup time it takes for the highest priority task that is woken
1679	up to the time that it executes. This is also known as "schedule
1680	latency". I stress the point that this is about RT tasks. It is
1681	also important to know the scheduling latency of non-RT tasks,
1682	but the average schedule latency is better for non-RT tasks.
1683	Tools like LatencyTop are more appropriate for such
1684	measurements.
1685	
1686	Real-Time environments are interested in the worst case latency.
1687	That is the longest latency it takes for something to happen,
1688	and not the average. We can have a very fast scheduler that may
1689	only have a large latency once in a while, but that would not
1690	work well with Real-Time tasks.  The wakeup_rt tracer was designed
1691	to record the worst case wakeups of RT tasks. Non-RT tasks are
1692	not recorded because the tracer only records one worst case and
1693	tracing non-RT tasks that are unpredictable will overwrite the
1694	worst case latency of RT tasks (just run the normal wakeup
1695	tracer for a while to see that effect).
1696	
1697	Since this tracer only deals with RT tasks, we will run this
1698	slightly differently than we did with the previous tracers.
1699	Instead of performing an 'ls', we will run 'sleep 1' under
1700	'chrt' which changes the priority of the task.
1701	
1702	 # echo 0 > options/function-trace
1703	 # echo wakeup_rt > current_tracer
1704	 # echo 1 > tracing_on
1705	 # echo 0 > tracing_max_latency
1706	 # chrt -f 5 sleep 1
1707	 # echo 0 > tracing_on
1708	 # cat trace
1709	# tracer: wakeup
1710	#
1711	# tracer: wakeup_rt
1712	#
1713	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1714	# --------------------------------------------------------------------
1715	# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1716	#    -----------------
1717	#    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1718	#    -----------------
1719	#
1720	#                  _------=> CPU#            
1721	#                 / _-----=> irqs-off        
1722	#                | / _----=> need-resched    
1723	#                || / _---=> hardirq/softirq 
1724	#                ||| / _--=> preempt-depth   
1725	#                |||| /     delay             
1726	#  cmd     pid   ||||| time  |   caller      
1727	#     \   /      |||||  \    |   /           
1728	  <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
1729	  <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1730	  <idle>-0       3d..3    5us : __schedule <-schedule
1731	  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1732	
1733	
1734	Running this on an idle system, we see that it only took 5 microseconds
1735	to perform the task switch.  Note, since the trace point in the schedule
1736	is before the actual "switch", we stop the tracing when the recorded task
1737	is about to schedule in. This may change if we add a new marker at the
1738	end of the scheduler.
1739	
1740	Notice that the recorded task is 'sleep' with the PID of 2389
1741	and it has an rt_prio of 5. This priority is user-space priority
1742	and not the internal kernel priority. The policy is 1 for
1743	SCHED_FIFO and 2 for SCHED_RR.
1744	
1745	Note, that the trace data shows the internal priority (99 - rtprio).
1746	
1747	  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1748	
1749	The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1750	and in the running state 'R'. The sleep task was scheduled in with
1751	2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1752	and it too is in the running state.
1753	
1754	Doing the same with chrt -r 5 and function-trace set.
1755	
1756	  echo 1 > options/function-trace
1757	
1758	# tracer: wakeup_rt
1759	#
1760	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1761	# --------------------------------------------------------------------
1762	# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1763	#    -----------------
1764	#    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1765	#    -----------------
1766	#
1767	#                  _------=> CPU#            
1768	#                 / _-----=> irqs-off        
1769	#                | / _----=> need-resched    
1770	#                || / _---=> hardirq/softirq 
1771	#                ||| / _--=> preempt-depth   
1772	#                |||| /     delay             
1773	#  cmd     pid   ||||| time  |   caller      
1774	#     \   /      |||||  \    |   /           
1775	  <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
1776	  <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1777	  <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
1778	  <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
1779	  <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
1780	  <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
1781	  <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
1782	  <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
1783	  <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1784	  <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1785	  <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
1786	  <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
1787	  <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
1788	  <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
1789	  <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
1790	  <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
1791	  <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
1792	  <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
1793	  <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
1794	  <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
1795	  <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
1796	  <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
1797	  <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1798	  <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
1799	  <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
1800	  <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1801	  <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
1802	  <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1803	  <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
1804	  <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1805	  <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
1806	  <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
1807	  <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
1808	  <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
1809	  <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
1810	  <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
1811	  <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
1812	  <idle>-0       3dN.1   15us : calc_load_nohz_stop <-tick_nohz_idle_exit
1813	  <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
1814	  <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
1815	  <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
1816	  <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
1817	  <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1818	  <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
1819	  <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
1820	  <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
1821	  <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
1822	  <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
1823	  <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
1824	  <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
1825	  <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
1826	  <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1827	  <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
1828	  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1829	  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1830	  <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
1831	  <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
1832	  <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
1833	  <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1834	  <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
1835	  <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
1836	  <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
1837	  <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
1838	  <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
1839	  <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
1840	  <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
1841	  <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
1842	  <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1843	  <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
1844	  <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
1845	  <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
1846	  <idle>-0       3.N..   25us : schedule <-cpu_idle
1847	  <idle>-0       3.N..   25us : __schedule <-preempt_schedule
1848	  <idle>-0       3.N..   26us : add_preempt_count <-__schedule
1849	  <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
1850	  <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
1851	  <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
1852	  <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
1853	  <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
1854	  <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
1855	  <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
1856	  <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
1857	  <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
1858	  <idle>-0       3d..3   29us : __schedule <-preempt_schedule
1859	  <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
1860	
1861	This isn't that big of a trace, even with function tracing enabled,
1862	so I included the entire trace.
1863	
1864	The interrupt went off while when the system was idle. Somewhere
1865	before task_woken_rt() was called, the NEED_RESCHED flag was set,
1866	this is indicated by the first occurrence of the 'N' flag.
1867	
1868	Latency tracing and events
1869	--------------------------
1870	As function tracing can induce a much larger latency, but without
1871	seeing what happens within the latency it is hard to know what
1872	caused it. There is a middle ground, and that is with enabling
1873	events.
1874	
1875	 # echo 0 > options/function-trace
1876	 # echo wakeup_rt > current_tracer
1877	 # echo 1 > events/enable
1878	 # echo 1 > tracing_on
1879	 # echo 0 > tracing_max_latency
1880	 # chrt -f 5 sleep 1
1881	 # echo 0 > tracing_on
1882	 # cat trace
1883	# tracer: wakeup_rt
1884	#
1885	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1886	# --------------------------------------------------------------------
1887	# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1888	#    -----------------
1889	#    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
1890	#    -----------------
1891	#
1892	#                  _------=> CPU#            
1893	#                 / _-----=> irqs-off        
1894	#                | / _----=> need-resched    
1895	#                || / _---=> hardirq/softirq 
1896	#                ||| / _--=> preempt-depth   
1897	#                |||| /     delay             
1898	#  cmd     pid   ||||| time  |   caller      
1899	#     \   /      |||||  \    |   /           
1900	  <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
1901	  <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1902	  <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
1903	  <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
1904	  <idle>-0       2.N.2    2us : power_end: cpu_id=2
1905	  <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
1906	  <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
1907	  <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
1908	  <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
1909	  <idle>-0       2.N.2    5us : rcu_utilization: End context switch
1910	  <idle>-0       2d..3    6us : __schedule <-schedule
1911	  <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
1912	
1913	
1914	Hardware Latency Detector
1915	-------------------------
1916	
1917	The hardware latency detector is executed by enabling the "hwlat" tracer.
1918	
1919	NOTE, this tracer will affect the performance of the system as it will
1920	periodically make a CPU constantly busy with interrupts disabled.
1921	
1922	 # echo hwlat > current_tracer
1923	 # sleep 100
1924	 # cat trace
1925	# tracer: hwlat
1926	#
1927	#                              _-----=> irqs-off
1928	#                             / _----=> need-resched
1929	#                            | / _---=> hardirq/softirq
1930	#                            || / _--=> preempt-depth
1931	#                            ||| /     delay
1932	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1933	#              | |       |   ||||       |         |
1934	           <...>-3638  [001] d... 19452.055471: #1     inner/outer(us):   12/14    ts:1499801089.066141940
1935	           <...>-3638  [003] d... 19454.071354: #2     inner/outer(us):   11/9     ts:1499801091.082164365
1936	           <...>-3638  [002] dn.. 19461.126852: #3     inner/outer(us):   12/9     ts:1499801098.138150062
1937	           <...>-3638  [001] d... 19488.340960: #4     inner/outer(us):    8/12    ts:1499801125.354139633
1938	           <...>-3638  [003] d... 19494.388553: #5     inner/outer(us):    8/12    ts:1499801131.402150961
1939	           <...>-3638  [003] d... 19501.283419: #6     inner/outer(us):    0/12    ts:1499801138.297435289 nmi-total:4 nmi-count:1
1940	
1941	
1942	The above output is somewhat the same in the header. All events will have
1943	interrupts disabled 'd'. Under the FUNCTION title there is:
1944	
1945	 #1 - This is the count of events recorded that were greater than the
1946	      tracing_threshold (See below).
1947	
1948	 inner/outer(us):   12/14
1949	
1950	      This shows two numbers as "inner latency" and "outer latency". The test
1951	      runs in a loop checking a timestamp twice. The latency detected within
1952	      the two timestamps is the "inner latency" and the latency detected
1953	      after the previous timestamp and the next timestamp in the loop is
1954	      the "outer latency".
1955	
1956	 ts:1499801089.066141940
1957	
1958	      The absolute timestamp that the event happened.
1959	
1960	 nmi-total:4 nmi-count:1
1961	
1962	      On architectures that support it, if an NMI comes in during the
1963	      test, the time spent in NMI is reported in "nmi-total" (in
1964	      microseconds).
1965	
1966	      All architectures that have NMIs will show the "nmi-count" if an
1967	      NMI comes in during the test.
1968	
1969	hwlat files:
1970	
1971	  tracing_threshold - This gets automatically set to "10" to represent 10
1972			      microseconds. This is the threshold of latency that
1973			      needs to be detected before the trace will be recorded.
1974	
1975			      Note, when hwlat tracer is finished (another tracer is
1976			      written into "current_tracer"), the original value for
1977			      tracing_threshold is placed back into this file.
1978	
1979	  hwlat_detector/width - The length of time the test runs with interrupts
1980				 disabled.
1981	
1982	  hwlat_detector/window - The length of time of the window which the test
1983				  runs. That is, the test will run for "width"
1984				  microseconds per "window" microseconds
1985	
1986	  tracing_cpumask - When the test is started. A kernel thread is created that
1987			    runs the test. This thread will alternate between CPUs
1988			    listed in the tracing_cpumask between each period
1989			    (one "window"). To limit the test to specific CPUs
1990			    set the mask in this file to only the CPUs that the test
1991			    should run on.
1992	
1993	function
1994	--------
1995	
1996	This tracer is the function tracer. Enabling the function tracer
1997	can be done from the debug file system. Make sure the
1998	ftrace_enabled is set; otherwise this tracer is a nop.
1999	See the "ftrace_enabled" section below.
2000	
2001	 # sysctl kernel.ftrace_enabled=1
2002	 # echo function > current_tracer
2003	 # echo 1 > tracing_on
2004	 # usleep 1
2005	 # echo 0 > tracing_on
2006	 # cat trace
2007	# tracer: function
2008	#
2009	# entries-in-buffer/entries-written: 24799/24799   #P:4
2010	#
2011	#                              _-----=> irqs-off
2012	#                             / _----=> need-resched
2013	#                            | / _---=> hardirq/softirq
2014	#                            || / _--=> preempt-depth
2015	#                            ||| /     delay
2016	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2017	#              | |       |   ||||       |         |
2018	            bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
2019	            bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2020	            bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
2021	            bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
2022	            bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
2023	            bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
2024	            bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
2025	            bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
2026	[...]
2027	
2028	
2029	Note: function tracer uses ring buffers to store the above
2030	entries. The newest data may overwrite the oldest data.
2031	Sometimes using echo to stop the trace is not sufficient because
2032	the tracing could have overwritten the data that you wanted to
2033	record. For this reason, it is sometimes better to disable
2034	tracing directly from a program. This allows you to stop the
2035	tracing at the point that you hit the part that you are
2036	interested in. To disable the tracing directly from a C program,
2037	something like following code snippet can be used:
2038	
2039	int trace_fd;
2040	[...]
2041	int main(int argc, char *argv[]) {
2042		[...]
2043		trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2044		[...]
2045		if (condition_hit()) {
2046			write(trace_fd, "0", 1);
2047		}
2048		[...]
2049	}
2050	
2051	
2052	Single thread tracing
2053	---------------------
2054	
2055	By writing into set_ftrace_pid you can trace a
2056	single thread. For example:
2057	
2058	# cat set_ftrace_pid
2059	no pid
2060	# echo 3111 > set_ftrace_pid
2061	# cat set_ftrace_pid
2062	3111
2063	# echo function > current_tracer
2064	# cat trace | head
2065	 # tracer: function
2066	 #
2067	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2068	 #              | |       |          |         |
2069	     yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
2070	     yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2071	     yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2072	     yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2073	     yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
2074	     yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
2075	# echo > set_ftrace_pid
2076	# cat trace |head
2077	 # tracer: function
2078	 #
2079	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
2080	 #              | |       |          |         |
2081	 ##### CPU 3 buffer started ####
2082	     yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
2083	     yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
2084	     yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
2085	     yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
2086	     yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
2087	
2088	If you want to trace a function when executing, you could use
2089	something like this simple program:
2090	
2091	#include <stdio.h>
2092	#include <stdlib.h>
2093	#include <sys/types.h>
2094	#include <sys/stat.h>
2095	#include <fcntl.h>
2096	#include <unistd.h>
2097	#include <string.h>
2098	
2099	#define _STR(x) #x
2100	#define STR(x) _STR(x)
2101	#define MAX_PATH 256
2102	
2103	const char *find_tracefs(void)
2104	{
2105	       static char tracefs[MAX_PATH+1];
2106	       static int tracefs_found;
2107	       char type[100];
2108	       FILE *fp;
2109	
2110	       if (tracefs_found)
2111	               return tracefs;
2112	
2113	       if ((fp = fopen("/proc/mounts","r")) == NULL) {
2114	               perror("/proc/mounts");
2115	               return NULL;
2116	       }
2117	
2118	       while (fscanf(fp, "%*s %"
2119	                     STR(MAX_PATH)
2120	                     "s %99s %*s %*d %*d\n",
2121	                     tracefs, type) == 2) {
2122	               if (strcmp(type, "tracefs") == 0)
2123	                       break;
2124	       }
2125	       fclose(fp);
2126	
2127	       if (strcmp(type, "tracefs") != 0) {
2128	               fprintf(stderr, "tracefs not mounted");
2129	               return NULL;
2130	       }
2131	
2132	       strcat(tracefs, "/tracing/");
2133	       tracefs_found = 1;
2134	
2135	       return tracefs;
2136	}
2137	
2138	const char *tracing_file(const char *file_name)
2139	{
2140	       static char trace_file[MAX_PATH+1];
2141	       snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2142	       return trace_file;
2143	}
2144	
2145	int main (int argc, char **argv)
2146	{
2147	        if (argc < 1)
2148	                exit(-1);
2149	
2150	        if (fork() > 0) {
2151	                int fd, ffd;
2152	                char line[64];
2153	                int s;
2154	
2155	                ffd = open(tracing_file("current_tracer"), O_WRONLY);
2156	                if (ffd < 0)
2157	                        exit(-1);
2158	                write(ffd, "nop", 3);
2159	
2160	                fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2161	                s = sprintf(line, "%d\n", getpid());
2162	                write(fd, line, s);
2163	
2164	                write(ffd, "function", 8);
2165	
2166	                close(fd);
2167	                close(ffd);
2168	
2169	                execvp(argv[1], argv+1);
2170	        }
2171	
2172	        return 0;
2173	}
2174	
2175	Or this simple script!
2176	
2177	------
2178	#!/bin/bash
2179	
2180	tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2181	echo nop > $tracefs/tracing/current_tracer
2182	echo 0 > $tracefs/tracing/tracing_on
2183	echo $$ > $tracefs/tracing/set_ftrace_pid
2184	echo function > $tracefs/tracing/current_tracer
2185	echo 1 > $tracefs/tracing/tracing_on
2186	exec "$@"
2187	------
2188	
2189	
2190	function graph tracer
2191	---------------------------
2192	
2193	This tracer is similar to the function tracer except that it
2194	probes a function on its entry and its exit. This is done by
2195	using a dynamically allocated stack of return addresses in each
2196	task_struct. On function entry the tracer overwrites the return
2197	address of each function traced to set a custom probe. Thus the
2198	original return address is stored on the stack of return address
2199	in the task_struct.
2200	
2201	Probing on both ends of a function leads to special features
2202	such as:
2203	
2204	- measure of a function's time execution
2205	- having a reliable call stack to draw function calls graph
2206	
2207	This tracer is useful in several situations:
2208	
2209	- you want to find the reason of a strange kernel behavior and
2210	  need to see what happens in detail on any areas (or specific
2211	  ones).
2212	
2213	- you are experiencing weird latencies but it's difficult to
2214	  find its origin.
2215	
2216	- you want to find quickly which path is taken by a specific
2217	  function
2218	
2219	- you just want to peek inside a working kernel and want to see
2220	  what happens there.
2221	
2222	# tracer: function_graph
2223	#
2224	# CPU  DURATION                  FUNCTION CALLS
2225	# |     |   |                     |   |   |   |
2226	
2227	 0)               |  sys_open() {
2228	 0)               |    do_sys_open() {
2229	 0)               |      getname() {
2230	 0)               |        kmem_cache_alloc() {
2231	 0)   1.382 us    |          __might_sleep();
2232	 0)   2.478 us    |        }
2233	 0)               |        strncpy_from_user() {
2234	 0)               |          might_fault() {
2235	 0)   1.389 us    |            __might_sleep();
2236	 0)   2.553 us    |          }
2237	 0)   3.807 us    |        }
2238	 0)   7.876 us    |      }
2239	 0)               |      alloc_fd() {
2240	 0)   0.668 us    |        _spin_lock();
2241	 0)   0.570 us    |        expand_files();
2242	 0)   0.586 us    |        _spin_unlock();
2243	
2244	
2245	There are several columns that can be dynamically
2246	enabled/disabled. You can use every combination of options you
2247	want, depending on your needs.
2248	
2249	- The cpu number on which the function executed is default
2250	  enabled.  It is sometimes better to only trace one cpu (see
2251	  tracing_cpu_mask file) or you might sometimes see unordered
2252	  function calls while cpu tracing switch.
2253	
2254		hide: echo nofuncgraph-cpu > trace_options
2255		show: echo funcgraph-cpu > trace_options
2256	
2257	- The duration (function's time of execution) is displayed on
2258	  the closing bracket line of a function or on the same line
2259	  than the current function in case of a leaf one. It is default
2260	  enabled.
2261	
2262		hide: echo nofuncgraph-duration > trace_options
2263		show: echo funcgraph-duration > trace_options
2264	
2265	- The overhead field precedes the duration field in case of
2266	  reached duration thresholds.
2267	
2268		hide: echo nofuncgraph-overhead > trace_options
2269		show: echo funcgraph-overhead > trace_options
2270		depends on: funcgraph-duration
2271	
2272	  ie:
2273	
2274	  3) # 1837.709 us |          } /* __switch_to */
2275	  3)               |          finish_task_switch() {
2276	  3)   0.313 us    |            _raw_spin_unlock_irq();
2277	  3)   3.177 us    |          }
2278	  3) # 1889.063 us |        } /* __schedule */
2279	  3) ! 140.417 us  |      } /* __schedule */
2280	  3) # 2034.948 us |    } /* schedule */
2281	  3) * 33998.59 us |  } /* schedule_preempt_disabled */
2282	
2283	  [...]
2284	
2285	  1)   0.260 us    |              msecs_to_jiffies();
2286	  1)   0.313 us    |              __rcu_read_unlock();
2287	  1) + 61.770 us   |            }
2288	  1) + 64.479 us   |          }
2289	  1)   0.313 us    |          rcu_bh_qs();
2290	  1)   0.313 us    |          __local_bh_enable();
2291	  1) ! 217.240 us  |        }
2292	  1)   0.365 us    |        idle_cpu();
2293	  1)               |        rcu_irq_exit() {
2294	  1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2295	  1)   3.125 us    |        }
2296	  1) ! 227.812 us  |      }
2297	  1) ! 457.395 us  |    }
2298	  1) @ 119760.2 us |  }
2299	
2300	  [...]
2301	
2302	  2)               |    handle_IPI() {
2303	  1)   6.979 us    |                  }
2304	  2)   0.417 us    |      scheduler_ipi();
2305	  1)   9.791 us    |                }
2306	  1) + 12.917 us   |              }
2307	  2)   3.490 us    |    }
2308	  1) + 15.729 us   |            }
2309	  1) + 18.542 us   |          }
2310	  2) $ 3594274 us  |  }
2311	
2312	  + means that the function exceeded 10 usecs.
2313	  ! means that the function exceeded 100 usecs.
2314	  # means that the function exceeded 1000 usecs.
2315	  * means that the function exceeded 10 msecs.
2316	  @ means that the function exceeded 100 msecs.
2317	  $ means that the function exceeded 1 sec.
2318	
2319	
2320	- The task/pid field displays the thread cmdline and pid which
2321	  executed the function. It is default disabled.
2322	
2323		hide: echo nofuncgraph-proc > trace_options
2324		show: echo funcgraph-proc > trace_options
2325	
2326	  ie:
2327	
2328	  # tracer: function_graph
2329	  #
2330	  # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2331	  # |    |    |           |   |                     |   |   |   |
2332	  0)    sh-4802     |               |                  d_free() {
2333	  0)    sh-4802     |               |                    call_rcu() {
2334	  0)    sh-4802     |               |                      __call_rcu() {
2335	  0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2336	  0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2337	  0)    sh-4802     |   2.899 us    |                      }
2338	  0)    sh-4802     |   4.040 us    |                    }
2339	  0)    sh-4802     |   5.151 us    |                  }
2340	  0)    sh-4802     | + 49.370 us   |                }
2341	
2342	
2343	- The absolute time field is an absolute timestamp given by the
2344	  system clock since it started. A snapshot of this time is
2345	  given on each entry/exit of functions
2346	
2347		hide: echo nofuncgraph-abstime > trace_options
2348		show: echo funcgraph-abstime > trace_options
2349	
2350	  ie:
2351	
2352	  #
2353	  #      TIME       CPU  DURATION                  FUNCTION CALLS
2354	  #       |         |     |   |                     |   |   |   |
2355	  360.774522 |   1)   0.541 us    |                                          }
2356	  360.774522 |   1)   4.663 us    |                                        }
2357	  360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2358	  360.774524 |   1)   6.796 us    |                                      }
2359	  360.774524 |   1)   7.952 us    |                                    }
2360	  360.774525 |   1)   9.063 us    |                                  }
2361	  360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2362	  360.774527 |   1)   0.578 us    |                                  __brelse();
2363	  360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2364	  360.774528 |   1)               |                                    unlock_buffer() {
2365	  360.774529 |   1)               |                                      wake_up_bit() {
2366	  360.774529 |   1)               |                                        bit_waitqueue() {
2367	  360.774530 |   1)   0.594 us    |                                          __phys_addr();
2368	
2369	
2370	The function name is always displayed after the closing bracket
2371	for a function if the start of that function is not in the
2372	trace buffer.
2373	
2374	Display of the function name after the closing bracket may be
2375	enabled for functions whose start is in the trace buffer,
2376	allowing easier searching with grep for function durations.
2377	It is default disabled.
2378	
2379		hide: echo nofuncgraph-tail > trace_options
2380		show: echo funcgraph-tail > trace_options
2381	
2382	  Example with nofuncgraph-tail (default):
2383	  0)               |      putname() {
2384	  0)               |        kmem_cache_free() {
2385	  0)   0.518 us    |          __phys_addr();
2386	  0)   1.757 us    |        }
2387	  0)   2.861 us    |      }
2388	
2389	  Example with funcgraph-tail:
2390	  0)               |      putname() {
2391	  0)               |        kmem_cache_free() {
2392	  0)   0.518 us    |          __phys_addr();
2393	  0)   1.757 us    |        } /* kmem_cache_free() */
2394	  0)   2.861 us    |      } /* putname() */
2395	
2396	You can put some comments on specific functions by using
2397	trace_printk() For example, if you want to put a comment inside
2398	the __might_sleep() function, you just have to include
2399	<linux/ftrace.h> and call trace_printk() inside __might_sleep()
2400	
2401	trace_printk("I'm a comment!\n")
2402	
2403	will produce:
2404	
2405	 1)               |             __might_sleep() {
2406	 1)               |                /* I'm a comment! */
2407	 1)   1.449 us    |             }
2408	
2409	
2410	You might find other useful features for this tracer in the
2411	following "dynamic ftrace" section such as tracing only specific
2412	functions or tasks.
2413	
2414	dynamic ftrace
2415	--------------
2416	
2417	If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2418	virtually no overhead when function tracing is disabled. The way
2419	this works is the mcount function call (placed at the start of
2420	every kernel function, produced by the -pg switch in gcc),
2421	starts of pointing to a simple return. (Enabling FTRACE will
2422	include the -pg switch in the compiling of the kernel.)
2423	
2424	At compile time every C file object is run through the
2425	recordmcount program (located in the scripts directory). This
2426	program will parse the ELF headers in the C object to find all
2427	the locations in the .text section that call mcount. Starting
2428	with gcc verson 4.6, the -mfentry has been added for x86, which
2429	calls "__fentry__" instead of "mcount". Which is called before
2430	the creation of the stack frame.
2431	
2432	Note, not all sections are traced. They may be prevented by either
2433	a notrace, or blocked another way and all inline functions are not
2434	traced. Check the "available_filter_functions" file to see what functions
2435	can be traced.
2436	
2437	A section called "__mcount_loc" is created that holds
2438	references to all the mcount/fentry call sites in the .text section.
2439	The recordmcount program re-links this section back into the
2440	original object. The final linking stage of the kernel will add all these
2441	references into a single table.
2442	
2443	On boot up, before SMP is initialized, the dynamic ftrace code
2444	scans this table and updates all the locations into nops. It
2445	also records the locations, which are added to the
2446	available_filter_functions list.  Modules are processed as they
2447	are loaded and before they are executed.  When a module is
2448	unloaded, it also removes its functions from the ftrace function
2449	list. This is automatic in the module unload code, and the
2450	module author does not need to worry about it.
2451	
2452	When tracing is enabled, the process of modifying the function
2453	tracepoints is dependent on architecture. The old method is to use
2454	kstop_machine to prevent races with the CPUs executing code being
2455	modified (which can cause the CPU to do undesirable things, especially
2456	if the modified code crosses cache (or page) boundaries), and the nops are
2457	patched back to calls. But this time, they do not call mcount
2458	(which is just a function stub). They now call into the ftrace
2459	infrastructure.
2460	
2461	The new method of modifying the function tracepoints is to place
2462	a breakpoint at the location to be modified, sync all CPUs, modify
2463	the rest of the instruction not covered by the breakpoint. Sync
2464	all CPUs again, and then remove the breakpoint with the finished
2465	version to the ftrace call site.
2466	
2467	Some archs do not even need to monkey around with the synchronization,
2468	and can just slap the new code on top of the old without any
2469	problems with other CPUs executing it at the same time.
2470	
2471	One special side-effect to the recording of the functions being
2472	traced is that we can now selectively choose which functions we
2473	wish to trace and which ones we want the mcount calls to remain
2474	as nops.
2475	
2476	Two files are used, one for enabling and one for disabling the
2477	tracing of specified functions. They are:
2478	
2479	  set_ftrace_filter
2480	
2481	and
2482	
2483	  set_ftrace_notrace
2484	
2485	A list of available functions that you can add to these files is
2486	listed in:
2487	
2488	   available_filter_functions
2489	
2490	 # cat available_filter_functions
2491	put_prev_task_idle
2492	kmem_cache_create
2493	pick_next_task_rt
2494	get_online_cpus
2495	pick_next_task_fair
2496	mutex_lock
2497	[...]
2498	
2499	If I am only interested in sys_nanosleep and hrtimer_interrupt:
2500	
2501	 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2502	 # echo function > current_tracer
2503	 # echo 1 > tracing_on
2504	 # usleep 1
2505	 # echo 0 > tracing_on
2506	 # cat trace
2507	# tracer: function
2508	#
2509	# entries-in-buffer/entries-written: 5/5   #P:4
2510	#
2511	#                              _-----=> irqs-off
2512	#                             / _----=> need-resched
2513	#                            | / _---=> hardirq/softirq
2514	#                            || / _--=> preempt-depth
2515	#                            ||| /     delay
2516	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2517	#              | |       |   ||||       |         |
2518	          usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2519	          <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2520	          usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2521	          <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2522	          <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2523	
2524	To see which functions are being traced, you can cat the file:
2525	
2526	 # cat set_ftrace_filter
2527	hrtimer_interrupt
2528	sys_nanosleep
2529	
2530	
2531	Perhaps this is not enough. The filters also allow glob(7) matching.
2532	
2533	  <match>*  - will match functions that begin with <match>
2534	  *<match>  - will match functions that end with <match>
2535	  *<match>* - will match functions that have <match> in it
2536	  <match1>*<match2> - will match functions that begin with
2537	                      <match1> and end with <match2>
2538	
2539	Note: It is better to use quotes to enclose the wild cards,
2540	      otherwise the shell may expand the parameters into names
2541	      of files in the local directory.
2542	
2543	 # echo 'hrtimer_*' > set_ftrace_filter
2544	
2545	Produces:
2546	
2547	# tracer: function
2548	#
2549	# entries-in-buffer/entries-written: 897/897   #P:4
2550	#
2551	#                              _-----=> irqs-off
2552	#                             / _----=> need-resched
2553	#                            | / _---=> hardirq/softirq
2554	#                            || / _--=> preempt-depth
2555	#                            ||| /     delay
2556	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2557	#              | |       |   ||||       |         |
2558	          <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2559	          <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2560	          <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2561	          <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2562	          <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2563	          <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2564	          <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2565	          <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
2566	
2567	Notice that we lost the sys_nanosleep.
2568	
2569	 # cat set_ftrace_filter
2570	hrtimer_run_queues
2571	hrtimer_run_pending
2572	hrtimer_init
2573	hrtimer_cancel
2574	hrtimer_try_to_cancel
2575	hrtimer_forward
2576	hrtimer_start
2577	hrtimer_reprogram
2578	hrtimer_force_reprogram
2579	hrtimer_get_next_event
2580	hrtimer_interrupt
2581	hrtimer_nanosleep
2582	hrtimer_wakeup
2583	hrtimer_get_remaining
2584	hrtimer_get_res
2585	hrtimer_init_sleeper
2586	
2587	
2588	This is because the '>' and '>>' act just like they do in bash.
2589	To rewrite the filters, use '>'
2590	To append to the filters, use '>>'
2591	
2592	To clear out a filter so that all functions will be recorded
2593	again:
2594	
2595	 # echo > set_ftrace_filter
2596	 # cat set_ftrace_filter
2597	 #
2598	
2599	Again, now we want to append.
2600	
2601	 # echo sys_nanosleep > set_ftrace_filter
2602	 # cat set_ftrace_filter
2603	sys_nanosleep
2604	 # echo 'hrtimer_*' >> set_ftrace_filter
2605	 # cat set_ftrace_filter
2606	hrtimer_run_queues
2607	hrtimer_run_pending
2608	hrtimer_init
2609	hrtimer_cancel
2610	hrtimer_try_to_cancel
2611	hrtimer_forward
2612	hrtimer_start
2613	hrtimer_reprogram
2614	hrtimer_force_reprogram
2615	hrtimer_get_next_event
2616	hrtimer_interrupt
2617	sys_nanosleep
2618	hrtimer_nanosleep
2619	hrtimer_wakeup
2620	hrtimer_get_remaining
2621	hrtimer_get_res
2622	hrtimer_init_sleeper
2623	
2624	
2625	The set_ftrace_notrace prevents those functions from being
2626	traced.
2627	
2628	 # echo '*preempt*' '*lock*' > set_ftrace_notrace
2629	
2630	Produces:
2631	
2632	# tracer: function
2633	#
2634	# entries-in-buffer/entries-written: 39608/39608   #P:4
2635	#
2636	#                              _-----=> irqs-off
2637	#                             / _----=> need-resched
2638	#                            | / _---=> hardirq/softirq
2639	#                            || / _--=> preempt-depth
2640	#                            ||| /     delay
2641	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2642	#              | |       |   ||||       |         |
2643	            bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
2644	            bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
2645	            bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
2646	            bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
2647	            bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
2648	            bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
2649	            bash-1994  [000] ....  4342.324899: do_truncate <-do_last
2650	            bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
2651	            bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
2652	            bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
2653	            bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
2654	            bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
2655	
2656	We can see that there's no more lock or preempt tracing.
2657	
2658	
2659	Dynamic ftrace with the function graph tracer
2660	---------------------------------------------
2661	
2662	Although what has been explained above concerns both the
2663	function tracer and the function-graph-tracer, there are some
2664	special features only available in the function-graph tracer.
2665	
2666	If you want to trace only one function and all of its children,
2667	you just have to echo its name into set_graph_function:
2668	
2669	 echo __do_fault > set_graph_function
2670	
2671	will produce the following "expanded" trace of the __do_fault()
2672	function:
2673	
2674	 0)               |  __do_fault() {
2675	 0)               |    filemap_fault() {
2676	 0)               |      find_lock_page() {
2677	 0)   0.804 us    |        find_get_page();
2678	 0)               |        __might_sleep() {
2679	 0)   1.329 us    |        }
2680	 0)   3.904 us    |      }
2681	 0)   4.979 us    |    }
2682	 0)   0.653 us    |    _spin_lock();
2683	 0)   0.578 us    |    page_add_file_rmap();
2684	 0)   0.525 us    |    native_set_pte_at();
2685	 0)   0.585 us    |    _spin_unlock();
2686	 0)               |    unlock_page() {
2687	 0)   0.541 us    |      page_waitqueue();
2688	 0)   0.639 us    |      __wake_up_bit();
2689	 0)   2.786 us    |    }
2690	 0) + 14.237 us   |  }
2691	 0)               |  __do_fault() {
2692	 0)               |    filemap_fault() {
2693	 0)               |      find_lock_page() {
2694	 0)   0.698 us    |        find_get_page();
2695	 0)               |        __might_sleep() {
2696	 0)   1.412 us    |        }
2697	 0)   3.950 us    |      }
2698	 0)   5.098 us    |    }
2699	 0)   0.631 us    |    _spin_lock();
2700	 0)   0.571 us    |    page_add_file_rmap();
2701	 0)   0.526 us    |    native_set_pte_at();
2702	 0)   0.586 us    |    _spin_unlock();
2703	 0)               |    unlock_page() {
2704	 0)   0.533 us    |      page_waitqueue();
2705	 0)   0.638 us    |      __wake_up_bit();
2706	 0)   2.793 us    |    }
2707	 0) + 14.012 us   |  }
2708	
2709	You can also expand several functions at once:
2710	
2711	 echo sys_open > set_graph_function
2712	 echo sys_close >> set_graph_function
2713	
2714	Now if you want to go back to trace all functions you can clear
2715	this special filter via:
2716	
2717	 echo > set_graph_function
2718	
2719	
2720	ftrace_enabled
2721	--------------
2722	
2723	Note, the proc sysctl ftrace_enable is a big on/off switch for the
2724	function tracer. By default it is enabled (when function tracing is
2725	enabled in the kernel). If it is disabled, all function tracing is
2726	disabled. This includes not only the function tracers for ftrace, but
2727	also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2728	
2729	Please disable this with care.
2730	
2731	This can be disable (and enabled) with:
2732	
2733	  sysctl kernel.ftrace_enabled=0
2734	  sysctl kernel.ftrace_enabled=1
2735	
2736	 or
2737	
2738	  echo 0 > /proc/sys/kernel/ftrace_enabled
2739	  echo 1 > /proc/sys/kernel/ftrace_enabled
2740	
2741	
2742	Filter commands
2743	---------------
2744	
2745	A few commands are supported by the set_ftrace_filter interface.
2746	Trace commands have the following format:
2747	
2748	<function>:<command>:<parameter>
2749	
2750	The following commands are supported:
2751	
2752	- mod
2753	  This command enables function filtering per module. The
2754	  parameter defines the module. For example, if only the write*
2755	  functions in the ext3 module are desired, run:
2756	
2757	   echo 'write*:mod:ext3' > set_ftrace_filter
2758	
2759	  This command interacts with the filter in the same way as
2760	  filtering based on function names. Thus, adding more functions
2761	  in a different module is accomplished by appending (>>) to the
2762	  filter file. Remove specific module functions by prepending
2763	  '!':
2764	
2765	   echo '!writeback*:mod:ext3' >> set_ftrace_filter
2766	
2767	  Mod command supports module globbing. Disable tracing for all
2768	  functions except a specific module:
2769	
2770	   echo '!*:mod:!ext3' >> set_ftrace_filter
2771	
2772	  Disable tracing for all modules, but still trace kernel:
2773	
2774	   echo '!*:mod:*' >> set_ftrace_filter
2775	
2776	  Enable filter only for kernel:
2777	
2778	   echo '*write*:mod:!*' >> set_ftrace_filter
2779	
2780	  Enable filter for module globbing:
2781	
2782	   echo '*write*:mod:*snd*' >> set_ftrace_filter
2783	
2784	- traceon/traceoff
2785	  These commands turn tracing on and off when the specified
2786	  functions are hit. The parameter determines how many times the
2787	  tracing system is turned on and off. If unspecified, there is
2788	  no limit. For example, to disable tracing when a schedule bug
2789	  is hit the first 5 times, run:
2790	
2791	   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
2792	
2793	  To always disable tracing when __schedule_bug is hit:
2794	
2795	   echo '__schedule_bug:traceoff' > set_ftrace_filter
2796	
2797	  These commands are cumulative whether or not they are appended
2798	  to set_ftrace_filter. To remove a command, prepend it by '!'
2799	  and drop the parameter:
2800	
2801	   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
2802	
2803	    The above removes the traceoff command for __schedule_bug
2804	    that have a counter. To remove commands without counters:
2805	
2806	   echo '!__schedule_bug:traceoff' > set_ftrace_filter
2807	
2808	- snapshot
2809	  Will cause a snapshot to be triggered when the function is hit.
2810	
2811	   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
2812	
2813	  To only snapshot once:
2814	
2815	   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
2816	
2817	  To remove the above commands:
2818	
2819	   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
2820	   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
2821	
2822	- enable_event/disable_event
2823	  These commands can enable or disable a trace event. Note, because
2824	  function tracing callbacks are very sensitive, when these commands
2825	  are registered, the trace point is activated, but disabled in
2826	  a "soft" mode. That is, the tracepoint will be called, but
2827	  just will not be traced. The event tracepoint stays in this mode
2828	  as long as there's a command that triggers it.
2829	
2830	   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
2831	   	 set_ftrace_filter
2832	
2833	  The format is:
2834	
2835	    <function>:enable_event:<system>:<event>[:count]
2836	    <function>:disable_event:<system>:<event>[:count]
2837	
2838	  To remove the events commands:
2839	
2840	
2841	   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
2842	   	 set_ftrace_filter
2843	   echo '!schedule:disable_event:sched:sched_switch' > \
2844	   	 set_ftrace_filter
2845	
2846	- dump
2847	  When the function is hit, it will dump the contents of the ftrace
2848	  ring buffer to the console. This is useful if you need to debug
2849	  something, and want to dump the trace when a certain function
2850	  is hit. Perhaps its a function that is called before a tripple
2851	  fault happens and does not allow you to get a regular dump.
2852	
2853	- cpudump
2854	  When the function is hit, it will dump the contents of the ftrace
2855	  ring buffer for the current CPU to the console. Unlike the "dump"
2856	  command, it only prints out the contents of the ring buffer for the
2857	  CPU that executed the function that triggered the dump.
2858	
2859	trace_pipe
2860	----------
2861	
2862	The trace_pipe outputs the same content as the trace file, but
2863	the effect on the tracing is different. Every read from
2864	trace_pipe is consumed. This means that subsequent reads will be
2865	different. The trace is live.
2866	
2867	 # echo function > current_tracer
2868	 # cat trace_pipe > /tmp/trace.out &
2869	[1] 4153
2870	 # echo 1 > tracing_on
2871	 # usleep 1
2872	 # echo 0 > tracing_on
2873	 # cat trace
2874	# tracer: function
2875	#
2876	# entries-in-buffer/entries-written: 0/0   #P:4
2877	#
2878	#                              _-----=> irqs-off
2879	#                             / _----=> need-resched
2880	#                            | / _---=> hardirq/softirq
2881	#                            || / _--=> preempt-depth
2882	#                            ||| /     delay
2883	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2884	#              | |       |   ||||       |         |
2885	
2886	 #
2887	 # cat /tmp/trace.out
2888	            bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
2889	            bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
2890	            bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
2891	            bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
2892	            bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
2893	            bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
2894	            bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
2895	            bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
2896	            bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
2897	
2898	
2899	Note, reading the trace_pipe file will block until more input is
2900	added.
2901	
2902	trace entries
2903	-------------
2904	
2905	Having too much or not enough data can be troublesome in
2906	diagnosing an issue in the kernel. The file buffer_size_kb is
2907	used to modify the size of the internal trace buffers. The
2908	number listed is the number of entries that can be recorded per
2909	CPU. To know the full size, multiply the number of possible CPUs
2910	with the number of entries.
2911	
2912	 # cat buffer_size_kb
2913	1408 (units kilobytes)
2914	
2915	Or simply read buffer_total_size_kb
2916	
2917	 # cat buffer_total_size_kb 
2918	5632
2919	
2920	To modify the buffer, simple echo in a number (in 1024 byte segments).
2921	
2922	 # echo 10000 > buffer_size_kb
2923	 # cat buffer_size_kb
2924	10000 (units kilobytes)
2925	
2926	It will try to allocate as much as possible. If you allocate too
2927	much, it can cause Out-Of-Memory to trigger.
2928	
2929	 # echo 1000000000000 > buffer_size_kb
2930	-bash: echo: write error: Cannot allocate memory
2931	 # cat buffer_size_kb
2932	85
2933	
2934	The per_cpu buffers can be changed individually as well:
2935	
2936	 # echo 10000 > per_cpu/cpu0/buffer_size_kb
2937	 # echo 100 > per_cpu/cpu1/buffer_size_kb
2938	
2939	When the per_cpu buffers are not the same, the buffer_size_kb
2940	at the top level will just show an X
2941	
2942	 # cat buffer_size_kb
2943	X
2944	
2945	This is where the buffer_total_size_kb is useful:
2946	
2947	 # cat buffer_total_size_kb 
2948	12916
2949	
2950	Writing to the top level buffer_size_kb will reset all the buffers
2951	to be the same again.
2952	
2953	Snapshot
2954	--------
2955	CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
2956	available to all non latency tracers. (Latency tracers which
2957	record max latency, such as "irqsoff" or "wakeup", can't use
2958	this feature, since those are already using the snapshot
2959	mechanism internally.)
2960	
2961	Snapshot preserves a current trace buffer at a particular point
2962	in time without stopping tracing. Ftrace swaps the current
2963	buffer with a spare buffer, and tracing continues in the new
2964	current (=previous spare) buffer.
2965	
2966	The following tracefs files in "tracing" are related to this
2967	feature:
2968	
2969	  snapshot:
2970	
2971		This is used to take a snapshot and to read the output
2972		of the snapshot. Echo 1 into this file to allocate a
2973		spare buffer and to take a snapshot (swap), then read
2974		the snapshot from this file in the same format as
2975		"trace" (described above in the section "The File
2976		System"). Both reads snapshot and tracing are executable
2977		in parallel. When the spare buffer is allocated, echoing
2978		0 frees it, and echoing else (positive) values clear the
2979		snapshot contents.
2980		More details are shown in the table below.
2981	
2982		status\input  |     0      |     1      |    else    |
2983		--------------+------------+------------+------------+
2984		not allocated |(do nothing)| alloc+swap |(do nothing)|
2985		--------------+------------+------------+------------+
2986		allocated     |    free    |    swap    |   clear    |
2987		--------------+------------+------------+------------+
2988	
2989	Here is an example of using the snapshot feature.
2990	
2991	 # echo 1 > events/sched/enable
2992	 # echo 1 > snapshot
2993	 # cat snapshot
2994	# tracer: nop
2995	#
2996	# entries-in-buffer/entries-written: 71/71   #P:8
2997	#
2998	#                              _-----=> irqs-off
2999	#                             / _----=> need-resched
3000	#                            | / _---=> hardirq/softirq
3001	#                            || / _--=> preempt-depth
3002	#                            ||| /     delay
3003	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3004	#              | |       |   ||||       |         |
3005	          <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3006	           sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3007	[...]
3008	          <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3009	
3010	 # cat trace
3011	# tracer: nop
3012	#
3013	# entries-in-buffer/entries-written: 77/77   #P:8
3014	#
3015	#                              _-----=> irqs-off
3016	#                             / _----=> need-resched
3017	#                            | / _---=> hardirq/softirq
3018	#                            || / _--=> preempt-depth
3019	#                            ||| /     delay
3020	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3021	#              | |       |   ||||       |         |
3022	          <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3023	 snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3024	[...]
3025	
3026	
3027	If you try to use this snapshot feature when current tracer is
3028	one of the latency tracers, you will get the following results.
3029	
3030	 # echo wakeup > current_tracer
3031	 # echo 1 > snapshot
3032	bash: echo: write error: Device or resource busy
3033	 # cat snapshot
3034	cat: snapshot: Device or resource busy
3035	
3036	
3037	Instances
3038	---------
3039	In the tracefs tracing directory is a directory called "instances".
3040	This directory can have new directories created inside of it using
3041	mkdir, and removing directories with rmdir. The directory created
3042	with mkdir in this directory will already contain files and other
3043	directories after it is created.
3044	
3045	 # mkdir instances/foo
3046	 # ls instances/foo
3047	buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
3048	set_event  snapshot  trace  trace_clock  trace_marker  trace_options
3049	trace_pipe  tracing_on
3050	
3051	As you can see, the new directory looks similar to the tracing directory
3052	itself. In fact, it is very similar, except that the buffer and
3053	events are agnostic from the main director, or from any other
3054	instances that are created.
3055	
3056	The files in the new directory work just like the files with the
3057	same name in the tracing directory except the buffer that is used
3058	is a separate and new buffer. The files affect that buffer but do not
3059	affect the main buffer with the exception of trace_options. Currently,
3060	the trace_options affect all instances and the top level buffer
3061	the same, but this may change in future releases. That is, options
3062	may become specific to the instance they reside in.
3063	
3064	Notice that none of the function tracer files are there, nor is
3065	current_tracer and available_tracers. This is because the buffers
3066	can currently only have events enabled for them.
3067	
3068	 # mkdir instances/foo
3069	 # mkdir instances/bar
3070	 # mkdir instances/zoot
3071	 # echo 100000 > buffer_size_kb
3072	 # echo 1000 > instances/foo/buffer_size_kb
3073	 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3074	 # echo function > current_trace
3075	 # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3076	 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3077	 # echo 1 > instances/foo/events/sched/sched_switch/enable
3078	 # echo 1 > instances/bar/events/irq/enable
3079	 # echo 1 > instances/zoot/events/syscalls/enable
3080	 # cat trace_pipe
3081	CPU:2 [LOST 11745 EVENTS]
3082	            bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3083	            bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3084	            bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3085	            bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3086	            bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3087	            bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3088	            bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3089	            bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3090	            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3091	            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3092	            bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3093	[...]
3094	
3095	 # cat instances/foo/trace_pipe
3096	            bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3097	            bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3098	          <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3099	          <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3100	     rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3101	            bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3102	            bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3103	            bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3104	     kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3105	     kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3106	[...]
3107	
3108	 # cat instances/bar/trace_pipe
3109	     migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
3110	          <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
3111	            bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
3112	            bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
3113	            bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
3114	            bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
3115	            bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
3116	            bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
3117	            sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3118	            sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
3119	            sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
3120	            sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
3121	[...]
3122	
3123	 # cat instances/zoot/trace
3124	# tracer: nop
3125	#
3126	# entries-in-buffer/entries-written: 18996/18996   #P:4
3127	#
3128	#                              _-----=> irqs-off
3129	#                             / _----=> need-resched
3130	#                            | / _---=> hardirq/softirq
3131	#                            || / _--=> preempt-depth
3132	#                            ||| /     delay
3133	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
3134	#              | |       |   ||||       |         |
3135	            bash-1998  [000] d...   140.733501: sys_write -> 0x2
3136	            bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
3137	            bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
3138	            bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3139	            bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
3140	            bash-1998  [000] d...   140.733510: sys_close(fd: a)
3141	            bash-1998  [000] d...   140.733510: sys_close -> 0x0
3142	            bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3143	            bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
3144	            bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3145	            bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
3146	
3147	You can see that the trace of the top most trace buffer shows only
3148	the function tracing. The foo instance displays wakeups and task
3149	switches.
3150	
3151	To remove the instances, simply delete their directories:
3152	
3153	 # rmdir instances/foo
3154	 # rmdir instances/bar
3155	 # rmdir instances/zoot
3156	
3157	Note, if a process has a trace file open in one of the instance
3158	directories, the rmdir will fail with EBUSY.
3159	
3160	
3161	Stack trace
3162	-----------
3163	Since the kernel has a fixed sized stack, it is important not to
3164	waste it in functions. A kernel developer must be conscience of
3165	what they allocate on the stack. If they add too much, the system
3166	can be in danger of a stack overflow, and corruption will occur,
3167	usually leading to a system panic.
3168	
3169	There are some tools that check this, usually with interrupts
3170	periodically checking usage. But if you can perform a check
3171	at every function call that will become very useful. As ftrace provides
3172	a function tracer, it makes it convenient to check the stack size
3173	at every function call. This is enabled via the stack tracer.
3174	
3175	CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3176	To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3177	
3178	 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3179	
3180	You can also enable it from the kernel command line to trace
3181	the stack size of the kernel during boot up, by adding "stacktrace"
3182	to the kernel command line parameter.
3183	
3184	After running it for a few minutes, the output looks like:
3185	
3186	 # cat stack_max_size
3187	2928
3188	
3189	 # cat stack_trace
3190	        Depth    Size   Location    (18 entries)
3191	        -----    ----   --------
3192	  0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
3193	  1)     2704     160   find_busiest_group+0x31/0x1f1
3194	  2)     2544     256   load_balance+0xd9/0x662
3195	  3)     2288      80   idle_balance+0xbb/0x130
3196	  4)     2208     128   __schedule+0x26e/0x5b9
3197	  5)     2080      16   schedule+0x64/0x66
3198	  6)     2064     128   schedule_timeout+0x34/0xe0
3199	  7)     1936     112   wait_for_common+0x97/0xf1
3200	  8)     1824      16   wait_for_completion+0x1d/0x1f
3201	  9)     1808     128   flush_work+0xfe/0x119
3202	 10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
3203	 11)     1664      48   input_available_p+0x1d/0x5c
3204	 12)     1616      48   n_tty_poll+0x6d/0x134
3205	 13)     1568      64   tty_poll+0x64/0x7f
3206	 14)     1504     880   do_select+0x31e/0x511
3207	 15)      624     400   core_sys_select+0x177/0x216
3208	 16)      224      96   sys_select+0x91/0xb9
3209	 17)      128     128   system_call_fastpath+0x16/0x1b
3210	
3211	Note, if -mfentry is being used by gcc, functions get traced before
3212	they set up the stack frame. This means that leaf level functions
3213	are not tested by the stack tracer when -mfentry is used.
3214	
3215	Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3216	
3217	---------
3218	
3219	More details can be found in the source code, in the
3220	kernel/trace/*.c files.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog
Information is copyright its respective author.
All material is available from the Linux Kernel Source distributed under a GPL License.
Hosted by mjmwired.net.