About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / cpu-freq / intel-pstate.txt


Based on kernel version 4.10.8. Page generated on 2017-04-01 14:42 EST.

1	Intel P-State driver
2	--------------------
3	
4	This driver provides an interface to control the P-State selection for the
5	SandyBridge+ Intel processors.
6	
7	The following document explains P-States:
8	http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
9	As stated in the document, P-State doesn’t exactly mean a frequency. However, for
10	the sake of the relationship with cpufreq, P-State and frequency are used
11	interchangeably.
12	
13	Understanding the cpufreq core governors and policies are important before
14	discussing more details about the Intel P-State driver. Based on what callbacks
15	a cpufreq driver provides to the cpufreq core, it can support two types of
16	drivers:
17	- with target_index() callback: In this mode, the drivers using cpufreq core
18	simply provide the minimum and maximum frequency limits and an additional
19	interface target_index() to set the current frequency. The cpufreq subsystem
20	has a number of scaling governors ("performance", "powersave", "ondemand",
21	etc.). Depending on which governor is in use, cpufreq core will call for
22	transitions to a specific frequency using target_index() callback.
23	- setpolicy() callback: In this mode, drivers do not provide target_index()
24	callback, so cpufreq core can't request a transition to a specific frequency.
25	The driver provides minimum and maximum frequency limits and callbacks to set a
26	policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
27	The cpufreq core can request the driver to operate in any of the two policies:
28	"performance" and "powersave". The driver decides which frequency to use based
29	on the above policy selection considering minimum and maximum frequency limits.
30	
31	The Intel P-State driver falls under the latter category, which implements the
32	setpolicy() callback. This driver decides what P-State to use based on the
33	requested policy from the cpufreq core. If the processor is capable of
34	selecting its next P-State internally, then the driver will offload this
35	responsibility to the processor (aka HWP: Hardware P-States). If not, the
36	driver implements algorithms to select the next P-State.
37	
38	Since these policies are implemented in the driver, they are not same as the
39	cpufreq scaling governors implementation, even if they have the same name in
40	the cpufreq sysfs (scaling_governors). For example the "performance" policy is
41	similar to cpufreq’s "performance" governor, but "powersave" is completely
42	different than the cpufreq "powersave" governor. The strategy here is similar
43	to cpufreq "ondemand", where the requested P-State is related to the system load.
44	
45	Sysfs Interface
46	
47	In addition to the frequency-controlling interfaces provided by the cpufreq
48	core, the driver provides its own sysfs files to control the P-State selection.
49	These files have been added to /sys/devices/system/cpu/intel_pstate/.
50	Any changes made to these files are applicable to all CPUs (even in a
51	multi-package system, Refer to later section on placing "Per-CPU limits").
52	
53	      max_perf_pct: Limits the maximum P-State that will be requested by
54	      the driver. It states it as a percentage of the available performance. The
55	      available (P-State) performance may be reduced by the no_turbo
56	      setting described below.
57	
58	      min_perf_pct: Limits the minimum P-State that will be requested by
59	      the driver. It states it as a percentage of the max (non-turbo)
60	      performance level.
61	
62	      no_turbo: Limits the driver to selecting P-State below the turbo
63	      frequency range.
64	
65	      turbo_pct: Displays the percentage of the total performance that
66	      is supported by hardware that is in the turbo range. This number
67	      is independent of whether turbo has been disabled or not.
68	
69	      num_pstates: Displays the number of P-States that are supported
70	      by hardware. This number is independent of whether turbo has
71	      been disabled or not.
72	
73	For example, if a system has these parameters:
74		Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
75		Max non turbo ratio: 0x17
76		Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)
77	
78	Sysfs will show :
79		max_perf_pct:100, which corresponds to 1 core ratio
80		min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
81		no_turbo:0, turbo is not disabled
82		num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
83		turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates
84	
85	Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
86	Volume 3: System Programming Guide" to understand ratios.
87	
88	cpufreq sysfs for Intel P-State
89	
90	Since this driver registers with cpufreq, cpufreq sysfs is also presented.
91	There are some important differences, which need to be considered.
92	
93	scaling_cur_freq: This displays the real frequency which was used during
94	the last sample period instead of what is requested. Some other cpufreq driver,
95	like acpi-cpufreq, displays what is requested (Some changes are on the
96	way to fix this for acpi-cpufreq driver). The same is true for frequencies
97	displayed at /proc/cpuinfo.
98	
99	scaling_governor: This displays current active policy. Since each CPU has a
100	cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
101	is not possible with Intel P-States, as there is one common policy for all
102	CPUs. Here, the last requested policy will be applicable to all CPUs. It is
103	suggested that one use the cpupower utility to change policy to all CPUs at the
104	same time.
105	
106	scaling_setspeed: This attribute can never be used with Intel P-State.
107	
108	scaling_max_freq/scaling_min_freq: This interface can be used similarly to
109	the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
110	are converted to nearest possible P-State, this is prone to rounding errors.
111	This method is not preferred to limit performance.
112	
113	affected_cpus: Not used
114	related_cpus: Not used
115	
116	For contemporary Intel processors, the frequency is controlled by the
117	processor itself and the P-State exposed to software is related to
118	performance levels.  The idea that frequency can be set to a single
119	frequency is fictional for Intel Core processors. Even if the scaling
120	driver selects a single P-State, the actual frequency the processor
121	will run at is selected by the processor itself.
122	
123	Per-CPU limits
124	
125	The kernel command line option "intel_pstate=per_cpu_perf_limits" forces
126	the intel_pstate driver to use per-CPU performance limits.  When it is set,
127	the sysfs control interface described above is subject to limitations.
128	- The following controls are not available for both read and write
129		/sys/devices/system/cpu/intel_pstate/max_perf_pct
130		/sys/devices/system/cpu/intel_pstate/min_perf_pct
131	- The following controls can be used to set performance limits, as far as the
132	architecture of the processor permits:
133		/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
134		/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
135		/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
136	- User can still observe turbo percent and number of P-States from
137		/sys/devices/system/cpu/intel_pstate/turbo_pct
138		/sys/devices/system/cpu/intel_pstate/num_pstates
139	- User can read write system wide turbo status
140		/sys/devices/system/cpu/no_turbo
141	
142	Support of energy performance hints
143	It is possible to provide hints to the HWP algorithms in the processor
144	to be more performance centric to more energy centric. When the driver
145	is using HWP, two additional cpufreq sysfs attributes are presented for
146	each logical CPU.
147	These attributes are:
148		- energy_performance_available_preferences
149		- energy_performance_preference
150	
151	To get list of supported hints:
152	$ cat energy_performance_available_preferences
153	    default performance balance_performance balance_power power
154	
155	The current preference can be read or changed via cpufreq sysfs
156	attribute "energy_performance_preference". Reading from this attribute
157	will display current effective setting. User can write any of the valid
158	preference string to this attribute. User can always restore to power-on
159	default by writing "default".
160	
161	Since threads can migrate to different CPUs, this is possible that the
162	new CPU may have different energy performance preference than the previous
163	one. To avoid such issues, either threads can be pinned to specific CPUs
164	or set the same energy performance preference value to all CPUs.
165	
166	Tuning Intel P-State driver
167	
168	When the performance can be tuned using PID (Proportional Integral
169	Derivative) controller, debugfs files are provided for adjusting performance.
170	They are presented under:
171	/sys/kernel/debug/pstate_snb/
172	
173	The PID tunable parameters are:
174	      deadband
175	      d_gain_pct
176	      i_gain_pct
177	      p_gain_pct
178	      sample_rate_ms
179	      setpoint
180	
181	To adjust these parameters, some understanding of driver implementation is
182	necessary. There are some tweeks described here, but be very careful. Adjusting
183	them requires expert level understanding of power and performance relationship.
184	These limits are only useful when the "powersave" policy is active.
185	
186	-To make the system more responsive to load changes, sample_rate_ms can
187	be adjusted  (current default is 10ms).
188	-To make the system use higher performance, even if the load is lower, setpoint
189	can be adjusted to a lower number. This will also lead to faster ramp up time
190	to reach the maximum P-State.
191	If there are no derivative and integral coefficients, The next P-State will be
192	equal to:
193		current P-State - ((setpoint - current cpu load) * p_gain_pct)
194	
195	For example, if the current PID parameters are (Which are defaults for the core
196	processors like SandyBridge):
197	      deadband = 0
198	      d_gain_pct = 0
199	      i_gain_pct = 0
200	      p_gain_pct = 20
201	      sample_rate_ms = 10
202	      setpoint = 97
203	
204	If the current P-State = 0x08 and current load = 100, this will result in the
205	next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
206	goes up by only 1. If during next sample interval the current load doesn't
207	change and still 100, then P-State goes up by one again. This process will
208	continue as long as the load is more than the setpoint until the maximum P-State
209	is reached.
210	
211	For the same load at setpoint = 60, this will result in the next P-State
212	= 0x08 - ((60 - 100) * 0.2) = 16
213	So by changing the setpoint from 97 to 60, there is an increase of the
214	next P-State from 9 to 16. So this will make processor execute at higher
215	P-State for the same CPU load. If the load continues to be more than the
216	setpoint during next sample intervals, then P-State will go up again till the
217	maximum P-State is reached. But the ramp up time to reach the maximum P-State
218	will be much faster when the setpoint is 60 compared to 97.
219	
220	Debugging Intel P-State driver
221	
222	Event tracing
223	To debug P-State transition, the Linux event tracing interface can be used.
224	There are two specific events, which can be enabled (Provided the kernel
225	configs related to event tracing are enabled).
226	
227	# cd /sys/kernel/debug/tracing/
228	# echo 1 > events/power/pstate_sample/enable
229	# echo 1 > events/power/cpu_frequency/enable
230	# cat trace
231	gnome-terminal--4510  [001] ..s.  1177.680733: pstate_sample: core_busy=107
232		scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
233			freq=2474476
234	cat-5235  [002] ..s.  1177.681723: cpu_frequency: state=2900000 cpu_id=2
235	
236	
237	Using ftrace
238	
239	If function level tracing is required, the Linux ftrace interface can be used.
240	For example if we want to check how often a function to set a P-State is
241	called, we can set ftrace filter to intel_pstate_set_pstate.
242	
243	# cd /sys/kernel/debug/tracing/
244	# cat available_filter_functions | grep -i pstate
245	intel_pstate_set_pstate
246	intel_pstate_cpu_init
247	...
248	
249	# echo intel_pstate_set_pstate > set_ftrace_filter
250	# echo function > current_tracer
251	# cat trace | head -15
252	# tracer: function
253	#
254	# entries-in-buffer/entries-written: 80/80   #P:4
255	#
256	#                              _-----=> irqs-off
257	#                             / _----=> need-resched
258	#                            | / _---=> hardirq/softirq
259	#                            || / _--=> preempt-depth
260	#                            ||| /     delay
261	#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
262	#              | |       |   ||||       |         |
263	            Xorg-3129  [000] ..s.  2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
264	 gnome-terminal--4510  [002] ..s.  2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
265	     gnome-shell-3409  [001] ..s.  2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
266	          <idle>-0     [000] ..s.  2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog