About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / power / pci.txt




Custom Search

Based on kernel version 3.13. Page generated on 2014-01-20 22:04 EST.

1	PCI Power Management
2	
3	Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
4	
5	An overview of concepts and the Linux kernel's interfaces related to PCI power
6	management.  Based on previous work by Patrick Mochel <mochel@transmeta.com>
7	(and others).
8	
9	This document only covers the aspects of power management specific to PCI
10	devices.  For general description of the kernel's interfaces related to device
11	power management refer to Documentation/power/devices.txt and
12	Documentation/power/runtime_pm.txt.
13	
14	---------------------------------------------------------------------------
15	
16	1. Hardware and Platform Support for PCI Power Management
17	2. PCI Subsystem and Device Power Management
18	3. PCI Device Drivers and Power Management
19	4. Resources
20	
21	
22	1. Hardware and Platform Support for PCI Power Management
23	=========================================================
24	
25	1.1. Native and Platform-Based Power Management
26	-----------------------------------------------
27	In general, power management is a feature allowing one to save energy by putting
28	devices into states in which they draw less power (low-power states) at the
29	price of reduced functionality or performance.
30	
31	Usually, a device is put into a low-power state when it is underutilized or
32	completely inactive.  However, when it is necessary to use the device once
33	again, it has to be put back into the "fully functional" state (full-power
34	state).  This may happen when there are some data for the device to handle or
35	as a result of an external event requiring the device to be active, which may
36	be signaled by the device itself.
37	
38	PCI devices may be put into low-power states in two ways, by using the device
39	capabilities introduced by the PCI Bus Power Management Interface Specification,
40	or with the help of platform firmware, such as an ACPI BIOS.  In the first
41	approach, that is referred to as the native PCI power management (native PCI PM)
42	in what follows, the device power state is changed as a result of writing a
43	specific value into one of its standard configuration registers.  The second
44	approach requires the platform firmware to provide special methods that may be
45	used by the kernel to change the device's power state.
46	
47	Devices supporting the native PCI PM usually can generate wakeup signals called
48	Power Management Events (PMEs) to let the kernel know about external events
49	requiring the device to be active.  After receiving a PME the kernel is supposed
50	to put the device that sent it into the full-power state.  However, the PCI Bus
51	Power Management Interface Specification doesn't define any standard method of
52	delivering the PME from the device to the CPU and the operating system kernel.
53	It is assumed that the platform firmware will perform this task and therefore,
54	even though a PCI device is set up to generate PMEs, it also may be necessary to
55	prepare the platform firmware for notifying the CPU of the PMEs coming from the
56	device (e.g. by generating interrupts).
57	
58	In turn, if the methods provided by the platform firmware are used for changing
59	the power state of a device, usually the platform also provides a method for
60	preparing the device to generate wakeup signals.  In that case, however, it
61	often also is necessary to prepare the device for generating PMEs using the
62	native PCI PM mechanism, because the method provided by the platform depends on
63	that.
64	
65	Thus in many situations both the native and the platform-based power management
66	mechanisms have to be used simultaneously to obtain the desired result.
67	
68	1.2. Native PCI Power Management
69	--------------------------------
70	The PCI Bus Power Management Interface Specification (PCI PM Spec) was
71	introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
72	standard interface for performing various operations related to power
73	management.
74	
75	The implementation of the PCI PM Spec is optional for conventional PCI devices,
76	but it is mandatory for PCI Express devices.  If a device supports the PCI PM
77	Spec, it has an 8 byte power management capability field in its PCI
78	configuration space.  This field is used to describe and control the standard
79	features related to the native PCI power management.
80	
81	The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
82	(B0-B3).  The higher the number, the less power is drawn by the device or bus
83	in that state.  However, the higher the number, the longer the latency for
84	the device or bus to return to the full-power state (D0 or B0, respectively).
85	
86	There are two variants of the D3 state defined by the specification.  The first
87	one is D3hot, referred to as the software accessible D3, because devices can be
88	programmed to go into it.  The second one, D3cold, is the state that PCI devices
89	are in when the supply voltage (Vcc) is removed from them.  It is not possible
90	to program a PCI device to go into D3cold, although there may be a programmable
91	interface for putting the bus the device is on into a state in which Vcc is
92	removed from all devices on the bus.
93	
94	PCI bus power management, however, is not supported by the Linux kernel at the
95	time of this writing and therefore it is not covered by this document.
96	
97	Note that every PCI device can be in the full-power state (D0) or in D3cold,
98	regardless of whether or not it implements the PCI PM Spec.  In addition to
99	that, if the PCI PM Spec is implemented by the device, it must support D3hot
100	as well as D0.  The support for the D1 and D2 power states is optional.
101	
102	PCI devices supporting the PCI PM Spec can be programmed to go to any of the
103	supported low-power states (except for D3cold).  While in D1-D3hot the
104	standard configuration registers of the device must be accessible to software
105	(i.e. the device is required to respond to PCI configuration accesses), although
106	its I/O and memory spaces are then disabled.  This allows the device to be
107	programmatically put into D0.  Thus the kernel can switch the device back and
108	forth between D0 and the supported low-power states (except for D3cold) and the
109	possible power state transitions the device can undergo are the following:
110	
111	+----------------------------+
112	| Current State | New State  |
113	+----------------------------+
114	| D0            | D1, D2, D3 |
115	+----------------------------+
116	| D1            | D2, D3     |
117	+----------------------------+
118	| D2            | D3         |
119	+----------------------------+
120	| D1, D2, D3    | D0         |
121	+----------------------------+
122	
123	The transition from D3cold to D0 occurs when the supply voltage is provided to
124	the device (i.e. power is restored).  In that case the device returns to D0 with
125	a full power-on reset sequence and the power-on defaults are restored to the
126	device by hardware just as at initial power up.
127	
128	PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
129	while in a low-power state (D1-D3), but they are not required to be capable
130	of generating PMEs from all supported low-power states.  In particular, the
131	capability of generating PMEs from D3cold is optional and depends on the
132	presence of additional voltage (3.3Vaux) allowing the device to remain
133	sufficiently active to generate a wakeup signal.
134	
135	1.3. ACPI Device Power Management
136	---------------------------------
137	The platform firmware support for the power management of PCI devices is
138	system-specific.  However, if the system in question is compliant with the
139	Advanced Configuration and Power Interface (ACPI) Specification, like the
140	majority of x86-based systems, it is supposed to implement device power
141	management interfaces defined by the ACPI standard.
142	
143	For this purpose the ACPI BIOS provides special functions called "control
144	methods" that may be executed by the kernel to perform specific tasks, such as
145	putting a device into a low-power state.  These control methods are encoded
146	using special byte-code language called the ACPI Machine Language (AML) and
147	stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
148	them as needed using an AML interpreter that translates the AML byte code into
149	computations and memory or I/O space accesses.  This way, in theory, a BIOS
150	writer can provide the kernel with a means to perform actions depending
151	on the system design in a system-specific fashion.
152	
153	ACPI control methods may be divided into global control methods, that are not
154	associated with any particular devices, and device control methods, that have
155	to be defined separately for each device supposed to be handled with the help of
156	the platform.  This means, in particular, that ACPI device control methods can
157	only be used to handle devices that the BIOS writer knew about in advance.  The
158	ACPI methods used for device power management fall into that category.
159	
160	The ACPI specification assumes that devices can be in one of four power states
161	labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
162	D0-D3 states (although the difference between D3hot and D3cold is not taken
163	into account by ACPI).  Moreover, for each power state of a device there is a
164	set of power resources that have to be enabled for the device to be put into
165	that state.  These power resources are controlled (i.e. enabled or disabled)
166	with the help of their own control methods, _ON and _OFF, that have to be
167	defined individually for each of them.
168	
169	To put a device into the ACPI power state Dx (where x is a number between 0 and
170	3 inclusive) the kernel is supposed to (1) enable the power resources required
171	by the device in this state using their _ON control methods and (2) execute the
172	_PSx control method defined for the device.  In addition to that, if the device
173	is going to be put into a low-power state (D1-D3) and is supposed to generate
174	wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
175	3.0) control method defined for it has to be executed before _PSx.  Power
176	resources that are not required by the device in the target power state and are
177	not required any more by any other device should be disabled (by executing their
178	_OFF control methods).  If the current power state of the device is D3, it can
179	only be put into D0 this way.
180	
181	However, quite often the power states of devices are changed during a
182	system-wide transition into a sleep state or back into the working state.  ACPI
183	defines four system sleep states, S1, S2, S3, and S4, and denotes the system
184	working state as S0.  In general, the target system sleep (or working) state
185	determines the highest power (lowest number) state the device can be put
186	into and the kernel is supposed to obtain this information by executing the
187	device's _SxD control method (where x is a number between 0 and 4 inclusive).
188	If the device is required to wake up the system from the target sleep state, the
189	lowest power (highest number) state it can be put into is also determined by the
190	target state of the system.  The kernel is then supposed to use the device's
191	_SxW control method to obtain the number of that state.  It also is supposed to
192	use the device's _PRW control method to learn which power resources need to be
193	enabled for the device to be able to generate wakeup signals.
194	
195	1.4. Wakeup Signaling
196	---------------------
197	Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
198	a result of the execution of the _DSW (or _PSW) ACPI control method before
199	putting the device into a low-power state, have to be caught and handled as
200	appropriate.  If they are sent while the system is in the working state
201	(ACPI S0), they should be translated into interrupts so that the kernel can
202	put the devices generating them into the full-power state and take care of the
203	events that triggered them.  In turn, if they are sent while the system is
204	sleeping, they should cause the system's core logic to trigger wakeup.
205	
206	On ACPI-based systems wakeup signals sent by conventional PCI devices are
207	converted into ACPI General-Purpose Events (GPEs) which are hardware signals
208	from the system core logic generated in response to various events that need to
209	be acted upon.  Every GPE is associated with one or more sources of potentially
210	interesting events.  In particular, a GPE may be associated with a PCI device
211	capable of signaling wakeup.  The information on the connections between GPEs
212	and event sources is recorded in the system's ACPI BIOS from where it can be
213	read by the kernel.
214	
215	If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
216	associated with it (if there is one) is triggered.  The GPEs associated with PCI
217	bridges may also be triggered in response to a wakeup signal from one of the
218	devices below the bridge (this also is the case for root bridges) and, for
219	example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
220	handled this way.
221	
222	A GPE may be triggered when the system is sleeping (i.e. when it is in one of
223	the ACPI S1-S4 states), in which case system wakeup is started by its core logic
224	(the device that was the source of the signal causing the system wakeup to occur
225	may be identified later).  The GPEs used in such situations are referred to as
226	wakeup GPEs.
227	
228	Usually, however, GPEs are also triggered when the system is in the working
229	state (ACPI S0) and in that case the system's core logic generates a System
230	Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
231	handler identifies the GPE that caused the interrupt to be generated which,
232	in turn, allows the kernel to identify the source of the event (that may be
233	a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
234	events occurring while the system is in the working state are referred to as
235	runtime GPEs.
236	
237	Unfortunately, there is no standard way of handling wakeup signals sent by
238	conventional PCI devices on systems that are not ACPI-based, but there is one
239	for PCI Express devices.  Namely, the PCI Express Base Specification introduced
240	a native mechanism for converting native PCI PMEs into interrupts generated by
241	root ports.  For conventional PCI devices native PMEs are out-of-band, so they
242	are routed separately and they need not pass through bridges (in principle they
243	may be routed directly to the system's core logic), but for PCI Express devices
244	they are in-band messages that have to pass through the PCI Express hierarchy,
245	including the root port on the path from the device to the Root Complex.  Thus
246	it was possible to introduce a mechanism by which a root port generates an
247	interrupt whenever it receives a PME message from one of the devices below it.
248	The PCI Express Requester ID of the device that sent the PME message is then
249	recorded in one of the root port's configuration registers from where it may be
250	read by the interrupt handler allowing the device to be identified.  [PME
251	messages sent by PCI Express endpoints integrated with the Root Complex don't
252	pass through root ports, but instead they cause a Root Complex Event Collector
253	(if there is one) to generate interrupts.]
254	
255	In principle the native PCI Express PME signaling may also be used on ACPI-based
256	systems along with the GPEs, but to use it the kernel has to ask the system's
257	ACPI BIOS to release control of root port configuration registers.  The ACPI
258	BIOS, however, is not required to allow the kernel to control these registers
259	and if it doesn't do that, the kernel must not modify their contents.  Of course
260	the native PCI Express PME signaling cannot be used by the kernel in that case.
261	
262	
263	2. PCI Subsystem and Device Power Management
264	============================================
265	
266	2.1. Device Power Management Callbacks
267	--------------------------------------
268	The PCI Subsystem participates in the power management of PCI devices in a
269	number of ways.  First of all, it provides an intermediate code layer between
270	the device power management core (PM core) and PCI device drivers.
271	Specifically, the pm field of the PCI subsystem's struct bus_type object,
272	pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
273	pointers to several device power management callbacks:
274	
275	const struct dev_pm_ops pci_dev_pm_ops = {
276		.prepare = pci_pm_prepare,
277		.complete = pci_pm_complete,
278		.suspend = pci_pm_suspend,
279		.resume = pci_pm_resume,
280		.freeze = pci_pm_freeze,
281		.thaw = pci_pm_thaw,
282		.poweroff = pci_pm_poweroff,
283		.restore = pci_pm_restore,
284		.suspend_noirq = pci_pm_suspend_noirq,
285		.resume_noirq = pci_pm_resume_noirq,
286		.freeze_noirq = pci_pm_freeze_noirq,
287		.thaw_noirq = pci_pm_thaw_noirq,
288		.poweroff_noirq = pci_pm_poweroff_noirq,
289		.restore_noirq = pci_pm_restore_noirq,
290		.runtime_suspend = pci_pm_runtime_suspend,
291		.runtime_resume = pci_pm_runtime_resume,
292		.runtime_idle = pci_pm_runtime_idle,
293	};
294	
295	These callbacks are executed by the PM core in various situations related to
296	device power management and they, in turn, execute power management callbacks
297	provided by PCI device drivers.  They also perform power management operations
298	involving some standard configuration registers of PCI devices that device
299	drivers need not know or care about.
300	
301	The structure representing a PCI device, struct pci_dev, contains several fields
302	that these callbacks operate on:
303	
304	struct pci_dev {
305		...
306		pci_power_t     current_state;  /* Current operating state. */
307		int		pm_cap;		/* PM capability offset in the
308						   configuration space */
309		unsigned int	pme_support:5;	/* Bitmask of states from which PME#
310						   can be generated */
311		unsigned int	pme_interrupt:1;/* Is native PCIe PME signaling used? */
312		unsigned int	d1_support:1;	/* Low power state D1 is supported */
313		unsigned int	d2_support:1;	/* Low power state D2 is supported */
314		unsigned int	no_d1d2:1;	/* D1 and D2 are forbidden */
315		unsigned int	wakeup_prepared:1;  /* Device prepared for wake up */
316		unsigned int	d3_delay;	/* D3->D0 transition time in ms */
317		...
318	};
319	
320	They also indirectly use some fields of the struct device that is embedded in
321	struct pci_dev.
322	
323	2.2. Device Initialization
324	--------------------------
325	The PCI subsystem's first task related to device power management is to
326	prepare the device for power management and initialize the fields of struct
327	pci_dev used for this purpose.  This happens in two functions defined in
328	drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
329	
330	The first of these functions checks if the device supports native PCI PM
331	and if that's the case the offset of its power management capability structure
332	in the configuration space is stored in the pm_cap field of the device's struct
333	pci_dev object.  Next, the function checks which PCI low-power states are
334	supported by the device and from which low-power states the device can generate
335	native PCI PMEs.  The power management fields of the device's struct pci_dev and
336	the struct device embedded in it are updated accordingly and the generation of
337	PMEs by the device is disabled.
338	
339	The second function checks if the device can be prepared to signal wakeup with
340	the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
341	the function updates the wakeup fields in struct device embedded in the
342	device's struct pci_dev and uses the firmware-provided method to prevent the
343	device from signaling wakeup.
344	
345	At this point the device is ready for power management.  For driverless devices,
346	however, this functionality is limited to a few basic operations carried out
347	during system-wide transitions to a sleep state and back to the working state.
348	
349	2.3. Runtime Device Power Management
350	------------------------------------
351	The PCI subsystem plays a vital role in the runtime power management of PCI
352	devices.  For this purpose it uses the general runtime power management
353	(runtime PM) framework described in Documentation/power/runtime_pm.txt.
354	Namely, it provides subsystem-level callbacks:
355	
356		pci_pm_runtime_suspend()
357		pci_pm_runtime_resume()
358		pci_pm_runtime_idle()
359	
360	that are executed by the core runtime PM routines.  It also implements the
361	entire mechanics necessary for handling runtime wakeup signals from PCI devices
362	in low-power states, which at the time of this writing works for both the native
363	PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
364	Section 1.
365	
366	First, a PCI device is put into a low-power state, or suspended, with the help
367	of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
368	pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
369	driver has to provide a pm->runtime_suspend() callback (see below), which is
370	run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
371	returns successfully, the device's standard configuration registers are saved,
372	the device is prepared to generate wakeup signals and, finally, it is put into
373	the target low-power state.
374	
375	The low-power state to put the device into is the lowest-power (highest number)
376	state from which it can signal wakeup.  The exact method of signaling wakeup is
377	system-dependent and is determined by the PCI subsystem on the basis of the
378	reported capabilities of the device and the platform firmware.  To prepare the
379	device for signaling wakeup and put it into the selected low-power state, the
380	PCI subsystem can use the platform firmware as well as the device's native PCI
381	PM capabilities, if supported.
382	
383	It is expected that the device driver's pm->runtime_suspend() callback will
384	not attempt to prepare the device for signaling wakeup or to put it into a
385	low-power state.  The driver ought to leave these tasks to the PCI subsystem
386	that has all of the information necessary to perform them.
387	
388	A suspended device is brought back into the "active" state, or resumed,
389	with the help of pm_request_resume() or pm_runtime_resume() which both call
390	pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
391	driver provides a pm->runtime_resume() callback (see below).  However, before
392	the driver's callback is executed, pci_pm_runtime_resume() brings the device
393	back into the full-power state, prevents it from signaling wakeup while in that
394	state and restores its standard configuration registers.  Thus the driver's
395	callback need not worry about the PCI-specific aspects of the device resume.
396	
397	Note that generally pci_pm_runtime_resume() may be called in two different
398	situations.  First, it may be called at the request of the device's driver, for
399	example if there are some data for it to process.  Second, it may be called
400	as a result of a wakeup signal from the device itself (this sometimes is
401	referred to as "remote wakeup").  Of course, for this purpose the wakeup signal
402	is handled in one of the ways described in Section 1 and finally converted into
403	a notification for the PCI subsystem after the source device has been
404	identified.
405	
406	The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
407	and pm_request_idle(), executes the device driver's pm->runtime_idle()
408	callback, if defined, and if that callback doesn't return error code (or is not
409	present at all), suspends the device with the help of pm_runtime_suspend().
410	Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
411	example, it is called right after the device has just been resumed), in which
412	cases it is expected to suspend the device if that makes sense.  Usually,
413	however, the PCI subsystem doesn't really know if the device really can be
414	suspended, so it lets the device's driver decide by running its
415	pm->runtime_idle() callback.
416	
417	2.4. System-Wide Power Transitions
418	----------------------------------
419	There are a few different types of system-wide power transitions, described in
420	Documentation/power/devices.txt.  Each of them requires devices to be handled
421	in a specific way and the PM core executes subsystem-level power management
422	callbacks for this purpose.  They are executed in phases such that each phase
423	involves executing the same subsystem-level callback for every device belonging
424	to the given subsystem before the next phase begins.  These phases always run
425	after tasks have been frozen.
426	
427	2.4.1. System Suspend
428	
429	When the system is going into a sleep state in which the contents of memory will
430	be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
431	
432		prepare, suspend, suspend_noirq.
433	
434	The following PCI bus type's callbacks, respectively, are used in these phases:
435	
436		pci_pm_prepare()
437		pci_pm_suspend()
438		pci_pm_suspend_noirq()
439	
440	The pci_pm_prepare() routine first puts the device into the "fully functional"
441	state with the help of pm_runtime_resume().  Then, it executes the device
442	driver's pm->prepare() callback if defined (i.e. if the driver's struct
443	dev_pm_ops object is present and the prepare pointer in that object is valid).
444	
445	The pci_pm_suspend() routine first checks if the device's driver implements
446	legacy PCI suspend routines (see Section 3), in which case the driver's legacy
447	suspend callback is executed, if present, and its result is returned.  Next, if
448	the device's driver doesn't provide a struct dev_pm_ops object (containing
449	pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
450	simply turns off the device's bus master capability and runs
451	pcibios_disable_device() to disable it, unless the device is a bridge (PCI
452	bridges are ignored by this routine).  Next, the device driver's pm->suspend()
453	callback is executed, if defined, and its result is returned if it fails.
454	Finally, pci_fixup_device() is called to apply hardware suspend quirks related
455	to the device if necessary.
456	
457	Note that the suspend phase is carried out asynchronously for PCI devices, so
458	the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
459	devices that don't depend on each other in a known way (i.e. none of the paths
460	in the device tree from the root bridge to a leaf device contains both of them).
461	
462	The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
463	been called, which means that the device driver's interrupt handler won't be
464	invoked while this routine is running.  It first checks if the device's driver
465	implements legacy PCI suspends routines (Section 3), in which case the legacy
466	late suspend routine is called and its result is returned (the standard
467	configuration registers of the device are saved if the driver's callback hasn't
468	done that).  Second, if the device driver's struct dev_pm_ops object is not
469	present, the device's standard configuration registers are saved and the routine
470	returns success.  Otherwise the device driver's pm->suspend_noirq() callback is
471	executed, if present, and its result is returned if it fails.  Next, if the
472	device's standard configuration registers haven't been saved yet (one of the
473	device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
474	saves them, prepares the device to signal wakeup (if necessary) and puts it into
475	a low-power state.
476	
477	The low-power state to put the device into is the lowest-power (highest number)
478	state from which it can signal wakeup while the system is in the target sleep
479	state.  Just like in the runtime PM case described above, the mechanism of
480	signaling wakeup is system-dependent and determined by the PCI subsystem, which
481	is also responsible for preparing the device to signal wakeup from the system's
482	target sleep state as appropriate.
483	
484	PCI device drivers (that don't implement legacy power management callbacks) are
485	generally not expected to prepare devices for signaling wakeup or to put them
486	into low-power states.  However, if one of the driver's suspend callbacks
487	(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
488	registers, pci_pm_suspend_noirq() will assume that the device has been prepared
489	to signal wakeup and put into a low-power state by the driver (the driver is
490	then assumed to have used the helper functions provided by the PCI subsystem for
491	this purpose).  PCI device drivers are not encouraged to do that, but in some
492	rare cases doing that in the driver may be the optimum approach.
493	
494	2.4.2. System Resume
495	
496	When the system is undergoing a transition from a sleep state in which the
497	contents of memory have been preserved, such as one of the ACPI sleep states
498	S1-S3, into the working state (ACPI S0), the phases are:
499	
500		resume_noirq, resume, complete.
501	
502	The following PCI bus type's callbacks, respectively, are executed in these
503	phases:
504	
505		pci_pm_resume_noirq()
506		pci_pm_resume()
507		pci_pm_complete()
508	
509	The pci_pm_resume_noirq() routine first puts the device into the full-power
510	state, restores its standard configuration registers and applies early resume
511	hardware quirks related to the device, if necessary.  This is done
512	unconditionally, regardless of whether or not the device's driver implements
513	legacy PCI power management callbacks (this way all PCI devices are in the
514	full-power state and their standard configuration registers have been restored
515	when their interrupt handlers are invoked for the first time during resume,
516	which allows the kernel to avoid problems with the handling of shared interrupts
517	by drivers whose devices are still suspended).  If legacy PCI power management
518	callbacks (see Section 3) are implemented by the device's driver, the legacy
519	early resume callback is executed and its result is returned.  Otherwise, the
520	device driver's pm->resume_noirq() callback is executed, if defined, and its
521	result is returned.
522	
523	The pci_pm_resume() routine first checks if the device's standard configuration
524	registers have been restored and restores them if that's not the case (this
525	only is necessary in the error path during a failing suspend).  Next, resume
526	hardware quirks related to the device are applied, if necessary, and if the
527	device's driver implements legacy PCI power management callbacks (see
528	Section 3), the driver's legacy resume callback is executed and its result is
529	returned.  Otherwise, the device's wakeup signaling mechanisms are blocked and
530	its driver's pm->resume() callback is executed, if defined (the callback's
531	result is then returned).
532	
533	The resume phase is carried out asynchronously for PCI devices, like the
534	suspend phase described above, which means that if two PCI devices don't depend
535	on each other in a known way, the pci_pm_resume() routine may be executed for
536	the both of them in parallel.
537	
538	The pci_pm_complete() routine only executes the device driver's pm->complete()
539	callback, if defined.
540	
541	2.4.3. System Hibernation
542	
543	System hibernation is more complicated than system suspend, because it requires
544	a system image to be created and written into a persistent storage medium.  The
545	image is created atomically and all devices are quiesced, or frozen, before that
546	happens.
547	
548	The freezing of devices is carried out after enough memory has been freed (at
549	the time of this writing the image creation requires at least 50% of system RAM
550	to be free) in the following three phases:
551	
552		prepare, freeze, freeze_noirq
553	
554	that correspond to the PCI bus type's callbacks:
555	
556		pci_pm_prepare()
557		pci_pm_freeze()
558		pci_pm_freeze_noirq()
559	
560	This means that the prepare phase is exactly the same as for system suspend.
561	The other two phases, however, are different.
562	
563	The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
564	the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
565	and it doesn't apply the suspend-related hardware quirks.  It is executed
566	asynchronously for different PCI devices that don't depend on each other in a
567	known way.
568	
569	The pci_pm_freeze_noirq() routine, in turn, is similar to
570	pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
571	routine instead of pm->suspend_noirq().  It also doesn't attempt to prepare the
572	device for signaling wakeup and put it into a low-power state.  Still, it saves
573	the device's standard configuration registers if they haven't been saved by one
574	of the driver's callbacks.
575	
576	Once the image has been created, it has to be saved.  However, at this point all
577	devices are frozen and they cannot handle I/O, while their ability to handle
578	I/O is obviously necessary for the image saving.  Thus they have to be brought
579	back to the fully functional state and this is done in the following phases:
580	
581		thaw_noirq, thaw, complete
582	
583	using the following PCI bus type's callbacks:
584	
585		pci_pm_thaw_noirq()
586		pci_pm_thaw()
587		pci_pm_complete()
588	
589	respectively.
590	
591	The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(),
592	but it doesn't put the device into the full power state and doesn't attempt to
593	restore its standard configuration registers.  It also executes the device
594	driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq().
595	
596	The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
597	driver's pm->thaw() callback instead of pm->resume().  It is executed
598	asynchronously for different PCI devices that don't depend on each other in a
599	known way.
600	
601	The complete phase it the same as for system resume.
602	
603	After saving the image, devices need to be powered down before the system can
604	enter the target sleep state (ACPI S4 for ACPI-based systems).  This is done in
605	three phases:
606	
607		prepare, poweroff, poweroff_noirq
608	
609	where the prepare phase is exactly the same as for system suspend.  The other
610	two phases are analogous to the suspend and suspend_noirq phases, respectively.
611	The PCI subsystem-level callbacks they correspond to
612	
613		pci_pm_poweroff()
614		pci_pm_poweroff_noirq()
615	
616	work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
617	although they don't attempt to save the device's standard configuration
618	registers.
619	
620	2.4.4. System Restore
621	
622	System restore requires a hibernation image to be loaded into memory and the
623	pre-hibernation memory contents to be restored before the pre-hibernation system
624	activity can be resumed.
625	
626	As described in Documentation/power/devices.txt, the hibernation image is loaded
627	into memory by a fresh instance of the kernel, called the boot kernel, which in
628	turn is loaded and run by a boot loader in the usual way.  After the boot kernel
629	has loaded the image, it needs to replace its own code and data with the code
630	and data of the "hibernated" kernel stored within the image, called the image
631	kernel.  For this purpose all devices are frozen just like before creating
632	the image during hibernation, in the
633	
634		prepare, freeze, freeze_noirq
635	
636	phases described above.  However, the devices affected by these phases are only
637	those having drivers in the boot kernel; other devices will still be in whatever
638	state the boot loader left them.
639	
640	Should the restoration of the pre-hibernation memory contents fail, the boot
641	kernel would go through the "thawing" procedure described above, using the
642	thaw_noirq, thaw, and complete phases (that will only affect the devices having
643	drivers in the boot kernel), and then continue running normally.
644	
645	If the pre-hibernation memory contents are restored successfully, which is the
646	usual situation, control is passed to the image kernel, which then becomes
647	responsible for bringing the system back to the working state.  To achieve this,
648	it must restore the devices' pre-hibernation functionality, which is done much
649	like waking up from the memory sleep state, although it involves different
650	phases:
651	
652		restore_noirq, restore, complete
653	
654	The first two of these are analogous to the resume_noirq and resume phases
655	described above, respectively, and correspond to the following PCI subsystem
656	callbacks:
657	
658		pci_pm_restore_noirq()
659		pci_pm_restore()
660	
661	These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
662	respectively, but they execute the device driver's pm->restore_noirq() and
663	pm->restore() callbacks, if available.
664	
665	The complete phase is carried out in exactly the same way as during system
666	resume.
667	
668	
669	3. PCI Device Drivers and Power Management
670	==========================================
671	
672	3.1. Power Management Callbacks
673	-------------------------------
674	PCI device drivers participate in power management by providing callbacks to be
675	executed by the PCI subsystem's power management routines described above and by
676	controlling the runtime power management of their devices.
677	
678	At the time of this writing there are two ways to define power management
679	callbacks for a PCI device driver, the recommended one, based on using a
680	dev_pm_ops structure described in Documentation/power/devices.txt, and the
681	"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
682	.resume() callbacks from struct pci_driver are used.  The legacy approach,
683	however, doesn't allow one to define runtime power management callbacks and is
684	not really suitable for any new drivers.  Therefore it is not covered by this
685	document (refer to the source code to learn more about it).
686	
687	It is recommended that all PCI device drivers define a struct dev_pm_ops object
688	containing pointers to power management (PM) callbacks that will be executed by
689	the PCI subsystem's PM routines in various circumstances.  A pointer to the
690	driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
691	its struct pci_driver object.  Once that has happened, the "legacy" PM callbacks
692	in struct pci_driver are ignored (even if they are not NULL).
693	
694	The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
695	defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
696	subsystem will handle the device in a simplified default manner.  If they are
697	defined, though, they are expected to behave as described in the following
698	subsections.
699	
700	3.1.1. prepare()
701	
702	The prepare() callback is executed during system suspend, during hibernation
703	(when a hibernation image is about to be created), during power-off after
704	saving a hibernation image and during system restore, when a hibernation image
705	has just been loaded into memory.
706	
707	This callback is only necessary if the driver's device has children that in
708	general may be registered at any time.  In that case the role of the prepare()
709	callback is to prevent new children of the device from being registered until
710	one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
711	
712	In addition to that the prepare() callback may carry out some operations
713	preparing the device to be suspended, although it should not allocate memory
714	(if additional memory is required to suspend the device, it has to be
715	preallocated earlier, for example in a suspend/hibernate notifier as described
716	in Documentation/power/notifiers.txt).
717	
718	3.1.2. suspend()
719	
720	The suspend() callback is only executed during system suspend, after prepare()
721	callbacks have been executed for all devices in the system.
722	
723	This callback is expected to quiesce the device and prepare it to be put into a
724	low-power state by the PCI subsystem.  It is not required (in fact it even is
725	not recommended) that a PCI driver's suspend() callback save the standard
726	configuration registers of the device, prepare it for waking up the system, or
727	put it into a low-power state.  All of these operations can very well be taken
728	care of by the PCI subsystem, without the driver's participation.
729	
730	However, in some rare case it is convenient to carry out these operations in
731	a PCI driver.  Then, pci_save_state(), pci_prepare_to_sleep(), and
732	pci_set_power_state() should be used to save the device's standard configuration
733	registers, to prepare it for system wakeup (if necessary), and to put it into a
734	low-power state, respectively.  Moreover, if the driver calls pci_save_state(),
735	the PCI subsystem will not execute either pci_prepare_to_sleep(), or
736	pci_set_power_state() for its device, so the driver is then responsible for
737	handling the device as appropriate.
738	
739	While the suspend() callback is being executed, the driver's interrupt handler
740	can be invoked to handle an interrupt from the device, so all suspend-related
741	operations relying on the driver's ability to handle interrupts should be
742	carried out in this callback.
743	
744	3.1.3. suspend_noirq()
745	
746	The suspend_noirq() callback is only executed during system suspend, after
747	suspend() callbacks have been executed for all devices in the system and
748	after device interrupts have been disabled by the PM core.
749	
750	The difference between suspend_noirq() and suspend() is that the driver's
751	interrupt handler will not be invoked while suspend_noirq() is running.  Thus
752	suspend_noirq() can carry out operations that would cause race conditions to
753	arise if they were performed in suspend().
754	
755	3.1.4. freeze()
756	
757	The freeze() callback is hibernation-specific and is executed in two situations,
758	during hibernation, after prepare() callbacks have been executed for all devices
759	in preparation for the creation of a system image, and during restore,
760	after a system image has been loaded into memory from persistent storage and the
761	prepare() callbacks have been executed for all devices.
762	
763	The role of this callback is analogous to the role of the suspend() callback
764	described above.  In fact, they only need to be different in the rare cases when
765	the driver takes the responsibility for putting the device into a low-power
766	state.
767	
768	In that cases the freeze() callback should not prepare the device system wakeup
769	or put it into a low-power state.  Still, either it or freeze_noirq() should
770	save the device's standard configuration registers using pci_save_state().
771	
772	3.1.5. freeze_noirq()
773	
774	The freeze_noirq() callback is hibernation-specific.  It is executed during
775	hibernation, after prepare() and freeze() callbacks have been executed for all
776	devices in preparation for the creation of a system image, and during restore,
777	after a system image has been loaded into memory and after prepare() and
778	freeze() callbacks have been executed for all devices.  It is always executed
779	after device interrupts have been disabled by the PM core.
780	
781	The role of this callback is analogous to the role of the suspend_noirq()
782	callback described above and it very rarely is necessary to define
783	freeze_noirq().
784	
785	The difference between freeze_noirq() and freeze() is analogous to the
786	difference between suspend_noirq() and suspend().
787	
788	3.1.6. poweroff()
789	
790	The poweroff() callback is hibernation-specific.  It is executed when the system
791	is about to be powered off after saving a hibernation image to a persistent
792	storage.  prepare() callbacks are executed for all devices before poweroff() is
793	called.
794	
795	The role of this callback is analogous to the role of the suspend() and freeze()
796	callbacks described above, although it does not need to save the contents of
797	the device's registers.  In particular, if the driver wants to put the device
798	into a low-power state itself instead of allowing the PCI subsystem to do that,
799	the poweroff() callback should use pci_prepare_to_sleep() and
800	pci_set_power_state() to prepare the device for system wakeup and to put it
801	into a low-power state, respectively, but it need not save the device's standard
802	configuration registers.
803	
804	3.1.7. poweroff_noirq()
805	
806	The poweroff_noirq() callback is hibernation-specific.  It is executed after
807	poweroff() callbacks have been executed for all devices in the system.
808	
809	The role of this callback is analogous to the role of the suspend_noirq() and
810	freeze_noirq() callbacks described above, but it does not need to save the
811	contents of the device's registers.
812	
813	The difference between poweroff_noirq() and poweroff() is analogous to the
814	difference between suspend_noirq() and suspend().
815	
816	3.1.8. resume_noirq()
817	
818	The resume_noirq() callback is only executed during system resume, after the
819	PM core has enabled the non-boot CPUs.  The driver's interrupt handler will not
820	be invoked while resume_noirq() is running, so this callback can carry out
821	operations that might race with the interrupt handler.
822	
823	Since the PCI subsystem unconditionally puts all devices into the full power
824	state in the resume_noirq phase of system resume and restores their standard
825	configuration registers, resume_noirq() is usually not necessary.  In general
826	it should only be used for performing operations that would lead to race
827	conditions if carried out by resume().
828	
829	3.1.9. resume()
830	
831	The resume() callback is only executed during system resume, after
832	resume_noirq() callbacks have been executed for all devices in the system and
833	device interrupts have been enabled by the PM core.
834	
835	This callback is responsible for restoring the pre-suspend configuration of the
836	device and bringing it back to the fully functional state.  The device should be
837	able to process I/O in a usual way after resume() has returned.
838	
839	3.1.10. thaw_noirq()
840	
841	The thaw_noirq() callback is hibernation-specific.  It is executed after a
842	system image has been created and the non-boot CPUs have been enabled by the PM
843	core, in the thaw_noirq phase of hibernation.  It also may be executed if the
844	loading of a hibernation image fails during system restore (it is then executed
845	after enabling the non-boot CPUs).  The driver's interrupt handler will not be
846	invoked while thaw_noirq() is running.
847	
848	The role of this callback is analogous to the role of resume_noirq().  The
849	difference between these two callbacks is that thaw_noirq() is executed after
850	freeze() and freeze_noirq(), so in general it does not need to modify the
851	contents of the device's registers.
852	
853	3.1.11. thaw()
854	
855	The thaw() callback is hibernation-specific.  It is executed after thaw_noirq()
856	callbacks have been executed for all devices in the system and after device
857	interrupts have been enabled by the PM core.
858	
859	This callback is responsible for restoring the pre-freeze configuration of
860	the device, so that it will work in a usual way after thaw() has returned.
861	
862	3.1.12. restore_noirq()
863	
864	The restore_noirq() callback is hibernation-specific.  It is executed in the
865	restore_noirq phase of hibernation, when the boot kernel has passed control to
866	the image kernel and the non-boot CPUs have been enabled by the image kernel's
867	PM core.
868	
869	This callback is analogous to resume_noirq() with the exception that it cannot
870	make any assumption on the previous state of the device, even if the BIOS (or
871	generally the platform firmware) is known to preserve that state over a
872	suspend-resume cycle.
873	
874	For the vast majority of PCI device drivers there is no difference between
875	resume_noirq() and restore_noirq().
876	
877	3.1.13. restore()
878	
879	The restore() callback is hibernation-specific.  It is executed after
880	restore_noirq() callbacks have been executed for all devices in the system and
881	after the PM core has enabled device drivers' interrupt handlers to be invoked.
882	
883	This callback is analogous to resume(), just like restore_noirq() is analogous
884	to resume_noirq().  Consequently, the difference between restore_noirq() and
885	restore() is analogous to the difference between resume_noirq() and resume().
886	
887	For the vast majority of PCI device drivers there is no difference between
888	resume() and restore().
889	
890	3.1.14. complete()
891	
892	The complete() callback is executed in the following situations:
893	  - during system resume, after resume() callbacks have been executed for all
894	    devices,
895	  - during hibernation, before saving the system image, after thaw() callbacks
896	    have been executed for all devices,
897	  - during system restore, when the system is going back to its pre-hibernation
898	    state, after restore() callbacks have been executed for all devices.
899	It also may be executed if the loading of a hibernation image into memory fails
900	(in that case it is run after thaw() callbacks have been executed for all
901	devices that have drivers in the boot kernel).
902	
903	This callback is entirely optional, although it may be necessary if the
904	prepare() callback performs operations that need to be reversed.
905	
906	3.1.15. runtime_suspend()
907	
908	The runtime_suspend() callback is specific to device runtime power management
909	(runtime PM).  It is executed by the PM core's runtime PM framework when the
910	device is about to be suspended (i.e. quiesced and put into a low-power state)
911	at run time.
912	
913	This callback is responsible for freezing the device and preparing it to be
914	put into a low-power state, but it must allow the PCI subsystem to perform all
915	of the PCI-specific actions necessary for suspending the device.
916	
917	3.1.16. runtime_resume()
918	
919	The runtime_resume() callback is specific to device runtime PM.  It is executed
920	by the PM core's runtime PM framework when the device is about to be resumed
921	(i.e. put into the full-power state and programmed to process I/O normally) at
922	run time.
923	
924	This callback is responsible for restoring the normal functionality of the
925	device after it has been put into the full-power state by the PCI subsystem.
926	The device is expected to be able to process I/O in the usual way after
927	runtime_resume() has returned.
928	
929	3.1.17. runtime_idle()
930	
931	The runtime_idle() callback is specific to device runtime PM.  It is executed
932	by the PM core's runtime PM framework whenever it may be desirable to suspend
933	the device according to the PM core's information.  In particular, it is
934	automatically executed right after runtime_resume() has returned in case the
935	resume of the device has happened as a result of a spurious event.
936	
937	This callback is optional, but if it is not implemented or if it returns 0, the
938	PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
939	cause the driver's runtime_suspend() callback to be executed.
940	
941	3.1.18. Pointing Multiple Callback Pointers to One Routine
942	
943	Although in principle each of the callbacks described in the previous
944	subsections can be defined as a separate function, it often is convenient to
945	point two or more members of struct dev_pm_ops to the same routine.  There are
946	a few convenience macros that can be used for this purpose.
947	
948	The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
949	suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
950	members and one resume routine pointed to by the .resume(), .thaw(), and
951	.restore() members.  The other function pointers in this struct dev_pm_ops are
952	unset.
953	
954	The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
955	additionally sets the .runtime_resume() pointer to the same value as
956	.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
957	the same value as .suspend() (and .freeze() and .poweroff()).
958	
959	The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
960	dev_pm_ops to indicate that one suspend routine is to be pointed to by the
961	.suspend(), .freeze(), and .poweroff() members and one resume routine is to
962	be pointed to by the .resume(), .thaw(), and .restore() members.
963	
964	3.2. Device Runtime Power Management
965	------------------------------------
966	In addition to providing device power management callbacks PCI device drivers
967	are responsible for controlling the runtime power management (runtime PM) of
968	their devices.
969	
970	The PCI device runtime PM is optional, but it is recommended that PCI device
971	drivers implement it at least in the cases where there is a reliable way of
972	verifying that the device is not used (like when the network cable is detached
973	from an Ethernet adapter or there are no devices attached to a USB controller).
974	
975	To support the PCI runtime PM the driver first needs to implement the
976	runtime_suspend() and runtime_resume() callbacks.  It also may need to implement
977	the runtime_idle() callback to prevent the device from being suspended again
978	every time right after the runtime_resume() callback has returned
979	(alternatively, the runtime_suspend() callback will have to check if the
980	device should really be suspended and return -EAGAIN if that is not the case).
981	
982	The runtime PM of PCI devices is disabled by default.  It is also blocked by
983	pci_pm_init() that runs the pm_runtime_forbid() helper function.  If a PCI
984	driver implements the runtime PM callbacks and intends to use the runtime PM
985	framework provided by the PM core and the PCI subsystem, it should enable this
986	feature by executing the pm_runtime_enable() helper function.  However, the
987	driver should not call the pm_runtime_allow() helper function unblocking
988	the runtime PM of the device.  Instead, it should allow user space or some
989	platform-specific code to do that (user space can do it via sysfs), although
990	once it has called pm_runtime_enable(), it must be prepared to handle the
991	runtime PM of the device correctly as soon as pm_runtime_allow() is called
992	(which may happen at any time).  [It also is possible that user space causes
993	pm_runtime_allow() to be called via sysfs before the driver is loaded, so in
994	fact the driver has to be prepared to handle the runtime PM of the device as
995	soon as it calls pm_runtime_enable().]
996	
997	The runtime PM framework works by processing requests to suspend or resume
998	devices, or to check if they are idle (in which cases it is reasonable to
999	subsequently request that they be suspended).  These requests are represented
1000	by work items put into the power management workqueue, pm_wq.  Although there
1001	are a few situations in which power management requests are automatically
1002	queued by the PM core (for example, after processing a request to resume a
1003	device the PM core automatically queues a request to check if the device is
1004	idle), device drivers are generally responsible for queuing power management
1005	requests for their devices.  For this purpose they should use the runtime PM
1006	helper functions provided by the PM core, discussed in
1007	Documentation/power/runtime_pm.txt.
1008	
1009	Devices can also be suspended and resumed synchronously, without placing a
1010	request into pm_wq.  In the majority of cases this also is done by their
1011	drivers that use helper functions provided by the PM core for this purpose.
1012	
1013	For more information on the runtime PM of devices refer to
1014	Documentation/power/runtime_pm.txt.
1015	
1016	
1017	4. Resources
1018	============
1019	
1020	PCI Local Bus Specification, Rev. 3.0
1021	PCI Bus Power Management Interface Specification, Rev. 1.2
1022	Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
1023	PCI Express Base Specification, Rev. 2.0
1024	Documentation/power/devices.txt
1025	Documentation/power/runtime_pm.txt
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.