About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / PCI / MSI-HOWTO.txt




Custom Search

Based on kernel version 3.16. Page generated on 2014-08-06 21:40 EST.

1			The MSI Driver Guide HOWTO
2		Tom L Nguyen tom.l.nguyen@intel.com
3				10/03/2003
4		Revised Feb 12, 2004 by Martine Silbermann
5			email: Martine.Silbermann@hp.com
6		Revised Jun 25, 2004 by Tom L Nguyen
7		Revised Jul  9, 2008 by Matthew Wilcox <willy@linux.intel.com>
8			Copyright 2003, 2008 Intel Corporation
9	
10	1. About this guide
11	
12	This guide describes the basics of Message Signaled Interrupts (MSIs),
13	the advantages of using MSI over traditional interrupt mechanisms, how
14	to change your driver to use MSI or MSI-X and some basic diagnostics to
15	try if a device doesn't support MSIs.
16	
17	
18	2. What are MSIs?
19	
20	A Message Signaled Interrupt is a write from the device to a special
21	address which causes an interrupt to be received by the CPU.
22	
23	The MSI capability was first specified in PCI 2.2 and was later enhanced
24	in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
25	capability was also introduced with PCI 3.0.  It supports more interrupts
26	per device than MSI and allows interrupts to be independently configured.
27	
28	Devices may support both MSI and MSI-X, but only one can be enabled at
29	a time.
30	
31	
32	3. Why use MSIs?
33	
34	There are three reasons why using MSIs can give an advantage over
35	traditional pin-based interrupts.
36	
37	Pin-based PCI interrupts are often shared amongst several devices.
38	To support this, the kernel must call each interrupt handler associated
39	with an interrupt, which leads to reduced performance for the system as
40	a whole.  MSIs are never shared, so this problem cannot arise.
41	
42	When a device writes data to memory, then raises a pin-based interrupt,
43	it is possible that the interrupt may arrive before all the data has
44	arrived in memory (this becomes more likely with devices behind PCI-PCI
45	bridges).  In order to ensure that all the data has arrived in memory,
46	the interrupt handler must read a register on the device which raised
47	the interrupt.  PCI transaction ordering rules require that all the data
48	arrive in memory before the value may be returned from the register.
49	Using MSIs avoids this problem as the interrupt-generating write cannot
50	pass the data writes, so by the time the interrupt is raised, the driver
51	knows that all the data has arrived in memory.
52	
53	PCI devices can only support a single pin-based interrupt per function.
54	Often drivers have to query the device to find out what event has
55	occurred, slowing down interrupt handling for the common case.  With
56	MSIs, a device can support more interrupts, allowing each interrupt
57	to be specialised to a different purpose.  One possible design gives
58	infrequent conditions (such as errors) their own interrupt which allows
59	the driver to handle the normal interrupt handling path more efficiently.
60	Other possible designs include giving one interrupt to each packet queue
61	in a network card or each port in a storage controller.
62	
63	
64	4. How to use MSIs
65	
66	PCI devices are initialised to use pin-based interrupts.  The device
67	driver has to set up the device to use MSI or MSI-X.  Not all machines
68	support MSIs correctly, and for those machines, the APIs described below
69	will simply fail and the device will continue to use pin-based interrupts.
70	
71	4.1 Include kernel support for MSIs
72	
73	To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
74	option enabled.  This option is only available on some architectures,
75	and it may depend on some other options also being set.  For example,
76	on x86, you must also enable X86_UP_APIC or SMP in order to see the
77	CONFIG_PCI_MSI option.
78	
79	4.2 Using MSI
80	
81	Most of the hard work is done for the driver in the PCI layer.  It simply
82	has to request that the PCI layer set up the MSI capability for this
83	device.
84	
85	4.2.1 pci_enable_msi
86	
87	int pci_enable_msi(struct pci_dev *dev)
88	
89	A successful call allocates ONE interrupt to the device, regardless
90	of how many MSIs the device supports.  The device is switched from
91	pin-based interrupt mode to MSI mode.  The dev->irq number is changed
92	to a new number which represents the message signaled interrupt;
93	consequently, this function should be called before the driver calls
94	request_irq(), because an MSI is delivered via a vector that is
95	different from the vector of a pin-based interrupt.
96	
97	4.2.2 pci_enable_msi_range
98	
99	int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
100	
101	This function allows a device driver to request any number of MSI
102	interrupts within specified range from 'minvec' to 'maxvec'.
103	
104	If this function returns a positive number it indicates the number of
105	MSI interrupts that have been successfully allocated.  In this case
106	the device is switched from pin-based interrupt mode to MSI mode and
107	updates dev->irq to be the lowest of the new interrupts assigned to it.
108	The other interrupts assigned to the device are in the range dev->irq
109	to dev->irq + returned value - 1.  Device driver can use the returned
110	number of successfully allocated MSI interrupts to further allocate
111	and initialize device resources.
112	
113	If this function returns a negative number, it indicates an error and
114	the driver should not attempt to request any more MSI interrupts for
115	this device.
116	
117	This function should be called before the driver calls request_irq(),
118	because MSI interrupts are delivered via vectors that are different
119	from the vector of a pin-based interrupt.
120	
121	It is ideal if drivers can cope with a variable number of MSI interrupts;
122	there are many reasons why the platform may not be able to provide the
123	exact number that a driver asks for.
124	
125	There could be devices that can not operate with just any number of MSI
126	interrupts within a range.  See chapter 4.3.1.3 to get the idea how to
127	handle such devices for MSI-X - the same logic applies to MSI.
128	
129	4.2.1.1 Maximum possible number of MSI interrupts
130	
131	The typical usage of MSI interrupts is to allocate as many vectors as
132	possible, likely up to the limit returned by pci_msi_vec_count() function:
133	
134	static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
135	{
136		return pci_enable_msi_range(pdev, 1, nvec);
137	}
138	
139	Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
140	the value of 0 would be meaningless and could result in error.
141	
142	Some devices have a minimal limit on number of MSI interrupts.
143	In this case the function could look like this:
144	
145	static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
146	{
147		return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
148	}
149	
150	4.2.1.2 Exact number of MSI interrupts
151	
152	If a driver is unable or unwilling to deal with a variable number of MSI
153	interrupts it could request a particular number of interrupts by passing
154	that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
155	parameters:
156	
157	static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
158	{
159		return pci_enable_msi_range(pdev, nvec, nvec);
160	}
161	
162	Note, unlike pci_enable_msi_exact() function, which could be also used to
163	enable a particular number of MSI-X interrupts, pci_enable_msi_range()
164	returns either a negative errno or 'nvec' (not negative errno or 0 - as
165	pci_enable_msi_exact() does).
166	
167	4.2.1.3 Single MSI mode
168	
169	The most notorious example of the request type described above is
170	enabling the single MSI mode for a device.  It could be done by passing
171	two 1s as 'minvec' and 'maxvec':
172	
173	static int foo_driver_enable_single_msi(struct pci_dev *pdev)
174	{
175		return pci_enable_msi_range(pdev, 1, 1);
176	}
177	
178	Note, unlike pci_enable_msi() function, which could be also used to
179	enable the single MSI mode, pci_enable_msi_range() returns either a
180	negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
181	does).
182	
183	4.2.3 pci_enable_msi_exact
184	
185	int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
186	
187	This variation on pci_enable_msi_range() call allows a device driver to
188	request exactly 'nvec' MSIs.
189	
190	If this function returns a negative number, it indicates an error and
191	the driver should not attempt to request any more MSI interrupts for
192	this device.
193	
194	By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
195	returns zero in case of success, which indicates MSI interrupts have been
196	successfully allocated.
197	
198	4.2.4 pci_disable_msi
199	
200	void pci_disable_msi(struct pci_dev *dev)
201	
202	This function should be used to undo the effect of pci_enable_msi_range().
203	Calling it restores dev->irq to the pin-based interrupt number and frees
204	the previously allocated MSIs.  The interrupts may subsequently be assigned
205	to another device, so drivers should not cache the value of dev->irq.
206	
207	Before calling this function, a device driver must always call free_irq()
208	on any interrupt for which it previously called request_irq().
209	Failure to do so results in a BUG_ON(), leaving the device with
210	MSI enabled and thus leaking its vector.
211	
212	4.2.4 pci_msi_vec_count
213	
214	int pci_msi_vec_count(struct pci_dev *dev)
215	
216	This function could be used to retrieve the number of MSI vectors the
217	device requested (via the Multiple Message Capable register). The MSI
218	specification only allows the returned value to be a power of two,
219	up to a maximum of 2^5 (32).
220	
221	If this function returns a negative number, it indicates the device is
222	not capable of sending MSIs.
223	
224	If this function returns a positive number, it indicates the maximum
225	number of MSI interrupt vectors that could be allocated.
226	
227	4.3 Using MSI-X
228	
229	The MSI-X capability is much more flexible than the MSI capability.
230	It supports up to 2048 interrupts, each of which can be controlled
231	independently.  To support this flexibility, drivers must use an array of
232	`struct msix_entry':
233	
234	struct msix_entry {
235		u16 	vector; /* kernel uses to write alloc vector */
236		u16	entry; /* driver uses to specify entry */
237	};
238	
239	This allows for the device to use these interrupts in a sparse fashion;
240	for example, it could use interrupts 3 and 1027 and yet allocate only a
241	two-element array.  The driver is expected to fill in the 'entry' value
242	in each element of the array to indicate for which entries the kernel
243	should assign interrupts; it is invalid to fill in two entries with the
244	same number.
245	
246	4.3.1 pci_enable_msix_range
247	
248	int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
249				  int minvec, int maxvec)
250	
251	Calling this function asks the PCI subsystem to allocate any number of
252	MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
253	The 'entries' argument is a pointer to an array of msix_entry structs
254	which should be at least 'maxvec' entries in size.
255	
256	On success, the device is switched into MSI-X mode and the function
257	returns the number of MSI-X interrupts that have been successfully
258	allocated.  In this case the 'vector' member in entries numbered from
259	0 to the returned value - 1 is populated with the interrupt number;
260	the driver should then call request_irq() for each 'vector' that it
261	decides to use.  The device driver is responsible for keeping track of the
262	interrupts assigned to the MSI-X vectors so it can free them again later.
263	Device driver can use the returned number of successfully allocated MSI-X
264	interrupts to further allocate and initialize device resources.
265	
266	If this function returns a negative number, it indicates an error and
267	the driver should not attempt to allocate any more MSI-X interrupts for
268	this device.
269	
270	This function, in contrast with pci_enable_msi_range(), does not adjust
271	dev->irq.  The device will not generate interrupts for this interrupt
272	number once MSI-X is enabled.
273	
274	Device drivers should normally call this function once per device
275	during the initialization phase.
276	
277	It is ideal if drivers can cope with a variable number of MSI-X interrupts;
278	there are many reasons why the platform may not be able to provide the
279	exact number that a driver asks for.
280	
281	There could be devices that can not operate with just any number of MSI-X
282	interrupts within a range.  E.g., an network adapter might need let's say
283	four vectors per each queue it provides.  Therefore, a number of MSI-X
284	interrupts allocated should be a multiple of four.  In this case interface
285	pci_enable_msix_range() can not be used alone to request MSI-X interrupts
286	(since it can allocate any number within the range, without any notion of
287	the multiple of four) and the device driver should master a custom logic
288	to request the required number of MSI-X interrupts.
289	
290	4.3.1.1 Maximum possible number of MSI-X interrupts
291	
292	The typical usage of MSI-X interrupts is to allocate as many vectors as
293	possible, likely up to the limit returned by pci_msix_vec_count() function:
294	
295	static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
296	{
297		return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
298					     1, nvec);
299	}
300	
301	Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
302	the value of 0 would be meaningless and could result in error.
303	
304	Some devices have a minimal limit on number of MSI-X interrupts.
305	In this case the function could look like this:
306	
307	static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
308	{
309		return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
310					     FOO_DRIVER_MINIMUM_NVEC, nvec);
311	}
312	
313	4.3.1.2 Exact number of MSI-X interrupts
314	
315	If a driver is unable or unwilling to deal with a variable number of MSI-X
316	interrupts it could request a particular number of interrupts by passing
317	that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
318	parameters:
319	
320	static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
321	{
322		return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
323					     nvec, nvec);
324	}
325	
326	Note, unlike pci_enable_msix_exact() function, which could be also used to
327	enable a particular number of MSI-X interrupts, pci_enable_msix_range()
328	returns either a negative errno or 'nvec' (not negative errno or 0 - as
329	pci_enable_msix_exact() does).
330	
331	4.3.1.3 Specific requirements to the number of MSI-X interrupts
332	
333	As noted above, there could be devices that can not operate with just any
334	number of MSI-X interrupts within a range.  E.g., let's assume a device that
335	is only capable sending the number of MSI-X interrupts which is a power of
336	two.  A routine that enables MSI-X mode for such device might look like this:
337	
338	/*
339	 * Assume 'minvec' and 'maxvec' are non-zero
340	 */
341	static int foo_driver_enable_msix(struct foo_adapter *adapter,
342					  int minvec, int maxvec)
343	{
344		int rc;
345	
346		minvec = roundup_pow_of_two(minvec);
347		maxvec = rounddown_pow_of_two(maxvec);
348	
349		if (minvec > maxvec)
350			return -ERANGE;
351	
352	retry:
353		rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
354					   maxvec, maxvec);
355		/*
356		 * -ENOSPC is the only error code allowed to be analized
357		 */
358		if (rc == -ENOSPC) {
359			if (maxvec == 1)
360				return -ENOSPC;
361	
362			maxvec /= 2;
363	
364			if (minvec > maxvec)
365				return -ENOSPC;
366	
367			goto retry;
368		}
369	
370		return rc;
371	}
372	
373	Note how pci_enable_msix_range() return value is analized for a fallback -
374	any error code other than -ENOSPC indicates a fatal error and should not
375	be retried.
376	
377	4.3.2 pci_enable_msix_exact
378	
379	int pci_enable_msix_exact(struct pci_dev *dev,
380				  struct msix_entry *entries, int nvec)
381	
382	This variation on pci_enable_msix_range() call allows a device driver to
383	request exactly 'nvec' MSI-Xs.
384	
385	If this function returns a negative number, it indicates an error and
386	the driver should not attempt to allocate any more MSI-X interrupts for
387	this device.
388	
389	By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
390	returns zero in case of success, which indicates MSI-X interrupts have been
391	successfully allocated.
392	
393	Another version of a routine that enables MSI-X mode for a device with
394	specific requirements described in chapter 4.3.1.3 might look like this:
395	
396	/*
397	 * Assume 'minvec' and 'maxvec' are non-zero
398	 */
399	static int foo_driver_enable_msix(struct foo_adapter *adapter,
400					  int minvec, int maxvec)
401	{
402		int rc;
403	
404		minvec = roundup_pow_of_two(minvec);
405		maxvec = rounddown_pow_of_two(maxvec);
406	
407		if (minvec > maxvec)
408			return -ERANGE;
409	
410	retry:
411		rc = pci_enable_msix_exact(adapter->pdev,
412					   adapter->msix_entries, maxvec);
413	
414		/*
415		 * -ENOSPC is the only error code allowed to be analyzed
416		 */
417		if (rc == -ENOSPC) {
418			if (maxvec == 1)
419				return -ENOSPC;
420	
421			maxvec /= 2;
422	
423			if (minvec > maxvec)
424				return -ENOSPC;
425	
426			goto retry;
427		} else if (rc < 0) {
428			return rc;
429		}
430	
431		return maxvec;
432	}
433	
434	4.3.3 pci_disable_msix
435	
436	void pci_disable_msix(struct pci_dev *dev)
437	
438	This function should be used to undo the effect of pci_enable_msix_range().
439	It frees the previously allocated MSI-X interrupts. The interrupts may
440	subsequently be assigned to another device, so drivers should not cache
441	the value of the 'vector' elements over a call to pci_disable_msix().
442	
443	Before calling this function, a device driver must always call free_irq()
444	on any interrupt for which it previously called request_irq().
445	Failure to do so results in a BUG_ON(), leaving the device with
446	MSI-X enabled and thus leaking its vector.
447	
448	4.3.3 The MSI-X Table
449	
450	The MSI-X capability specifies a BAR and offset within that BAR for the
451	MSI-X Table.  This address is mapped by the PCI subsystem, and should not
452	be accessed directly by the device driver.  If the driver wishes to
453	mask or unmask an interrupt, it should call disable_irq() / enable_irq().
454	
455	4.3.4 pci_msix_vec_count
456	
457	int pci_msix_vec_count(struct pci_dev *dev)
458	
459	This function could be used to retrieve number of entries in the device
460	MSI-X table.
461	
462	If this function returns a negative number, it indicates the device is
463	not capable of sending MSI-Xs.
464	
465	If this function returns a positive number, it indicates the maximum
466	number of MSI-X interrupt vectors that could be allocated.
467	
468	4.4 Handling devices implementing both MSI and MSI-X capabilities
469	
470	If a device implements both MSI and MSI-X capabilities, it can
471	run in either MSI mode or MSI-X mode, but not both simultaneously.
472	This is a requirement of the PCI spec, and it is enforced by the
473	PCI layer.  Calling pci_enable_msi_range() when MSI-X is already
474	enabled or pci_enable_msix_range() when MSI is already enabled
475	results in an error.  If a device driver wishes to switch between MSI
476	and MSI-X at runtime, it must first quiesce the device, then switch
477	it back to pin-interrupt mode, before calling pci_enable_msi_range()
478	or pci_enable_msix_range() and resuming operation.  This is not expected
479	to be a common operation but may be useful for debugging or testing
480	during development.
481	
482	4.5 Considerations when using MSIs
483	
484	4.5.1 Choosing between MSI-X and MSI
485	
486	If your device supports both MSI-X and MSI capabilities, you should use
487	the MSI-X facilities in preference to the MSI facilities.  As mentioned
488	above, MSI-X supports any number of interrupts between 1 and 2048.
489	In constrast, MSI is restricted to a maximum of 32 interrupts (and
490	must be a power of two).  In addition, the MSI interrupt vectors must
491	be allocated consecutively, so the system might not be able to allocate
492	as many vectors for MSI as it could for MSI-X.  On some platforms, MSI
493	interrupts must all be targeted at the same set of CPUs whereas MSI-X
494	interrupts can all be targeted at different CPUs.
495	
496	4.5.2 Spinlocks
497	
498	Most device drivers have a per-device spinlock which is taken in the
499	interrupt handler.  With pin-based interrupts or a single MSI, it is not
500	necessary to disable interrupts (Linux guarantees the same interrupt will
501	not be re-entered).  If a device uses multiple interrupts, the driver
502	must disable interrupts while the lock is held.  If the device sends
503	a different interrupt, the driver will deadlock trying to recursively
504	acquire the spinlock.
505	
506	There are two solutions.  The first is to take the lock with
507	spin_lock_irqsave() or spin_lock_irq() (see
508	Documentation/DocBook/kernel-locking).  The second is to specify
509	IRQF_DISABLED to request_irq() so that the kernel runs the entire
510	interrupt routine with interrupts disabled.
511	
512	If your MSI interrupt routine does not hold the lock for the whole time
513	it is running, the first solution may be best.  The second solution is
514	normally preferred as it avoids making two transitions from interrupt
515	disabled to enabled and back again.
516	
517	4.6 How to tell whether MSI/MSI-X is enabled on a device
518	
519	Using 'lspci -v' (as root) may show some devices with "MSI", "Message
520	Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
521	has an 'Enable' flag which is followed with either "+" (enabled)
522	or "-" (disabled).
523	
524	
525	5. MSI quirks
526	
527	Several PCI chipsets or devices are known not to support MSIs.
528	The PCI stack provides three ways to disable MSIs:
529	
530	1. globally
531	2. on all devices behind a specific bridge
532	3. on a single device
533	
534	5.1. Disabling MSIs globally
535	
536	Some host chipsets simply don't support MSIs properly.  If we're
537	lucky, the manufacturer knows this and has indicated it in the ACPI
538	FADT table.  In this case, Linux automatically disables MSIs.
539	Some boards don't include this information in the table and so we have
540	to detect them ourselves.  The complete list of these is found near the
541	quirk_disable_all_msi() function in drivers/pci/quirks.c.
542	
543	If you have a board which has problems with MSIs, you can pass pci=nomsi
544	on the kernel command line to disable MSIs on all devices.  It would be
545	in your best interests to report the problem to linux-pci@vger.kernel.org
546	including a full 'lspci -v' so we can add the quirks to the kernel.
547	
548	5.2. Disabling MSIs below a bridge
549	
550	Some PCI bridges are not able to route MSIs between busses properly.
551	In this case, MSIs must be disabled on all devices behind the bridge.
552	
553	Some bridges allow you to enable MSIs by changing some bits in their
554	PCI configuration space (especially the Hypertransport chipsets such
555	as the nVidia nForce and Serverworks HT2000).  As with host chipsets,
556	Linux mostly knows about them and automatically enables MSIs if it can.
557	If you have a bridge unknown to Linux, you can enable
558	MSIs in configuration space using whatever method you know works, then
559	enable MSIs on that bridge by doing:
560	
561	       echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
562	
563	where $bridge is the PCI address of the bridge you've enabled (eg
564	0000:00:0e.0).
565	
566	To disable MSIs, echo 0 instead of 1.  Changing this value should be
567	done with caution as it could break interrupt handling for all devices
568	below this bridge.
569	
570	Again, please notify linux-pci@vger.kernel.org of any bridges that need
571	special handling.
572	
573	5.3. Disabling MSIs on a single device
574	
575	Some devices are known to have faulty MSI implementations.  Usually this
576	is handled in the individual device driver, but occasionally it's necessary
577	to handle this with a quirk.  Some drivers have an option to disable use
578	of MSI.  While this is a convenient workaround for the driver author,
579	it is not good practise, and should not be emulated.
580	
581	5.4. Finding why MSIs are disabled on a device
582	
583	From the above three sections, you can see that there are many reasons
584	why MSIs may not be enabled for a given device.  Your first step should
585	be to examine your dmesg carefully to determine whether MSIs are enabled
586	for your machine.  You should also check your .config to be sure you
587	have enabled CONFIG_PCI_MSI.
588	
589	Then, 'lspci -t' gives the list of bridges above a device.  Reading
590	/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
591	or disabled (0).  If 0 is found in any of the msi_bus files belonging
592	to bridges between the PCI root and the device, MSIs are disabled.
593	
594	It is also worth checking the device driver to see whether it supports MSIs.
595	For example, it may contain calls to pci_enable_msi_range() or
596	pci_enable_msix_range().
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.