About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / powerpc / pci_iov_resource_on_powernv.txt


Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.

1	Wei Yang <weiyang@linux.vnet.ibm.com>
2	Benjamin Herrenschmidt <benh@au1.ibm.com>
3	Bjorn Helgaas <bhelgaas@google.com>
4	26 Aug 2014
5	
6	This document describes the requirement from hardware for PCI MMIO resource
7	sizing and assignment on PowerKVM and how generic PCI code handles this
8	requirement. The first two sections describe the concepts of Partitionable
9	Endpoints and the implementation on P8 (IODA2). The next two sections talks
10	about considerations on enabling SRIOV on IODA2.
11	
12	1. Introduction to Partitionable Endpoints
13	
14	A Partitionable Endpoint (PE) is a way to group the various resources
15	associated with a device or a set of devices to provide isolation between
16	partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
17	to freeze a device that is causing errors in order to limit the possibility
18	of propagation of bad data.
19	
20	There is thus, in HW, a table of PE states that contains a pair of "frozen"
21	state bits (one for MMIO and one for DMA, they get set together but can be
22	cleared independently) for each PE.
23	
24	When a PE is frozen, all stores in any direction are dropped and all loads
25	return all 1's value. MSIs are also blocked. There's a bit more state that
26	captures things like the details of the error that caused the freeze etc., but
27	that's not critical.
28	
29	The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
30	are matched to their corresponding PEs.
31	
32	The following section provides a rough description of what we have on P8
33	(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
34	is a completely separate HW entity that replicates the entire logic, so has
35	its own set of PEs, etc.
36	
37	2. Implementation of Partitionable Endpoints on P8 (IODA2)
38	
39	P8 supports up to 256 Partitionable Endpoints per PHB.
40	
41	  * Inbound
42	
43	    For DMA, MSIs and inbound PCIe error messages, we have a table (in
44	    memory but accessed in HW by the chip) that provides a direct
45	    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
46	    We call this the RTT.
47	
48	    - For DMA we then provide an entire address space for each PE that can
49	      contain two "windows", depending on the value of PCI address bit 59.
50	      Each window can be configured to be remapped via a "TCE table" (IOMMU
51	      translation table), which has various configurable characteristics
52	      not described here.
53	
54	    - For MSIs, we have two windows in the address space (one at the top of
55	      the 32-bit space and one much higher) which, via a combination of the
56	      address and MSI value, will result in one of the 2048 interrupts per
57	      bridge being triggered.  There's a PE# in the interrupt controller
58	      descriptor table as well which is compared with the PE# obtained from
59	      the RTT to "authorize" the device to emit that specific interrupt.
60	
61	    - Error messages just use the RTT.
62	
63	  * Outbound.  That's where the tricky part is.
64	
65	    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
66	    from the CPU address space to the PCI address space.  There is one M32
67	    window and sixteen M64 windows.  They have different characteristics.
68	    First what they have in common: they forward a configurable portion of
69	    the CPU address space to the PCIe bus and must be naturally aligned
70	    power of two in size.  The rest is different:
71	
72	    - The M32 window:
73	
74	      * Is limited to 4GB in size.
75	
76	      * Drops the top bits of the address (above the size) and replaces
77		them with a configurable value.  This is typically used to generate
78		32-bit PCIe accesses.  We configure that window at boot from FW and
79		don't touch it from Linux; it's usually set to forward a 2GB
80		portion of address space from the CPU to PCIe
81		0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
82		reserved for MSIs but this is not a problem at this point; we just
83		need to ensure Linux doesn't assign anything there, the M32 logic
84		ignores that however and will forward in that space if we try).
85	
86	      * It is divided into 256 segments of equal size.  A table in the chip
87		maps each segment to a PE#.  That allows portions of the MMIO space
88		to be assigned to PEs on a segment granularity.  For a 2GB window,
89		the segment granularity is 2GB/256 = 8MB.
90	
91	    Now, this is the "main" window we use in Linux today (excluding
92	    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
93	    onto a segment alignment/granularity so that the space behind a bridge
94	    can be assigned to a PE.
95	
96	    Ideally we would like to be able to have individual functions in PEs
97	    but that would mean using a completely different address allocation
98	    scheme where individual function BARs can be "grouped" to fit in one or
99	    more segments.
100	
101	    - The M64 windows:
102	
103	      * Must be at least 256MB in size.
104	
105	      * Do not translate addresses (the address on PCIe is the same as the
106		address on the PowerBus).  There is a way to also set the top 14
107		bits which are not conveyed by PowerBus but we don't use this.
108	
109	      * Can be configured to be segmented.  When not segmented, we can
110		specify the PE# for the entire window.  When segmented, a window
111		has 256 segments; however, there is no table for mapping a segment
112		to a PE#.  The segment number *is* the PE#.
113	
114	      * Support overlaps.  If an address is covered by multiple windows,
115		there's a defined ordering for which window applies.
116	
117	    We have code (fairly new compared to the M32 stuff) that exploits that
118	    for large BARs in 64-bit space:
119	
120	    We configure an M64 window to cover the entire region of address space
121	    that has been assigned by FW for the PHB (about 64GB, ignore the space
122	    for the M32, it comes out of a different "reserve").  We configure it
123	    as segmented.
124	
125	    Then we do the same thing as with M32, using the bridge alignment
126	    trick, to match to those giant segments.
127	
128	    Since we cannot remap, we have two additional constraints:
129	
130	    - We do the PE# allocation *after* the 64-bit space has been assigned
131	      because the addresses we use directly determine the PE#.  We then
132	      update the M32 PE# for the devices that use both 32-bit and 64-bit
133	      spaces or assign the remaining PE# to 32-bit only devices.
134	
135	    - We cannot "group" segments in HW, so if a device ends up using more
136	      than one segment, we end up with more than one PE#.  There is a HW
137	      mechanism to make the freeze state cascade to "companion" PEs but
138	      that only works for PCIe error messages (typically used so that if
139	      you freeze a switch, it freezes all its children).  So we do it in
140	      SW.  We lose a bit of effectiveness of EEH in that case, but that's
141	      the best we found.  So when any of the PEs freezes, we freeze the
142	      other ones for that "domain".  We thus introduce the concept of
143	      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
144	      PEs" that are used for the remaining M64 segments.
145	
146	    We would like to investigate using additional M64 windows in "single
147	    PE" mode to overlay over specific BARs to work around some of that, for
148	    example for devices with very large BARs, e.g., GPUs.  It would make
149	    sense, but we haven't done it yet.
150	
151	3. Considerations for SR-IOV on PowerKVM
152	
153	  * SR-IOV Background
154	
155	    The PCIe SR-IOV feature allows a single Physical Function (PF) to
156	    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
157	    Capability control the number of VFs and whether they are enabled.
158	
159	    When VFs are enabled, they appear in Configuration Space like normal
160	    PCI devices, but the BARs in VF config space headers are unusual.  For
161	    a non-VF device, software uses BARs in the config space header to
162	    discover the BAR sizes and assign addresses for them.  For VF devices,
163	    software uses VF BAR registers in the *PF* SR-IOV Capability to
164	    discover sizes and assign addresses.  The BARs in the VF's config space
165	    header are read-only zeros.
166	
167	    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
168	    base address for all the corresponding VF(n) BARs.  For example, if the
169	    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
170	    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
171	    This region is divided into eight contiguous 1MB regions, each of which
172	    is a BAR0 for one of the VFs.  Note that even though the VF BAR
173	    describes an 8MB region, the alignment requirement is for a single VF,
174	    i.e., 1MB in this example.
175	
176	  There are several strategies for isolating VFs in PEs:
177	
178	  - M32 window: There's one M32 window, and it is split into 256
179	    equally-sized segments.  The finest granularity possible is a 256MB
180	    window with 1MB segments.  VF BARs that are 1MB or larger could be
181	    mapped to separate PEs in this window.  Each segment can be
182	    individually mapped to a PE via the lookup table, so this is quite
183	    flexible, but it works best when all the VF BARs are the same size.  If
184	    they are different sizes, the entire window has to be small enough that
185	    the segment size matches the smallest VF BAR, which means larger VF
186	    BARs span several segments.
187	
188	  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
189	    to a single PE, so it could only isolate one VF.
190	
191	  - Single segmented M64 windows: A segmented M64 window could be used just
192	    like the M32 window, but the segments can't be individually mapped to
193	    PEs (the segment number is the PE#), so there isn't as much
194	    flexibility.  A VF with multiple BARs would have to be in a "domain" of
195	    multiple PEs, which is not as well isolated as a single PE.
196	
197	  - Multiple segmented M64 windows: As usual, each window is split into 256
198	    equally-sized segments, and the segment number is the PE#.  But if we
199	    use several M64 windows, they can be set to different base addresses
200	    and different segment sizes.  If we have VFs that each have a 1MB BAR
201	    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
202	    another M64 window to assign 32MB segments.
203	
204	  Finally, the plan to use M64 windows for SR-IOV, which will be described
205	  more in the next two sections.  For a given VF BAR, we need to
206	  effectively reserve the entire 256 segments (256 * VF BAR size) and
207	  position the VF BAR to start at the beginning of a free range of
208	  segments/PEs inside that M64 window.
209	
210	  The goal is of course to be able to give a separate PE for each VF.
211	
212	  The IODA2 platform has 16 M64 windows, which are used to map MMIO
213	  range to PE#.  Each M64 window defines one MMIO range and this range is
214	  divided into 256 segments, with each segment corresponding to one PE.
215	
216	  We decide to leverage this M64 window to map VFs to individual PEs, since
217	  SR-IOV VF BARs are all the same size.
218	
219	  But doing so introduces another problem: total_VFs is usually smaller
220	  than the number of M64 window segments, so if we map one VF BAR directly
221	  to one M64 window, some part of the M64 window will map to another
222	  device's MMIO range.
223	
224	  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
225	  total_VFs is less than 256, we have the situation in Figure 1.0, where
226	  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
227	  other devices:
228	
229	     0      1                     total_VFs - 1
230	     +------+------+-     -+------+------+
231	     |      |      |  ...  |      |      |
232	     +------+------+-     -+------+------+
233	
234	                           VF(n) BAR space
235	
236	     0      1                     total_VFs - 1                255
237	     +------+------+-     -+------+------+-      -+------+------+
238	     |      |      |  ...  |      |      |   ...  |      |      |
239	     +------+------+-     -+------+------+-      -+------+------+
240	
241	                           M64 window
242	
243			Figure 1.0 Direct map VF(n) BAR space
244	
245	  Our current solution is to allocate 256 segments even if the VF(n) BAR
246	  space doesn't need that much, as shown in Figure 1.1:
247	
248	     0      1                     total_VFs - 1                255
249	     +------+------+-     -+------+------+-      -+------+------+
250	     |      |      |  ...  |      |      |   ...  |      |      |
251	     +------+------+-     -+------+------+-      -+------+------+
252	
253	                           VF(n) BAR space + extra
254	
255	     0      1                     total_VFs - 1                255
256	     +------+------+-     -+------+------+-      -+------+------+
257	     |      |      |  ...  |      |      |   ...  |      |      |
258	     +------+------+-     -+------+------+-      -+------+------+
259	
260				   M64 window
261	
262			Figure 1.1 Map VF(n) BAR space + extra
263	
264	  Allocating the extra space ensures that the entire M64 window will be
265	  assigned to this one SR-IOV device and none of the space will be
266	  available for other devices.  Note that this only expands the space
267	  reserved in software; there are still only total_VFs VFs, and they only
268	  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
269	  responds to segments [total_VFs, 255].
270	
271	4. Implications for the Generic PCI Code
272	
273	The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
274	aligned to the size of an individual VF BAR.
275	
276	In IODA2, the MMIO address determines the PE#.  If the address is in an M32
277	window, we can set the PE# by updating the table that translates segments
278	to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
279	set the PE# for the window.  But if it's in a segmented M64 window, the
280	segment number is the PE#.
281	
282	Therefore, the only way to control the PE# for a VF is to change the base
283	of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
284	amount of space required for the VF(n) BAR space, the VF BAR value is fixed
285	and cannot be changed.
286	
287	On the other hand, if the PCI core allocates additional space, the VF BAR
288	value can be changed as long as the entire VF(n) BAR space remains inside
289	the space allocated by the core.
290	
291	Ideally the segment size will be the same as an individual VF BAR size.
292	Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
293	are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
294	allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
295	
296	If the segment size is smaller than the VF BAR size, it will take several
297	segments to cover a VF BAR, and a VF will be in several PEs.  This is
298	possible, but the isolation isn't as good, and it reduces the number of PE#
299	choices because instead of consuming only numVFs segments, the VF(n) BAR
300	space will consume (numVFs * n) segments.  That means there aren't as many
301	available segments for adjusting base of the VF(n) BAR space.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog