About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / virtual / virtio-spec.txt




Custom Search

Based on kernel version 3.9. Page generated on 2013-05-02 23:16 EST.

1	[Generated file: see http://ozlabs.org/~rusty/virtio-spec/]
2	Virtio PCI Card Specification
3	v0.9.5 DRAFT
4	-
5	
6	Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor)
7	
8	2012 May 7.
9	
10	Purpose and Description
11	
12	This document describes the specifications of the “virtio” family
13	of PCI[LaTeX Command: nomenclature] devices. These are devices
14	are found in virtual environments[LaTeX Command: nomenclature],
15	yet by design they are not all that different from physical PCI
16	devices, and this document treats them as such. This allows the
17	guest to use standard PCI drivers and discovery mechanisms.
18	
19	The purpose of virtio and this specification is that virtual
20	environments and guests should have a straightforward, efficient,
21	standard and extensible mechanism for virtual devices, rather
22	than boutique per-environment or per-OS mechanisms.
23	
24	  Straightforward: Virtio PCI devices use normal PCI mechanisms
25	  of interrupts and DMA which should be familiar to any device
26	  driver author. There is no exotic page-flipping or COW
27	  mechanism: it's just a PCI device.[footnote:
28	This lack of page-sharing implies that the implementation of the
29	device (e.g. the hypervisor or host) needs full access to the
30	guest memory. Communication with untrusted parties (i.e.
31	inter-guest communication) requires copying.
32	]
33	
34	  Efficient: Virtio PCI devices consist of rings of descriptors
35	  for input and output, which are neatly separated to avoid cache
36	  effects from both guest and device writing to the same cache
37	  lines.
38	
39	  Standard: Virtio PCI makes no assumptions about the environment
40	  in which it operates, beyond supporting PCI. In fact the virtio
41	  devices specified in the appendices do not require PCI at all:
42	  they have been implemented on non-PCI buses.[footnote:
43	The Linux implementation further separates the PCI virtio code
44	from the specific virtio drivers: these drivers are shared with
45	the non-PCI implementations (currently lguest and S/390).
46	]
47	
48	  Extensible: Virtio PCI devices contain feature bits which are
49	  acknowledged by the guest operating system during device setup.
50	  This allows forwards and backwards compatibility: the device
51	  offers all the features it knows about, and the driver
52	  acknowledges those it understands and wishes to use.
53	
54	  Virtqueues
55	
56	The mechanism for bulk data transport on virtio PCI devices is
57	pretentiously called a virtqueue. Each device can have zero or
58	more virtqueues: for example, the network device has one for
59	transmit and one for receive.
60	
61	Each virtqueue occupies two or more physically-contiguous pages
62	(defined, for the purposes of this specification, as 4096 bytes),
63	and consists of three parts:
64	
65	
66	+-------------------+-----------------------------------+-----------+
67	| Descriptor Table  |   Available Ring     (padding)    | Used Ring |
68	+-------------------+-----------------------------------+-----------+
69	
70	
71	When the driver wants to send a buffer to the device, it fills in
72	a slot in the descriptor table (or chains several together), and
73	writes the descriptor index into the available ring. It then
74	notifies the device. When the device has finished a buffer, it
75	writes the descriptor into the used ring, and sends an interrupt.
76	
77	Specification
78	
79	  PCI Discovery
80	
81	Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000
82	through 0x103F inclusive is a virtio device[footnote:
83	The actual value within this range is ignored
84	]. The device must also have a Revision ID of 0 to match this
85	specification.
86	
87	The Subsystem Device ID indicates which virtio device is
88	supported by the device. The Subsystem Vendor ID should reflect
89	the PCI Vendor ID of the environment (it's currently only used
90	for informational purposes by the guest).
91	
92	
93	+----------------------+--------------------+---------------+
94	| Subsystem Device ID  |   Virtio Device    | Specification |
95	+----------------------+--------------------+---------------+
96	+----------------------+--------------------+---------------+
97	|          1           |   network card     |  Appendix C   |
98	+----------------------+--------------------+---------------+
99	|          2           |   block device     |  Appendix D   |
100	+----------------------+--------------------+---------------+
101	|          3           |      console       |  Appendix E   |
102	+----------------------+--------------------+---------------+
103	|          4           |  entropy source    |  Appendix F   |
104	+----------------------+--------------------+---------------+
105	|          5           | memory ballooning  |  Appendix G   |
106	+----------------------+--------------------+---------------+
107	|          6           |     ioMemory       |       -       |
108	+----------------------+--------------------+---------------+
109	|          7           |       rpmsg        |  Appendix H   |
110	+----------------------+--------------------+---------------+
111	|          8           |     SCSI host      |  Appendix I   |
112	+----------------------+--------------------+---------------+
113	|          9           |   9P transport     |       -       |
114	+----------------------+--------------------+---------------+
115	|         10           |   mac80211 wlan    |       -       |
116	+----------------------+--------------------+---------------+
117	
118	
119	  Device Configuration
120	
121	To configure the device, we use the first I/O region of the PCI
122	device. This contains a virtio header followed by a
123	device-specific region.
124	
125	There may be different widths of accesses to the I/O region; the “
126	natural” access method for each field in the virtio header must
127	be used (i.e. 32-bit accesses for 32-bit fields, etc), but the
128	device-specific region can be accessed using any width accesses,
129	and should obtain the same results.
130	
131	Note that this is possible because while the virtio header is PCI
132	(i.e. little) endian, the device-specific region is encoded in
133	the native endian of the guest (where such distinction is
134	applicable).
135	
136	  Device Initialization Sequence<sub:Device-Initialization-Sequence>
137	
138	We start with an overview of device initialization, then expand
139	on the details of the device and how each step is preformed.
140	
141	  Reset the device. This is not required on initial start up.
142	
143	  The ACKNOWLEDGE status bit is set: we have noticed the device.
144	
145	  The DRIVER status bit is set: we know how to drive the device.
146	
147	  Device-specific setup, including reading the Device Feature
148	  Bits, discovery of virtqueues for the device, optional MSI-X
149	  setup, and reading and possibly writing the virtio
150	  configuration space.
151	
152	  The subset of Device Feature Bits understood by the driver is
153	  written to the device.
154	
155	  The DRIVER_OK status bit is set.
156	
157	  The device can now be used (ie. buffers added to the
158	  virtqueues)[footnote:
159	Historically, drivers have used the device before steps 5 and 6.
160	This is only allowed if the driver does not use any features
161	which would alter this early use of the device.
162	]
163	
164	If any of these steps go irrecoverably wrong, the guest should
165	set the FAILED status bit to indicate that it has given up on the
166	device (it can reset the device later to restart if desired).
167	
168	We now cover the fields required for general setup in detail.
169	
170	  Virtio Header
171	
172	The virtio header looks as follows:
173	
174	
175	+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
176	| Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
177	+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
178	| Read/Write || R                   | R+W                 | R+W      | R      | R+W     | R+W     | R+W     | R      |
179	+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
180	| Purpose    || Device              | Guest               | Queue    | Queue  | Queue   | Queue   | Device  | ISR    |
181	|            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
182	+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
183	
184	
185	If MSI-X is enabled for the device, two additional fields
186	immediately follow this header:[footnote:
187	ie. once you enable MSI-X on the device, the other fields move.
188	If you turn it off again, they move back!
189	]
190	
191	
192	+------------++----------------+--------+
193	| Bits       || 16             | 16     |
194	              +----------------+--------+
195	+------------++----------------+--------+
196	| Read/Write || R+W            | R+W    |
197	+------------++----------------+--------+
198	| Purpose    || Configuration  | Queue  |
199	| (MSI-X)    || Vector         | Vector |
200	+------------++----------------+--------+
201	
202	
203	Immediately following these general headers, there may be
204	device-specific headers:
205	
206	
207	+------------++--------------------+
208	| Bits       || Device Specific    |
209	              +--------------------+
210	+------------++--------------------+
211	| Read/Write || Device Specific    |
212	+------------++--------------------+
213	| Purpose    || Device Specific... |
214	|            ||                    |
215	+------------++--------------------+
216	
217	
218	  Device Status
219	
220	The Device Status field is updated by the guest to indicate its
221	progress. This provides a simple low-level diagnostic: it's most
222	useful to imagine them hooked up to traffic lights on the console
223	indicating the status of each device.
224	
225	The device can be reset by writing a 0 to this field, otherwise
226	at least one bit should be set:
227	
228	  ACKNOWLEDGE (1) Indicates that the guest OS has found the
229	  device and recognized it as a valid virtio device.
230	
231	  DRIVER (2) Indicates that the guest OS knows how to drive the
232	  device. Under Linux, drivers can be loadable modules so there
233	  may be a significant (or infinite) delay before setting this
234	  bit.
235	
236	  DRIVER_OK (4) Indicates that the driver is set up and ready to
237	  drive the device.
238	
239	  FAILED (128) Indicates that something went wrong in the guest,
240	  and it has given up on the device. This could be an internal
241	  error, or the driver didn't like the device for some reason, or
242	  even a fatal error during device operation. The device must be
243	  reset before attempting to re-initialize.
244	
245	  Feature Bits<sub:Feature-Bits>
246	
247	Thefirst configuration field indicates the features that the
248	device supports. The bits are allocated as follows:
249	
250	  0 to 23 Feature bits for the specific device type
251	
252	  24 to 32 Feature bits reserved for extensions to the queue and
253	  feature negotiation mechanisms
254	
255	For example, feature bit 0 for a network device (i.e. Subsystem
256	Device ID 1) indicates that the device supports checksumming of
257	packets.
258	
259	The feature bits are negotiated: the device lists all the
260	features it understands in the Device Features field, and the
261	guest writes the subset that it understands into the Guest
262	Features field. The only way to renegotiate is to reset the
263	device.
264	
265	In particular, new fields in the device configuration header are
266	indicated by offering a feature bit, so the guest can check
267	before accessing that part of the configuration space.
268	
269	This allows for forwards and backwards compatibility: if the
270	device is enhanced with a new feature bit, older guests will not
271	write that feature bit back to the Guest Features field and it
272	can go into backwards compatibility mode. Similarly, if a guest
273	is enhanced with a feature that the device doesn't support, it
274	will not see that feature bit in the Device Features field and
275	can go into backwards compatibility mode (or, for poor
276	implementations, set the FAILED Device Status bit).
277	
278	  Configuration/Queue Vectors
279	
280	When MSI-X capability is present and enabled in the device
281	(through standard PCI configuration space) 4 bytes at byte offset
282	20 are used to map configuration change and queue interrupts to
283	MSI-X vectors. In this case, the ISR Status field is unused, and
284	device specific configuration starts at byte offset 24 in virtio
285	header structure. When MSI-X capability is not enabled, device
286	specific configuration starts at byte offset 20 in virtio header.
287	
288	Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
289	Configuration/Queue Vector registers, maps interrupts triggered
290	by the configuration change/selected queue events respectively to
291	the corresponding MSI-X vector. To disable interrupts for a
292	specific event type, unmap it by writing a special NO_VECTOR
293	value:
294	
295	/* Vector value used to disable MSI for queue */
296	
297	#define VIRTIO_MSI_NO_VECTOR            0xffff
298	
299	Reading these registers returns vector mapped to a given event,
300	or NO_VECTOR if unmapped. All queue and configuration change
301	events are unmapped by default.
302	
303	Note that mapping an event to vector might require allocating
304	internal device resources, and might fail. Devices report such
305	failures by returning the NO_VECTOR value when the relevant
306	Vector field is read. After mapping an event to vector, the
307	driver must verify success by reading the Vector field value: on
308	success, the previously written value is returned, and on
309	failure, NO_VECTOR is returned. If a mapping failure is detected,
310	the driver can retry mapping with fewervectors, or disable MSI-X.
311	
312	  Virtqueue Configuration<sec:Virtqueue-Configuration>
313	
314	As a device can have zero or more virtqueues for bulk data
315	transport (for example, the network driver has two), the driver
316	needs to configure them as part of the device-specific
317	configuration.
318	
319	This is done as follows, for each virtqueue a device has:
320	
321	  Write the virtqueue index (first queue is 0) to the Queue
322	  Select field.
323	
324	  Read the virtqueue size from the Queue Size field, which is
325	  always a power of 2. This controls how big the virtqueue is
326	  (see below). If this field is 0, the virtqueue does not exist.
327	
328	  Allocate and zero virtqueue in contiguous physical memory, on a
329	  4096 byte alignment. Write the physical address, divided by
330	  4096 to the Queue Address field.[footnote:
331	The 4096 is based on the x86 page size, but it's also large
332	enough to ensure that the separate parts of the virtqueue are on
333	separate cache lines.
334	]
335	
336	  Optionally, if MSI-X capability is present and enabled on the
337	  device, select a vector to use to request interrupts triggered
338	  by virtqueue events. Write the MSI-X Table entry number
339	  corresponding to this vector in Queue Vector field. Read the
340	  Queue Vector field: on success, previously written value is
341	  returned; on failure, NO_VECTOR value is returned.
342	
343	The Queue Size field controls the total number of bytes required
344	for the virtqueue according to the following formula:
345	
346	#define ALIGN(x) (((x) + 4095) & ~4095)
347	
348	static inline unsigned vring_size(unsigned int qsz)
349	
350	{
351	
352	     return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2
353	+ qsz))
354	
355	          + ALIGN(sizeof(struct vring_used_elem)*qsz);
356	
357	}
358	
359	This currently wastes some space with padding, but also allows
360	future extensions. The virtqueue layout structure looks like this
361	(qsz is the Queue Size field, which is a variable, so this code
362	won't compile):
363	
364	struct vring {
365	
366	    /* The actual descriptors (16 bytes each) */
367	
368	    struct vring_desc desc[qsz];
369	
370	
371	
372	    /* A ring of available descriptor heads with free-running
373	index. */
374	
375	    struct vring_avail avail;
376	
377	
378	
379	    // Padding to the next 4096 boundary.
380	
381	    char pad[];
382	
383	
384	
385	    // A ring of used descriptor heads with free-running index.
386	
387	    struct vring_used used;
388	
389	};
390	
391	  A Note on Virtqueue Endianness
392	
393	Note that the endian of these fields and everything else in the
394	virtqueue is the native endian of the guest, not little-endian as
395	PCI normally is. This makes for simpler guest code, and it is
396	assumed that the host already has to be deeply aware of the guest
397	endian so such an “endian-aware” device is not a significant
398	issue.
399	
400	  Descriptor Table
401	
402	The descriptor table refers to the buffers the guest is using for
403	the device. The addresses are physical addresses, and the buffers
404	can be chained via the next field. Each descriptor describes a
405	buffer which is read-only or write-only, but a chain of
406	descriptors can contain both read-only and write-only buffers.
407	
408	No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc {
409	
410	    /* Address (guest-physical). */
411	
412	    u64 addr;
413	
414	    /* Length. */
415	
416	    u32 len;
417	
418	/* This marks a buffer as continuing via the next field. */
419	
420	#define VRING_DESC_F_NEXT   1
421	
422	/* This marks a buffer as write-only (otherwise read-only). */
423	
424	#define VRING_DESC_F_WRITE     2
425	
426	/* This means the buffer contains a list of buffer descriptors.
427	*/
428	
429	#define VRING_DESC_F_INDIRECT   4
430	
431	    /* The flags as indicated above. */
432	
433	    u16 flags;
434	
435	    /* Next field if flags & NEXT */
436	
437	    u16 next;
438	
439	};
440	
441	The number of descriptors in the table is specified by the Queue
442	Size field for this virtqueue.
443	
444	  <sub:Indirect-Descriptors>Indirect Descriptors
445	
446	Some devices benefit by concurrently dispatching a large number
447	of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be
448	used to allow this (see [cha:Reserved-Feature-Bits]). To increase
449	ring capacity it is possible to store a table of indirect
450	descriptors anywhere in memory, and insert a descriptor in main
451	virtqueue (with flags&INDIRECT on) that refers to memory buffer
452	containing this indirect descriptor table; fields addr and len
453	refer to the indirect table address and length in bytes,
454	respectively. The indirect table layout structure looks like this
455	(len is the length of the descriptor that refers to this table,
456	which is a variable, so this code won't compile):
457	
458	struct indirect_descriptor_table {
459	
460	    /* The actual descriptors (16 bytes each) */
461	
462	    struct vring_desc desc[len / 16];
463	
464	};
465	
466	The first indirect descriptor is located at start of the indirect
467	descriptor table (index 0), additional indirect descriptors are
468	chained by next field. An indirect descriptor without next field
469	(with flags&NEXT off) signals the end of the indirect descriptor
470	table, and transfers control back to the main virtqueue. An
471	indirect descriptor can not refer to another indirect descriptor
472	table (flags&INDIRECT must be off). A single indirect descriptor
473	table can include both read-only and write-only descriptors;
474	write-only flag (flags&WRITE) in the descriptor that refers to it
475	is ignored.
476	
477	  Available Ring
478	
479	The available ring refers to what descriptors we are offering the
480	device: it refers to the head of a descriptor chain. The “flags”
481	field is currently 0 or 1: 1 indicating that we do not need an
482	interrupt when the device consumes a descriptor from the
483	available ring. Alternatively, the guest can ask the device to
484	delay interrupts until an entry with an index specified by the “
485	used_event” field is written in the used ring (equivalently,
486	until the idx field in the used ring will reach the value
487	used_event + 1). The method employed by the device is controlled
488	by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
489	). This interrupt suppression is merely an optimization; it may
490	not suppress interrupts entirely.
491	
492	The “idx” field indicates where we would put the next descriptor
493	entry (modulo the ring size). This starts at 0, and increases.
494	
495	struct vring_avail {
496	
497	#define VRING_AVAIL_F_NO_INTERRUPT      1
498	
499	   u16 flags;
500	
501	   u16 idx;
502	
503	   u16 ring[qsz]; /* qsz is the Queue Size field read from device
504	*/
505	
506	   u16 used_event;
507	
508	};
509	
510	  Used Ring
511	
512	The used ring is where the device returns buffers once it is done
513	with them. The flags field can be used by the device to hint that
514	no notification is necessary when the guest adds to the available
515	ring. Alternatively, the “avail_event” field can be used by the
516	device to hint that no notification is necessary until an entry
517	with an index specified by the “avail_event” is written in the
518	available ring (equivalently, until the idx field in the
519	available ring will reach the value avail_event + 1). The method
520	employed by the device is controlled by the guest through the
521	VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
522	). [footnote:
523	These fields are kept here because this is the only part of the
524	virtqueue written by the device
525	].
526	
527	Each entry in the ring is a pair: the head entry of the
528	descriptor chain describing the buffer (this matches an entry
529	placed in the available ring by the guest earlier), and the total
530	of bytes written into the buffer. The latter is extremely useful
531	for guests using untrusted buffers: if you do not know exactly
532	how much has been written by the device, you usually have to zero
533	the buffer to ensure no data leakage occurs.
534	
535	/* u32 is used here for ids for padding reasons. */
536	
537	struct vring_used_elem {
538	
539	    /* Index of start of used descriptor chain. */
540	
541	    u32 id;
542	
543	    /* Total length of the descriptor chain which was used
544	(written to) */
545	
546	    u32 len;
547	
548	};
549	
550	
551	
552	struct vring_used {
553	
554	#define VRING_USED_F_NO_NOTIFY  1
555	
556	    u16 flags;
557	
558	    u16 idx;
559	
560	    struct vring_used_elem ring[qsz];
561	
562	    u16 avail_event;
563	
564	};
565	
566	  Helpers for Managing Virtqueues
567	
568	The Linux Kernel Source code contains the definitions above and
569	helper routines in a more usable form, in
570	include/linux/virtio_ring.h. This was explicitly licensed by IBM
571	and Red Hat under the (3-clause) BSD license so that it can be
572	freely used by all other projects, and is reproduced (with slight
573	variation to remove Linux assumptions) in Appendix A.
574	
575	  Device Operation<sec:Device-Operation>
576	
577	There are two parts to device operation: supplying new buffers to
578	the device, and processing used buffers from the device. As an
579	example, the virtio network device has two virtqueues: the
580	transmit virtqueue and the receive virtqueue. The driver adds
581	outgoing (read-only) packets to the transmit virtqueue, and then
582	frees them after they are used. Similarly, incoming (write-only)
583	buffers are added to the receive virtqueue, and processed after
584	they are used.
585	
586	  Supplying Buffers to The Device
587	
588	Actual transfer of buffers from the guest OS to the device
589	operates as follows:
590	
591	  Place the buffer(s) into free descriptor(s).
592	
593	  If there are no free descriptors, the guest may choose to
594	    notify the device even if notifications are suppressed (to
595	    reduce latency).[footnote:
596	The Linux drivers do this only for read-only buffers: for
597	write-only buffers, it is assumed that the driver is merely
598	trying to keep the receive buffer ring full, and no notification
599	of this expected condition is necessary.
600	]
601	
602	  Place the id of the buffer in the next ring entry of the
603	  available ring.
604	
605	  The steps (1) and (2) may be performed repeatedly if batching
606	  is possible.
607	
608	  A memory barrier should be executed to ensure the device sees
609	  the updated descriptor table and available ring before the next
610	  step.
611	
612	  The available “idx” field should be increased by the number of
613	  entries added to the available ring.
614	
615	  A memory barrier should be executed to ensure that we update
616	  the idx field before checking for notification suppression.
617	
618	  If notifications are not suppressed, the device should be
619	  notified of the new buffers.
620	
621	Note that the above code does not take precautions against the
622	available ring buffer wrapping around: this is not possible since
623	the ring buffer is the same size as the descriptor table, so step
624	(1) will prevent such a condition.
625	
626	In addition, the maximum queue size is 32768 (it must be a power
627	of 2 which fits in 16 bits), so the 16-bit “idx” value can always
628	distinguish between a full and empty buffer.
629	
630	Here is a description of each stage in more detail.
631	
632	  Placing Buffers Into The Descriptor Table
633	
634	A buffer consists of zero or more read-only physically-contiguous
635	elements followed by zero or more physically-contiguous
636	write-only elements (it must have at least one element). This
637	algorithm maps it into the descriptor table:
638	
639	  for each buffer element, b:
640	
641	  Get the next free descriptor table entry, d
642	
643	  Set d.addr to the physical address of the start of b
644	
645	  Set d.len to the length of b.
646	
647	  If b is write-only, set d.flags to VRING_DESC_F_WRITE,
648	    otherwise 0.
649	
650	  If there is a buffer element after this:
651	
652	    Set d.next to the index of the next free descriptor element.
653	
654	    Set the VRING_DESC_F_NEXT bit in d.flags.
655	
656	In practice, the d.next fields are usually used to chain free
657	descriptors, and a separate count kept to check there are enough
658	free descriptors before beginning the mappings.
659	
660	  Updating The Available Ring
661	
662	The head of the buffer we mapped is the first d in the algorithm
663	above. A naive implementation would do the following:
664	
665	avail->ring[avail->idx % qsz] = head;
666	
667	However, in general we can add many descriptors before we update
668	the “idx” field (at which point they become visible to the
669	device), so we keep a counter of how many we've added:
670	
671	avail->ring[(avail->idx + added++) % qsz] = head;
672	
673	  Updating The Index Field
674	
675	Once the idx field of the virtqueue is updated, the device will
676	be able to access the descriptor entries we've created and the
677	memory they refer to. This is why a memory barrier is generally
678	used before the idx update, to ensure it sees the most up-to-date
679	copy.
680	
681	The idx field always increments, and we let it wrap naturally at
682	65536:
683	
684	avail->idx += added;
685	
686	  <sub:Notifying-The-Device>Notifying The Device
687	
688	Device notification occurs by writing the 16-bit virtqueue index
689	of this virtqueue to the Queue Notify field of the virtio header
690	in the first I/O region of the PCI device. This can be expensive,
691	however, so the device can suppress such notifications if it
692	doesn't need them. We have to be careful to expose the new idx
693	value before checking the suppression flag: it's OK to notify
694	gratuitously, but not to omit a required notification. So again,
695	we use a memory barrier here before reading the flags or the
696	avail_event field.
697	
698	If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if
699	the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to
700	the PCI configuration space.
701	
702	If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the
703	avail_event field in the available ring structure. If the
704	available index crossed_the avail_event field value since the
705	last notification, we go ahead and write to the PCI configuration
706	space. The avail_event field wraps naturally at 65536 as well:
707	
708	(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
709	
710	  <sub:Receiving-Used-Buffers>Receiving Used Buffers From The
711	  Device
712	
713	Once the device has used a buffer (read from or written to it, or
714	parts of both, depending on the nature of the virtqueue and the
715	device), it sends an interrupt, following an algorithm very
716	similar to the algorithm used for the driver to send the device a
717	buffer:
718	
719	  Write the head descriptor number to the next field in the used
720	  ring.
721	
722	  Update the used ring idx.
723	
724	  Determine whether an interrupt is necessary:
725	
726	  If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check
727	    if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail-
728	    >flags
729	
730	  If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check
731	    whether the used index crossed the used_event field value
732	    since the last update. The used_event field wraps naturally
733	    at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
734	
735	  If an interrupt is necessary:
736	
737	  If MSI-X capability is disabled:
738	
739	    Set the lower bit of the ISR Status field for the device.
740	
741	    Send the appropriate PCI interrupt for the device.
742	
743	  If MSI-X capability is enabled:
744	
745	    Request the appropriate MSI-X interrupt message for the
746	      device, Queue Vector field sets the MSI-X Table entry
747	      number.
748	
749	    If Queue Vector field value is NO_VECTOR, no interrupt
750	      message is requested for this event.
751	
752	The guest interrupt handler should:
753	
754	  If MSI-X capability is disabled: read the ISR Status field,
755	  which will reset it to zero. If the lower bit is zero, the
756	  interrupt was not for this device. Otherwise, the guest driver
757	  should look through the used rings of each virtqueue for the
758	  device, to see if any progress has been made by the device
759	  which requires servicing.
760	
761	  If MSI-X capability is enabled: look through the used rings of
762	  each virtqueue mapped to the specific MSI-X vector for the
763	  device, to see if any progress has been made by the device
764	  which requires servicing.
765	
766	For each ring, guest should then disable interrupts by writing
767	VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
768	It can then process used ring entries finally enabling interrupts
769	by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
770	EVENT_IDX field in the available structure, Guest should then
771	execute a memory barrier, and then recheck the ring empty
772	condition. This is necessary to handle the case where, after the
773	last check and before enabling interrupts, an interrupt has been
774	suppressed by the device:
775	
776	vring_disable_interrupts(vq);
777	
778	for (;;) {
779	
780	    if (vq->last_seen_used != vring->used.idx) {
781	
782			vring_enable_interrupts(vq);
783	
784			mb();
785	
786			if (vq->last_seen_used != vring->used.idx)
787	
788				break;
789	
790	    }
791	
792	    struct vring_used_elem *e =
793	vring.used->ring[vq->last_seen_used%vsz];
794	
795	    process_buffer(e);
796	
797	    vq->last_seen_used++;
798	
799	}
800	
801	  Dealing With Configuration Changes<sub:Dealing-With-Configuration>
802	
803	Some virtio PCI devices can change the device configuration
804	state, as reflected in the virtio header in the PCI configuration
805	space. In this case:
806	
807	  If MSI-X capability is disabled: an interrupt is delivered and
808	  the second highest bit is set in the ISR Status field to
809	  indicate that the driver should re-examine the configuration
810	  space.Note that a single interrupt can indicate both that one
811	  or more virtqueue has been used and that the configuration
812	  space has changed: even if the config bit is set, virtqueues
813	  must be scanned.
814	
815	  If MSI-X capability is enabled: an interrupt message is
816	  requested. The Configuration Vector field sets the MSI-X Table
817	  entry number to use. If Configuration Vector field value is
818	  NO_VECTOR, no interrupt message is requested for this event.
819	
820	Creating New Device Types
821	
822	Various considerations are necessary when creating a new device
823	type:
824	
825	  How Many Virtqueues?
826	
827	It is possible that a very simple device will operate entirely
828	through its configuration space, but most will need at least one
829	virtqueue in which it will place requests. A device with both
830	input and output (eg. console and network devices described here)
831	need two queues: one which the driver fills with buffers to
832	receive input, and one which the driver places buffers to
833	transmit output.
834	
835	  What Configuration Space Layout?
836	
837	Configuration space is generally used for rarely-changing or
838	initialization-time parameters. But it is a limited resource, so
839	it might be better to use a virtqueue to update configuration
840	information (the network device does this for filtering,
841	otherwise the table in the config space could potentially be very
842	large).
843	
844	Note that this space is generally the guest's native endian,
845	rather than PCI's little-endian.
846	
847	  What Device Number?
848	
849	Currently device numbers are assigned quite freely: a simple
850	request mail to the author of this document or the Linux
851	virtualization mailing list[footnote:
852	
853	https://lists.linux-foundation.org/mailman/listinfo/virtualization
854	] will be sufficient to secure a unique one.
855	
856	Meanwhile for experimental drivers, use 65535 and work backwards.
857	
858	  How many MSI-X vectors?
859	
860	Using the optional MSI-X capability devices can speed up
861	interrupt processing by removing the need to read ISR Status
862	register by guest driver (which might be an expensive operation),
863	reducing interrupt sharing between devices and queues within the
864	device, and handling interrupts from multiple CPUs. However, some
865	systems impose a limit (which might be as low as 256) on the
866	total number of MSI-X vectors that can be allocated to all
867	devices. Devices and/or device drivers should take this into
868	account, limiting the number of vectors used unless the device is
869	expected to cause a high volume of interrupts. Devices can
870	control the number of vectors used by limiting the MSI-X Table
871	Size or not presenting MSI-X capability in PCI configuration
872	space. Drivers can control this by mapping events to as small
873	number of vectors as possible, or disabling MSI-X capability
874	altogether.
875	
876	  Message Framing
877	
878	The descriptors used for a buffer should not effect the semantics
879	of the message, except for the total length of the buffer. For
880	example, a network buffer consists of a 10 byte header followed
881	by the network packet. Whether this is presented in the ring
882	descriptor chain as (say) a 10 byte buffer and a 1514 byte
883	buffer, or a single 1524 byte buffer, or even three buffers,
884	should have no effect.
885	
886	In particular, no implementation should use the descriptor
887	boundaries to determine the size of any header in a request.[footnote:
888	The current qemu device implementations mistakenly insist that
889	the first descriptor cover the header in these cases exactly, so
890	a cautious driver should arrange it so.
891	]
892	
893	  Device Improvements
894	
895	Any change to configuration space, or new virtqueues, or
896	behavioural changes, should be indicated by negotiation of a new
897	feature bit. This establishes clarity[footnote:
898	Even if it does mean documenting design or implementation
899	mistakes!
900	] and avoids future expansion problems.
901	
902	Clusters of functionality which are always implemented together
903	can use a single bit, but if one feature makes sense without the
904	others they should not be gratuitously grouped together to
905	conserve feature bits. We can always extend the spec when the
906	first person needs more than 24 feature bits for their device.
907	
908	[LaTeX Command: printnomenclature]
909	
910	Appendix A: virtio_ring.h
911	
912	#ifndef VIRTIO_RING_H
913	
914	#define VIRTIO_RING_H
915	
916	/* An interface for efficient virtio implementation.
917	
918	 *
919	
920	 * This header is BSD licensed so anyone can use the definitions
921	
922	 * to implement compatible drivers/servers.
923	
924	 *
925	
926	 * Copyright 2007, 2009, IBM Corporation
927	
928	 * Copyright 2011, Red Hat, Inc
929	
930	 * All rights reserved.
931	
932	 *
933	
934	 * Redistribution and use in source and binary forms, with or
935	without
936	
937	 * modification, are permitted provided that the following
938	conditions
939	
940	 * are met:
941	
942	 * 1. Redistributions of source code must retain the above
943	copyright
944	
945	 *    notice, this list of conditions and the following
946	disclaimer.
947	
948	 * 2. Redistributions in binary form must reproduce the above
949	copyright
950	
951	 *    notice, this list of conditions and the following
952	disclaimer in the
953	
954	 *    documentation and/or other materials provided with the
955	distribution.
956	
957	 * 3. Neither the name of IBM nor the names of its contributors
958	
959	 *    may be used to endorse or promote products derived from
960	this software
961	
962	 *    without specific prior written permission.
963	
964	 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
965	CONTRIBUTORS ``AS IS'' AND
966	
967	 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
968	TO, THE
969	
970	 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
971	PARTICULAR PURPOSE
972	
973	 * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE
974	LIABLE
975	
976	 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
977	CONSEQUENTIAL
978	
979	 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
980	SUBSTITUTE GOODS
981	
982	 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
983	INTERRUPTION)
984	
985	 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
986	CONTRACT, STRICT
987	
988	 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
989	IN ANY WAY
990	
991	 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
992	POSSIBILITY OF
993	
994	 * SUCH DAMAGE.
995	
996	 */
997	
998	
999	
1000	/* This marks a buffer as continuing via the next field. */
1001	
1002	#define VRING_DESC_F_NEXT       1
1003	
1004	/* This marks a buffer as write-only (otherwise read-only). */
1005	
1006	#define VRING_DESC_F_WRITE      2
1007	
1008	
1009	
1010	/* The Host uses this in used->flags to advise the Guest: don't
1011	kick me
1012	
1013	 * when you add a buffer.  It's unreliable, so it's simply an
1014	
1015	 * optimization.  Guest will still kick if it's out of buffers.
1016	*/
1017	
1018	#define VRING_USED_F_NO_NOTIFY  1
1019	
1020	/* The Guest uses this in avail->flags to advise the Host: don't
1021	
1022	 * interrupt me when you consume a buffer.  It's unreliable, so
1023	it's
1024	
1025	 * simply an optimization.  */
1026	
1027	#define VRING_AVAIL_F_NO_INTERRUPT      1
1028	
1029	
1030	
1031	/* Virtio ring descriptors: 16 bytes.
1032	
1033	 * These can chain together via "next". */
1034	
1035	struct vring_desc {
1036	
1037	        /* Address (guest-physical). */
1038	
1039	        uint64_t addr;
1040	
1041	        /* Length. */
1042	
1043	        uint32_t len;
1044	
1045	        /* The flags as indicated above. */
1046	
1047	        uint16_t flags;
1048	
1049	        /* We chain unused descriptors via this, too */
1050	
1051	        uint16_t next;
1052	
1053	};
1054	
1055	
1056	
1057	struct vring_avail {
1058	
1059	        uint16_t flags;
1060	
1061	        uint16_t idx;
1062	
1063	        uint16_t ring[];
1064	
1065	        uint16_t used_event;
1066	
1067	};
1068	
1069	
1070	
1071	/* u32 is used here for ids for padding reasons. */
1072	
1073	struct vring_used_elem {
1074	
1075	        /* Index of start of used descriptor chain. */
1076	
1077	        uint32_t id;
1078	
1079	        /* Total length of the descriptor chain which was written
1080	to. */
1081	
1082	        uint32_t len;
1083	
1084	};
1085	
1086	
1087	
1088	struct vring_used {
1089	
1090	        uint16_t flags;
1091	
1092	        uint16_t idx;
1093	
1094	        struct vring_used_elem ring[];
1095	
1096	        uint16_t avail_event;
1097	
1098	};
1099	
1100	
1101	
1102	struct vring {
1103	
1104	        unsigned int num;
1105	
1106	
1107	
1108	        struct vring_desc *desc;
1109	
1110	        struct vring_avail *avail;
1111	
1112	        struct vring_used *used;
1113	
1114	};
1115	
1116	
1117	
1118	/* The standard layout for the ring is a continuous chunk of
1119	memory which
1120	
1121	 * looks like this.  We assume num is a power of 2.
1122	
1123	 *
1124	
1125	 * struct vring {
1126	
1127	 *      // The actual descriptors (16 bytes each)
1128	
1129	 *      struct vring_desc desc[num];
1130	
1131	 *
1132	
1133	 *      // A ring of available descriptor heads with free-running
1134	index.
1135	
1136	 *      __u16 avail_flags;
1137	
1138	 *      __u16 avail_idx;
1139	
1140	 *      __u16 available[num];
1141	
1142	 *
1143	
1144	 *      // Padding to the next align boundary.
1145	
1146	 *      char pad[];
1147	
1148	 *
1149	
1150	 *      // A ring of used descriptor heads with free-running
1151	index.
1152	
1153	 *      __u16 used_flags;
1154	
1155	 *      __u16 EVENT_IDX;
1156	
1157	 *      struct vring_used_elem used[num];
1158	
1159	 * };
1160	
1161	 * Note: for virtio PCI, align is 4096.
1162	
1163	 */
1164	
1165	static inline void vring_init(struct vring *vr, unsigned int num,
1166	void *p,
1167	
1168	                              unsigned long align)
1169	
1170	{
1171	
1172	        vr->num = num;
1173	
1174	        vr->desc = p;
1175	
1176	        vr->avail = p + num*sizeof(struct vring_desc);
1177	
1178	        vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
1179	
1180	                              + align-1)
1181	
1182	                            & ~(align - 1));
1183	
1184	}
1185	
1186	
1187	
1188	static inline unsigned vring_size(unsigned int num, unsigned long
1189	align)
1190	
1191	{
1192	
1193	        return ((sizeof(struct vring_desc)*num +
1194	sizeof(uint16_t)*(2+num)
1195	
1196	                 + align - 1) & ~(align - 1))
1197	
1198	                + sizeof(uint16_t)*3 + sizeof(struct
1199	vring_used_elem)*num;
1200	
1201	}
1202	
1203	
1204	
1205	static inline int vring_need_event(uint16_t event_idx, uint16_t
1206	new_idx, uint16_t old_idx)
1207	
1208	{
1209	
1210	         return (uint16_t)(new_idx - event_idx - 1) <
1211	(uint16_t)(new_idx - old_idx);
1212	
1213	}
1214	
1215	#endif /* VIRTIO_RING_H */
1216	
1217	<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits
1218	
1219	Currently there are five device-independent feature bits defined:
1220	
1221	  VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
1222	  indicates that the driver wants an interrupt if the device runs
1223	  out of available descriptors on a virtqueue, even though
1224	  interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
1225	  flag or the used_event field. An example of this is the
1226	  networking driver: it doesn't need to know every time a packet
1227	  is transmitted, but it does need to free the transmitted
1228	  packets a finite time after they are transmitted. It can avoid
1229	  using a timer if the device interrupts it when all the packets
1230	  are transmitted.
1231	
1232	  VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature
1233	  indicates that the driver can use descriptors with the
1234	  VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors]
1235	  .
1236	
1237	  VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
1238	  and the avail_event fields. If set, it indicates that the
1239	  device should ignore the flags field in the available ring
1240	  structure. Instead, the used_event field in this structure is
1241	  used by guest to suppress device interrupts. Further, the
1242	  driver should ignore the flags field in the used ring
1243	  structure. Instead, the avail_event field in this structure is
1244	  used by the device to suppress notifications. If unset, the
1245	  driver should ignore the used_event field; the device should
1246	  ignore the avail_event field; the flags field is used
1247	
1248	Appendix C: Network Device
1249	
1250	The virtio network device is a virtual ethernet card, and is the
1251	most complex of the devices supported so far by virtio. It has
1252	enhanced rapidly and demonstrates clearly how support for new
1253	features should be added to an existing device. Empty buffers are
1254	placed in one virtqueue for receiving packets, and outgoing
1255	packets are enqueued into another for transmission in that order.
1256	A third command queue is used to control advanced filtering
1257	features.
1258	
1259	  Configuration
1260	
1261	  Subsystem Device ID 1
1262	
1263	  Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote:
1264	Only if VIRTIO_NET_F_CTRL_VQ set
1265	]
1266	
1267	  Feature bits
1268	
1269	  VIRTIO_NET_F_CSUM (0) Device handles packets with partial
1270	    checksum
1271	
1272	  VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial
1273	    checksum
1274	
1275	  VIRTIO_NET_F_MAC (5) Device has given MAC address.
1276	
1277	  VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with
1278	    any GSO type.[footnote:
1279	It was supposed to indicate segmentation offload support, but
1280	upon further investigation it became clear that multiple bits
1281	were required.
1282	]
1283	
1284	  VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4.
1285	
1286	  VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6.
1287	
1288	  VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN.
1289	
1290	  VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO.
1291	
1292	  VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.
1293	
1294	  VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.
1295	
1296	  VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN.
1297	
1298	  VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO.
1299	
1300	  VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers.
1301	
1302	  VIRTIO_NET_F_STATUS (16) Configuration status field is
1303	    available.
1304	
1305	  VIRTIO_NET_F_CTRL_VQ (17) Control channel is available.
1306	
1307	  VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support.
1308	
1309	  VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering.
1310	
1311	  VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous
1312	    packets.
1313	
1314	  Device configuration layout Two configuration fields are
1315	  currently defined. The mac address field always exists (though
1316	  is only valid if VIRTIO_NET_F_MAC is set), and the status field
1317	  only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits
1318	  are currently defined for the status field:
1319	  VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP	1
1320	
1321	#define VIRTIO_NET_S_ANNOUNCE	2
1322	
1323	
1324	
1325	struct virtio_net_config {
1326	
1327	    u8 mac[6];
1328	
1329	    u16 status;
1330	
1331	};
1332	
1333	  Device Initialization
1334	
1335	  The initialization routine should identify the receive and
1336	  transmission virtqueues.
1337	
1338	  If the VIRTIO_NET_F_MAC feature bit is set, the configuration
1339	  space “mac” entry indicates the “physical” address of the the
1340	  network card, otherwise a private MAC address should be
1341	  assigned. All guests are expected to negotiate this feature if
1342	  it is set.
1343	
1344	  If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify
1345	  the control virtqueue.
1346	
1347	  If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link
1348	  status can be read from the bottom bit of the “status” config
1349	  field. Otherwise, the link should be assumed active.
1350	
1351	  The receive virtqueue should be filled with receive buffers.
1352	  This is described in detail below in “Setting Up Receive
1353	  Buffers”.
1354	
1355	  A driver can indicate that it will generate checksumless
1356	  packets by negotating the VIRTIO_NET_F_CSUM feature. This “
1357	  checksum offload” is a common feature on modern network cards.
1358	
1359	  If that feature is negotiated[footnote:
1360	ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are
1361	dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload
1362	features must offer the checksum feature, and a driver which
1363	accepts the offload features must accept the checksum feature.
1364	Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features
1365	depending on VIRTIO_NET_F_GUEST_CSUM.
1366	], a driver can use TCP or UDP segmentation offload by
1367	  negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP),
1368	  VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO
1369	  (UDP fragmentation) features. It should not send TCP packets
1370	  requiring segmentation offload which have the Explicit
1371	  Congestion Notification bit set, unless the
1372	  VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote:
1373	This is a common restriction in real, older network cards.
1374	]
1375	
1376	  The converse features are also available: a driver can save the
1377	  virtual device some work by negotiating these features.[footnote:
1378	For example, a network packet transported between two guests on
1379	the same system may not require checksumming at all, nor
1380	segmentation, if both guests are amenable.
1381	] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially
1382	  checksummed packets can be received, and if it can do that then
1383	  the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
1384	  VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input
1385	  equivalents of the features described above. See “Receiving
1386	  Packets” below.
1387	
1388	  Device Operation
1389	
1390	Packets are transmitted by placing them in the transmitq, and
1391	buffers for incoming packets are placed in the receiveq. In each
1392	case, the packet itself is preceeded by a header:
1393	
1394	struct virtio_net_hdr {
1395	
1396	#define VIRTIO_NET_HDR_F_NEEDS_CSUM    1
1397	
1398		u8 flags;
1399	
1400	#define VIRTIO_NET_HDR_GSO_NONE        0
1401	
1402	#define VIRTIO_NET_HDR_GSO_TCPV4       1
1403	
1404	#define VIRTIO_NET_HDR_GSO_UDP		 3
1405	
1406	#define VIRTIO_NET_HDR_GSO_TCPV6       4
1407	
1408	#define VIRTIO_NET_HDR_GSO_ECN      0x80
1409	
1410		u8 gso_type;
1411	
1412		u16 hdr_len;
1413	
1414		u16 gso_size;
1415	
1416		u16 csum_start;
1417	
1418		u16 csum_offset;
1419	
1420	/* Only if VIRTIO_NET_F_MRG_RXBUF: */
1421	
1422		u16 num_buffers
1423	
1424	};
1425	
1426	The controlq is used to control device features such as
1427	filtering.
1428	
1429	  Packet Transmission
1430	
1431	Transmitting a single packet is simple, but varies depending on
1432	the different features the driver negotiated.
1433	
1434	  If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has
1435	  not been fully checksummed, then the virtio_net_hdr's fields
1436	  are set as follows. Otherwise, the packet must be fully
1437	  checksummed, and flags is zero.
1438	
1439	  flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set,
1440	
1441	  <ite:csum_start-is-set>csum_start is set to the offset within
1442	    the packet to begin checksumming, and
1443	
1444	  csum_offset indicates how many bytes after the csum_start the
1445	    new (16 bit ones' complement) checksum should be placed.[footnote:
1446	For example, consider a partially checksummed TCP (IPv4) packet.
1447	It will have a 14 byte ethernet header and 20 byte IP header
1448	followed by the TCP header (with the TCP checksum field 16 bytes
1449	into that header). csum_start will be 14+20 = 34 (the TCP
1450	checksum includes the header), and csum_offset will be 16. The
1451	value in the TCP checksum field should be initialized to the sum
1452	of the TCP pseudo header, so that replacing it by the ones'
1453	complement checksum of the TCP header and body will give the
1454	correct result.
1455	]
1456	
1457	  <enu:If-the-driver>If the driver negotiated
1458	  VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires
1459	  TCP segmentation or UDP fragmentation, then the “gso_type”
1460	  field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP.
1461	  (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this
1462	  case, packets larger than 1514 bytes can be transmitted: the
1463	  metadata indicates how to replicate the packet header to cut it
1464	  into smaller packets. The other gso fields are set:
1465	
1466	  hdr_len is a hint to the device as to how much of the header
1467	    needs to be kept to copy into each packet, usually set to the
1468	    length of the headers, including the transport header.[footnote:
1469	Due to various bugs in implementations, this field is not useful
1470	as a guarantee of the transport header size.
1471	]
1472	
1473	  gso_size is the maximum size of each packet beyond that header
1474	    (ie. MSS).
1475	
1476	  If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
1477	    VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well,
1478	    indicating that the TCP packet has the ECN bit set.[footnote:
1479	This case is not handled by some older hardware, so is called out
1480	specifically in the protocol.
1481	]
1482	
1483	  If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
1484	  the num_buffers field is set to zero.
1485	
1486	  The header and packet are added as one output buffer to the
1487	  transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device]
1488	  ).[footnote:
1489	Note that the header will be two bytes longer for the
1490	VIRTIO_NET_F_MRG_RXBUF case.
1491	]
1492	
1493	  Packet Transmission Interrupt
1494	
1495	Often a driver will suppress transmission interrupts using the
1496	VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers]
1497	) and check for used packets in the transmit path of following
1498	packets. However, it will still receive interrupts if the
1499	VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that
1500	the transmission queue is completely emptied.
1501	
1502	The normal behavior in this interrupt handler is to retrieve and
1503	new descriptors from the used ring and free the corresponding
1504	headers and packets.
1505	
1506	  Setting Up Receive Buffers
1507	
1508	It is generally a good idea to keep the receive virtqueue as
1509	fully populated as possible: if it runs out, network performance
1510	will suffer.
1511	
1512	If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or
1513	VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to
1514	accept packets of up to 65550 bytes long (the maximum size of a
1515	TCP or UDP packet, plus the 14 byte ethernet header), otherwise
1516	1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every
1517	buffer in the receive queue needs to be at least this length [footnote:
1518	Obviously each one can be split across multiple descriptor
1519	elements.
1520	].
1521	
1522	If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at
1523	least the size of the struct virtio_net_hdr.
1524	
1525	  Packet Receive Interrupt
1526	
1527	When a packet is copied into a buffer in the receiveq, the
1528	optimal path is to disable further interrupts for the receiveq
1529	(see [sub:Receiving-Used-Buffers]) and process packets until no
1530	more are found, then re-enable them.
1531	
1532	Processing packet involves:
1533	
1534	  If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
1535	  then the “num_buffers” field indicates how many descriptors
1536	  this packet is spread over (including this one). This allows
1537	  receipt of large packets without having to allocate large
1538	  buffers. In this case, there will be at least “num_buffers” in
1539	  the used ring, and they should be chained together to form a
1540	  single packet. The other buffers will not begin with a struct
1541	  virtio_net_hdr.
1542	
1543	  If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or
1544	  the “num_buffers” field is one, then the entire packet will be
1545	  contained within this buffer, immediately following the struct
1546	  virtio_net_hdr.
1547	
1548	  If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
1549	  VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be
1550	  set: if so, the checksum on the packet is incomplete and the “
1551	  csum_start” and “csum_offset” fields indicate how to calculate
1552	  it (see [ite:csum_start-is-set]).
1553	
1554	  If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
1555	  negotiated, then the “gso_type” may be something other than
1556	  VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
1557	  desired MSS (see [enu:If-the-driver]).
1558	
1559	  Control Virtqueue
1560	
1561	The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
1562	negotiated) to send commands to manipulate various features of
1563	the device which would not easily map into the configuration
1564	space.
1565	
1566	All commands are of the following form:
1567	
1568	struct virtio_net_ctrl {
1569	
1570		u8 class;
1571	
1572		u8 command;
1573	
1574		u8 command-specific-data[];
1575	
1576		u8 ack;
1577	
1578	};
1579	
1580	
1581	
1582	/* ack values */
1583	
1584	#define VIRTIO_NET_OK     0
1585	
1586	#define VIRTIO_NET_ERR    1
1587	
1588	The class, command and command-specific-data are set by the
1589	driver, and the device sets the ack byte. There is little it can
1590	do except issue a diagnostic if the ack byte is not
1591	VIRTIO_NET_OK.
1592	
1593	  Packet Receive Filtering
1594	
1595	If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can
1596	send control commands for promiscuous mode, multicast receiving,
1597	and filtering of MAC addresses.
1598	
1599	Note that in general, these commands are best-effort: unwanted
1600	packets may still arrive.
1601	
1602	  Setting Promiscuous Mode
1603	
1604	#define VIRTIO_NET_CTRL_RX    0
1605	
1606	 #define VIRTIO_NET_CTRL_RX_PROMISC      0
1607	
1608	 #define VIRTIO_NET_CTRL_RX_ALLMULTI     1
1609	
1610	The class VIRTIO_NET_CTRL_RX has two commands:
1611	VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and
1612	VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and
1613	off. The command-specific-data is one byte containing 0 (off) or
1614	1 (on).
1615	
1616	  Setting MAC Address Filtering
1617	
1618	struct virtio_net_ctrl_mac {
1619	
1620		u32 entries;
1621	
1622		u8 macs[entries][ETH_ALEN];
1623	
1624	};
1625	
1626	
1627	
1628	#define VIRTIO_NET_CTRL_MAC    1
1629	
1630	 #define VIRTIO_NET_CTRL_MAC_TABLE_SET        0
1631	
1632	The device can filter incoming packets by any number of
1633	destination MAC addresses.[footnote:
1634	Since there are no guarentees, it can use a hash filter
1635	orsilently switch to allmulti or promiscuous mode if it is given
1636	too many addresses.
1637	] This table is set using the class VIRTIO_NET_CTRL_MAC and the
1638	command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data
1639	is two variable length tables of 6-byte MAC addresses. The first
1640	table contains unicast addresses, and the second contains
1641	multicast addresses.
1642	
1643	  VLAN Filtering
1644	
1645	If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it
1646	can control a VLAN filter table in the device.
1647	
1648	#define VIRTIO_NET_CTRL_VLAN       2
1649	
1650	 #define VIRTIO_NET_CTRL_VLAN_ADD             0
1651	
1652	 #define VIRTIO_NET_CTRL_VLAN_DEL             1
1653	
1654	Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
1655	command take a 16-bit VLAN id as the command-specific-data.
1656	
1657	  Gratuitous Packet Sending
1658	
1659	If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends
1660	on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous
1661	packets; this is usually done after the guest has been physically
1662	migrated, and needs to announce its presence on the new network
1663	links. (As hypervisor does not have the knowledge of guest
1664	network configuration (eg. tagged vlan) it is simplest to prod
1665	the guest in this way).
1666	
1667	#define VIRTIO_NET_CTRL_ANNOUNCE       3
1668	
1669	 #define VIRTIO_NET_CTRL_ANNOUNCE_ACK             0
1670	
1671	The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status
1672	field when it notices the changes of device configuration. The
1673	command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that
1674	driver has recevied the notification and device would clear the
1675	VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received
1676	this command.
1677	
1678	Processing this notification involves:
1679	
1680	  Sending the gratuitous packets or marking there are pending
1681	  gratuitous packets to be sent and letting deferred routine to
1682	  send them.
1683	
1684	  Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control
1685	  vq.
1686	
1687	  .
1688	
1689	Appendix D: Block Device
1690	
1691	The virtio block device is a simple virtual block device (ie.
1692	disk). Read and write requests (and other exotic requests) are
1693	placed in the queue, and serviced (probably out of order) by the
1694	device except where noted.
1695	
1696	  Configuration
1697	
1698	  Subsystem Device ID 2
1699	
1700	  Virtqueues 0:requestq.
1701	
1702	  Feature bits
1703	
1704	  VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.
1705	
1706	  VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is
1707	    in “size_max”.
1708	
1709	  VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a
1710	    request is in “seg_max”.
1711	
1712	  VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “
1713	    geometry”.
1714	
1715	  VIRTIO_BLK_F_RO (5) Device is read-only.
1716	
1717	  VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”.
1718	
1719	  VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands.
1720	
1721	  VIRTIO_BLK_F_FLUSH (9) Cache flush command support.
1722	
1723	  Device configuration layout The capacity of the device
1724	  (expressed in 512-byte sectors) is always present. The
1725	  availability of the others all depend on various feature bits
1726	  as indicated above. struct virtio_blk_config {
1727	
1728		u64 capacity;
1729	
1730		u32 size_max;
1731	
1732		u32 seg_max;
1733	
1734		struct virtio_blk_geometry {
1735	
1736			u16 cylinders;
1737	
1738			u8 heads;
1739	
1740			u8 sectors;
1741	
1742		} geometry;
1743	
1744		u32 blk_size;
1745	
1746	
1747	
1748	};
1749	
1750	  Device Initialization
1751	
1752	  The device size should be read from the “capacity”
1753	  configuration field. No requests should be submitted which goes
1754	  beyond this limit.
1755	
1756	  If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the
1757	  blk_size field can be read to determine the optimal sector size
1758	  for the driver to use. This does not effect the units used in
1759	  the protocol (always 512 bytes), but awareness of the correct
1760	  value can effect performance.
1761	
1762	  If the VIRTIO_BLK_F_RO feature is set by the device, any write
1763	  requests will fail.
1764	
1765	  Device Operation
1766	
1767	The driver queues requests to the virtqueue, and they are used by
1768	the device (not necessarily in order). Each request is of form:
1769	
1770	struct virtio_blk_req {
1771	
1772	
1773	
1774		u32 type;
1775	
1776		u32 ioprio;
1777	
1778		u64 sector;
1779	
1780		char data[][512];
1781	
1782		u8 status;
1783	
1784	};
1785	
1786	If the device has VIRTIO_BLK_F_SCSI feature, it can also support
1787	scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req {
1788	
1789		u32 type;
1790	
1791		u32 ioprio;
1792	
1793		u64 sector;
1794	
1795	    char cmd[];
1796	
1797		char data[][512];
1798	
1799	#define SCSI_SENSE_BUFFERSIZE   96
1800	
1801	    u8 sense[SCSI_SENSE_BUFFERSIZE];
1802	
1803	    u32 errors;
1804	
1805	    u32 data_len;
1806	
1807	    u32 sense_len;
1808	
1809	    u32 residual;
1810	
1811		u8 status;
1812	
1813	};
1814	
1815	The type of the request is either a read (VIRTIO_BLK_T_IN), a
1816	write (VIRTIO_BLK_T_OUT), a scsi packet command
1817	(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote:
1818	the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device
1819	does not distinguish between them
1820	]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote:
1821	the FLUSH and FLUSH_OUT types are equivalent, the device does not
1822	distinguish between them
1823	]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit
1824	(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a
1825	barrier and that all preceeding requests must be complete before
1826	this one, and all following requests must not be started until
1827	this is complete. Note that a barrier does not flush caches in
1828	the underlying backend device in host, and thus does not serve as
1829	data consistency guarantee. Driver must use FLUSH request to
1830	flush the host cache.
1831	
1832	#define VIRTIO_BLK_T_IN           0
1833	
1834	#define VIRTIO_BLK_T_OUT          1
1835	
1836	#define VIRTIO_BLK_T_SCSI_CMD     2
1837	
1838	#define VIRTIO_BLK_T_SCSI_CMD_OUT 3
1839	
1840	#define VIRTIO_BLK_T_FLUSH        4
1841	
1842	#define VIRTIO_BLK_T_FLUSH_OUT    5
1843	
1844	#define VIRTIO_BLK_T_BARRIER	 0x80000000
1845	
1846	The ioprio field is a hint about the relative priorities of
1847	requests to the device: higher numbers indicate more important
1848	requests.
1849	
1850	The sector number indicates the offset (multiplied by 512) where
1851	the read or write is to occur. This field is unused and set to 0
1852	for scsi packet commands and for flush commands.
1853	
1854	The cmd field is only present for scsi packet command requests,
1855	and indicates the command to perform. This field must reside in a
1856	single, separate read-only buffer; command length can be derived
1857	from the length of this buffer.
1858	
1859	Note that these first three (four for scsi packet commands)
1860	fields are always read-only: the data field is either read-only
1861	or write-only, depending on the request. The size of the read or
1862	write can be derived from the total size of the request buffers.
1863	
1864	The sense field is only present for scsi packet command requests,
1865	and indicates the buffer for scsi sense data.
1866	
1867	The data_len field is only present for scsi packet command
1868	requests, this field is deprecated, and should be ignored by the
1869	driver. Historically, devices copied data length there.
1870	
1871	The sense_len field is only present for scsi packet command
1872	requests and indicates the number of bytes actually written to
1873	the sense buffer.
1874	
1875	The residual field is only present for scsi packet command
1876	requests and indicates the residual size, calculated as data
1877	length - number of bytes actually transferred.
1878	
1879	The final status byte is written by the device: either
1880	VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest
1881	error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK        0
1882	
1883	#define VIRTIO_BLK_S_IOERR     1
1884	
1885	#define VIRTIO_BLK_S_UNSUPP    2
1886	
1887	Historically, devices assumed that the fields type, ioprio and
1888	sector reside in a single, separate read-only buffer; the fields
1889	errors, data_len, sense_len and residual reside in a single,
1890	separate write-only buffer; the sense field in a separate
1891	write-only buffer of size 96 bytes, by itself; the fields errors,
1892	data_len, sense_len and residual in a single write-only buffer;
1893	and the status field is a separate read-only buffer of size 1
1894	byte, by itself.
1895	
1896	Appendix E: Console Device
1897	
1898	The virtio console device is a simple device for data input and
1899	output. A device may have one or more ports. Each port has a pair
1900	of input and output virtqueues. Moreover, a device has a pair of
1901	control IO virtqueues. The control virtqueues are used to
1902	communicate information between the device and the driver about
1903	ports being opened and closed on either side of the connection,
1904	indication from the host about whether a particular port is a
1905	console port, adding new ports, port hot-plug/unplug, etc., and
1906	indication from the guest about whether a port or a device was
1907	successfully added, port open/close, etc.. For data IO, one or
1908	more empty buffers are placed in the receive queue for incoming
1909	data and outgoing characters are placed in the transmit queue.
1910	
1911	  Configuration
1912	
1913	  Subsystem Device ID 3
1914	
1915	  Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control
1916	  receiveq[footnote:
1917	Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set
1918	], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
1919	  ...
1920	
1921	  Feature bits
1922	
1923	  VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields
1924	    are valid.
1925	
1926	  VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple
1927	    ports; configuration fields nr_ports and max_nr_ports are
1928	    valid and control virtqueues will be used.
1929	
1930	  Device configuration layout The size of the console is supplied
1931	  in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature
1932	  is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature
1933	  is set, the maximum number of ports supported by the device can
1934	  be fetched.struct virtio_console_config {
1935	
1936		u16 cols;
1937	
1938		u16 rows;
1939	
1940	
1941	
1942		u32 max_nr_ports;
1943	
1944	};
1945	
1946	  Device Initialization
1947	
1948	  If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver
1949	  can read the console dimensions from the configuration fields.
1950	
1951	  If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the
1952	  driver can spawn multiple ports, not all of which may be
1953	  attached to a console. Some could be generic ports. In this
1954	  case, the control virtqueues are enabled and according to the
1955	  max_nr_ports configuration-space value, the appropriate number
1956	  of virtqueues are created. A control message indicating the
1957	  driver is ready is sent to the host. The host can then send
1958	  control messages for adding new ports to the device. After
1959	  creating and initializing each port, a
1960	  VIRTIO_CONSOLE_PORT_READY control message is sent to the host
1961	  for that port so the host can let us know of any additional
1962	  configuration options set for that port.
1963	
1964	  The receiveq for each port is populated with one or more
1965	  receive buffers.
1966	
1967	  Device Operation
1968	
1969	  For output, a buffer containing the characters is placed in the
1970	  port's transmitq.[footnote:
1971	Because this is high importance and low bandwidth, the current
1972	Linux implementation polls for the buffer to be used, rather than
1973	waiting for an interrupt, simplifying the implementation
1974	significantly. However, for generic serial ports with the
1975	O_NONBLOCK flag set, the polling limitation is relaxed and the
1976	consumed buffers are freed upon the next write or poll call or
1977	when a port is closed or hot-unplugged.
1978	]
1979	
1980	  When a buffer is used in the receiveq (signalled by an
1981	  interrupt), the contents is the input to the port associated
1982	  with the virtqueue for which the notification was received.
1983	
1984	  If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a
1985	  configuration change interrupt may occur. The updated size can
1986	  be read from the configuration fields.
1987	
1988	  If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT
1989	  feature, active ports are announced by the host using the
1990	  VIRTIO_CONSOLE_PORT_ADD control message. The same message is
1991	  used for port hot-plug as well.
1992	
1993	  If the host specified a port `name', a sysfs attribute is
1994	  created with the name filled in, so that udev rules can be
1995	  written that can create a symlink from the port's name to the
1996	  char device for port discovery by applications in the guest.
1997	
1998	  Changes to ports' state are effected by control messages.
1999	  Appropriate action is taken on the port indicated in the
2000	  control message. The layout of the structure of the control
2001	  buffer and the events associated are:struct virtio_console_control {
2002	
2003		uint32_t id;    /* Port number */
2004	
2005		uint16_t event; /* The kind of control event */
2006	
2007		uint16_t value; /* Extra information for the event */
2008	
2009	};
2010	
2011	
2012	
2013	/* Some events for the internal messages (control packets) */
2014	
2015	
2016	
2017	#define VIRTIO_CONSOLE_DEVICE_READY     0
2018	
2019	#define VIRTIO_CONSOLE_PORT_ADD         1
2020	
2021	#define VIRTIO_CONSOLE_PORT_REMOVE      2
2022	
2023	#define VIRTIO_CONSOLE_PORT_READY       3
2024	
2025	#define VIRTIO_CONSOLE_CONSOLE_PORT     4
2026	
2027	#define VIRTIO_CONSOLE_RESIZE           5
2028	
2029	#define VIRTIO_CONSOLE_PORT_OPEN        6
2030	
2031	#define VIRTIO_CONSOLE_PORT_NAME        7
2032	
2033	Appendix F: Entropy Device
2034	
2035	The virtio entropy device supplies high-quality randomness for
2036	guest use.
2037	
2038	  Configuration
2039	
2040	  Subsystem Device ID 4
2041	
2042	  Virtqueues 0:requestq.
2043	
2044	  Feature bits None currently defined
2045	
2046	  Device configuration layout None currently defined.
2047	
2048	  Device Initialization
2049	
2050	  The virtqueue is initialized
2051	
2052	  Device Operation
2053	
2054	When the driver requires random bytes, it places the descriptor
2055	of one or more buffers in the queue. It will be completely filled
2056	by random data by the device.
2057	
2058	Appendix G: Memory Balloon Device
2059	
2060	The virtio memory balloon device is a primitive device for
2061	managing guest memory: the device asks for a certain amount of
2062	memory, and the guest supplies it (or withdraws it, if the device
2063	has more than it asks for). This allows the guest to adapt to
2064	changes in allowance of underlying physical memory. If the
2065	feature is negotiated, the device can also be used to communicate
2066	guest memory statistics to the host.
2067	
2068	  Configuration
2069	
2070	  Subsystem Device ID 5
2071	
2072	  Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote:
2073	Only if VIRTIO_BALLON_F_STATS_VQ set
2074	]
2075	
2076	  Feature bits
2077	
2078	  VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before
2079	    pages from the balloon are used.
2080	
2081	  VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest
2082	    memory statistics is present.
2083	
2084	  Device configuration layout Both fields of this configuration
2085	  are always available. Note that they are little endian, despite
2086	  convention that device fields are guest endian:struct virtio_balloon_config {
2087	
2088		u32 num_pages;
2089	
2090		u32 actual;
2091	
2092	};
2093	
2094	  Device Initialization
2095	
2096	  The inflate and deflate virtqueues are identified.
2097	
2098	  If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated:
2099	
2100	  Identify the stats virtqueue.
2101	
2102	  Add one empty buffer to the stats virtqueue and notify the
2103	    host.
2104	
2105	Device operation begins immediately.
2106	
2107	  Device Operation
2108	
2109	  Memory Ballooning The device is driven by the receipt of a
2110	  configuration change interrupt.
2111	
2112	  The “num_pages” configuration field is examined. If this is
2113	  greater than the “actual” number of pages, memory must be given
2114	  to the balloon. If it is less than the “actual” number of
2115	  pages, memory may be taken back from the balloon for general
2116	  use.
2117	
2118	  To supply memory to the balloon (aka. inflate):
2119	
2120	  The driver constructs an array of addresses of unused memory
2121	    pages. These addresses are divided by 4096[footnote:
2122	This is historical, and independent of the guest page size
2123	] and the descriptor describing the resulting 32-bit array is
2124	    added to the inflateq.
2125	
2126	  To remove memory from the balloon (aka. deflate):
2127	
2128	  The driver constructs an array of addresses of memory pages it
2129	    has previously given to the balloon, as described above. This
2130	    descriptor is added to the deflateq.
2131	
2132	  If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the
2133	    guest may not use these requested pages until that descriptor
2134	    in the deflateq has been used by the device.
2135	
2136	  Otherwise, the guest may begin to re-use pages previously given
2137	    to the balloon before the device has acknowledged their
2138	    withdrawl. [footnote:
2139	In this case, deflation advice is merely a courtesy
2140	]
2141	
2142	  In either case, once the device has completed the inflation or
2143	  deflation, the “actual” field of the configuration should be
2144	  updated to reflect the new number of pages in the balloon.[footnote:
2145	As updates to configuration space are not atomic, this field
2146	isn't particularly reliable, but can be used to diagnose buggy
2147	guests.
2148	]
2149	
2150	  Memory Statistics
2151	
2152	The stats virtqueue is atypical because communication is driven
2153	by the device (not the driver). The channel becomes active at
2154	driver initialization time when the driver adds an empty buffer
2155	and notifies the device. A request for memory statistics proceeds
2156	as follows:
2157	
2158	  The device pushes the buffer onto the used ring and sends an
2159	  interrupt.
2160	
2161	  The driver pops the used buffer and discards it.
2162	
2163	  The driver collects memory statistics and writes them into a
2164	  new buffer.
2165	
2166	  The driver adds the buffer to the virtqueue and notifies the
2167	  device.
2168	
2169	  The device pops the buffer (retaining it to initiate a
2170	  subsequent request) and consumes the statistics.
2171	
2172	  Memory Statistics Format Each statistic consists of a 16 bit
2173	  tag and a 64 bit value. Both quantities are represented in the
2174	  native endian of the guest. All statistics are optional and the
2175	  driver may choose which ones to supply. To guarantee backwards
2176	  compatibility, unsupported statistics should be omitted.
2177	
2178	  struct virtio_balloon_stat {
2179	
2180	#define VIRTIO_BALLOON_S_SWAP_IN  0
2181	
2182	#define VIRTIO_BALLOON_S_SWAP_OUT 1
2183	
2184	#define VIRTIO_BALLOON_S_MAJFLT   2
2185	
2186	#define VIRTIO_BALLOON_S_MINFLT   3
2187	
2188	#define VIRTIO_BALLOON_S_MEMFREE  4
2189	
2190	#define VIRTIO_BALLOON_S_MEMTOT   5
2191	
2192		u16 tag;
2193	
2194		u64 val;
2195	
2196	} __attribute__((packed));
2197	
2198	  Tags
2199	
2200	  VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been
2201	  swapped in (in bytes).
2202	
2203	  VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been
2204	  swapped out to disk (in bytes).
2205	
2206	  VIRTIO_BALLOON_S_MAJFLT The number of major page faults that
2207	  have occurred.
2208	
2209	  VIRTIO_BALLOON_S_MINFLT The number of minor page faults that
2210	  have occurred.
2211	
2212	  VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used
2213	  for any purpose (in bytes).
2214	
2215	  VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
2216	  (in bytes).
2217	
2218	Appendix H: Rpmsg: Remote Processor Messaging
2219	
2220	Virtio rpmsg devices represent remote processors on the system
2221	which run in asymmetric multi-processing (AMP) configuration, and
2222	which are usually used to offload cpu-intensive tasks from the
2223	main application processor (a typical SoC methodology).
2224	
2225	Virtio is being used to communicate with those remote processors;
2226	empty buffers are placed in one virtqueue for receiving messages,
2227	and non-empty buffers, containing outbound messages, are enqueued
2228	in a second virtqueue for transmission.
2229	
2230	Numerous communication channels can be multiplexed over those two
2231	virtqueues, so different entities, running on the application and
2232	remote processor, can directly communicate in a point-to-point
2233	fashion.
2234	
2235	  Configuration
2236	
2237	  Subsystem Device ID 7
2238	
2239	  Virtqueues 0:receiveq. 1:transmitq.
2240	
2241	  Feature bits
2242	
2243	  VIRTIO_RPMSG_F_NS (0) Device sends (and capable of receiving)
2244	    name service messages announcing the creation (or
2245	    destruction) of a channel:/**
2246	
2247	 * struct rpmsg_ns_msg - dynamic name service announcement
2248	message
2249	
2250	 * @name: name of remote service that is published
2251	
2252	 * @addr: address of remote service that is published
2253	
2254	 * @flags: indicates whether service is created or destroyed
2255	
2256	 *
2257	
2258	 * This message is sent across to publish a new service (or
2259	announce
2260	
2261	 * about its removal). When we receives these messages, an
2262	appropriate
2263	
2264	 * rpmsg channel (i.e device) is created/destroyed.
2265	
2266	 */
2267	
2268	struct rpmsg_ns_msgoon_config {
2269	
2270		char name[RPMSG_NAME_SIZE];
2271	
2272		u32 addr;
2273	
2274		u32 flags;
2275	
2276	} __packed;
2277	
2278	
2279	
2280	/**
2281	
2282	 * enum rpmsg_ns_flags - dynamic name service announcement flags
2283	
2284	 *
2285	
2286	 * @RPMSG_NS_CREATE: a new remote service was just created
2287	
2288	 * @RPMSG_NS_DESTROY: a remote service was just destroyed
2289	
2290	 */
2291	
2292	enum rpmsg_ns_flags {
2293	
2294		RPMSG_NS_CREATE = 0,
2295	
2296		RPMSG_NS_DESTROY = 1,
2297	
2298	};
2299	
2300	  Device configuration layout
2301	
2302	At his point none currently defined.
2303	
2304	  Device Initialization
2305	
2306	  The initialization routine should identify the receive and
2307	  transmission virtqueues.
2308	
2309	  The receive virtqueue should be filled with receive buffers.
2310	
2311	  Device Operation
2312	
2313	Messages are transmitted by placing them in the transmitq, and
2314	buffers for inbound messages are placed in the receiveq. In any
2315	case, messages are always preceded by the following header: /**
2316	
2317	 * struct rpmsg_hdr - common header for all rpmsg messages
2318	
2319	 * @src: source address
2320	
2321	 * @dst: destination address
2322	
2323	 * @reserved: reserved for future use
2324	
2325	 * @len: length of payload (in bytes)
2326	
2327	 * @flags: message flags
2328	
2329	 * @data: @len bytes of message payload data
2330	
2331	 *
2332	
2333	 * Every message sent(/received) on the rpmsg bus begins with
2334	this header.
2335	
2336	 */
2337	
2338	struct rpmsg_hdr {
2339	
2340		u32 src;
2341	
2342		u32 dst;
2343	
2344		u32 reserved;
2345	
2346		u16 len;
2347	
2348		u16 flags;
2349	
2350		u8 data[0];
2351	
2352	} __packed;
2353	
2354	Appendix I: SCSI Host Device
2355	
2356	The virtio SCSI host device groups together one or more virtual
2357	logical units (such as disks), and allows communicating to them
2358	using the SCSI protocol. An instance of the device represents a
2359	SCSI host to which many targets and LUNs are attached.
2360	
2361	The virtio SCSI device services two kinds of requests:
2362	
2363	  command requests for a logical unit;
2364	
2365	  task management functions related to a logical unit, target or
2366	  command.
2367	
2368	The device is also able to send out notifications about added and
2369	removed logical units. Together, these capabilities provide a
2370	SCSI transport protocol that uses virtqueues as the transfer
2371	medium. In the transport protocol, the virtio driver acts as the
2372	initiator, while the virtio SCSI host provides one or more
2373	targets that receive and process the requests.
2374	
2375	  Configuration
2376	
2377	  Subsystem Device ID 8
2378	
2379	  Virtqueues 0:controlq; 1:eventq; 2..n:request queues.
2380	
2381	  Feature bits
2382	
2383	  VIRTIO_SCSI_F_INOUT (0) A single request can include both
2384	    read-only and write-only data buffers.
2385	
2386	  VIRTIO_SCSI_F_HOTPLUG (1) The host should enable
2387	    hot-plug/hot-unplug of new LUNs and targets on the SCSI bus.
2388	
2389	  Device configuration layout All fields of this configuration
2390	  are always available. sense_size and cdb_size are writable by
2391	  the guest.struct virtio_scsi_config {
2392	
2393	    u32 num_queues;
2394	
2395	    u32 seg_max;
2396	
2397	    u32 max_sectors;
2398	
2399	    u32 cmd_per_lun;
2400	
2401	    u32 event_info_size;
2402	
2403	    u32 sense_size;
2404	
2405	    u32 cdb_size;
2406	
2407	    u16 max_channel;
2408	
2409	    u16 max_target;
2410	
2411	    u32 max_lun;
2412	
2413	};
2414	
2415	  num_queues is the total number of request virtqueues exposed by
2416	    the device. The driver is free to use only one request queue,
2417	    or it can use more to achieve better performance.
2418	
2419	  seg_max is the maximum number of segments that can be in a
2420	    command. A bidirectional command can include seg_max input
2421	    segments and seg_max output segments.
2422	
2423	  max_sectors is a hint to the guest about the maximum transfer
2424	    size it should use.
2425	
2426	  cmd_per_lun is a hint to the guest about the maximum number of
2427	    linked commands it should send to one LUN. The actual value
2428	    to be used is the minimum of cmd_per_lun and the virtqueue
2429	    size.
2430	
2431	  event_info_size is the maximum size that the device will fill
2432	    for buffers that the driver places in the eventq. The driver
2433	    should always put buffers at least of this size. It is
2434	    written by the device depending on the set of negotated
2435	    features.
2436	
2437	  sense_size is the maximum size of the sense data that the
2438	    device will write. The default value is written by the device
2439	    and will always be 96, but the driver can modify it. It is
2440	    restored to the default when the device is reset.
2441	
2442	  cdb_size is the maximum size of the CDB that the driver will
2443	    write. The default value is written by the device and will
2444	    always be 32, but the driver can likewise modify it. It is
2445	    restored to the default when the device is reset.
2446	
2447	  max_channel, max_target and max_lun can be used by the driver
2448	    as hints to constrain scanning the logical units on the
2449	    host.h
2450	
2451	  Device Initialization
2452	
2453	The initialization routine should first of all discover the
2454	device's virtqueues.
2455	
2456	If the driver uses the eventq, it should then place at least a
2457	buffer in the eventq.
2458	
2459	The driver can immediately issue requests (for example, INQUIRY
2460	or REPORT LUNS) or task management functions (for example, I_T
2461	RESET).
2462	
2463	  Device Operation: request queues
2464	
2465	The driver queues requests to an arbitrary request queue, and
2466	they are used by the device on that same queue. It is the
2467	responsibility of the driver to ensure strict request ordering
2468	for commands placed on different queues, because they will be
2469	consumed with no order constraints.
2470	
2471	Requests have the following format:
2472	
2473	struct virtio_scsi_req_cmd {
2474	
2475	    // Read-only
2476	
2477	    u8 lun[8];
2478	
2479	    u64 id;
2480	
2481	    u8 task_attr;
2482	
2483	    u8 prio;
2484	
2485	    u8 crn;
2486	
2487	    char cdb[cdb_size];
2488	
2489	    char dataout[];
2490	
2491	    // Write-only part
2492	
2493	    u32 sense_len;
2494	
2495	    u32 residual;
2496	
2497	    u16 status_qualifier;
2498	
2499	    u8 status;
2500	
2501	    u8 response;
2502	
2503	    u8 sense[sense_size];
2504	
2505	    char datain[];
2506	
2507	};
2508	
2509	
2510	
2511	/* command-specific response values */
2512	
2513	#define VIRTIO_SCSI_S_OK                0
2514	
2515	#define VIRTIO_SCSI_S_OVERRUN           1
2516	
2517	#define VIRTIO_SCSI_S_ABORTED           2
2518	
2519	#define VIRTIO_SCSI_S_BAD_TARGET        3
2520	
2521	#define VIRTIO_SCSI_S_RESET             4
2522	
2523	#define VIRTIO_SCSI_S_BUSY              5
2524	
2525	#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
2526	
2527	#define VIRTIO_SCSI_S_TARGET_FAILURE    7
2528	
2529	#define VIRTIO_SCSI_S_NEXUS_FAILURE     8
2530	
2531	#define VIRTIO_SCSI_S_FAILURE           9
2532	
2533	
2534	
2535	/* task_attr */
2536	
2537	#define VIRTIO_SCSI_S_SIMPLE            0
2538	
2539	#define VIRTIO_SCSI_S_ORDERED           1
2540	
2541	#define VIRTIO_SCSI_S_HEAD              2
2542	
2543	#define VIRTIO_SCSI_S_ACA               3
2544	
2545	The lun field addresses a target and logical unit in the
2546	virtio-scsi device's SCSI domain. The only supported format for
2547	the LUN field is: first byte set to 1, second byte set to target,
2548	third and fourth byte representing a single level LUN structure,
2549	followed by four zero bytes. With this representation, a
2550	virtio-scsi device can serve up to 256 targets and 16384 LUNs per
2551	target.
2552	
2553	The id field is the command identifier (“tag”).
2554	
2555	task_attr, prio and crn should be left to zero. task_attr defines
2556	the task attribute as in the table above, but all task attributes
2557	may be mapped to SIMPLE by the device; crn may also be provided
2558	by clients, but is generally expected to be 0. The maximum CRN
2559	value defined by the protocol is 255, since CRN is stored in an
2560	8-bit integer.
2561	
2562	All of these fields are defined in SAM. They are always
2563	read-only, as are the cdb and dataout field. The cdb_size is
2564	taken from the configuration space.
2565	
2566	sense and subsequent fields are always write-only. The sense_len
2567	field indicates the number of bytes actually written to the sense
2568	buffer. The residual field indicates the residual size,
2569	calculated as “data_length - number_of_transferred_bytes”, for
2570	read or write operations. For bidirectional commands, the
2571	number_of_transferred_bytes includes both read and written bytes.
2572	A residual field that is less than the size of datain means that
2573	the dataout field was processed entirely. A residual field that
2574	exceeds the size of datain means that the dataout field was
2575	processed partially and the datain field was not processed at
2576	all.
2577	
2578	The status byte is written by the device to be the status code as
2579	defined in SAM.
2580	
2581	The response byte is written by the device to be one of the
2582	following:
2583	
2584	  VIRTIO_SCSI_S_OK when the request was completed and the status
2585	  byte is filled with a SCSI status code (not necessarily
2586	  "GOOD").
2587	
2588	  VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires
2589	  transferring more data than is available in the data buffers.
2590	
2591	  VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an
2592	  ABORT TASK or ABORT TASK SET task management function.
2593	
2594	  VIRTIO_SCSI_S_BAD_TARGET if the request was never processed
2595	  because the target indicated by the lun field does not exist.
2596	
2597	  VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus
2598	  or device reset (including a task management function).
2599	
2600	  VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a
2601	  problem in the connection between the host and the target
2602	  (severed link).
2603	
2604	  VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a
2605	  failure and the guest should not retry on other paths.
2606	
2607	  VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure
2608	  but retrying on other paths might yield a different result.
2609	
2610	  VIRTIO_SCSI_S_BUSY if the request failed but retrying on the
2611	  same path should work.
2612	
2613	  VIRTIO_SCSI_S_FAILURE for other host or guest error. In
2614	  particular, if neither dataout nor datain is empty, and the
2615	  VIRTIO_SCSI_F_INOUT feature has not been negotiated, the
2616	  request will be immediately returned with a response equal to
2617	  VIRTIO_SCSI_S_FAILURE.
2618	
2619	  Device Operation: controlq
2620	
2621	The controlq is used for other SCSI transport operations.
2622	Requests have the following format:
2623	
2624	struct virtio_scsi_ctrl {
2625	
2626	    u32 type;
2627	
2628	    ...
2629	
2630	    u8 response;
2631	
2632	};
2633	
2634	
2635	
2636	/* response values valid for all commands */
2637	
2638	#define VIRTIO_SCSI_S_OK                       0
2639	
2640	#define VIRTIO_SCSI_S_BAD_TARGET               3
2641	
2642	#define VIRTIO_SCSI_S_BUSY                     5
2643	
2644	#define VIRTIO_SCSI_S_TRANSPORT_FAILURE        6
2645	
2646	#define VIRTIO_SCSI_S_TARGET_FAILURE           7
2647	
2648	#define VIRTIO_SCSI_S_NEXUS_FAILURE            8
2649	
2650	#define VIRTIO_SCSI_S_FAILURE                  9
2651	
2652	#define VIRTIO_SCSI_S_INCORRECT_LUN            12
2653	
2654	The type identifies the remaining fields.
2655	
2656	The following commands are defined:
2657	
2658	  Task management function
2659	#define VIRTIO_SCSI_T_TMF                      0
2660	
2661	
2662	
2663	#define VIRTIO_SCSI_T_TMF_ABORT_TASK           0
2664	
2665	#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET       1
2666	
2667	#define VIRTIO_SCSI_T_TMF_CLEAR_ACA            2
2668	
2669	#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET       3
2670	
2671	#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET      4
2672	
2673	#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET   5
2674	
2675	#define VIRTIO_SCSI_T_TMF_QUERY_TASK           6
2676	
2677	#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET       7
2678	
2679	
2680	
2681	struct virtio_scsi_ctrl_tmf
2682	
2683	{
2684	
2685	    // Read-only part
2686	
2687	    u32 type;
2688	
2689	    u32 subtype;
2690	
2691	    u8 lun[8];
2692	
2693	    u64 id;
2694	
2695	    // Write-only part
2696	
2697	    u8 response;
2698	
2699	}
2700	
2701	
2702	
2703	/* command-specific response values */
2704	
2705	#define VIRTIO_SCSI_S_FUNCTION_COMPLETE        0
2706	
2707	#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED       10
2708	
2709	#define VIRTIO_SCSI_S_FUNCTION_REJECTED        11
2710	
2711	  The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All
2712	  fields except response are filled by the driver. The subtype
2713	  field must always be specified and identifies the requested
2714	  task management function.
2715	
2716	  Other fields may be irrelevant for the requested TMF; if so,
2717	  they are ignored but they should still be present. The lun
2718	  field is in the same format specified for request queues; the
2719	  single level LUN is ignored when the task management function
2720	  addresses a whole I_T nexus. When relevant, the value of the id
2721	  field is matched against the id values passed on the requestq.
2722	
2723	  The outcome of the task management function is written by the
2724	  device in the response field. The command-specific response
2725	  values map 1-to-1 with those defined in SAM.
2726	
2727	  Asynchronous notification query
2728	#define VIRTIO_SCSI_T_AN_QUERY                    1
2729	
2730	
2731	
2732	struct virtio_scsi_ctrl_an {
2733	
2734	    // Read-only part
2735	
2736	    u32 type;
2737	
2738	    u8  lun[8];
2739	
2740	    u32 event_requested;
2741	
2742	    // Write-only part
2743	
2744	    u32 event_actual;
2745	
2746	    u8  response;
2747	
2748	}
2749	
2750	
2751	
2752	#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE  2
2753	
2754	#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT          4
2755	
2756	#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST    8
2757	
2758	#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE        16
2759	
2760	#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST          32
2761	
2762	#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY         64
2763	
2764	  By sending this command, the driver asks the device which
2765	  events the given LUN can report, as described in paragraphs 6.6
2766	  and A.6 of the SCSI MMC specification. The driver writes the
2767	  events it is interested in into the event_requested; the device
2768	  responds by writing the events that it supports into
2769	  event_actual.
2770	
2771	  The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested
2772	  fields are written by the driver. The event_actual and response
2773	  fields are written by the device.
2774	
2775	  No command-specific values are defined for the response byte.
2776	
2777	  Asynchronous notification subscription
2778	#define VIRTIO_SCSI_T_AN_SUBSCRIBE                2
2779	
2780	
2781	
2782	struct virtio_scsi_ctrl_an {
2783	
2784	    // Read-only part
2785	
2786	    u32 type;
2787	
2788	    u8  lun[8];
2789	
2790	    u32 event_requested;
2791	
2792	    // Write-only part
2793	
2794	    u32 event_actual;
2795	
2796	    u8  response;
2797	
2798	}
2799	
2800	  By sending this command, the driver asks the specified LUN to
2801	  report events for its physical interface, again as described in
2802	  the SCSI MMC specification. The driver writes the events it is
2803	  interested in into the event_requested; the device responds by
2804	  writing the events that it supports into event_actual.
2805	
2806	  Event types are the same as for the asynchronous notification
2807	  query message.
2808	
2809	  The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and
2810	  event_requested fields are written by the driver. The
2811	  event_actual and response fields are written by the device.
2812	
2813	  No command-specific values are defined for the response byte.
2814	
2815	  Device Operation: eventq
2816	
2817	The eventq is used by the device to report information on logical
2818	units that are attached to it. The driver should always leave a
2819	few buffers ready in the eventq. In general, the device will not
2820	queue events to cope with an empty eventq, and will end up
2821	dropping events if it finds no buffer ready. However, when
2822	reporting events for many LUNs (e.g. when a whole target
2823	disappears), the device can throttle events to avoid dropping
2824	them. For this reason, placing 10-15 buffers on the event queue
2825	should be enough.
2826	
2827	Buffers are placed in the eventq and filled by the device when
2828	interesting events occur. The buffers should be strictly
2829	write-only (device-filled) and the size of the buffers should be
2830	at least the value given in the device's configuration
2831	information.
2832	
2833	Buffers returned by the device on the eventq will be referred to
2834	as "events" in the rest of this section. Events have the
2835	following format:
2836	
2837	#define VIRTIO_SCSI_T_EVENTS_MISSED   0x80000000
2838	
2839	
2840	
2841	struct virtio_scsi_event {
2842	
2843	    // Write-only part
2844	
2845	    u32 event;
2846	
2847	    ...
2848	
2849	}
2850	
2851	If bit 31 is set in the event field, the device failed to report
2852	an event due to missing buffers. In this case, the driver should
2853	poll the logical units for unit attention conditions, and/or do
2854	whatever form of bus scan is appropriate for the guest operating
2855	system.
2856	
2857	Other data that the device writes to the buffer depends on the
2858	contents of the event field. The following events are defined:
2859	
2860	  No event
2861	#define VIRTIO_SCSI_T_NO_EVENT         0
2862	
2863	  This event is fired in the following cases:
2864	
2865	  When the device detects in the eventq a buffer that is shorter
2866	    than what is indicated in the configuration field, it might
2867	    use it immediately and put this dummy value in the event
2868	    field. A well-written driver will never observe this
2869	    situation.
2870	
2871	  When events are dropped, the device may signal this event as
2872	    soon as the drivers makes a buffer available, in order to
2873	    request action from the driver. In this case, of course, this
2874	    event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED
2875	    flag.
2876	
2877	  Transport reset
2878	#define VIRTIO_SCSI_T_TRANSPORT_RESET  1
2879	
2880	
2881	
2882	struct virtio_scsi_event_reset {
2883	
2884	    // Write-only part
2885	
2886	    u32 event;
2887	
2888	    u8  lun[8];
2889	
2890	    u32 reason;
2891	
2892	}
2893	
2894	
2895	
2896	#define VIRTIO_SCSI_EVT_RESET_HARD         0
2897	
2898	#define VIRTIO_SCSI_EVT_RESET_RESCAN       1
2899	
2900	#define VIRTIO_SCSI_EVT_RESET_REMOVED      2
2901	
2902	  By sending this event, the device signals that a logical unit
2903	  on a target has been reset, including the case of a new device
2904	  appearing or disappearing on the bus.The device fills in all
2905	  fields. The event field is set to
2906	  VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a
2907	  logical unit in the SCSI host.
2908	
2909	  The reason value is one of the three #define values appearing
2910	  above:
2911	
2912	  VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used if
2913	    the target or logical unit is no longer able to receive
2914	    commands.
2915	
2916	  VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the
2917	    logical unit has been reset, but is still present.
2918	
2919	  VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a
2920	    target or logical unit has just appeared on the device.
2921	
2922	  The “removed” and “rescan” events, when sent for LUN 0, may
2923	  apply to the entire target. After receiving them the driver
2924	  should ask the initiator to rescan the target, in order to
2925	  detect the case when an entire target has appeared or
2926	  disappeared. These two events will never be reported unless the
2927	  VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host
2928	  and the guest.
2929	
2930	  Events will also be reported via sense codes (this obviously
2931	  does not apply to newly appeared buses or targets, since the
2932	  application has never discovered them):
2933	
2934	  “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc
2935	    0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED)
2936	
2937	  “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29
2938	    (POWER ON, RESET OR BUS DEVICE RESET OCCURRED)
2939	
2940	  “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f,
2941	    ascq 0x0e (REPORTED LUNS DATA HAS CHANGED)
2942	
2943	  The preferred way to detect transport reset is always to use
2944	  events, because sense codes are only seen by the driver when it
2945	  sends a SCSI command to the logical unit or target. However, in
2946	  case events are dropped, the initiator will still be able to
2947	  synchronize with the actual state of the controller if the
2948	  driver asks the initiator to rescan of the SCSI bus. During the
2949	  rescan, the initiator will be able to observe the above sense
2950	  codes, and it will process them as if it the driver had
2951	  received the equivalent event.
2952	
2953	  Asynchronous notification
2954	#define VIRTIO_SCSI_T_ASYNC_NOTIFY     2
2955	
2956	
2957	
2958	struct virtio_scsi_event_an {
2959	
2960	    // Write-only part
2961	
2962	    u32 event;
2963	
2964	    u8  lun[8];
2965	
2966	    u32 reason;
2967	
2968	}
2969	
2970	  By sending this event, the device signals that an asynchronous
2971	  event was fired from a physical interface.
2972	
2973	  All fields are written by the device. The event field is set to
2974	  VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical
2975	  unit in the SCSI host. The reason field is a subset of the
2976	  events that the driver has subscribed to via the "Asynchronous
2977	  notification subscription" command.
2978	
2979	  When dropped events are reported, the driver should poll for
2980	  asynchronous events manually using SCSI commands.
2981	
2982	Appendix X: virtio-mmio
2983	
2984	Virtual environments without PCI support (a common situation in
2985	embedded devices models) might use simple memory mapped device (“
2986	virtio-mmio”) instead of the PCI device.
2987	
2988	The memory mapped virtio device behaviour is based on the PCI
2989	device specification. Therefore most of operations like device
2990	initialization, queues configuration and buffer transfers are
2991	nearly identical. Existing differences are described in the
2992	following sections.
2993	
2994	  Device Initialization
2995	
2996	Instead of using the PCI IO space for virtio header, the “
2997	virtio-mmio” device provides a set of memory mapped control
2998	registers, all 32 bits wide, followed by device-specific
2999	configuration space. The following list presents their layout:
3000	
3001	  Offset from the device base address | Direction | Name
3002	 Description
3003	
3004	  0x000 | R | MagicValue
3005	 “virt” string.
3006	
3007	  0x004 | R | Version
3008	 Device version number. Currently must be 1.
3009	
3010	  0x008 | R | DeviceID
3011	 Virtio Subsystem Device ID (ie. 1 for network card).
3012	
3013	  0x00c | R | VendorID
3014	 Virtio Subsystem Vendor ID.
3015	
3016	  0x010 | R | HostFeatures
3017	 Flags representing features the device supports.
3018	 Reading from this register returns 32 consecutive flag bits,
3019	  first bit depending on the last value written to
3020	  HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32
3021	
3022	   to (HostFeaturesSel*32)+31
3023	, eg. feature bits 0 to 31 if
3024	  HostFeaturesSel is set to 0 and features bits 32 to 63 if
3025	  HostFeaturesSel is set to 1. Also see [sub:Feature-Bits]
3026	
3027	  0x014 | W | HostFeaturesSel
3028	 Device (Host) features word selection.
3029	 Writing to this register selects a set of 32 device feature bits
3030	  accessible by reading from HostFeatures register. Device driver
3031	  must write a value to the HostFeaturesSel register before
3032	  reading from the HostFeatures register.
3033	
3034	  0x020 | W | GuestFeatures
3035	 Flags representing device features understood and activated by
3036	  the driver.
3037	 Writing to this register sets 32 consecutive flag bits, first
3038	  bit depending on the last value written to GuestFeaturesSel
3039	  register. Access to this register sets bits GuestFeaturesSel*32
3040	
3041	   to (GuestFeaturesSel*32)+31
3042	, eg. feature bits 0 to 31 if
3043	  GuestFeaturesSel is set to 0 and features bits 32 to 63 if
3044	  GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits]
3045	
3046	  0x024 | W | GuestFeaturesSel
3047	 Activated (Guest) features word selection.
3048	 Writing to this register selects a set of 32 activated feature
3049	  bits accessible by writing to the GuestFeatures register.
3050	  Device driver must write a value to the GuestFeaturesSel
3051	  register before writing to the GuestFeatures register.
3052	
3053	  0x028 | W | GuestPageSize
3054	 Guest page size.
3055	 Device driver must write the guest page size in bytes to the
3056	  register during initialization, before any queues are used.
3057	  This value must be a power of 2 and is used by the Host to
3058	  calculate Guest address of the first queue page (see QueuePFN).
3059	
3060	  0x030 | W | QueueSel
3061	 Virtual queue index (first queue is 0).
3062	 Writing to this register selects the virtual queue that the
3063	  following operations on QueueNum, QueueAlign and QueuePFN apply
3064	  to.
3065	
3066	  0x034 | R | QueueNumMax
3067	 Maximum virtual queue size.
3068	 Reading from the register returns the maximum size of the queue
3069	  the Host is ready to process or zero (0x0) if the queue is not
3070	  available. This applies to the queue selected by writing to
3071	  QueueSel and is allowed only when QueuePFN is set to zero
3072	  (0x0), so when the queue is not actively used.
3073	
3074	  0x038 | W | QueueNum
3075	 Virtual queue size.
3076	 Queue size is a number of elements in the queue, therefore size
3077	  of the descriptor table and both available and used rings.
3078	 Writing to this register notifies the Host what size of the
3079	  queue the Guest will use. This applies to the queue selected by
3080	  writing to QueueSel.
3081	
3082	  0x03c | W | QueueAlign
3083	 Used Ring alignment in the virtual queue.
3084	 Writing to this register notifies the Host about alignment
3085	  boundary of the Used Ring in bytes. This value must be a power
3086	  of 2 and applies to the queue selected by writing to QueueSel.
3087	
3088	  0x040 | RW | QueuePFN
3089	 Guest physical page number of the virtual queue.
3090	 Writing to this register notifies the host about location of the
3091	  virtual queue in the Guest's physical address space. This value
3092	  is the index number of a page starting with the queue
3093	  Descriptor Table. Value zero (0x0) means physical address zero
3094	  (0x00000000) and is illegal. When the Guest stops using the
3095	  queue it must write zero (0x0) to this register.
3096	 Reading from this register returns the currently used page
3097	  number of the queue, therefore a value other than zero (0x0)
3098	  means that the queue is in use.
3099	 Both read and write accesses apply to the queue selected by
3100	  writing to QueueSel.
3101	
3102	  0x050 | W | QueueNotify
3103	 Queue notifier.
3104	 Writing a queue index to this register notifies the Host that
3105	  there are new buffers to process in the queue.
3106	
3107	  0x60 | R | InterruptStatus
3108	Interrupt status.
3109	Reading from this register returns a bit mask of interrupts
3110	  asserted by the device. An interrupt is asserted if the
3111	  corresponding bit is set, ie. equals one (1).
3112	
3113	  Bit 0 | Used Ring Update
3114	This interrupt is asserted when the Host has updated the Used
3115	    Ring in at least one of the active virtual queues.
3116	
3117	  Bit 1 | Configuration change
3118	This interrupt is asserted when configuration of the device has
3119	    changed.
3120	
3121	  0x064 | W | InterruptACK
3122	 Interrupt acknowledge.
3123	 Writing to this register notifies the Host that the Guest
3124	  finished handling interrupts. Set bits in the value clear the
3125	  corresponding bits of the InterruptStatus register.
3126	
3127	  0x070 | RW | Status
3128	 Device status.
3129	 Reading from this register returns the current device status
3130	  flags.
3131	 Writing non-zero values to this register sets the status flags,
3132	  indicating the Guest progress. Writing zero (0x0) to this
3133	  register triggers a device reset.
3134	 Also see [sub:Device-Initialization-Sequence]
3135	
3136	  0x100+ | RW | Config
3137	 Device-specific configuration space starts at an offset 0x100
3138	  and is accessed with byte alignment. Its meaning and size
3139	  depends on the device and the driver.
3140	
3141	Virtual queue size is a number of elements in the queue,
3142	therefore size of the descriptor table and both available and
3143	used rings.
3144	
3145	The endianness of the registers follows the native endianness of
3146	the Guest. Writing to registers described as “R” and reading from
3147	registers described as “W” is not permitted and can cause
3148	undefined behavior.
3149	
3150	The device initialization is performed as described in [sub:Device-Initialization-Sequence]
3151	 with one exception: the Guest must notify the Host about its
3152	page size, writing the size in bytes to GuestPageSize register
3153	before the initialization is finished.
3154	
3155	The memory mapped virtio devices generate single interrupt only,
3156	therefore no special configuration is required.
3157	
3158	  Virtqueue Configuration
3159	
3160	The virtual queue configuration is performed in a similar way to
3161	the one described in [sec:Virtqueue-Configuration] with a few
3162	additional operations:
3163	
3164	  Select the queue writing its index (first queue is 0) to the
3165	  QueueSel register.
3166	
3167	  Check if the queue is not already in use: read QueuePFN
3168	  register, returned value should be zero (0x0).
3169	
3170	  Read maximum queue size (number of elements) from the
3171	  QueueNumMax register. If the returned value is zero (0x0) the
3172	  queue is not available.
3173	
3174	  Allocate and zero the queue pages in contiguous virtual memory,
3175	  aligning the Used Ring to an optimal boundary (usually page
3176	  size). Size of the allocated queue may be smaller than or equal
3177	  to the maximum size returned by the Host.
3178	
3179	  Notify the Host about the queue size by writing the size to
3180	  QueueNum register.
3181	
3182	  Notify the Host about the used alignment by writing its value
3183	  in bytes to QueueAlign register.
3184	
3185	  Write the physical number of the first page of the queue to the
3186	  QueuePFN register.
3187	
3188	The queue and the device are ready to begin normal operations
3189	now.
3190	
3191	  Device Operation
3192	
3193	The memory mapped virtio device behaves in the same way as
3194	described in [sec:Device-Operation], with the following
3195	exceptions:
3196	
3197	  The device is notified about new buffers available in a queue
3198	  by writing the queue index to register QueueNum instead of the
3199	  virtio header in PCI I/O space ([sub:Notifying-The-Device]).
3200	
3201	  The memory mapped virtio device is using single, dedicated
3202	  interrupt signal, which is raised when at least one of the
3203	  interrupts described in the InterruptStatus register
3204	  description is asserted. After receiving an interrupt, the
3205	  driver must read the InterruptStatus register to check what
3206	  caused the interrupt (see the register description). After the
3207	  interrupt is handled, the driver must acknowledge it by writing
3208	  a bit mask corresponding to the serviced interrupt to the
3209	  InterruptACK register.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.