Documentation / s390 / vfio-ccw.rst


Based on kernel version 6.5. Page generated on 2023-08-29 08:57 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445
==================================
vfio-ccw: the basic infrastructure
==================================

Introduction
------------

Here we describe the vfio support for I/O subchannel devices for
Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
virtual machine, while vfio is the means.

Different than other hardware architectures, s390 has defined a unified
I/O access method, which is so called Channel I/O. It has its own access
patterns:

- Channel programs run asynchronously on a separate (co)processor.
- The channel subsystem will access any memory designated by the caller
  in the channel program directly, i.e. there is no iommu involved.

Thus when we introduce vfio support for these devices, we realize it
with a mediated device (mdev) implementation. The vfio mdev will be
added to an iommu group, so as to make itself able to be managed by the
vfio framework. And we add read/write callbacks for special vfio I/O
regions to pass the channel programs from the mdev to its parent device
(the real I/O subchannel device) to do further address translation and
to perform I/O instructions.

This document does not intend to explain the s390 I/O architecture in
every detail. More information/reference could be found here:

- A good start to know Channel I/O in general:
  https://en.wikipedia.org/wiki/Channel_I/O
- s390 architecture:
  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
- The existing QEMU code which implements a simple emulated channel
  subsystem could also be a good reference. It makes it easier to follow
  the flow.
  qemu/hw/s390x/css.c

For vfio mediated device framework:
- Documentation/driver-api/vfio-mediated-device.rst

Motivation of vfio-ccw
----------------------

Typically, a guest virtualized via QEMU/KVM on s390 only sees
paravirtualized virtio devices via the "Virtio Over Channel I/O
(virtio-ccw)" transport. This makes virtio devices discoverable via
standard operating system algorithms for handling channel devices.

However this is not enough. On s390 for the majority of devices, which
use the standard Channel I/O based mechanism, we also need to provide
the functionality of passing through them to a QEMU virtual machine.
This includes devices that don't have a virtio counterpart (e.g. tape
drives) or that have specific characteristics which guests want to
exploit.

For passing a device to a guest, we want to use the same interface as
everybody else, namely vfio. We implement this vfio support for channel
devices via the vfio mediated device framework and the subchannel device
driver "vfio_ccw".

Access patterns of CCW devices
------------------------------

s390 architecture has implemented a so called channel subsystem, that
provides a unified view of the devices physically attached to the
systems. Though the s390 hardware platform knows about a huge variety of
different peripheral attachments like disk devices (aka. DASDs), tapes,
communication controllers, etc. They can all be accessed by a well
defined access method and they are presenting I/O completion a unified
way: I/O interruptions.

All I/O requires the use of channel command words (CCWs). A CCW is an
instruction to a specialized I/O channel processor. A channel program is
a sequence of CCWs which are executed by the I/O channel subsystem.  To
issue a channel program to the channel subsystem, it is required to
build an operation request block (ORB), which can be used to point out
the format of the CCW and other control information to the system. The
operating system signals the I/O channel subsystem to begin executing
the channel program with a SSCH (start sub-channel) instruction. The
central processor is then free to proceed with non-I/O instructions
until interrupted. The I/O completion result is received by the
interrupt handler in the form of interrupt response block (IRB).

Back to vfio-ccw, in short:

- ORBs and channel programs are built in guest kernel (with guest
  physical addresses).
- ORBs and channel programs are passed to the host kernel.
- Host kernel translates the guest physical addresses to real addresses
  and starts the I/O with issuing a privileged Channel I/O instruction
  (e.g SSCH).
- channel programs run asynchronously on a separate processor.
- I/O completion will be signaled to the host with I/O interruptions.
  And it will be copied as IRB to user space to pass it back to the
  guest.

Physical vfio ccw device and its child mdev
-------------------------------------------

As mentioned above, we realize vfio-ccw with a mdev implementation.

Channel I/O does not have IOMMU hardware support, so the physical
vfio-ccw device does not have an IOMMU level translation or isolation.

Subchannel I/O instructions are all privileged instructions. When
handling the I/O instruction interception, vfio-ccw has the software
policing and translation how the channel program is programmed before
it gets sent to hardware.

Within this implementation, we have two drivers for two types of
devices:

- The vfio_ccw driver for the physical subchannel device.
  This is an I/O subchannel driver for the real subchannel device.  It
  realizes a group of callbacks and registers to the mdev framework as a
  parent (physical) device. As a consequence, mdev provides vfio_ccw a
  generic interface (sysfs) to create mdev devices. A vfio mdev could be
  created by vfio_ccw then and added to the mediated bus. It is the vfio
  device that added to an IOMMU group and a vfio group.
  vfio_ccw also provides an I/O region to accept channel program
  request from user space and store I/O interrupt result for user
  space to retrieve. To notify user space an I/O completion, it offers
  an interface to setup an eventfd fd for asynchronous signaling.

- The vfio_mdev driver for the mediated vfio ccw device.
  This is provided by the mdev framework. It is a vfio device driver for
  the mdev that created by vfio_ccw.
  It realizes a group of vfio device driver callbacks, adds itself to a
  vfio group, and registers itself to the mdev framework as a mdev
  driver.
  It uses a vfio iommu backend that uses the existing map and unmap
  ioctls, but rather than programming them into an IOMMU for a device,
  it simply stores the translations for use by later requests. This
  means that a device programmed in a VM with guest physical addresses
  can have the vfio kernel convert that address to process virtual
  address, pin the page and program the hardware with the host physical
  address in one step.
  For a mdev, the vfio iommu backend will not pin the pages during the
  VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
  of the iova<->vaddr mappings in this operation. And they export a
  vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
  backend for the physical devices to pin and unpin pages by demand.

Below is a high Level block diagram::

 +-------------+
 |             |
 | +---------+ | mdev_register_driver() +--------------+
 | |  Mdev   | +<-----------------------+              |
 | |  bus    | |                        | vfio_mdev.ko |
 | | driver  | +----------------------->+              |<-> VFIO user
 | +---------+ |    probe()/remove()    +--------------+    APIs
 |             |
 |  MDEV CORE  |
 |   MODULE    |
 |   mdev.ko   |
 | +---------+ | mdev_register_parent() +--------------+
 | |Physical | +<-----------------------+              |
 | | device  | |                        |  vfio_ccw.ko |<-> subchannel
 | |interface| +----------------------->+              |     device
 | +---------+ |       callback         +--------------+
 +-------------+

The process of how these work together.

1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
   physical device (with callbacks) to mdev framework.
   When vfio_ccw probing the subchannel device, it registers device
   pointer and callbacks to the mdev framework. Mdev related file nodes
   under the device node in sysfs would be created for the subchannel
   device, namely 'mdev_create', 'mdev_destroy' and
   'mdev_supported_types'.
2. Create a mediated vfio ccw device.
   Use the 'mdev_create' sysfs file, we need to manually create one (and
   only one for our case) mediated device.
3. vfio_mdev.ko drives the mediated ccw device.
   vfio_mdev is also the vfio device driver. It will probe the mdev and
   add it to an iommu_group and a vfio_group. Then we could pass through
   the mdev to a guest.


VFIO-CCW Regions
----------------

The vfio-ccw driver exposes MMIO regions to accept requests from and return
results to userspace.

vfio-ccw I/O region
-------------------

An I/O region is used to accept channel program request from user
space and store I/O interrupt result for user space to retrieve. The
definition of the region is::

  struct ccw_io_region {
  #define ORB_AREA_SIZE 12
	  __u8    orb_area[ORB_AREA_SIZE];
  #define SCSW_AREA_SIZE 12
	  __u8    scsw_area[SCSW_AREA_SIZE];
  #define IRB_AREA_SIZE 96
	  __u8    irb_area[IRB_AREA_SIZE];
	  __u32   ret_code;
  } __packed;

This region is always available.

While starting an I/O request, orb_area should be filled with the
guest ORB, and scsw_area should be filled with the SCSW of the Virtual
Subchannel.

irb_area stores the I/O result.

ret_code stores a return code for each access of the region. The following
values may occur:

``0``
  The operation was successful.

``-EOPNOTSUPP``
  The ORB specified transport mode or the
  SCSW specified a function other than the start function.

``-EIO``
  A request was issued while the device was not in a state ready to accept
  requests, or an internal error occurred.

``-EBUSY``
  The subchannel was status pending or busy, or a request is already active.

``-EAGAIN``
  A request was being processed, and the caller should retry.

``-EACCES``
  The channel path(s) used for the I/O were found to be not operational.

``-ENODEV``
  The device was found to be not operational.

``-EINVAL``
  The orb specified a chain longer than 255 ccws, or an internal error
  occurred.


vfio-ccw cmd region
-------------------

The vfio-ccw cmd region is used to accept asynchronous instructions
from userspace::

  #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0)
  #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1)
  struct ccw_cmd_region {
         __u32 command;
         __u32 ret_code;
  } __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD.

Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region.

command specifies the command to be issued; ret_code stores a return code
for each access of the region. The following values may occur:

``0``
  The operation was successful.

``-ENODEV``
  The device was found to be not operational.

``-EINVAL``
  A command other than halt or clear was specified.

``-EIO``
  A request was issued while the device was not in a state ready to accept
  requests.

``-EAGAIN``
  A request was being processed, and the caller should retry.

``-EBUSY``
  The subchannel was status pending or busy while processing a halt request.

vfio-ccw schib region
---------------------

The vfio-ccw schib region is used to return Subchannel-Information
Block (SCHIB) data to userspace::

  struct ccw_schib_region {
  #define SCHIB_AREA_SIZE 52
         __u8 schib_area[SCHIB_AREA_SIZE];
  } __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB.

Reading this region triggers a STORE SUBCHANNEL to be issued to the
associated hardware.

vfio-ccw crw region
---------------------

The vfio-ccw crw region is used to return Channel Report Word (CRW)
data to userspace::

  struct ccw_crw_region {
         __u32 crw;
         __u32 pad;
  } __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW.

Reading this region returns a CRW if one that is relevant for this
subchannel (e.g. one reporting changes in channel path state) is
pending, or all zeroes if not. If multiple CRWs are pending (including
possibly chained CRWs), reading this region again will return the next
one, until no more CRWs are pending and zeroes are returned. This is
similar to how STORE CHANNEL REPORT WORD works.

vfio-ccw operation details
--------------------------

vfio-ccw follows what vfio-pci did on the s390 platform and uses
vfio-iommu-type1 as the vfio iommu backend.

* CCW translation APIs
  A group of APIs (start with `cp_`) to do CCW translation. The CCWs
  passed in by a user space program are organized with their guest
  physical memory addresses. These APIs will copy the CCWs into kernel
  space, and assemble a runnable kernel channel program by updating the
  guest physical addresses with their corresponding host physical addresses.
  Note that we have to use IDALs even for direct-access CCWs, as the
  referenced memory can be located anywhere, including above 2G.

* vfio_ccw device driver
  This driver utilizes the CCW translation APIs and introduces
  vfio_ccw, which is the driver for the I/O subchannel devices you want
  to pass through.
  vfio_ccw implements the following vfio ioctls::

    VFIO_DEVICE_GET_INFO
    VFIO_DEVICE_GET_IRQ_INFO
    VFIO_DEVICE_GET_REGION_INFO
    VFIO_DEVICE_RESET
    VFIO_DEVICE_SET_IRQS

  This provides an I/O region, so that the user space program can pass a
  channel program to the kernel, to do further CCW translation before
  issuing them to a real device.
  This also provides the SET_IRQ ioctl to setup an event notifier to
  notify the user space program the I/O completion in an asynchronous
  way.

The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
good example to get understand how these patches work. Here is a little
bit more detail how an I/O request triggered by the QEMU guest will be
handled (without error handling).

Explanation:

- Q1-Q7: QEMU side process.
- K1-K5: Kernel side process.

Q1.
    Get I/O region info during initialization.

Q2.
    Setup event notifier and handler to handle I/O completion.

... ...

Q3.
    Intercept a ssch instruction.
Q4.
    Write the guest channel program and ORB to the I/O region.

    K1.
	Copy from guest to kernel.
    K2.
	Translate the guest channel program to a host kernel space
	channel program, which becomes runnable for a real device.
    K3.
	With the necessary information contained in the orb passed in
	by QEMU, issue the ccwchain to the device.
    K4.
	Return the ssch CC code.
Q5.
    Return the CC code to the guest.

... ...

    K5.
	Interrupt handler gets the I/O result and write the result to
	the I/O region.
    K6.
	Signal QEMU to retrieve the result.

Q6.
    Get the signal and event handler reads out the result from the I/O
    region.
Q7.
    Update the irb for the guest.

Limitations
-----------

The current vfio-ccw implementation focuses on supporting basic commands
needed to implement block device functionality (read/write) of DASD/ECKD
device only. Some commands may need special handling in the future, for
example, anything related to path grouping.

DASD is a kind of storage device. While ECKD is a data recording format.
More information for DASD and ECKD could be found here:
https://en.wikipedia.org/wiki/Direct-access_storage_device
https://en.wikipedia.org/wiki/Count_key_data

Together with the corresponding work in QEMU, we can bring the passed
through DASD/ECKD device online in a guest now and use it as a block
device.

The current code allows the guest to start channel programs via
START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL,
and STORE SUBCHANNEL.

Currently all channel programs are prefetched, regardless of the
p-bit setting in the ORB.  As a result, self modifying channel
programs are not supported.  For this reason, IPL has to be handled as
a special case by a userspace/guest program; this has been implemented
in QEMU's s390-ccw bios as of QEMU 4.1.

vfio-ccw supports classic (command mode) channel I/O only. Transport
mode (HPF) is not supported.

QDIO subchannels are currently not supported. Classic devices other than
DASD/ECKD might work, but have not been tested.

Reference
---------
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
3. https://en.wikipedia.org/wiki/Channel_I/O
4. Documentation/s390/cds.rst
5. Documentation/driver-api/vfio.rst
6. Documentation/driver-api/vfio-mediated-device.rst