About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / target / tcmu-design.txt


Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.

1	Contents:
2	
3	1) TCM Userspace Design
4	  a) Background
5	  b) Benefits
6	  c) Design constraints
7	  d) Implementation overview
8	     i. Mailbox
9	     ii. Command ring
10	     iii. Data Area
11	  e) Device discovery
12	  f) Device events
13	  g) Other contingencies
14	2) Writing a user pass-through handler
15	  a) Discovering and configuring TCMU uio devices
16	  b) Waiting for events on the device(s)
17	  c) Managing the command ring
18	3) A final note
19	
20	
21	TCM Userspace Design
22	--------------------
23	
24	TCM is another name for LIO, an in-kernel iSCSI target (server).
25	Existing TCM targets run in the kernel.  TCMU (TCM in Userspace)
26	allows userspace programs to be written which act as iSCSI targets.
27	This document describes the design.
28	
29	The existing kernel provides modules for different SCSI transport
30	protocols.  TCM also modularizes the data storage.  There are existing
31	modules for file, block device, RAM or using another SCSI device as
32	storage.  These are called "backstores" or "storage engines".  These
33	built-in modules are implemented entirely as kernel code.
34	
35	Background:
36	
37	In addition to modularizing the transport protocol used for carrying
38	SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
39	the actual data storage as well. These are referred to as "backstores"
40	or "storage engines". The target comes with backstores that allow a
41	file, a block device, RAM, or another SCSI device to be used for the
42	local storage needed for the exported SCSI LUN. Like the rest of LIO,
43	these are implemented entirely as kernel code.
44	
45	These backstores cover the most common use cases, but not all. One new
46	use case that other non-kernel target solutions, such as tgt, are able
47	to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
48	target then serves as a translator, allowing initiators to store data
49	in these non-traditional networked storage systems, while still only
50	using standard protocols themselves.
51	
52	If the target is a userspace process, supporting these is easy. tgt,
53	for example, needs only a small adapter module for each, because the
54	modules just use the available userspace libraries for RBD and GLFS.
55	
56	Adding support for these backstores in LIO is considerably more
57	difficult, because LIO is entirely kernel code. Instead of undertaking
58	the significant work to port the GLFS or RBD APIs and protocols to the
59	kernel, another approach is to create a userspace pass-through
60	backstore for LIO, "TCMU".
61	
62	
63	Benefits:
64	
65	In addition to allowing relatively easy support for RBD and GLFS, TCMU
66	will also allow easier development of new backstores. TCMU combines
67	with the LIO loopback fabric to become something similar to FUSE
68	(Filesystem in Userspace), but at the SCSI layer instead of the
69	filesystem layer. A SUSE, if you will.
70	
71	The disadvantage is there are more distinct components to configure, and
72	potentially to malfunction. This is unavoidable, but hopefully not
73	fatal if we're careful to keep things as simple as possible.
74	
75	Design constraints:
76	
77	- Good performance: high throughput, low latency
78	- Cleanly handle if userspace:
79	   1) never attaches
80	   2) hangs
81	   3) dies
82	   4) misbehaves
83	- Allow future flexibility in user & kernel implementations
84	- Be reasonably memory-efficient
85	- Simple to configure & run
86	- Simple to write a userspace backend
87	
88	
89	Implementation overview:
90	
91	The core of the TCMU interface is a memory region that is shared
92	between kernel and userspace. Within this region is: a control area
93	(mailbox); a lockless producer/consumer circular buffer for commands
94	to be passed up, and status returned; and an in/out data buffer area.
95	
96	TCMU uses the pre-existing UIO subsystem. UIO allows device driver
97	development in userspace, and this is conceptually very close to the
98	TCMU use case, except instead of a physical device, TCMU implements a
99	memory-mapped layout designed for SCSI commands. Using UIO also
100	benefits TCMU by handling device introspection (e.g. a way for
101	userspace to determine how large the shared region is) and signaling
102	mechanisms in both directions.
103	
104	There are no embedded pointers in the memory region. Everything is
105	expressed as an offset from the region's starting address. This allows
106	the ring to still work if the user process dies and is restarted with
107	the region mapped at a different virtual address.
108	
109	See target_core_user.h for the struct definitions.
110	
111	The Mailbox:
112	
113	The mailbox is always at the start of the shared memory region, and
114	contains a version, details about the starting offset and size of the
115	command ring, and head and tail pointers to be used by the kernel and
116	userspace (respectively) to put commands on the ring, and indicate
117	when the commands are completed.
118	
119	version - 1 (userspace should abort if otherwise)
120	flags:
121	- TCMU_MAILBOX_FLAG_CAP_OOOC: indicates out-of-order completion is
122	  supported.  See "The Command Ring" for details.
123	cmdr_off - The offset of the start of the command ring from the start
124	of the memory region, to account for the mailbox size.
125	cmdr_size - The size of the command ring. This does *not* need to be a
126	power of two.
127	cmd_head - Modified by the kernel to indicate when a command has been
128	placed on the ring.
129	cmd_tail - Modified by userspace to indicate when it has completed
130	processing of a command.
131	
132	The Command Ring:
133	
134	Commands are placed on the ring by the kernel incrementing
135	mailbox.cmd_head by the size of the command, modulo cmdr_size, and
136	then signaling userspace via uio_event_notify(). Once the command is
137	completed, userspace updates mailbox.cmd_tail in the same way and
138	signals the kernel via a 4-byte write(). When cmd_head equals
139	cmd_tail, the ring is empty -- no commands are currently waiting to be
140	processed by userspace.
141	
142	TCMU commands are 8-byte aligned. They start with a common header
143	containing "len_op", a 32-bit value that stores the length, as well as
144	the opcode in the lowest unused bits. It also contains cmd_id and
145	flags fields for setting by the kernel (kflags) and userspace
146	(uflags).
147	
148	Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
149	
150	When the opcode is CMD, the entry in the command ring is a struct
151	tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
152	tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
153	overall shared memory region, not the entry. The data in/out buffers
154	are accessible via tht req.iov[] array. iov_cnt contains the number of
155	entries in iov[] needed to describe either the Data-In or Data-Out
156	buffers. For bidirectional commands, iov_cnt specifies how many iovec
157	entries cover the Data-Out area, and iov_bidi_cnt specifies how many
158	iovec entries immediately after that in iov[] cover the Data-In
159	area. Just like other fields, iov.iov_base is an offset from the start
160	of the region.
161	
162	When completing a command, userspace sets rsp.scsi_status, and
163	rsp.sense_buffer if necessary. Userspace then increments
164	mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
165	kernel via the UIO method, a 4-byte write to the file descriptor.
166	
167	If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is
168	capable of handling out-of-order completions. In this case, userspace can
169	handle command in different order other than original. Since kernel would
170	still process the commands in the same order it appeared in the command
171	ring, userspace need to update the cmd->id when completing the
172	command(a.k.a steal the original command's entry).
173	
174	When the opcode is PAD, userspace only updates cmd_tail as above --
175	it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
176	is contiguous within the command ring.)
177	
178	More opcodes may be added in the future. If userspace encounters an
179	opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
180	hdr.uflags, update cmd_tail, and proceed with processing additional
181	commands, if any.
182	
183	The Data Area:
184	
185	This is shared-memory space after the command ring. The organization
186	of this area is not defined in the TCMU interface, and userspace
187	should access only the parts referenced by pending iovs.
188	
189	
190	Device Discovery:
191	
192	Other devices may be using UIO besides TCMU. Unrelated user processes
193	may also be handling different sets of TCMU devices. TCMU userspace
194	processes must find their devices by scanning sysfs
195	class/uio/uio*/name. For TCMU devices, these names will be of the
196	format:
197	
198	tcm-user/<hba_num>/<device_name>/<subtype>/<path>
199	
200	where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
201	and <device_name> allow userspace to find the device's path in the
202	kernel target's configfs tree. Assuming the usual mount point, it is
203	found at:
204	
205	/sys/kernel/config/target/core/user_<hba_num>/<device_name>
206	
207	This location contains attributes such as "hw_block_size", that
208	userspace needs to know for correct operation.
209	
210	<subtype> will be a userspace-process-unique string to identify the
211	TCMU device as expecting to be backed by a certain handler, and <path>
212	will be an additional handler-specific string for the user process to
213	configure the device, if needed. The name cannot contain ':', due to
214	LIO limitations.
215	
216	For all devices so discovered, the user handler opens /dev/uioX and
217	calls mmap():
218	
219	mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
220	
221	where size must be equal to the value read from
222	/sys/class/uio/uioX/maps/map0/size.
223	
224	
225	Device Events:
226	
227	If a new device is added or removed, a notification will be broadcast
228	over netlink, using a generic netlink family name of "TCM-USER" and a
229	multicast group named "config". This will include the UIO name as
230	described in the previous section, as well as the UIO minor
231	number. This should allow userspace to identify both the UIO device and
232	the LIO device, so that after determining the device is supported
233	(based on subtype) it can take the appropriate action.
234	
235	
236	Other contingencies:
237	
238	Userspace handler process never attaches:
239	
240	- TCMU will post commands, and then abort them after a timeout period
241	  (30 seconds.)
242	
243	Userspace handler process is killed:
244	
245	- It is still possible to restart and re-connect to TCMU
246	  devices. Command ring is preserved. However, after the timeout period,
247	  the kernel will abort pending tasks.
248	
249	Userspace handler process hangs:
250	
251	- The kernel will abort pending tasks after a timeout period.
252	
253	Userspace handler process is malicious:
254	
255	- The process can trivially break the handling of devices it controls,
256	  but should not be able to access kernel memory outside its shared
257	  memory areas.
258	
259	
260	Writing a user pass-through handler (with example code)
261	-------------------------------------------------------
262	
263	A user process handing a TCMU device must support the following:
264	
265	a) Discovering and configuring TCMU uio devices
266	b) Waiting for events on the device(s)
267	c) Managing the command ring: Parsing operations and commands,
268	   performing work as needed, setting response fields (scsi_status and
269	   possibly sense_buffer), updating cmd_tail, and notifying the kernel
270	   that work has been finished
271	
272	First, consider instead writing a plugin for tcmu-runner. tcmu-runner
273	implements all of this, and provides a higher-level API for plugin
274	authors.
275	
276	TCMU is designed so that multiple unrelated processes can manage TCMU
277	devices separately. All handlers should make sure to only open their
278	devices, based opon a known subtype string.
279	
280	a) Discovering and configuring TCMU UIO devices:
281	
282	(error checking omitted for brevity)
283	
284	int fd, dev_fd;
285	char buf[256];
286	unsigned long long map_len;
287	void *map;
288	
289	fd = open("/sys/class/uio/uio0/name", O_RDONLY);
290	ret = read(fd, buf, sizeof(buf));
291	close(fd);
292	buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
293	
294	/* we only want uio devices whose name is a format we expect */
295	if (strncmp(buf, "tcm-user", 8))
296		exit(-1);
297	
298	/* Further checking for subtype also needed here */
299	
300	fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
301	ret = read(fd, buf, sizeof(buf));
302	close(fd);
303	str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
304	
305	map_len = strtoull(buf, NULL, 0);
306	
307	dev_fd = open("/dev/uio0", O_RDWR);
308	map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
309	
310	
311	b) Waiting for events on the device(s)
312	
313	while (1) {
314	  char buf[4];
315	
316	  int ret = read(dev_fd, buf, 4); /* will block */
317	
318	  handle_device_events(dev_fd, map);
319	}
320	
321	
322	c) Managing the command ring
323	
324	#include <linux/target_core_user.h>
325	
326	int handle_device_events(int fd, void *map)
327	{
328	  struct tcmu_mailbox *mb = map;
329	  struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
330	  int did_some_work = 0;
331	
332	  /* Process events from cmd ring until we catch up with cmd_head */
333	  while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
334	
335	    if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
336	      uint8_t *cdb = (void *)mb + ent->req.cdb_off;
337	      bool success = true;
338	
339	      /* Handle command here. */
340	      printf("SCSI opcode: 0x%x\n", cdb[0]);
341	
342	      /* Set response fields */
343	      if (success)
344	        ent->rsp.scsi_status = SCSI_NO_SENSE;
345	      else {
346	        /* Also fill in rsp->sense_buffer here */
347	        ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
348	      }
349	    }
350	    else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
351	      /* Tell the kernel we didn't handle unknown opcodes */
352	      ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
353	    }
354	    else {
355	      /* Do nothing for PAD entries except update cmd_tail */
356	    }
357	
358	    /* update cmd_tail */
359	    mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
360	    ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
361	    did_some_work = 1;
362	  }
363	
364	  /* Notify the kernel that work has been finished */
365	  if (did_some_work) {
366	    uint32_t buf = 0;
367	
368	    write(fd, &buf, 4);
369	  }
370	
371	  return 0;
372	}
373	
374	
375	A final note
376	------------
377	
378	Please be careful to return codes as defined by the SCSI
379	specifications. These are different than some values defined in the
380	scsi/scsi.h include file. For example, CHECK CONDITION's status code
381	is 2, not 1.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog