About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / networking / msg_zerocopy.rst




Custom Search

Based on kernel version 4.15. Page generated on 2018-01-29 10:00 EST.

1	
2	============
3	MSG_ZEROCOPY
4	============
5	
6	Intro
7	=====
8	
9	The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
10	The feature is currently implemented for TCP sockets.
11	
12	
13	Opportunity and Caveats
14	-----------------------
15	
16	Copying large buffers between user process and kernel can be
17	expensive. Linux supports various interfaces that eschew copying,
18	such as sendpage and splice. The MSG_ZEROCOPY flag extends the
19	underlying copy avoidance mechanism to common socket send calls.
20	
21	Copy avoidance is not a free lunch. As implemented, with page pinning,
22	it replaces per byte copy cost with page accounting and completion
23	notification overhead. As a result, MSG_ZEROCOPY is generally only
24	effective at writes over around 10 KB.
25	
26	Page pinning also changes system call semantics. It temporarily shares
27	the buffer between process and network stack. Unlike with copying, the
28	process cannot immediately overwrite the buffer after system call
29	return without possibly modifying the data in flight. Kernel integrity
30	is not affected, but a buggy program can possibly corrupt its own data
31	stream.
32	
33	The kernel returns a notification when it is safe to modify data.
34	Converting an existing application to MSG_ZEROCOPY is not always as
35	trivial as just passing the flag, then.
36	
37	
38	More Info
39	---------
40	
41	Much of this document was derived from a longer paper presented at
42	netdev 2.1. For more in-depth information see that paper and talk,
43	the excellent reporting over at LWN.net or read the original code.
44	
45	  paper, slides, video
46	    https://netdevconf.org/2.1/session.html?debruijn
47	
48	  LWN article
49	    https://lwn.net/Articles/726917/
50	
51	  patchset
52	    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
53	    http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
54	
55	
56	Interface
57	=========
58	
59	Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
60	avoidance, but not the only one.
61	
62	Socket Setup
63	------------
64	
65	The kernel is permissive when applications pass undefined flags to the
66	send system call. By default it simply ignores these. To avoid enabling
67	copy avoidance mode for legacy processes that accidentally already pass
68	this flag, a process must first signal intent by setting a socket option:
69	
70	::
71	
72		if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
73			error(1, errno, "setsockopt zerocopy");
74	
75	Setting the socket option only works when the socket is in its initial
76	(TCP_CLOSED) state.  Trying to set the option for a socket returned by accept(),
77	for example, will lead to an EBUSY error. In this case, the option should be set
78	to the listening socket and it will be inherited by the accepted sockets.
79	
80	Transmission
81	------------
82	
83	The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
84	Pass the new flag.
85	
86	::
87	
88		ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
89	
90	A zerocopy failure will return -1 with errno ENOBUFS. This happens if
91	the socket option was not set, the socket exceeds its optmem limit or
92	the user exceeds its ulimit on locked pages.
93	
94	
95	Mixing copy avoidance and copying
96	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97	
98	Many workloads have a mixture of large and small buffers. Because copy
99	avoidance is more expensive than copying for small packets, the
100	feature is implemented as a flag. It is safe to mix calls with the flag
101	with those without.
102	
103	
104	Notifications
105	-------------
106	
107	The kernel has to notify the process when it is safe to reuse a
108	previously passed buffer. It queues completion notifications on the
109	socket error queue, akin to the transmit timestamping interface.
110	
111	The notification itself is a simple scalar value. Each socket
112	maintains an internal unsigned 32-bit counter. Each send call with
113	MSG_ZEROCOPY that successfully sends data increments the counter. The
114	counter is not incremented on failure or if called with length zero.
115	The counter counts system call invocations, not bytes. It wraps after
116	UINT_MAX calls.
117	
118	
119	Notification Reception
120	~~~~~~~~~~~~~~~~~~~~~~
121	
122	The below snippet demonstrates the API. In the simplest case, each
123	send syscall is followed by a poll and recvmsg on the error queue.
124	
125	Reading from the error queue is always a non-blocking operation. The
126	poll call is there to block until an error is outstanding. It will set
127	POLLERR in its output flags. That flag does not have to be set in the
128	events field. Errors are signaled unconditionally.
129	
130	::
131	
132		pfd.fd = fd;
133		pfd.events = 0;
134		if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
135			error(1, errno, "poll");
136	
137		ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
138		if (ret == -1)
139			error(1, errno, "recvmsg");
140	
141		read_notification(msg);
142	
143	The example is for demonstration purpose only. In practice, it is more
144	efficient to not wait for notifications, but read without blocking
145	every couple of send calls.
146	
147	Notifications can be processed out of order with other operations on
148	the socket. A socket that has an error queued would normally block
149	other operations until the error is read. Zerocopy notifications have
150	a zero error code, however, to not block send and recv calls.
151	
152	
153	Notification Batching
154	~~~~~~~~~~~~~~~~~~~~~
155	
156	Multiple outstanding packets can be read at once using the recvmmsg
157	call. This is often not needed. In each message the kernel returns not
158	a single value, but a range. It coalesces consecutive notifications
159	while one is outstanding for reception on the error queue.
160	
161	When a new notification is about to be queued, it checks whether the
162	new value extends the range of the notification at the tail of the
163	queue. If so, it drops the new notification packet and instead increases
164	the range upper value of the outstanding notification.
165	
166	For protocols that acknowledge data in-order, like TCP, each
167	notification can be squashed into the previous one, so that no more
168	than one notification is outstanding at any one point.
169	
170	Ordered delivery is the common case, but not guaranteed. Notifications
171	may arrive out of order on retransmission and socket teardown.
172	
173	
174	Notification Parsing
175	~~~~~~~~~~~~~~~~~~~~
176	
177	The below snippet demonstrates how to parse the control message: the
178	read_notification() call in the previous snippet. A notification
179	is encoded in the standard error format, sock_extended_err.
180	
181	The level and type fields in the control data are protocol family
182	specific, IP_RECVERR or IPV6_RECVERR.
183	
184	Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
185	as explained before, to avoid blocking read and write system calls on
186	the socket.
187	
188	The 32-bit notification range is encoded as [ee_info, ee_data]. This
189	range is inclusive. Other fields in the struct must be treated as
190	undefined, bar for ee_code, as discussed below.
191	
192	::
193	
194		struct sock_extended_err *serr;
195		struct cmsghdr *cm;
196	
197		cm = CMSG_FIRSTHDR(msg);
198		if (cm->cmsg_level != SOL_IP &&
199		    cm->cmsg_type != IP_RECVERR)
200			error(1, 0, "cmsg");
201	
202		serr = (void *) CMSG_DATA(cm);
203		if (serr->ee_errno != 0 ||
204		    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
205			error(1, 0, "serr");
206	
207		printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
208	
209	
210	Deferred copies
211	~~~~~~~~~~~~~~~
212	
213	Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
214	avoidance, and a contract that the kernel will queue a completion
215	notification. It is not a guarantee that the copy is elided.
216	
217	Copy avoidance is not always feasible. Devices that do not support
218	scatter-gather I/O cannot send packets made up of kernel generated
219	protocol headers plus zerocopy user data. A packet may need to be
220	converted to a private copy of data deep in the stack, say to compute
221	a checksum.
222	
223	In all these cases, the kernel returns a completion notification when
224	it releases its hold on the shared pages. That notification may arrive
225	before the (copied) data is fully transmitted. A zerocopy completion
226	notification is not a transmit completion notification, therefore.
227	
228	Deferred copies can be more expensive than a copy immediately in the
229	system call, if the data is no longer warm in the cache. The process
230	also incurs notification processing cost for no benefit. For this
231	reason, the kernel signals if data was completed with a copy, by
232	setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
233	A process may use this signal to stop passing flag MSG_ZEROCOPY on
234	subsequent requests on the same socket.
235	
236	
237	Implementation
238	==============
239	
240	Loopback
241	--------
242	
243	Data sent to local sockets can be queued indefinitely if the receive
244	process does not read its socket. Unbound notification latency is not
245	acceptable. For this reason all packets generated with MSG_ZEROCOPY
246	that are looped to a local socket will incur a deferred copy. This
247	includes looping onto packet sockets (e.g., tcpdump) and tun devices.
248	
249	
250	Testing
251	=======
252	
253	More realistic example code can be found in the kernel source under
254	tools/testing/selftests/net/msg_zerocopy.c.
255	
256	Be cognizant of the loopback constraint. The test can be run between
257	a pair of hosts. But if run between a local pair of processes, for
258	instance when run with msg_zerocopy.sh between a veth pair across
259	namespaces, the test will not show any improvement. For testing, the
260	loopback restriction can be temporarily relaxed by making
261	skb_orphan_frags_rx identical to skb_orphan_frags.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.