About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / networking / rds.txt

Custom Search

Based on kernel version 4.7.2. Page generated on 2016-08-22 22:47 EST.

2	Overview
3	========
5	This readme tries to provide some background on the hows and whys of RDS,
6	and will hopefully help you find your way around the code.
8	In addition, please see this email about RDS origins:
9	http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
11	RDS Architecture
12	================
14	RDS provides reliable, ordered datagram delivery by using a single
15	reliable connection between any two nodes in the cluster. This allows
16	applications to use a single socket to talk to any other process in the
17	cluster - so in a cluster with N processes you need N sockets, in contrast
18	to N*N if you use a connection-oriented socket transport like TCP.
20	RDS is not Infiniband-specific; it was designed to support different
21	transports.  The current implementation used to support RDS over TCP as well
22	as IB.
24	The high-level semantics of RDS from the application's point of view are
26	 *	Addressing
27	        RDS uses IPv4 addresses and 16bit port numbers to identify
28	        the end point of a connection. All socket operations that involve
29	        passing addresses between kernel and user space generally
30	        use a struct sockaddr_in.
32	        The fact that IPv4 addresses are used does not mean the underlying
33	        transport has to be IP-based. In fact, RDS over IB uses a
34	        reliable IB connection; the IP address is used exclusively to
35	        locate the remote node's GID (by ARPing for the given IP).
37	        The port space is entirely independent of UDP, TCP or any other
38	        protocol.
40	 *	Socket interface
41	        RDS sockets work *mostly* as you would expect from a BSD
42	        socket. The next section will cover the details. At any rate,
43	        all I/O is performed through the standard BSD socket API.
44	        Some additions like zerocopy support are implemented through
45	        control messages, while other extensions use the getsockopt/
46	        setsockopt calls.
48	        Sockets must be bound before you can send or receive data.
49	        This is needed because binding also selects a transport and
50	        attaches it to the socket. Once bound, the transport assignment
51	        does not change. RDS will tolerate IPs moving around (eg in
52	        a active-active HA scenario), but only as long as the address
53	        doesn't move to a different transport.
55	 *	sysctls
56	        RDS supports a number of sysctls in /proc/sys/net/rds
59	Socket Interface
60	================
63		AF_RDS and PF_RDS are the domain type to be used with socket(2)
64		to create RDS sockets. SOL_RDS is the socket-level to be used
65		with setsockopt(2) and getsockopt(2) for RDS specific socket
66		options.
68	  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
69	        This creates a new, unbound RDS socket.
71	  setsockopt(SOL_SOCKET): send and receive buffer size
72	        RDS honors the send and receive buffer size socket options.
73	        You are not allowed to queue more than SO_SNDSIZE bytes to
74	        a socket. A message is queued when sendmsg is called, and
75	        it leaves the queue when the remote system acknowledges
76	        its arrival.
78	        The SO_RCVSIZE option controls the maximum receive queue length.
79	        This is a soft limit rather than a hard limit - RDS will
80	        continue to accept and queue incoming messages, even if that
81	        takes the queue length over the limit. However, it will also
82	        mark the port as "congested" and send a congestion update to
83	        the source node. The source node is supposed to throttle any
84	        processes sending to this congested port.
86	  bind(fd, &sockaddr_in, ...)
87	        This binds the socket to a local IP address and port, and a
88	        transport.
90	  sendmsg(fd, ...)
91	        Sends a message to the indicated recipient. The kernel will
92	        transparently establish the underlying reliable connection
93	        if it isn't up yet.
95	        An attempt to send a message that exceeds SO_SNDSIZE will
96	        return with -EMSGSIZE
98	        An attempt to send a message that would take the total number
99	        of queued bytes over the SO_SNDSIZE threshold will return
100	        EAGAIN.
102	        An attempt to send a message to a destination that is marked
103	        as "congested" will return ENOBUFS.
105	  recvmsg(fd, ...)
106	        Receives a message that was queued to this socket. The sockets
107	        recv queue accounting is adjusted, and if the queue length
108	        drops below SO_SNDSIZE, the port is marked uncongested, and
109	        a congestion update is sent to all peers.
111	        Applications can ask the RDS kernel module to receive
112	        notifications via control messages (for instance, there is a
113	        notification when a congestion update arrived, or when a RDMA
114	        operation completes). These notifications are received through
115	        the msg.msg_control buffer of struct msghdr. The format of the
116	        messages is described in manpages.
118	  poll(fd)
119	        RDS supports the poll interface to allow the application
120	        to implement async I/O.
122	        POLLIN handling is pretty straightforward. When there's an
123	        incoming message queued to the socket, or a pending notification,
124	        we signal POLLIN.
126	        POLLOUT is a little harder. Since you can essentially send
127	        to any destination, RDS will always signal POLLOUT as long as
128	        there's room on the send queue (ie the number of bytes queued
129	        is less than the sendbuf size).
131	        However, the kernel will refuse to accept messages to
132	        a destination marked congested - in this case you will loop
133	        forever if you rely on poll to tell you what to do.
134	        This isn't a trivial problem, but applications can deal with
135	        this - by using congestion notifications, and by checking for
136	        ENOBUFS errors returned by sendmsg.
138	  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
139	        This allows the application to discard all messages queued to a
140	        specific destination on this particular socket.
142	        This allows the application to cancel outstanding messages if
143	        it detects a timeout. For instance, if it tried to send a message,
144	        and the remote host is unreachable, RDS will keep trying forever.
145	        The application may decide it's not worth it, and cancel the
146	        operation. In this case, it would use RDS_CANCEL_SENT_TO to
147	        nuke any pending messages.
150	RDMA for RDS
151	============
153	  see rds-rdma(7) manpage (available in rds-tools)
156	Congestion Notifications
157	========================
159	  see rds(7) manpage
162	RDS Protocol
163	============
165	  Message header
167	    The message header is a 'struct rds_header' (see rds.h):
168	    Fields:
169	      h_sequence:
170	          per-packet sequence number
171	      h_ack:
172	          piggybacked acknowledgment of last packet received
173	      h_len:
174	          length of data, not including header
175	      h_sport:
176	          source port
177	      h_dport:
178	          destination port
179	      h_flags:
180	          CONG_BITMAP - this is a congestion update bitmap
181	          ACK_REQUIRED - receiver must ack this packet
182	          RETRANSMITTED - packet has previously been sent
183	      h_credit:
184	          indicate to other end of connection that
185	          it has more credits available (i.e. there is
186	          more send room)
187	      h_padding[4]:
188	          unused, for future use
189	      h_csum:
190	          header checksum
191	      h_exthdr:
192	          optional data can be passed here. This is currently used for
193	          passing RDMA-related information.
195	  ACK and retransmit handling
197	      One might think that with reliable IB connections you wouldn't need
198	      to ack messages that have been received.  The problem is that IB
199	      hardware generates an ack message before it has DMAed the message
200	      into memory.  This creates a potential message loss if the HCA is
201	      disabled for any reason between when it sends the ack and before
202	      the message is DMAed and processed.  This is only a potential issue
203	      if another HCA is available for fail-over.
205	      Sending an ack immediately would allow the sender to free the sent
206	      message from their send queue quickly, but could cause excessive
207	      traffic to be used for acks. RDS piggybacks acks on sent data
208	      packets.  Ack-only packets are reduced by only allowing one to be
209	      in flight at a time, and by the sender only asking for acks when
210	      its send buffers start to fill up. All retransmissions are also
211	      acked.
213	  Flow Control
215	      RDS's IB transport uses a credit-based mechanism to verify that
216	      there is space in the peer's receive buffers for more data. This
217	      eliminates the need for hardware retries on the connection.
219	  Congestion
221	      Messages waiting in the receive queue on the receiving socket
222	      are accounted against the sockets SO_RCVBUF option value.  Only
223	      the payload bytes in the message are accounted for.  If the
224	      number of bytes queued equals or exceeds rcvbuf then the socket
225	      is congested.  All sends attempted to this socket's address
226	      should return block or return -EWOULDBLOCK.
228	      Applications are expected to be reasonably tuned such that this
229	      situation very rarely occurs.  An application encountering this
230	      "back-pressure" is considered a bug.
232	      This is implemented by having each node maintain bitmaps which
233	      indicate which ports on bound addresses are congested.  As the
234	      bitmap changes it is sent through all the connections which
235	      terminate in the local address of the bitmap which changed.
237	      The bitmaps are allocated as connections are brought up.  This
238	      avoids allocation in the interrupt handling path which queues
239	      sages on sockets.  The dense bitmaps let transports send the
240	      entire bitmap on any bitmap change reasonably efficiently.  This
241	      is much easier to implement than some finer-grained
242	      communication of per-port congestion.  The sender does a very
243	      inexpensive bit test to test if the port it's about to send to
244	      is congested or not.
247	RDS Transport Layer
248	==================
250	  As mentioned above, RDS is not IB-specific. Its code is divided
251	  into a general RDS layer and a transport layer.
253	  The general layer handles the socket API, congestion handling,
254	  loopback, stats, usermem pinning, and the connection state machine.
256	  The transport layer handles the details of the transport. The IB
257	  transport, for example, handles all the queue pairs, work requests,
258	  CM event handlers, and other Infiniband details.
261	RDS Kernel Structures
262	=====================
264	  struct rds_message
265	    aka possibly "rds_outgoing", the generic RDS layer copies data to
266	    be sent and sets header fields as needed, based on the socket API.
267	    This is then queued for the individual connection and sent by the
268	    connection's transport.
269	  struct rds_incoming
270	    a generic struct referring to incoming data that can be handed from
271	    the transport to the general code and queued by the general code
272	    while the socket is awoken. It is then passed back to the transport
273	    code to handle the actual copy-to-user.
274	  struct rds_socket
275	    per-socket information
276	  struct rds_connection
277	    per-connection information
278	  struct rds_transport
279	    pointers to transport-specific functions
280	  struct rds_statistics
281	    non-transport-specific statistics
282	  struct rds_cong_map
283	    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
285	Connection management
286	=====================
288	  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
289	  ERROR states.
291	  The first time an attempt is made by an RDS socket to send data to
292	  a node, a connection is allocated and connected. That connection is
293	  then maintained forever -- if there are transport errors, the
294	  connection will be dropped and re-established.
296	  Dropping a connection while packets are queued will cause queued or
297	  partially-sent datagrams to be retransmitted when the connection is
298	  re-established.
301	The send path
302	=============
304	  rds_sendmsg()
305	    struct rds_message built from incoming data
306	    CMSGs parsed (e.g. RDMA ops)
307	    transport connection alloced and connected if not already
308	    rds_message placed on send queue
309	    send worker awoken
310	  rds_send_worker()
311	    calls rds_send_xmit() until queue is empty
312	  rds_send_xmit()
313	    transmits congestion map if one is pending
314	    may set ACK_REQUIRED
315	    calls transport to send either non-RDMA or RDMA message
316	    (RDMA ops never retransmitted)
317	  rds_ib_xmit()
318	    allocs work requests from send ring
319	    adds any new send credits available to peer (h_credits)
320	    maps the rds_message's sg list
321	    piggybacks ack
322	    populates work requests
323	    post send to connection's queue pair
325	The recv path
326	=============
328	  rds_ib_recv_cq_comp_handler()
329	    looks at write completions
330	    unmaps recv buffer from device
331	    no errors, call rds_ib_process_recv()
332	    refill recv ring
333	  rds_ib_process_recv()
334	    validate header checksum
335	    copy header to rds_ib_incoming struct if start of a new datagram
336	    add to ibinc's fraglist
337	    if competed datagram:
338	      update cong map if datagram was cong update
339	      call rds_recv_incoming() otherwise
340	      note if ack is required
341	  rds_recv_incoming()
342	    drop duplicate packets
343	    respond to pings
344	    find the sock associated with this datagram
345	    add to sock queue
346	    wake up sock
347	    do some congestion calculations
348	  rds_recvmsg
349	    copy data into user iovec
350	    handle CMSGs
351	    return to application
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.