About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / networking / packet_mmap.txt




Custom Search

Based on kernel version 3.13. Page generated on 2014-01-20 22:04 EST.

1	--------------------------------------------------------------------------------
2	+ ABSTRACT
3	--------------------------------------------------------------------------------
4	
5	This file documents the mmap() facility available with the PACKET
6	socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
7	i) capture network traffic with utilities like tcpdump, ii) transmit network
8	traffic, or any other that needs raw access to network interface.
9	
10	You can find the latest version of this document at:
11	    http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
12	
13	Howto can be found at:
14	    http://wiki.gnu-log.net (packet_mmap)
15	
16	Please send your comments to
17	    Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
18	    Johann Baudy <johann.baudy@gnu-log.net>
19	
20	-------------------------------------------------------------------------------
21	+ Why use PACKET_MMAP
22	--------------------------------------------------------------------------------
23	
24	In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
25	inefficient. It uses very limited buffers and requires one system call to
26	capture each packet, it requires two if you want to get packet's timestamp
27	(like libpcap always does).
28	
29	In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
30	configurable circular buffer mapped in user space that can be used to either
31	send or receive packets. This way reading packets just needs to wait for them,
32	most of the time there is no need to issue a single system call. Concerning
33	transmission, multiple packets can be sent through one system call to get the
34	highest bandwidth. By using a shared buffer between the kernel and the user
35	also has the benefit of minimizing packet copies.
36	
37	It's fine to use PACKET_MMAP to improve the performance of the capture and
38	transmission process, but it isn't everything. At least, if you are capturing
39	at high speeds (this is relative to the cpu speed), you should check if the
40	device driver of your network interface card supports some sort of interrupt
41	load mitigation or (even better) if it supports NAPI, also make sure it is
42	enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
43	supported by devices of your network. CPU IRQ pinning of your network interface
44	card can also be an advantage.
45	
46	--------------------------------------------------------------------------------
47	+ How to use mmap() to improve capture process
48	--------------------------------------------------------------------------------
49	
50	From the user standpoint, you should use the higher level libpcap library, which
51	is a de facto standard, portable across nearly all operating systems
52	including Win32. 
53	
54	Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include
55	support for PACKET_MMAP, and also probably the libpcap included in your distribution. 
56	
57	I'm aware of two implementations of PACKET_MMAP in libpcap:
58	
59	    http://wiki.ipxwarzone.com/		     (by Simon Patarin, based on libpcap 0.6.2)
60	    http://public.lanl.gov/cpw/              (by Phil Wood, based on lastest libpcap)
61	
62	The rest of this document is intended for people who want to understand
63	the low level details or want to improve libpcap by including PACKET_MMAP
64	support.
65	
66	--------------------------------------------------------------------------------
67	+ How to use mmap() directly to improve capture process
68	--------------------------------------------------------------------------------
69	
70	From the system calls stand point, the use of PACKET_MMAP involves
71	the following process:
72	
73	
74	[setup]     socket() -------> creation of the capture socket
75	            setsockopt() ---> allocation of the circular buffer (ring)
76	                              option: PACKET_RX_RING
77	            mmap() ---------> mapping of the allocated buffer to the
78	                              user process
79	
80	[capture]   poll() ---------> to wait for incoming packets
81	
82	[shutdown]  close() --------> destruction of the capture socket and
83	                              deallocation of all associated 
84	                              resources.
85	
86	
87	socket creation and destruction is straight forward, and is done 
88	the same way with or without PACKET_MMAP:
89	
90	 int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
91	
92	where mode is SOCK_RAW for the raw interface were link level
93	information can be captured or SOCK_DGRAM for the cooked
94	interface where link level information capture is not 
95	supported and a link level pseudo-header is provided 
96	by the kernel.
97	
98	The destruction of the socket and all associated resources
99	is done by a simple call to close(fd).
100	
101	Next I will describe PACKET_MMAP settings and its constraints,
102	also the mapping of the circular buffer in the user process and 
103	the use of this buffer.
104	
105	--------------------------------------------------------------------------------
106	+ How to use mmap() directly to improve transmission process
107	--------------------------------------------------------------------------------
108	Transmission process is similar to capture as shown below.
109	
110	[setup]          socket() -------> creation of the transmission socket
111	                 setsockopt() ---> allocation of the circular buffer (ring)
112	                                   option: PACKET_TX_RING
113	                 bind() ---------> bind transmission socket with a network interface
114	                 mmap() ---------> mapping of the allocated buffer to the
115	                                   user process
116	
117	[transmission]   poll() ---------> wait for free packets (optional)
118	                 send() ---------> send all packets that are set as ready in
119	                                   the ring
120	                                   The flag MSG_DONTWAIT can be used to return
121	                                   before end of transfer.
122	
123	[shutdown]  close() --------> destruction of the transmission socket and
124	                              deallocation of all associated resources.
125	
126	Socket creation and destruction is also straight forward, and is done
127	the same way as in capturing described in the previous paragraph:
128	
129	 int fd = socket(PF_PACKET, mode, 0);
130	
131	The protocol can optionally be 0 in case we only want to transmit
132	via this socket, which avoids an expensive call to packet_rcv().
133	In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
134	set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
135	
136	Binding the socket to your network interface is mandatory (with zero copy) to
137	know the header size of frames used in the circular buffer.
138	
139	As capture, each frame contains two parts:
140	
141	 --------------------
142	| struct tpacket_hdr | Header. It contains the status of
143	|                    | of this frame
144	|--------------------|
145	| data buffer        |
146	.                    .  Data that will be sent over the network interface.
147	.                    .
148	 --------------------
149	
150	 bind() associates the socket to your network interface thanks to
151	 sll_ifindex parameter of struct sockaddr_ll.
152	
153	 Initialization example:
154	
155	 struct sockaddr_ll my_addr;
156	 struct ifreq s_ifr;
157	 ...
158	
159	 strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
160	
161	 /* get interface index of eth0 */
162	 ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
163	
164	 /* fill sockaddr_ll struct to prepare binding */
165	 my_addr.sll_family = AF_PACKET;
166	 my_addr.sll_protocol = htons(ETH_P_ALL);
167	 my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
168	
169	 /* bind socket to eth0 */
170	 bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
171	
172	 A complete tutorial is available at: http://wiki.gnu-log.net/
173	
174	By default, the user should put data at :
175	 frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
176	
177	So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
178	the beginning of the user data will be at :
179	 frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
180	
181	If you wish to put user data at a custom offset from the beginning of
182	the frame (for payload alignment with SOCK_RAW mode for instance) you
183	can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
184	to make this work it must be enabled previously with setsockopt()
185	and the PACKET_TX_HAS_OFF option.
186	
187	--------------------------------------------------------------------------------
188	+ PACKET_MMAP settings
189	--------------------------------------------------------------------------------
190	
191	To setup PACKET_MMAP from user level code is done with a call like
192	
193	 - Capture process
194	     setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
195	 - Transmission process
196	     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
197	
198	The most significant argument in the previous call is the req parameter, 
199	this parameter must to have the following structure:
200	
201	    struct tpacket_req
202	    {
203	        unsigned int    tp_block_size;  /* Minimal size of contiguous block */
204	        unsigned int    tp_block_nr;    /* Number of blocks */
205	        unsigned int    tp_frame_size;  /* Size of frame */
206	        unsigned int    tp_frame_nr;    /* Total number of frames */
207	    };
208	
209	This structure is defined in /usr/include/linux/if_packet.h and establishes a 
210	circular buffer (ring) of unswappable memory.
211	Being mapped in the capture process allows reading the captured frames and 
212	related meta-information like timestamps without requiring a system call.
213	
214	Frames are grouped in blocks. Each block is a physically contiguous
215	region of memory and holds tp_block_size/tp_frame_size frames. The total number 
216	of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
217	
218	    frames_per_block = tp_block_size/tp_frame_size
219	
220	indeed, packet_set_ring checks that the following condition is true
221	
222	    frames_per_block * tp_block_nr == tp_frame_nr
223	
224	Lets see an example, with the following values:
225	
226	     tp_block_size= 4096
227	     tp_frame_size= 2048
228	     tp_block_nr  = 4
229	     tp_frame_nr  = 8
230	
231	we will get the following buffer structure:
232	
233	        block #1                 block #2         
234	+---------+---------+    +---------+---------+    
235	| frame 1 | frame 2 |    | frame 3 | frame 4 |    
236	+---------+---------+    +---------+---------+    
237	
238	        block #3                 block #4
239	+---------+---------+    +---------+---------+
240	| frame 5 | frame 6 |    | frame 7 | frame 8 |
241	+---------+---------+    +---------+---------+
242	
243	A frame can be of any size with the only condition it can fit in a block. A block
244	can only hold an integer number of frames, or in other words, a frame cannot 
245	be spawned across two blocks, so there are some details you have to take into 
246	account when choosing the frame_size. See "Mapping and use of the circular 
247	buffer (ring)".
248	
249	--------------------------------------------------------------------------------
250	+ PACKET_MMAP setting constraints
251	--------------------------------------------------------------------------------
252	
253	In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
254	the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
255	16384 in a 64 bit architecture. For information on these kernel versions
256	see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
257	
258	 Block size limit
259	------------------
260	
261	As stated earlier, each block is a contiguous physical region of memory. These 
262	memory regions are allocated with calls to the __get_free_pages() function. As 
263	the name indicates, this function allocates pages of memory, and the second
264	argument is "order" or a power of two number of pages, that is 
265	(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, 
266	order=2 ==> 16384 bytes, etc. The maximum size of a 
267	region allocated by __get_free_pages is determined by the MAX_ORDER macro. More 
268	precisely the limit can be calculated as:
269	
270	   PAGE_SIZE << MAX_ORDER
271	
272	   In a i386 architecture PAGE_SIZE is 4096 bytes 
273	   In a 2.4/i386 kernel MAX_ORDER is 10
274	   In a 2.6/i386 kernel MAX_ORDER is 11
275	
276	So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel 
277	respectively, with an i386 architecture.
278	
279	User space programs can include /usr/include/sys/user.h and 
280	/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
281	
282	The pagesize can also be determined dynamically with the getpagesize (2) 
283	system call. 
284	
285	 Block number limit
286	--------------------
287	
288	To understand the constraints of PACKET_MMAP, we have to see the structure 
289	used to hold the pointers to each block.
290	
291	Currently, this structure is a dynamically allocated vector with kmalloc 
292	called pg_vec, its size limits the number of blocks that can be allocated.
293	
294	    +---+---+---+---+
295	    | x | x | x | x |
296	    +---+---+---+---+
297	      |   |   |   |
298	      |   |   |   v
299	      |   |   v  block #4
300	      |   v  block #3
301	      v  block #2
302	     block #1
303	
304	kmalloc allocates any number of bytes of physically contiguous memory from 
305	a pool of pre-determined sizes. This pool of memory is maintained by the slab 
306	allocator which is at the end the responsible for doing the allocation and 
307	hence which imposes the maximum memory that kmalloc can allocate. 
308	
309	In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The 
310	predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" 
311	entries of /proc/slabinfo
312	
313	In a 32 bit architecture, pointers are 4 bytes long, so the total number of 
314	pointers to blocks is
315	
316	     131072/4 = 32768 blocks
317	
318	 PACKET_MMAP buffer size calculator
319	------------------------------------
320	
321	Definitions:
322	
323	<size-max>    : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
324	<pointer size>: depends on the architecture -- sizeof(void *)
325	<page size>   : depends on the architecture -- PAGE_SIZE or getpagesize (2)
326	<max-order>   : is the value defined with MAX_ORDER
327	<frame size>  : it's an upper bound of frame's capture size (more on this later)
328	
329	from these definitions we will derive 
330	
331		<block number> = <size-max>/<pointer size>
332		<block size> = <pagesize> << <max-order>
333	
334	so, the max buffer size is
335	
336		<block number> * <block size>
337	
338	and, the number of frames be
339	
340		<block number> * <block size> / <frame size>
341	
342	Suppose the following parameters, which apply for 2.6 kernel and an
343	i386 architecture:
344	
345		<size-max> = 131072 bytes
346		<pointer size> = 4 bytes
347		<pagesize> = 4096 bytes
348		<max-order> = 11
349	
350	and a value for <frame size> of 2048 bytes. These parameters will yield
351	
352		<block number> = 131072/4 = 32768 blocks
353		<block size> = 4096 << 11 = 8 MiB.
354	
355	and hence the buffer will have a 262144 MiB size. So it can hold 
356	262144 MiB / 2048 bytes = 134217728 frames
357	
358	Actually, this buffer size is not possible with an i386 architecture. 
359	Remember that the memory is allocated in kernel space, in the case of 
360	an i386 kernel's memory size is limited to 1GiB.
361	
362	All memory allocations are not freed until the socket is closed. The memory 
363	allocations are done with GFP_KERNEL priority, this basically means that 
364	the allocation can wait and swap other process' memory in order to allocate 
365	the necessary memory, so normally limits can be reached.
366	
367	 Other constraints
368	-------------------
369	
370	If you check the source code you will see that what I draw here as a frame
371	is not only the link level frame. At the beginning of each frame there is a 
372	header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
373	meta information like timestamp. So what we draw here a frame it's really 
374	the following (from include/linux/if_packet.h):
375	
376	/*
377	   Frame structure:
378	
379	   - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
380	   - struct tpacket_hdr
381	   - pad to TPACKET_ALIGNMENT=16
382	   - struct sockaddr_ll
383	   - Gap, chosen so that packet data (Start+tp_net) aligns to 
384	     TPACKET_ALIGNMENT=16
385	   - Start+tp_mac: [ Optional MAC header ]
386	   - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
387	   - Pad to align to TPACKET_ALIGNMENT=16
388	 */
389	 
390	 The following are conditions that are checked in packet_set_ring
391	
392	   tp_block_size must be a multiple of PAGE_SIZE (1)
393	   tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
394	   tp_frame_size must be a multiple of TPACKET_ALIGNMENT
395	   tp_frame_nr   must be exactly frames_per_block*tp_block_nr
396	
397	Note that tp_block_size should be chosen to be a power of two or there will
398	be a waste of memory.
399	
400	--------------------------------------------------------------------------------
401	+ Mapping and use of the circular buffer (ring)
402	--------------------------------------------------------------------------------
403	
404	The mapping of the buffer in the user process is done with the conventional 
405	mmap function. Even the circular buffer is compound of several physically
406	discontiguous blocks of memory, they are contiguous to the user space, hence
407	just one call to mmap is needed:
408	
409	    mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
410	
411	If tp_frame_size is a divisor of tp_block_size frames will be 
412	contiguously spaced by tp_frame_size bytes. If not, each
413	tp_block_size/tp_frame_size frames there will be a gap between 
414	the frames. This is because a frame cannot be spawn across two
415	blocks. 
416	
417	At the beginning of each frame there is an status field (see 
418	struct tpacket_hdr). If this field is 0 means that the frame is ready
419	to be used for the kernel, If not, there is a frame the user can read 
420	and the following flags apply:
421	
422	+++ Capture process:
423	     from include/linux/if_packet.h
424	
425	     #define TP_STATUS_COPY          2 
426	     #define TP_STATUS_LOSING        4 
427	     #define TP_STATUS_CSUMNOTREADY  8 
428	
429	TP_STATUS_COPY        : This flag indicates that the frame (and associated
430	                        meta information) has been truncated because it's 
431	                        larger than tp_frame_size. This packet can be 
432	                        read entirely with recvfrom().
433	                        
434	                        In order to make this work it must to be
435	                        enabled previously with setsockopt() and 
436	                        the PACKET_COPY_THRESH option. 
437	
438	                        The number of frames than can be buffered to 
439	                        be read with recvfrom is limited like a normal socket.
440	                        See the SO_RCVBUF option in the socket (7) man page.
441	
442	TP_STATUS_LOSING      : indicates there were packet drops from last time 
443	                        statistics where checked with getsockopt() and
444	                        the PACKET_STATISTICS option.
445	
446	TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which 
447	                        its checksum will be done in hardware. So while
448	                        reading the packet we should not try to check the 
449	                        checksum. 
450	
451	for convenience there are also the following defines:
452	
453	     #define TP_STATUS_KERNEL        0
454	     #define TP_STATUS_USER          1
455	
456	The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
457	receives a packet it puts in the buffer and updates the status with
458	at least the TP_STATUS_USER flag. Then the user can read the packet,
459	once the packet is read the user must zero the status field, so the kernel 
460	can use again that frame buffer.
461	
462	The user can use poll (any other variant should apply too) to check if new
463	packets are in the ring:
464	
465	    struct pollfd pfd;
466	
467	    pfd.fd = fd;
468	    pfd.revents = 0;
469	    pfd.events = POLLIN|POLLRDNORM|POLLERR;
470	
471	    if (status == TP_STATUS_KERNEL)
472	        retval = poll(&pfd, 1, timeout);
473	
474	It doesn't incur in a race condition to first check the status value and 
475	then poll for frames.
476	
477	++ Transmission process
478	Those defines are also used for transmission:
479	
480	     #define TP_STATUS_AVAILABLE        0 // Frame is available
481	     #define TP_STATUS_SEND_REQUEST     1 // Frame will be sent on next send()
482	     #define TP_STATUS_SENDING          2 // Frame is currently in transmission
483	     #define TP_STATUS_WRONG_FORMAT     4 // Frame format is not correct
484	
485	First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
486	packet, the user fills a data buffer of an available frame, sets tp_len to
487	current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
488	This can be done on multiple frames. Once the user is ready to transmit, it
489	calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
490	forwarded to the network device. The kernel updates each status of sent
491	frames with TP_STATUS_SENDING until the end of transfer.
492	At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
493	
494	    header->tp_len = in_i_size;
495	    header->tp_status = TP_STATUS_SEND_REQUEST;
496	    retval = send(this->socket, NULL, 0, 0);
497	
498	The user can also use poll() to check if a buffer is available:
499	(status == TP_STATUS_SENDING)
500	
501	    struct pollfd pfd;
502	    pfd.fd = fd;
503	    pfd.revents = 0;
504	    pfd.events = POLLOUT;
505	    retval = poll(&pfd, 1, timeout);
506	
507	-------------------------------------------------------------------------------
508	+ What TPACKET versions are available and when to use them?
509	-------------------------------------------------------------------------------
510	
511	 int val = tpacket_version;
512	 setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
513	 getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
514	
515	where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
516	
517	TPACKET_V1:
518		- Default if not otherwise specified by setsockopt(2)
519		- RX_RING, TX_RING available
520		- VLAN metadata information available for packets
521		  (TP_STATUS_VLAN_VALID)
522	
523	TPACKET_V1 --> TPACKET_V2:
524		- Made 64 bit clean due to unsigned long usage in TPACKET_V1
525		  structures, thus this also works on 64 bit kernel with 32 bit
526		  userspace and the like
527		- Timestamp resolution in nanoseconds instead of microseconds
528		- RX_RING, TX_RING available
529		- How to switch to TPACKET_V2:
530			1. Replace struct tpacket_hdr by struct tpacket2_hdr
531			2. Query header len and save
532			3. Set protocol version to 2, set up ring as usual
533			4. For getting the sockaddr_ll,
534			   use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
535			   (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
536	
537	TPACKET_V2 --> TPACKET_V3:
538		- Flexible buffer implementation:
539			1. Blocks can be configured with non-static frame-size
540			2. Read/poll is at a block-level (as opposed to packet-level)
541			3. Added poll timeout to avoid indefinite user-space wait
542			   on idle links
543			4. Added user-configurable knobs:
544				4.1 block::timeout
545				4.2 tpkt_hdr::sk_rxhash
546		- RX Hash data available in user space
547		- Currently only RX_RING available
548	
549	-------------------------------------------------------------------------------
550	+ AF_PACKET fanout mode
551	-------------------------------------------------------------------------------
552	
553	In the AF_PACKET fanout mode, packet reception can be load balanced among
554	processes. This also works in combination with mmap(2) on packet sockets.
555	
556	Currently implemented fanout policies are:
557	
558	  - PACKET_FANOUT_HASH: schedule to socket by skb's rxhash
559	  - PACKET_FANOUT_LB: schedule to socket by round-robin
560	  - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
561	  - PACKET_FANOUT_RND: schedule to socket by random selection
562	  - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
563	
564	Minimal example code by David S. Miller (try things like "./test eth0 hash",
565	"./test eth0 lb", etc.):
566	
567	#include <stddef.h>
568	#include <stdlib.h>
569	#include <stdio.h>
570	#include <string.h>
571	
572	#include <sys/types.h>
573	#include <sys/wait.h>
574	#include <sys/socket.h>
575	#include <sys/ioctl.h>
576	
577	#include <unistd.h>
578	
579	#include <linux/if_ether.h>
580	#include <linux/if_packet.h>
581	
582	#include <net/if.h>
583	
584	static const char *device_name;
585	static int fanout_type;
586	static int fanout_id;
587	
588	#ifndef PACKET_FANOUT
589	# define PACKET_FANOUT			18
590	# define PACKET_FANOUT_HASH		0
591	# define PACKET_FANOUT_LB		1
592	#endif
593	
594	static int setup_socket(void)
595	{
596		int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
597		struct sockaddr_ll ll;
598		struct ifreq ifr;
599		int fanout_arg;
600	
601		if (fd < 0) {
602			perror("socket");
603			return EXIT_FAILURE;
604		}
605	
606		memset(&ifr, 0, sizeof(ifr));
607		strcpy(ifr.ifr_name, device_name);
608		err = ioctl(fd, SIOCGIFINDEX, &ifr);
609		if (err < 0) {
610			perror("SIOCGIFINDEX");
611			return EXIT_FAILURE;
612		}
613	
614		memset(&ll, 0, sizeof(ll));
615		ll.sll_family = AF_PACKET;
616		ll.sll_ifindex = ifr.ifr_ifindex;
617		err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
618		if (err < 0) {
619			perror("bind");
620			return EXIT_FAILURE;
621		}
622	
623		fanout_arg = (fanout_id | (fanout_type << 16));
624		err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
625				 &fanout_arg, sizeof(fanout_arg));
626		if (err) {
627			perror("setsockopt");
628			return EXIT_FAILURE;
629		}
630	
631		return fd;
632	}
633	
634	static void fanout_thread(void)
635	{
636		int fd = setup_socket();
637		int limit = 10000;
638	
639		if (fd < 0)
640			exit(fd);
641	
642		while (limit-- > 0) {
643			char buf[1600];
644			int err;
645	
646			err = read(fd, buf, sizeof(buf));
647			if (err < 0) {
648				perror("read");
649				exit(EXIT_FAILURE);
650			}
651			if ((limit % 10) == 0)
652				fprintf(stdout, "(%d) \n", getpid());
653		}
654	
655		fprintf(stdout, "%d: Received 10000 packets\n", getpid());
656	
657		close(fd);
658		exit(0);
659	}
660	
661	int main(int argc, char **argp)
662	{
663		int fd, err;
664		int i;
665	
666		if (argc != 3) {
667			fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
668			return EXIT_FAILURE;
669		}
670	
671		if (!strcmp(argp[2], "hash"))
672			fanout_type = PACKET_FANOUT_HASH;
673		else if (!strcmp(argp[2], "lb"))
674			fanout_type = PACKET_FANOUT_LB;
675		else {
676			fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
677			exit(EXIT_FAILURE);
678		}
679	
680		device_name = argp[1];
681		fanout_id = getpid() & 0xffff;
682	
683		for (i = 0; i < 4; i++) {
684			pid_t pid = fork();
685	
686			switch (pid) {
687			case 0:
688				fanout_thread();
689	
690			case -1:
691				perror("fork");
692				exit(EXIT_FAILURE);
693			}
694		}
695	
696		for (i = 0; i < 4; i++) {
697			int status;
698	
699			wait(&status);
700		}
701	
702		return 0;
703	}
704	
705	-------------------------------------------------------------------------------
706	+ AF_PACKET TPACKET_V3 example
707	-------------------------------------------------------------------------------
708	
709	AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
710	sizes by doing it's own memory management. It is based on blocks where polling
711	works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
712	
713	It is said that TPACKET_V3 brings the following benefits:
714	 *) ~15 - 20% reduction in CPU-usage
715	 *) ~20% increase in packet capture rate
716	 *) ~2x increase in packet density
717	 *) Port aggregation analysis
718	 *) Non static frame size to capture entire packet payload
719	
720	So it seems to be a good candidate to be used with packet fanout.
721	
722	Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
723	it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
724	
725	/* Written from scratch, but kernel-to-user space API usage
726	 * dissected from lolpcap:
727	 *  Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
728	 *  License: GPL, version 2.0
729	 */
730	
731	#include <stdio.h>
732	#include <stdlib.h>
733	#include <stdint.h>
734	#include <string.h>
735	#include <assert.h>
736	#include <net/if.h>
737	#include <arpa/inet.h>
738	#include <netdb.h>
739	#include <poll.h>
740	#include <unistd.h>
741	#include <signal.h>
742	#include <inttypes.h>
743	#include <sys/socket.h>
744	#include <sys/mman.h>
745	#include <linux/if_packet.h>
746	#include <linux/if_ether.h>
747	#include <linux/ip.h>
748	
749	#ifndef likely
750	# define likely(x)		__builtin_expect(!!(x), 1)
751	#endif
752	#ifndef unlikely
753	# define unlikely(x)		__builtin_expect(!!(x), 0)
754	#endif
755	
756	struct block_desc {
757		uint32_t version;
758		uint32_t offset_to_priv;
759		struct tpacket_hdr_v1 h1;
760	};
761	
762	struct ring {
763		struct iovec *rd;
764		uint8_t *map;
765		struct tpacket_req3 req;
766	};
767	
768	static unsigned long packets_total = 0, bytes_total = 0;
769	static sig_atomic_t sigint = 0;
770	
771	static void sighandler(int num)
772	{
773		sigint = 1;
774	}
775	
776	static int setup_socket(struct ring *ring, char *netdev)
777	{
778		int err, i, fd, v = TPACKET_V3;
779		struct sockaddr_ll ll;
780		unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
781		unsigned int blocknum = 64;
782	
783		fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
784		if (fd < 0) {
785			perror("socket");
786			exit(1);
787		}
788	
789		err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
790		if (err < 0) {
791			perror("setsockopt");
792			exit(1);
793		}
794	
795		memset(&ring->req, 0, sizeof(ring->req));
796		ring->req.tp_block_size = blocksiz;
797		ring->req.tp_frame_size = framesiz;
798		ring->req.tp_block_nr = blocknum;
799		ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
800		ring->req.tp_retire_blk_tov = 60;
801		ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
802	
803		err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
804				 sizeof(ring->req));
805		if (err < 0) {
806			perror("setsockopt");
807			exit(1);
808		}
809	
810		ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
811				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
812		if (ring->map == MAP_FAILED) {
813			perror("mmap");
814			exit(1);
815		}
816	
817		ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
818		assert(ring->rd);
819		for (i = 0; i < ring->req.tp_block_nr; ++i) {
820			ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
821			ring->rd[i].iov_len = ring->req.tp_block_size;
822		}
823	
824		memset(&ll, 0, sizeof(ll));
825		ll.sll_family = PF_PACKET;
826		ll.sll_protocol = htons(ETH_P_ALL);
827		ll.sll_ifindex = if_nametoindex(netdev);
828		ll.sll_hatype = 0;
829		ll.sll_pkttype = 0;
830		ll.sll_halen = 0;
831	
832		err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
833		if (err < 0) {
834			perror("bind");
835			exit(1);
836		}
837	
838		return fd;
839	}
840	
841	static void display(struct tpacket3_hdr *ppd)
842	{
843		struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
844		struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
845	
846		if (eth->h_proto == htons(ETH_P_IP)) {
847			struct sockaddr_in ss, sd;
848			char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
849	
850			memset(&ss, 0, sizeof(ss));
851			ss.sin_family = PF_INET;
852			ss.sin_addr.s_addr = ip->saddr;
853			getnameinfo((struct sockaddr *) &ss, sizeof(ss),
854				    sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
855	
856			memset(&sd, 0, sizeof(sd));
857			sd.sin_family = PF_INET;
858			sd.sin_addr.s_addr = ip->daddr;
859			getnameinfo((struct sockaddr *) &sd, sizeof(sd),
860				    dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
861	
862			printf("%s -> %s, ", sbuff, dbuff);
863		}
864	
865		printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
866	}
867	
868	static void walk_block(struct block_desc *pbd, const int block_num)
869	{
870		int num_pkts = pbd->h1.num_pkts, i;
871		unsigned long bytes = 0;
872		struct tpacket3_hdr *ppd;
873	
874		ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
875					       pbd->h1.offset_to_first_pkt);
876		for (i = 0; i < num_pkts; ++i) {
877			bytes += ppd->tp_snaplen;
878			display(ppd);
879	
880			ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
881						       ppd->tp_next_offset);
882		}
883	
884		packets_total += num_pkts;
885		bytes_total += bytes;
886	}
887	
888	static void flush_block(struct block_desc *pbd)
889	{
890		pbd->h1.block_status = TP_STATUS_KERNEL;
891	}
892	
893	static void teardown_socket(struct ring *ring, int fd)
894	{
895		munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
896		free(ring->rd);
897		close(fd);
898	}
899	
900	int main(int argc, char **argp)
901	{
902		int fd, err;
903		socklen_t len;
904		struct ring ring;
905		struct pollfd pfd;
906		unsigned int block_num = 0, blocks = 64;
907		struct block_desc *pbd;
908		struct tpacket_stats_v3 stats;
909	
910		if (argc != 2) {
911			fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
912			return EXIT_FAILURE;
913		}
914	
915		signal(SIGINT, sighandler);
916	
917		memset(&ring, 0, sizeof(ring));
918		fd = setup_socket(&ring, argp[argc - 1]);
919		assert(fd > 0);
920	
921		memset(&pfd, 0, sizeof(pfd));
922		pfd.fd = fd;
923		pfd.events = POLLIN | POLLERR;
924		pfd.revents = 0;
925	
926		while (likely(!sigint)) {
927			pbd = (struct block_desc *) ring.rd[block_num].iov_base;
928	
929			if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
930				poll(&pfd, 1, -1);
931				continue;
932			}
933	
934			walk_block(pbd, block_num);
935			flush_block(pbd);
936			block_num = (block_num + 1) % blocks;
937		}
938	
939		len = sizeof(stats);
940		err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
941		if (err < 0) {
942			perror("getsockopt");
943			exit(1);
944		}
945	
946		fflush(stdout);
947		printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
948		       stats.tp_packets, bytes_total, stats.tp_drops,
949		       stats.tp_freeze_q_cnt);
950	
951		teardown_socket(&ring, fd);
952		return 0;
953	}
954	
955	-------------------------------------------------------------------------------
956	+ PACKET_TIMESTAMP
957	-------------------------------------------------------------------------------
958	
959	The PACKET_TIMESTAMP setting determines the source of the timestamp in
960	the packet meta information for mmap(2)ed RX_RING and TX_RINGs.  If your
961	NIC is capable of timestamping packets in hardware, you can request those
962	hardware timestamps to be used. Note: you may need to enable the generation
963	of hardware timestamps with SIOCSHWTSTAMP (see related information from
964	Documentation/networking/timestamping.txt).
965	
966	PACKET_TIMESTAMP accepts the same integer bit field as
967	SO_TIMESTAMPING.  However, only the SOF_TIMESTAMPING_SYS_HARDWARE
968	and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by
969	PACKET_TIMESTAMP.  SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over
970	SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
971	
972	    int req = 0;
973	    req |= SOF_TIMESTAMPING_SYS_HARDWARE;
974	    setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
975	
976	For the mmap(2)ed ring buffers, such timestamps are stored in the
977	tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
978	what kind of timestamp has been reported, the tp_status field is binary |'ed
979	with the following possible bits ...
980	
981	    TP_STATUS_TS_SYS_HARDWARE
982	    TP_STATUS_TS_RAW_HARDWARE
983	    TP_STATUS_TS_SOFTWARE
984	
985	... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
986	RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set),
987	then this means that a software fallback was invoked *within* PF_PACKET's
988	processing code (less precise).
989	
990	Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
991	ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
992	frames to be updated resp. the frame handed over to the application, iv) walk
993	through the frames to pick up the individual hw/sw timestamps.
994	
995	Only (!) if transmit timestamping is enabled, then these bits are combined
996	with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
997	application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
998	in a first step to see if the frame belongs to the application, and then
999	one can extract the type of timestamp in a second step from tp_status)!
1000	
1001	If you don't care about them, thus having it disabled, checking for
1002	TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
1003	TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
1004	members do not contain a valid value. For TX_RINGs, by default no timestamp
1005	is generated!
1006	
1007	See include/linux/net_tstamp.h and Documentation/networking/timestamping
1008	for more information on hardware timestamps.
1009	
1010	-------------------------------------------------------------------------------
1011	+ Miscellaneous bits
1012	-------------------------------------------------------------------------------
1013	
1014	- Packet sockets work well together with Linux socket filters, thus you also
1015	  might want to have a look at Documentation/networking/filter.txt
1016	
1017	--------------------------------------------------------------------------------
1018	+ THANKS
1019	--------------------------------------------------------------------------------
1020	   
1021	   Jesse Brandeburg, for fixing my grammathical/spelling errors
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.