About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / vm / hugetlbpage.txt

Based on kernel version 2.6.26. Page generated on 2008-07-16 21:13 EST.

1	
2	The intent of this file is to give a brief summary of hugetlbpage support in
3	the Linux kernel.  This support is built on top of multiple page size support
4	that is provided by most modern architectures.  For example, i386
5	architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
6	architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
7	256M and ppc64 supports 4K and 16M.  A TLB is a cache of virtual-to-physical
8	translations.  Typically this is a very scarce resource on processor.
9	Operating systems try to make best use of limited number of TLB resources.
10	This optimization is more critical now as bigger and bigger physical memories
11	(several GBs) are more readily available.
12	
13	Users can use the huge page support in Linux kernel by either using the mmap
14	system call or standard SYSv shared memory system calls (shmget, shmat).
15	
16	First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
17	(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
18	automatically when CONFIG_HUGETLBFS is selected) configuration
19	options.
20	
21	The kernel built with hugepage support should show the number of configured
22	hugepages in the system by running the "cat /proc/meminfo" command.
23	
24	/proc/meminfo also provides information about the total number of hugetlb
25	pages configured in the kernel.  It also displays information about the
26	number of free hugetlb pages at any time.  It also displays information about
27	the configured hugepage size - this is needed for generating the proper
28	alignment and size of the arguments to the above system calls.
29	
30	The output of "cat /proc/meminfo" will have lines like:
31	
32	.....
33	HugePages_Total: vvv
34	HugePages_Free:  www
35	HugePages_Rsvd:  xxx
36	HugePages_Surp:  yyy
37	Hugepagesize:    zzz kB
38	
39	where:
40	HugePages_Total is the size of the pool of hugepages.
41	HugePages_Free is the number of hugepages in the pool that are not yet
42	allocated.
43	HugePages_Rsvd is short for "reserved," and is the number of hugepages
44	for which a commitment to allocate from the pool has been made, but no
45	allocation has yet been made. It's vaguely analogous to overcommit.
46	HugePages_Surp is short for "surplus," and is the number of hugepages in
47	the pool above the value in /proc/sys/vm/nr_hugepages. The maximum
48	number of surplus hugepages is controlled by
49	/proc/sys/vm/nr_overcommit_hugepages.
50	
51	/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
52	in the kernel.
53	
54	/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
55	pages in the kernel.  Super user can dynamically request more (or free some
56	pre-configured) hugepages.
57	The allocation (or deallocation) of hugetlb pages is possible only if there are
58	enough physically contiguous free pages in system (freeing of hugepages is
59	possible only if there are enough hugetlb pages free that can be transferred
60	back to regular memory pool).
61	
62	Pages that are used as hugetlb pages are reserved inside the kernel and cannot
63	be used for other purposes.
64	
65	Once the kernel with Hugetlb page support is built and running, a user can
66	use either the mmap system call or shared memory system calls to start using
67	the huge pages.  It is required that the system administrator preallocate
68	enough memory for huge page purposes.
69	
70	Use the following command to dynamically allocate/deallocate hugepages:
71	
72		echo 20 > /proc/sys/vm/nr_hugepages
73	
74	This command will try to configure 20 hugepages in the system.  The success
75	or failure of allocation depends on the amount of physically contiguous
76	memory that is preset in system at this time.  System administrators may want
77	to put this command in one of the local rc init files.  This will enable the
78	kernel to request huge pages early in the boot process (when the possibility
79	of getting physical contiguous pages is still very high). In either
80	case, adminstrators will want to verify the number of hugepages actually
81	allocated by checking the sysctl or meminfo.
82	
83	/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of
84	hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are
85	requested by applications. echo'ing any non-zero value into this file
86	indicates that the hugetlb subsystem is allowed to try to obtain
87	hugepages from the buddy allocator, if the normal pool is exhausted. As
88	these surplus hugepages go out of use, they are freed back to the buddy
89	allocator.
90	
91	Caveat: Shrinking the pool via nr_hugepages such that it becomes less
92	than the number of hugepages in use will convert the balance to surplus
93	huge pages even if it would exceed the overcommit value.  As long as
94	this condition holds, however, no more surplus huge pages will be
95	allowed on the system until one of the two sysctls are increased
96	sufficiently, or the surplus huge pages go out of use and are freed.
97	
98	If the user applications are going to request hugepages using mmap system
99	call, then it is required that system administrator mount a file system of
100	type hugetlbfs:
101	
102	  mount -t hugetlbfs \
103		-o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \
104		none /mnt/huge
105	
106	This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
107	/mnt/huge.  Any files created on /mnt/huge uses hugepages.  The uid and gid
108	options sets the owner and group of the root of the file system.  By default
109	the uid and gid of the current process are taken.  The mode option sets the
110	mode of root of file system to value & 0777.  This value is given in octal.
111	By default the value 0755 is picked. The size option sets the maximum value of
112	memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
113	rounded down to HPAGE_SIZE.  The option nr_inodes sets the maximum number of
114	inodes that /mnt/huge can use.  If the size or nr_inodes option is not
115	provided on command line then no limits are set.  For size and nr_inodes
116	options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
117	example, size=2K has the same meaning as size=2048.
118	
119	While read system calls are supported on files that reside on hugetlb
120	file systems, write system calls are not.
121	
122	Regular chown, chgrp, and chmod commands (with right permissions) could be
123	used to change the file attributes on hugetlbfs.
124	
125	Also, it is important to note that no such mount command is required if the
126	applications are going to use only shmat/shmget system calls.  Users who
127	wish to use hugetlb page via shared memory segment should be a member of
128	a supplementary group and system admin needs to configure that gid into
129	/proc/sys/vm/hugetlb_shm_group.  It is possible for same or different
130	applications to use any combination of mmaps and shm* calls, though the
131	mount of filesystem will be required for using mmap calls.
132	
133	*******************************************************************
134	
135	/*
136	 * Example of using hugepage memory in a user application using Sys V shared
137	 * memory system calls.  In this example the app is requesting 256MB of
138	 * memory that is backed by huge pages.  The application uses the flag
139	 * SHM_HUGETLB in the shmget system call to inform the kernel that it is
140	 * requesting hugepages.
141	 *
142	 * For the ia64 architecture, the Linux kernel reserves Region number 4 for
143	 * hugepages.  That means the addresses starting with 0x800000... will need
144	 * to be specified.  Specifying a fixed address is not required on ppc64,
145	 * i386 or x86_64.
146	 *
147	 * Note: The default shared memory limit is quite low on many kernels,
148	 * you may need to increase it via:
149	 *
150	 * echo 268435456 > /proc/sys/kernel/shmmax
151	 *
152	 * This will increase the maximum size per shared memory segment to 256MB.
153	 * The other limit that you will hit eventually is shmall which is the
154	 * total amount of shared memory in pages. To set it to 16GB on a system
155	 * with a 4kB pagesize do:
156	 *
157	 * echo 4194304 > /proc/sys/kernel/shmall
158	 */
159	#include <stdlib.h>
160	#include <stdio.h>
161	#include <sys/types.h>
162	#include <sys/ipc.h>
163	#include <sys/shm.h>
164	#include <sys/mman.h>
165	
166	#ifndef SHM_HUGETLB
167	#define SHM_HUGETLB 04000
168	#endif
169	
170	#define LENGTH (256UL*1024*1024)
171	
172	#define dprintf(x)  printf(x)
173	
174	/* Only ia64 requires this */
175	#ifdef __ia64__
176	#define ADDR (void *)(0x8000000000000000UL)
177	#define SHMAT_FLAGS (SHM_RND)
178	#else
179	#define ADDR (void *)(0x0UL)
180	#define SHMAT_FLAGS (0)
181	#endif
182	
183	int main(void)
184	{
185		int shmid;
186		unsigned long i;
187		char *shmaddr;
188	
189		if ((shmid = shmget(2, LENGTH,
190				    SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
191			perror("shmget");
192			exit(1);
193		}
194		printf("shmid: 0x%x\n", shmid);
195	
196		shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
197		if (shmaddr == (char *)-1) {
198			perror("Shared memory attach failure");
199			shmctl(shmid, IPC_RMID, NULL);
200			exit(2);
201		}
202		printf("shmaddr: %p\n", shmaddr);
203	
204		dprintf("Starting the writes:\n");
205		for (i = 0; i < LENGTH; i++) {
206			shmaddr[i] = (char)(i);
207			if (!(i % (1024 * 1024)))
208				dprintf(".");
209		}
210		dprintf("\n");
211	
212		dprintf("Starting the Check...");
213		for (i = 0; i < LENGTH; i++)
214			if (shmaddr[i] != (char)i)
215				printf("\nIndex %lu mismatched\n", i);
216		dprintf("Done.\n");
217	
218		if (shmdt((const void *)shmaddr) != 0) {
219			perror("Detach failure");
220			shmctl(shmid, IPC_RMID, NULL);
221			exit(3);
222		}
223	
224		shmctl(shmid, IPC_RMID, NULL);
225	
226		return 0;
227	}
228	
229	*******************************************************************
230	
231	/*
232	 * Example of using hugepage memory in a user application using the mmap
233	 * system call.  Before running this application, make sure that the
234	 * administrator has mounted the hugetlbfs filesystem (on some directory
235	 * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
236	 * example, the app is requesting memory of size 256MB that is backed by
237	 * huge pages.
238	 *
239	 * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
240	 * That means the addresses starting with 0x800000... will need to be
241	 * specified.  Specifying a fixed address is not required on ppc64, i386
242	 * or x86_64.
243	 */
244	#include <stdlib.h>
245	#include <stdio.h>
246	#include <unistd.h>
247	#include <sys/mman.h>
248	#include <fcntl.h>
249	
250	#define FILE_NAME "/mnt/hugepagefile"
251	#define LENGTH (256UL*1024*1024)
252	#define PROTECTION (PROT_READ | PROT_WRITE)
253	
254	/* Only ia64 requires this */
255	#ifdef __ia64__
256	#define ADDR (void *)(0x8000000000000000UL)
257	#define FLAGS (MAP_SHARED | MAP_FIXED)
258	#else
259	#define ADDR (void *)(0x0UL)
260	#define FLAGS (MAP_SHARED)
261	#endif
262	
263	void check_bytes(char *addr)
264	{
265		printf("First hex is %x\n", *((unsigned int *)addr));
266	}
267	
268	void write_bytes(char *addr)
269	{
270		unsigned long i;
271	
272		for (i = 0; i < LENGTH; i++)
273			*(addr + i) = (char)i;
274	}
275	
276	void read_bytes(char *addr)
277	{
278		unsigned long i;
279	
280		check_bytes(addr);
281		for (i = 0; i < LENGTH; i++)
282			if (*(addr + i) != (char)i) {
283				printf("Mismatch at %lu\n", i);
284				break;
285			}
286	}
287	
288	int main(void)
289	{
290		void *addr;
291		int fd;
292	
293		fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
294		if (fd < 0) {
295			perror("Open failed");
296			exit(1);
297		}
298	
299		addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
300		if (addr == MAP_FAILED) {
301			perror("mmap");
302			unlink(FILE_NAME);
303			exit(1);
304		}
305	
306		printf("Returned address is %p\n", addr);
307		check_bytes(addr);
308		write_bytes(addr);
309		read_bytes(addr);
310	
311		munmap(addr, LENGTH);
312		close(fd);
313		unlink(FILE_NAME);
314	
315		return 0;
316	}
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.