About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / cgroup-v2.txt

Custom Search

Based on kernel version 4.9. Page generated on 2016-12-21 14:28 EST.

2	Control Group v2
4	October, 2015		Tejun Heo <tj@kernel.org>
6	This is the authoritative documentation on the design, interface and
7	conventions of cgroup v2.  It describes all userland-visible aspects
8	of cgroup including core and specific controller behaviors.  All
9	future changes must be reflected in this document.  Documentation for
10	v1 is available under Documentation/cgroup-v1/.
14	1. Introduction
15	  1-1. Terminology
16	  1-2. What is cgroup?
17	2. Basic Operations
18	  2-1. Mounting
19	  2-2. Organizing Processes
20	  2-3. [Un]populated Notification
21	  2-4. Controlling Controllers
22	    2-4-1. Enabling and Disabling
23	    2-4-2. Top-down Constraint
24	    2-4-3. No Internal Process Constraint
25	  2-5. Delegation
26	    2-5-1. Model of Delegation
27	    2-5-2. Delegation Containment
28	  2-6. Guidelines
29	    2-6-1. Organize Once and Control
30	    2-6-2. Avoid Name Collisions
31	3. Resource Distribution Models
32	  3-1. Weights
33	  3-2. Limits
34	  3-3. Protections
35	  3-4. Allocations
36	4. Interface Files
37	  4-1. Format
38	  4-2. Conventions
39	  4-3. Core Interface Files
40	5. Controllers
41	  5-1. CPU
42	    5-1-1. CPU Interface Files
43	  5-2. Memory
44	    5-2-1. Memory Interface Files
45	    5-2-2. Usage Guidelines
46	    5-2-3. Memory Ownership
47	  5-3. IO
48	    5-3-1. IO Interface Files
49	    5-3-2. Writeback
50	6. Namespace
51	  6-1. Basics
52	  6-2. The Root and Views
53	  6-3. Migration and setns(2)
54	  6-4. Interaction with Other Namespaces
55	P. Information on Kernel Programming
56	  P-1. Filesystem Support for Writeback
57	D. Deprecated v1 Core Features
58	R. Issues with v1 and Rationales for v2
59	  R-1. Multiple Hierarchies
60	  R-2. Thread Granularity
61	  R-3. Competition Between Inner Nodes and Threads
62	  R-4. Other Interface Issues
63	  R-5. Controller Issues and Remedies
64	    R-5-1. Memory
67	1. Introduction
69	1-1. Terminology
71	"cgroup" stands for "control group" and is never capitalized.  The
72	singular form is used to designate the whole feature and also as a
73	qualifier as in "cgroup controllers".  When explicitly referring to
74	multiple individual control groups, the plural form "cgroups" is used.
77	1-2. What is cgroup?
79	cgroup is a mechanism to organize processes hierarchically and
80	distribute system resources along the hierarchy in a controlled and
81	configurable manner.
83	cgroup is largely composed of two parts - the core and controllers.
84	cgroup core is primarily responsible for hierarchically organizing
85	processes.  A cgroup controller is usually responsible for
86	distributing a specific type of system resource along the hierarchy
87	although there are utility controllers which serve purposes other than
88	resource distribution.
90	cgroups form a tree structure and every process in the system belongs
91	to one and only one cgroup.  All threads of a process belong to the
92	same cgroup.  On creation, all processes are put in the cgroup that
93	the parent process belongs to at the time.  A process can be migrated
94	to another cgroup.  Migration of a process doesn't affect already
95	existing descendant processes.
97	Following certain structural constraints, controllers may be enabled or
98	disabled selectively on a cgroup.  All controller behaviors are
99	hierarchical - if a controller is enabled on a cgroup, it affects all
100	processes which belong to the cgroups consisting the inclusive
101	sub-hierarchy of the cgroup.  When a controller is enabled on a nested
102	cgroup, it always restricts the resource distribution further.  The
103	restrictions set closer to the root in the hierarchy can not be
104	overridden from further away.
107	2. Basic Operations
109	2-1. Mounting
111	Unlike v1, cgroup v2 has only single hierarchy.  The cgroup v2
112	hierarchy can be mounted with the following mount command.
114	  # mount -t cgroup2 none $MOUNT_POINT
116	cgroup2 filesystem has the magic number 0x63677270 ("cgrp").  All
117	controllers which support v2 and are not bound to a v1 hierarchy are
118	automatically bound to the v2 hierarchy and show up at the root.
119	Controllers which are not in active use in the v2 hierarchy can be
120	bound to other hierarchies.  This allows mixing v2 hierarchy with the
121	legacy v1 multiple hierarchies in a fully backward compatible way.
123	A controller can be moved across hierarchies only after the controller
124	is no longer referenced in its current hierarchy.  Because per-cgroup
125	controller states are destroyed asynchronously and controllers may
126	have lingering references, a controller may not show up immediately on
127	the v2 hierarchy after the final umount of the previous hierarchy.
128	Similarly, a controller should be fully disabled to be moved out of
129	the unified hierarchy and it may take some time for the disabled
130	controller to become available for other hierarchies; furthermore, due
131	to inter-controller dependencies, other controllers may need to be
132	disabled too.
134	While useful for development and manual configurations, moving
135	controllers dynamically between the v2 and other hierarchies is
136	strongly discouraged for production use.  It is recommended to decide
137	the hierarchies and controller associations before starting using the
138	controllers after system boot.
140	During transition to v2, system management software might still
141	automount the v1 cgroup filesystem and so hijack all controllers
142	during boot, before manual intervention is possible. To make testing
143	and experimenting easier, the kernel parameter cgroup_no_v1= allows
144	disabling controllers in v1 and make them always available in v2.
147	2-2. Organizing Processes
149	Initially, only the root cgroup exists to which all processes belong.
150	A child cgroup can be created by creating a sub-directory.
152	  # mkdir $CGROUP_NAME
154	A given cgroup may have multiple child cgroups forming a tree
155	structure.  Each cgroup has a read-writable interface file
156	"cgroup.procs".  When read, it lists the PIDs of all processes which
157	belong to the cgroup one-per-line.  The PIDs are not ordered and the
158	same PID may show up more than once if the process got moved to
159	another cgroup and then back or the PID got recycled while reading.
161	A process can be migrated into a cgroup by writing its PID to the
162	target cgroup's "cgroup.procs" file.  Only one process can be migrated
163	on a single write(2) call.  If a process is composed of multiple
164	threads, writing the PID of any thread migrates all threads of the
165	process.
167	When a process forks a child process, the new process is born into the
168	cgroup that the forking process belongs to at the time of the
169	operation.  After exit, a process stays associated with the cgroup
170	that it belonged to at the time of exit until it's reaped; however, a
171	zombie process does not appear in "cgroup.procs" and thus can't be
172	moved to another cgroup.
174	A cgroup which doesn't have any children or live processes can be
175	destroyed by removing the directory.  Note that a cgroup which doesn't
176	have any children and is associated only with zombie processes is
177	considered empty and can be removed.
179	  # rmdir $CGROUP_NAME
181	"/proc/$PID/cgroup" lists a process's cgroup membership.  If legacy
182	cgroup is in use in the system, this file may contain multiple lines,
183	one for each hierarchy.  The entry for cgroup v2 is always in the
184	format "0::$PATH".
186	  # cat /proc/842/cgroup
187	  ...
188	  0::/test-cgroup/test-cgroup-nested
190	If the process becomes a zombie and the cgroup it was associated with
191	is removed subsequently, " (deleted)" is appended to the path.
193	  # cat /proc/842/cgroup
194	  ...
195	  0::/test-cgroup/test-cgroup-nested (deleted)
198	2-3. [Un]populated Notification
200	Each non-root cgroup has a "cgroup.events" file which contains
201	"populated" field indicating whether the cgroup's sub-hierarchy has
202	live processes in it.  Its value is 0 if there is no live process in
203	the cgroup and its descendants; otherwise, 1.  poll and [id]notify
204	events are triggered when the value changes.  This can be used, for
205	example, to start a clean-up operation after all processes of a given
206	sub-hierarchy have exited.  The populated state updates and
207	notifications are recursive.  Consider the following sub-hierarchy
208	where the numbers in the parentheses represent the numbers of processes
209	in each cgroup.
211	  A(4) - B(0) - C(1)
212	              \ D(0)
214	A, B and C's "populated" fields would be 1 while D's 0.  After the one
215	process in C exits, B and C's "populated" fields would flip to "0" and
216	file modified events will be generated on the "cgroup.events" files of
217	both cgroups.
220	2-4. Controlling Controllers
222	2-4-1. Enabling and Disabling
224	Each cgroup has a "cgroup.controllers" file which lists all
225	controllers available for the cgroup to enable.
227	  # cat cgroup.controllers
228	  cpu io memory
230	No controller is enabled by default.  Controllers can be enabled and
231	disabled by writing to the "cgroup.subtree_control" file.
233	  # echo "+cpu +memory -io" > cgroup.subtree_control
235	Only controllers which are listed in "cgroup.controllers" can be
236	enabled.  When multiple operations are specified as above, either they
237	all succeed or fail.  If multiple operations on the same controller
238	are specified, the last one is effective.
240	Enabling a controller in a cgroup indicates that the distribution of
241	the target resource across its immediate children will be controlled.
242	Consider the following sub-hierarchy.  The enabled controllers are
243	listed in parentheses.
245	  A(cpu,memory) - B(memory) - C()
246	                            \ D()
248	As A has "cpu" and "memory" enabled, A will control the distribution
249	of CPU cycles and memory to its children, in this case, B.  As B has
250	"memory" enabled but not "CPU", C and D will compete freely on CPU
251	cycles but their division of memory available to B will be controlled.
253	As a controller regulates the distribution of the target resource to
254	the cgroup's children, enabling it creates the controller's interface
255	files in the child cgroups.  In the above example, enabling "cpu" on B
256	would create the "cpu." prefixed controller interface files in C and
257	D.  Likewise, disabling "memory" from B would remove the "memory."
258	prefixed controller interface files from C and D.  This means that the
259	controller interface files - anything which doesn't start with
260	"cgroup." are owned by the parent rather than the cgroup itself.
263	2-4-2. Top-down Constraint
265	Resources are distributed top-down and a cgroup can further distribute
266	a resource only if the resource has been distributed to it from the
267	parent.  This means that all non-root "cgroup.subtree_control" files
268	can only contain controllers which are enabled in the parent's
269	"cgroup.subtree_control" file.  A controller can be enabled only if
270	the parent has the controller enabled and a controller can't be
271	disabled if one or more children have it enabled.
274	2-4-3. No Internal Process Constraint
276	Non-root cgroups can only distribute resources to their children when
277	they don't have any processes of their own.  In other words, only
278	cgroups which don't contain any processes can have controllers enabled
279	in their "cgroup.subtree_control" files.
281	This guarantees that, when a controller is looking at the part of the
282	hierarchy which has it enabled, processes are always only on the
283	leaves.  This rules out situations where child cgroups compete against
284	internal processes of the parent.
286	The root cgroup is exempt from this restriction.  Root contains
287	processes and anonymous resource consumption which can't be associated
288	with any other cgroups and requires special treatment from most
289	controllers.  How resource consumption in the root cgroup is governed
290	is up to each controller.
292	Note that the restriction doesn't get in the way if there is no
293	enabled controller in the cgroup's "cgroup.subtree_control".  This is
294	important as otherwise it wouldn't be possible to create children of a
295	populated cgroup.  To control resource distribution of a cgroup, the
296	cgroup must create children and transfer all its processes to the
297	children before enabling controllers in its "cgroup.subtree_control"
298	file.
301	2-5. Delegation
303	2-5-1. Model of Delegation
305	A cgroup can be delegated to a less privileged user by granting write
306	access of the directory and its "cgroup.procs" file to the user.  Note
307	that resource control interface files in a given directory control the
308	distribution of the parent's resources and thus must not be delegated
309	along with the directory.
311	Once delegated, the user can build sub-hierarchy under the directory,
312	organize processes as it sees fit and further distribute the resources
313	it received from the parent.  The limits and other settings of all
314	resource controllers are hierarchical and regardless of what happens
315	in the delegated sub-hierarchy, nothing can escape the resource
316	restrictions imposed by the parent.
318	Currently, cgroup doesn't impose any restrictions on the number of
319	cgroups in or nesting depth of a delegated sub-hierarchy; however,
320	this may be limited explicitly in the future.
323	2-5-2. Delegation Containment
325	A delegated sub-hierarchy is contained in the sense that processes
326	can't be moved into or out of the sub-hierarchy by the delegatee.  For
327	a process with a non-root euid to migrate a target process into a
328	cgroup by writing its PID to the "cgroup.procs" file, the following
329	conditions must be met.
331	- The writer's euid must match either uid or suid of the target process.
333	- The writer must have write access to the "cgroup.procs" file.
335	- The writer must have write access to the "cgroup.procs" file of the
336	  common ancestor of the source and destination cgroups.
338	The above three constraints ensure that while a delegatee may migrate
339	processes around freely in the delegated sub-hierarchy it can't pull
340	in from or push out to outside the sub-hierarchy.
342	For an example, let's assume cgroups C0 and C1 have been delegated to
343	user U0 who created C00, C01 under C0 and C10 under C1 as follows and
344	all processes under C0 and C1 belong to U0.
346	  ~~~~~~~~~~~~~ - C0 - C00
347	  ~ cgroup    ~      \ C01
348	  ~ hierarchy ~
349	  ~~~~~~~~~~~~~ - C1 - C10
351	Let's also say U0 wants to write the PID of a process which is
352	currently in C10 into "C00/cgroup.procs".  U0 has write access to the
353	file and uid match on the process; however, the common ancestor of the
354	source cgroup C10 and the destination cgroup C00 is above the points
355	of delegation and U0 would not have write access to its "cgroup.procs"
356	files and thus the write will be denied with -EACCES.
359	2-6. Guidelines
361	2-6-1. Organize Once and Control
363	Migrating a process across cgroups is a relatively expensive operation
364	and stateful resources such as memory are not moved together with the
365	process.  This is an explicit design decision as there often exist
366	inherent trade-offs between migration and various hot paths in terms
367	of synchronization cost.
369	As such, migrating processes across cgroups frequently as a means to
370	apply different resource restrictions is discouraged.  A workload
371	should be assigned to a cgroup according to the system's logical and
372	resource structure once on start-up.  Dynamic adjustments to resource
373	distribution can be made by changing controller configuration through
374	the interface files.
377	2-6-2. Avoid Name Collisions
379	Interface files for a cgroup and its children cgroups occupy the same
380	directory and it is possible to create children cgroups which collide
381	with interface files.
383	All cgroup core interface files are prefixed with "cgroup." and each
384	controller's interface files are prefixed with the controller name and
385	a dot.  A controller's name is composed of lower case alphabets and
386	'_'s but never begins with an '_' so it can be used as the prefix
387	character for collision avoidance.  Also, interface file names won't
388	start or end with terms which are often used in categorizing workloads
389	such as job, service, slice, unit or workload.
391	cgroup doesn't do anything to prevent name collisions and it's the
392	user's responsibility to avoid them.
395	3. Resource Distribution Models
397	cgroup controllers implement several resource distribution schemes
398	depending on the resource type and expected use cases.  This section
399	describes major schemes in use along with their expected behaviors.
402	3-1. Weights
404	A parent's resource is distributed by adding up the weights of all
405	active children and giving each the fraction matching the ratio of its
406	weight against the sum.  As only children which can make use of the
407	resource at the moment participate in the distribution, this is
408	work-conserving.  Due to the dynamic nature, this model is usually
409	used for stateless resources.
411	All weights are in the range [1, 10000] with the default at 100.  This
412	allows symmetric multiplicative biases in both directions at fine
413	enough granularity while staying in the intuitive range.
415	As long as the weight is in range, all configuration combinations are
416	valid and there is no reason to reject configuration changes or
417	process migrations.
419	"cpu.weight" proportionally distributes CPU cycles to active children
420	and is an example of this type.
423	3-2. Limits
425	A child can only consume upto the configured amount of the resource.
426	Limits can be over-committed - the sum of the limits of children can
427	exceed the amount of resource available to the parent.
429	Limits are in the range [0, max] and defaults to "max", which is noop.
431	As limits can be over-committed, all configuration combinations are
432	valid and there is no reason to reject configuration changes or
433	process migrations.
435	"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
436	on an IO device and is an example of this type.
439	3-3. Protections
441	A cgroup is protected to be allocated upto the configured amount of
442	the resource if the usages of all its ancestors are under their
443	protected levels.  Protections can be hard guarantees or best effort
444	soft boundaries.  Protections can also be over-committed in which case
445	only upto the amount available to the parent is protected among
446	children.
448	Protections are in the range [0, max] and defaults to 0, which is
449	noop.
451	As protections can be over-committed, all configuration combinations
452	are valid and there is no reason to reject configuration changes or
453	process migrations.
455	"memory.low" implements best-effort memory protection and is an
456	example of this type.
459	3-4. Allocations
461	A cgroup is exclusively allocated a certain amount of a finite
462	resource.  Allocations can't be over-committed - the sum of the
463	allocations of children can not exceed the amount of resource
464	available to the parent.
466	Allocations are in the range [0, max] and defaults to 0, which is no
467	resource.
469	As allocations can't be over-committed, some configuration
470	combinations are invalid and should be rejected.  Also, if the
471	resource is mandatory for execution of processes, process migrations
472	may be rejected.
474	"cpu.rt.max" hard-allocates realtime slices and is an example of this
475	type.
478	4. Interface Files
480	4-1. Format
482	All interface files should be in one of the following formats whenever
483	possible.
485	  New-line separated values
486	  (when only one value can be written at once)
488		VAL0\n
489		VAL1\n
490		...
492	  Space separated values
493	  (when read-only or multiple values can be written at once)
495		VAL0 VAL1 ...\n
497	  Flat keyed
499		KEY0 VAL0\n
500		KEY1 VAL1\n
501		...
503	  Nested keyed
505		KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
506		KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
507		...
509	For a writable file, the format for writing should generally match
510	reading; however, controllers may allow omitting later fields or
511	implement restricted shortcuts for most common use cases.
513	For both flat and nested keyed files, only the values for a single key
514	can be written at a time.  For nested keyed files, the sub key pairs
515	may be specified in any order and not all pairs have to be specified.
518	4-2. Conventions
520	- Settings for a single feature should be contained in a single file.
522	- The root cgroup should be exempt from resource control and thus
523	  shouldn't have resource control interface files.  Also,
524	  informational files on the root cgroup which end up showing global
525	  information available elsewhere shouldn't exist.
527	- If a controller implements weight based resource distribution, its
528	  interface file should be named "weight" and have the range [1,
529	  10000] with 100 as the default.  The values are chosen to allow
530	  enough and symmetric bias in both directions while keeping it
531	  intuitive (the default is 100%).
533	- If a controller implements an absolute resource guarantee and/or
534	  limit, the interface files should be named "min" and "max"
535	  respectively.  If a controller implements best effort resource
536	  guarantee and/or limit, the interface files should be named "low"
537	  and "high" respectively.
539	  In the above four control files, the special token "max" should be
540	  used to represent upward infinity for both reading and writing.
542	- If a setting has a configurable default value and keyed specific
543	  overrides, the default entry should be keyed with "default" and
544	  appear as the first entry in the file.
546	  The default value can be updated by writing either "default $VAL" or
547	  "$VAL".
549	  When writing to update a specific override, "default" can be used as
550	  the value to indicate removal of the override.  Override entries
551	  with "default" as the value must not appear when read.
553	  For example, a setting which is keyed by major:minor device numbers
554	  with integer values may look like the following.
556	    # cat cgroup-example-interface-file
557	    default 150
558	    8:0 300
560	  The default value can be updated by
562	    # echo 125 > cgroup-example-interface-file
564	  or
566	    # echo "default 125" > cgroup-example-interface-file
568	  An override can be set by
570	    # echo "8:16 170" > cgroup-example-interface-file
572	  and cleared by
574	    # echo "8:0 default" > cgroup-example-interface-file
575	    # cat cgroup-example-interface-file
576	    default 125
577	    8:16 170
579	- For events which are not very high frequency, an interface file
580	  "events" should be created which lists event key value pairs.
581	  Whenever a notifiable event happens, file modified event should be
582	  generated on the file.
585	4-3. Core Interface Files
587	All cgroup core files are prefixed with "cgroup."
589	  cgroup.procs
591		A read-write new-line separated values file which exists on
592		all cgroups.
594		When read, it lists the PIDs of all processes which belong to
595		the cgroup one-per-line.  The PIDs are not ordered and the
596		same PID may show up more than once if the process got moved
597		to another cgroup and then back or the PID got recycled while
598		reading.
600		A PID can be written to migrate the process associated with
601		the PID to the cgroup.  The writer should match all of the
602		following conditions.
604		- Its euid is either root or must match either uid or suid of
605	          the target process.
607		- It must have write access to the "cgroup.procs" file.
609		- It must have write access to the "cgroup.procs" file of the
610		  common ancestor of the source and destination cgroups.
612		When delegating a sub-hierarchy, write access to this file
613		should be granted along with the containing directory.
615	  cgroup.controllers
617		A read-only space separated values file which exists on all
618		cgroups.
620		It shows space separated list of all controllers available to
621		the cgroup.  The controllers are not ordered.
623	  cgroup.subtree_control
625		A read-write space separated values file which exists on all
626		cgroups.  Starts out empty.
628		When read, it shows space separated list of the controllers
629		which are enabled to control resource distribution from the
630		cgroup to its children.
632		Space separated list of controllers prefixed with '+' or '-'
633		can be written to enable or disable controllers.  A controller
634		name prefixed with '+' enables the controller and '-'
635		disables.  If a controller appears more than once on the list,
636		the last one is effective.  When multiple enable and disable
637		operations are specified, either all succeed or all fail.
639	  cgroup.events
641		A read-only flat-keyed file which exists on non-root cgroups.
642		The following entries are defined.  Unless specified
643		otherwise, a value change in this file generates a file
644		modified event.
646		  populated
648			1 if the cgroup or its descendants contains any live
649			processes; otherwise, 0.
652	5. Controllers
654	5-1. CPU
656	[NOTE: The interface for the cpu controller hasn't been merged yet]
658	The "cpu" controllers regulates distribution of CPU cycles.  This
659	controller implements weight and absolute bandwidth limit models for
660	normal scheduling policy and absolute bandwidth allocation model for
661	realtime scheduling policy.
664	5-1-1. CPU Interface Files
666	All time durations are in microseconds.
668	  cpu.stat
670		A read-only flat-keyed file which exists on non-root cgroups.
672		It reports the following six stats.
674		  usage_usec
675		  user_usec
676		  system_usec
677		  nr_periods
678		  nr_throttled
679		  throttled_usec
681	  cpu.weight
683		A read-write single value file which exists on non-root
684		cgroups.  The default is "100".
686		The weight in the range [1, 10000].
688	  cpu.max
690		A read-write two value file which exists on non-root cgroups.
691		The default is "max 100000".
693		The maximum bandwidth limit.  It's in the following format.
695		  $MAX $PERIOD
697		which indicates that the group may consume upto $MAX in each
698		$PERIOD duration.  "max" for $MAX indicates no limit.  If only
699		one number is written, $MAX is updated.
701	  cpu.rt.max
703	  [NOTE: The semantics of this file is still under discussion and the
704	   interface hasn't been merged yet]
706		A read-write two value file which exists on all cgroups.
707		The default is "0 100000".
709		The maximum realtime runtime allocation.  Over-committing
710		configurations are disallowed and process migrations are
711		rejected if not enough bandwidth is available.  It's in the
712		following format.
714		  $MAX $PERIOD
716		which indicates that the group may consume upto $MAX in each
717		$PERIOD duration.  If only one number is written, $MAX is
718		updated.
721	5-2. Memory
723	The "memory" controller regulates distribution of memory.  Memory is
724	stateful and implements both limit and protection models.  Due to the
725	intertwining between memory usage and reclaim pressure and the
726	stateful nature of memory, the distribution model is relatively
727	complex.
729	While not completely water-tight, all major memory usages by a given
730	cgroup are tracked so that the total memory consumption can be
731	accounted and controlled to a reasonable extent.  Currently, the
732	following types of memory usages are tracked.
734	- Userland memory - page cache and anonymous memory.
736	- Kernel data structures such as dentries and inodes.
738	- TCP socket buffers.
740	The above list may expand in the future for better coverage.
743	5-2-1. Memory Interface Files
745	All memory amounts are in bytes.  If a value which is not aligned to
746	PAGE_SIZE is written, the value may be rounded up to the closest
747	PAGE_SIZE multiple when read back.
749	  memory.current
751		A read-only single value file which exists on non-root
752		cgroups.
754		The total amount of memory currently being used by the cgroup
755		and its descendants.
757	  memory.low
759		A read-write single value file which exists on non-root
760		cgroups.  The default is "0".
762		Best-effort memory protection.  If the memory usages of a
763		cgroup and all its ancestors are below their low boundaries,
764		the cgroup's memory won't be reclaimed unless memory can be
765		reclaimed from unprotected cgroups.
767		Putting more memory than generally available under this
768		protection is discouraged.
770	  memory.high
772		A read-write single value file which exists on non-root
773		cgroups.  The default is "max".
775		Memory usage throttle limit.  This is the main mechanism to
776		control memory usage of a cgroup.  If a cgroup's usage goes
777		over the high boundary, the processes of the cgroup are
778		throttled and put under heavy reclaim pressure.
780		Going over the high limit never invokes the OOM killer and
781		under extreme conditions the limit may be breached.
783	  memory.max
785		A read-write single value file which exists on non-root
786		cgroups.  The default is "max".
788		Memory usage hard limit.  This is the final protection
789		mechanism.  If a cgroup's memory usage reaches this limit and
790		can't be reduced, the OOM killer is invoked in the cgroup.
791		Under certain circumstances, the usage may go over the limit
792		temporarily.
794		This is the ultimate protection mechanism.  As long as the
795		high limit is used and monitored properly, this limit's
796		utility is limited to providing the final safety net.
798	  memory.events
800		A read-only flat-keyed file which exists on non-root cgroups.
801		The following entries are defined.  Unless specified
802		otherwise, a value change in this file generates a file
803		modified event.
805		  low
807			The number of times the cgroup is reclaimed due to
808			high memory pressure even though its usage is under
809			the low boundary.  This usually indicates that the low
810			boundary is over-committed.
812		  high
814			The number of times processes of the cgroup are
815			throttled and routed to perform direct memory reclaim
816			because the high memory boundary was exceeded.  For a
817			cgroup whose memory usage is capped by the high limit
818			rather than global memory pressure, this event's
819			occurrences are expected.
821		  max
823			The number of times the cgroup's memory usage was
824			about to go over the max boundary.  If direct reclaim
825			fails to bring it down, the OOM killer is invoked.
827		  oom
829			The number of times the OOM killer has been invoked in
830			the cgroup.  This may not exactly match the number of
831			processes killed but should generally be close.
833	  memory.stat
835		A read-only flat-keyed file which exists on non-root cgroups.
837		This breaks down the cgroup's memory footprint into different
838		types of memory, type-specific details, and other information
839		on the state and past events of the memory management system.
841		All memory amounts are in bytes.
843		The entries are ordered to be human readable, and new entries
844		can show up in the middle. Don't rely on items remaining in a
845		fixed position; use the keys to look up specific values!
847		  anon
849			Amount of memory used in anonymous mappings such as
850			brk(), sbrk(), and mmap(MAP_ANONYMOUS)
852		  file
854			Amount of memory used to cache filesystem data,
855			including tmpfs and shared memory.
857		  kernel_stack
859			Amount of memory allocated to kernel stacks.
861		  slab
863			Amount of memory used for storing in-kernel data
864			structures.
866		  sock
868			Amount of memory used in network transmission buffers
870		  file_mapped
872			Amount of cached filesystem data mapped with mmap()
874		  file_dirty
876			Amount of cached filesystem data that was modified but
877			not yet written back to disk
879		  file_writeback
881			Amount of cached filesystem data that was modified and
882			is currently being written back to disk
884		  inactive_anon
885		  active_anon
886		  inactive_file
887		  active_file
888		  unevictable
890			Amount of memory, swap-backed and filesystem-backed,
891			on the internal memory management lists used by the
892			page reclaim algorithm
894		  slab_reclaimable
896			Part of "slab" that might be reclaimed, such as
897			dentries and inodes.
899		  slab_unreclaimable
901			Part of "slab" that cannot be reclaimed on memory
902			pressure.
904		  pgfault
906			Total number of page faults incurred
908		  pgmajfault
910			Number of major page faults incurred
912	  memory.swap.current
914		A read-only single value file which exists on non-root
915		cgroups.
917		The total amount of swap currently being used by the cgroup
918		and its descendants.
920	  memory.swap.max
922		A read-write single value file which exists on non-root
923		cgroups.  The default is "max".
925		Swap usage hard limit.  If a cgroup's swap usage reaches this
926		limit, anonymous meomry of the cgroup will not be swapped out.
929	5-2-2. Usage Guidelines
931	"memory.high" is the main mechanism to control memory usage.
932	Over-committing on high limit (sum of high limits > available memory)
933	and letting global memory pressure to distribute memory according to
934	usage is a viable strategy.
936	Because breach of the high limit doesn't trigger the OOM killer but
937	throttles the offending cgroup, a management agent has ample
938	opportunities to monitor and take appropriate actions such as granting
939	more memory or terminating the workload.
941	Determining whether a cgroup has enough memory is not trivial as
942	memory usage doesn't indicate whether the workload can benefit from
943	more memory.  For example, a workload which writes data received from
944	network to a file can use all available memory but can also operate as
945	performant with a small amount of memory.  A measure of memory
946	pressure - how much the workload is being impacted due to lack of
947	memory - is necessary to determine whether a workload needs more
948	memory; unfortunately, memory pressure monitoring mechanism isn't
949	implemented yet.
952	5-2-3. Memory Ownership
954	A memory area is charged to the cgroup which instantiated it and stays
955	charged to the cgroup until the area is released.  Migrating a process
956	to a different cgroup doesn't move the memory usages that it
957	instantiated while in the previous cgroup to the new cgroup.
959	A memory area may be used by processes belonging to different cgroups.
960	To which cgroup the area will be charged is in-deterministic; however,
961	over time, the memory area is likely to end up in a cgroup which has
962	enough memory allowance to avoid high reclaim pressure.
964	If a cgroup sweeps a considerable amount of memory which is expected
965	to be accessed repeatedly by other cgroups, it may make sense to use
966	POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
967	belonging to the affected files to ensure correct memory ownership.
970	5-3. IO
972	The "io" controller regulates the distribution of IO resources.  This
973	controller implements both weight based and absolute bandwidth or IOPS
974	limit distribution; however, weight based distribution is available
975	only if cfq-iosched is in use and neither scheme is available for
976	blk-mq devices.
979	5-3-1. IO Interface Files
981	  io.stat
983		A read-only nested-keyed file which exists on non-root
984		cgroups.
986		Lines are keyed by $MAJ:$MIN device numbers and not ordered.
987		The following nested keys are defined.
989		  rbytes	Bytes read
990		  wbytes	Bytes written
991		  rios		Number of read IOs
992		  wios		Number of write IOs
994		An example read output follows.
996		  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
997		  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
999	  io.weight
1001		A read-write flat-keyed file which exists on non-root cgroups.
1002		The default is "default 100".
1004		The first line is the default weight applied to devices
1005		without specific override.  The rest are overrides keyed by
1006		$MAJ:$MIN device numbers and not ordered.  The weights are in
1007		the range [1, 10000] and specifies the relative amount IO time
1008		the cgroup can use in relation to its siblings.
1010		The default weight can be updated by writing either "default
1011		$WEIGHT" or simply "$WEIGHT".  Overrides can be set by writing
1012		"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
1014		An example read output follows.
1016		  default 100
1017		  8:16 200
1018		  8:0 50
1020	  io.max
1022		A read-write nested-keyed file which exists on non-root
1023		cgroups.
1025		BPS and IOPS based IO limit.  Lines are keyed by $MAJ:$MIN
1026		device numbers and not ordered.  The following nested keys are
1027		defined.
1029		  rbps		Max read bytes per second
1030		  wbps		Max write bytes per second
1031		  riops		Max read IO operations per second
1032		  wiops		Max write IO operations per second
1034		When writing, any number of nested key-value pairs can be
1035		specified in any order.  "max" can be specified as the value
1036		to remove a specific limit.  If the same key is specified
1037		multiple times, the outcome is undefined.
1039		BPS and IOPS are measured in each IO direction and IOs are
1040		delayed if limit is reached.  Temporary bursts are allowed.
1042		Setting read limit at 2M BPS and write at 120 IOPS for 8:16.
1044		  echo "8:16 rbps=2097152 wiops=120" > io.max
1046		Reading returns the following.
1048		  8:16 rbps=2097152 wbps=max riops=max wiops=120
1050		Write IOPS limit can be removed by writing the following.
1052		  echo "8:16 wiops=max" > io.max
1054		Reading now returns the following.
1056		  8:16 rbps=2097152 wbps=max riops=max wiops=max
1059	5-3-2. Writeback
1061	Page cache is dirtied through buffered writes and shared mmaps and
1062	written asynchronously to the backing filesystem by the writeback
1063	mechanism.  Writeback sits between the memory and IO domains and
1064	regulates the proportion of dirty memory by balancing dirtying and
1065	write IOs.
1067	The io controller, in conjunction with the memory controller,
1068	implements control of page cache writeback IOs.  The memory controller
1069	defines the memory domain that dirty memory ratio is calculated and
1070	maintained for and the io controller defines the io domain which
1071	writes out dirty pages for the memory domain.  Both system-wide and
1072	per-cgroup dirty memory states are examined and the more restrictive
1073	of the two is enforced.
1075	cgroup writeback requires explicit support from the underlying
1076	filesystem.  Currently, cgroup writeback is implemented on ext2, ext4
1077	and btrfs.  On other filesystems, all writeback IOs are attributed to
1078	the root cgroup.
1080	There are inherent differences in memory and writeback management
1081	which affects how cgroup ownership is tracked.  Memory is tracked per
1082	page while writeback per inode.  For the purpose of writeback, an
1083	inode is assigned to a cgroup and all IO requests to write dirty pages
1084	from the inode are attributed to that cgroup.
1086	As cgroup ownership for memory is tracked per page, there can be pages
1087	which are associated with different cgroups than the one the inode is
1088	associated with.  These are called foreign pages.  The writeback
1089	constantly keeps track of foreign pages and, if a particular foreign
1090	cgroup becomes the majority over a certain period of time, switches
1091	the ownership of the inode to that cgroup.
1093	While this model is enough for most use cases where a given inode is
1094	mostly dirtied by a single cgroup even when the main writing cgroup
1095	changes over time, use cases where multiple cgroups write to a single
1096	inode simultaneously are not supported well.  In such circumstances, a
1097	significant portion of IOs are likely to be attributed incorrectly.
1098	As memory controller assigns page ownership on the first use and
1099	doesn't update it until the page is released, even if writeback
1100	strictly follows page ownership, multiple cgroups dirtying overlapping
1101	areas wouldn't work as expected.  It's recommended to avoid such usage
1102	patterns.
1104	The sysctl knobs which affect writeback behavior are applied to cgroup
1105	writeback as follows.
1107	  vm.dirty_background_ratio
1108	  vm.dirty_ratio
1110		These ratios apply the same to cgroup writeback with the
1111		amount of available memory capped by limits imposed by the
1112		memory controller and system-wide clean memory.
1114	  vm.dirty_background_bytes
1115	  vm.dirty_bytes
1117		For cgroup writeback, this is calculated into ratio against
1118		total available memory and applied the same way as
1119		vm.dirty[_background]_ratio.
1122	6. Namespace
1124	6-1. Basics
1126	cgroup namespace provides a mechanism to virtualize the view of the
1127	"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
1128	flag can be used with clone(2) and unshare(2) to create a new cgroup
1129	namespace.  The process running inside the cgroup namespace will have
1130	its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
1131	cgroupns root is the cgroup of the process at the time of creation of
1132	the cgroup namespace.
1134	Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
1135	complete path of the cgroup of a process.  In a container setup where
1136	a set of cgroups and namespaces are intended to isolate processes the
1137	"/proc/$PID/cgroup" file may leak potential system level information
1138	to the isolated processes.  For Example:
1140	  # cat /proc/self/cgroup
1141	  0::/batchjobs/container_id1
1143	The path '/batchjobs/container_id1' can be considered as system-data
1144	and undesirable to expose to the isolated processes.  cgroup namespace
1145	can be used to restrict visibility of this path.  For example, before
1146	creating a cgroup namespace, one would see:
1148	  # ls -l /proc/self/ns/cgroup
1149	  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
1150	  # cat /proc/self/cgroup
1151	  0::/batchjobs/container_id1
1153	After unsharing a new namespace, the view changes.
1155	  # ls -l /proc/self/ns/cgroup
1156	  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
1157	  # cat /proc/self/cgroup
1158	  0::/
1160	When some thread from a multi-threaded process unshares its cgroup
1161	namespace, the new cgroupns gets applied to the entire process (all
1162	the threads).  This is natural for the v2 hierarchy; however, for the
1163	legacy hierarchies, this may be unexpected.
1165	A cgroup namespace is alive as long as there are processes inside or
1166	mounts pinning it.  When the last usage goes away, the cgroup
1167	namespace is destroyed.  The cgroupns root and the actual cgroups
1168	remain.
1171	6-2. The Root and Views
1173	The 'cgroupns root' for a cgroup namespace is the cgroup in which the
1174	process calling unshare(2) is running.  For example, if a process in
1175	/batchjobs/container_id1 cgroup calls unshare, cgroup
1176	/batchjobs/container_id1 becomes the cgroupns root.  For the
1177	init_cgroup_ns, this is the real root ('/') cgroup.
1179	The cgroupns root cgroup does not change even if the namespace creator
1180	process later moves to a different cgroup.
1182	  # ~/unshare -c # unshare cgroupns in some cgroup
1183	  # cat /proc/self/cgroup
1184	  0::/
1185	  # mkdir sub_cgrp_1
1186	  # echo 0 > sub_cgrp_1/cgroup.procs
1187	  # cat /proc/self/cgroup
1188	  0::/sub_cgrp_1
1190	Each process gets its namespace-specific view of "/proc/$PID/cgroup"
1192	Processes running inside the cgroup namespace will be able to see
1193	cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
1194	From within an unshared cgroupns:
1196	  # sleep 100000 &
1197	  [1] 7353
1198	  # echo 7353 > sub_cgrp_1/cgroup.procs
1199	  # cat /proc/7353/cgroup
1200	  0::/sub_cgrp_1
1202	From the initial cgroup namespace, the real cgroup path will be
1203	visible:
1205	  $ cat /proc/7353/cgroup
1206	  0::/batchjobs/container_id1/sub_cgrp_1
1208	From a sibling cgroup namespace (that is, a namespace rooted at a
1209	different cgroup), the cgroup path relative to its own cgroup
1210	namespace root will be shown.  For instance, if PID 7353's cgroup
1211	namespace root is at '/batchjobs/container_id2', then it will see
1213	  # cat /proc/7353/cgroup
1214	  0::/../container_id2/sub_cgrp_1
1216	Note that the relative path always starts with '/' to indicate that
1217	its relative to the cgroup namespace root of the caller.
1220	6-3. Migration and setns(2)
1222	Processes inside a cgroup namespace can move into and out of the
1223	namespace root if they have proper access to external cgroups.  For
1224	example, from inside a namespace with cgroupns root at
1225	/batchjobs/container_id1, and assuming that the global hierarchy is
1226	still accessible inside cgroupns:
1228	  # cat /proc/7353/cgroup
1229	  0::/sub_cgrp_1
1230	  # echo 7353 > batchjobs/container_id2/cgroup.procs
1231	  # cat /proc/7353/cgroup
1232	  0::/../container_id2
1234	Note that this kind of setup is not encouraged.  A task inside cgroup
1235	namespace should only be exposed to its own cgroupns hierarchy.
1237	setns(2) to another cgroup namespace is allowed when:
1239	(a) the process has CAP_SYS_ADMIN against its current user namespace
1240	(b) the process has CAP_SYS_ADMIN against the target cgroup
1241	    namespace's userns
1243	No implicit cgroup changes happen with attaching to another cgroup
1244	namespace.  It is expected that the someone moves the attaching
1245	process under the target cgroup namespace root.
1248	6-4. Interaction with Other Namespaces
1250	Namespace specific cgroup hierarchy can be mounted by a process
1251	running inside a non-init cgroup namespace.
1253	  # mount -t cgroup2 none $MOUNT_POINT
1255	This will mount the unified cgroup hierarchy with cgroupns root as the
1256	filesystem root.  The process needs CAP_SYS_ADMIN against its user and
1257	mount namespaces.
1259	The virtualization of /proc/self/cgroup file combined with restricting
1260	the view of cgroup hierarchy by namespace-private cgroupfs mount
1261	provides a properly isolated cgroup view inside the container.
1264	P. Information on Kernel Programming
1266	This section contains kernel programming information in the areas
1267	where interacting with cgroup is necessary.  cgroup core and
1268	controllers are not covered.
1271	P-1. Filesystem Support for Writeback
1273	A filesystem can support cgroup writeback by updating
1274	address_space_operations->writepage[s]() to annotate bio's using the
1275	following two functions.
1277	  wbc_init_bio(@wbc, @bio)
1279		Should be called for each bio carrying writeback data and
1280		associates the bio with the inode's owner cgroup.  Can be
1281		called anytime between bio allocation and submission.
1283	  wbc_account_io(@wbc, @page, @bytes)
1285		Should be called for each data segment being written out.
1286		While this function doesn't care exactly when it's called
1287		during the writeback session, it's the easiest and most
1288		natural to call it as data segments are added to a bio.
1290	With writeback bio's annotated, cgroup support can be enabled per
1291	super_block by setting SB_I_CGROUPWB in ->s_iflags.  This allows for
1292	selective disabling of cgroup writeback support which is helpful when
1293	certain filesystem features, e.g. journaled data mode, are
1294	incompatible.
1296	wbc_init_bio() binds the specified bio to its cgroup.  Depending on
1297	the configuration, the bio may be executed at a lower priority and if
1298	the writeback session is holding shared resources, e.g. a journal
1299	entry, may lead to priority inversion.  There is no one easy solution
1300	for the problem.  Filesystems can try to work around specific problem
1301	cases by skipping wbc_init_bio() or using bio_associate_blkcg()
1302	directly.
1305	D. Deprecated v1 Core Features
1307	- Multiple hierarchies including named ones are not supported.
1309	- All mount options and remounting are not supported.
1311	- The "tasks" file is removed and "cgroup.procs" is not sorted.
1313	- "cgroup.clone_children" is removed.
1315	- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
1316	  at the root instead.
1319	R. Issues with v1 and Rationales for v2
1321	R-1. Multiple Hierarchies
1323	cgroup v1 allowed an arbitrary number of hierarchies and each
1324	hierarchy could host any number of controllers.  While this seemed to
1325	provide a high level of flexibility, it wasn't useful in practice.
1327	For example, as there is only one instance of each controller, utility
1328	type controllers such as freezer which can be useful in all
1329	hierarchies could only be used in one.  The issue is exacerbated by
1330	the fact that controllers couldn't be moved to another hierarchy once
1331	hierarchies were populated.  Another issue was that all controllers
1332	bound to a hierarchy were forced to have exactly the same view of the
1333	hierarchy.  It wasn't possible to vary the granularity depending on
1334	the specific controller.
1336	In practice, these issues heavily limited which controllers could be
1337	put on the same hierarchy and most configurations resorted to putting
1338	each controller on its own hierarchy.  Only closely related ones, such
1339	as the cpu and cpuacct controllers, made sense to be put on the same
1340	hierarchy.  This often meant that userland ended up managing multiple
1341	similar hierarchies repeating the same steps on each hierarchy
1342	whenever a hierarchy management operation was necessary.
1344	Furthermore, support for multiple hierarchies came at a steep cost.
1345	It greatly complicated cgroup core implementation but more importantly
1346	the support for multiple hierarchies restricted how cgroup could be
1347	used in general and what controllers was able to do.
1349	There was no limit on how many hierarchies there might be, which meant
1350	that a thread's cgroup membership couldn't be described in finite
1351	length.  The key might contain any number of entries and was unlimited
1352	in length, which made it highly awkward to manipulate and led to
1353	addition of controllers which existed only to identify membership,
1354	which in turn exacerbated the original problem of proliferating number
1355	of hierarchies.
1357	Also, as a controller couldn't have any expectation regarding the
1358	topologies of hierarchies other controllers might be on, each
1359	controller had to assume that all other controllers were attached to
1360	completely orthogonal hierarchies.  This made it impossible, or at
1361	least very cumbersome, for controllers to cooperate with each other.
1363	In most use cases, putting controllers on hierarchies which are
1364	completely orthogonal to each other isn't necessary.  What usually is
1365	called for is the ability to have differing levels of granularity
1366	depending on the specific controller.  In other words, hierarchy may
1367	be collapsed from leaf towards root when viewed from specific
1368	controllers.  For example, a given configuration might not care about
1369	how memory is distributed beyond a certain level while still wanting
1370	to control how CPU cycles are distributed.
1373	R-2. Thread Granularity
1375	cgroup v1 allowed threads of a process to belong to different cgroups.
1376	This didn't make sense for some controllers and those controllers
1377	ended up implementing different ways to ignore such situations but
1378	much more importantly it blurred the line between API exposed to
1379	individual applications and system management interface.
1381	Generally, in-process knowledge is available only to the process
1382	itself; thus, unlike service-level organization of processes,
1383	categorizing threads of a process requires active participation from
1384	the application which owns the target process.
1386	cgroup v1 had an ambiguously defined delegation model which got abused
1387	in combination with thread granularity.  cgroups were delegated to
1388	individual applications so that they can create and manage their own
1389	sub-hierarchies and control resource distributions along them.  This
1390	effectively raised cgroup to the status of a syscall-like API exposed
1391	to lay programs.
1393	First of all, cgroup has a fundamentally inadequate interface to be
1394	exposed this way.  For a process to access its own knobs, it has to
1395	extract the path on the target hierarchy from /proc/self/cgroup,
1396	construct the path by appending the name of the knob to the path, open
1397	and then read and/or write to it.  This is not only extremely clunky
1398	and unusual but also inherently racy.  There is no conventional way to
1399	define transaction across the required steps and nothing can guarantee
1400	that the process would actually be operating on its own sub-hierarchy.
1402	cgroup controllers implemented a number of knobs which would never be
1403	accepted as public APIs because they were just adding control knobs to
1404	system-management pseudo filesystem.  cgroup ended up with interface
1405	knobs which were not properly abstracted or refined and directly
1406	revealed kernel internal details.  These knobs got exposed to
1407	individual applications through the ill-defined delegation mechanism
1408	effectively abusing cgroup as a shortcut to implementing public APIs
1409	without going through the required scrutiny.
1411	This was painful for both userland and kernel.  Userland ended up with
1412	misbehaving and poorly abstracted interfaces and kernel exposing and
1413	locked into constructs inadvertently.
1416	R-3. Competition Between Inner Nodes and Threads
1418	cgroup v1 allowed threads to be in any cgroups which created an
1419	interesting problem where threads belonging to a parent cgroup and its
1420	children cgroups competed for resources.  This was nasty as two
1421	different types of entities competed and there was no obvious way to
1422	settle it.  Different controllers did different things.
1424	The cpu controller considered threads and cgroups as equivalents and
1425	mapped nice levels to cgroup weights.  This worked for some cases but
1426	fell flat when children wanted to be allocated specific ratios of CPU
1427	cycles and the number of internal threads fluctuated - the ratios
1428	constantly changed as the number of competing entities fluctuated.
1429	There also were other issues.  The mapping from nice level to weight
1430	wasn't obvious or universal, and there were various other knobs which
1431	simply weren't available for threads.
1433	The io controller implicitly created a hidden leaf node for each
1434	cgroup to host the threads.  The hidden leaf had its own copies of all
1435	the knobs with "leaf_" prefixed.  While this allowed equivalent
1436	control over internal threads, it was with serious drawbacks.  It
1437	always added an extra layer of nesting which wouldn't be necessary
1438	otherwise, made the interface messy and significantly complicated the
1439	implementation.
1441	The memory controller didn't have a way to control what happened
1442	between internal tasks and child cgroups and the behavior was not
1443	clearly defined.  There were attempts to add ad-hoc behaviors and
1444	knobs to tailor the behavior to specific workloads which would have
1445	led to problems extremely difficult to resolve in the long term.
1447	Multiple controllers struggled with internal tasks and came up with
1448	different ways to deal with it; unfortunately, all the approaches were
1449	severely flawed and, furthermore, the widely different behaviors
1450	made cgroup as a whole highly inconsistent.
1452	This clearly is a problem which needs to be addressed from cgroup core
1453	in a uniform way.
1456	R-4. Other Interface Issues
1458	cgroup v1 grew without oversight and developed a large number of
1459	idiosyncrasies and inconsistencies.  One issue on the cgroup core side
1460	was how an empty cgroup was notified - a userland helper binary was
1461	forked and executed for each event.  The event delivery wasn't
1462	recursive or delegatable.  The limitations of the mechanism also led
1463	to in-kernel event delivery filtering mechanism further complicating
1464	the interface.
1466	Controller interfaces were problematic too.  An extreme example is
1467	controllers completely ignoring hierarchical organization and treating
1468	all cgroups as if they were all located directly under the root
1469	cgroup.  Some controllers exposed a large amount of inconsistent
1470	implementation details to userland.
1472	There also was no consistency across controllers.  When a new cgroup
1473	was created, some controllers defaulted to not imposing extra
1474	restrictions while others disallowed any resource usage until
1475	explicitly configured.  Configuration knobs for the same type of
1476	control used widely differing naming schemes and formats.  Statistics
1477	and information knobs were named arbitrarily and used different
1478	formats and units even in the same controller.
1480	cgroup v2 establishes common conventions where appropriate and updates
1481	controllers so that they expose minimal and consistent interfaces.
1484	R-5. Controller Issues and Remedies
1486	R-5-1. Memory
1488	The original lower boundary, the soft limit, is defined as a limit
1489	that is per default unset.  As a result, the set of cgroups that
1490	global reclaim prefers is opt-in, rather than opt-out.  The costs for
1491	optimizing these mostly negative lookups are so high that the
1492	implementation, despite its enormous size, does not even provide the
1493	basic desirable behavior.  First off, the soft limit has no
1494	hierarchical meaning.  All configured groups are organized in a global
1495	rbtree and treated like equal peers, regardless where they are located
1496	in the hierarchy.  This makes subtree delegation impossible.  Second,
1497	the soft limit reclaim pass is so aggressive that it not just
1498	introduces high allocation latencies into the system, but also impacts
1499	system performance due to overreclaim, to the point where the feature
1500	becomes self-defeating.
1502	The memory.low boundary on the other hand is a top-down allocated
1503	reserve.  A cgroup enjoys reclaim protection when it and all its
1504	ancestors are below their low boundaries, which makes delegation of
1505	subtrees possible.  Secondly, new cgroups have no reserve per default
1506	and in the common case most cgroups are eligible for the preferred
1507	reclaim pass.  This allows the new low boundary to be efficiently
1508	implemented with just a minor addition to the generic reclaim code,
1509	without the need for out-of-band data structures and reclaim passes.
1510	Because the generic reclaim code considers all cgroups except for the
1511	ones running low in the preferred first reclaim pass, overreclaim of
1512	individual groups is eliminated as well, resulting in much better
1513	overall workload performance.
1515	The original high boundary, the hard limit, is defined as a strict
1516	limit that can not budge, even if the OOM killer has to be called.
1517	But this generally goes against the goal of making the most out of the
1518	available memory.  The memory consumption of workloads varies during
1519	runtime, and that requires users to overcommit.  But doing that with a
1520	strict upper limit requires either a fairly accurate prediction of the
1521	working set size or adding slack to the limit.  Since working set size
1522	estimation is hard and error prone, and getting it wrong results in
1523	OOM kills, most users tend to err on the side of a looser limit and
1524	end up wasting precious resources.
1526	The memory.high boundary on the other hand can be set much more
1527	conservatively.  When hit, it throttles allocations by forcing them
1528	into direct reclaim to work off the excess, but it never invokes the
1529	OOM killer.  As a result, a high boundary that is chosen too
1530	aggressively will not terminate the processes, but instead it will
1531	lead to gradual performance degradation.  The user can monitor this
1532	and make corrections until the minimal memory footprint that still
1533	gives acceptable performance is found.
1535	In extreme cases, with many concurrent allocations and a complete
1536	breakdown of reclaim progress within the group, the high boundary can
1537	be exceeded.  But even then it's mostly better to satisfy the
1538	allocation from the slack available in other groups or the rest of the
1539	system than killing the group.  Otherwise, memory.max is there to
1540	limit this type of spillover and ultimately contain buggy or even
1541	malicious applications.
1543	Setting the original memory.limit_in_bytes below the current usage was
1544	subject to a race condition, where concurrent charges could cause the
1545	limit setting to fail. memory.max on the other hand will first set the
1546	limit to prevent new charges, and then reclaim and OOM kill until the
1547	new limit is met - or the task writing to memory.max is killed.
1549	The combined memory+swap accounting and limiting is replaced by real
1550	control over swap space.
1552	The main argument for a combined memory+swap facility in the original
1553	cgroup design was that global or parental pressure would always be
1554	able to swap all anonymous memory of a child group, regardless of the
1555	child's own (possibly untrusted) configuration.  However, untrusted
1556	groups can sabotage swapping by other means - such as referencing its
1557	anonymous memory in a tight loop - and an admin can not assume full
1558	swappability when overcommitting untrusted jobs.
1560	For trusted jobs, on the other hand, a combined counter is not an
1561	intuitive userspace interface, and it flies in the face of the idea
1562	that cgroup controllers should account and limit specific physical
1563	resources.  Swap space is a resource like all others in the system,
1564	and that's why unified hierarchy allows distributing it separately.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.