About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / cgroups / unified-hierarchy.txt


Based on kernel version 4.3. Page generated on 2015-11-02 12:44 EST.

1	
2	Cgroup unified hierarchy
3	
4	April, 2014		Tejun Heo <tj@kernel.org>
5	
6	This document describes the changes made by unified hierarchy and
7	their rationales.  It will eventually be merged into the main cgroup
8	documentation.
9	
10	CONTENTS
11	
12	1. Background
13	2. Basic Operation
14	  2-1. Mounting
15	  2-2. cgroup.subtree_control
16	  2-3. cgroup.controllers
17	3. Structural Constraints
18	  3-1. Top-down
19	  3-2. No internal tasks
20	4. Delegation
21	  4-1. Model of delegation
22	  4-2. Common ancestor rule
23	5. Other Changes
24	  5-1. [Un]populated Notification
25	  5-2. Other Core Changes
26	  5-3. Controller File Conventions
27	    5-3-1. Format
28	    5-3-2. Control Knobs
29	  5-4. Per-Controller Changes
30	    5-4-1. io
31	    5-4-2. cpuset
32	    5-4-3. memory
33	6. Planned Changes
34	  6-1. CAP for resource control
35	
36	
37	1. Background
38	
39	cgroup allows an arbitrary number of hierarchies and each hierarchy
40	can host any number of controllers.  While this seems to provide a
41	high level of flexibility, it isn't quite useful in practice.
42	
43	For example, as there is only one instance of each controller, utility
44	type controllers such as freezer which can be useful in all
45	hierarchies can only be used in one.  The issue is exacerbated by the
46	fact that controllers can't be moved around once hierarchies are
47	populated.  Another issue is that all controllers bound to a hierarchy
48	are forced to have exactly the same view of the hierarchy.  It isn't
49	possible to vary the granularity depending on the specific controller.
50	
51	In practice, these issues heavily limit which controllers can be put
52	on the same hierarchy and most configurations resort to putting each
53	controller on its own hierarchy.  Only closely related ones, such as
54	the cpu and cpuacct controllers, make sense to put on the same
55	hierarchy.  This often means that userland ends up managing multiple
56	similar hierarchies repeating the same steps on each hierarchy
57	whenever a hierarchy management operation is necessary.
58	
59	Unfortunately, support for multiple hierarchies comes at a steep cost.
60	Internal implementation in cgroup core proper is dazzlingly
61	complicated but more importantly the support for multiple hierarchies
62	restricts how cgroup is used in general and what controllers can do.
63	
64	There's no limit on how many hierarchies there may be, which means
65	that a task's cgroup membership can't be described in finite length.
66	The key may contain any varying number of entries and is unlimited in
67	length, which makes it highly awkward to handle and leads to addition
68	of controllers which exist only to identify membership, which in turn
69	exacerbates the original problem.
70	
71	Also, as a controller can't have any expectation regarding what shape
72	of hierarchies other controllers would be on, each controller has to
73	assume that all other controllers are operating on completely
74	orthogonal hierarchies.  This makes it impossible, or at least very
75	cumbersome, for controllers to cooperate with each other.
76	
77	In most use cases, putting controllers on hierarchies which are
78	completely orthogonal to each other isn't necessary.  What usually is
79	called for is the ability to have differing levels of granularity
80	depending on the specific controller.  In other words, hierarchy may
81	be collapsed from leaf towards root when viewed from specific
82	controllers.  For example, a given configuration might not care about
83	how memory is distributed beyond a certain level while still wanting
84	to control how CPU cycles are distributed.
85	
86	Unified hierarchy is the next version of cgroup interface.  It aims to
87	address the aforementioned issues by having more structure while
88	retaining enough flexibility for most use cases.  Various other
89	general and controller-specific interface issues are also addressed in
90	the process.
91	
92	
93	2. Basic Operation
94	
95	2-1. Mounting
96	
97	Currently, unified hierarchy can be mounted with the following mount
98	command.  Note that this is still under development and scheduled to
99	change soon.
100	
101	 mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
102	
103	All controllers which support the unified hierarchy and are not bound
104	to other hierarchies are automatically bound to unified hierarchy and
105	show up at the root of it.  Controllers which are enabled only in the
106	root of unified hierarchy can be bound to other hierarchies.  This
107	allows mixing unified hierarchy with the traditional multiple
108	hierarchies in a fully backward compatible way.
109	
110	For development purposes, the following boot parameter makes all
111	controllers to appear on the unified hierarchy whether supported or
112	not.
113	
114	 cgroup__DEVEL__legacy_files_on_dfl
115	
116	A controller can be moved across hierarchies only after the controller
117	is no longer referenced in its current hierarchy.  Because per-cgroup
118	controller states are destroyed asynchronously and controllers may
119	have lingering references, a controller may not show up immediately on
120	the unified hierarchy after the final umount of the previous
121	hierarchy.  Similarly, a controller should be fully disabled to be
122	moved out of the unified hierarchy and it may take some time for the
123	disabled controller to become available for other hierarchies;
124	furthermore, due to dependencies among controllers, other controllers
125	may need to be disabled too.
126	
127	While useful for development and manual configurations, dynamically
128	moving controllers between the unified and other hierarchies is
129	strongly discouraged for production use.  It is recommended to decide
130	the hierarchies and controller associations before starting using the
131	controllers.
132	
133	
134	2-2. cgroup.subtree_control
135	
136	All cgroups on unified hierarchy have a "cgroup.subtree_control" file
137	which governs which controllers are enabled on the children of the
138	cgroup.  Let's assume a hierarchy like the following.
139	
140	  root - A - B - C
141	               \ D
142	
143	root's "cgroup.subtree_control" file determines which controllers are
144	enabled on A.  A's on B.  B's on C and D.  This coincides with the
145	fact that controllers on the immediate sub-level are used to
146	distribute the resources of the parent.  In fact, it's natural to
147	assume that resource control knobs of a child belong to its parent.
148	Enabling a controller in a "cgroup.subtree_control" file declares that
149	distribution of the respective resources of the cgroup will be
150	controlled.  Note that this means that controller enable states are
151	shared among siblings.
152	
153	When read, the file contains a space-separated list of currently
154	enabled controllers.  A write to the file should contain a
155	space-separated list of controllers with '+' or '-' prefixed (without
156	the quotes).  Controllers prefixed with '+' are enabled and '-'
157	disabled.  If a controller is listed multiple times, the last entry
158	wins.  The specific operations are executed atomically - either all
159	succeed or fail.
160	
161	
162	2-3. cgroup.controllers
163	
164	Read-only "cgroup.controllers" file contains a space-separated list of
165	controllers which can be enabled in the cgroup's
166	"cgroup.subtree_control" file.
167	
168	In the root cgroup, this lists controllers which are not bound to
169	other hierarchies and the content changes as controllers are bound to
170	and unbound from other hierarchies.
171	
172	In non-root cgroups, the content of this file equals that of the
173	parent's "cgroup.subtree_control" file as only controllers enabled
174	from the parent can be used in its children.
175	
176	
177	3. Structural Constraints
178	
179	3-1. Top-down
180	
181	As it doesn't make sense to nest control of an uncontrolled resource,
182	all non-root "cgroup.subtree_control" files can only contain
183	controllers which are enabled in the parent's "cgroup.subtree_control"
184	file.  A controller can be enabled only if the parent has the
185	controller enabled and a controller can't be disabled if one or more
186	children have it enabled.
187	
188	
189	3-2. No internal tasks
190	
191	One long-standing issue that cgroup faces is the competition between
192	tasks belonging to the parent cgroup and its children cgroups.  This
193	is inherently nasty as two different types of entities compete and
194	there is no agreed-upon obvious way to handle it.  Different
195	controllers are doing different things.
196	
197	The cpu controller considers tasks and cgroups as equivalents and maps
198	nice levels to cgroup weights.  This works for some cases but falls
199	flat when children should be allocated specific ratios of CPU cycles
200	and the number of internal tasks fluctuates - the ratios constantly
201	change as the number of competing entities fluctuates.  There also are
202	other issues.  The mapping from nice level to weight isn't obvious or
203	universal, and there are various other knobs which simply aren't
204	available for tasks.
205	
206	The io controller implicitly creates a hidden leaf node for each
207	cgroup to host the tasks.  The hidden leaf has its own copies of all
208	the knobs with "leaf_" prefixed.  While this allows equivalent control
209	over internal tasks, it's with serious drawbacks.  It always adds an
210	extra layer of nesting which may not be necessary, makes the interface
211	messy and significantly complicates the implementation.
212	
213	The memory controller currently doesn't have a way to control what
214	happens between internal tasks and child cgroups and the behavior is
215	not clearly defined.  There have been attempts to add ad-hoc behaviors
216	and knobs to tailor the behavior to specific workloads.  Continuing
217	this direction will lead to problems which will be extremely difficult
218	to resolve in the long term.
219	
220	Multiple controllers struggle with internal tasks and came up with
221	different ways to deal with it; unfortunately, all the approaches in
222	use now are severely flawed and, furthermore, the widely different
223	behaviors make cgroup as whole highly inconsistent.
224	
225	It is clear that this is something which needs to be addressed from
226	cgroup core proper in a uniform way so that controllers don't need to
227	worry about it and cgroup as a whole shows a consistent and logical
228	behavior.  To achieve that, unified hierarchy enforces the following
229	structural constraint:
230	
231	 Except for the root, only cgroups which don't contain any task may
232	 have controllers enabled in their "cgroup.subtree_control" files.
233	
234	Combined with other properties, this guarantees that, when a
235	controller is looking at the part of the hierarchy which has it
236	enabled, tasks are always only on the leaves.  This rules out
237	situations where child cgroups compete against internal tasks of the
238	parent.
239	
240	There are two things to note.  Firstly, the root cgroup is exempt from
241	the restriction.  Root contains tasks and anonymous resource
242	consumption which can't be associated with any other cgroup and
243	requires special treatment from most controllers.  How resource
244	consumption in the root cgroup is governed is up to each controller.
245	
246	Secondly, the restriction doesn't take effect if there is no enabled
247	controller in the cgroup's "cgroup.subtree_control" file.  This is
248	important as otherwise it wouldn't be possible to create children of a
249	populated cgroup.  To control resource distribution of a cgroup, the
250	cgroup must create children and transfer all its tasks to the children
251	before enabling controllers in its "cgroup.subtree_control" file.
252	
253	
254	4. Delegation
255	
256	4-1. Model of delegation
257	
258	A cgroup can be delegated to a less privileged user by granting write
259	access of the directory and its "cgroup.procs" file to the user.  Note
260	that the resource control knobs in a given directory concern the
261	resources of the parent and thus must not be delegated along with the
262	directory.
263	
264	Once delegated, the user can build sub-hierarchy under the directory,
265	organize processes as it sees fit and further distribute the resources
266	it got from the parent.  The limits and other settings of all resource
267	controllers are hierarchical and regardless of what happens in the
268	delegated sub-hierarchy, nothing can escape the resource restrictions
269	imposed by the parent.
270	
271	Currently, cgroup doesn't impose any restrictions on the number of
272	cgroups in or nesting depth of a delegated sub-hierarchy; however,
273	this may in the future be limited explicitly.
274	
275	
276	4-2. Common ancestor rule
277	
278	On the unified hierarchy, to write to a "cgroup.procs" file, in
279	addition to the usual write permission to the file and uid match, the
280	writer must also have write access to the "cgroup.procs" file of the
281	common ancestor of the source and destination cgroups.  This prevents
282	delegatees from smuggling processes across disjoint sub-hierarchies.
283	
284	Let's say cgroups C0 and C1 have been delegated to user U0 who created
285	C00, C01 under C0 and C10 under C1 as follows.
286	
287	 ~~~~~~~~~~~~~ - C0 - C00
288	 ~ cgroup    ~      \ C01
289	 ~ hierarchy ~
290	 ~~~~~~~~~~~~~ - C1 - C10
291	
292	C0 and C1 are separate entities in terms of resource distribution
293	regardless of their relative positions in the hierarchy.  The
294	resources the processes under C0 are entitled to are controlled by
295	C0's ancestors and may be completely different from C1.  It's clear
296	that the intention of delegating C0 to U0 is allowing U0 to organize
297	the processes under C0 and further control the distribution of C0's
298	resources.
299	
300	On traditional hierarchies, if a task has write access to "tasks" or
301	"cgroup.procs" file of a cgroup and its uid agrees with the target, it
302	can move the target to the cgroup.  In the above example, U0 will not
303	only be able to move processes in each sub-hierarchy but also across
304	the two sub-hierarchies, effectively allowing it to violate the
305	organizational and resource restrictions implied by the hierarchical
306	structure above C0 and C1.
307	
308	On the unified hierarchy, let's say U0 wants to write the pid of a
309	process which has a matching uid and is currently in C10 into
310	"C00/cgroup.procs".  U0 obviously has write access to the file and
311	migration permission on the process; however, the common ancestor of
312	the source cgroup C10 and the destination cgroup C00 is above the
313	points of delegation and U0 would not have write access to its
314	"cgroup.procs" and thus be denied with -EACCES.
315	
316	
317	5. Other Changes
318	
319	5-1. [Un]populated Notification
320	
321	cgroup users often need a way to determine when a cgroup's
322	subhierarchy becomes empty so that it can be cleaned up.  cgroup
323	currently provides release_agent for it; unfortunately, this mechanism
324	is riddled with issues.
325	
326	- It delivers events by forking and execing a userland binary
327	  specified as the release_agent.  This is a long deprecated method of
328	  notification delivery.  It's extremely heavy, slow and cumbersome to
329	  integrate with larger infrastructure.
330	
331	- There is single monitoring point at the root.  There's no way to
332	  delegate management of a subtree.
333	
334	- The event isn't recursive.  It triggers when a cgroup doesn't have
335	  any tasks or child cgroups.  Events for internal nodes trigger only
336	  after all children are removed.  This again makes it impossible to
337	  delegate management of a subtree.
338	
339	- Events are filtered from the kernel side.  A "notify_on_release"
340	  file is used to subscribe to or suppress release events.  This is
341	  unnecessarily complicated and probably done this way because event
342	  delivery itself was expensive.
343	
344	Unified hierarchy implements an interface file "cgroup.populated"
345	which can be used to monitor whether the cgroup's subhierarchy has
346	tasks in it or not.  Its value is 0 if there is no task in the cgroup
347	and its descendants; otherwise, 1.  poll and [id]notify events are
348	triggered when the value changes.
349	
350	This is significantly lighter and simpler and trivially allows
351	delegating management of subhierarchy - subhierarchy monitoring can
352	block further propagation simply by putting itself or another process
353	in the subhierarchy and monitor events that it's interested in from
354	there without interfering with monitoring higher in the tree.
355	
356	In unified hierarchy, the release_agent mechanism is no longer
357	supported and the interface files "release_agent" and
358	"notify_on_release" do not exist.
359	
360	
361	5-2. Other Core Changes
362	
363	- None of the mount options is allowed.
364	
365	- remount is disallowed.
366	
367	- rename(2) is disallowed.
368	
369	- The "tasks" file is removed.  Everything should at process
370	  granularity.  Use the "cgroup.procs" file instead.
371	
372	- The "cgroup.procs" file is not sorted.  pids will be unique unless
373	  they got recycled in-between reads.
374	
375	- The "cgroup.clone_children" file is removed.
376	
377	
378	5-3. Controller File Conventions
379	
380	5-3-1. Format
381	
382	In general, all controller files should be in one of the following
383	formats whenever possible.
384	
385	- Values only files
386	
387	  VAL0 VAL1...\n
388	
389	- Flat keyed files
390	
391	  KEY0 VAL0\n
392	  KEY1 VAL1\n
393	  ...
394	
395	- Nested keyed files
396	
397	  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
398	  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
399	  ...
400	
401	For a writeable file, the format for writing should generally match
402	reading; however, controllers may allow omitting later fields or
403	implement restricted shortcuts for most common use cases.
404	
405	For both flat and nested keyed files, only the values for a single key
406	can be written at a time.  For nested keyed files, the sub key pairs
407	may be specified in any order and not all pairs have to be specified.
408	
409	
410	5-3-2. Control Knobs
411	
412	- Settings for a single feature should generally be implemented in a
413	  single file.
414	
415	- In general, the root cgroup should be exempt from resource control
416	  and thus shouldn't have resource control knobs.
417	
418	- If a controller implements ratio based resource distribution, the
419	  control knob should be named "weight" and have the range [1, 10000]
420	  and 100 should be the default value.  The values are chosen to allow
421	  enough and symmetric bias in both directions while keeping it
422	  intuitive (the default is 100%).
423	
424	- If a controller implements an absolute resource guarantee and/or
425	  limit, the control knobs should be named "min" and "max"
426	  respectively.  If a controller implements best effort resource
427	  gurantee and/or limit, the control knobs should be named "low" and
428	  "high" respectively.
429	
430	  In the above four control files, the special token "max" should be
431	  used to represent upward infinity for both reading and writing.
432	
433	- If a setting has configurable default value and specific overrides,
434	  the default settings should be keyed with "default" and appear as
435	  the first entry in the file.  Specific entries can use "default" as
436	  its value to indicate inheritance of the default value.
437	
438	
439	5-4. Per-Controller Changes
440	
441	5-4-1. io
442	
443	- blkio is renamed to io.  The interface is overhauled anyway.  The
444	  new name is more in line with the other two major controllers, cpu
445	  and memory, and better suited given that it may be used for cgroup
446	  writeback without involving block layer.
447	
448	- Everything including stat is always hierarchical making separate
449	  recursive stat files pointless and, as no internal node can have
450	  tasks, leaf weights are meaningless.  The operation model is
451	  simplified and the interface is overhauled accordingly.
452	
453	  io.stat
454	
455		The stat file.  The reported stats are from the point where
456		bio's are issued to request_queue.  The stats are counted
457		independent of which policies are enabled.  Each line in the
458		file follows the following format.  More fields may later be
459		added at the end.
460	
461		  $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
462	
463	  io.weight
464	
465		The weight setting, currently only available and effective if
466		cfq-iosched is in use for the target device.  The weight is
467		between 1 and 10000 and defaults to 100.  The first line
468		always contains the default weight in the following format to
469		use when per-device setting is missing.
470	
471		  default $WEIGHT
472	
473		Subsequent lines list per-device weights of the following
474		format.
475	
476		  $MAJ:$MIN $WEIGHT
477	
478		Writing "$WEIGHT" or "default $WEIGHT" changes the default
479		setting.  Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
480		while "$MAJ:$MIN default" clears it.
481	
482		This file is available only on non-root cgroups.
483	
484	  io.max
485	
486		The maximum bandwidth and/or iops setting, only available if
487		blk-throttle is enabled.  The file is of the following format.
488	
489		  $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
490	
491		${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
492		read/write IOs per second.  "max" indicates no limit.  Writing
493		to the file follows the same format but the individual
494		settings may be ommitted or specified in any order.
495	
496		This file is available only on non-root cgroups.
497	
498	
499	5-4-2. cpuset
500	
501	- Tasks are kept in empty cpusets after hotplug and take on the masks
502	  of the nearest non-empty ancestor, instead of being moved to it.
503	
504	- A task can be moved into an empty cpuset, and again it takes on the
505	  masks of the nearest non-empty ancestor.
506	
507	
508	5-4-3. memory
509	
510	- use_hierarchy is on by default and the cgroup file for the flag is
511	  not created.
512	
513	- The original lower boundary, the soft limit, is defined as a limit
514	  that is per default unset.  As a result, the set of cgroups that
515	  global reclaim prefers is opt-in, rather than opt-out.  The costs
516	  for optimizing these mostly negative lookups are so high that the
517	  implementation, despite its enormous size, does not even provide the
518	  basic desirable behavior.  First off, the soft limit has no
519	  hierarchical meaning.  All configured groups are organized in a
520	  global rbtree and treated like equal peers, regardless where they
521	  are located in the hierarchy.  This makes subtree delegation
522	  impossible.  Second, the soft limit reclaim pass is so aggressive
523	  that it not just introduces high allocation latencies into the
524	  system, but also impacts system performance due to overreclaim, to
525	  the point where the feature becomes self-defeating.
526	
527	  The memory.low boundary on the other hand is a top-down allocated
528	  reserve.  A cgroup enjoys reclaim protection when it and all its
529	  ancestors are below their low boundaries, which makes delegation of
530	  subtrees possible.  Secondly, new cgroups have no reserve per
531	  default and in the common case most cgroups are eligible for the
532	  preferred reclaim pass.  This allows the new low boundary to be
533	  efficiently implemented with just a minor addition to the generic
534	  reclaim code, without the need for out-of-band data structures and
535	  reclaim passes.  Because the generic reclaim code considers all
536	  cgroups except for the ones running low in the preferred first
537	  reclaim pass, overreclaim of individual groups is eliminated as
538	  well, resulting in much better overall workload performance.
539	
540	- The original high boundary, the hard limit, is defined as a strict
541	  limit that can not budge, even if the OOM killer has to be called.
542	  But this generally goes against the goal of making the most out of
543	  the available memory.  The memory consumption of workloads varies
544	  during runtime, and that requires users to overcommit.  But doing
545	  that with a strict upper limit requires either a fairly accurate
546	  prediction of the working set size or adding slack to the limit.
547	  Since working set size estimation is hard and error prone, and
548	  getting it wrong results in OOM kills, most users tend to err on the
549	  side of a looser limit and end up wasting precious resources.
550	
551	  The memory.high boundary on the other hand can be set much more
552	  conservatively.  When hit, it throttles allocations by forcing them
553	  into direct reclaim to work off the excess, but it never invokes the
554	  OOM killer.  As a result, a high boundary that is chosen too
555	  aggressively will not terminate the processes, but instead it will
556	  lead to gradual performance degradation.  The user can monitor this
557	  and make corrections until the minimal memory footprint that still
558	  gives acceptable performance is found.
559	
560	  In extreme cases, with many concurrent allocations and a complete
561	  breakdown of reclaim progress within the group, the high boundary
562	  can be exceeded.  But even then it's mostly better to satisfy the
563	  allocation from the slack available in other groups or the rest of
564	  the system than killing the group.  Otherwise, memory.max is there
565	  to limit this type of spillover and ultimately contain buggy or even
566	  malicious applications.
567	
568	- The original control file names are unwieldy and inconsistent in
569	  many different ways.  For example, the upper boundary hit count is
570	  exported in the memory.failcnt file, but an OOM event count has to
571	  be manually counted by listening to memory.oom_control events, and
572	  lower boundary / soft limit events have to be counted by first
573	  setting a threshold for that value and then counting those events.
574	  Also, usage and limit files encode their units in the filename.
575	  That makes the filenames very long, even though this is not
576	  information that a user needs to be reminded of every time they type
577	  out those names.
578	
579	  To address these naming issues, as well as to signal clearly that
580	  the new interface carries a new configuration model, the naming
581	  conventions in it necessarily differ from the old interface.
582	
583	- The original limit files indicate the state of an unset limit with a
584	  Very High Number, and a configured limit can be unset by echoing -1
585	  into those files.  But that very high number is implementation and
586	  architecture dependent and not very descriptive.  And while -1 can
587	  be understood as an underflow into the highest possible value, -2 or
588	  -10M etc. do not work, so it's not consistent.
589	
590	  memory.low, memory.high, and memory.max will use the string "max" to
591	  indicate and set the highest possible value.
592	
593	6. Planned Changes
594	
595	6-1. CAP for resource control
596	
597	Unified hierarchy will require one of the capabilities(7), which is
598	yet to be decided, for all resource control related knobs.  Process
599	organization operations - creation of sub-cgroups and migration of
600	processes in sub-hierarchies may be delegated by changing the
601	ownership and/or permissions on the cgroup directory and
602	"cgroup.procs" interface file; however, all operations which affect
603	resource control - writes to a "cgroup.subtree_control" file or any
604	controller-specific knobs - will require an explicit CAP privilege.
605	
606	This, in part, is to prevent the cgroup interface from being
607	inadvertently promoted to programmable API used by non-privileged
608	binaries.  cgroup exposes various aspects of the system in ways which
609	aren't properly abstracted for direct consumption by regular programs.
610	This is an administration interface much closer to sysctl knobs than
611	system calls.  Even the basic access model, being filesystem path
612	based, isn't suitable for direct consumption.  There's no way to
613	access "my cgroup" in a race-free way or make multiple operations
614	atomic against migration to another cgroup.
615	
616	Another aspect is that, for better or for worse, the cgroup interface
617	goes through far less scrutiny than regular interfaces for
618	unprivileged userland.  The upside is that cgroup is able to expose
619	useful features which may not be suitable for general consumption in a
620	reasonable time frame.  It provides a relatively short path between
621	internal details and userland-visible interface.  Of course, this
622	shortcut comes with high risk.  We go through what we go through for
623	general kernel APIs for good reasons.  It may end up leaking internal
624	details in a way which can exert significant pain by locking the
625	kernel into a contract that can't be maintained in a reasonable
626	manner.
627	
628	Also, due to the specific nature, cgroup and its controllers don't
629	tend to attract attention from a wide scope of developers.  cgroup's
630	short history is already fraught with severely mis-designed
631	interfaces, unnecessary commitments to and exposing of internal
632	details, broken and dangerous implementations of various features.
633	
634	Keeping cgroup as an administration interface is both advantageous for
635	its role and imperative given its nature.  Some of the cgroup features
636	may make sense for unprivileged access.  If deemed justified, those
637	must be further abstracted and implemented as a different interface,
638	be it a system call or process-private filesystem, and survive through
639	the scrutiny that any interface for general consumption is required to
640	go through.
641	
642	Requiring CAP is not a complete solution but should serve as a
643	significant deterrent against spraying cgroup usages in non-privileged
644	programs.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog