About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / arm / cluster-pm-race-avoidance.txt


Based on kernel version 4.16.1. Page generated on 2018-04-09 11:52 EST.

1	Cluster-wide Power-up/power-down race avoidance algorithm
2	=========================================================
3	
4	This file documents the algorithm which is used to coordinate CPU and
5	cluster setup and teardown operations and to manage hardware coherency
6	controls safely.
7	
8	The section "Rationale" explains what the algorithm is for and why it is
9	needed.  "Basic model" explains general concepts using a simplified view
10	of the system.  The other sections explain the actual details of the
11	algorithm in use.
12	
13	
14	Rationale
15	---------
16	
17	In a system containing multiple CPUs, it is desirable to have the
18	ability to turn off individual CPUs when the system is idle, reducing
19	power consumption and thermal dissipation.
20	
21	In a system containing multiple clusters of CPUs, it is also desirable
22	to have the ability to turn off entire clusters.
23	
24	Turning entire clusters off and on is a risky business, because it
25	involves performing potentially destructive operations affecting a group
26	of independently running CPUs, while the OS continues to run.  This
27	means that we need some coordination in order to ensure that critical
28	cluster-level operations are only performed when it is truly safe to do
29	so.
30	
31	Simple locking may not be sufficient to solve this problem, because
32	mechanisms like Linux spinlocks may rely on coherency mechanisms which
33	are not immediately enabled when a cluster powers up.  Since enabling or
34	disabling those mechanisms may itself be a non-atomic operation (such as
35	writing some hardware registers and invalidating large caches), other
36	methods of coordination are required in order to guarantee safe
37	power-down and power-up at the cluster level.
38	
39	The mechanism presented in this document describes a coherent memory
40	based protocol for performing the needed coordination.  It aims to be as
41	lightweight as possible, while providing the required safety properties.
42	
43	
44	Basic model
45	-----------
46	
47	Each cluster and CPU is assigned a state, as follows:
48	
49		DOWN
50		COMING_UP
51		UP
52		GOING_DOWN
53	
54		    +---------> UP ----------+
55		    |                        v
56	
57		COMING_UP                GOING_DOWN
58	
59		    ^                        |
60		    +--------- DOWN <--------+
61	
62	
63	DOWN:	The CPU or cluster is not coherent, and is either powered off or
64		suspended, or is ready to be powered off or suspended.
65	
66	COMING_UP: The CPU or cluster has committed to moving to the UP state.
67		It may be part way through the process of initialisation and
68		enabling coherency.
69	
70	UP:	The CPU or cluster is active and coherent at the hardware
71		level.  A CPU in this state is not necessarily being used
72		actively by the kernel.
73	
74	GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
75		state.  It may be part way through the process of teardown and
76		coherency exit.
77	
78	
79	Each CPU has one of these states assigned to it at any point in time.
80	The CPU states are described in the "CPU state" section, below.
81	
82	Each cluster is also assigned a state, but it is necessary to split the
83	state value into two parts (the "cluster" state and "inbound" state) and
84	to introduce additional states in order to avoid races between different
85	CPUs in the cluster simultaneously modifying the state.  The cluster-
86	level states are described in the "Cluster state" section.
87	
88	To help distinguish the CPU states from cluster states in this
89	discussion, the state names are given a CPU_ prefix for the CPU states,
90	and a CLUSTER_ or INBOUND_ prefix for the cluster states.
91	
92	
93	CPU state
94	---------
95	
96	In this algorithm, each individual core in a multi-core processor is
97	referred to as a "CPU".  CPUs are assumed to be single-threaded:
98	therefore, a CPU can only be doing one thing at a single point in time.
99	
100	This means that CPUs fit the basic model closely.
101	
102	The algorithm defines the following states for each CPU in the system:
103	
104		CPU_DOWN
105		CPU_COMING_UP
106		CPU_UP
107		CPU_GOING_DOWN
108	
109		 cluster setup and
110		CPU setup complete          policy decision
111		      +-----------> CPU_UP ------------+
112		      |                                v
113	
114		CPU_COMING_UP                   CPU_GOING_DOWN
115	
116		      ^                                |
117		      +----------- CPU_DOWN <----------+
118		 policy decision           CPU teardown complete
119		or hardware event
120	
121	
122	The definitions of the four states correspond closely to the states of
123	the basic model.
124	
125	Transitions between states occur as follows.
126	
127	A trigger event (spontaneous) means that the CPU can transition to the
128	next state as a result of making local progress only, with no
129	requirement for any external event to happen.
130	
131	
132	CPU_DOWN:
133	
134		A CPU reaches the CPU_DOWN state when it is ready for
135		power-down.  On reaching this state, the CPU will typically
136		power itself down or suspend itself, via a WFI instruction or a
137		firmware call.
138	
139		Next state:	CPU_COMING_UP
140		Conditions:	none
141	
142		Trigger events:
143	
144			a) an explicit hardware power-up operation, resulting
145			   from a policy decision on another CPU;
146	
147			b) a hardware event, such as an interrupt.
148	
149	
150	CPU_COMING_UP:
151	
152		A CPU cannot start participating in hardware coherency until the
153		cluster is set up and coherent.  If the cluster is not ready,
154		then the CPU will wait in the CPU_COMING_UP state until the
155		cluster has been set up.
156	
157		Next state:	CPU_UP
158		Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
159		Trigger events:	Transition of the parent cluster to CLUSTER_UP.
160	
161		Refer to the "Cluster state" section for a description of the
162		CLUSTER_UP state.
163	
164	
165	CPU_UP:
166		When a CPU reaches the CPU_UP state, it is safe for the CPU to
167		start participating in local coherency.
168	
169		This is done by jumping to the kernel's CPU resume code.
170	
171		Note that the definition of this state is slightly different
172		from the basic model definition: CPU_UP does not mean that the
173		CPU is coherent yet, but it does mean that it is safe to resume
174		the kernel.  The kernel handles the rest of the resume
175		procedure, so the remaining steps are not visible as part of the
176		race avoidance algorithm.
177	
178		The CPU remains in this state until an explicit policy decision
179		is made to shut down or suspend the CPU.
180	
181		Next state:	CPU_GOING_DOWN
182		Conditions:	none
183		Trigger events:	explicit policy decision
184	
185	
186	CPU_GOING_DOWN:
187	
188		While in this state, the CPU exits coherency, including any
189		operations required to achieve this (such as cleaning data
190		caches).
191	
192		Next state:	CPU_DOWN
193		Conditions:	local CPU teardown complete
194		Trigger events:	(spontaneous)
195	
196	
197	Cluster state
198	-------------
199	
200	A cluster is a group of connected CPUs with some common resources.
201	Because a cluster contains multiple CPUs, it can be doing multiple
202	things at the same time.  This has some implications.  In particular, a
203	CPU can start up while another CPU is tearing the cluster down.
204	
205	In this discussion, the "outbound side" is the view of the cluster state
206	as seen by a CPU tearing the cluster down.  The "inbound side" is the
207	view of the cluster state as seen by a CPU setting the CPU up.
208	
209	In order to enable safe coordination in such situations, it is important
210	that a CPU which is setting up the cluster can advertise its state
211	independently of the CPU which is tearing down the cluster.  For this
212	reason, the cluster state is split into two parts:
213	
214		"cluster" state: The global state of the cluster; or the state
215			on the outbound side:
216	
217			CLUSTER_DOWN
218			CLUSTER_UP
219			CLUSTER_GOING_DOWN
220	
221		"inbound" state: The state of the cluster on the inbound side.
222	
223			INBOUND_NOT_COMING_UP
224			INBOUND_COMING_UP
225	
226	
227		The different pairings of these states results in six possible
228		states for the cluster as a whole:
229	
230		                            CLUSTER_UP
231		          +==========> INBOUND_NOT_COMING_UP -------------+
232		          #                                               |
233		                                                          |
234		     CLUSTER_UP     <----+                                |
235		  INBOUND_COMING_UP      |                                v
236	
237		          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
238		          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
239	
240		    CLUSTER_DOWN         |                                |
241		  INBOUND_COMING_UP <----+                                |
242		                                                          |
243		          ^                                               |
244		          +===========     CLUSTER_DOWN      <------------+
245		                       INBOUND_NOT_COMING_UP
246	
247		Transitions -----> can only be made by the outbound CPU, and
248		only involve changes to the "cluster" state.
249	
250		Transitions ===##> can only be made by the inbound CPU, and only
251		involve changes to the "inbound" state, except where there is no
252		further transition possible on the outbound side (i.e., the
253		outbound CPU has put the cluster into the CLUSTER_DOWN state).
254	
255		The race avoidance algorithm does not provide a way to determine
256		which exact CPUs within the cluster play these roles.  This must
257		be decided in advance by some other means.  Refer to the section
258		"Last man and first man selection" for more explanation.
259	
260	
261		CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
262		cluster can actually be powered down.
263	
264		The parallelism of the inbound and outbound CPUs is observed by
265		the existence of two different paths from CLUSTER_GOING_DOWN/
266		INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
267		model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
268		COMING_UP in the basic model).  The second path avoids cluster
269		teardown completely.
270	
271		CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
272		model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
273		is trivial and merely resets the state machine ready for the
274		next cycle.
275	
276		Details of the allowable transitions follow.
277	
278		The next state in each case is notated
279	
280			<cluster state>/<inbound state> (<transitioner>)
281	
282		where the <transitioner> is the side on which the transition
283		can occur; either the inbound or the outbound side.
284	
285	
286	CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
287	
288		Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
289		Conditions:	none
290		Trigger events:
291	
292			a) an explicit hardware power-up operation, resulting
293			   from a policy decision on another CPU;
294	
295			b) a hardware event, such as an interrupt.
296	
297	
298	CLUSTER_DOWN/INBOUND_COMING_UP:
299	
300		In this state, an inbound CPU sets up the cluster, including
301		enabling of hardware coherency at the cluster level and any
302		other operations (such as cache invalidation) which are required
303		in order to achieve this.
304	
305		The purpose of this state is to do sufficient cluster-level
306		setup to enable other CPUs in the cluster to enter coherency
307		safely.
308	
309		Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
310		Conditions:	cluster-level setup and hardware coherency complete
311		Trigger events:	(spontaneous)
312	
313	
314	CLUSTER_UP/INBOUND_COMING_UP:
315	
316		Cluster-level setup is complete and hardware coherency is
317		enabled for the cluster.  Other CPUs in the cluster can safely
318		enter coherency.
319	
320		This is a transient state, leading immediately to
321		CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
322		should consider treat these two states as equivalent.
323	
324		Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
325		Conditions:	none
326		Trigger events:	(spontaneous)
327	
328	
329	CLUSTER_UP/INBOUND_NOT_COMING_UP:
330	
331		Cluster-level setup is complete and hardware coherency is
332		enabled for the cluster.  Other CPUs in the cluster can safely
333		enter coherency.
334	
335		The cluster will remain in this state until a policy decision is
336		made to power the cluster down.
337	
338		Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
339		Conditions:	none
340		Trigger events:	policy decision to power down the cluster
341	
342	
343	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
344	
345		An outbound CPU is tearing the cluster down.  The selected CPU
346		must wait in this state until all CPUs in the cluster are in the
347		CPU_DOWN state.
348	
349		When all CPUs are in the CPU_DOWN state, the cluster can be torn
350		down, for example by cleaning data caches and exiting
351		cluster-level coherency.
352	
353		To avoid wasteful unnecessary teardown operations, the outbound
354		should check the inbound cluster state for asynchronous
355		transitions to INBOUND_COMING_UP.  Alternatively, individual
356		CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
357	
358	
359		Next states:
360	
361		CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
362			Conditions:	cluster torn down and ready to power off
363			Trigger events:	(spontaneous)
364	
365		CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
366			Conditions:	none
367			Trigger events:
368	
369				a) an explicit hardware power-up operation,
370				   resulting from a policy decision on another
371				   CPU;
372	
373				b) a hardware event, such as an interrupt.
374	
375	
376	CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
377	
378		The cluster is (or was) being torn down, but another CPU has
379		come online in the meantime and is trying to set up the cluster
380		again.
381	
382		If the outbound CPU observes this state, it has two choices:
383	
384			a) back out of teardown, restoring the cluster to the
385			   CLUSTER_UP state;
386	
387			b) finish tearing the cluster down and put the cluster
388			   in the CLUSTER_DOWN state; the inbound CPU will
389			   set up the cluster again from there.
390	
391		Choice (a) permits the removal of some latency by avoiding
392		unnecessary teardown and setup operations in situations where
393		the cluster is not really going to be powered down.
394	
395	
396		Next states:
397	
398		CLUSTER_UP/INBOUND_COMING_UP (outbound)
399			Conditions:	cluster-level setup and hardware
400					coherency complete
401			Trigger events:	(spontaneous)
402	
403		CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
404			Conditions:	cluster torn down and ready to power off
405			Trigger events:	(spontaneous)
406	
407	
408	Last man and First man selection
409	--------------------------------
410	
411	The CPU which performs cluster tear-down operations on the outbound side
412	is commonly referred to as the "last man".
413	
414	The CPU which performs cluster setup on the inbound side is commonly
415	referred to as the "first man".
416	
417	The race avoidance algorithm documented above does not provide a
418	mechanism to choose which CPUs should play these roles.
419	
420	
421	Last man:
422	
423	When shutting down the cluster, all the CPUs involved are initially
424	executing Linux and hence coherent.  Therefore, ordinary spinlocks can
425	be used to select a last man safely, before the CPUs become
426	non-coherent.
427	
428	
429	First man:
430	
431	Because CPUs may power up asynchronously in response to external wake-up
432	events, a dynamic mechanism is needed to make sure that only one CPU
433	attempts to play the first man role and do the cluster-level
434	initialisation: any other CPUs must wait for this to complete before
435	proceeding.
436	
437	Cluster-level initialisation may involve actions such as configuring
438	coherency controls in the bus fabric.
439	
440	The current implementation in mcpm_head.S uses a separate mutual exclusion
441	mechanism to do this arbitration.  This mechanism is documented in
442	detail in vlocks.txt.
443	
444	
445	Features and Limitations
446	------------------------
447	
448	Implementation:
449	
450		The current ARM-based implementation is split between
451		arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
452		arch/arm/common/mcpm_entry.c (everything else):
453	
454		__mcpm_cpu_going_down() signals the transition of a CPU to the
455			CPU_GOING_DOWN state.
456	
457		__mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
458			state.
459	
460		A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
461			low-level power-up code in mcpm_head.S.  This could
462			involve CPU-specific setup code, but in the current
463			implementation it does not.
464	
465		__mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
466			handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
467			and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
468			the case of an aborted cluster power-down).
469	
470			These functions are more complex than the __mcpm_cpu_*()
471			functions due to the extra inter-CPU coordination which
472			is needed for safe transitions at the cluster level.
473	
474		A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
475			the low-level power-up code in mcpm_head.S.  This
476			typically involves platform-specific setup code,
477			provided by the platform-specific power_up_setup
478			function registered via mcpm_sync_init.
479	
480	Deep topologies:
481	
482		As currently described and implemented, the algorithm does not
483		support CPU topologies involving more than two levels (i.e.,
484		clusters of clusters are not supported).  The algorithm could be
485		extended by replicating the cluster-level states for the
486		additional topological levels, and modifying the transition
487		rules for the intermediate (non-outermost) cluster levels.
488	
489	
490	Colophon
491	--------
492	
493	Originally created and documented by Dave Martin for Linaro Limited, in
494	collaboration with Nicolas Pitre and Achin Gupta.
495	
496	Copyright (C) 2012-2013  Linaro Limited
497	Distributed under the terms of Version 2 of the GNU General Public
498	License, as defined in linux/COPYING.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog