About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / assoc_array.txt


Based on kernel version 4.9. Page generated on 2016-12-21 14:28 EST.

1			   ========================================
2			   GENERIC ASSOCIATIVE ARRAY IMPLEMENTATION
3			   ========================================
4	
5	Contents:
6	
7	 - Overview.
8	
9	 - The public API.
10	   - Edit script.
11	   - Operations table.
12	   - Manipulation functions.
13	   - Access functions.
14	   - Index key form.
15	
16	 - Internal workings.
17	   - Basic internal tree layout.
18	   - Shortcuts.
19	   - Splitting and collapsing nodes.
20	   - Non-recursive iteration.
21	   - Simultaneous alteration and iteration.
22	
23	
24	========
25	OVERVIEW
26	========
27	
28	This associative array implementation is an object container with the following
29	properties:
30	
31	 (1) Objects are opaque pointers.  The implementation does not care where they
32	     point (if anywhere) or what they point to (if anything).
33	
34	     [!] NOTE: Pointers to objects _must_ be zero in the least significant bit.
35	
36	 (2) Objects do not need to contain linkage blocks for use by the array.  This
37	     permits an object to be located in multiple arrays simultaneously.
38	     Rather, the array is made up of metadata blocks that point to objects.
39	
40	 (3) Objects require index keys to locate them within the array.
41	
42	 (4) Index keys must be unique.  Inserting an object with the same key as one
43	     already in the array will replace the old object.
44	
45	 (5) Index keys can be of any length and can be of different lengths.
46	
47	 (6) Index keys should encode the length early on, before any variation due to
48	     length is seen.
49	
50	 (7) Index keys can include a hash to scatter objects throughout the array.
51	
52	 (8) The array can iterated over.  The objects will not necessarily come out in
53	     key order.
54	
55	 (9) The array can be iterated over whilst it is being modified, provided the
56	     RCU readlock is being held by the iterator.  Note, however, under these
57	     circumstances, some objects may be seen more than once.  If this is a
58	     problem, the iterator should lock against modification.  Objects will not
59	     be missed, however, unless deleted.
60	
61	(10) Objects in the array can be looked up by means of their index key.
62	
63	(11) Objects can be looked up whilst the array is being modified, provided the
64	     RCU readlock is being held by the thread doing the look up.
65	
66	The implementation uses a tree of 16-pointer nodes internally that are indexed
67	on each level by nibbles from the index key in the same manner as in a radix
68	tree.  To improve memory efficiency, shortcuts can be emplaced to skip over
69	what would otherwise be a series of single-occupancy nodes.  Further, nodes
70	pack leaf object pointers into spare space in the node rather than making an
71	extra branch until as such time an object needs to be added to a full node.
72	
73	
74	==============
75	THE PUBLIC API
76	==============
77	
78	The public API can be found in <linux/assoc_array.h>.  The associative array is
79	rooted on the following structure:
80	
81		struct assoc_array {
82			...
83		};
84	
85	The code is selected by enabling CONFIG_ASSOCIATIVE_ARRAY.
86	
87	
88	EDIT SCRIPT
89	-----------
90	
91	The insertion and deletion functions produce an 'edit script' that can later be
92	applied to effect the changes without risking ENOMEM.  This retains the
93	preallocated metadata blocks that will be installed in the internal tree and
94	keeps track of the metadata blocks that will be removed from the tree when the
95	script is applied.
96	
97	This is also used to keep track of dead blocks and dead objects after the
98	script has been applied so that they can be freed later.  The freeing is done
99	after an RCU grace period has passed - thus allowing access functions to
100	proceed under the RCU read lock.
101	
102	The script appears as outside of the API as a pointer of the type:
103	
104		struct assoc_array_edit;
105	
106	There are two functions for dealing with the script:
107	
108	 (1) Apply an edit script.
109	
110		void assoc_array_apply_edit(struct assoc_array_edit *edit);
111	
112	     This will perform the edit functions, interpolating various write barriers
113	     to permit accesses under the RCU read lock to continue.  The edit script
114	     will then be passed to call_rcu() to free it and any dead stuff it points
115	     to.
116	
117	 (2) Cancel an edit script.
118	
119		void assoc_array_cancel_edit(struct assoc_array_edit *edit);
120	
121	     This frees the edit script and all preallocated memory immediately.  If
122	     this was for insertion, the new object is _not_ released by this function,
123	     but must rather be released by the caller.
124	
125	These functions are guaranteed not to fail.
126	
127	
128	OPERATIONS TABLE
129	----------------
130	
131	Various functions take a table of operations:
132	
133		struct assoc_array_ops {
134			...
135		};
136	
137	This points to a number of methods, all of which need to be provided:
138	
139	 (1) Get a chunk of index key from caller data:
140	
141		unsigned long (*get_key_chunk)(const void *index_key, int level);
142	
143	     This should return a chunk of caller-supplied index key starting at the
144	     *bit* position given by the level argument.  The level argument will be a
145	     multiple of ASSOC_ARRAY_KEY_CHUNK_SIZE and the function should return
146	     ASSOC_ARRAY_KEY_CHUNK_SIZE bits.  No error is possible.
147	
148	
149	 (2) Get a chunk of an object's index key.
150	
151		unsigned long (*get_object_key_chunk)(const void *object, int level);
152	
153	     As the previous function, but gets its data from an object in the array
154	     rather than from a caller-supplied index key.
155	
156	
157	 (3) See if this is the object we're looking for.
158	
159		bool (*compare_object)(const void *object, const void *index_key);
160	
161	     Compare the object against an index key and return true if it matches and
162	     false if it doesn't.
163	
164	
165	 (4) Diff the index keys of two objects.
166	
167		int (*diff_objects)(const void *object, const void *index_key);
168	
169	     Return the bit position at which the index key of the specified object
170	     differs from the given index key or -1 if they are the same.
171	
172	
173	 (5) Free an object.
174	
175		void (*free_object)(void *object);
176	
177	     Free the specified object.  Note that this may be called an RCU grace
178	     period after assoc_array_apply_edit() was called, so synchronize_rcu() may
179	     be necessary on module unloading.
180	
181	
182	MANIPULATION FUNCTIONS
183	----------------------
184	
185	There are a number of functions for manipulating an associative array:
186	
187	 (1) Initialise an associative array.
188	
189		void assoc_array_init(struct assoc_array *array);
190	
191	     This initialises the base structure for an associative array.  It can't
192	     fail.
193	
194	
195	 (2) Insert/replace an object in an associative array.
196	
197		struct assoc_array_edit *
198		assoc_array_insert(struct assoc_array *array,
199				   const struct assoc_array_ops *ops,
200				   const void *index_key,
201				   void *object);
202	
203	     This inserts the given object into the array.  Note that the least
204	     significant bit of the pointer must be zero as it's used to type-mark
205	     pointers internally.
206	
207	     If an object already exists for that key then it will be replaced with the
208	     new object and the old one will be freed automatically.
209	
210	     The index_key argument should hold index key information and is
211	     passed to the methods in the ops table when they are called.
212	
213	     This function makes no alteration to the array itself, but rather returns
214	     an edit script that must be applied.  -ENOMEM is returned in the case of
215	     an out-of-memory error.
216	
217	     The caller should lock exclusively against other modifiers of the array.
218	
219	
220	 (3) Delete an object from an associative array.
221	
222		struct assoc_array_edit *
223		assoc_array_delete(struct assoc_array *array,
224				   const struct assoc_array_ops *ops,
225				   const void *index_key);
226	
227	     This deletes an object that matches the specified data from the array.
228	
229	     The index_key argument should hold index key information and is
230	     passed to the methods in the ops table when they are called.
231	
232	     This function makes no alteration to the array itself, but rather returns
233	     an edit script that must be applied.  -ENOMEM is returned in the case of
234	     an out-of-memory error.  NULL will be returned if the specified object is
235	     not found within the array.
236	
237	     The caller should lock exclusively against other modifiers of the array.
238	
239	
240	 (4) Delete all objects from an associative array.
241	
242		struct assoc_array_edit *
243		assoc_array_clear(struct assoc_array *array,
244				  const struct assoc_array_ops *ops);
245	
246	     This deletes all the objects from an associative array and leaves it
247	     completely empty.
248	
249	     This function makes no alteration to the array itself, but rather returns
250	     an edit script that must be applied.  -ENOMEM is returned in the case of
251	     an out-of-memory error.
252	
253	     The caller should lock exclusively against other modifiers of the array.
254	
255	
256	 (5) Destroy an associative array, deleting all objects.
257	
258		void assoc_array_destroy(struct assoc_array *array,
259					 const struct assoc_array_ops *ops);
260	
261	     This destroys the contents of the associative array and leaves it
262	     completely empty.  It is not permitted for another thread to be traversing
263	     the array under the RCU read lock at the same time as this function is
264	     destroying it as no RCU deferral is performed on memory release -
265	     something that would require memory to be allocated.
266	
267	     The caller should lock exclusively against other modifiers and accessors
268	     of the array.
269	
270	
271	 (6) Garbage collect an associative array.
272	
273		int assoc_array_gc(struct assoc_array *array,
274				   const struct assoc_array_ops *ops,
275				   bool (*iterator)(void *object, void *iterator_data),
276				   void *iterator_data);
277	
278	     This iterates over the objects in an associative array and passes each one
279	     to iterator().  If iterator() returns true, the object is kept.  If it
280	     returns false, the object will be freed.  If the iterator() function
281	     returns true, it must perform any appropriate refcount incrementing on the
282	     object before returning.
283	
284	     The internal tree will be packed down if possible as part of the iteration
285	     to reduce the number of nodes in it.
286	
287	     The iterator_data is passed directly to iterator() and is otherwise
288	     ignored by the function.
289	
290	     The function will return 0 if successful and -ENOMEM if there wasn't
291	     enough memory.
292	
293	     It is possible for other threads to iterate over or search the array under
294	     the RCU read lock whilst this function is in progress.  The caller should
295	     lock exclusively against other modifiers of the array.
296	
297	
298	ACCESS FUNCTIONS
299	----------------
300	
301	There are two functions for accessing an associative array:
302	
303	 (1) Iterate over all the objects in an associative array.
304	
305		int assoc_array_iterate(const struct assoc_array *array,
306					int (*iterator)(const void *object,
307							void *iterator_data),
308					void *iterator_data);
309	
310	     This passes each object in the array to the iterator callback function.
311	     iterator_data is private data for that function.
312	
313	     This may be used on an array at the same time as the array is being
314	     modified, provided the RCU read lock is held.  Under such circumstances,
315	     it is possible for the iteration function to see some objects twice.  If
316	     this is a problem, then modification should be locked against.  The
317	     iteration algorithm should not, however, miss any objects.
318	
319	     The function will return 0 if no objects were in the array or else it will
320	     return the result of the last iterator function called.  Iteration stops
321	     immediately if any call to the iteration function results in a non-zero
322	     return.
323	
324	
325	 (2) Find an object in an associative array.
326	
327		void *assoc_array_find(const struct assoc_array *array,
328				       const struct assoc_array_ops *ops,
329				       const void *index_key);
330	
331	     This walks through the array's internal tree directly to the object
332	     specified by the index key..
333	
334	     This may be used on an array at the same time as the array is being
335	     modified, provided the RCU read lock is held.
336	
337	     The function will return the object if found (and set *_type to the object
338	     type) or will return NULL if the object was not found.
339	
340	
341	INDEX KEY FORM
342	--------------
343	
344	The index key can be of any form, but since the algorithms aren't told how long
345	the key is, it is strongly recommended that the index key includes its length
346	very early on before any variation due to the length would have an effect on
347	comparisons.
348	
349	This will cause leaves with different length keys to scatter away from each
350	other - and those with the same length keys to cluster together.
351	
352	It is also recommended that the index key begin with a hash of the rest of the
353	key to maximise scattering throughout keyspace.
354	
355	The better the scattering, the wider and lower the internal tree will be.
356	
357	Poor scattering isn't too much of a problem as there are shortcuts and nodes
358	can contain mixtures of leaves and metadata pointers.
359	
360	The index key is read in chunks of machine word.  Each chunk is subdivided into
361	one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and
362	on a 64-bit CPU, 16 levels.  Unless the scattering is really poor, it is
363	unlikely that more than one word of any particular index key will have to be
364	used.
365	
366	
367	=================
368	INTERNAL WORKINGS
369	=================
370	
371	The associative array data structure has an internal tree.  This tree is
372	constructed of two types of metadata blocks: nodes and shortcuts.
373	
374	A node is an array of slots.  Each slot can contain one of four things:
375	
376	 (*) A NULL pointer, indicating that the slot is empty.
377	
378	 (*) A pointer to an object (a leaf).
379	
380	 (*) A pointer to a node at the next level.
381	
382	 (*) A pointer to a shortcut.
383	
384	
385	BASIC INTERNAL TREE LAYOUT
386	--------------------------
387	
388	Ignoring shortcuts for the moment, the nodes form a multilevel tree.  The index
389	key space is strictly subdivided by the nodes in the tree and nodes occur on
390	fixed levels.  For example:
391	
392	 Level:	0		1		2		3
393		===============	===============	===============	===============
394								NODE D
395				NODE B		NODE C	+------>+---+
396			+------>+---+	+------>+---+	|	| 0 |
397		NODE A	|	| 0 |	|	| 0 |	|	+---+
398		+---+	|	+---+	|	+---+	|	:   :
399		| 0 |	|	:   :	|	:   :	|	+---+
400		+---+	|	+---+	|	+---+	|	| f |
401		| 1 |---+	| 3 |---+	| 7 |---+	+---+
402		+---+		+---+		+---+
403		:   :		:   :		| 8 |---+
404		+---+		+---+		+---+	|	NODE E
405		| e |---+	| f |		:   :   +------>+---+
406		+---+	|	+---+		+---+		| 0 |
407		| f |	|			| f |		+---+
408		+---+	|			+---+		:   :
409			|	NODE F				+---+
410			+------>+---+				| f |
411				| 0 |		NODE G		+---+
412				+---+	+------>+---+
413				:   :	|	| 0 |
414				+---+	|	+---+
415				| 6 |---+	:   :
416				+---+		+---+
417				:   :		| f |
418				+---+		+---+
419				| f |
420				+---+
421	
422	In the above example, there are 7 nodes (A-G), each with 16 slots (0-f).
423	Assuming no other meta data nodes in the tree, the key space is divided thusly:
424	
425		KEY PREFIX	NODE
426		==========	====
427		137*		D
428		138*		E
429		13[0-69-f]*	C
430		1[0-24-f]*	B
431		e6*		G
432		e[0-57-f]*	F
433		[02-df]*	A
434	
435	So, for instance, keys with the following example index keys will be found in
436	the appropriate nodes:
437	
438		INDEX KEY	PREFIX	NODE
439		===============	=======	====
440		13694892892489	13	C
441		13795289025897	137	D
442		13889dde88793	138	E
443		138bbb89003093	138	E
444		1394879524789	12	C
445		1458952489	1	B
446		9431809de993ba	-	A
447		b4542910809cd	-	A
448		e5284310def98	e	F
449		e68428974237	e6	G
450		e7fffcbd443	e	F
451		f3842239082	-	A
452	
453	To save memory, if a node can hold all the leaves in its portion of keyspace,
454	then the node will have all those leaves in it and will not have any metadata
455	pointers - even if some of those leaves would like to be in the same slot.
456	
457	A node can contain a heterogeneous mix of leaves and metadata pointers.
458	Metadata pointers must be in the slots that match their subdivisions of key
459	space.  The leaves can be in any slot not occupied by a metadata pointer.  It
460	is guaranteed that none of the leaves in a node will match a slot occupied by a
461	metadata pointer.  If the metadata pointer is there, any leaf whose key matches
462	the metadata key prefix must be in the subtree that the metadata pointer points
463	to.
464	
465	In the above example list of index keys, node A will contain:
466	
467		SLOT	CONTENT		INDEX KEY (PREFIX)
468		====	===============	==================
469		1	PTR TO NODE B	1*
470		any	LEAF		9431809de993ba
471		any	LEAF		b4542910809cd
472		e	PTR TO NODE F	e*
473		any	LEAF		f3842239082
474	
475	and node B:
476	
477		3	PTR TO NODE C	13*
478		any	LEAF		1458952489
479	
480	
481	SHORTCUTS
482	---------
483	
484	Shortcuts are metadata records that jump over a piece of keyspace.  A shortcut
485	is a replacement for a series of single-occupancy nodes ascending through the
486	levels.  Shortcuts exist to save memory and to speed up traversal.
487	
488	It is possible for the root of the tree to be a shortcut - say, for example,
489	the tree contains at least 17 nodes all with key prefix '1111'.  The insertion
490	algorithm will insert a shortcut to skip over the '1111' keyspace in a single
491	bound and get to the fourth level where these actually become different.
492	
493	
494	SPLITTING AND COLLAPSING NODES
495	------------------------------
496	
497	Each node has a maximum capacity of 16 leaves and metadata pointers.  If the
498	insertion algorithm finds that it is trying to insert a 17th object into a
499	node, that node will be split such that at least two leaves that have a common
500	key segment at that level end up in a separate node rooted on that slot for
501	that common key segment.
502	
503	If the leaves in a full node and the leaf that is being inserted are
504	sufficiently similar, then a shortcut will be inserted into the tree.
505	
506	When the number of objects in the subtree rooted at a node falls to 16 or
507	fewer, then the subtree will be collapsed down to a single node - and this will
508	ripple towards the root if possible.
509	
510	
511	NON-RECURSIVE ITERATION
512	-----------------------
513	
514	Each node and shortcut contains a back pointer to its parent and the number of
515	slot in that parent that points to it.  None-recursive iteration uses these to
516	proceed rootwards through the tree, going to the parent node, slot N + 1 to
517	make sure progress is made without the need for a stack.
518	
519	The backpointers, however, make simultaneous alteration and iteration tricky.
520	
521	
522	SIMULTANEOUS ALTERATION AND ITERATION
523	-------------------------------------
524	
525	There are a number of cases to consider:
526	
527	 (1) Simple insert/replace.  This involves simply replacing a NULL or old
528	     matching leaf pointer with the pointer to the new leaf after a barrier.
529	     The metadata blocks don't change otherwise.  An old leaf won't be freed
530	     until after the RCU grace period.
531	
532	 (2) Simple delete.  This involves just clearing an old matching leaf.  The
533	     metadata blocks don't change otherwise.  The old leaf won't be freed until
534	     after the RCU grace period.
535	
536	 (3) Insertion replacing part of a subtree that we haven't yet entered.  This
537	     may involve replacement of part of that subtree - but that won't affect
538	     the iteration as we won't have reached the pointer to it yet and the
539	     ancestry blocks are not replaced (the layout of those does not change).
540	
541	 (4) Insertion replacing nodes that we're actively processing.  This isn't a
542	     problem as we've passed the anchoring pointer and won't switch onto the
543	     new layout until we follow the back pointers - at which point we've
544	     already examined the leaves in the replaced node (we iterate over all the
545	     leaves in a node before following any of its metadata pointers).
546	
547	     We might, however, re-see some leaves that have been split out into a new
548	     branch that's in a slot further along than we were at.
549	
550	 (5) Insertion replacing nodes that we're processing a dependent branch of.
551	     This won't affect us until we follow the back pointers.  Similar to (4).
552	
553	 (6) Deletion collapsing a branch under us.  This doesn't affect us because the
554	     back pointers will get us back to the parent of the new node before we
555	     could see the new node.  The entire collapsed subtree is thrown away
556	     unchanged - and will still be rooted on the same slot, so we shouldn't
557	     process it a second time as we'll go back to slot + 1.
558	
559	Note:
560	
561	 (*) Under some circumstances, we need to simultaneously change the parent
562	     pointer and the parent slot pointer on a node (say, for example, we
563	     inserted another node before it and moved it up a level).  We cannot do
564	     this without locking against a read - so we have to replace that node too.
565	
566	     However, when we're changing a shortcut into a node this isn't a problem
567	     as shortcuts only have one slot and so the parent slot number isn't used
568	     when traversing backwards over one.  This means that it's okay to change
569	     the slot number first - provided suitable barriers are used to make sure
570	     the parent slot number is read after the back pointer.
571	
572	Obsolete blocks and leaves are freed up after an RCU grace period has passed,
573	so as long as anyone doing walking or iteration holds the RCU read lock, the
574	old superstructure should not go away on them.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog