About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / filesystems / xfs-self-describing-metadata.txt




Custom Search

Based on kernel version 3.13. Page generated on 2014-01-20 22:03 EST.

1	XFS Self Describing Metadata
2	----------------------------
3	
4	Introduction
5	------------
6	
7	The largest scalability problem facing XFS is not one of algorithmic
8	scalability, but of verification of the filesystem structure. Scalabilty of the
9	structures and indexes on disk and the algorithms for iterating them are
10	adequate for supporting PB scale filesystems with billions of inodes, however it
11	is this very scalability that causes the verification problem.
12	
13	Almost all metadata on XFS is dynamically allocated. The only fixed location
14	metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
15	other metadata structures need to be discovered by walking the filesystem
16	structure in different ways. While this is already done by userspace tools for
17	validating and repairing the structure, there are limits to what they can
18	verify, and this in turn limits the supportable size of an XFS filesystem.
19	
20	For example, it is entirely possible to manually use xfs_db and a bit of
21	scripting to analyse the structure of a 100TB filesystem when trying to
22	determine the root cause of a corruption problem, but it is still mainly a
23	manual task of verifying that things like single bit errors or misplaced writes
24	weren't the ultimate cause of a corruption event. It may take a few hours to a
25	few days to perform such forensic analysis, so for at this scale root cause
26	analysis is entirely possible.
27	
28	However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
29	to analyse and so that analysis blows out towards weeks/months of forensic work.
30	Most of the analysis work is slow and tedious, so as the amount of analysis goes
31	up, the more likely that the cause will be lost in the noise.  Hence the primary
32	concern for supporting PB scale filesystems is minimising the time and effort
33	required for basic forensic analysis of the filesystem structure.
34	
35	
36	Self Describing Metadata
37	------------------------
38	
39	One of the problems with the current metadata format is that apart from the
40	magic number in the metadata block, we have no other way of identifying what it
41	is supposed to be. We can't even identify if it is the right place. Put simply,
42	you can't look at a single metadata block in isolation and say "yes, it is
43	supposed to be there and the contents are valid".
44	
45	Hence most of the time spent on forensic analysis is spent doing basic
46	verification of metadata values, looking for values that are in range (and hence
47	not detected by automated verification checks) but are not correct. Finding and
48	understanding how things like cross linked block lists (e.g. sibling
49	pointers in a btree end up with loops in them) are the key to understanding what
50	went wrong, but it is impossible to tell what order the blocks were linked into
51	each other or written to disk after the fact.
52	
53	Hence we need to record more information into the metadata to allow us to
54	quickly determine if the metadata is intact and can be ignored for the purpose
55	of analysis. We can't protect against every possible type of error, but we can
56	ensure that common types of errors are easily detectable.  Hence the concept of
57	self describing metadata.
58	
59	The first, fundamental requirement of self describing metadata is that the
60	metadata object contains some form of unique identifier in a well known
61	location. This allows us to identify the expected contents of the block and
62	hence parse and verify the metadata object. IF we can't independently identify
63	the type of metadata in the object, then the metadata doesn't describe itself
64	very well at all!
65	
66	Luckily, almost all XFS metadata has magic numbers embedded already - only the
67	AGFL, remote symlinks and remote attribute blocks do not contain identifying
68	magic numbers. Hence we can change the on-disk format of all these objects to
69	add more identifying information and detect this simply by changing the magic
70	numbers in the metadata objects. That is, if it has the current magic number,
71	the metadata isn't self identifying. If it contains a new magic number, it is
72	self identifying and we can do much more expansive automated verification of the
73	metadata object at runtime, during forensic analysis or repair.
74	
75	As a primary concern, self describing metadata needs some form of overall
76	integrity checking. We cannot trust the metadata if we cannot verify that it has
77	not been changed as a result of external influences. Hence we need some form of
78	integrity check, and this is done by adding CRC32c validation to the metadata
79	block. If we can verify the block contains the metadata it was intended to
80	contain, a large amount of the manual verification work can be skipped.
81	
82	CRC32c was selected as metadata cannot be more than 64k in length in XFS and
83	hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
84	metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
85	fast. So while CRC32c is not the strongest of possible integrity checks that
86	could be used, it is more than sufficient for our needs and has relatively
87	little overhead. Adding support for larger integrity fields and/or algorithms
88	does really provide any extra value over CRC32c, but it does add a lot of
89	complexity and so there is no provision for changing the integrity checking
90	mechanism.
91	
92	Self describing metadata needs to contain enough information so that the
93	metadata block can be verified as being in the correct place without needing to
94	look at any other metadata. This means it needs to contain location information.
95	Just adding a block number to the metadata is not sufficient to protect against
96	mis-directed writes - a write might be misdirected to the wrong LUN and so be
97	written to the "correct block" of the wrong filesystem. Hence location
98	information must contain a filesystem identifier as well as a block number.
99	
100	Another key information point in forensic analysis is knowing who the metadata
101	block belongs to. We already know the type, the location, that it is valid
102	and/or corrupted, and how long ago that it was last modified. Knowing the owner
103	of the block is important as it allows us to find other related metadata to
104	determine the scope of the corruption. For example, if we have a extent btree
105	object, we don't know what inode it belongs to and hence have to walk the entire
106	filesystem to find the owner of the block. Worse, the corruption could mean that
107	no owner can be found (i.e. it's an orphan block), and so without an owner field
108	in the metadata we have no idea of the scope of the corruption. If we have an
109	owner field in the metadata object, we can immediately do top down validation to
110	determine the scope of the problem.
111	
112	Different types of metadata have different owner identifiers. For example,
113	directory, attribute and extent tree blocks are all owned by an inode, whilst
114	freespace btree blocks are owned by an allocation group. Hence the size and
115	contents of the owner field are determined by the type of metadata object we are
116	looking at.  The owner information can also identify misplaced writes (e.g.
117	freespace btree block written to the wrong AG).
118	
119	Self describing metadata also needs to contain some indication of when it was
120	written to the filesystem. One of the key information points when doing forensic
121	analysis is how recently the block was modified. Correlation of set of corrupted
122	metadata blocks based on modification times is important as it can indicate
123	whether the corruptions are related, whether there's been multiple corruption
124	events that lead to the eventual failure, and even whether there are corruptions
125	present that the run-time verification is not detecting.
126	
127	For example, we can determine whether a metadata object is supposed to be free
128	space or still allocated if it is still referenced by its owner by looking at
129	when the free space btree block that contains the block was last written
130	compared to when the metadata object itself was last written.  If the free space
131	block is more recent than the object and the object's owner, then there is a
132	very good chance that the block should have been removed from the owner.
133	
134	To provide this "written timestamp", each metadata block gets the Log Sequence
135	Number (LSN) of the most recent transaction it was modified on written into it.
136	This number will always increase over the life of the filesystem, and the only
137	thing that resets it is running xfs_repair on the filesystem. Further, by use of
138	the LSN we can tell if the corrupted metadata all belonged to the same log
139	checkpoint and hence have some idea of how much modification occurred between
140	the first and last instance of corrupt metadata on disk and, further, how much
141	modification occurred between the corruption being written and when it was
142	detected.
143	
144	Runtime Validation
145	------------------
146	
147	Validation of self-describing metadata takes place at runtime in two places:
148	
149		- immediately after a successful read from disk
150		- immediately prior to write IO submission
151	
152	The verification is completely stateless - it is done independently of the
153	modification process, and seeks only to check that the metadata is what it says
154	it is and that the metadata fields are within bounds and internally consistent.
155	As such, we cannot catch all types of corruption that can occur within a block
156	as there may be certain limitations that operational state enforces of the
157	metadata, or there may be corruption of interblock relationships (e.g. corrupted
158	sibling pointer lists). Hence we still need stateful checking in the main code
159	body, but in general most of the per-field validation is handled by the
160	verifiers.
161	
162	For read verification, the caller needs to specify the expected type of metadata
163	that it should see, and the IO completion process verifies that the metadata
164	object matches what was expected. If the verification process fails, then it
165	marks the object being read as EFSCORRUPTED. The caller needs to catch this
166	error (same as for IO errors), and if it needs to take special action due to a
167	verification error it can do so by catching the EFSCORRUPTED error value. If we
168	need more discrimination of error type at higher levels, we can define new
169	error numbers for different errors as necessary.
170	
171	The first step in read verification is checking the magic number and determining
172	whether CRC validating is necessary. If it is, the CRC32c is calculated and
173	compared against the value stored in the object itself. Once this is validated,
174	further checks are made against the location information, followed by extensive
175	object specific metadata validation. If any of these checks fail, then the
176	buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
177	
178	Write verification is the opposite of the read verification - first the object
179	is extensively verified and if it is OK we then update the LSN from the last
180	modification made to the object, After this, we calculate the CRC and insert it
181	into the object. Once this is done the write IO is allowed to continue. If any
182	error occurs during this process, the buffer is again marked with a EFSCORRUPTED
183	error for the higher layers to catch.
184	
185	Structures
186	----------
187	
188	A typical on-disk structure needs to contain the following information:
189	
190	struct xfs_ondisk_hdr {
191	        __be32  magic;		/* magic number */
192	        __be32  crc;		/* CRC, not logged */
193	        uuid_t  uuid;		/* filesystem identifier */
194	        __be64  owner;		/* parent object */
195	        __be64  blkno;		/* location on disk */
196	        __be64  lsn;		/* last modification in log, not logged */
197	};
198	
199	Depending on the metadata, this information may be part of a header structure
200	separate to the metadata contents, or may be distributed through an existing
201	structure. The latter occurs with metadata that already contains some of this
202	information, such as the superblock and AG headers.
203	
204	Other metadata may have different formats for the information, but the same
205	level of information is generally provided. For example:
206	
207		- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
208		  number for location. The two of these combined provide the same
209		  information as @owner and @blkno in eh above structure, but using 8
210		  bytes less space on disk.
211	
212		- directory/attribute node blocks have a 16 bit magic number, and the
213		  header that contains the magic number has other information in it as
214		  well. hence the additional metadata headers change the overall format
215		  of the metadata.
216	
217	A typical buffer read verifier is structured as follows:
218	
219	#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
220	
221	static void
222	xfs_foo_read_verify(
223		struct xfs_buf	*bp)
224	{
225	       struct xfs_mount *mp = bp->b_target->bt_mount;
226	
227	        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
228	             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
229						XFS_FOO_CRC_OFF)) ||
230	            !xfs_foo_verify(bp)) {
231	                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
232	                xfs_buf_ioerror(bp, EFSCORRUPTED);
233	        }
234	}
235	
236	The code ensures that the CRC is only checked if the filesystem has CRCs enabled
237	by checking the superblock of the feature bit, and then if the CRC verifies OK
238	(or is not needed) it verifies the actual contents of the block.
239	
240	The verifier function will take a couple of different forms, depending on
241	whether the magic number can be used to determine the format of the block. In
242	the case it can't, the code is structured as follows:
243	
244	static bool
245	xfs_foo_verify(
246		struct xfs_buf		*bp)
247	{
248	        struct xfs_mount	*mp = bp->b_target->bt_mount;
249	        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
250	
251	        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
252	                return false;
253	
254	        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
255			if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
256				return false;
257			if (bp->b_bn != be64_to_cpu(hdr->blkno))
258				return false;
259			if (hdr->owner == 0)
260				return false;
261		}
262	
263		/* object specific verification checks here */
264	
265	        return true;
266	}
267	
268	If there are different magic numbers for the different formats, the verifier
269	will look like:
270	
271	static bool
272	xfs_foo_verify(
273		struct xfs_buf		*bp)
274	{
275	        struct xfs_mount	*mp = bp->b_target->bt_mount;
276	        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
277	
278	        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
279			if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
280				return false;
281			if (bp->b_bn != be64_to_cpu(hdr->blkno))
282				return false;
283			if (hdr->owner == 0)
284				return false;
285		} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
286			return false;
287	
288		/* object specific verification checks here */
289	
290	        return true;
291	}
292	
293	Write verifiers are very similar to the read verifiers, they just do things in
294	the opposite order to the read verifiers. A typical write verifier:
295	
296	static void
297	xfs_foo_write_verify(
298		struct xfs_buf	*bp)
299	{
300		struct xfs_mount	*mp = bp->b_target->bt_mount;
301		struct xfs_buf_log_item	*bip = bp->b_fspriv;
302	
303		if (!xfs_foo_verify(bp)) {
304			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
305			xfs_buf_ioerror(bp, EFSCORRUPTED);
306			return;
307		}
308	
309		if (!xfs_sb_version_hascrc(&mp->m_sb))
310			return;
311	
312	
313		if (bip) {
314			struct xfs_ondisk_hdr	*hdr = bp->b_addr;
315			hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
316		}
317		xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
318	}
319	
320	This will verify the internal structure of the metadata before we go any
321	further, detecting corruptions that have occurred as the metadata has been
322	modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
323	update the LSN field (when it was last modified) and calculate the CRC on the
324	metadata. Once this is done, we can issue the IO.
325	
326	Inodes and Dquots
327	-----------------
328	
329	Inodes and dquots are special snowflakes. They have per-object CRC and
330	self-identifiers, but they are packed so that there are multiple objects per
331	buffer. Hence we do not use per-buffer verifiers to do the work of per-object
332	verification and CRC calculations. The per-buffer verifiers simply perform basic
333	identification of the buffer - that they contain inodes or dquots, and that
334	there are magic numbers in all the expected spots. All further CRC and
335	verification checks are done when each inode is read from or written back to the
336	buffer.
337	
338	The structure of the verifiers and the identifiers checks is very similar to the
339	buffer code described above. The only difference is where they are called. For
340	example, inode read verification is done in xfs_iread() when the inode is first
341	read out of the buffer and the struct xfs_inode is instantiated. The inode is
342	already extensively verified during writeback in xfs_iflush_int, so the only
343	addition here is to add the LSN and CRC to the inode as it is copied back into
344	the buffer.
345	
346	XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
347	the unlinked list modifications check or update CRCs, neither during unlink nor
348	log recovery. So, it's gone unnoticed until now. This won't matter immediately -
349	repair will probably complain about it - but it needs to be fixed.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.