Based on kernel version 4.9. Page generated on 2016-12-21 14:34 EST.
1 2 The LogFS Flash Filesystem 3 ========================== 4 5 Specification 6 ============= 7 8 Superblocks 9 ----------- 10 11 Two superblocks exist at the beginning and end of the filesystem. 12 Each superblock is 256 Bytes large, with another 3840 Bytes reserved 13 for future purposes, making a total of 4096 Bytes. 14 15 Superblock locations may differ for MTD and block devices. On MTD the 16 first non-bad block contains a superblock in the first 4096 Bytes and 17 the last non-bad block contains a superblock in the last 4096 Bytes. 18 On block devices, the first 4096 Bytes of the device contain the first 19 superblock and the last aligned 4096 Byte-block contains the second 20 superblock. 21 22 For the most part, the superblocks can be considered read-only. They 23 are written only to correct errors detected within the superblocks, 24 move the journal and change the filesystem parameters through tunefs. 25 As a result, the superblock does not contain any fields that require 26 constant updates, like the amount of free space, etc. 27 28 Segments 29 -------- 30 31 The space in the device is split up into equal-sized segments. 32 Segments are the primary write unit of LogFS. Within each segments, 33 writes happen from front (low addresses) to back (high addresses. If 34 only a partial segment has been written, the segment number, the 35 current position within and optionally a write buffer are stored in 36 the journal. 37 38 Segments are erased as a whole. Therefore Garbage Collection may be 39 required to completely free a segment before doing so. 40 41 Journal 42 -------- 43 44 The journal contains all global information about the filesystem that 45 is subject to frequent change. At mount time, it has to be scanned 46 for the most recent commit entry, which contains a list of pointers to 47 all currently valid entries. 48 49 Object Store 50 ------------ 51 52 All space except for the superblocks and journal is part of the object 53 store. Each segment contains a segment header and a number of 54 objects, each consisting of the object header and the payload. 55 Objects are either inodes, directory entries (dentries), file data 56 blocks or indirect blocks. 57 58 Levels 59 ------ 60 61 Garbage collection (GC) may fail if all data is written 62 indiscriminately. One requirement of GC is that data is separated 63 roughly according to the distance between the tree root and the data. 64 Effectively that means all file data is on level 0, indirect blocks 65 are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, 66 respectively. Inode file data is on level 6 for the inodes and 7-11 67 for indirect blocks. 68 69 Each segment contains objects of a single level only. As a result, 70 each level requires its own separate segment to be open for writing. 71 72 Inode File 73 ---------- 74 75 All inodes are stored in a special file, the inode file. Single 76 exception is the inode file's inode (master inode) which for obvious 77 reasons is stored in the journal instead. Instead of data blocks, the 78 leaf nodes of the inode files are inodes. 79 80 Aliases 81 ------- 82 83 Writes in LogFS are done by means of a wandering tree. A naïve 84 implementation would require that for each write or a block, all 85 parent blocks are written as well, since the block pointers have 86 changed. Such an implementation would not be very efficient. 87 88 In LogFS, the block pointer changes are cached in the journal by means 89 of alias entries. Each alias consists of its logical address - inode 90 number, block index, level and child number (index into block) - and 91 the changed data. Any 8-byte word can be changes in this manner. 92 93 Currently aliases are used for block pointers, file size, file used 94 bytes and the height of an inodes indirect tree. 95 96 Segment Aliases 97 --------------- 98 99 Related to regular aliases, these are used to handle bad blocks. 100 Initially, bad blocks are handled by moving the affected segment 101 content to a spare segment and noting this move in the journal with a 102 segment alias, a simple (to, from) tupel. GC will later empty this 103 segment and the alias can be removed again. This is used on MTD only. 104 105 Vim 106 --- 107 108 By cleverly predicting the life time of data, it is possible to 109 separate long-living data from short-living data and thereby reduce 110 the GC overhead later. Each type of distinc life expectency (vim) can 111 have a separate segment open for writing. Each (level, vim) tupel can 112 be open just once. If an open segment with unknown vim is encountered 113 at mount time, it is closed and ignored henceforth. 114 115 Indirect Tree 116 ------------- 117 118 Inodes in LogFS are similar to FFS-style filesystems with direct and 119 indirect block pointers. One difference is that LogFS uses a single 120 indirect pointer that can be either a 1x, 2x, etc. indirect pointer. 121 A height field in the inode defines the height of the indirect tree 122 and thereby the indirection of the pointer. 123 124 Another difference is the addressing of indirect blocks. In LogFS, 125 the first 16 pointers in the first indirect block are left empty, 126 corresponding to the 16 direct pointers in the inode. In ext2 (maybe 127 others as well) the first pointer in the first indirect block 128 corresponds to logical block 12, skipping the 12 direct pointers. 129 So where ext2 is using arithmetic to better utilize space, LogFS keeps 130 arithmetic simple and uses compression to save space. 131 132 Compression 133 ----------- 134 135 Both file data and metadata can be compressed. Compression for file 136 data can be enabled with chattr +c and disabled with chattr -c. Doing 137 so has no effect on existing data, but new data will be stored 138 accordingly. New inodes will inherit the compression flag of the 139 parent directory. 140 141 Metadata is always compressed. However, the space accounting ignores 142 this and charges for the uncompressed size. Failing to do so could 143 result in GC failures when, after moving some data, indirect blocks 144 compress worse than previously. Even on a 100% full medium, GC may 145 not consume any extra space, so the compression gains are lost space 146 to the user. 147 148 However, they are not lost space to the filesystem internals. By 149 cheating the user for those bytes, the filesystem gained some slack 150 space and GC will run less often and faster. 151 152 Garbage Collection and Wear Leveling 153 ------------------------------------ 154 155 Garbage collection is invoked whenever the number of free segments 156 falls below a threshold. The best (known) candidate is picked based 157 on the least amount of valid data contained in the segment. All 158 remaining valid data is copied elsewhere, thereby invalidating it. 159 160 The GC code also checks for aliases and writes then back if their 161 number gets too large. 162 163 Wear leveling is done by occasionally picking a suboptimal segment for 164 garbage collection. If a stale segments erase count is significantly 165 lower than the active segments' erase counts, it will be picked. Wear 166 leveling is rate limited, so it will never monopolize the device for 167 more than one segment worth at a time. 168 169 Values for "occasionally", "significantly lower" are compile time 170 constants. 171 172 Hashed directories 173 ------------------ 174 175 To satisfy efficient lookup(), directory entries are hashed and 176 located based on the hash. In order to both support large directories 177 and not be overly inefficient for small directories, several hash 178 tables of increasing size are used. For each table, the hash value 179 modulo the table size gives the table index. 180 181 Tables sizes are chosen to limit the number of indirect blocks with a 182 fully populated table to 0, 1, 2 or 3 respectively. So the first 183 table contains 16 entries, the second 512-16, etc. 184 185 The last table is special in several ways. First its size depends on 186 the effective 32bit limit on telldir/seekdir cookies. Since logfs 187 uses the upper half of the address space for indirect blocks, the size 188 is limited to 2^31. Secondly the table contains hash buckets with 16 189 entries each. 190 191 Using single-entry buckets would result in birthday "attacks". At 192 just 2^16 used entries, hash collisions would be likely (P >= 0.5). 193 My math skills are insufficient to do the combinatorics for the 17x 194 collisions necessary to overflow a bucket, but testing showed that in 195 10,000 runs the lowest directory fill before a bucket overflow was 196 188,057,130 entries with an average of 315,149,915 entries. So for 197 directory sizes of up to a million, bucket overflows should be 198 virtually impossible under normal circumstances. 199 200 With carefully chosen filenames, it is obviously possible to cause an 201 overflow with just 21 entries (4 higher tables + 16 entries + 1). So 202 there may be a security concern if a malicious user has write access 203 to a directory. 204 205 Open For Discussion 206 =================== 207 208 Device Address Space 209 -------------------- 210 211 A device address space is used for caching. Both block devices and 212 MTD provide functions to either read a single page or write a segment. 213 Partial segments may be written for data integrity, but where possible 214 complete segments are written for performance on simple block device 215 flash media. 216 217 Meta Inodes 218 ----------- 219 220 Inodes are stored in the inode file, which is just a regular file for 221 most purposes. At umount time, however, the inode file needs to 222 remain open until all dirty inodes are written. So 223 generic_shutdown_super() may not close this inode, but shouldn't 224 complain about remaining inodes due to the inode file either. Same 225 goes for mapping inode of the device address space. 226 227 Currently logfs uses a hack that essentially copies part of fs/inode.c 228 code over. A general solution would be preferred. 229 230 Indirect block mapping 231 ---------------------- 232 233 With compression, the block device (or mapping inode) cannot be used 234 to cache indirect blocks. Some other place is required. Currently 235 logfs uses the top half of each inode's address space. The low 8TB 236 (on 32bit) are filled with file data, the high 8TB are used for 237 indirect blocks. 238 239 One problem is that 16TB files created on 64bit systems actually have 240 data in the top 8TB. But files >16TB would cause problems anyway, so 241 only the limit has changed.