Based on kernel version 4.10.8. Page generated on 2017-04-01 14:43 EST.
1 The cluster MD is a shared-device RAID for a cluster. 2 3 4 1. On-disk format 5 6 Separate write-intent-bitmaps are used for each cluster node. 7 The bitmaps record all writes that may have been started on that node, 8 and may not yet have finished. The on-disk layout is: 9 10 0 4k 8k 12k 11 ------------------------------------------------------------------- 12 | idle | md super | bm super [0] + bits | 13 | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | 14 | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | 15 | bm bits [3, contd] | | | 16 17 During "normal" functioning we assume the filesystem ensures that only 18 one node writes to any given block at a time, so a write request will 19 20 - set the appropriate bit (if not already set) 21 - commit the write to all mirrors 22 - schedule the bit to be cleared after a timeout. 23 24 Reads are just handled normally. It is up to the filesystem to ensure 25 one node doesn't read from a location where another node (or the same 26 node) is writing. 27 28 29 2. DLM Locks for management 30 31 There are three groups of locks for managing the device: 32 33 2.1 Bitmap lock resource (bm_lockres) 34 35 The bm_lockres protects individual node bitmaps. They are named in 36 the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a 37 node joins the cluster, it acquires the lock in PW mode and it stays 38 so during the lifetime the node is part of the cluster. The lock 39 resource number is based on the slot number returned by the DLM 40 subsystem. Since DLM starts node count from one and bitmap slots 41 start from zero, one is subtracted from the DLM slot number to arrive 42 at the bitmap slot number. 43 44 The LVB of the bitmap lock for a particular node records the range 45 of sectors that are being re-synced by that node. No other 46 node may write to those sectors. This is used when a new nodes 47 joins the cluster. 48 49 2.2 Message passing locks 50 51 Each node has to communicate with other nodes when starting or ending 52 resync, and for metadata superblock updates. This communication is 53 managed through three locks: "token", "message", and "ack", together 54 with the Lock Value Block (LVB) of one of the "message" lock. 55 56 2.3 new-device management 57 58 A single lock: "no-new-dev" is used to co-ordinate the addition of 59 new devices - this must be synchronized across the array. 60 Normally all nodes hold a concurrent-read lock on this device. 61 62 3. Communication 63 64 Messages can be broadcast to all nodes, and the sender waits for all 65 other nodes to acknowledge the message before proceeding. Only one 66 message can be processed at a time. 67 68 3.1 Message Types 69 70 There are six types of messages which are passed: 71 72 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has 73 been updated, and the node must re-read the md superblock. This is 74 performed synchronously. It is primarily used to signal device 75 failure. 76 77 3.1.2 RESYNCING: informs other nodes that a resync is initiated or 78 ended so that each node may suspend or resume the region. Each 79 RESYNCING message identifies a range of the devices that the 80 sending node is about to resync. This over-rides any pervious 81 notification from that node: only one ranged can be resynced at a 82 time per-node. 83 84 3.1.3 NEWDISK: informs other nodes that a device is being added to 85 the array. Message contains an identifier for that device. See 86 below for further details. 87 88 3.1.4 REMOVE: A failed or spare device is being removed from the 89 array. The slot-number of the device is included in the message. 90 91 3.1.5 RE_ADD: A failed device is being re-activated - the assumption 92 is that it has been determined to be working again. 93 94 3.1.6 BITMAP_NEEDS_SYNC: if a node is stopped locally but the bitmap 95 isn't clean, then another node is informed to take the ownership of 96 resync. 97 98 3.2 Communication mechanism 99 100 The DLM LVB is used to communicate within nodes of the cluster. There 101 are three resources used for the purpose: 102 103 3.2.1 token: The resource which protects the entire communication 104 system. The node having the token resource is allowed to 105 communicate. 106 107 3.2.2 message: The lock resource which carries the data to 108 communicate. 109 110 3.2.3 ack: The resource, acquiring which means the message has been 111 acknowledged by all nodes in the cluster. The BAST of the resource 112 is used to inform the receiving node that a node wants to 113 communicate. 114 115 The algorithm is: 116 117 1. receive status - all nodes have concurrent-reader lock on "ack". 118 119 sender receiver receiver 120 "ack":CR "ack":CR "ack":CR 121 122 2. sender get EX on "token" 123 sender get EX on "message" 124 sender receiver receiver 125 "token":EX "ack":CR "ack":CR 126 "message":EX 127 "ack":CR 128 129 Sender checks that it still needs to send a message. Messages 130 received or other events that happened while waiting for the 131 "token" may have made this message inappropriate or redundant. 132 133 3. sender writes LVB. 134 sender down-convert "message" from EX to CW 135 sender try to get EX of "ack" 136 [ wait until all receivers have *processed* the "message" ] 137 138 [ triggered by bast of "ack" ] 139 receiver get CR on "message" 140 receiver read LVB 141 receiver processes the message 142 [ wait finish ] 143 receiver releases "ack" 144 receiver tries to get PR on "message" 145 146 sender receiver receiver 147 "token":EX "message":CR "message":CR 148 "message":CW 149 "ack":EX 150 151 4. triggered by grant of EX on "ack" (indicating all receivers 152 have processed message) 153 sender down-converts "ack" from EX to CR 154 sender releases "message" 155 sender releases "token" 156 receiver upconvert to PR on "message" 157 receiver get CR of "ack" 158 receiver release "message" 159 160 sender receiver receiver 161 "ack":CR "ack":CR "ack":CR 162 163 164 4. Handling Failures 165 166 4.1 Node Failure 167 168 When a node fails, the DLM informs the cluster with the slot 169 number. The node starts a cluster recovery thread. The cluster 170 recovery thread: 171 172 - acquires the bitmap<number> lock of the failed node 173 - opens the bitmap 174 - reads the bitmap of the failed node 175 - copies the set bitmap to local node 176 - cleans the bitmap of the failed node 177 - releases bitmap<number> lock of the failed node 178 - initiates resync of the bitmap on the current node 179 md_check_recovery is invoked within recover_bitmaps, 180 then md_check_recovery -> metadata_update_start/finish, 181 it will lock the communication by lock_comm. 182 Which means when one node is resyncing it blocks all 183 other nodes from writing anywhere on the array. 184 185 The resync process is the regular md resync. However, in a clustered 186 environment when a resync is performed, it needs to tell other nodes 187 of the areas which are suspended. Before a resync starts, the node 188 send out RESYNCING with the (lo,hi) range of the area which needs to 189 be suspended. Each node maintains a suspend_list, which contains the 190 list of ranges which are currently suspended. On receiving RESYNCING, 191 the node adds the range to the suspend_list. Similarly, when the node 192 performing resync finishes, it sends RESYNCING with an empty range to 193 other nodes and other nodes remove the corresponding entry from the 194 suspend_list. 195 196 A helper function, ->area_resyncing() can be used to check if a 197 particular I/O range should be suspended or not. 198 199 4.2 Device Failure 200 201 Device failures are handled and communicated with the metadata update 202 routine. When a node detects a device failure it does not allow 203 any further writes to that device until the failure has been 204 acknowledged by all other nodes. 205 206 5. Adding a new Device 207 208 For adding a new device, it is necessary that all nodes "see" the new 209 device to be added. For this, the following algorithm is used: 210 211 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues 212 ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) 213 2. Node 1 sends a NEWDISK message with uuid and slot number 214 3. Other nodes issue kobject_uevent_env with uuid and slot number 215 (Steps 4,5 could be a udev rule) 216 4. In userspace, the node searches for the disk, perhaps 217 using blkid -t SUB_UUID="" 218 5. Other nodes issue either of the following depending on whether 219 the disk was found: 220 ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and 221 disc.number set to slot number) 222 ioctl(CLUSTERED_DISK_NACK) 223 6. Other nodes drop lock on "no-new-devs" (CR) if device is found 224 7. Node 1 attempts EX lock on "no-new-dev" 225 8. If node 1 gets the lock, it sends METADATA_UPDATED after 226 unmarking the disk as SpareLocal 227 9. If not (get "no-new-dev" lock), it fails the operation and sends 228 METADATA_UPDATED. 229 10. Other nodes get the information whether a disk is added or not 230 by the following METADATA_UPDATED. 231 232 6. Module interface. 233 234 There are 17 call-backs which the md core can make to the cluster 235 module. Understanding these can give a good overview of the whole 236 process. 237 238 6.1 join(nodes) and leave() 239 240 These are called when an array is started with a clustered bitmap, 241 and when the array is stopped. join() ensures the cluster is 242 available and initializes the various resources. 243 Only the first 'nodes' nodes in the cluster can use the array. 244 245 6.2 slot_number() 246 247 Reports the slot number advised by the cluster infrastructure. 248 Range is from 0 to nodes-1. 249 250 6.3 resync_info_update() 251 252 This updates the resync range that is stored in the bitmap lock. 253 The starting point is updated as the resync progresses. The 254 end point is always the end of the array. 255 It does *not* send a RESYNCING message. 256 257 6.4 resync_start(), resync_finish() 258 259 These are called when resync/recovery/reshape starts or stops. 260 They update the resyncing range in the bitmap lock and also 261 send a RESYNCING message. resync_start reports the whole 262 array as resyncing, resync_finish reports none of it. 263 264 resync_finish() also sends a BITMAP_NEEDS_SYNC message which 265 allows some other node to take over. 266 267 6.5 metadata_update_start(), metadata_update_finish(), 268 metadata_update_cancel(). 269 270 metadata_update_start is used to get exclusive access to 271 the metadata. If a change is still needed once that access is 272 gained, metadata_update_finish() will send a METADATA_UPDATE 273 message to all other nodes, otherwise metadata_update_cancel() 274 can be used to release the lock. 275 276 6.6 area_resyncing() 277 278 This combines two elements of functionality. 279 280 Firstly, it will check if any node is currently resyncing 281 anything in a given range of sectors. If any resync is found, 282 then the caller will avoid writing or read-balancing in that 283 range. 284 285 Secondly, while node recovery is happening it reports that 286 all areas are resyncing for READ requests. This avoids races 287 between the cluster-filesystem and the cluster-RAID handling 288 a node failure. 289 290 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() 291 292 These are used to manage the new-disk protocol described above. 293 When a new device is added, add_new_disk_start() is called before 294 it is bound to the array and, if that succeeds, add_new_disk_finish() 295 is called the device is fully added. 296 297 When a device is added in acknowledgement to a previous 298 request, or when the device is declared "unavailable", 299 new_disk_ack() is called. 300 301 6.8 remove_disk() 302 303 This is called when a spare or failed device is removed from 304 the array. It causes a REMOVE message to be send to other nodes. 305 306 6.9 gather_bitmaps() 307 308 This sends a RE_ADD message to all other nodes and then 309 gathers bitmap information from all bitmaps. This combined 310 bitmap is then used to recovery the re-added device. 311 312 6.10 lock_all_bitmaps() and unlock_all_bitmaps() 313 314 These are called when change bitmap to none. If a node plans 315 to clear the cluster raid's bitmap, it need to make sure no other 316 nodes are using the raid which is achieved by lock all bitmap 317 locks within the cluster, and also those locks are unlocked 318 accordingly. 319 320 7. Unsupported features 321 322 There are somethings which are not supported by cluster MD yet. 323 324 - update size and change array_sectors.