Based on kernel version 4.8. Page generated on 2016-10-06 23:10 EST.
1 dm-switch 2 ========= 3 4 The device-mapper switch target creates a device that supports an 5 arbitrary mapping of fixed-size regions of I/O across a fixed set of 6 paths. The path used for any specific region can be switched 7 dynamically by sending the target a message. 8 9 It maps I/O to underlying block devices efficiently when there is a large 10 number of fixed-sized address regions but there is no simple pattern 11 that would allow for a compact representation of the mapping such as 12 dm-stripe. 13 14 Background 15 ---------- 16 17 Dell EqualLogic and some other iSCSI storage arrays use a distributed 18 frameless architecture. In this architecture, the storage group 19 consists of a number of distinct storage arrays ("members") each having 20 independent controllers, disk storage and network adapters. When a LUN 21 is created it is spread across multiple members. The details of the 22 spreading are hidden from initiators connected to this storage system. 23 The storage group exposes a single target discovery portal, no matter 24 how many members are being used. When iSCSI sessions are created, each 25 session is connected to an eth port on a single member. Data to a LUN 26 can be sent on any iSCSI session, and if the blocks being accessed are 27 stored on another member the I/O will be forwarded as required. This 28 forwarding is invisible to the initiator. The storage layout is also 29 dynamic, and the blocks stored on disk may be moved from member to 30 member as needed to balance the load. 31 32 This architecture simplifies the management and configuration of both 33 the storage group and initiators. In a multipathing configuration, it 34 is possible to set up multiple iSCSI sessions to use multiple network 35 interfaces on both the host and target to take advantage of the 36 increased network bandwidth. An initiator could use a simple round 37 robin algorithm to send I/O across all paths and let the storage array 38 members forward it as necessary, but there is a performance advantage to 39 sending data directly to the correct member. 40 41 A device-mapper table already lets you map different regions of a 42 device onto different targets. However in this architecture the LUN is 43 spread with an address region size on the order of 10s of MBs, which 44 means the resulting table could have more than a million entries and 45 consume far too much memory. 46 47 Using this device-mapper switch target we can now build a two-layer 48 device hierarchy: 49 50 Upper Tier - Determine which array member the I/O should be sent to. 51 Lower Tier - Load balance amongst paths to a particular member. 52 53 The lower tier consists of a single dm multipath device for each member. 54 Each of these multipath devices contains the set of paths directly to 55 the array member in one priority group, and leverages existing path 56 selectors to load balance amongst these paths. We also build a 57 non-preferred priority group containing paths to other array members for 58 failover reasons. 59 60 The upper tier consists of a single dm-switch device. This device uses 61 a bitmap to look up the location of the I/O and choose the appropriate 62 lower tier device to route the I/O. By using a bitmap we are able to 63 use 4 bits for each address range in a 16 member group (which is very 64 large for us). This is a much denser representation than the dm table 65 b-tree can achieve. 66 67 Construction Parameters 68 ======================= 69 70 <num_paths> <region_size> <num_optional_args> [<optional_args>...] 71 [<dev_path> <offset>]+ 72 73 <num_paths> 74 The number of paths across which to distribute the I/O. 75 76 <region_size> 77 The number of 512-byte sectors in a region. Each region can be redirected 78 to any of the available paths. 79 80 <num_optional_args> 81 The number of optional arguments. Currently, no optional arguments 82 are supported and so this must be zero. 83 84 <dev_path> 85 The block device that represents a specific path to the device. 86 87 <offset> 88 The offset of the start of data on the specific <dev_path> (in units 89 of 512-byte sectors). This number is added to the sector number when 90 forwarding the request to the specific path. Typically it is zero. 91 92 Messages 93 ======== 94 95 set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... 96 97 Modify the region table by specifying which regions are redirected to 98 which paths. 99 100 <index> 101 The region number (region size was specified in constructor parameters). 102 If index is omitted, the next region (previous index + 1) is used. 103 Expressed in hexadecimal (WITHOUT any prefix like 0x). 104 105 <path_nr> 106 The path number in the range 0 ... (<num_paths> - 1). 107 Expressed in hexadecimal (WITHOUT any prefix like 0x). 108 109 R<n>,<m> 110 This parameter allows repetitive patterns to be loaded quickly. <n> and <m> 111 are hexadecimal numbers. The last <n> mappings are repeated in the next <m> 112 slots. 113 114 Status 115 ====== 116 117 No status line is reported. 118 119 Example 120 ======= 121 122 Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with 123 the same size. 124 125 Create a switch device with 64kB region size: 126 dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0` 127 switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" 128 129 Set mappings for the first 7 entries to point to devices switch0, switch1, 130 switch2, switch0, switch1, switch2, switch1: 131 dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 132 133 Set repetitive mapping. This command: 134 dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 135 is equivalent to: 136 dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ 137 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2