Based on kernel version 4.10.8. Page generated on 2017-04-01 14:44 EST.
1 # Kernel Self-Protection 2 3 Kernel self-protection is the design and implementation of systems and 4 structures within the Linux kernel to protect against security flaws in 5 the kernel itself. This covers a wide range of issues, including removing 6 entire classes of bugs, blocking security flaw exploitation methods, 7 and actively detecting attack attempts. Not all topics are explored in 8 this document, but it should serve as a reasonable starting point and 9 answer any frequently asked questions. (Patches welcome, of course!) 10 11 In the worst-case scenario, we assume an unprivileged local attacker 12 has arbitrary read and write access to the kernel's memory. In many 13 cases, bugs being exploited will not provide this level of access, 14 but with systems in place that defend against the worst case we'll 15 cover the more limited cases as well. A higher bar, and one that should 16 still be kept in mind, is protecting the kernel against a _privileged_ 17 local attacker, since the root user has access to a vastly increased 18 attack surface. (Especially when they have the ability to load arbitrary 19 kernel modules.) 20 21 The goals for successful self-protection systems would be that they 22 are effective, on by default, require no opt-in by developers, have no 23 performance impact, do not impede kernel debugging, and have tests. It 24 is uncommon that all these goals can be met, but it is worth explicitly 25 mentioning them, since these aspects need to be explored, dealt with, 26 and/or accepted. 27 28 29 ## Attack Surface Reduction 30 31 The most fundamental defense against security exploits is to reduce the 32 areas of the kernel that can be used to redirect execution. This ranges 33 from limiting the exposed APIs available to userspace, making in-kernel 34 APIs hard to use incorrectly, minimizing the areas of writable kernel 35 memory, etc. 36 37 ### Strict kernel memory permissions 38 39 When all of kernel memory is writable, it becomes trivial for attacks 40 to redirect execution flow. To reduce the availability of these targets 41 the kernel needs to protect its memory with a tight set of permissions. 42 43 #### Executable code and read-only data must not be writable 44 45 Any areas of the kernel with executable memory must not be writable. 46 While this obviously includes the kernel text itself, we must consider 47 all additional places too: kernel modules, JIT memory, etc. (There are 48 temporary exceptions to this rule to support things like instruction 49 alternatives, breakpoints, kprobes, etc. If these must exist in a 50 kernel, they are implemented in a way where the memory is temporarily 51 made writable during the update, and then returned to the original 52 permissions.) 53 54 In support of this are (the poorly named) CONFIG_DEBUG_RODATA and 55 CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not 56 writable, data is not executable, and read-only data is neither writable 57 nor executable. 58 59 #### Function pointers and sensitive variables must not be writable 60 61 Vast areas of kernel memory contain function pointers that are looked 62 up by the kernel and used to continue execution (e.g. descriptor/vector 63 tables, file/network/etc operation structures, etc). The number of these 64 variables must be reduced to an absolute minimum. 65 66 Many such variables can be made read-only by setting them "const" 67 so that they live in the .rodata section instead of the .data section 68 of the kernel, gaining the protection of the kernel's strict memory 69 permissions as described above. 70 71 For variables that are initialized once at __init time, these can 72 be marked with the (new and under development) __ro_after_init 73 attribute. 74 75 What remains are variables that are updated rarely (e.g. GDT). These 76 will need another infrastructure (similar to the temporary exceptions 77 made to kernel code mentioned above) that allow them to spend the rest 78 of their lifetime read-only. (For example, when being updated, only the 79 CPU thread performing the update would be given uninterruptible write 80 access to the memory.) 81 82 #### Segregation of kernel memory from userspace memory 83 84 The kernel must never execute userspace memory. The kernel must also never 85 access userspace memory without explicit expectation to do so. These 86 rules can be enforced either by support of hardware-based restrictions 87 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 88 By blocking userspace memory in this way, execution and data parsing 89 cannot be passed to trivially-controlled userspace memory, forcing 90 attacks to operate entirely in kernel memory. 91 92 ### Reduced access to syscalls 93 94 One trivial way to eliminate many syscalls for 64-bit systems is building 95 without CONFIG_COMPAT. However, this is rarely a feasible scenario. 96 97 The "seccomp" system provides an opt-in feature made available to 98 userspace, which provides a way to reduce the number of kernel entry 99 points available to a running process. This limits the breadth of kernel 100 code that can be reached, possibly reducing the availability of a given 101 bug to an attack. 102 103 An area of improvement would be creating viable ways to keep access to 104 things like compat, user namespaces, BPF creation, and perf limited only 105 to trusted processes. This would keep the scope of kernel entry points 106 restricted to the more regular set of normally available to unprivileged 107 userspace. 108 109 ### Restricting access to kernel modules 110 111 The kernel should never allow an unprivileged user the ability to 112 load specific kernel modules, since that would provide a facility to 113 unexpectedly extend the available attack surface. (The on-demand loading 114 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 115 considered "expected" here, though additional consideration should be 116 given even to these.) For example, loading a filesystem module via an 117 unprivileged socket API is nonsense: only the root or physically local 118 user should trigger filesystem module loading. (And even this can be up 119 for debate in some scenarios.) 120 121 To protect against even privileged users, systems may need to either 122 disable module loading entirely (e.g. monolithic kernel builds or 123 modules_disabled sysctl), or provide signed modules (e.g. 124 CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having 125 root load arbitrary kernel code via the module loader interface. 126 127 128 ## Memory integrity 129 130 There are many memory structures in the kernel that are regularly abused 131 to gain execution control during an attack, By far the most commonly 132 understood is that of the stack buffer overflow in which the return 133 address stored on the stack is overwritten. Many other examples of this 134 kind of attack exist, and protections exist to defend against them. 135 136 ### Stack buffer overflow 137 138 The classic stack buffer overflow involves writing past the expected end 139 of a variable stored on the stack, ultimately writing a controlled value 140 to the stack frame's stored return address. The most widely used defense 141 is the presence of a stack canary between the stack variables and the 142 return address (CONFIG_CC_STACKPROTECTOR), which is verified just before 143 the function returns. Other defenses include things like shadow stacks. 144 145 ### Stack depth overflow 146 147 A less well understood attack is using a bug that triggers the 148 kernel to consume stack memory with deep function calls or large stack 149 allocations. With this attack it is possible to write beyond the end of 150 the kernel's preallocated stack space and into sensitive structures. Two 151 important changes need to be made for better protections: moving the 152 sensitive thread_info structure elsewhere, and adding a faulting memory 153 hole at the bottom of the stack to catch these overflows. 154 155 ### Heap memory integrity 156 157 The structures used to track heap free lists can be sanity-checked during 158 allocation and freeing to make sure they aren't being used to manipulate 159 other memory areas. 160 161 ### Counter integrity 162 163 Many places in the kernel use atomic counters to track object references 164 or perform similar lifetime management. When these counters can be made 165 to wrap (over or under) this traditionally exposes a use-after-free 166 flaw. By trapping atomic wrapping, this class of bug vanishes. 167 168 ### Size calculation overflow detection 169 170 Similar to counter overflow, integer overflows (usually size calculations) 171 need to be detected at runtime to kill this class of bug, which 172 traditionally leads to being able to write past the end of kernel buffers. 173 174 175 ## Statistical defenses 176 177 While many protections can be considered deterministic (e.g. read-only 178 memory cannot be written to), some protections provide only statistical 179 defense, in that an attack must gather enough information about a 180 running system to overcome the defense. While not perfect, these do 181 provide meaningful defenses. 182 183 ### Canaries, blinding, and other secrets 184 185 It should be noted that things like the stack canary discussed earlier 186 are technically statistical defenses, since they rely on a secret value, 187 and such values may become discoverable through an information exposure 188 flaw. 189 190 Blinding literal values for things like JITs, where the executable 191 contents may be partially under the control of userspace, need a similar 192 secret value. 193 194 It is critical that the secret values used must be separate (e.g. 195 different canary per stack) and high entropy (e.g. is the RNG actually 196 working?) in order to maximize their success. 197 198 ### Kernel Address Space Layout Randomization (KASLR) 199 200 Since the location of kernel memory is almost always instrumental in 201 mounting a successful attack, making the location non-deterministic 202 raises the difficulty of an exploit. (Note that this in turn makes 203 the value of information exposures higher, since they may be used to 204 discover desired memory locations.) 205 206 #### Text and module base 207 208 By relocating the physical and virtual base address of the kernel at 209 boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be 210 frustrated. Additionally, offsetting the module loading base address 211 means that even systems that load the same set of modules in the same 212 order every boot will not share a common base address with the rest of 213 the kernel text. 214 215 #### Stack base 216 217 If the base address of the kernel stack is not the same between processes, 218 or even not the same between syscalls, targets on or beyond the stack 219 become more difficult to locate. 220 221 #### Dynamic memory base 222 223 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 224 being relatively deterministic in layout due to the order of early-boot 225 initializations. If the base address of these areas is not the same 226 between boots, targeting them is frustrated, requiring an information 227 exposure specific to the region. 228 229 #### Structure layout 230 231 By performing a per-build randomization of the layout of sensitive 232 structures, attacks must either be tuned to known kernel builds or expose 233 enough kernel memory to determine structure layouts before manipulating 234 them. 235 236 237 ## Preventing Information Exposures 238 239 Since the locations of sensitive structures are the primary target for 240 attacks, it is important to defend against exposure of both kernel memory 241 addresses and kernel memory contents (since they may contain kernel 242 addresses or other sensitive things like canary values). 243 244 ### Unique identifiers 245 246 Kernel memory addresses must never be used as identifiers exposed to 247 userspace. Instead, use an atomic counter, an idr, or similar unique 248 identifier. 249 250 ### Memory initialization 251 252 Memory copied to userspace must always be fully initialized. If not 253 explicitly memset(), this will require changes to the compiler to make 254 sure structure holes are cleared. 255 256 ### Memory poisoning 257 258 When releasing memory, it is best to poison the contents (clear stack on 259 syscall return, wipe heap memory on a free), to avoid reuse attacks that 260 rely on the old contents of memory. This frustrates many uninitialized 261 variable attacks, stack content exposures, heap content exposures, and 262 use-after-free attacks. 263 264 ### Destination tracking 265 266 To help kill classes of bugs that result in kernel addresses being 267 written to userspace, the destination of writes needs to be tracked. If 268 the buffer is destined for userspace (e.g. seq_file backed /proc files), 269 it should automatically censor sensitive values.