Based on kernel version 3.9. Page generated on 2013-05-02 23:08 EST.
1 -*-Mode: outline-*- 2 3 Light-weight System Calls for IA-64 4 ----------------------------------- 5 6 Started: 13-Jan-2003 7 Last update: 27-Sep-2003 8 9 David Mosberger-Tang 10 <firstname.lastname@example.org> 11 12 Using the "epc" instruction effectively introduces a new mode of 13 execution to the ia64 linux kernel. We call this mode the 14 "fsys-mode". To recap, the normal states of execution are: 15 16 - kernel mode: 17 Both the register stack and the memory stack have been 18 switched over to kernel memory. The user-level state is saved 19 in a pt-regs structure at the top of the kernel memory stack. 20 21 - user mode: 22 Both the register stack and the kernel stack are in 23 user memory. The user-level state is contained in the 24 CPU registers. 25 26 - bank 0 interruption-handling mode: 27 This is the non-interruptible state which all 28 interruption-handlers start execution in. The user-level 29 state remains in the CPU registers and some kernel state may 30 be stored in bank 0 of registers r16-r31. 31 32 In contrast, fsys-mode has the following special properties: 33 34 - execution is at privilege level 0 (most-privileged) 35 36 - CPU registers may contain a mixture of user-level and kernel-level 37 state (it is the responsibility of the kernel to ensure that no 38 security-sensitive kernel-level state is leaked back to 39 user-level) 40 41 - execution is interruptible and preemptible (an fsys-mode handler 42 can disable interrupts and avoid all other interruption-sources 43 to avoid preemption) 44 45 - neither the memory-stack nor the register-stack can be trusted while 46 in fsys-mode (they point to the user-level stacks, which may 47 be invalid, or completely bogus addresses) 48 49 In summary, fsys-mode is much more similar to running in user-mode 50 than it is to running in kernel-mode. Of course, given that the 51 privilege level is at level 0, this means that fsys-mode requires some 52 care (see below). 53 54 55 * How to tell fsys-mode 56 57 Linux operates in fsys-mode when (a) the privilege level is 0 (most 58 privileged) and (b) the stacks have NOT been switched to kernel memory 59 yet. For convenience, the header file <asm-ia64/ptrace.h> provides 60 three macros: 61 62 user_mode(regs) 63 user_stack(task,regs) 64 fsys_mode(task,regs) 65 66 The "regs" argument is a pointer to a pt_regs structure. The "task" 67 argument is a pointer to the task structure to which the "regs" 68 pointer belongs to. user_mode() returns TRUE if the CPU state pointed 69 to by "regs" was executing in user mode (privilege level 3). 70 user_stack() returns TRUE if the state pointed to by "regs" was 71 executing on the user-level stack(s). Finally, fsys_mode() returns 72 TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. 73 The fsys_mode() macro is equivalent to the expression: 74 75 !user_mode(regs) && user_stack(task,regs) 76 77 * How to write an fsyscall handler 78 79 The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers 80 (fsyscall_table). This table contains one entry for each system call. 81 By default, a system call is handled by fsys_fallback_syscall(). This 82 routine takes care of entering (full) kernel mode and calling the 83 normal Linux system call handler. For performance-critical system 84 calls, it is possible to write a hand-tuned fsyscall_handler. For 85 example, fsys.S contains fsys_getpid(), which is a hand-tuned version 86 of the getpid() system call. 87 88 The entry and exit-state of an fsyscall handler is as follows: 89 90 ** Machine state on entry to fsyscall handler: 91 92 - r10 = 0 93 - r11 = saved ar.pfs (a user-level value) 94 - r15 = system call number 95 - r16 = "current" task pointer (in normal kernel-mode, this is in r13) 96 - r32-r39 = system call arguments 97 - b6 = return address (a user-level value) 98 - ar.pfs = previous frame-state (a user-level value) 99 - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) 100 - all other registers may contain values passed in from user-mode 101 102 ** Required machine state on exit to fsyscall handler: 103 104 - r11 = saved ar.pfs (as passed into the fsyscall handler) 105 - r15 = system call number (as passed into the fsyscall handler) 106 - r32-r39 = system call arguments (as passed into the fsyscall handler) 107 - b6 = return address (as passed into the fsyscall handler) 108 - ar.pfs = previous frame-state (as passed into the fsyscall handler) 109 110 Fsyscall handlers can execute with very little overhead, but with that 111 speed comes a set of restrictions: 112 113 o Fsyscall-handlers MUST check for any pending work in the flags 114 member of the thread-info structure and if any of the 115 TIF_ALLWORK_MASK flags are set, the handler needs to fall back on 116 doing a full system call (by calling fsys_fallback_syscall). 117 118 o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, 119 r15, b6, and ar.pfs) because they will be needed in case of a 120 system call restart. Of course, all "preserved" registers also 121 must be preserved, in accordance to the normal calling conventions. 122 123 o Fsyscall-handlers MUST check argument registers for containing a 124 NaT value before using them in any way that could trigger a 125 NaT-consumption fault. If a system call argument is found to 126 contain a NaT value, an fsyscall-handler may return immediately 127 with r8=EINVAL, r10=-1. 128 129 o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform 130 any other operation that would trigger mandatory RSE 131 (register-stack engine) traffic. 132 133 o Fsyscall-handlers MUST NOT write to any stacked registers because 134 it is not safe to assume that user-level called a handler with the 135 proper number of arguments. 136 137 o Fsyscall-handlers need to be careful when accessing per-CPU variables: 138 unless proper safe-guards are taken (e.g., interruptions are avoided), 139 execution may be pre-empted and resumed on another CPU at any given 140 time. 141 142 o Fsyscall-handlers must be careful not to leak sensitive kernel' 143 information back to user-level. In particular, before returning to 144 user-level, care needs to be taken to clear any scratch registers 145 that could contain sensitive information (note that the current 146 task pointer is not considered sensitive: it's already exposed 147 through ar.k6). 148 149 o Fsyscall-handlers MUST NOT access user-memory without first 150 validating access-permission (this can be done typically via 151 probe.r.fault and/or probe.w.fault) and without guarding against 152 memory access exceptions (this can be done with the EX() macros 153 defined by asmmacro.h). 154 155 The above restrictions may seem draconian, but remember that it's 156 possible to trade off some of the restrictions by paying a slightly 157 higher overhead. For example, if an fsyscall-handler could benefit 158 from the shadow register bank, it could temporarily disable PSR.i and 159 PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as 160 needed. In other words, following the above rules yields extremely 161 fast system call execution (while fully preserving system call 162 semantics), but there is also a lot of flexibility in handling more 163 complicated cases. 164 165 * Signal handling 166 167 The delivery of (asynchronous) signals must be delayed until fsys-mode 168 is exited. This is accomplished with the help of the lower-privilege 169 transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() 170 checks whether the interrupted task was in fsys-mode and, if so, sets 171 PSR.lp and returns immediately. When fsys-mode is exited via the 172 "br.ret" instruction that lowers the privilege level, a trap will 173 occur. The trap handler clears PSR.lp again and returns immediately. 174 The kernel exit path then checks for and delivers any pending signals. 175 176 * PSR Handling 177 178 The "epc" instruction doesn't change the contents of PSR at all. This 179 is in contrast to a regular interruption, which clears almost all 180 bits. Because of that, some care needs to be taken to ensure things 181 work as expected. The following discussion describes how each PSR bit 182 is handled. 183 184 PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used 185 to ensure the CPU is in little-endian mode before the first 186 load/store instruction is executed. PSR.be is normally NOT 187 restored upon return from an fsys-mode handler. In other 188 words, user-level code must not rely on PSR.be being preserved 189 across a system call. 190 PSR.up Unchanged. 191 PSR.ac Unchanged. 192 PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! 193 PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! 194 PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. 195 PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. 196 PSR.pk Unchanged. 197 PSR.dt Unchanged. 198 PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! 199 PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! 200 PSR.sp Unchanged. 201 PSR.pp Unchanged. 202 PSR.di Unchanged. 203 PSR.si Unchanged. 204 PSR.db Unchanged. The kernel prevents user-level from setting a hardware 205 breakpoint that triggers at any privilege level other than 3 (user-mode). 206 PSR.lp Unchanged. 207 PSR.tb Lazy redirect. If a taken-branch trap occurs while in 208 fsys-mode, the trap-handler modifies the saved machine state 209 such that execution resumes in the gate page at 210 syscall_via_break(), with privilege level 3. Note: the 211 taken branch would occur on the branch invoking the 212 fsyscall-handler, at which point, by definition, a syscall 213 restart is still safe. If the system call number is invalid, 214 the fsys-mode handler will return directly to user-level. This 215 return will trigger a taken-branch trap, but since the trap is 216 taken _after_ restoring the privilege level, the CPU has already 217 left fsys-mode, so no special treatment is needed. 218 PSR.rt Unchanged. 219 PSR.cpl Cleared to 0. 220 PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). 221 PSR.mc Unchanged. 222 PSR.it Unchanged (guaranteed to be 1). 223 PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. 224 PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. 225 PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. 226 PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to 227 be taken. The trap handler then modifies the saved machine 228 state such that execution resumes in the gate page at 229 syscall_via_break(), with privilege level 3. 230 PSR.ri Unchanged. 231 PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode 232 handler performed a speculative load that gets NaTted. If so, this 233 would be the normal & expected behavior, so no special treatment is 234 needed. 235 PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. 236 Doing so requires clearing PSR.i and PSR.ic as well. 237 PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. 238 239 * Using fast system calls 240 241 To use fast system calls, userspace applications need simply call 242 __kernel_syscall_via_epc(). For example 243 244 -- example fgettimeofday() call -- 245 -- fgettimeofday.S -- 246 247 #include <asm/asmmacro.h> 248 249 GLOBAL_ENTRY(fgettimeofday) 250 .prologue 251 .save ar.pfs, r11 252 mov r11 = ar.pfs 253 .body 254 255 mov r2 = 0xa000000000020660;; // gate address 256 // found by inspection of System.map for the 257 // __kernel_syscall_via_epc() function. See 258 // below for how to do this for real. 259 260 mov b7 = r2 261 mov r15 = 1087 // gettimeofday syscall 262 ;; 263 br.call.sptk.many b6 = b7 264 ;; 265 266 .restore sp 267 268 mov ar.pfs = r11 269 br.ret.sptk.many rp;; // return to caller 270 END(fgettimeofday) 271 272 -- end fgettimeofday.S -- 273 274 In reality, getting the gate address is accomplished by two extra 275 values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) 276 277 o AT_SYSINFO : is the address of __kernel_syscall_via_epc() 278 o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO 279 280 The ELF DSO is a pre-linked library that is mapped in by the kernel at 281 the gate page. It is a proper ELF shared object so, with a dynamic 282 loader that recognises the library, you should be able to make calls to 283 the exported functions within it as with any other shared library. 284 AT_SYSINFO points into the kernel DSO at the 285 __kernel_syscall_via_epc() function for historical reasons (it was 286 used before the kernel DSO) and as a convenience.