About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / ia64 / fsys.txt


Based on kernel version 4.16.1. Page generated on 2018-04-09 11:53 EST.

1	-*-Mode: outline-*-
2	
3			Light-weight System Calls for IA-64
4			-----------------------------------
5	
6			        Started: 13-Jan-2003
7			    Last update: 27-Sep-2003
8	
9		              David Mosberger-Tang
10			      <davidm@hpl.hp.com>
11	
12	Using the "epc" instruction effectively introduces a new mode of
13	execution to the ia64 linux kernel.  We call this mode the
14	"fsys-mode".  To recap, the normal states of execution are:
15	
16	  - kernel mode:
17		Both the register stack and the memory stack have been
18		switched over to kernel memory.  The user-level state is saved
19		in a pt-regs structure at the top of the kernel memory stack.
20	
21	  - user mode:
22		Both the register stack and the kernel stack are in
23		user memory.  The user-level state is contained in the
24		CPU registers.
25	
26	  - bank 0 interruption-handling mode:
27		This is the non-interruptible state which all
28		interruption-handlers start execution in.  The user-level
29		state remains in the CPU registers and some kernel state may
30		be stored in bank 0 of registers r16-r31.
31	
32	In contrast, fsys-mode has the following special properties:
33	
34	  - execution is at privilege level 0 (most-privileged)
35	
36	  - CPU registers may contain a mixture of user-level and kernel-level
37	    state (it is the responsibility of the kernel to ensure that no
38	    security-sensitive kernel-level state is leaked back to
39	    user-level)
40	
41	  - execution is interruptible and preemptible (an fsys-mode handler
42	    can disable interrupts and avoid all other interruption-sources
43	    to avoid preemption)
44	
45	  - neither the memory-stack nor the register-stack can be trusted while
46	    in fsys-mode (they point to the user-level stacks, which may
47	    be invalid, or completely bogus addresses)
48	
49	In summary, fsys-mode is much more similar to running in user-mode
50	than it is to running in kernel-mode.  Of course, given that the
51	privilege level is at level 0, this means that fsys-mode requires some
52	care (see below).
53	
54	
55	* How to tell fsys-mode
56	
57	Linux operates in fsys-mode when (a) the privilege level is 0 (most
58	privileged) and (b) the stacks have NOT been switched to kernel memory
59	yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
60	three macros:
61	
62		user_mode(regs)
63		user_stack(task,regs)
64		fsys_mode(task,regs)
65	
66	The "regs" argument is a pointer to a pt_regs structure.  The "task"
67	argument is a pointer to the task structure to which the "regs"
68	pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
69	to by "regs" was executing in user mode (privilege level 3).
70	user_stack() returns TRUE if the state pointed to by "regs" was
71	executing on the user-level stack(s).  Finally, fsys_mode() returns
72	TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
73	The fsys_mode() macro is equivalent to the expression:
74	
75		!user_mode(regs) && user_stack(task,regs)
76	
77	* How to write an fsyscall handler
78	
79	The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
80	(fsyscall_table).  This table contains one entry for each system call.
81	By default, a system call is handled by fsys_fallback_syscall().  This
82	routine takes care of entering (full) kernel mode and calling the
83	normal Linux system call handler.  For performance-critical system
84	calls, it is possible to write a hand-tuned fsyscall_handler.  For
85	example, fsys.S contains fsys_getpid(), which is a hand-tuned version
86	of the getpid() system call.
87	
88	The entry and exit-state of an fsyscall handler is as follows:
89	
90	** Machine state on entry to fsyscall handler:
91	
92	 - r10	  = 0
93	 - r11	  = saved ar.pfs (a user-level value)
94	 - r15	  = system call number
95	 - r16	  = "current" task pointer (in normal kernel-mode, this is in r13)
96	 - r32-r39 = system call arguments
97	 - b6	  = return address (a user-level value)
98	 - ar.pfs = previous frame-state (a user-level value)
99	 - PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
100	 - all other registers may contain values passed in from user-mode
101	
102	** Required machine state on exit to fsyscall handler:
103	
104	 - r11	  = saved ar.pfs (as passed into the fsyscall handler)
105	 - r15	  = system call number (as passed into the fsyscall handler)
106	 - r32-r39 = system call arguments (as passed into the fsyscall handler)
107	 - b6	  = return address (as passed into the fsyscall handler)
108	 - ar.pfs = previous frame-state (as passed into the fsyscall handler)
109	
110	Fsyscall handlers can execute with very little overhead, but with that
111	speed comes a set of restrictions:
112	
113	 o Fsyscall-handlers MUST check for any pending work in the flags
114	   member of the thread-info structure and if any of the
115	   TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
116	   doing a full system call (by calling fsys_fallback_syscall).
117	
118	 o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
119	   r15, b6, and ar.pfs) because they will be needed in case of a
120	   system call restart.  Of course, all "preserved" registers also
121	   must be preserved, in accordance to the normal calling conventions.
122	
123	 o Fsyscall-handlers MUST check argument registers for containing a
124	   NaT value before using them in any way that could trigger a
125	   NaT-consumption fault.  If a system call argument is found to
126	   contain a NaT value, an fsyscall-handler may return immediately
127	   with r8=EINVAL, r10=-1.
128	
129	 o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
130	   any other operation that would trigger mandatory RSE
131	   (register-stack engine) traffic.
132	
133	 o Fsyscall-handlers MUST NOT write to any stacked registers because
134	   it is not safe to assume that user-level called a handler with the
135	   proper number of arguments.
136	
137	 o Fsyscall-handlers need to be careful when accessing per-CPU variables:
138	   unless proper safe-guards are taken (e.g., interruptions are avoided),
139	   execution may be pre-empted and resumed on another CPU at any given
140	   time.
141	
142	 o Fsyscall-handlers must be careful not to leak sensitive kernel'
143	   information back to user-level.  In particular, before returning to
144	   user-level, care needs to be taken to clear any scratch registers
145	   that could contain sensitive information (note that the current
146	   task pointer is not considered sensitive: it's already exposed
147	   through ar.k6).
148	
149	 o Fsyscall-handlers MUST NOT access user-memory without first
150	   validating access-permission (this can be done typically via
151	   probe.r.fault and/or probe.w.fault) and without guarding against
152	   memory access exceptions (this can be done with the EX() macros
153	   defined by asmmacro.h).
154	
155	The above restrictions may seem draconian, but remember that it's
156	possible to trade off some of the restrictions by paying a slightly
157	higher overhead.  For example, if an fsyscall-handler could benefit
158	from the shadow register bank, it could temporarily disable PSR.i and
159	PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
160	needed.  In other words, following the above rules yields extremely
161	fast system call execution (while fully preserving system call
162	semantics), but there is also a lot of flexibility in handling more
163	complicated cases.
164	
165	* Signal handling
166	
167	The delivery of (asynchronous) signals must be delayed until fsys-mode
168	is exited.  This is accomplished with the help of the lower-privilege
169	transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
170	checks whether the interrupted task was in fsys-mode and, if so, sets
171	PSR.lp and returns immediately.  When fsys-mode is exited via the
172	"br.ret" instruction that lowers the privilege level, a trap will
173	occur.  The trap handler clears PSR.lp again and returns immediately.
174	The kernel exit path then checks for and delivers any pending signals.
175	
176	* PSR Handling
177	
178	The "epc" instruction doesn't change the contents of PSR at all.  This
179	is in contrast to a regular interruption, which clears almost all
180	bits.  Because of that, some care needs to be taken to ensure things
181	work as expected.  The following discussion describes how each PSR bit
182	is handled.
183	
184	PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
185		to ensure the CPU is in little-endian mode before the first
186		load/store instruction is executed.  PSR.be is normally NOT
187		restored upon return from an fsys-mode handler.  In other
188		words, user-level code must not rely on PSR.be being preserved
189		across a system call.
190	PSR.up	Unchanged.
191	PSR.ac	Unchanged.
192	PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
193	PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
194	PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
195	PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
196	PSR.pk	Unchanged.
197	PSR.dt	Unchanged.
198	PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
199	PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
200	PSR.sp	Unchanged.
201	PSR.pp	Unchanged.
202	PSR.di	Unchanged.
203	PSR.si	Unchanged.
204	PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
205		breakpoint that triggers at any privilege level other than 3 (user-mode).
206	PSR.lp	Unchanged.
207	PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
208		fsys-mode, the trap-handler modifies the saved machine state
209		such that execution resumes in the gate page at
210		syscall_via_break(), with privilege level 3.  Note: the
211		taken branch would occur on the branch invoking the
212		fsyscall-handler, at which point, by definition, a syscall
213		restart is still safe.  If the system call number is invalid,
214		the fsys-mode handler will return directly to user-level.  This
215		return will trigger a taken-branch trap, but since the trap is
216		taken _after_ restoring the privilege level, the CPU has already
217		left fsys-mode, so no special treatment is needed.
218	PSR.rt	Unchanged.
219	PSR.cpl	Cleared to 0.
220	PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
221	PSR.mc	Unchanged.
222	PSR.it	Unchanged (guaranteed to be 1).
223	PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
224	PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
225	PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
226	PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
227		be taken.  The trap handler then modifies the saved machine
228		state such that execution resumes in the gate page at
229		syscall_via_break(), with privilege level 3.
230	PSR.ri	Unchanged.
231	PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
232		handler performed a speculative load that gets NaTted.  If so, this
233		would be the normal & expected behavior, so no special treatment is
234		needed.
235	PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
236		Doing so requires clearing PSR.i and PSR.ic as well.
237	PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.
238	
239	* Using fast system calls
240	
241	To use fast system calls, userspace applications need simply call
242	__kernel_syscall_via_epc().  For example
243	
244	-- example fgettimeofday() call --
245	-- fgettimeofday.S --
246	
247	#include <asm/asmmacro.h>
248	
249	GLOBAL_ENTRY(fgettimeofday)
250	.prologue
251	.save ar.pfs, r11
252	mov r11 = ar.pfs
253	.body 
254	
255	mov r2 = 0xa000000000020660;;  // gate address 
256				       // found by inspection of System.map for the 
257				       // __kernel_syscall_via_epc() function.  See
258				       // below for how to do this for real.
259	
260	mov b7 = r2
261	mov r15 = 1087		       // gettimeofday syscall
262	;;
263	br.call.sptk.many b6 = b7
264	;;
265	
266	.restore sp
267	
268	mov ar.pfs = r11
269	br.ret.sptk.many rp;;	      // return to caller
270	END(fgettimeofday)
271	
272	-- end fgettimeofday.S --
273	
274	In reality, getting the gate address is accomplished by two extra
275	values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
276	
277	 o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
278	 o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
279	
280	The ELF DSO is a pre-linked library that is mapped in by the kernel at
281	the gate page.  It is a proper ELF shared object so, with a dynamic
282	loader that recognises the library, you should be able to make calls to
283	the exported functions within it as with any other shared library.
284	AT_SYSINFO points into the kernel DSO at the
285	__kernel_syscall_via_epc() function for historical reasons (it was
286	used before the kernel DSO) and as a convenience.
Hide Line Numbers


About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog