GCC and Linux Kernel Idioms ======================================================================= The Assembler Directive (a gcc extension) General format: asm ("" : "" () : "" ()); Reg Reg C variable C variable asm ("instruction %op1,%op2" : "qualifier-datatype" (result) : "datatype" (input) : changed regs); Example: asm ("fsinx %1,%0" : "=f" (result) : "f" (angle)); Execute the floating sinx instruction. Input is %1, or the C language variable "angle" of type float while the output is a destructive "=" write of a float datatype to the C variable "result." http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Extended-Asm.html#Extended-Asm http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions ======================================================================= likely() and unlikely() (a gcc extension) Likely and unlikely are Linux macros to GCC optimization override controls. The macros are located in: include/linux/compiler.h #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) And you would see these macros used like this: if (unlikely(PossibleErrorFlag)) { free(resource); errno = EBUSTED; goto out; } Given the typical behavior of the kernel, this routine may test "PossibleErrorFlag" billions and billions of times without it ever being true. The compiler on the other hand cannot tell and may put the error processing code in-line with a jump around the code for the typical false condition. But the error processing code is wasting instruction cache space and the jump instruction is causing the CPU to flush its prefetched instruction pipeline and goto another location and start decoding instructions. Thus, the unlikely() macro instructs the compiler to place the error code elsewhere and jump on the true condition. In this way the next instruction is local and in the decoded instruction cache. Even though this is small manual optimization, it has huge payoffs over the life of the computer. ======================================================================= __attribute__ directive (a gcc extension) The CPP_ASMLINKAGE __attribute__((regparm(0))) Macro asmlinkage macro defines as: #define CPP_ASMLINKAGE __attribute__((regparm(0))) which defines as: #define extern "C" __attribute__((regparm(0))) This is used in the system call interface where C library routines enter the kernel after setting up their arguments and executing the trap instruction (INT 80) to enter the kernel. The "asmlinkage" tag really should read "C language linkage." GCC takes a i386 specific __attribute__((regparm(0))) that causes the compiler to pass integer data type arguments in the stack instead of using regesters. Functions that take a variable number of arguments will continue to be passed all of their arguments on the stack. The typical assembly kernel entry format is the system_call number as its first argument in register AX and up to four more arguments passed in other CPU registers. If more than four additional arguments are required, they are left on the user stack. All system calls are marked with the asmlinkage tag, so the "sys_" routines look to the stack for arguments. ======================================================================= __attribute__ directive (a gcc extension) FASTCALL(x) macro passes arguments in CPU registers and not on the stack. ======================================================================= The DO-WHILE Macro Wrapper The reason you see the do{ ... }while(0) loop construct in most defined macros is that: A) the loop employs the constant "0" which causes the optimizing compiler to strip out the non functional loop, but leaves the block, B) if there is an empty statement in a macro, it will generate compiler warnings, C) the wrapper provides a block in which local variables can be declared, D) the wrapper allows you to invoke macros as if they are functions meaning that you can terminate the macro with a trailing semicolon (see FOO(x); below), E) you get the correct semantic action in all constructs, including the then clause of an if/else statement, and F) this has to be the longest run on sentence I have ever written. For example: #define FOO(x) { printf("x is %d\n", x); exit(-1); } void bar(int x) { if (x) FOO(x); } works fine, but if we add an else clause: #define FOO(x) { printf("x is %d\n", x); exit(-1); } void bar(int x) { if (x) FOO(x); else mumbleFratz(); } Then the compiler barfs with: test.c: In function `bar': test.c:7: parse error before `else' The else is unexpected since the semicolon after the closing brace terminates the if statement. By wrapping the macro definition with a do{ ... }while(0), the semicolon is now attached to the while statement and a trailing else is parsed correctly. ======================================================================= GET_CURRENT() Assembly Language Inline Function The function is: static inline struct task_struct * get_current(void) { struct task_struct *current; __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL)); return current; } get_current() is a routine for getting access to the task_struct of the currently executing process. It uses inline assembly features of GCC which means that the function is not to be called, but instead placed in line with the present function. The GCC syntax for in line assembly is as follows : | __asm__( This signifies a piece of inline assembly that the compiler must insert into its output code. The __asm__ is the same as asm, but can't be disabled by command line flags. | "andl %%esp,%0 "%%" is a macro that expands to a "%". "%0" is a macro that expands to the first input/output specification. So in this case, it takes the stack pointer (register %esp) and ANDs it into a register that contains 0xFFFFE000, leaving the result in that register. Basically, the process' task_struct and a process' kernel stack are placed next to each other in memory and they occupy an 8KB block that is also 8KB aligned. The task_struct is at the beginning (low memory address) and the stack is at the end (high memory address) growing from the end downwards. So you can find the task_struct by clearing the bottom 13 bits (8KB aligning) any value in the kernel stack pointer. | ; " The semicolon can be used to separate assembly statements, as can the newline character escape sequence ("\n"). | :"=r" (current) This specifies an OUTPUT constraint (all of which occur after the first colon, but before the second). The '=' specifies that this overwrites an existing value. The 'r' indicates that a general purpose register should be allocated such that the instruction can place the output value into it. The bit inside the brackets - 'current' - is the intended destination of the output value (normally a local variable) once the C code is returned to. | : "0" (~8191UL)); This specifies an INPUT constraint (all of which occur after the second colon, but before the third). The '0' references another constraint (in this case, the first output constraint), saying that the same register or memory location should be used for both. The '~8191UL' inside the brackets is the one's compliment '~' of the 8KB unsigned long constant is loaded into the register allocated for the output value (INPUT) before using the instructions inside the asm block. See also the GCC info pages, Topic "C Extensions", subtopic "Extended Asm". (Mostly courtesy of David Howells of Redhat). ======================================================================= Negative Kernel Return Values Many implementations of Unix and other OSs have used the processor status word (i.e., Intel Flags register) condition codes' carry bit to indicate an error. It works because the CALL, RET, INT and IRET assembly language instructions do not change the processor status word and, therefore, the Flags register can be used as sort of a mailbox holding code values between the caller and callee both before the call and after the return. This design is used for performance reasons and obviates a stack or memory reference. (Forget prior art, let's file a patent application!) Regardless, if there is an error, the error number still has to be stored in the global "errno" for user level program inspection. Instead of the traditional method, Linux combines these two operations by employing a negative "ERRNO" return value to indicate syscall errors and the type of error. Since version 2.1 the return value of a system call might be negative even if the call succeeded, i.e., the `lseek' system call might return a large offset that appears to be negative if viewed as a signed integer data type. So, instead of testing if the returned value is just negative or a negative one, glibc checks for an error number range. If the returned value is within the range of known error numbers (-1 to -4095) it is again negated (made positive) and assigned to the global errno. It is assumed that returned values more negative that -4095 are, in fact, user data to be left in %EAX. ======================================================================= Default error return value Most functions declare and initialize the local return varible with a generic error code. In this way, multiple error conditions only require a break or goto to function end to return the error. Example: char * getname(const char __user * filename) { char *tmp, *result; // Initialize result to "Error, No Memory" // negate the value to indicate error // and cast the constant to a pointer since // the compiler will complain if a pointer // data type is not returned to the caller. result = ERR_PTR(-ENOMEM); tmp = __getname(); if (tmp) { // May be valid or invalid pointer int retval = do_getname(filename, tmp); // Save kernel memory pointer result = tmp; // Did we get a valid pointer to user memory? if (retval < 0) { // Nope, return allocated memory to pool __putname(tmp); // Pass the exact type of error to caller result = ERR_PTR(retval); } } audit_getname(result); // Return success or failure return result; ======================================================================= Kernel Memory Limit Constants // In /usr/src/linux/include/asm-i386/uaccess.h // Kernel data segment (paging limit) extends through the full 4 GiB // virtual address space #define KERNEL_DS MAKE_MM_SEG(0xFFFFFFFFUL) // In /usr/src/linux/include/asm-i386/page.h // User data segment (paging limit) extends up to 3 GiB, where the // kernel code space begins. #define USER_DS MAKE_MM_SEG(PAGE_OFFSET) #define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) #define __PAGE_OFFSET (0xC0000000) #define get_ds() (KERNEL_DS) // Just like current(); fetch the top data address (addr_limit) from the PPDA #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) __asm__("andl %%esp,%0; ":"=r" (ti) : "" (~(THREAD_SIZE - 1))) // After 2.6.11 #define __PAGE_OFFSET CONFIG_PAGE_OFFSET // and in the usr/src/linux/.config file CONFIG_PAGE_OFFSET=0xC0000000 ======================================================================= Default File Descriptor or DFD API Calls Ulrich Drepper, the maintainer of glibc, has added 11 variants on current file operations: int mknodat(int dfd, const char *pathname, mode_t mode, dev_t dev); int mkdirat(int dfd, const char *pathname, mode_t mode); int unlinkat(int dfd, const char *pathname); int symlinkat(const char *oldname, int newdfd, const char *newname); int linkat(int olddfd, const char *oldname, int newdfd, const char *newname); int renameat(int olddfd, const char *oldname, int newdfd, const char *newname); int utimesat(int dfd, const char *filename, struct timeval *tvp); int chownat(int dfd, const char *path, uid_t owner, gid_t group); int openat(int dfd, const char *filename, int flags, int mode); int newfstatat(int dfd, char *filename, struct stat *buf, int flag); int readlinkat(int dfd, const char *pathname, char *buf, int size); Each new system call extends an existing one by adding one or more "dfd" (default file descriptor) argument. The new argument indicates a directory which is used in place of the current working directory when relative path names are provided. These calls allow applications to navigate through directory trees without race conditions. They may also be used to allow a virtual per-thread working directory. So "normal" calls now have the new formal parameter added with the value of -100 or use-current-working directory. ======================================================================= Summary of Memory Ordering When it comes to how memory ordering works on different CPUs, there is good news and bad news. The bad news is each CPU's memory ordering is a bit different. The good news is you can count on a few things: A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations. An operation is reordered with a store only if the operation accesses a different location than does the store. Aligned simple loads and stores are atomic. Linux-kernel synchronization primitives contain any needed memory barriers, which is a good reason to use these primitives. The most important differences are called out in Table 1. More detailed descriptions of specific CPUs' features will be addressed in a later installment. Parenthesized CPU names indicate modes that are allowed architecturally but rarely used in practice. The cells marked with a Y indicate weak memory ordering; the more Ys, the more reordering is possible. In general, it is easier to port SMP code from a CPU with many Ys to a CPU with fewer Ys, though your mileage may vary. However, code that uses standard synchronization primitives-spinlocks, semaphores, RCU-should not need explicit memory barriers, because any required barriers already are present in these primitives. Only tricky code that bypasses these synchronization primitives needs barriers. It is important to note that most atomic operations, for example, atomic_inc() and atomic_add(), do not include any memory barriers. How Linux Copes One of Linux's great advantages is it runs on a wide variety of different CPUs. Unfortunately, as we have seen, these CPUs sport a wide variety of memory-consistency models. So what is a portable kernel to do? Linux provides a carefully chosen set of memory-barrier primitives, as follows: * smp_mb(): "memory barrier" that orders both loads and stores. This means loads and stores preceding the memory barrier are committed to memory before any loads and stores following the memory barrier. * smp_rmb(): "read memory barrier" that orders only loads. * smp_wmb(): "write memory barrier" that orders only stores. * smp_read_barrier_depends(): forces subsequent operations that depend on prior operations to be ordered. This primitive is a no-op on all platforms except Alpha. The smp_mb(), smp_rmb() and smp_wmb() primitives also force the compiler to eschew any optimizations that would have the effect of reordering memory optimizations across the barriers. The smp_read_barrier_depends() primitive must do the same, but only on Alpha CPUs. These primitives generate code only in SMP kernels; however, each also has a UP version-mb(), rmb(), wmb() and read_barrier_depends(), respectively-that generate a memory barrier even in UP kernels. The smp_ versions should be used in most cases. However, these latter primitives are useful when writing drivers, because memory-mapped I/O accesses must remain ordered even in UP kernels. In absence of memory-barrier instructions, both CPUs and compilers happily would rearrange these accesses. At best, this would make the device act strangely; at worst, it would crash your kernel or, in some cases, even damage your hardware. http://www.linuxjournal.com/article/8211