System Calls - What’s Underneath the Hood

In previous memos, I briefly mentioned that read() and write() functions in libc are wrappers of system call read and write. Note that they are very different animals not to be confused. In short, libc functions provide a higher-level interface for common programming tasks, while system calls offer direct access to kernel services and are used for privileged operations.

System calls, also known as syscalls, enable applications to request services and resources from the operating system. By invoking specific system calls, applications can perform operations such as file I/O, process management, memory allocation, and hardware interaction. System calls facilitate privileged operations by transferring control from user mode to kernel mode, where the operating system handles the requested tasks.

Linux System Call Table

Reference

System calls are vital for the functioning of an operating system as they provide a standardized mechanism for applications to interact with the underlying system, offering a predefined set of functions and parameters for various operations. Below is an excerpt from the Linux system call table:

NRsyscall name%raxarg0 (%rdi)arg1 (%rsi)arg2 (%rdx)arg3 (%r10)arg4 (%r8)arg5 (%r9)
0read0x00unsigned int fdchar *bufsize_t count---
1write0x01unsigned int fdconst char *bufsize_t count---
2open0x02const char *filenameint flagsumode_t mode---
3close0x03unsigned int fd-----
4stat0x04const char *filenamestruct __old_kernel_stat *statbuf----
7poll0x07struct pollfd *ufdsunsigned int nfdsint timeout---
8lseek0x08unsigned int fdoff_t offsetunsigned int whence---
9mmap0x09??????
12brk0x0cunsigned long brk-----
16ioctl0x10unsigned int fdunsigned int cmdunsigned long arg---
21access0x15const char *filenameint mode----
22pipe0x16int *fildes-----
23select0x17int nfd_set *inpfd_set *outpfd_set *expstruct timeval *tvp-
29shmget0x1dkey_t keysize_t sizeint flag---
30shmat0x1eint shmidchar *shmaddrint shmflg---
31shmctl0x1fint shmidint cmdstruct shmid_ds *buf---
32dup0x20unsigned int fildes-----
33dup20x21unsigned int oldfdunsigned int newfd----
39getpid0x27------
41socket0x29intintint---
42connect0x2aintstruct sockaddr *int---
43accept0x2bintstruct sockaddr *int *---
44sendto0x2cintvoid *size_tstruct sockaddr
45recvfrom0x2dintvoid *size_tunsignedstruct sockaddr *int *
49bind0x31intstruct sockaddr *int---
50listen0x32intint----
56clone0x38unsigned longunsigned longint *int *unsigned long-
57fork0x39------
59execve0x3bconst char *filenameconst char *const *argvconst char *const *envp---
60exit0x3cint error_code-----
61wait40x3dpid_t pidint *stat_addrint optionsstruct rusage *ru--
62kill0x3epid_t pidint sig----
67shmdt0x43char *shmaddr-----

Do you see many familiar names? Yes, those libc functions are just wrapper for syscalls. Also note that read and write syscalls are on the very top of the table, as they are critical features since the earliest days of UNIX.

Let’s take the example of the read system call. When making a read system call in a program, the following steps occur:

  1. The actual system call number is typically stored in the rax register, which would contain the value 0 for read.
  2. The file descriptor identifying the file or input stream to read from is typically passed as the first argument. In the x86_64 architecture, this value would be placed in the rdi register. The buffer where the read data will be stored is passed as the second argument, usually a pointer to a memory location. This pointer would be placed in the rsi register. The maximum number of bytes to read is provided as the third argument, which would be placed in the rdx register.
  3. Once the necessary arguments are prepared in the respective registers, the program triggers the system call by executing the syscall instruction, causing a transition to kernel mode. Inside the kernel, the system call handler identifies the system call number (0 in this case) from the rax register.
  4. The handler retrieves the arguments from the appropriate registers (rdi, rsi, rdx) and performs the requested operation, which is reading data from the specified file descriptor into the provided buffer.After the read operation is completed, the number of bytes read is usually returned in the rax register.

Technically it is possible to write programs only using system calls - you may try hard to print an integer in decimal with write syscall only - but libc serves as a collection of programmer-friendly snippets. When a libc function is called, it internally invoke the corresponding system call to interact with the kernel and perform the required operation. The libc function acts as a wrapper around the system call, providing a higher-level and more convenient interface to the application developer.

Trace Syscalls

If you wonder how system calls are used in program execution, use strace utility. strace is a command-line tool used on Unix-like systems to trace the system calls and signals made by a program. It allows you to see the interactions between a program and the operating system, which can be helpful for debugging, performance analysis, or understanding program behavior.

Let’s say we have a simple C program called hello.c that prints “Hello, world!” to the console:

1
2
3
4
5
6
# include <stdio.h>

int main() {
printf("Hello, world!\n");
return 0;
}

To trace the system calls made by this program, we can compile it and run strace as follows:

1
2
gcc -o hello hello.c
strace ./hello

The output of strace will show each system call made by the program and its corresponding result. Here’s a sample excerpt of the strace output for the hello program:

1
2
3
4
5
6
7
8
9
10
execve("./hello", ["./hello"], 0x7ffdcdd64280 /*58 vars*/) = 0
brk(NULL) = 0x5609415c4000

fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
brk(NULL) = 0x5609415c4000
brk(0x5609415e5000) = 0x5609415e5000
write(1, "Hello, world!\n", 14Hello, world!
) = 14
exit_group(0) = ?
+++ exited with 0 +++

In the above output, you can see various system calls such as execve, brk, arch_prctl, access, openat, write, and exit_group. Each line provides information about the system call, its arguments, and the return value. By analyzing the strace output, you can gain insights into how the program interacts with the operating system, diagnose issues, or understand the underlying system behavior during program execution.