Linking and Shared Libraries

For large software, separate compilation becomes an important capability. The compiler can compile source files independently of each other, and the compiled code is then linked to create an executable in memory.

Object code

Traditionally, a compiler generates assembly code from which an assembler generates object code, which is machine code accompanied by additional information. Object code is generated into an object file, typically with extension .o or .obj, depending on the operating system. Assembly code is allowed to contain references to labels defined by other assembly source files; when the assembler generates code, it leaves empty the part of the instruction corresponding to the externally defined label. The object code describes each point in the machine code that is left unresolved in this way. This information is used by the linker to patch the code from each object file so it uses the actual memory addresses of the resources provided by the other object files.

For example, suppose that abs is an externally defined function, and consider the compilation of the following line of code:

y = y + abs(x)

This code might be compiled to assembly as follows:

.globl abs
mov rdi, rcx
call abs
add [rbp-16], rax

The .globl directive tells the assembler that such a symbol exists but does not give its address (or generate any code). The assembler then generates object code like the following, shown in hexadecimal:

0: 48 89 cf            
3: e8 00 00 00 00
8: 48 01 45 f0

The four bytes shown in green italics represent the address of the function abs, which is not available to the compiler. Hence, it fills in those bytes with zeros, to be fixed later by the linker. In addition, the object file includes a relocation entry pointing to those bytes. The actual data filled in is an offset from the current program counter, rather than the absolute address of the target: the call instruction uses relative addressing, which means that the call instruction and its target can be moved around in memory as long as they stay at the same offset from each other.

Structure of object and executable files

An object file is actually a data structure, divided into “sections”. The ELF object file format is a widely used standard. The machine code is contained in a “text” section. Another section contains initialized data. Object code includes a symbol table that specifies both the symbols defined by the object code and the external symbols on which the object code depends. For symbols defined in the code, the symbol table gives the value of the symbol (typically, an address). For symbols not in the code, the symbol table gives their name but not a value. Also in the object file is a relocation section specifying which code and data locations need to be supplied with values during linking.

Executable files (at least on Unix-based systems) are structured like object files, except with a different header to indicate that they are executable. Similarly, they include a text section, but it contains the concatenated and linked text sections of various object files. And the initialized data section concatenates the corresponding sections of object files.

When an executable is run, pages of code are brought in from its text and data sections and copied into memory. If there are multiple processes running the executable, only one copy of the executable code is placed in memory, and it it is mapped into the address space of each of the processes using address translation. Each process has its own copy of the data section, in different locations in physical memory but, again, mapped into the processes' address spaces in the same place.

Object files and executables may contain more detailed symbol table information, which C compilers generate if given the -g option. This extended information is helpful for debugging. In addition, source code line number information can be embedded into the object file, allowing debugging of arbitrary language executables using debuggers like gdb. The assembler embeds line number information into its output if the .file and .loc directives are used to indicate the name of the source file and the locations of lines of source code.

Libraries

A library is a collection of object files, combined with an index for quickly looking up which object file contains a given symbol. When linking a program against a library, only the object files that are actually used are linked into the program. In general, a linker invocation can specify any number of object files and libraries. These are given in priority order so that a given symbol is bound to the first definition encountered.

Shared libraries

As libraries developed and grew over time, they started to have too large a memory footprint, even though only any given executable only contains the object code it actually uses. In the C ecosystem, the standard C library libc is particularly problematic; it is quite large, and a system with many running C programs might have many copies of large parts of this library.

Shared libraries (On Windows, DLLs) are libraries that are intended to be loaded into memory just once. When linking an executable against a shared library, the linker works against an import library that does not actually contain the library's code. The executable doesn't contain library code; instead, when the executable is loaded into memory, run-time relocation is done so that it is able to use the functions and other resources in the shared library.

Global tables

However, supporting the functionality of libraries is not easy with shared libraries. Recall that libraries and object files are searched, in order, for external symbols. This rule means that applications can, if they wish, override the behavior of the library by providing their own implementation of a library function.

For example, the C standard library makes many calls to the function malloc, in order to allocate memory. However, suppose that we want to override malloc, perhaps because we have a faster implementation or because we are building a garbage collector. We would like our version of malloc to be used not just by our own code directly, but also by the standard library. Otherwise, our malloc and the standard malloc will be stepping on each other as they both try to manage the program's data segment.

To be able to override globals, the library cannot directly call its own functions like malloc. Instead, calls are indirected through global tables located in the data segment. Instead of jumping directly to the address of malloc, the address of the malloc function is read from a location in the global table, and a call is done to that location. Programs can have a different linkage of malloc because their global tables are located in the data segment, which is not shared.

Position-independent code

Many early computer systems lacked the capability for hardware address translation, that is, virtual memory. In fact, modern embedded systems often still lack it. To be able to run multiple programs simultaneously on such systems, it is useful to be able to move code around in memory so programs do not collide. Position-independent code is designed to be movable in memory. Most instructions can be moved without harm, including most branch instructions, because they specify their target using pc-relative addressing (as an offset from the program counter). On the other hand, accesses to absolute memory addresses pose a challenge.

On systems with virtual memory, it would seem that position independence is no longer needed; address translation allows programs to run at the same virtual address even though their memory has different physical addresses.

However, shared libraries cause trouble, because they are supposed to be shared across multiple executables, yet nothing prevents them from being located at the same place in memory as the application or each other. Hence, shared libraries are usually generated as position-independent code so that they can be placed anywhere in memory, appearing in the virtual memory of different applications at different locations.

This strategy raises a question: if the shared library uses global tables to support flexible linkage, how does the code know where to find the global table itself in a position-independent way? The usual approach is to place the global tables in memory at a link-time-determined offset from the program counter. Then, pc-relative addressing can be used to load addresses from the global table.

For calls to external functions, recall that the object code has space for a regular, absolute call. However, an indirect, pc-relative call instruction is too big to fit. Instead, the linker can generate a short trampoline function that does a pc-relative indirect jump to the correct location. For example, in the code above we had a call to an external function abs:

          call abs

This call is linked in such a way that the address (offset) filled in goes to a trampoline:

          call abs_stub
          ...
abs_stub: jmp [rip + abs_offset]

Here, the constant f_offset indexes within the global table to the location where the actual address of abs is stored.

Accessing global variables is similary more expensive. Previously we might have expected to simply load from the absolute memory address of global variable v:

          mov rbx, [v]

Instead, the address of the variable must be found in the global table:

          mov rax, [rip + v_offset]    # rax is now the address of v
          mov rbx, [rax]

Clearly, running code in this way adds overheads to accesses to global variables and global functions. In fact, the cost of accessing them is comparable to the cost of accessing methods and fields in an object-oriented language!

Because of these hidden overheads, it is common for processor manufacturers and others reporting performance numbers to fully statically link their applications, avoiding these overheads. The linker does peephole optimizations to generate faster, absolute-addressed code that includes its own copy of library code. The performance numbers are therefore not entirely representable of what will be seen on a real system! They are aiming to make a single program look as fast as possible. However, for a large system being used in practice, and running dozens or hundreds of programs at the same time, shared libraries and position-independent code lead to better performance overall.

Prelinking (aka prebinding) is often used to speed up loading of applications. The idea is to analyze the shared libraries present on the system and to choose aheda of time where to place them in the memory space of applications. The global tables of the shared libraries can then be largely shared across applications, reducing application startup time. Prelinking is a global system optimization, so it is done infrequently—for example, when a new operating system is installed, or every few weeks.

Dynamic linking

Once code can be positioned arbitrarily in memory, it becomes relatively easy to support dynamically adding new code at run time. Operating systems offer an API for explicitly dynamic linking object files into a running application. This API is how plugins are typically supported, so dynamic linking is an important feature for extensibility.

In Unix systems, the call h = dlopen(filename) loads the named object file into free memory (if it has not already been loaded). Globals defined in the file can then be queried via the returned handle h. The call p = dlsym(h, name) looks up the given name in the object file and returns its address. On Windows systems, the corresponding functions are called LoadLibrary() and GetProcAddress().

Conclusions

Shared libraries and DSOs allow efficient memory use on a machine running many different programs that share code, inproving cache and TLB performance overall. However, they hurt individual program performance, because of indirections through global tables and some expansion in code size. We can see that globals are more expensive than they look! However, we also get an important new functionality: the ability to extend programs dynamically.

CS 4120 Spring 2022 Introduction to Compilers