Understanding Address Spacing in Detail

Address spacing is used in all the Operating Systems to manage the processes' memory. Virtual address spacing opened the door for more features like swapping and easy relocations of the instructions and data to prevent address tampering.

Understanding Address Spacing in Detail

Welcome to the very first post in the PE file analysis. In this, I will guide you through the very crucial concept of the address spacing of understanding any file format, but in this mini-series, I will be focusing on PE file only.

Let's first understand some terms I will be using in the post

  1. Loader – It is a program in the operating system that takes the PE file or DLL file as the input and loads it into the memory for execution.
  2. Image File – The file created by the compiler and is used by the loader to execute is known as an image file. Generally, it is also called a program file, but since it is an image of the process or program that can be loaded multiple times, it is nowadays called an image file
  3. Process – An instance of the image file that is being executed after being loaded by the loader
  4. Memory – A big chunk in the RAM where the process gets loaded. CPU often loads from and write to this place to update the state of the process (aka variables)

What is Address Spacing Anyway?

When an image file is loaded into the memory, it has a bunch of memory-related tasks like calling functions, handling data in local or global variables and so on. To know the address of these symbols is required while performing the operations because the values are not always stored in the CPU registers. So a range of addresses in the memory is assigned to the process where it can perform all the read/write or arithmetic operations, this is called an address space of the process.

Traditionally in 8-bit computers or currently in the case of microprocessors (for eg: Arduino) when one CPU is allowed to have only one running process, it was easy as the process could get a fixed set of addresses space where all the data was populated. To make things easy for developers, it is assumed that there is only one process running on the system.

Fig 1: Abstract intuition of process loading in memory. (Left: RAM Right: Process)

Now in the case of multiple programs, it is required that two programs should not overwrite into address space of each other because it will result in memory corruption and the programs will not produce the expected output.

There is also a problem discussed in the below video, Memory Fragmentation. It is similar to disk fragmentation when smaller chunks of memory are left unattended because a certain program couldn't fit in it and eventually the system run out of memory. After all, the loader keeps on loading contents to another address space.

Address Spacing Explained

If you are wondering where exactly the windows process loader is located? Well, it is a part of windows internals and is exposed by a set of APIs defined in kernel32.dll and ntdll.dll. For more information, you can refer to this answer on StackExchange.

Why Use Virtual Address?

All the problems that you have seen in the previous topic related to memory fragmentation or overwriting and the one I missed which is "where to start the execution", all these problems are solved by virtual addressing mode.

Now before moving on to this topic, let me explain to you what is "where to start the execution". In the case of the multi-processing architecture of the operating system, after the image file is loaded into the memory the control of the process is given to the CPU from the entry point of the image. This is known as the ImageBase. Since the address of the process is kept on changing so it's hard to know where this entry point is located.

The virtual address space is used to overcome this issue by providing the same set of address spaces for the process with the same notion assuming the current process is the only process. In this case, the application doesn't directly interact with physical memory, rather have a translator which maps the application memory with physical memory. For example, if a value is stored in the 0x40001 address of the application, could be stored in 0x7f46b1bbc1f0 physical memory. The address space you see in the debuggers are these virtual addresses (relative to the debugger)

Fig 2: Virtual addressing in GDB debugger

This physical memory can now be either a RAM or a Hard Drive. This information is only known to the kernel and the address translator (MMU or Memory Management Unit). In 32-bit the max virtual address spacing is of \( 2^{32} \) bytes (4.0 GB with 1024 as a base and approx 4.29 GB with latest 1000 as a base). In the case of 64 bit, it will have a massive range of 18446.744 PettaByte.

Managing an index of each VA with a physical address and for every process seems unreal and requires extra space. To overcome this issue, the VA is equally chunked into 4096 bytes (4 KB) of the space known as a page. Similarly, a page frame is the smallest fixed-length contiguous block of physical memory into which memory pages are mapped by the operating system. When the least frequently used pages are stored in the hard drive or retrieved from it when in need, the process is known as paging or swapping

Fig 2: Page memory translation with RAM and Hard Disk. The image is taken from https://www.cs.uic.edu/

The default value of ImageBase in the process's VA is 0x400000 for 32-bit images or 0x140000000 for 64-bit images. However, it can be changed by explicitly passing the /BASE option during compilation. Read more on Microsoft's documentation page

What is a Relative Virtual Address?

Every time using the actual virtual address would be confusing for both computers and developers, thus increasing the chances of bugs. There it is required to have the mechanism in which a process can get the data by either adding or subtracting some amount of bytes to the pivoted memory segment. This is known as base and the resulting virtual address after performing such arithmetic operations is known as relative virtual address. In most cases, the data will be relative to the ImageBase address. The formula to calculate the RVA of any virtual address from image base is

\[ RVA \; = \; VA \; - \; ImageBase \]

The concept of RVA is specially designed to ease relocation problems where despite the compiler's output and changing of the base address, the assembly is well aligned that it can adapt to any situation just by jumping around its origin (in this case its ImageBase address). For more details, read this answer on StackOverflow