Prevent Privilege Escalation from Container Breakout via UserNS Remapping

Hello World! In my previous posts, I have discussed a lot on how does a user with certain capabilities can escape the docker container and execute commands on the root of the host. The naive approach to fix this issue could be the combination of the following

  • Disable capabilities like CAP_DAC_READ_SEARCH, CAP_SYS_MODULE and etc
  • Relinquish the root user privileges before executing ENTRYPOINT in the dockerfile
  • Implement firewall to disable privileged container and mounting of file system using -v argument and use volumes instead

In most of the cases, some options could be required. For instance, in of the applications that I am working on right now, we are saving the build time for production releases by reusing the image of stage environment and replacing environment file at the run time using bind mount. Instead, remap the default root, which is used to spawn containerd-shims and then child processes, to a separate low-privileged user. This technique is known as User Namespace Remapping in the Docker world.

Implementing User Namespace Remapping

You can see the current session is running on behalf of a low privileged user, student. But it is allowed to perform all actions on the docker, as you can see it is added to the docker group, which means interact with the docker UNIX socket.

Docker is accessible from low privileged account

There are two repositories cloned in the home directory which I will be using to demonstrate the remapping and then try to exploit it.

Spoiler alert! It won't happen 😅

In the docker-privsec directory you will find the a shell script which contains instructions to implement the remapping.

Docker exploits in the current directory

You will find the following contents in the userns-remap.sh script. The first two commands are pretty straight forward, create a user and group with name dockremap and set the shell to /bin/false so that it can not be used as a login.

groupadd -g 99999 dockremap && 
useradd -u 99999 -g dockremap -s /bin/false dockremap && 
echo "dockremap:99999:65536" >> /etc/subuid && 
echo "dockremap:99999:65536" >>/etc/subgid
Create the default user remap user and group information

You will also see that it updates the /etc/docker/daemon.json file and add { "userns-remap": "default" } to it. Edit the echo line in the file as shown below to support both insecure registries and user namespace remapping.

echo "{
+	\"insecure-registries\": [\"registry:5000\"],
	\"userns-remap\": \"default\"
}" > /etc/docker/daemon.json
Fix for the daemon config, include default user dockremap for remapping

The default value of user namespace remapping in docker points to dockremap user. If you wish to add different user, make sure change this value to that user and group, in format user:group.

Lastly, this script will reload the systemd units and then restart the docker service. Now, the dockerd will read this updated configuration from the daemon.json file and map the user in the namespace to dockremap.

Execute the userns-remap script from the root user
Note: The password of the root user is provided in the lab description.

Now, go to the $HOME/dockerrootplease directory and edit Dockerfile, as shown in the following diff. This will let you use the fresh parent image from the registry if it is not pulled already.

- FROM ubuntu:18.04
+ FROM registry:5000/ubuntu:18.04
Fix to pull image from remote registry

Build the image using docker build command and give it any tag you want. I am using short and relevant tag rootme:latest.

Building exploit docker image

You will find the command to run the exploit in the README.md file as shown below. After copying it, make sure you change the image named used while building.

Instructions to privilege escalation

Run the docker container as shown below and you will see that it will spawn the shell after chroot'ing into the /hostOS directory. You can confirm the container breakout from the process listing, which starts with /sbin/init process.

Create new container with similar command

Even though the effective user and group id are 0 (root), you won't be able to read the contents of the protected files like /etc/shadow or the flag in /root/flag. The container is completely isolated it cannot even run the directory listing command in the home directory of the root user.

Confirm the security fix

The containerd-shim has started the entry point process as the dockremap user, as you can see from the process listing output on the host machine. While accessing the resources on the file system and etc, the kernel will use this user instead of the namespace user (root) to check the DAC permissions of the resources,

Confirm commands in container are running with privileged user
Note: Remember that containerd-shim will launch the entry point as the root user if no remapping is done.

Why does it work the way it works?

The UID 99999 is mapped within the namespace as UID 0 (root) and inherited by all the child processes spawned by the first process (entry point). Similarity, this mapping will work with the GID. Since the remapping information is transparent to the namespace, you can confirm it by reading the uid_map and gid_map files from the procfs.

UID and GID mappings in the container

Let's ignore the last entry 65536 for the time being; the first entry in the map file only tells you the user or group id in the namespace, while the second entry in the map file tells you the user or group id outside of the namespace, which will be used by the kernel on the host.

How does This Differ from What fakeroot do?

When you run the program with fakeroot, it will inject it's interceptor via LD_PRELOAD and LD_PRELOAD_PATH environment variable and patching the system calls on the go. For security reasons, it will block this behaviour for open() and create() syscall functions.

In case of remapping, when the containerd will run the program, by adding the configuration into uid_map and gid_map files as shown below. This will be then used to map the user and group from inside to outside the container without patching anything on the runtime.

unshare(CLONE_NEWUSER)                  = 0
openat(AT_FDCWD, "/proc/self/uid_map", O_WRONLY) = 3
write(3, "1000 1000 1", 11)             = 11
close(3)                                = 0
openat(AT_FDCWD, "/proc/self/setgroups", O_WRONLY) = 3
write(3, "deny", 4)                     = 4
close(3)                                = 0
openat(AT_FDCWD, "/proc/self/gid_map", O_WRONLY) = 3
write(3, "1000 1000 1", 11)             = 11
close(3)                                = 0
Syscalls used in user namespace with 
💡
To learn more about shared libraries and LD_PRELOAD, check out two of the posts on Linux privilege escalation – Understanding Concept of Shared Libraries and Exploiting Shared Library Misconfigurations

Where the Hell did Images go?

After implementing the namespaces, you won't be able to list the images anymore and this is an expected behaviour. The docker daemon (dockerd) will create a separate directory in /var/lib/docker/[uid].[gid].

Separate data directories created for remapping
Note: For testing purposes, I have created the user mapping for www-data. That is why you are seeing 33.33 directory here.

Resources