Hello World! Welcome to yet another docker security post. In of my older posts, which is "Breakout from the Seccomp Unconfined Container", you have seen how to run a docker with an unconfined seccomp security option and exploit the
CAP_SYS_MODULE capability on the container to break out of it. Well, in this post, I will be discussing what seccomp is and how you can configure it with your existing Docker environment.
What is Seccomp?
Seccomp is a syscall implemented in the Linux kernel since the v3.17 version release, and as the man page says –
operate on Secure Computing state of the process
The system calls are implemented during development and then called during the execution of the process. There are two main operations we will be discussing here:
The above image is basically depicts how the information is transferred from user-space to kernel-space. I got this while reading the research paper "A Trustworthy In-Kernel Interpreter Infrastructure".
Difference between MAC and Seccomp
It is different from the Mandatory Access Controls, like AppArmor, which let you write a profile for the program binaries in plain text with certain syntax and load them in the kernel. This will be then applied to all the processes created by that binary, and deny all the entries which are explicitly marked to do so, whether or not in the profile list. But, in the case of seccomp, the developers have full control over when, how, and what all syscalls are supposed to be filtered.
You can disable the MAC services either by stopping the services, removing the kernel modules or even uninstalling the package. But since, seccomp is implemented in the kernel itself, it can not be removed after 3.17 kernel version. This makes it more powerful than any other security solution.
In case you are wondering what "Mandatory Access Control" is, it is another way to run confined docker containers, usually done with the help of AppArmor. I have written a mini-series on it for you. You can find it here: AppArmor Basics for Sysadmins
Write your First Seccomp Application
I have created a simple program for demonstrating the working of the
SECCOMP_SET_MODE_STRICT mode. This one is very easy to understand as it allows only a set of syscalls and block all the remaining.
- _exit (implemented by exit())
Other system calls result in the termination of the calling thread, or termination of the entire process with the
SIGKILL signal when there is only one thread.
Let's compile it using gcc and run it. You will find out that the
open syscall succeeds at first, but later it failed because this time the kernel knows what to filter and how to handle it if program tries to execute it.
Don't trust me blindly; I have an output of strace ready for you. In this case, after
write syscalls, there is a call to
prctl which is setting the seccomp in strict mode. Therefore, when another call to
openat was initiated, but before it could return any value, it is supposed to be killed with a SIGKILL signal.
prctl is another syscall used to manipulate certain behaviours of the running process or thread.
Advanced eBPF Filtering with Granular Control
The strict mode is pretty cool and secure, we have seen that. But in the real case, there are a lot of syscalls involved and the strict mode doesn't meet the requirements. Therefore, the advanced filtering using the eBPF filter comes into play. It is implemented by another mode in the seccomp, known as
With this developers or the users (if implemented by developer) could have full control on allowing or disallowing the syscalls at runtime, without hard-coding in the the program itself.
Unfortunately, its demonstration on the man page looks so scary, if you are a newbie or inexperienced in the C programming language, (I am at least 🥲), To make things a lot easier for this post and you, I have found an answer on StackExchange which uses certain helper functions for this.
Let's copy the source code from this answer for now, and modify it to disallow the
read function, and return
EBADF error, which is also known as "Error Bad File".
Since these functions are defined in the libseccomp.so library, therefore, at link time use
-lseccomp to include the references of the required functions in the binary.
What is wrong with the output? Why is it not getting killed, like in strict mode? Because the killing process was the default in the strict mode. In this case, the
read syscall's filter is allowed to return with an EBADF error instead of killing, so the programme continued and returned the error in the third call.
Even though the file description number is correct in the first argument, the return value is -1 (EBADF), as defined in the seccomp filter.
Seccomp Profile of Docker
Docker uses eBPF mode instead which gives the power to the developers to decide which syscall should be allowed or blocked for individual containers. By default, it would use the configuration from the moby project, https://github.com/moby/moby/blob/master/profiles/seccomp/default.json for all the containers.
You can either disable it using
--security-opt seccomp=unconfined or profile your own JSON config using
--security-opt seccomp=/path/to/profile.json while creating or running the container using docker CLI.
Let's have the following profile to disallow
connect syscalls, which would return Operation not Permitted error while executing mkdir and ping command.
Run the container with the appropriate security option command,
--security-opt seccomp=myprofile.json and try to execute the ping or mkdir command, even if you are a root user in the container.
With the default profile, all the above commands work as expected. This confirms that seccomp is effective in the container even if you are running as the root user. Therefore, it is a trustworthy solution for setting up a secure containerized environment on your production infrastructure.