Before we get into OpenShift lets explore what
Linux Kernel Capabilities are
and some of the steps that OpenShift has taken to remove certain capabilities
According to the capabilities man page;
Capabilities are distinct units of privilege that can be independently enabled or disabled.
Capabilities were added to the kernel around 15 or so years ago to try to divide up the power of root. Originally the kernel allocated a 32-bit bitmask to define these capabilities. A few years ago it was expanded to 64. There are currently around 38 capabilities defined.
Capabilities are things like the ability to send raw IP packets, change hostnames or bind to ports below 1024. When we run containers we can drop a whole bunch of capabilities before running our containers without causing the vast majority of containerized applications to fail.
Most capabilities are required to manipulate the kernel/system, and these are used by the container framework (docker), but seldom used by the processes running inside the container. However, some containers require a few capabilities, for example a container process needs capabilities like setuid/setgid to drop privileges. As with most things in the container world, we try to establish a compromise between security and the ability to get work done.
OpenShift Container Platform and OpenShift Dedicated have removed most of the default 38 capabilities from root users running inside a container.
So, how can we drop these capabilities using docker? First, let’s see what
capabilities a process has. There is a cool tool in Linux that can help you
view what capabilities a process has called
pscap, available in the
libcap-ng-utils package on Fedora.
pscap | head -10
sudo docker run -d fedora sleep 5 >/dev/null; pscap | grep sleep
If I wanted to drop
mknod, I could use
sudo docker run -d --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod fedora sleep 5 > /dev/null; pscap | grep sleep
Better yet, if you know your container only needs setuid and setgid, you can drop all capabilities and just add setgid and setuid back in.
sudo docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid fedora sleep 5 > /dev/null; pscap | grep sleep
The following bellow is a comparison of the out put form
chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot (1) vs. setgid, setuid (2)
|1||Blacklisting: Specifying each capability you want to drop.|
|2||Whitelisting: Dropping all and adding back the ones you need.|
The main point is that it is easier to be specific and only allow the capabilities that you know your application needs. However Blacklisting can still be very effective in removing the capabilities you are sure you do not want.
sudo docker ps -a ... CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fdf41302efbc (1)
|1||Container ID to copy|
Then stop and remove the Container.
sudo docker stop fdf41302efbc sudo docker rm fdf41302efbc
Lets look deeper into each of these remaining capabilities.
The man page describes chown as the ability to make arbitrary changes to file UIDs and GIDs.
This means that root can change the ownership or group of any file system object. If you are not running a shell within a container and not installing packages into the container, you should drop this capability.
I would make the argument this should never be needed in production. If you need to chown, allow the capability, do the work, then take it away.
The man page says that dac_override allows root to bypass file read, write, and execute permission checks. DAC is an abbreviation of “discretionary access control”.
This means a root capable process can read, write, and execute any file on the system, even if the permission and ownership fields would not allow it. Almost no apps need DAC_OVERRIDE, and if they do they are probably doing something wrong. There are probably less than ten in the whole distribution that actually need it. Of course the administrator shell could require DAC_OVERRIDE fixing bad permissions in the file system.
Steve Grubb, security standards expert at Red Hat, says that “nothing should need this. If your container needs this, it’s probably doing something horrible.”
According to the man page, fowner conveys the ability to bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file. For example, chmod and utime, and excludes operations covered by cap_dac_override and cap_dac_read_search. Here’s more from the man page:
set extended file attributes (see chattr(1)) on arbitrary files; set Access Control Lists (ACLs) on arbitrary files; ignore directory sticky bit on file deletion; specify O_NOATIME for arbitrary files in open(2) and fcntl(2). This is similar to DAC_OVERRIDE, almost no applications need this other than, potentially, software installation tools. Most likely your container would run fine without this capability. You might need to allow this for docker build but it should be blocked it when you run your container is production.
The man page says “don’t clear set-user-ID and set-group-ID mode bits when a file is modified; set the set-group-ID bit for a file whose GID does not match the filesystem or any of the supplementary GIDs of the calling process.”
My take: if you are not running an installation, you probably do not need this capability. I would disable this one by default.
If a process has this capability it can override the restriction that “the real or effective user ID of a process sending a signal must match the real or effective user ID of the process receiving the signal.”
This capability basically means that a root owned process can send kill signals to non root processes. If your container is running all processes as root or the root processes never kills processes running as non root, you do not need this capability. If you are running systemd as PID 1 inside of a container and you want to stop a container running with a different UID you might need this capability.
It’s probably also worth mentioning on the danger scale, this one is on the low end.
The man page says that the setgid capability lets a process make arbitrary manipulations of process GIDs and supplementary GID list. It can also forge GID when passing socket credentials via UNIX domain sockets or write a group ID mapping in a user namespace. See user_namespaces(7) for more information.
In short, a process with this capability can change its GID to any other GID. Basically allows full group access to all files on the system. If your container processes do not change UIDs/GIDs, they do not need this capability.
If a process has the setuid capability it can “make arbitrary manipulations of process UIDs (setuid(2), setreuid(2), setresuid(2), setfsuid(2)); forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace (see user_namespaces(7)).”
A process with this capability can change its UID to any other UID. Basically, it allows full access to all files on the system. If your container processes do not change UIDs/GIDs always running as the same UID, preferably non root, they do not need this capability. Applications that that need setuid usually start as root in order to bind to ports below 1024 and then changes their UIDS and drop capabilities. Apache binding to port 80 requires net_bind_service, usually starting as root. It then needs setuid/setgid to switch to the apache user and drop capabilities.
Most containers can safely drop setuid/setgid capability.
Let’s look at the man page description: “Add any capability from the calling thread’s bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.”
In layman’s terms, a process with this capability can change its current capability set within its bounding set. Meaning a process could drop capabilities or add capabilities if it did not currently have them, but limited by the bounding set capabilities.
This one’s easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).
If you want to bind to a port below 1024 you need this capability. If you are running a service that listens to a port above 1024 you should drop this capability.
The risk of this capabilty is a rogue process interpreting a service like sshd, and collecting users passwords. Running a container in a different network namespace reduces the risk of this capability. It would be difficult for the container process to get to the public network interface
The man page says, “allow use of RAW and PACKET sockets. Allow binding to any address for transparent proxying.”
This access allows a process to spy on packets on its network. That’s bad, right? Most container processes would not need this access so it probably should be dropped. Note this would only affect the containers that share the same network that your container process is running on, usually preventing access to the real network.
RAW sockets also give an attacker the ability to inject scary things onto the network. Depending on what you are doing with the ping command, it could require this access.
This capability allows use of chroot(). In other words, it allows your processes to chroot into a different rootfs. chroot is probably not used within your container, so it should be dropped.
If you have this capability, you can create special files using mknod.
This allows your processes to create device nodes. Containers are usually provided all of the device nodes they need in /dev, the creation of device nodes is controlled by the device node cgroup, but I really think this should be dropped by default. Almost no containers ever do this, and even fewer containers should do this.
If you have this one, you can write a message to kernel auditing log. Few processes attempt to write to the audit log (login programs, su, sudo) and processes inside of the container are probably not trusted. The audit subsystem is not currently namespace aware, so this should be dropped by default.
Finally, the setfcap capability allows you to set file capabilities on a file system. Might be needed for doing installs during builds, but in production it should probably be dropped.