vim -R /usr/include/linux/capability.h
Containers provide a certain degree of process isolation via kernel namespaces. In this module, we’ll examine the capabilities of a process running in a containerized namespace. Specifically we’ll look at how Linux capabilities can be used to grant a particular or subset of root privileges to a process running in a container.
It would be worth taking a few minutes to read this blog post before beginning this lab.
Capabilities are distinct units of privilege that can be independently enabled or disabled.
Start by examining the Linux process capabilities header file.
vim -R /usr/include/linux/capability.h
Turn on line numbering in vim.
The capability bit mask definitions begin around line 103. Each capability is defined by a bit position followed by a short description. For example, the first capability is
CAP_CHOWN (bit 0). This capability grants permission to change the ownership of files.
103 /* 104 ** POSIX-draft defined capabilities. 105 **/ 106 107 /* In a system with the [_POSIX_CHOWN_RESTRICTED] option defined, this 108 overrides the restriction of changing file ownership and group 109 ownership. */ 110 111 #define CAP_CHOWN 0
Now examine the capabilities bit mask of a rootless process running on the host.
grep CapEff /proc/self/status
The bitmask returned is
0x0 meaning this process does not have any capabilities. For example, since bit 0 is not set (
CAP_CHOWN) a process with this
CapEff bitmask can not execute
chgrp on a file that it does not own.
Now examine the capabilities bit mask of a root process on the host.
sudo grep CapEff /proc/self/status
Notice that all 38 capability bits are set indicating this process has a full set of capabilities. In a later exercise, you’ll discover that a container running as root has a filtered set of capabilities by default but this can be changed at run time.
capsh command can decode a capabilities bitmask into a human readable output. Try it out!
NOTE: for those interested, capability '38' allows the following:
* CAP_PERFMON relaxes the verifier checks further: * - BPF progs can use of pointer-to-integer conversions * - speculation attack hardening measures are bypassed * - bpf_probe_read to read arbitrary kernel memory is allowed * - bpf_trace_printk to print kernel memory is allowed
NOTE: for those interested, capability '39' allows the following:
* CAP_BPF allows the following BPF operations: * - Creating all types of BPF maps * - Advanced verifier features * - Indirect variable access * - Bounded loops * - BPF to BPF function calls * - Scalar precision tracking * - Larger complexity limits * - Dead code elimination * - And potentially other features * - Loading BPF Type Format (BTF) data * - Retrieve xlated and JITed code of BPF programs * - Use bpf_spin_lock() helper
In comparison, '200' indicates the immutable capability (CAP_LINUX_IMMUTABLE):
A non-null CapEff value indicates the process has some capabilities. You will discover that the capabilities of a container are often less than what a root process has running on the host.
Start by running a rootless container as
--user=32767 and look at it’s capabilities.
podman run --rm -it --user=32767 registry.access.redhat.com/ubi8/ubi grep CapEff /proc/self/status
Next run a rootless container as
--user=0 and look at it’s capabilities. Note that it’s capabilities are filtered from that of a root process on the host.
podman run --user=0 --rm -it registry.access.redhat.com/ubi8/ubi grep CapEff /proc/self/status
Now run a container as privileged and compare the results to the previous exercises. What conclusions can you draw?
sudo podman run --user=0 --rm -it --privileged registry.access.redhat.com/ubi8/ubi grep CapEff /proc/self/status
Run a container in the background that runs for a few minutes.
podman run --user=0 --name=sleepy -it -d registry.access.redhat.com/ubi8/ubi sleep 999
podman top to examine the container’s capabilities.
podman top sleepy capeff
EFFECTIVE CAPS AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
Try out the following very useful shortcut (
-l). It tells
podman to act on the latest container it is aware of:
podman top -l capeff
top has many additional format descriptors you can check out.
podman top -h
How could you determine which capabilities podman filters from a root process running in a container?
From a previous exercise we know that a root process on the host has a capabilities mask of CapEff =
From a previous exercise we know that a root process in a container has a capabilities mask of CapEff =
Hint: Below is an example that uses the Linux binary calculator
bc to add hexadecimal numbers
(0x9 + 0x1) = A.
sudo yum install bc -y echo 'obase=16;ibase=16;9+1' | bc
Suppose an application had a legitimate reason to change the date (ntpd, license testing, etc) How would you allow a container to change the date on the host? What capabilities are needed to allow this?
Run a container, save the date then try to change the date.
podman run --rm -ti --user 0 --name temp registry.access.redhat.com/ubi8/ubi bash savethedate=$(date) date -s "$savethedate"
date: cannot set date: Operation not permitted Mon Apr 8 21:45:24 UTC 2019
You have been given a container image to deploy (
quay.io/bkozdemb/hello). The application needs to use the
chattr utility but must not be allowed to
ping any hosts. Use what you’ve learned about capabilities to properly deploy this application using
ping succeeds but
chattr fails. We want the opposite.
podman run -it --name=chattr_no_ping --rm quay.io/bkozdemb/utils bash ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.035 ms --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.035/0.035/0.035/0.000 ms
cd /tmp touch file chattr +i file
chattr: Operation not permitted while setting flags on file
Exit from the container:
Challenge #1: One approach would be to use your favorite binary calculator (
bc) to calculate the difference in
CapEff between a host root process
(0xffffffffff) and a containerized root process
0xFFFFFFFFFF - 0x00A80427FB ------------ 0xFF57FBD804
echo 'obase=16;ibase=16;FFFFFFFFFF-A80427FB' | bc
To produce a human readable list, use
capsh to decode the vector.
Challenge #2: To allow a container to set the system clock, the
sys_time capability must be added. Add this capability then try setting the date again.
sudo podman run --rm -ti --user 0 --name temp --cap-add=sys_time registry.access.redhat.com/ubi8/ubi bash savethedate=$(date) date -s "$savethedate"
Mon Apr 8 21:46:18 UTC 2019
And exit from the container:
Challenge #3: Drop all capabilities then add
linux_immutable. The trick with this is the container must run as root because
linux_immutable is a filtered capability.
sudo podman run -it --name=chattr_no_ping --rm --cap-drop=all --cap-add=linux_immutable quay.io/bkozdemb/utils bash
chattr command should succeed in making a file read only:
cd /tmp touch file chattr +i file rm -rf file
rm: cannot remove 'file': Operation not permitted
Remember to reset the file attributes so the container can shutdown cleanly:
chattr -i file lsattr file
On the host, check the capabilities of the container. This will require you to open another terminal:
sudo podman top chattr_no_ping capeff
EFFECTIVE CAPS LINUX_IMMUTABLE
Exit the container (in the original container):