Learning Points: Containers Under the Hood
In this video, Gerlof Langeveld from AT Computing talks about how Linux containers work under the hood. The focus is mostly on the Linux namespaces and not Cgroups.
Conventional Unix Approach
All processes run in one ecosystem, which includes:
- Hostname.
- PID numbers.
- Mounted filesystems.
- Network stack.
- IPC objects (semaphores, pipes).
- Users.
Even before containers, every process could have its own root directory (chroot) which it is limited to for usage.
If a process assumes the root identity means all the privileged actions are allowed. Non-root identity meant no privileged actions at all.
There were no tools to control resource consumption for each process.
Containerized Approach
Processes are isolated from other processes in the host. Containers are implemented by administering the root directory, namespaces and control groups of a child process differently from the parent.
- Private filesystem - chroot.
- Isolated hostname - namespace ‘uts’.
- Isolated IPC (shmem, semaphores, message queues) - namespace ‘ipc’. Processes in the same namespace can share the objects.
- Isolated PID numbering - namespace ‘pid’. The process will have different PID in ancestor namespace and its own namespace. Process with PID 1 in any namespace reaps the orphaned children in the namespace. The
/proc
has to be mounted accordingly in the new PID namespace too. - Isolated users - namespace ‘user’.
- Isolated mount - namespace ‘mnt’. Processes connected to the same namespace share mountpoints. When a new namespace is created, the mount structure is inherited. However, when there are updates to the mount points, there is no impact on the original namespaces.
- Private network stack - namespace ‘net’. A new network namespace initially has only the loopback interface but more can be added. Physical devices can only be in one namespace. Network namespaces can be connected via veth pairs.
- Limited privilege under root identity - capabilities.
- Limited utilization of CPU - cgroup ‘cpu’.
- Limited utilization of memory - cgroup ‘memory’.
- Limited utilization of disk - cgroup ‘blkio’.
Namespaces
Every process refers to namespaces. When there is a fork, the child processes inherit the binding of namespaces by default. Processes can unshare namespaces.
The namespace details are in the pseudo-filesystem /proc
. The $$
refers to the current process. The number in the bracket refers to the inode representing the namespace.
1
2
3
4
5
6
$ ls -l /proc/$$/ns
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 net -> 'net:[4026531840]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 pid -> 'pid:[4026531836]'
...
The unshare
command executes a specified program in new namespaces. It uses the system call unshare
. The nsenter
command connects with existing namespaces of other processes. It uses the system call setns
.
Docker Namespaces
Docker allows sharing of namespaces. For example, docker run --pid=host
shares the PID namespace with the host and docker run --pid=container:CID
shares the PID namespace with anothe container.
Modified Root Directory
Even before containers were prevalent, every process has own root directory. Usually, all the processes inherit the root directory of the entire filesystem from systemd. Use chroot
to use a prepared directory as root. Use pivot_root
to change root directory for all processes in the mount namespace.
1
sudo chroot topdir bash --login
Capabilities
Traditional Unix privilege scheme only checks whether UID = 0. Linux privilege scheme has a collection of distinct privileges that can be set for each process. For example: CAP_CHOWN, CAP_KILL, CAP_SYS_BOOT etc. Thread running with effective UID = 0 initially has all the capabilities set.
1
docker run --cap-add foo --cap-drop bar ...
Build a Container Step by Step (Without Cgroups)
In the step1.sh
, a new hostname namespace is created.
1
2
#!/bin/bash
unshare -u bash step2.sh
In step2.sh
, the hostname for the child process is changed. Then, unshare
is run again to create a new PID namespace. The current process’ environment is not modified. The forked child process will be in the new namespace with PID 1.
1
2
hostname mycontainer
unshare -p --fork --mount-proc bash step3.sh
In step3.sh
, a new network namespace is created.
1
unshare -n bash step4.sh
In step4.sh
, the local loopback link is set to up. Then, the script uses nsenter
to set up veth pairing interface mybr0
in the parent process’s network stack namespace. Then the script sets up mybr1
device for its own namespace. Now packets transmitted on one device in the pair are immediately received on the other device.
At the end, a new mount namespace is created.
1
2
3
4
5
6
7
8
9
10
ip link set dev lo up
nsenter -n -t 1 ip link add name mybr0 type veth peer name mybr1 netns $$
nsenter -n -t 1 ip addr add 192.168.47.11/24 dev mybr0
nsenter -n -t 1 ip link set dev mybr0 up
ip addr add 192.168.47.12/24 dev my br1
ip link set dev mybr1 up
unshare -m bash step5.sh
In step5.sh
, a new directory for the container’s root is created. A tmpfs filesystem (in-memory filesystem) of 50MB is mounted to the new root’s directory. This mount point will be not visible within the parent’s process. The contents from a skeleton root folder is copied into the mount point. pivot_root
is used to change the root directory.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ROOTDIR=${pwd}/newroot
[ -d ${ROOTDIR} ] || mkdir "$ROOTDIR"
mount -n -t tmpfs -i size=50M none "${ROOTDIR}"
rsync -a skeletonfs/ "${ROOTDIR}"
cd "${ROOTDIR}"
pivot_root . oldroot
mount -t proc proc /proc
export PS1="[\u$\h \W]# "
bash