7. Docker vs. Podman
Docker Podman
Architecture Daemon-based (dockerd) Daemonless
CLI Standard Docker CLI Docker CLI-compatible
Rootless Available (requires setup) Built-in
Container Storage Uses containerd and runc Uses crun or runc
Security Requires root by default Designed for rootless
8. Docker vs. Podman
Use Docker when:
● needing seamless integration with existing Docker-based
● relying heavily on Docker Compose or Docker Swarm.
● requiring full compatibility due to use of third party plugins
and tools
Use Podman when:
● security is important: rootless and/or daemonless
● crun as the default low-level runtime
○ lower overhead
9. Daemonless
● Improved security due to less code running as root.
● Simplified resource management.
● Better alignment with Unix philosophy:
○ child processes having env vars
○ on linux “fork / exec” is really cheap
● Some tooling and orchestration features are less mature.
● Harder to coordinate if containers run on multiple machines.
10. runc vs. crun
● go container runtime
● forked from docker and now part of OCI
crun (Red Hat / IBM)
● C container runtime
● fresh re-implementation of a container runtime
12. Building Blocks: Namespaces
● Mount (mnt): Isolates filesystem mount points.
● Process ID (pid): Isolates process ID number space.
● Network (net): Provides isolated network interfaces.
● Inter-process Communication (ipc): Isolates IPC resources.
● UTS (uts): Isolates hostname and domain name.
● User (user): Isolates user and group IDs.
● Cgroup (cgroup): Isolates view of cgroups.
● Time (time): Isolates system clocks.
Upcoming (proposed)
● Device Namespace: isolate some devices
● Keyring Namespace: isolate kernel keyrings
13. Building Blocks: Control Groups
cgroups limit, account for, and isolate the resource usage of process
Resources Managed:
● CPU Usage
● Memory
● Block I/O
● Network Bandwidth
● cgroup v1: The original implementation, each resource controller (e.g.,
cpu, memory, blkio) operates within its own distinct hierarchy, with the
controlled container appearing in each of those hierarchies.
● cgroup v2: Unified hierarchy for each controlled resource (container).
14. Building Blocks: Control Groups v1 (2006)
$ tree /sys/fs/cgroup/cpu
└── docker
└── container123
├── cpu.cfs_period_us
├── cpu.cfs_quota_us
├── cpu.shares
├── cpu.stat
└── tasks
15. Building Blocks: Control Groups v2 (2016)
$ tree /sys/fs/cgroup/docker/my_container
├── cgroup.controllers
├── cgroup.max
├── cpu.max
├── memory.max
├── io.max
└── tasks
16. Practical problem: resource allocation
Unique problem: given containers are not a full isolation solution, how does an
application now how to scale?
Business cases:
● java application determining its max memory (-Xmx) or thread pool limit
based on the available resources (also see a presentation on this topic)
● go application determining its maximum process limit
here see: GOMAXPROCS, or the automaxprocs package
Solution: your containerized app needs to be aware of the cgroup-allocated limits
and requests.
Problem: the cgroups layout may be different between docker, podman, containerd,
crio, so your library needs to be aware of cgroups version.
17. Practical problem: resource allocation
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
CPUs", $1/$2}}'
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
GB", $1/1024/1024/1024}}'
Qodea is not responsible for damages from these commands ;)
18. Not Namespaced
Kernel Modules:
● Loading or unloading modules affects the entire system.
System Time:
● Adjusting the system clock can affect all processes (though time namespaces
now exist). [As of late 2024] None of the container tooling currently uses the
time namespace.
● Access to /dev devices isn't fully namespaced.
Sysctl Settings:
● Some kernel parameters are system-wide and not namespaced.
19. Going Deeper: how containers are creat’d
● Prepare the root filesystem
mkdir rootfs
debootstrap --variant=minbase focal rootfs http://archive.ubuntu.com/ubuntu/
● Set Up Namespaces (unshare or clone)
unshare --fork --pid --mount --uts --ipc --net --user --map-root-user chroot rootfs
● Change Root Directory
use chroot to change the root filesystem.
● Set Up Mount Points: mount /proc, /sys, and others
mount -t proc proc /proc
mount -t proc proc /sys
● Start the process
20. Going Deeper: key system calls
● Creates a new process with specified namespaces
● Flags determine which namespaces to unshare
● Allows a process to disassociate parts of its execution context.
● Joins an existing namespace.
● Changes the root directory of the calling process.
● Mounts filesystems within the namespace.
21. Going Deeper: layered images
Union Mounts
● Filesystem model supported by linux and other *nixes
● Read-only base layers + writable last layer
mount -t overlay
-o lowerdir=/base/dir,upperdir=/upper/dir,workdir=/work/dir
none /path/to/my/mount/point
Can be used to combine multiple filesystems into one.
Useful for caching and efficiently stored in Object Store (read-only layers can be
22. Going Deeper: layered images
Docker / OCI “image” layers
# Layer 1
FROM ubuntu:20.04
# Layer 2
RUN apt-get update && apt-get
install -y python3
# Layer 3
COPY . /app
CMD ["python3", "/app/app.py"]
(just metadata, not a layer)
| Layer 4: CMD |
| CMD ["python3", "/app/app.py"]|
| Layer 3: COPY |
| COPY . /app |
| Layer 2: RUN |
| RUN apt-get update && |
| apt-get install -y python3|
| Layer 1: FROM |
| FROM ubuntu:20.04 |
23. Going Deeper: COWs, Union FS & Overlay FS
● a union mount filesystem that allows multiple filesystems (layers) to be overlaid
● enables the layering of filesystem changes, which is essential for creating and
managing container images
● Lowerdir: Read-only layers, typically the base images.
● Upperdir: Writable layer where changes are stored.
● Workdir: Directory required by OverlayFS for internal operations.
● Merged View: The combined view presented to the container or user.
24. Going Deeper: COWs, Union FS & Overlay FS
Storage efficiency:
● Shared layers reduce disk usage.
● Very friendly with object store: S3, Google Storage etc.
● Changes in the upper layer don't affect the lower layers.
● Common base layers can be used by multiple images.
Build Optimization:
● Caching layers speeds up image build.
26. Practical problem: How to minimize image size
● Know: Every layer is stored separately
● Clean-up after package install
● Multi-stage builds (multiple FROM’s)
● FROM scratch
● Fat binaries
● Lightweight linux distros like Alpine
○ Beware of `musl`
● Google’s Distroless project
Any others?
28. Practical problem: How to minimize startup
● Pre-bake the image (do not download packages or copy across the network)
● Reduce image size (also cuts down on cloud data transfer)
● Cache: docker buildx, etc.
● Consolidate machines (docker cache is per-machine normally)
● For cloud: it is possible to have pre-baked machine images or Union-FS style
● Use a faster/lighter language
Any others?
29. K8s
Kubernetes Pod: The smallest deployable unit in Kubernetes, encapsulating one or
more containers that share certain namespaces and resources.
Shared Namespaces Between Containers in a Pod:
● Network Namespace (net): Shared IP address and ports.
● Inter-Process Communication Namespace (ipc): Shared IPC mechanisms.
● Process ID Namespace (pid): Shared process visibility.
● UTS Namespace (uts): Shared hostname and domain name.
Isolated Namespaces:
● Mount Namespace (mnt): Each container has its own filesystem view.
● User Namespace (user): Typically isolated to manage user privileges.
Use Cases: Sidecar containers, ambassador containers, and scenarios requiring
tight coupling between containers.
30. gVisor
● started in 2018 by Google
● user-space kernel re-implementation
● implements some (but not all) Linux system calls [1] in user space
● compatibility with OCI container runtime standards
● available in Google Cloud as a feature of GKE [2]
[1] - https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/
[2] - https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods
31. Try Yourself
● run a program using only container primitives like namespaces, without
high-level tools like docker
● run a “pod” like docker-compose but using only docker
● unpack a container image (untar / unzip) and examine the contents
● run docker on a remote machine
● build a container image by using only standard linux utils, without
docker/podman or any specialized container tooling
● run docker inside of docker
● run podman inside of podman
● build a rootless image and try to run it
● build a root filesystem and save it as a base layer
33. Managed
Value-Add Operational
Generative AI
Modern Data Platform
Data Analytics &
Machine Learning
Google Cloud and AWS
Training - for Engineers by
VM Migration
High Performance
App Modernisation
App development
Chronicle and Mandiant
Managed Security Services
Migration and Deployment
Complex Change Management
Insights as a Service
Training and Managed Services
Cloud Centre of Excellence
Secure Landing Zones
Cloud Assure/Finops
Ad: Our Google Experience