7. Docker vs. Podman
Docker Podman
Architecture Daemon-based (dockerd) Daemonless
CLI Standard Docker CLI Docker CLI-compatible
Rootless Available (requires setup) Built-in
Container Storage Uses containerd and runc Uses crun or runc
Security Requires root by default Designed for rootless
8. Docker vs. Podman
Use Docker when:
● needing seamless integration with existing Docker-based
workflows
● relying heavily on Docker Compose or Docker Swarm.
● requiring full compatibility due to use of third party plugins
and tools
Use Podman when:
● security is important: rootless and/or daemonless
● crun as the default low-level runtime
○ lower overhead
9. Daemonless
Advantages
● Improved security due to less code running as root.
● Simplified resource management.
● Better alignment with Unix philosophy:
○ child processes having env vars
○ on linux “fork / exec” is really cheap
Disadvantages
● Some tooling and orchestration features are less mature.
● Harder to coordinate if containers run on multiple machines.
10. runc vs. crun
runc
● go container runtime
● forked from docker and now part of OCI
crun (Red Hat / IBM)
● C container runtime
● fresh re-implementation of a container runtime
12. Building Blocks: Namespaces
Namespaces
● Mount (mnt): Isolates filesystem mount points.
● Process ID (pid): Isolates process ID number space.
● Network (net): Provides isolated network interfaces.
● Inter-process Communication (ipc): Isolates IPC resources.
● UTS (uts): Isolates hostname and domain name.
● User (user): Isolates user and group IDs.
● Cgroup (cgroup): Isolates view of cgroups.
● Time (time): Isolates system clocks.
Upcoming (proposed)
● Device Namespace: isolate some devices
● Keyring Namespace: isolate kernel keyrings
13. Building Blocks: Control Groups
cgroups limit, account for, and isolate the resource usage of process
groups
Resources Managed:
● CPU Usage
● Memory
● Block I/O
● Network Bandwidth
Versions:
● cgroup v1: The original implementation, each resource controller (e.g.,
cpu, memory, blkio) operates within its own distinct hierarchy, with the
controlled container appearing in each of those hierarchies.
● cgroup v2: Unified hierarchy for each controlled resource (container).
14. Building Blocks: Control Groups v1 (2006)
$ tree /sys/fs/cgroup/cpu
└── docker
└── container123
├── cpu.cfs_period_us
├── cpu.cfs_quota_us
├── cpu.shares
├── cpu.stat
└── tasks
15. Building Blocks: Control Groups v2 (2016)
$ tree /sys/fs/cgroup/docker/my_container
├── cgroup.controllers
├── cgroup.max
├── cpu.max
├── memory.max
├── io.max
└── tasks
16. Practical problem: resource allocation
Unique problem: given containers are not a full isolation solution, how does an
application now how to scale?
Business cases:
● java application determining its max memory (-Xmx) or thread pool limit
based on the available resources (also see a presentation on this topic)
● go application determining its maximum process limit
here see: GOMAXPROCS, or the automaxprocs package
Solution: your containerized app needs to be aware of the cgroup-allocated limits
and requests.
Problem: the cgroups layout may be different between docker, podman, containerd,
crio, so your library needs to be aware of cgroups version.
17. Practical problem: resource allocation
CPU LIMIT
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
CPUs", $1/$2}}'
/sys/fs/cgroup/cpu.max
MEMORY LIMIT
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
GB", $1/1024/1024/1024}}'
/sys/fs/cgroup/memory.max)
Qodea is not responsible for damages from these commands ;)
18. Not Namespaced
Kernel Modules:
● Loading or unloading modules affects the entire system.
System Time:
● Adjusting the system clock can affect all processes (though time namespaces
now exist). [As of late 2024] None of the container tooling currently uses the
time namespace.
Devices:
● Access to /dev devices isn't fully namespaced.
Sysctl Settings:
● Some kernel parameters are system-wide and not namespaced.
19. Going Deeper: how containers are creat’d
● Prepare the root filesystem
mkdir rootfs
debootstrap --variant=minbase focal rootfs http://archive.ubuntu.com/ubuntu/
● Set Up Namespaces (unshare or clone)
unshare --fork --pid --mount --uts --ipc --net --user --map-root-user chroot rootfs
/bin/sh
● Change Root Directory
use chroot to change the root filesystem.
● Set Up Mount Points: mount /proc, /sys, and others
mount -t proc proc /proc
mount -t proc proc /sys
● Start the process
20. Going Deeper: key system calls
clone()
● Creates a new process with specified namespaces
● Flags determine which namespaces to unshare
unshare()
● Allows a process to disassociate parts of its execution context.
setns()
● Joins an existing namespace.
chroot()
● Changes the root directory of the calling process.
mount():
● Mounts filesystems within the namespace.
21. Going Deeper: layered images
Union Mounts
● Filesystem model supported by linux and other *nixes
● Read-only base layers + writable last layer
Example
mount -t overlay
-o lowerdir=/base/dir,upperdir=/upper/dir,workdir=/work/dir
none /path/to/my/mount/point
Can be used to combine multiple filesystems into one.
Useful for caching and efficiently stored in Object Store (read-only layers can be
immutable).
22. Going Deeper: layered images
Docker / OCI “image” layers
# Layer 1
FROM ubuntu:20.04
# Layer 2
RUN apt-get update && apt-get
install -y python3
# Layer 3
COPY . /app
# A CMD or ENTRYPOINT
CMD ["python3", "/app/app.py"]
(just metadata, not a layer)
+-------------------------------+
| Layer 4: CMD |
|-------------------------------|
| CMD ["python3", "/app/app.py"]|
+-------------------------------+
| Layer 3: COPY |
|-------------------------------|
| COPY . /app |
+-------------------------------+
| Layer 2: RUN |
|-------------------------------|
| RUN apt-get update && |
| apt-get install -y python3|
+-------------------------------+
| Layer 1: FROM |
|-------------------------------|
| FROM ubuntu:20.04 |
+-------------------------------+
23. Going Deeper: COWs, Union FS & Overlay FS
OverlayFS
● a union mount filesystem that allows multiple filesystems (layers) to be overlaid
● enables the layering of filesystem changes, which is essential for creating and
managing container images
Components:
● Lowerdir: Read-only layers, typically the base images.
● Upperdir: Writable layer where changes are stored.
● Workdir: Directory required by OverlayFS for internal operations.
● Merged View: The combined view presented to the container or user.
24. Going Deeper: COWs, Union FS & Overlay FS
Storage efficiency:
● Shared layers reduce disk usage.
● Very friendly with object store: S3, Google Storage etc.
Isolation:
● Changes in the upper layer don't affect the lower layers.
Reusability:
● Common base layers can be used by multiple images.
Build Optimization:
● Caching layers speeds up image build.
26. Practical problem: How to minimize image size
● Know: Every layer is stored separately
● Clean-up after package install
● Multi-stage builds (multiple FROM’s)
● FROM scratch
● Fat binaries
● Lightweight linux distros like Alpine
○ Beware of `musl`
● Google’s Distroless project
Any others?
28. Practical problem: How to minimize startup
● Pre-bake the image (do not download packages or copy across the network)
● Reduce image size (also cuts down on cloud data transfer)
● Cache: docker buildx, etc.
● Consolidate machines (docker cache is per-machine normally)
● For cloud: it is possible to have pre-baked machine images or Union-FS style
mounts
● Use a faster/lighter language
Any others?
29. K8s
Kubernetes Pod: The smallest deployable unit in Kubernetes, encapsulating one or
more containers that share certain namespaces and resources.
Shared Namespaces Between Containers in a Pod:
● Network Namespace (net): Shared IP address and ports.
● Inter-Process Communication Namespace (ipc): Shared IPC mechanisms.
● Process ID Namespace (pid): Shared process visibility.
● UTS Namespace (uts): Shared hostname and domain name.
Isolated Namespaces:
● Mount Namespace (mnt): Each container has its own filesystem view.
● User Namespace (user): Typically isolated to manage user privileges.
Use Cases: Sidecar containers, ambassador containers, and scenarios requiring
tight coupling between containers.
30. gVisor
● started in 2018 by Google
● user-space kernel re-implementation
● implements some (but not all) Linux system calls [1] in user space
● compatibility with OCI container runtime standards
● available in Google Cloud as a feature of GKE [2]
[1] - https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/
[2] - https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods
31. Try Yourself
● run a program using only container primitives like namespaces, without
high-level tools like docker
● run a “pod” like docker-compose but using only docker
● unpack a container image (untar / unzip) and examine the contents
● run docker on a remote machine
● build a container image by using only standard linux utils, without
docker/podman or any specialized container tooling
● run docker inside of docker
● run podman inside of podman
● build a rootless image and try to run it
● build a root filesystem and save it as a base layer
33. Managed
Services
Value-Add Operational
Services
Data
Modernisation
Generative AI
Modern Data Platform
Data Analytics &
Machine Learning
Cloud
Training
Google Cloud and AWS
Training - for Engineers by
Engineers.
VM Migration
High Performance
Compute
App Modernisation
Apigee
Anthos
App development
Infrastructure
Modernisation
Application
Modernisation
Cloud
Security
Chronicle and Mandiant
Consulting
Managed Security Services
Google
Workspace
Migration and Deployment
Complex Change Management
Insights as a Service
Training and Managed Services
Cloud
Foundations
Cloud Centre of Excellence
Secure Landing Zones
Cloud Assure/Finops
Ad: Our Google Experience
https://qodea.com/careers/