GDG Cloud Iasi - Docker For The Busy Developer.pdf

Docker internals for the busy
developer
Adrian Mârza @ Qodea

key dates
overview
docker and podman
runc vs. crun
namespaces and cgroups
resource allocation
(some) low level details
layered images
K8s
gvisor
Contents

2016: cgroups v2
2016: k8s v1.2, ﬁrst stable version
2017: OCI Runtime & Image spec 1.0
2018: Podman
2020: DockerHub starts to throttle
2020: OCI distribution spec released
2022: Docker Compose v2
Key Dates - Containers
1979: chroot (Bill Joy)
2000: FreeBSD Jails
2006: cgroups v1
2008: LXC (original “containers”)
2013: Docker
2015: Borg White Paper (Google)
2015: runC & Open Container Initiative (OCI)

Key Dates - Linux Namespaces
● 2002: Mount Namespace (2.4.19)
● 2006: UTS & IPC Namespaces (2.6.19)
● 2008: PID & Network Namespaces (2.6.24)
● 2013: User Namespace (3.8)
● 2016: Cgroup Namespace (4.6)
● 2020: Time Namespace (5.6)

Overview - process
# Docker
systemd (PID 1)
└── dockerd (Docker Daemon)
└── containerd (Container Runtime)
├── containerd-shim for Container 1
│ └── nginx (Container Process)
└── containerd-shim for Container 2
└── redis (Container Proces
# Podman
systemd (PID 1)
└── systemd --user (User's systemd instance)
└── podman (Podman CLI)
├── crun (Runtime for Container 1)
│ └── nginx (Container Process)
└── crun (Runtime for Container 2)
└── redis (Container Process)

Docker vs. Podman
Docker Podman
Architecture Daemon-based (dockerd) Daemonless
CLI Standard Docker CLI Docker CLI-compatible
Rootless Available (requires setup) Built-in
Container Storage Uses containerd and runc Uses crun or runc
Security Requires root by default Designed for rootless

Docker vs. Podman
Use Docker when:
● needing seamless integration with existing Docker-based
workﬂows
● relying heavily on Docker Compose or Docker Swarm.
● requiring full compatibility due to use of third party plugins
and tools
Use Podman when:
● security is important: rootless and/or daemonless
● crun as the default low-level runtime
○ lower overhead

Daemonless
Advantages
● Improved security due to less code running as root.
● Simpliﬁed resource management.
● Better alignment with Unix philosophy:
○ child processes having env vars
○ on linux “fork / exec” is really cheap
Disadvantages
● Some tooling and orchestration features are less mature.
● Harder to coordinate if containers run on multiple machines.

runc vs. crun
runc
● go container runtime
● forked from docker and now part of OCI
crun (Red Hat / IBM)
● C container runtime
● fresh re-implementation of a container runtime

Building Blocks
1. Namespaces
2. Control Groups (cgroups)

Building Blocks: Namespaces
Namespaces
● Mount (mnt): Isolates ﬁlesystem mount points.
● Process ID (pid): Isolates process ID number space.
● Network (net): Provides isolated network interfaces.
● Inter-process Communication (ipc): Isolates IPC resources.
● UTS (uts): Isolates hostname and domain name.
● User (user): Isolates user and group IDs.
● Cgroup (cgroup): Isolates view of cgroups.
● Time (time): Isolates system clocks.
Upcoming (proposed)
● Device Namespace: isolate some devices
● Keyring Namespace: isolate kernel keyrings

Building Blocks: Control Groups
cgroups limit, account for, and isolate the resource usage of process
groups
Resources Managed:
● CPU Usage
● Memory
● Block I/O
● Network Bandwidth
Versions:
● cgroup v1: The original implementation, each resource controller (e.g.,
cpu, memory, blkio) operates within its own distinct hierarchy, with the
controlled container appearing in each of those hierarchies.
● cgroup v2: Uniﬁed hierarchy for each controlled resource (container).

Building Blocks: Control Groups v1 (2006)
$ tree /sys/fs/cgroup/cpu
└── docker
└── container123
├── cpu.cfs_period_us
├── cpu.cfs_quota_us
├── cpu.shares
├── cpu.stat
└── tasks

Building Blocks: Control Groups v2 (2016)
$ tree /sys/fs/cgroup/docker/my_container
├── cgroup.controllers
├── cgroup.max
├── cpu.max
├── memory.max
├── io.max
└── tasks

Practical problem: resource allocation
Unique problem: given containers are not a full isolation solution, how does an
application now how to scale?
Business cases:
● java application determining its max memory (-Xmx) or thread pool limit
based on the available resources (also see a presentation on this topic)
● go application determining its maximum process limit
here see: GOMAXPROCS, or the automaxprocs package
Solution: your containerized app needs to be aware of the cgroup-allocated limits
and requests.
Problem: the cgroups layout may be different between docker, podman, containerd,
crio, so your library needs to be aware of cgroups version.

Practical problem: resource allocation
CPU LIMIT
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
CPUs", $1/$2}}'
/sys/fs/cgroup/cpu.max
MEMORY LIMIT
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
GB", $1/1024/1024/1024}}'
/sys/fs/cgroup/memory.max)
Qodea is not responsible for damages from these commands ;)

Not Namespaced
Kernel Modules:
● Loading or unloading modules affects the entire system.
System Time:
● Adjusting the system clock can affect all processes (though time namespaces
now exist). [As of late 2024] None of the container tooling currently uses the
time namespace.
Devices:
● Access to /dev devices isn't fully namespaced.
Sysctl Settings:
● Some kernel parameters are system-wide and not namespaced.

Going Deeper: how containers are creat’d
● Prepare the root ﬁlesystem
mkdir rootfs
debootstrap --variant=minbase focal rootfs http://archive.ubuntu.com/ubuntu/
● Set Up Namespaces (unshare or clone)
unshare --fork --pid --mount --uts --ipc --net --user --map-root-user chroot rootfs
/bin/sh
● Change Root Directory
use chroot to change the root ﬁlesystem.
● Set Up Mount Points: mount /proc, /sys, and others
mount -t proc proc /proc
mount -t proc proc /sys
● Start the process

Going Deeper: key system calls
clone()
● Creates a new process with speciﬁed namespaces
● Flags determine which namespaces to unshare
unshare()
● Allows a process to disassociate parts of its execution context.
setns()
● Joins an existing namespace.
chroot()
● Changes the root directory of the calling process.
mount():
● Mounts ﬁlesystems within the namespace.

Going Deeper: layered images
Union Mounts
● Filesystem model supported by linux and other *nixes
● Read-only base layers + writable last layer
Example
mount -t overlay
-o lowerdir=/base/dir,upperdir=/upper/dir,workdir=/work/dir
none /path/to/my/mount/point
Can be used to combine multiple ﬁlesystems into one.
Useful for caching and efﬁciently stored in Object Store (read-only layers can be
immutable).

Going Deeper: layered images
Docker / OCI “image” layers
# Layer 1
FROM ubuntu:20.04
# Layer 2
RUN apt-get update && apt-get
install -y python3
# Layer 3
COPY . /app
# A CMD or ENTRYPOINT
CMD ["python3", "/app/app.py"]
(just metadata, not a layer)
+-------------------------------+
| Layer 4: CMD |
|-------------------------------|
| CMD ["python3", "/app/app.py"]|
+-------------------------------+
| Layer 3: COPY |
|-------------------------------|
| COPY . /app |
+-------------------------------+
| Layer 2: RUN |
|-------------------------------|
| RUN apt-get update && |
| apt-get install -y python3|
+-------------------------------+
| Layer 1: FROM |
|-------------------------------|
| FROM ubuntu:20.04 |
+-------------------------------+

Going Deeper: COWs, Union FS & Overlay FS
OverlayFS
● a union mount filesystem that allows multiple filesystems (layers) to be overlaid
● enables the layering of filesystem changes, which is essential for creating and
managing container images
Components:
● Lowerdir: Read-only layers, typically the base images.
● Upperdir: Writable layer where changes are stored.
● Workdir: Directory required by OverlayFS for internal operations.
● Merged View: The combined view presented to the container or user.

Going Deeper: COWs, Union FS & Overlay FS
Storage efﬁciency:
● Shared layers reduce disk usage.
● Very friendly with object store: S3, Google Storage etc.
Isolation:
● Changes in the upper layer don't affect the lower layers.
Reusability:
● Common base layers can be used by multiple images.
Build Optimization:
● Caching layers speeds up image build.

Practical problem: How to minimize image size

Practical problem: How to minimize image size
● Know: Every layer is stored separately
● Clean-up after package install
● Multi-stage builds (multiple FROM’s)
● FROM scratch
● Fat binaries
● Lightweight linux distros like Alpine
○ Beware of `musl`
● Google’s Distroless project
Any others?

Practical problem: How to minimize startup

Practical problem: How to minimize startup
● Pre-bake the image (do not download packages or copy across the network)
● Reduce image size (also cuts down on cloud data transfer)
● Cache: docker buildx, etc.
● Consolidate machines (docker cache is per-machine normally)
● For cloud: it is possible to have pre-baked machine images or Union-FS style
mounts
● Use a faster/lighter language
Any others?

K8s
Kubernetes Pod: The smallest deployable unit in Kubernetes, encapsulating one or
more containers that share certain namespaces and resources.
Shared Namespaces Between Containers in a Pod:
● Network Namespace (net): Shared IP address and ports.
● Inter-Process Communication Namespace (ipc): Shared IPC mechanisms.
● Process ID Namespace (pid): Shared process visibility.
● UTS Namespace (uts): Shared hostname and domain name.
Isolated Namespaces:
● Mount Namespace (mnt): Each container has its own ﬁlesystem view.
● User Namespace (user): Typically isolated to manage user privileges.
Use Cases: Sidecar containers, ambassador containers, and scenarios requiring
tight coupling between containers.

gVisor
● started in 2018 by Google
● user-space kernel re-implementation
● implements some (but not all) Linux system calls [1] in user space
● compatibility with OCI container runtime standards
● available in Google Cloud as a feature of GKE [2]
[1] - https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/
[2] - https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods

Try Yourself
● run a program using only container primitives like namespaces, without
high-level tools like docker
● run a “pod” like docker-compose but using only docker
● unpack a container image (untar / unzip) and examine the contents
● run docker on a remote machine
● build a container image by using only standard linux utils, without
docker/podman or any specialized container tooling
● run docker inside of docker
● run podman inside of podman
● build a rootless image and try to run it
● build a root ﬁlesystem and save it as a base layer

Ad: We’re Hiring!
https://qodea.com/careers/
Romania
Iași
The Netherlands
Utrecht
UK
London, Manchester, Swindon, and Edinburgh
France
Paris
Belgium
Brussels

Managed
Services
Value-Add Operational
Services
Data
Modernisation
Generative AI
Modern Data Platform
Data Analytics &
Machine Learning
Cloud
Training
Google Cloud and AWS
Training - for Engineers by
Engineers.
VM Migration
High Performance
Compute
App Modernisation
Apigee
Anthos
App development
Infrastructure
Modernisation
Application
Modernisation
Cloud
Security
Chronicle and Mandiant
Consulting
Managed Security Services
Google
Workspace
Migration and Deployment
Complex Change Management
Insights as a Service
Training and Managed Services
Cloud
Foundations
Cloud Centre of Excellence
Secure Landing Zones
Cloud Assure/Finops
Ad: Our Google Experience
https://qodea.com/careers/

Thanks to
People: Florian Blaga, Dan & Sabina Zaharia
I am Adrian Mârza @Qodea
Feedback welcome to: adi11235@gmail.com
To join GDG Iași: Partners:

GDG Cloud Iasi - Docker For The Busy Developer.pdf

More Related Content

GDG Cloud Iasi - Docker For The Busy Developer.pdf