SlideShare a Scribd company logo
Docker internals for the busy
Adrian Mârza @ Qodea
key dates
docker and podman
runc vs. crun
namespaces and cgroups
resource allocation
(some) low level details
layered images
2016: cgroups v2
2016: k8s v1.2, first stable version
2017: OCI Runtime & Image spec 1.0
2018: Podman
2020: DockerHub starts to throttle
2020: OCI distribution spec released
2022: Docker Compose v2
Key Dates - Containers
1979: chroot (Bill Joy)
2000: FreeBSD Jails
2006: cgroups v1
2008: LXC (original “containers”)
2013: Docker
2015: Borg White Paper (Google)
2015: runC & Open Container Initiative (OCI)
Key Dates - Linux Namespaces
● 2002: Mount Namespace (2.4.19)
● 2006: UTS & IPC Namespaces (2.6.19)
● 2008: PID & Network Namespaces (2.6.24)
● 2013: User Namespace (3.8)
● 2016: Cgroup Namespace (4.6)
● 2020: Time Namespace (5.6)
| Docker CLI / Docker API |
| (docker run, docker build, REST API calls) |
| Docker Daemon (dockerd) |
| Container Runtime (containerd) |
| - Uses syscalls to interact with kernel |
| OCI Runtime (runc, crun, etc.) |
| - Sets up namespaces (CLONE_NEW*) |
| - Applies cgroups for resource limits |
| - Launches containerized process (execve) |
| Linux Kernel / System Primitives |
Overview - runtime / stack
| Host Machine |
| +------------------------------------+ |
| | Host OS Kernel | |
| | (Shared by all processes) | |
| +------------------------------------+ |
| | Processes | |
| | +------------------------------+ | |
| | | Container Runtime | | |
| | | (e.g., Docker, containerd) | | |
| | +------------------------------+ | |
| | +------------------------------+ | |
| | | Running Container Process | | |
| | | (e.g., Nginx, PostgreSQL) | | |
| | | PID: 12345 | | |
| | +------------------------------+ | |
| +------------------------------------+ |
| Podman CLI / REST API |
| (podman run, podman build, podman system) |
| Libpod Library (Podman Core) |
| Container Runtime (OCI Compliant) |
| - Directly invokes OCI runtimes (e.g., runc) |
| OCI Runtime (runc, crun) |
| - Sets up namespaces (CLONE_NEW*) |
| - Applies cgroups for resource limits |
| - Launches containerized process (execve) |
| Linux Kernel / System Primitives |
Overview - process
# Docker
systemd (PID 1)
└── dockerd (Docker Daemon)
└── containerd (Container Runtime)
├── containerd-shim for Container 1
│ └── nginx (Container Process)
└── containerd-shim for Container 2
└── redis (Container Proces
# Podman
systemd (PID 1)
└── systemd --user (User's systemd instance)
└── podman (Podman CLI)
├── crun (Runtime for Container 1)
│ └── nginx (Container Process)
└── crun (Runtime for Container 2)
└── redis (Container Process)
Docker vs. Podman
Docker Podman
Architecture Daemon-based (dockerd) Daemonless
CLI Standard Docker CLI Docker CLI-compatible
Rootless Available (requires setup) Built-in
Container Storage Uses containerd and runc Uses crun or runc
Security Requires root by default Designed for rootless
Docker vs. Podman
Use Docker when:
● needing seamless integration with existing Docker-based
● relying heavily on Docker Compose or Docker Swarm.
● requiring full compatibility due to use of third party plugins
and tools
Use Podman when:
● security is important: rootless and/or daemonless
● crun as the default low-level runtime
○ lower overhead
● Improved security due to less code running as root.
● Simplified resource management.
● Better alignment with Unix philosophy:
○ child processes having env vars
○ on linux “fork / exec” is really cheap
● Some tooling and orchestration features are less mature.
● Harder to coordinate if containers run on multiple machines.
runc vs. crun
● go container runtime
● forked from docker and now part of OCI
crun (Red Hat / IBM)
● C container runtime
● fresh re-implementation of a container runtime
Building Blocks
1. Namespaces
2. Control Groups (cgroups)
Building Blocks: Namespaces
● Mount (mnt): Isolates filesystem mount points.
● Process ID (pid): Isolates process ID number space.
● Network (net): Provides isolated network interfaces.
● Inter-process Communication (ipc): Isolates IPC resources.
● UTS (uts): Isolates hostname and domain name.
● User (user): Isolates user and group IDs.
● Cgroup (cgroup): Isolates view of cgroups.
● Time (time): Isolates system clocks.
Upcoming (proposed)
● Device Namespace: isolate some devices
● Keyring Namespace: isolate kernel keyrings
Building Blocks: Control Groups
cgroups limit, account for, and isolate the resource usage of process
Resources Managed:
● CPU Usage
● Memory
● Block I/O
● Network Bandwidth
● cgroup v1: The original implementation, each resource controller (e.g.,
cpu, memory, blkio) operates within its own distinct hierarchy, with the
controlled container appearing in each of those hierarchies.
● cgroup v2: Unified hierarchy for each controlled resource (container).
Building Blocks: Control Groups v1 (2006)
$ tree /sys/fs/cgroup/cpu
└── docker
└── container123
├── cpu.cfs_period_us
├── cpu.cfs_quota_us
├── cpu.shares
├── cpu.stat
└── tasks
Building Blocks: Control Groups v2 (2016)
$ tree /sys/fs/cgroup/docker/my_container
├── cgroup.controllers
├── cgroup.max
├── cpu.max
├── memory.max
├── io.max
└── tasks
Practical problem: resource allocation
Unique problem: given containers are not a full isolation solution, how does an
application now how to scale?
Business cases:
● java application determining its max memory (-Xmx) or thread pool limit
based on the available resources (also see a presentation on this topic)
● go application determining its maximum process limit
here see: GOMAXPROCS, or the automaxprocs package
Solution: your containerized app needs to be aware of the cgroup-allocated limits
and requests.
Problem: the cgroups layout may be different between docker, podman, containerd,
crio, so your library needs to be aware of cgroups version.
Practical problem: resource allocation
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
CPUs", $1/$2}}' 
$ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f
GB", $1/1024/1024/1024}}' 
Qodea is not responsible for damages from these commands ;)
Not Namespaced
Kernel Modules:
● Loading or unloading modules affects the entire system.
System Time:
● Adjusting the system clock can affect all processes (though time namespaces
now exist). [As of late 2024] None of the container tooling currently uses the
time namespace.
● Access to /dev devices isn't fully namespaced.
Sysctl Settings:
● Some kernel parameters are system-wide and not namespaced.
Going Deeper: how containers are creat’d
● Prepare the root filesystem
mkdir rootfs
debootstrap --variant=minbase focal rootfs
● Set Up Namespaces (unshare or clone)
unshare --fork --pid --mount --uts --ipc --net --user --map-root-user chroot rootfs
● Change Root Directory
use chroot to change the root filesystem.
● Set Up Mount Points: mount /proc, /sys, and others
mount -t proc proc /proc
mount -t proc proc /sys
● Start the process
Going Deeper: key system calls
● Creates a new process with specified namespaces
● Flags determine which namespaces to unshare
● Allows a process to disassociate parts of its execution context.
● Joins an existing namespace.
● Changes the root directory of the calling process.
● Mounts filesystems within the namespace.
Going Deeper: layered images
Union Mounts
● Filesystem model supported by linux and other *nixes
● Read-only base layers + writable last layer
mount -t overlay 
-o lowerdir=/base/dir,upperdir=/upper/dir,workdir=/work/dir 
none /path/to/my/mount/point
Can be used to combine multiple filesystems into one.
Useful for caching and efficiently stored in Object Store (read-only layers can be
Going Deeper: layered images
Docker / OCI “image” layers
# Layer 1
FROM ubuntu:20.04
# Layer 2
RUN apt-get update && apt-get
install -y python3
# Layer 3
COPY . /app
CMD ["python3", "/app/"]
(just metadata, not a layer)
| Layer 4: CMD |
| CMD ["python3", "/app/"]|
| Layer 3: COPY |
| COPY . /app |
| Layer 2: RUN |
| RUN apt-get update && |
| apt-get install -y python3|
| Layer 1: FROM |
| FROM ubuntu:20.04 |
Going Deeper: COWs, Union FS & Overlay FS
● a union mount filesystem that allows multiple filesystems (layers) to be overlaid
● enables the layering of filesystem changes, which is essential for creating and
managing container images
● Lowerdir: Read-only layers, typically the base images.
● Upperdir: Writable layer where changes are stored.
● Workdir: Directory required by OverlayFS for internal operations.
● Merged View: The combined view presented to the container or user.
Going Deeper: COWs, Union FS & Overlay FS
Storage efficiency:
● Shared layers reduce disk usage.
● Very friendly with object store: S3, Google Storage etc.
● Changes in the upper layer don't affect the lower layers.
● Common base layers can be used by multiple images.
Build Optimization:
● Caching layers speeds up image build.
Practical problem: How to minimize image size
Practical problem: How to minimize image size
● Know: Every layer is stored separately
● Clean-up after package install
● Multi-stage builds (multiple FROM’s)
● FROM scratch
● Fat binaries
● Lightweight linux distros like Alpine
○ Beware of `musl`
● Google’s Distroless project
Any others?
Practical problem: How to minimize startup
Practical problem: How to minimize startup
● Pre-bake the image (do not download packages or copy across the network)
● Reduce image size (also cuts down on cloud data transfer)
● Cache: docker buildx, etc.
● Consolidate machines (docker cache is per-machine normally)
● For cloud: it is possible to have pre-baked machine images or Union-FS style
● Use a faster/lighter language
Any others?
Kubernetes Pod: The smallest deployable unit in Kubernetes, encapsulating one or
more containers that share certain namespaces and resources.
Shared Namespaces Between Containers in a Pod:
● Network Namespace (net): Shared IP address and ports.
● Inter-Process Communication Namespace (ipc): Shared IPC mechanisms.
● Process ID Namespace (pid): Shared process visibility.
● UTS Namespace (uts): Shared hostname and domain name.
Isolated Namespaces:
● Mount Namespace (mnt): Each container has its own filesystem view.
● User Namespace (user): Typically isolated to manage user privileges.
Use Cases: Sidecar containers, ambassador containers, and scenarios requiring
tight coupling between containers.
● started in 2018 by Google
● user-space kernel re-implementation
● implements some (but not all) Linux system calls [1] in user space
● compatibility with OCI container runtime standards
● available in Google Cloud as a feature of GKE [2]
[1] -
[2] -
Try Yourself
● run a program using only container primitives like namespaces, without
high-level tools like docker
● run a “pod” like docker-compose but using only docker
● unpack a container image (untar / unzip) and examine the contents
● run docker on a remote machine
● build a container image by using only standard linux utils, without
docker/podman or any specialized container tooling
● run docker inside of docker
● run podman inside of podman
● build a rootless image and try to run it
● build a root filesystem and save it as a base layer
Ad: We’re Hiring!
The Netherlands
London, Manchester, Swindon, and Edinburgh
Value-Add Operational
Generative AI
Modern Data Platform
Data Analytics &
Machine Learning
Google Cloud and AWS
Training - for Engineers by
VM Migration
High Performance
App Modernisation
App development
Chronicle and Mandiant
Managed Security Services
Migration and Deployment
Complex Change Management
Insights as a Service
Training and Managed Services
Cloud Centre of Excellence
Secure Landing Zones
Cloud Assure/Finops
Ad: Our Google Experience
Thanks to
People: Florian Blaga, Dan & Sabina Zaharia
I am Adrian Mârza @Qodea
Feedback welcome to:
To join GDG Iași: Partners:

More Related Content

GDG Cloud Iasi - Docker For The Busy Developer.pdf

  • 1. Docker internals for the busy developer Adrian Mârza @ Qodea
  • 2. key dates overview docker and podman runc vs. crun namespaces and cgroups resource allocation (some) low level details layered images K8s gvisor Contents
  • 3. 2016: cgroups v2 2016: k8s v1.2, first stable version 2017: OCI Runtime & Image spec 1.0 2018: Podman 2020: DockerHub starts to throttle 2020: OCI distribution spec released 2022: Docker Compose v2 Key Dates - Containers 1979: chroot (Bill Joy) 2000: FreeBSD Jails 2006: cgroups v1 2008: LXC (original “containers”) 2013: Docker 2015: Borg White Paper (Google) 2015: runC & Open Container Initiative (OCI)
  • 4. Key Dates - Linux Namespaces ● 2002: Mount Namespace (2.4.19) ● 2006: UTS & IPC Namespaces (2.6.19) ● 2008: PID & Network Namespaces (2.6.24) ● 2013: User Namespace (3.8) ● 2016: Cgroup Namespace (4.6) ● 2020: Time Namespace (5.6)
  • 5. +-----------------STACK-(DOCKER)---------------+ | Docker CLI / Docker API | | (docker run, docker build, REST API calls) | +----------------------------------------------+ | Docker Daemon (dockerd) | +----------------------------------------------+ | Container Runtime (containerd) | | - Uses syscalls to interact with kernel | +----------------------------------------------+ | OCI Runtime (runc, crun, etc.) | | - Sets up namespaces (CLONE_NEW*) | | - Applies cgroups for resource limits | | - Launches containerized process (execve) | +----------------------------------------------+ | Linux Kernel / System Primitives | +----------------------------------------------+ … Overview - runtime / stack +---------------RUNTIME--------------------+ | Host Machine | | +------------------------------------+ | | | Host OS Kernel | | | | (Shared by all processes) | | | +------------------------------------+ | | | Processes | | | | +------------------------------+ | | | | | Container Runtime | | | | | | (e.g., Docker, containerd) | | | | | +------------------------------+ | | | | +------------------------------+ | | | | | Running Container Process | | | | | | (e.g., Nginx, PostgreSQL) | | | | | | PID: 12345 | | | | | +------------------------------+ | | | +------------------------------------+ | +------------------------------------------+ +------------------STACK-(PODMAN)---------------+ | Podman CLI / REST API | | (podman run, podman build, podman system) | +-----------------------------------------------+ | Libpod Library (Podman Core) | +-----------------------------------------------+ | Container Runtime (OCI Compliant) | | - Directly invokes OCI runtimes (e.g., runc) | +-----------------------------------------------+ | OCI Runtime (runc, crun) | | - Sets up namespaces (CLONE_NEW*) | | - Applies cgroups for resource limits | | - Launches containerized process (execve) | +-----------------------------------------------+ | Linux Kernel / System Primitives | +-----------------------------------------------+ …
  • 6. Overview - process # Docker systemd (PID 1) └── dockerd (Docker Daemon) └── containerd (Container Runtime) ├── containerd-shim for Container 1 │ └── nginx (Container Process) └── containerd-shim for Container 2 └── redis (Container Proces # Podman systemd (PID 1) └── systemd --user (User's systemd instance) └── podman (Podman CLI) ├── crun (Runtime for Container 1) │ └── nginx (Container Process) └── crun (Runtime for Container 2) └── redis (Container Process)
  • 7. Docker vs. Podman Docker Podman Architecture Daemon-based (dockerd) Daemonless CLI Standard Docker CLI Docker CLI-compatible Rootless Available (requires setup) Built-in Container Storage Uses containerd and runc Uses crun or runc Security Requires root by default Designed for rootless
  • 8. Docker vs. Podman Use Docker when: ● needing seamless integration with existing Docker-based workflows ● relying heavily on Docker Compose or Docker Swarm. ● requiring full compatibility due to use of third party plugins and tools Use Podman when: ● security is important: rootless and/or daemonless ● crun as the default low-level runtime ○ lower overhead
  • 9. Daemonless Advantages ● Improved security due to less code running as root. ● Simplified resource management. ● Better alignment with Unix philosophy: ○ child processes having env vars ○ on linux “fork / exec” is really cheap Disadvantages ● Some tooling and orchestration features are less mature. ● Harder to coordinate if containers run on multiple machines.
  • 10. runc vs. crun runc ● go container runtime ● forked from docker and now part of OCI crun (Red Hat / IBM) ● C container runtime ● fresh re-implementation of a container runtime
  • 11. Building Blocks 1. Namespaces 2. Control Groups (cgroups)
  • 12. Building Blocks: Namespaces Namespaces ● Mount (mnt): Isolates filesystem mount points. ● Process ID (pid): Isolates process ID number space. ● Network (net): Provides isolated network interfaces. ● Inter-process Communication (ipc): Isolates IPC resources. ● UTS (uts): Isolates hostname and domain name. ● User (user): Isolates user and group IDs. ● Cgroup (cgroup): Isolates view of cgroups. ● Time (time): Isolates system clocks. Upcoming (proposed) ● Device Namespace: isolate some devices ● Keyring Namespace: isolate kernel keyrings
  • 13. Building Blocks: Control Groups cgroups limit, account for, and isolate the resource usage of process groups Resources Managed: ● CPU Usage ● Memory ● Block I/O ● Network Bandwidth Versions: ● cgroup v1: The original implementation, each resource controller (e.g., cpu, memory, blkio) operates within its own distinct hierarchy, with the controlled container appearing in each of those hierarchies. ● cgroup v2: Unified hierarchy for each controlled resource (container).
  • 14. Building Blocks: Control Groups v1 (2006) $ tree /sys/fs/cgroup/cpu └── docker └── container123 ├── cpu.cfs_period_us ├── cpu.cfs_quota_us ├── cpu.shares ├── cpu.stat └── tasks
  • 15. Building Blocks: Control Groups v2 (2016) $ tree /sys/fs/cgroup/docker/my_container ├── cgroup.controllers ├── cgroup.max ├── cpu.max ├── memory.max ├── io.max └── tasks
  • 16. Practical problem: resource allocation Unique problem: given containers are not a full isolation solution, how does an application now how to scale? Business cases: ● java application determining its max memory (-Xmx) or thread pool limit based on the available resources (also see a presentation on this topic) ● go application determining its maximum process limit here see: GOMAXPROCS, or the automaxprocs package Solution: your containerized app needs to be aware of the cgroup-allocated limits and requests. Problem: the cgroups layout may be different between docker, podman, containerd, crio, so your library needs to be aware of cgroups version.
  • 17. Practical problem: resource allocation CPU LIMIT $ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f CPUs", $1/$2}}' /sys/fs/cgroup/cpu.max MEMORY LIMIT $ awk '{if ($1=="max") {print "Unlimited"} else {printf "%.2f GB", $1/1024/1024/1024}}' /sys/fs/cgroup/memory.max) Qodea is not responsible for damages from these commands ;)
  • 18. Not Namespaced Kernel Modules: ● Loading or unloading modules affects the entire system. System Time: ● Adjusting the system clock can affect all processes (though time namespaces now exist). [As of late 2024] None of the container tooling currently uses the time namespace. Devices: ● Access to /dev devices isn't fully namespaced. Sysctl Settings: ● Some kernel parameters are system-wide and not namespaced.
  • 19. Going Deeper: how containers are creat’d ● Prepare the root filesystem mkdir rootfs debootstrap --variant=minbase focal rootfs ● Set Up Namespaces (unshare or clone) unshare --fork --pid --mount --uts --ipc --net --user --map-root-user chroot rootfs /bin/sh ● Change Root Directory use chroot to change the root filesystem. ● Set Up Mount Points: mount /proc, /sys, and others mount -t proc proc /proc mount -t proc proc /sys ● Start the process
  • 20. Going Deeper: key system calls clone() ● Creates a new process with specified namespaces ● Flags determine which namespaces to unshare unshare() ● Allows a process to disassociate parts of its execution context. setns() ● Joins an existing namespace. chroot() ● Changes the root directory of the calling process. mount(): ● Mounts filesystems within the namespace.
  • 21. Going Deeper: layered images Union Mounts ● Filesystem model supported by linux and other *nixes ● Read-only base layers + writable last layer Example mount -t overlay -o lowerdir=/base/dir,upperdir=/upper/dir,workdir=/work/dir none /path/to/my/mount/point Can be used to combine multiple filesystems into one. Useful for caching and efficiently stored in Object Store (read-only layers can be immutable).
  • 22. Going Deeper: layered images Docker / OCI “image” layers # Layer 1 FROM ubuntu:20.04 # Layer 2 RUN apt-get update && apt-get install -y python3 # Layer 3 COPY . /app # A CMD or ENTRYPOINT CMD ["python3", "/app/"] (just metadata, not a layer) +-------------------------------+ | Layer 4: CMD | |-------------------------------| | CMD ["python3", "/app/"]| +-------------------------------+ | Layer 3: COPY | |-------------------------------| | COPY . /app | +-------------------------------+ | Layer 2: RUN | |-------------------------------| | RUN apt-get update && | | apt-get install -y python3| +-------------------------------+ | Layer 1: FROM | |-------------------------------| | FROM ubuntu:20.04 | +-------------------------------+
  • 23. Going Deeper: COWs, Union FS & Overlay FS OverlayFS ● a union mount filesystem that allows multiple filesystems (layers) to be overlaid ● enables the layering of filesystem changes, which is essential for creating and managing container images Components: ● Lowerdir: Read-only layers, typically the base images. ● Upperdir: Writable layer where changes are stored. ● Workdir: Directory required by OverlayFS for internal operations. ● Merged View: The combined view presented to the container or user.
  • 24. Going Deeper: COWs, Union FS & Overlay FS Storage efficiency: ● Shared layers reduce disk usage. ● Very friendly with object store: S3, Google Storage etc. Isolation: ● Changes in the upper layer don't affect the lower layers. Reusability: ● Common base layers can be used by multiple images. Build Optimization: ● Caching layers speeds up image build.
  • 25. Practical problem: How to minimize image size
  • 26. Practical problem: How to minimize image size ● Know: Every layer is stored separately ● Clean-up after package install ● Multi-stage builds (multiple FROM’s) ● FROM scratch ● Fat binaries ● Lightweight linux distros like Alpine ○ Beware of `musl` ● Google’s Distroless project Any others?
  • 27. Practical problem: How to minimize startup
  • 28. Practical problem: How to minimize startup ● Pre-bake the image (do not download packages or copy across the network) ● Reduce image size (also cuts down on cloud data transfer) ● Cache: docker buildx, etc. ● Consolidate machines (docker cache is per-machine normally) ● For cloud: it is possible to have pre-baked machine images or Union-FS style mounts ● Use a faster/lighter language Any others?
  • 29. K8s Kubernetes Pod: The smallest deployable unit in Kubernetes, encapsulating one or more containers that share certain namespaces and resources. Shared Namespaces Between Containers in a Pod: ● Network Namespace (net): Shared IP address and ports. ● Inter-Process Communication Namespace (ipc): Shared IPC mechanisms. ● Process ID Namespace (pid): Shared process visibility. ● UTS Namespace (uts): Shared hostname and domain name. Isolated Namespaces: ● Mount Namespace (mnt): Each container has its own filesystem view. ● User Namespace (user): Typically isolated to manage user privileges. Use Cases: Sidecar containers, ambassador containers, and scenarios requiring tight coupling between containers.
  • 30. gVisor ● started in 2018 by Google ● user-space kernel re-implementation ● implements some (but not all) Linux system calls [1] in user space ● compatibility with OCI container runtime standards ● available in Google Cloud as a feature of GKE [2] [1] - [2] -
  • 31. Try Yourself ● run a program using only container primitives like namespaces, without high-level tools like docker ● run a “pod” like docker-compose but using only docker ● unpack a container image (untar / unzip) and examine the contents ● run docker on a remote machine ● build a container image by using only standard linux utils, without docker/podman or any specialized container tooling ● run docker inside of docker ● run podman inside of podman ● build a rootless image and try to run it ● build a root filesystem and save it as a base layer
  • 32. Ad: We’re Hiring! Romania Iași The Netherlands Utrecht UK London, Manchester, Swindon, and Edinburgh France Paris Belgium Brussels
  • 33. Managed Services Value-Add Operational Services Data Modernisation Generative AI Modern Data Platform Data Analytics & Machine Learning Cloud Training Google Cloud and AWS Training - for Engineers by Engineers. VM Migration High Performance Compute App Modernisation Apigee Anthos App development Infrastructure Modernisation Application Modernisation Cloud Security Chronicle and Mandiant Consulting Managed Security Services Google Workspace Migration and Deployment Complex Change Management Insights as a Service Training and Managed Services Cloud Foundations Cloud Centre of Excellence Secure Landing Zones Cloud Assure/Finops Ad: Our Google Experience
  • 34. Thanks to People: Florian Blaga, Dan & Sabina Zaharia I am Adrian Mârza @Qodea Feedback welcome to: [email protected] To join GDG Iași: Partners: