One of the superpowers of containers is their isolated filesystem view — from inside a container it looks like a full Linux distro, often different from the host. Run docker run nginx, and Nginx lands in its familiar Debian userspace no matter what Linux flavor your host runs. But how is that illusion built?
In this post, we’ll walk through how to assemble a tiny but realistic container using only stock Linux tools: unshare, mount, and pivot_root. No runtime magic. Along the way, you’ll see why the mount namespace is the bedrock of container isolation, while other namespaces (PID, cgroup, UTS, network) play complementary roles.
What Does Mount Namespace Actually Isolate?
Start a new shell in its own mount namespace:
|
|
Now from another terminal, create a file on the host:
|
|
Surprisingly, if you check from the namespaced shell — the file is there. So what did we actually isolate?
The answer is the mount table, not the filesystem itself. Linux mount namespaces isolate the list of mount points seen by processes in each namespace. The underlying filesystem is still shared — it’s only when you create new mount points that the views start diverging.
To verify, from the namespaced shell:
|
|
But from the host terminal, /mnt remains empty. The mount only exists in the new namespace’s mount table.
You can compare mount tables using findmnt from each terminal — the namespaced shell will show the extra /mnt mount point that the host doesn’t see.
Mount namespaces were the first namespace type added to Linux, appearing in Linux 2.4 around 2002.
Mount Propagation
Before diving into how container runtimes use mount namespaces, there’s an important related concept: mount propagation.
When you create a new mount namespace, mount points can be configured to propagate (or not) between parent and child namespaces. This is controlled by propagation types:
- shared — mounts propagate in both directions
- private — no propagation at all
- slave — propagation only from parent to child
Container runtimes typically set the root mount to private (or slave) in the container’s namespace so that host mounts don’t leak in, and container mounts don’t leak out. The unshare CLI tool does this automatically with --mount, but if you use the unshare() syscall directly, you need to handle it yourself:
|
|
This is a common gotcha when building containers from scratch.
Building the Container Filesystem
Step 1: Prepare rootfs
You need a root filesystem for the container. You can extract one from a Docker image:
|
|
Step 2: Create namespaces
|
|
Step 3: Isolate mount namespace
Make all existing mounts private so nothing leaks:
|
|
Step 4: Prepare /proc
The /proc pseudo filesystem needs to be mounted inside the container for process isolation to work properly:
|
|
Step 5: Prepare /dev
Container needs basic device nodes. A minimal approach:
|
|
Step 6: Prepare /sys
|
|
Step 7: Bind hostname, hosts, and resolv.conf
|
|
Step 8: Pivot into the new rootfs
This is the key step — pivot_root swaps the root filesystem:
|
|
Step 9: Harden the filesystem
Make certain mounts read-only to prevent the container from modifying them:
|
|
Step 10: Run the application
|
|
You now have a shell running in an isolated filesystem that looks like a standalone Alpine Linux system.
Sharing Files with Containers
This is how Docker volumes work under the hood — bind mounts from the host into the container’s filesystem:
|
|
Because the container has its own mount namespace, this mount is only visible inside the container (and on the host, since the host namespace still sees it). The container sees the host directory at /mnt/data.
Where Do Union Filesystems Come In?
Everything above uses a plain directory as rootfs. Real container runtimes add a layer on top: union filesystems (OverlayFS, typically) that combine read-only image layers with a writable upper layer. This is what enables:
- Multiple containers sharing the same base image layers (saves disk space)
- Copy-on-write semantics (container writes don’t modify the image)
- Efficient image distribution (only changed layers need to be pulled)
But the union filesystem is separate from namespace isolation. You can build a fully functional container without it — it’s an optimization, not a requirement.
Key Takeaways
- Mount namespace is the foundation of container filesystem isolation — it isolates the mount table, not the filesystem itself
- Mount propagation controls whether mounts leak between namespaces — container runtimes set this to private
- pivot_root is what actually switches the container to its own root filesystem
- Pseudo filesystems (
/proc,/dev,/sys) need to be explicitly set up inside the container - Union filesystems (OverlayFS) are an optimization layer on top — not strictly required for isolation
- Other namespaces (PID, UTS, network, cgroup) work together with mount namespace to complete the isolation picture
Understanding these primitives makes debugging container issues much easier — when something goes wrong with mounts, volumes, or filesystem permissions, you know exactly which layer to investigate.