One of the superpowers of containers is their isolated filesystem view — from inside a container it looks like a full Linux distro, often different from the host. Run docker run nginx, and Nginx lands in its familiar Debian userspace no matter what Linux flavor your host runs. But how is that illusion built?

In this post, we’ll walk through how to assemble a tiny but realistic container using only stock Linux tools: unshare, mount, and pivot_root. No runtime magic. Along the way, you’ll see why the mount namespace is the bedrock of container isolation, while other namespaces (PID, cgroup, UTS, network) play complementary roles.

What Does Mount Namespace Actually Isolate?

Start a new shell in its own mount namespace:

sudo unshare --mount bash

Now from another terminal, create a file on the host:

echo "Hello from host" | sudo tee /opt/marker.txt

Surprisingly, if you check from the namespaced shell — the file is there. So what did we actually isolate?

The answer is the mount table, not the filesystem itself. Linux mount namespaces isolate the list of mount points seen by processes in each namespace. The underlying filesystem is still shared — it’s only when you create new mount points that the views start diverging.

To verify, from the namespaced shell:

sudo mount --bind /tmp /mnt
ls -l /mnt    # shows /tmp contents

But from the host terminal, /mnt remains empty. The mount only exists in the new namespace’s mount table.

You can compare mount tables using findmnt from each terminal — the namespaced shell will show the extra /mnt mount point that the host doesn’t see.

Mount namespaces were the first namespace type added to Linux, appearing in Linux 2.4 around 2002.

Mount Propagation

Before diving into how container runtimes use mount namespaces, there’s an important related concept: mount propagation.

When you create a new mount namespace, mount points can be configured to propagate (or not) between parent and child namespaces. This is controlled by propagation types:

  • shared — mounts propagate in both directions
  • private — no propagation at all
  • slave — propagation only from parent to child

Container runtimes typically set the root mount to private (or slave) in the container’s namespace so that host mounts don’t leak in, and container mounts don’t leak out. The unshare CLI tool does this automatically with --mount, but if you use the unshare() syscall directly, you need to handle it yourself:

mount --make-rprivate /

This is a common gotcha when building containers from scratch.

Building the Container Filesystem

Step 1: Prepare rootfs

You need a root filesystem for the container. You can extract one from a Docker image:

mkdir -p /tmp/container/rootfs
docker export $(docker create alpine) | tar -C /tmp/container/rootfs -xf -

Step 2: Create namespaces

sudo unshare --mount --pid --fork --uts bash

Step 3: Isolate mount namespace

Make all existing mounts private so nothing leaks:

mount --make-rprivate /

Step 4: Prepare /proc

The /proc pseudo filesystem needs to be mounted inside the container for process isolation to work properly:

mount -t proc proc /tmp/container/rootfs/proc

Step 5: Prepare /dev

Container needs basic device nodes. A minimal approach:

mount -t tmpfs tmpfs /tmp/container/rootfs/dev
mknod -m 666 /tmp/container/rootfs/dev/null c 1 3
mknod -m 666 /tmp/container/rootfs/dev/zero c 1 5
mknod -m 666 /tmp/container/rootfs/dev/random c 1 8
mknod -m 666 /tmp/container/rootfs/dev/urandom c 1 9
mknod -m 666 /tmp/container/rootfs/dev/tty c 5 0

Step 6: Prepare /sys

mount -t sysfs sysfs /tmp/container/rootfs/sys

Step 7: Bind hostname, hosts, and resolv.conf

cp /etc/hostname /tmp/container/rootfs/etc/hostname
cp /etc/hosts /tmp/container/rootfs/etc/hosts
cp /etc/resolv.conf /tmp/container/rootfs/etc/resolv.conf

Step 8: Pivot into the new rootfs

This is the key step — pivot_root swaps the root filesystem:

mkdir -p /tmp/container/rootfs/.old_root
pivot_root /tmp/container/rootfs /tmp/container/rootfs/.old_root
cd /
umount -l /.old_root
rmdir /.old_root

Step 9: Harden the filesystem

Make certain mounts read-only to prevent the container from modifying them:

mount -o remount,ro /proc/sys
mount -o remount,ro /sys

Step 10: Run the application

exec /bin/sh

You now have a shell running in an isolated filesystem that looks like a standalone Alpine Linux system.

Sharing Files with Containers

This is how Docker volumes work under the hood — bind mounts from the host into the container’s filesystem:

mount --bind /host/path /container/rootfs/mnt/data

Because the container has its own mount namespace, this mount is only visible inside the container (and on the host, since the host namespace still sees it). The container sees the host directory at /mnt/data.

Where Do Union Filesystems Come In?

Everything above uses a plain directory as rootfs. Real container runtimes add a layer on top: union filesystems (OverlayFS, typically) that combine read-only image layers with a writable upper layer. This is what enables:

  • Multiple containers sharing the same base image layers (saves disk space)
  • Copy-on-write semantics (container writes don’t modify the image)
  • Efficient image distribution (only changed layers need to be pulled)

But the union filesystem is separate from namespace isolation. You can build a fully functional container without it — it’s an optimization, not a requirement.

Key Takeaways

  • Mount namespace is the foundation of container filesystem isolation — it isolates the mount table, not the filesystem itself
  • Mount propagation controls whether mounts leak between namespaces — container runtimes set this to private
  • pivot_root is what actually switches the container to its own root filesystem
  • Pseudo filesystems (/proc, /dev, /sys) need to be explicitly set up inside the container
  • Union filesystems (OverlayFS) are an optimization layer on top — not strictly required for isolation
  • Other namespaces (PID, UTS, network, cgroup) work together with mount namespace to complete the isolation picture

Understanding these primitives makes debugging container issues much easier — when something goes wrong with mounts, volumes, or filesystem permissions, you know exactly which layer to investigate.