Barges of Serfs

We're going to connect multiple containers running in multiple hosts in a public network. Suppose for a moment, that one server is a multicore monster capable of servicing a lot of small processes, and another one is a beefy RAID machine for storing things. Since we cannot trust a public net, we must also set up a VPN tunnel between the hosts. If you're fine with Ruby, grab the Vagrantfile, then vagrant up and query Serf cluster:

$ vagrant ssh barge3 -c "docker exec serf0 serf members"
barge3-serf4  172.20.103.5:7946  alive
barge3-serf3  172.20.103.4:7946  alive
barge3-serf0  172.20.103.1:7946  alive
barge2-serf0  172.20.102.1:7946  alive
barge1-serf0  172.20.101.1:7946  alive
barge2-serf1  172.20.102.2:7946  alive
barge3-serf1  172.20.103.2:7946  alive
barge3-serf2  172.20.103.3:7946  alive
Connection to 127.0.0.1 closed.

Planning the addressing space

Our environment consists of several nodes with IPs like 10.10.10.10x. Kinda easy to follow addresses created by Vagrant. In a real Internet it can be any public IP addresses that every node aware of and can knock to. Now we must create an IP addressing plan for an overlay network, i.e. virtual network where all containers are operating and know nothing about their hosts and their public faces. The simplest one is a net with 256 nodes (NNN) and 254 containers (MMM) per host:

172.20.0.0/16 overlay network address
- 172.20.NNN.0/24 subnet for each node bargesNNN
  - 172.20.NNN.MMM container addressess serfMMM

Installing Docker and WireGuard at Barge

Barge is a Linux distribution similar in properties to boot2docker. The single reason to pick it up instead of Ubuntu or any other popular variant is its memory footprint—we have to run several nodes to simulate a real multihost environment. Anyway, the first step at any host is to install Docker and WireGuard (take a look at install() in Vagrantfile).

Creating overlay network

We have to run this at every node nuancing --ip-range appropriately:

# At `barge1` with installed Docker
$ docker network create \
    --driver=bridge \
    --subnet=172.20.0.0/16 \
    --ip-range=172.20.101.0/24 \
    --opt "com.docker.network.bridge.name"="ovl0" \
    ovl0

As you can see, we instruct Docker's IPAM driver to assign IPs within a selected subrange according to our plan. This command creates a bridge and updates routing tables of the host and any container attached to that network (for examlpe docker run -d --net=ovl0 nginx).

If you get in the routing table of a container, you'll see this:

# Inside a container at `barge1`
$ ip route
default via 172.20.101.0 dev eth0
172.20.0.0/16 dev eth0 scope link  src 172.20.101.5

Anything within ovl0 subnet (172.20.0.0/16) goes through eth0 with source address 172.20.101.5. Notice that container's addresses won't span full --subnet range — only concrete --ip-range available for a barge. The rest will go through default gateway substituting this container source IP to gateway's IP. Let's follow first container to container request.

If container serf15 with IP 172.20.101.15 at host barge1 would like to connect to container serf23 at barge3 with IP 172.20.103.23, it must know dest MAC address. Since those containers are connected to the same network bridge 172.20.0.0/16 ARP request to this subnet is going to be used. However, by default, Linux net stack prevents this kind of broadcasting from one interface to another and first secret sause ingredient is:

# Run at every `bargeN` to allow ARPing within `ovl0`
$ sysctl net.ipv4.conf.ovl0.proxy_arp=1

Now the ARP packet is getting out of the bridge ovl0 and host's routing table must decide where to forward it (IP forwarding must be enabled at host sysctl -w net.ipv4.ip_forward=1). By default, Docker has been added route 172.20.0.0/16 dev ovl0 which is not what we need. This says kernel to return ARP request back to the sender using ovl0 device. We must fix it by replacing ovl0 output device to WireGuard wg0 and by letting packets from other hosts to this host's --ip-range be forwarded to ovl0 bridge. The second sauce ingredient by the way:

# At `barge1`
$ ip route del 172.20.0.0/16 dev ovl0
$ ip route add 172.20.101.0/24 dev ovl0

# Next line do not work here because there is no `wg0` yet
# We will make this call later in `ifup`
#
# ip route add 172.20.0.0/16 dev wg0

Configuring WireGuard

With addressing plan in mind, let's add WireGuard config for every node. Assuming three nodes cluster, here is /etc/wireguard/wg0.conf for barge2 host:

# HOSTNAME:   barge2
# IP:         10.10.10.102
[Interface]
PrivateKey = -- private key --
ListenPort = 12345

# HOSTNAME:   barge1
[Peer]
PublicKey = -- public key --
Endpoint = 10.10.10.101:12345
AllowedIPs = 172.20.101.0/24

# HOSTNAME:   barge3
[Peer]
PublicKey = -- public key --
Endpoint = 10.10.10.103:12345
AllowedIPs = 172.20.103.0/24

Setting up WireGuard network interface

The last step is about setting up wg0 interface. It is perfectly reasonable to use wg-quick or just to follow quick start guide, but since barge-os is a Busybox based distro, we are going to use ifup and /etc/network/interfaces file:

auto wg0
iface wg0 inet static
  address 0.0.0.0                                                             # (1)
  netmask 255.255.255.255                                                     # (1)
  pre-up ip link add dev wg0 type wireguard
  pre-up wg setconf wg0 /etc/wireguard/wg0.conf
  post-up sysctl net.ipv4.conf.ovl0.proxy_arp=1                               # (2)
  post-up ip route add 172.20.0.0/16 dev wg0                                  # (3)
  post-up iptables -t nat -I POSTROUTING -o wg0 -s 172.20.0.0/16 -j ACCEPT    # (4)
  post-down iptables -t nat -D POSTROUTING -o wg0 -s 172.20.0.0/16 -j ACCEPT
  post-down sysctl net.ipv4.conf.ovl0.proxy_arp=0
  post-down ip link delete dev wg0

Notes:

We don't need to assign any address for wg0 interface. address and netmask used here because these are required options for ifup.
After the interface is up, enable proxy_arp for ovl0 Docker bridge.
Direct --subnet packets by default to wg0 (not ovl0).
The third ingredient—firewall rule.

POSTROUTING chain of nat table translates source address for anything that is going away from host (SNAT). That will shadow real container address with host IP. This iptable rule makes anything willing to go out through wg0 with source address of overlay network untouched. This rule must be the first one in the chain! This prevents other MASQUERADE/SNAT rules to process overlay's packets.

Looking back

Let me explain it again. Packets from barge1 to barge2 came out from ovl0 bridge of barge1 and directed to wg0 tunnel via route (3) and firewall rule (4). At barge2 side, wg0 interface allows that packet to come in (remember peer setup at /etc/wireguard/wg0.conf) and kernel routes it to ovl0 bridge via narrowing rule—the second sauce ingredient.

Testing setup with Serf

In order to test this setup, we have to run some containers and let them chat. Serf is an agent process that helps to build decentralized clusters. These agents maintain membership information across all the nodes by watching carefully for other's health statuses. To join the cluster new agent must know the address of any member that in the club already.

FROM busybox

RUN wget https://releases.hashicorp.com/serf/0.8.2/serf_0.8.2_linux_amd64.zip \
 && unzip -d /bin serf_0.8.2_linux_amd64.zip

CMD ["/bin/serf", "agent", "-retry-join", "172.20.101.1"]

Eventually, serf members inside any container of our simulation will show the list of members across multiple hosts. Right as it has been shown at the beginning.