We're going to connect multiple containers running in multiple hosts in a public network. Suppose for a moment, that one server is a multicore monster capable of servicing a lot of small processes, and another one is a beefy RAID machine for storing things. Since we cannot trust a public net, we must also set up a VPN tunnel between the hosts. If you're fine with Ruby, grab the Vagrantfile, then vagrant up
and query Serf cluster:
$ vagrant ssh barge3 -c "docker exec serf0 serf members"
barge3-serf4 172.20.103.5:7946 alive
barge3-serf3 172.20.103.4:7946 alive
barge3-serf0 172.20.103.1:7946 alive
barge2-serf0 172.20.102.1:7946 alive
barge1-serf0 172.20.101.1:7946 alive
barge2-serf1 172.20.102.2:7946 alive
barge3-serf1 172.20.103.2:7946 alive
barge3-serf2 172.20.103.3:7946 alive
Connection to 127.0.0.1 closed.
Planning the addressing space
Our environment consists of several nodes with IPs like 10.10.10.10x
. Kinda easy to follow addresses created by Vagrant. In a real Internet it can be any public IP addresses that every node aware of and can knock to. Now we must create an IP addressing plan for an overlay network, i.e. virtual network where all containers are operating and know nothing about their hosts and their public faces. The simplest one is a net with 256 nodes (NNN) and 254 containers (MMM) per host:
172.20.0.0/16
overlay network address172.20.NNN.0/24
subnet for each nodebargesNNN
172.20.NNN.MMM
container addressessserfMMM
Installing Docker and WireGuard at Barge
Barge is a Linux distribution similar in properties to boot2docker. The single reason to pick it up instead of Ubuntu or any other popular variant is its memory footprint—we have to run several nodes to simulate a real multihost environment. Anyway, the first step at any host is to install Docker and WireGuard (take a look at install()
in Vagrantfile).
Creating overlay network
We have to run this at every node nuancing --ip-range
appropriately:
# At `barge1` with installed Docker
$ docker network create \
--driver=bridge \
--subnet=172.20.0.0/16 \
--ip-range=172.20.101.0/24 \
--opt "com.docker.network.bridge.name"="ovl0" \
ovl0
As you can see, we instruct Docker's IPAM driver to assign IPs within a selected subrange according to our plan. This command creates a bridge and updates routing tables of the host and any container attached to that network (for examlpe docker run -d --net=ovl0 nginx
).
If you get in the routing table of a container, you'll see this:
# Inside a container at `barge1`
$ ip route
default via 172.20.101.0 dev eth0
172.20.0.0/16 dev eth0 scope link src 172.20.101.5
Anything within ovl0
subnet (172.20.0.0/16) goes through eth0
with source address 172.20.101.5. Notice that container's addresses won't span full --subnet
range — only concrete --ip-range
available for a barge. The rest will go through default gateway substituting this container source IP to gateway's IP. Let's follow first container to container request.
If container serf15
with IP 172.20.101.15
at host barge1
would like to connect to container serf23
at barge3
with IP 172.20.103.23
, it must know dest MAC address. Since those containers are connected to the same network bridge 172.20.0.0/16
ARP request to this subnet is going to be used. However, by default, Linux net stack prevents this kind of broadcasting from one interface to another and first secret sause ingredient is:
# Run at every `bargeN` to allow ARPing within `ovl0`
$ sysctl net.ipv4.conf.ovl0.proxy_arp=1
Now the ARP packet is getting out of the bridge ovl0
and host's routing table must decide where to forward it (IP forwarding must be enabled at host sysctl -w net.ipv4.ip_forward=1
). By default, Docker has been added route 172.20.0.0/16 dev ovl0
which is not what we need. This says kernel to return ARP request back to the sender using ovl0
device. We must fix it by replacing ovl0
output device to WireGuard wg0
and by letting packets from other hosts to this host's --ip-range
be forwarded to ovl0
bridge. The second sauce ingredient by the way:
# At `barge1`
$ ip route del 172.20.0.0/16 dev ovl0
$ ip route add 172.20.101.0/24 dev ovl0
# Next line do not work here because there is no `wg0` yet
# We will make this call later in `ifup`
#
# ip route add 172.20.0.0/16 dev wg0
Configuring WireGuard
With addressing plan in mind, let's add WireGuard config for every node. Assuming three nodes cluster, here is /etc/wireguard/wg0.conf
for barge2
host:
# HOSTNAME: barge2
# IP: 10.10.10.102
[Interface]
PrivateKey = -- private key --
ListenPort = 12345
# HOSTNAME: barge1
[Peer]
PublicKey = -- public key --
Endpoint = 10.10.10.101:12345
AllowedIPs = 172.20.101.0/24
# HOSTNAME: barge3
[Peer]
PublicKey = -- public key --
Endpoint = 10.10.10.103:12345
AllowedIPs = 172.20.103.0/24
Setting up WireGuard network interface
The last step is about setting up wg0
interface. It is perfectly reasonable to use wg-quick
or just to follow quick start guide, but since barge-os is a Busybox based distro, we are going to use ifup
and /etc/network/interfaces
file:
auto wg0
iface wg0 inet static
address 0.0.0.0 # (1)
netmask 255.255.255.255 # (1)
pre-up ip link add dev wg0 type wireguard
pre-up wg setconf wg0 /etc/wireguard/wg0.conf
post-up sysctl net.ipv4.conf.ovl0.proxy_arp=1 # (2)
post-up ip route add 172.20.0.0/16 dev wg0 # (3)
post-up iptables -t nat -I POSTROUTING -o wg0 -s 172.20.0.0/16 -j ACCEPT # (4)
post-down iptables -t nat -D POSTROUTING -o wg0 -s 172.20.0.0/16 -j ACCEPT
post-down sysctl net.ipv4.conf.ovl0.proxy_arp=0
post-down ip link delete dev wg0
Notes:
- We don't need to assign any address for
wg0
interface.address
andnetmask
used here because these are required options forifup
. - After the interface is up, enable
proxy_arp
forovl0
Docker bridge. - Direct
--subnet
packets by default towg0
(notovl0
). - The third ingredient—firewall rule.
POSTROUTING
chain of nat
table translates source address for anything that is going away from host (SNAT). That will shadow real container address with host IP. This iptable
rule makes anything willing to go out through wg0
with source address of overlay network untouched. This rule must be the first one in the chain! This prevents other MASQUERADE/SNAT rules to process overlay's packets.
Looking back
Let me explain it again. Packets from barge1
to barge2
came out from ovl0
bridge of barge1
and directed to wg0
tunnel via route (3) and firewall rule (4). At barge2
side, wg0
interface allows that packet to come in (remember peer setup at /etc/wireguard/wg0.conf
) and kernel routes it to ovl0
bridge via narrowing rule—the second sauce ingredient.
Testing setup with Serf
In order to test this setup, we have to run some containers and let them chat. Serf is an agent process that helps to build decentralized clusters. These agents maintain membership information across all the nodes by watching carefully for other's health statuses. To join the cluster new agent must know the address of any member that in the club already.
FROM busybox
RUN wget https://releases.hashicorp.com/serf/0.8.2/serf_0.8.2_linux_amd64.zip \
&& unzip -d /bin serf_0.8.2_linux_amd64.zip
CMD ["/bin/serf", "agent", "-retry-join", "172.20.101.1"]
Eventually, serf members
inside any container of our simulation will show the list of members across multiple hosts. Right as it has been shown at the beginning.