OpenVPN Always-On Auto Roaming with Route Metric

Jul 2, 2023 · 1305 words · 7 minute read

Sysadmin Networking OpenVPN

I recently switched from WireGuard to OpenVPN for remote access from endpoint devices (Point-to-Site) because OpenVPN offers greater managementibility. The clients are all Windows 10 devices, and I would like them to have the VPN always-on, so users could logon to the domain seamlessly outside of the on-prem network. Moreover, I want the VPN to be off when they are on-prem to optimize routing. DirectAccess uses an internal HTTP endpoint to detect if the network is internal or not, while OpenVPN does not provide any. However, it is very easy to achieve this using a well-designed network topology and different route metric.

Note: I am completely amateur in network engineering, so this post is just showing a possible approach I used at my home. Do not take it seriously!

My topology is a little bit complex - I have literally three subnets, each representing a different site, and they need to be all-interconnected:

CORE - Cloud site: 10.0.1.0/24: The core domain controllers and other servers (e.g., ADCS, NAS, Unix servers, and SCCM primary site in the future). This is hosted on my homelab vSphere cluster.
YVR - Vancouver site: 10.0.2.0/24: Basically my home: RODC, printer, PCs, phones.
ROAM - Mobile devices site: 10.0.3.0/24: The subnet for OpenVPN clients. It used to be WireGuard, so I assigned a designated /24 to all the mobile devices. With OpenVPN, it could (?) potentially (?) possibly (?) get into the cloud site subnet or whatever other subnet? I’m on expert in networking, so I just use the simple approach: give them a dedicated subnet.

Note that all three subnets sit properly in a 10.0.0.0/22 subnet. I will use that later.

The site-to-site VPN between CORE and YVR is WireGuard, and it works perfectly.

The point-to-site VPN between ROAM and CORE is OpenVPN.

The topology is as follows:

        +------------------------+
        |   CORE - 10.0.1.0/24   |
        |               +--------+
        +---------+     | OpenVPN|
        |WireGuard|     |10.0.3.1|<----- When roaming,
        +----+----+-----+----+---+       VPN is preferred
             |               |
+-------+----+----+     +----+-------+
|       |WireGuard|     | OpenVPN    |
|       +---------+     |10.0.3.x/24 |
|YVR - 10.0.2.0/24+-----+   Client   |
+-----------------+ ^   +------------+
                    |
                    |
                   When on-prem, WiFi is preferred

(Generated using https://asciiflow.com/)

An intuitive approach is to write a script: if ping 10.0.2.1 success, then stop VPN; otherwise, start it. However, this not only is complicated but also causes network interrupt (OpenVPN startup takes time). Therefore, it is more ideal to set a higher metric on the OpenVPN interface, so the OS will prefer the WiFi over the VPN. We can safely leave OpenVPN always connected because the client will not route lots of stuff over it.

This approach has one major downside: if the WiFi network conflicts with any subnets, it will break the whole setup (roaming clients cannot access that subnet). This do happens on public networks, so it is not guaranteed to work 100%. A more secure way is to have the client accesses an internal endpoint to switch the VPN on and off, indeed.

Implementation #

Basically, the client has two NICs: WiFi and OpenVPN. We need to route CORE and YVR over WiFi (when possible), and I just left the ROAM site always over VPN (using the OpenVPN client-to-clent option). This makes client config much simpler, but it also causes devices in YVR site that do not have OpenVPN deployed unable to access those devices in the ROAM site only. We can work this around by adding a static route to ROAM via CORE on YVR site router (remember: CORE is always connected to ROAM devices using OpenVPN). However, this is not that important and is not implemented right now.

Every operating system matches routes by preferring the most specific route, and (at least on Windows) if two such routing table entries exist, it will pick the one with lower metric.

Therefore, I implemented it as follows:

The YVR site router continues to use the 10.0.2.0/24 subnet, which is also reflected on the client WiFi NIC addresses.
The YVR DHCP server offers a 10.0.0.0/22 classless static route via 10.0.2.1 (the site router). Note that according to RFC3442, the DHCP client must ignore the Router option if it receied a classless static route option. Please set your default routes accordingly (probably using another classless static route).
The OpenVPN server uses 10.0.3.0/24 space, which is also reflected on client OpenVPN NIC addresses.
The OpenVPN server pushes 10.0.0.0/22 route via vpn_gateway, with a very high metric.

The result is:

On-prem access CORE: 10.0.1.x matches two 10.0.0.0/22 routes, and the client prefers the one with lower metric, so client -> YVR -> CORE.
On-prem access YVR: 10.0.2.x matches 10.0.2.0/24 on WiFi and 10.0.0.0/22 on VPN, and the client prefers the more exact one, which is WiFi.
On-prem access ROAM: 10.0.3.x matches 10.0.3.0/24 on VPN and 10.0.0.0/22 on WiFi, and the client prefers the more exact one, which is VPN.
Roaming access CORE: 10.0.1.x matches 10.0.0.0/22 only, and therefore client -> VPN -> Core.
Roaming access YVR: 10.0.2.x matches 10.0.0.0/22 only, and therefore client -> VPN -> Core -> YVR.
Roaming access ROAM: 10.0.3.x matches 10.0.3.0/24 on VPN NIC, and it uses the OpenVPN client-to-client feature to access other roaming devices.

This works as intended.

Implementation Details #

Three changes need to be made:

Set YVR DHCP server: it is too trivial to show here. I used EdgeRouter X.
Push 10.0.0.0/22 over OpenVPN: push "route 10.0.0.0 255.255.252.0 vpn_gateway". Do not use two /24 routes because they are exact routes that will precede the /22 route offered by YVR router.
Set OpenVPN metric to a higher value. The manual says that you just need to put route-metric 100 in the client configuration file. However, that does not seems to work on Windows: OpenVPN executes netsh.exe without the METRIC arg. The manual also says that you can set per-route metric by adding the metric value after vpn_gateway, and that also never works, too. The only solution I found was to actually push route-metric from server. This one magically works, and I really don’t know why. Probably digging the source will give me the answer. Anyway if it works, it works. Remember to use verb 4 in your client configuration to see what is going on. Also use Find-NetRoute to check the route (this cmdlet is similar to ip r get). Lastly, note that route metric differs from adapter metric (which Get-NetIPInterface gives you), and we just need to set the route metric to a higher value.

This works as intended:

C:\>tracert 10.0.1.1

Tracing route to 10.0.1.1 over a maximum of 30 hops

  1     6 ms     2 ms     1 ms  10.0.2.1
  2    19 ms    34 ms    29 ms  10.0.1.1

Trace complete.

C:\>tracert 10.0.2.1

Tracing route to 10.0.2.1 over a maximum of 30 hops

  1     4 ms     3 ms     2 ms  10.0.2.1

Trace complete.

C:\>tracert 10.0.3.1

Tracing route to 10.0.3.1 over a maximum of 30 hops

  1    33 ms    33 ms    30 ms  10.0.3.1

Trace complete.

On roaming (see that the WiFi uses 10.0.1.0/24, which conflicts):

C:\>tracert 10.0.1.1

Tracing route to 10.0.1.1 over a maximum of 30 hops

  1    20 ms    15 ms    33 ms  10.0.1.1

Trace complete.

C:\>tracert 10.0.2.1

Tracing route to 10.0.2.1 over a maximum of 30 hops

  1    30 ms    28 ms    18 ms  10.0.3.1
  2    59 ms    57 ms    56 ms  10.0.2.1

Trace complete.

C:\>tracert 10.0.3.1

Tracing route to 10.0.3.1 over a maximum of 30 hops

  1    58 ms    32 ms    28 ms  10.0.3.1

Trace complete.

Alternative Approach #

While writing this post, I realized that it would be simpler if YVR router uses the subnet of 10.0.0.0/22 (but only offers 10.0.2.0/24 addresses over DHCP), or something similar (like, offer the subnet 10.0.0.0/22 directly over DHCP, without the need of a dedicated classless static route). In this way, traffic to ROAM from YVR would go over CORE, but this eliminates the need of having 10.0.2.0/24 sticking around WiFi NIC.

I will consider that later.