I recently switched from WireGuard to OpenVPN for remote access from endpoint devices (Point-to-Site) because OpenVPN offers greater managementibility. The clients are all Windows 10 devices, and I would like them to have the VPN always-on, so users could logon to the domain seamlessly outside of the on-prem network. Moreover, I want the VPN to be off when they are on-prem to optimize routing. DirectAccess uses an internal HTTP endpoint to detect if the network is internal or not, while OpenVPN does not provide any. However, it is very easy to achieve this using a well-designed network topology and different route metric.
Note: I am completely amateur in network engineering, so this post is just showing a possible approach I used at my home. Do not take it seriously!
My topology is a little bit complex - I have literally three subnets, each representing a different site, and they need to be all-interconnected:
- CORE - Cloud site: 10.0.1.0/24: The core domain controllers and other servers (e.g., ADCS, NAS, Unix servers, and SCCM primary site in the future). This is hosted on my homelab vSphere cluster.
- YVR - Vancouver site: 10.0.2.0/24: Basically my home: RODC, printer, PCs, phones.
- ROAM - Mobile devices site: 10.0.3.0/24: The subnet for OpenVPN clients. It used to be WireGuard, so I assigned a designated /24 to all the mobile devices. With OpenVPN, it could (?) potentially (?) possibly (?) get into the cloud site subnet or whatever other subnet? I’m on expert in networking, so I just use the simple approach: give them a dedicated subnet.
Note that all three subnets sit properly in a
10.0.0.0/22 subnet. I will use
The site-to-site VPN between CORE and YVR is WireGuard, and it works perfectly.
The point-to-site VPN between ROAM and CORE is OpenVPN.
The topology is as follows:
+------------------------+ | CORE - 10.0.1.0/24 | | +--------+ +---------+ | OpenVPN| |WireGuard| |10.0.3.1|<----- When roaming, +----+----+-----+----+---+ VPN is preferred | | +-------+----+----+ +----+-------+ | |WireGuard| | OpenVPN | | +---------+ |10.0.3.x/24 | |YVR - 10.0.2.0/24+-----+ Client | +-----------------+ ^ +------------+ | | When on-prem, WiFi is preferred
(Generated using https://asciiflow.com/)
An intuitive approach is to write a script: if ping 10.0.2.1 success, then stop VPN; otherwise, start it. However, this not only is complicated but also causes network interrupt (OpenVPN startup takes time). Therefore, it is more ideal to set a higher metric on the OpenVPN interface, so the OS will prefer the WiFi over the VPN. We can safely leave OpenVPN always connected because the client will not route lots of stuff over it.
This approach has one major downside: if the WiFi network conflicts with any subnets, it will break the whole setup (roaming clients cannot access that subnet). This do happens on public networks, so it is not guaranteed to work 100%. A more secure way is to have the client accesses an internal endpoint to switch the VPN on and off, indeed.
Basically, the client has two NICs: WiFi and OpenVPN. We need to route CORE
and YVR over WiFi (when possible), and I just left the ROAM site always over
VPN (using the OpenVPN
client-to-clent option). This makes client config much
simpler, but it also causes devices in YVR site that do not have OpenVPN
deployed unable to access those devices in the ROAM site only. We can work this
around by adding a static route to ROAM via CORE on YVR site router (remember:
CORE is always connected to ROAM devices using OpenVPN). However, this is not
that important and is not implemented right now.
Every operating system matches routes by preferring the most specific route, and (at least on Windows) if two such routing table entries exist, it will pick the one with lower metric.
Therefore, I implemented it as follows:
- The YVR site router continues to use the
10.0.2.0/24subnet, which is also reflected on the client WiFi NIC addresses.
- The YVR DHCP server offers a
10.0.0.0/22classless static route via
10.0.2.1(the site router). Note that according to RFC3442, the DHCP client must ignore the Router option if it receied a classless static route option. Please set your default routes accordingly (probably using another classless static route).
- The OpenVPN server uses
10.0.3.0/24space, which is also reflected on client OpenVPN NIC addresses.
- The OpenVPN server pushes
vpn_gateway, with a very high metric.
The result is:
- On-prem access CORE:
10.0.0.0/22routes, and the client prefers the one with lower metric, so client -> YVR -> CORE.
- On-prem access YVR:
10.0.2.0/24on WiFi and
10.0.0.0/22on VPN, and the client prefers the more exact one, which is WiFi.
- On-prem access ROAM:
10.0.3.0/24on VPN and
10.0.0.0/22on WiFi, and the client prefers the more exact one, which is VPN.
- Roaming access CORE:
10.0.0.0/22only, and therefore client -> VPN -> Core.
- Roaming access YVR:
10.0.0.0/22only, and therefore client -> VPN -> Core -> YVR.
- Roaming access ROAM:
10.0.3.0/24on VPN NIC, and it uses the OpenVPN
client-to-clientfeature to access other roaming devices.
This works as intended.
Implementation Details 🔗
Three changes need to be made:
- Set YVR DHCP server: it is too trivial to show here. I used EdgeRouter X.
push "route 10.0.0.0 255.255.252.0 vpn_gateway". Do not use two
/24routes because they are exact routes that will precede the
/22route offered by YVR router.
- Set OpenVPN metric to a higher value. The manual says that you just need to
route-metric 100in the client configuration file. However, that does not seems to work on Windows: OpenVPN executes netsh.exe without the
METRICarg. The manual also says that you can set per-route metric by adding the metric value after
vpn_gateway, and that also never works, too. The only solution I found was to actually push
route-metricfrom server. This one magically works, and I really don’t know why. Probably digging the source will give me the answer. Anyway if it works, it works. Remember to use
verb 4in your client configuration to see what is going on. Also use
Find-NetRouteto check the route (this cmdlet is similar to
ip r get). Lastly, note that route metric differs from adapter metric (which
Get-NetIPInterfacegives you), and we just need to set the route metric to a higher value.
This works as intended:
C:\>tracert 10.0.1.1 Tracing route to 10.0.1.1 over a maximum of 30 hops 1 6 ms 2 ms 1 ms 10.0.2.1 2 19 ms 34 ms 29 ms 10.0.1.1 Trace complete. C:\>tracert 10.0.2.1 Tracing route to 10.0.2.1 over a maximum of 30 hops 1 4 ms 3 ms 2 ms 10.0.2.1 Trace complete. C:\>tracert 10.0.3.1 Tracing route to 10.0.3.1 over a maximum of 30 hops 1 33 ms 33 ms 30 ms 10.0.3.1 Trace complete.
On roaming (see that the WiFi uses
10.0.1.0/24, which conflicts):
C:\>tracert 10.0.1.1 Tracing route to 10.0.1.1 over a maximum of 30 hops 1 20 ms 15 ms 33 ms 10.0.1.1 Trace complete. C:\>tracert 10.0.2.1 Tracing route to 10.0.2.1 over a maximum of 30 hops 1 30 ms 28 ms 18 ms 10.0.3.1 2 59 ms 57 ms 56 ms 10.0.2.1 Trace complete. C:\>tracert 10.0.3.1 Tracing route to 10.0.3.1 over a maximum of 30 hops 1 58 ms 32 ms 28 ms 10.0.3.1 Trace complete.
Alternative Approach 🔗
While writing this post, I realized that it would be simpler if YVR router uses
the subnet of
10.0.0.0/22 (but only offers
10.0.2.0/24 addresses over DHCP),
or something similar (like, offer the subnet
10.0.0.0/22 directly over DHCP,
without the need of a dedicated classless static route). In this way, traffic
to ROAM from YVR would go over CORE, but this eliminates the need of having
10.0.2.0/24 sticking around WiFi NIC.
I will consider that later.