user3466413's questions -server

user3466413

Asked: 2021-03-13 09:54:37 +0800 CST

OPNsense WAN failover causes disruption when non-active WAN is down

I have the latest version of OPNsense set up in a VM on ESXi 7. OPNsense is very similar to pfSense, and I suspect the solution would apply to both. All the NICs are PCI passthrough devices:

A management interface
WAN 1, my preferred WAN to be used all the time unless WAN 1 is failed
WAN 2, my fallback WAN to be used all the time only when WAN 1 is failed
A LAN interface through which all LAN clients connect to the internet

WAN 1 and WAN 2 are set up as gateways. WAN 1, being the connection I want to use all the time unless it's down or has high loss, has a priority of 254, and WAN 2 is 255, thus preferring WAN 1. There are no other enabled gateways. Both are upstream gateways. There are no ping target IPs so instead they ping their respective gateway IPs as provided by the ISPs.

I then made a gateway group with both the WANs as members. Tier 1 has WAN 1 only, and Tier 2 has WAN 2 only. The "Trigger Level" is "Packet Loss or High Latency." My understanding is having two tiers like this means Tier 1 is used 100% of the time unless it fails, then Tier 2 is used 100% of the time until Tier 1 comes back.

Under normal circumstances with both WANs up and reporting no loss, this works fine. All traffic seems to be sent through WAN 1 and WAN 2 is ignored, aside from health checks and the firewall blocking random script kiddies probing it.

However, when WAN 2 starts to report packet loss and OPNsense considers it to be down, which happens usually a handful of times per day for a few minutes at a time, it is disruptive despite WAN 1 being healthy. Some devices that try to get to the internet suddenly can't until WAN 2 is healthy again. Alternatively, if I just disable the WAN 2 gateway entirely, the connection is stable, presuming of course WAN 1 also stays up. When this happens, OPNsense is still saying WAN 1 is healthy.

The logs suggest that the routes aren't changing at all from WAN 1 when WAN 2 fails, which makes this more confusing. Here are log entries around the time of a WAN 2 failure:

2021-03-10T10:31:44 kernel  pflog0: promiscuous mode disabled    
2021-03-10T10:31:44 opnsense[38519] /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '(WAN 1's IP)'  
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: The LAN_DHCP monitor address is empty, skipping.   
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: Choose to bind WAN 1 on (WAN 1's IP) since we could not find a proper match.   
2021-03-10T10:31:44 opnsense[41442] plugins_configure monitor (execute task : dpinger_configure_do())    
2021-03-10T10:31:44 opnsense[41442] plugins_configure monitor ()     
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: ROUTING: keeping current default gateway '(WAN 1's IP)'    
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: ROUTING: setting IPv4 default route to (WAN 1's IP)    
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: ROUTING: IPv4 default gateway set to opt2  
2021-03-10T10:31:44 opnsense[41442] /system_gateways.php: ROUTING: entering configure using defaults     
2021-03-10T10:31:43 configctl[59484]    event @ 1615397503.05 exec: system event config_changed  
2021-03-10T10:31:43 configctl[59484]    event @ 1615397503.05 msg: Mar 10 10:31:43 OPNsense.localdomain config[41442]: config-event: new_config /conf/backup/config-1615397503.052.xml

And here are gateway log entries around the same time:

2021-03-10T10:31:03 dpinger[59862]  GATEWAY ALARM: WAN 2 (Addr: ******* Alarm: 1 RTT: 8702us RTTd: 7796us Loss: 22%)     
2021-03-10T10:31:03 dpinger[77013]  WAN 2 *******: Alarm latency 8702us stddev 7796us loss 22%   
2021-03-10T10:30:05 dpinger[79106]  GATEWAY ALARM: WAN 2 (Addr: ******* Alarm: 0 RTT: 9164us RTTd: 14634us Loss: 11%)

The most confusing entry to me is keeping current default gateway which to me explicitly is saying "WAN 2 failed but I don't care because WAN 1 is the default gateway", but maybe I'm misinterpreting it.

So, in this dual WAN failover scenario, why does the failover of the second, supposedly inactive WAN cause the active WAN to be unusable?

user3466413

Asked: 2021-02-18 14:28:50 +0800 CST

Offloading PPPoE from an OPNsense router

I'm running opnSense, a FreeBSD-based firewall and router similar to pfSense, in a virtual machine under VMware ESXi 7 on a Dell PowerEdge R230, as a router for my home network. No other VMs are running or even set up on the host, just this one.

My ISP uses symmetrical gigabit fiber. You connect to it via PPPoE with VLAN tagging. The IP is dynamically assigned, although it's unclear if it's DHCP or not. Either way, I get assigned a single IP not of my choosing. I don't have any static IPs nor a static block, something for which my ISP charges a non-trivial amount. Reconnecting often does give me a new IP, so I can't assume it would stay the same all the time.

I set up a gateway for the ISP in opnSense. opnSense does the PPPoE connection and for simplicity ESXi itself does the VLAN tagging so it's transparent to the underlying switch. The topology ends up looking like this:

       Internet
           ↓
      Fiber line
           ↓
       Fiber ONT
           ↓
    Ethernet cable
           ↓
     Physical NIC
           ↓
Virtual switch/port group
           ↓
      Virtual NIC
           ↓
      opnSense VM

The NICs in ESXi look like this. The first two are the onboard LAN. The last four are the PCI-E card: Dell Intel I350-T4 Quad Port 1GbE PCI-E card. The ISP's connection is using vmnic2, the first port on the PCI-E card.

But, the problem is I'm not getting the full speed. This is because FreeBSD has a bug with some NICs where PPPoE does not use all the CPU cores. The VM has what I'd consider overkill assigned to it: 4 vCPUs from a Xeon E3-1240 v5 (3.5 GHz) and 8 GB RAM, but apparently the single core performance isn't fast enough to do gigabit due to PPPoE. When I do a speed test, the CPU usage within the VM predictably goes to 25% (one of four cores) and I only get about 600 Mbps peak download, and for some reason upload is much worse than that too. When I use other hardware, like a Ubiquiti UniFi USG-3P or the ISP's provided modem/router, I get the nearly the full speed both directions. I've tried making the virtual NIC both vmxnet3 and e1000e with no difference in performance.

So with this setup, how can I get full speed with a PPPoE connection? I see several options but I don't know which is the best or the most likely chance of success:

Use some other network driver in opnSense, although I have no idea how to do that or if an applicable one even exists. Is this even an option?
Offload the PPPoE unwrapping in a Linux VM which doesn't have the PPPoE bug, and then have opnSense use that VM as the gateway and not the ISP directly. With this option there are two approaches I can think of:
- Use a Linux VM to act as a very simple router with NAT and treat opnSense as a DMZ, so all the port forwarding, etc. applies only to opnSense and I don't have to do it two places. This does directly expose the Linux VM to the internet without the opnSense firewall to protect it. Is this safe?
- Use a Linux VM to only unwrap the PPPoE aspect of the connection and send the unwrapped Ethernet frames straight to another interface; i.e., opnSense still gets the assigned the external IP and there's no NAT. I saw a tutorial on that but I don't think it works in my case because I don't have a static IP or static IP block. This sounds like the cheapest option with no new hardware or service charges if it works, but is it a good idea and is it safe?
Use a hardware device of some kind to do the PPPoE-to-Ethernet conversion. My ISP-provided modem/router, which I'm not using right now, is a Zyxel C1100Z. It has an option for "transparent bridging", but best I can tell, the PPPoE handling would still happen on opnSense in this scenario which doesn't help. I'm not using the UniFi USG-3P because it is very buggy with software updates and also doesn't do the WAN failover reliably at all.
Replace the PCI-E NIC with something else, or passthrough the physical PCI-E card to the VM instead of virtualizing it. From what I've seen so far, other vendors' cards like Realtek are even more unreliable in opnSense. Is there a NIC that's known to work for this scenario? Would the onboard LAN work any better?
Install opnSense directly on the host or run it directly off a USB stick to eliminate the VMware side. I'm really hoping I can do a VM-based setup for snapshots and whatnot for safety. Would this be any improvement at all given the underlying hardware won't change?
Replace opnSense with some other virtual appliance not based on FreeBSD, although I haven't yet found something as comparatively easy to use or feature-rich, or they're expensive or require subscription pricing. I prefer UIs over text configuration and want to use open source self-hosted software. Does such a product exist?

OPNsense WAN failover causes disruption when non-active WAN is down

Offloading PPPoE from an OPNsense router

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?