I am evaluating a system for a client where many OpenVPN clients connect to a OpenVPN server. "Many" means 50000 - 1000000.
Why do I do that? The clients are distributed embedded systems, each sitting behind the system owners dsl router. The server needs to be able to send commands to the clients. My first naive approach is to make the clients connect to the server via an openvpn network. This way, the secure communication tunnel can be used in both directions.
This means that all clients are always connected to the server. There are many clients summing up over the years.
The question is: does the OpenVPN server explode when reaching a certain number of clients? I am already aware of a maximum TCP connection number limit, therefore (and for other reasons) the VPN would have to use UDP transport.
OpenVPN gurus, what is your opinion?
I doubt that a setup that large has ever been attempted before, so you likely will be pushing limits when trying. I could find an article on a VPN deployment for 400 clients but judging from the text, the author just relied on rough estimates about how many clients could be run per CPU and lacked some understanding about how his setup would perform.
You would mainly need to consider these two points:
The bandwidth your data transfers are going to use would need encryption / decryption at the VPN server side, consuming CPU resources
OpenVPN client connections consume both, memory and CPU resources on the server even when no data is transferred
Any decent PC hardware available today should easily saturate a Gigabit link with Blowfish or AES-128, even $100 embedded devices are capable of rates near 100 Mbps, so CPU bottlenecks due to bandwidth intensity should not be of any concern.
Given the default rekeying interval of 3600 seconds, a number of 1,000,000 clients would mean that the server would need to be able to complete 278 key exchanges per second on average. While a key exchange is a rather CPU-intensive task, you could offload it to dedicated hardware if needed - cryptographic accelerator cards available easily meet and exceed this number of TLS handshakes. And memory restrictions should not bother too much as well - a 64-bit binary should take care of any virtual memory restrictions you would be likely to hit otherwise.
But the real beauty with OpenVPN is that you can scale it out quite easily - simply set up an arbitrary number of OpenVPN servers and make sure your clients are using them (e.g. through DNS round-robin), configure a dynamic routing protocol of your choice (typically this would be RIP due to its simplicity) and your infrastructure would be capable of supporting an arbitrary number of clients as long as you've got enough hardware.
I have actually done this, albeit with "only" a few hundred remote connections similarly behind DSL routers. I can't comment too much about the rekeying issues, but a few practical things I learned along the way:
1) When deploying clients, make sure you specify multiple VPN servers in the client conf, vpn1.example.com, vpn2.example.com, vpn3..... Even if you only provide one or two of these now, you give yourself headroom. Configured properly, the clients will keep retrying them at random till they find one that works.
2) We use a custom AWS VPN server image, and can spin up additional capacity on demand, and Amazon DNS (R53) handles the DNS side of things. It is completely detached from the rest of our infrastructure.
3) At the server(s) end, make careful use of the netmask to restrict the number of potential clients. That should force clients onto an alternative server, mitigating the CPU issues. I think we limit our servers to 300 or so clients. This choice was somewhat arbitrary on our part - "gut feel" if you like.
4) Also at the server end, you should make careful use of firewalls. In simple terms, we have ours configured such that the clients can VPN connect in, but the servers strictly disallow all ssh connections inbound except from a known IP address. We can SSH to the clients if we occasionally need to, they can't SSH to us.
5) Don't rely on OpenVPN doing the reconnect for you at the client end. 9 times out of 10 it will, but sometimes it gets stuck. Have a separate process to reset/restart openVPN at the client end regularly.
6) You need a way of generating unique keys for the clients so you can disavow them sometimes. We generate these internally with our server build (PXEboot) process. Never happened to us, but we know we can do it.
7) You will need some management tools, scripts to monitor your VPN server connections effectively.
There is not much material out there about how to do this unfortunately, but it is possible, with careful configuration.
Update 2018
Not sure what all has changed since 2012. Just wanted to give an update as to my experience in 2018. We have deployed an openvpn network very similar to the OP setup. Our endpoints are full blown linux pcs instead of embedded devices. Each endpoint has a monitor used to display information and alarm for that site and our server allows us a single point to remote into all endpoints. The network isn't overly active but sometimes has 5-10 remote sessions simultaneously.
Using a current build of openvpn at around 100 clients on an azure image with a single core and 2gb of ram we use around 0.7% of memory on average and the cpu usage is almost always around 0%. Based on what I found for this smaller test I figure a single server with decent specs would easily handle 50000 concurrent if it had the ram to support it. If ram usage scaled linearly then 16gb would be able to handle 50000 users with enough extra on a dedicated openvpn machine.
We are not at a large enough scale to say that with significant confidence but I just wanted to give a recent update since when originally deploying our network I found this and was expecting much more resource usage at this scale. Now, I do believe the cpu that runs this does have hardware encryption and I'm not sure at what point that would be overloaded traffic wise but for endpoints that do not communicate a lot this shouldn't be an issue.
At 1000000 you would need 200gb of ram on a single machine(if scaled linearly with extra) while this is possible I would think at that point you would want to have 5 machines each with 64gb of ram so you don't have a single point of failure. This should allow maintenance, restarts and replacements of 1 or even 2 machines without significant issues.
My ram estimates are likely way overkill since I'm dividing the entire openvpn usage by number of clients where only a portion of that ram is due to clients.
We have added 74 endpoints in a year since initially deployed. I hope to continue to grow that number significantly and will make a further update if we get to a decent scale.
I'm looking into a similar problem, although the number of clients would be into hundreds maybe a couple of thousands.
I figured that I can't keep all the clients connected all the time.
I'm thinking of starting OpenVPN daemon on clients at randomized time intervals so they can check if they were polled. If they were they are to send an email or something that they are online and send keep alive packets for a period of time so I can connect to them.
If there is no traffic for some time daemon would be stopped.
The problem I'm facing right now is that it seems impossible to get a list of currently connected VPN clients ...