I have a collection of CentOS servers in my corporate network. For security reasons, most servers do not have general outbound internet access unless it is a core functional requirement for the server.
This creates a challenge when I need to update packages. For yum repositories, I currently mirror all needed repos from the internet, and make the mirrors available inside the intranet. I keep copies of each repo in each of our five environments: dev, QA, staging, and two production datacenters.
I don't currently solve for language-specific package repos. When servers need an update from rubygems, PyPI, PECL, CPAN or npm, they have to acquire temporary outbound internet access to fetch the packages. I've been asked to start mirroring rubygems and PyPI, and the rest will probably follow.
All of this is clunky and doesn't work well. I'd like to replace it with a single caching proxy in one environment and four daisy-chained proxies in my other environments, to eliminate the complexity and disk overhead of full mirrors. Additionally:
- It can be either a forward or reverse proxy; each package manager supports a proxy server or a custom repository endpoint, which could be either a local mirror or a reverse proxy.
- It needs granular access control, so I can limit which client IPs can connect to which repo domains.
- Clients need to be able to follow redirects to unknown domains. Your original request might be limited to rubygems.org, but if that server returns a 302 to a random CDN, you should be able to follow it.
- It should support HTTPS backends. I don't necessarily need to impersonate other SSL servers, but I should be able to re-expose an HTTPS site over HTTP, or terminate and re-encrypt with a different certificate.
I was initially looking at reverse proxies, and Varnish seems to be the only one that would allow me to internally resolve 302 redirects within the proxy. However, the free version of Varnish does not support HTTPS backends. I'm now evaluating Squid as a forward proxy option.
This seems like something that ought to be a relatively common problem within enterprise networks, but I'm having trouble finding examples of how other people have solved this. Has anyone implemented something similar or have thoughts on how best to do so?
Thanks!
We use Squid for this; the nice thing about squid is that you can set individual expiry of objects based on a pattern match, fairly easily, which allows the metadata from the yum repo to be purged fairly quickly. The config we have which implements this:
http://www.squid-cache.org/Doc/config/refresh_pattern/
That's a definitive use case for a proxy. A normal proxy, not a reverse-proxy (aka. load balancers).
The most well-known and free and open-source is squid. Luckily it's one of the few good open-source software that can easily be installed with a single
apt-get install squid3
and configured with a single file/etc/squid3/squid.conf
.We'll go over the good practices and the lessons to known about.
The official configuration file slightly modified (the 5000 useless commented lines were removed).
Client Configuration - Environment Variables
Configure these two environment variables on all systems.
Most http client libraries (libcurl, httpclient, ...) are self configuring using the environment variables. Most applications are using one of the common libraries and thus support proxying out-of-the-box (without the dev necessarily knowing that they do).
Note that the syntax is strict:
http_proxy
MUST be lowercase on most Linux.http(s)://
(the proxying protocol is NOT http(s)).Client Configuration - Specific
Some applications are ignoring environment variables and/or are run as service before variables can be set (e.g. debian
apt
).These applications will require special configuration (e.g.
/etc/apt.conf
).HTTPS Proxying - Connect
HTTPS proxying is fully supported by design. It uses a special "CONNECT" method which establishes some sort of tunnel between the browser and the proxy.
Dunno much about that thing but I've never had issues with it in years. It just works.
HTTPS Special Case - Transparent Proxy
A note on transparent proxy. (i.e. The proxy is hidden and it intercepts clients requests ala. man-in-the-middle).
Transparent proxies are breaking HTTPS. The client doesn't know that there is a proxy and has no reason to use the special Connect method.
The client tries a direct HTTPS connection... that is intercepted. The interception is detected and errors are thrown all over the place. (HTTPS is meant to detect man-in-he-middle attacks).
Domain and CDN whitelisting
Domain and subdomain whitelisting is fully supported by squid. Nonetheless, it's bound to fail in unexpected ways from time to time.
Modern websites can have all sort of domain redirections and CDN. That will break ACL when people didn't go the extra mile to put everything neatly in a single domain.
Sometimes there will be an installer or a package that wants to call the homeship or retrieve external dependencies before running. It will fail every single time and there is nothing you can do about it.
Caching
The provided configuration file is disabling all form of caching. Better safe than sorry.
Personally, I'm running things in the cloud at the moment, all instances have at least 100 Mbps connectivity and the provider runs its own repos for popular stuff (e.g. Debian) which are discovered automatically. That makes bandwidth a commodity I couldn't care less about.
I'd rather totally disable caching than experience a single caching bug that will melt my brain in troubleshooting. Every single person on the internet CANNOT get their caching headers right.
Not all environments have the same requirements though. You may go the extra mile and configure caching.
NEVER EVER require authentication on the proxy
There is an option to require password authentication from clients, typically with their LDAP accounts. It will break every browser and every command line tool in the universe.
If you want to do authentication on the proxy, don't.
If management wants authentication, explain that it's not possible.
If you're a dev and you just joined a company that is blocking direct internet AND forcing proxy authentication, RUN AWAY WHILE YOU CAN.
Conclusion
We went through the common configuration, common mistakes and things one must known about proxying.
Lesson learnt:
As usual in programming and system design, it's critical to manage requirements and expectations.
I'd recommend to stick to the basics when setting up a proxy. Generally speaking, a plain proxy without any particular filtering will work well and not give any trouble. Just gotta remember to (auto) configure the clients.
This won't solve all your tasks, but maybe this is still helpful. Despite the name, apt-cacher-ng doesn't only work with Debian and derivatives, and is
I'm using this in production in a similar (Debian based) environment like yours.
However, AFAIK, this won't support rubygems, PyPI, PECL, CPAN or npm and doesn't provide granular ACLs.
Personally, I think that investigating Squid is a good idea. If you implement a setup in the end, could you please share your experiences? I'm quite interested in how it goes.
we had a similar challenge and have solved it using local repos and a snapshot based storage system. We basically update the development repository, clone it for testing, clone that for staging and finally for production. The amount of disk used is limited that way, plus it's all slow sata storage and that's ok.
The clients get the repository info from our configuration management so switching is easy if necessary.
You could achieve what you want using ace's on the proxy server using user-agent strings or source ips/mask combinations and restricting their access to certain domains, but if you do that one problem I see is that of different versions of packages/libraries. So if one of the hosts may access cpan and requests module xxx::yyy unless the client instructs to use a specific version, will pull the latest from cpan (or pypy or rubygems), which may or may not be the one that was already cached in the proxy. So you might end up with different versions on the same environment. You will not have that problem if you use local repositories.