I need to mirror a website and deploy the copy under a different domain name. The mirroring procedure should be all automatic, so that I can update the copy on a regular basis with cron
.
The mirror MUST NOT be a real mirror, but it MUST be static copy, e.g. a snaphot of the site at a specific time, so I think wget
might fit.
As of now, I've come up with the following script to get a copy of the original site:
#!/bin/bash
DOMAIN="example.com"
cd /srv/mirrors
TMPDIR=$(mktemp -p . -d)
cd "${TMPDIR}"
wget -m -p -E --tries=10 --convert-links --retry-connrefused "${DOMAIN}"
cd ..
rm -rf oldcopy
mv "${DOMAIN}" oldcopy
mv "${TMPDIR}/${DOMAIN}" "${DOMAIN}"
rmdir "${TMPDIR}"
The resulting copy is then brought to you by Nginx under the new domain name, with a simple configuration for a local static site, and it seems to work.
Problem is the origin server produces web pages with absolute links in them, even when the links point to internal resources. E.g. a page at https://example.com/page1
contains
<link rel="stylesheet" href="https://example.com/style.css">
<script src="https://example.com/ui.js"/>
and so on (it's WordPress). No way I can change that behavior. wget
then does not convert those links for local browsing, because they are absolute (or, at least, I think that's the cause).
EDIT: the real domain name is assodigitale.it, though I need a script that works regardless of the particular domain, because I will need it for a few other domains too.
Can I make wget
convert those links to the new domain name?
There is another solution to your problem.
Instead of making wget convert those links to the new domain name, you can make your webserver rewrite links on the fly.
with apache, you can use mod_sed to rewrite links.
eg :
AddOutputFilter Sed html OutputSed "s/example.com/newdomain.com/g"
https://httpd.apache.org/docs/trunk/mod/mod_sed.html
Could this be a mixed content issue or otherwise related to using both HTTP & HTTPS protocols?
It might be that you are doing the mirror using HTTP
while the mentioned URLs to be converted are absolute HTTPS URLs:
The link conversion is the last phase of your command and it should show you lines giving detailed information on the conversion process. This is just an example from mirroring one page using your command:
Only at the end wget will know what have been downloaded and it converts all the links it knows (from this download history) with the relative paths to the existing files. It's possible, that while wget is able to retrieve content using HTTP, it fails with HTTPS.
Try this:
It might either work or give you an error that helps you with solving the actual problem.