Ping a Specific Port

Question

Lucio Crusca

Asked: 2018-03-23 22:49:34 +0800 CST2018-03-23 22:49:34 +0800 CST 2018-03-23 22:49:34 +0800 CST

`wget`-ting a website for "local" browsing on a different domain

772

I need to mirror a website and deploy the copy under a different domain name. The mirroring procedure should be all automatic, so that I can update the copy on a regular basis with cron.

The mirror MUST NOT be a real mirror, but it MUST be static copy, e.g. a snaphot of the site at a specific time, so I think wget might fit.

As of now, I've come up with the following script to get a copy of the original site:

#!/bin/bash

DOMAIN="example.com"

cd /srv/mirrors
TMPDIR=$(mktemp -p . -d)
cd "${TMPDIR}"

wget -m -p -E --tries=10 --convert-links --retry-connrefused "${DOMAIN}"

cd ..
rm -rf oldcopy
mv "${DOMAIN}" oldcopy
mv "${TMPDIR}/${DOMAIN}" "${DOMAIN}"
rmdir "${TMPDIR}"

The resulting copy is then brought to you by Nginx under the new domain name, with a simple configuration for a local static site, and it seems to work.

Problem is the origin server produces web pages with absolute links in them, even when the links point to internal resources. E.g. a page at https://example.com/page1 contains

<link rel="stylesheet" href="https://example.com/style.css">
<script src="https://example.com/ui.js"/>

and so on (it's WordPress). No way I can change that behavior. wget then does not convert those links for local browsing, because they are absolute (or, at least, I think that's the cause).

EDIT: the real domain name is assodigitale.it, though I need a script that works regardless of the particular domain, because I will need it for a few other domains too.

Can I make wget convert those links to the new domain name?

2 Answers

Voted

bgtvfr · Answer 1 · 2018-03-24T02:36:00+08:00

bgtvfr

2018-03-24T02:36:00+08:002018-03-24T02:36:00+08:00

There is another solution to your problem.

Instead of making wget convert those links to the new domain name, you can make your webserver rewrite links on the fly.

with apache, you can use mod_sed to rewrite links.

eg :

AddOutputFilter Sed html OutputSed "s/example.com/newdomain.com/g"

https://httpd.apache.org/docs/trunk/mod/mod_sed.html

1

Esa Jokinen · Answer 2 · 2018-03-24T02:01:46+08:00

Could this be a mixed content issue or otherwise related to using both HTTP & HTTPS protocols?

It might be that you are doing the mirror using HTTP

DOMAIN="example.com"
wget -m -p -E --tries=10 --convert-links --retry-connrefused "${DOMAIN}"

while the mentioned URLs to be converted are absolute HTTPS URLs:

<link rel="stylesheet" href="https://example.com/style.css">
<script src="https://example.com/ui.js"/>

The link conversion is the last phase of your command and it should show you lines giving detailed information on the conversion process. This is just an example from mirroring one page using your command:

Downloaded: 177 files, 12M in 0.2s (51.0 MB/s)
Converting links in example.com/index.html... 45-2
...
Converted links in 15 files in 0.008 seconds.

Only at the end wget will know what have been downloaded and it converts all the links it knows (from this download history) with the relative paths to the existing files. It's possible, that while wget is able to retrieve content using HTTP, it fails with HTTPS.

Try this:

DOMAIN="example.com"
wget -m -p -E --tries=10 --convert-links --retry-connrefused https://"${DOMAIN}"

It might either work or give you an error that helps you with solving the actual problem.

`wget`-ting a website for "local" browsing on a different domain

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?