Re: Problems with wget
- Date: Mon, 26 Feb 2018 13:50:44 +0100
- From: <tomas@xxxxxxxxxx>
- Subject: Re: Problems with wget
-----BEGIN PGP SIGNED MESSAGE-----
On Mon, Feb 26, 2018 at 06:40:02AM -0600, Richard Owlett wrote:
> I'm attempting to download a site which is an instruction manual.
> Its URL is of the form
> That page has several lines whose target URLs are of form
> I wish a single HTML file consisting of all the pages of the site.
> Where <http://example.com/index.html> points to
> <http://example.com/pageN.html> I wish my local file to have
> appropriate internal references.
> There are references of form
> which I do not wish to download.
> I tried
> wget -l 2 -O owl.html ‐‐no-parent http://example.com/index.html
> It *almost* worked as intended.
> I did get all the text of the site.
> 1. I also got the text of <http://some_where_else.com/pagex.html>
> 2. Where <http://example.com/index.html> referenced
> <http://example.com/pageN.html> there were still references to
> the original site rather than a relative link within owl.html .
Ad (1): this is strange. By default wget doesn't "span" hosts,
i.e. doesn't follow links to other hosts unless you specify
that with -H (--span-hosts).
Ad (2) you want option -k. Quoth the man page:
After the download is complete, convert the links in
the document to make them suitable for local viewing.
This affects not only the visible hyperlinks, but any
part of the document that links to external content,
such as embedded images, links to style sheets,
hyperlinks to non-HTML content, etc.
Each link will be changed in one of the two ways:
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
-----END PGP SIGNATURE-----