Web lists-archives.com

Re: Problems with wget

Hash: SHA1

On Mon, Feb 26, 2018 at 06:40:02AM -0600, Richard Owlett wrote:
> I'm attempting to download a site which is an instruction manual.
> Its URL is of the form
>    http://example.com/index.html
> That page has several lines whose target URLs are of form
>    http://example.com/page1.html
>    http://example.com/page2.html
>    http://example.com/page3.html
>   etc.
> I wish a single HTML file consisting of all the pages of the site.
> Where <http://example.com/index.html> points to
> <http://example.com/pageN.html> I wish my local file to have
> appropriate internal references.
> There are references of form
>    http://some_where_else.com/pagex.html
> which I do not wish to download.
> I tried
> wget  -l 2 -O owl.html ‐‐no-parent http://example.com/index.html
> It *almost* worked as intended.
> I did get all the text of the site.
>   1. I also got the text of <http://some_where_else.com/pagex.html>
>   2. Where <http://example.com/index.html> referenced
>      <http://example.com/pageN.html> there were still references to
>      the original site rather than a relative link within owl.html .

Ad (1): this is strange. By default wget doesn't "span" hosts,
  i.e. doesn't follow links to other hosts unless you specify
  that with -H (--span-hosts).

Ad (2) you want option -k. Quoth the man page:

           After the download is complete, convert the links in
           the document to make them suitable for local viewing.
           This affects not only the visible hyperlinks, but any
           part of the document that links to external content,
           such as embedded images, links to style sheets,
           hyperlinks to non-HTML content, etc.

           Each link will be changed in one of the two ways:

- -- tomás
Version: GnuPG v1.4.12 (GNU/Linux)