There are some sites that provide books as HTML pages (e.g., legal stuff).
What can I use to create a PDF book from these pages, based on the already-existing structure?
In Windows there is Adobe Professional (commercial software). I'm guessing that Linux has something free? A solution involving scripting would be OK for me.
Calibre is a pretty powerful tool for converting things into ebooks in various formats. Available in a Software Centre near you!
Don't be deceived by its less than beautiful UI, it can do a lot.
The easiest way? File > Print from your browser. Select Print to File as your printer, and it will ask you where you want it. Be sure to mark PDF. Hit "Print" and it will actually be saved to your drive instead of actually printing.
Htmldoc can be useful, see it here; http://www.htmldoc.org/ it is available from software center, sadly the 1.8 version has a problem with unicode encoded files but on many occasions it can still be a saviour, the problem is fixed in the 1.9 development version.
I usually use the wonderful scrapbook extension here; http://amb.vis.ne.jp/mozilla/scrapbook/ for Firefox to capture the web pages, use the editing tools in scrapbook to fix them up if that is needed and then use htmldoc to convert all pages to PDF.
I would recommend using OpenOffice/LibreOffice to create the PDF. As a test I downloaded the Wget manul (all in one page) and then opened the HTML page in OponOffice and clicked on the "Export Directly to PDF" button. It created the PDF with with an index from the table of contents.
In the past I've found this to be the easiest way to convert HTML pages to PDF. It also allows you to make changes without much effort.
Screenshots:
Wget manual exported to PDF using Open Office
Export Directly to PDF option in Open Office
You could try http://www.xhtml2pdf.com/. It's a converter for HTML/XHTML and CSS to PDF. All written in Python.
I've actually voted for the calibre solution. But here's another you could try. Install AbiWord. It can do conversions between any formats it knows from the command line. To convert all the .html files in a folder to .pdf you could do:
for file in *.html ; do abiword --to=pdf "$file" ; done
For higher-level typography (but arguably more complicated), another option would be PrinceXML.
Depending on the html document to be printed, you might have the best results using pandoc. This is one of the most versatile HTML-to-LaTeX converters. The resulting .tex file can be turned to PDF quite easily, using
xelatex
orpdflatex
. Lots of options are available if you are willing to delve into LaTeX syntax and packages. This may not work well if embedded images and fancy HTML styles should be preserved.In google-chrome, you can create a pdf file fo a whole site by using an extension. I personally use the Web2PDF Converter extension that makes a PDF just in a click.
Here is a screenshot of this plugin, provided by google extensions web store site.
Additionally, you can see a PDF created by me with this tool, by downloading the next (right clic, save target as): http://geppettvs.servehttp.com/resources/askubuntu-com.pdf (some browsers like google-chrome may allow you to see this online).
And if you wish to edit those PDF's created by the extension in order to remove the digital signature placed by the extension in the bottom of each page or to remove anything else, take a look at this: Remove text information from a PDF?
Good luck!