How to unzip a zip file from the Terminal?

Question

dan

Asked: 2011-07-07 08:54:48 +0800 CST2011-07-07 08:54:48 +0800 CST 2011-07-07 08:54:48 +0800 CST

Is there a better pdf to text converter than pdftotext?

772

I'm using pdftotext (part of poppler-utils) to convert PDF documents to text. It works, for the most part, but one thing I wish it did was to insert blank lines between separate paragraphs instead of mashing them together.

Is there way to get pdftotext to do this? And if not, is there another pdf to text utility that can do this?

5 Answers

Voted

Noah · Answer 1 · 2013-06-14T07:25:32+08:00

Noah

2013-06-14T07:25:32+08:002013-06-14T07:25:32+08:00

If you are using pdftotext you can use the -layout flag to preserve the layout of the text on the pages in your input pdf file:

pdftotext -layout input.pdf output.txt

130

frabjous · Answer 2 · 2011-08-09T20:52:14+08:00

Best Answer

frabjous

2011-08-09T20:52:14+08:002011-08-09T20:52:14+08:00

You could try ebook-convert from Calibre.

If anything, I'd say it errs in the other direction: too many line breaks.

Another thing I'd definitely consider though is converting to HTML using pdfreflow, and then convert the HTML to TXT.

26

Darren Cook · Answer 3 · 2013-09-11T18:58:09+08:00

As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.

(I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.)

I've been comparing the output side-by-side. My second choice is ebook-convert.

Adobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. Junk that was hidden in the PDF did not get output. Correctly got the big capitals at start of sections, e.g. "The", not "T he" or even "T he".

ebook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Converts most paragraphs to be single lines. The ones it missed are double-spaced though! Bullets don't always line up with the text. Correctly got "The" at the start of the chapter.

pdftotext (without --layout): Not bad, bullets line up, but header/footer noise. FFs are in there. Hyphens removed. Worst for start of chapter big letters: "T\n\nhe".

pdftotext (with --layout): Similar, but more indents. "T he" for start of chapter.

pdftohtml >> pdfreflow >> htmltotext: It removed page numbers, but still junk in header/footer. "T he" for start of chapter. Hyphens removed. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)

xangua · Answer 4 · 2011-07-07T10:13:46+08:00

xangua

2011-07-07T10:13:46+08:002011-07-07T10:13:46+08:00

If you have a Google account, you can use Google Docs to upload the PDF and transform it into editable text.

7

Max · Answer 5 · 2013-10-05T10:22:26+08:00

Max

2013-10-05T10:22:26+08:002013-10-05T10:22:26+08:00

I also tried pypdf and compared it against pdftotext on two documents. It had more linebreaks and split some section names (REFERENCES was R E F E R E N C E S).

pdf2txt did output complete garbage.

I often use pdfBox (java) if pdftotext screws up the output. You might give it a try.

1

Is there a better pdf to text converter than pdftotext?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

What command do I need to unzip/extract a .tar.gz file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?

How can I add a user as a new sudoer using the command line?

Change folder permissions and ownership