I'm using pdftotext (part of poppler-utils) to convert PDF documents to text. It works, for the most part, but one thing I wish it did was to insert blank lines between separate paragraphs instead of mashing them together.
Is there way to get pdftotext to do this? And if not, is there another pdf to text utility that can do this?
If you are using pdftotext you can use the
-layout
flag to preserve the layout of the text on the pages in your input pdf file:You could try
ebook-convert
from Calibre.If anything, I'd say it errs in the other direction: too many line breaks.
Another thing I'd definitely consider though is converting to HTML using pdfreflow, and then convert the HTML to TXT.
As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.
(I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.)
I've been comparing the output side-by-side. My second choice is ebook-convert.
Adobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. Junk that was hidden in the PDF did not get output. Correctly got the big capitals at start of sections, e.g. "The", not "T he" or even "T he".
ebook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Converts most paragraphs to be single lines. The ones it missed are double-spaced though! Bullets don't always line up with the text. Correctly got "The" at the start of the chapter.
pdftotext (without --layout): Not bad, bullets line up, but header/footer noise. FFs are in there. Hyphens removed. Worst for start of chapter big letters: "T\n\nhe".
pdftotext (with --layout): Similar, but more indents. "T he" for start of chapter.
pdftohtml >> pdfreflow >> htmltotext: It removed page numbers, but still junk in header/footer. "T he" for start of chapter. Hyphens removed. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)
If you have a Google account, you can use Google Docs to upload the PDF and transform it into editable text.
I also tried pypdf and compared it against pdftotext on two documents. It had more linebreaks and split some section names (REFERENCES was R E F E R E N C E S).
pdf2txt did output complete garbage.
I often use pdfBox (java) if pdftotext screws up the output. You might give it a try.