How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.)
This is not about OCR and not about converting image to text.
To clarify what I mean I will post a few examples.
There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:
The above is not what I discuss here.
What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.
The first are formed of images that look like pictures taken of book pages:
or
Such copies can hardly be re-printed on paper, as the background will be printed too.
The second ones are what one would expect from scanned text, and can be printed:
or
The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.
What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.
As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.
As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.
scantailor
is not maintained anymore but you can still build it from source and use it.However, the original repository needs
qt4
, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted toqt5
.Prerequisites:
Installation:
Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.
Another option would be to use Scantailor advanced. You can install it via
snap
...... or flatpak.
... or via ppa.
Quick test:
As a direct solution on PDF (no manual image extraction):
Using
ocrmypdf
to restore OCR (as mentioned at the end of the complementary part of this answer) I have noticed thatocrmypdf -h
shows an option which sounded like exactly what is asked:The initial pdf already had OCR, which gives an error unless one of the following options are used:
or
Applying each separately to one of my large files with hundreds of pages that already had OCR crashed the process.
The best solution seems to me to first print to pdf the initial file (which removes OCR), and then do
For English, the
-l
option is not needed.-v
is for verbose details in terminal.The resulted pdf is larger than the input (because of the
--remove-background
option): reduce the size as said below.About Scan Tailor, as a complement to the main answer
Even its icon illustrates the fact that it is intended exactly for what is asked here:
Here is how to use Scan Tailor with pdfs:
pdftoppm MY_PDF.pdf NAME -tiff
- as said here. — Other variables can be used instead oftiff
(which givestif
files), for examplepng
orjpeg
. See here a set of Dolphin service menu actions for the various extraction options:tif
files are as you want them.) There are many ways to create a new pdf. Again the GUI tools that I've tried very soon crashed or gave odd results, so I prefer to put the resultingtif
files in a separate folder and there run the commandimg2pdf *.tif -o out.pdf
- as said here. (This may need proper naming/numbering of the files. More on that here.)The resulting "tailored" pdf will be smaller than the initial one, but the percentage of the size reduction varies depending on factors that I ignore (but I imagine that the pages contained in the initial pdf should be extracted — at step 1 — in the format they already have; I think
jpeg
andtif
should be used instead ofpng
; usepdfimages -list your.pdf
in terminal to see details on format, dpi and other details before processing with the commands above and below).The final pdf can be further reduced with a command like:
More details on that, here.
Here is a set of Dolphin service menu actions based on the above link:
I got some help from this answer too.
OCR (text search and copy capability) is lost during the above procedure, if present in the initial pdf. In order to get OCR, use
ocrmypdf input.pdf output.pdf
for English, as said here. For other languages, look for them withapt-cache search tesseract-ocr
, and install them. Add-l <LANG>
at the end of the command for specific languages; more here; see their names also here.Here is a Dolphin service menu action for Romanian OCR with two options (one with progress in terminal and fixed output name, the other with background process but with output name based on input; I would like to have both process in terminal and output name based on input but don't know how; if someone can do it, please post here!). For English, replace "Romanian" and remove the
-l ron
variable:(Extracting and processing images, as well as 'printing as pdf' removes OCR, but reducing size with ghostscript as above does not, so the "shrinking" can be applied before or after the OCR.)
I've got pretty good result using imageMagick and the following script http://www.fmwconcepts.com/imagemagick/shadowhighlight/index.php
Here is the result using the following parameters:
Just install Gimp(preferably use appimage). Following are the options:
Second option 2) Select Image>Mode>Indexed>Use black and white 1 bit palette
Any number of pages your pdf may have this will convert all to 1 bit Black and White.
Edit on 02/11/2021: As per query raised by cipiricus
Here are steps that I follow: