I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?
Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).
- pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
- pdfsandwich (of which the software center says it is a poor package and I should not install it)
- OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
- Gscan2pdf exports an all black (but searchable) image as reported in this discussion.
- I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.
Is there a software package I am unaware of? Or a script that does this?
As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run
Finally you can OCR your pdf with the command:
If it seems the command is unresponsive, you can increase the verbosity using the
-v
flag (which can be used incrementally as-vv
or-vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:If you have any question have a look in the Github repo.
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
In case you miss one, install it. For instance,
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
You can, of course, check its man page for some additional options.
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.If you have a scanned file
scanned_file.pdf
, simply callwhich generates the file
scanned_file_ocr.pdf
with the recognized text added to the scanned pages.Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of
pdfsandwich
and therefore heavily biased.I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around
tesseract
. It usespdftoppm
to convert a PDF into a bunch of TIFF files, then it usestesseract
to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
Instructions to install & use
pdf2searchablepdf
:Tested on Ubuntu 18.04 on 11 Nov 2019.
Install:
Use:
You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text!
Done. The wrapper has no python dependencies, as it's currently written entirely in bash.
References or Related Resources:
pdftoppm
] Extracting embedded images from a PDFOS: Ubuntu 18.04
First, install
tesseract-ocr
with:If you are going to use a language other than English with tesseract, then you will have to install the corresponding laguage package. For example for Portuguese, you will need to do:
Otherwise you'll get the error:
If you Google "tesseract PDF" you will probably find this somewhat outdated post. However, it gives you some useful hints. You will first have to convert your
.pdf
file to a.tiff
one. Run:If, as in the outdated post, you forget to add
alpha -Off
, you'll get the following error:Now you can run the final command. In the particular case that your original PDF is in Portuguese, you will need this command:
The generated file will be named
output.pdf
. If, for example, your PDF is in French, after you install the correspondingtesseract-ocr-fra
, you will run:And the desired file will be, again,
output.pdf
.OCRfeeder has a bug in
line 436 should read:
changed this and it worked for me
As of Ubuntu 16.04, OCRmyPDF has become available through
apt
. Just run the following command to install it:You can also run this command to see its usage:
Finally, you can OCR your PDF with the command:
(change
input.pdf
andoutput.pdf
to the files you want)