I installed gocr, with the command suggested by the ubuntu terminal (sudo apt install gocr), in order to carry out an OCR recognition of the text in a pdf file. How could I use it? I didn't find a tutorial to do this.
When I try to detect text on my jpeg, it shows correctly all areas where it suspects text and images, but when I export it to ODT it only creates an ODT with empty text- and imageframes.
Do I have to configure tesseract
somehow?
(I use Ubuntu 14.10 32bit)
Tesseract now creates an .hocr file rather than an .html file for ocr output, but this is not exactly what is at issue here. When hocr2pdf uses this output it uses a large text size with small bounding boxes since the upgrade. Most of the text doesn't even appear in the resulting pdf, and what small amount of text does appear is unreadable and unselectable.
I'm using a script that goes through each .tif file in the directory and does the ocr on each one. I use a for loop like this:
for page in "$dir"/*page*.tif
do
base="${page%.tif}"
tesseract "$page" "$base" -l eng hocr
hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
done
I also tried specifying the resolution with a -r 400
switch to hocr2pdf, but this did not result in any changes. I can only assume that the current version of tesseract is not producing appropriate output for hocr2pdf to work with.
Tesseract is my only ocr option because it handles Icelandic and Old Norse very well, so moving to another ocr tool is probably not a possibility.
I'm using the OCR-utility of OCRFeeder. OCRFeeder is using the tesseract-engine. I have installed the several language-packs needed for tesseract. How can I set the language such that tesseract will use the right language-file for converting the scanned document into text?
I'd like to scan a good amount of papers I have lying around, with the least possible hassle. I would like to convert them to images using Simple Scan, then convert them to text using OCR. Is there a good OCR app with a GUI that will give me good results at the push of a button?