user299889's questions -ubuntu

user299889

Asked: 2014-07-03 11:23:49 +0800 CST

How I prevent hocr2pdf to use a large font from tesseract generated .hocr file?

Tesseract now creates an .hocr file rather than an .html file for ocr output, but this is not exactly what is at issue here. When hocr2pdf uses this output it uses a large text size with small bounding boxes since the upgrade. Most of the text doesn't even appear in the resulting pdf, and what small amount of text does appear is unreadable and unselectable.

I'm using a script that goes through each .tif file in the directory and does the ocr on each one. I use a for loop like this:

for page in "$dir"/*page*.tif
do
    base="${page%.tif}"
    tesseract "$page" "$base" -l eng hocr
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
done

I also tried specifying the resolution with a -r 400 switch to hocr2pdf, but this did not result in any changes. I can only assume that the current version of tesseract is not producing appropriate output for hocr2pdf to work with.

Tesseract is my only ocr option because it handles Icelandic and Old Norse very well, so moving to another ocr tool is probably not a possibility.

How I prevent hocr2pdf to use a large font from tesseract generated .hocr file?

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?