To Do

Asked: 2013-03-23 07:50:26 +0800 CST2013-03-23 07:50:26 +0800 CST 2013-03-23 07:50:26 +0800 CST

How do I produce a multi-page sandwich pdf with hocr2pdf?

772

I used tesseract to produce the special html to use with hocr2pdf starting from a muti-page tif.

I tried using hoc2pdf to produce a "sandwich pdf" (image + hidden text layer).

Hocr2pdf produces a one page pdf with all the pages superimposed.

Is there a way to solve this problem or an alternative solution?

1 Answers

Voted

Best Answer

To Do
2013-03-28T13:04:35+08:002013-03-28T13:04:35+08:00
I found a workaround to this issue. Hocr2pdf has issues with producing multi-page pdfs so I produced single-page tifs, ran tesseract-ocr, ran hocr2pdf then combined the results with the following script:

for f in ./*.tif; do tesseract "$f" "$f" -l fra hocr hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html" done pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html
3

Web Analytics Made Easy - Statcounter