How can I extract text from images?
I am not talking about scanned files, but garden variety images, such as when you take a high-def picture of a blackboard at class, and it is nicely handwritten; or when you photograph a page from a recipe book and want the recipe in text format.
Any free and open software for that?
I tried tesseract, and the results were awful.
The act of extracting text from images is called
OCR
and Ubuntu has a wiki page dedicated to OCR. From that page:Available OCR tools
The Ubuntu Universe repositories contain the following OCR tools:
The Ubuntu multiverse respositories also contain:
Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.
edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.
tesseract-ocr
would be the great one compared to all others. For Installation, run the below commandUsage is
tesseract filename.jpg output.txt
, then it will generateoutput.txt
file.You might consider selecting the appropriate language. In that case, you will need to install
tesseract-ocr-LANG
package, whereLANG
is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:Using
tesseract-ocr
we can extract text from images. I have testedgocr
which didn't work well as compare totesseract-ocr
Installation:
Python
program to convert all the image files with png extension inside of current directory to txt file