How to unzip a zip file from the Terminal?

Question

Strapakowsky

Asked: 2011-09-01 00:33:41 +0800 CST2011-09-01 00:33:41 +0800 CST 2011-09-01 00:33:41 +0800 CST

How can I extract text from images?

772

How can I extract text from images?

I am not talking about scanned files, but garden variety images, such as when you take a high-def picture of a blackboard at class, and it is nicely handwritten; or when you photograph a page from a recipe book and want the recipe in text format.

Any free and open software for that?

I tried tesseract, and the results were awful.

3 Answers

Voted

Rinzwind · Answer 1 · 2011-09-01T00:55:52+08:00

Best Answer

Rinzwind

2011-09-01T00:55:52+08:002011-09-01T00:55:52+08:00

The act of extracting text from images is called OCR and Ubuntu has a wiki page dedicated to OCR. From that page:

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

gocr - A command line OCR
fuzzyocr - spamassassin plugin to check image attachments
libhocr0 - Hebrew OCR
ocrad - Optical Character Recognition program
ocrfeeder - Document layout analysis and optical character recognition system
ocropus - document analysis and OCR system
tesseract-ocr

The Ubuntu multiverse respositories also contain:

cuneiform - multi-language OCR system

Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.

edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.

38

Sudhir Belagali · Answer 2 · 2016-04-18T19:44:27+08:00

Sudhir Belagali

2016-04-18T19:44:27+08:002016-04-18T19:44:27+08:00

tesseract-ocr would be the great one compared to all others. For Installation, run the below command

sudo apt-get install tesseract-ocr

Usage is tesseract filename.jpg output.txt, then it will generate output.txt file.

You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG package, where LANG is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:

tesseract mySpanishText.jpg output -l spa

34

devp · Answer 3 · 2022-03-29T19:26:11+08:00

devp

2022-03-29T19:26:11+08:002022-03-29T19:26:11+08:00

Using tesseract-ocr we can extract text from images. I have tested gocr which didn't work well as compare to tesseract-ocr

Installation:

sudo apt-get install tesseract-ocr

Python program to convert all the image files with png extension inside of current directory to txt file

#!/usr/bin/env python3.10
import os
import subprocess

def list_files(path):
    files = []
    for name in os.listdir(path):
        if os.path.isfile(os.path.join(path, name)):
            files.append(os.path.join(path, name))
    return files

def convertImageToText(img_file):
    #process = subprocess.Popen(['tesseract', img_file,
    #    ''.join(img_file.rsplit('.png', 1))])
    os.system(f"tesseract {img_file} {''.join(img_file.rsplit('.png', 1))}")


def startOperation():
    list_file = list_files(".")
    print(list_file)
    for img_file in list_file:
        if img_file.lower().split(".")[-1] == "png":
            convertImageToText(img_file)

startOperation()

0

How can I extract text from images?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

What command do I need to unzip/extract a .tar.gz file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?

How can I add a user as a new sudoer using the command line?

Change folder permissions and ownership