What are the Ubuntu options to create CSV tables from an image:
Ideally, options would be simple and quick. Image is from Yield Curve article
What are the Ubuntu options to create CSV tables from an image:
Ideally, options would be simple and quick. Image is from Yield Curve article
I wrote a small Python script that can do what you want.
For the OCR part you would need
tesseract
. You can install it running:Then, run
tesseract
to create a txt file with the image-read data. I am naming this filetesseract_output
(tesseract
will add the.txt
extension), you can name it as you wish.Then copy and paste the following script and save it to a file ending with
.py
(for examplescript.py
).For the script to work, you have to enter the following in the USER INPUT section:
input_file
: the complete path to thetesseract
output.output_file
: the complete path to the final csv file.rows
: the number of table rows. In your example image it's 10.delimiter
: the delimiter to be used in the csv. Here I use;
. You can use any 1-character string you need.Run the file:
You should now have a csv with the following contents:
CAUTION
As you can see, the result is satisfactory, but depends on
tesseract
's output. It is almost certain thattesseract
won't detect everything correctly, as you can easily see in the csv output. You will have to compare the results with the original image and fix them manually, either in the tesseract output or in the csv output at the end.Also, in the script I am taking care of trailing whitespace and redundant newlines that
tesseract
spits out, which works fine for your example image. However, if a table cell was empty, it would be completely removed, effectively destroying the whole table structure. In this case, if I were you, I would edit thetesseract_output.txt
file and manually change the empty cells to containing a-
, so it wouldn't get deleted.