

When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.

Simple answer is YES, you should just choose right tools. If you want to try the Tesseract power, maybe you should try this site: The author published a full algorithm using Python and Tesseract, both opensource solutions! I found all this information in this link, asking Google "OCR to table". Possible solution was to draw a unique character, like “^” on eachĬell boundary – something the OCR would still recognize and that IĬould use later to split the resulting strings. Retaining the relationship between cells was very important, so one Several words in each column, the cell boundaries were getting lost. Since my input documents had multiple columns with However, the software compressed all whitespace into a singleĮmpty space. Long as I removed the cell borders (long horizontal and vertical Performing OCR on the entire document actually worked pretty well as To get the best possible results, it helps to use theĬleanest input you can. Optical Character Recognition is pretty amazing stuff, but it isn’tĪlways perfect.
