Image to Excel online, API to extract tables from PDF

The cost of OCR error correction

The aim of ExtractTable is to extract the text content trapped in images or PDFs. All the efforts to bring life to the data are strongly dependent on the input quality. In the process, especially on the non-computer generated inputs, like camera images or a scanned copy, often seen in the production runs, the output is likely to prone to errors. Working on the wrong data has adverse effects than on no-data, and it is essential to correct the captured data.

Imagine you are to extract the tabular data with 40 rows and 5 columns from the financial statement. The easiest way to validate if the extracted amount values are correct is by tallying the column's total to a summary value of the column, if available. If this tally fails, you got to compare the input and the output of all rows of the column to find the incorrectly recognized cell. With the EXTRA plan, which provides table cell's character accuracy and location in the input, instead of checking every cell of the table, you can straight away go the cell and make corrections by using the character accuracy. The ultimate goal is to reduce the efforts from your Data Entry/Quality Assurance team.

Character accuracy

While every user wants a 100% accurate data, it is much harder to achieve on low-quality images. If not provided with the recognized characters' accuracy details, the data-corrections team may have to spend more time. The cost of error correction is more than you imagine - in fact, more than the 2 cents you pay us for full-page extraction.

Calculator to estimate the cost of OCR corrections

Usual Table Formats:	Bordered	Borderless
Usual Page Content:	Single Table	Multiple Tables
Usual Data Correction Method:	With Character Accuracy details	Without Character Accuracy
Usual Table Dimensions:	Rows X Columns
Salary of your Data Entry Operator in USD	$ that's $/minute
At % character detection accuracy, it require seconds to make corrections that costs $/page

Layout Detection

Tons of services that offer OCR detection, yet only 10s of those services preserve the layout. Often, these services which SEO-ed themselves for "image to excel" put each text line as a row (without column separation) in the excel or, much worse, just insert the image into the excel sheet. The high priced (3x ExtractTable's) data capture solutions partially preserve the layout, which comes with strict input restrictions like a high-quality non-skewed image with tables in the same location. However, the input files in the production environment come in all possible varieties, to which the traditional services are unhelpful.

ExtractTable offer Image to Excel & PDF to Excel conversions without the user to worry about the skewness, non-templated bordered or borderless table layout, one or multiple tables on a single page. We employ preprocessing on the input to give the best possible results than any other service.

Data Cleaning

As much as ExtractTable try to extract the best table structure from image to excel output, sometimes, the outlying cases like tables with tightly packed cells or low-quality images may result in merged rows or columns, or date formatting or decimal separator issues. And such scenarios cannot be neglected. With those in mind, we have released a built-in functionality, "MakeCorrections", in ExtractTable Python Library, to ease corrections on the output. The functionality helps to

✔ Split Merged Rows
✔ Split Merged Columns
✔ Fix Decimal Format
✔ Fix Date Format

If you haven't explored the service, sign-up to receive promotional credits, and let us know how we did.