Pdf to excel table reader ocr machine learning Essay

The main difficulty with the existing manual processing of the documents is the time and cost overhead. To relatively extract information of tables from the documents and manually processing it is a very difficult task. Generally, the employees have to manually enter the data from tables provided in multiple documents like Payroll, Resume, Invoice, Bill Receipt, Legal documents etc. which makes a tedious task. So, this project will help such employees by automating the task of identification and extraction of tables from documents.

The detection of tables in unrestricted documents is a challenging problem for two reasons:

First, tables are so diverse that they hardly have any easy to identify characteristic in common.

Secondly, in some cases we should favour the analysis of groups of graphic lines, also we have to pay attention to relatively regular configurations of text elements.

Now, in this project an OCR will be developed and this OCR technique will be applied on the documents provided. Using Machine Learning technique, the table will be extracted. The table can be bordered, borderless or may have some missing lines in columns.

To represent tables in order to characterize the structure of tables for a wide-ranging class of documents, a rich and flexible representation framework is needed. Both the physical layout and the logical structure of a table must be described. While the information concerning the original layout may help visualization, the logical structure enables the presentation of tables through different media and an automatic processing of their content.

The electronic exchange of complex documents often relies on such formats as PDF and Postscript or on images of the documents. While these portable formats do convey a rich visual information to the user, they do not encourage the efficient reuse, organization and distribution of content because the visual structure of a document is not explicitly represented in the corresponding file. The reconstruction of this structure (in terms of components and relationships that are meaningful to the reader) from the visual aspect of a document is a prerequisite to a more flexible exploitation of its content.

