PDF extraction – Other Docs

The ODE (OtherDocExtraction) service ensures the extraction of the contents of interest from Pdf documents other than the business documents processed by the DDE service, i.e. documents of heterogeneous nature, type and layout.

The following are the main features of this service:
1

The ODE service is fully automated and includes acquisition of original documents, document transformation from Pdf to an internal structured format, conversion to the required custom or standard format, and delivery of the output to the customer. File exchange takes place via sFTP.

2

Differently from DDE, the ODE service is not specialised to manage a given set of document types, instead for each project the needed document classes are defined, unless they were already created in previous projects.

3

The ODE service functions are adapted project by project to the specific document class so as to satisfy at best the customer requirements. As it happens with DDE, ODE takes advantage of a template-based approach, ensuring maximum extraction reliability and, for Pdf image documents (very frequent in these projects), the use of high performance OCR components (typically Google).

4

The ODE service is in cloud, then acting outside the company legacy system, which can anyway import the resulting output in custom format. Performances in the extraction phase can be tuned and improved by using in parallel one or more extractor modules. The only interactive module is Verifier, which allows to fix possible extraction errors.

Key factors

Complete and reliable extraction!

The ODE service takes advantage of the experience acquired with the DDE service in content extraction from both Pdf vector, using proprietary algorithms, and PDF image, using the best available OCRs (ABBYY up to the last year, now Google).

The ODE service strengthens in particular the possibility to enrich templates with specific software functions able to replace values or derive data not reported in the document. The ODE results are very much appreciated, both in terms of quality and performance, also in very critical application scenarios like those in banks and insurance companies, requiring top data quality and accuracy.

Single project optimisation!

The ODE service aims at fitting at best the customer requirements, then it often happens that a project begins with a Proof of Concept (POC) highlighting most the specific issues to cope with, from document quality to data volumes and expected throughput, from processing times to customised checks to apply.

The web form to validate documents with missing, wrong or uncertain values, is usually realised from scratch in any project, so as to ensure the best user experience. In other words, while DDE is a mass service, ODE is typically customised and optimised to obtain the maximum results taking into account all the project constraints.

Integration with Artificial Intelligence tools!

Some ODE projects highlighted the need of classifying documents, which sometimes should also be separated from each other when they are provided in a unique file by the user company.

Depending on application scenarios and availability of examples, the ODE basic service can be integrated with classification functions based on heuristics, on templates, as well as on artificial intelligence functions based on image (or text) recognition.

The latter approach is only applicable starting from a base of at least 500-1000 already annotated examples, but to this purpose it is really promising. In both cases the solution should provide a web application to let the user correct possible classification errors.

FAQ

Service characteristics

It is possible to recognise checkmarks in pre-defined boxes in the documents. It is also possible to identify signature boxes, through (possibly enriched) templates, and recognise whether or not there is something written in those boxes.

No, the document should not be subject to protection, as page extraction must be enabled in any case. In particular, Pdf documents opened with a password cannot be processed, as well as Pdf documents that can be opened but require a password to apply the “Content copy” function.

Background cleaning and characters cleanness are key factors to manage Pdf images. Original documents with heavy deficiencies in this respect are pointed out at set-up time. The ideal scan resolution is 300 dpi (not less but even not more).

Taking into account both running projects and POCs there is quite a large variety of processed document types, from penalty refunding for rented cars to repossession letters, from identification of signature fields on heterogeneous modules to F24 and bank account reports.

Codice Destinatario is always XXXXXXX (seven times X). For EU countries, the VAT ID always exists and should be reported with the country prefix (ISO 3166-1 alpha-2 code). For extra-EU countries, the service adopts the following rule for VAT ID: Country code + Value, where Country code is ISO 3166-1 alpha-2 code and Value is a string with maximum length 28 digits.

DISCOVER OTHER SERVICES