PDF extraction – Biz Docs

The DDE (DocDataExtraction) service ensures the extraction of the contents of interest from typical business documents, i.e. orders, order responses, despatch advises, invoices, provided in Pdf format.

In particular:
1

The DDE service is fully automated and includes acquisition of original documents, document transformation from Pdf to an internal structured format, conversion to the required custom or standard format (i.e. XML-SdI, Peppol 3.0), and delivery of the output to the customer. File exchange takes place via sFTP.

2

The DDE service ensures excellent performances as it is already specialised on a set of document types for which it optimises the creation of models (templates) representing their layout. The template can be realised by expert users not necessarily skilled in ICT.

3

The DDE service is in cloud, then acting outside the company legacy system which can anyway import the resulting output in custom format. Performances in the extraction phase can be tuned and improved by using in parallel two or more extractor modules. The only interactive module is the Verifier, which allows to fix possible extraction errors.

4

The DDE service native languages are Italian and English. It is possible to extract the desired contents from documents in other languages provided that the templates are created before the service is run in production.

Key factors

Complete extraction!

The DDE service has improved its effectiveness since its birth, in 2009:

1

The initial objective of the DDE service was quite challenging, e. derive the CBI2 4 “white label” standard format from Pdf invoices. This forced to face since the beginning VAT detail tables, payment modes and deadlines, and other non-trivial details, finally obtaining very good details.

2

Since 2014, with the mandatory B2G eInvoicing, further important improvements were introduced, particularly regarding the extraction of invoice lines and the document footer, where information embedded in a free text can be found.

Thanks to this evolution the DDE service reached a level of completeness ad efficiency higher than what ensured by SATA competitors, first for invoices but also for orders and despatch advises.

EXTRACTION RELIABILITY!

The DDE service processes both text and image (scanned) Pdf documents:

For text or semi-text Pdf documents, the DDE service uses proprietary algorithms for data extraction, ensuring 100% correctness on all the document fields (unless they are missing or syntactically or semantically wrong).

For image Pdf documents the DDE service uses the best available OCRs, i.e.  ABBYY up to the last year and still today for large volumes per single customer, and preferably Google as cloud-based and then particularly suited for a cloud service as DDE.

In addition, the DDE service uses a combination of advanced techniques, from syntactical checks to fuzzy-logic and semantic controls on single fields and groups of fields, by applying in a smart way also positional logic and image improvement.

A distinctive feature of the DDE service is the systematic use of templates, easy and fast to create. Other competitor use heuristic techniques or machine learning, but DDE has very high success rates with respect to these approaches.

MINIMUM SET-UP TIME!

For the DDE service SATA invested very much to reach an almost automatic template creation, for minimising the user company set-up:

1

Once a document with an unknown template arrives, the DDE service tries to extract all the information included, by applying a number of quite sophisticated logics, thus creating a template draft.

2

Possible fields that DDE considers suspect are highlighted in different colours corresponding to the diagnosis results (red for missing information, yellow for suspect, orange for syntactically/semantically wrong, and green for ok).

3

The human operator working on templates starts from the exceptions to complete and correct the corresponding fields, thus obtaining the final template in a very short time.

Statistics on several thousand templates built, the average time spent to create a template is 15 minutes, with a minimum of few minutes and a maximum that seldom exceeds half an hour.

FAQ

Document requirements

The document size should be equivalent to an A4. Smaller size documents are accepted. Larger size documents are not accepted, and the same holds for irregular or variable layouts.

The document can have a vertical or horizontal orientation with respect to the included text. It is not possible to process documents where some text is orthogonal to other text.

No, this practice makes template application less effective. In order to make the layout steady it is suggested to start from an XLS file, of which SATA can provide an example, and then generate the PDF via virtual printer.

No, the document should not be subject to protection, as the extraction should be always enabled. In particular, DDE cannot process Pdfs that need a password to open, as well as Pdfs that, although opening without password, are protected on the “Content copy” function.

Background cleaning and character clarity are key factors for Pdf image processing. Original documents with heavy deficiencies in this respect are pointed out at set-up time. The ideal scan resolution is 300 dpi (not less but even not more).

The main peculiarity of such kinds of invoices is that the body usually reports data on resource usage, not products/services sold, then it cannot be interpreted as a classic invoice body. Then for such invoices DDE exports as many lines as VAT details, with a description like “Charge for services provided from xx/xx/xxxx to yy/yy/yyyy”, where the dates representing the start and end of the period are taken from document header or body.

Identification of parties

Templates are associated to the issuer VAT ID. For issuers based on a European country the VAT ID always exists (including the country prefix ISO 3166-1 alpha-2 code). For issuers based outside EU the template creator adopts the following convention for VAT ID: EXTRAxxxxxx (11 characters in all), and in this case template recognition is based on other data such as company name, phone, e-mail, address.

In this case, the template is not found and the extraction presents more exceptions than expected. If the template for that issuer exists, the user can select and apply it to have a more reliable result, otherwise the template can be created taking advantage of the correct information already extracted. As said in P01, in case of extra-EU issuer, DDE is not expecting the VAT ID and searches the template on the basis of other information.

In this case, the Pdf should include both the buyer VAT ID and the tax code (codice fiscale). The tax code should be reported in any position of the invoice header or footer, with a label that can be easily recognised.

Invoices addressed to public entities or companies and delivered via SdI should include the “codice Destinatario” (7 digits for private entities and 6 digits for public ones), or the PEC address. That code (or PEC address) should be reported in any position of the invoice header or footer, with a label that can be easily recognised.

Invoices addressed to public entities or companies and delivered via the Peppol network should include the buyer Peppol ID, that can be reported in any position of the invoice header or footer, with a label that can be easily recognised.

Yes, they can, provided that who creates the template is aware of such requirement. Regular expression can be associated to some template fields, and if the issuer uses steady labels in the description, can extract the desired contents. In few cases writing specific pieces of code is required, associated to template fields to be interpreted beyond what is written on the document.

The best condition is that in the PDF each line reports a “VAT code” expressing the VAT rate or exemption article, and this VAT code is quoted into the summary VAT detail. VAT information at the line level is mandatory in the XML-SDI format and can be automatically derived only one only one VAT code is in the summary. Note that in the summary VAT detail it is mandatory to quote the exemption article (norm reference) in case of VAT amount null.

This is the suggested practice, particularly since January 2021 when the number of SDI Nature codes increased substantially. The best solution is to put the Nature code at the beginning of the exemption code description. If Nature code cannot be reported into the invoice, the user can provide a table putting the Nature codes in correspondence with the exemption codes. Such table should be sent to the SATA support via the ticketing system.

Typical requirements are: (a) clear reference to social charges (“cassa previdenziale”), and related VAT application, and (b) clear reference to withholding tax and payable amount. Such requirements do not apply to professional in “regime forfettario”.

First there are syntactical errors, i.e. VAT ID not valid, tax code not valid, IBAN not valid, CIG not valid, CUP not valid. Another problem is the application of line charge rates with respect to the total without VAT, as SdI has its own logic and a new line could be needed for rounding purposes, to get exactly the same result of the original document. Also, additional expenses and stamp duty should be reported in specific lines, in case they are missing, their amounts should be reported in the invoice footer and a plug-in is needed to generate the corresponding lines.

DISCOVER OTHER SERVICES