Text Classification Models for Form Entity Linking

  1. María Villota 1
  2. César Domínguez 1
  3. Jónathan Heras 1
  4. Eloy Mata 1
  5. Vico Pascual 1
  1. 1 Department of Mathematics and Computer Science, University of La Rioja
Actas:
15th IAPR International Workshop on Document Analysis Systems (DAS 2022), 22-25 may 2022. Short Paper Booklet

Editorial: La Rochelle Université

Año de publicación: 2022

Páginas: 40-43

Congreso: 15th IAPR International Workshop on Document Analysis Systems (DAS 2022), 22-25 may 2022.

Tipo: Aportación congreso

DOI: 10.48550/ARXIV.2112.07443 GOOGLE SCHOLAR
Repositorio institucional: lock_openAcceso abierto Editor

Resumen

Forms are a widespread type of template-based documentused in a great variety of fields. The automatic extraction of the informationincluded in these documents is greatly demanded due to theincreasing volume of forms that are generated in a daily basis. However,this is not a straightforward task when working with scanned forms becauseof the great diversity of templates with different location of formentities, and the quality of the scanned documents. In this context, thereis a feature that is shared by all forms: they contain a collection of interlinkedentities built as key-value (or label-value) pairs, together withother entities such as headers or images. In this work, we have tackled theproblem of entity linking in forms by combining image processing techniquesand a text classification model based on the BERT architecture.This approach achieves state-of-the-art results with a F1-score of 0.80on the FUNSD dataset, a 5% improvement regarding the best previousmethod.