Handwritten Text Recognition for the European Digital Treasures Collections. Hands On workshop by Joan Andreu Sánchez and Enrique Vidal

The first day of the workshop “New Digital Exponential Technologies Towards The Generation Of Business Models” was concluded by a hands on session led by Joan Andreu Sánchez and Enrique Vidal.

Joan Andreu Sánchez is assistant professor at Universitat Politècnica de València and the Director of the Pattern Recognition and Human Language Technologies (PRHLT) Research Center in this university. His main area of research is machine learning and formal languages applied to text recognition and math recognition.

Enrique Vidal is emeritus professor at the same university and former co-leader of PRHLT research center. For many years Dr. Vidal has focussed his research on handwritten document analysis and recognition leading the development of the probabilistic indexing technology. Joan Andreu and Enrique are founders of tranSkriptorium, an AI spin-off company.

The contents of a massive volume of digitised handwritten records in archives and libraries all over the world are practically inaccessible, buried beneath thousands of terabytes of high-resolution images. The image textual content could be straightforwardly indexed for plain-text textual access using conventional information retrieval systems if perfect or sufficiently accurate text image transcripts were available.

However, fully automatic transcription results generally lack the level of accuracy that is required for reliable text indexing and search purposes. On the other hand, the massive volume of image collections typically considered for indexing render manual or even computer-assisted transcription as entirely prohibitive. Dr. Sanchez and Dr. Vidal explain how very accurate indexing and search can be directly implemented on the images themselves, without explicitly resorting to image transcripts; they present the results obtained using the proposed techniques on several relevant historical data sets. The results have led to a high interest in these technologies.

You can watch the session on YouTube here and the paper presented at the workshop here: Part I & Part II.

Written by Leonard Callus and the European Digital Treasures Team.