Recovering text sequences using deep learning models

Igor V. Vinokurov

Program Systems: Theory and Applications

ISSN 2079-3316

Bilingual online scientific Online scientific journal of the Ailamazyan Program System Institute of the Ailamazyan PSI of PSI of Russian Academy of Science of RAS

12+

Volume 15 (2024) . Issue 3 (62) . Paper No. 4 (451)

Artificial intelligence and machine learning

Research Article

DOI

10.25209/2079-3316-2024-15-3-75-110

Recovering text sequences using deep learning models

Igor Victorovich Vinokurov

	Financial University under the Government of the Russian Federation, Moscow, Russia
	igvvinokurov@fa.ru

Abstract. This article presents the results of the formation, training and performance evaluation of models with the Encoder-Decoder and Sequence-To-Sequence (Seq2Seq) architectures for solving the problem of supplementing incomplete texts. Problems of this type often arise when restoring the contents of documents from their low-quality images. The studies conducted in the work are aimed at solving the practical problem of forming electronic copies of scanned documents of the «Roskadastr» PLC, the recognition of which is difficult or impossible with standard means.

The formation and study of models was carried out in Python using the high-level API of the Keras package. A dataset consisting of several thousand pairs was formed for the purpose of training and studying the models. Each pair in this set represented an incomplete and corresponding full text. To evaluate the quality of the models, the values of the loss function and the accuracy, BLEU and ROUGE-L metrics were calculated. Loss and accuracy made it possible to evaluate the effectiveness of the models at the level of predicting individual words. The BLEU and ROUGE-L metrics were used to evaluate the similarity between the full and reconstructed texts. The results showed that both the Encoder-Decoder and Seq2Seq models cope with the task of reconstructing text sequences from their fixed set, but the Seq2Seq transformer-based model achieves better results in terms of training speed and quality. (Linked article texts in Russian and in English).

Keywords: deep learning models, encoder-decoder, sequence-to-sequence transformer, text recovering, BLEU, ROUGE-L, Keras, Python

MSC-2020

68T20; 68T07, 68T45 MSC-2020 68-XX: Computer science
MSC-2020 68Txx: Artificial intelligence
MSC-2020 68T20: Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
MSC-2020 68T07: Artificial neural networks and deep learning

For citation: Igor V. Vinokurov. Recovering text sequences using deep learning models. Program Systems: Theory and Applications, 2024, 15:3, pp. 75–110. (In Russ., in Engl.). https://psta.psiras.ru/2024/3_75-110.

Full text of bilingual article (PDF): https://psta.psiras.ru/read/psta2024_3_75-110.pdf (Clicking on the flag in the header switches the page language).

The article was submitted 03.03.2024; approved after reviewing 14.04.2024; accepted for publication 15.08.2024; published online 23.09.2024.

2024

Editorial address: Ailamazyan Program Systems Institute of the Russian Academy of Sciences, Peter the First Street 4«a», Veskovo village, Pereslavl area, Yaroslavl region, 152021 Russia; Website: http://psta.psiras.ru

Phone: +7(4852) 695-228; E-mail: ; License: CC-BY-4.0 License text on the Creative Commons site