Volume 15 (2024) . Issue 3 (62) . Paper No. 4 (451)

Applied software systems

Research Article

Recovering text sequences using deep learning models

Igor Victorovich VinokurovCorrespondent author

Financial University under the Government of the Russian Federation, Moscow, Russia
Igor Victorovich Vinokurov — Correspondent author igvvinokurov@fa.ru

Abstract. This article presents the results of the formation, training and performance evaluation of models with the Encoder-Decoder and Sequence-To-Sequence (Seq2Seq) architectures for solving the problem of supplementing incomplete texts. Problems of this type often arise when restoring the contents of documents from their low-quality images. The studies conducted in the work are aimed at solving the practical problem of forming electronic copies of scanned documents of the «Roskadastr» PLC, the recognition of which is difficult or impossible with standard means.

The formation and study of models was carried out in Python using the high-level API of the Keras package. A dataset consisting of several thousand pairs was formed for the purpose of training and studying the models. Each pair in this set represented an incomplete and corresponding full text. To evaluate the quality of the models, the values of the loss function and the accuracy, BLEU and ROUGE-L metrics were calculated. Loss and accuracy made it possible to evaluate the effectiveness of the models at the level of predicting individual words. The BLEU and ROUGE-L metrics were used to evaluate the similarity between the full and reconstructed texts. The results showed that both the Encoder-Decoder and Seq2Seq models cope with the task of reconstructing text sequences from their fixed set, but the Seq2Seq transformer-based model achieves better results in terms of training speed and quality. (Linked article texts in Russian and in English).

Keywords: deep learning models, encoder-decoder, sequence-to-sequence transformer, text recovering, BLEU, ROUGE-L, Keras, Python

MSC-20202020 Mathematics Subject Classification 68T20; 68T07, 68T45MSC-2020 68-XX: Computer science
MSC-2020 68Txx: Artificial intelligence
MSC-2020 68T20: Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
MSC-2020 68T07: Artificial neural networks and deep learning

For citation: Igor V. Vinokurov. Recovering text sequences using deep learning models. Program Systems: Theory and Applications, 2024, 15:3, pp. 75–110. (In Russ., in Engl.). https://psta.psiras.ru/2024/3_75-110.

Full text of bilingual article (PDF): https://psta.psiras.ru/read/psta2024_3_75-110.pdf (Clicking on the flag in the header switches the page language).

The article was submitted 03.03.2024; approved after reviewing 14.04.2024; accepted for publication 15.08.2024; published online 23.09.2024.

© Vinokurov I. V.
2024
Editorial address: Ailamazyan Program Systems Institute of the Russian Academy of Sciences, Peter the First Street 4«a», Veskovo village, Pereslavl area, Yaroslavl region, 152021 Russia; Phone: +7(4852) 695-228; E-mail: ; Website:  http://psta.psiras.ru
© Ailamazyan Program System Institute of Russian Academy of Science (site design) 2010–2024 The text of CC-BY-4.0 license