Comparative Analysis of Backbone Architectures for Instance Segmentation of Objects in Aerial Imagery Using Mask R-CNN

Ivan R. Kuznetsov; Igor V. Vinokurov; Daria A. Frolova; Andrey I. Ilyin

Program Systems: Theory and Applications

ISSN 2079-3316

Bilingual online scientific Online scientific journal of the Ailamazyan Program System Institute of the Ailamazyan PSI of PSI of Russian Academy of Science of RAS

12+

Volume 16 (2025) . Issue 4 (67) . Paper No. 7 (454)

Artificial intelligence and machine learning

Research Article

DOI

10.25209/2079-3316-2025-16-4-173-216

Comparative Analysis of Backbone Architectures for Instance Segmentation of Objects in Aerial Imagery Using Mask R-CNN

Igor Victorovich Vinokurov¹, Daria Aleksandrovna Frolova², Andrey Ivanovich Ilyin³, Ivan Romanovich Kuznetsov⁴

^1-4	Financial University under the Government of the Russian Federation, Moscow, Russia
¹	igvvinokurov@fa.ru

Abstract. This paper compares Mask R-CNN models with various pretrained backbone architectures for implementing instance segmentation of real estate objects in aerial images. The models were fine-tuned on a specialized dataset provided by the PLC « Roskadastr».

Analysis of the accuracy of detecting bounding boxes and object segmentation masks revealed the preferred architectures: Swin transformers (Swin-S and Swin-T) and the ConvNeXt-T convolutional network. The high accuracy of these models is explained by their ability to account for global contextual dependencies of the image.

The results of the study allow us to formulate the following recommendations for choosing a backbone architecture: for real-time monitoring systems where performance is critical, lightweight models (EfficientNet-B3, ConvNeXt-T, Swin-T) are advisable; for offline tasks requiring maximum accuracy (such as real estate mapping), the large-scale Swin-S model is recommended. (Linked article texts in English and in Russian).

Keywords: instance segmentation, backbone, Mask R-CNN, ResNet, DenseNet, EfficientNet, ConvNeXt, Swin

MSC-2020

68T20; 68T07, 68T45 MSC-2020 68-XX: Computer science
MSC-2020 68Txx: Artificial intelligence
MSC-2020 68T20: Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
MSC-2020 68T07: Artificial neural networks and deep learning
MSC-2020 68T45: Machine vision and scene understanding

For citation: Igor V. Vinokurov, Daria A. Frolova, Andrey I. Ilyin, Ivan R. Kuznetsov. Comparative Analysis of Backbone Architectures for Instance Segmentation of Objects in Aerial Imagery Using Mask R-CNN. Program Systems: Theory and Applications, 2025, 16:4, pp. 173–216. (in Engl. In Russ.). https://psta.psiras.ru/2025/4_173-216.

Full text of bilingual article (PDF): https://psta.psiras.ru/read/psta2025_4_173-216.pdf (Clicking on the flag in the header switches the page language).

The article was submitted 22.09.2025; approved after reviewing 27.09.2025; accepted for publication 12.10.2025; published online 19.10.2025.

2025

Editorial address: Ailamazyan Program Systems Institute of the Russian Academy of Sciences, Peter the First Street 4«a», Veskovo village, Pereslavl area, Yaroslavl region, 152021 Russia; Website: http://psta.psiras.ru

Phone: +7(4852) 695-228; E-mail: ; License: CC-BY-4.0 License text on the Creative Commons site

Artificial intelligence and machine learning

Research Article

Comparative Analysis of Backbone Architectures for Instance Segmentation of Objects in Aerial Imagery Using Mask R-CNN

Igor Victorovich Vinokurov1, Daria Aleksandrovna Frolova2, Andrey Ivanovich Ilyin3, Ivan Romanovich Kuznetsov4

Igor Victorovich Vinokurov¹, Daria Aleksandrovna Frolova², Andrey Ivanovich Ilyin³, Ivan Romanovich Kuznetsov⁴