Using multilevel data sources to prepare training sets for cyberattack detection

Sergey V. Isaev; Dmitry D. Kononov

Program Systems: Theory and Applications

ISSN 2079-3316

Bilingual online scientific Online scientific journal of the Ailamazyan Program System Institute of the Ailamazyan PSI of PSI of Russian Academy of Science of RAS

12+

Volume 16 (2025) . Issue 4 (67) . Paper No. 10 (457)

Hardware and software for distributed and supercomputer systems

Research Article

DOI

10.25209/2079-3316-2025-16-4-267-285

Using multilevel data sources to prepare training sets for cyberattack detection

Dmitry Dmitrievich Kononov¹, Sergey Vladislavovich Isaev²

^1,2	Institute of Computational Modelling of the Siberian Branch of the Russian Academy of Sciences, Krasnoyarsk, Russia
¹	ddk@icm.krasn.ru

Abstract. Network traffic analysis is an integral part of ensuring security in information and telecommunication systems. The use of machine learning provides modern approaches with higher detection rates for cyber threats.

A new approach for generating training datasets is proposed, which introduces a new aggregation unit “session”, utilizes signature analysis and multi-level data sources, including heterogeneous ones. A list of requirements for the datasets is generated, which includes preserving the first packets of the connection, preserving hidden areas of the packets, extended information about traffic sources (country, autonomous system number ASN). The additional information will allow to detect attacks of the “hidden communication channel” type. Using the proposed approach, a software package for creating training datasets from multilevel sources at the L7, L4, L3 levels of the OSI model has been developed. In contrast to existing works, real data of network activity as well as long time intervals are used. The proposed approach allows to use the obtained training sets to create more effective methods of intrusion detection and prevention using machine learning techniques. (In Russian).

Keywords: Internet, network security, cyber threats, network traffic analysis, datasets, machine learning

MSC-2020

68M25; 68-11, 62N86 MSC-2020 68-XX: Computer science
MSC-2020 68Mxx: Computer system organization
MSC-2020 68M25: Computer security
MSC-2020 68-11: Research data for problems pertaining to computer science
MSC-2020 62-XX: Statistics
MSC-2020 62Nxx: Survival analysis and censored data
MSC-2020 62N86: Fuzziness, and survival analysis and censored data

MSC-2020 68-XX: Computer science
MSC-2020 68Mxx: Computer system organization
MSC-2020 68M25: Computer security
MSC-2020 68-11: Research data for problems pertaining to computer science
MSC-2020 62-XX: Statistics
MSC-2020 62Nxx: Survival analysis and censored data
MSC-2020 62N86: Fuzziness, and survival analysis and censored data

For citation: Dmitry D. Kononov, Sergey V. Isaev. Using multilevel data sources to prepare training sets for cyberattack detection. Program Systems: Theory and Applications, 2025, 16:4, pp. 267–285. (In Russ.). https://psta.psiras.ru/2025/4_267-285.

Full text of article (PDF): https://psta.psiras.ru/read/psta2025_4_267-285.pdf.

The article was submitted 10.07.2025; approved after reviewing 16.07.2025; accepted for publication 03.10.2025; published online 27.11.2025.

2025

Editorial address: Ailamazyan Program Systems Institute of the Russian Academy of Sciences, Peter the First Street 4«a», Veskovo village, Pereslavl area, Yaroslavl region, 152021 Russia; Website: http://psta.psiras.ru

Phone: +7(4852) 695-228; E-mail: ; License: CC-BY-4.0 License text on the Creative Commons site

Hardware and software for distributed and supercomputer systems

Research Article

Using multilevel data sources to prepare training sets for cyberattack detection

Dmitry Dmitrievich Kononov1, Sergey Vladislavovich Isaev2

Dmitry Dmitrievich Kononov¹, Sergey Vladislavovich Isaev²