Cognitive information retrieval for data leakage prevention

Романов Александр Сергеевич – студент Московского государственного лингвистического университета.

Аннотация: В данной статье рассматривается возможность реализации технологии функционирования DLP систем по предотвращению утечек конфиденциальной информации на основе использования интеллектуальных нейронных сетей. Применение данной системы основано на использовании лингвистического анализа, с обучением нейронной сети на основе базы данных организации, что позволяет индивидуализировать систему под конкретные нужды и технические возможности организации. Основным преимуществом данной системы является ее простота настройки и функционирования в условиях малого и среднего бизнеса.

Abstract: This paper discusses the possibility of implementing the technology of functioning DLP systems for preventing leakage of confidential information through the use of intelligent neural networks. The application of this system is based on the use of linguistic analysis, with the training of a neural network based on the organization’s database, which allows to individualize the system for the specific needs and technical capabilities of the organization. The main advantage of this system is its ease of setup and operation in small and medium-sized businesses.

Ключевые слова: Информационная безопасность, DLP системы, интеллектуальные нейронные сети, защита от утечек данных, конфиденциальная информация.

Keywords: Information security, DLP systems, intelligent neural networks, protection against data leaks, confidential information.


With the development and spreading of the use of information and communication technologies, organizations are increasingly becoming victims of information leaks committed by their employees, and the damage from these incidents can amount to tens of billions of dollars of loss. More often, there are reports of incidents related to the violation of the responsibilities and rights by authorized users who intentionally transmit information to competitors or third parties. At the same time, the business environment is changing, which is increasingly relying on outsourcing, contracting companies and third-party’s technology platforms, which leads to the fact that valuable business information is becoming available to more and more people. In the case of insider leaks of information, access control and perimeter protection will not help, the malefactor is already inside the perimeter.

According to recent statistics, companies in fields of technology, finance, and healthcare lead the number of data records stolen.

Companies in the technological (5,071,144 stolen data), financial industry (4,915,553) and healthcare sectors (1,923,340) are the leaders in the number of stolen data. The most common reason for the transfer of commercial data is the insider activities of employees of the organization, neglecting of internal regulations for ensuring information security, setting passwords of insufficient complexity. [1]

In connection with this established trend, data leakage prevention systems(DLP) are becoming increasingly relevant. The most effective are DLP systems based on content analysis.

Implementing of cognitive model of information retrieval to DLP systems

Existing DLP systems are based on container and content analysis technologies. The proposed model of cognitive information retrieval, in order to prevent data leakage, is based on the use of DLP systems content analysis, implemented on the technology of intelligent neural networks.

The main technologies in the determination of prohibited content in DLP systems are signature control, hash-based control and linguistic methods. [3]

Signatures. The main control method is the search in the data stream for a certain given sequence of characters. In a more general case, it can be represented not by a word, but by an arbitrary set of characters. More often, a search for a certain sequence of characters is used in the analysis of the text. [3] In the vast majority of cases, signature systems are configured to search for several words and the frequency of occurrence of terms. The advantages of this method include independence from the language and ease of updating the dictionary of forbidden terms.

Masks. An extension to the signature search functionality is the search for their masks. It is a search for content that cannot be accurately indicated in the signature database, but its element or structure can be specified. [3] Such information should include any codes that characterize a person or company: individual tax number (ITN), account numbers, documents, etc. Searching for them using signatures is not possible.

To implement contextual analysis using neural networks, it is necessary to connect the created network with the organization’s database. Thus, it is possible to create a DLP system based on a neural network, a model of training with a teacher, which subsequently prevents data leakage, according to pre-specified labels. Names of organizations, individuals, ITN accounts, and other information defined by the trusted environment of the organization can serve as labels.

The applied decisions on the implementation of the confidential information leakage system can be implemented individually for any organization. Since one of the organization’s priorities is the simplicity and effectiveness of the solutions used, this system should guarantee the ease of configuration and inclusion in business processes.

In connection with these, 5 main criteria of the system can be distinguished.

  • The ability to organize several modes of operation.
  • Implementation of various information recognition technologies.
  • Complete control of channels.
  • Ease of configuration and system support.
  • Creating reports and system operation logs.

Functioning of DLP system

Modes of operation. There are two modes of operation of DLP systems. [3]

A passive mode of operation of the system is necessary if there is a high probability of a false response of the system. This mode is necessary at the stage of training of an intellectual neural network. In the process of network training [4], errors and false alarms will inevitably occur, this stage is necessary for installing and adjusting the DLP system settings.

After completion of training, when the neural network reaches the expected average result for identifying information to be protected, the system enters an active mode of operation. The objective of this mode is to block the transfer of confidential information to organizations and persons not listed in the organization’s database. If such information is found, the system notifies you of an attempt to transfer it to the organization’s security department.

Implementation of various technologies for recognition of confidential information by the system

As indicated earlier, the most priority information recognition technologies that can be implemented in neural networks, without complicating the system are signatures, masks and linguistic methods. [4] The advantages of these technologies are that they work directly with the content of documents, it does not matter for the system where and how the document was created, what signature stamp on it and what the file name is. This is important when processing drafts of confidential documents or to protect incoming documentation. Linguistic analysis shows high quality work with a large amount of information. For a big text, a DLP system with a linguistic analysis algorithm will more accurately select the correct class, assign it to the desired category and run the configured rule.

Full control of data channels. Each open data channel is a potential source of data leakage. In this regard, it is necessary to control the use of channels, block unused channels, and control the remaining ones using information leakage prevention systems. Despite the ability of DLP systems to control a large number of network channels, it is advisable to block unused channels.

Convenience of setup and system support. Convenience of configuration and function of the system are critical to the functioning of the organization. These criteria are especially relevant in the conditions of functioning of small and medium-sized businesses, in organizations that do not have an extensive information security department in the state. The complexity of system maintenance can cause a drop in the efficiency of the system in recognizing confidential information. Regular maintenance, audit and adjustment of system settings are key in ensuring the functioning, and, therefore, ensuring a high level of information security.

Creating reports and system operation logs. The operation of a DLP system implies the use of big data in its work. In this regard, for the proper functioning and maintenance of the system, it is necessary to work with the DLP archive of the system, presented in the form of a database containing events and objects. [3] Examples of these objects can include files, letters, http-requests, names of organizations, individuals, recorded during the work.

Such database is necessary at the initial stages of the implementation of the system, since the information stored in it, it is necessary for conducting the learning process of a neural intellectual network according to the model of education with a teacher. Generating reports in the system is necessary to summarize the results of network training, its subsequent configuration, and monitoring the state of protection of confidential information critical to the organization.


To effectively protect the confidential data of an organization online, DLP systems are the best choice. The main advantages of these systems are the possibility of flexible individual adjustment of the system to the needs and standards of a particular organization, the lack of the need for a large number of employees to maintain the level of functioning of the system. In combination, the DLP system is an intelligent neural network, this system involves the minimum use of human resources for its training and operation, which is the most successful solution to prevent information leaks for small and medium-sized businesses, in case of financial and personnel restrictions.


  1. Main leaks of 2018 [Electronic resource] / / InfoWatch. URL: © (accessed: 29.10.2019).
  2. Global study of confidential information leaks in the first half of 2019 [Electronic resource] / / InfoWatch. URL: © (accessed: 04.11.2019).
  3. Bart Business, Wouter Verbeke Fraud analytics using descriptive predictive and social network techniques: M., 2015. 437c.
  4. Simon Haikin Neural networks. Complete course - 2nd ed. Williams, 2019. - 1104 p.

Интересная статья? Поделись ей с другими: