УДК 004.855.5

Analysis of the use of classification methods of data mining for the detection of cardiovascular diseases

Ли Цзинчжао – профессор Департамента телекоммуникаций Аньхойского университета наук и технологий (Китай, Аньхой).

Лукичева Дарья Михайловна – магистр компьютерных наук и инженерии Аньхойского университета наук и технологий (Китай, Аньхой).

Abstract: This work unleashes the potential of data mining techniques to work with a limited and easily extractable data set to detect very serious diseases in the early stages of the disease. Two main directions were chosen in the work: the mathematical implementation of data analysis and the analysis of their practical value.

Аннотация: Данная работа раскрывает потенциал методов интеллектуального анализа данных для работы с ограниченным и легко извлекаемым набором данных для выявления очень серьезных заболеваний на ранних стадиях. В работе были выбраны два основных направления: математическая реализация анализа данных и анализ их практической ценности.

Keywords: data mining, mining models, mining methods, classification, cardiovascular disease.

Ключевые слова: интеллектуальный анализ данных, модели интеллектуального анализа данных, методы интеллектуального анализа данных, классификация данных, сердечно-сосудистые заболевания.

As the practice of combating coronavirus around the world has shown, the introduction of intelligent technologies in modern society is extremely necessary. Moreover, the collection and processing of data should be carried out at speeds greater than the rate of spread of infections. The integration of modern technology and medical research is one of the major trends and driving forces behind both of these disciplines. An important role in this is played by the speed of modern data processing methods and their low energy and resource consumption.

The purpose of the work is to analyze the use of classification methods to search for patterns in medical research using real data as an example and make a decision based on the results obtained. The object of research is the methods of data mining for making managerial decisions.

Tasks set to achieve the goal of the work:

To study the features of the data classification method;
Consider the mathematical justification of the algorithm and features of its implementation;
Carry out preliminary preparation and analysis of practical data;
Build data mining models based on the considered algorithms;
Analyze the obtained results and their practical value.

All programmable data manipulations will take place in the language R – static data processing language in the free RStudio environment and free software Weka.

Classification is an example of machine pattern recognition, that is, the task of determining which of the previously known categories a new observation belongs to, based on a training data set containing similar observations whose belonging to categories is known. Thus, a set of multiple dimensions can be used to provide a discriminant function that is linear in the observations, and that has the property that it is better than any other linear function to discriminate between the selected classes [4, c.179-180].

There are currently a lot of classification algorithms, the main ones are presented in the scheme.

Picture 1. Classification algorithm selection scheme.

Mathematical formulation of the classification problem:

Let we have some data set T, consisting n-elements: . Each element of the set T is characterized by a certain set of m parameters or attributes: , and each parameter h_j can take values from a given set of values for this parameter: , and the value of y depends in some way on these parameters and is a known class. It is necessary to build a mapping F of the set of elements T on the given set of classes , which defines the initial data dependency structure: F:T → Y.

It is worth noting that if the set of class values is a classification problem, if L is a finite number, and the problem will be well-posed if L < n. And also, if , then the problem is a regression problem [1, p. 112].

Thus, the task of classification is reduced to finding some mapping F of the original set onto a given set T of classes Y according to the principle of spatial proximity of the characteristics of objects.

In the classification problem, the set of classes is known and limited in advance, then let's assume that for some random sample generated by the distribution P, we have built a hypothesis h = h_T that expresses the belonging of objects t to classes generated by an unknown concept y. The hypothesis error h is defined as errP(h) = P{h(t_i ≠ c(t_i))}. The function errP(h) is a random variable, since the function h = h_T is a function of T by its definition.

Then it is necessary to find a hypothesis h for which the probability of the event that the error errP(h) is large is small. In other words, we would like to state that h is probably true up to errP(h) ≤ ε. The degree of “probability” will be measured using the confidence level parameter δ. We want to get a good approximation of the concept y_i ∈ Y with a high probability. In particular, we require that the inequality be satisfied with probability at least 1 – δ [3, p.204-206].

The main purpose of classification is to construct a model: a training set is formed from the existing data set, on the basis of which a classification algorithm is built, and then a classification model for data processing.

Picture 2. Building a classification model.

After creating a model, it is applied to new data, thereby exposing the desired classes for unlabeled data instances [2, p.116].

Picture 3. Using the Classification Model.

Let's apply and analyze the classification methods and check whether they are really useful in such a complex and important field of human activity as cardiology.

For the task, a dataset was taken from the materials provided for the ML Boot Camp machine learning competition. The data set contains information about the results of a classic medical examination of 100 thousands patients.

Table 1. Description of the attributes of the data set under investigation.

Attribute	Meaning	Additional information
Id	patient identification number
Age	patient's age	in days
Gender	patient gender	1 – female, 2 – male
Height	patient's height	in centimeters
Weight	patient weight	in kilograms
Ap_hi	upper blood pressure
Ap_lo	lower blood pressure
Cholesterol	blood cholesterol readings	1 – normal, 2 – above normal, 3 – significantly above normal
Gluc	blood glucose readings	1 – normal, 2 – above normal, 3 – significantly above normal
Smoke	whether the patient is a smoker	0 – no, 1 – yes
Alco	patient alcohol abuse	0 – no, 1 – yes
Active	the patient's activity	0 – absent, 1 – high
Cardio	patient's cardiovascular disease	0 – absent, 1 – present

Let's clean up the initial data: transform the value of age into years; we will limit the values of arterial pressure from 50 to 250 in mm. rt. Art. – upper, from 20 to 200 – lower: all values outside these limits are equated to the nearest set limit; we will remove the smoke, alco and active characteristics, since they are subjective.

Picture 4. Distribution of attributes after adjustment.

Let's build a data processing model using tree processing solutions in the RStudio data processing environment. Let's split the original sequence dataset with the list of first instances into a test and a test part, in a probability of 70/30. Using the rpart function, the class division method, and the rpart.control(cp) complexity parameter with an intensity of 0.005, we will build a decision tree with a cardio target value based on all other attributes from the dataset.

Picture 5. The resulting decision tree.

The accuracy of the model, determined using the predict function, which returns the predictions of the cardio target variable for the test data classified on the previously obtained model, is 72.9%.

Picture 6. Classification accuracy of test data by a decision tree.

Let's build a classification model by the support vector machine using the ksvm function. We explicitly indicate that the target variable is a factor variable, the type of the kernel is "rbfdot" – the radial basis Gaussian kernel , the parameter sigma σ = 0.05 – the width inverse of the kernel for the radial basis function of the kernel and the penalty C = 5 for an incorrectly classified object.

Picture 7. Visualization of source data classification by support vector machines.

Using the predict function defined earlier, we calculate the accuracy of the method on the initial data, and compare the parameters of the classification results of the two methods.

Table 2. Methods classification results parameters.

Method	Number of correctly classified instances	Percentage Accuracy	Model building time
Decision tree	21870	72,9%	0,2 min
Support vector machine	21988	73,3%	27, 6 min

Both methods do not have great accuracy, but taking into account all the assumptions made earlier, taking into account the minimum set of initial attributes: their relative impact on the target value, as well as the high complexity and specificity of the cardiology field, the models coped well with the task.

As a result of the work, methods of data mining are considered. The formulation of the mathematical component of the problem, as well as its implementation within the development environment R – Rstudio, is considered. Data classification was carried out on real data using the decision tree method and the support vector method. As a result, data processing models with a high level of accuracy under given conditions were obtained: 72.9% of correctly classified patients for decision trees and 73.3% for the support vector machine.

The analysis of the application of mining methods and models and their features can be used to search for patterns in medical data and make decisions on real data.

The results of the work can be used in the development of practical methods for analyzing data from various areas of human life. The application of the methods covered in the work will allow you to quickly and efficiently generalize the results of large amounts of data from various studies.

Future research in this area should focus on improving the accuracy of the methods and considering other methods of data mining and their combination, as well as introducing cardiology specialists as experts.

References

Barseghyan, A. A. – Analysis of data and processes: textbook. allowance / A. A. Barseghyan, M. S. Kupriyanov, I. I. Kholod, M. D. Tess, S. I. Elizarov. 3rd ed., revised. and additional – St. Petersburg: BHV-Petersburg, 2009. – 512 p.
Chubukova I.A. – Data Mining, lecture course. – INTUIT: 2006, – 328 p.
Vyugin V.V. "Mathematical foundations of the theory of machine learning and forecasting" M.: 2013. – 387 p.
А.Fisher – The statistical utilization of multiple measurements, Annals of Eugenics, 1936,- 475 p.

Интересная статья? Поделись ей с другими: