УДК 004.855.5

Analysis of the use of classification methods of data mining for the detection of cardiovascular diseases

Ли Цзинчжао – профессор Департамента телекоммуникаций Аньхойского университета наук и технологий (Китай, Аньхой).

Лукичева Дарья Михайловна – магистр компьютерных наук и инженерии Аньхойского университета наук и технологий (Китай, Аньхой).

Abstract: This work unleashes the potential of data mining techniques to work with a limited and easily extractable data set to detect very serious diseases in the early stages of the disease. Two main directions were chosen in the work: the mathematical implementation of data analysis and the analysis of their practical value.

Аннотация: Данная работа раскрывает потенциал методов интеллектуального анализа данных для работы с ограниченным и легко извлекаемым набором данных для выявления очень серьезных заболеваний на ранних стадиях. В работе были выбраны два основных направления: математическая реализация анализа данных и анализ их практической ценности.

Keywords: data mining, mining models, mining methods, classification, cardiovascular disease.

Ключевые слова: интеллектуальный анализ данных, модели интеллектуального анализа данных, методы интеллектуального анализа данных, классификация данных, сердечно-сосудистые заболевания.

As the practice of combating coronavirus around the world has shown, the introduction of intelligent technologies in modern society is extremely necessary. Moreover, the collection and processing of data should be carried out at speeds greater than the rate of spread of infections. The integration of modern technology and medical research is one of the major trends and driving forces behind both of these disciplines. An important role in this is played by the speed of modern data processing methods and their low energy and resource consumption.

The purpose of the work is to analyze the use of classification methods to search for patterns in medical research using real data as an example and make a decision based on the results obtained. The object of research is the methods of data mining for making managerial decisions.

Tasks set to achieve the goal of the work:

  • To study the features of the data classification method;
  • Consider the mathematical justification of the algorithm and features of its implementation;
  • Carry out preliminary preparation and analysis of practical data;
  • Build data mining models based on the considered algorithms;
  • Analyze the obtained results and their practical value.

All programmable data manipulations will take place in the language R – static data processing language in the free RStudio environment and free software Weka.

Classification is an example of machine pattern recognition, that is, the task of determining which of the previously known categories a new observation belongs to, based on a training data set containing similar observations whose belonging to categories is known. Thus, a set of multiple dimensions can be used to provide a discriminant function that is linear in the observations, and that has the property that it is better than any other linear function to discriminate between the selected classes [4, c.179-180].

There are currently a lot of classification algorithms, the main ones are presented in the scheme.

1

Picture 1. Classification algorithm selection scheme.

Mathematical formulation of the classification problem:

Let we have some data set T, consisting n-elements: . Each element of the set T is characterized by a certain set of m parameters or attributes: , and each parameter hj can take values from a given set of values for this parameter: , and the value of y depends in some way on these parameters and is a known class. It is necessary to build a mapping F of the set of elements T on the given set of classes , which defines the initial data dependency structure: F:T → Y.

It is worth noting that if the set of class values is a classification problem, if L is a finite number, and the problem will be well-posed if L < n. And also, if , then the problem is a regression problem [1, p. 112].

Thus, the task of classification is reduced to finding some mapping F of the original set onto a given set T of classes Y according to the principle of spatial proximity of the characteristics of objects.

In the classification problem, the set of classes is known and limited in advance, then let's assume that for some random sample  generated by the distribution P, we have built a hypothesis h = hT that expresses the belonging of objects t to classes generated by an unknown concept y. The hypothesis error h is defined as errP(h) = P{h(ti ≠ c(ti))}. The function errP(h) is a random variable, since the function h = hT is a function of T by its definition.

Then it is necessary to find a hypothesis h for which the probability of the event that the error errP(h) is large is small. In other words, we would like to state that h is probably true up to errP(h) ≤ ε. The degree of “probability” will be measured using the confidence level parameter δ. We want to get a good approximation of the concept yi  Y with a high probability. In particular, we require that the inequality be satisfied with probability at least  δ [3, p.204-206].

The main purpose of classification is to construct a model: a training set is formed from the existing data set, on the basis of which a classification algorithm is built, and then a classification model for data processing.

2

Picture 2. Building a classification model.

After creating a model, it is applied to new data, thereby exposing the desired classes for unlabeled data instances [2, p.116].

3

Picture 3. Using the Classification Model.

Let's apply and analyze the classification methods and check whether they are really useful in such a complex and important field of human activity as cardiology.

For the task, a dataset was taken from the materials provided for the ML Boot Camp machine learning competition. The data set contains information about the results of a classic medical examination of 100 thousands patients.

Table 1. Description of the attributes of the data set under investigation.

Attribute

Meaning

Additional information

Id

patient identification number

 

Age

patient's age

in days

Gender

patient gender

1 – female,

2 – male

Height

patient's height

in centimeters

Weight

patient weight

in kilograms

Ap_hi

upper blood pressure

 

Ap_lo

lower blood pressure

 

Cholesterol

blood cholesterol readings

1 – normal,

2 – above normal,

3 – significantly above normal

Gluc

blood glucose readings

1 – normal,

2 – above normal,

3 – significantly above normal

Smoke

whether the patient is a smoker

0 – no,

1 – yes

Alco

patient alcohol abuse

0 – no,

1 – yes

Active

the patient's activity

0 – absent,

1 – high

Cardio

patient's cardiovascular disease

0 – absent,

1 – present

Let's clean up the initial data: transform the value of age into years; we will limit the values of arterial pressure from 50 to 250 in mm. rt. Art. – upper, from 20 to 200 – lower: all values outside these limits are equated to the nearest set limit; we will remove the smoke, alco and active characteristics, since they are subjective.

Picture 4. Distribution of attributes after adjustment.

Let's build a data processing model using tree processing solutions in the RStudio data processing environment. Let's split the original sequence dataset with the list of first instances into a test and a test part, in a probability of 70/30. Using the rpart function, the class division method, and the rpart.control(cp) complexity parameter with an intensity of 0.005, we will build a decision tree with a cardio target value based on all other attributes from the dataset.

Picture 5. The resulting decision tree.

The accuracy of the model, determined using the predict function, which returns the predictions of the cardio target variable for the test data classified on the previously obtained model, is 72.9%.

6

Picture 6. Classification accuracy of test data by a decision tree.

Let's build a classification model by the support vector machine using the ksvm function. We explicitly indicate that the target variable is a factor variable, the type of the kernel is "rbfdot" – the radial basis Gaussian kernel , the parameter sigma σ = 0.05 – the width inverse of the kernel for the radial basis function of the kernel and the penalty C = 5 for an incorrectly classified object.

Picture 7. Visualization of source data classification by support vector machines.

Using the predict function defined earlier, we calculate the accuracy of the method on the initial data, and compare the parameters of the classification results of the two methods.

Table 2. Methods classification results parameters.

Method

Number of correctly classified instances

Percentage Accuracy

Model building time

Decision tree

21870

72,9%

0,2 min

Support vector machine

21988

73,3%

27, 6 min

Both methods do not have great accuracy, but taking into account all the assumptions made earlier, taking into account the minimum set of initial attributes: their relative impact on the target value, as well as the high complexity and specificity of the cardiology field, the models coped well with the task.

As a result of the work, methods of data mining are considered. The formulation of the mathematical component of the problem, as well as its implementation within the development environment R – Rstudio, is considered. Data classification was carried out on real data using the decision tree method and the support vector method. As a result, data processing models with a high level of accuracy under given conditions were obtained: 72.9% of correctly classified patients for decision trees and 73.3% for the support vector machine.

The analysis of the application of mining methods and models and their features can be used to search for patterns in medical data and make decisions on real data.

The results of the work can be used in the development of practical methods for analyzing data from various areas of human life. The application of the methods covered in the work will allow you to quickly and efficiently generalize the results of large amounts of data from various studies.

Future research in this area should focus on improving the accuracy of the methods and considering other methods of data mining and their combination, as well as introducing cardiology specialists as experts.

References

  1. Barseghyan, A. A. – Analysis of data and processes: textbook. allowance / A. A. Barseghyan, M. S. Kupriyanov, I. I. Kholod, M. D. Tess, S. I. Elizarov. 3rd ed., revised. and additional – St. Petersburg: BHV-Petersburg, 2009. – 512 p.
  2. Chubukova I.A. – Data Mining, lecture course. – INTUIT: 2006, – 328 p.
  3. Vyugin V.V. "Mathematical foundations of the theory of machine learning and forecasting" M.: 2013. – 387 p.
  4. А.Fisher – The statistical utilization of multiple measurements, Annals of Eugenics, 1936,- 475 p.

Интересная статья? Поделись ей с другими: