|
Table 1. Summary of related works on PIMA dataset
FS: Feature Selection; MVI: Missing Value Imputation; KNN: K-Nearest Neighbors; PCA: Principal Component Analysis; SVM: Support Vector Machine; DT: Decision Tree; NB: Naïve Bayes; RB-Bayes: Recursive Bayesian; LR: Logistic Regression; RF: Random Forest; ANN: Artificial Neural Networks; NSGA: Non-dominated Sorting Genetic Algorithm; LDA: Linear Discriminant Analysis; QDA: Quadratic Discriminant Analysis; GPC: Granite Powder Concrete; MLP: Multilayer Perceptron; EBBM-based UTM: Evolutionary Bait Balls Model-based Unorganized Turing Machine; KNN: K-Nearest Neighbor |
|
Figure 1. The proposed framework for diabetes prediction |
with the majority of samples being non-diabetic (
) and a smaller number of samples being diabetic (
),
(i) ∈
, i=1,2,...,204).
represented by a mixture of
s and
s,
(i) ∈
, i=1,2,...,188).
). This can be accomplished by utilizing information from two clusters: Cluster 1 (
) and Cluster 2 (
). The analysis commences with records that contain only one missing value. After imputing the missing value for each record, the record is assigned to a cluster (
or
) according to its class variable. This process enhances the reference data and improves the quality of imputing missing values for other records. Subsequently, the missing values in records with two missing values can be filled in and assigned to one of the clusters. This procedure continues for records with an increasing number of missing values, following a sequential order.
= diabetic), while Scenario 2 deals with the MVI for non-diabetic individuals (class of
= non-diabetic).
and all diabetic data points in Cluster 2 (
s ∈
), which contains a higher number of diabetic patients, are calculated using the Euclidean method. This calculation considers only valid columns with non-missing values from the record
. After arranging the records according to their distances, we choose
(
=10 in proposed method) closest ones to
as
(j = 1, 2, ...,
).
records (
(j = 1, 2, ...,
)) and the center of Cluster 1, which contains a higher number of non-diabetic patients. The record that has the maximum distance from the center of Cluster 1 among these
records is identified as
.
are then imputed using the corresponding values from the
record.
more closely with individuals diagnosed with diabetes, who are predominantly found in Cluster 2. Conversely, we seek to distance these records from non-diabetic individuals, who are mainly present in Cluster 1.
and all non-diabetic data points in Cluster 1 (all
∈
) as
and Cluster 2 (all
∈
) as
are calculated. This calculation considers only valid columns with non-missing values of record
. The cluster closest to the record
(either Cluster 1 or Cluster 2) is specified using the calculated distances. The cluster that is closest to the record
is deemed the most appropriate for identifying and replacing the missing value of that particular record.
is closer to Cluster 2 (
<
): After evaluating the distances between the record
and all non-diabetic records in Cluster 2 (
∈
), the
(
=10 in proposed method) nearest records are identified as
(j = 1, 2, ...,
). Then, the distances between these
records and the centroid of Cluster 1 are calculated. The record with the minimum distance to the center of Cluster 1 is selected as the reference for the MVI for record
. The objective of this process is to move
closer to Cluster 1, which consists of a higher number of non-diabetic individuals, while simultaneously distancing from Cluster 2, which contains a significant number of diabetic individuals.
is closer to Cluster 1 (
<
): After evaluating the distances between the record
and all non-diabetic records in Cluster 1, the
(
=10 in proposed method) nearest records are identified as
(j = 1, 2, ...,
). Then, the distances between these
records and the centroid of Cluster 2 are calculated. The record that has the maximum distance from the center of Cluster 2 is selected as the reference (
) for the MVI for the record
. The objective of this process is to move
closer to Cluster 1, which consists of a higher number of non-diabetic individuals, while simultaneously distancing from Cluster 2, which contains a significant number of diabetic individuals.
) and test dataset (
); 70% of the dataset has been allocated for the train dataset, and 30% for the test set.
= 3) in the PIMA dataset leads to fewer mixed clusters. The first cluster is referred to as the" impure" cluster (
) which contains an approximately equal number of both diabetic and non-diabetic records. The second cluster is referred to as the" non-diabetic" cluster (
), which mainly includes individuals without diabetes, and the" diabetic" cluster (
), consisting primarily of diabetic individuals.
): The distances between the test record and all records of the non-diabetic (
) and diabetic (
) clusters are calculated. The closest record from each cluster is then chosen, the minimum distance of xtest to
(
) and also the minimum distance to
(
) are calculated. In order to validate the decision of assigning a record to a cluster at this stage, a threshold limit was considered for the distance between the record and the cluster center. If the test record’s distance to the nearest cluster record is within the threshold limit (min{
,
} < threshold), it can be assigned a label based on its cluster, if
<
the label of xtest is considered non-diabetic (
: non-diabetic), otherwise (
>
) is considered diabetic (
: diabetic). If min {
,
} > threshold, the algorithm will not assign any label to the record and place it in the rejected records.
).|
Figure 2. The proposed approach for missing value imputation |
|
Figure 3. The flow diagram of the proposed semi-supervised classification method |
|
Table 4. Implementation of Gaussian mixture model clusters on records without missing values (Ranging from 2 to 5)
Table 5. Evaluation of machine learning algorithm performance utilizing various missing value imputation techniques RMV: Removing records containing Missing Values; FAV: Filling in missing values using the Average Value; PMVI: Proposed Missing Value Imputation method; SVM: Support Vector Machine; RF: Random Forest; DT: Decision Tree |
|
Table 6. The implementation of Gaussian mixture model clusters on the train dataset (Ranging from 2 to 5)
Figure 5. Visualization of optimal clusters (k=3) Table 7. The accuracy and the number of labeled and rejected records for proposed method using a range of threshold values from 0.1 to 1 Re: Repetition; Avg: Average |
|
Table 8. Repeating the proposed method for five times
Figure 6. The comparison chart of the evaluation criteria of the proposed method for the total, train, and test datasets Table 9. Comparing the proposed method on the PIMA dataset to other methods FS: Feature Selection, MVI: Missing Value Imputation, P: Precision, R: Recall, Fs: F1-score, Se: Sensitivity, Sp: Specificity, Acc: Accuracy; KNN: K-Nearest Neighbors; PCA: Principal Component Analysis; RB-Bayes: Recursive Bayesian; LR: Loistic Regression; RF: Random Forest; ANN: Artificial Neural Networks; NSGA: Non-dominated Sorting Genetic Algorithm; LDA: Linear Discriminant Analysis; GPC: Granite Powder Concrete; MLP: Multilayer Perceptron; EBBM-based UTM: Evolutionary Bait Balls Model-based unorganized Turing machine; KNN: K-Nearest Neighbor |
|
Table 10. Comparing the proposed method on the breast cancer dataset to other methods
FS: Feature Selection; MVI: Missing Value Imputation; P: Precision’ R: Recall’ Fs: F1-score’ Se: Sensitivity; Sp: Specificity; Acc: Accuracy; SVM: Support Vector Machine; HCRF: Hierarchical Clustering Random Forest; LDA: Linear Discriminant Analysis; VIM: Variable Importance Measure; RF: Random Forest |
|
Table 11. The impact of each component of the proposed approach on the results
MVC: Missing Value Correction; UTS1C: Using Threshold in Stage 1 of Classification; CS2: Classifier in Stage 2; Acc: Accuracy; P: Precision; R: Recall; Fs: F1-score; Rel: Reliability; DT: Decision Tree; SVM: Support Vector Machines; RF: Random Forest |
| Rights and permissions | |
|
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |