An enhanced random forest approach using CoClust clustering: MIMIC‑III and SMS spam collection application


Creative Commons License

İlhan Taşkın Z., Yıldırak Ş. K., Aladağ Ç. H.

Journal of Big Data, cilt.10, sa.38, ss.1-36, 2023 (SCI-Expanded)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 10 Sayı: 38
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1186/s40537-023-00720-9
  • Dergi Adı: Journal of Big Data
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED)
  • Sayfa Sayıları: ss.1-36
  • Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

The random forest algorithm could be enhanced and produce better results with

a well-designed and organized feature selection phase. The dependency structure

between the variables is considered to be the most important criterion behind selecting

the variables to be used in the algorithm during the feature selection phase. As

the dependency structure is mostly nonlinear, making use of a tool that considers

nonlinearity would be a more beneficial approach. Copula-Based Clustering technique

(CoClust) clusters variables with copulas according to nonlinear dependency. We show

that it is possible to achieve a remarkable improvement in CPU times and accuracy

by adding the CoClust-based feature selection step to the random forest technique.

We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and

the SMS Spam Collection Dataset. The first dataset is large in terms of rows referring to

individual IDs, while the latter is an example of longer column length data with many

variables to be considered. In the proposed approach, first, random forest is employed

without adding the CoClust step. Then, random forest is repeated in the clusters

obtained with CoClust. The obtained results are compared in terms of CPU time, accuracy

and ROC (receiver operating characteristic) curve. CoClust clustering results are

compared with K-means and hierarchical clustering techniques. The Random Forest,

Gradient Boosting and Logistic Regression results obtained with these clusters and the

success of RF and CoClust working together are examined.