Feature Selection and Class Imbalance Machine Learning for Early Detection of Thyroid Cancer Recurrence: A Performance-Based Analysis

Authors

  • Agus Wantoro Universitas Aisyah Pringsewu
  • Wahyu Caesarendra Curtin University Malaysia
  • Admi Syarif Universitas Lampung
  • Hari Soetanto Universitas Budi Luhur

DOI:

https://doi.org/10.55981/jet.758

Keywords:

Class imbalance, Feature selection, Machine Learning, Thyroid cancer.

Abstract

Early detection of thyroid cancer recurrence is a crucial factor in patient survival and treatment effectiveness. Misdetection results in disease severity, high cost, recovery time, and decreased service quality. In addition, the main challenges in developing a Machine Learning (ML)-based detection decision support system are class imbalance in medical data and high feature dimensions that can affect model accuracy and efficiency. This study proposes a feature selection-based approach and class imbalance handling to improve the performance of early detection of Thyroid cancer. Several feature selection techniques, such as Information Gain (IG), Gain Ratio (GR), Gini Decrease (GD), and Chi-Square (CS), can select features based on weighted ranking. In addition, to overcome the imbalanced class distribution, we use the Synthetic Minority Over-Sampling Technique (SMOTE). ML classification models such as k-NN, Tree, SVM, Naive Bayes, AdaBoost, Neural Network (NN), and Logistic Regression (LR) are tested and evaluated based on a confusion matrix, including accuracy, precision, recall, time, and log loss. Experimental results show that the combination of imbalanced class handling strategies significantly improves the prediction performance of ML algorithms. In addition, we found that the combination of CS+NN feature selection techniques consistently showed optimal performance. This study emphasizes the importance of data pre-processing and proper algorithm selection in the development of a machine learning-based thyroid cancer detection system.

Downloads

Download data is not yet available.

Author Biographies

  • Agus Wantoro, Universitas Aisyah Pringsewu
    Technology and Informatics
  • Wahyu Caesarendra, Curtin University Malaysia
    Department Mechanical Electronics and Mechatronics Engineering
  • Admi Syarif, Universitas Lampung
    Department of Computer Sciences
  • Hari Soetanto, Universitas Budi Luhur
    Department Information Technology

References

A. Schindele et al., “Interpretable machine learning for thyroid cancer recurrence predicton: Leveraging XGBoost and SHAP analysis,” Eur. J. Radiol., vol. 186, May 2025, doi: 10.1016/j.ejrad.2025.112049. Crossref

A. H. Barfejani et al., “Predicting overall survival in anaplastic thyroid cancer using machine learning approaches,” Eur. Arch. Oto-Rhino-Laryngology, vol. 282, no. 3, pp. 1653–1657, 2025, doi: 10.1007/s00405-024-08986-2. Crossref

D. W. Chen, B. H. H. Lang, D. S. A. McLeod, K. Newbold, and M. R. Haymart, “Thyroid cancer,” Lancet, vol. 401, no. 10387, pp. 1531–1544, May 2023, doi: 10.1016/S0140-6736(23)00020-X. Crossref

A. Kuang, V. L. Kouznetsova, S. Kesari, and I. F. Tsigelny, “Diagnostics of Thyroid Cancer Using Machine Learning and Metabolomics,” Metabolites, vol. 14, no. 1, 2024, doi: 10.3390/metabo14010011. Crossref

R. Iacob et al., “Evaluating the Role of Breast Ultrasound in Early Detection of Breast Cancer in Low- and Middle-Income Countries: A Comprehensive Narrative Review,” Bioengineering, vol. 11, no. 3. 2024. doi: 10.3390/bioengineering11030262. Crossref

Y.-M. Huang et al., “Correction: Huang et al. Systemic Anticoagulation and Inpatient Outcomes of Pancreatic Cancer: Real-World Evidence from U.S. Nationwide Inpatient Sample. Cancers 2023, 15, 1985,” Cancers, vol. 16, no. 6. 2024. doi: 10.3390/cancers16061181. Crossref

I. O. Lixandru-Petre et al., “Machine Learning for Thyroid Cancer Detection, Presence of Metastasis, and Recurrence Predictions—A Scoping Review,” Cancers (Basel)., vol. 17, no. 8, pp. 1–27, 2025, doi: 10.3390/cancers17081308. Crossref

S. Li, Z. Tang, L. Yang, M. Li, and Z. Shang, “Application of deep reinforcement learning for spike sorting under multi-class imbalance,” Comput. Biol. Med., vol. 164, p. 107253, 2023, doi: https://doi.org/10.1016/j.compbiomed.2023.107253. Crossref

X. Song et al., “Evolutionary computation for feature selection in classification: A comprehensive survey of solutions, applications and challenges,” Swarm Evol. Comput., vol. 90, p. 101661, 2024, doi: https://doi.org/10.1016/j.swevo.2024.101661. Crossref

W. Chen, K. Yang, Z. Yu, Y. Shi, and C. L. P. Chen, “A survey on imbalanced learning: latest research, applications and future directions,” Artif. Intell. Rev., vol. 57, no. 6, p. 137, 2024, doi: 10.1007/s10462-024-10759-6. Crossref

L. C. M. Liaw, S. C. Tan, P. Y. Goh, and C. P. Lim, “A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification,” Inf. Sci. (Ny)., vol. 686, p. 121193, 2025, doi: https://doi.org/10.1016/j.ins.2024.121193. Crossref

K. E. Setiawan, “Predicting Recurrence in Differentiated Thyroid Cancer: a Comparative Analysis of Various Machine Learning Models Including Ensemble Methods With Chi-Squared Feature Selection,” Commun. Math. Biol. Neurosci., vol. 2024, no. Scenario 1, pp. 1–29, 2024, doi: 10.28919/cmbn/8506. Crossref

G. Husain et al., “SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models,” Algorithms, vol. 18, no. 1, pp. 1–16, 2025, doi: 10.3390/a18010037. Crossref

M. F. Ijaz, G. Alfian, M. Syafrudin, and J. Rhee, “Hybrid Prediction Model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, Synthetic Minority Over Sampling Technique (SMOTE), and random forest,” Appl. Sci., vol. 8, no. 8, 2018, doi: 10.3390/app8081325. Crossref

H. Sulistiani, A. Syarif, K. Muludi, and Warsito, “Performance evaluation of feature selections on some ML approaches for diagnosing the narcissistic personality disorder,” Bull. Electr. Eng. Informatics, vol. 13, no. 2, pp. 1383–1391, 2024, doi: 10.11591/eei.v13i2.6717. Crossref

J. Wang, S. Zhou, Y. Yi, and J. Kong, “An improved feature selection based on effective range for classification,” Sci. World J., vol. 2014, 2014, doi: 10.1155/2014/972125. Crossref

S. Bashir, Z. S. Khan, F. H. Khan, A. Anjum, and K. Bashir, “Improving Heart Disease Prediction Using Feature Selection Approaches,” in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2019, pp. 619–623. doi: 10.1109/IBCAST.2019.8667106. Crossref

J. Gao, Z. Wang, T. Jin, J. Cheng, Z. Lei, and S. Gao, “Information gain ratio-based subfeature grouping empowers particle swarm optimization for feature selection,” Knowledge-Based Syst., vol. 286, p. 111380, 2024, doi: https://doi.org/10.1016/j.knosys.2024.111380. Crossref

P. Bhat and K. Dutta, “A multi-tiered feature selection model for android malware detection based on Feature discrimination and Information Gain,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 10, Part B, pp. 9464–9477, 2022, doi: https://doi.org/10.1016/j.jksuci.2021.11.004. Crossref

M. Trabelsi, N. Meddouri, and M. Maddouri, “A New Feature Selection Method for Nominal Classifier based on Formal Concept Analysis,” Procedia Comput. Sci., vol. 112, pp. 186–194, 2017, doi: 10.1016/j.procs.2017.08.227. Crossref

Y. Sang and X. Dang, “Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation,” J. Multivar. Anal., vol. 204, pp. 1–25, 2024, doi: 10.1016/j.jmva.2024.105360. Crossref

Y. Zhang et al., “Feature selection based on neighborhood rough sets and Gini index,” PeerJ Comput. Sci., vol. 9, p. e1711, 2023, doi: 10.7717/peerj-cs.1711. Crossref

A. Abdo, R. Mostafa, and L. Abdel-Hamid, “An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms,” Data, vol. 9, no. 2. 2024. doi: 10.3390/data9020020. Crossref

T. Yan, S.-L. Shen, A. Zhou, and X. Chen, “Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm,” J. Rock Mech. Geotech. Eng., vol. 14, no. 4, pp. 1292–1303, 2022, doi: https://doi.org/10.1016/j.jrmge.2022.03.002. Crossref

M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, and A. Ralescu, “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1806–1819, 2017, doi: 10.1109/TKDE.2017.2682249. Crossref

I. Popchev and D. Orozova, “Algorithms for Machine Learning with Orange System,” Int. J. online Biomed. Eng., vol. 19, no. 4, pp. 109–123, 2023, doi: 10.3991/ijoe.v19i04.36897. Crossref

F. Miao, Y. Wu, G. Yan, and X. Si, “Dynamic multi-swarm whale optimization algorithm based on elite tuning for high-dimensional feature selection classification problems,” Appl. Soft Comput., vol. 169, p. 112634, 2025, doi: https://doi.org/10.1016/j.asoc.2024.112634. Crossref

Downloads

Published

2025-12-31

Issue

Section

Articles

How to Cite

[1]
“Feature Selection and Class Imbalance Machine Learning for Early Detection of Thyroid Cancer Recurrence: A Performance-Based Analysis”, J. Elektron. dan Telekomun., vol. 25, no. 2, pp. 93–101, Dec. 2025, doi: 10.55981/jet.758.