Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach

Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo

Research output: Contribution to journalArticlepeer-review

Abstract

The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model's efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.

Original languageEnglish
Article number9991095
JournalJournal of Applied Mathematics
Volume2023
DOIs
Publication statusPublished - 2023
Externally publishedYes

Fingerprint

Dive into the research topics of 'Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach'. Together they form a unique fingerprint.

Cite this