Impact Feature Vectorization Methods on Arabic Large Data Using Logistic Regression Classification

Main Article Content

Ali Shafah
Ahmed Suleiman
Samira Alshafah

Abstract

The process of assigning text documents to a predetermined set of categories is known as text categorization. The objective of this study is to present experimental assessments of various feature vectorization methods for the purpose of categorizing a large Arabic corpus using a logistic regression classifier. N-Gram, Bag of Words, and Term Frequency–Inverse Document Frequency are these methods. A corpus of around 111,000 Arabic documents was utilized, which was split up into five categories: news, sports, culture, economics, and varied. Each method's experimental findings were assessed using three different performance indicators. According to the experimental findings, the Logistic Regression model using Term Frequency–Inverse Document Frequency and N-gram (1,2) had the best accuracy, scoring 96%, while Bag of Words came in second with 95%.


 

Article Details

How to Cite
Shafah, A., Suleiman, A., & Alshafah, S. (2023). Impact Feature Vectorization Methods on Arabic Large Data Using Logistic Regression Classification. University of Zawia Journal of Engineering Science and Technology, 1(1). https://doi.org/10.26629/uzjest.2023.03
Section
Information Technology