Impact Feature Vectorization Methods on Arabic Large Data Using Logistic Regression Classification

Ali Shafah; Ahmed Suleiman; Samira Alshafah

PDF

Published: Dec 24, 2023

Keywords:

Keywords: Arabic Text Classification, Large data, Big data, Feature Vectorization, TF-IDF, BoW, N-gram

Ali Shafah

Data analysis department, Faculty of Economics, University of Zawia, Zawia, Libya

Ahmed Suleiman

Computer department, Faculty of Education, University of Zawia, Zawia, Libya

Samira Alshafah

Computer department, Faculty of Education, University of Zawia, Zawia, Libya

Abstract

The process of assigning text documents to a predetermined set of categories is known as text categorization. The objective of this study is to present experimental assessments of various feature vectorization methods for the purpose of categorizing a large Arabic corpus using a logistic regression classifier. N-Gram, Bag of Words, and Term Frequency–Inverse Document Frequency are these methods. A corpus of around 111,000 Arabic documents was utilized, which was split up into five categories: news, sports, culture, economics, and varied. Each method's experimental findings were assessed using three different performance indicators. According to the experimental findings, the Logistic Regression model using Term Frequency–Inverse Document Frequency and N-gram (1,2) had the best accuracy, scoring 96%, while Bag of Words came in second with 95%.

How to Cite

Shafah, A., Suleiman, A., & Alshafah, S. (2023). Impact Feature Vectorization Methods on Arabic Large Data Using Logistic Regression Classification. University of Zawia Journal of Engineering Sciences and Technology, 1(1). Retrieved from http://journals.zu.edu.ly/index.php/UZJEST/article/view/49

Issue

Vol. 1 No. 1 (2023): University of Zawia Journal of Engineering Sciences and Technology

Section

Information Technology

Impact Feature Vectorization Methods on Arabic Large Data Using Logistic Regression Classification

Abstract

Contact Us

Quick Links

Journal Information

Article Sidebar

Main Article Content

Abstract

Article Details